POV-Ray : Newsgroups : povray.off-topic : A monologue involving binary log : A monologue involving binary log Server Time
6 Oct 2024 07:03:09 EDT (-0400)
  A monologue involving binary log  
From: Orchid Win7 v1
Date: 14 Feb 2015 10:33:32
Message: <54df6acc@news.povray.org>
OK, so it's been a while since I posted a big wall of text. So here goes...



The name of the game is data compression. If you already know how that 
works, this is going to be a VERY boring post. If you don't know, it 
probably seems like magic; you take a file, throw it at WinZip, and 
suddenly the file gets 4x smaller.

Now MP3 does that to sound files, but it does that by DELETING DATA. The 
compressed file has lower fidelity than the original. It's not that 
surprising that you can make a file smaller by removing data from it. 
But LOSSLESS compression algorithms do something seemingly impossible: 
they take a file, somehow make it smaller, and yet the process is 
EXACTLY REVERSIBLE. When you decompress the file again, it is IDENTICAL 
to the original. How is that even possible?

Well, it ISN'T always possible. Sometimes when you zip a file it gets 
tiny, and sometimes it doesn't get that much smaller. You may not have 
experienced this, but occasionally a file will get LARGER. For example, 
zip a file - any file - and then go zip the zip file. It WILL NOT get 
any smaller. I promise. (Think about it; otherwise you could take an 
arbitrary file and zip it over and over until it gets arbitrarily small. 
But that really IS impossible!)

Rather than dwell on this further, I'm going to show you some 
compression algorithms now. We're going to start with Huffman coding.

As you probably know, the American Standard Code for Information 
Interchange ("ASCII") assigns an 8-bit code to every letter of the 
alphabet. (You may or may not know that technically ASCII is actually a 
7-bit code; any text encoding which actually *uses* the 8th bit for 
something technically isn't ASCII, it's an ASCII extension.)

You might also be aware that Unicode defines zillions of characters as 
24-bit codes. In UTF-32, every character is stored as 32 bits. In 
UTF-16, almost all characters are stored as 16 bits, but particularly 
unusual characters take 32 bits. And in UTF-8, many characters take only 
8 bits, a few take 16 bits, the really rare ones take 24 bits and a tiny 
few take 32 bits. (The extra bits are necessary to signal how many bits 
make up the character.)

The various UTF encodings are of course designed to make encoding and 
decoding as painless as possible, while not wasting absurd amounts of 
space. UTF-32 is particularly easy to deal with, but very space-hungry. 
UTF-8 is the most frugal, but quite awkward if there are lots of 
multi-byte characters.

But suppose we want to assign codes to characters, not so that 
processing is easy, but so that space is minimised. Ease of use is not a 
requirement; we just want to crunch the space down as much as possible.

It seems obvious that the fewer bits per character we have, the less 
space this is all going to take up. If you have a plain ASCII file, just 
dropping every 8th bit already gives us 1/8 compression ( = 12.5%).

We can go one step further: Scan the whole file, and see which 
characters are *actually used* in this file. Codes 0 to 31 in ASCII are 
all "control codes", and with the exception of the newline character(s), 
they're all pretty much obsolete now. By not assigning any code at all 
to characters we aren't actually using, we *might* be able to get away 
with 6 bits per character. Maybe.

But even if we do, that's still "only" 25% compression. And if there are 
64 characters used then 6 bits of character will do, but if there are 65 
characters used then we have to have 7 bits - which is annoying.

But now, suppose we assign different code lengths to different 
characters. By assigning shorter codes to common letters and longer 
codes to rare letters, we can potentially store the same data in very 
much reduced space.

Suppose, for example, that we assign a 2-bit code to the letter E, and 
7-bit codes to everything else. Then every single time the letter E 
appears, we're saving 6 bits (compared to the original ASCII file). 
Since E probably appears quite a bit, that could add up to substantial 
savings. (Hint: The character " " probably appears *way more* than ANY 
actual letter of the alphabet, but hey.)

It might seem tempting to assign the shortest possible codes to 
everything. However, for X number of bits, there are only 2^X possible 
codes. If you make one code shorter, you have to make other codes 
correspondingly longer. So if you take a character that isn't *that* 
common and give it a really short code, you'll have to make all the 
other codes longer to compensate. Because the character isn't *that* 
common, the savings from the short code don't make up for the losses in 
all the other longer codes.

In summary, for every character there is a mathematically "perfect" code 
length, which will give best possible compression. And that length is 
equal to [the negation of] the binary logarithm of the probability of 
that character appearing:

   bits(X) = -log2 Pr(X)

If X takes more bits than this, you're wasting space. If X takes fewer 
bits than this, you're wasting bits somewhere else. So this is the 
"perfect" length of X.

(Note that if we were trying to encode characters with decimal codes, it 
would be log10 rather than log2. The 2 is because computers are binary.)

You can work out the "probability of X appearing" by simply counting how 
many X characters there are in the entire file, divided by the *total* 
number of characters. So that's easy enough.

Now, how to assign codes to characters? And how to figure out where one 
character ends and the next begins? In a normal (ASCII) text file, you 
know that each byte is one character. But if different characters take 
different numbers of bits, how are we to know where the start and end of 
each character code is?

The answer is to use a binary tree. For example:

       +
      / \
     +   C
    / \
   A   B

Starting from the top, go left for 0 and go right for 1. If we start at 
the top and go right, we immediately reach C. So C = 1. If we go left 
and then left, we get to A. So A = 00. And B = 01. So from this tree, we 
have

   A = 00
   B = 01
   C = 1

As we read bits from the input file, we can walk down the tree, going 
left when we read a 0 and right when we read a 1. When we get to a leaf, 
we spit out that character, and go back to the top of the tree. Easy. 
[But very inefficient. There are tricks to speed this up, though.]

So to build a code which can be unambiguously decoded, we just need to 
build a suitable code tree. How can we do that?

One way is to build a "Shannon tree". We count the characters in the 
file, and end up with a table like

   Char  Prob
    A      7%
    B     12%
    C      3%
   ...    ...

Now we need to build a tree. What we do is to try to split the set of 
all [used] characters into two sets who's total probability is as close 
to 50% as possible. All the items in the left subset get codes starting 
with a 0, all the codes in the right subset get codes starting with a 1. 
Now recurse. When a subset contains only one item, we have determined 
its code.

Trouble is, splitting a set into two subsets with equal sum... That's 
the Partition Problem, and it's NP-complete. Which is a technical way of 
saying that this is REALLY FREAKING HARD to do well. Worse, there may be 
more than one partitioning that yields roughly 50%, and which one you 
pick affects how easy subsequent partitionings are. This looks really, 
really hard...

Worse still, it can be proven that Shannon trees are sometimes 
unavoidably sub-optimal. So this algorithm is really hard to problem, 
really slow to execute, and doesn't even give optimal results. Great. :-/

The thing is, you don't need to do any of this. Because there's a way, 
way easier algorithm: You can build a Huffman tree instead.

Instead of starting with the set of all characters and trying to 
recursively split it to build a tree, we start with single characters 
and recursively join them to build a tree. The algorithm is:

   1. Start with a set of trivial 1-element trees containing just a 
character.
   2. Pick the two lowest-probability trees and remove them from the set.
   3. Combine the two trees, and insert the new tree back into the set.
   4. If the set has more than one tree, to go step 2.

It's difficult to come up with an example big enough to see what's 
happening, but small enough to not take thirty pages of printout. Let's 
try this:

   Char  Prob
    A      2%
    B      2%
    C      5%
    D     20%
    E     68%
    F      3%

Our initial set of trees is therefore:

    2%    2%    3%    5%    20%    68%
    A     B     F     C      D      E

The two lowest-probability trees are A (2%) and B (2%). 2% + 2% = 4%. So 
now we get:

    3%    4%    5%    20%    68%
    F     +     C      D      E
         / \
        A   B
        0   1

Now we join F to AB:

    5%    7%    20%    68%
    C     +      D      E
         / \
        F   +
        0  / \
          A   B
         10   11

Now we join C to that tree:

     12%    20%    68%
      +      D      E
     / \
    C   +
    0  / \
      F   +
     10  / \
        A   B
      110   111

Add on D:

       32%    68%
        +      E
       / \
      +   D
     / \  1
    C   +
   00  / \
      F   +
    010  / \
        A   B
     0110   0111

And finally, our Huffman tree:

        100%
          +
         / \
        +   E
       / \  1
      +   D
     / \  01
    C   +
  000  / \
      F   +
   0010  / \
        A   B
    00110   00111

So we have

   Char   Prob   Code    Bits   Log2(Prob)
    A      2%    00110   5      -5.643
    B      2%    00111   5      -5.643
    C      5%    000     3      -4.321
    D     20%    01      2      -2.321
    E     68%    1       1      -0.556
    F      3%    0010    4      -5.058

As you can see, the actual number of bits is quite close to -Log2(Prob). 
(Only C is a bit off.) If E really does make up 68% of the characters in 
the file, then by encoding it with just 1 bit, we're saving a huge 
amount of space.

The trouble with all this, of course, is that in order to *decode* the 
data, you need a copy of the Huffman tree used to encode it. So you have 
to store that somewhere. Fortunately, if you pick the codes in a 
suitable systematic way, it's possible to reconstruct the codes just by 
knowing how many bits were assigned to each character.



Alright, so that's nice and everything. But look at that table again:

   Char   Prob   Code    Bits   Log2(Prob)
    A      2%    00110   5      -5.643
    B      2%    00111   5      -5.643
    C      5%    000     3      -4.321
    D     20%    01      2      -2.321
    E     68%    1       1      -0.556
    F      3%    0010    4      -5.058

Ideally, we want to assign half a bit to E. Well, that's *obviously* 
impossible, right? I mean, a bit is a 0 or a 1. What the heck would half 
a bit look like?

It turns out you *can* actually do this. (!) Suppose we want to use 
exactly 1.5 bits per character. If you ask the encoder to encode 1 
character, then 2 bits come out. But if you ask it to encode 2 
characters, only 3 bits come out. Not 4, just 3. And this is what it 
means; the number of bits output is always a whole number (obviously), 
but it's the fractional number of bits, rounded up. (It's a bit like 
buying a box of 7400s priced at 21.225p each. Obviously there's no such 
thing as 0.225p; if you buy one chip, they change you 22p. But if you 
buy 1,000 chips...)

Huffman encoding can't do fractional bits. And there are some 
limitations on the combinations of bit-lengths you can get. (E.g., C 
should be 4 bits or 5 bits, but we've given it 3 bits.) But there is an 
algorithm that lacks these limitations: range coding.

I'm going to demonstrate this in decimal, but the principle in binary is 
just the same. We start off with an "active range", which begins with 
0.0000 to 0.9999:

   Max = 0.9999
   Min = 0.0000

We then split that range into several subranges, one for each possible 
character. The width of that range equals the probability of that 
character appearing next:

         0.9999
    F    ||||||
         0.9700
         0.9699
    E    ||||||
         0.2900
         0.2199
    D    ||||||
         0.0900
         0.0899
    C    ||||||
         0.0400
         0.0399
    B    ||||||
         0.0200
         0.0199
    A    ||||||
         0.0000

Suppose the first character to be encoded happens to be a C. Then that 
becomes the current range:

   Max = 0.0899
   Min = 0.0400

and we split this new subrange up as before:

         0.089999
    F    ||||||||
         0.085800
         0.085799
    E    ||||||||
         0.054500
         0.054499
    D    ||||||||
         0.044500
         0.044499
    C    ||||||||
         0.042000
         0.041999
    B    ||||||||
         0.041000
         0.040999
    A    ||||||||
         0.040000

This process continues for each and every character. Finally, at the end 
of the calculation, we have a really, really tiny range. For example:

   Max = 0.04487515156722040876510168073524657069870453400
   Min = 0.04487515156722040876510168073524657069870453399

We then output one of the two values (it doesn't matter which), together 
with some indication of how many characters are supposed to be decoded. 
(E.g., if the input was just lots of As, the min value would stay at 
zero forever, and we couldn't tell how many As to decode.)

This is great and all, but it looks like we need to calculate with 
*ridiculous* precision. If I compress a 2GB file and it becomes a 200MB 
file, that's 200MB for *one number*. We have to do all calculations with 
200MB of precision?! That doesn't sound fun...

Fortunately, this is not the case. As the range gets smaller and 
smaller, soon the leading digits all become the same. Look at the table 
above: every single code begins 0.0xxx. So we can output the 0.0 part, 
and shift it out of our calculations. The result is that when we get to 
here:

   Max = 0.448737
   Min = 0.448611

we can output "0.448" and change the numbers to

   Max = 73799999
   Min = 61100000

We can then continue our calculation as before. So we only need to hold 
in memory the handful of digits that are *actually changing*. To the 
left are digits we've already output into the compressed file. To the 
right, the max is 999 recurring, and the min is 000 recurring.

(There is an issue called "underflow", where the min gets stuck at 9999 
and the max gets stuck at 0000 and the digits don't meet before you run 
out of precision. You can work around that by "remembering" how many 0s 
and 9s there are, and cancelling them out temporarily. When the digits 
do align, you just need to remember to reanimate the cancelled digits. 
It's fiddly, but not hard.)

Notice that the smaller a range is, the quicker the leading digits start 
to match, and the sooner we write digits out to the compressed file. In 
other words, narrow ranges take more precision to specify. To summarise, 
rare symbols take more digits than common ones - just like with Huffman 
codes.



Huffman coding and range coding are both examples of "entropy coding". 
In either case, how much compression you get depends on:

   1. How much difference there is between the probabilities of the symbols.
   2. How accurately you can predict the probabilities.

#1 is a property of the file itself; you can't change that. Some files 
just compress better than others. #2, however, depends on how you 
estimate your symbol probabilities.

The simplest thing to do is to count how many times each character 
appears in the whole file, and divide by the total character count to 
get a probability for each character. That *works*, but the decoder 
doesn't have access to the uncompressed file to compute the statistics, 
which means you have to *tell* the decompressor what to do.

Fundamentally, there are two kinds of compression:

* In "static" compression, you scan the entire file, decide how to 
compress it, and then do the actual compression.

* In "dynamic" compression, you decide how to compress the next item 
based only on the data you've already processed so far.

Static compression has the advantage that you can "see" all the data, so 
you can make the best compression decisions. But the decompressor can't 
see that data, so you have to "tell" the decompressor what you decided 
to do (e.g., tell it what Huffman table you built), or it won't be able 
to decompress. "Telling" the decompressor takes up disk space, making 
the file larger and partially negating the potentially superior 
compression ratio.

Dynamic compression means you have limited information to work with, so 
you can't make compression decisions quite as well, resulting in a lower 
compression rate. On the other hand, the decompressor can "see" exactly 
what the compressor saw, and by running the same algorithm it can come 
to the same conclusion. In other words, you don't have to "tell" the 
decompressor what you did; it can work it out automatically from the 
compressed data itself. So you get lower compression rates, but there's 
no extra data tables to send, so there's less file overhead.

In the end, it's swings and roundabouts. But I will say this: some 
algorithms work better static, and some work better dynamic.

You can do dynamic Huffman compression. You can start estimating all 
characters with equal frequency, and AFTER compressing each character, 
update your frequency estimates and rebuild the whole Huffman tree. The 
decompressor can do the same, and hence recover the Huffman tree without 
needing any tables. However, rebuilding the Huffman tree takes time; 
more likely you'll update it every dozen characters or something, 
otherwise speed will suffer too much.

But in general, people usually do static compression when they use 
Huffman coding. It's quicker to build the Huffman tree once, and then 
send it once.

You can do static range coding. You calculate all your probabilities, 
write them into the header of the file, and then encode everything. 
It'll probably take a bit of space (it's not just bit-counts like 
Huffman, it's limited-precision probability values for each character), 
but you can do it.

The key advantage of range coding, though, is that it *costs nothing* to 
completely change all your probability estimates for every single 
character. So range coders are usually dynamic.

Whether Huffman or range coder, static compression of an English text 
ASCII file usually gets about 50% compression. That's good, but not 
fantastic. I just tried some random text files on my PC, and by zipping 
them I got around 80% compression. So how do we do that?

A key thing you can do is to use a higher-order model. U is a fairly 
uncommon letter. There aren't too many words with U in them. So the 
probability of seeing a U might seem fairly low. But if the previous 
letter happens to be Q... surely the probability that the next letter is 
a U ought to be *a lot* higher now!

If you calculate a fixed probability for each character, that is a 
"1st-order Markov model". If you calculate a different set of 
probabilities for each letter depending on what the previous letter was, 
that's a "2nd-order Markov model". And it can give substantially 
improved compression.

Of course, coming up with probabilities for 256 different byte values is 
one thing. Coming up with 256 tables of 256 probabilities, that's 65,536 
probabilities we need to compute... If each one takes 1 byte to store, 
that's 64KB just to store the probability tables! If you're trying to 
compress a 5KB text file, adding 64KB of overhead is going to give you 
some serious negative compression! (I.e., the "compressed" file will 
actually be *bigger* than the original.)

Suffice it to say, nobody does static 2nd-order Markov. It eats too much 
space. But dynamic? Sure. And *that* probably means you're using a range 
coder.

But why base your probabilities only on the *last* character? Why not 
the last two? Or three? Or then?

In fact, if you do this, you start hitting the problem that if you've 
never seen a certain sequence of characters before, how do you estimate 
probabilities following it?

The usual solution is "partial predictive matching" (PPM). The 
probability tables [of which there are a hell of a lot!] are sparse. If 
you haven't seen a certain 10-character sequence before, you ignore the 
oldest character and look up the corresponding 9-character sequence. If 
you haven't seen that, you try 8-character, until you finally get a 
match. In this way, for each character you can make the best possible 
probability estimate. And better probability estimates = better 
compression. (And because this is all dynamic, there's no need to store 
huge tables in the compressed file. The decompressor will run the same 
algorithm and compute the same estimates.)

Compressors such as PAQ use range coding and PPM (amount other things) 
to top the compression ratio ratings. (At the expense of speed and 
memory usage, it must be said.)



Actually, PAQ is interesting, since it uses context mixing. That is, 
half a dozen different algorithms (not just PPM) estimate the 
probabilities for the next character. It then uses a left-learning 
artificial neural network (!!) to combine these probability estimates to 
get a final estimate.

It does all of this PER BIT!

The result is an excruciatingly slow compressor that gives astonishing 
compression ratios.


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.