|
|
OK, so it's been a while since I posted a big wall of text. So here goes...
The name of the game is data compression. If you already know how that
works, this is going to be a VERY boring post. If you don't know, it
probably seems like magic; you take a file, throw it at WinZip, and
suddenly the file gets 4x smaller.
Now MP3 does that to sound files, but it does that by DELETING DATA. The
compressed file has lower fidelity than the original. It's not that
surprising that you can make a file smaller by removing data from it.
But LOSSLESS compression algorithms do something seemingly impossible:
they take a file, somehow make it smaller, and yet the process is
EXACTLY REVERSIBLE. When you decompress the file again, it is IDENTICAL
to the original. How is that even possible?
Well, it ISN'T always possible. Sometimes when you zip a file it gets
tiny, and sometimes it doesn't get that much smaller. You may not have
experienced this, but occasionally a file will get LARGER. For example,
zip a file - any file - and then go zip the zip file. It WILL NOT get
any smaller. I promise. (Think about it; otherwise you could take an
arbitrary file and zip it over and over until it gets arbitrarily small.
But that really IS impossible!)
Rather than dwell on this further, I'm going to show you some
compression algorithms now. We're going to start with Huffman coding.
As you probably know, the American Standard Code for Information
Interchange ("ASCII") assigns an 8-bit code to every letter of the
alphabet. (You may or may not know that technically ASCII is actually a
7-bit code; any text encoding which actually *uses* the 8th bit for
something technically isn't ASCII, it's an ASCII extension.)
You might also be aware that Unicode defines zillions of characters as
24-bit codes. In UTF-32, every character is stored as 32 bits. In
UTF-16, almost all characters are stored as 16 bits, but particularly
unusual characters take 32 bits. And in UTF-8, many characters take only
8 bits, a few take 16 bits, the really rare ones take 24 bits and a tiny
few take 32 bits. (The extra bits are necessary to signal how many bits
make up the character.)
The various UTF encodings are of course designed to make encoding and
decoding as painless as possible, while not wasting absurd amounts of
space. UTF-32 is particularly easy to deal with, but very space-hungry.
UTF-8 is the most frugal, but quite awkward if there are lots of
multi-byte characters.
But suppose we want to assign codes to characters, not so that
processing is easy, but so that space is minimised. Ease of use is not a
requirement; we just want to crunch the space down as much as possible.
It seems obvious that the fewer bits per character we have, the less
space this is all going to take up. If you have a plain ASCII file, just
dropping every 8th bit already gives us 1/8 compression ( = 12.5%).
We can go one step further: Scan the whole file, and see which
characters are *actually used* in this file. Codes 0 to 31 in ASCII are
all "control codes", and with the exception of the newline character(s),
they're all pretty much obsolete now. By not assigning any code at all
to characters we aren't actually using, we *might* be able to get away
with 6 bits per character. Maybe.
But even if we do, that's still "only" 25% compression. And if there are
64 characters used then 6 bits of character will do, but if there are 65
characters used then we have to have 7 bits - which is annoying.
But now, suppose we assign different code lengths to different
characters. By assigning shorter codes to common letters and longer
codes to rare letters, we can potentially store the same data in very
much reduced space.
Suppose, for example, that we assign a 2-bit code to the letter E, and
7-bit codes to everything else. Then every single time the letter E
appears, we're saving 6 bits (compared to the original ASCII file).
Since E probably appears quite a bit, that could add up to substantial
savings. (Hint: The character " " probably appears *way more* than ANY
actual letter of the alphabet, but hey.)
It might seem tempting to assign the shortest possible codes to
everything. However, for X number of bits, there are only 2^X possible
codes. If you make one code shorter, you have to make other codes
correspondingly longer. So if you take a character that isn't *that*
common and give it a really short code, you'll have to make all the
other codes longer to compensate. Because the character isn't *that*
common, the savings from the short code don't make up for the losses in
all the other longer codes.
In summary, for every character there is a mathematically "perfect" code
length, which will give best possible compression. And that length is
equal to [the negation of] the binary logarithm of the probability of
that character appearing:
bits(X) = -log2 Pr(X)
If X takes more bits than this, you're wasting space. If X takes fewer
bits than this, you're wasting bits somewhere else. So this is the
"perfect" length of X.
(Note that if we were trying to encode characters with decimal codes, it
would be log10 rather than log2. The 2 is because computers are binary.)
You can work out the "probability of X appearing" by simply counting how
many X characters there are in the entire file, divided by the *total*
number of characters. So that's easy enough.
Now, how to assign codes to characters? And how to figure out where one
character ends and the next begins? In a normal (ASCII) text file, you
know that each byte is one character. But if different characters take
different numbers of bits, how are we to know where the start and end of
each character code is?
The answer is to use a binary tree. For example:
+
/ \
+ C
/ \
A B
Starting from the top, go left for 0 and go right for 1. If we start at
the top and go right, we immediately reach C. So C = 1. If we go left
and then left, we get to A. So A = 00. And B = 01. So from this tree, we
have
A = 00
B = 01
C = 1
As we read bits from the input file, we can walk down the tree, going
left when we read a 0 and right when we read a 1. When we get to a leaf,
we spit out that character, and go back to the top of the tree. Easy.
[But very inefficient. There are tricks to speed this up, though.]
So to build a code which can be unambiguously decoded, we just need to
build a suitable code tree. How can we do that?
One way is to build a "Shannon tree". We count the characters in the
file, and end up with a table like
Char Prob
A 7%
B 12%
C 3%
... ...
Now we need to build a tree. What we do is to try to split the set of
all [used] characters into two sets who's total probability is as close
to 50% as possible. All the items in the left subset get codes starting
with a 0, all the codes in the right subset get codes starting with a 1.
Now recurse. When a subset contains only one item, we have determined
its code.
Trouble is, splitting a set into two subsets with equal sum... That's
the Partition Problem, and it's NP-complete. Which is a technical way of
saying that this is REALLY FREAKING HARD to do well. Worse, there may be
more than one partitioning that yields roughly 50%, and which one you
pick affects how easy subsequent partitionings are. This looks really,
really hard...
Worse still, it can be proven that Shannon trees are sometimes
unavoidably sub-optimal. So this algorithm is really hard to problem,
really slow to execute, and doesn't even give optimal results. Great. :-/
The thing is, you don't need to do any of this. Because there's a way,
way easier algorithm: You can build a Huffman tree instead.
Instead of starting with the set of all characters and trying to
recursively split it to build a tree, we start with single characters
and recursively join them to build a tree. The algorithm is:
1. Start with a set of trivial 1-element trees containing just a
character.
2. Pick the two lowest-probability trees and remove them from the set.
3. Combine the two trees, and insert the new tree back into the set.
4. If the set has more than one tree, to go step 2.
It's difficult to come up with an example big enough to see what's
happening, but small enough to not take thirty pages of printout. Let's
try this:
Char Prob
A 2%
B 2%
C 5%
D 20%
E 68%
F 3%
Our initial set of trees is therefore:
2% 2% 3% 5% 20% 68%
A B F C D E
The two lowest-probability trees are A (2%) and B (2%). 2% + 2% = 4%. So
now we get:
3% 4% 5% 20% 68%
F + C D E
/ \
A B
0 1
Now we join F to AB:
5% 7% 20% 68%
C + D E
/ \
F +
0 / \
A B
10 11
Now we join C to that tree:
12% 20% 68%
+ D E
/ \
C +
0 / \
F +
10 / \
A B
110 111
Add on D:
32% 68%
+ E
/ \
+ D
/ \ 1
C +
00 / \
F +
010 / \
A B
0110 0111
And finally, our Huffman tree:
100%
+
/ \
+ E
/ \ 1
+ D
/ \ 01
C +
000 / \
F +
0010 / \
A B
00110 00111
So we have
Char Prob Code Bits Log2(Prob)
A 2% 00110 5 -5.643
B 2% 00111 5 -5.643
C 5% 000 3 -4.321
D 20% 01 2 -2.321
E 68% 1 1 -0.556
F 3% 0010 4 -5.058
As you can see, the actual number of bits is quite close to -Log2(Prob).
(Only C is a bit off.) If E really does make up 68% of the characters in
the file, then by encoding it with just 1 bit, we're saving a huge
amount of space.
The trouble with all this, of course, is that in order to *decode* the
data, you need a copy of the Huffman tree used to encode it. So you have
to store that somewhere. Fortunately, if you pick the codes in a
suitable systematic way, it's possible to reconstruct the codes just by
knowing how many bits were assigned to each character.
Alright, so that's nice and everything. But look at that table again:
Char Prob Code Bits Log2(Prob)
A 2% 00110 5 -5.643
B 2% 00111 5 -5.643
C 5% 000 3 -4.321
D 20% 01 2 -2.321
E 68% 1 1 -0.556
F 3% 0010 4 -5.058
Ideally, we want to assign half a bit to E. Well, that's *obviously*
impossible, right? I mean, a bit is a 0 or a 1. What the heck would half
a bit look like?
It turns out you *can* actually do this. (!) Suppose we want to use
exactly 1.5 bits per character. If you ask the encoder to encode 1
character, then 2 bits come out. But if you ask it to encode 2
characters, only 3 bits come out. Not 4, just 3. And this is what it
means; the number of bits output is always a whole number (obviously),
but it's the fractional number of bits, rounded up. (It's a bit like
buying a box of 7400s priced at 21.225p each. Obviously there's no such
thing as 0.225p; if you buy one chip, they change you 22p. But if you
buy 1,000 chips...)
Huffman encoding can't do fractional bits. And there are some
limitations on the combinations of bit-lengths you can get. (E.g., C
should be 4 bits or 5 bits, but we've given it 3 bits.) But there is an
algorithm that lacks these limitations: range coding.
I'm going to demonstrate this in decimal, but the principle in binary is
just the same. We start off with an "active range", which begins with
0.0000 to 0.9999:
Max = 0.9999
Min = 0.0000
We then split that range into several subranges, one for each possible
character. The width of that range equals the probability of that
character appearing next:
0.9999
F ||||||
0.9700
0.9699
E ||||||
0.2900
0.2199
D ||||||
0.0900
0.0899
C ||||||
0.0400
0.0399
B ||||||
0.0200
0.0199
A ||||||
0.0000
Suppose the first character to be encoded happens to be a C. Then that
becomes the current range:
Max = 0.0899
Min = 0.0400
and we split this new subrange up as before:
0.089999
F ||||||||
0.085800
0.085799
E ||||||||
0.054500
0.054499
D ||||||||
0.044500
0.044499
C ||||||||
0.042000
0.041999
B ||||||||
0.041000
0.040999
A ||||||||
0.040000
This process continues for each and every character. Finally, at the end
of the calculation, we have a really, really tiny range. For example:
Max = 0.04487515156722040876510168073524657069870453400
Min = 0.04487515156722040876510168073524657069870453399
We then output one of the two values (it doesn't matter which), together
with some indication of how many characters are supposed to be decoded.
(E.g., if the input was just lots of As, the min value would stay at
zero forever, and we couldn't tell how many As to decode.)
This is great and all, but it looks like we need to calculate with
*ridiculous* precision. If I compress a 2GB file and it becomes a 200MB
file, that's 200MB for *one number*. We have to do all calculations with
200MB of precision?! That doesn't sound fun...
Fortunately, this is not the case. As the range gets smaller and
smaller, soon the leading digits all become the same. Look at the table
above: every single code begins 0.0xxx. So we can output the 0.0 part,
and shift it out of our calculations. The result is that when we get to
here:
Max = 0.448737
Min = 0.448611
we can output "0.448" and change the numbers to
Max = 73799999
Min = 61100000
We can then continue our calculation as before. So we only need to hold
in memory the handful of digits that are *actually changing*. To the
left are digits we've already output into the compressed file. To the
right, the max is 999 recurring, and the min is 000 recurring.
(There is an issue called "underflow", where the min gets stuck at 9999
and the max gets stuck at 0000 and the digits don't meet before you run
out of precision. You can work around that by "remembering" how many 0s
and 9s there are, and cancelling them out temporarily. When the digits
do align, you just need to remember to reanimate the cancelled digits.
It's fiddly, but not hard.)
Notice that the smaller a range is, the quicker the leading digits start
to match, and the sooner we write digits out to the compressed file. In
other words, narrow ranges take more precision to specify. To summarise,
rare symbols take more digits than common ones - just like with Huffman
codes.
Huffman coding and range coding are both examples of "entropy coding".
In either case, how much compression you get depends on:
1. How much difference there is between the probabilities of the symbols.
2. How accurately you can predict the probabilities.
#1 is a property of the file itself; you can't change that. Some files
just compress better than others. #2, however, depends on how you
estimate your symbol probabilities.
The simplest thing to do is to count how many times each character
appears in the whole file, and divide by the total character count to
get a probability for each character. That *works*, but the decoder
doesn't have access to the uncompressed file to compute the statistics,
which means you have to *tell* the decompressor what to do.
Fundamentally, there are two kinds of compression:
* In "static" compression, you scan the entire file, decide how to
compress it, and then do the actual compression.
* In "dynamic" compression, you decide how to compress the next item
based only on the data you've already processed so far.
Static compression has the advantage that you can "see" all the data, so
you can make the best compression decisions. But the decompressor can't
see that data, so you have to "tell" the decompressor what you decided
to do (e.g., tell it what Huffman table you built), or it won't be able
to decompress. "Telling" the decompressor takes up disk space, making
the file larger and partially negating the potentially superior
compression ratio.
Dynamic compression means you have limited information to work with, so
you can't make compression decisions quite as well, resulting in a lower
compression rate. On the other hand, the decompressor can "see" exactly
what the compressor saw, and by running the same algorithm it can come
to the same conclusion. In other words, you don't have to "tell" the
decompressor what you did; it can work it out automatically from the
compressed data itself. So you get lower compression rates, but there's
no extra data tables to send, so there's less file overhead.
In the end, it's swings and roundabouts. But I will say this: some
algorithms work better static, and some work better dynamic.
You can do dynamic Huffman compression. You can start estimating all
characters with equal frequency, and AFTER compressing each character,
update your frequency estimates and rebuild the whole Huffman tree. The
decompressor can do the same, and hence recover the Huffman tree without
needing any tables. However, rebuilding the Huffman tree takes time;
more likely you'll update it every dozen characters or something,
otherwise speed will suffer too much.
But in general, people usually do static compression when they use
Huffman coding. It's quicker to build the Huffman tree once, and then
send it once.
You can do static range coding. You calculate all your probabilities,
write them into the header of the file, and then encode everything.
It'll probably take a bit of space (it's not just bit-counts like
Huffman, it's limited-precision probability values for each character),
but you can do it.
The key advantage of range coding, though, is that it *costs nothing* to
completely change all your probability estimates for every single
character. So range coders are usually dynamic.
Whether Huffman or range coder, static compression of an English text
ASCII file usually gets about 50% compression. That's good, but not
fantastic. I just tried some random text files on my PC, and by zipping
them I got around 80% compression. So how do we do that?
A key thing you can do is to use a higher-order model. U is a fairly
uncommon letter. There aren't too many words with U in them. So the
probability of seeing a U might seem fairly low. But if the previous
letter happens to be Q... surely the probability that the next letter is
a U ought to be *a lot* higher now!
If you calculate a fixed probability for each character, that is a
"1st-order Markov model". If you calculate a different set of
probabilities for each letter depending on what the previous letter was,
that's a "2nd-order Markov model". And it can give substantially
improved compression.
Of course, coming up with probabilities for 256 different byte values is
one thing. Coming up with 256 tables of 256 probabilities, that's 65,536
probabilities we need to compute... If each one takes 1 byte to store,
that's 64KB just to store the probability tables! If you're trying to
compress a 5KB text file, adding 64KB of overhead is going to give you
some serious negative compression! (I.e., the "compressed" file will
actually be *bigger* than the original.)
Suffice it to say, nobody does static 2nd-order Markov. It eats too much
space. But dynamic? Sure. And *that* probably means you're using a range
coder.
But why base your probabilities only on the *last* character? Why not
the last two? Or three? Or then?
In fact, if you do this, you start hitting the problem that if you've
never seen a certain sequence of characters before, how do you estimate
probabilities following it?
The usual solution is "partial predictive matching" (PPM). The
probability tables [of which there are a hell of a lot!] are sparse. If
you haven't seen a certain 10-character sequence before, you ignore the
oldest character and look up the corresponding 9-character sequence. If
you haven't seen that, you try 8-character, until you finally get a
match. In this way, for each character you can make the best possible
probability estimate. And better probability estimates = better
compression. (And because this is all dynamic, there's no need to store
huge tables in the compressed file. The decompressor will run the same
algorithm and compute the same estimates.)
Compressors such as PAQ use range coding and PPM (amount other things)
to top the compression ratio ratings. (At the expense of speed and
memory usage, it must be said.)
Actually, PAQ is interesting, since it uses context mixing. That is,
half a dozen different algorithms (not just PPM) estimate the
probabilities for the next character. It then uses a left-learning
artificial neural network (!!) to combine these probability estimates to
get a final estimate.
It does all of this PER BIT!
The result is an excruciatingly slow compressor that gives astonishing
compression ratios.
Post a reply to this message
|
|