|
|
scott wrote:
>> No, really??
>>
>> Heh. Now tell me something I don't already know. ;-) [Such as how to
>> *find* a better job... that'd be useful!]
>
> Try hammering some random search terms into monster.co.uk, even
> "haskell" brings up a couple of results!
Yep. Already done that. And they have my CV, so every now and then they
email me a job advert or 12. [Usually in Canada... WTF??] I even went to
an interview a few weeks back...
They've sent me a couple of messages telling me I'm "ideal" to work for
HMGCC. ("HMG" as in "Her Majesty's Government". Scary shit, right there!)
> I can't believe there is no job on there for you.
I'm fairly confident somebody somewhere wants to employ somebody like
me. The part I'm worried about is *finding* this company...
>> While I'm not desputing the truth of your statement, I am curios to
>> know what lead you to this conclusion.
>
> The fact that you always go on about quite complicated stuff that not
> many other people would understand, and I mean not many within the IT
> industry, not the entire population.
Pfft. It's all stuff that's readily available on Wikipedia. And it's not
exactly "complicated".
Assuming you already comprehend what a binary tree data structure is, a
heap is just a special case: Every node holds data, and the data in each
node must be less than or equal to the data in both its child nodes.
[Notice there's no ordering constraint on siblings - so the tree need
not be fully ordered, only partially.]
Finding the minimum element is O(1) because, by definition, the minimum
element is always in the root node.
Inserting an element means comparing it to what's in the root node,
putting the smaller of the two back into the root, and propogating the
larger of the two down the tree recursively. [Notice I didn't say which
child node. There are actually several possible algorithms here, each
with different characteristics.]
Deleting an element likewise involves throwing away what's in the root
and moving the smaller of the two child elements up into the root. This
"moving" thereby deletes an element from one of the child nodes, so you
must recurse.
Personally I think Haskell source code says it better, but that's just me.
A Huffman code is a similarly straight-forward idea. When you save a
normal ASCII text file, each character is given an 8-bit code. [But if
you want to be technical, ASCII is a 7-bit code, so the extra bit is
always a 0.] The idea behind Huffman codes is that instead of every
character having an 8-bit code, different characters can have codes of
different lengths.
If you give the most common characters shorter codes [and therefore, by
necessity, give the rarer characters longer codes], then simply
re-encoding your text from ASCII to this new coding system will make
your file smaller! :-D
But how to construct this new coding scheme? And if codes can be
different lengths, how do you figure out where one code ends and the
next begins?
The answer to the second question is that the code must be a "prefix
code". That is, no full code can ever be the prefix of another full
code. (It may be a suffic, but never a prefix.) So if 011 is a valid
code, all on its own, then no other valid code can begin with 011
(although it can END with 011). This means that if the decoder sees 011,
it knows it just read a whole code, and the next bit is a seperate code.
An easy way to build a prefix code is to make each code be a set of
directions from the root of a tree to the leaves. Assuming each leaf
stores a character, each leaf's address will be a prefix code.
So, "all" we have to do is build a tree where the rare characters are
deep in the tree, and the common characters are shallow. We then read
off the tree address for each character and we will have our prefix code.
One obvious way is to take our set of characters and split it into two
halves such that the total probability for each half is roughly 50%.
Then all the characters in one set go on one side of our tree, and all
the characters in the other half go on the other side of our tree.
Recurse until you're done, and you have a nice tree.
This is called a Shannon tree, and it doesn't work very well. It's
actually fairly hard to split a set into two equal-probability sets. (I
have a vague recollection it might be NP-complete. Darren will know.)
The *other* thing you can do is work backwards. The rarest characters
should end up at the bottom of the tree, so let's put those into trees
*first*. And the common characters need to end up near the top, so let's
add those *last*. In other words, pick the two lowest-probability trees,
join them into a single tree, repeat until, like Hilander, there is Only
One.
*This* is called a Huffman tree. And it works much better. So, to summarise:
- Assign a "probability" to each character. [Guess them, estimate them,
count them up, whatever.]
- Build a set of trivial trees each consisting of a leaf containing a
character and its probability.
- Select the two lowest-probability trees, remove them from the set,
join them into a larger tree, insert this back into the set. Repeat
until only 1 tree remains.
- For each ASCII code in your input, replace it with the tree address of
that character in your tree.
- Assuming the probabilities you assigned were correct, the output file
is *guaranteed* to be smaller than or, in the worst case (every
character has equal probability), exactly the same size as the original.
(But note that you probably need to write down what the tree was so the
decoder can reconstruct it! This will make the file fractionally larger.)
Note that *any* prefix code will work. There's nothing "magical" about
Huffman codes. The Huffman algorithm is just an efficient way to build
such a code. If you know that, for example, low numbers are common in
your input and high numbers are rare, you can use a "Fibonacci code",
which is a prefix code with a predefined set of probability assumptions,
so it takes no effort to compute. [It's based on Fibonacci numbers,
hence the name.]
For something like a sound signal, where most samples are near to zero,
this could be a win. (Also, look at Predictive Coding, where you make a
"prediction" for each sample based on some property of the last few
samples, and record the "error" between the prediction and the actual
sample. Assuming your predictions are good, the errors will always be
"small" numbers.)
Huffman codes will work on anything, and they work quite well. There
*is*, however, a thing called "arithmetic coding", which does the same
thing but better. You see, it assigns a *fractional* number of bits to
each character.
That probably sounds completely incomprehensible. Put simply, the
compressed file contains an integral number of bits. (Duh!) But if you
encode 100 identical characters, the output bit length need not be
divisible by 100. You could say the number of bits gets rounded up to
the nearest whole number when you complete the file.
In practice, what this means is that the number of bits per character
can be closer to the binary logarithm of the probability for that
character - the "perfect" bit size. So you get a few percent more
compression this way. OTOH, the algorithm is harder. And patented, I
believe...?
Of course, comming up with better probability estimates will have a much
more dramatic effect on compression. Running a Huffman coder on a normal
text file gives typically 50% compression. But take into account letter
*combination* frequencies and you can go much higher than that.
Theoretically, PPM + arithmetic coding should give the best compression
that money can buy!
You know what? I think I might have just disproven the point I was
trying to make...? <:-,
--
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*
Post a reply to this message
|
|