POV-Ray : Newsgroups : povray.off-topic : 10 things to use a Min Heap for : Re: 10 things to use a Min Heap for Server Time
7 Sep 2024 09:21:57 EDT (-0400)
  Re: 10 things to use a Min Heap for  
From: Invisible
Date: 18 Jun 2008 08:19:42
Message: <4858fd5e$1@news.povray.org>
scott wrote:
>> No, really??
>>
>> Heh. Now tell me something I don't already know. ;-) [Such as how to 
>> *find* a better job... that'd be useful!]
> 
> Try hammering some random search terms into monster.co.uk, even 
> "haskell" brings up a couple of results!

Yep. Already done that. And they have my CV, so every now and then they 
email me a job advert or 12. [Usually in Canada... WTF??] I even went to 
an interview a few weeks back...

They've sent me a couple of messages telling me I'm "ideal" to work for 
HMGCC. ("HMG" as in "Her Majesty's Government". Scary shit, right there!)

> I can't believe there is no job on there for you.

I'm fairly confident somebody somewhere wants to employ somebody like 
me. The part I'm worried about is *finding* this company...

>> While I'm not desputing the truth of your statement, I am curios to 
>> know what lead you to this conclusion.
> 
> The fact that you always go on about quite complicated stuff that not 
> many other people would understand, and I mean not many within the IT 
> industry, not the entire population.

Pfft. It's all stuff that's readily available on Wikipedia. And it's not 
exactly "complicated".


Assuming you already comprehend what a binary tree data structure is, a 
heap is just a special case: Every node holds data, and the data in each 
node must be less than or equal to the data in both its child nodes. 
[Notice there's no ordering constraint on siblings - so the tree need 
not be fully ordered, only partially.]

Finding the minimum element is O(1) because, by definition, the minimum 
element is always in the root node.

Inserting an element means comparing it to what's in the root node, 
putting the smaller of the two back into the root, and propogating the 
larger of the two down the tree recursively. [Notice I didn't say which 
child node. There are actually several possible algorithms here, each 
with different characteristics.]

Deleting an element likewise involves throwing away what's in the root 
and moving the smaller of the two child elements up into the root. This 
"moving" thereby deletes an element from one of the child nodes, so you 
must recurse.

Personally I think Haskell source code says it better, but that's just me.


A Huffman code is a similarly straight-forward idea. When you save a 
normal ASCII text file, each character is given an 8-bit code. [But if 
you want to be technical, ASCII is a 7-bit code, so the extra bit is 
always a 0.] The idea behind Huffman codes is that instead of every 
character having an 8-bit code, different characters can have codes of 
different lengths.

If you give the most common characters shorter codes [and therefore, by 
necessity, give the rarer characters longer codes], then simply 
re-encoding your text from ASCII to this new coding system will make 
your file smaller! :-D

But how to construct this new coding scheme? And if codes can be 
different lengths, how do you figure out where one code ends and the 
next begins?

The answer to the second question is that the code must be a "prefix 
code". That is, no full code can ever be the prefix of another full 
code. (It may be a suffic, but never a prefix.) So if 011 is a valid 
code, all on its own, then no other valid code can begin with 011 
(although it can END with 011). This means that if the decoder sees 011, 
it knows it just read a whole code, and the next bit is a seperate code.

An easy way to build a prefix code is to make each code be a set of 
directions from the root of a tree to the leaves. Assuming each leaf 
stores a character, each leaf's address will be a prefix code.

So, "all" we have to do is build a tree where the rare characters are 
deep in the tree, and the common characters are shallow. We then read 
off the tree address for each character and we will have our prefix code.

One obvious way is to take our set of characters and split it into two 
halves such that the total probability for each half is roughly 50%. 
Then all the characters in one set go on one side of our tree, and all 
the characters in the other half go on the other side of our tree. 
Recurse until you're done, and you have a nice tree.

This is called a Shannon tree, and it doesn't work very well. It's 
actually fairly hard to split a set into two equal-probability sets. (I 
have a vague recollection it might be NP-complete. Darren will know.)

The *other* thing you can do is work backwards. The rarest characters 
should end up at the bottom of the tree, so let's put those into trees 
*first*. And the common characters need to end up near the top, so let's 
add those *last*. In other words, pick the two lowest-probability trees, 
join them into a single tree, repeat until, like Hilander, there is Only 
One.

*This* is called a Huffman tree. And it works much better. So, to summarise:

- Assign a "probability" to each character. [Guess them, estimate them, 
count them up, whatever.]

- Build a set of trivial trees each consisting of a leaf containing a 
character and its probability.

- Select the two lowest-probability trees, remove them from the set, 
join them into a larger tree, insert this back into the set. Repeat 
until only 1 tree remains.

- For each ASCII code in your input, replace it with the tree address of 
that character in your tree.

- Assuming the probabilities you assigned were correct, the output file 
is *guaranteed* to be smaller than or, in the worst case (every 
character has equal probability), exactly the same size as the original. 
(But note that you probably need to write down what the tree was so the 
decoder can reconstruct it! This will make the file fractionally larger.)


Note that *any* prefix code will work. There's nothing "magical" about 
Huffman codes. The Huffman algorithm is just an efficient way to build 
such a code. If you know that, for example, low numbers are common in 
your input and high numbers are rare, you can use a "Fibonacci code", 
which is a prefix code with a predefined set of probability assumptions, 
so it takes no effort to compute. [It's based on Fibonacci numbers, 
hence the name.]

For something like a sound signal, where most samples are near to zero, 
this could be a win. (Also, look at Predictive Coding, where you make a 
"prediction" for each sample based on some property of the last few 
samples, and record the "error" between the prediction and the actual 
sample. Assuming your predictions are good, the errors will always be 
"small" numbers.)


Huffman codes will work on anything, and they work quite well. There 
*is*, however, a thing called "arithmetic coding", which does the same 
thing but better. You see, it assigns a *fractional* number of bits to 
each character.

That probably sounds completely incomprehensible. Put simply, the 
compressed file contains an integral number of bits. (Duh!) But if you 
encode 100 identical characters, the output bit length need not be 
divisible by 100. You could say the number of bits gets rounded up to 
the nearest whole number when you complete the file.

In practice, what this means is that the number of bits per character 
can be closer to the binary logarithm of the probability for that 
character - the "perfect" bit size. So you get a few percent more 
compression this way. OTOH, the algorithm is harder. And patented, I 
believe...?

Of course, comming up with better probability estimates will have a much 
more dramatic effect on compression. Running a Huffman coder on a normal 
text file gives typically 50% compression. But take into account letter 
*combination* frequencies and you can go much higher than that. 
Theoretically, PPM + arithmetic coding should give the best compression 
that money can buy!



You know what? I think I might have just disproven the point I was 
trying to make...? <:-,

-- 
http://blog.orphi.me.uk/
http://www.zazzle.com/MathematicalOrchid*


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.