|
 |
Ever since I read that book I got for Christmas, I've been carrying
around this implementation of a minimum-heap ("min-heap") in my head.
The design always seemed quite neat to me, and it seemed to have
impressive properties. But today I sat down to actually put those
properties to the test.
I've written a short program that randomly inserts and deletes elements
from the heap and measures how badly unbalanced the heap tree is. As per
my expectations, it seems that inserting produces nicely balanced trees.
Somewhat surprisingly, I have discovered that deletions can produce
arbitrarily unbalanced trees.
The actual algorithm can be succinctly described thus:
data Heap x = Node x (Heap x) (Heap x) | Empty
insert x' (Node x h0 h1) = Node (min x x') (insert (max x x') h1) h0
insert x' (Empty ) = Node x' Empty Empty
get_min (Node x _ _) = x
get_min (Empty ) = error "empty heap"
delete_min (Node _ h0 h1) = merge h0 h1
delete_min (Empty ) = error "empty heap"
merge Empty h1 = h1
merge h0 Empty = h0
merge h0 h1 =
if get_min h0 < get_min h1
then Node (get_min h0) h1 (delete_min h0)
else Node (get_min h1) h0 (delete_min h1)
For those who do not grok Haskell, let me summarise thus:
1. A heap is simply a binary tree where every node contains one datum
and has between 0 and 2 children.
2. Insert:
- Inserting into an empty heap yields a trivial 1-node tree.
- Inserting into a non-empty heap proceeds as follows:
A. Construct a new tree node containing the smaller of the existing
root element and the new element to be inserted.
B. The right child of the new node is the left child of the original
root node.
C. The left child of the new node is constructed by recursively
inserting the larger of the two elements into the previous right subtree.
3. Get minimum element: Just return the datum from the root node of the
tree.
4. Delete minimum element: Just take the two subtrees of the root node
and heap-merge them.
5. Heap merge:
- Merging a heap with an empty heap is no-op.
- To marge two non-empty heaps:
A. Determine which heap has the lowest datum in its root node.
B. Put the lowest datum into the new root node, and recursively
delete it from the subtree you got it from.
It might just be me, but I can't help noticing how the English
description is vastly more wordy yet less precise than the source code.
But anyway...
Notice how an insertion always swaps the two subtrees, and how it always
inserts into the (old) right subtree [which becomes the new left
subtree]. In this way, the left subtree of any node is always the larger
than or the same size as the corresponding right subtree. In other
words, insertions leave the tree balanced.
Notice also how a deletion (or rather, a merge) always puts the subtree
you just deleted from as the right subtree, reinforcing the same
invariant. The left subtree can never be smaller than the right subtree.
However, they *can* be wildly different sizes. If, by some misfortune,
it happens that you have a node where *all* the elements in one subtree
are strictly less than *all* of the elements in the other subtree, then
deleting from this heap will cause it to become more and more
unbalanced. And no amount of swapping subtrees around is going to fix that.
We have that deletion leaves the heap in a state where a subsequent
insertion will definitely insert into the smaller subtree, which is
good. In other words, inserting will always bring the tree back towards
being balanced again. But that still doesn't change the fact that, for
sufficiently pathological cases, deletions can make the tree unbalanced
to an unlimited degree.
Since a heap is (quite deliberately) only *partially* sorted, there's
nothing to prevent this kind of thing from happening. The only way we're
gonna fix this is to start moving stuff from one subtree to the other if
it gets too unbalanced. But can we detect this condition in a way that
doesn't slow the whole thing down too much?
Insert has the invariant that the sizes of the left and right subtrees
are related. At all times, we have either L = R or L = R + 1. The
constant swapping of the two subtrees mean that they both fill at an
equal rate.
Insert propagates items from the top downwards. Which way the item
propagates is down to the insertion function. But deletion moves items
up to the top. Which side the item comes from depends on which item is
the lowest, which the deletion function has no control over. Hence the
problem.
Deleting from the left subtree causes it to become the right subtree. If
we always deleted from the left, the tree would always remain balanced,
and there would be no problem. Deleting from the right subtree, however,
can unbalance the tree.
Now obviously, if the lowest element is in the right subtree, then it's
in the right subtree. There's nothing we can do about that. But this
(and only this) condition looks like a suitable signal that some
rebalancing is required. The question is, what do we do to rebalance?
One obvious possibility is to completely rebuild the entire subtree at
this moment. This operation has O(n log n) complexity, and is obviously
ludicrously expensive to do on (potentially) every single delete operation.
A better possibility is to move one element from the left subtree into
the right subtree if it's going to become too small. Actually to do this
reliably requires that we know whether L = R or L = R + 1. In the former
case, we do nothing, and in the latter case we rebalance by moving a
single element across. Here's the code:
data Balance = Equal | Unequal
invert Equal = Unequal
invert Unequal = Equal
data Heap x = Node Balance x (Heap x) Heap x) | Empty
insert x' (Node b x h0 h1) = Node (invert b) (min x x') (insert (max
x x') h1) h0
insert x' (Empty ) = Node Equal x Empty Empty
get_min (Node _ x _ _) = x
get_min (Empty ) = error "empty heap"
delete_min (Node x _ h0 h1) = merge b h0 h1
delete_min (Empty ) = error "empty heap"
merge _ Empty h1 = h1
merge _ h0 Empty = h0
merge _ h0 h1 =
if get_min h0 < get_min h1
then Node (invert b) (get_min h0) h1 (delete_min h0)
else rebalance b (get_min h1) h0 (delete_min h1)
rebalance Equal x h0 h1 = Node Unequal x h0 h1
rebalance Unequal x h0 h1 = let x' = get_min h0 in Node Equal x
(delete_min h0) (insert x' h1)
Here the functions "invert", "merge" and "rebalance" are private
functions not accessible to external clients.
I've run this code through my simulator, and after many minutes of
number crunching it has failed to produce one single unbalanced tree.
This of course is not proof, but it shows at least that this structure
works somewhat better than the original. (And for very little cost in
either code size of run-time complexity.)
Unfortunately, because we've added a new invariant to the data structure
(namely that the balance factor is always exactly 0 or -1), we lose the
ability to cheaply merge two heaps. (In the old version, this was
already a very cheap way to make an arbitrarily unbalanced tree; just
merge a size-1 tree with a size-1000 tree and watch how lopsided it now
is!) You would now be forced to do some kind of elaborate rebalancing
procedure to restore the new invariant on the subtree sizes.
Post a reply to this message
|
 |