POV-Ray: Newsgroups: povray.off-topic: A min-heap

POV-Ray : Newsgroups : povray.off-topic : A min-heap		Server Time 3 Sep 2024 19:14:19 EDT (-0400)

Goto Latest 10 Messages

Next 6 Messages >>>

From: Invisible
Subject: A min-heap
Date: 12 Oct 2010 09:49:29
Message: <4cb46769$1@news.povray.org>

Ever since I read that book I got for Christmas, I've been carrying 
around this implementation of a minimum-heap ("min-heap") in my head. 
The design always seemed quite neat to me, and it seemed to have 
impressive properties. But today I sat down to actually put those 
properties to the test.

I've written a short program that randomly inserts and deletes elements 
from the heap and measures how badly unbalanced the heap tree is. As per 
my expectations, it seems that inserting produces nicely balanced trees. 
Somewhat surprisingly, I have discovered that deletions can produce 
arbitrarily unbalanced trees.

The actual algorithm can be succinctly described thus:

   data Heap x = Node x (Heap x) (Heap x) | Empty

   insert x' (Node x h0 h1) = Node (min x x') (insert (max x x') h1) h0
   insert x' (Empty       ) = Node x' Empty Empty

   get_min (Node x _ _) = x
   get_min (Empty     ) = error "empty heap"

   delete_min (Node _ h0 h1) = merge h0 h1
   delete_min (Empty       ) = error "empty heap"

   merge Empty h1    = h1
   merge h0    Empty = h0
   merge h0    h1    =
     if get_min h0 < get_min h1
       then Node (get_min h0) h1 (delete_min h0)
       else Node (get_min h1) h0 (delete_min h1)

For those who do not grok Haskell, let me summarise thus:

1. A heap is simply a binary tree where every node contains one datum 
and has between 0 and 2 children.

2. Insert:
   - Inserting into an empty heap yields a trivial 1-node tree.
   - Inserting into a non-empty heap proceeds as follows:
    A. Construct a new tree node containing the smaller of the existing 
root element and the new element to be inserted.
    B. The right child of the new node is the left child of the original 
root node.
    C. The left child of the new node is constructed by recursively 
inserting the larger of the two elements into the previous right subtree.

3. Get minimum element: Just return the datum from the root node of the 
tree.

4. Delete minimum element: Just take the two subtrees of the root node 
and heap-merge them.

5. Heap merge:
   - Merging a heap with an empty heap is no-op.
   - To marge two non-empty heaps:
     A. Determine which heap has the lowest datum in its root node.
     B. Put the lowest datum into the new root node, and recursively 
delete it from the subtree you got it from.

It might just be me, but I can't help noticing how the English 
description is vastly more wordy yet less precise than the source code. 
But anyway...

Notice how an insertion always swaps the two subtrees, and how it always 
inserts into the (old) right subtree [which becomes the new left 
subtree]. In this way, the left subtree of any node is always the larger 
than or the same size as the corresponding right subtree. In other 
words, insertions leave the tree balanced.

Notice also how a deletion (or rather, a merge) always puts the subtree 
you just deleted from as the right subtree, reinforcing the same 
invariant. The left subtree can never be smaller than the right subtree.

However, they *can* be wildly different sizes. If, by some misfortune, 
it happens that you have a node where *all* the elements in one subtree 
are strictly less than *all* of the elements in the other subtree, then 
deleting from this heap will cause it to become more and more 
unbalanced. And no amount of swapping subtrees around is going to fix that.

We have that deletion leaves the heap in a state where a subsequent 
insertion will definitely insert into the smaller subtree, which is 
good. In other words, inserting will always bring the tree back towards 
being balanced again. But that still doesn't change the fact that, for 
sufficiently pathological cases, deletions can make the tree unbalanced 
to an unlimited degree.

Since a heap is (quite deliberately) only *partially* sorted, there's 
nothing to prevent this kind of thing from happening. The only way we're 
gonna fix this is to start moving stuff from one subtree to the other if 
it gets too unbalanced. But can we detect this condition in a way that 
doesn't slow the whole thing down too much?

Insert has the invariant that the sizes of the left and right subtrees 
are related. At all times, we have either L = R or L = R + 1. The 
constant swapping of the two subtrees mean that they both fill at an 
equal rate.

Insert propagates items from the top downwards. Which way the item 
propagates is down to the insertion function. But deletion moves items 
up to the top. Which side the item comes from depends on which item is 
the lowest, which the deletion function has no control over. Hence the 
problem.

Deleting from the left subtree causes it to become the right subtree. If 
we always deleted from the left, the tree would always remain balanced, 
and there would be no problem. Deleting from the right subtree, however, 
can unbalance the tree.

Now obviously, if the lowest element is in the right subtree, then it's 
in the right subtree. There's nothing we can do about that. But this 
(and only this) condition looks like a suitable signal that some 
rebalancing is required. The question is, what do we do to rebalance?

One obvious possibility is to completely rebuild the entire subtree at 
this moment. This operation has O(n log n) complexity, and is obviously 
ludicrously expensive to do on (potentially) every single delete operation.

A better possibility is to move one element from the left subtree into 
the right subtree if it's going to become too small. Actually to do this 
reliably requires that we know whether L = R or L = R + 1. In the former 
case, we do nothing, and in the latter case we rebalance by moving a 
single element across. Here's the code:

   data Balance = Equal | Unequal

   invert Equal   = Unequal
   invert Unequal = Equal

   data Heap x = Node Balance x (Heap x) Heap x) | Empty

   insert x' (Node b x h0 h1) = Node (invert b) (min x x') (insert (max 
x x') h1) h0
   insert x' (Empty         ) = Node Equal x Empty Empty

   get_min (Node _ x _ _) = x
   get_min (Empty       ) = error "empty heap"

   delete_min (Node x _ h0 h1) = merge b h0 h1
   delete_min (Empty         ) = error "empty heap"

   merge _ Empty h1    = h1
   merge _ h0    Empty = h0
   merge _ h0    h1    =
     if get_min h0 < get_min h1
       then Node (invert b) (get_min h0) h1 (delete_min h0)
       else rebalance    b  (get_min h1) h0 (delete_min h1)

   rebalance Equal   x h0 h1 = Node Unequal x h0 h1
   rebalance Unequal x h0 h1 = let x' = get_min h0 in Node Equal x 
(delete_min h0) (insert x' h1)

Here the functions "invert", "merge" and "rebalance" are private 
functions not accessible to external clients.

I've run this code through my simulator, and after many minutes of 
number crunching it has failed to produce one single unbalanced tree. 
This of course is not proof, but it shows at least that this structure 
works somewhat better than the original. (And for very little cost in 
either code size of run-time complexity.)

Unfortunately, because we've added a new invariant to the data structure 
(namely that the balance factor is always exactly 0 or -1), we lose the 
ability to cheaply merge two heaps. (In the old version, this was 
already a very cheap way to make an arbitrarily unbalanced tree; just 
merge a size-1 tree with a size-1000 tree and watch how lopsided it now 
is!) You would now be forced to do some kind of elaborate rebalancing 
procedure to restore the new invariant on the subtree sizes.

Post a reply to this message

From: Warp
Subject: Re: A min-heap
Date: 12 Oct 2010 11:48:02
Message: <4cb48332@news.povray.org>

Invisible <voi### [at] devnull> wrote:
> For those who do not grok Haskell, let me summarise thus:

  What are the advantages of a heap compared to a balanced binary search
tree in Haskell?

  The reason heaps are sometimes used in imperative languages is because
a heap can be constructed inside an array without requiring any ancillary
data whatsoever (in other words, the array contains the elements and
nothing else; the order of the elements can be heapified in-place, without
any ancillary data or extra memory), which makes it extremely efficient
(both memorywise, rather obviously, and also speedwise).

  However, Haskell doesn't use (mutable) arrays, so you have to do it
as a regular tree, with each node allocated separately and containing
references and whatever ancillary data.

-- 
                                                          - Warp

Post a reply to this message

From: Invisible
Subject: Re: A min-heap
Date: 13 Oct 2010 04:24:51
Message: <4cb56cd3$1@news.povray.org>

On 12/10/2010 04:48 PM, Warp wrote:

>    What are the advantages of a heap compared to a balanced binary search
> tree in Haskell?

I hadn't really thought about it.

However, now that I'm thinking about it, the thing that occurs to me is 
that finding the minimum or maximum element of a binary search tree 
presumably requires O(log N) operations, while for a min-heap you can 
find the minimum in O(1) operations. (Similarly, you can find the 
maximum of a max-heap in O(1) time.)

So if you're doing something like implementing a priority queue or 
similar, a heap looks like being a better choice than a BST.

On the other hand, I'm not intimately familiar with the various kinds of 
BSTs and their detailed properties...

>    The reason heaps are sometimes used in imperative languages is because
> a heap can be constructed inside an array without requiring any ancillary
> data whatsoever.
>
>    However, Haskell doesn't use (mutable) arrays, so you have to do it
> as a regular tree, with each node allocated separately and containing
> references and whatever ancillary data.

You could do it as a mutable array (but nobody will). More to the point, 
the easiest way to do it is a mutable array of pointers, which still has 
the overhead (although I guess you avoid needing the child pointers). If 
you want (say) a heap of integers represented as just an array of 
unboxed integers, you're going to have to do a fair bit of leg-work in 
Haskell. Sadly...

Post a reply to this message

From: Warp
Subject: Re: A min-heap
Date: 13 Oct 2010 13:06:06
Message: <4cb5e6fe@news.povray.org>

Invisible <voi### [at] devnull> wrote:
> However, now that I'm thinking about it, the thing that occurs to me is 
> that finding the minimum or maximum element of a binary search tree 
> presumably requires O(log N) operations, while for a min-heap you can 
> find the minimum in O(1) operations. (Similarly, you can find the 
> maximum of a max-heap in O(1) time.)

  Your tree data structure can have a reference/handle to the first and
last elements for fast access (which is, in fact, what eg. std::set and
std::map do, in order to be able to get the iterators to the first and
last elements fast). In this case you get fast access to the smallest,
largest and median elements (the root node is obviously the median in
a balanced binary search tree).

> So if you're doing something like implementing a priority queue or 
> similar, a heap looks like being a better choice than a BST.

  A heap is at least much *simpler* to implement (and requires less
ancillary data) than a balanced binary search tree. However, the fact
that you can construct a heap on an array without any ancillary data
whatsoever is still, IMO, by far its biggest advantage as a data structure
in such a case (ie. a priority queue). You lose much of that efficiency
if you allocate all the nodes separately.

> You could do it as a mutable array (but nobody will). More to the point, 
> the easiest way to do it is a mutable array of pointers, which still has 
> the overhead (although I guess you avoid needing the child pointers).

  I assume that means that Haskell doesn't handle objects (if it even has
such a concept) by value, only by reference (which, in fact, isn't very
surprising to me).

> If you want (say) a heap of integers represented as just an array of 
> unboxed integers, you're going to have to do a fair bit of leg-work in 
> Haskell. Sadly...

  Sometimes low-level efficiency means you have to bypass fancy
abstractions...

-- 
                                                          - Warp

Post a reply to this message

From: Darren New
Subject: Re: A min-heap
Date: 13 Oct 2010 14:24:10
Message: <4cb5f94a$1@news.povray.org>

Warp wrote:
>   I assume that means that Haskell doesn't handle objects (if it even has
> such a concept) by value, only by reference (which, in fact, isn't very
> surprising to me).

Semantically, this is a rather difficult question to answer for a purely 
functional language where the data can't be mutated.

-- 
Darren New, San Diego CA, USA (PST)
   Serving Suggestion:
     "Don't serve this any more. It's awful."

Post a reply to this message

From: Invisible
Subject: Re: A min-heap
Date: 14 Oct 2010 05:09:49
Message: <4cb6c8dd$1@news.povray.org>

On 13/10/2010 06:06 PM, Warp wrote:
> Invisible<voi### [at] devnull>  wrote:
>> However, now that I'm thinking about it, the thing that occurs to me is
>> that finding the minimum or maximum element of a binary search tree
>> presumably requires O(log N) operations, while for a min-heap you can
>> find the minimum in O(1) operations. (Similarly, you can find the
>> maximum of a max-heap in O(1) time.)
>
>    Your tree data structure can have a reference/handle to the first and
> last elements for fast access (which is, in fact, what eg. std::set and
> std::map do, in order to be able to get the iterators to the first and
> last elements fast). In this case you get fast access to the smallest,
> largest and median elements (the root node is obviously the median in
> a balanced binary search tree).

Presumably the root is only the median if the tree is *perfectly* balanced?

At any rate, I guess you can include direct pointers to any elements 
likely to be of interest. The question then becomes "how fast is it to 
keep those pointers up to date?" At least in the case of min/max, if you 
delete the present minimum, or insert a new, lower one, it should be 
quite efficient to update the pointers.

>> So if you're doing something like implementing a priority queue or
>> similar, a heap looks like being a better choice than a BST.
>
>    A heap is at least much *simpler* to implement (and requires less
> ancillary data) than a balanced binary search tree.

I guess that's one of the reasons I ended up playing with it. It's a 
nice simple algorithm that works well, which is kinda cool.

> However, the fact
> that you can construct a heap on an array without any ancillary data
> whatsoever is still, IMO, by far its biggest advantage as a data structure
> in such a case (ie. a priority queue). You lose much of that efficiency
> if you allocate all the nodes separately.

True enough I guess.

>    I assume that means that Haskell doesn't handle objects (if it even has
> such a concept) by value, only by reference (which, in fact, isn't very
> surprising to me).

Haskell defaults to handling everything by reference, yes. There are a 
couple of reasons for that.

In Java, if you ask for an array of Pixel objects (i.e., "Pixel[]"), 
what you get is an array of *pointers* to Pixel objects. Similarly, in 
Haskell if you ask for an array of Pixel values, you get an array of 
pointers. (This is regardless of whether you ask for an immutable array 
or a mutable array.)

Reason #1: All pointers have the same structure, which means that the 
person writing the array implementation code only needs to handle one 
case - pointers.

Reason #2: Each "Pixel value" could in fact be an unevaluated 
*expression* which will yield a Pixel value when actually evaluated. You 
cannot store an expression in 24 bits of storage. It's impossible.

Of course, in Java you could ask for a byte[] instead of a Pixel[]. Then 
you get an array of actual bytes. But now *you* must write the 
boilerplate code for writing a Pixel into the appropriate place in a 
byte[], and reading it back out again.

You can do a similar thing in Haskell. And it's similarly tedious.

Haskell adds some innovations that make it easier for *the client* of 
this code, but it's still quite tedious to write the code in the first 
place.

To use the Haskell jargon, most types are "boxed" (i.e., they are 
pointers to "boxes" on the heap). Haskell (or rather, GHC) also has 
support for "unboxed" types. (Things like integers, characters, etc., 
which really are just plain binary.) The unboxed types are necessarily 
"unlifted" (i.e., they cannot represent unevaluated expressions, only 
evaluated answers).

So, a "boxed array" is an array of pointers. An "unboxed array" is an 
array of primitive values, such as an array of integers or an array of 
characters. A side-effect of this is that the data in an unboxed array 
is always fully evaluated; you cannot store unevaluated data in an 
unboxed array (or any other unboxed type).

In languages such as C and Pascal, you can make an (unboxed) array of 
any type, but Haskell does not provide this level of sophistication (yet).

I say "yet" because several people are working on libraries that are 
supposed to make this sort of thing easier.

Interestingly, you *can* have unboxed record fields quite trivially. For 
example, if I say

   data Pixel = Pixel {red, green, blue :: Word8}

then each Pixel value contains three pointers to Word8 values. But if I say

   data Pixel = Pixel {red, green, blue :: {-# UNPACK #-} !Word8}

then each Pixel value contains three unboxed Word8 values. In other 
words, no pointers and no indirection.

(Amusingly, GHC stores Word8, Word16 and Word32 as 32-bit quantities, 
and just implements arithmetic on them differently. So declaring 
something as Word8 doesn't actually reduce storage space at all, it just 
changes the way overflow wraps around. Presumably because implementing 
one storage type rather than 3 is less work...)

Not only can you unpack Word8 and similar primitives, but you can in 
fact unpack almost any type you want. Automatically. (There are a few 
restrictions. For example, you cannot unpack a type inside itself.)

Since GHC is apparently perfectly capable of unboxing arbitrary types in 
constructor fields, there's no particular reason, in principle, why it 
can't do the same for arrays. It just doesn't.

Then again, contrary to my expectations, it seems that the entire GHC 
tool-chain has approximately 3 developers working on it full-time, with 
a large cloud of irregular contributors floating around the outside. No 
wonder it takes so long to implement anything...

(On top of that, implementing sophisticated abstractions is *far* more 
interesting than fixing mundane low-level problems. Thus we have things 
like Haskell's floor function being about 5x slow than the C equivalent, 
for no apparent reason other than "nobody has got round to fixing it 
yet". The ticket has been open for years, but nobody has bothered to fix 
it yet...)

>    Sometimes low-level efficiency means you have to bypass fancy
> abstractions...

You said it!

Post a reply to this message

From: Warp
Subject: Re: A min-heap
Date: 14 Oct 2010 15:40:08
Message: <4cb75c98@news.povray.org>

Darren New <dne### [at] sanrrcom> wrote:
> Warp wrote:
> >   I assume that means that Haskell doesn't handle objects (if it even has
> > such a concept) by value, only by reference (which, in fact, isn't very
> > surprising to me).

> Semantically, this is a rather difficult question to answer for a purely 
> functional language where the data can't be mutated.

  Well, if eg. integers can be handled by value, why not user-defined
objects?

-- 
                                                          - Warp

Post a reply to this message

From: Darren New
Subject: Re: A min-heap
Date: 14 Oct 2010 17:48:24
Message: <4cb77aa8$1@news.povray.org>

Warp wrote:
> Darren New <dne### [at] sanrrcom> wrote:
>> Warp wrote:
>>>   I assume that means that Haskell doesn't handle objects (if it even has
>>> such a concept) by value, only by reference (which, in fact, isn't very
>>> surprising to me).
> 
>> Semantically, this is a rather difficult question to answer for a purely 
>> functional language where the data can't be mutated.
> 
>   Well, if eg. integers can be handled by value, why not user-defined
> objects?

There's no need to?  Whether the language implements user objects as values 
or as something pointed to makes no difference to the meaning of the 
language. It's of course an efficiency trade-off, which is why I said it's 
*semantically* difficult to answer.

Even if Haskell is passing around pointers to user objects, it's still not 
pass-by-reference in the usual sense of the word, because the values are 
immutable.

My comment might not have been 100% in line with what you were asking about. 
I just thought it was an interesting point.  In a language where you can't 
modify memory once you've allocated it, is there any semantic difference 
between allocating it on the heap or allocating it inline somewhere? I don't 
think so.

-- 
Darren New, San Diego CA, USA (PST)
   Serving Suggestion:
     "Don't serve this any more. It's awful."

Post a reply to this message

From: Invisible
Subject: Re: A min-heap
Date: 15 Oct 2010 04:38:01
Message: <4cb812e9$1@news.povray.org>

> My comment might not have been 100% in line with what you were asking
> about. I just thought it was an interesting point. In a language where
> you can't modify memory once you've allocated it, is there any semantic
> difference between allocating it on the heap or allocating it inline
> somewhere? I don't think so.

The Haskell language specification describes the syntax of the language 
and approximately how to execute it. However, it is silent on 
implementation details such as calling conventions. (This is not an 
oversight.)

The language spec says nothing about what goes on the heap, what goes on 
the stack, what goes in registers, etc. In Haskell, I can say "print (x 
+ 1)". Now, whether the value of "x" is in a register, on the stack, on 
the heap somewhere, and so on makes absolutely no difference to the code 
I just wrote there. One Haskell implementation might do it one way, and 
another might do it a completely different way, and the program will 
still run the same way.

In practise, when people say "Haskell does X", what they really mean is 
"the only currently available production-grade Haskell implementation 
(i.e., GHC) does X". Which isn't quite the same statement.

The simplest and easiest way to implement Haskell is for *everything* to 
be a heap-allocated dynamic object. Obviously this has quite laughable 
performance, and so real Haskell implementations go out of their way to 
be more efficient where possible. Haskell is particularly helpful in 
this regard, since it has no specified calling convention that you have 
to adhere to, and everything is immutable, so it doesn't matter where 
it's stored or whether you copy it.

Darren is quite right: Since [almost] everything is immutable, you can't 
distinguish between a thing and a reference to a thing. You can't 
distinguish between two pointers to the same thing and two pointers to 
things that are identical. (I.e., there is no difference between 
value-equal and reference-equal.) You can't distinguish copies from each 
other. Heck, you can't even distinguish an evaluated result from an 
unevaluated result!

This gives people trying to implement Haskell considerable design freedom.

In fact, consider this:

   repeat x = x : repeat x

Semantically, this generates an infinite list with all elements equal to 
x. But how is it implemented? Does it return a circular list, or does it 
actually *execute* an infinite function-call loop?

The answer is... it doesn't matter. From within Haskell, it is 
impossible to tell the difference.

Note that this is different from saying that there *is* no difference. 
For example, a circular list clearly takes less storage space (and less 
CPU time to manage) than an infinite loop that allocates another new 
list node on each cycle. (To say nothing of cache locality and so on.)

In other words, there are a whole raft of design decisions that make 
absolutely no difference to Haskell (i.e., your programs will still work 
right), but potentially make a very big difference to run-time performance.

Since it does affect performance, GHC (and probably other 
implementations) allow you to have some limited control over some of 
these things using compiler pragmas, command line switches and so on. 
But for the most part, the compiler internally uses sophisticated 
algorithms to attempt to determine the best choices automatically on a 
case-by-case basis. (And if you doubt this, check out the extensive 
published literature output of the GHC team.)



Final note: If I start dealing with mutable data structures, now it 
*does* matter whether you're talking about value-equity or 
reference-equity. Since, in a multi-threaded scenario, value-equity 
could potentially change at any moment, all the mutable containers 
implement reference-equity.

Comparing two things for equity is a pure operation, which you can do 
anywhere, at any time. If you want to compare two mutable objects for 
value-equity, you must do it by hand. This involves manually fetching 
the contents of the container - and thereby fixing exactly which moment 
in time the comparison occurs at, since fetching the contents of a 
mutable structure is an impure operation which must be explicitly 
sequenced with respect to other impure operations in the same thread.

In summary, once again Haskell's type system forces us to stop, think, 
and do the semantically correct thing.

Post a reply to this message

From: Warp
Subject: Re: A min-heap
Date: 15 Oct 2010 10:03:04
Message: <4cb85f17@news.povray.org>

Invisible <voi### [at] devnull> wrote:
> In Java, if you ask for an array of Pixel objects (i.e., "Pixel[]"), 
> what you get is an array of *pointers* to Pixel objects.

  I think that in Java it's not completely correct to call them "pointers".

  A "pointer" is usually understood as being a raw memory address pointing
to the memory location which contains the binary representation of the object
(its "value"). This is what a pointer is in C, C++ and some other languages
which have pointers.

  In Java a reference might internally be a true pointer, or it might not.
(Instead, it could be something else, such as an index to a table which
contains information about the object.)

  There is a difference. In C/C++ you can compare pointers with an ordering
comparison (less than, etc), which allows you to eg. sort pointers (ie. sort
them by the values of the pointers themselves rather than the objects they
are pointing to). Pointer arithmetic is also a "side-effect" of this: You
can add an offset to a pointer and get another pointer which points that
much further in memory (which is used for raw arrays).

  In Java you can't sort references (because references might not have a
well-defined ordering) nor perform "reference arithmetic" (because it would
be nonsensical if the references are something else than raw memory
addresses).

  Moreover, in C and C++ objects cannot be moved in memory behind the scenes
because there may be pointers pointing to them, and hence moving an object
would break the pointers (and thus the entire program). In Java, however,
objects can be moved in memory without breaking any existing references to
them, if the references are not raw memory addresses but something smarter
(such as indices to some table, or whatever; of course this introduces an
additional indirection level to accessing the object through the reference,
but it might be worth it in some cases).

-- 
                                                          - Warp

Post a reply to this message

Goto Latest 10 Messages

Next 6 Messages >>>