|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Ever since I read that book I got for Christmas, I've been carrying
around this implementation of a minimum-heap ("min-heap") in my head.
The design always seemed quite neat to me, and it seemed to have
impressive properties. But today I sat down to actually put those
properties to the test.
I've written a short program that randomly inserts and deletes elements
from the heap and measures how badly unbalanced the heap tree is. As per
my expectations, it seems that inserting produces nicely balanced trees.
Somewhat surprisingly, I have discovered that deletions can produce
arbitrarily unbalanced trees.
The actual algorithm can be succinctly described thus:
data Heap x = Node x (Heap x) (Heap x) | Empty
insert x' (Node x h0 h1) = Node (min x x') (insert (max x x') h1) h0
insert x' (Empty ) = Node x' Empty Empty
get_min (Node x _ _) = x
get_min (Empty ) = error "empty heap"
delete_min (Node _ h0 h1) = merge h0 h1
delete_min (Empty ) = error "empty heap"
merge Empty h1 = h1
merge h0 Empty = h0
merge h0 h1 =
if get_min h0 < get_min h1
then Node (get_min h0) h1 (delete_min h0)
else Node (get_min h1) h0 (delete_min h1)
For those who do not grok Haskell, let me summarise thus:
1. A heap is simply a binary tree where every node contains one datum
and has between 0 and 2 children.
2. Insert:
- Inserting into an empty heap yields a trivial 1-node tree.
- Inserting into a non-empty heap proceeds as follows:
A. Construct a new tree node containing the smaller of the existing
root element and the new element to be inserted.
B. The right child of the new node is the left child of the original
root node.
C. The left child of the new node is constructed by recursively
inserting the larger of the two elements into the previous right subtree.
3. Get minimum element: Just return the datum from the root node of the
tree.
4. Delete minimum element: Just take the two subtrees of the root node
and heap-merge them.
5. Heap merge:
- Merging a heap with an empty heap is no-op.
- To marge two non-empty heaps:
A. Determine which heap has the lowest datum in its root node.
B. Put the lowest datum into the new root node, and recursively
delete it from the subtree you got it from.
It might just be me, but I can't help noticing how the English
description is vastly more wordy yet less precise than the source code.
But anyway...
Notice how an insertion always swaps the two subtrees, and how it always
inserts into the (old) right subtree [which becomes the new left
subtree]. In this way, the left subtree of any node is always the larger
than or the same size as the corresponding right subtree. In other
words, insertions leave the tree balanced.
Notice also how a deletion (or rather, a merge) always puts the subtree
you just deleted from as the right subtree, reinforcing the same
invariant. The left subtree can never be smaller than the right subtree.
However, they *can* be wildly different sizes. If, by some misfortune,
it happens that you have a node where *all* the elements in one subtree
are strictly less than *all* of the elements in the other subtree, then
deleting from this heap will cause it to become more and more
unbalanced. And no amount of swapping subtrees around is going to fix that.
We have that deletion leaves the heap in a state where a subsequent
insertion will definitely insert into the smaller subtree, which is
good. In other words, inserting will always bring the tree back towards
being balanced again. But that still doesn't change the fact that, for
sufficiently pathological cases, deletions can make the tree unbalanced
to an unlimited degree.
Since a heap is (quite deliberately) only *partially* sorted, there's
nothing to prevent this kind of thing from happening. The only way we're
gonna fix this is to start moving stuff from one subtree to the other if
it gets too unbalanced. But can we detect this condition in a way that
doesn't slow the whole thing down too much?
Insert has the invariant that the sizes of the left and right subtrees
are related. At all times, we have either L = R or L = R + 1. The
constant swapping of the two subtrees mean that they both fill at an
equal rate.
Insert propagates items from the top downwards. Which way the item
propagates is down to the insertion function. But deletion moves items
up to the top. Which side the item comes from depends on which item is
the lowest, which the deletion function has no control over. Hence the
problem.
Deleting from the left subtree causes it to become the right subtree. If
we always deleted from the left, the tree would always remain balanced,
and there would be no problem. Deleting from the right subtree, however,
can unbalance the tree.
Now obviously, if the lowest element is in the right subtree, then it's
in the right subtree. There's nothing we can do about that. But this
(and only this) condition looks like a suitable signal that some
rebalancing is required. The question is, what do we do to rebalance?
One obvious possibility is to completely rebuild the entire subtree at
this moment. This operation has O(n log n) complexity, and is obviously
ludicrously expensive to do on (potentially) every single delete operation.
A better possibility is to move one element from the left subtree into
the right subtree if it's going to become too small. Actually to do this
reliably requires that we know whether L = R or L = R + 1. In the former
case, we do nothing, and in the latter case we rebalance by moving a
single element across. Here's the code:
data Balance = Equal | Unequal
invert Equal = Unequal
invert Unequal = Equal
data Heap x = Node Balance x (Heap x) Heap x) | Empty
insert x' (Node b x h0 h1) = Node (invert b) (min x x') (insert (max
x x') h1) h0
insert x' (Empty ) = Node Equal x Empty Empty
get_min (Node _ x _ _) = x
get_min (Empty ) = error "empty heap"
delete_min (Node x _ h0 h1) = merge b h0 h1
delete_min (Empty ) = error "empty heap"
merge _ Empty h1 = h1
merge _ h0 Empty = h0
merge _ h0 h1 =
if get_min h0 < get_min h1
then Node (invert b) (get_min h0) h1 (delete_min h0)
else rebalance b (get_min h1) h0 (delete_min h1)
rebalance Equal x h0 h1 = Node Unequal x h0 h1
rebalance Unequal x h0 h1 = let x' = get_min h0 in Node Equal x
(delete_min h0) (insert x' h1)
Here the functions "invert", "merge" and "rebalance" are private
functions not accessible to external clients.
I've run this code through my simulator, and after many minutes of
number crunching it has failed to produce one single unbalanced tree.
This of course is not proof, but it shows at least that this structure
works somewhat better than the original. (And for very little cost in
either code size of run-time complexity.)
Unfortunately, because we've added a new invariant to the data structure
(namely that the balance factor is always exactly 0 or -1), we lose the
ability to cheaply merge two heaps. (In the old version, this was
already a very cheap way to make an arbitrarily unbalanced tree; just
merge a size-1 tree with a size-1000 tree and watch how lopsided it now
is!) You would now be forced to do some kind of elaborate rebalancing
procedure to restore the new invariant on the subtree sizes.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible <voi### [at] devnull> wrote:
> For those who do not grok Haskell, let me summarise thus:
What are the advantages of a heap compared to a balanced binary search
tree in Haskell?
The reason heaps are sometimes used in imperative languages is because
a heap can be constructed inside an array without requiring any ancillary
data whatsoever (in other words, the array contains the elements and
nothing else; the order of the elements can be heapified in-place, without
any ancillary data or extra memory), which makes it extremely efficient
(both memorywise, rather obviously, and also speedwise).
However, Haskell doesn't use (mutable) arrays, so you have to do it
as a regular tree, with each node allocated separately and containing
references and whatever ancillary data.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 12/10/2010 04:48 PM, Warp wrote:
> What are the advantages of a heap compared to a balanced binary search
> tree in Haskell?
I hadn't really thought about it.
However, now that I'm thinking about it, the thing that occurs to me is
that finding the minimum or maximum element of a binary search tree
presumably requires O(log N) operations, while for a min-heap you can
find the minimum in O(1) operations. (Similarly, you can find the
maximum of a max-heap in O(1) time.)
So if you're doing something like implementing a priority queue or
similar, a heap looks like being a better choice than a BST.
On the other hand, I'm not intimately familiar with the various kinds of
BSTs and their detailed properties...
> The reason heaps are sometimes used in imperative languages is because
> a heap can be constructed inside an array without requiring any ancillary
> data whatsoever.
>
> However, Haskell doesn't use (mutable) arrays, so you have to do it
> as a regular tree, with each node allocated separately and containing
> references and whatever ancillary data.
You could do it as a mutable array (but nobody will). More to the point,
the easiest way to do it is a mutable array of pointers, which still has
the overhead (although I guess you avoid needing the child pointers). If
you want (say) a heap of integers represented as just an array of
unboxed integers, you're going to have to do a fair bit of leg-work in
Haskell. Sadly...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible <voi### [at] devnull> wrote:
> However, now that I'm thinking about it, the thing that occurs to me is
> that finding the minimum or maximum element of a binary search tree
> presumably requires O(log N) operations, while for a min-heap you can
> find the minimum in O(1) operations. (Similarly, you can find the
> maximum of a max-heap in O(1) time.)
Your tree data structure can have a reference/handle to the first and
last elements for fast access (which is, in fact, what eg. std::set and
std::map do, in order to be able to get the iterators to the first and
last elements fast). In this case you get fast access to the smallest,
largest and median elements (the root node is obviously the median in
a balanced binary search tree).
> So if you're doing something like implementing a priority queue or
> similar, a heap looks like being a better choice than a BST.
A heap is at least much *simpler* to implement (and requires less
ancillary data) than a balanced binary search tree. However, the fact
that you can construct a heap on an array without any ancillary data
whatsoever is still, IMO, by far its biggest advantage as a data structure
in such a case (ie. a priority queue). You lose much of that efficiency
if you allocate all the nodes separately.
> You could do it as a mutable array (but nobody will). More to the point,
> the easiest way to do it is a mutable array of pointers, which still has
> the overhead (although I guess you avoid needing the child pointers).
I assume that means that Haskell doesn't handle objects (if it even has
such a concept) by value, only by reference (which, in fact, isn't very
surprising to me).
> If you want (say) a heap of integers represented as just an array of
> unboxed integers, you're going to have to do a fair bit of leg-work in
> Haskell. Sadly...
Sometimes low-level efficiency means you have to bypass fancy
abstractions...
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> I assume that means that Haskell doesn't handle objects (if it even has
> such a concept) by value, only by reference (which, in fact, isn't very
> surprising to me).
Semantically, this is a rather difficult question to answer for a purely
functional language where the data can't be mutated.
--
Darren New, San Diego CA, USA (PST)
Serving Suggestion:
"Don't serve this any more. It's awful."
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 13/10/2010 06:06 PM, Warp wrote:
> Invisible<voi### [at] devnull> wrote:
>> However, now that I'm thinking about it, the thing that occurs to me is
>> that finding the minimum or maximum element of a binary search tree
>> presumably requires O(log N) operations, while for a min-heap you can
>> find the minimum in O(1) operations. (Similarly, you can find the
>> maximum of a max-heap in O(1) time.)
>
> Your tree data structure can have a reference/handle to the first and
> last elements for fast access (which is, in fact, what eg. std::set and
> std::map do, in order to be able to get the iterators to the first and
> last elements fast). In this case you get fast access to the smallest,
> largest and median elements (the root node is obviously the median in
> a balanced binary search tree).
Presumably the root is only the median if the tree is *perfectly* balanced?
At any rate, I guess you can include direct pointers to any elements
likely to be of interest. The question then becomes "how fast is it to
keep those pointers up to date?" At least in the case of min/max, if you
delete the present minimum, or insert a new, lower one, it should be
quite efficient to update the pointers.
>> So if you're doing something like implementing a priority queue or
>> similar, a heap looks like being a better choice than a BST.
>
> A heap is at least much *simpler* to implement (and requires less
> ancillary data) than a balanced binary search tree.
I guess that's one of the reasons I ended up playing with it. It's a
nice simple algorithm that works well, which is kinda cool.
> However, the fact
> that you can construct a heap on an array without any ancillary data
> whatsoever is still, IMO, by far its biggest advantage as a data structure
> in such a case (ie. a priority queue). You lose much of that efficiency
> if you allocate all the nodes separately.
True enough I guess.
> I assume that means that Haskell doesn't handle objects (if it even has
> such a concept) by value, only by reference (which, in fact, isn't very
> surprising to me).
Haskell defaults to handling everything by reference, yes. There are a
couple of reasons for that.
In Java, if you ask for an array of Pixel objects (i.e., "Pixel[]"),
what you get is an array of *pointers* to Pixel objects. Similarly, in
Haskell if you ask for an array of Pixel values, you get an array of
pointers. (This is regardless of whether you ask for an immutable array
or a mutable array.)
Reason #1: All pointers have the same structure, which means that the
person writing the array implementation code only needs to handle one
case - pointers.
Reason #2: Each "Pixel value" could in fact be an unevaluated
*expression* which will yield a Pixel value when actually evaluated. You
cannot store an expression in 24 bits of storage. It's impossible.
Of course, in Java you could ask for a byte[] instead of a Pixel[]. Then
you get an array of actual bytes. But now *you* must write the
boilerplate code for writing a Pixel into the appropriate place in a
byte[], and reading it back out again.
You can do a similar thing in Haskell. And it's similarly tedious.
Haskell adds some innovations that make it easier for *the client* of
this code, but it's still quite tedious to write the code in the first
place.
To use the Haskell jargon, most types are "boxed" (i.e., they are
pointers to "boxes" on the heap). Haskell (or rather, GHC) also has
support for "unboxed" types. (Things like integers, characters, etc.,
which really are just plain binary.) The unboxed types are necessarily
"unlifted" (i.e., they cannot represent unevaluated expressions, only
evaluated answers).
So, a "boxed array" is an array of pointers. An "unboxed array" is an
array of primitive values, such as an array of integers or an array of
characters. A side-effect of this is that the data in an unboxed array
is always fully evaluated; you cannot store unevaluated data in an
unboxed array (or any other unboxed type).
In languages such as C and Pascal, you can make an (unboxed) array of
any type, but Haskell does not provide this level of sophistication (yet).
I say "yet" because several people are working on libraries that are
supposed to make this sort of thing easier.
Interestingly, you *can* have unboxed record fields quite trivially. For
example, if I say
data Pixel = Pixel {red, green, blue :: Word8}
then each Pixel value contains three pointers to Word8 values. But if I say
data Pixel = Pixel {red, green, blue :: {-# UNPACK #-} !Word8}
then each Pixel value contains three unboxed Word8 values. In other
words, no pointers and no indirection.
(Amusingly, GHC stores Word8, Word16 and Word32 as 32-bit quantities,
and just implements arithmetic on them differently. So declaring
something as Word8 doesn't actually reduce storage space at all, it just
changes the way overflow wraps around. Presumably because implementing
one storage type rather than 3 is less work...)
Not only can you unpack Word8 and similar primitives, but you can in
fact unpack almost any type you want. Automatically. (There are a few
restrictions. For example, you cannot unpack a type inside itself.)
Since GHC is apparently perfectly capable of unboxing arbitrary types in
constructor fields, there's no particular reason, in principle, why it
can't do the same for arrays. It just doesn't.
Then again, contrary to my expectations, it seems that the entire GHC
tool-chain has approximately 3 developers working on it full-time, with
a large cloud of irregular contributors floating around the outside. No
wonder it takes so long to implement anything...
(On top of that, implementing sophisticated abstractions is *far* more
interesting than fixing mundane low-level problems. Thus we have things
like Haskell's floor function being about 5x slow than the C equivalent,
for no apparent reason other than "nobody has got round to fixing it
yet". The ticket has been open for years, but nobody has bothered to fix
it yet...)
> Sometimes low-level efficiency means you have to bypass fancy
> abstractions...
You said it!
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Darren New <dne### [at] sanrrcom> wrote:
> Warp wrote:
> > I assume that means that Haskell doesn't handle objects (if it even has
> > such a concept) by value, only by reference (which, in fact, isn't very
> > surprising to me).
> Semantically, this is a rather difficult question to answer for a purely
> functional language where the data can't be mutated.
Well, if eg. integers can be handled by value, why not user-defined
objects?
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> Darren New <dne### [at] sanrrcom> wrote:
>> Warp wrote:
>>> I assume that means that Haskell doesn't handle objects (if it even has
>>> such a concept) by value, only by reference (which, in fact, isn't very
>>> surprising to me).
>
>> Semantically, this is a rather difficult question to answer for a purely
>> functional language where the data can't be mutated.
>
> Well, if eg. integers can be handled by value, why not user-defined
> objects?
There's no need to? Whether the language implements user objects as values
or as something pointed to makes no difference to the meaning of the
language. It's of course an efficiency trade-off, which is why I said it's
*semantically* difficult to answer.
Even if Haskell is passing around pointers to user objects, it's still not
pass-by-reference in the usual sense of the word, because the values are
immutable.
My comment might not have been 100% in line with what you were asking about.
I just thought it was an interesting point. In a language where you can't
modify memory once you've allocated it, is there any semantic difference
between allocating it on the heap or allocating it inline somewhere? I don't
think so.
--
Darren New, San Diego CA, USA (PST)
Serving Suggestion:
"Don't serve this any more. It's awful."
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> My comment might not have been 100% in line with what you were asking
> about. I just thought it was an interesting point. In a language where
> you can't modify memory once you've allocated it, is there any semantic
> difference between allocating it on the heap or allocating it inline
> somewhere? I don't think so.
The Haskell language specification describes the syntax of the language
and approximately how to execute it. However, it is silent on
implementation details such as calling conventions. (This is not an
oversight.)
The language spec says nothing about what goes on the heap, what goes on
the stack, what goes in registers, etc. In Haskell, I can say "print (x
+ 1)". Now, whether the value of "x" is in a register, on the stack, on
the heap somewhere, and so on makes absolutely no difference to the code
I just wrote there. One Haskell implementation might do it one way, and
another might do it a completely different way, and the program will
still run the same way.
In practise, when people say "Haskell does X", what they really mean is
"the only currently available production-grade Haskell implementation
(i.e., GHC) does X". Which isn't quite the same statement.
The simplest and easiest way to implement Haskell is for *everything* to
be a heap-allocated dynamic object. Obviously this has quite laughable
performance, and so real Haskell implementations go out of their way to
be more efficient where possible. Haskell is particularly helpful in
this regard, since it has no specified calling convention that you have
to adhere to, and everything is immutable, so it doesn't matter where
it's stored or whether you copy it.
Darren is quite right: Since [almost] everything is immutable, you can't
distinguish between a thing and a reference to a thing. You can't
distinguish between two pointers to the same thing and two pointers to
things that are identical. (I.e., there is no difference between
value-equal and reference-equal.) You can't distinguish copies from each
other. Heck, you can't even distinguish an evaluated result from an
unevaluated result!
This gives people trying to implement Haskell considerable design freedom.
In fact, consider this:
repeat x = x : repeat x
Semantically, this generates an infinite list with all elements equal to
x. But how is it implemented? Does it return a circular list, or does it
actually *execute* an infinite function-call loop?
The answer is... it doesn't matter. From within Haskell, it is
impossible to tell the difference.
Note that this is different from saying that there *is* no difference.
For example, a circular list clearly takes less storage space (and less
CPU time to manage) than an infinite loop that allocates another new
list node on each cycle. (To say nothing of cache locality and so on.)
In other words, there are a whole raft of design decisions that make
absolutely no difference to Haskell (i.e., your programs will still work
right), but potentially make a very big difference to run-time performance.
Since it does affect performance, GHC (and probably other
implementations) allow you to have some limited control over some of
these things using compiler pragmas, command line switches and so on.
But for the most part, the compiler internally uses sophisticated
algorithms to attempt to determine the best choices automatically on a
case-by-case basis. (And if you doubt this, check out the extensive
published literature output of the GHC team.)
Final note: If I start dealing with mutable data structures, now it
*does* matter whether you're talking about value-equity or
reference-equity. Since, in a multi-threaded scenario, value-equity
could potentially change at any moment, all the mutable containers
implement reference-equity.
Comparing two things for equity is a pure operation, which you can do
anywhere, at any time. If you want to compare two mutable objects for
value-equity, you must do it by hand. This involves manually fetching
the contents of the container - and thereby fixing exactly which moment
in time the comparison occurs at, since fetching the contents of a
mutable structure is an impure operation which must be explicitly
sequenced with respect to other impure operations in the same thread.
In summary, once again Haskell's type system forces us to stop, think,
and do the semantically correct thing.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Invisible <voi### [at] devnull> wrote:
> In Java, if you ask for an array of Pixel objects (i.e., "Pixel[]"),
> what you get is an array of *pointers* to Pixel objects.
I think that in Java it's not completely correct to call them "pointers".
A "pointer" is usually understood as being a raw memory address pointing
to the memory location which contains the binary representation of the object
(its "value"). This is what a pointer is in C, C++ and some other languages
which have pointers.
In Java a reference might internally be a true pointer, or it might not.
(Instead, it could be something else, such as an index to a table which
contains information about the object.)
There is a difference. In C/C++ you can compare pointers with an ordering
comparison (less than, etc), which allows you to eg. sort pointers (ie. sort
them by the values of the pointers themselves rather than the objects they
are pointing to). Pointer arithmetic is also a "side-effect" of this: You
can add an offset to a pointer and get another pointer which points that
much further in memory (which is used for raw arrays).
In Java you can't sort references (because references might not have a
well-defined ordering) nor perform "reference arithmetic" (because it would
be nonsensical if the references are something else than raw memory
addresses).
Moreover, in C and C++ objects cannot be moved in memory behind the scenes
because there may be pointers pointing to them, and hence moving an object
would break the pointers (and thus the entire program). In Java, however,
objects can be moved in memory without breaking any existing references to
them, if the references are not raw memory addresses but something smarter
(such as indices to some table, or whatever; of course this introduces an
additional indirection level to accessing the object through the reference,
but it might be worth it in some cases).
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|