POV-Ray : Newsgroups : povray.off-topic : Nice data structure Server Time
5 Sep 2024 21:23:34 EDT (-0400)
  Nice data structure (Message 6 to 15 of 25)  
<<< Previous 5 Messages Goto Latest 10 Messages Next 10 Messages >>>
From: Darren New
Subject: Re: Nice data structure
Date: 20 Jun 2009 15:32:49
Message: <4a3d3961@news.povray.org>
Warp wrote:
> Darren New <dne### [at] sanrrcom> wrote:
>> I think that would work with the caveat that you have to distinguish between 
>> nodes without a child and nodes whose only child is '\0'.  That is, not all 
>> your branches will end in a '\0' as shown in the picture.
> 
>   Each node needs a bit of information denoting whether it's an end node
> or not (ie. whether that node is the last character of a string stored in
> the data structure or not) regardless. For example, if you store "CAT" and
> "CATS" as two separate strings, you'll need a bit in the 'T' node to
> indicate that there exists a string which end there.

I think the way the original paper worked it is to store the '\0' as part of 
the string in the tree. So your CAT vs CATS would be
-->C-->A-->T-->S-->\0
            |
            v
           \0

You could of course represent it as a separate bit instead.

>   Doing it like I suggested doesn't change anything. For the extraneous
> nodes you simply set the bit to "no string ends here".

Yes, that's another way to do it. However, if you want to associate some 
data with each string (e.g., you're using it in place of a hashtable map, 
instead of in place of a list of strings like a spelling checker might use), 
you'll want a place to store the pointer associated with the "string ends 
here" anyway, and it might be more space-efficient to have a '\0' node to 
tack that pointer onto than it would be to have that pointer on every 
character whether it's the end of a string or not. If you have '\0' nodes, 
you could special-case the "next character" (i.e., middle) pointer to point 
to the data you want to associate with the word.

I wonder how ugly this would be to implement functionally.

-- 
   Darren New, San Diego CA, USA (PST)
   Insanity is a small city on the western
   border of the State of Mind.


Post a reply to this message

From: Warp
Subject: Re: Nice data structure
Date: 20 Jun 2009 16:20:22
Message: <4a3d4486@news.povray.org>
Darren New <dne### [at] sanrrcom> wrote:
> I think the way the original paper worked it is to store the '\0' as part of 
> the string in the tree. So your CAT vs CATS would be
> -->C-->A-->T-->S-->\0
>             |
>             v
>            \0

  But that's not a null pointer indicating the end of the string. That's
a regular node containing a special character which indicates the end of
the string. Implementing my automatically-balanced tree creation idea can
use the exact same method for marking ends of strings.

  Btw, the node containing an end-of-string value idea works only as long
as there's a special character value which never appears in a string. It
can't be done like that if the range of values of each character in your
"string" can hold is the full 0-255 range. Of course you could use values
larger than 8 bits, but that's just basically the exact same thing as having
an end-of-string bit at each node. (And given that the "character" in the
node will probably take 4 bytes for alignment purposes anyways, there's
plenty of unused bits.)

  Which got me thinking: This data container uses quite a lot of space,
actually. Each node requires three pointers plus the character itself.
In a 32-bit system the pointers will be 4 bytes each, and as said, memory
alignment requires for the character to take 4 bytes as well, for a grand
total of 16 bytes per node (and this assuming you are storing each node
optimally in an array rather than allocating each node separately).
16 bytes to store one single character sounds quite a lot... :)

  Of course strings with equal beginnings will share nodes, but if the
strings are very diverse, they will still end up using quite a lot of
memory.

> If you have '\0' nodes, 
> you could special-case the "next character" (i.e., middle) pointer to point 
> to the data you want to associate with the word.

  In fact, since no character can be smaller than '\0', you could also
have additional '\0' nodes on the left branch, each one pointing to
associated data. That way you can have a multiset, ie. the same key
pointing to multiple data, rather than just one.

-- 
                                                          - Warp


Post a reply to this message

From: Darren New
Subject: Re: Nice data structure
Date: 20 Jun 2009 17:26:18
Message: <4a3d53fa$1@news.povray.org>
Warp wrote:
> Darren New <dne### [at] sanrrcom> wrote:
>> I think the way the original paper worked it is to store the '\0' as part of 
>> the string in the tree. So your CAT vs CATS would be
>> -->C-->A-->T-->S-->\0
>>             |
>>             v
>>            \0
> 
>   But that's not a null pointer indicating the end of the string. That's
> a regular node containing a special character which indicates the end of
> the string. Implementing my automatically-balanced tree creation idea can
> use the exact same method for marking ends of strings.

Then we're in agreement. I was using \0 to indicate ASCII NUL, not C's NULL 
pointer.  You can either use a node whose character value is '\0', or you 
can use a separate bit, depending on your needs.

>   Btw, the node containing an end-of-string value idea works only as long
> as there's a special character value which never appears in a string. It
> can't be done like that if the range of values of each character in your
> "string" can hold is the full 0-255 range. Of course you could use values
> larger than 8 bits, but that's just basically the exact same thing as having
> an end-of-string bit at each node.

Yep. Or you could have another pointer that points to the data that is 
associated with that string (i.e., the "value" of the "key/value" pair), and 
if that's not NULL, then a string ends at that node.

> 16 bytes to store one single character sounds quite a lot... :)

And a fourth pointer for the value associated, yes. Of course, any tree 
structure is going to be close to the same size too.  If you have lots and 
lots of strings, and you're more worried about space than speed, a hash 
table works out better, I expect.

>   Of course strings with equal beginnings will share nodes, but if the
> strings are very diverse, they will still end up using quite a lot of
> memory.

If you're using up enough memory to be worried about it, chances are you 
have enough strings that the beginnings will overlap a lot? I mean, unless 
you're really memory-constrained.

>   In fact, since no character can be smaller than '\0', you could also
> have additional '\0' nodes on the left branch, each one pointing to
> associated data. That way you can have a multiset, ie. the same key
> pointing to multiple data, rather than just one.

Hmmmmm. Well, certainly the '\0' node could point to a different structure 
than the other nodes. You already have three pointers there to play with, 
and you could reasonably hang an entire tree underneath if you wanted.

-- 
   Darren New, San Diego CA, USA (PST)
   Insanity is a small city on the western
   border of the State of Mind.


Post a reply to this message

From: Darren New
Subject: Re: Nice data structure
Date: 20 Jun 2009 17:36:52
Message: <4a3d5674$1@news.povray.org>
Warp wrote:
> Each node requires three pointers plus the character itself.

Or, with enough work, you could get away with not using full-sized pointers, 
perhaps. Indexes into arrays (with said arrays split into chunks if you have 
trouble finding that much contiguous address space) might work too.

Or, if it's on disk, you could have two kinds of nodes - ones that point to 
other nodes nearby (or in the same block or file, say), and others that 
point to different files. So the root and first 2 or 3 levels might be in 
one block, with the "leaves" in other blocks or files or something. Since 
you don't randomly walk around the tree when you're updating it, you don't 
need to worry quite so much about caching as you might otherwise.

-- 
   Darren New, San Diego CA, USA (PST)
   Insanity is a small city on the western
   border of the State of Mind.


Post a reply to this message

From: Warp
Subject: Re: Nice data structure
Date: 20 Jun 2009 17:53:09
Message: <4a3d5a45@news.povray.org>
Darren New <dne### [at] sanrrcom> wrote:
> > 16 bytes to store one single character sounds quite a lot... :)

> And a fourth pointer for the value associated

  That's not necessary if you use the '\0' node to end strings, as you
pointed out.

> Hmmmmm. Well, certainly the '\0' node could point to a different structure 
> than the other nodes. You already have three pointers there to play with, 
> and you could reasonably hang an entire tree underneath if you wanted.

  You can't use the "right child" pointer of the '\0' node for ancillary
data because it can be part of the regular tree structure.

-- 
                                                          - Warp


Post a reply to this message

From: Warp
Subject: Re: Nice data structure
Date: 20 Jun 2009 18:09:14
Message: <4a3d5e0a@news.povray.org>
Darren New <dne### [at] sanrrcom> wrote:
> Warp wrote:
> > Each node requires three pointers plus the character itself.

> Or, with enough work, you could get away with not using full-sized pointers, 
> perhaps. Indexes into arrays (with said arrays split into chunks if you have 
> trouble finding that much contiguous address space) might work too.

  I suppose that if you limit the size of the data container to 2^24 nodes,
you could use "pointers" of 3 bytes, which would be 9 bytes plus the
character. Alignment usually requires to round that up to 12 bytes. (The
extra space could be used for the end-of-string bit.)

  If you limited the number of nodes to 2^16 then you could use 8 bytes
per node. However, I'm not sure if such a limitation is very practical.

  Of course if you are willing to sacrifice some speed in order to squeeze
space, you could break the word/half-word boundaries, and compact eg. the
three 24-bit pointers and the 8-bit character into a 10-byte node in a
byte array. Or if you are not afraid of bit-fu, something more exotic like
21 bits for the "pointers" (giving a maximum of 2097152 nodes), 8 bits for
the characters and 1 bit for the end-of-string bit, which would result in
9 bytes per node.

-- 
                                                          - Warp


Post a reply to this message

From: Darren New
Subject: Re: Nice data structure
Date: 20 Jun 2009 18:35:45
Message: <4a3d6441@news.povray.org>
Warp wrote:
>   You can't use the "right child" pointer of the '\0' node for ancillary
> data because it can be part of the regular tree structure.

True, true. :-)

-- 
   Darren New, San Diego CA, USA (PST)
   Insanity is a small city on the western
   border of the State of Mind.


Post a reply to this message

From: Darren New
Subject: Re: Nice data structure
Date: 20 Jun 2009 18:40:25
Message: <4a3d6559$1@news.povray.org>
Warp wrote:
>   I suppose that if you limit the size of the data container to 2^24 nodes,

No, you could use "offsets to next element" or something, too, methinks. But 
yeah, that's kind of the thing I was thinking about. By the time you have 
more than 2^16 nodes, you're probably worrying more about speed, paging, etc 
than you are about raw size, perhaps. And you're going to start to see a lot 
of overlap in the earlier parts of the strings.

-- 
   Darren New, San Diego CA, USA (PST)
   Insanity is a small city on the western
   border of the State of Mind.


Post a reply to this message

From: clipka
Subject: Re: Nice data structure
Date: 20 Jun 2009 20:40:00
Message: <web.4a3d808c8e78d9a973b318a20@news.povray.org>
Warp <war### [at] tagpovrayorg> wrote:
> Darren New <dne### [at] sanrrcom> wrote:
>   However, I can think of a much easier solution which doesn't need any
> rebalancing:
>
>   Always create every node so that it partitions the alphabet in two equal
> parts. In other words, assuming an alphabet of A-Z, the root node would
> *always* be N.

This would require your data to have a linear distribution.

What if all your keys had the form "img00000000002AC73F.jpg"?


As a matter of fact, there already are good tree concepts out there that include
automatic balancing without any a-priori knowledge of the data. Not perfect, but
with a certain guaranteed minimum quality. Red-Black-Trees are a quite popular
example. I'd expect the same principle to be adaptable to ternary trees, too.


Post a reply to this message

From: Darren New
Subject: Re: Nice data structure
Date: 21 Jun 2009 01:18:18
Message: <4a3dc29a$1@news.povray.org>
clipka wrote:
> Red-Black-Trees are a quite popular example. 

Skip lists are also a cool data structure for when you have a lot of data 
you'd like to put in a tree but don't want to balance things etc. 
Generalized to skip graphs in the same sense that a DHT generalizes a hashtable.

-- 
   Darren New, San Diego CA, USA (PST)
   Insanity is a small city on the western
   border of the State of Mind.


Post a reply to this message

<<< Previous 5 Messages Goto Latest 10 Messages Next 10 Messages >>>

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.