|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
http://www.pcplus.co.uk/node/3074/
--
Darren New, San Diego CA, USA (PST)
Insanity is a small city on the western
border of the State of Mind.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Darren New <dne### [at] sanrrcom> wrote:
> http://www.pcplus.co.uk/node/3074/
> --
> Darren New, San Diego CA, USA (PST)
> Insanity is a small city on the western
> border of the State of Mind.
Cool idea. I've often wondered about the possibility of multiple child nodes,
and what situations would benefit. Seems this is the case where three nodes is
optimum.
....Chambers
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Darren New <dne### [at] sanrrcom> wrote:
> http://www.pcplus.co.uk/node/3074/
One problem which I see with the tree building algorithm presented there
is that the tree is not balanced. The geometry of the tree will heavily
depend on which order you insert the strings into it. A suboptimal insertion
order will cause the tree to be heavily unbalanced.
For example, assuming an alphabet of 26 characters, you may end up with
a tree such that if you try to search, for example, for the string "AAAA",
you first have to choose the left node 25 times before you find the first
one equal to A, after which you have to choose again the left node 25
times for the second A, and so on.
However, if the tree was fully balanced, then you would only have to
choose the left node 5 times (or less) for each of the A's (because
log2(26) = 5). The problem of course accentuates itself with larger
alphabets (eg. if you had a full byte alphabet, ie. 256 distinct "letters").
Thinking about the problem, I see two possible solutions: You could come
up with a really contrived solution, similar to how binary search trees are
kept balanced (eg. by using the same techniques as with a red-black tree).
However, I can think of a much easier solution which doesn't need any
rebalancing:
Always create every node so that it partitions the alphabet in two equal
parts. In other words, assuming an alphabet of A-Z, the root node would
*always* be N. Then create new nodes as needed (for the string to be
inserted): If the first character of the new string is less then N, then
create a G node at the left of the root (regardless of what the first
character of the string was). If it's larger, create a T node at the right.
If it was equal, then create an new N node as the "equal" child node for
the root. If the first character of the string was not N, keep splitting
like this until you end up with the character in question, and then always
create a new N node as a new child node for that. Repeat this until the
entire string has been inserted.
There will be some extra nodes (something line O(log2(n)) extra nodes
per character), but the tree will always be fully balanced, and immediately
when a new string is added which already has an existing node in the tree,
it will benefit from it. In an optimal case the amount of extraneous nodes
in the tree will be quite small compared to the amount of "true" nodes.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> when a new string is added which already has an existing node in the tree,
> it will benefit from it.
I think that would work with the caveat that you have to distinguish between
nodes without a child and nodes whose only child is '\0'. That is, not all
your branches will end in a '\0' as shown in the picture. I'd have to think
harder about the performance implications for failed lookups, too.
--
Darren New, San Diego CA, USA (PST)
Insanity is a small city on the western
border of the State of Mind.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Darren New <dne### [at] sanrrcom> wrote:
> I think that would work with the caveat that you have to distinguish between
> nodes without a child and nodes whose only child is '\0'. That is, not all
> your branches will end in a '\0' as shown in the picture.
Each node needs a bit of information denoting whether it's an end node
or not (ie. whether that node is the last character of a string stored in
the data structure or not) regardless. For example, if you store "CAT" and
"CATS" as two separate strings, you'll need a bit in the 'T' node to
indicate that there exists a string which end there.
Doing it like I suggested doesn't change anything. For the extraneous
nodes you simply set the bit to "no string ends here".
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> Darren New <dne### [at] sanrrcom> wrote:
>> I think that would work with the caveat that you have to distinguish between
>> nodes without a child and nodes whose only child is '\0'. That is, not all
>> your branches will end in a '\0' as shown in the picture.
>
> Each node needs a bit of information denoting whether it's an end node
> or not (ie. whether that node is the last character of a string stored in
> the data structure or not) regardless. For example, if you store "CAT" and
> "CATS" as two separate strings, you'll need a bit in the 'T' node to
> indicate that there exists a string which end there.
I think the way the original paper worked it is to store the '\0' as part of
the string in the tree. So your CAT vs CATS would be
-->C-->A-->T-->S-->\0
|
v
\0
You could of course represent it as a separate bit instead.
> Doing it like I suggested doesn't change anything. For the extraneous
> nodes you simply set the bit to "no string ends here".
Yes, that's another way to do it. However, if you want to associate some
data with each string (e.g., you're using it in place of a hashtable map,
instead of in place of a list of strings like a spelling checker might use),
you'll want a place to store the pointer associated with the "string ends
here" anyway, and it might be more space-efficient to have a '\0' node to
tack that pointer onto than it would be to have that pointer on every
character whether it's the end of a string or not. If you have '\0' nodes,
you could special-case the "next character" (i.e., middle) pointer to point
to the data you want to associate with the word.
I wonder how ugly this would be to implement functionally.
--
Darren New, San Diego CA, USA (PST)
Insanity is a small city on the western
border of the State of Mind.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Darren New <dne### [at] sanrrcom> wrote:
> I think the way the original paper worked it is to store the '\0' as part of
> the string in the tree. So your CAT vs CATS would be
> -->C-->A-->T-->S-->\0
> |
> v
> \0
But that's not a null pointer indicating the end of the string. That's
a regular node containing a special character which indicates the end of
the string. Implementing my automatically-balanced tree creation idea can
use the exact same method for marking ends of strings.
Btw, the node containing an end-of-string value idea works only as long
as there's a special character value which never appears in a string. It
can't be done like that if the range of values of each character in your
"string" can hold is the full 0-255 range. Of course you could use values
larger than 8 bits, but that's just basically the exact same thing as having
an end-of-string bit at each node. (And given that the "character" in the
node will probably take 4 bytes for alignment purposes anyways, there's
plenty of unused bits.)
Which got me thinking: This data container uses quite a lot of space,
actually. Each node requires three pointers plus the character itself.
In a 32-bit system the pointers will be 4 bytes each, and as said, memory
alignment requires for the character to take 4 bytes as well, for a grand
total of 16 bytes per node (and this assuming you are storing each node
optimally in an array rather than allocating each node separately).
16 bytes to store one single character sounds quite a lot... :)
Of course strings with equal beginnings will share nodes, but if the
strings are very diverse, they will still end up using quite a lot of
memory.
> If you have '\0' nodes,
> you could special-case the "next character" (i.e., middle) pointer to point
> to the data you want to associate with the word.
In fact, since no character can be smaller than '\0', you could also
have additional '\0' nodes on the left branch, each one pointing to
associated data. That way you can have a multiset, ie. the same key
pointing to multiple data, rather than just one.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> Darren New <dne### [at] sanrrcom> wrote:
>> I think the way the original paper worked it is to store the '\0' as part of
>> the string in the tree. So your CAT vs CATS would be
>> -->C-->A-->T-->S-->\0
>> |
>> v
>> \0
>
> But that's not a null pointer indicating the end of the string. That's
> a regular node containing a special character which indicates the end of
> the string. Implementing my automatically-balanced tree creation idea can
> use the exact same method for marking ends of strings.
Then we're in agreement. I was using \0 to indicate ASCII NUL, not C's NULL
pointer. You can either use a node whose character value is '\0', or you
can use a separate bit, depending on your needs.
> Btw, the node containing an end-of-string value idea works only as long
> as there's a special character value which never appears in a string. It
> can't be done like that if the range of values of each character in your
> "string" can hold is the full 0-255 range. Of course you could use values
> larger than 8 bits, but that's just basically the exact same thing as having
> an end-of-string bit at each node.
Yep. Or you could have another pointer that points to the data that is
associated with that string (i.e., the "value" of the "key/value" pair), and
if that's not NULL, then a string ends at that node.
> 16 bytes to store one single character sounds quite a lot... :)
And a fourth pointer for the value associated, yes. Of course, any tree
structure is going to be close to the same size too. If you have lots and
lots of strings, and you're more worried about space than speed, a hash
table works out better, I expect.
> Of course strings with equal beginnings will share nodes, but if the
> strings are very diverse, they will still end up using quite a lot of
> memory.
If you're using up enough memory to be worried about it, chances are you
have enough strings that the beginnings will overlap a lot? I mean, unless
you're really memory-constrained.
> In fact, since no character can be smaller than '\0', you could also
> have additional '\0' nodes on the left branch, each one pointing to
> associated data. That way you can have a multiset, ie. the same key
> pointing to multiple data, rather than just one.
Hmmmmm. Well, certainly the '\0' node could point to a different structure
than the other nodes. You already have three pointers there to play with,
and you could reasonably hang an entire tree underneath if you wanted.
--
Darren New, San Diego CA, USA (PST)
Insanity is a small city on the western
border of the State of Mind.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> Each node requires three pointers plus the character itself.
Or, with enough work, you could get away with not using full-sized pointers,
perhaps. Indexes into arrays (with said arrays split into chunks if you have
trouble finding that much contiguous address space) might work too.
Or, if it's on disk, you could have two kinds of nodes - ones that point to
other nodes nearby (or in the same block or file, say), and others that
point to different files. So the root and first 2 or 3 levels might be in
one block, with the "leaves" in other blocks or files or something. Since
you don't randomly walk around the tree when you're updating it, you don't
need to worry quite so much about caching as you might otherwise.
--
Darren New, San Diego CA, USA (PST)
Insanity is a small city on the western
border of the State of Mind.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Darren New <dne### [at] sanrrcom> wrote:
> > 16 bytes to store one single character sounds quite a lot... :)
> And a fourth pointer for the value associated
That's not necessary if you use the '\0' node to end strings, as you
pointed out.
> Hmmmmm. Well, certainly the '\0' node could point to a different structure
> than the other nodes. You already have three pointers there to play with,
> and you could reasonably hang an entire tree underneath if you wanted.
You can't use the "right child" pointer of the '\0' node for ancillary
data because it can be part of the regular tree structure.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|