POV-Ray: Newsgroups: povray.off-topic: An interesting problem

POV-Ray : Newsgroups : povray.off-topic : An interesting problem		Server Time 17 Jul 2025 05:52:21 EDT (-0400)

From: Orchid Win7 v1
Subject: An interesting problem
Date: 24 Apr 2013 03:51:34
Message: <51778f06$1@news.povray.org>

OK, so here's an interesting thought...

I've been looking at string searching algorithms, and regular 
expressions, and generally at pattern matching on data. There are lots 
of ideas about how to find a specific pattern as fast as possible. But 
what I don't see anywhere is how to look for *several* patterns and see 
which one is present. (Especially patterns occurring [or not] at a 
specific location.)

If I'm looking at some data which will contain one of 1,000 possible 
patterns, it sounds like it would be quite inefficient to run 1,000 
different checks on it. I suppose I could try building a decision tree, 
but if most of the patterns are at least half a dozen bytes long, 
there's a danger that tree could become really, really big.

Anyone have any other suggestions?

Post a reply to this message

From: Le Forgeron
Subject: Re: An interesting problem
Date: 24 Apr 2013 09:09:37
Message: <5177d991$1@news.povray.org>

Le 24/04/2013 09:51, Orchid Win7 v1 a écrit :
> OK, so here's an interesting thought...
> 
> I've been looking at string searching algorithms, and regular
> expressions, and generally at pattern matching on data. There are lots
> of ideas about how to find a specific pattern as fast as possible. But
> what I don't see anywhere is how to look for *several* patterns and see
> which one is present. (Especially patterns occurring [or not] at a
> specific location.)
> 
> If I'm looking at some data which will contain one of 1,000 possible
> patterns, it sounds like it would be quite inefficient to run 1,000
> different checks on it. I suppose I could try building a decision tree,
> but if most of the patterns are at least half a dozen bytes long,
> there's a danger that tree could become really, really big.
> 
> Anyone have any other suggestions?

That's where extended regular expression kick in.
two distinct patterns can be combined as a single one, hence the search
API only need to allow the specification of 1 pattern...

But it solves only half of your problem: it usually won't return the
branch that matches (but rather the start of the match, so after the
search, you still have to identify which sub-pattern was matched, but
that should be "easy", as you do not have to search, only compare)



-- 
Just because nobody complains does not mean all parachutes are perfect.

Post a reply to this message

From: bart
Subject: Re: An interesting problem
Date: 24 Apr 2013 09:57:49
Message: <5177e4dd@news.povray.org>

On 04/24/2013 01:51 PM, Orchid Win7 v1 wrote:
 >I've been looking at string searching algorithms, and regular
 >expressions, and generally at pattern matching on data. There are lots
 >of ideas about how to find a specific pattern as fast as possible. But
 >what I don't see anywhere is how to look for *several* patterns and see
 >which one is present. (Especially patterns occurring [or not] at a
 >specific location.)

Check out a classic book  (http://tinyurl.com/d9o565h)
"The design and analysis of computer algorithms"
Alfred V. Aho, John E. Hopcroft, Jeffrey D. Ullman
Addison-Wesley Pub. Co., 1974.

In particular, Theorem 9.9.

Post a reply to this message

From: Warp
Subject: Re: An interesting problem
Date: 24 Apr 2013 13:31:43
Message: <517816ff@news.povray.org>

Orchid Win7 v1 <voi### [at] devnull> wrote:
> OK, so here's an interesting thought...

> I've been looking at string searching algorithms, and regular 
> expressions, and generally at pattern matching on data. There are lots 
> of ideas about how to find a specific pattern as fast as possible. But 
> what I don't see anywhere is how to look for *several* patterns and see 
> which one is present. (Especially patterns occurring [or not] at a 
> specific location.)

> If I'm looking at some data which will contain one of 1,000 possible 
> patterns, it sounds like it would be quite inefficient to run 1,000 
> different checks on it. I suppose I could try building a decision tree, 
> but if most of the patterns are at least half a dozen bytes long, 
> there's a danger that tree could become really, really big.

> Anyone have any other suggestions?

IIRC a regular expression can be (and is usually) converted into a state
machine that traverses the input data linearly once. When the state machine
ends up in one of specific nodes, you know that a match was found (and the
node in question uniquely identifies which pattern was matching.)

-- 
                                                          - Warp

Post a reply to this message

From: Orchid Win7 v1
Subject: Re: An interesting problem
Date: 24 Apr 2013 16:07:59
Message: <51783b9f$1@news.povray.org>

On 24/04/2013 06:31 PM, Warp wrote:
> IIRC a regular expression can be (and is usually) converted into a state
> machine that traverses the input data linearly once.

Yes, that was my understanding. (Although I've never actually seen the 
algorithm in question, I imagine it can't be that complicated.)

> When the state machine
> ends up in one of specific nodes, you know that a match was found (and the
> node in question uniquely identifies which pattern was matching.)

I wasn't sure whether you could actually build such a state machine for 
*several* different regular expressions simultaneously. I'm not sure how 
you'd go about doing that...

Post a reply to this message

From: Warp
Subject: Re: An interesting problem
Date: 24 Apr 2013 17:23:14
Message: <51784d41@news.povray.org>

Orchid Win7 v1 <voi### [at] devnull> wrote:
> I wasn't sure whether you could actually build such a state machine for 
> *several* different regular expressions simultaneously. I'm not sure how 
> you'd go about doing that...

Any regular expressions can be joined with the | operator (after which
you get a regular expression that matches either one of those.) It's
a simple "or" operator.

I don't know how much optimization goes into building the state machine
from the RE, though.

-- 
                                                          - Warp

Post a reply to this message

From: Orchid Win7 v1
Subject: Re: An interesting problem
Date: 24 Apr 2013 18:08:53
Message: <517857f5$1@news.povray.org>

On 24/04/2013 10:23 PM, Warp wrote:
> Orchid Win7 v1<voi### [at] devnull>  wrote:
>> I wasn't sure whether you could actually build such a state machine for
>> *several* different regular expressions simultaneously. I'm not sure how
>> you'd go about doing that...
>
> Any regular expressions can be joined with the | operator (after which
> you get a regular expression that matches either one of those.) It's
> a simple "or" operator.

...You're right. I hadn't thought of that. (Mainly because I've been 
looking mostly at a subset of the full power of regex.)

Post a reply to this message

From: Orchid Win7 v1
Subject: Re: An interesting problem
Date: 8 May 2013 16:26:01
Message: <518ab4d9@news.povray.org>

On 24/04/2013 06:31 PM, Warp wrote:
> IIRC a regular expression can be (and is usually) converted into a state
> machine that traverses the input data linearly once. When the state machine
> ends up in one of specific nodes, you know that a match was found (and the
> node in question uniquely identifies which pattern was matching.)

I've spent the last few days playing with this. I managed to convert my 
regular expression list into a non-deterministic finite state automaton. 
This was a rather interesting process.

Initially I tried to build a simple decision tree, where every node has 
a dictionary for every possible byte stating what to do if that byte is 
read next. Trouble is, what this is essentially doing is constructing 
every possible matching byte sequence. For expressions with lots of 
alternatives, you get a combinatorial explosion of matching byte sequences.

This became especially evident when I tried to generate code for the 
decision tree. I ended up producing an 8MB Haskell source file, which 
the Haskell compiler died trying to compile. Clearly this approach is a 
dead end.

I then did some reading about how the professionals do this stuff. It 
seems the key is that where a subexpression has alternatives, you 
generate a series of states, but those diverging branches *merge* again 
at the end of the subexpression. What I was doing was effectively 
duplicating the rest of the expression - which is obviously more 
space-hungry.

The slightly tricky thing is that some transitions happen only for one 
specific byte, and others happen for *any* byte. And yet other 
transitions happen without consuming any input at all. This is how we 
end up with a *non*-deterministic state automaton, after all.

Anyway, having implemented all this, the generated Haskell file is now a 
mere 600KB. It still takes a while to compile it, but nowhere near as 
long as before.

I then began reading about how to transform a non-deterministic 
automaton into a deterministic one. It appears you can get rid of the 
"zero byte" moves by generating a new state for every collection of 
states reachable by zero-byte moves. I have yet to implement this yet, 
but it looks like it should significantly reduce the size of the 
generated code.

I didn't actually get as far as the part where you remove the 
non-determinism completely...

Post a reply to this message

From: Orchid Win7 v1
Subject: Re: An interesting problem
Date: 10 May 2013 17:11:14
Message: <518d6272$1@news.povray.org>

On 08/05/2013 09:26 PM, Orchid Win7 v1 wrote:
> I then began reading about how to transform a non-deterministic
> automaton into a deterministic one. It appears you can get rid of the
> "zero byte" moves by generating a new state for every collection of
> states reachable by zero-byte moves. I have yet to implement this yet,
> but it looks like it should significantly reduce the size of the
> generated code.
>
> I didn't actually get as far as the part where you remove the
> non-determinism completely...

It seems with a bit of clever coding, you can avoid generating null 
moves at all in the first place. However, I can't see a way to easily 
remove the non-determinism without a combinatorial explosion of states 
(i.e., producing a full decision tree like before).

Perhaps slightly disappointingly, I can't find any measurable 
performance difference between the laughably naive version, and the 
"optimised" NFA version... It seems the I/O takes so long that any 
changes to the actual data processing are of no significance.

Post a reply to this message

From: Orchid Win7 v1
Subject: Re: An interesting problem
Date: 24 May 2013 14:14:48
Message: <519fae18$1@news.povray.org>

On 08/05/2013 09:26 PM, Orchid Win7 v1 wrote:
> I then did some reading about how the professionals do this stuff. It
> seems the key is that where a subexpression has alternatives, you
> generate a series of states, but those diverging branches *merge* again
> at the end of the subexpression. What I was doing was effectively
> duplicating the rest of the expression - which is obviously more
> space-hungry.
>
> Anyway, having implemented all this, the generated Haskell file is now a
> mere 600KB. It still takes a while to compile it, but nowhere near as
> long as before.

So... I've written a program that reads a text file, and [eventually] 
generates executable code which implements the instructions in that file?

Holy crap, I've implemented a compiler! :-D

Post a reply to this message