|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
OK, so here's an interesting thought...
I've been looking at string searching algorithms, and regular
expressions, and generally at pattern matching on data. There are lots
of ideas about how to find a specific pattern as fast as possible. But
what I don't see anywhere is how to look for *several* patterns and see
which one is present. (Especially patterns occurring [or not] at a
specific location.)
If I'm looking at some data which will contain one of 1,000 possible
patterns, it sounds like it would be quite inefficient to run 1,000
different checks on it. I suppose I could try building a decision tree,
but if most of the patterns are at least half a dozen bytes long,
there's a danger that tree could become really, really big.
Anyone have any other suggestions?
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Le 24/04/2013 09:51, Orchid Win7 v1 a écrit :
> OK, so here's an interesting thought...
>
> I've been looking at string searching algorithms, and regular
> expressions, and generally at pattern matching on data. There are lots
> of ideas about how to find a specific pattern as fast as possible. But
> what I don't see anywhere is how to look for *several* patterns and see
> which one is present. (Especially patterns occurring [or not] at a
> specific location.)
>
> If I'm looking at some data which will contain one of 1,000 possible
> patterns, it sounds like it would be quite inefficient to run 1,000
> different checks on it. I suppose I could try building a decision tree,
> but if most of the patterns are at least half a dozen bytes long,
> there's a danger that tree could become really, really big.
>
> Anyone have any other suggestions?
That's where extended regular expression kick in.
two distinct patterns can be combined as a single one, hence the search
API only need to allow the specification of 1 pattern...
But it solves only half of your problem: it usually won't return the
branch that matches (but rather the start of the match, so after the
search, you still have to identify which sub-pattern was matched, but
that should be "easy", as you do not have to search, only compare)
--
Just because nobody complains does not mean all parachutes are perfect.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 04/24/2013 01:51 PM, Orchid Win7 v1 wrote:
>I've been looking at string searching algorithms, and regular
>expressions, and generally at pattern matching on data. There are lots
>of ideas about how to find a specific pattern as fast as possible. But
>what I don't see anywhere is how to look for *several* patterns and see
>which one is present. (Especially patterns occurring [or not] at a
>specific location.)
Check out a classic book (http://tinyurl.com/d9o565h)
"The design and analysis of computer algorithms"
Alfred V. Aho, John E. Hopcroft, Jeffrey D. Ullman
Addison-Wesley Pub. Co., 1974.
In particular, Theorem 9.9.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Orchid Win7 v1 <voi### [at] devnull> wrote:
> OK, so here's an interesting thought...
> I've been looking at string searching algorithms, and regular
> expressions, and generally at pattern matching on data. There are lots
> of ideas about how to find a specific pattern as fast as possible. But
> what I don't see anywhere is how to look for *several* patterns and see
> which one is present. (Especially patterns occurring [or not] at a
> specific location.)
> If I'm looking at some data which will contain one of 1,000 possible
> patterns, it sounds like it would be quite inefficient to run 1,000
> different checks on it. I suppose I could try building a decision tree,
> but if most of the patterns are at least half a dozen bytes long,
> there's a danger that tree could become really, really big.
> Anyone have any other suggestions?
IIRC a regular expression can be (and is usually) converted into a state
machine that traverses the input data linearly once. When the state machine
ends up in one of specific nodes, you know that a match was found (and the
node in question uniquely identifies which pattern was matching.)
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 24/04/2013 06:31 PM, Warp wrote:
> IIRC a regular expression can be (and is usually) converted into a state
> machine that traverses the input data linearly once.
Yes, that was my understanding. (Although I've never actually seen the
algorithm in question, I imagine it can't be that complicated.)
> When the state machine
> ends up in one of specific nodes, you know that a match was found (and the
> node in question uniquely identifies which pattern was matching.)
I wasn't sure whether you could actually build such a state machine for
*several* different regular expressions simultaneously. I'm not sure how
you'd go about doing that...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Orchid Win7 v1 <voi### [at] devnull> wrote:
> I wasn't sure whether you could actually build such a state machine for
> *several* different regular expressions simultaneously. I'm not sure how
> you'd go about doing that...
Any regular expressions can be joined with the | operator (after which
you get a regular expression that matches either one of those.) It's
a simple "or" operator.
I don't know how much optimization goes into building the state machine
from the RE, though.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 24/04/2013 10:23 PM, Warp wrote:
> Orchid Win7 v1<voi### [at] devnull> wrote:
>> I wasn't sure whether you could actually build such a state machine for
>> *several* different regular expressions simultaneously. I'm not sure how
>> you'd go about doing that...
>
> Any regular expressions can be joined with the | operator (after which
> you get a regular expression that matches either one of those.) It's
> a simple "or" operator.
...You're right. I hadn't thought of that. (Mainly because I've been
looking mostly at a subset of the full power of regex.)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 24/04/2013 06:31 PM, Warp wrote:
> IIRC a regular expression can be (and is usually) converted into a state
> machine that traverses the input data linearly once. When the state machine
> ends up in one of specific nodes, you know that a match was found (and the
> node in question uniquely identifies which pattern was matching.)
I've spent the last few days playing with this. I managed to convert my
regular expression list into a non-deterministic finite state automaton.
This was a rather interesting process.
Initially I tried to build a simple decision tree, where every node has
a dictionary for every possible byte stating what to do if that byte is
read next. Trouble is, what this is essentially doing is constructing
every possible matching byte sequence. For expressions with lots of
alternatives, you get a combinatorial explosion of matching byte sequences.
This became especially evident when I tried to generate code for the
decision tree. I ended up producing an 8MB Haskell source file, which
the Haskell compiler died trying to compile. Clearly this approach is a
dead end.
I then did some reading about how the professionals do this stuff. It
seems the key is that where a subexpression has alternatives, you
generate a series of states, but those diverging branches *merge* again
at the end of the subexpression. What I was doing was effectively
duplicating the rest of the expression - which is obviously more
space-hungry.
The slightly tricky thing is that some transitions happen only for one
specific byte, and others happen for *any* byte. And yet other
transitions happen without consuming any input at all. This is how we
end up with a *non*-deterministic state automaton, after all.
Anyway, having implemented all this, the generated Haskell file is now a
mere 600KB. It still takes a while to compile it, but nowhere near as
long as before.
I then began reading about how to transform a non-deterministic
automaton into a deterministic one. It appears you can get rid of the
"zero byte" moves by generating a new state for every collection of
states reachable by zero-byte moves. I have yet to implement this yet,
but it looks like it should significantly reduce the size of the
generated code.
I didn't actually get as far as the part where you remove the
non-determinism completely...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 08/05/2013 09:26 PM, Orchid Win7 v1 wrote:
> I then began reading about how to transform a non-deterministic
> automaton into a deterministic one. It appears you can get rid of the
> "zero byte" moves by generating a new state for every collection of
> states reachable by zero-byte moves. I have yet to implement this yet,
> but it looks like it should significantly reduce the size of the
> generated code.
>
> I didn't actually get as far as the part where you remove the
> non-determinism completely...
It seems with a bit of clever coding, you can avoid generating null
moves at all in the first place. However, I can't see a way to easily
remove the non-determinism without a combinatorial explosion of states
(i.e., producing a full decision tree like before).
Perhaps slightly disappointingly, I can't find any measurable
performance difference between the laughably naive version, and the
"optimised" NFA version... It seems the I/O takes so long that any
changes to the actual data processing are of no significance.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 08/05/2013 09:26 PM, Orchid Win7 v1 wrote:
> I then did some reading about how the professionals do this stuff. It
> seems the key is that where a subexpression has alternatives, you
> generate a series of states, but those diverging branches *merge* again
> at the end of the subexpression. What I was doing was effectively
> duplicating the rest of the expression - which is obviously more
> space-hungry.
>
> Anyway, having implemented all this, the generated Haskell file is now a
> mere 600KB. It still takes a while to compile it, but nowhere near as
> long as before.
So... I've written a program that reads a text file, and [eventually]
generates executable code which implements the instructions in that file?
Holy crap, I've implemented a compiler! :-D
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|