|
|
On 24/04/2013 06:31 PM, Warp wrote:
> IIRC a regular expression can be (and is usually) converted into a state
> machine that traverses the input data linearly once. When the state machine
> ends up in one of specific nodes, you know that a match was found (and the
> node in question uniquely identifies which pattern was matching.)
I've spent the last few days playing with this. I managed to convert my
regular expression list into a non-deterministic finite state automaton.
This was a rather interesting process.
Initially I tried to build a simple decision tree, where every node has
a dictionary for every possible byte stating what to do if that byte is
read next. Trouble is, what this is essentially doing is constructing
every possible matching byte sequence. For expressions with lots of
alternatives, you get a combinatorial explosion of matching byte sequences.
This became especially evident when I tried to generate code for the
decision tree. I ended up producing an 8MB Haskell source file, which
the Haskell compiler died trying to compile. Clearly this approach is a
dead end.
I then did some reading about how the professionals do this stuff. It
seems the key is that where a subexpression has alternatives, you
generate a series of states, but those diverging branches *merge* again
at the end of the subexpression. What I was doing was effectively
duplicating the rest of the expression - which is obviously more
space-hungry.
The slightly tricky thing is that some transitions happen only for one
specific byte, and others happen for *any* byte. And yet other
transitions happen without consuming any input at all. This is how we
end up with a *non*-deterministic state automaton, after all.
Anyway, having implemented all this, the generated Haskell file is now a
mere 600KB. It still takes a while to compile it, but nowhere near as
long as before.
I then began reading about how to transform a non-deterministic
automaton into a deterministic one. It appears you can get rid of the
"zero byte" moves by generating a new state for every collection of
states reachable by zero-byte moves. I have yet to implement this yet,
but it looks like it should significantly reduce the size of the
generated code.
I didn't actually get as far as the part where you remove the
non-determinism completely...
Post a reply to this message
|
|