 |
 |
|
 |
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
>> In short, people tend to use regexs for quick and dirty hacks that kinda
>> work, rather than doing the job properly with a full parser. And I'm
>> really not fond of hacks.
>
> Writing a parser, even with the aid of a specialized description language
> such as BNF, is very laborious. Simple string matching can be often
> expressed
> with very short regular expressions which you can write in a few seconds.
> Writing a parser would be complete overkill.
I think he meant using a parser library, not writing one from scratch.
Indeed writing a parser (or a regex library) would be a total overkill for
pretty much any project.
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
scott <sco### [at] scott com> wrote:
> >> In short, people tend to use regexs for quick and dirty hacks that kinda
> >> work, rather than doing the job properly with a full parser. And I'm
> >> really not fond of hacks.
> >
> > Writing a parser, even with the aid of a specialized description language
> > such as BNF, is very laborious. Simple string matching can be often
> > expressed
> > with very short regular expressions which you can write in a few seconds.
> > Writing a parser would be complete overkill.
> I think he meant using a parser library, not writing one from scratch.
I was talking about parser libraries (you know, those which eat BNF or
other such syntax definition languages).
If you had to code a parser from scratch, it would be a hundred times
more work still.
> Indeed writing a parser (or a regex library) would be a total overkill for
> pretty much any project.
*Using* a parser library for something which can be easily expressed with
a regexp string would be complete overkill.
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
On 09/11/2010 10:11 PM, Warp wrote:
> *Using* a parser library for something which can be easily expressed with
> a regexp string would be complete overkill.
Yeah, that's true. Writing
string "foo"
many char
string "bar"
is drastically harder than just saying "foo*bar".
Oh, wait...
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
On 08/11/2010 05:31 PM, Darren New wrote:
> Invisible wrote:
>> I'd much prefer to see a much bigger separation between what's a
>> literal character and what's a command.
>
> Technically, they're all commands. The letter "s" means "match against
> the letter s." :-)
Well, if you wanted to split hairs, it's the implicit command "match
against a specific character" plus the argument "s". :-P
>> you could probably use it to match against streams of other data, not
>> just characters.
>
> You can.
Not in any regex product I've ever seen. (Although I won't claim to be
an authority on the subject.) Most people seem to equate "regex" with
"shorthand for writing text parsers".
>> A quick inspection of Wikipedia suggests that POSIX ERE involves at
>> least .[]^$()\*{}?+|:, which is 16, not 10. (Still, it's not the
>> thousands it seemed like last time I tried to learn this stuff.)
>
> : and {} and ^ $ aren't original regular expression characters.
> Technically not + either, so I think that's where the 10 come from. The
> rest are short-cuts for what you can already otherwise specify (: + {
> }), or are useful for programming but outside the theory (^$).
So much for theory. Most "regular expressions" out there aren't even
regular. I don't know much about the theory; what I know about is the
actual regex tools that you can actually use.
>> I recall reading somewhere that Perl's "regular expressions" aren't
>> actually regular, and so require exponential time for matching. Truly
>> regular expressions apparently require only linear time.
>
> Correct. And not only exponential time, but memory as well. A regular
> expression is regular because it requires a fixed amount of memory to
> match or reject.
Is that the definition of "regular" then?
>> The other thing I dislike is that people seem to have a tendency to
>> use regexs where they should be using a real parser.
>
> Yes, well, that's because people are stupid, not regexps.
> Only stupid people. Learn regexps, and learn the theory behind them, so
> when the boss asks you to write a parser, you know which one to use.
My boss well never, ever ask me to write a parser. (Mostly because he
doesn't know what one is, or that such a process is actually necessary.)
Regardless, if you're trying to do complicated parsing, you should use
real parser tools, not a regex.
Now if you just want to do a quick wildcard search, then why not? It can
be useful to be able to say, for example, DELETE *.PNG or something. But
by the time you get to the point where your search term is practically a
predicate calculus, you really shouldn't be trying to encode the entire
thing as a flat character string. You should use a real language instead.
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Invisible <voi### [at] dev null> wrote:
> On 09/11/2010 10:11 PM, Warp wrote:
> > *Using* a parser library for something which can be easily expressed with
> > a regexp string would be complete overkill.
> Yeah, that's true. Writing
> string "foo"
> many char
> string "bar"
> is drastically harder than just saying "foo*bar".
> Oh, wait...
Nice straw man. (And "foo*bar" still doesn't mean what you think it means
as a regex.)
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
> Nice straw man.
Wasn't his problem that he didn't have a brain?
> (And "foo*bar" still doesn't mean what you think it means
> as a regex.)
So, what, it means
string "fo"
many (char 'o')
string "bar"
Or am I reading this wrong?
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Invisible wrote:
> Not in any regex product I've ever seen.
You don't even use text regular expressions, so that's not very
authoritative. ;-)
Seriously, any DFA is the equivalent of a regex.
> Most people seem to equate "regex" with "shorthand for writing text parsers".
Sure. But exactly the same theory works for any stream of tokens. You can
use most modern libraries to match, for example, unicode, which includes
chinese, so right there you're outside the "text" area, let alone if you
write your own parser.
> So much for theory. Most "regular expressions" out there aren't even
> regular.
Sure they are. Unless you use a back-escape (i.e., substitute in something
that you earlier matched) then it's all regular. Stuff like {} and + are
just trivial macros to reduce typing. I think I use a backmatch maybe once
every two or three years interactively, and I don't think I've ever used one
programmatically.
>>> I recall reading somewhere that Perl's "regular expressions" aren't
>>> actually regular, and so require exponential time for matching. Truly
>>> regular expressions apparently require only linear time.
>>
>> Correct. And not only exponential time, but memory as well. A regular
>> expression is regular because it requires a fixed amount of memory to
>> match or reject.
>
> Is that the definition of "regular" then?
Wikipedia is your friend. But yes, that's part of the definition. A language
is regular if it can be matched by a DFA.
> Regardless, if you're trying to do complicated parsing, you should use
> real parser tools, not a regex.
Again, it depends what you're trying to parse. Are you trying to parse a
file full of lines like
structure_size = 37
structure_drift = 92.7E13
etc? A regexp will do just fine.
--
Darren New, San Diego CA, USA (PST)
Serving Suggestion:
"Don't serve this any more. It's awful."
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Invisible wrote:
> On 09/11/2010 10:11 PM, Warp wrote:
>
>> *Using* a parser library for something which can be easily
>> expressed with
>> a regexp string would be complete overkill.
>
> Yeah, that's true. Writing
>
> string "foo"
> many char
> string "bar"
>
> is drastically harder than just saying "foo*bar".
>
> Oh, wait...
OK, so how do you do
(\+|-)[0-9]+(\.[0-9]+)?(E(\+|-)?[0-9]{1,3})?
in your parser language?
(That is, optional sign, one or more digits, optional decimal point followed
by one or more digits, optional E followed by optional sign followed by one
to three digits.)
--
Darren New, San Diego CA, USA (PST)
Serving Suggestion:
"Don't serve this any more. It's awful."
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Darren New <dne### [at] san rr com> wrote:
> (\+|-)[0-9]+(\.[0-9]+)?(E(\+|-)?[0-9]{1,3})?
Btw, it's a surprisingly little known fact that at least in C and C++
things like 10. and 10.e5 are valid floating point literals (besides the
more usual .1 form).
If you wanted to take that into account in a regexp like the one above,
it actually becomes a bit verbose (so that a lone . wouldn't be considered
a valid floating point literal). Basically you need to write the above twice
(with slight differences, and the two parts separated with a |.)
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
> OK, so how do you do
>
> (\+|-)[0-9]+(\.[0-9]+)?(E(\+|-)?[0-9]{1,3})?
>
> in your parser language?
OK, that's one big ol' complex regex, right there.
> (That is, optional sign, one or more digits, optional decimal point
> followed by one or more digits, optional E followed by optional sign
> followed by one to three digits.)
If I've understood the spec correctly, it's
do
option (char '+' <|> char '-')
many1 digit
option (char '.')
many1 digit
option (do char 'E'; option (char '+' <|> char '-'); many1 digit)
Enforcing that the exponent is less than or equal to 3 digits would be
slightly more wordy. The obvious way is
xs <- many1 digit
if length xs > 3 then fail else return ()
Notice that since this is written in a /real/ programming language and
not a text string, we can do
sign = char '+' <|> char '-'
number = do
option sign
many1 digit
option (char '.')
many1 digit
option (do char 'E'; option sign; many1 digit)
and save a little typing. You can also factor the task into smaller pieces:
sign = char '+' <|> char '-'
exponent = do
char 'E'
option sign
xs <- many1 digit
if length xs > 3 then fail else return ()
number = do
option sign
many1 digit
option (char '.')
many1 digit
option exponent
You can also do things like write a function that builds a special kind
of parser given a simpler spec.
With a regex, on the other hand, you cannot even statically guarantee
that a given string is even a syntactically valid regex. And that's
before you try to programmatically construct new ones. :-P
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|
 |