 |
 |
|
 |
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
>> Other amusing edge cases include "/":
>>
>> - A name is usually executable; by preceeding it with "/", it becomes
>> literal.
>>
>> - The toke "/" by itself (i.e., not preceeding a name) is a valid
>> (executable) name.
>>
>> Trixy Hobbitses!
>
> Also fun is trying to write a correct number parser:
>
> - ".0" and "0." are both real number objects (equal to 0.0).
>
> - "." by itself is a name object.
>
> - PostScript allows both "-" and "+" as sign prefixes (which is good).
> Haskell does not, however (which is bad).
Ah, but these interact!
Anything that isn't parsable as a number is a name. Therefore,
"0." -> real
".0" -> real
"." -> name
"1.1" -> real
"1.1.1" -> name
"1e1" -> real
"1x1" -> name
"s1" -> name
"1s" -> name
Will the insanity never end?? >_<
Good luck writing a parser that can untangle all of that... :-(
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
> Anything that isn't parsable as a number is a name. Therefore,
>
> "0." -> real
> ".0" -> real
> "." -> name
> "1.1" -> real
> "1.1.1" -> name
> "1e1" -> real
> "1x1" -> name
> "s1" -> name
> "1s" -> name
>
> Will the insanity never end?? >_<
>
> Good luck writing a parser that can untangle all of that... :-(
http://xkcd.com/208/
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
scott wrote:
> http://xkcd.com/208/
Seriously... You gotta love the way this guy manages to draw stick
figers that have no facial expressions, yet you can tell *exactly* what
emotion they're having! o_O
Also... Yes, I am very, very glad I'm not writing a parser for regular
expressions. (My God, think of the massacre...!)
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
>> http://xkcd.com/208/
>
> Seriously... You gotta love the way this guy manages to draw stick figers
> that have no facial expressions, yet you can tell *exactly* what emotion
> they're having! o_O
>
> Also... Yes, I am very, very glad I'm not writing a parser for regular
> expressions. (My God, think of the massacre...!)
I meant using regular expressions to help in your parser to decipher
numbers.
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
>> Also... Yes, I am very, very glad I'm not writing a parser for regular
>> expressions. (My God, think of the massacre...!)
>
> I meant using regular expressions to help in your parser to decipher
> numbers.
I fail to see how a pattern matching language is of help here...
(I already *have* a real parser construction toolkit. The *problem* is
that the rules I'm trying to puzzle out are quite complex - and not
fantastically well-documented.)
Still, sooner or later I'll reach this stage:
http://xkcd.com/349/
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
>> I meant using regular expressions to help in your parser to decipher
>> numbers.
>
> I fail to see how a pattern matching language is of help here...
Well it seemed from your example, a "number" is quite easily distinguished
from a non-number.
A number takes one of the four forms (where n is 1 or more digits):
n.n
n.
.n
n
And is optionally prefixed by a minus sign, and optionally suffixed by an
exponential term, which takes the form E or e followed by an optional minus
sign followed by one or more digits.
I would use regular expressions to decide if my string matched this form or
not, but maybe your language/library already has similar functions to do
that?
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
>> I fail to see how a pattern matching language is of help here...
>
> Well it seemed from your example, a "number" is quite easily
> distinguished from a non-number.
Yeah, maybe.
> A number takes one of the four forms (where n is 1 or more digits):
>
> n.n
> n.
> .n
> n
>
> And is optionally prefixed by a minus sign, and optionally suffixed by
> an exponential term, which takes the form E or e followed by an optional
> minus sign followed by one or more digits.
This isn't quite correct.
- The optional sign prefix can also be "+" instead of "-" (in both the
mantissa and any exponent there might be).
- Numbers may also take the form "n#n".
> I would use regular expressions to decide if my string matched this form
> or not, but maybe your language/library already has similar functions to
> do that?
Well, given that I already need to cut the string into bits anyway so I
can modify it so the number parser will accept it, I'm not sure this
buys me anything. (Haskell's number parser doesn't like "+" as a prefix,
doesn't like ".7" or "7." as a number, and so forth.)
There is also a whole bunch of "interesting" rules about how token
parsing works. A PostScript program can take an arbitrary text string
and ask the interpretter to parse one token from it. Page 703 of the
PostScript Language Reference Manual states the following facts:
- If the token read is a name object or a number object, and it is
followed by a white-space character, one whitespace character is consumed.
- If the token ends with a delimiter that's part of the token, that
delimiter is consumed, and no other characters after it.
- If the token is terminated by a delimiter that marks the start of the
next token, that character is not consumed.
In other words, if you have "123 456" then the space is consumed, but if
you have "<123> 456" then the space is *not* consumed. Likewise, if you
have "123/abc" then the "/" is not consumed. However, "123abc" is a
single (name object) token.
Looking at all these facts, it appears that the interpretter actually
uses some simple rule to break the whole input stream into "tokens", and
then decides what kind of token it is seperately.
I am now reimplementing my parser so that instead of trying to classify
and split the input at the same time, it splits it first, and only then
attempts to decide what it just read. I think this is probably how the
"real" PostScript interpretters work.
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Invisible <voi### [at] dev null> wrote:
> Anything that isn't parsable as a number is a name. Therefore,
> "0." -> real
> ".0" -> real
> "." -> name
> "1.1" -> real
> "1.1.1" -> name
> "1e1" -> real
> "1x1" -> name
> "s1" -> name
> "1s" -> name
> Will the insanity never end?? >_<
> Good luck writing a parser that can untangle all of that... :-(
I really can't see the problem. When the input contains a sequence of
valid characters (ie. which can form a real or a name), if this sequence
has the form:
^[+-]?([0-9]+\.?|[0-9]*\.[0-9]+)(e[+-]?([0-9]+\.?|[0-9]*\.[0-9]+))?$
then it's a real, else it's a name.
If we translate that regexp to plain English, it means:
- There's nothing before this pattern (which is what the ^ at the beginning
means), and nothing after it (which is what the $ at the end means).
- The sequence optionally starts with a + or a -.
- After that two possible patterns must appear (the expression in
parentheses, where the two patterns are separated with the | symbol):
- A sequence of one of more digits ([0-9]+), optionally followed by
the dot character (a plain "." has a special meaning in regexps, so the
dot character has to be escaped, and thus written as "\.")
- A sequence of zero or more digits ([0-9]*) followed by a dot character
followed by a sequence of one of more digits.
- Optionally the character "e" can follow, and if that's the case, a real
(not containing an "e") must follow as well (the whole last part in
parentheses, with the "?" at the end to indicate optionality).
In an actual BNF-style parser the rule probably becomes simpler because
the repetition can be removed.
--
- Warp
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Warp wrote:
>> Good luck writing a parser that can untangle all of that... :-(
>
> I really can't see the problem. When the input contains a sequence of
> valid characters (ie. which can form a real or a name), if this sequence
> has the form:
>
> ^[+-]?([0-9]+\.?|[0-9]*\.[0-9]+)(e[+-]?([0-9]+\.?|[0-9]*\.[0-9]+))?$
>
> then it's a real, else it's a name.
Yeah. As I said, I'm currently changing my design from one that attempts
to recognise and delimit numbers to one that just chops the text into
chunks, and *then* decides what kind of thing each chunk is.
> If we translate that regexp to plain English, it means:
>
> - There's nothing before this pattern (which is what the ^ at the beginning
> means), and nothing after it (which is what the $ at the end means).
> - The sequence optionally starts with a + or a -.
> - After that two possible patterns must appear (the expression in
> parentheses, where the two patterns are separated with the | symbol):
> - A sequence of one of more digits ([0-9]+), optionally followed by
> the dot character (a plain "." has a special meaning in regexps, so the
> dot character has to be escaped, and thus written as "\.")
> - A sequence of zero or more digits ([0-9]*) followed by a dot character
> followed by a sequence of one of more digits.
> - Optionally the character "e" can follow, and if that's the case, a real
> (not containing an "e") must follow as well (the whole last part in
> parentheses, with the "?" at the end to indicate optionality).
...which would be incorrect then, for at least the following reasons:
- There can be *zero* or more characters before the decimal point. (But
notice that there must be more than zero characters *in total* before
and after the decimal point. It's just that there can be zero in either
place, but not both.)
- The "e" can also be "E".
- The exponent is an integer, not a real.
See? Not as easy as it looks, is it? Gotta pay careful attention to
*exactly* what the manual says is and isn't permissible.
> In an actual BNF-style parser the rule probably becomes simpler because
> the repetition can be removed.
It would be *really nice* if the reference manual included a BNF syntax
diagram... :-S
Between the rules for splitting up tokens, the tricky rules for escaping
things in strings, and characters that have multiple meanings depending
on context, it's really quite hard!
E.g., "<" is the start of a string, "<~" is the start of another string,
and "<<" is an ordinary name object. Go figure. Similarly, "[" is
classified as a "delimiter character", yet it's also the *name* of an
operator (and a name is a sequence of "regular characters" - that is,
can't contain delimiters).
It all gets confusing very fast...
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|  |
|
 |
Orchid XP v8 wrote:
>
> Why...why...WHY...why would they do this? o_O
>
Why not? After all, it's possible.
-Aero
Post a reply to this message
|
 |
|  |
|  |
|
 |
|
 |
|  |
|
 |