POV-Ray: Newsgroups: povray.general: Parser Scanner Test Case Challenge

POV-Ray : Newsgroups : povray.general : Parser Scanner Test Case Challenge		Server Time 24 Nov 2024 16:50:27 EST (-0500)

From: clipka
Subject: Parser Scanner Test Case Challenge
Date: 14 May 2018 11:58:16
Message: <5af9b218$1@news.povray.org>

Hi folks,

I've started work on (trying to) re-write portions of the parser, in
hope of being able to improve performance in scenes with loops or
frequently-invoked macros (though only time will tell whether I'll just
wreck performance for linear scenes), and with a bit of luck also
improve stability.

As a first step, I'm writing a brand new _scanner_, i.e. a module that
will take a scene file and, in simplified terms, split it up into
individual "words" -- such words being (A) potential identifiers and
keywords, (B) numeric literals, (C) string literals, and (D) single- or
two-character operators and punctuation -- while at the same time
skipping over whitespace, line breaks and comments. The scanner will
also be responsible for earmarking each "word" with the line and column
where it was found.

Since I intend to subject the module to a lot of automated tests, but am
a bit short on time and imagination to conjure up those tests all by
myself, and also because some bugs might be in there because I may be
oblivious of a certain problematic scenario, I hereby present to you the
following

------------------------------------------------------------------------
CHALLENGE: CONCOCT A TEST CASE THAT WILL REVEAL AN ISSUE IN THE SCANNER.
------------------------------------------------------------------------

Notes:

- Single-line test cases are preferred. In general, the shorter a test
case is, the better.

- Since only the scanner will be tested, the test cases do not have to
make any sense syntactically. This is only about whether the parser
correctly splits the text up into "words" as explained above.

- Test cases do not even have to be well-formed. Part of the scanner's
job is to detect and report malformed input, such as a bad escape
sequence in a string or an unclosed comment.

- Besides the suggested input, test cases should also outline the
expected output.

- Speaking of comments and expected output, if your test case involves
block comments, please consider both (a) the expected result if nested
block comments are allowed (the current parser's behaviour), and (b) the
expected result if block comments do not nest (the current syntax
highlighting behaviour).

- An "issue" may be pretty much anything on the spectrum from blatant
bugs to just surprising but perfectly good behaviour.

- An "issue" may also be a case where the new scanner's behaviour is
perfectly in line with the documentation, but the existing parser's
behaviour turns out to differ. In such a case, please consider both (a)
the expected behaviour as per the documentation, as well as (b) the
observed behaviour in POV-Ray v3.7.

- By accepting this challenge, you agree that your submission /may/ end
up in POV-Ray's official repository, as part of a set of unit tests.
(For this to happen, your test case does not have to actually trigger a
bug in the current version of the scanner; after all, future changes
might accidently break things again.)

Post a reply to this message

From: Bald Eagle
Subject: Re: Parser Scanner Test Case Challenge
Date: 14 May 2018 12:55:01
Message: <web.5af9be843f003c3bc437ac910@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> The scanner will
> also be responsible for earmarking each "word" with the line and column
> where it was found.

Well, this all sounds like a herculean task.

Perhaps you could kick-start this by posting an illustrative example, perhaps
with a past known issue that's been resolved.

A related aside:
----------------------------------------------------

Since writing all of that benefits from incorporating certain features early on,
I'm wondering if this would be the point to think about implementing a
nested-level-counter.

Quite often with loops or include files, a closing bracket or something gets
left out, and it's a huge nightmare to backtrack through it all.
Perhaps if the level of the instruction were returned along with the line number
and column, it would easier to see "where" in the code things went wrong.

// Level 0
#for (X, 0 10)
     // Level 1
          #if (Something = true)
               // Level 2
               #debug"True!"
          #end
#end

Post a reply to this message

From: clipka
Subject: Re: Parser Scanner Test Case Challenge
Date: 14 May 2018 14:00:56
Message: <5af9ced8$1@news.povray.org>

Am 14.05.2018 um 18:51 schrieb Bald Eagle:
> clipka <ano### [at] anonymousorg> wrote:
>> The scanner will
>> also be responsible for earmarking each "word" with the line and column
>> where it was found.
> 
> Well, this all sounds like a herculean task.

I think it's manageable -- as long as I don't go into over-engineering
mode (like attempting to optimize the implementation for performance
right from the start while at the same time trying to add support for
generic character encodings; I had to convince myself that we can
probably do without EBCDIC for now ;)).

It's not like I'll be re-writing the entire parser; I'll just trim away
some very basic functionality of the tokenizer, and adapt the remainder
of the tokenizer to use the new dedicated scanner class. I think I can
get a clean cut through it all.

With that cut made, it will presumably be a lot easier to address the
true purpose of the whole operation: Cutting the tokenizer right down
the middle where it hurts, to separate context-insensitive tokenization
(such as identifying `sphere` as SPHERE_TOKEN) from context-sensitive
tokenization (such as identifying `Foo` as VECTOR_ID_TOKEN representing
the value <1,2,3> because the scene happened to contain a `#declare
Foo=<1,2,3>;` earlier).

> Perhaps you could kick-start this by posting an illustrative example, perhaps
> with a past known issue that's been resolved.

Example:

    314.e-2

should be interpreted as numeric literal `314.e-2`, but at some time was
interpreted as numeric literal `314.`, identifier `e`, punctuation `-`,
numeric literal `2`.

> A related aside:
> ----------------------------------------------------
> 
> Since writing all of that benefits from incorporating certain features early on,
> I'm wondering if this would be the point to think about implementing a
> nested-level-counter.

Such a functionality would be far beyond the layers I'm currently
working on right now. The scanner has no idea what semantics `(` or `{`
have, let alone `#` followed by `for` (yup, from the current tokenizer's
perspective those are two separate "words", and I'll leave it at that).
It doesn't even know that `end` is not an identifier but a keyword.

Even the envisioned context-insensitive half of the tokenizer would have
no concept of nesting of anything - by virtue of being context-insensitive.

> Quite often with loops or include files, a closing bracket or something gets
> left out, and it's a huge nightmare to backtrack through it all.
> Perhaps if the level of the instruction were returned along with the line number
> and column, it would easier to see "where" in the code things went wrong.
> 
> // Level 0
> #for (X, 0 10)
>      // Level 1
>           #if (Something = true)
>                // Level 2
>                #debug"True!"
>           #end
> #end

I think just a bare number would be of little use. In cases simple
enough that you can figure out by yourself what "level 2" actually
means, chances are you don't need that information anyway.

Also, cases of missing closing brackets/braces/parentheses/`#end`
statements etc. are typically found far later than where they are
actually missing, and by then the parser is usually back on the lowest
level already.

Remember, the parser has a _very_ poor understanding where brackets and
alike stuff would really be required. _At best_ it can tell you in
hindsight that something was missing _somewhere_; at worst, it may
actually be so far off the page that it hard-crashes before it even gets
a fair chance of noticing the mismatch.

Post a reply to this message