POV-Ray: Newsgroups: povray.beta-test: New version of new tokenizer

POV-Ray : Newsgroups : povray.beta-test : New version of new tokenizer		Server Time 23 Apr 2024 09:56:29 EDT (-0400)

From: clipka
Subject: New version of new tokenizer
Date: 31 May 2018 12:03:10
Message: <5b101cbe@news.povray.org>

https://github.com/POV-Ray/povray/releases/tag/v3.8.0-x.tokenizer.9684878

This version re-implements the `#read` directive.

The new implementation also does away with the need for commas at the
end of each line (thus allowing to read regular CSV files), nor even
between individual values on the same line (thus also supporting
tab-separated values).

(Double quotes around strings remain mandatory though.)

Post a reply to this message

From: clipka
Subject: Re: New version of new tokenizer
Date: 31 May 2018 13:59:03
Message: <5b1037e7$1@news.povray.org>

Am 31.05.2018 um 18:03 schrieb clipka:
> https://github.com/POV-Ray/povray/releases/tag/v3.8.0-x.tokenizer.9684878
> 
> This version re-implements the `#read` directive.
> 
> The new implementation also does away with the need for commas at the
> end of each line (thus allowing to read regular CSV files), nor even
> between individual values on the same line (thus also supporting
> tab-separated values).
> 
> (Double quotes around strings remain mandatory though.)

After some trouble with the auto-build service, Windows binaries are up now.

Post a reply to this message

From: clipka
Subject: Re: New version of new tokenizer
Date: 1 Jun 2018 04:35:16
Message: <5b110544$1@news.povray.org>

Am 31.05.2018 um 18:03 schrieb clipka:
> https://github.com/POV-Ray/povray/releases/tag/v3.8.0-x.tokenizer.9684878
> 
> This version re-implements the `#read` directive.
> 
> The new implementation also does away with the need for commas at the
> end of each line (thus allowing to read regular CSV files), nor even
> between individual values on the same line (thus also supporting
> tab-separated values).
> 
> (Double quotes around strings remain mandatory though.)

Remaining known issues:

- system-specific character encoding in strings currently not supported
(but utf8 should work)
- signature BOM in utf8-encoded files currently not supported

- performance of loops with macros in them still needs work.

Post a reply to this message

From: clipka
Subject: Re: New version of new tokenizer
Date: 1 Jun 2018 13:29:15
Message: <5b11826b$1@news.povray.org>

Am 01.06.2018 um 10:35 schrieb clipka:

> - signature BOM in utf8-encoded files currently not supported

And now that one has also been addressed:

https://github.com/POV-Ray/povray/releases/tag/v3.8.0-x.tokenizer.9686180

And once again this re-implementation comes with improvements over v3.7:

- The v3.7 implementation simply swallowed any contiguous sequence of
non-ASCII bytes at the start of a scene file, and just /presumed/ them
to be an UTF-8 signature BOM. The v3.8.0-x.tokenizer implementation
actually checks whether the non-ASCII byte sequence matches the UTF-8
signature BOM.

- The v3.7 implementation only covered the main scene file. The new
implementation extends to include files as well.


This leaves only two known issues:

- Performance of loops invoking macros: I will not address this for now,
as it does not seem to impede functionality, and the root cause will
most likely be eliminated anyway when I implement token-level loop caching.

- Non-ASCII characters in string literals: This I will also set aside
for now, until I get a clearer picture of whether the current
scene-global `charset` mechanism is even used to any extent worth
supporting, as I think it may be easier and cleaner to throw it
overboard (or at least ditch the `sys` setting) in favour of a per-file
mechanism.

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: New version of new tokenizer
Date: 1 Jun 2018 17:20:01
Message: <web.5b11b74a905c8c8f535efa580@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> - The v3.7 implementation simply swallowed any contiguous sequence of
> non-ASCII bytes at the start of a scene file, and just /presumed/ them
> to be an UTF-8 signature BOM. The v3.8.0-x.tokenizer implementation
> actually checks whether the non-ASCII byte sequence matches the UTF-8
> signature BOM.

This was actually a feature. Originally it checked, but it turned out that at
least at the time several editors created incorrect BOMs...

Thorsten

Post a reply to this message

From: clipka
Subject: Re: New version of new tokenizer
Date: 2 Jun 2018 03:52:57
Message: <5b124cd9$1@news.povray.org>

Am 01.06.2018 um 23:15 schrieb Thorsten Froehlich:
> clipka <ano### [at] anonymousorg> wrote:
>> - The v3.7 implementation simply swallowed any contiguous sequence of
>> non-ASCII bytes at the start of a scene file, and just /presumed/ them
>> to be an UTF-8 signature BOM. The v3.8.0-x.tokenizer implementation
>> actually checks whether the non-ASCII byte sequence matches the UTF-8
>> signature BOM.
> 
> This was actually a feature. Originally it checked, but it turned out that at
> least at the time several editors created incorrect BOMs...

Thanks for the info. Should the change prompt any issue reports, I'll
know what to do. For now, I'll just take the chance.

Post a reply to this message

From: Le Forgeron
Subject: Re: New version of new tokenizer
Date: 2 Jun 2018 04:14:18
Message: <5b1251da$1@news.povray.org>

Le 01/06/2018 à 19:29, clipka a écrit :
> - Non-ASCII characters in string literals: This I will also set aside
> for now, until I get a clearer picture of whether the current
> scene-global `charset` mechanism is even used to any extent worth
> supporting, as I think it may be easier and cleaner to throw it
> overboard (or at least ditch the `sys` setting) in favour of a per-file
> mechanism.
> 

1. Is there, in our modern world, a need for something else than utf-8 ?

2. I hope you do not expect editors to always insert a BOM header

Post a reply to this message

From: clipka
Subject: Re: New version of new tokenizer
Date: 2 Jun 2018 05:22:19
Message: <5b1261cb@news.povray.org>

Am 02.06.2018 um 10:14 schrieb Le_Forgeron:
> Le 01/06/2018 à 19:29, clipka a écrit :
>> - Non-ASCII characters in string literals: This I will also set aside
>> for now, until I get a clearer picture of whether the current
>> scene-global `charset` mechanism is even used to any extent worth
>> supporting, as I think it may be easier and cleaner to throw it
>> overboard (or at least ditch the `sys` setting) in favour of a per-file
>> mechanism.
>>
> 
> 1. Is there, in our modern world, a need for something else than utf-8 ?

I'm primarily thinking of legacy files, or files created by legacy software.

> 2. I hope you do not expect editors to always insert a BOM header

No, of course not. Sticking to current UCS specs there, according to
which the signature is to be optional in UTF-8 encoding scheme.

Having or not having a signature BOM /may/ have side effects though --
most notably because without a signature it is impossible to distinguish
the format from ASCII or classic extended ASCII until the first
non-ASCII character is encountered (and even then it is a guess whether
it's really UTF-8), or some other means of specifying the encoding is
used. Such has been the case in v3.7, where a signature BOM was taken to
imply `global_settings { charset utf8 }`, while absence of both
signature BOM and `charset` caused UTF-8 files to be interpreted as
ASCII with unrecognized characters (quietly replaced with blanks, IIRC).

Post a reply to this message

From: clipka
Subject: Re: New version of new tokenizer
Date: 2 Jun 2018 13:59:51
Message: <5b12db17@news.povray.org>

Am 01.06.2018 um 19:29 schrieb clipka:

> - Non-ASCII characters in string literals: This I will also set aside
> for now, until I get a clearer picture of whether the current
> scene-global `charset` mechanism is even used to any extent worth
> supporting, as I think it may be easier and cleaner to throw it
> overboard (or at least ditch the `sys` setting) in favour of a per-file
> mechanism.

Well, that was easier than I expected:

- According to the v3.6 documentation, POV-Ray for Windows and Mac do
not support the `charset sys` setting. The platform-specific docs for
Unix do not seem to mention `charset sys`, but according to the source
code, POV-Ray for Unix does not support the setting either.

- According to the v3.7 source code, neither POV-Ray for Windows nor
POV-Ray for Unix support the setting.

So apparently `charset sys` has never really been implemented, and can't
have been used in any legacy scene.

This leaves only `charset ascii` and `charset utf8` to be supported for
backward compatibility. Which in theory would be trivial from the
perspective of the scanner and low-level tokenizer, because ASCII is a
true subset of UTF-8 in every respect.


In practice it's a little bit less trivial, as in legacy (pre-v3.5)
scenes using the default `charset ascii` setting, non-ASCII characters
are passed "as is" to some portions of the code (most notably debug
output). This means interpretation of non-ASCII characters would have to
be context-sensitive not necessarily with regards to the `charset`
setting, but to the `#version` setting.

I guess I'll address this as follows:

- I'll presume that in any file using neither plain ASCII nor UTF-8
encoding, the first occurrence of one or more non-ASCII characters does
/not/ happen to be a valid UTF-8 encoding, allowing for detection of
UTF-8 without knowledge of the `charset` setting. (This presumption is
guaranteed to be true for any file where the first non-ASCII character
is followed by an ASCII character; otherwise, it depends on other
properties of the first non-ASCII sequence.)

- I'll naively presume that any non-ASCII non-UTF-8 file uses ISO
Latin-1 encoding. (As far as I can see, this matches the implemented
behaviour of v3.7 in `charset ascii` pre-v3.5 legacy mode; in `charset
ascii` non-legacy-mode, the encoding is irrelevant as any non-ASCII
characters are replaced with blanks; in `charset utf8` mode, we should
detect ASCII or UTF-8.)

- I'll have the scanner always translate the input file to UCS based on
the above presumptions, and leave it to the parser proper to decide what
to do with non-ASCII characters, based on `#version` and `charset` settings.

Post a reply to this message