POV-Ray: Newsgroups: povray.beta-test: New version of new tokenizer: Re: New version of new tokenizer

POV-Ray : Newsgroups : povray.beta-test : New version of new tokenizer : Re: New version of new tokenizer		Server Time 20 Apr 2024 02:43:04 EDT (-0400)

From: clipka
Date: 2 Jun 2018 13:59:51
Message: <5b12db17@news.povray.org>

Am 01.06.2018 um 19:29 schrieb clipka:

> - Non-ASCII characters in string literals: This I will also set aside
> for now, until I get a clearer picture of whether the current
> scene-global `charset` mechanism is even used to any extent worth
> supporting, as I think it may be easier and cleaner to throw it
> overboard (or at least ditch the `sys` setting) in favour of a per-file
> mechanism.

Well, that was easier than I expected:

- According to the v3.6 documentation, POV-Ray for Windows and Mac do
not support the `charset sys` setting. The platform-specific docs for
Unix do not seem to mention `charset sys`, but according to the source
code, POV-Ray for Unix does not support the setting either.

- According to the v3.7 source code, neither POV-Ray for Windows nor
POV-Ray for Unix support the setting.

So apparently `charset sys` has never really been implemented, and can't
have been used in any legacy scene.

This leaves only `charset ascii` and `charset utf8` to be supported for
backward compatibility. Which in theory would be trivial from the
perspective of the scanner and low-level tokenizer, because ASCII is a
true subset of UTF-8 in every respect.


In practice it's a little bit less trivial, as in legacy (pre-v3.5)
scenes using the default `charset ascii` setting, non-ASCII characters
are passed "as is" to some portions of the code (most notably debug
output). This means interpretation of non-ASCII characters would have to
be context-sensitive not necessarily with regards to the `charset`
setting, but to the `#version` setting.

I guess I'll address this as follows:

- I'll presume that in any file using neither plain ASCII nor UTF-8
encoding, the first occurrence of one or more non-ASCII characters does
/not/ happen to be a valid UTF-8 encoding, allowing for detection of
UTF-8 without knowledge of the `charset` setting. (This presumption is
guaranteed to be true for any file where the first non-ASCII character
is followed by an ASCII character; otherwise, it depends on other
properties of the first non-ASCII sequence.)

- I'll naively presume that any non-ASCII non-UTF-8 file uses ISO
Latin-1 encoding. (As far as I can see, this matches the implemented
behaviour of v3.7 in `charset ascii` pre-v3.5 legacy mode; in `charset
ascii` non-legacy-mode, the encoding is irrelevant as any non-ASCII
characters are replaced with blanks; in `charset utf8` mode, we should
detect ASCII or UTF-8.)

- I'll have the scanner always translate the input file to UCS based on
the above presumptions, and leave it to the parser proper to decide what
to do with non-ASCII characters, based on `#version` and `charset` settings.

Post a reply to this message