|
|
Am 01.06.2018 um 19:29 schrieb clipka:
> - Non-ASCII characters in string literals: This I will also set aside
> for now, until I get a clearer picture of whether the current
> scene-global `charset` mechanism is even used to any extent worth
> supporting, as I think it may be easier and cleaner to throw it
> overboard (or at least ditch the `sys` setting) in favour of a per-file
> mechanism.
Well, that was easier than I expected:
- According to the v3.6 documentation, POV-Ray for Windows and Mac do
not support the `charset sys` setting. The platform-specific docs for
Unix do not seem to mention `charset sys`, but according to the source
code, POV-Ray for Unix does not support the setting either.
- According to the v3.7 source code, neither POV-Ray for Windows nor
POV-Ray for Unix support the setting.
So apparently `charset sys` has never really been implemented, and can't
have been used in any legacy scene.
This leaves only `charset ascii` and `charset utf8` to be supported for
backward compatibility. Which in theory would be trivial from the
perspective of the scanner and low-level tokenizer, because ASCII is a
true subset of UTF-8 in every respect.
In practice it's a little bit less trivial, as in legacy (pre-v3.5)
scenes using the default `charset ascii` setting, non-ASCII characters
are passed "as is" to some portions of the code (most notably debug
output). This means interpretation of non-ASCII characters would have to
be context-sensitive not necessarily with regards to the `charset`
setting, but to the `#version` setting.
I guess I'll address this as follows:
- I'll presume that in any file using neither plain ASCII nor UTF-8
encoding, the first occurrence of one or more non-ASCII characters does
/not/ happen to be a valid UTF-8 encoding, allowing for detection of
UTF-8 without knowledge of the `charset` setting. (This presumption is
guaranteed to be true for any file where the first non-ASCII character
is followed by an ASCII character; otherwise, it depends on other
properties of the first non-ASCII sequence.)
- I'll naively presume that any non-ASCII non-UTF-8 file uses ISO
Latin-1 encoding. (As far as I can see, this matches the implemented
behaviour of v3.7 in `charset ascii` pre-v3.5 legacy mode; in `charset
ascii` non-legacy-mode, the encoding is irrelevant as any non-ASCII
characters are replaced with blanks; in `charset utf8` mode, we should
detect ASCII or UTF-8.)
- I'll have the scanner always translate the input file to UCS based on
the above presumptions, and leave it to the parser proper to decide what
to do with non-ASCII characters, based on `#version` and `charset` settings.
Post a reply to this message
|
|