POV-Ray: Newsgroups: povray.beta-test: v3.8 character set handling: v3.8 character set handling

POV-Ray : Newsgroups : povray.beta-test : v3.8 character set handling : v3.8 character set handling		Server Time 18 Jul 2025 18:52:21 EDT (-0400)
From: clipka
Date: 3 Jan 2019 14:49:26
Message: <5c2e6746@news.povray.org>
The internal reworking of the parser will make it necessary to overhaul 
the handling of character sets, in a way that will not be entirely 
compatible with earlier versions. Here are my current plans - comments 
welcome:


(1) The current `global_settings { charset FOO }` mechanism (apparently 
introduced with v3.5) will be disabled.

For backward compatibility it will still be accepted, but produce a 
warning. (Not sure yet if we want the warning for pre-v3.5 scenes, 
and/or an outright error for v3.8 and later scenes.)


(2) Internal character handling will use UCS (essentially Unicode minus 
a truckload of semantics rules) throughout.

For example, `chr(X)` will return the character corresponding to UCS 
codepoint X.

Specifically, for now the internal character set will be UCS-2 (i.e. the 
16-bit subset of UCS, aka the Base Multilingual Plane; not to be 
confused with UTF-16, which is a 16-bit-based encoding of the full UCS-4 
character set).

Note that the choice of UCS-2 will make the internal handling compatible 
with both Latin-1 and ASCII (but not Windows-1252, which is frequently 
misidentified as Latin-1).

In later versions, the internal character handling may be extended to 
fully support UCS-4.


(3) Character encoding of input files will be auto-detected.

Full support will be provided for UTF-8 (and, by extension, ASCII), 
which will be auto-detected reliably.

Reasonable support will be provided for Windows-1252 (and, by extension, 
Latin-1) except for a few fringe cases that will be misidentified as UTF-8.

For now, it will be recommended that any other encodings (or 
misidentified fringe cases) be converted to UTF-8.

In later versions, a mechanism may be added to specify encoding on a 
per-file basis, to add support for additional encodings or work around 
misidentified fringe cases.


(4) Character encoding of output (created via `#write` or `#debug`) will 
be left undefined for now except for being a superset of ASCII.

In later versions, output may be more rigorously defined as either 
UTF-8, Latin-1 or Windows-1252, or mechanisms may be added to choose 
between these (and maybe other) encodings.


(5) Text primitives will use UCS encoding unless specified otherwise.

If the current syntax is used, the specified text will be interpreted as 
a sequence of UCS codepoints (conformant with (2)), and looked up in one 
of the Unicode CMAP tables in the font file. If the font file has no 
Unicode CMAP tables, the character will first be converted from UCS to 
the character set of an available CMAP table, and then looked up via 
that table. If this also fails, the character will be rendered using the 
repacement glyph (often a square).

To simplify conversion of old scenes, the text primitive syntax will be 
extended with a syntax allowing for more control over the lookup process:

     #declare MyText = "a\u20ACb";
     text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }

This example will do the following:

- The sequence '\u20AC' in a string literal instructs POV-Ray to place 
the UCS character U+20AC there (this syntax already exists), in this 
case resulting in a string that contains `a` (U+0061) followed by a Euro 
sign (U+20AC) followed by `b` (U+0062).

- Specifying `cmap { ... charset windows1252 }` instructs POV-Ray to 
first convert each character to the Windows-1252 character set, in this 
case giving character codes of 97 (`a`), 128 (the Euro sign) and 98 
(`b`), respectively.

- Specifying `cmap { 3,0 ... }` instructs POV-Ray to use the CMAP table 
with the specified PlatformID (in this case 3, identifying the platform 
as "Microsoft") and PlatformSpecificID (in this case 0, identifiying the 
encoding as "Symbol").

- POV-Ray will take the character codes as per the charset conversion, 
and use those directly as indices into the specified CMAP table to look 
up the glyph (character shape), without any further conversion.

The `charset` setting will be optional; omitting it will instruct 
POV-Ray to not do a character conversion, but rather use the UCS 
character codes directly as indices into the CMAP table.

For now, the set of supported charsets will be limited to the following:

- Windows-1252
- Mac OS Roman
- UCS (implicit via omission)

Additional charsets will be supported by virtue of being subsets of the 
above, e.g.:

- Latin-1 (being a subset of both Windows-1252 and UCS)
- ASCII (being a subset of all of the above)

(Note that all of the above refer to the respective character sets, not 
encodings thereof.)

The syntax for the `charset` setting is not finalized yet. Instead of a 
keyword as a parameter (e.g. `charset windows1252`) I'm also considering 
using a string parameter (e.g. `charset "Windows-1252"`), or even 
Windows codepage numbers (e.g. `charset 1252` for Windows-1252 and 
`charset 10000` for Mac OS Roman). Comments highly welcome.


For scenes using `global_setting { charset utf8 }`, as well as scenes 
using only ASCII characters, the above rules should result in unchanged 
behaviour (except for the effects of a few bugs in the TrueType handling 
code that I intend to eliminate along the way).
Post a reply to this message