|
|
The internal reworking of the parser will make it necessary to overhaul
the handling of character sets, in a way that will not be entirely
compatible with earlier versions. Here are my current plans - comments
welcome:
(1) The current `global_settings { charset FOO }` mechanism (apparently
introduced with v3.5) will be disabled.
For backward compatibility it will still be accepted, but produce a
warning. (Not sure yet if we want the warning for pre-v3.5 scenes,
and/or an outright error for v3.8 and later scenes.)
(2) Internal character handling will use UCS (essentially Unicode minus
a truckload of semantics rules) throughout.
For example, `chr(X)` will return the character corresponding to UCS
codepoint X.
Specifically, for now the internal character set will be UCS-2 (i.e. the
16-bit subset of UCS, aka the Base Multilingual Plane; not to be
confused with UTF-16, which is a 16-bit-based encoding of the full UCS-4
character set).
Note that the choice of UCS-2 will make the internal handling compatible
with both Latin-1 and ASCII (but not Windows-1252, which is frequently
misidentified as Latin-1).
In later versions, the internal character handling may be extended to
fully support UCS-4.
(3) Character encoding of input files will be auto-detected.
Full support will be provided for UTF-8 (and, by extension, ASCII),
which will be auto-detected reliably.
Reasonable support will be provided for Windows-1252 (and, by extension,
Latin-1) except for a few fringe cases that will be misidentified as UTF-8.
For now, it will be recommended that any other encodings (or
misidentified fringe cases) be converted to UTF-8.
In later versions, a mechanism may be added to specify encoding on a
per-file basis, to add support for additional encodings or work around
misidentified fringe cases.
(4) Character encoding of output (created via `#write` or `#debug`) will
be left undefined for now except for being a superset of ASCII.
In later versions, output may be more rigorously defined as either
UTF-8, Latin-1 or Windows-1252, or mechanisms may be added to choose
between these (and maybe other) encodings.
(5) Text primitives will use UCS encoding unless specified otherwise.
If the current syntax is used, the specified text will be interpreted as
a sequence of UCS codepoints (conformant with (2)), and looked up in one
of the Unicode CMAP tables in the font file. If the font file has no
Unicode CMAP tables, the character will first be converted from UCS to
the character set of an available CMAP table, and then looked up via
that table. If this also fails, the character will be rendered using the
repacement glyph (often a square).
To simplify conversion of old scenes, the text primitive syntax will be
extended with a syntax allowing for more control over the lookup process:
#declare MyText = "a\u20ACb";
text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }
This example will do the following:
- The sequence '\u20AC' in a string literal instructs POV-Ray to place
the UCS character U+20AC there (this syntax already exists), in this
case resulting in a string that contains `a` (U+0061) followed by a Euro
sign (U+20AC) followed by `b` (U+0062).
- Specifying `cmap { ... charset windows1252 }` instructs POV-Ray to
first convert each character to the Windows-1252 character set, in this
case giving character codes of 97 (`a`), 128 (the Euro sign) and 98
(`b`), respectively.
- Specifying `cmap { 3,0 ... }` instructs POV-Ray to use the CMAP table
with the specified PlatformID (in this case 3, identifying the platform
as "Microsoft") and PlatformSpecificID (in this case 0, identifiying the
encoding as "Symbol").
- POV-Ray will take the character codes as per the charset conversion,
and use those directly as indices into the specified CMAP table to look
up the glyph (character shape), without any further conversion.
The `charset` setting will be optional; omitting it will instruct
POV-Ray to not do a character conversion, but rather use the UCS
character codes directly as indices into the CMAP table.
For now, the set of supported charsets will be limited to the following:
- Windows-1252
- Mac OS Roman
- UCS (implicit via omission)
Additional charsets will be supported by virtue of being subsets of the
above, e.g.:
- Latin-1 (being a subset of both Windows-1252 and UCS)
- ASCII (being a subset of all of the above)
(Note that all of the above refer to the respective character sets, not
encodings thereof.)
The syntax for the `charset` setting is not finalized yet. Instead of a
keyword as a parameter (e.g. `charset windows1252`) I'm also considering
using a string parameter (e.g. `charset "Windows-1252"`), or even
Windows codepage numbers (e.g. `charset 1252` for Windows-1252 and
`charset 10000` for Mac OS Roman). Comments highly welcome.
For scenes using `global_setting { charset utf8 }`, as well as scenes
using only ASCII characters, the above rules should result in unchanged
behaviour (except for the effects of a few bugs in the TrueType handling
code that I intend to eliminate along the way).
Post a reply to this message
|
|