POV-Ray: Newsgroups: povray.beta-test: v3.8 character set handling

POV-Ray : Newsgroups : povray.beta-test : v3.8 character set handling		Server Time 5 Jul 2025 16:36:07 EDT (-0400)

Goto Latest 10 Messages

Next 5 Messages >>>

From: clipka
Subject: v3.8 character set handling
Date: 3 Jan 2019 14:49:26
Message: <5c2e6746@news.povray.org>

The internal reworking of the parser will make it necessary to overhaul 
the handling of character sets, in a way that will not be entirely 
compatible with earlier versions. Here are my current plans - comments 
welcome:


(1) The current `global_settings { charset FOO }` mechanism (apparently 
introduced with v3.5) will be disabled.

For backward compatibility it will still be accepted, but produce a 
warning. (Not sure yet if we want the warning for pre-v3.5 scenes, 
and/or an outright error for v3.8 and later scenes.)


(2) Internal character handling will use UCS (essentially Unicode minus 
a truckload of semantics rules) throughout.

For example, `chr(X)` will return the character corresponding to UCS 
codepoint X.

Specifically, for now the internal character set will be UCS-2 (i.e. the 
16-bit subset of UCS, aka the Base Multilingual Plane; not to be 
confused with UTF-16, which is a 16-bit-based encoding of the full UCS-4 
character set).

Note that the choice of UCS-2 will make the internal handling compatible 
with both Latin-1 and ASCII (but not Windows-1252, which is frequently 
misidentified as Latin-1).

In later versions, the internal character handling may be extended to 
fully support UCS-4.


(3) Character encoding of input files will be auto-detected.

Full support will be provided for UTF-8 (and, by extension, ASCII), 
which will be auto-detected reliably.

Reasonable support will be provided for Windows-1252 (and, by extension, 
Latin-1) except for a few fringe cases that will be misidentified as UTF-8.

For now, it will be recommended that any other encodings (or 
misidentified fringe cases) be converted to UTF-8.

In later versions, a mechanism may be added to specify encoding on a 
per-file basis, to add support for additional encodings or work around 
misidentified fringe cases.


(4) Character encoding of output (created via `#write` or `#debug`) will 
be left undefined for now except for being a superset of ASCII.

In later versions, output may be more rigorously defined as either 
UTF-8, Latin-1 or Windows-1252, or mechanisms may be added to choose 
between these (and maybe other) encodings.


(5) Text primitives will use UCS encoding unless specified otherwise.

If the current syntax is used, the specified text will be interpreted as 
a sequence of UCS codepoints (conformant with (2)), and looked up in one 
of the Unicode CMAP tables in the font file. If the font file has no 
Unicode CMAP tables, the character will first be converted from UCS to 
the character set of an available CMAP table, and then looked up via 
that table. If this also fails, the character will be rendered using the 
repacement glyph (often a square).

To simplify conversion of old scenes, the text primitive syntax will be 
extended with a syntax allowing for more control over the lookup process:

     #declare MyText = "a\u20ACb";
     text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }

This example will do the following:

- The sequence '\u20AC' in a string literal instructs POV-Ray to place 
the UCS character U+20AC there (this syntax already exists), in this 
case resulting in a string that contains `a` (U+0061) followed by a Euro 
sign (U+20AC) followed by `b` (U+0062).

- Specifying `cmap { ... charset windows1252 }` instructs POV-Ray to 
first convert each character to the Windows-1252 character set, in this 
case giving character codes of 97 (`a`), 128 (the Euro sign) and 98 
(`b`), respectively.

- Specifying `cmap { 3,0 ... }` instructs POV-Ray to use the CMAP table 
with the specified PlatformID (in this case 3, identifying the platform 
as "Microsoft") and PlatformSpecificID (in this case 0, identifiying the 
encoding as "Symbol").

- POV-Ray will take the character codes as per the charset conversion, 
and use those directly as indices into the specified CMAP table to look 
up the glyph (character shape), without any further conversion.

The `charset` setting will be optional; omitting it will instruct 
POV-Ray to not do a character conversion, but rather use the UCS 
character codes directly as indices into the CMAP table.

For now, the set of supported charsets will be limited to the following:

- Windows-1252
- Mac OS Roman
- UCS (implicit via omission)

Additional charsets will be supported by virtue of being subsets of the 
above, e.g.:

- Latin-1 (being a subset of both Windows-1252 and UCS)
- ASCII (being a subset of all of the above)

(Note that all of the above refer to the respective character sets, not 
encodings thereof.)

The syntax for the `charset` setting is not finalized yet. Instead of a 
keyword as a parameter (e.g. `charset windows1252`) I'm also considering 
using a string parameter (e.g. `charset "Windows-1252"`), or even 
Windows codepage numbers (e.g. `charset 1252` for Windows-1252 and 
`charset 10000` for Mac OS Roman). Comments highly welcome.


For scenes using `global_setting { charset utf8 }`, as well as scenes 
using only ASCII characters, the above rules should result in unchanged 
behaviour (except for the effects of a few bugs in the TrueType handling 
code that I intend to eliminate along the way).

Post a reply to this message

From: ingo
Subject: Re: v3.8 character set handling
Date: 4 Jan 2019 08:27:59
Message: <XnsA9CD932976F8Dseed7@news.povray.org>

in news:5c2e6746@news.povray.org clipka wrote:

Sorry, can't comment on the whole character sets thing, it's beyond me,

> backward compatibility

but this got my attention. 

I know that backwards compatibility always has been a big deal with POV-
Ray, but is breaking it and keeping all previous versions around as binary 
and/or source not an easier way to do that? Maybe even as a single 
download.

I would not mind having several versions on my system and having the IDE 
automatically switch to the right engine based on the #version directive 
in the scene.

Ingo

Post a reply to this message

From: jr
Subject: Re: v3.8 character set handling
Date: 4 Jan 2019 09:05:01
Message: <web.5c2f677ad2c7bd4748892b50@news.povray.org>

hi,

ingo <ing### [at] tagpovrayorg> wrote:
> in news:5c2e6746@news.povray.org clipka wrote:
> > backward compatibility
> I would not mind having several versions on my system and having the IDE
> automatically switch to the right engine based on the #version directive
> in the scene.

neat.  and it could be done via a shell script/batch file, not needing any
modification to the command-line parser; in the script grep the version number
from the scene or ini file + run the corresponding program.

regards ,jr.

Post a reply to this message

From: Alain
Subject: Re: v3.8 character set handling
Date: 4 Jan 2019 13:15:49
Message: <5c2fa2d5$1@news.povray.org>

Le 19-01-03 à 14:49, clipka a écrit :
> The internal reworking of the parser will make it necessary to overhaul 
> the handling of character sets, in a way that will not be entirely 
> compatible with earlier versions. Here are my current plans - comments 
> welcome:
> 
> 
> (1) The current `global_settings { charset FOO }` mechanism (apparently 
> introduced with v3.5) will be disabled.
> 
> For backward compatibility it will still be accepted, but produce a 
> warning. (Not sure yet if we want the warning for pre-v3.5 scenes, 
> and/or an outright error for v3.8 and later scenes.)
maybe only a warning and ignore.

> 

Will it be possible to directly use UTF-8 characters ?
After all, if you can directly enter characters like à é è ô ç (direct 
access) or easily like €(altchar+e) ñ(altchar+ç,n) from your keyboard as 
I just did, you should be able to use them instead of the cumbersome codes.



Alain

Post a reply to this message

From: clipka
Subject: Re: v3.8 character set handling
Date: 4 Jan 2019 20:09:47
Message: <5c3003db$1@news.povray.org>

Am 04.01.2019 um 14:27 schrieb ingo:

> I know that backwards compatibility always has been a big deal with POV-
> Ray, but is breaking it and keeping all previous versions around as binary
> and/or source not an easier way to do that? Maybe even as a single
> download.
> 
> I would not mind having several versions on my system and having the IDE
> automatically switch to the right engine based on the #version directive
> in the scene.

That would require a strict separation and stable interface between IDE 
and render engine - which we currently don't really have. 
Architecturally, the IDE and render engine are currently a rather 
monolithic thing.

Even if we had such an interface, what you suggest would still require 
at least some baseline maintenance on all the old versions we want to 
support, so that they keep working as runtime and build environments 
change over time. Cases in point: v3.6 had to be modified to play nice 
with Windows Vista; and v3.7.0 had to be modified to play nice with 
modern versions of the boost library, compilers conforming to modern C++ 
language standards, and modern versions of the Unix Automake toolset.

And then there is the issue of how to deal with legacy include files 
used in new scenes. Switching parsers mid-scene is not a viable option.

Post a reply to this message

From: clipka
Subject: Re: v3.8 character set handling
Date: 4 Jan 2019 21:04:41
Message: <5c3010b9@news.povray.org>

Am 04.01.2019 um 19:18 schrieb Alain:

> Will it be possible to directly use UTF-8 characters ?
> After all, if you can directly enter characters like à é è ô ç (direct 
> access) or easily like €(altchar+e) ñ(altchar+ç,n) from your keyboard as 
> I just did, you should be able to use them instead of the cumbersome codes.

Short answer: The `\uXXXX` notation won't be necessary. I just used it 
to avoid non-ASCII characters in my post.


Looooong answer:


It depends on what you're taling about.

First, let's get an elephant - or should I say mammoth - out of the 
room: The editor component of the Windows GUI. It's old and crappy, and 
doesn't support UTF-8 at all. It does support Windows-1252 though (at 
least on my system; I guess it may depend on what locale you have 
configured in Windows), which has all the characters you mentioned.


Now if you are using a different editor, using verbatim "UTF-8 
characters" should be no problem: Enter the characters, save the file as 
UTF-8, done.

The characters will be encoded directly as UTF-8, and the parser will 
work with them just fine (provided you're only using them in string 
literals or comments); no need for `\uXXXX` notation.


Alternatively, you could enter the same characters in the same editor, 
and save the file as "Windows-1252" (or maybe called "ANSI" or 
"Latin-1"), or enter them in POV-Ray for Windows and just save the file 
without specifying a particular encoding (because you can't).

In that case the characters will be encoded as Windows-1252, and in most 
cases the parser will also work with them just fine (again, string 
literals or comments only); again no need for `\uXXXX` notation.

What the parser will do in such a case is first convert the 
Windows-1252-enoded characters to Unicode, and then proceed in just the 
same way.


For example:

     #declare MyText = "a€b"; // a Euro sign between `a` and `b`

will create a string containing `a` (U+0061) followed by a Euro sign 
(U+20AC) followed by `b` (U+0062), no matter whether the file uses UTF-8 
encoding or Windows-1252 encoding. In both cases, the parser will 
interpret the thing between `a` and `b` as U+20AC, even though in a 
UTF-8 encided file that thing is represented by the byte sequence hex 
E2,82,AC while in a Windows-1252 encoded file it is represented by the 
single byte hex 80.

Post a reply to this message

From: ingo
Subject: Re: v3.8 character set handling
Date: 5 Jan 2019 04:06:53
Message: <XnsA9CE66E56E743seed7@news.povray.org>

in news:5c3003db$1@news.povray.org clipka wrote:

> what you suggest would still require 
> at least some baseline maintenance on all the old versions we want to 
> support

Where lies the "break even point", how much does being backwards 
compatibel cost versus this other maintenance with regards to the ability 
/ possibility to take bigger / different development steps? Now, I know 
you can't put a percentage on that ;) Just me wondering, looking at what 
happened in the Python world with 2 & 3. Yesterday I 'broke' my mesh 
macro's that are also in 3.7 by adding a dictionary and by changing the 
way resolution is set...

Ingo

Post a reply to this message

From: Alain
Subject: Re: v3.8 character set handling
Date: 6 Jan 2019 12:08:35
Message: <5c323613$1@news.povray.org>

Le 19-01-04 à 21:04, clipka a écrit :
> Am 04.01.2019 um 19:18 schrieb Alain:
> 
>> Will it be possible to directly use UTF-8 characters ?
>> After all, if you can directly enter characters like à é è ô ç (direct 
>> access) or easily like €(altchar+e) ñ(altchar+ç,n) from your keyboard 
>> as I just did, you should be able to use them instead of the 
>> cumbersome codes.
> 
> Short answer: The `\uXXXX` notation won't be necessary. I just used it 
> to avoid non-ASCII characters in my post.
> 
> 
> Looooong answer:
> 
> 
> It depends on what you're taling about.
> 
> First, let's get an elephant - or should I say mammoth - out of the 
> room: The editor component of the Windows GUI. It's old and crappy, and 
> doesn't support UTF-8 at all. It does support Windows-1252 though (at 
> least on my system; I guess it may depend on what locale you have 
> configured in Windows), which has all the characters you mentioned.
> 
> 
> Now if you are using a different editor, using verbatim "UTF-8 
> characters" should be no problem: Enter the characters, save the file as 
> UTF-8, done.
> 
> The characters will be encoded directly as UTF-8, and the parser will 
> work with them just fine (provided you're only using them in string 
> literals or comments); no need for `\uXXXX` notation.
> 
> 
> Alternatively, you could enter the same characters in the same editor, 
> and save the file as "Windows-1252" (or maybe called "ANSI" or 
> "Latin-1"), or enter them in POV-Ray for Windows and just save the file 
> without specifying a particular encoding (because you can't).
> 
> In that case the characters will be encoded as Windows-1252, and in most 
> cases the parser will also work with them just fine (again, string 
> literals or comments only); again no need for `\uXXXX` notation.
> 
> What the parser will do in such a case is first convert the 
> Windows-1252-enoded characters to Unicode, and then proceed in just the 
> same way.
> 
> 
> For example:
> 
>      #declare MyText = "a€b"; // a Euro sign between `a` and `b`
> 
> will create a string containing `a` (U+0061) followed by a Euro sign 
> (U+20AC) followed by `b` (U+0062), no matter whether the file uses UTF-8 
> encoding or Windows-1252 encoding. In both cases, the parser will 
> interpret the thing between `a` and `b` as U+20AC, even though in a 
> UTF-8 encided file that thing is represented by the byte sequence hex 
> E2,82,AC while in a Windows-1252 encoded file it is represented by the 
> single byte hex 80.

Nice.

Post a reply to this message

From: clipka
Subject: Re: v3.8 character set handling
Date: 9 Jan 2019 20:06:40
Message: <5c369aa0$1@news.povray.org>

Am 03.01.2019 um 20:49 schrieb clipka:

> (5) Text primitives will use UCS encoding unless specified otherwise.
...
> To simplify conversion of old scenes, the text primitive syntax will be 
> extended with a syntax allowing for more control over the lookup process:
> 
>      #declare MyText = "a\u20ACb";
>      text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }

I think I will change that as following:

     text { ttf "sym.ttf" cmap { 3,0 charset 1252 } MyText }

with a few select charset numbers defined; most notably:

     1252    Windows code page 1252
             (for obvious reasons)

     10000   Mac OS Roman
             (because Windows supports this as code page 10000)

     61440   MS legacy symbol font (Wingdings etc.) remapping to
             Unicode Private Use Area U+F000..U+F0FF
             (because 61440 = hex F000)

Post a reply to this message

From: jr
Subject: Re: v3.8 character set handling
Date: 12 Jan 2019 13:10:00
Message: <web.5c3a2ce8d2c7bd4748892b50@news.povray.org>

hi,

clipka <ano### [at] anonymousorg> wrote:
> To simplify conversion of old scenes, the text primitive syntax will be
> extended with a syntax allowing for more control over the lookup process:
>
>      #declare MyText = "a\u20ACb";
>      text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }

with alpha.10008988, and same code as in other thread modified to read:

#version 3.8;
.....
text { ttf "arialbd.ttf" cmap { 1,0 charset utf8 S }
.....

I get the following error:

File 'pav-patt.pov' line 61: Parse Warning: Text primitive 'cmap' extension is
 experimental and may be subject to future changes.
File 'pav-patt.pov' line 61: Parse Error: Expected 'numeric expression', utf8
 found instead
Fatal error in parser: Cannot parse input.
Render failed

same for 'ascii'.

regards, jr.

Post a reply to this message

Goto Latest 10 Messages

Next 5 Messages >>>