POV-Ray: Newsgroups: povray.beta-test: v3.8 character set handling

POV-Ray : Newsgroups : povray.beta-test : v3.8 character set handling		Server Time 18 Apr 2024 13:55:57 EDT (-0400)

<<< Previous 5 Messages

Goto Initial 10 Messages

From: clipka
Subject: Re: v3.8 character set handling
Date: 4 Jan 2019 21:04:41
Message: <5c3010b9@news.povray.org>

Am 04.01.2019 um 19:18 schrieb Alain:

> Will it be possible to directly use UTF-8 characters ?
> After all, if you can directly enter characters like à é è ô ç (direct 
> access) or easily like €(altchar+e) ñ(altchar+ç,n) from your keyboard as 
> I just did, you should be able to use them instead of the cumbersome codes.

Short answer: The `\uXXXX` notation won't be necessary. I just used it 
to avoid non-ASCII characters in my post.


Looooong answer:


It depends on what you're taling about.

First, let's get an elephant - or should I say mammoth - out of the 
room: The editor component of the Windows GUI. It's old and crappy, and 
doesn't support UTF-8 at all. It does support Windows-1252 though (at 
least on my system; I guess it may depend on what locale you have 
configured in Windows), which has all the characters you mentioned.


Now if you are using a different editor, using verbatim "UTF-8 
characters" should be no problem: Enter the characters, save the file as 
UTF-8, done.

The characters will be encoded directly as UTF-8, and the parser will 
work with them just fine (provided you're only using them in string 
literals or comments); no need for `\uXXXX` notation.


Alternatively, you could enter the same characters in the same editor, 
and save the file as "Windows-1252" (or maybe called "ANSI" or 
"Latin-1"), or enter them in POV-Ray for Windows and just save the file 
without specifying a particular encoding (because you can't).

In that case the characters will be encoded as Windows-1252, and in most 
cases the parser will also work with them just fine (again, string 
literals or comments only); again no need for `\uXXXX` notation.

What the parser will do in such a case is first convert the 
Windows-1252-enoded characters to Unicode, and then proceed in just the 
same way.


For example:

     #declare MyText = "a€b"; // a Euro sign between `a` and `b`

will create a string containing `a` (U+0061) followed by a Euro sign 
(U+20AC) followed by `b` (U+0062), no matter whether the file uses UTF-8 
encoding or Windows-1252 encoding. In both cases, the parser will 
interpret the thing between `a` and `b` as U+20AC, even though in a 
UTF-8 encided file that thing is represented by the byte sequence hex 
E2,82,AC while in a Windows-1252 encoded file it is represented by the 
single byte hex 80.

Post a reply to this message

From: ingo
Subject: Re: v3.8 character set handling
Date: 5 Jan 2019 04:06:53
Message: <XnsA9CE66E56E743seed7@news.povray.org>

in news:5c3003db$1@news.povray.org clipka wrote:

> what you suggest would still require 
> at least some baseline maintenance on all the old versions we want to 
> support

Where lies the "break even point", how much does being backwards 
compatibel cost versus this other maintenance with regards to the ability 
/ possibility to take bigger / different development steps? Now, I know 
you can't put a percentage on that ;) Just me wondering, looking at what 
happened in the Python world with 2 & 3. Yesterday I 'broke' my mesh 
macro's that are also in 3.7 by adding a dictionary and by changing the 
way resolution is set...

Ingo

Post a reply to this message

From: Alain
Subject: Re: v3.8 character set handling
Date: 6 Jan 2019 12:08:35
Message: <5c323613$1@news.povray.org>

Le 19-01-04 à 21:04, clipka a écrit :
> Am 04.01.2019 um 19:18 schrieb Alain:
> 
>> Will it be possible to directly use UTF-8 characters ?
>> After all, if you can directly enter characters like à é è ô ç (direct 
>> access) or easily like €(altchar+e) ñ(altchar+ç,n) from your keyboard 
>> as I just did, you should be able to use them instead of the 
>> cumbersome codes.
> 
> Short answer: The `\uXXXX` notation won't be necessary. I just used it 
> to avoid non-ASCII characters in my post.
> 
> 
> Looooong answer:
> 
> 
> It depends on what you're taling about.
> 
> First, let's get an elephant - or should I say mammoth - out of the 
> room: The editor component of the Windows GUI. It's old and crappy, and 
> doesn't support UTF-8 at all. It does support Windows-1252 though (at 
> least on my system; I guess it may depend on what locale you have 
> configured in Windows), which has all the characters you mentioned.
> 
> 
> Now if you are using a different editor, using verbatim "UTF-8 
> characters" should be no problem: Enter the characters, save the file as 
> UTF-8, done.
> 
> The characters will be encoded directly as UTF-8, and the parser will 
> work with them just fine (provided you're only using them in string 
> literals or comments); no need for `\uXXXX` notation.
> 
> 
> Alternatively, you could enter the same characters in the same editor, 
> and save the file as "Windows-1252" (or maybe called "ANSI" or 
> "Latin-1"), or enter them in POV-Ray for Windows and just save the file 
> without specifying a particular encoding (because you can't).
> 
> In that case the characters will be encoded as Windows-1252, and in most 
> cases the parser will also work with them just fine (again, string 
> literals or comments only); again no need for `\uXXXX` notation.
> 
> What the parser will do in such a case is first convert the 
> Windows-1252-enoded characters to Unicode, and then proceed in just the 
> same way.
> 
> 
> For example:
> 
>      #declare MyText = "a€b"; // a Euro sign between `a` and `b`
> 
> will create a string containing `a` (U+0061) followed by a Euro sign 
> (U+20AC) followed by `b` (U+0062), no matter whether the file uses UTF-8 
> encoding or Windows-1252 encoding. In both cases, the parser will 
> interpret the thing between `a` and `b` as U+20AC, even though in a 
> UTF-8 encided file that thing is represented by the byte sequence hex 
> E2,82,AC while in a Windows-1252 encoded file it is represented by the 
> single byte hex 80.

Nice.

Post a reply to this message

From: clipka
Subject: Re: v3.8 character set handling
Date: 9 Jan 2019 20:06:40
Message: <5c369aa0$1@news.povray.org>

Am 03.01.2019 um 20:49 schrieb clipka:

> (5) Text primitives will use UCS encoding unless specified otherwise.
...
> To simplify conversion of old scenes, the text primitive syntax will be 
> extended with a syntax allowing for more control over the lookup process:
> 
>      #declare MyText = "a\u20ACb";
>      text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }

I think I will change that as following:

     text { ttf "sym.ttf" cmap { 3,0 charset 1252 } MyText }

with a few select charset numbers defined; most notably:

     1252    Windows code page 1252
             (for obvious reasons)

     10000   Mac OS Roman
             (because Windows supports this as code page 10000)

     61440   MS legacy symbol font (Wingdings etc.) remapping to
             Unicode Private Use Area U+F000..U+F0FF
             (because 61440 = hex F000)

Post a reply to this message

From: jr
Subject: Re: v3.8 character set handling
Date: 12 Jan 2019 13:10:00
Message: <web.5c3a2ce8d2c7bd4748892b50@news.povray.org>

hi,

clipka <ano### [at] anonymousorg> wrote:
> To simplify conversion of old scenes, the text primitive syntax will be
> extended with a syntax allowing for more control over the lookup process:
>
>      #declare MyText = "a\u20ACb";
>      text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }

with alpha.10008988, and same code as in other thread modified to read:

#version 3.8;
.....
text { ttf "arialbd.ttf" cmap { 1,0 charset utf8 S }
.....

I get the following error:

File 'pav-patt.pov' line 61: Parse Warning: Text primitive 'cmap' extension is
 experimental and may be subject to future changes.
File 'pav-patt.pov' line 61: Parse Error: Expected 'numeric expression', utf8
 found instead
Fatal error in parser: Cannot parse input.
Render failed

same for 'ascii'.

regards, jr.

Post a reply to this message

From: clipka
Subject: Re: v3.8 character set handling
Date: 12 Jan 2019 20:54:02
Message: <5c3a9a3a@news.povray.org>

Am 12.01.2019 um 19:09 schrieb jr:

>> To simplify conversion of old scenes, the text primitive syntax will be
>> extended with a syntax allowing for more control over the lookup process:
>>
>>       #declare MyText = "a\u20ACb";
>>       text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }
> 
> with alpha.10008988, and same code as in other thread modified to read:
> 
> #version 3.8;
> ......
> text { ttf "arialbd.ttf" cmap { 1,0 charset utf8 S }
> ......
> 
> I get the following error:
> 
> File 'pav-patt.pov' line 61: Parse Warning: Text primitive 'cmap' extension is
>   experimental and may be subject to future changes.
> File 'pav-patt.pov' line 61: Parse Error: Expected 'numeric expression', utf8
>   found instead
> Fatal error in parser: Cannot parse input.
> Render failed
> 
> 
> same for 'ascii'.

Yes, change of plan, sorry. Specify `charset FLOAT` here, with `FLOAT` 
being one of the following values:

     0       No remapping (effectively UCS4)
     1200    UCS2 character set (16-bit subset of UCS, aka BMP)
     1251    Windows-1251 character set (aka "ANSI Cyrillic")
     1252    Windows-1252 character set (aka "ANSI Latin")
     10000   Mac OS Roman
     12000   UCS4 character set
     28591   ISO-8859-1 character set (aka Latin-1)
     -1      Special remapping for legacy Microsoft symbol fonts

Note that these are character sets (collections of characters with an 
associated mapping to integral values, aka code points), _not_ character 
encoding schemes (character set with an associated scheme for storing 
character sequences as byte streams). So with UTF-8 being an encoding 
scheme, there's no dedicated value for it - use the value for UCS4 
instead, which is the character set used in UTF-8.

There is no speicifc value for ASCII, but any of the above values except 
-1 will do, as they're all supersets of ASCII.

We could also probably do without values 1200 (UCS2 being a subset of 
UCS4) and 28591 (ISO-8895-1 being a subset of both UCS2 and 
Windows-1252), but I happen to have implemented them anyway.


I concede that the numeric values aren't easy to memorize, but this 
could be solved by supplying an include file that defines some common 
macros for the entire CMAP block, and/or variables (or maybe even a 
dictionary with string keys) for the charset numeric values.


Also, as the first warning message already mentions, stay tuned for 
future changes to this feature. I'm still not happy with it - ideas for 
improvement continue to be highly welcome - and integration of the 
FreeType library may also necessitate modifications.

Post a reply to this message

From: jr
Subject: Re: v3.8 character set handling
Date: 13 Jan 2019 04:50:00
Message: <web.5c3b093dd2c7bd4748892b50@news.povray.org>

hi,

clipka <ano### [at] anonymousorg> wrote:
> Am 12.01.2019 um 19:09 schrieb jr:
>
> >> To simplify conversion of old scenes, the text primitive syntax will be
> >> extended with a syntax allowing for more control over the lookup process:
> >>
> >>       #declare MyText = "a\u20ACb";
> >>       text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }
> >
> > with alpha.10008988, and same code as in other thread modified to read:
> >
> > #version 3.8;
> > ......
> > text { ttf "arialbd.ttf" cmap { 1,0 charset utf8 S }
> > ......
> >
> > I get the following error:
> >
> > File 'pav-patt.pov' line 61: Parse Warning: Text primitive 'cmap' extension is
> >   experimental and may be subject to future changes.
> > File 'pav-patt.pov' line 61: Parse Error: Expected 'numeric expression', utf8
> >   found instead
> > Fatal error in parser: Cannot parse input.
> > Render failed
> >
> >
> > same for 'ascii'.
>
> Yes, change of plan, sorry. Specify `charset FLOAT` here, with `FLOAT`
> being one of the following values:
>
>      0       No remapping (effectively UCS4)
>      1200    UCS2 character set (16-bit subset of UCS, aka BMP)
>      1251    Windows-1251 character set (aka "ANSI Cyrillic")
>      1252    Windows-1252 character set (aka "ANSI Latin")
>      10000   Mac OS Roman
>      12000   UCS4 character set
>      28591   ISO-8859-1 character set (aka Latin-1)
>      -1      Special remapping for legacy Microsoft symbol fonts
>
> Note that these are character sets (collections of characters with an
> associated mapping to integral values, aka code points), _not_ character
> encoding schemes (character set with an associated scheme for storing
> character sequences as byte streams). So with UTF-8 being an encoding
> scheme, there's no dedicated value for it - use the value for UCS4
> instead, which is the character set used in UTF-8.
>
> There is no speicifc value for ASCII, but any of the above values except
> -1 will do, as they're all supersets of ASCII.
>
> We could also probably do without values 1200 (UCS2 being a subset of
> UCS4) and 28591 (ISO-8895-1 being a subset of both UCS2 and
> Windows-1252), but I happen to have implemented them anyway.
>
>
> I concede that the numeric values aren't easy to memorize, but this
> could be solved by supplying an include file that defines some common
> macros for the entire CMAP block, and/or variables (or maybe even a
> dictionary with string keys) for the charset numeric values.
>
>
> Also, as the first warning message already mentions, stay tuned for
> future changes to this feature. I'm still not happy with it - ideas for
> improvement continue to be highly welcome - and integration of the
> FreeType library may also necessitate modifications.

I think a dictionary (provided in 'charsets.inc?') with keys like 'utf8' and
'ascii' etc sounds ok.


regards, jr.

Post a reply to this message

From: jr
Subject: Re: v3.8 character set handling
Date: 13 Jan 2019 07:20:00
Message: <web.5c3b2c0bd2c7bd4748892b50@news.povray.org>

hi,

clipka <ano### [at] anonymousorg> wrote:
> >>       text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }
>
>      0       No remapping (effectively UCS4)
>      1200    UCS2 character set (16-bit subset of UCS, aka BMP)
>      1251    Windows-1251 character set (aka "ANSI Cyrillic")
>      1252    Windows-1252 character set (aka "ANSI Latin")
>      10000   Mac OS Roman
>      12000   UCS4 character set
>      28591   ISO-8859-1 character set (aka Latin-1)
>      -1      Special remapping for legacy Microsoft symbol fonts
>

can you confirm that I'm using the correct syntax?  because the new alpha gives
me the same error.

Script started on Sun 13 Jan 2019 12:05:40 GMT
jr@crow:1:pave$ c### [at] pav-pattpov
// Hintergrund
#version 3.8;
global_settings {assumed_gamma 1}
  ...
    text { ttf "arialbd.ttf" cmap { 1,0 charset 0 } S }
  ...

jr@crow:2:pave$ pov38 +a0.1 +ipa### [at] tpov
Persistence of Vision(tm) Ray Tracer Version 3.8.0-alpha.10011104.unofficial
 (g++ -std=gnu++11 4.8.2 @ x86_64-slackware-linux-gnu)
  ...
==== [Parsing...] ==========================================================
File 'pav-patt.pov' line 61: Parse Warning: Text primitive 'cmap' extension is
 experimental and may be subject to future changes.
File 'pav-patt.pov' line 61: Parse Error: Expected 'numeric expression', } found
 instead
Fatal error in parser: Cannot parse input.
Render failed

regards, jr.

Post a reply to this message

From: clipka
Subject: Re: v3.8 character set handling
Date: 13 Jan 2019 08:35:18
Message: <5c3b3e96@news.povray.org>

Am 13.01.2019 um 13:16 schrieb jr:

> can you confirm that I'm using the correct syntax?  because the new alpha gives
> me the same error.

To be precise, it gives you the same error /message/.

It's not my usual style, but for the sake of maximum user experience 
I'll say no more, except that no nits were picked in the making of this 
post ;)

(Took me a while, too.)

Post a reply to this message

From: jr
Subject: Re: v3.8 character set handling
Date: 13 Jan 2019 09:00:00
Message: <web.5c3b4410d2c7bd4748892b50@news.povray.org>

hi,

clipka <ano### [at] anonymousorg> wrote:
> Am 13.01.2019 um 13:16 schrieb jr:
> > can you confirm that I'm using the correct syntax?  because the new alpha gives
> > me the same error.
>
> To be precise, it gives you the same error /message/.

syntax correct, then.  on to the next alpha..  :-)

> It's not my usual style, but for the sake of maximum user experience
> I'll say no more, except that no nits were picked in the making of this
> post ;)
>
> (Took me a while, too.)

(I blame the binge-watching.  :-))


regards, jr.

Post a reply to this message

<<< Previous 5 Messages

Goto Initial 10 Messages