POV-Ray : Newsgroups : povray.beta-test : POV-Ray v3.7 charset behaviour Server Time: 18 Jun 2018 07:10:37 GMT
  POV-Ray v3.7 charset behaviour (Message 1 to 7 of 7)  
From: clipka
Subject: POV-Ray v3.7 charset behaviour
Date: 5 Jun 2018 21:28:11
Message: <5b17006b$1@news.povray.org>
POV-Ray v3.7 (for Windows) behaves as follows with respect to different
#version/charset settings (as tested with German locale):


Common
======

- "\uXXXX" escape sequences are technically always interpreted as UCS-2
character codes. Note however that depending on the context in which a
string is used, the /effective/ interpretation may vary.

- `asc()` and `chr()` functions technically always operate according to
UCS-2 character encoding. Note however that depending on the context in
which a string is used, the /effective/ encoding may vary.


3.0 / ascii
===========

- Non-ASCII octets in strings are technically decoded according to ISO
8859-1 (Latin-1; matching the display in the editor for many characters,
except for codes hex 80-9F, though I presume the editor display may vary
with the system's locale, while the Latin-1 decoding is invariant). Note
however that depending on the context in which a string is used, the
/effective/ decoding may vary.

- When used in file names or debug output, strings are effectively
subject to re-interpretation of the UCS-2 character codes as
Windows-1252 codes (matching the display in the editor; presumably this
varies with the system's locale), with character codes above hex FF
interpreted modulo 256.

- When used in text primitives, non-ASCII characters in strings are
typically garbled, depending on the font used; for instace, with
Microsoft's Arial font the text appears to be subject to
re-interpretation of the UCS-2 character codes as Macintosh Roman codes,
while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
according to obscure rules in all cases. (In some cases,
re-interpretation may happen to match the display in the editor.)


3.0 / utf8
==========

- Non-ASCII octets in strings are technically decoded according to
UTF-8. Note however that depending on the context in which a string is
used, non-ASCII characters may effectively become garbled.

- When used in file names or debug output, strings are effectively
subject to re-interpretation of the UCS-2 codes as Windows-1252 codes
(presumably this varies with the system's locale), with codes above hex
FF interpreted modulo 256.

- When used in text primitives, non-ASCII characters in strings may or
may not be garbled, depending on the font used; for instace, with
Microsoft's Arial font the text is displayed as expected for UCS-2
encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
subject to re-interpretation of the UCS-2 codes as Macintosh Cyrillic
codes with codes above hex FF being treated according to obscure rules.


3.7 / ascii
===========

- Non-ASCII octets in strings are decoded as blanks (ASCII hex 20;
non-ASCII characters can still be inserted via `\uXXXX` escape sequences
or the `chr()` function though).

- When used in file names or debug output, non-ASCII characters in
strings (entered via "\uXXXX" or `chr()`) are substituted with blanks.

- When used in text primitives, non-ASCII characters in strings (entered
via "\uXXXX" or `chr()`) are typically garbled, depending on the font
used; for instace, with Microsoft's Arial font the text appears to be
subject to re-interpretation of the codes as Macintosh Roman codes,
while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
according to obscure rules in all cases.


3.7 / utf8
==========

- Non-ASCII octets in strings are decoded according to UTF-8.

- When used in file names or debug output, non-ASCII characters in
strings are substituted with blanks.

- When used in text primitives, non-ASCII characters in strings may or
may not be garbled, depending on the font used; for instace, with
Microsoft's Arial font the text is displayed as expected for UCS-2
encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
subject to re-interpretation of the UCS-2 codes as Windows-1251
(Cyrillic) codes, again with codes above hex FF being treated according
to obscure rules.


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 17:15:01
Message: <web.5b18164a4a827126535efa580@news.povray.org>
Note that the TrueType font decoding used to be rather buggy before 3.6 (or was
it 3.5?), which is why i.e. in 3.0 it matched the MacRoman font tables before
ever getting to the other tables.

clipka <ano### [at] anonymousorg> wrote:
> POV-Ray v3.7 (for Windows) behaves as follows with respect to different
> #version/charset settings (as tested with German locale):
>
>
> Common
> ======
>
> - "\uXXXX" escape sequences are technically always interpreted as UCS-2
> character codes. Note however that depending on the context in which a
> string is used, the /effective/ interpretation may vary.
>
> - `asc()` and `chr()` functions technically always operate according to
> UCS-2 character encoding. Note however that depending on the context in
> which a string is used, the /effective/ encoding may vary.
>
>
> 3.0 / ascii
> ===========
>
> - Non-ASCII octets in strings are technically decoded according to ISO
> 8859-1 (Latin-1; matching the display in the editor for many characters,
> except for codes hex 80-9F, though I presume the editor display may vary
> with the system's locale, while the Latin-1 decoding is invariant). Note
> however that depending on the context in which a string is used, the
> /effective/ decoding may vary.
>
> - When used in file names or debug output, strings are effectively
> subject to re-interpretation of the UCS-2 character codes as
> Windows-1252 codes (matching the display in the editor; presumably this
> varies with the system's locale), with character codes above hex FF
> interpreted modulo 256.
>
> - When used in text primitives, non-ASCII characters in strings are
> typically garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text appears to be subject to
> re-interpretation of the UCS-2 character codes as Macintosh Roman codes,
> while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
> Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
> according to obscure rules in all cases. (In some cases,
> re-interpretation may happen to match the display in the editor.)
>
>
> 3.0 / utf8
> ==========
>
> - Non-ASCII octets in strings are technically decoded according to
> UTF-8. Note however that depending on the context in which a string is
> used, non-ASCII characters may effectively become garbled.
>
> - When used in file names or debug output, strings are effectively
> subject to re-interpretation of the UCS-2 codes as Windows-1252 codes
> (presumably this varies with the system's locale), with codes above hex
> FF interpreted modulo 256.
>
> - When used in text primitives, non-ASCII characters in strings may or
> may not be garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text is displayed as expected for UCS-2
> encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
> subject to re-interpretation of the UCS-2 codes as Macintosh Cyrillic
> codes with codes above hex FF being treated according to obscure rules.
>
>
> 3.7 / ascii
> ===========
>
> - Non-ASCII octets in strings are decoded as blanks (ASCII hex 20;
> non-ASCII characters can still be inserted via `\uXXXX` escape sequences
> or the `chr()` function though).
>
> - When used in file names or debug output, non-ASCII characters in
> strings (entered via "\uXXXX" or `chr()`) are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings (entered
> via "\uXXXX" or `chr()`) are typically garbled, depending on the font
> used; for instace, with Microsoft's Arial font the text appears to be
> subject to re-interpretation of the codes as Macintosh Roman codes,
> while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
> Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
> according to obscure rules in all cases.
>
>
> 3.7 / utf8
> ==========
>
> - Non-ASCII octets in strings are decoded according to UTF-8.
>
> - When used in file names or debug output, non-ASCII characters in
> strings are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings may or
> may not be garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text is displayed as expected for UCS-2
> encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
> subject to re-interpretation of the UCS-2 codes as Windows-1251
> (Cyrillic) codes, again with codes above hex FF being treated according
> to obscure rules.


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 17:30:07
Message: <web.5b1819544a827126535efa580@news.povray.org>
clipka <ano### [at] anonymousorg> wrote:
> 3.7 / ascii
> ===========
>
> - Non-ASCII octets in strings are decoded as blanks (ASCII hex 20;
> non-ASCII characters can still be inserted via `\uXXXX` escape sequences
> or the `chr()` function though).
>
> - When used in file names or debug output, non-ASCII characters in
> strings (entered via "\uXXXX" or `chr()`) are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings (entered
> via "\uXXXX" or `chr()`) are typically garbled, depending on the font
> used; for instace, with Microsoft's Arial font the text appears to be
> subject to re-interpretation of the codes as Macintosh Roman codes,
> while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
> Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
> according to obscure rules in all cases.
>
>
> 3.7 / utf8
> ==========
>
> - Non-ASCII octets in strings are decoded according to UTF-8.
>
> - When used in file names or debug output, non-ASCII characters in
> strings are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings may or
> may not be garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text is displayed as expected for UCS-2
> encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
> subject to re-interpretation of the UCS-2 codes as Windows-1251
> (Cyrillic) codes, again with codes above hex FF being treated according
> to obscure rules.

I won't dispute your results because they don't surprise me, but I dispute the
fact that the code behaves unpredictably: The code is in OpenFontFile, and it
does pick the Unicode tables as top priority if Unicode is specified as text
format. It only falls back to MacRoman (which next to all TTFs contain for
legacy reasons) if it cannot find any of the Unicode formats that it does
support. So the real question is if maybe in the 18 years since this code was
introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
other TTF Unicode formats were introduced that the code does not support...


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 17:35:02
Message: <web.5b181ab04a827126535efa580@news.povray.org>
"Thorsten Froehlich" <nomail@nomail> wrote:
> I won't dispute your results because they don't surprise me, but I dispute the
> fact that the code behaves unpredictably: The code is in OpenFontFile, and it
> does pick the Unicode tables as top priority if Unicode is specified as text
> format. It only falls back to MacRoman (which next to all TTFs contain for
> legacy reasons) if it cannot find any of the Unicode formats that it does
> support. So the real question is if maybe in the 18 years since this code was
> introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
> other TTF Unicode formats were introduced that the code does not support...

To answer my own questions, the formats did change...
https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html

So what you see is the code falling back to a ASCII (MacRoman) code tables as a
fail-safe because it cannot use the newer Unicode tables most current fonts seem
to contain.


Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 22:04:38
Message: <5b185a76$1@news.povray.org>
Am 06.06.2018 um 19:32 schrieb Thorsten Froehlich:
> "Thorsten Froehlich" <nomail@nomail> wrote:
>> I won't dispute your results because they don't surprise me, but I dispute the
>> fact that the code behaves unpredictably:

I never used the word "unpredictably" - I used the phrase "according to
obscure rules". As in, "pretty difficult to guess without looking at the
actual code and maybe the font file".

>> The code is in OpenFontFile, and it
>> does pick the Unicode tables as top priority if Unicode is specified as text
>> format. It only falls back to MacRoman (which next to all TTFs contain for
>> legacy reasons) if it cannot find any of the Unicode formats that it does
>> support. So the real question is if maybe in the 18 years since this code was
>> introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
>> other TTF Unicode formats were introduced that the code does not support...
> 
> To answer my own questions, the formats did change...
> https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
> 
> So what you see is the code falling back to a ASCII (MacRoman) code tables as a
> fail-safe because it cannot use the newer Unicode tables most current fonts seem
> to contain.

Um... I do /not/ really think so... you know, the font that I found to
be ok (with `charset utf8`) is the Microsoft Arial font on my Windows 10
machine, while the other is POV-Ray's cyrvetic.ttf. So I have a hunch
that the ok font /may/ be the newer one.

So I'd guess ye olde cyrvetic.ttf never had a Unicode table to begin with.

Also, I don't see a fallback to Mac Roman in the cyrvetic font, but
rather a fallback to Mac Cyrillic.


(As it turns out, the POV-Ray TrueType font gives a rat's arse about the
platformSpecificID when looking for a code table it can use; platformID
is all it tests for.)


Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 22:18:05
Message: <5b185d9d$1@news.povray.org>
Am 06.06.2018 um 19:13 schrieb Thorsten Froehlich:
> Note that the TrueType font decoding used to be rather buggy before 3.6 (or was
> it 3.5?), which is why i.e. in 3.0 it matched the MacRoman font tables before
> ever getting to the other tables.

Presumably before v3.5, because that's the version number threshold
POV-Ray v3.7 tests for.

But right now I don't really care much /why/ POV-Ray is behaving the way
it does; at the moment I'm just taking stock of /how/ it currently
behaves, so that I can mimick it as accurately as reasonably possible
within the framework of the v3.8 parser changes.


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 7 Jun 2018 04:45:00
Message: <web.5b18b7c44a8271266dfe572e0@news.povray.org>
clipka <ano### [at] anonymousorg> wrote:
> Am 06.06.2018 um 19:13 schrieb Thorsten Froehlich:
> > Note that the TrueType font decoding used to be rather buggy before 3.6 (or was
> > it 3.5?), which is why i.e. in 3.0 it matched the MacRoman font tables before
> > ever getting to the other tables.
>
> Presumably before v3.5, because that's the version number threshold
> POV-Ray v3.7 tests for.

Yes, there should be a compatibility warning in there somewhere because it was
decided back then that the previous behavior was so badly broken that only ASCII
ever worked. The other problem found back then was that especially many (old)
free fonts used to be not exactly conforming to the specifications available at
the time (i.e. some only worked on Macs, others only on Windows), which might
also be due to the specs not being easily accessible when the freeware that some
people used to create those fonts was published. So there is very little point
trying to recreate all that odd behavior.

> But right now I don't really care much /why/ POV-Ray is behaving the way
> it does; at the moment I'm just taking stock of /how/ it currently
> behaves, so that I can mimick it as accurately as reasonably possible
> within the framework of the v3.8 parser changes.

I would suggest to go with the documented behavior, which is ASCII support for
charset ASCII and UCS-2 support (because that is all what fonts had back then)
for UTF-8 input. And after that maybe consider adding the missing cmap table
support for the extended Unicode stuff beyond the 16 bit codes. You will also
need to fiddle with color fonts and the like then ... but at least nowadays you
can find those specs on the internet. Adding Unicode support to 3.5 meant buying
the Unicode 3.0 book...

Btw., I found that the best way to determine what happens inside the TrueType
parser code is enabling the debug code in there. Otherwise you can never be
certain there isn't some problem with the input file rather than the code.

Thorsten


Post a reply to this message

Copyright 2003-2008 Persistence of Vision Raytracer Pty. Ltd.