POV-Ray : Newsgroups : povray.beta-test : POV-Ray v3.7 charset behaviour Server Time
26 Dec 2024 20:43:24 EST (-0500)
  POV-Ray v3.7 charset behaviour (Message 1 to 10 of 22)  
Goto Latest 10 Messages Next 10 Messages >>>
From: clipka
Subject: POV-Ray v3.7 charset behaviour
Date: 5 Jun 2018 17:28:11
Message: <5b17006b$1@news.povray.org>
POV-Ray v3.7 (for Windows) behaves as follows with respect to different
#version/charset settings (as tested with German locale):


Common
======

- "\uXXXX" escape sequences are technically always interpreted as UCS-2
character codes. Note however that depending on the context in which a
string is used, the /effective/ interpretation may vary.

- `asc()` and `chr()` functions technically always operate according to
UCS-2 character encoding. Note however that depending on the context in
which a string is used, the /effective/ encoding may vary.


3.0 / ascii
===========

- Non-ASCII octets in strings are technically decoded according to ISO
8859-1 (Latin-1; matching the display in the editor for many characters,
except for codes hex 80-9F, though I presume the editor display may vary
with the system's locale, while the Latin-1 decoding is invariant). Note
however that depending on the context in which a string is used, the
/effective/ decoding may vary.

- When used in file names or debug output, strings are effectively
subject to re-interpretation of the UCS-2 character codes as
Windows-1252 codes (matching the display in the editor; presumably this
varies with the system's locale), with character codes above hex FF
interpreted modulo 256.

- When used in text primitives, non-ASCII characters in strings are
typically garbled, depending on the font used; for instace, with
Microsoft's Arial font the text appears to be subject to
re-interpretation of the UCS-2 character codes as Macintosh Roman codes,
while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
according to obscure rules in all cases. (In some cases,
re-interpretation may happen to match the display in the editor.)


3.0 / utf8
==========

- Non-ASCII octets in strings are technically decoded according to
UTF-8. Note however that depending on the context in which a string is
used, non-ASCII characters may effectively become garbled.

- When used in file names or debug output, strings are effectively
subject to re-interpretation of the UCS-2 codes as Windows-1252 codes
(presumably this varies with the system's locale), with codes above hex
FF interpreted modulo 256.

- When used in text primitives, non-ASCII characters in strings may or
may not be garbled, depending on the font used; for instace, with
Microsoft's Arial font the text is displayed as expected for UCS-2
encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
subject to re-interpretation of the UCS-2 codes as Macintosh Cyrillic
codes with codes above hex FF being treated according to obscure rules.


3.7 / ascii
===========

- Non-ASCII octets in strings are decoded as blanks (ASCII hex 20;
non-ASCII characters can still be inserted via `\uXXXX` escape sequences
or the `chr()` function though).

- When used in file names or debug output, non-ASCII characters in
strings (entered via "\uXXXX" or `chr()`) are substituted with blanks.

- When used in text primitives, non-ASCII characters in strings (entered
via "\uXXXX" or `chr()`) are typically garbled, depending on the font
used; for instace, with Microsoft's Arial font the text appears to be
subject to re-interpretation of the codes as Macintosh Roman codes,
while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
according to obscure rules in all cases.


3.7 / utf8
==========

- Non-ASCII octets in strings are decoded according to UTF-8.

- When used in file names or debug output, non-ASCII characters in
strings are substituted with blanks.

- When used in text primitives, non-ASCII characters in strings may or
may not be garbled, depending on the font used; for instace, with
Microsoft's Arial font the text is displayed as expected for UCS-2
encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
subject to re-interpretation of the UCS-2 codes as Windows-1251
(Cyrillic) codes, again with codes above hex FF being treated according
to obscure rules.


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 13:15:01
Message: <web.5b18164a4a827126535efa580@news.povray.org>
Note that the TrueType font decoding used to be rather buggy before 3.6 (or was
it 3.5?), which is why i.e. in 3.0 it matched the MacRoman font tables before
ever getting to the other tables.

clipka <ano### [at] anonymousorg> wrote:
> POV-Ray v3.7 (for Windows) behaves as follows with respect to different
> #version/charset settings (as tested with German locale):
>
>
> Common
> ======
>
> - "\uXXXX" escape sequences are technically always interpreted as UCS-2
> character codes. Note however that depending on the context in which a
> string is used, the /effective/ interpretation may vary.
>
> - `asc()` and `chr()` functions technically always operate according to
> UCS-2 character encoding. Note however that depending on the context in
> which a string is used, the /effective/ encoding may vary.
>
>
> 3.0 / ascii
> ===========
>
> - Non-ASCII octets in strings are technically decoded according to ISO
> 8859-1 (Latin-1; matching the display in the editor for many characters,
> except for codes hex 80-9F, though I presume the editor display may vary
> with the system's locale, while the Latin-1 decoding is invariant). Note
> however that depending on the context in which a string is used, the
> /effective/ decoding may vary.
>
> - When used in file names or debug output, strings are effectively
> subject to re-interpretation of the UCS-2 character codes as
> Windows-1252 codes (matching the display in the editor; presumably this
> varies with the system's locale), with character codes above hex FF
> interpreted modulo 256.
>
> - When used in text primitives, non-ASCII characters in strings are
> typically garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text appears to be subject to
> re-interpretation of the UCS-2 character codes as Macintosh Roman codes,
> while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
> Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
> according to obscure rules in all cases. (In some cases,
> re-interpretation may happen to match the display in the editor.)
>
>
> 3.0 / utf8
> ==========
>
> - Non-ASCII octets in strings are technically decoded according to
> UTF-8. Note however that depending on the context in which a string is
> used, non-ASCII characters may effectively become garbled.
>
> - When used in file names or debug output, strings are effectively
> subject to re-interpretation of the UCS-2 codes as Windows-1252 codes
> (presumably this varies with the system's locale), with codes above hex
> FF interpreted modulo 256.
>
> - When used in text primitives, non-ASCII characters in strings may or
> may not be garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text is displayed as expected for UCS-2
> encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
> subject to re-interpretation of the UCS-2 codes as Macintosh Cyrillic
> codes with codes above hex FF being treated according to obscure rules.
>
>
> 3.7 / ascii
> ===========
>
> - Non-ASCII octets in strings are decoded as blanks (ASCII hex 20;
> non-ASCII characters can still be inserted via `\uXXXX` escape sequences
> or the `chr()` function though).
>
> - When used in file names or debug output, non-ASCII characters in
> strings (entered via "\uXXXX" or `chr()`) are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings (entered
> via "\uXXXX" or `chr()`) are typically garbled, depending on the font
> used; for instace, with Microsoft's Arial font the text appears to be
> subject to re-interpretation of the codes as Macintosh Roman codes,
> while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
> Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
> according to obscure rules in all cases.
>
>
> 3.7 / utf8
> ==========
>
> - Non-ASCII octets in strings are decoded according to UTF-8.
>
> - When used in file names or debug output, non-ASCII characters in
> strings are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings may or
> may not be garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text is displayed as expected for UCS-2
> encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
> subject to re-interpretation of the UCS-2 codes as Windows-1251
> (Cyrillic) codes, again with codes above hex FF being treated according
> to obscure rules.


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 13:30:07
Message: <web.5b1819544a827126535efa580@news.povray.org>
clipka <ano### [at] anonymousorg> wrote:
> 3.7 / ascii
> ===========
>
> - Non-ASCII octets in strings are decoded as blanks (ASCII hex 20;
> non-ASCII characters can still be inserted via `\uXXXX` escape sequences
> or the `chr()` function though).
>
> - When used in file names or debug output, non-ASCII characters in
> strings (entered via "\uXXXX" or `chr()`) are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings (entered
> via "\uXXXX" or `chr()`) are typically garbled, depending on the font
> used; for instace, with Microsoft's Arial font the text appears to be
> subject to re-interpretation of the codes as Macintosh Roman codes,
> while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
> Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
> according to obscure rules in all cases.
>
>
> 3.7 / utf8
> ==========
>
> - Non-ASCII octets in strings are decoded according to UTF-8.
>
> - When used in file names or debug output, non-ASCII characters in
> strings are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings may or
> may not be garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text is displayed as expected for UCS-2
> encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
> subject to re-interpretation of the UCS-2 codes as Windows-1251
> (Cyrillic) codes, again with codes above hex FF being treated according
> to obscure rules.

I won't dispute your results because they don't surprise me, but I dispute the
fact that the code behaves unpredictably: The code is in OpenFontFile, and it
does pick the Unicode tables as top priority if Unicode is specified as text
format. It only falls back to MacRoman (which next to all TTFs contain for
legacy reasons) if it cannot find any of the Unicode formats that it does
support. So the real question is if maybe in the 18 years since this code was
introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
other TTF Unicode formats were introduced that the code does not support...


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 13:35:02
Message: <web.5b181ab04a827126535efa580@news.povray.org>
"Thorsten Froehlich" <nomail@nomail> wrote:
> I won't dispute your results because they don't surprise me, but I dispute the
> fact that the code behaves unpredictably: The code is in OpenFontFile, and it
> does pick the Unicode tables as top priority if Unicode is specified as text
> format. It only falls back to MacRoman (which next to all TTFs contain for
> legacy reasons) if it cannot find any of the Unicode formats that it does
> support. So the real question is if maybe in the 18 years since this code was
> introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
> other TTF Unicode formats were introduced that the code does not support...

To answer my own questions, the formats did change...
https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html

So what you see is the code falling back to a ASCII (MacRoman) code tables as a
fail-safe because it cannot use the newer Unicode tables most current fonts seem
to contain.


Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 18:04:38
Message: <5b185a76$1@news.povray.org>
Am 06.06.2018 um 19:32 schrieb Thorsten Froehlich:
> "Thorsten Froehlich" <nomail@nomail> wrote:
>> I won't dispute your results because they don't surprise me, but I dispute the
>> fact that the code behaves unpredictably:

I never used the word "unpredictably" - I used the phrase "according to
obscure rules". As in, "pretty difficult to guess without looking at the
actual code and maybe the font file".

>> The code is in OpenFontFile, and it
>> does pick the Unicode tables as top priority if Unicode is specified as text
>> format. It only falls back to MacRoman (which next to all TTFs contain for
>> legacy reasons) if it cannot find any of the Unicode formats that it does
>> support. So the real question is if maybe in the 18 years since this code was
>> introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
>> other TTF Unicode formats were introduced that the code does not support...
> 
> To answer my own questions, the formats did change...
> https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
> 
> So what you see is the code falling back to a ASCII (MacRoman) code tables as a
> fail-safe because it cannot use the newer Unicode tables most current fonts seem
> to contain.

Um... I do /not/ really think so... you know, the font that I found to
be ok (with `charset utf8`) is the Microsoft Arial font on my Windows 10
machine, while the other is POV-Ray's cyrvetic.ttf. So I have a hunch
that the ok font /may/ be the newer one.

So I'd guess ye olde cyrvetic.ttf never had a Unicode table to begin with.

Also, I don't see a fallback to Mac Roman in the cyrvetic font, but
rather a fallback to Mac Cyrillic.


(As it turns out, the POV-Ray TrueType font gives a rat's arse about the
platformSpecificID when looking for a code table it can use; platformID
is all it tests for.)


Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 18:18:05
Message: <5b185d9d$1@news.povray.org>
Am 06.06.2018 um 19:13 schrieb Thorsten Froehlich:
> Note that the TrueType font decoding used to be rather buggy before 3.6 (or was
> it 3.5?), which is why i.e. in 3.0 it matched the MacRoman font tables before
> ever getting to the other tables.

Presumably before v3.5, because that's the version number threshold
POV-Ray v3.7 tests for.

But right now I don't really care much /why/ POV-Ray is behaving the way
it does; at the moment I'm just taking stock of /how/ it currently
behaves, so that I can mimick it as accurately as reasonably possible
within the framework of the v3.8 parser changes.


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 7 Jun 2018 00:45:00
Message: <web.5b18b7c44a8271266dfe572e0@news.povray.org>
clipka <ano### [at] anonymousorg> wrote:
> Am 06.06.2018 um 19:13 schrieb Thorsten Froehlich:
> > Note that the TrueType font decoding used to be rather buggy before 3.6 (or was
> > it 3.5?), which is why i.e. in 3.0 it matched the MacRoman font tables before
> > ever getting to the other tables.
>
> Presumably before v3.5, because that's the version number threshold
> POV-Ray v3.7 tests for.

Yes, there should be a compatibility warning in there somewhere because it was
decided back then that the previous behavior was so badly broken that only ASCII
ever worked. The other problem found back then was that especially many (old)
free fonts used to be not exactly conforming to the specifications available at
the time (i.e. some only worked on Macs, others only on Windows), which might
also be due to the specs not being easily accessible when the freeware that some
people used to create those fonts was published. So there is very little point
trying to recreate all that odd behavior.

> But right now I don't really care much /why/ POV-Ray is behaving the way
> it does; at the moment I'm just taking stock of /how/ it currently
> behaves, so that I can mimick it as accurately as reasonably possible
> within the framework of the v3.8 parser changes.

I would suggest to go with the documented behavior, which is ASCII support for
charset ASCII and UCS-2 support (because that is all what fonts had back then)
for UTF-8 input. And after that maybe consider adding the missing cmap table
support for the extended Unicode stuff beyond the 16 bit codes. You will also
need to fiddle with color fonts and the like then ... but at least nowadays you
can find those specs on the internet. Adding Unicode support to 3.5 meant buying
the Unicode 3.0 book...

Btw., I found that the best way to determine what happens inside the TrueType
parser code is enabling the debug code in there. Otherwise you can never be
certain there isn't some problem with the input file rather than the code.

Thorsten


Post a reply to this message

From: Kenneth
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 05:45:00
Message: <web.5b30b8c84a827126a47873e10@news.povray.org>
Living in the U.S, I've never paid much attention to text encodings other than
ASCII ("US-ASCII" I suppose)-- although I've seen "UTF-8" etc. show up from time
to time in others' scene files or include files.

When writing my own include files manually (for my own use, and saved as 'plain
text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
are simple and available. But I'm having a problem saving even a *simple* UTF-8
file.

WORDPAD (in my Win7 installation) can encode text in different ways:
plain text (.txt)-- is that the same as ANSI?
Rich Text Format (.rtf)-- one of Microsoft's own file-types
Unicode (.txt ?)
Unicode big endian (.txt ?)
UTF-8 (.txt ?)
.... plus a few others like XML and such.

[I assume that the various Unicode files are output as .txt file types.]

The thing is, WORDPAD's  'plain text file' is the ONLY one of its own encodings
that can be successfully read by POV-Ray as a text include file; all the others
produce various error messages. Most of those errors are expected-- but even a
UTF-8 file doesn't work. This is... odd. Perhaps I don't understand how to use
Unicode files. OR, WORDPAD isn't writng the file correctly??

Code example:
#version 3.71;  // using 3.7.1 beta 9
global_settings {assumed_gamma 1.0 charset utf8}
#include "text file as UTF8.txt" // Saved as UTF-8. No strings in
           // the contents, just a single line--  #local R = 45;

Fatal error result:
"illegal character in input file, value is ef"
This happens whether global_settings has charset utf8 or no charset at all.
(BTW, I can't 'see' the 'ef' value when I open the file.) So It appears that
WORDPAD is appending a small header-- a BOM?-- which may not conform to UTF-8
specs(?)

So I tried it a different way, just to be complete:
#version 3.71;
global_settings {assumed_gamma 1.0} // NO charset here
#include "text file as UTF8.txt" // Saved as UTF-8 again, but this one
// has  global_settings{charset utf8}, plus   #local R = 45;

.... which has the same fatal result.

I looked at various Wikipedia pages ( "US-ASCII", Windows WordPad app", "UTF-8",
"Comparison of Unicode encodings" ), but I *still* don't have a full grasp of
the facts:

"WordPad for Windows XP [and later?] added full Unicode support, enabling
WordPad to support multiple languages..."

"[Windows] files saved as Unicode text are encoded as UTF-16 LE.  [not UTF-8,
unless that is specified] ...Such [Unicode] files normally begin with Byte Order
Mark (BOM), which communicates the endianness of the file content. Although
UTF-8 does not suffer from endianness problems, many Windows programs --i.e.
Notepad-- prepend the contents of UTF-8-encoded files with BOM, to differentiate
UTF-8 encoding from other 8-bit encodings." [Other 8-bit encodings meaning
"plain ACSII"?]

[However...]
"The Unicode Standard neither requires nor recommends the use of the BOM for
UTF-8, but warns that it may be encountered at the start of a file as a
transcoding artifact. The presence of the UTF-8 BOM may cause problems with
existing software that can [otherwise] handle UTF-8..."

"UTF-16 [is] incompatible with ASCII files, and thus requires Unicode-aware
programs to display, print and manipulate them, even if the file is known to


"UTF-16 does not have endianness defined, [but] this may be achieved by using a
byte-order mark [BOM] at the start of the text, or assuming big-endian... [but]
UTF-8 is standardised on a single byte order and does not have this problem."

"If any stored data is in UTF-8 (such as file contents or names), it is very
difficult to write a system that uses UTF-16 or UTF-32 as an API. [However,] it
is trivial to translate invalid UTF-16 to a unique (though technically invalid)
UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names,
making UTF-8 preferred in any such mixed environment."


Post a reply to this message

From: Stephen
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 06:21:07
Message: <5b30c213@news.povray.org>
On 25/06/2018 10:41, Kenneth wrote:
> Living in the U.S, I've never paid much attention to text encodings other than
> ASCII ("US-ASCII" I suppose)-- although I've seen "UTF-8" etc. show up from time
> to time in others' scene files or include files.
> 
Similarly, living in the UK I use the encoding used in the King James 
bible. ;-)
And if foreigners want to understand. I will shout. :-)


> When writing my own include files manually (for my own use, and saved as 'plain
> text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
> are simple and available. But I'm having a problem saving even a*simple*  UTF-8
> file.

Have you come across Notepad ++ ?
httos://noteoadplusplus.oro/

It is free, simple to use and jam packed with features such as Encoding, 
Syntax Highlighting for lots of languages, Macro recording etc.



-- 

Regards
     Stephen


Post a reply to this message

From: jr
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 06:35:01
Message: <web.5b30c49c4a827126635cc5ad0@news.povray.org>
hi,

"Kenneth" <kdw### [at] gmailcom> wrote:
> ... text encodings other than ASCII ...
> When writing my own include files manually (for my own use, and saved as 'plain
> text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
> are simple and available.

many years ago I used to use 'Qedit'* but there will be any number of other,
often free editors.  the aBOMination is purely Microsoft, a "programmer's
editor" should not have issues.

https://en.wikipedia.org/wiki/The_SemWare_Editor
free for personal use but it only cost 30 USD (I think) to license, back then.


regards, jr.


Post a reply to this message

Goto Latest 10 Messages Next 10 Messages >>>

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.