POV-Ray: Newsgroups: povray.beta-test: POV-Ray v3.7 charset behaviour

POV-Ray : Newsgroups : povray.beta-test : POV-Ray v3.7 charset behaviour		Server Time 12 Jul 2025 18:01:02 EDT (-0400)

<<< Previous 2 Messages

Goto Latest 10 Messages

Next 10 Messages >>>

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 13:30:07
Message: <web.5b1819544a827126535efa580@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> 3.7 / ascii
> ===========
>
> - Non-ASCII octets in strings are decoded as blanks (ASCII hex 20;
> non-ASCII characters can still be inserted via `\uXXXX` escape sequences
> or the `chr()` function though).
>
> - When used in file names or debug output, non-ASCII characters in
> strings (entered via "\uXXXX" or `chr()`) are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings (entered
> via "\uXXXX" or `chr()`) are typically garbled, depending on the font
> used; for instace, with Microsoft's Arial font the text appears to be
> subject to re-interpretation of the codes as Macintosh Roman codes,
> while with POV-Ray's `cyrvetic.ttf` the re-interpretation seems to be as
> Windows-1251 (Cyrillic) codes. Character codes above hex FF are treated
> according to obscure rules in all cases.
>
>
> 3.7 / utf8
> ==========
>
> - Non-ASCII octets in strings are decoded according to UTF-8.
>
> - When used in file names or debug output, non-ASCII characters in
> strings are substituted with blanks.
>
> - When used in text primitives, non-ASCII characters in strings may or
> may not be garbled, depending on the font used; for instace, with
> Microsoft's Arial font the text is displayed as expected for UCS-2
> encoded text, while with POV-Ray's `cyrvetic.ttf` it appears to be
> subject to re-interpretation of the UCS-2 codes as Windows-1251
> (Cyrillic) codes, again with codes above hex FF being treated according
> to obscure rules.

I won't dispute your results because they don't surprise me, but I dispute the
fact that the code behaves unpredictably: The code is in OpenFontFile, and it
does pick the Unicode tables as top priority if Unicode is specified as text
format. It only falls back to MacRoman (which next to all TTFs contain for
legacy reasons) if it cannot find any of the Unicode formats that it does
support. So the real question is if maybe in the 18 years since this code was
introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
other TTF Unicode formats were introduced that the code does not support...

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 13:35:02
Message: <web.5b181ab04a827126535efa580@news.povray.org>

"Thorsten Froehlich" <nomail@nomail> wrote:
> I won't dispute your results because they don't surprise me, but I dispute the
> fact that the code behaves unpredictably: The code is in OpenFontFile, and it
> does pick the Unicode tables as top priority if Unicode is specified as text
> format. It only falls back to MacRoman (which next to all TTFs contain for
> legacy reasons) if it cannot find any of the Unicode formats that it does
> support. So the real question is if maybe in the 18 years since this code was
> introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
> other TTF Unicode formats were introduced that the code does not support...

To answer my own questions, the formats did change...
https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html

So what you see is the code falling back to a ASCII (MacRoman) code tables as a
fail-safe because it cannot use the newer Unicode tables most current fonts seem
to contain.

Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 18:04:38
Message: <5b185a76$1@news.povray.org>

Am 06.06.2018 um 19:32 schrieb Thorsten Froehlich:
> "Thorsten Froehlich" <nomail@nomail> wrote:
>> I won't dispute your results because they don't surprise me, but I dispute the
>> fact that the code behaves unpredictably:

I never used the word "unpredictably" - I used the phrase "according to
obscure rules". As in, "pretty difficult to guess without looking at the
actual code and maybe the font file".

>> The code is in OpenFontFile, and it
>> does pick the Unicode tables as top priority if Unicode is specified as text
>> format. It only falls back to MacRoman (which next to all TTFs contain for
>> legacy reasons) if it cannot find any of the Unicode formats that it does
>> support. So the real question is if maybe in the 18 years since this code was
>> introduced in 3.5 based on TTFs shipped with Mac OS 9 and Windows 2000, some
>> other TTF Unicode formats were introduced that the code does not support...
> 
> To answer my own questions, the formats did change...
> https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html
> 
> So what you see is the code falling back to a ASCII (MacRoman) code tables as a
> fail-safe because it cannot use the newer Unicode tables most current fonts seem
> to contain.

Um... I do /not/ really think so... you know, the font that I found to
be ok (with `charset utf8`) is the Microsoft Arial font on my Windows 10
machine, while the other is POV-Ray's cyrvetic.ttf. So I have a hunch
that the ok font /may/ be the newer one.

So I'd guess ye olde cyrvetic.ttf never had a Unicode table to begin with.

Also, I don't see a fallback to Mac Roman in the cyrvetic font, but
rather a fallback to Mac Cyrillic.

(As it turns out, the POV-Ray TrueType font gives a rat's arse about the
platformSpecificID when looking for a code table it can use; platformID
is all it tests for.)

Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 6 Jun 2018 18:18:05
Message: <5b185d9d$1@news.povray.org>

Am 06.06.2018 um 19:13 schrieb Thorsten Froehlich:
> Note that the TrueType font decoding used to be rather buggy before 3.6 (or was
> it 3.5?), which is why i.e. in 3.0 it matched the MacRoman font tables before
> ever getting to the other tables.

Presumably before v3.5, because that's the version number threshold
POV-Ray v3.7 tests for.

But right now I don't really care much /why/ POV-Ray is behaving the way
it does; at the moment I'm just taking stock of /how/ it currently
behaves, so that I can mimick it as accurately as reasonably possible
within the framework of the v3.8 parser changes.

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 7 Jun 2018 00:45:00
Message: <web.5b18b7c44a8271266dfe572e0@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> Am 06.06.2018 um 19:13 schrieb Thorsten Froehlich:
> > Note that the TrueType font decoding used to be rather buggy before 3.6 (or was
> > it 3.5?), which is why i.e. in 3.0 it matched the MacRoman font tables before
> > ever getting to the other tables.
>
> Presumably before v3.5, because that's the version number threshold
> POV-Ray v3.7 tests for.

Yes, there should be a compatibility warning in there somewhere because it was
decided back then that the previous behavior was so badly broken that only ASCII
ever worked. The other problem found back then was that especially many (old)
free fonts used to be not exactly conforming to the specifications available at
the time (i.e. some only worked on Macs, others only on Windows), which might
also be due to the specs not being easily accessible when the freeware that some
people used to create those fonts was published. So there is very little point
trying to recreate all that odd behavior.

> But right now I don't really care much /why/ POV-Ray is behaving the way
> it does; at the moment I'm just taking stock of /how/ it currently
> behaves, so that I can mimick it as accurately as reasonably possible
> within the framework of the v3.8 parser changes.

I would suggest to go with the documented behavior, which is ASCII support for
charset ASCII and UCS-2 support (because that is all what fonts had back then)
for UTF-8 input. And after that maybe consider adding the missing cmap table
support for the extended Unicode stuff beyond the 16 bit codes. You will also
need to fiddle with color fonts and the like then ... but at least nowadays you
can find those specs on the internet. Adding Unicode support to 3.5 meant buying
the Unicode 3.0 book...

Btw., I found that the best way to determine what happens inside the TrueType
parser code is enabling the debug code in there. Otherwise you can never be
certain there isn't some problem with the input file rather than the code.

Thorsten

Post a reply to this message

From: Kenneth
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 05:45:00
Message: <web.5b30b8c84a827126a47873e10@news.povray.org>

Living in the U.S, I've never paid much attention to text encodings other than
ASCII ("US-ASCII" I suppose)-- although I've seen "UTF-8" etc. show up from time
to time in others' scene files or include files.

When writing my own include files manually (for my own use, and saved as 'plain
text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
are simple and available. But I'm having a problem saving even a *simple* UTF-8
file.

WORDPAD (in my Win7 installation) can encode text in different ways:
plain text (.txt)-- is that the same as ANSI?
Rich Text Format (.rtf)-- one of Microsoft's own file-types
Unicode (.txt ?)
Unicode big endian (.txt ?)
UTF-8 (.txt ?)
.... plus a few others like XML and such.

[I assume that the various Unicode files are output as .txt file types.]

The thing is, WORDPAD's  'plain text file' is the ONLY one of its own encodings
that can be successfully read by POV-Ray as a text include file; all the others
produce various error messages. Most of those errors are expected-- but even a
UTF-8 file doesn't work. This is... odd. Perhaps I don't understand how to use
Unicode files. OR, WORDPAD isn't writng the file correctly??

Code example:
#version 3.71;  // using 3.7.1 beta 9
global_settings {assumed_gamma 1.0 charset utf8}
#include "text file as UTF8.txt" // Saved as UTF-8. No strings in
           // the contents, just a single line--  #local R = 45;

Fatal error result:
"illegal character in input file, value is ef"
This happens whether global_settings has charset utf8 or no charset at all.
(BTW, I can't 'see' the 'ef' value when I open the file.) So It appears that
WORDPAD is appending a small header-- a BOM?-- which may not conform to UTF-8
specs(?)

So I tried it a different way, just to be complete:
#version 3.71;
global_settings {assumed_gamma 1.0} // NO charset here
#include "text file as UTF8.txt" // Saved as UTF-8 again, but this one
// has  global_settings{charset utf8}, plus   #local R = 45;

.... which has the same fatal result.

I looked at various Wikipedia pages ( "US-ASCII", Windows WordPad app", "UTF-8",
"Comparison of Unicode encodings" ), but I *still* don't have a full grasp of
the facts:

"WordPad for Windows XP [and later?] added full Unicode support, enabling
WordPad to support multiple languages..."

"[Windows] files saved as Unicode text are encoded as UTF-16 LE.  [not UTF-8,
unless that is specified] ...Such [Unicode] files normally begin with Byte Order
Mark (BOM), which communicates the endianness of the file content. Although
UTF-8 does not suffer from endianness problems, many Windows programs --i.e.
Notepad-- prepend the contents of UTF-8-encoded files with BOM, to differentiate
UTF-8 encoding from other 8-bit encodings." [Other 8-bit encodings meaning
"plain ACSII"?]

[However...]
"The Unicode Standard neither requires nor recommends the use of the BOM for
UTF-8, but warns that it may be encountered at the start of a file as a
transcoding artifact. The presence of the UTF-8 BOM may cause problems with
existing software that can [otherwise] handle UTF-8..."

"UTF-16 [is] incompatible with ASCII files, and thus requires Unicode-aware
programs to display, print and manipulate them, even if the file is known to


"UTF-16 does not have endianness defined, [but] this may be achieved by using a
byte-order mark [BOM] at the start of the text, or assuming big-endian... [but]
UTF-8 is standardised on a single byte order and does not have this problem."

"If any stored data is in UTF-8 (such as file contents or names), it is very
difficult to write a system that uses UTF-16 or UTF-32 as an API. [However,] it
is trivial to translate invalid UTF-16 to a unique (though technically invalid)
UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names,
making UTF-8 preferred in any such mixed environment."

Post a reply to this message

From: Stephen
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 06:21:07
Message: <5b30c213@news.povray.org>

On 25/06/2018 10:41, Kenneth wrote:
> Living in the U.S, I've never paid much attention to text encodings other than
> ASCII ("US-ASCII" I suppose)-- although I've seen "UTF-8" etc. show up from time
> to time in others' scene files or include files.
> 
Similarly, living in the UK I use the encoding used in the King James 
bible. ;-)
And if foreigners want to understand. I will shout. :-)


> When writing my own include files manually (for my own use, and saved as 'plain
> text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
> are simple and available. But I'm having a problem saving even a*simple*  UTF-8
> file.

Have you come across Notepad ++ ?
httos://noteoadplusplus.oro/

It is free, simple to use and jam packed with features such as Encoding, 
Syntax Highlighting for lots of languages, Macro recording etc.



-- 

Regards
     Stephen

Post a reply to this message

From: jr
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 06:35:01
Message: <web.5b30c49c4a827126635cc5ad0@news.povray.org>

hi,

"Kenneth" <kdw### [at] gmailcom> wrote:
> ... text encodings other than ASCII ...
> When writing my own include files manually (for my own use, and saved as 'plain
> text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
> are simple and available.

many years ago I used to use 'Qedit'* but there will be any number of other,
often free editors.  the aBOMination is purely Microsoft, a "programmer's
editor" should not have issues.

https://en.wikipedia.org/wiki/The_SemWare_Editor
free for personal use but it only cost 30 USD (I think) to license, back then.

regards, jr.

Post a reply to this message

From: Kenneth
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 09:55:00
Message: <web.5b30f3c24a827126a47873e10@news.povray.org>

Stephen <mca### [at] aolcom> wrote:
>
> Have you come across Notepad ++ ?
> httos://noteoadplusplus.oro/
>

Ya know, I actually *did* have that app at one time-- on a hard drive that
failed-- but then I completely forgot about it! Thanks for the memory boost ;-)
I'll download it again. (I never expected Windows' own long-time 'core' apps to
have such a basic text-encoding flaw; apparently neither NOTEPAD nor WORDPAD
conform to the UTF-8 published standards, as far as I can tell. Maybe the
standards came later(!)

Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 10:12:31
Message: <5b30f84f$1@news.povray.org>

Am 25.06.2018 um 11:41 schrieb Kenneth:
> Living in the U.S, I've never paid much attention to text encodings other than
> ASCII ("US-ASCII" I suppose)-- although I've seen "UTF-8" etc. show up from time
> to time in others' scene files or include files.
> 
> When writing my own include files manually (for my own use, and saved as 'plain
> text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
> are simple and available. But I'm having a problem saving even a *simple* UTF-8
> file.

Actually, no, you don't (if you do indeed use plain ASCII): Every ASCII
text file is also a perfectly valid UTF-8 file ;)


> WORDPAD (in my Win7 installation) can encode text in different ways:
> plain text (.txt)-- is that the same as ANSI?

It's the same as whatever codepage you happen to be using; probably
Windows-1252, also erroneously referred to as "ISO-8859-1" (of which
codepage 1252 happens to be a superset) or "ANSI" (presumably shorthand
for "ANSI/ISO-8859-1").

> Rich Text Format (.rtf)-- one of Microsoft's own file-types

Entirely different thing: This is not just a character encoding, but an
entirely different file format.

> Unicode (.txt ?)

That would be UTF-16 with Little Endian encoding (*).

> Unicode big endian (.txt ?)

That would presumably (the Windows 10 version doesn't have this) be
UTF-16 with Big Endian encoding (*).

(* Not to be confused with UTF-16LE or UTF-16BE, respectively; UTF-16
always has a signature (aka BOM = Byte Order Mark), UTF-16LE and
UTF-16BE never have.)

> UTF-8 (.txt ?)

That would presumably (again the Windows 10 version doesn't have this)
be UTF-8 with signature (aka BOM = Byte Order Mark", though in this
context that term might be misleading; see below).

> The thing is, WORDPAD's  'plain text file' is the ONLY one of its own encodings
> that can be successfully read by POV-Ray as a text include file; all the others
> produce various error messages. Most of those errors are expected-- but even a
> UTF-8 file doesn't work. This is... odd. Perhaps I don't understand how to use
> Unicode files. OR, WORDPAD isn't writng the file correctly??
> 
> Code example:
> #version 3.71;  // using 3.7.1 beta 9
> global_settings {assumed_gamma 1.0 charset utf8}
> #include "text file as UTF8.txt" // Saved as UTF-8. No strings in
>            // the contents, just a single line--  #local R = 45;
> 
> Fatal error result:
> "illegal character in input file, value is ef"
> This happens whether global_settings has charset utf8 or no charset at all.
> (BTW, I can't 'see' the 'ef' value when I open the file.) So It appears that
> WORDPAD is appending a small header-- a BOM?-- which may not conform to UTF-8
> specs(?)

The UCS specification explicitly and indiscriminately allows for UTF-8
files both with and without a signature. The Unicode specification is
more partial in that it discourages the presence of a signature, but
still allows it.

If present, the signature is (hex) `EF BB BF`, which "happens" to match
the UTF-8 encoding of U+FEFF, the UTF-16 and UTF-32 encodings of which
"happen" to match the signatures of UTF-16 and UTF-32 indicating byte order.

So Wordpad is without fault here (these days, at any rate). It is
POV-Ray that is to blame -- or, rather, whoever added support for
signatures in UTF-8 files, because while they did add such support for
the scene file proper, they completely forgot to do the same for include
files.

The overhauled parser will address this.


> I looked at various Wikipedia pages ( "US-ASCII", Windows WordPad app", "UTF-8",
> "Comparison of Unicode encodings" ), but I *still* don't have a full grasp of
> the facts:
> 
> "WordPad for Windows XP [and later?] added full Unicode support, enabling
> WordPad to support multiple languages..."

Yup. It's a pity they've apparently thrown part of this support
overboard again between Windows 7 and Windows 10 (saving as Big-Endian
UTF-16 and saving as UTF-8).

> "[Windows] files saved as Unicode text are encoded as UTF-16 LE.  [not UTF-8,

Nope. It's not UTF-16LE (that would be a file using UTF-16 little-endian
encoding /without/ a signature BOM), but UTF-16 with Little Endian byte
order (carrying a signature BOM).

> unless that is specified] ...Such [Unicode] files normally begin with Byte Order
> Mark (BOM), which communicates the endianness of the file content. Although
> UTF-8 does not suffer from endianness problems, many Windows programs --i.e.
> Notepad-- prepend the contents of UTF-8-encoded files with BOM, to differentiate
> UTF-8 encoding from other 8-bit encodings." [Other 8-bit encodings meaning
> "plain ACSII"?]

No, "other 8-bit encodings" means /any/ encodings where each character
is encoded as a sequence of one or more 8-bit /code units/.

Many such encodings are /compatible/ with ASCII, in that they are
indistinguishable from plain ASCII if the text uses only the 128
printable and control characters from the ASCII character set. However,
such encodings (commonly referred to as "extended ASCII") are legion,
differing in how the remaining 128 code unit values (80 through FF) are
interpreted.

There are also 8-bit encodings that are incompatible even with plain
ASCII, the most important examples still in use these days certainly
being EBCDIC and its extensions.

Post a reply to this message

<<< Previous 2 Messages

Goto Latest 10 Messages

Next 10 Messages >>>