POV-Ray: Newsgroups: povray.beta-test: POV-Ray v3.7 charset behaviour: Re: POV-Ray v3.7 charset behaviour

POV-Ray : Newsgroups : povray.beta-test : POV-Ray v3.7 charset behaviour : Re: POV-Ray v3.7 charset behaviour		Server Time 18 Jul 2025 18:58:39 EDT (-0400)
From: clipka
Date: 25 Jun 2018 10:12:31
Message: <5b30f84f$1@news.povray.org>
Am 25.06.2018 um 11:41 schrieb Kenneth:
> Living in the U.S, I've never paid much attention to text encodings other than
> ASCII ("US-ASCII" I suppose)-- although I've seen "UTF-8" etc. show up from time
> to time in others' scene files or include files.
> 
> When writing my own include files manually (for my own use, and saved as 'plain
> text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
> are simple and available. But I'm having a problem saving even a *simple* UTF-8
> file.

Actually, no, you don't (if you do indeed use plain ASCII): Every ASCII
text file is also a perfectly valid UTF-8 file ;)


> WORDPAD (in my Win7 installation) can encode text in different ways:
> plain text (.txt)-- is that the same as ANSI?

It's the same as whatever codepage you happen to be using; probably
Windows-1252, also erroneously referred to as "ISO-8859-1" (of which
codepage 1252 happens to be a superset) or "ANSI" (presumably shorthand
for "ANSI/ISO-8859-1").

> Rich Text Format (.rtf)-- one of Microsoft's own file-types

Entirely different thing: This is not just a character encoding, but an
entirely different file format.

> Unicode (.txt ?)

That would be UTF-16 with Little Endian encoding (*).

> Unicode big endian (.txt ?)

That would presumably (the Windows 10 version doesn't have this) be
UTF-16 with Big Endian encoding (*).

(* Not to be confused with UTF-16LE or UTF-16BE, respectively; UTF-16
always has a signature (aka BOM = Byte Order Mark), UTF-16LE and
UTF-16BE never have.)

> UTF-8 (.txt ?)

That would presumably (again the Windows 10 version doesn't have this)
be UTF-8 with signature (aka BOM = Byte Order Mark", though in this
context that term might be misleading; see below).

> The thing is, WORDPAD's  'plain text file' is the ONLY one of its own encodings
> that can be successfully read by POV-Ray as a text include file; all the others
> produce various error messages. Most of those errors are expected-- but even a
> UTF-8 file doesn't work. This is... odd. Perhaps I don't understand how to use
> Unicode files. OR, WORDPAD isn't writng the file correctly??
> 
> Code example:
> #version 3.71;  // using 3.7.1 beta 9
> global_settings {assumed_gamma 1.0 charset utf8}
> #include "text file as UTF8.txt" // Saved as UTF-8. No strings in
>            // the contents, just a single line--  #local R = 45;
> 
> Fatal error result:
> "illegal character in input file, value is ef"
> This happens whether global_settings has charset utf8 or no charset at all.
> (BTW, I can't 'see' the 'ef' value when I open the file.) So It appears that
> WORDPAD is appending a small header-- a BOM?-- which may not conform to UTF-8
> specs(?)

The UCS specification explicitly and indiscriminately allows for UTF-8
files both with and without a signature. The Unicode specification is
more partial in that it discourages the presence of a signature, but
still allows it.

If present, the signature is (hex) `EF BB BF`, which "happens" to match
the UTF-8 encoding of U+FEFF, the UTF-16 and UTF-32 encodings of which
"happen" to match the signatures of UTF-16 and UTF-32 indicating byte order.

So Wordpad is without fault here (these days, at any rate). It is
POV-Ray that is to blame -- or, rather, whoever added support for
signatures in UTF-8 files, because while they did add such support for
the scene file proper, they completely forgot to do the same for include
files.

The overhauled parser will address this.


> I looked at various Wikipedia pages ( "US-ASCII", Windows WordPad app", "UTF-8",
> "Comparison of Unicode encodings" ), but I *still* don't have a full grasp of
> the facts:
> 
> "WordPad for Windows XP [and later?] added full Unicode support, enabling
> WordPad to support multiple languages..."

Yup. It's a pity they've apparently thrown part of this support
overboard again between Windows 7 and Windows 10 (saving as Big-Endian
UTF-16 and saving as UTF-8).

> "[Windows] files saved as Unicode text are encoded as UTF-16 LE.  [not UTF-8,

Nope. It's not UTF-16LE (that would be a file using UTF-16 little-endian
encoding /without/ a signature BOM), but UTF-16 with Little Endian byte
order (carrying a signature BOM).

> unless that is specified] ...Such [Unicode] files normally begin with Byte Order
> Mark (BOM), which communicates the endianness of the file content. Although
> UTF-8 does not suffer from endianness problems, many Windows programs --i.e.
> Notepad-- prepend the contents of UTF-8-encoded files with BOM, to differentiate
> UTF-8 encoding from other 8-bit encodings." [Other 8-bit encodings meaning
> "plain ACSII"?]

No, "other 8-bit encodings" means /any/ encodings where each character
is encoded as a sequence of one or more 8-bit /code units/.

Many such encodings are /compatible/ with ASCII, in that they are
indistinguishable from plain ASCII if the text uses only the 128
printable and control characters from the ASCII character set. However,
such encodings (commonly referred to as "extended ASCII") are legion,
differing in how the remaining 128 code unit values (80 through FF) are
interpreted.

There are also 8-bit encodings that are incompatible even with plain
ASCII, the most important examples still in use these days certainly
being EBCDIC and its extensions.
Post a reply to this message