POV-Ray : Newsgroups : povray.beta-test : POV-Ray v3.7 charset behaviour Server Time
27 Dec 2024 06:38:29 EST (-0500)
  POV-Ray v3.7 charset behaviour (Message 11 to 20 of 22)  
<<< Previous 10 Messages Goto Latest 10 Messages Next 2 Messages >>>
From: Kenneth
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 09:55:00
Message: <web.5b30f3c24a827126a47873e10@news.povray.org>
Stephen <mca### [at] aolcom> wrote:
>
> Have you come across Notepad ++ ?
> httos://noteoadplusplus.oro/
>

Ya know, I actually *did* have that app at one time-- on a hard drive that
failed-- but then I completely forgot about it! Thanks for the memory boost ;-)
I'll download it again. (I never expected Windows' own long-time 'core' apps to
have such a basic text-encoding flaw; apparently neither NOTEPAD nor WORDPAD
conform to the UTF-8 published standards, as far as I can tell. Maybe the
standards came later(!)


Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 10:12:31
Message: <5b30f84f$1@news.povray.org>
Am 25.06.2018 um 11:41 schrieb Kenneth:
> Living in the U.S, I've never paid much attention to text encodings other than
> ASCII ("US-ASCII" I suppose)-- although I've seen "UTF-8" etc. show up from time
> to time in others' scene files or include files.
> 
> When writing my own include files manually (for my own use, and saved as 'plain
> text'), I've always used either Window's NOTEPAD or WORDPAD-- only because they
> are simple and available. But I'm having a problem saving even a *simple* UTF-8
> file.

Actually, no, you don't (if you do indeed use plain ASCII): Every ASCII
text file is also a perfectly valid UTF-8 file ;)


> WORDPAD (in my Win7 installation) can encode text in different ways:
> plain text (.txt)-- is that the same as ANSI?

It's the same as whatever codepage you happen to be using; probably
Windows-1252, also erroneously referred to as "ISO-8859-1" (of which
codepage 1252 happens to be a superset) or "ANSI" (presumably shorthand
for "ANSI/ISO-8859-1").

> Rich Text Format (.rtf)-- one of Microsoft's own file-types

Entirely different thing: This is not just a character encoding, but an
entirely different file format.

> Unicode (.txt ?)

That would be UTF-16 with Little Endian encoding (*).

> Unicode big endian (.txt ?)

That would presumably (the Windows 10 version doesn't have this) be
UTF-16 with Big Endian encoding (*).

(* Not to be confused with UTF-16LE or UTF-16BE, respectively; UTF-16
always has a signature (aka BOM = Byte Order Mark), UTF-16LE and
UTF-16BE never have.)

> UTF-8 (.txt ?)

That would presumably (again the Windows 10 version doesn't have this)
be UTF-8 with signature (aka BOM = Byte Order Mark", though in this
context that term might be misleading; see below).

> The thing is, WORDPAD's  'plain text file' is the ONLY one of its own encodings
> that can be successfully read by POV-Ray as a text include file; all the others
> produce various error messages. Most of those errors are expected-- but even a
> UTF-8 file doesn't work. This is... odd. Perhaps I don't understand how to use
> Unicode files. OR, WORDPAD isn't writng the file correctly??
> 
> Code example:
> #version 3.71;  // using 3.7.1 beta 9
> global_settings {assumed_gamma 1.0 charset utf8}
> #include "text file as UTF8.txt" // Saved as UTF-8. No strings in
>            // the contents, just a single line--  #local R = 45;
> 
> Fatal error result:
> "illegal character in input file, value is ef"
> This happens whether global_settings has charset utf8 or no charset at all.
> (BTW, I can't 'see' the 'ef' value when I open the file.) So It appears that
> WORDPAD is appending a small header-- a BOM?-- which may not conform to UTF-8
> specs(?)

The UCS specification explicitly and indiscriminately allows for UTF-8
files both with and without a signature. The Unicode specification is
more partial in that it discourages the presence of a signature, but
still allows it.

If present, the signature is (hex) `EF BB BF`, which "happens" to match
the UTF-8 encoding of U+FEFF, the UTF-16 and UTF-32 encodings of which
"happen" to match the signatures of UTF-16 and UTF-32 indicating byte order.

So Wordpad is without fault here (these days, at any rate). It is
POV-Ray that is to blame -- or, rather, whoever added support for
signatures in UTF-8 files, because while they did add such support for
the scene file proper, they completely forgot to do the same for include
files.

The overhauled parser will address this.


> I looked at various Wikipedia pages ( "US-ASCII", Windows WordPad app", "UTF-8",
> "Comparison of Unicode encodings" ), but I *still* don't have a full grasp of
> the facts:
> 
> "WordPad for Windows XP [and later?] added full Unicode support, enabling
> WordPad to support multiple languages..."

Yup. It's a pity they've apparently thrown part of this support
overboard again between Windows 7 and Windows 10 (saving as Big-Endian
UTF-16 and saving as UTF-8).

> "[Windows] files saved as Unicode text are encoded as UTF-16 LE.  [not UTF-8,

Nope. It's not UTF-16LE (that would be a file using UTF-16 little-endian
encoding /without/ a signature BOM), but UTF-16 with Little Endian byte
order (carrying a signature BOM).

> unless that is specified] ...Such [Unicode] files normally begin with Byte Order
> Mark (BOM), which communicates the endianness of the file content. Although
> UTF-8 does not suffer from endianness problems, many Windows programs --i.e.
> Notepad-- prepend the contents of UTF-8-encoded files with BOM, to differentiate
> UTF-8 encoding from other 8-bit encodings." [Other 8-bit encodings meaning
> "plain ACSII"?]

No, "other 8-bit encodings" means /any/ encodings where each character
is encoded as a sequence of one or more 8-bit /code units/.

Many such encodings are /compatible/ with ASCII, in that they are
indistinguishable from plain ASCII if the text uses only the 128
printable and control characters from the ASCII character set. However,
such encodings (commonly referred to as "extended ASCII") are legion,
differing in how the remaining 128 code unit values (80 through FF) are
interpreted.

There are also 8-bit encodings that are incompatible even with plain
ASCII, the most important examples still in use these days certainly
being EBCDIC and its extensions.


Post a reply to this message

From: Kenneth
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 11:15:00
Message: <web.5b3106724a827126a47873e10@news.povray.org>
clipka <ano### [at] anonymousorg> wrote:
> Am 25.06.2018 um 11:41 schrieb Kenneth:
>
> > Rich Text Format (.rtf)-- one of Microsoft's own file-types
>
> Entirely different thing: This is not just a character encoding, but an
> entirely different file format.

Yeah, and I even tried #including such a file, just to see what would happen:
After the fatal error, the file contents actually showed up-- a bunch of
Microsoft 'rtf' gibberish as a header.

>
> The UCS specification explicitly and indiscriminately allows for UTF-8
> files both with and without a signature. The Unicode specification is
> more partial in that it discourages the presence of a signature, but
> still allows it.
>
> If present, the signature is (hex) `EF BB BF`...

.... so I assume that the error message I see (the "illegal ef value") is the
first part of that hex code. It appears to be, anyway.
>
> So Wordpad is without fault here (these days, at any rate). It is
> POV-Ray that is to blame -- or, rather, whoever added support for
> signatures in UTF-8 files, because while they did add such support for
> the scene file proper, they completely forgot to do the same for include
> files.
>

That's a surprise! Thanks for clarifying. And I was all set to write a
multi-page screed to Mr. Microsoft, detailing my grievances with NOTEPAD :-P

So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
file, regardless of the text-editing app that was used.

Just from my own research into all this stuff, it seems that dealing with text
encoding is a real headache for software developers. What a complex mess.


Post a reply to this message

From: Stephen
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 12:39:50
Message: <5b311ad6@news.povray.org>
On 25/06/2018 14:53, Kenneth wrote:
> Stephen <mca### [at] aolcom> wrote:
>>
>> Have you come across Notepad ++ ?
>> httos://noteoadplusplus.oro/
>>
> 
> Ya know, I actually *did* have that app at one time-- on a hard drive that
> failed-- but then I completely forgot about it! Thanks for the memory boost ;-)

Fish oils help. ;-)

If you use XML files. You might find that XML Notepad is not bad.
https://www.microsoft.com/en-gb/download/details.aspx?id=7973



> I'll download it again. (I never expected Windows' own long-time 'core' apps to
> have such a basic text-encoding flaw; apparently neither NOTEPAD nor WORDPAD
> conform to the UTF-8 published standards, as far as I can tell. Maybe the
> standards came later(!)
> 
> 

If at all, at all. :-)

-- 

Regards
     Stephen


Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 14:00:06
Message: <5b312da6@news.povray.org>
Am 25.06.2018 um 17:12 schrieb Kenneth:

>> If present, the signature is (hex) `EF BB BF`...
> 
> ..... so I assume that the error message I see (the "illegal ef value") is the
> first part of that hex code. It appears to be, anyway.

Exactly.

>> So Wordpad is without fault here (these days, at any rate). It is
>> POV-Ray that is to blame -- or, rather, whoever added support for
>> signatures in UTF-8 files, because while they did add such support for
>> the scene file proper, they completely forgot to do the same for include
>> files.
...
> So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
> file, regardless of the text-editing app that was used.

UTF-8 encoded files /without/ a signature are fine(*). Unfortunately,
Notepad and WordPad can't write those.

(* For certain definitions of "fine", that is; the `global_settings {
charset FOO }` mechanism isn't really useful for mixing files with
different encodings. Also, the editor component of POV-Ray for Windows
doesn't do UTF-8.)

> Just from my own research into all this stuff, it seems that dealing with text
> encoding is a real headache for software developers. What a complex mess.

It's gotten better. The apparent increase in brain-wrecking is just due
to the fact that nowadays it has become worth /trying/. Historically,
the only way to stay sane in a small software project was to just
pretend that there wasn't such a thing as different file encodings,
hoping that the users could work around the fallout. Now proper handling
of text encoding can be as simple as supporting UTF-8 with and without
signature - no need to even worry about plain ASCII, being merely a
special case of UTF-8 without signature.

It is only when you want to implement /some/ support for legacy
encodings, and do it in a clean way, that things still get a bit tricky.

In POV-Ray for Windows, I consider it pretty evident that besides ASCII
and UTF-8 the parser should also support whatever encoding the editor
module is effectively using. On my machine that's Windows-1252, but I
wouldn't be surprised if that depended on the operating system's locale
settings.


Post a reply to this message

From: Kenneth
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 23:15:00
Message: <web.5b31ae954a827126a47873e10@news.povray.org>
clipka <ano### [at] anonymousorg> wrote:
> Am 25.06.2018 um 17:12 schrieb Kenneth:

> > So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
> > file, regardless of the text-editing app that was used.
>
> UTF-8 encoded files /without/ a signature are fine(*). Unfortunately,
> Notepad and WordPad can't write those.

Ah yes, of course. (I did see that tidbit of information in my research, but it
didn't sink in.) All clear now.
>
> (* For certain definitions of "fine", that is; the `global_settings {
> charset FOO }` mechanism isn't really useful for mixing files with
> different encodings. Also, the editor component of POV-Ray for Windows
> doesn't do UTF-8.)
[snip]
> Now proper handling
> of text encoding can be as simple as supporting UTF-8 with and without
> signature - no need to even worry about plain ASCII, being merely a
> special case of UTF-8 without signature.
>

Going forward:
When you have finished your work on restructuring the parser, and someone wants
to write an #include file using UTF-8 encoding (with or without a BOM): Which of
the following two contructs is the proper way to code the scene file/#include
file combo:

A)
Scene file:
global_settings{... charset utf8}
#include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword

OR B):
scene file:
global_settings{...) // no charset keyword
#include "MY FILE.txt" // encoded as UTF-8 and with its *own*
                       // global_settings{charset utf8}

I'm still a bit confused as to which is correct-- although B) looks like the
logical choice(?). The documentation about 'charset' seems to imply this.


Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 26 Jun 2018 09:41:26
Message: <5b324286$1@news.povray.org>
Am 26.06.2018 um 05:12 schrieb Kenneth:

> Going forward:
> When you have finished your work on restructuring the parser, and someone wants
> to write an #include file using UTF-8 encoding (with or without a BOM): Which of
> the following two contructs is the proper way to code the scene file/#include
> file combo:
> 
> A)
> Scene file:
> global_settings{... charset utf8}
> #include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword
> 
> OR B):
> scene file:
> global_settings{...) // no charset keyword
> #include "MY FILE.txt" // encoded as UTF-8 and with its *own*
>                        // global_settings{charset utf8}
> 
> I'm still a bit confused as to which is correct-- although B) looks like the
> logical choice(?). The documentation about 'charset' seems to imply this.

When I have finished my work?

Probably neither. The `global_settings { charset FOO }` mechanism isn't
really ideal, and I'm pretty sure I'll be deprecating it and introducing
something different, possibly along the following lines:

(1) A signature-based mechanism to auto-detect UTF-8 with signature, and
maybe also UTF-16 and/or UTF-32 (either endian variant).

(2) A `#charset STRING_LITERAL` directive to explicitly specify the
encoding on a per-file basis. This setting would explicitly apply only
to the respective file itself, and would probably have to appear at the
top of the file (right alongside the initial `#version` directive).

(3a.1) An INI setting `Charset_Autodetect=BOOL` to specify whether
POV-Ray should attempt to auto-detect UTF-8 without signature (and maybe
certain other encodings) based on the first non-ASCII byte sequence in
the file.

(3a.2) An INI setting `Charset_Default=STRING` to specify what character
set should be presumed for files that have neither a signature, nor a
`#charset` statement, nor can be recognized based on the first non-ASCII
byte sequence.

-or-

(3b) An INI setting `Charset_Autodetect=STRING_LIST` to specify a list
of character sets, in order of descending preference, to try to
auto-detect based on the first non-ASCII byte sequence in the file.


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 28 Jun 2018 13:30:05
Message: <web.5b351a2a4a8271265315c0590@news.povray.org>
clipka <ano### [at] anonymousorg> wrote:
> Am 26.06.2018 um 05:12 schrieb Kenneth:
>
> > Going forward:
> > When you have finished your work on restructuring the parser, and someone wants
> > to write an #include file using UTF-8 encoding (with or without a BOM): Which of
> > the following two contructs is the proper way to code the scene file/#include
> > file combo:
> >
> > A)
> > Scene file:
> > global_settings{... charset utf8}
> > #include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword
> >
> > OR B):
> > scene file:
> > global_settings{...) // no charset keyword
> > #include "MY FILE.txt" // encoded as UTF-8 and with its *own*
> >                        // global_settings{charset utf8}
> >
> > I'm still a bit confused as to which is correct-- although B) looks like the
> > logical choice(?). The documentation about 'charset' seems to imply this.
>
> When I have finished my work?
>
> Probably neither. The `global_settings { charset FOO }` mechanism isn't
> really ideal, and I'm pretty sure I'll be deprecating it and introducing
> something different, possibly along the following lines:

Well, the idea of it being inside a file was really simple: To have the encoding
of a file's strings inside the file.

Of course, the more meaningful question - nowadays that even Windows 10 Notepad
supports UTF-8 properly - is if there is any non-legacy (aka Windows editor)
reason not to require input to be UTF-8. It is extremely unlikely that anything
will replace UTF-8 any time soon...


Post a reply to this message

From: Stephen
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 28 Jun 2018 14:12:23
Message: <5b352507$1@news.povray.org>
On 28/06/2018 18:26, Thorsten Froehlich wrote:
> It is extremely unlikely that anything
> will replace UTF-8 any time soon...

Who in their right mind would ever need more than 640k of ram?

;-)

-- 

Regards
     Stephen


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 29 Jun 2018 08:15:01
Message: <web.5b3621a24a8271265315c0590@news.povray.org>
Stephen <mca### [at] aolcom> wrote:
> On 28/06/2018 18:26, Thorsten Froehlich wrote:
> > It is extremely unlikely that anything
> > will replace UTF-8 any time soon...
>
> Who in their right mind would ever need more than 640k of ram?
>
> ;-)

ROFL - at least designers of standards got smarter and Unicode is pretty
extensible, though the focus seems to be on more and more emoticons...


Post a reply to this message

<<< Previous 10 Messages Goto Latest 10 Messages Next 2 Messages >>>

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.