POV-Ray: Newsgroups: povray.beta-test: POV-Ray v3.7 charset behaviour

POV-Ray : Newsgroups : povray.beta-test : POV-Ray v3.7 charset behaviour		Server Time 25 Apr 2024 18:42:00 EDT (-0400)

<<< Previous 10 Messages

Goto Initial 10 Messages

From: Kenneth
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 11:15:00
Message: <web.5b3106724a827126a47873e10@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> Am 25.06.2018 um 11:41 schrieb Kenneth:
>
> > Rich Text Format (.rtf)-- one of Microsoft's own file-types
>
> Entirely different thing: This is not just a character encoding, but an
> entirely different file format.

Yeah, and I even tried #including such a file, just to see what would happen:
After the fatal error, the file contents actually showed up-- a bunch of
Microsoft 'rtf' gibberish as a header.

>
> The UCS specification explicitly and indiscriminately allows for UTF-8
> files both with and without a signature. The Unicode specification is
> more partial in that it discourages the presence of a signature, but
> still allows it.
>
> If present, the signature is (hex) `EF BB BF`...

.... so I assume that the error message I see (the "illegal ef value") is the
first part of that hex code. It appears to be, anyway.
>
> So Wordpad is without fault here (these days, at any rate). It is
> POV-Ray that is to blame -- or, rather, whoever added support for
> signatures in UTF-8 files, because while they did add such support for
> the scene file proper, they completely forgot to do the same for include
> files.
>

That's a surprise! Thanks for clarifying. And I was all set to write a
multi-page screed to Mr. Microsoft, detailing my grievances with NOTEPAD :-P

So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
file, regardless of the text-editing app that was used.

Just from my own research into all this stuff, it seems that dealing with text
encoding is a real headache for software developers. What a complex mess.

Post a reply to this message

From: Stephen
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 12:39:50
Message: <5b311ad6@news.povray.org>

On 25/06/2018 14:53, Kenneth wrote:
> Stephen <mca### [at] aolcom> wrote:
>>
>> Have you come across Notepad ++ ?
>> httos://noteoadplusplus.oro/
>>
> 
> Ya know, I actually *did* have that app at one time-- on a hard drive that
> failed-- but then I completely forgot about it! Thanks for the memory boost ;-)

Fish oils help. ;-)

If you use XML files. You might find that XML Notepad is not bad.
https://www.microsoft.com/en-gb/download/details.aspx?id=7973



> I'll download it again. (I never expected Windows' own long-time 'core' apps to
> have such a basic text-encoding flaw; apparently neither NOTEPAD nor WORDPAD
> conform to the UTF-8 published standards, as far as I can tell. Maybe the
> standards came later(!)
> 
> 

If at all, at all. :-)

-- 

Regards
     Stephen

Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 14:00:06
Message: <5b312da6@news.povray.org>

Am 25.06.2018 um 17:12 schrieb Kenneth:

>> If present, the signature is (hex) `EF BB BF`...
> 
> ..... so I assume that the error message I see (the "illegal ef value") is the
> first part of that hex code. It appears to be, anyway.

Exactly.

>> So Wordpad is without fault here (these days, at any rate). It is
>> POV-Ray that is to blame -- or, rather, whoever added support for
>> signatures in UTF-8 files, because while they did add such support for
>> the scene file proper, they completely forgot to do the same for include
>> files.
...
> So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
> file, regardless of the text-editing app that was used.

UTF-8 encoded files /without/ a signature are fine(*). Unfortunately,
Notepad and WordPad can't write those.

(* For certain definitions of "fine", that is; the `global_settings {
charset FOO }` mechanism isn't really useful for mixing files with
different encodings. Also, the editor component of POV-Ray for Windows
doesn't do UTF-8.)

> Just from my own research into all this stuff, it seems that dealing with text
> encoding is a real headache for software developers. What a complex mess.

It's gotten better. The apparent increase in brain-wrecking is just due
to the fact that nowadays it has become worth /trying/. Historically,
the only way to stay sane in a small software project was to just
pretend that there wasn't such a thing as different file encodings,
hoping that the users could work around the fallout. Now proper handling
of text encoding can be as simple as supporting UTF-8 with and without
signature - no need to even worry about plain ASCII, being merely a
special case of UTF-8 without signature.

It is only when you want to implement /some/ support for legacy
encodings, and do it in a clean way, that things still get a bit tricky.

In POV-Ray for Windows, I consider it pretty evident that besides ASCII
and UTF-8 the parser should also support whatever encoding the editor
module is effectively using. On my machine that's Windows-1252, but I
wouldn't be surprised if that depended on the operating system's locale
settings.

Post a reply to this message

From: Kenneth
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 25 Jun 2018 23:15:00
Message: <web.5b31ae954a827126a47873e10@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> Am 25.06.2018 um 17:12 schrieb Kenneth:

> > So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
> > file, regardless of the text-editing app that was used.
>
> UTF-8 encoded files /without/ a signature are fine(*). Unfortunately,
> Notepad and WordPad can't write those.

Ah yes, of course. (I did see that tidbit of information in my research, but it
didn't sink in.) All clear now.
>
> (* For certain definitions of "fine", that is; the `global_settings {
> charset FOO }` mechanism isn't really useful for mixing files with
> different encodings. Also, the editor component of POV-Ray for Windows
> doesn't do UTF-8.)
[snip]
> Now proper handling
> of text encoding can be as simple as supporting UTF-8 with and without
> signature - no need to even worry about plain ASCII, being merely a
> special case of UTF-8 without signature.
>

Going forward:
When you have finished your work on restructuring the parser, and someone wants
to write an #include file using UTF-8 encoding (with or without a BOM): Which of
the following two contructs is the proper way to code the scene file/#include
file combo:

A)
Scene file:
global_settings{... charset utf8}
#include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword

OR B):
scene file:
global_settings{...) // no charset keyword
#include "MY FILE.txt" // encoded as UTF-8 and with its *own*
                       // global_settings{charset utf8}

I'm still a bit confused as to which is correct-- although B) looks like the
logical choice(?). The documentation about 'charset' seems to imply this.

Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 26 Jun 2018 09:41:26
Message: <5b324286$1@news.povray.org>

Am 26.06.2018 um 05:12 schrieb Kenneth:

> Going forward:
> When you have finished your work on restructuring the parser, and someone wants
> to write an #include file using UTF-8 encoding (with or without a BOM): Which of
> the following two contructs is the proper way to code the scene file/#include
> file combo:
> 
> A)
> Scene file:
> global_settings{... charset utf8}
> #include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword
> 
> OR B):
> scene file:
> global_settings{...) // no charset keyword
> #include "MY FILE.txt" // encoded as UTF-8 and with its *own*
>                        // global_settings{charset utf8}
> 
> I'm still a bit confused as to which is correct-- although B) looks like the
> logical choice(?). The documentation about 'charset' seems to imply this.

When I have finished my work?

Probably neither. The `global_settings { charset FOO }` mechanism isn't
really ideal, and I'm pretty sure I'll be deprecating it and introducing
something different, possibly along the following lines:

(1) A signature-based mechanism to auto-detect UTF-8 with signature, and
maybe also UTF-16 and/or UTF-32 (either endian variant).

(2) A `#charset STRING_LITERAL` directive to explicitly specify the
encoding on a per-file basis. This setting would explicitly apply only
to the respective file itself, and would probably have to appear at the
top of the file (right alongside the initial `#version` directive).

(3a.1) An INI setting `Charset_Autodetect=BOOL` to specify whether
POV-Ray should attempt to auto-detect UTF-8 without signature (and maybe
certain other encodings) based on the first non-ASCII byte sequence in
the file.

(3a.2) An INI setting `Charset_Default=STRING` to specify what character
set should be presumed for files that have neither a signature, nor a
`#charset` statement, nor can be recognized based on the first non-ASCII
byte sequence.

-or-

(3b) An INI setting `Charset_Autodetect=STRING_LIST` to specify a list
of character sets, in order of descending preference, to try to
auto-detect based on the first non-ASCII byte sequence in the file.

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 28 Jun 2018 13:30:05
Message: <web.5b351a2a4a8271265315c0590@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> Am 26.06.2018 um 05:12 schrieb Kenneth:
>
> > Going forward:
> > When you have finished your work on restructuring the parser, and someone wants
> > to write an #include file using UTF-8 encoding (with or without a BOM): Which of
> > the following two contructs is the proper way to code the scene file/#include
> > file combo:
> >
> > A)
> > Scene file:
> > global_settings{... charset utf8}
> > #include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword
> >
> > OR B):
> > scene file:
> > global_settings{...) // no charset keyword
> > #include "MY FILE.txt" // encoded as UTF-8 and with its *own*
> >                        // global_settings{charset utf8}
> >
> > I'm still a bit confused as to which is correct-- although B) looks like the
> > logical choice(?). The documentation about 'charset' seems to imply this.
>
> When I have finished my work?
>
> Probably neither. The `global_settings { charset FOO }` mechanism isn't
> really ideal, and I'm pretty sure I'll be deprecating it and introducing
> something different, possibly along the following lines:

Well, the idea of it being inside a file was really simple: To have the encoding
of a file's strings inside the file.

Of course, the more meaningful question - nowadays that even Windows 10 Notepad
supports UTF-8 properly - is if there is any non-legacy (aka Windows editor)
reason not to require input to be UTF-8. It is extremely unlikely that anything
will replace UTF-8 any time soon...

Post a reply to this message

From: Stephen
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 28 Jun 2018 14:12:23
Message: <5b352507$1@news.povray.org>

On 28/06/2018 18:26, Thorsten Froehlich wrote:
> It is extremely unlikely that anything
> will replace UTF-8 any time soon...

Who in their right mind would ever need more than 640k of ram?

;-)

-- 

Regards
     Stephen

Post a reply to this message

From: Thorsten Froehlich
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 29 Jun 2018 08:15:01
Message: <web.5b3621a24a8271265315c0590@news.povray.org>

Stephen <mca### [at] aolcom> wrote:
> On 28/06/2018 18:26, Thorsten Froehlich wrote:
> > It is extremely unlikely that anything
> > will replace UTF-8 any time soon...
>
> Who in their right mind would ever need more than 640k of ram?
>
> ;-)

ROFL - at least designers of standards got smarter and Unicode is pretty
extensible, though the focus seems to be on more and more emoticons...

Post a reply to this message

From: clipka
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 29 Jun 2018 08:26:33
Message: <5b362579$1@news.povray.org>

Am 28.06.2018 um 19:26 schrieb Thorsten Froehlich:

>> Probably neither. The `global_settings { charset FOO }` mechanism isn't
>> really ideal, and I'm pretty sure I'll be deprecating it and introducing
>> something different, possibly along the following lines:
> 
> Well, the idea of it being inside a file was really simple: To have the encoding
> of a file's strings inside the file.

The problem there is that the setting in the file also governs the
presumed encoding of strings inside /other/ files, namely include files.


> Of course, the more meaningful question - nowadays that even Windows 10 Notepad
> supports UTF-8 properly - is if there is any non-legacy (aka Windows editor)
> reason not to require input to be UTF-8. It is extremely unlikely that anything
> will replace UTF-8 any time soon...

I pretty much agree there. But as long as we're rolling out legacy tools
(aka Windows editor) ourselves, we ought to support it properly.

Post a reply to this message

From: Stephen
Subject: Re: POV-Ray v3.7 charset behaviour
Date: 29 Jun 2018 09:00:45
Message: <5b362d7d$1@news.povray.org>

On 29/06/2018 13:10, Thorsten Froehlich wrote:
> Stephen <mca### [at] aolcom> wrote:
>> On 28/06/2018 18:26, Thorsten Froehlich wrote:
>>> It is extremely unlikely that anything
>>> will replace UTF-8 any time soon...
>>
>> Who in their right mind would ever need more than 640k of ram?
>>
>> ;-)
> 
> ROFL - at least designers of standards got smarter and Unicode is pretty
> extensible, though the focus seems to be on more and more emoticons...
> 
> 
> 
> 
I could not resist it. It is a classic (rebuff).

I think that in English, emoticons are not a bad idea*. As English can 
be quite ambiguous.
Warp seldom used them and I was always misunderstanding his meaning. :-(

For instance. Replying "interesting". Does not mean "It is interesting" 
but "Not worth perusing, it is not a good idea".

*
I don't mean crying cats and the like.

-- 

Regards
     Stephen

Post a reply to this message

<<< Previous 10 Messages

Goto Initial 10 Messages