|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
clipka <ano### [at] anonymousorg> wrote:
> Am 25.06.2018 um 11:41 schrieb Kenneth:
>
> > Rich Text Format (.rtf)-- one of Microsoft's own file-types
>
> Entirely different thing: This is not just a character encoding, but an
> entirely different file format.
Yeah, and I even tried #including such a file, just to see what would happen:
After the fatal error, the file contents actually showed up-- a bunch of
Microsoft 'rtf' gibberish as a header.
>
> The UCS specification explicitly and indiscriminately allows for UTF-8
> files both with and without a signature. The Unicode specification is
> more partial in that it discourages the presence of a signature, but
> still allows it.
>
> If present, the signature is (hex) `EF BB BF`...
.... so I assume that the error message I see (the "illegal ef value") is the
first part of that hex code. It appears to be, anyway.
>
> So Wordpad is without fault here (these days, at any rate). It is
> POV-Ray that is to blame -- or, rather, whoever added support for
> signatures in UTF-8 files, because while they did add such support for
> the scene file proper, they completely forgot to do the same for include
> files.
>
That's a surprise! Thanks for clarifying. And I was all set to write a
multi-page screed to Mr. Microsoft, detailing my grievances with NOTEPAD :-P
So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
file, regardless of the text-editing app that was used.
Just from my own research into all this stuff, it seems that dealing with text
encoding is a real headache for software developers. What a complex mess.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 25/06/2018 14:53, Kenneth wrote:
> Stephen <mca### [at] aolcom> wrote:
>>
>> Have you come across Notepad ++ ?
>> httos://noteoadplusplus.oro/
>>
>
> Ya know, I actually *did* have that app at one time-- on a hard drive that
> failed-- but then I completely forgot about it! Thanks for the memory boost ;-)
Fish oils help. ;-)
If you use XML files. You might find that XML Notepad is not bad.
https://www.microsoft.com/en-gb/download/details.aspx?id=7973
> I'll download it again. (I never expected Windows' own long-time 'core' apps to
> have such a basic text-encoding flaw; apparently neither NOTEPAD nor WORDPAD
> conform to the UTF-8 published standards, as far as I can tell. Maybe the
> standards came later(!)
>
>
If at all, at all. :-)
--
Regards
Stephen
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 25.06.2018 um 17:12 schrieb Kenneth:
>> If present, the signature is (hex) `EF BB BF`...
>
> ..... so I assume that the error message I see (the "illegal ef value") is the
> first part of that hex code. It appears to be, anyway.
Exactly.
>> So Wordpad is without fault here (these days, at any rate). It is
>> POV-Ray that is to blame -- or, rather, whoever added support for
>> signatures in UTF-8 files, because while they did add such support for
>> the scene file proper, they completely forgot to do the same for include
>> files.
...
> So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
> file, regardless of the text-editing app that was used.
UTF-8 encoded files /without/ a signature are fine(*). Unfortunately,
Notepad and WordPad can't write those.
(* For certain definitions of "fine", that is; the `global_settings {
charset FOO }` mechanism isn't really useful for mixing files with
different encodings. Also, the editor component of POV-Ray for Windows
doesn't do UTF-8.)
> Just from my own research into all this stuff, it seems that dealing with text
> encoding is a real headache for software developers. What a complex mess.
It's gotten better. The apparent increase in brain-wrecking is just due
to the fact that nowadays it has become worth /trying/. Historically,
the only way to stay sane in a small software project was to just
pretend that there wasn't such a thing as different file encodings,
hoping that the users could work around the fallout. Now proper handling
of text encoding can be as simple as supporting UTF-8 with and without
signature - no need to even worry about plain ASCII, being merely a
special case of UTF-8 without signature.
It is only when you want to implement /some/ support for legacy
encodings, and do it in a clean way, that things still get a bit tricky.
In POV-Ray for Windows, I consider it pretty evident that besides ASCII
and UTF-8 the parser should also support whatever encoding the editor
module is effectively using. On my machine that's Windows-1252, but I
wouldn't be surprised if that depended on the operating system's locale
settings.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
clipka <ano### [at] anonymousorg> wrote:
> Am 25.06.2018 um 17:12 schrieb Kenneth:
> > So I assume that NO one has had any luck trying to #include a UTF-8 -encoded
> > file, regardless of the text-editing app that was used.
>
> UTF-8 encoded files /without/ a signature are fine(*). Unfortunately,
> Notepad and WordPad can't write those.
Ah yes, of course. (I did see that tidbit of information in my research, but it
didn't sink in.) All clear now.
>
> (* For certain definitions of "fine", that is; the `global_settings {
> charset FOO }` mechanism isn't really useful for mixing files with
> different encodings. Also, the editor component of POV-Ray for Windows
> doesn't do UTF-8.)
[snip]
> Now proper handling
> of text encoding can be as simple as supporting UTF-8 with and without
> signature - no need to even worry about plain ASCII, being merely a
> special case of UTF-8 without signature.
>
Going forward:
When you have finished your work on restructuring the parser, and someone wants
to write an #include file using UTF-8 encoding (with or without a BOM): Which of
the following two contructs is the proper way to code the scene file/#include
file combo:
A)
Scene file:
global_settings{... charset utf8}
#include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword
OR B):
scene file:
global_settings{...) // no charset keyword
#include "MY FILE.txt" // encoded as UTF-8 and with its *own*
// global_settings{charset utf8}
I'm still a bit confused as to which is correct-- although B) looks like the
logical choice(?). The documentation about 'charset' seems to imply this.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 26.06.2018 um 05:12 schrieb Kenneth:
> Going forward:
> When you have finished your work on restructuring the parser, and someone wants
> to write an #include file using UTF-8 encoding (with or without a BOM): Which of
> the following two contructs is the proper way to code the scene file/#include
> file combo:
>
> A)
> Scene file:
> global_settings{... charset utf8}
> #include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword
>
> OR B):
> scene file:
> global_settings{...) // no charset keyword
> #include "MY FILE.txt" // encoded as UTF-8 and with its *own*
> // global_settings{charset utf8}
>
> I'm still a bit confused as to which is correct-- although B) looks like the
> logical choice(?). The documentation about 'charset' seems to imply this.
When I have finished my work?
Probably neither. The `global_settings { charset FOO }` mechanism isn't
really ideal, and I'm pretty sure I'll be deprecating it and introducing
something different, possibly along the following lines:
(1) A signature-based mechanism to auto-detect UTF-8 with signature, and
maybe also UTF-16 and/or UTF-32 (either endian variant).
(2) A `#charset STRING_LITERAL` directive to explicitly specify the
encoding on a per-file basis. This setting would explicitly apply only
to the respective file itself, and would probably have to appear at the
top of the file (right alongside the initial `#version` directive).
(3a.1) An INI setting `Charset_Autodetect=BOOL` to specify whether
POV-Ray should attempt to auto-detect UTF-8 without signature (and maybe
certain other encodings) based on the first non-ASCII byte sequence in
the file.
(3a.2) An INI setting `Charset_Default=STRING` to specify what character
set should be presumed for files that have neither a signature, nor a
`#charset` statement, nor can be recognized based on the first non-ASCII
byte sequence.
-or-
(3b) An INI setting `Charset_Autodetect=STRING_LIST` to specify a list
of character sets, in order of descending preference, to try to
auto-detect based on the first non-ASCII byte sequence in the file.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
clipka <ano### [at] anonymousorg> wrote:
> Am 26.06.2018 um 05:12 schrieb Kenneth:
>
> > Going forward:
> > When you have finished your work on restructuring the parser, and someone wants
> > to write an #include file using UTF-8 encoding (with or without a BOM): Which of
> > the following two contructs is the proper way to code the scene file/#include
> > file combo:
> >
> > A)
> > Scene file:
> > global_settings{... charset utf8}
> > #include "MY FILE.txt" // encoded as UTF-8 but with no charset keyword
> >
> > OR B):
> > scene file:
> > global_settings{...) // no charset keyword
> > #include "MY FILE.txt" // encoded as UTF-8 and with its *own*
> > // global_settings{charset utf8}
> >
> > I'm still a bit confused as to which is correct-- although B) looks like the
> > logical choice(?). The documentation about 'charset' seems to imply this.
>
> When I have finished my work?
>
> Probably neither. The `global_settings { charset FOO }` mechanism isn't
> really ideal, and I'm pretty sure I'll be deprecating it and introducing
> something different, possibly along the following lines:
Well, the idea of it being inside a file was really simple: To have the encoding
of a file's strings inside the file.
Of course, the more meaningful question - nowadays that even Windows 10 Notepad
supports UTF-8 properly - is if there is any non-legacy (aka Windows editor)
reason not to require input to be UTF-8. It is extremely unlikely that anything
will replace UTF-8 any time soon...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 28/06/2018 18:26, Thorsten Froehlich wrote:
> It is extremely unlikely that anything
> will replace UTF-8 any time soon...
Who in their right mind would ever need more than 640k of ram?
;-)
--
Regards
Stephen
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Stephen <mca### [at] aolcom> wrote:
> On 28/06/2018 18:26, Thorsten Froehlich wrote:
> > It is extremely unlikely that anything
> > will replace UTF-8 any time soon...
>
> Who in their right mind would ever need more than 640k of ram?
>
> ;-)
ROFL - at least designers of standards got smarter and Unicode is pretty
extensible, though the focus seems to be on more and more emoticons...
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 28.06.2018 um 19:26 schrieb Thorsten Froehlich:
>> Probably neither. The `global_settings { charset FOO }` mechanism isn't
>> really ideal, and I'm pretty sure I'll be deprecating it and introducing
>> something different, possibly along the following lines:
>
> Well, the idea of it being inside a file was really simple: To have the encoding
> of a file's strings inside the file.
The problem there is that the setting in the file also governs the
presumed encoding of strings inside /other/ files, namely include files.
> Of course, the more meaningful question - nowadays that even Windows 10 Notepad
> supports UTF-8 properly - is if there is any non-legacy (aka Windows editor)
> reason not to require input to be UTF-8. It is extremely unlikely that anything
> will replace UTF-8 any time soon...
I pretty much agree there. But as long as we're rolling out legacy tools
(aka Windows editor) ourselves, we ought to support it properly.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
On 29/06/2018 13:10, Thorsten Froehlich wrote:
> Stephen <mca### [at] aolcom> wrote:
>> On 28/06/2018 18:26, Thorsten Froehlich wrote:
>>> It is extremely unlikely that anything
>>> will replace UTF-8 any time soon...
>>
>> Who in their right mind would ever need more than 640k of ram?
>>
>> ;-)
>
> ROFL - at least designers of standards got smarter and Unicode is pretty
> extensible, though the focus seems to be on more and more emoticons...
>
>
>
>
I could not resist it. It is a classic (rebuff).
I think that in English, emoticons are not a bad idea*. As English can
be quite ambiguous.
Warp seldom used them and I was always misunderstanding his meaning. :-(
For instance. Replying "interesting". Does not mean "It is interesting"
but "Not worth perusing, it is not a good idea".
*
I don't mean crying cats and the like.
--
Regards
Stephen
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |