POV-Ray : Newsgroups : povray.off-topic : The trouble with XSLT Server Time
29 Jul 2024 14:17:13 EDT (-0400)
  The trouble with XSLT (Message 21 to 30 of 84)  
<<< Previous 10 Messages Goto Latest 10 Messages Next 10 Messages >>>
From: Orchid Win7 v1
Subject: Re: The trouble with XSLT
Date: 22 Feb 2012 16:28:49
Message: <4f455e11$1@news.povray.org>
>>> Ah, so your /real/ problem is a crappy text editor.
>>
>> The /real/ problem is that there is no way of knowing what encoding a
>> given text file has. So it's not safe to use non-ASCII characters in a
>> text file. Which is why character entities were invented in the first
>> place...
>
> Use an editor that places a BOM at the start of UFT-8 files and bob's
> your uncle.

In other words, "there is this informal undocumented /convention/ that 
if a file starts with a BOM [even though UTF-8 does not require such a 
mark, since there /is/ no byte order], then it is presumed to contain 
UTF-8".

I wonder how widely implemented this undocumented feature is?


Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 22 Feb 2012 16:47:21
Message: <4f456269$1@news.povray.org>
Am 22.02.2012 22:28, schrieb Orchid Win7 v1:

>> Use an editor that places a BOM at the start of UFT-8 files and bob's
>> your uncle.
>
> In other words, "there is this informal undocumented /convention/ that
> if a file starts with a BOM [even though UTF-8 does not require such a
> mark, since there /is/ no byte order], then it is presumed to contain
> UTF-8".
>
> I wonder how widely implemented this undocumented feature is?

Most likely more widespread than you think.

As an alternative, forget UTF-8 and go for UTF-16.


Post a reply to this message

From: Invisible
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 04:11:08
Message: <4f4602ac$1@news.povray.org>
>> I wonder how widely implemented this undocumented feature is?
>
> Most likely more widespread than you think.

I got the distinct impression that this is a Windows-specific 
convention. (Doesn't Linux do something strange with using environment 
variables to define the "system locale"?)

> As an alternative, forget UTF-8 and go for UTF-16.

Or that, I suppose.

This still leaves me with the problem of how to generate unusual 
characters in the first place. Typing → is pretty simple. Figuring 
out how to actually generate the arrow character is not.


Post a reply to this message

From: Le Forgeron
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 04:39:02
Message: <4f460936$1@news.povray.org>
Le 23/02/2012 10:11, Invisible a écrit :
> This still leaves me with the problem of how to generate unusual
> characters in the first place. Typing → is pretty simple. Figuring
> out how to actually generate the arrow character is not.

You just need the right documentation.
Either you search it each time on the web (like "utf-8 chartable"... )

or you call it correctly, and ask for unicode (utf-8 is one
presentation, unicode is what you really want), and get to print the
parts that you need often.

See http://unicode.org/charts/

And soon you will discover that there is no single font to display all
possible unicode glyphs.

You also need a unicode-compatible editor...


Post a reply to this message

From: Invisible
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 05:57:34
Message: <4f461b9e$1@news.povray.org>
On 23/02/2012 09:39 AM, Le_Forgeron wrote:
> Le 23/02/2012 10:11, Invisible a écrit :
>> This still leaves me with the problem of how to generate unusual
>> characters in the first place. Typing→ is pretty simple. Figuring
>> out how to actually generate the arrow character is not.
>
> You just need the right documentation.

Remembering to type → is much harder than remembering to type 
→. It's also far less readable. Working out how to generate the 
actual character so you can copy and paste it into your document is far 
slower than just typing some stuff with your keyboard. (It involves 
moving your hand to the mouse, for one thing...)

> and get to print the parts that you need often.

Well, yeah, I guess that's what it comes down to. Sadly.

> And soon you will discover that there is no single font to display all
> possible unicode glyphs.

Yeah, but nobody /needs/ all possible Unicode glyphs. I'm never going to 
write stuff in Linear B or Ogham. I just hand a small handful of 
non-ASCII characters - most of which /are/ widely supported in many fonts.

> You also need a unicode-compatible editor...

I've got that. The problem isn't the editor handling Unicode, it's 
figuring out which encoding an arbitrary text file happens to use. It 
seems as soon as you use any encoding other than the Windows default 
(whatever the hell that is), things get messy, rapidly.


Post a reply to this message

From: Le Forgeron
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 06:53:15
Message: <4f4628ab$1@news.povray.org>
Le 23/02/2012 11:57, Invisible a écrit :
> It seems as soon as you use any encoding other than the Windows default
> (whatever the hell that is), things get messy, rapidly.


On Windows, yes. Even two Windows can be messy out of their natural
country (around Redmond I believe). (e.g. at work we have some English
windows and some french ones... they default to a different code page
(that's a thing back from DOS, a time where printers had micro-switchs
to provide different mapping not only to the 127-255 range but also
inside the 33-126 range... none of which usually maps exactly to the
local code page either)

Windows choose the UTF-16 way at a time (hence the wchar in API)... but
that's double the file's size of ascii... and yet, the integration is
far from done (many code still expect char of 8 bits..)

Maybe you can have some forked resources to tag the file as being a
given encoding (forked resources for a file are its filename followed by
colon and another text... it's usually invisible and yet remain
associated with the file (yet another kludge from the West of america...
the text names the resource, the value is in the associated "file"))
Maybe you can search the web about that ?
(and find a program which allow you to handle forked resources...)
As long as the file is in a NTFS filesystem, it's safe.


Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 06:55:55
Message: <4f46294b$1@news.povray.org>
Am 23.02.2012 10:11, schrieb Invisible:
>>> I wonder how widely implemented this undocumented feature is?
>>
>> Most likely more widespread than you think.
>
> I got the distinct impression that this is a Windows-specific
> convention. (Doesn't Linux do something strange with using environment
> variables to define the "system locale"?)

 From the current XML spec:

"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin 
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], 
section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, 
#xFEFF)."

As for this being supported by editors, there are two possible cases:

(1) The editor treats UTF-8 with a leading BOM as a special encoding; in 
that case, it will strip the BOM from the character stream upon reading, 
and prepend it upon writing, so you're perfectly safe here.

(2) The editor does not expect a leading BOM in UTF-8; in that case, it 
/must/ treat it according to the Unicode standard, which explicitly 
states that the BOM is actually a perfectly valid normal character, 
which just happens to be one of the many space characters, non-breaking 
in this case, with zero width; so you're perfectly safe here as well, 
unless you accidently strip it from the very beginning of the file.


Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 07:13:03
Message: <4f462d4f$1@news.povray.org>
Am 23.02.2012 11:57, schrieb Invisible:

>> You also need a unicode-compatible editor...
>
> I've got that. The problem isn't the editor handling Unicode, it's
> figuring out which encoding an arbitrary text file happens to use. It
> seems as soon as you use any encoding other than the Windows default
> (whatever the hell that is), things get messy, rapidly.

That's because figuring out the encoding of an extended-ASCII text file 
is, in fact, virtually impossible (unless you know details about the 
contents, e.g. you can recognize the encoding if you know that it's an 
HTML file), due to the fact that none of them has a standardized file 
signature. With the sole exception of UTF-8 with leading BOM, where the 
BOM character can double-feature as such a signature.

A leading BOM in UTF-8 files can cause problems with files such as shell 
scripts or C/C++ source code (because it is indeed a part of the 
character stream rather than a mere file signature; strictly speaking 
the same is actually true for UTF-16 as well); but at the beginning of 
XML files it is explicitly allowed.


Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 07:23:42
Message: <4f462fce$1@news.povray.org>
Am 23.02.2012 12:53, schrieb Le_Forgeron:
> Le 23/02/2012 11:57, Invisible a écrit :
>> It seems as soon as you use any encoding other than the Windows default
>> (whatever the hell that is), things get messy, rapidly.
>
>
> On Windows, yes. Even two Windows can be messy out of their natural
> country (around Redmond I believe). (e.g. at work we have some English
> windows and some french ones... they default to a different code page
> (that's a thing back from DOS, a time where printers had micro-switchs
> to provide different mapping not only to the 127-255 range but also
> inside the 33-126 range... none of which usually maps exactly to the
> local code page either)

Huh? I'd have expected French Windows to use Latin-1 as well. Wouldn't 
be surprised about problems with the keyboard mapping though.

Unless you're talking about the command prompt, which may indeed still 
use those old IBM codepages. (But the US-American IBM codepage and 
Latin-1 differ as well, so the only advantage the Redmondians have over 
you in this matter is that their native language doesn't need more than 
95 printable characters in the first place.)


Post a reply to this message

From: Warp
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 08:54:15
Message: <4f464507@news.povray.org>
clipka <ano### [at] anonymousorg> wrote:
> (2) The editor does not expect a leading BOM in UTF-8; in that case, it 
> /must/ treat it according to the Unicode standard, which explicitly 
> states that the BOM is actually a perfectly valid normal character, 
> which just happens to be one of the many space characters, non-breaking 
> in this case, with zero width; so you're perfectly safe here as well, 
> unless you accidently strip it from the very beginning of the file.

  Does that mean that xFEFF is the zero-width nbsp in both UTF-16 and UTF-8?

  Also: If the byte order happened to be the reverse of what the editor
expects (assuming the editor does not support the BOM), wouldn't the
multi-byte characters be garbage then?

-- 
                                                          - Warp


Post a reply to this message

<<< Previous 10 Messages Goto Latest 10 Messages Next 10 Messages >>>

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.