POV-Ray: Newsgroups: povray.off-topic: The trouble with XSLT

POV-Ray : Newsgroups : povray.off-topic : The trouble with XSLT		Server Time 1 Mar 2026 17:51:37 EST (-0500)

<<< Previous 10 Messages

Goto Latest 10 Messages

Next 10 Messages >>>

From: Invisible
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 05:57:34
Message: <4f461b9e$1@news.povray.org>

On 23/02/2012 09:39 AM, Le_Forgeron wrote:
> Le 23/02/2012 10:11, Invisible a écrit :
>> This still leaves me with the problem of how to generate unusual
>> characters in the first place. Typing→ is pretty simple. Figuring
>> out how to actually generate the arrow character is not.
>
> You just need the right documentation.

Remembering to type → is much harder than remembering to type 
→. It's also far less readable. Working out how to generate the 
actual character so you can copy and paste it into your document is far 
slower than just typing some stuff with your keyboard. (It involves 
moving your hand to the mouse, for one thing...)

> and get to print the parts that you need often.

Well, yeah, I guess that's what it comes down to. Sadly.

> And soon you will discover that there is no single font to display all
> possible unicode glyphs.

Yeah, but nobody /needs/ all possible Unicode glyphs. I'm never going to 
write stuff in Linear B or Ogham. I just hand a small handful of 
non-ASCII characters - most of which /are/ widely supported in many fonts.

> You also need a unicode-compatible editor...

I've got that. The problem isn't the editor handling Unicode, it's 
figuring out which encoding an arbitrary text file happens to use. It 
seems as soon as you use any encoding other than the Windows default 
(whatever the hell that is), things get messy, rapidly.

Post a reply to this message

From: Le Forgeron
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 06:53:15
Message: <4f4628ab$1@news.povray.org>

Le 23/02/2012 11:57, Invisible a écrit :
> It seems as soon as you use any encoding other than the Windows default
> (whatever the hell that is), things get messy, rapidly.


On Windows, yes. Even two Windows can be messy out of their natural
country (around Redmond I believe). (e.g. at work we have some English
windows and some french ones... they default to a different code page
(that's a thing back from DOS, a time where printers had micro-switchs
to provide different mapping not only to the 127-255 range but also
inside the 33-126 range... none of which usually maps exactly to the
local code page either)

Windows choose the UTF-16 way at a time (hence the wchar in API)... but
that's double the file's size of ascii... and yet, the integration is
far from done (many code still expect char of 8 bits..)

Maybe you can have some forked resources to tag the file as being a
given encoding (forked resources for a file are its filename followed by
colon and another text... it's usually invisible and yet remain
associated with the file (yet another kludge from the West of america...
the text names the resource, the value is in the associated "file"))
Maybe you can search the web about that ?
(and find a program which allow you to handle forked resources...)
As long as the file is in a NTFS filesystem, it's safe.

Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 06:55:55
Message: <4f46294b$1@news.povray.org>

Am 23.02.2012 10:11, schrieb Invisible:
>>> I wonder how widely implemented this undocumented feature is?
>>
>> Most likely more widespread than you think.
>
> I got the distinct impression that this is a Windows-specific
> convention. (Doesn't Linux do something strange with using environment
> variables to define the "system locale"?)

 From the current XML spec:

"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin 
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], 
section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, 
#xFEFF)."

As for this being supported by editors, there are two possible cases:

(1) The editor treats UTF-8 with a leading BOM as a special encoding; in 
that case, it will strip the BOM from the character stream upon reading, 
and prepend it upon writing, so you're perfectly safe here.

(2) The editor does not expect a leading BOM in UTF-8; in that case, it 
/must/ treat it according to the Unicode standard, which explicitly 
states that the BOM is actually a perfectly valid normal character, 
which just happens to be one of the many space characters, non-breaking 
in this case, with zero width; so you're perfectly safe here as well, 
unless you accidently strip it from the very beginning of the file.

Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 07:13:03
Message: <4f462d4f$1@news.povray.org>

Am 23.02.2012 11:57, schrieb Invisible:

>> You also need a unicode-compatible editor...
>
> I've got that. The problem isn't the editor handling Unicode, it's
> figuring out which encoding an arbitrary text file happens to use. It
> seems as soon as you use any encoding other than the Windows default
> (whatever the hell that is), things get messy, rapidly.

That's because figuring out the encoding of an extended-ASCII text file 
is, in fact, virtually impossible (unless you know details about the 
contents, e.g. you can recognize the encoding if you know that it's an 
HTML file), due to the fact that none of them has a standardized file 
signature. With the sole exception of UTF-8 with leading BOM, where the 
BOM character can double-feature as such a signature.

A leading BOM in UTF-8 files can cause problems with files such as shell 
scripts or C/C++ source code (because it is indeed a part of the 
character stream rather than a mere file signature; strictly speaking 
the same is actually true for UTF-16 as well); but at the beginning of 
XML files it is explicitly allowed.

Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 07:23:42
Message: <4f462fce$1@news.povray.org>

Am 23.02.2012 12:53, schrieb Le_Forgeron:
> Le 23/02/2012 11:57, Invisible a écrit :
>> It seems as soon as you use any encoding other than the Windows default
>> (whatever the hell that is), things get messy, rapidly.
>
>
> On Windows, yes. Even two Windows can be messy out of their natural
> country (around Redmond I believe). (e.g. at work we have some English
> windows and some french ones... they default to a different code page
> (that's a thing back from DOS, a time where printers had micro-switchs
> to provide different mapping not only to the 127-255 range but also
> inside the 33-126 range... none of which usually maps exactly to the
> local code page either)

Huh? I'd have expected French Windows to use Latin-1 as well. Wouldn't 
be surprised about problems with the keyboard mapping though.

Unless you're talking about the command prompt, which may indeed still 
use those old IBM codepages. (But the US-American IBM codepage and 
Latin-1 differ as well, so the only advantage the Redmondians have over 
you in this matter is that their native language doesn't need more than 
95 printable characters in the first place.)

Post a reply to this message

From: Warp
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 08:54:15
Message: <4f464507@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> (2) The editor does not expect a leading BOM in UTF-8; in that case, it 
> /must/ treat it according to the Unicode standard, which explicitly 
> states that the BOM is actually a perfectly valid normal character, 
> which just happens to be one of the many space characters, non-breaking 
> in this case, with zero width; so you're perfectly safe here as well, 
> unless you accidently strip it from the very beginning of the file.

  Does that mean that xFEFF is the zero-width nbsp in both UTF-16 and UTF-8?

  Also: If the byte order happened to be the reverse of what the editor
expects (assuming the editor does not support the BOM), wouldn't the
multi-byte characters be garbage then?

-- 
                                                          - Warp

Post a reply to this message

From: Le Forgeron
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 10:11:01
Message: <4f465705$1@news.povray.org>

Le 23/02/2012 14:54, Warp a écrit :

>   Does that mean that xFEFF is the zero-width nbsp in both UTF-16 and UTF-8?
> 

Tsss... no cake for you.

0xFEFF is the UTF-16 encoding of BOM (Byte order mark). It is used to
signal endianess with UTF-16 (because 0xFFFE is not a valid utf-16,
indeed U+FFFE will never be a valid glyph).

Encoding U+FEFF in utf-8:
 * has no purpose, there is no endianess to detect for utf-8 encoding
(but it is legit to have a BOM in utf-8)
 * would be done as 3 bytes: 0xEF 0xBB 0xBF

BOM can also be useful when using UTF-32. (and other esoteric encoding
of unicode, such as utf-7, or utf-ebcdic, utf-1 (misnamed, IMHO), ... )

Notice that U+FEFF is deprecated as zero-width non breaking space.
You should use U+2060 (word joiner, zero width space non breaking),
and/or U+200B (zero width space, but breaking). At least in unicode 6.1

>   Also: If the byte order happened to be the reverse of what the editor
> expects (assuming the editor does not support the BOM), wouldn't the
> multi-byte characters be garbage then?
> 
That's the reason utf-16/32 need a BOM for automatic detection.

Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 11:49:39
Message: <4f466e23$1@news.povray.org>

Am 23.02.2012 14:54, schrieb Warp:
> clipka<ano### [at] anonymousorg>  wrote:
>> (2) The editor does not expect a leading BOM in UTF-8; in that case, it
>> /must/ treat it according to the Unicode standard, which explicitly
>> states that the BOM is actually a perfectly valid normal character,
>> which just happens to be one of the many space characters, non-breaking
>> in this case, with zero width; so you're perfectly safe here as well,
>> unless you accidently strip it from the very beginning of the file.
>
>    Does that mean that xFEFF is the zero-width nbsp in both UTF-16 and UTF-8?

If you're talking about codepoint, then obviously yes.

If you're talking about encoded byte sequence, then no; in UTF-16, it 
would be encoded as xFE xFF or xFF xFE respectively, while in UTF-8 it 
would always be encoded as xEF xBB xBF.

>    Also: If the byte order happened to be the reverse of what the editor
> expects (assuming the editor does not support the BOM), wouldn't the
> multi-byte characters be garbage then?

I would be surprised to find an editor supporting UTF-16 but not the 
BOM. As for UTF-8, the byte order for multi-byte characters (i.e. 
codepoints x0100 and above) is unambiguously defined by the standard 
(using a big-endian-ish encoding); as UTF-8 requires bit shifting 
anyway, a byte-reversed encoding would provide no benefit for 
little-endian machines and therefore doesn't exist.

Post a reply to this message

From: clipka
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 11:55:51
Message: <4f466f97$1@news.povray.org>

Am 23.02.2012 17:49, schrieb clipka:

> BOM. As for UTF-8, the byte order for multi-byte characters (i.e.
> codepoints x0100 and above) is unambiguously defined by the standard

Strike "x0100", replace with "x0080".

Post a reply to this message

From: Warp
Subject: Re: The trouble with XSLT
Date: 23 Feb 2012 12:35:03
Message: <4f4678c5@news.povray.org>

clipka <ano### [at] anonymousorg> wrote:
> As an alternative, forget UTF-8 and go for UTF-16.

  UTF-16 is more compact if the text consists mostly of non-ascii
characters, especially if it contains eg. kanji symbols, hiragana, etc.
(The vast majority of Japanese kanji can be expressed with 2 bytes using
UTF-16 but require 3 bytes with UTF-8.)

  However, if the text consists mostly of ascii characters, such as
English usually does, then UTF-8 is more compact than UTF-16 (which will
basically double the size of the file).

  Support for UTF-16 is still relatively poor (although getting better).
Most modern browsers should handle it ok, though, but it requires for the
server to send the proper http header to tell the browser the encoding,
and configuring the server to do this might not be trivial. (A html file
encoded in UTF-16 will look like garbage.)

-- 
                                                          - Warp

Post a reply to this message

<<< Previous 10 Messages

Goto Latest 10 Messages

Next 10 Messages >>>