![](/i/fill.gif) |
![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
Le 23/02/2012 14:54, Warp a écrit :
> Does that mean that xFEFF is the zero-width nbsp in both UTF-16 and UTF-8?
>
Tsss... no cake for you.
0xFEFF is the UTF-16 encoding of BOM (Byte order mark). It is used to
signal endianess with UTF-16 (because 0xFFFE is not a valid utf-16,
indeed U+FFFE will never be a valid glyph).
Encoding U+FEFF in utf-8:
* has no purpose, there is no endianess to detect for utf-8 encoding
(but it is legit to have a BOM in utf-8)
* would be done as 3 bytes: 0xEF 0xBB 0xBF
BOM can also be useful when using UTF-32. (and other esoteric encoding
of unicode, such as utf-7, or utf-ebcdic, utf-1 (misnamed, IMHO), ... )
Notice that U+FEFF is deprecated as zero-width non breaking space.
You should use U+2060 (word joiner, zero width space non breaking),
and/or U+200B (zero width space, but breaking). At least in unicode 6.1
> Also: If the byte order happened to be the reverse of what the editor
> expects (assuming the editor does not support the BOM), wouldn't the
> multi-byte characters be garbage then?
>
That's the reason utf-16/32 need a BOM for automatic detection.
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
Am 23.02.2012 14:54, schrieb Warp:
> clipka<ano### [at] anonymous org> wrote:
>> (2) The editor does not expect a leading BOM in UTF-8; in that case, it
>> /must/ treat it according to the Unicode standard, which explicitly
>> states that the BOM is actually a perfectly valid normal character,
>> which just happens to be one of the many space characters, non-breaking
>> in this case, with zero width; so you're perfectly safe here as well,
>> unless you accidently strip it from the very beginning of the file.
>
> Does that mean that xFEFF is the zero-width nbsp in both UTF-16 and UTF-8?
If you're talking about codepoint, then obviously yes.
If you're talking about encoded byte sequence, then no; in UTF-16, it
would be encoded as xFE xFF or xFF xFE respectively, while in UTF-8 it
would always be encoded as xEF xBB xBF.
> Also: If the byte order happened to be the reverse of what the editor
> expects (assuming the editor does not support the BOM), wouldn't the
> multi-byte characters be garbage then?
I would be surprised to find an editor supporting UTF-16 but not the
BOM. As for UTF-8, the byte order for multi-byte characters (i.e.
codepoints x0100 and above) is unambiguously defined by the standard
(using a big-endian-ish encoding); as UTF-8 requires bit shifting
anyway, a byte-reversed encoding would provide no benefit for
little-endian machines and therefore doesn't exist.
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
Am 23.02.2012 17:49, schrieb clipka:
> BOM. As for UTF-8, the byte order for multi-byte characters (i.e.
> codepoints x0100 and above) is unambiguously defined by the standard
Strike "x0100", replace with "x0080".
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
clipka <ano### [at] anonymous org> wrote:
> As an alternative, forget UTF-8 and go for UTF-16.
UTF-16 is more compact if the text consists mostly of non-ascii
characters, especially if it contains eg. kanji symbols, hiragana, etc.
(The vast majority of Japanese kanji can be expressed with 2 bytes using
UTF-16 but require 3 bytes with UTF-8.)
However, if the text consists mostly of ascii characters, such as
English usually does, then UTF-8 is more compact than UTF-16 (which will
basically double the size of the file).
Support for UTF-16 is still relatively poor (although getting better).
Most modern browsers should handle it ok, though, but it requires for the
server to send the proper http header to tell the browser the encoding,
and configuring the server to do this might not be trivial. (A html file
encoded in UTF-16 will look like garbage.)
--
- Warp
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
The trouble with the whole xml thing is that it's just another
enterprisey BS to grind CPUs idle times. You need an xml document, an
xml document describing the structure of the previous document, yet
another xml document to describe how to style the original document
itens, perhaps a xml document describing how to transform your xml
document into another xml document. It's an insanely verbose and
homogeneous pile of human and machine barely readable crap.
People resented it and thus insist on saner formats, such as CSS, JSON
and real programming languages rather than a shitload of xml abstraction
layers, tools and java frameworks.
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
Am 23.02.2012 18:35, schrieb Warp:
> clipka<ano### [at] anonymous org> wrote:
>> As an alternative, forget UTF-8 and go for UTF-16.
>
> UTF-16 is more compact if the text consists mostly of non-ascii
> characters, especially if it contains eg. kanji symbols, hiragana, etc.
> (The vast majority of Japanese kanji can be expressed with 2 bytes using
> UTF-16 but require 3 bytes with UTF-8.)
>
> However, if the text consists mostly of ascii characters, such as
> English usually does, then UTF-8 is more compact than UTF-16 (which will
> basically double the size of the file).
I guess a factor 2 in text stream size is not a serious problem with
today's internet bandwidths.
> Support for UTF-16 is still relatively poor (although getting better).
> Most modern browsers should handle it ok, though, but it requires for the
> server to send the proper http header to tell the browser the encoding,
> and configuring the server to do this might not be trivial. (A html file
> encoded in UTF-16 will look like garbage.)
It didn't sound like Andy would want to retrieve the XML file from a web
server, but rather directly from the local file system. Otherwise he
could simply go for server-side XSLT processing.
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
Am 23.02.2012 19:04, schrieb nemesis:
> The trouble with the whole xml thing is that it's just another
> enterprisey BS to grind CPUs idle times. You need an xml document, an
> xml document describing the structure of the previous document, yet
> another xml document to describe how to style the original document
> itens, perhaps a xml document describing how to transform your xml
> document into another xml document. It's an insanely verbose and
> homogeneous pile of human and machine barely readable crap.
>
> People resented it and thus insist on saner formats, such as CSS, JSON
> and real programming languages rather than a shitload of xml abstraction
> layers, tools and java frameworks.
Businesses do use it quite a lot for data exchange.
But yes, XML as a mere replacement for HTML is a rather silly thing
(except in its incarnation as XHTML); its legitimate ecologic niche on
the web is on the server side (if anywhere), and its native habitat is
actually totally somewhere else.
In some sense, XML is today's CSV: A generic file or data stream format
a human /can/ create, read and/or modify with an ASCII text editor, but
that still follows certain clear-cut rules that it can also be evaluated
by software; and actually just a meta-format, in the sense that the
content of the individual data fields needs to be agreed upon separately.
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
nemesis <nam### [at] gmail com> wrote:
> The trouble with the whole xml thing is that it's just another
> enterprisey BS to grind CPUs idle times. You need an xml document, an
> xml document describing the structure of the previous document, yet
> another xml document to describe how to style the original document
> itens, perhaps a xml document describing how to transform your xml
> document into another xml document. It's an insanely verbose and
> homogeneous pile of human and machine barely readable crap.
It's verbose, but it has one advantage over most other formats: It's
standardized and pretty well supported.
It has many advantages over many other formats. One example is character
encoding. With all types of character encodings out there, and support
for them in different file formats and programs being what they are, a
*standardized* form for representing special characters can be really
useful. Also, any program that reads XML ought to support it regardless
of which character encoding it uses (at least if the program uses a
generic XML parser internally).
Compare this to, for example, just a simple raw .txt file. Which encoding
does it use? ISO-Latin-1? ISO-Latin-9? UTF-8? Shift JIS? EUC-JP? ISO-2022-JP?
Something else completely? Impossible to say. With an XML file you don't have
to care. (As said, if your program uses a generic XML parser, the character
encoding used in the input XML file becomes a non-issue.)
Not that this exact same thing wouldn't be possible with a less verbose
format, but as said, XML is widely supported so it has this implicit
advantages over many other formats.
--
- Warp
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
Le 23/02/2012 20:06, Warp nous fit lire :
> It's verbose, but it has one advantage over most other formats: It's
> standardized and pretty well supported.
>
Well, XML is a container. The problem is lack of intelligent design for
the inside. It is too often the Excel sheet of today: A bunch of
entries, without consistency.
Indeed, with a bit of base64 encapsulation, you could put an BINARY
excel sheet file into a XML document. And advertise that you output XML.
And to make matter more interesting, some find it enterprisey to have
xml inside xml... and other old CSV into XML too (without reinterpreting
the data, so it's just a formatting. A dumb formatting).
> It has many advantages over many other formats. One example is character
> encoding. With all types of character encodings out there, and support
> for them in different file formats and programs being what they are, a
> *standardized* form for representing special characters can be really
> useful. Also, any program that reads XML ought to support it regardless
> of which character encoding it uses (at least if the program uses a
> generic XML parser internally).
Read it, yes. Understand it, that's another whole story!
Same as: I can read latin or japanese in katakana, with few error on
sound. That does not means I get the meaning. At least I can edit it
like a monkey.
> Not that this exact same thing wouldn't be possible with a less verbose
> format, but as said, XML is widely supported so it has this implicit
> advantages over many other formats.
XML is interesting when exchanging documents/data, once the big bosses
and their technical staffs have agreed on a XSD. But whenever you add a
third company, you need to negociate another XSD (with a totally
different approach of the data, not even compatible with the first one).
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
On 23/02/2012 17:35, Warp wrote:
> Support for UTF-16 is still relatively poor (although getting better).
> Most modern browsers should handle it ok, though, but it requires for the
> server to send the proper http header to tell the browser the encoding,
> and configuring the server to do this might not be trivial. (A html file
> encoded in UTF-16 will look like garbage.)
Isn't that what the HTML encoding tag is for? Or the XML encoding
declaration?
Post a reply to this message
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |
|
![](/i/fill.gif) |
| ![](/i/fill.gif) |
|
![](/i/fill.gif) |