|
|
Am 15.08.2012 13:06, schrieb Invisible:
>>>> Hint: There's more to Unicode than just ~1 million code points with
>>>> glyphs mapped to some of them.
>>>
>>> Sure. But if you can't even /write down/ those code-points, that's kinda
>>> limiting.
>>>
>>> (It also strikes me that if the String type /can/ hold them all and Char
>>> /can't/, that has interesting implications for trying to iterate over a
>>> String one Char at a time...)
>>
>> As soon as you think of combining diacritics, you'll see that this type
>> of limitation is actually inevitable
>
> I guess combining characters are The Real WTF...
Nope. You just can't do without them for plenty of purposes, because
coding each conceivable combo as a single character would have the
Unicode code space explode (and make comprehensive Unicode fonts even
more scarce). You'd be surprised how many diacritics some languages
stack even on one single character, or how many different diacritics
there are for the IPA phonetic notation.
> (No, wait - that's the BOM.)
Nope, the BOM is quite sane.
First, it is a legal Unicode codepoint, indicating a zero-width
non-breaking space. While this may sound weird to have, it does allow
you to suppress kerning, as well as lots of other stuff that might
happen to pairs of characters (just think of properly implemented arabic
script, where a character's glyph depends not only on the character
itself, but also on its neighbors).
Second, it is suggested to prepend this Unicode character (which by
definition doesn't do any harm) to any text stream that leaves a
program's boundaries.
Why not mandate a canonical byte ordering? Simple: As long as the next
program to pick up the data runs on the same machine (and hence uses the
same byte ordering), re-ordering would just add unnecessary overhead.
Post a reply to this message
|
|