|
|
Fredrik Eriksson <fe79}--at--{yahoo}--dot--{com> wrote:
> When dealing with just one character at a time, the application receives
> UTF-16 code points and must identify and deal with surrogates if needed.
Hmm, I'm not exactly sure what you mean by "UTF-16 code point".
According to the unicode.org glossary, a "code point" is a value in the
Unicode codespace, ie. a value in the range between 0 and 10FFFF. (Or what
I often refer to as "raw Unicode value".)
UTF-16 is a translation format between Unicode code points and bytes.
In other words, the raw unicode value is taken and encoded into a series
of bytes (2 or 4 of them, depending on the value) using a certain algorithm
(this encoding algorithm is designed to avoid any code point which has a
value larger than 127 into producing a byte with a value smaller than that).
Decoding from UTF-16 back to a Unicode code point is the reverse operation.
Thus I'm not exactly sure what you mean by "UTF-16 code point", as it
seems to be mixing the two things into one concept.
Anyways, if the Unicode-aware program requests a Unicode character from
the system, and the system returns it UTF-16-encoded, I suppose that means
that the program must decode it to a Unicode code point before it can use
it (unless it specifically handles UTF-16 strings directrly, of course).
Hmm, that sounds like a hindrance. Couldn't the system return raw Unicode
code points directly?
> For an application that does not need to deal with "exotic" alphabets
> (e.g. Chinese), one can typically get away just fine with treating the
> UTF-16 code points as if they were UCS-2.
OTOH, if a program is Unicode-aware, it should really be prepared to
handle any characters in the entire Unicode codespace.
--
- Warp
Post a reply to this message
|
|