|
|
On Thu, 01 Nov 2007 21:13:15 +0000, Orchid XP v7 wrote:
> I was under the impression that these encodings apply to *strings*, not
> individual characters by themselves...
The encoding applies to individual characters, and
from those characters is the string composed.
code point character utf8 encoding
U+006B k 6B
U+0070 p 70
U+0069 i 69
So the UTF-8 encoding of the string becomes 9 bytes long in total.
Similarly, the Czech word for "cat" would be encoded like this:
code point character utf8 encoding
U+006B k 6B
U+006F o 6F
U+010D ? C4 8D
U+006B k 6B
U+0061 a 61
(Note: I'm posting in iso-8859-1, which cannot express
the third character in the word: a "c" with a hacek,
hence substituting with "?".)
And the Japanese word for Japan would be:
code point character utf8 encoding
U+65E5 ? E6 97 A5
U+672C ? E6 9C AC
The encoding (UTF-8) has a few clever attained design goals:
- Backwards compatibility with ASCII
- Asciibetical sorting still works the same way
- Forward and backward seeking in the string possible without desynchronization
- Minimal space wasted
- Possibility to extend naturally if the unicode set grows
--
Joel Yliluoma - http://iki.fi/bisqwit/
Post a reply to this message
|
|