POV-Ray: Newsgroups: povray.off-topic: Haskell raving: Re: Haskell raving

POV-Ray : Newsgroups : povray.off-topic : Haskell raving : Re: Haskell raving		Server Time 16 Jul 2025 18:46:11 EDT (-0400)

From: Joel Yliluoma
Date: 19 Nov 2007 05:31:53
Message: <slrnfk2pgp.529.bisqwit@bisqwit.iki.fi>

On Thu, 01 Nov 2007 21:13:15 +0000, Orchid XP v7 wrote:
> I was under the impression that these encodings apply to *strings*, not 
> individual characters by themselves...

The encoding applies to individual characters, and
from those characters is the string composed.



  code point character   utf8 encoding
   U+006B     k            6B


   U+0070     p            70
   U+0069     i            69


So the UTF-8 encoding of the string becomes 9 bytes long in total.

Similarly, the Czech word for "cat" would be encoded like this:

  code point character   utf8 encoding
   U+006B     k            6B
   U+006F     o            6F
   U+010D     ?            C4 8D
   U+006B     k            6B
   U+0061     a            61

(Note: I'm posting in iso-8859-1, which cannot express
the third character in the word: a "c" with a hacek,
hence substituting with "?".)

And the Japanese word for Japan would be:

  code point character   utf8 encoding
   U+65E5     ?            E6 97 A5
   U+672C     ?            E6 9C AC 

The encoding (UTF-8) has a few clever attained design goals:
- Backwards compatibility with ASCII
- Asciibetical sorting still works the same way
- Forward and backward seeking in the string possible without desynchronization
- Minimal space wasted
- Possibility to extend naturally if the unicode set grows

-- 
Joel Yliluoma - http://iki.fi/bisqwit/

Post a reply to this message