|
|
On Sat, 26 Nov 2005 14:01:29 -0700, Patrick Elliott wrote:
> How can it be, given that it doesn't allow access to all unicode
> characters? It by definition can't, since true unicode requires a
> 'section' code, followed by a 'character' code, of *always* two bytes.
You must be confusing something.
Unicode is a character-set with integer range 00000-1FFFF mapping to
different characters.
You can use http://bisqwit.iki.fi/japtools/unicodemap.php to browse it,
for example.
For example, the character U+05E1 always means the hebrew letter samekh,
and nothing else. There is no "section code" (whatever that means).
In UTF-8, the unicode characters are encoded in varying number of bytes.
UTF-8 encoding:
bytes bits representation
1 7 0bbbbbbb
2 11 110bbbbb 10bbbbbb
3 16 1110bbbb 10bbbbbb 10bbbbbb
4 21 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
Each b represents a bit that can be used to store character data.
So, the character U+05E1, which is in 010111100001 in binary,
is encoded as 11010111 10100001, that is, E7 A1.
The character U+0041, that is the latin capital letter "A",
is encoded as 01000001, that is, 41, which, not coincidentally,
is exactly the same as "A" in ASCII.
--
Joel Yliluoma - http://bisqwit.iki.fi/
: comprehension = 1 / (2 ^ precision)
: Try to be as precise as can be and no one will comprehend what you mean.
: Say nothing, and everybody will understand.
Post a reply to this message
|
|