|
|
Le Forgeron <jgr### [at] freefr> wrote:
> It is past 24 bits since a few... (even past 32 bits!!!)
That's not true. The current unicode standard defines about 100000
characters. Thus raw unicode values require only 17 bits
UTF-8 encoding "wastes" some bits (in order to use less bits for the
most used western characters) and requires at most 4 bytes per character
(even though the characters requiring more than 3 bytes are very rarely
used).
> The real thing is how you encode all these.
> UTF-8 is one way (the popular one these days),
> UTF-16 another... and raw storage the worst idea ever!
Why would raw storage be the worst idea? There are several advantages.
For instance, each character takes the same amount of space (instead of
taking a variable amount like with the encodings), which means that you
can directly index the nth character in a string (in an utf8-encoded string
if you want the nth character you have to actually traverse the entire
string up to that point and decode it along the way). It's also the most
efficient way (speedwise) of handling the characters because you don't
need to be doing conversions back and forth between the encoding and
the raw values.
The disadvantage is, of course, an increased memory requirement.
--
- Warp
Post a reply to this message
|
|