|
|
On 25 Jan 1999 10:13:26 -0500, Ron Parker <par### [at] my-dejanewscom> wrote:
>You would need three high-bit characters in a row, the first one
>has to be one of the sixteen characters that has a high nybble of
>high nybble of $8 through $B. (mostly symbols of one form or
>another, unlikely to be inside or at the end of a word, particularly
>in combination, though there are conceivable exceptions.) Some of
>the combinations thus formed might even be valid in the font you're
>using (particularly if you're using a Unicode font on NT, most of
>which include Arabic and Hebrew script.)
Erg. It's worse than I thought. You can also have sequences of
two high-bit characters, where the first has a high nybble of
0xC or 0xD and the second has a high nybble of 0x8 through 0xB.
These combinations encode characters in the range 0x80 through
0x7ff, which includes the entire set of high-bit characters.
So if you use a character from the range 0xC0 through 0xC3
trademarks of e.g. Spanish words!), you're guaranteed to have a
valid UTF-8 representation for a character that's probably
represented in your font.
Maybe I need to add a "charset" keyword with allowed values of
US-ASCII, ISO-8859-1, UTF-8, and ISO-10646-UCS-2 with a default
of ISO-8859-1. (All of these are IANA names for the various
charsets, so don't blame me for the huge name UCS-2 has. :) )
Any thoughts? Anyone? Does anyone but me and Jon care anymore?
Post a reply to this message
|
|