|
|
Ron Parker wrote:
> It is UTF-8, but it starts with EF BB BF, the UTF-8 encoding of the FFFE
> endianness indicator (as written on an Intel machine, obviously. A Motorola
> machine would use EF BF BE)
Actually, UTF-8 is byte-order independent. So the UTF-8 BOM will always be EF BB
BF.
> We could easily interpret the presence of those
> three bytes as an implicit UTF-8 charmap, and infer the endianness of the
> other UTF-8 characters in the file at the same time.
http://www.unicode.org/unicode/faq/utf_bom.html
I had just run into this on some Java related issues.
Basically, the BOM is a special use of a standard "ZERO WIDTH NON-BREAKING SPACE"
character. Sometimes it might be treated as a BOM (or UTF-8 flag) and stripped out,
but it doesn't have to be. At the begining of a file it's probably a good idea,
though.
--
Jon A. Cruz
http://www.geocities.com/joncruz/action.html
Post a reply to this message
|
|