K
Keith Thompson
glen herrmannsfeldt said:At least some MS software puts special flag bytes at the beginning
of a UTF-8 or UTF-16 file. As I understand it, to indicate the
endianness and also that it is UTF-8 or UTF-16.
(And to confuse programs not expecting them.)
The special flag is a "byte order mark", represented as the 16-bit
character '\uFEFF' (ZERO WIDTH NO-BREAK SPACE). In UTF-16, you can tell
whether a file is big-endian or little-endian by checking whether the
first two bytes are (FE FF) or (FF FE). In UTF-8, it's represented by
the 3-byte sequence (EF BB BF). Without such a marker, it can be
difficult (and in principle sometimes impossible) to distinguish between
valid big-endian and little-ending UTF-16 files.
Outside the Windows world, UTF-8 is by far the most common encoding for
Unicode text, and byte order marks are rare. There are *some*
UTF-8-with-BOM files floating around (likely converted from Windows
UTF-16-with-BOM).