David said:
Do you consider a unicode file a text file? The C standard probably
doesn't but unfortunately a lot of people do and they mail them to
me.
Yeah, its called "globalization". Anyways, the best that the C library
can do is read it as binary, then you are on your own for decoding
Unicode. Ironically, the wchar_t stuff is not a portable solution.
The first indication that you've got one is when "cat" works
but "grep" won't find any of the words that "cat" shows. On a
Windows system these act just like a text file: notepad,
wordpad, and DOS level TYPE and FIND all show the same thing,
with no overt indication that the file contains unicode.
Sounds like you need a better grep?
Run "od -c" on a unicode file and you'll find that it starts with
a Byte Order Mark (FE FF). After that every other byte is null.
Well, technically a UTF-16 file may start with either FE FF or FF FE,
and any of the octets that follow it may be NUL -- the encoding really
is a mapping to 16-bit values.
Take out the BOM and the null characters and you've got an ASCII file,
assuming it was originally written in english.
Better yet, convert it to UTF-8 and it remains as much ASCII as
required, while not losing any non-eglish characters. If you are
consistently seeing every other byte as NUL, then the author (or
program that the author is using) has almost certainly chosen a very
sub-optimal encoding (they should choose UTF-8 instead.)
[...] Somewhat ironically
all of the ones I've seen so far have only LF EOLs after being
processed like this. This is for UTF-16. There's also
UTF-32 but thankfully nobody has sent me one of those yet.
UTF-32 is mostly a "theoretical" transfer format. Its commonly used
internally within a program to simplify text data manipulation (and can
sometimes be mapped to "wchar_t"), however, nobody would ever use it as
a format for storing or sending a file. The reason is that UTF-16 is
always shorter than UTF-32, and UTF-8 is often shorter than both (but
sometimes can be longer than either.)