For e.g. HTML or XML the meta charset header resp. the encoding attribute
should tell you what encoding to expect.
Maybe. Depends how the files got there. For HTTP transactions it's
legal (and often preferable) to supply the character coding on the
actual HTTP content-type header, and to make no mention of it inside
the actual body of the content. However, the O.P speaks of "files",
so presumably you're right, and the HTTP transaction issue is outside
this particular problem domain. But there's still the BOM option to
keep in mind!
Just scan for it and evaluate the rest of the file accordingly.
But you can't scan it without reading it, and you can't read it
without opening it; so you'd have to open it provisionally with *some*
mode, scan for the stuff that you have described - and then maybe
re-open it with a different mode?
OTOH, if one opens it in raw mode, and scans it in a way which can
accommodate itself to different encodings, then, when the relevant
encoding information has been found, the data can be piped through the
appropriate encoding layers explicitly.
There's a lot of options, and I'm not sure of the practical
implications of choosing one or another. If the data is to be
processed by an appropriate HTML or XML module, maybe that module can
adapt to different data encodings as read in raw mode?
What I think it comes down to is that it would definitely be a mistake
to open the file with a utf8 IO layer without being sure that it's
utf-8-encoded, due to the errors that will inevitably result if it
isn't.
hope this helps a bit.