Simon said:
May i know is there any possible solutions to detect the encoding or
character set (charset) of a file automatically? Second, how to
convert a particular encoding to Unicode once the file encoding is
detected?
The short answer: there's no easy way to detect charset automatically.
The long answer:
Typically, no filesystem stores metadata that one can associate with a
file encoding. All of your ISO 8859-* codes differ only in what the
codepoints in the x80 - xFF range look like, be it standard accented
characters (like à ), Greek characters (α), or some other language.
Pragmatically differentiating between these single-byte encodings forces
you to resort to either heuristics or getting help from the user (if you
notice, all major browsers allow you to select a web page's encoding for
this very reason).
There is another class of encodings-- variable-length encodings like
UTF-8 or Shift-JIS. One can sometimes rule out these encodings, if
invalid sequences are produced. For example, 0xa4 0xf4 is invalid UTF-8,
so it's probably an ISO 8859-* language instead.
Context is also helpful. You may recall coming across documents that
have unusual character pairings, like ۍ or something (if your
newsreader sucks at i18n, you'll probably be seeing those in this
message as well). That is pretty much a dead giveaway that the message
is UTF-8 but someone is treating it as ISO 8859-1 (or it's very close
sibling, Windows-1252). If you're seeing multiple high-byte characters
in a row, it's more likely UTF-8 than it is ISO 8859-1, although some
other languages may have these cases routinely (like Greek).
The final way to guess at the encoding is to look at what the platform's
default is. Western European-localized products will tend to be in
either Cp1252 (which is pretty much ISO 8859-1) or UTF-8; Japanese are
probably either Shift-JIS or UTF-8. I believe Java's conversion methods
will default to platform encoding for you anyways, so that may be a
safer bet for you. The other alternative is to just assume everyone uses
the same charset and not think about it.