[java programming] How to detect the file encoding?

S

Simon

Hi all,

May i know is there any possible solutions to detect the encoding or
character set (charset) of a file automatically? Second, how to
convert a particular encoding to Unicode once the file encoding is
detected?

Thanks in advance.
 
S

Stefan Ram

Peter Duniho said:
AFAIK, Unicode is the only commonly used encoding with a "signature" (the
byte-order marker, "BOM"). Detecting other encodings can be done
heuristically, but I'm not aware of any specific support within Java to do
so, and it wouldn't be 100% reliable anyway.

The program could return a /set/ of possible encodings.
Or a map: Mapping each encoding to its probability.
Or the top encoding with its probability (reliability estimation).

One could make byte-value frequency statistics of many files
in some common encodings and compare them to the byte-value
frequency of the source given. (Advanced: Frequencies of
byte-pairs and so.)

It would help for this purpose, if one can assume a certain
natural language for the content.

Or, one might study how other software is doing this. Such software
can be found using Google, for example:

»enca -- detect and convert encoding of text files«

http://www.digipedia.pl/man/enca.1.html

(Or, install and call this software from Java.)
 
S

Stefan Ram

One could make byte-value frequency statistics of many files
in some common encodings and compare them to the byte-value
frequency of the source given. (Advanced: Frequencies of
byte-pairs and so.)

Of course, one can take advantage of the fact, that certain
octet values and octet sequence values are absolutely forbidden
in certain encodings so as to exclude those encodings.

The program then might even detect better than a decoding
declared sometimes. For example, some authors declare »ISO-8859-1«,
but use »Windows-1252«.

Another idea, would be to assume a /common/ encoding, such as
UTF-8 (including US-ASCII), ISO-8859-1, or Windows-1252, first
and detect a rare encoding only when there is strong evidence
for it.

It is easy to tell UTF-8 from ISO-8859-1 by the encoding of
character values above 127 and to tell ISO-8859-1 from
Windows-1252 by the presence of the Windows-1252 extension
octet values.

So, it will help, if the user can give an estimation of the
encodings most common in his realm.
 
J

Joshua Cranmer

Simon said:
May i know is there any possible solutions to detect the encoding or
character set (charset) of a file automatically? Second, how to
convert a particular encoding to Unicode once the file encoding is
detected?

The short answer: there's no easy way to detect charset automatically.

The long answer:
Typically, no filesystem stores metadata that one can associate with a
file encoding. All of your ISO 8859-* codes differ only in what the
codepoints in the x80 - xFF range look like, be it standard accented
characters (like à), Greek characters (α), or some other language.
Pragmatically differentiating between these single-byte encodings forces
you to resort to either heuristics or getting help from the user (if you
notice, all major browsers allow you to select a web page's encoding for
this very reason).

There is another class of encodings-- variable-length encodings like
UTF-8 or Shift-JIS. One can sometimes rule out these encodings, if
invalid sequences are produced. For example, 0xa4 0xf4 is invalid UTF-8,
so it's probably an ISO 8859-* language instead.

Context is also helpful. You may recall coming across documents that
have unusual character pairings, like ۍ or something (if your
newsreader sucks at i18n, you'll probably be seeing those in this
message as well). That is pretty much a dead giveaway that the message
is UTF-8 but someone is treating it as ISO 8859-1 (or it's very close
sibling, Windows-1252). If you're seeing multiple high-byte characters
in a row, it's more likely UTF-8 than it is ISO 8859-1, although some
other languages may have these cases routinely (like Greek).

The final way to guess at the encoding is to look at what the platform's
default is. Western European-localized products will tend to be in
either Cp1252 (which is pretty much ISO 8859-1) or UTF-8; Japanese are
probably either Shift-JIS or UTF-8. I believe Java's conversion methods
will default to platform encoding for you anyways, so that may be a
safer bet for you. The other alternative is to just assume everyone uses
the same charset and not think about it.
 
R

Roedy Green

May i know is there any possible solutions to detect the encoding or
character set (charset) of a file automatically? Second, how to
convert a particular encoding to Unicode once the file encoding is
detected?

I wrote a utility to manually assist the process. You could do it
automatically if you know the vocabulary of the file. Search for byte
patterns of encoded words.

see http://mindprod.com/jgloss/encoding.html

The fact you can't tell is so dirty coffee cups and pizza boxes on the
floor. I can't imagine that happening if someone like Martha Stewart
were in charge.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Everybody’s worried about stopping terrorism. Well, there’s a really easy way: stop participating in it."
~ Noam Chomsky
 
R

Roedy Green

Of course, one can take advantage of the fact, that certain
octet values and octet sequence values are absolutely forbidden
in certain encodings so as to exclude those encodings.

The biggest clue is the country source of the file. Check the
national encodings first.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Everybody’s worried about stopping terrorism. Well, there’s a really easy way: stop participating in it."
~ Noam Chomsky
 
R

Roedy Green

I've often thought an elegant solution would be to define more than
one BOM (byte order mark) in Unicode. They could allocate enough
BOMs to have a different one for each encoding.

There are hundreds of encodings. You could add it now with:

BOM BOM name-of-encoding BOM.

That way you don't have to reserve any new characters.

While we are at it, we should encode the MIME type and create an
extensible scheme to add other meta-information.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Everybody’s worried about stopping terrorism. Well, there’s a really easy way: stop participating in it."
~ Noam Chomsky
 
S

Stefan Ram

Roedy Green said:
There are hundreds of encodings. You could add it now with:
BOM BOM name-of-encoding BOM.

It is called »XML«:

<?xml encoding="name-of-encoding" ?><text><![CDATA[...]]></text>
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
 
R

Roedy Green

FE FF UTF-16BE BOM
FF FE UTF-16LE BOM
EF BB BF UTF-8 BOM

So there is already defined multiple BOMs, including one
for UTF-8. (I knew it was a good idea! :)

I suppose we could try to get rid of all the old 8-bit encodings and
use Unicode/UTF rather than try to patch all those text files out
there with some scheme to mark the encoding.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Never discourage anyone... who continually makes progress, no matter how slow.
~ Plato 428 BC died: 348 BC at age: 80
 
M

Mayeul

Wayne said:
Stefan said:
Roedy Green said:
There are hundreds of encodings. You could add it now with:
BOM BOM name-of-encoding BOM.
It is called »XML«:

<?xml encoding="name-of-encoding" ?><text><![CDATA[...]]></text>
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Right, but there should be a simple way to deal with plain text
files too.

Turns out there is one! I've been reading the HTML5 draft spec
and cam across this:

2.7.3 Content-Type sniffing: text or binary

1. The user agent may wait for 512 or more bytes of the resource
to be available.
2. Let n be the smaller of either 512 or the number of bytes
already available.
3. If n is 4 or more, and the first bytes of the resource match
one of the following byte sets:

Bytes in
Hexadecimal Description
FE FF UTF-16BE BOM
FF FE UTF-16LE BOM
EF BB BF UTF-8 BOM

So there is already defined multiple BOMs, including one
for UTF-8. (I knew it was a good idea! :)

I wouldn't say that "multiple" BOMs are already defined. The idea of the
BOM is to insert a zero-width no-break space character, whose code point
is U+FEFF, at the start of the file.

Since this character will be encoded differently by different encodings,
it enables to distinguish between UTF-16BE, UTF-16LE, UTF-8 and other
Unicode encodings.
It is also a somewhat acceptable way to indicate a file is UTF-8 rather
than latin-1 or something, since it seems unlikely that a plain text
file would start with the characters that the BOM's binary represents in
non-Unicode encodings.

Bottomline, the BOM is a zero-width no-break space. It is unique, there
are no multiple BOMs.

Or if there are that I don't know of, that would be another norm the
given table wouldn't conform with.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top