I
how to detect the character encoding in a web page ?
such as this page: http://python.org/
but how to let python do it for you ?
such as these 2 pages
http://python.org/
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
how to detect the character encoding in these 2 pages by python ?
$ wget -q -O - http://python.org/ | chardetect.py
stdin: ISO-8859-2 with confidence 0.803579722043
$
And it sucks, because it uses magic, and not reading the HTML tags. The
RIGHT thing to do for websites is detect the meta charset definition,
which is
<meta http-equiv="content-type" content="text/html; charset=utf-8">
or
<meta charset="utf-8">
The second one for HTML5 websites, and both may require case conversion
and the useless ` /` at the end. But if somebody is using HTML5, you
are pretty much guaranteed to get UTF-8.
In today’s world, the proper assumption to make is “UTF-8 or GTFOâ€.
Because nobody in the right mind would use something else today.
Alister said:Indeed due to the poor quality of most websites it is not possible to be
100% accurate for all sites.
personally I would start by checking the doc type & then the meta data as
these should be quick & correct, I then use chardectect only if these
fail to provide any result.
I agree that checking the metadata is the right thing to do. But, I
wouldn't go so far as to assume it will always be correct. There's a
lot of crap out there with perfectly formed metadata which just happens
to be wrong.
Although it pains me greatly to quote Ronald Reagan as a source of
wisdom, I have to admit he got it right with "Trust, but verify". It's
the only way to survive in the unicode world. Write defensive code.
Wrap try blocks around calls that might raise exceptions if the external
data is borked w/r/t what the metadata claims it should be.
在 2012å¹´12月24日星期一UTC+8上åˆ8æ—¶34分47秒,iMath写é“:
by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead !
The HTTP header is completely out of band. This is the best way to
transmit encoding information. Otherwise, you assume 7-bit ASCII and start
parsing. Once you find a meta tag, you stop parsing and go back to the
top, decoding in the new way.
"ASCII-compatible" covers a huge number of
encodings, so it's not actually much of a problem to do this.
Provided that the meta tag indicates an ASCII-compatible encoding, and you
haven't encountered any decode errors due to 8-bit characters, then
there's no need to go back to the top.
With slight modifications, you can also handle some
almost-ASCII-compatible encodings such as shift-JIS.
Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
have actually been seen, and only re-start parsing from the top if the
encoding change actually affects the interpretation of any of those bytes.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.