how to detect the character encoding in a web page ?

iMath · Dec 24, 2012

how to detect the character encoding in a web page ?
such as this page

http://python.org/

Chris Angelico · Dec 24, 2012

how to detect the character encoding in a web page ?
such as this page

http://python.org/

You read part-way into the page, where you find this:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

That tells you that the character set is UTF-8.

ChrisA

Hans Mulder · Dec 24, 2012

how to detect the character encoding in a web page ?

That depends on the site: different sites indicate
their encoding differently.

such as this page: http://python.org/

If you download that page and look at the HTML code, you'll find a line:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

So it's encoded as utf-8.

Other sites declare their charset in the Content-Type HTTP header line.
And then there are sites relying on the default. And sites that get
it wrong, and send data in a different encoding from what they declare.

Welcome to the real world,

-- HansM

iMath · Dec 24, 2012

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

but how to let python do it for you ?

such as this page

http://python.org/

how to detect the character encoding in this web page by python ?

iMath · Dec 24, 2012

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

but how to let python do it for you ?

such as these 2 pages

http://python.org/
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx

how to detect the character encoding in these 2 pages by python ?

iMath · Dec 24, 2012

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

but how to let python do it for you ?

such as these 2 pages

http://python.org/
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx

how to detect the character encoding in these 2 pages by python ?

Kurt Mueller · Dec 24, 2012

Am 24.12.2012 um 04:03 schrieb iMath:

but how to let python do it for you ?
such as these 2 pages
http://python.org/
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
how to detect the character encoding in these 2 pages by python ?

If you have the html code, let
chardetect.py
do an educated guess for you.

http://pypi.python.org/pypi/chardet

Example:
$ wget -q -O - http://python.org/ | chardetect.py
stdin: ISO-8859-2 with confidence 0.803579722043
$

$ wget -q -O - 'http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx' | chardetect.py
stdin: utf-8 with confidence 0.87625
$

Grüessli

Kwpolska · Dec 24, 2012

$ wget -q -O - http://python.org/ | chardetect.py
stdin: ISO-8859-2 with confidence 0.803579722043
$

And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is

<meta http-equiv="content-type" content="text/html; charset=utf-8">

or

<meta charset="utf-8">

The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end. But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.

In todayâ€™s world, the proper assumption to make is â€œUTF-8 or GTFOâ€.
Because nobody in the right mind would use something else today.

Steven D'Aprano · Dec 24, 2012

And it sucks, because it uses magic, and not reading the HTML tags. The
RIGHT thing to do for websites is detect the meta charset definition,
which is

<meta http-equiv="content-type" content="text/html; charset=utf-8">

or

<meta charset="utf-8">

The second one for HTML5 websites, and both may require case conversion
and the useless ` /` at the end. But if somebody is using HTML5, you
are pretty much guaranteed to get UTF-8.

In todayâ€™s world, the proper assumption to make is â€œUTF-8 or GTFOâ€.
Because nobody in the right mind would use something else today.

Alas, there are many, many, many, MANY websites that are created by
people who are *not* in their right mind. To say nothing of 15 year old
websites that use a legacy encoding. And to support those, you may need
to guess the encoding, and for that, chardetect.py is the solution.

Roy Smith · Dec 24, 2012

Alister said:
Indeed due to the poor quality of most websites it is not possible to be
100% accurate for all sites.

personally I would start by checking the doc type & then the meta data as
these should be quick & correct, I then use chardectect only if these
fail to provide any result.

I agree that checking the metadata is the right thing to do. But, I
wouldn't go so far as to assume it will always be correct. There's a
lot of crap out there with perfectly formed metadata which just happens
to be wrong.

Although it pains me greatly to quote Ronald Reagan as a source of
wisdom, I have to admit he got it right with "Trust, but verify". It's
the only way to survive in the unicode world. Write defensive code.
Wrap try blocks around calls that might raise exceptions if the external
data is borked w/r/t what the metadata claims it should be.

pythonåŸ¹è®­ · Dec 28, 2012

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

first setup chardet

import chardet
#æŠ“å–ç½‘é¡µhtml
html_1 = urllib2.urlopen(line,timeout=120).read()
#print html_1
mychar=chardet.detect(html_1)
#print mychar
bianma=mychar['encoding']
if bianma == 'utf-8' or bianma == 'UTF-8':
#html=html.decode('utf-8','ignore').encode('utf-8')
html=html_1
else :
html =html_1.decode('gb2312','ignore').encode('utf-8')

iMath · Jan 7, 2013

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

up to now , maybe chadet is the only way to let python automatically do it ..

Albert van der Horst · Jan 14, 2013

I agree that checking the metadata is the right thing to do. But, I
wouldn't go so far as to assume it will always be correct. There's a
lot of crap out there with perfectly formed metadata which just happens
to be wrong.

Although it pains me greatly to quote Ronald Reagan as a source of
wisdom, I have to admit he got it right with "Trust, but verify". It's

Not surprisingly, as an actor, Reagan was as good as his script.
This one he got from Stalin.

the only way to survive in the unicode world. Write defensive code.
Wrap try blocks around calls that might raise exceptions if the external
data is borked w/r/t what the metadata claims it should be.

The way to go, of course.

Groetjes Albert

iMath · Jun 5, 2013

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

I found PyQtâ€™s QtextStream can very accurately detect the character encoding in a web page .
even for this bad page

chardet and beautiful soup failed ,but QtextStream can get the right result

iMath · Jun 5, 2013

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

I found PyQtâ€™s QtextStream can very accurately detect the character encoding in a web page .
even for this bad page
http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html
chardet and beautiful soup failed ,but QtextStream can get the right result.

here is my code

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtNetwork import *
import sys
def slotSourceDownloaded(reply):
redirctLocation=reply.header(QNetworkRequest.LocationHeader)
redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
print(redirctLocationUrl)

if (reply.error()!= QNetworkReply.NoError):
print('11111111', reply.errorString())
return

content=QTextStream(reply).readAll()
if content=='':
print('---------', 'cannot find any resource !')
return

print(content)

reply.deleteLater()
qApp.quit()

if __name__ == '__main__':
app =QCoreApplication(sys.argv)
manager=QNetworkAccessManager ()
url =input('input url :')
request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
manager.get(request)
manager.finished.connect(slotSourceDownloaded)
sys.exit(app.exec_())

iMath · Jun 5, 2013

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead !

Chris Angelico · Jun 5, 2013

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead !

The rules for web pages are (massively oversimplified):

1) HTTP header
2) ASCII-compatible encoding and meta tag

The HTTP header is completely out of band. This is the best way to
transmit encoding information. Otherwise, you assume 7-bit ASCII and
start parsing. Once you find a meta tag, you stop parsing and go back
to the top, decoding in the new way. "ASCII-compatible" covers a huge
number of encodings, so it's not actually much of a problem to do
this.

ChrisA

Nobody · Jun 6, 2013

The HTTP header is completely out of band. This is the best way to
transmit encoding information. Otherwise, you assume 7-bit ASCII and start
parsing. Once you find a meta tag, you stop parsing and go back to the
top, decoding in the new way.

Provided that the meta tag indicates an ASCII-compatible encoding, and you
haven't encountered any decode errors due to 8-bit characters, then
there's no need to go back to the top.

"ASCII-compatible" covers a huge number of
encodings, so it's not actually much of a problem to do this.

With slight modifications, you can also handle some
almost-ASCII-compatible encodings such as shift-JIS.

Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
have actually been seen, and only re-start parsing from the top if the
encoding change actually affects the interpretation of any of those bytes.

And if the encoding isn't even remotely ASCII-compatible, you aren't going
to be able to recognise the meta tag in the first place. But I don't think
I've ever seen a web page encoded in UTF-16 or EBCDIC.

Tools like chardet are meant for the situation where either no encoding is
specified or the specified encoding can't be trusted (which is rather
common; why else would web browsers have a menu to allow the user to
select the encoding?).

Chris Angelico · Jun 6, 2013

Provided that the meta tag indicates an ASCII-compatible encoding, and you
haven't encountered any decode errors due to 8-bit characters, then
there's no need to go back to the top.

Technically and conceptually, you go back to the start and re-parse.
Sure, you might optimize that if you can, but not every parser will,
hence it's advisable to put the content-type as early as possible.

With slight modifications, you can also handle some
almost-ASCII-compatible encodings such as shift-JIS.

Personally, I'd start by assuming ISO-8859-1, keep track of which bytes
have actually been seen, and only re-start parsing from the top if the
encoding change actually affects the interpretation of any of those bytes.

Hrm, it'd be equally valid to guess UTF-8. But as long as you're
prepared to re-parse after finding the content-type, that's just a
choice of optimization and has no real impact.

ChrisA

iMath · Jun 9, 2013

åœ¨ 2012å¹´12æœˆ24æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ8æ—¶34åˆ†47ç§’ï¼ŒiMathå†™é“ï¼š

how to detect the character encoding in a web page ?

such as this page

http://python.org/

Finally ,I found by using PyQtâ€™s QtextStream , QTextCodec and chardet ,we can get a web page code more securely
even for this bad page
http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html

this script
http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==

and this page without chardet in its source code
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtNetwork import *
import sys
import chardet

def slotSourceDownloaded(reply):
redirctLocation=reply.header(QNetworkRequest.LocationHeader)
redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
#print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))

if (reply.error()!= QNetworkReply.NoError):
print('11111111', reply.errorString())
return

pageCode=reply.readAll()
charCodecInfo=chardet.detect(pageCode.data())

textStream=QTextStream(pageCode)
codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] ))
textStream.setCodec(codec)
content=textStream.readAll()
print(content)

if content=='':
print('---------', 'cannot find any resource !')
return

reply.deleteLater()
qApp.quit()

if __name__ == '__main__':
app =QCoreApplication(sys.argv)
manager=QNetworkAccessManager ()
url =input('input url :')
request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
manager.get(request)
manager.finished.connect(slotSourceDownloaded)
sys.exit(app.exec_())

how to detect the encoding used for a specific text data ?	8	Dec 20, 2012
How to build a database-driven web page	2	Apr 22, 2024
How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
I am trying to detect Which image id="" was clicked ?	22	Jan 3, 2023
I want to have a button in a web page with action for open new phone contact page and fill it with some date	1	Aug 31, 2023
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Sending data from web page to Raspberry Pi	0	Nov 26, 2022
How to push data from one HTML page to another	4	Jan 3, 2024

how to detect the character encoding in a web page ?

iMath

Chris Angelico

Hans Mulder

iMath

iMath

iMath

Kurt Mueller

Kwpolska

Steven D'Aprano

Roy Smith

pythonåŸ¹è®

iMath

Albert van der Horst

iMath

iMath

iMath

Chris Angelico

Nobody

Chris Angelico

iMath

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

how to detect the character encoding in a web page ?

iMath

Chris Angelico

Hans Mulder

iMath

iMath

iMath

Kurt Mueller

Kwpolska

Steven D'Aprano

Roy Smith

pythonåŸ¹è®­

iMath

Albert van der Horst

iMath

iMath

iMath

Chris Angelico

Nobody

Chris Angelico

iMath

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

pythonåŸ¹è®