Some <head> clauses cases BeautifulSoup to choke?

F

Frank Stutzman

I've got a simple script that looks like (watch the wrap):
---------------------------------------------------
import BeautifulSoup,urllib

ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read()

soup=BeautifulSoup.BeautifulSoup(ifile)
print soup.prettify()
----------------------------------------------------

and all I get out of it is garbage. Other simular urls from the same site
work fine (use http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search as one example).

I did some poking and proding and it seems that there is something in the
<head> clause that is causing the problem. Heck if I can see what it is.

I'm new to BeautifulSoup (heck, I'm new to python). If I'm doing something
dumb, you don't need to be gentle.
 
C

Chris Mellon

I've got a simple script that looks like (watch the wrap):
---------------------------------------------------
import BeautifulSoup,urllib

ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read()

soup=BeautifulSoup.BeautifulSoup(ifile)
print soup.prettify()
----------------------------------------------------

and all I get out of it is garbage. Other simular urls from the same site
work fine (use http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search as one example).

I did some poking and proding and it seems that there is something in the
<head> clause that is causing the problem. Heck if I can see what it is.

I'm new to BeautifulSoup (heck, I'm new to python). If I'm doing something
dumb, you don't need to be gentle.

You have the same URL as both your good and bad example.
 
M

Marc Christiansen

Frank Stutzman said:
I've got a simple script that looks like (watch the wrap):
---------------------------------------------------
import BeautifulSoup,urllib

ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read()

soup=BeautifulSoup.BeautifulSoup(ifile)
print soup.prettify()

Same for me.
I did some poking and proding and it seems that there is something in the
<head> clause that is causing the problem. Heck if I can see what it is.

The problem is this line:
<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

Which is wrong. The content is not utf-16 encoded. The line after that
declares the charset as utf-8, which is correct, although ascii would be
ok too.

If I save the search result and remove this line, everything works. So,
you could:
- ignore problematic pages
- save and edit them, then reparse them (not always practical)
- use the fromEncoding argument:
soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
(or 'ascii'). Of course this only works if you guess/predict the
encoding correctly ;) Which can be difficult. Since BeautifulSoup uses
"an encoding discovered in the document itself" (quote from
<http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)
when the encoding you supply does not work, using fromEncoding="ascii"
should not hurt too much. But this being usenet, I'm sure someone will
tell me that I'm wrong and there is some weird 7bit encoding in use
somewhere on the web...
I'm new to BeautifulSoup (heck, I'm new to python). If I'm doing something
dumb, you don't need to be gentle.

No, you did nothing dumb. The server sent you broken content.

Ciao
Marc
 
D

Duncan Booth

Frank Stutzman said:
I did some poking and proding and it seems that there is something in
the
<head> clause that is causing the problem. Heck if I can see what it
is.

Maybe Beautifulsoup believes the incorrect encoding in the meta tags?
 
C

Chris Mellon

Same for me.


The problem is this line:
<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

Which is wrong. The content is not utf-16 encoded. The line after that
declares the charset as utf-8, which is correct, although ascii would be
ok too.

If I save the search result and remove this line, everything works. So,
you could:
- ignore problematic pages
- save and edit them, then reparse them (not always practical)
- use the fromEncoding argument:
soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
(or 'ascii'). Of course this only works if you guess/predict the
encoding correctly ;) Which can be difficult. Since BeautifulSoup uses
"an encoding discovered in the document itself" (quote from
<http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)
when the encoding you supply does not work, using fromEncoding="ascii"
should not hurt too much. But this being usenet, I'm sure someone will
tell me that I'm wrong and there is some weird 7bit encoding in use
somewhere on the web...


No, you did nothing dumb. The server sent you broken content.

Correct. However, this is the sort of real-life broken HTML that BS is
tasked to handle. It looks like the major browers handle this by using
the last content type (header or meta tag) encountered before other
content. Right now, it looks like BS has a number of fallback
mechanisms but it's meta-tag fallback only looks at the first tag.

Posting a feature request or whatever through whatever mechanism BS
uses to handle this sort of thing would probably be nice.
 
F

Frank Stutzman

Some kind person replied:
You have the same URL as both your good and bad example.

Oops, dang emacs cut buffer (yeah, thats what did it). A working
example url would be (again, mind the wrap):

http://www.naco.faa.gov/digital_tpp...t_type=ICAO&ver=0711&bnSubmit=Complete+Search


Marc Christiansen said:
The problem is this line:
<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

Which is wrong. The content is not utf-16 encoded. The line after that
declares the charset as utf-8, which is correct, although ascii would be
ok too.

Ah, er, hmmm. Take a look the 'good' URL I mentioned above. You will
notice that it has the same utf-16, utf-8 encoding that the 'bad' one
has. And BeautifulSoup works great on it.

I'm still scratchin' ma head...
If I save the search result and remove this line, everything works. So,
you could:
- ignore problematic pages

Not an option for my application.
- save and edit them, then reparse them (not always practical)

Thats what I'm doing at the moment during my development. Sure
seems inelegant.
- use the fromEncoding argument:
soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
(or 'ascii'). Of course this only works if you guess/predict the
encoding correctly ;) Which can be difficult. Since BeautifulSoup uses
"an encoding discovered in the document itself" (quote from
<http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)

I'll try that. For what I'm doing it ought to be safe enough.

Much appreciate all the comments so far.
 
M

Marc Christiansen

Frank Stutzman said:
Some kind person replied:

Oops, dang emacs cut buffer (yeah, thats what did it). A working
example url would be (again, mind the wrap):

http://www.naco.faa.gov/digital_tpp...t_type=ICAO&ver=0711&bnSubmit=Complete+Search




Ah, er, hmmm. Take a look the 'good' URL I mentioned above. You will
notice that it has the same utf-16, utf-8 encoding that the 'bad' one
has. And BeautifulSoup works great on it.

I'm still scratchin' ma head...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 41176: truncated data

bad contains the content of the 'bad' url, good the content of the
'good' url. Because of the UnicodeDecodeError, BeautifulSoup tries
either the next encoding or the next step from the url below.
Much appreciate all the comments so far.

You're welcome.

Marc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top