D
Dragon Lord
I am trying to download a few IEEE pages by using urllib2, but with
certain pages I get only the first part of the page. With other pages
from the same server and url (just another pageID) I get the right
results. The difference between these pages seems to be the date the
paper for which the page is was published. Any papers from before 2000
end just before the date, pages from 2000 and later and at <\html>.
Two example URLs:
Does not work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=517048
Does work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=854728
I tried both urlopen and urlretrieve and tried both urllib and
urllib2. With urlopen I tried both .read() and .read(10000) to make
sure I got the whole page, but nothing helped.
Sample code:
import urllib2
response = urllib2.urlopen("http://ieeexplore.ieee.org/xpl/
freeabs_all.jsp?arnumber=517048")
html = response.read()
print html
The cutoff is allways at the same location: just after the label
"Meeting date" and before the date itself. Could it be that something
is interpreted as and eof command or something like that?
example of the cutoff point with a bad page:
<br/><b>Meeting Date: </b>
example of the cutoff point with a good page:
<br/><b>Meeting Date: </b>
13 jun 2000
The bad pages do continue after this point btw. if you use a
webbrowser, it does not seem to be a server problem.
certain pages I get only the first part of the page. With other pages
from the same server and url (just another pageID) I get the right
results. The difference between these pages seems to be the date the
paper for which the page is was published. Any papers from before 2000
end just before the date, pages from 2000 and later and at <\html>.
Two example URLs:
Does not work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=517048
Does work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=854728
I tried both urlopen and urlretrieve and tried both urllib and
urllib2. With urlopen I tried both .read() and .read(10000) to make
sure I got the whole page, but nothing helped.
Sample code:
import urllib2
response = urllib2.urlopen("http://ieeexplore.ieee.org/xpl/
freeabs_all.jsp?arnumber=517048")
html = response.read()
print html
The cutoff is allways at the same location: just after the label
"Meeting date" and before the date itself. Could it be that something
is interpreted as and eof command or something like that?
example of the cutoff point with a bad page:
<br/><b>Meeting Date: </b>
example of the cutoff point with a good page:
<br/><b>Meeting Date: </b>
13 jun 2000
The bad pages do continue after this point btw. if you use a
webbrowser, it does not seem to be a server problem.