How do I enter/receive webpage information?

M

Mudcat

Hi,

I'm wondering the best way to do the following.

I would like to use a map webpage (like yahoo maps) to find the
distance between two places that are pulled in from a text file. I want
to accomplish this without displaying the browser.

I am looking at several options right now, including urllib, httplib,
packet trace, etc. But I don't know where to start with it or if there
are existing tools that I could incorporate.

Can someone explain how to do this or point me in the right direction?

Thanks,
Marc
 
J

Jorgen Grahn

Hi,

I'm wondering the best way to do the following.

I would like to use a map webpage (like yahoo maps) to find the
distance between two places that are pulled in from a text file. I want
to accomplish this without displaying the browser.

That's called "web scraping", in case you want to Google for info.
I am looking at several options right now, including urllib, httplib,
packet trace, etc. But I don't know where to start with it or if there
are existing tools that I could incorporate.

Can someone explain how to do this or point me in the right direction?

I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.

/Jorgen
 
J

John J. Lee

Jorgen Grahn said:
I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.

I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.

One starting point for web scraping with Python:

http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

There are some modules you may find useful there, too.

Google Groups for urlencode. Or use my module ClientForm, if you
prefer. Experiment a little with an HTML form in a local file and
(eg.) the 'ethereal' sniffer to see what happens when you click
submit.

The stdlib now has cookie support (in Python 2.4):

import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

r = opener.open("http://example.com/")
print r.read()

Unfortunately, it's true that network sniffing and a reasonable
smattering of knowledge about HTTP &c., does often turn out to be
necessary to scrape stuff. A few useful tips:

http://wwwsearch.sourceforge.net/ClientCookie/doc.html#debugging


John
 
J

John J. Lee

Jorgen Grahn said:
- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.
[...]

BeautifulSoup is often recommended (never tried it myself).

Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are
available for cleaning horrid HTML while-u-scrape, too.

Alternatively, some people swear by automating Internet Explorer;
other people would rather be hit on the head with a ball-peen hammer
(not only the MS-haters)...


John
 
J

Jorgen Grahn

....

I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.

That's quite possibly true. I remember looking at and rejecting
urllib/urllib2, but I cannot remember my reasons. Maybe I didn't feel they
were documented well enough (in Python 2.1, which is where I live).

[more useful info snipped]

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,219
Messages
2,571,117
Members
47,730
Latest member
scavoli

Latest Threads

Top