How do I enter/receive webpage information?

Mudcat · Feb 4, 2005

Hi,

I'm wondering the best way to do the following.

I would like to use a map webpage (like yahoo maps) to find the
distance between two places that are pulled in from a text file. I want
to accomplish this without displaying the browser.

I am looking at several options right now, including urllib, httplib,
packet trace, etc. But I don't know where to start with it or if there
are existing tools that I could incorporate.

Can someone explain how to do this or point me in the right direction?

Thanks,
Marc

Jorgen Grahn · Feb 5, 2005

Hi,

I'm wondering the best way to do the following.

I would like to use a map webpage (like yahoo maps) to find the
distance between two places that are pulled in from a text file. I want
to accomplish this without displaying the browser.

That's called "web scraping", in case you want to Google for info.

I am looking at several options right now, including urllib, httplib,
packet trace, etc. But I don't know where to start with it or if there
are existing tools that I could incorporate.

Can someone explain how to do this or point me in the right direction?

I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.

/Jorgen

John J. Lee · Feb 5, 2005

Jorgen Grahn said:
I did it this way successfully once ... it's probably the wrong approach in
some ways, but It Works For Me.

- used httplib.HTTPConnection for the HTTP parts, building my own requests
with headers and all, calling h.send() and h.getresponse() etc.

- created my own cookie container class (because there was a session
involved, and logging in and such things, and all of it used cookies)

- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

Wrapped in classes this ended up as (fictive):

client = Client('somehost:80)
client.login('me', 'secret)
a, b = theAsAndBs(client, 'tomorrow', 'Wiltshire')
foo = theFoo(client, 'yesterday')

I had to look deeply into the HTTP RFCs to do this, and also snoop the
traffic for a "real" session to see what went on between server and client.

I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.

One starting point for web scraping with Python:

http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html

There are some modules you may find useful there, too.

Google Groups for urlencode. Or use my module ClientForm, if you
prefer. Experiment a little with an HTML form in a local file and
(eg.) the 'ethereal' sniffer to see what happens when you click
submit.

The stdlib now has cookie support (in Python 2.4):

import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

r = opener.open("http://example.com/")
print r.read()

Unfortunately, it's true that network sniffing and a reasonable
smattering of knowledge about HTTP &c., does often turn out to be
necessary to scrape stuff. A few useful tips:

http://wwwsearch.sourceforge.net/ClientCookie/doc.html#debugging

John

John J. Lee · Feb 5, 2005

Jorgen Grahn said:
- subclassed sgmllib.SGMLParser once for each kind of page I expected to
receive. This class knew how to pull the information from a HTML document,
provided it looked as I expected it to. Very tedious work. It can be easier
and safer to just use module re in some cases.

[...]

BeautifulSoup is often recommended (never tried it myself).

Remember HTMLtidy and its offshoots (eg. tidylib, mxTidy) are
available for cleaning horrid HTML while-u-scrape, too.

Alternatively, some people swear by automating Internet Explorer;
other people would rather be hit on the head with a ball-peen hammer
(not only the MS-haters)...

John

Jorgen Grahn · Feb 6, 2005

....

I see little benefit and significant loss in using httplib instead of
urllib2, unless and until you get a particulary stubborn problem and
want to drop down a level to debug. It's easy to see and modify
urllib2's headers if you need to get low level.

That's quite possibly true. I remember looking at and rejecting
urllib/urllib2, but I cannot remember my reasons. Maybe I didn't feel they
were documented well enough (in Python 2.1, which is where I live).

[more useful info snipped]

/Jorgen

How do i do math problems with files in java?	3	Jan 11, 2022
How do I limit the for loop count?	3	Nov 2, 2021
In R Shiny, How do I ensure variable value propagation within same code block in R?	0	Sep 29, 2022
How do I set the default content page) on a Classic ASP file?	0	Aug 24, 2021
How do i get numberOfItemsHired to only accept 1-500 if it is outside those values error message should be displayed	10	Jul 5, 2024
If you need to code a Windows Forms software that uses C# software how do i make the design for a software that makes this Post Description function ?	0	Sep 21, 2022
How do I deal with packet data	1	Aug 1, 2013
How do I use Find and Loop in VBA for Excel to identify, delete, and insert blank row for values greater than 6?	0	Feb 28, 2022

How do I enter/receive webpage information?

Mudcat

Jorgen Grahn

John J. Lee

John J. Lee

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads