stealth screen scraping with python?

D

different.engine

Folks:

I am screen scraping a large volume of data from Yahoo Finance each
evening, and parsing with Beautiful Soup.

I was wondering if anyone could give me some pointers on how to make
it less obvious to Yahoo that this is what I am doing, as I fear that
they probably monitor for this type of activity, and will soon ban my
IP.

-DE
 
D

Dotan Cohen

Folks:

I am screen scraping a large volume of data from Yahoo Finance each
evening, and parsing with Beautiful Soup.

I was wondering if anyone could give me some pointers on how to make
it less obvious to Yahoo that this is what I am doing, as I fear that
they probably monitor for this type of activity, and will soon ban my
IP.

-DE

So long as you are sending a regular http request, as from a browser,
then they will have no way of knowing. Just keep your queries down to
no more than once every 3-5 seconds and you should be fine. Rotate
your IP, too, if you can.

Dotan Cohen

http://lyricslist.com/lyrics/artist_albums/110/carmen_eric.html
http://what-is-what.com/what_is/eula.html
 
K

kyosohma

Folks:

I am screen scraping a large volume of data from Yahoo Finance each
evening, and parsing with Beautiful Soup.

I was wondering if anyone could give me some pointers on how to make
it less obvious to Yahoo that this is what I am doing, as I fear that
they probably monitor for this type of activity, and will soon ban my
IP.

-DE

Depends on what you're doing exactly. I've done something like this
and it only hits the page once:

URL = 'http://quote.yahoo.com/d/quotes.csv?s=%s&f=sl1c1p2'
TICKS = ('AMZN', 'AMD', 'EBAY', 'GOOG', 'MSFT', 'YHOO')
u = urlopen(URL % ','.join(TICKS))
for data in u:
tick, price, chg, per = data.split(',')
# do something with data

If you're grabbing all the data in one fell swoop (which is what you
should aim for), then it's harder for Yahoo! to know what you're doing
exactly. And I can't see why they'd care as that is all a browser does
anyway. It's when you hit the site a bunch of times in a short period
of time that sets off the alarms.

Mike
 
S

Steven D'Aprano

Folks:

I am screen scraping a large volume of data from Yahoo Finance each
evening, and parsing with Beautiful Soup.

I was wondering if anyone could give me some pointers on how to make
it less obvious to Yahoo that this is what I am doing, as I fear that
they probably monitor for this type of activity, and will soon ban my
IP.

Write a virus to hijack tens of thousands of Windows PCs around the world,
and use your army of zombie-PCs to do the screen scraping for you. Each
one only needs to scrape a small amount of data, and Yahoo will have no
way of telling that it is you.

*wink*
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,830
Latest member
HeleneMull

Latest Threads

Top