How do I correctly download Wikipedia pages?

S

Steven D'Aprano

I'm trying to scrape a Wikipedia page from Python. Following instructions
here:

http://en.wikipedia.org/wiki/Wikipedia:Database_download
http://en.wikipedia.org/wiki/Special:Export

I use the URL "http://en.wikipedia.org/wiki/Special:Export/Train" instead
of just "http://en.wikipedia.org/wiki/Train". But instead of getting the
page I expect, and can see in my browser, I get an error page:

....
Our servers are currently experiencing a technical problem. This is
probably temporary and should be fixed soon
....


(Output is obviously truncated for your sanity and mine.)


Is there a trick to downloading from Wikipedia with urllib?
 
C

Cousin Stanley

I'm trying to scrape a Wikipedia page from Python.
....

On occasion I use a program under Debian Linux
called wikipedia2text that is very handy
for downloading wikipedia pages as plain text files ....

Description: displays Wikipedia articles on the command line

This script fetches Wikipedia articles (currently supports
around 30 Wikipedia languages) and displays them as plain text
in a pager or just sends the text to standard out. Alternatively
it opens the Wikipedia article in a (possibly GUI) web browser
or just shows the URL of the appropriate Wikipedia article.

Example directed through the lynx browser ....

wp2t -b lynx gorilla > gorilla.txt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,005
Messages
2,570,264
Members
46,859
Latest member
HeidiAtkin

Latest Threads

Top