M
Magnus.Moraberg
Hi,
I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -
http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm
a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -
u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'
and -
<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
Any suggestions how I might overcome this problem?
Thanks,
Barry.
Here's my code -
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
# proxy_support = urllib2.ProxyHandler({"http":"http://
999.999.999.999:8080"})
# opener = urllib2.build_opener(proxy_support)
# urllib2.install_opener(opener)
page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
newsid_7420900/7420967.stm')
soup = BeautifulSoup(page)
pageText = soup.findAll(text=True)
print pageText
I wish to extract all the words on a set of webpages and store them in
a large dictionary. I then wish to procuce a list with the most common
words for the language under consideration. So, my code below reads
the page -
http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm
a welsh language page. I hope to then establish the 1000 most commonly
used words in Welsh. The problem I'm having is that
soup.findAll(text=True) is returning the likes of -
u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
www.w3.org/TR/REC-html40/loose.dtd"'
and -
<a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
Any suggestions how I might overcome this problem?
Thanks,
Barry.
Here's my code -
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
# proxy_support = urllib2.ProxyHandler({"http":"http://
999.999.999.999:8080"})
# opener = urllib2.build_opener(proxy_support)
# urllib2.install_opener(opener)
page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
newsid_7420900/7420967.stm')
soup = BeautifulSoup(page)
pageText = soup.findAll(text=True)
print pageText