Getting URL's

S

softwindow

it is difficult to get all URL's in a page
you can use sgmllib module to parse html files
can get the standard href .
 
P

Paul McGuire

it is difficult to get all URL's in a page
<snip>

Is this really so hard?:

-----------------
from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\
Word,dblQuotedString,alphanums,SkipTo,makeHTMLTags
import urllib

# extract all <a> anchor tags - makeHTMLTags defines a
# fairly robust pair of match patterns, not just "<tag>","</tag>"
linkOpenTag,linkCloseTag = makeHTMLTags("a")
link = linkOpenTag + \
SkipTo(linkCloseTag).setResultsName("body") + \
linkCloseTag.suppress()

# read the HTML source from some random URL
serverListPage = urllib.urlopen( "http://www.google.com" )
htmlText = serverListPage.read()
serverListPage.close()

# use the link grammar to scan the HTML source
for toks,strt,end in link.scanString(htmlText):
print toks.startA.href,"->",toks.body

-----------------
Prints:
/url?sa=p&pref=ig&pval=2&q=http://www.google.com/ig?hl=en ->
Personalized Home
https://www.google.com/accounts/Login?continue=http://www.google.com/&hl=en ->
Sign in
/imghp?hl=en&tab=wi&ie=UTF-8 -> Images
http://groups.google.com/grphp?hl=en&tab=wg&ie=UTF-8 -> Groups
http://news.google.com/nwshp?hl=en&tab=wn&ie=UTF-8 -> News
http://froogle.google.com/frghp?hl=en&tab=wf&ie=UTF-8 -> Froogle
/maphp?hl=en&tab=wl&ie=UTF-8 -> Maps
/intl/en/options/ -> more&nbsp;&raquo;
/advanced_search?hl=en -> Advanced Search
/preferences?hl=en -> Preferences
/language_tools?hl=en -> Language Tools
/intl/en/ads/ -> Advertising&nbsp;Programs
/services/ -> Business Solutions
/intl/en/about.html -> About Google


-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,298
Messages
2,571,539
Members
48,274
Latest member
HowardKipp

Latest Threads

Top