how to scrape url out of href

H

homepricemaps

i need to scrape a url out of an href. it seems that people recommend
that i use beautiful soup but had some problems.

does anyone have sample code for scraping the actual url out of an href
like this one

<a href="http://www.cnn.com" target="_blank">
 
H

homepricemaps

sorry paul-i'm an extremely beginner programmer, if that! ;-) can you
give me an example?

thanks in advance
 
M

Mike Meyer

i need to scrape a url out of an href. it seems that people recommend
that i use beautiful soup but had some problems.

What problem are you having with BeautifulSoup? It's working fine for
here.
does anyone have sample code for scraping the actual url out of an href
like this one

<a href="http://www.cnn.com" target="_blank">

The following fragment works fine for me:

linktext = soup.fetchText('Next')
if not linktext:
return pages
else:
url = linktext[0].findParent('a')['href']


So you probably want something like:

for anchor in soup.fetch('a', {'target': '_blank'}):
print anchor['href']


<mike
 
H

homepricemaps

mike's code worked like a charm. i have one more question. i have an
href which looks like this:

<td class="all">
<a class="btn" name="D1" href="http://www.cnn.com">
</a>

i thought i would use this code to get the href out but it fails, gives
me a keyerror:

for incident in row('td', {'class':'all'}):
n = incident.findNextSibling('a', {'class': 'btn'})
link = incident.findNextSibling['href'] + "','"


any idea what i'm doing wrong here with the syntax? thanks in advance
 
K

Kent Johnson

mike's code worked like a charm. i have one more question. i have an
href which looks like this:

<td class="all">
<a class="btn" name="D1" href="http://www.cnn.com">
</a>

i thought i would use this code to get the href out but it fails, gives
me a keyerror:

for incident in row('td', {'class':'all'}):
n = incident.findNextSibling('a', {'class': 'btn'})
link = incident.findNextSibling['href'] + "','"


any idea what i'm doing wrong here with the syntax? thanks in advance

ISTM that <a class="btn"> is a child of <td>, not a sibling, and
findNextSibling is a method, not an indexable element. Try
n = incident('a', {'class': 'btn'})
link = n['href'] + "','"

Kent
 
H

homepricemaps

actuall the full error is this:


File "/home/felafela/BeautifulSoup.py", line 301, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,275
Messages
2,571,381
Members
48,070
Latest member
nick_tyson

Latest Threads

Top