how to scrape url out of href

homepricemaps · Jan 2, 2006

i need to scrape a url out of an href. it seems that people recommend
that i use beautiful soup but had some problems.

does anyone have sample code for scraping the actual url out of an href
like this one

<a href="http://www.cnn.com" target="_blank">

Paul Rubin · Jan 2, 2006

does anyone have sample code for scraping the actual url out of an href
like this one

<a href="http://www.cnn.com" target="_blank">

If you've got the tag by itself like that, just use a regexp to get
the href out.

homepricemaps · Jan 2, 2006

sorry paul-i'm an extremely beginner programmer, if that! ;-) can you
give me an example?

thanks in advance

Mike Meyer · Jan 2, 2006

i need to scrape a url out of an href. it seems that people recommend
that i use beautiful soup but had some problems.

What problem are you having with BeautifulSoup? It's working fine for
here.

does anyone have sample code for scraping the actual url out of an href
like this one

<a href="http://www.cnn.com" target="_blank">

The following fragment works fine for me:

linktext = soup.fetchText('Next')
if not linktext:
return pages
else:
url = linktext[0].findParent('a')['href']

So you probably want something like:

for anchor in soup.fetch('a', {'target': '_blank'}):
print anchor['href']

<mike

homepricemaps · Jan 2, 2006

mike's code worked like a charm. i have one more question. i have an
href which looks like this:

<td class="all">
<a class="btn" name="D1" href="http://www.cnn.com">
</a>

i thought i would use this code to get the href out but it fails, gives
me a keyerror:

for incident in row('td', {'class':'all'}):
n = incident.findNextSibling('a', {'class': 'btn'})
link = incident.findNextSibling['href'] + "','"

any idea what i'm doing wrong here with the syntax? thanks in advance

Kent Johnson · Jan 2, 2006

mike's code worked like a charm. i have one more question. i have an
href which looks like this:

<td class="all">
<a class="btn" name="D1" href="http://www.cnn.com">
</a>

i thought i would use this code to get the href out but it fails, gives
me a keyerror:

for incident in row('td', {'class':'all'}):
n = incident.findNextSibling('a', {'class': 'btn'})
link = incident.findNextSibling['href'] + "','"

any idea what i'm doing wrong here with the syntax? thanks in advance

ISTM that <a class="btn"> is a child of <td>, not a sibling, and
findNextSibling is a method, not an indexable element. Try
n = incident('a', {'class': 'btn'})
link = n['href'] + "','"

Kent

homepricemaps · Jan 2, 2006

hey ken thanks for writing. when i try that i get told

KeyError: 'href'

homepricemaps · Jan 2, 2006

actuall the full error is this:

File "/home/felafela/BeautifulSoup.py", line 301, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'

I am having trouble finding a method of using the git enterprise api to scrape data from projects	1	Jun 1, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
scrape url out of brackets?	4	Dec 25, 2005
I'm tempted to quit out of frustration	1	Aug 13, 2023
Final chapter of "Learn PHP, MySQL and JavaScript"	3	Jun 4, 2024
Need help with <rowspan> in an HTML table	1	Nov 6, 2024
Hover state on an element stuttering when I'm close to the edge, or move my mouse really fast	1	Feb 2, 2023
Web scraping i guess (Yet to start, maybe this should be done in python?)	1	Nov 10, 2021

how to scrape url out of href

homepricemaps

Paul Rubin

homepricemaps

Mike Meyer

homepricemaps

Kent Johnson

homepricemaps

homepricemaps

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads