strip away html tags from extracted links

Max Cuban · Nov 29, 2013

I have the following code to extract certain links from a webpage:

from bs4 import BeautifulSoup
import urllib2, sys
import re

def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
print soup.find_all('h2')

The links are contained in the 'h2' tags so I get the links as follows:

<h2><a href="/en/cashiers-accra">cashiers </a></h2>
<h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
<h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
<h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>

But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:

<a href="/en/cashiers-accra">cashiers </a>
<a href="/en/cake-baker-accra">Cake baker</a>
<a href="/en/automobile-technician-accra">Automobile Technician</a>
<a href="/en/marketing-officer-accra-4">Marketing Officer</a>

I therefore updated my code to look like this:

def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
jobs = soup.find_all('h2')
for tag in invalid_tag:
for match in jobs(tag):
match.replaceWithChildren()
print jobs

But I couldn't get it to work, even though I thought that was the best logic i could come up with.I'm a newbie though so I know there is something better that could be done.

Any help will be gracefully appreciated

Thanks

Chris Angelico · Nov 29, 2013

One last thing, I observe that you've a gmail address. This is currently
guaranteed to send shivers down my spine. So if you're using google groups,
would you be kind enough to read and action this,
https://wiki.python.org/moin/GoogleGroupsPython, thanks.

Don't blame all gmail users, some of us are using the mailing list.

You should be able to check the headers - with the email posts,
there's an Injection-Info header which cites Google Groups. Presumably
you get the same or similar if you read as a newsgroup.

And the OP was, indeed, using GG. Why is it so suddenly so popular?

ChrisA

Gene Heskett · Nov 29, 2013

Don't blame all gmail users, some of us are using the mailing list.
You should be able to check the headers - with the email posts,
there's an Injection-Info header which cites Google Groups. Presumably
you get the same or similar if you read as a newsgroup.

And the OP was, indeed, using GG. Why is it so suddenly so popular?

ChrisA

Click to expand...

Thank you for that hint Chris, it should enhance my enjoyment of this list.

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

There is a 20% chance of tomorrow.
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.

Crawling	1	Mar 10, 2021
Extracting text from a Webpage using BeautifulSoup	3	May 27, 2008
seting cookies to use some links with perl	0	Nov 13, 2007
urllib2.urlopen(url) pulling something other than HTML	7	Aug 20, 2007
my first screen scraper	0	Dec 2, 2007

strip away html tags from extracted links

Max Cuban

Chris Angelico

Gene Heskett

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads