strip away html tags from extracted links

M

Max Cuban

I have the following code to extract certain links from a webpage:

from bs4 import BeautifulSoup
import urllib2, sys
import re

def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
print soup.find_all('h2')

The links are contained in the 'h2' tags so I get the links as follows:

<h2><a href="/en/cashiers-accra">cashiers </a></h2>
<h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
<h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
<h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>

But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:

<a href="/en/cashiers-accra">cashiers </a>
<a href="/en/cake-baker-accra">Cake baker</a>
<a href="/en/automobile-technician-accra">Automobile Technician</a>
<a href="/en/marketing-officer-accra-4">Marketing Officer</a>


I therefore updated my code to look like this:

def tonaton():
site = "http://tonaton.com/en/job-vacancies-in-ghana"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
invalid_tag = ('h2')
soup = BeautifulSoup(jobpass)
jobs = soup.find_all('h2')
for tag in invalid_tag:
for match in jobs(tag):
match.replaceWithChildren()
print jobs

But I couldn't get it to work, even though I thought that was the best logic i could come up with.I'm a newbie though so I know there is something better that could be done.

Any help will be gracefully appreciated

Thanks
 
C

Chris Angelico

One last thing, I observe that you've a gmail address. This is currently
guaranteed to send shivers down my spine. So if you're using google groups,
would you be kind enough to read and action this,
https://wiki.python.org/moin/GoogleGroupsPython, thanks.

Don't blame all gmail users, some of us are using the mailing list. :)
You should be able to check the headers - with the email posts,
there's an Injection-Info header which cites Google Groups. Presumably
you get the same or similar if you read as a newsgroup.

And the OP was, indeed, using GG. Why is it so suddenly so popular?

ChrisA
 
G

Gene Heskett

Don't blame all gmail users, some of us are using the mailing list. :)
You should be able to check the headers - with the email posts,
there's an Injection-Info header which cites Google Groups. Presumably
you get the same or similar if you read as a newsgroup.

And the OP was, indeed, using GG. Why is it so suddenly so popular?

ChrisA

Thank you for that hint Chris, it should enhance my enjoyment of this list.

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

There is a 20% chance of tomorrow.
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top