S
SAKTHEESH
I am using Beautiful Soup to parse a html to find all text that is Not
contained inside any anchor elements
I came up with this code which finds all links within href but not the
other way around.
How can I modify this code to get only plain text using Beautiful
Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
Example:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within `<a></a>` : then
find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
</div>
</body></html>
Thanks for your time and help !
contained inside any anchor elements
I came up with this code which finds all links within href but not the
other way around.
How can I modify this code to get only plain text using Beautiful
Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
Example:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within `<a></a>` : then
find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b>
identified </b>.</p>
</div>
</body></html>
Thanks for your time and help !