Beautiful Soup Looping Extraction Question

Tess · Mar 24, 2008

Hello All,

I have a Beautiful Soup question and I'd appreciate any guidance the
forum can provide.

Let's say I have a file that looks at file.html pasted below.

My goal is to extract all elements where the following is true: <p
align="left"> and <div align="center">.

The lines should be ordered in the same order as they appear in the
file - therefore the output file would look like output.txt below.

I experimented with something similar to this code:
for i in soup.findAll('p', align="left"):
print i
for i in soup.findAll('p', align="center"):
print i

I get something like this:
<p align="left">P4</p>
<p align="left">P3</p>
<p align="left">P1</p>
<div align="center">div4b</div>
<div align="center">div3b</div>
<div align="center">div2b</div>
<div align="center">div2a</div>

Any guidance would be greatly appreciated.

Best,

Ira

##########begin: file.html############
<html>
<body>

<p align="left">P1</p>

<p align="right">P2</p>

<div align="center">div2a</div>
<div align="center">div2b</div>

<p align="left">P3</p>
<div align="right">div3a</div>
<div align="center">div3b</div>
<div align="left">div3c</div>

<p align="left">P4</p>
<div align="left">div4a</div>
<div align="center">div4b</div>

</body>
</html>

##########end: file.html############

===================begin: output.txt===================
<p align="left">P1</p>
<div align="center">div2a</div>
<div align="center">div2b</div>
<p align="left">P3</p>
<div align="center">div3b</div>
<p align="left">P4</p>
<div align="center">div4b</div>
===================end: output.txt===================

Paul McGuire · Mar 24, 2008

Hello All,

I have a Beautiful Soup question and I'd appreciate any guidance the
forum can provide.

I *know* you're using Beautiful Soup, and I *know* that BS is the de
facto HTML parser/processor library. Buuuuuut, I just couldn't help
myself in trying a pyparsing scanning approach to your problem. See
the program below for a pyparsing treatment of your question.

-- Paul

"""
My goal is to extract all elements where the following is true: <p
align="left"> and <div align="center">.
"""
from pyparsing import makeHTMLTags, withAttribute, keepOriginalText,
SkipTo

p,pEnd = makeHTMLTags("P")
p.setParseAction( withAttribute(align="left") )
div,divEnd = makeHTMLTags("DIV")
div.setParseAction( withAttribute(align="center") )

# basic scanner for matching either <p> or <div> with desired attrib
value
patt = ( p + SkipTo(pEnd) + pEnd ) | ( div + SkipTo(divEnd) + divEnd )
patt.setParseAction( keepOriginalText )

print "\nBasic scanning"
for match in patt.searchString(html):
print match[0]

# simplified data access, by adding some results names
patt = ( p + SkipTo(pEnd)("body") + pEnd )("P") | \
( div + SkipTo(divEnd)("body") + divEnd )("DIV")
patt.setParseAction( keepOriginalText )

print "\nSimplified field access using results names"
for match in patt.searchString(html):
if match.P:
print "P -", match.body
if match.DIV:
print "DIV -", match.body

Prints:

Basic scanning
<p align="left">P1</p>
<div align="center">div2a</div>
<div align="center">div2b</div>
<p align="left">P3</p>
<div align="center">div3b</div>
<p align="left">P4</p>
<div align="center">div4b</div>

Simplified field access using results names
P - P1
DIV - div2a
DIV - div2b
P - P3
DIV - div3b
P - P4
DIV - div4b

Tess · Mar 25, 2008

Paul - thanks for the input, it's interesting to see how pyparser
handles it.

Anyhow, a simple regex took care of the issue in BS:

for i in soup.findAll(re.compile('^p|^div'),align=re.compile('^center|
^left')):
print i

Thanks again!

T

Paul McGuire · Mar 25, 2008

Anyhow, a simple regex took care of the issue in BS:

for i in soup.findAll(re.compile('^p|^div'),align=re.compile('^center|
^left')):
print i

But I thought you only wanted certain combinations:

"My goal is to extract all elements where the following is true:
<p align="left"> and <div align="center">."

Wont this solution give you false hits, such as <p align="center"> and
<div align="left"> ?

-- Paul

Tess · Mar 25, 2008

Paul - you are very right. I am back to the drawing board. Tess

Stefan Behnel · Mar 25, 2008

Hi,

again, not BS related, but still a solution.

Let's say I have a file that looks at file.html pasted below.

My goal is to extract all elements where the following is true: <p
align="left"> and <div align="center">.

Using lxml:

from lxml import html
tree = html.parse("file.html")
for el in tree.iter():
if el.tag == 'p' and el.get('align') == 'left':
print el.tag
elif el.tag == 'div' and el.get('align') == 'center':
print el.tag

I assume that BS can do something similar, though.

Stefan

Aligned to the left	3	Apr 19, 2023
Help with code	0	Jun 12, 2022
Stuck with html and css	25	Dec 14, 2022
A little complex usage of Beautiful Soup Parsing Help!	1	Jul 20, 2011
Image shifts to the right when export the page to pdf	4	May 5, 2023
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
<Button ...> display is fine, except for two things	1	Oct 23, 2023
How to properly insert a landing page within same container beneath an image element?	0	Oct 7, 2024

Beautiful Soup Looping Extraction Question

Tess

Paul McGuire

Tess

Paul McGuire

Tess

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads