Parsing HTML

M

mtuller

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?


Thanks,

Mike
 
S

Samuel Karl Peterson

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:
[snip]

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database.
[snip]

I have also tried Beautiful Soup, but had trouble understanding the
documentation.

====================
from BeautifulSoup import BeautifulSoup as parser

soup = parser("""<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>""")

value = \
int(soup.find('td', headers='col2_1').span.contents[0].replace(',', ''))
====================

Hope that helped. This code assumes there aren't any td tags with
header=col2_1 that come before the value you are trying to extract.
There's several ways to do things in BeautifulSoup. You should play
around with BeautifulSoup in the interactive prompt. It's simply
awesome if you don't need speed on your side.
 
P

Paul McGuire

Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

So what you are saying is that you need to make your pattern more
specific. So I suggest adding these items to your matching pattern:
- only match span if inside a <td> with attribute 'headers="col2_1"'
- only match if the span body is an integer (with optional comma
separater for thousands)

This grammar adds these more specific tests for matching the input
HTML (note also the use of results names to make it easy to extract
the integer number, and a parse action added to integer to convert the
'33,699' string to the integer 33699).

-- Paul


htmlSource = """<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>"""

from pyparsing import makeHTMLTags, Word, nums, ParseException

tdStart, tdEnd = makeHTMLTags('td')
spanStart, spanEnd = makeHTMLTags('span')

def onlyAcceptWithTagAttr(attrname,attrval):
def action(tagAttrs):
if not(attrname in tagAttrs and tagAttrs[attrname]==attrval):
raise ParseException("",0,"")
return action

tdStart.setParseAction(onlyAcceptWithTagAttr("headers","col2_1"))
spanStart.setParseAction(onlyAcceptWithTagAttr("class","hpPageText"))

integer = Word(nums,nums+',')
integer.setParseAction(lambda t:int("".join(c for c in t[0] if c !=
',')))

patt = tdStart + spanStart + integer.setResultsName("intValue") +
spanEnd + tdEnd

for matches in patt.searchString(htmlSource):
print matches.intValue

prints:
33699
 
F

Frederic Rentsch

mtuller said:
Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

<tr >
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col4_1" style="width:13%; text-align:right" >
</tr>

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?


Thanks,

Mike
Posted problems rarely provide exhaustive information. It's just not
possible. I have been taking shots in the dark of late suggesting a
stream-editing approach to extracting data from htm files. The
mainstream approach is to use a parser (beautiful soup or pyparsing).
Often times nothing more is attempted than the location and
extraction of some text irrespective of page layout. This can sometimes
be done with a simple regular expression, or with a stream editor if a
regular expression gets too unwieldy. The advantage of the stream editor
over a parser is that it doesn't mobilize an arsenal of unneeded
functionality and therefore tends to be easier, faster and shorter to
implement. The editor's inability to understand structure isn't a
shortcoming when structure doesn't matter and can even be an advantage
in the presence of malformed input that sends a parser on a tough and
potentially hazardous mission for no purpose at all.
SE doesn't impose the study of massive documentation, nor the
memorization of dozens of classes, methods and what not. The following
four lines would solve the OP's problem (provided the post really is all
there is to the problem):

>>> import re, SE # http://cheeseshop.python.org/pypi/SE/2.3
>>> Filter = SE.SE ('<EAT> "~(?i)col[0-9]_[0-9](.|\n)*?/td>~==SOME
SPLIT MARK"')
>>> r = re.compile ('(?i)(col[0-9]_[0-9])(.|\n)*?([0-9,]+)</span')
>>> for line in Filter (s).split ('SOME SPLIT MARK'):
print r.search (line).group (1, 3)

('col2_1', '33,699')
('col3_1', '0')
('col4_1', '7,428')


-----------------------------------------------------------------------

Input:
<td headers="col1_1" style="width:21%" >
<span class="hpPageText" >LETTER</span></td>
<td headers="col2_1" style="width:13%; text-align:right" >
<span class="hpPageText" >33,699</span></td>
<td headers="col3_1" style="width:13%; text-align:right" >
<span class="hpPageText" >1.0</span></td>
<td headers="col5_1" style="width:13%; text-align:right" >
<span class="hppagetext" >7,428</span></td>
</tr>'''

The SE object handles file input too:
'' commands string output
print r.search (line).group (1, 3)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top