re troubles

E

Evanda Remington

I'm trying to filter some rows of an html table out, based on their
contents. For input like:
"""
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
</table>
"""
I would like to completely remove the (3 line) table row that makes mention
of kittens. The regexp I have tried to use is: r"<tr>.*?kittens.*?</tr>".
When compiled and used with subs("",data), strangely removes everything
from the first "<tr>" to the first "<tr>" after kittens.

That is, the ".*?" notation works in the second half, but not in the first
half. It behaves the same as ".*" should.

Any advice?

-e
 
B

Bengt Richter

I'm trying to filter some rows of an html table out, based on their
contents. For input like:
"""
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
</table>
"""
I would like to completely remove the (3 line) table row that makes mention
of kittens. The regexp I have tried to use is: r"<tr>.*?kittens.*?</tr>".
When compiled and used with subs("",data), strangely removes everything
from the first "<tr>" to the first "<tr>" after kittens.

That is, the ".*?" notation works in the second half, but not in the first
half. It behaves the same as ".*" should.

Any advice?
See if this will work for you. I added some more kittens and robots. Otherwise
a single instance could be done differently. I used 'XXX' rather than '' for example clarity.

====< evanda.py >====================
import re
s = """\
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
<tr> << want to filter
<td>more kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>more robots</td><td>8</td> </tr>
</table>
"""
rxo = re.compile(r"(?ms)<tr>(?:[^<]|<[^t]|<t[^r]|<tr[^>])*?kittens.*?</tr>")
print '==== before ====\n%s==== after sub XXX ====\n%s====' % (s, rxo.sub('XXX', s))
=====================================
Result:

[19:02] C:\pywk\clp>evanda.py
==== before ====
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
<tr> << want to filter
<td>more kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>more robots</td><td>8</td> </tr>
</table>
==== after sub XXX ====
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
XXX <<
<tr> <td>robots</td><td>8</td> </tr>
XXX <<
<tr> <td>more robots</td><td>8</td> </tr>
</table>
====

Regards,
Bengt Richter
 
R

Robin Munn

Evanda Remington said:
I'm trying to filter some rows of an html table out, based on their
contents. For input like:
"""
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
</table>
"""
I would like to completely remove the (3 line) table row that makes mention
of kittens. The regexp I have tried to use is: r"<tr>.*?kittens.*?</tr>".
When compiled and used with subs("",data), strangely removes everything
from the first "<tr>" to the first "<tr>" after kittens.

That is, the ".*?" notation works in the second half, but not in the first
half. It behaves the same as ".*" should.

Any advice?

Parsing HTML with regular expressions is notoriously tricky. Have you
tried using HTMLParser yet? If you've tried it and it doesn't work for
you for some reason, then you may have to deal with regexp's. But if you
haven't tried HTMLParser, you may find it a lot easier than regexp's for
this task.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,173
Messages
2,570,937
Members
47,481
Latest member
ElviraDoug

Latest Threads

Top