re troubles

Evanda Remington · Dec 18, 2003

I'm trying to filter some rows of an html table out, based on their
contents. For input like:
"""
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
</table>
"""
I would like to completely remove the (3 line) table row that makes mention
of kittens. The regexp I have tried to use is: r"<tr>.*?kittens.*?</tr>".
When compiled and used with subs("",data), strangely removes everything
from the first "<tr>" to the first "<tr>" after kittens.

That is, the ".*?" notation works in the second half, but not in the first
half. It behaves the same as ".*" should.

Any advice?

-e

Bengt Richter · Dec 19, 2003

I'm trying to filter some rows of an html table out, based on their
contents. For input like:
"""
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
</table>
"""
I would like to completely remove the (3 line) table row that makes mention
of kittens. The regexp I have tried to use is: r"<tr>.*?kittens.*?</tr>".
When compiled and used with subs("",data), strangely removes everything
from the first "<tr>" to the first "<tr>" after kittens.

That is, the ".*?" notation works in the second half, but not in the first
half. It behaves the same as ".*" should.

Any advice?

See if this will work for you. I added some more kittens and robots. Otherwise
a single instance could be done differently. I used 'XXX' rather than '' for example clarity.

====< evanda.py >====================
import re
s = """\
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
<tr> << want to filter
<td>more kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>more robots</td><td>8</td> </tr>
</table>
"""
rxo = re.compile(r"(?ms)<tr>(?:[^<]|<[^t]|<t[^r]|<tr[^>])*?kittens.*?</tr>")
print '==== before ====\n%s==== after sub XXX ====\n%s====' % (s, rxo.sub('XXX', s))
=====================================
Result:

[19:02] C:\pywk\clp>evanda.py
==== before ====
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
<tr> << want to filter
<td>more kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>more robots</td><td>8</td> </tr>
</table>
==== after sub XXX ====
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
XXX <<
<tr> <td>robots</td><td>8</td> </tr>
XXX <<
<tr> <td>more robots</td><td>8</td> </tr>
</table>
====

Regards,
Bengt Richter

Robin Munn · Dec 22, 2003

Evanda Remington said:
I'm trying to filter some rows of an html table out, based on their
contents. For input like:
"""
<table>
<tr>
<td>Lasers</td><td>17</td> </tr>
<tr> << want to filter
<td>kittens</td><td>8</td> << this out.
</tr> <<
<tr> <td>robots</td><td>8</td> </tr>
</table>
"""
I would like to completely remove the (3 line) table row that makes mention
of kittens. The regexp I have tried to use is: r"<tr>.*?kittens.*?</tr>".
When compiled and used with subs("",data), strangely removes everything
from the first "<tr>" to the first "<tr>" after kittens.

That is, the ".*?" notation works in the second half, but not in the first
half. It behaves the same as ".*" should.

Any advice?

Parsing HTML with regular expressions is notoriously tricky. Have you
tried using HTMLParser yet? If you've tried it and it doesn't work for
you for some reason, then you may have to deal with regexp's. But if you
haven't tried HTMLParser, you may find it a lot easier than regexp's for
this task.

Sort by number of characters	1	Nov 2, 2023
Javascript DOM	1	Mar 29, 2023
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
Updating Inventory using First In First out(FIFO)	1	Feb 2, 2023
Need help with <rowspan> in an HTML table	1	Nov 6, 2024
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
Script to send email not working	1	Apr 10, 2023
When I send email as HTML, why do erroneous whitespaces getintroduced to the HTML source and a few <	2	Nov 8, 2013

re troubles

Evanda Remington

Bengt Richter

Robin Munn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads