Newbie regular expression and whitespace question

googleboy · Sep 22, 2005

Hi.

I am trying to collapse an html table into a single line. Basically,
anytime I see ">" & "<" with nothing but whitespace between them, I'd
like to remove all the whitespace, including newlines. I've read the
how-to and I have tried a bunch of things, but nothing seems to work
for me:

--

table = open(r'D:\path\to\tabletest.txt', 'rb')
strTable = table.read()

#Below find the different sort of things I have tried, one at a time:

strTable = strTable.replace(">\s<", "><") #I got this from the module
docs

strTable = strTable.replace(">.<", "><")

strTable = ">\s+<".join(strTable)

strTable = ">\s<".join(strTable)

print strTable

--

The table in question looks like this:

<table width="80%" border="0">
<tr>
<td> </td>
<td colspan="2">Introduction</td>
<td><div align="right">3</div></td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td><i>ONE</i></td>
<td colspan="2">Childraising for Parrots</td>
<td><div align="right">11</div></td>
</tr>
</table>

For extra kudos (and I confess I have been so stuck on the above
problem I haven't put much thought into how to do this one) I'd like to
be able to measure the number of characters between the <p> & </p>
tags, and then insert a newline character at the end of the next word
after an arbitrary number of characters..... I am reading in to a
script a bunch of paragraphs formatted for a webpage, but they're all
on one big long line and I would like to split them for readability.

TIA

Googleboy

Paul McGuire · Sep 22, 2005

googleboy said:
Hi.

I am trying to collapse an html table into a single line. Basically,
anytime I see ">" & "<" with nothing but whitespace between them, I'd
like to remove all the whitespace, including newlines. I've read the
how-to and I have tried a bunch of things, but nothing seems to work
for me:

--

table = open(r'D:\path\to\tabletest.txt', 'rb')
strTable = table.read()

#Below find the different sort of things I have tried, one at a time:

strTable = strTable.replace(">\s<", "><") #I got this from the module
docs

strTable = strTable.replace(">.<", "><")

strTable = ">\s+<".join(strTable)

strTable = ">\s<".join(strTable)

print strTable

--

The table in question looks like this:

<table width="80%" border="0">
<tr>
<td> </td>
<td colspan="2">Introduction</td>
<td><div align="right">3</div></td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td><i>ONE</i></td>
<td colspan="2">Childraising for Parrots</td>
<td><div align="right">11</div></td>
</tr>
</table>

For extra kudos (and I confess I have been so stuck on the above
problem I haven't put much thought into how to do this one) I'd like to
be able to measure the number of characters between the <p> & </p>
tags, and then insert a newline character at the end of the next word
after an arbitrary number of characters..... I am reading in to a
script a bunch of paragraphs formatted for a webpage, but they're all
on one big long line and I would like to split them for readability.

TIA

Googleboy

If you're absolutely stuck on using RE's, then others will have to step
forward. Meanwhile, here's a pyparsing solution (get pyparsing at
http://pyparsing.sourceforge.net):

---------------
from pyparsing import *

LT = Literal("<")
GT = Literal(">")

collapsableSpace = GT + LT # matches with or without intervening
whitespace
collapsableSpace.setParseAction( replaceWith("><") )

print collapsableSpace.transformString(data)
---------------

The reason this works is that pyparsing implicitly skips over whitespace
while looking for matches of collapsable space (a '>' followed by a '<').
When found, the parse action is triggered, which in this case, replaces
whatever was matched with the string "><". Finally, the input data (in this
case your HTML table, stored in the string variable, data) is passed to
transformString, which scans for matches of the collapsableSpace expression,
runs the parse action when they are found, and returns the final transformed
string.

As for word wrapping within <p>...</p> tags, there are at least two recipes
in the Python Cookbook for word wrapping. Be careful, though, as many HTML
pages are very bad about omitting the trailing </p> tags.

-- Paul

Fredrik Lundh · Sep 22, 2005

Paul said:
If you're absolutely stuck on using RE's, then others will have to step
forward. Meanwhile, here's a pyparsing solution (get pyparsing at
http://pyparsing.sourceforge.net):

so, let's see. using ...

from pyparsing import *
import re

data = """ ... table example from op ... """

def test1():
LT = Literal("<")
GT = Literal(">")
collapsableSpace = GT + LT
collapsableSpace.setParseAction( replaceWith("><") )
return collapsableSpace.transformString(data)

def test2():
return re.sub(">\s+<", "><", data)

I get

timeit -s "import test" "test.test1()"

100 loops, best of 3: 6.8 msec per loop

timeit -s "import test" "test.test2()"

10000 loops, best of 3: 33.3 usec per loop

or in other words, five lines instead of one, and a 200x slowdown.

but alright, maybe we should precompile the expressions to get a
fair comparision. adding

LT = Literal("<")
GT = Literal(">")
collapsableSpace = GT + LT
collapsableSpace.setParseAction( replaceWith("><") )

def test3():
return collapsableSpace.transformString(data)

p = re.compile(">\s+<")

def test4():
return p.sub("><", data)

to the first program, I get

timeit -s "import test" "test.test3()"

100 loops, best of 3: 6.73 msec per loop

timeit -s "import test" "test.test4()"

10000 loops, best of 3: 27.8 usec per loop

that's a 240x slowdown. hmm.

</F>

Bruno Desthuilliers · Sep 22, 2005

googleboy a écrit :

Hi.

I am trying to collapse an html table into a single line. Basically,
anytime I see ">" & "<" with nothing but whitespace between them, I'd
like to remove all the whitespace, including newlines. I've read the
how-to and I have tried a bunch of things, but nothing seems to work
for me:

--

table = open(r'D:\path\to\tabletest.txt', 'rb')
strTable = table.read()

#Below find the different sort of things I have tried, one at a time:

strTable = strTable.replace(">\s<", "><") #I got this from the module
docs

From which module's doc ?

">\s<" is the litteral string ">\s<", not a regular expression. Please
re-read the re module doc, and the re howto (you'll find a link to it in
the re module's doc...)

George Sakkis · Sep 23, 2005

googleboy said:
Hi.

I am trying to collapse an html table into a single line. Basically,
anytime I see ">" & "<" with nothing but whitespace between them, I'd
like to remove all the whitespace, including newlines. I've read the
how-to and I have tried a bunch of things, but nothing seems to work
for me:

[snip]

As others have shown you already, you need to use the sub method of the re module:

import re
regex = re.compile(r'>\s*<')
print regex.sub('><',data)

For extra kudos (and I confess I have been so stuck on the above
problem I haven't put much thought into how to do this one) I'd like to
be able to measure the number of characters between the <p> & </p>
tags, and then insert a newline character at the end of the next word
after an arbitrary number of characters..... I am reading in to a
script a bunch of paragraphs formatted for a webpage, but they're all
on one big long line and I would like to split them for readability.

What I guess you want to do is wrap some text. Do not reinvent the wheel, there's already a module
for that:

import textwrap
print textwrap.fill(oneBigLongLine, 60)

HTH,
George

googleboy · Sep 23, 2005

Thanks for the great positive responses. I was close with what I was
trying, I guess, but close only counts in horseshoes and um..
something else that close counts in.

googleboy

Paul McGuire · Sep 23, 2005

100 loops, best of 3: 6.73 msec per loop

10000 loops, best of 3: 27.8 usec per loop

that's a 240x slowdown. hmm.

</F>

Well, what of it? How fast does it have to be? Is it a one-shot
conversion? People tend to be willing to wait a bit longer for one-time
conversion programs. What else is going on in this program? Is this the
bottleneck? Are we reading the input over the Internet through HTTP?

If I'm running this program and waiting for the results, 7 msec isn't
perceptibly slower than 28 usec - both are going to seem pretty much
instantaneous. On the other hand, if I'm processing 100 files, then this
goes up to, um, .7 sec vs 3 msec.

There is no question, regexp's beat the pants off of pyparsing in raw
performance. But this newsgroup has visited the raw performance issue many
times in the past, usually when responding to the "Python can't be very
fast, it's interpreted" argument. Raw performance is just one aspect in
determining suitability of a given technical approach.

-- Paul

Sort by number of characters	1	Nov 2, 2023
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Angularjs newbie - second JSON datasource does not display	0	May 18, 2022
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
How to have two html audio players on one page?	0	May 3, 2022
Need help with <rowspan> in an HTML table	1	Nov 6, 2024
Pythen question	0	Aug 14, 2022
Javascript DOM	1	Mar 29, 2023

Newbie regular expression and whitespace question

googleboy

Paul McGuire

Fredrik Lundh

Bruno Desthuilliers

George Sakkis

googleboy

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads