Regular expression fun. Repeated matching of a group Q

M

matteosartori

Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@
 
J

johnzenger

There's more to re than just sub. How about:

sanesplit = re.split(r"</td><td>|<td>|</td>", text)
date = sanesplit[1]
times = times = [time for time in sanesplit if re.match("\d\d:\d\d",
time)]

.... then "date" contains the date at the beginning of the line and
"times" contains all your times.
 
M

matteosartori

Thanks,

The date = sanesplit[1] line complains about the "list index being out
of range", which is probably due to the fact that not all lines have
the <td> in them, something i didn't explain in the previous post.

I'd need some way of ensuring, as with the pattern I'd concocted, that
a valid line actually starts with a <td> containing a / separated date
tag.

As an aside, is it not actually possible to do what I was trying with a
single pattern or is it just not practical?

M@
 
J

johnzenger

You can check len(sanesplit) to see how big your list is. If it is <
2, then there were no <td>'s, so move on to the next line.

It is probably possible to do the whole thing with a regular
expression. It is probably not wise to do so. Regular expressions are
difficult to read, and, as you discovered, difficult to program and
debug. In many cases, Python code that relies on regular expressions
for lots of program logic runs slower than code that uses normal
Python.

Suppose "words" contains all the words in English. Compare these two
lines:

foobarwords1 = [x for x in words if re.search("foo|bar", x) ]
foobarwords2 = [x for x in words if "foo" in x or "bar" in x ]

I haven't tested this with 2.4, but as of a few years ago it was a safe
bet that foobarwords2 will be calculated much, much faster. Also, I
think you will agree, foobarwords2 is a lot easier to read.
 
M

matteosartori

Yes, it's easier to read without a doubt. I just wondered if i was
failing to do what i was trying to do because it couldn't be done or
because i hadn't properly understood what i was doing. Alas, it was
probably the latter.

Thanks for your help,

M@
 
P

Paul McGuire

Here's a (surprise!) pyparsing solution. -- Paul
(Get pyparsing at http://pyparsing.sourceforge.net.)

data = [
"""<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>""",
"""<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>"""
]

from pyparsing import *

startTD,endTD = makeHTMLTags("TD")
startTD = startTD.suppress()
endTD = endTD.suppress()
dayOfWeek = oneOf("Sunday Monday Tuesday Wednesday Thursday Friday
Saturday")
nbsp = Literal("&nbsp;")
time = Combine(Word(nums,exact=2) + ":" + Word(nums,exact=2))
date = Combine(Word(nums,exact=2) + "/" + Word(nums,exact=2) + "/" +
Word(nums,exact=4))

entry = ( startTD + date.setResultsName("date") + endTD +
startTD + dayOfWeek.setResultsName("dayOfWeek") + endTD +
startTD + ( Suppress(nbsp) |
Word(alphanums+"_").setResultsName("name") ) + endTD +
OneOrMore(startTD + (Suppress(nbsp) | time) + endTD
).setResultsName("dates")
)

for d in data:
res = entry.parseString(d)
print res.date
print res.dayOfWeek
print res.name
print res.dates
print


Returns:

04/01/2006
Wednesday

['09:14', '12:44', '12:50', '17:58', '08:14']

03/01/2006
Tuesday
Annual_Holiday
['08:00']
 
P

plahey

Doesn't this do what you want?

import re

DATE_TIME_RE =
re.compile(r'<td>((\d{2}\/\d{2}\/\d{4})|(\d{2}:\d{2}))<\/td>')

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td>&nbsp;</td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>08:14</td>'

out = [m[0] for m in DATE_TIME_RE.findall(test)]

for m in out:
print m
 
L

Larry Bates

Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td>&nbsp;</td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@
This works:

import BeautifulSoup

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td>&nbsp;</td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>&nbsp;</td>' \
'<td>08:14</td>'

c=BeautifulSoup.BeautifulSoup(test)
times=[]
for i in c.childGenerator():
if i.contents[0] == "&nbsp;": continue
times.append(i.contents[0])

date=times.pop(0)
day=times.pop(0)

print "date=", date
print "day=", day
print "times=", times

-Larry Bates
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top