Pattern Matching

G

Greg Lindstrom

Hello-

I'm running Python 2.2.3 on Windows XP "Professional" and am reading a file
wit 1 very long line of text (the line consists of multiple records with no
cr/lf). What I would like to do is scan for the occurrence of a specific
pattern of characters which I expect to repeat many times in the file.
Suppose I want to search for "Start: mm/dd/yy" and capture the mm/dd/yyyy
data for processing each time I find it. This is the type of problem I used
to solve with <duck>Perl<\duck> in a former lifetime using regular
expressions. The following does not work, but is the flavor of what I want
to do:

long_line_of_text = 'Start: 1/1/2004 and some stuff.~Start: 2/3/2004 stuff.
~Start 5/1/2004 morestuff.~'
while re.match('Start:\ (\D?/\D?/\D+)', long_line_of_text):
# process the date string here which I hoped to catch in the parenthesis
above.

I'd like this to keep matching and processing the string as long as it keeps
matching the pattern, bopping down the string as it goes.

Another way to handle this is to replace all of the tildes with linefeeds
(tildes are the end of segment marker), or split the records on the tilde
and go from there. I'd just like to know how I could do it with the regular
expressions.

Thanks for your help,
--greg

Greg Lindstrom (501) 975-4859
NovaSys Health (e-mail address removed)

"We are the music makers, and we are the dreamers of dreams" W.W.
 
C

Christopher T King

The following does not work, but is the flavor of what I want to do:

long_line_of_text = 'Start: 1/1/2004 and some stuff.~Start: 2/3/2004 stuff.
~Start 5/1/2004 morestuff.~'
while re.match('Start:\ (\D?/\D?/\D+)', long_line_of_text):
# process the date string here which I hoped to catch in the parenthesis
above.

I'd like this to keep matching and processing the string as long as it keeps
matching the pattern, bopping down the string as it goes.

That line tastes distincly Perlish ;)

What you want to write in Python is:

for match in re.finditer('Start:\ (\D?/\D?/\D+)', long_line_of_text):
<do something with match.group(1)>

re.finditer() returns an iterator that loops over all occurances of the
pattern in the string, returning a match object for each one.
match.group() returns the actual text of the match, and match.group(n)
returns the text of group n.

I'm curious, though, why do you escape the space? My guess is it's
something from Perl that I don't remember.
 
K

Kristofer Pettijohn

Greg Lindstrom said:
long_line_of_text = 'Start: 1/1/2004 and some stuff.~Start: 2/3/2004 stuff.
~Start 5/1/2004 morestuff.~'
while re.match('Start:\ (\D?/\D?/\D+)', long_line_of_text):
# process the date string here which I hoped to catch in the parenthesis
above.

I'd like this to keep matching and processing the string as long as it keeps
matching the pattern, bopping down the string as it goes.

p = re.compile(your_pattern_from_above)
matches = p.findall(long_line_of_text)

matches will be a list of your matches caught in the parenthesis
 
E

Eddie Corns

Greg Lindstrom said:
I'm running Python 2.2.3 on Windows XP "Professional" and am reading a file
wit 1 very long line of text (the line consists of multiple records with no
cr/lf). What I would like to do is scan for the occurrence of a specific
pattern of characters which I expect to repeat many times in the file.
Suppose I want to search for "Start: mm/dd/yy" and capture the mm/dd/yyyy
data for processing each time I find it. This is the type of problem I used
to solve with <duck>Perl<\duck> in a former lifetime using regular
expressions. The following does not work, but is the flavor of what I want
to do:
long_line_of_text = 'Start: 1/1/2004 and some stuff.~Start: 2/3/2004 stuff.
~Start 5/1/2004 morestuff.~'
while re.match('Start:\ (\D?/\D?/\D+)', long_line_of_text):
# process the date string here which I hoped to catch in the parenthesis
above.
I'd like this to keep matching and processing the string as long as it keeps
matching the pattern, bopping down the string as it goes.
Another way to handle this is to replace all of the tildes with linefeeds
(tildes are the end of segment marker), or split the records on the tilde
and go from there. I'd just like to know how I could do it with the regular
expressions.

In addition to previous answers, a useful resource might be:
http://gnosis.cx/TPiP/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Pattern Matching 0
Matching Control Characters 1
mySQL access 2
Pmw EntryWidget Help 1
Boa Constructor Problem 5
Sharing Base Class members 0
Oracle Access via cx_Oracle 1
Working with Forms in MS Word 1

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top