Simple regexp question

Tom · Oct 26, 2005

--=====================_17971640==.ALT
Content-Type: text/plain; charset="us-ascii"; format=flowed

I'm having trouble doing something with regular expressions in Ruby
that should be simple.

All I want to do is find each successive regexp, and its offset in a
string. The regexp may have multiple capture groups in it.

The obvious answers of split/scan/index/match all fail, as each of
them fail to return some necessary piece of data.

This is trivial to do in other languages, so I feel I must be missing
something.

Example:

If I have a string like " blah blah 7pm something happens 8pm
something else happens 9pm something different"

I want to use the times to split, and get the text between the times.

Why scan doesn't work
I have no trouble getting the times

string = " blah blah 7pm something happens 8pm something else happens
9pm something different"
timepattern = /(\d{1,2})

\d\d)?\s?([aApP]\.?[mM]\.?)/

irb(main):005:0> string.scan(timepattern)
=> [["7", nil, "pm"], ["8", nil, "pm"], ["9", nil, "pm"]]

This gives me exactly what I want about the times, but no way to find
what was between the matches

Why split doesn't work

If I use split, I can get everything, but in a format that is useless
to me (and to anybody, I'd guess).

irb(main):006:0> string.split(timepattern)
=> [" blah blah ", "7", "pm", " something happens ", "8", "pm", "
something else happens ", "9", "pm", " something different"]

This gives me everything mixed together, but since some capture
groups are not there, you can't figure out which part is regexp
match, and which part is text between regexps.

Why index() doesn't work

Using string.index(timepattern) allows me to walk through the string
by passing the offset, but doesn't return the regexp, so I can get
the data, but no times.

Why match doesn't work
timepattern.match(string) returns the regexp, so I get the times, and
I get a starting offset, so I can find the data, but I can't figure
out how to do a "next match", since match doesn't take an offset, so
this is of no use. This is where I really feel I must be missing
something, since it's hard to believe something so fundamental is missing.

The java equivalent of MatchData has a next match function, it's
commonly used, so I don't quite understand why it's missing.

What's wrong with post_match & slice
One can traverse the matches like this

def reg_split r , string
while match = r.match(string)
next_match = r.match(match.post_match)
if (next_match)
length = next_match.begin(0)
else
length = match.post_match.length
end
text = match.post_match.slice(0,length)
yield(match, text)
string = match.post_match.slice(length,
match.post_match.length - length)
end
end

But each slice is creating (I believe) a new string object, so you
are going to get n*n/2 performance. Horrible with any large strings

What I'd really like
If the Regexp class did a yield on matches, if would be a very nice
thing. It would be more ruby-like, and would give people an easy way
to iterate through matches.

For example:
r = /foo/
r.match(string) { | matchdata | puts matchdata[0]}

Or even just a regex.match(string, offset)

any suggestions?
--=====================_17971640==.ALT--

Regexp simple question	5	May 11, 2009
Regexp question	3	May 9, 2005
Regexp - start and end of line or string	1	Jan 16, 2011
String#match vs. Regexp#match - confused	1	Sep 4, 2008
small regexp help	1	Oct 30, 2013
String extraction using RegExp	2	Jun 9, 2008
Match a pattern multiple times, returning matches, captures andoffset?	9	Apr 5, 2011
Why, oh, why, little regexp?	14	Oct 30, 2007

Simple regexp question

Tom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads