T
Tom
--=====================_17971640==.ALT
Content-Type: text/plain; charset="us-ascii"; format=flowed
I'm having trouble doing something with regular expressions in Ruby
that should be simple.
All I want to do is find each successive regexp, and its offset in a
string. The regexp may have multiple capture groups in it.
The obvious answers of split/scan/index/match all fail, as each of
them fail to return some necessary piece of data.
This is trivial to do in other languages, so I feel I must be missing
something.
Example:
If I have a string like " blah blah 7pm something happens 8pm
something else happens 9pm something different"
I want to use the times to split, and get the text between the times.
Why scan doesn't work
I have no trouble getting the times
string = " blah blah 7pm something happens 8pm something else happens
9pm something different"
timepattern = /(\d{1,2})\d\d)?\s?([aApP]\.?[mM]\.?)/
irb(main):005:0> string.scan(timepattern)
=> [["7", nil, "pm"], ["8", nil, "pm"], ["9", nil, "pm"]]
This gives me exactly what I want about the times, but no way to find
what was between the matches
Why split doesn't work
If I use split, I can get everything, but in a format that is useless
to me (and to anybody, I'd guess).
irb(main):006:0> string.split(timepattern)
=> [" blah blah ", "7", "pm", " something happens ", "8", "pm", "
something else happens ", "9", "pm", " something different"]
This gives me everything mixed together, but since some capture
groups are not there, you can't figure out which part is regexp
match, and which part is text between regexps.
Why index() doesn't work
Using string.index(timepattern) allows me to walk through the string
by passing the offset, but doesn't return the regexp, so I can get
the data, but no times.
Why match doesn't work
timepattern.match(string) returns the regexp, so I get the times, and
I get a starting offset, so I can find the data, but I can't figure
out how to do a "next match", since match doesn't take an offset, so
this is of no use. This is where I really feel I must be missing
something, since it's hard to believe something so fundamental is missing.
The java equivalent of MatchData has a next match function, it's
commonly used, so I don't quite understand why it's missing.
What's wrong with post_match & slice
One can traverse the matches like this
def reg_split r , string
while match = r.match(string)
next_match = r.match(match.post_match)
if (next_match)
length = next_match.begin(0)
else
length = match.post_match.length
end
text = match.post_match.slice(0,length)
yield(match, text)
string = match.post_match.slice(length,
match.post_match.length - length)
end
end
But each slice is creating (I believe) a new string object, so you
are going to get n*n/2 performance. Horrible with any large strings
What I'd really like
If the Regexp class did a yield on matches, if would be a very nice
thing. It would be more ruby-like, and would give people an easy way
to iterate through matches.
For example:
r = /foo/
r.match(string) { | matchdata | puts matchdata[0]}
Or even just a regex.match(string, offset)
any suggestions?
--=====================_17971640==.ALT--
Content-Type: text/plain; charset="us-ascii"; format=flowed
I'm having trouble doing something with regular expressions in Ruby
that should be simple.
All I want to do is find each successive regexp, and its offset in a
string. The regexp may have multiple capture groups in it.
The obvious answers of split/scan/index/match all fail, as each of
them fail to return some necessary piece of data.
This is trivial to do in other languages, so I feel I must be missing
something.
Example:
If I have a string like " blah blah 7pm something happens 8pm
something else happens 9pm something different"
I want to use the times to split, and get the text between the times.
Why scan doesn't work
I have no trouble getting the times
string = " blah blah 7pm something happens 8pm something else happens
9pm something different"
timepattern = /(\d{1,2})\d\d)?\s?([aApP]\.?[mM]\.?)/
irb(main):005:0> string.scan(timepattern)
=> [["7", nil, "pm"], ["8", nil, "pm"], ["9", nil, "pm"]]
This gives me exactly what I want about the times, but no way to find
what was between the matches
Why split doesn't work
If I use split, I can get everything, but in a format that is useless
to me (and to anybody, I'd guess).
irb(main):006:0> string.split(timepattern)
=> [" blah blah ", "7", "pm", " something happens ", "8", "pm", "
something else happens ", "9", "pm", " something different"]
This gives me everything mixed together, but since some capture
groups are not there, you can't figure out which part is regexp
match, and which part is text between regexps.
Why index() doesn't work
Using string.index(timepattern) allows me to walk through the string
by passing the offset, but doesn't return the regexp, so I can get
the data, but no times.
Why match doesn't work
timepattern.match(string) returns the regexp, so I get the times, and
I get a starting offset, so I can find the data, but I can't figure
out how to do a "next match", since match doesn't take an offset, so
this is of no use. This is where I really feel I must be missing
something, since it's hard to believe something so fundamental is missing.
The java equivalent of MatchData has a next match function, it's
commonly used, so I don't quite understand why it's missing.
What's wrong with post_match & slice
One can traverse the matches like this
def reg_split r , string
while match = r.match(string)
next_match = r.match(match.post_match)
if (next_match)
length = next_match.begin(0)
else
length = match.post_match.length
end
text = match.post_match.slice(0,length)
yield(match, text)
string = match.post_match.slice(length,
match.post_match.length - length)
end
end
But each slice is creating (I believe) a new string object, so you
are going to get n*n/2 performance. Horrible with any large strings
What I'd really like
If the Regexp class did a yield on matches, if would be a very nice
thing. It would be more ruby-like, and would give people an easy way
to iterate through matches.
For example:
r = /foo/
r.match(string) { | matchdata | puts matchdata[0]}
Or even just a regex.match(string, offset)
any suggestions?
--=====================_17971640==.ALT--