regex: capture groups and term binding

Simon Mullis · Sep 28, 2007

Hi All,

Let's get down to it...

I have a long string of the form:

string = <<-EOVAR
XD 1 * 100000436 3441863 1550663 1161254 951982
XD 1 479903531056 47988002622 21360568539 18276299303 15476234490
XD 1 66934 5552 321640438 40297830 0
XD 1 0 3235 2197 10907 1631621
XD 1 15488078 210564267 574075997 2405132745 7805716381
XD 1 0 4949 0 58361 0
(goes for about 17 lines, all separated by \n)
<<EOVAR

I'm building a regex for this string and it's pretty straightforward.
Only prerequisite is to capture all numbers for later Ruby fun:

regex = %r{XD\s2\s\*\s(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\n ...etc... }mx

I would like to pare it down a bit, using term binding:

regex = %r{XD 1 \* (\d+\s+){5}\n ...etc...}mx

If I do this then only the last group is captured

pp var.scan(regex)
[["951982\n"]]

If this worked, I could shorten it much much more.. all of the lines
after the first one have exactly the same format and I need to capture
all of the variables.

mother_of_all_regexen = %r{XD\s1\s\*\s((\d+\s+){5})\n(XD\s1
(\d+\s+){5})){17} }mx

or something

So,

- Can I use capture groups and term binding?
- Why am I only capturing the last term?
- Should I just stop trying to be clever and explicitly match against
all parts of the string?

The reason I want to do this as a single regex is that I've written a
framework that grabs files, monkeys around with them and then applies
a rule-set from a YAML file to create output. For each "signature" in
the YAML file one can choose a defined action (match, count, compare
etc) which relate to methods in the main code. This allows the editor
of the YAML to add signatures etc to their hearts desire... And more
importantly, it means that I won't have to maintain the ruleset.
(woohoo!)

Thanks in advance for any suggestion

SM

Peter Szinek · Sep 28, 2007

Hey Simon,

string = <<-EOVAR
XD 1 * 100000436 3441863 1550663 1161254 951982
XD 1 479903531056 47988002622 21360568539 18276299303 15476234490
XD 1 66934 5552 321640438 40297830 0
XD 1 0 3235 2197 10907 1631621
XD 1 15488078 210564267 574075997 2405132745 7805716381
XD 1 0 4949 0 58361 0
(goes for about 17 lines, all separated by \n)
<<EOVAR

Maybe I am seriously misunderstanding something, but why not just:

string.split("\n").map{|line| line.scan(/\d+/)} ?

Cheers,
Peter
__
http://www.rubyrailways.com
http://scrubyt.org

Simon Mullis · Sep 28, 2007

Hi Peter,

This is a good idea... I wasn't clear in my original post but the
problem is that some of the lines have 3 (\d+), some 4 and some 5.
Also, there are 4 different groups of data sprinkled through a load of
log files.

Another way of slimming down the regex horror might be to use a bunch
of mini regexes and then using "recipes".

So, a new method for the Regexp class (shamelessly plagiarized from this group)

class Regexp
def +(other)
if other.is_a?(Regexp)
if self.options == other.options
Regexp.new(source + other.source, options)
else
Regexp.new(source + other.to_s, options)
end
else
Regexp.new(source + Regexp.escape(other.to_s), options)
end
end
end

r1 = %r{XD\s\*\s}
r2 = %r{(\d)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\n}mx
r3 = %r{(\d)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\n}mx
r4 = %r{(\d)\s(\d+)\s(\d+)\s(\d+)\n}mx

recipe1 = r1 + r2 + r2 + r3 + r2 + r4 + r3 .... and so on
recipe2 = r1 + r2 + r4 + r4 + r3 ....

In the end I've used one huge whacking great regex for each "recipe" -
my main question was about can we combine capture groups and term
binding? It seems the precedence in the RE engine is to do the
captures first then unwind the binding. Or something.

Cheers

SM

Peter Szinek · Sep 28, 2007

Simon said:
Hi Peter,

This is a good idea... I wasn't clear in my original post but the
problem is that some of the lines have 3 (\d+), some 4 and some 5.
Also, there are 4 different groups of data sprinkled through a load of
log files.

Could you please give an example of how the expected result looks like
for the above dataset? Possibly it's not my day, but I still didn't get
what are you trying to accomplish

The result of my solution was:

[["1", "100000436", "3441863", "1550663", "1161254", "951982"],
["1", "479903531056", "47988002622", "21360568539", "18276299303",
"15476234490"],
["1", "66934", "5552", "321640438", "40297830", "0"],
["1", "0", "3235", "2197", "10907", "1631621"],
["1", "15488078", "210564267", "574075997", "2405132745", "7805716381"],
["1", "0", "4949", "0", "58361", "0"]]

How does the result you are expecting differ from the above one?

Cheers,
Peter
__
http://www.rubyrailways.com
http://scrubyt.org

Simon Mullis · Sep 28, 2007

Hi Peter,

Sorry if I'm not being clear - this is more a regex question than a ruby one.

I'll try again.

str = "100000436 3441863 1550663 1161254 951982"

re = %r{(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)}

Let's shorten the re using grouping and a quantifier:
re2 = %r{(\d+)(?:\s(\d+)){4}}

pp re.match(str)
#<MatchData
"100000436 3441863 1550663 1161254 951982"
"100000436"
"3441863"
"1550663"
"1161254"
"951982">

pp re2.match(str)
#<MatchData "100000436 3441863 1550663" "100000436" "1550663">

so, either:

1 - My re2 regex is incorrect.
2 - You cannot do this with the ruby regex engine.

From experience, I'd guess it's probably 1. ;-)

Thanks!

SM

Simon said:
Simon said:

Hi Peter,

This is a good idea... I wasn't clear in my original post but the
problem is that some of the lines have 3 (\d+), some 4 and some 5.
Also, there are 4 different groups of data sprinkled through a load of
log files.

Click to expand...

Could you please give an example of how the expected result looks like
for the above dataset? Possibly it's not my day, but I still didn't get
what are you trying to accomplish

The result of my solution was:

[["1", "100000436", "3441863", "1550663", "1161254", "951982"],
["1", "479903531056", "47988002622", "21360568539", "18276299303",
"15476234490"],
["1", "66934", "5552", "321640438", "40297830", "0"],
["1", "0", "3235", "2197", "10907", "1631621"],
["1", "15488078", "210564267", "574075997", "2405132745", "7805716381"],
["1", "0", "4949", "0", "58361", "0"]]

How does the result you are expecting differ from the above one?

Cheers,
Peter
__
http://www.rubyrailways.com
http://scrubyt.org

Ari Brown · Sep 29, 2007

Hi Peter,

Hi Simon and Peter

class Regexp
def +(other)
if other.is_a?(Regexp)
if self.options == other.options
Regexp.new(source + other.source, options)
else
Regexp.new(source + other.to_s, options)
end
else
Regexp.new(source + Regexp.escape(other.to_s), options)
end
end
end

This is also exactly (mostly) what is implemented the the aw3s0m3
(sic) library, TextualRegexp.

And let me say, DAMN Regexp.+ makes life easier :-D

---------------------------------------------------------------|
~Ari
"I don't suffer from insanity. I enjoy every minute of it" --1337est
man alive

Regular expressions, capture repeated groups	4	Jul 8, 2010
SQL Connection string regex pattern to parse sections	1	May 9, 2024
Creating a regex to get multiple values and print	0	Jan 10, 2021
[HELP] Add-on - Twitch chat input	0	Sep 1, 2024
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
how to capture all conditions using regex	5	Jan 4, 2010
Regex returning less number of groups - where is the error?	4	Mar 16, 2009
Need help! Following code isnt working fully Comparison of integer and pointer	0	Nov 20, 2022

regex: capture groups and term binding

Simon Mullis

Peter Szinek

Simon Mullis

Peter Szinek

Simon Mullis

Ari Brown

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads