Defining regexp's and variables set by them

  • Thread starter Garance A Drosehn
  • Start date
G

Garance A Drosehn

Sometimes I get in a situation where I have a case statement
with a several when-clauses, each of which is a regular
expression. Some of those regular-expressions may be
rather complicated. If that case statement is going to be
processed many times, then I like to define objects via
Regexp.new, and use those pre-compiled objects in the
when clauses:

re_simple =3D Regexp.new("check (\d+)")
...
very_huge_file.each_line { |aline}
case aline
when re_simple
check_number =3D $1
when re_other
...

The downside to this is that the regular-expression is now
defined somewhere "far away" from the when-clause that
uses it. After some testing I may need to make changes to
the regexp, such as:

re_simple =3D Regexp.new("(check|test) (\d+)")

The thing is, by changing 'check' to '(check|test)', I have to
remember that the when clause also needs to change from
referencing $1 to referencing $2. Note that in this case I do
not *care* whether it matched 'check' as opposed to 'test'.
Either word is acceptable to me, so I do not need the actual
value of $1 for anything.

I was kinda wondering if it would make sense for ruby to
support something like:

re_simple =3D Regexp.new("check (\d+)") { cnum =3D $1 }

so I could then have the when clause say:

when re_simple
check_number =3D cnum

I realize this is a trivial example, but as the expressions get
more involved, and the case-statement has many when's, it
would be nice if I could have the compiled regular-expression
set values on variable names that *I* pick, in addition to the
standard values in a MatchData object.

Another thing that this might let me do, is something like:

when re_simple, re_other, re_yetmore
check_number =3D cnum

where the different regular-expressions may find 'cnum' in
different positions in the string ($1 vs $2 vs $3), and yet
they could all be processed in the same when-clause.

Or is there already some way to do this?

--=20
Garance Alistair Drosehn =3D (e-mail address removed)
Senior Systems Programmer or (e-mail address removed)
Rensselaer Polytechnic Institute; Troy, NY; USA
 
D

daz

Garance said:
The thing is, by changing 'check' to '(check|test)', I have to
remember that the when clause also needs to change from
referencing $1 to referencing $2. Note that in this case I do
not *care* whether it matched 'check' as opposed to 'test'.
Either word is acceptable to me, so I do not need the actual
value of $1 for anything.

Hi Garance,

You can prevent the grouped expression from creating a
back-reference by using the ?: extension ...


s = 'start: check 41 is less than ...'

re_simple = /check (\d+)/
s =~ re_simple
p [$1, $2] #-> ["41", nil]

re_simple = /(?:check|test) (\d+)/
s =~ re_simple
p [$1, $2] #-> ["41", nil]


HTH,

daz
 
C

Caleb Clausen

If that case statement is going to be
processed many times, then I like to define objects via
Regexp.new, and use those pre-compiled objects in the
when clauses:
=20
re_simple =3D Regexp.new("check (\d+)")
...
very_huge_file.each_line { |aline}
case aline
when re_simple
check_number =3D $1
when re_other
...
=20

Why? I don't think you have to do this to avoid recompiling each time.
Ruby should compile it once when the program is first parsed, and then
recompiles are not needed (unless your regexp has an interpolation).

just so long as you do this:

when /some_rex/:

and not this:

when Regexp.new("some_rex"):

I was kinda wondering if it would make sense for ruby to
support something like:
=20
re_simple =3D Regexp.new("check (\d+)") { cnum =3D $1 }

Ah! a man after my own heart. I think this would be just lovely. It's
never quite so simple, tho. What about, eg re_simple.to_s?

You might want to look at my Reg library/pattern matching language. If
I ever finish the required features, it will support things like what
you have above. Not quite the same syntax, it'd look more like:

re_simple=3D/check (\d+)/>>BR[1]

and then

case str
when re_simple: check_number=3Dstr

Ok, that probably makes no sense to anyone but me yet.
 
G

Garance A Drosehn

=20
You can prevent the grouped expression from creating a
back-reference by using the ?: extension ...

Ah. That's one of those things which didn't sink in when I
first read about it, since I didn't need it at the time. I wish I
had paid better attention to it! Thanks.

--=20
Garance Alistair Drosehn =3D (e-mail address removed)
Senior Systems Programmer or (e-mail address removed)
Rensselaer Polytechnic Institute; Troy, NY; USA
 
G

Garance A Drosehn

=20
Why? I don't think you have to do this to avoid recompiling
each time. Ruby should compile it once when the program
is first parsed, and then recompiles are not needed (unless
your regexp has an interpolation).

A few of the regexp's are based on global options, which is
to say a regexp would be constant for any one run of the
program, but it is built from the value of other variables. I
don't do that very often, but sometimes I do.

I did have the impression that a regexp was compiled only
for the context of the method it was in. So, while I would
expect that it would be compiled only once in the simple
example that I gave, back in my real-world ruby script the
regexp is in a method which is called many times. So I
store the regexp's as class objects (@@rx_simple). It
would not surprise me if I had the wrong idea there. Does
ruby keep the compiled-code for a method after the method
is finished?

But the main reason I like to split things up is that the
regexp's involved in my real-world example are rather
complicated. I'd like to have one section of code which
defines the regexp's, and comments why they are the
way they are. And then a separate section of code which
just says "When you *do* match rx_whatever, then this
is what you should do with the line".
=20
Ah! a man after my own heart. I think this would be just lovely. It's
never quite so simple, tho. What about, eg re_simple.to_s?

What about it? My program isn't doing to_s on any regexp's
which it defines, so I don't understand the significance of your
question...

--=20
Garance Alistair Drosehn =3D (e-mail address removed)
Senior Systems Programmer or (e-mail address removed)
Rensselaer Polytechnic Institute; Troy, NY; USA
 
G

Garance A Drosehn

The downside to this is that the regular-expression is now
defined somewhere "far away" from the when-clause that
uses it. After some testing I may need to make changes to
the regexp, such as:
=20
re_simple =3D Regexp.new('(check|test) (\d+)')
=20
The thing is, by changing 'check' to '(check|test)', I have to
remember that the when clause also needs to change from
referencing $1 to referencing $2. [...]
=20
I was kinda wondering if it would make sense for ruby to
support something like:
=20
re_simple =3D Regexp.new('check (\d+)') { cnum =3D $1 }
=20
so I could then have the when clause say:
=20
when re_simple
check_number =3D cnum
=20
I realize this is a trivial example, but as the expressions get
more involved, and the case-statement has many when's, it
would be nice if I could have the compiled regular-expression
set values on variable names that *I* pick, in addition to the
standard values in a MatchData object.

I thought about this some more after going home and getting
some sleep... One obvious question is what would be the
scope of the commands inside the { ...code-fragment...}. It
also occurred to me that I sometimes I make a match, and
then I pass around the resulting MatchData object to other
methods, and *they* do things based on info in MatchData.

So, I came up with this idea:

Allow MatchData to include some user-settable value,
which would initially be set to 'nil' at the time of the match.
And then support:

re_simple =3D Regexp.new('check (\d+)') { |mdata|
mdata.userdata =3D mdata[1]
}

or:

re_simple =3D Regexp.new('check (\d+)') { |mdata|
mdata.userdata =3D Hash.new
mdata.userdata["cnum"] =3D mdata[1]
mdata.userdata["otherval"] =3D mdata[7]
}

That way, all the variables that the user is setting will
be tied to the appropriate MatchData object.

I almost think I could implement this by creating my own
subclasses for Regexp and MatchData...

--=20
Garance Alistair Drosehn =3D (e-mail address removed)
Senior Systems Programmer or (e-mail address removed)
Rensselaer Polytechnic Institute; Troy, NY; USA
 
B

Brian Schröder

The downside to this is that the regular-expression is now
defined somewhere "far away" from the when-clause that
uses it. After some testing I may need to make changes to
the regexp, such as:

re_simple =3D Regexp.new('(check|test) (\d+)')

The thing is, by changing 'check' to '(check|test)', I have to
remember that the when clause also needs to change from
referencing $1 to referencing $2. [...]

I was kinda wondering if it would make sense for ruby to
support something like:

re_simple =3D Regexp.new('check (\d+)') { cnum =3D $1 }

so I could then have the when clause say:

when re_simple
check_number =3D cnum

I realize this is a trivial example, but as the expressions get
more involved, and the case-statement has many when's, it
would be nice if I could have the compiled regular-expression
set values on variable names that *I* pick, in addition to the
standard values in a MatchData object.
=20
I thought about this some more after going home and getting
some sleep... One obvious question is what would be the
scope of the commands inside the { ...code-fragment...}. It
also occurred to me that I sometimes I make a match, and
then I pass around the resulting MatchData object to other
methods, and *they* do things based on info in MatchData.
=20
So, I came up with this idea:
=20
Allow MatchData to include some user-settable value,
which would initially be set to 'nil' at the time of the match.
And then support:
=20
re_simple =3D Regexp.new('check (\d+)') { |mdata|
mdata.userdata =3D mdata[1]
}
=20
or:
=20
re_simple =3D Regexp.new('check (\d+)') { |mdata|
mdata.userdata =3D Hash.new
mdata.userdata["cnum"] =3D mdata[1]
mdata.userdata["otherval"] =3D mdata[7]
}
=20
That way, all the variables that the user is setting will
be tied to the appropriate MatchData object.
=20
I almost think I could implement this by creating my own
subclasses for Regexp and MatchData...
=20

You could define it like this:

bschroed@black:~/svn/projekte/ruby-things$ cat regexp_data.rb=20
class DataRegexp < Regexp
def initialize(regexp, &block)
@block =3D block
@userdata =3D {}
super(regexp)
end
=20
def match(str)
result =3D super(str)
class <<result
def userdata
@userdata ||=3D {}
end
end
@block[result] if @block
result
end
end

re_simple =3D DataRegexp.new('check (\d+)') { | mdata |
mdata.userdata[:check_number] =3D mdata[1].to_i if mdata
}

if match =3D re_simple.match("Something")
puts "Something matched"
end

if match =3D re_simple.match("check 12")
puts "Checking #{match.userdata[:check_number]}"
end
bschroed@black:~/svn/projekte/ruby-things$ ruby regexp_data.rb=20
Checking 12

regards,

Brian

--=20
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/
 
C

Caleb Clausen

A few of the regexp's are based on global options, which is
to say a regexp would be constant for any one run of the
program, but it is built from the value of other variables. I
don't do that very often, but sometimes I do.

This is what the /o regexp option is for; it forces the regexp to be
compiled only once, even if it has interpolations.
Does
ruby keep the compiled-code for a method after the method
is finished?

Uhhh, it's nowhere near that fancy. Ruby is a fairly traditional
interpreter, without even bytecode compilation. That's why it's so
slow.
=20
But the main reason I like to split things up is that the
regexp's involved in my real-world example are rather
complicated. I'd like to have one section of code which
defines the regexp's, and comments why they are the
way they are.

But you were just saying you don't like the regexp distant from it's use...
What about it? My program isn't doing to_s on any regexp's
which it defines, so I don't understand the significance of your
question...

Maybe not, but if it's to be a general solution, you need to handle
all this stuff. If you're just going to use it in your own program,
then that's fine but I thought you were talking about something more
general-purpose.

You maybe be calling Regexp#to_s without knowing it if you do
something like this:

rex1=3D/bar/
rex2=3D/foo#{rex1}baz/

The interpolation calls to_s.
 
G

Garance A Drosehn

=20
Uhhh, it's nowhere near that fancy. Ruby is a fairly traditional
interpreter, without even bytecode compilation. That's why it's
so slow.

That is what I expected. So I'm back to wishing to have one
method which creates all the Regexp.new's as @@variables,
and then I can reference those compiled Regexp's in other
methods for that class.
=20
But you were just saying you don't like the regexp distant
from it's use...

Almost. I was saying that I *wanted* to have them distant,
because the result is more readable (for what I'm doing, IMO),
and for the efficiency benefit. This may seem weird, but most
of the regexp's that I'm talking about are three or four full lines
long, complete with a few regexp tricks that take a few more
lines of comments to explain what the regexp is doing. The
case-statement is *much* more readable if the regexp's are
separated from the case statement.

But there is a downside from doing that, so I am looking
for ideas on how I might eliminate that downside.
=20
=20
Maybe not, but if it's to be a general solution, you need to handle
all this stuff. If you're just going to use it in your own program,
then that's fine but I thought you were talking about something
more general-purpose.

The more general the solution, the better! :)
=20
You maybe be calling Regexp#to_s without knowing it if you
do something like this:
=20
rex1=3D/bar/
rex2=3D/foo#{rex1}baz/
=20
The interpolation calls to_s.

Ah. I don't do that much, but I can understand why that would
be important. Your reply and the other replies in this thread
have given me quite a few good suggestions to think about.
Very instructive. Thanks!

--=20
Garance Alistair Drosehn =3D (e-mail address removed)
Senior Systems Programmer or (e-mail address removed)
Rensselaer Polytechnic Institute; Troy, NY; USA
 
R

Robert Klemme

Garance A Drosehn said:
That is what I expected. So I'm back to wishing to have one
method which creates all the Regexp.new's as @@variables,
and then I can reference those compiled Regexp's in other
methods for that class.

Performance does not differ much between a regexp in place and a regexp
compiled once and stored in a variable or constant (assuming no
interpolation is used or interpolation with "o" is used - otherwise both
scenarios have different semantics anyway and can't be compared). Ruby
*has* been optimized to make in place regexps efficient - there is no
recompilation of the regexp on every pass.

You can try it out with the attached script. Using several invocations
either of the two is faster

user system total real
direct 0.312000 0.000000 0.312000 ( 0.305000)
compiled 0.313000 0.000000 0.313000 ( 0.306000)

user system total real
direct 0.313000 0.000000 0.313000 ( 0.319000)
compiled 0.297000 0.000000 0.297000 ( 0.305000)
Almost. I was saying that I *wanted* to have them distant,
because the result is more readable (for what I'm doing, IMO),
and for the efficiency benefit.

As I said there is no such thing as an efficiency benefit in using "remote"
regexps.
This may seem weird, but most
of the regexp's that I'm talking about are three or four full lines
long, complete with a few regexp tricks that take a few more
lines of comments to explain what the regexp is doing. The
case-statement is *much* more readable if the regexp's are
separated from the case statement.

I'd stick with the readability argument and forget about the performance
here. The question is, does the code become more readable by moving the
regexps out of the case statement? I don't know your code but I'd say it's
not automatically so.

Kind regards

robert
 
R

Robert Klemme

Robert Klemme said:
Performance does not differ much between a regexp in place and a
regexp compiled once and stored in a variable or constant (assuming no
interpolation is used or interpolation with "o" is used - otherwise
both scenarios have different semantics anyway and can't be
compared). Ruby *has* been optimized to make in place regexps
efficient - there is no recompilation of the regexp on every pass.

You can try it out with the attached script. Using several
invocations either of the two is faster

user system total real
direct 0.312000 0.000000 0.312000 ( 0.305000)
compiled 0.313000 0.000000 0.313000 ( 0.306000)

user system total real
direct 0.313000 0.000000 0.313000 ( 0.319000)
compiled 0.297000 0.000000 0.297000 ( 0.305000)


As I said there is no such thing as an efficiency benefit in using
"remote" regexps.


I'd stick with the readability argument and forget about the
performance here. The question is, does the code become more
readable by moving the regexps out of the case statement? I don't
know your code but I'd say it's not automatically so.

Kind regards

robert

Sorry, I forgot the attachment. Here's the script:

robert


require 'benchmark'

REPEAT = 10000

RX = /foo/

TEXT = <<EOS
akdnhkaj dahdk ahda da#dada
da
dopakdjalkjdlak djadklasd
adasklfoodköasjhdjkasdha
dadjkadjkashdjkasd#
aajdhkasjdjkfooashd
aldaksjhdjasd
EOS

Benchmark.bmbm 10 do |b|
b.report "direct" do
REPEAT.times { TEXT.scan(/foo/o) {|m| m + "x"} }
end

b.report "compiled" do
REPEAT.times { TEXT.scan(RX) {|m| m + "x"} }
end
end
 
G

Garance A Drosehn

=20
As I said there is no such thing as an efficiency benefit in using
"remote" regexps.

Ah, okay. I guess I was reading too much into the word "compiled",
such that I thought it would be significantly faster.
=20
I'd stick with the readability argument and forget about the performance
here. The question is, does the code become more readable by moving
the regexps out of the case statement? I don't know your code but I'd
say it's not automatically so.

In the script that I am working on right now, it is definitely more
readable. But in most scripts I write, the readability is probably
about the same either way. It wouldn't surprise me if readability
was usually better with regexp's in the case statement that uses
them, especially if they are all single-line regexp's.

I first wrote this script with the regexp's in place, and that was
getting too messy (IMO). So I've now redone them with the
regexp's separate. I might do some performance comparision
of the two versions once I'm done. But I doubt that will be very
accurate, because I am changing so many other things at the
same time.=20

--=20
Garance Alistair Drosehn =3D (e-mail address removed)
Senior Systems Programmer or (e-mail address removed)
Rensselaer Polytechnic Institute; Troy, NY; USA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,176
Messages
2,570,947
Members
47,498
Latest member
log5Sshell/alfa5

Latest Threads

Top