DRYing a Regex

RichardOnRails · Nov 12, 2009

I've got a routine that works fine at building an array of upper-case
strings extracted from a string:

aNewList = []
s = StringScanner.new sNewList
upper = /[A-Z]+/
not_upper= Regexp.new( upper.source.sub( /\[/, '[^' ) )
while not s.eos?
case
when s.skip(upper); aNewList << s.matched
else s.skip(not_upper)
end
end

But the not_upper Regexp definition is really a kludge. It somewhat
camouflages what is really /[^A-Z]+/

I'd like to DRY it by expressing it as something like !upper. I need
something like !~ we use normally with string searches.

Any ideas?

Caleb Clausen · Nov 12, 2009

upper = /[A-Z]+/
not_upper= Regexp.new( upper.source.sub( /\[/, '[^' ) ) [snip]
But the not_upper Regexp definition is really a kludge. It somewhat
camouflages what is really /[^A-Z]+/

I'd like to DRY it by expressing it as something like !upper. I need
something like !~ we use normally with string searches.

This is somewhat better, but still not real obvious:
not_upper=/(?:.(?!#{upper}))+/ #untested, tho

Myself, I'd just write not_upper=/[^A-Z]/.... for something this
short, is it really worth trying all that hard to be DRY?

James Edward Gray II · Nov 12, 2009

I've got a routine that works fine at building an array of upper-case
strings extracted from a string:
=20
aNewList =3D []
s =3D StringScanner.new sNewList
upper =3D /[A-Z]+/
not_upper=3D Regexp.new( upper.source.sub( /\[/, '[^' ) )
while not s.eos?
case
when s.skip(upper); aNewList << s.matched
else s.skip(not_upper)
end
end
=20
But the not_upper Regexp definition is really a kludge. It somewhat
camouflages what is really /[^A-Z]+/
=20
I'd like to DRY it by expressing it as something like !upper. I need
something like !~ we use normally with string searches.
=20
Any ideas?

Well, you don't really need a StringScanner for this simple task. Your =
code really just rebuilds String#scan():

a_new_list =3D s_new_list.scan(/[A-Z]+/)

Note that I've also switched your variable naming style to the =
snake_case that we Rubyists prefer.

Hope that helps.

James Edward Gray II=

Caleb Clausen · Nov 13, 2009

I've got a routine that works fine at building an array of upper-case
strings extracted from a string:

aNewList = []
s = StringScanner.new sNewList
upper = /[A-Z]+/
not_upper= Regexp.new( upper.source.sub( /\[/, '[^' ) )
while not s.eos?
case
when s.skip(upper); aNewList << s.matched
else s.skip(not_upper)
end
end

OTOH, you can rewrite it like this, and not have to even mention the
complement of the match you're interested in:

aNewList = []
s = StringScanner.new sNewList
upper = /[A-Z]+/
aNewList<< s.matched while s.skip_until(upper)

(Not tested real thoroughly, corner cases may break.)

Caleb Clausen · Nov 13, 2009

Well, you don't really need a StringScanner for this simple task. Your code
really just rebuilds String#scan():

a_new_list = s_new_list.scan(/[A-Z]+/)

ooh! that's even better.

RichardOnRails · Nov 13, 2009

Well, you don't really need a StringScanner for this simple task. Your code
really just rebuilds String#scan():

Click to expand...

a_new_list = s_new_list.scan(/[A-Z]+/)

Click to expand...

ooh! that's even better.

You're right. I didn't NEED to DRY that simple thing. I'm just
trying to improve my coding generally, especially to write things that
don't break easily when the inevitable changes are made.

But cutting out 90% of the code, wow! That's DRY!!

Thank you very much for your ideas. I haven't tested it yet, but it
looks right to me.

Best wishes,
Richard

RichardOnRails · Nov 13, 2009

I've got a routine that works fine at building an array of upper-case
strings extracted from a string:

Click to expand...

aNewList = []
s = StringScanner.new sNewList
upper = /[A-Z]+/
not_upper= Regexp.new( upper.source.sub( /\[/, '[^' ) )
while not s.eos?
case
when s.skip(upper); aNewList << s.matched
else s.skip(not_upper)
end
end

Click to expand...

But the not_upper Regexp definition is really a kludge. It somewhat
camouflages what is really /[^A-Z]+/

Click to expand...

I'd like to DRY it by expressing it as something like !upper. I need
something like !~ we use normally with string searches.

Click to expand...

Any ideas?

Click to expand...

Well, you don't really need a StringScanner for this simple task. Yourcode really just rebuilds String#scan():

a_new_list = s_new_list.scan(/[A-Z]+/)

Note that I've also switched your variable naming style to the snake_casethat we Rubyists prefer.

Hope that helps.

James Edward Gray II

Hi James,

As I said to Caleb, cutting my 10-liner down to 1 is extreme DRYing!!
Thanks for that.

As far as underscoring vs. Camel-case goes, I know Rubyists'
preference, but I bow to Shakespeare's notion that "a rose by any
other name is just as sweet." I spent a couple decades writing/
maintaining Window's application for clients using C and C++, so I've
a fondness for Polish notation (at least that's what I think it was
called.) Typing extra hyphens vs pressing the shift key lets me write
code faster, and the a/s/h prefix for arrays/strings/hashes helps me
avoid a lot of interpreter complaints. And fellow programmers of
almost any stripe knows what I mean. Finally, I retired curmudgeon,
and you know how we old folks are

Seriously, your insight was very helpful and will help me avoid a
bunch of wasteful code.

Best wishes,
Richard

RichardOnRails · Nov 13, 2009

Hey Caleb & James,

With your insights, I was able to cut down 18 lines of somewhat
obscure code to 6 lines that I find very readable. That's such and
improvement on the quality of the code.

Though I expect you guys are tired ot this thread, I included the new
and old code below, along with results that both of them produce.

Again, thank you very much for your insights.

Best wishes,
Richard

# Accept a new list as a string; extract an array of contiguous upper-
case letters as stock symbols, ignoring any duplicates (Test data)
# Delete any symbol in the current list that occurs here
sNewList = %{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT',
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}

#===============
# New technique
#===============
aRawNewList = sNewList.scan(/[A-Z]+/)
aNewList = Set.new(aRawNewList ).to_a.sort
nDeleted = 0
aNewList.each { |sym| hCurrentList.delete sym and nDeleted += 1 if
hCurrentList[sym] }
show_array( aNewList, 10, "New List (unique:%d, dups:%d, deleted:%s)"
%
[aNewList.size, aRawNewList.size - aNewList.size, nDeleted] , true)

#============================
# Old technique; No longer used
#============================
aNewList = []
s = StringScanner.new sNewList
upper = /[A-Z]+/
non_upper= Regexp.new( upper.source.sub( /\[/, '[^' ) )
nNewSyms = nCurrSymsDeleted = 0
while not s.eos?
case
when s.skip(upper)
nNewSyms+=1
aNewList << s.matched unless aNewList.include? s.matched
( hCurrentList.delete s.matched and nCurrSymsDeleted += 1) if
hCurrentList[s.matched]
else
s.skip(non_upper)
end
end
show_array( aNewList.sort, 10, "New List (%d unique; %d dups; %d curr.
deleted)" %
[aNewList.size, nNewSyms - aNewList.size, nCurrSymsDeleted] )

#=======
# Output
#=======
===== New List (unique:19, dups:4, deleted:3) =====
AA ABX AMAT BRCM BUR CAT COL CSCO FCX FDX
FSLR HPQ INTC MSFT ORCL PNC PVTB TM XHB
===== =====

Robert Klemme · Nov 13, 2009

2009/11/13 RichardOnRails said:
As far as underscoring vs. Camel-case goes, =A0I know Rubyists'
preference, but I bow to Shakespeare's notion that "a rose by any
other name is just as sweet." =A0I spent a couple decades writing/
maintaining Window's application for clients using C and C++, so I've
a fondness for Polish notation (at least that's what I think it was
called.)

I believe you mean Hungarian Notation:
http://en.wikipedia.org/wiki/Polish_notation

Typing extra hyphens vs pressing the shift key lets me write
code faster, and the a/s/h prefix for arrays/strings/hashes helps me
avoid a lot of interpreter complaints. =A0And fellow programmers of
almost any stripe knows what I mean.

There's always something to be said for conventions. The issue with
your notation is that it seems to be far less used among Ruby
programmers than the snake case. Snake case for variables and methods
also has the added advantage that classes and modules stand out
immediately.

Side note: with modern IDE's I believe there is not much reason to use
Hungarian Notation any more. I personally find it more difficult to
spot certain variables when all variables of the same type start with
the same letter. For me, PN actually _reduces_ readability.

=A0Finally, =A0I retired curmudgeon,
and you know how we old folks are

LOL

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Mark Thomas · Nov 13, 2009

Hey Caleb & James,

With your insights, I was able to cut down 18 lines of somewhat
obscure code to 6 lines that I find very readable. That's such and
improvement on the quality of the code.

I believe you can go further. For example, these three lines:

sNewList = %{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT',
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}
aRawNewList = sNewList.scan(/[A-Z]+/)
aNewList = Set.new(aRawNewList ).to_a.sort

can be replaced by one:

aNewList = %W{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT',
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}

(you can add .sort to the end but I don't think you need it)

also, consider something like this:

hCurrentList.delete_if { |key,v| aNewList.include?key }

-- Mark.

Robert Klemme · Nov 13, 2009

2009/11/13 Mark Thomas said:
Hey Caleb & James,

With your insights, =A0I was able to cut down 18 lines of somewhat
obscure code to 6 lines that I find very readable. =A0That's such and
improvement on the quality of the code.

Click to expand...

I believe you can go further. For example, these three lines:

=A0sNewList =3D %{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT',
=A0 =A0 =A0 =A0 =A0 =A0 =A0 PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO = TM FDX}
=A0aRawNewList =3D sNewList.scan(/[A-Z]+/)
=A0aNewList =3D Set.new(aRawNewList ).to_a.sort

can be replaced by one:

=A0aNewList =3D %W{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT',
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CS=

CO TM FDX}

I don't think so because that appears to be input from the outside
which is provided as single String.

(you can add .sort to the end but I don't think you need it)

also, consider something like this:

=A0hCurrentList.delete_if { |key,v| aNewList.include?key }

Basically the question is which of the two is larger. But if you do
it this way round (i.e. iterate the Hash and check for existence in
the new list then that should definitively be a Set).

Here's my suggestion

require 'set'

# dumy base
current =3D {"CSCO" =3D> 1, "COL" =3D> 2, "INTC" =3D> 3, "BRCM" =3D> 4, "FO=
O" =3D> 99}

# user input
input =3D %{TMxxx CSCO COL INTC BRCM FDX AA CAT BUR FSLR MSFT
PNC HPQ CSCO AMAT ORCL FCX ABX PVTB XHB CSCO TM FDX}

# algorithm
symbols =3D input.scan(/[A-Z]+/)
deduped =3D symbols.to_set

old_size =3D current.size
deduped.each {|sym| current.delete sym}

p(deduped.sort,
10,
sprintf("New List (unique:%d, dups:%d, deleted:%s)",
deduped.size,
symbols.size - deduped.size,
old_size - current.size),
true)

p current

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Marnen Laibow-Koser · Nov 14, 2009

RichardOnRails wrote:
[...]

As far as underscoring vs. Camel-case goes, I know Rubyists'
preference, but I bow to Shakespeare's notion that "a rose by any
other name is just as sweet."

It doesn't work that way in programming. Good naming practices are an
important part of readable code. This is particularly so in a language
like Ruby, in which "literate" interfaces are common.

I spent a couple decades writing/

maintaining Window's application for clients using C and C++, so I've
a fondness for Polish notation (at least that's what I think it was
called.)

Polish Notation is Åukasiewicz-style prefix notation, rather like what's
used in Lisp. You mean Hungarian Notation.

But in any case, *you've been had*. Hungarian Notation as developed by
Charles Simonyi is extremely useful in non-OO code (I've used it in PHP
with great success). Hungarian Notation as the term is usually
understood is a very stupid thing indeed, which has unfortunately been
foisted by Microsoft on huge numbers of Windows programmers who really
should know better.

It is (at best) marginally useful in statically
typed languages like C, and downright misleading in dynamically typed
languages like Ruby.

The difference is that Simonyi's original concept encodes information
*outside the scope* of the variable's type (which, after all, the
interpreter or compiler already knows about). For example, in a mapping
system, you might have kmDistance and ftCorrection. It's entirely clear
from those names that kmDistance + ftCorrection would be adding
kilometers and feet without a conversion, and thus it's immediately
clear that that operation is wrong.

OTOH, legions of misled Windows developers would simply call those two
variables intDistance and intCorrection, incorporating no new useful
information and making the names harder to read.

For more on the misuse of Hungarian Notation, please see
http://www.joelonsoftware.com/articles/Wrong.html (Simonyi's original is
there called Apps Hungarian, while the popular perversion is called
Systems Hungarian). There's also some interesting discussion at
http://c2.com/cgi/wiki?HungarianNotation , if you can wade through the
disorganization.

Systems Hungarian, BTW, is bad enough in C, where you should be able to
refer to your variable declarations. If your functions are so long that
you can't refer easily to declarations, then you need to refactor to
shorter methods for overall readability anyway -- methods should be
short. Systems Hungarian has no use at all in Ruby, since although
objects are typed, variables are not, so it's perfectly possible to do
intValue = 1
# later
intValue = {:foo => 'bar'}

Even Apps Hungarian is not a great idea in OO code. Instead, just use
the type system, so that distance would be a Kilometer object and
correction would be a Foot object. Kilometer.+(foot) could then either
raise an exception or invoke a conversion.

In summary, then, Hungarian Notation of either sort is inappropriate in
Ruby. Drop the habit.

Typing extra hyphens vs pressing the shift key lets me write
code faster, and the a/s/h prefix for arrays/strings/hashes helps me
avoid a lot of interpreter complaints.

If you care about removing characters from variable names, start with
removing the Hungarian warts. As I explained above, they serve no
useful purpose in Ruby at all. And I have to say, I don't find
wordsRunTogether as easy to read as words_with_underscores -- the
underscores look more like spaces and delineate the words better to my
eye. WouldYouRatherReadThisClauseHere, or
would_you_rather_read_this_clause_here?

In any case, "snake_case" is the prevailing style in Ruby, and virtually
every Ruby library uses it (including the standard library and Rails) --
your code will look strange if you don't follow suit. The examples in
Programming Ruby tend to use camelCase, but that's more of a flaw in the
book than an indicator of Ruby practice.

And fellow programmers of
almost any stripe knows what I mean. Finally, I retired curmudgeon,
and you know how we old folks are

Age is not an excuse. If you're going to learn a language, take the
time to learn the idioms and the "spirit" of the language, not just the
bare essentials of syntax. I've seen far too many people try to write
C, Java, or PHP in Ruby -- avoid the temptation!

Seriously, your insight was very helpful and will help me avoid a
bunch of wasteful code.

Best wishes,
Richard

Best,

James Edward Gray II · Nov 14, 2009

RichardOnRails wrote:
[...]

As far as underscoring vs. Camel-case goes, I know Rubyists'
preference, but I bow to Shakespeare's notion that "a rose by any
other name is just as sweet." =20

Click to expand...

=20
It doesn't work that way in programming. Good naming practices are an=20=

important part of readable code.

As the saying goes, "When in Rome, do as the Romans do." You're =
speaking our language now and you want to learn to speak it like us, =
even with our slang. That allows you to communicate with us better so =
we can learn from each other.

Even Apps Hungarian is not a great idea in OO code. Instead, just use=20=

the type system, so that distance would be a Kilometer object and=20
correction would be a Foot object. Kilometer.+(foot) could then = either=20
raise an exception or invoke a conversion.

I would like to see us move away from considering classes to be types at =
all in Ruby. Who knows what modules an object has mixed into it and who =
knows what singleton methods are defined on it. A class, which is what =
people traditionally take for the type, is just one piece of an object's =
identity.

James Edward Gray II

Marnen Laibow-Koser · Nov 14, 2009

James Edward Gray II wrote:
[...]

I would like to see us move away from considering classes to be types at
all in Ruby. Who knows what modules an object has mixed into it and who
knows what singleton methods are defined on it.

Do you make much use of singleton mixins or singleton methods in your
code? I know I don't.

A class, which is what
people traditionally take for the type, is just one piece of an object's
identity.

You're right. But with a proper class system, my point about not
needing Apps Hungarian in Ruby still stands, I think. Do you disagree?

James Edward Gray II

Best,

David Turnbull · Nov 14, 2009

I would like to see us move away from considering classes to be
types at all in Ruby. Who knows what modules an object has mixed
into it and who knows what singleton methods are defined on it. A
class, which is what people traditionally take for the type, is just
one piece of an object's identity.

I would still look immediately to the class of the object in order to
find out what it's supposed to do. From there, the class definition
will probably list it's module inclusions prominently.

As a vim user, with very limited interactive debugging, my primary
exploration technique will usually consist of at most a couple of
'obj.methods.grep' calls followed by grepping ~/gems which seems to
emphasize the actual reading of the source for object identity info.

Python's integrated documentation would be really welcome in this
case, i think.

I'm curious what you think the most correct way is to discover object
identity.

Marnen Laibow-Koser · Nov 14, 2009

David said:
I would still look immediately to the class of the object in order to
find out what it's supposed to do.

I would too. James is correct that it isn't the whole story, but it's
the best place to start.

From there, the class definition
will probably list it's module inclusions prominently.

As a vim user, with very limited interactive debugging,

What? You can use ruby-debug interactively in a console session. I
often do.

my primary
exploration technique will usually consist of at most a couple of
'obj.methods.grep' calls followed by grepping ~/gems which seems to
emphasize the actual reading of the source for object identity info.

Python's integrated documentation would be really welcome in this
case, i think.

WTF? Aren't you familiar with RDoc? And didn't you know that running
"gem server" will start a Web server with gem RDoc pages on port 8808?

I'm curious what you think the most correct way is to discover object
identity.

Object identity? Well, for that, you need object_id. That's something
different than object type.

Bill Kelly · Nov 14, 2009

From: "David Turnbull said:
I would still look immediately to the class of the object in order to
find out what it's supposed to do. From there, the class definition
will probably list it's module inclusions prominently.

A human looking to documentation to find out what an object
of a partiular class is supposed to *do*, is one thing. But
then there's the programmatic flipside where one could code
a method to select between different behaviors based on the
class-type of a given argument-object.

def foo(bar)
if bar.is_a? Array
do_array_thing(bar)
elsif bar.is_a? String
do_string_thing(bar)
else
... # ?
end
end

I believe it's (variations on) the above that are viewed
as unreasonably restrictive in ruby.

It's challenging, too, because even :respond_to? can be
misleading.

I like Og (Object Graph), an Object Relational Mapping
library in ruby providing high-level database access.

require 'og'

class Address
property :name, String
property :company, String
property :dept, String
property :addr1, String
property :addr2, String
property :city, String
property :state, String
property :zip, String
property :country, String
belongs_to

rder, Order
end

When Og is initialized, it searches ObjectSpace for
classes like the above, and detects that they are
intended to be Og-managed classes, and imbues them
with certain basic features. (It also generates the
SQL needed to create the database tables
corresponding to such classes.)

An example is that, given nothing more than the above
Address class declaration... I could now say:

result = Address.find_by_name_and_state("Bob Jones", "CA")

But..! The Address.find_by_name_and_state doesn't even
exist until the time that it is called. Part of the
magic with which an Og-managed class is imbued, is
some method_missing logic which looks for particular
method signatures, like /find_by_(.*)/ , and, at the
moment such a method is called, is tested against the
following, behind the scenes:

def method_missing(sym, *args, &block)
if match = /find_(all_by|by)_([_a-zA-Z]\w*)/.match(sym.to_s)
return find_by_(match, args, &block)
elsif match = /find_or_create_by_([_a-zA-Z]\w*)/.match(sym.to_s)
return find_or_create_by_(match, args, &block)
else
super
end
end

(Note: In this case, it appears Og _always_ handles the
request via method_missing. But I've seen other code
in Og (or maybe Nitro) that did *define* the method when
it was first called, such that on subsequent invocations
the method would now already be existing.)

. . Anyway, the point being, Ruby is pretty dynamic.

Python's integrated documentation would be really welcome in this
case, i think.

I seem to recall mention awhile back on ruby-talk of
a gem or module that integrated `ri` into `irb`, such
that one could pull up the documentation from within
irb. (I don't have any links for that, sorry.)

Regards,

Bill

Ralph Shnelvar · Nov 14, 2009

BK> elsif bar.is_a? String

As a newbie I would surely like to know why the language decided on
"elsif" rather than "elseif".

And before anyone accuses me of not doing a Google search on the
subject ... I did. E.g. http://www.ruby-forum.com/topic/100350

I'm not trying to change the language ... I'm wondering if the
language developer(s) had a reason for it? Did the want to save the
typing of an 'e'?

- - - -

While we're at it, is there a (undocumented?) compiler switch that says "always check
for then in if/elsif statements"?

And before anyone accuses me of not doing a Google search on the
subject of compiler switches ... I did. E.g. http://www.zenspider.com/Languages/Ruby/QuickRef.html

Mark Thomas · Nov 14, 2009

BK> elsif bar.is_a? String

As a newbie I would surely like to know why the language decided on
"elsif" rather than "elseif".

Because a precedent had been set in Perl. That's one of the
unfortunate Perlisms in Ruby.

At least Matz didn't borrow it from Bash, which uses "elif".

Todd Benson · Nov 14, 2009

BK> elsif bar.is_a? String

As a newbie I would surely like to know why the language decided on
"elsif" rather than "elseif".

I'm pretty sure it's a Perl artifact.

DRYing a Regex	2	Nov 12, 2009
SQL Connection string regex pattern to parse sections	1	May 9, 2024
detail on a regex?	3	May 13, 2008
String#split regex \W on non-ASCII text	1	Nov 9, 2010
Clickable link conversion regex?	0	Nov 30, 2012
Problem populating a hash with regex results	7	Jan 16, 2011
Java Regex execution order	4	Oct 15, 2010
RegEx issues	6	Jan 24, 2009

DRYing a Regex

RichardOnRails

Caleb Clausen

James Edward Gray II

Caleb Clausen

Caleb Clausen

RichardOnRails

RichardOnRails

RichardOnRails

Robert Klemme

Mark Thomas

Robert Klemme

Marnen Laibow-Koser

James Edward Gray II

Marnen Laibow-Koser

David Turnbull

Marnen Laibow-Koser

Bill Kelly

Ralph Shnelvar

Mark Thomas

Todd Benson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads