How's ruby compare to it older brother python

P

Peter Hansen

Cameron said:
"suggests"? I thought he was "chanting", or "intoning", or
"inveighing" or "denouncing" or ... Well, I'll say it on my
own behalf: friends, REs are not always the best choice <URL:
http://www.unixreview.com/documents/s=2472/uni1037388368795/ >.

"Denouncing" it definitely was. I wanted to find the original
source for myself, and Google conveniently provided it. Still
fun reading, and quite relevant to the thread:

http://slashdot.org/comments.pl?sid=19607&cid=1871619

-Peter
 
P

Peter Hansen

Skip said:
Roy> I don't see anything in the reference manual which says re.match()
Roy> caches compilations, but I suspect it does.

Yup, every call does. I believe the only time you really need to compile
them is if you have more than 100 different regular expressions. In this
case the caching code clears the cache when the threshold is reached instead
of just picking a random few elements to toss (which seems like a mistake to
me but one that should rarely be encountered).

I would think the mistake would be in writing code that actually
depended on more than 100 different regular expressions, without
explicitly compiling the expressions itself. The 100-expression
cache seems like a cheap, transparent optimization that shouldn't
cause anyone trouble in well-written code.

-Peter
 
P

Peter Hansen

Paramjit said:
If only compiled regular expression objects allowed easy access to the
last match object, python regxes would become significantly more
convenient. For example, today you have to write:

m = myregex.match(line)
if (m):
xyz = m.group('xyz')

It would be much easier to do:

if myregex.match(line):
xyz = myregex.lastm.group('xyz')

Having to introduce a new variable name for a match object just
muddles the code unnecessarily in common regex usage scenarios.

Wouldn't that be a fairly trivial class to write, and use
in your own code? This is the kind of thing people tend
to whip up on the fly and add to their little bag of tools
if they really like it.

Something like (untested, not real code, just for the idea):

class myregex(re):
def match(self, data):
self.lastm = self.pattern.match(data)
return self.lastm

-Peter
 
R

Roy Smith

Asun> Regex is much more 'immediate' in Perl.

Sure, it's syntactically bound into the language. There will always be an
extra constant overhead to enable regular expressions in Python. That

If only compiled regular expression objects allowed easy access to the
last match object, python regxes would become significantly more
convenient. For example, today you have to write:

m = myregex.match(line)
if (m):
xyz = m.group('xyz')

It would be much easier to do:

if myregex.match(line):
xyz = myregex.lastm.group('xyz')

Having to introduce a new variable name for a match object just
muddles the code unnecessarily in common regex usage scenarios.

-param[/QUOTE]

I like that idea. I would go one step further and eliminate the lastm
attribute, so you could just do:

if myregex.match(line):
xyz = myregex.group('xyz')

Maybe even go all the way and let a regex object have all the methods of
a match object, cache a reference to the last match, and delegate calls
to the match object methods to the cached match object.
 
R

Roy Smith

Peter Hansen said:
Wouldn't that be a fairly trivial class to write, and use
in your own code?

It would be. And everybody would invent their own flavor. Putting it
into the core means everybody does it the same way, so everybody's code
is easier to understand and maintain.
 
M

Michele Dondi

ùOn Mon, 26 Apr 2004 16:55:02 -0400, "Ruby Tuesdays"
Would this super perl program of yours can convert the massive amount of
perl script to ruby or python?

If it could, it would be great so ruby/python programmers does not have to
learn those cryptic perl-ish syntax and the non-OOish scripting language.

Huh?!?


Michele
 
C

Carl Banks

Paramjit said:
If only compiled regular expression objects allowed easy access to the
last match object, python regxes would become significantly more
convenient. For example, today you have to write:

m = myregex.match(line)
if (m):
xyz = m.group('xyz')

It would be much easier to do:

if myregex.match(line):
xyz = myregex.lastm.group('xyz')

Having to introduce a new variable name for a match object just
muddles the code unnecessarily in common regex usage scenarios.


Hmm. The reason this hasn't been done is that it makes the match
method non-reentrant. For example, suppose you call some functions in
between the matching and use of the matched object, like this:

if myregex.match(line):
xyz = (subprocess(line[myregex.lastm.end():])
+ myregex.lastm.group(1))

And suppose subprocess goes on to use the same regexp. By the time
subprocess returns, myregex.lastm could have been overwritten. This
is not a far-fetched example at all; one could easily encounter this
problem when writing, say, a recursive descent parser.

Murphy's law says that if anything bad can happen, sooner or later it
will, and this is why non-reentrant functions like your proposed
myregex.match are so heinous. So, I can't agree that this is a good
idea as it stands.
 
R

Roy Smith

Carl Banks said:
Hmm. The reason this hasn't been done is that it makes the match
method non-reentrant. For example, suppose you call some functions in
between the matching and use of the matched object, like this:

if myregex.match(line):
xyz = (subprocess(line[myregex.lastm.end():])
+ myregex.lastm.group(1))

And suppose subprocess goes on to use the same regexp. By the time
subprocess returns, myregex.lastm could have been overwritten. This
is not a far-fetched example at all; one could easily encounter this
problem when writing, say, a recursive descent parser.

I don't see that this is any worse than any other stateful object. If
you change the state of the object, you can't expect to get the same
data from it as you did before.
 
C

Carl Banks

Roy said:
Carl Banks said:
Hmm. The reason this hasn't been done is that it makes the match
method non-reentrant. For example, suppose you call some functions in
between the matching and use of the matched object, like this:

if myregex.match(line):
xyz = (subprocess(line[myregex.lastm.end():])
+ myregex.lastm.group(1))

And suppose subprocess goes on to use the same regexp. By the time
subprocess returns, myregex.lastm could have been overwritten. This
is not a far-fetched example at all; one could easily encounter this
problem when writing, say, a recursive descent parser.

I don't see that this is any worse than any other stateful object.

It's worse because, unlike most objects, regexp objects are usually
global (at least they are when I use them). Moreover, the library
encourages us to make regexp objects global by exposing the regexp
compiler. So even if you personally use local regexps (and accept the
resulting performance hit), many will declare them global.

In other words, myregexp.match is essentially a global function, so
it shouldn't have state.
 
R

Roy Smith

Carl Banks said:
It's worse because, unlike most objects, regexp objects are usually
global (at least they are when I use them). Moreover, the library
encourages us to make regexp objects global by exposing the regexp
compiler. So even if you personally use local regexps (and accept the
resulting performance hit), many will declare them global.

I don't see why regexps are usually global. Nor do I see why exposing
the compiler encourages them to be global, or why making them local
should result in a performance hit.

I do a lot of regex work. I just looked over a bunch of scripts I
happen to have handy and only found one where I used global regexp
objects. In that script, the regexps were only used in a single
routine, so moving them down in scope to be local to that routine would
have made more sense anyway. Looking back at the code, which I wrote
several years ago, I have no idea why I decided to make them global.
 
A

Asun Friere

Skip Montanaro said:
Sure, it's syntactically bound into the language. There will always be an
extra constant overhead to enable regular expressions in Python. That
doesn't make them any less powerful than the Perl variety. It's simply a
pair of different design decisions Guido and Larry made (along with a few
others).
Sure.

Asun> Probably the only time I would reach for Perl rather than for
Asun> python is when I knew a task involved a lot of regex (and no
Asun> object orientation).

Why? I write non-object-oriented Python code all the time.

What I meant is that if it involves lots of regexp I'd probably use
Perl If it involved lots of regex AND object orientation, I wouldn't
consider Perl.
Python/Perl switch you'd still have to shift your mental gears to deal with
a different syntax, different way of getting at and using functionality that
isn't builtin, etc. Even with lots of regex fiddling to do, I think the
extra overhead of using regexes in Python would be swamped by the other
differences.

It's good for the soul to shift your mental gears every now and then.
 
C

Carl Banks

Roy said:
I don't see why regexps are usually global. Nor do I see why exposing
the compiler encourages them to be global, or why making them local
should result in a performance hit.

Because if regexp objects are local, they have to be recompiled every
time you call the function. If you're doing that, you could be taking
a performance hit. I guess it depends on your functional style,
though. If your scripts have only one or two functions where all the
regexps are and it only gets called a few times, then it probably
won't matter too much to you.

If you do stuff recursively (as I often do) or break up code into
smaller functions (as I often do) so that the functions with these
regexps get called numerous times, it can help performance to move the
compile step out of the functions.

The point is, the existence re.compile encourages people to make
regexp objects global so they only need to be compiled once, when the
module is loaded. Because of this, and especially because regexps are
prone to being used in recursive functions and such, it's dangerous to
allow them to have state.


I do a lot of regex work. I just looked over a bunch of scripts I
happen to have handy and only found one where I used global regexp
objects. In that script, the regexps were only used in a single
routine, so moving them down in scope to be local to that routine would
have made more sense anyway. Looking back at the code, which I wrote
several years ago, I have no idea why I decided to make them global.

Well, that's fine. For the reasons I've stated, I think there are
good reasons to not do it the way you did it.
 
R

Roy Smith

Carl Banks said:
especially because regexps are
prone to being used in recursive functions and such

Why are regexps prone to being used in recursive functions?
 
P

Paramjit Oberoi

If only compiled regular expression objects allowed easy access to the
Hmm. The reason this hasn't been done is that it makes the match
method non-reentrant. For example, suppose you call some functions in
between the matching and use of the matched object, like this:

I agree that that's a good reason...

So: to make regular expressions convenient, it should be possible to use
them without necessarily declaring a variable for the regular expression
object or the match objects. The former is fairly easy; the latter is
not.

The match object needs to have local scope; but, a function cannot access
the caller's locals without mucking around with sys._getframe(). Is there
any other way of injecting objects into the caller's namespace? Are there
any objects that exist per-frame that could be co-opted for this purpose?

I suppose a convenience module that uses sys._getframe() could be written,
but I don't think it would be suitable for the standard library. Of
course, once we get decorators, it would be possible to munge functions to
make them more amenable to regexes... but then, we don't want such hackery
in the standard library either.

Question about decorators: are they only going to be for methods, or for
all functions?
 
P

Paramjit Oberoi

if myregex.match(line):
Hmm. The reason this hasn't been done is that it makes the match
method non-reentrant. For example, suppose you call some functions in
between the matching and use of the matched object, like this:

Another thing in perl that makes regexes convenient is the 'default'
variable $_. So, maybe the following could be done:

line_a = re.compile(...)
line_b = re.compile(...)

rx = re.context()
for rx.text in sys.stdin:
if rx(line_a).match():
total += a_function() + rx[1]
elif rx(r'^msg (?P<text>.*)$').match():
print rx['text']
elif rx(line_b).match(munge(text)):
print 'munged text matches'

// similarly rx.{search, sub, ...}

But, as Peter Hansen pointed out, maybe this is more suited to a personal
utility module... What are the considerations for whether something
should or shouldn't be in the standard library?
 
M

Michael

Regular expressions are like duct tape. They may not be the best tool
for everything, but they usually get the job done.
Until you are in the middle of a vital project with a limited time to
get the job done and suddenly all your duct tape starts to twist
together incorrectly. :)
 
C

Carl Banks

Roy said:
Why are regexps prone to being used in recursive functions?

Prone was probably a bit too strong a word, but using regexps inside a
recursive function is far from an obscure use. It can show up in
functions that parse arbitrarily nested text, which typically use
recursion to handle the arbitrarily nested part.
 
S

S Koppelman

Leif said:
Cameron Laird rose and spake:
.
I hear this more often than I understand it. Perl certainly
does support many string-oriented operations. What's a speci-
fic example, though, of an action you feel more comfortable
coding in external Perl? I suspect there's something I need
to learn about PHP's deficiencies, or Perl's power.

I'm glad that you asked :)

The routine is for a phonetic search in Norwegian 18th century names,
which can be spelled in an amazing number of different ways. As I found
that the Soundex algorithm was useless for Norwegian spellings, I
invented my own. It's not really an algorithm, but a series of
substitutions that reduces names to a kind of primitives. Thus, eg.....

Here's a small sample:

$str =~ s/HN/N/g; # John --> JON
$str =~ s/TH/T/g; # Thor --> TOR ....
>
[snip]

In theory, the web routine for phonetic searches might have been
implemented in PHP. The trouble with that is that I would have to
maintain both a PHP and a Perl version of the same routine. I find it
much easier to just copy and paste the whole mess (at present about 120
lines) between the encoding and the decoding routines in Perl, and run
an exec("perl norphon.pl $name") from PHP.

Well, that's not PHP's fault, especially with such straightforward
regexps. The only reason to user perl over PHP in that case is the valid
one you cite: you already wrote the damned code in perl. ;)

Meanwhile, wouldn't it run more efficiently if you hosted your perl
functions under mod_perl or some other persistent harness like a SOAP or
XML-RPC daemon and had the PHP call accross to that? Execing out and
having perl launch, compile and run your script each time is a waste of
resources, not to mention needlessly slow. You might not notice it if
you're the only person using it, but if there's a sudden uptick in
traffic to your site from fellow Scandinavian diaspora genaologists, you
will.

-sk
 
C

Carl Banks

Paramjit said:
I agree that that's a good reason...

So: to make regular expressions convenient, it should be possible to use
them without necessarily declaring a variable for the regular expression
object or the match objects. The former is fairly easy; the latter is
not.

The match object needs to have local scope; but, a function cannot access
the caller's locals without mucking around with sys._getframe(). Is there
any other way of injecting objects into the caller's namespace? Are there
any objects that exist per-frame that could be co-opted for this purpose?

I suppose a convenience module that uses sys._getframe() could be written,

Maybe something like this (I'll leave filling in details and fixing
bugs as an exercise):


class magic_sre(_sre.whatever):

_matchstore = {}

def match(self,s):
callhash = id(sys._getframe(1))
match = _sre.whatever.match(self,s)
self._matchstore[id] = match
return match

def matchobj(self):
callhash = id(sys._getframe(1))
return self._matchstore[id]


Calling matchobj() gets the match object that was matched in the
calling function using this regexp. I'm pretty sure using id() of the
frame works: until the calling function exits, no other frame can have
the same id.

It has drawbacks though (only works with CPython, match objects stay
around forever, could accidently hit the storage if another frame has
the same id).

but I don't think it would be suitable for the standard library.

It's such a silly thing I don't really see the need for it in the
standard library at all. In fact, I prefer working explicitly with
match objects, working around unweildy testing when I have to.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,269
Messages
2,571,348
Members
48,026
Latest member
ArnulfoCat

Latest Threads

Top