Regex driving me crazy...

L

Lie Ryan

There's a spectrum of parsing solutions:
The above dialog tends to appear when the task isn't in the
sweet-spot of regexps. Either it's sufficiently simple that
simple split/slice notation will do, or (at the other end of the
spectrum) the effort to get it working with a regexp is hairy and
convoluted, worthy of a more readable solution implemented with
pyparsing. The problem comes from people thinking that regexps
are the right solution to *every* problem...often demonstrated by
the OP writing "how do I write a regexp to solve this
<non-regexp-optimal> problem" assuming regexps are the right tool
for everything.

There are some problem-classes for which regexps are the *right*
solution, and I don't see as much of your example dialog in those
cases.

I would have agreed with you if someone were to make the statement
until a few weeks ago; somehow in the last week or so, the mood about
regex seems to has shifted to "regex is not suitable for anything"
type of mood. As soon as someone (OP or not) proposed a regex
solution, someone else would retort with don't use regex use
string-builtins or pyparsing. It appears that the group has developed
some sense of regexphobia; some people pushes using string builtins
for moderately complex requirement and suggested pyparsing for not-so
complex need and that keeps shrinking regex sweet spot. But that's
just my inherently subjective observation.
 
D

Dotan Cohen

I would have agreed with you if someone were to make the statement
until a few weeks ago; somehow in the last week or so, the mood about
regex seems to has shifted to "regex is not suitable for anything"
type of mood. As soon as someone (OP or not) proposed a regex
solution, someone else would retort with don't use regex use
string-builtins or pyparsing. It appears that the group has developed
some sense of regexphobia; some people pushes using string builtins
for moderately complex requirement and suggested pyparsing for not-so
complex need and that keeps shrinking regex sweet spot. But that's
just my inherently subjective observation.

Isn't that a core feature of a high-level language such as Python?
Providing the tools to perform common or difficult tasks easily
thought built in functions?

I am hard pressed to think of a situation in which a regex is
preferable to a built-in function.

--
Dotan Cohen

http://bido.com
http://what-is-what.com

Please CC me if you want to be sure that I read your message. I do not
read all list mail.
 
M

MRAB

Dotan said:
Isn't that a core feature of a high-level language such as Python?
Providing the tools to perform common or difficult tasks easily
thought built in functions?

I am hard pressed to think of a situation in which a regex is
preferable to a built-in function.
Regexes do have their uses. It's a case of knowing when they are the
best approach and when they aren't.
 
D

Dotan Cohen

Regexes do have their uses. It's a case of knowing when they are the
best approach and when they aren't.

Agreed. The problems begin when the "when they aren't" is not recognised.
 
P

Patrick Maupin

Agreed. The problems begin when the "when they aren't" is not recognised.

Arguing against this is like arguing against motherhood and apple
pie. The same argument can validly be made for any Python construct,
any C construct, etc. This argument is so broad and vague that it's
completely meaningless. Platitudes don't help people learn how to
code. Even constant measuring of speed doesn't really help people
start learning how to code -- it just shows them that there are a lot
of OCD people in this profession.

The great thing about Python is that a lot of people, with differing
ambitions, capabilities, amounts of time to invest, and backgrounds
can pick it up and just start using it.

If somebody asks "how do I use re for this" then IMO the *best*
possible response is to tell them how to use re for this (unless
"this" is *difficult* or *impossible* to do with re, in which case you
shouldn't answer the question unless you've had your coffee and you're
in a good mood). You might also gently explain that there other
techniques that might, in some cases be easier to code or read. But
performance? It's all fine and dandy for the experienced coders to
discuss the finer points of different techniques (which, BTW, are
usually all predicated on using the current CPython implementation,
and might in some cases be completely wrong with one of the new JITs
under development), but you have to trust people to know their own
needs! If somebody says "this is too slow -- how do I speed it up?"
then that's really the time to strut your stuff and show that you know
how to milk the language for all it's worth. Until then, just tell
them what they want to know, perhaps with a small disclaimer that it's
probably not the most efficient or elegant or whatever way to solve
their problem. The process of learning a computer language is one of
breaking through a series of brick walls, and in many cases people
will learn things faster if you help give them the tools to get past
their mental roadblocks.

The thing that Lie and I were reacting to was the visceral "don't do
that" that seems to crop up whenever somebody asks how to do something
with re. There are a lot of good use cases for re. Arguably,
something like mxtexttools or some other low-level text processor
would be better for a few of the cases, but they're not part of the
standard library and re is.

One of the great things about Python is that a lot of useful programs
can be written just using Python and the standard library. No C, no
third-party binary libraries, etc. It's not just batteries included
-- it's everything included!

I've written C extensions, both bare, and wrapped with Pyrex, and I've
used third-party extension modules, and while that's OK, it's much
better to have some Python source code in a repository that you can
pull down to any kind of system and just RUN. And look at. And learn
from.

Many useful programs need to do text processing. Often, the built-in
string functions are sufficient. But sometimes they are not.
Discouraging somebody from learning re is doing them a disservice,
because, for the things it is really good at, it is the *only* thing
in the standard library that IS really good.

Yes, you can construct regular expressions and example texts that will
exhibit horrible worst-case performance. But there are a lot of ways
to shoot yourself in the foot performance-wise in Python (as in any
language), and most of them don't require you to use *any* library
functions, much less the dreaded re module.

Often, when I see people give advice that is (I don't want to say
"knee-jerk" because the advice usually has a good foundation) so let's
say "terse" and "unexplained" or maybe even that it is an
"admonishment", it makes me feel that perhaps the person giving the
advice doesn't really trust Python.

I don't remember where I first read it, or heard it, but one of the
core strengths of Python is how easy it is to throw away code and
replace it with something better. So, trust Python to help people get
something going, and then (if they need or want to!) to make it
better.

Just my 2 cents worth.

Pat
 
L

Lie Ryan

Agreed. The problems begin when the "when they aren't" is not recognised.

But problems also arises when people are suggesting overly complex
series of built-in functions for what is better handled by regex.

Using built-in functions (to me at least) is not a natural way to match
strings, and makes things less understandable for anything but very
simple manipulations. Regex is like Query-by-Example (QBE), in database,
you give an example and you get a result; you give the general pattern
and you get a match. Regex is declarative similar to full-blown parser,
instead of procedural like built-in functions. Regex's unsuitability for
complex parsing stems from terseness and inability to handle arbitrary
nests.

People need to recognize when built-in function isn't suitable and when
bringing forth pyparsing for parsing one or two is just an overkill.

Unreasonable phobia to regex is just as much harmful as overuse of it.
 
S

Steven D'Aprano

But problems also arises when people are suggesting overly complex
series of built-in functions for what is better handled by regex.

What defines "overly complex"?

For some reason, people seem to have the idea that pattern matching of
strings must be a single expression, no matter how complicated the
pattern they're trying to match. If we have a complicated task to do in
almost any other field, we don't hesitate to write a function to do it,
or even multiple functions: we break our code up into small,
understandable, testable pieces. We recognise that a five-line function
may very well be less complex than a one-line expression that does the
same thing. But if it's a string pattern matching task, we somehow become
resistant to the idea of writing a function and treat one-line
expressions as "simpler", no matter how convoluted they become.

It's as if we decided that every maths problem had to be solved by a
single expression, no matter how complex, and invented a painfully terse
language unrelated to normal maths syntax for doing so:

# Calculate the roots of sin**2(3*x-y):
result = me.compile("{^g.?+*y:h}|\Y^r&(?P:2+)|\w+(x&y)|[?#\s]").solve()

That's not to say that regexes aren't useful, or that they don't have
advantages. They are well-studied from a theoretical basis. You don't
have to re-invent the wheel: the re module provides useful pattern
matching functionality with quite good performance.

One disadvantage is that you have to learn an entire new language, a
language which is painfully terse and obfuscated, with virtually no
support for debugging. Larry Wall has criticised the Perl regex syntax on
a number of grounds:

* things which look similar often are very different;
* things which are commonly needed are long and verbose, while things
which are rarely needed are short;
* too much reliance on too few metacharacters;
* the default is to treat whitespace around tokens as significant,
instead of defaulting to verbose-mode for readability;
* overuse of parentheses;
* difficulty working with non-ASCII data;
* insufficient abstraction;
* even though regexes are source code in a regular expression language,
they're treated as mere strings, even in Perl;

and many others.

http://dev.perl.org/perl6/doc/design/apo/A05.html

As programming languages go, regular expressions -- even Perl's regular
expressions on steroids -- are particularly low-level. It's the assembly
language of pattern matching, compared to languages like Prolog, SNOBOL
and Icon. These languages use patterns equivalent in power to Backus-Naur
Form grammars, or context-free grammars, much more powerful and readable
than regular expressions.

But in any case, not all text processing problems are pattern-matching
problems, and even those that are don't necessarily require the 30lb
sledgehammer of regular expressions.

I find it interesting to note that there is such a thing as "regex
culture", as Larry Wall describes it. There seems to be a sort of
programmers' machismo about solving problems via regexes, even when
they're not the right tool for the job, and in the fewest number of
characters possible. I think regexes have a bad reputation because of
regex culture, and not just within Python circles either:

http://echochamber.me/viewtopic.php?f=11&t=57405


For the record, I'm not talking about "Because It's There" regexes like
this this 6343-character monster:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

or these:

http://mail.pm.org/pipermail/athens-pm/2003-January/000033.html
http://blog.sigfpe.com/2007/02/modular-arithmetic-with-regular.html

The fact that these exist at all is amazing and wonderful. And yes, I
admire the Obfuscated C and Underhanded C contests too :)
 
A

Alf P. Steinbach

* Steven D'Aprano:
For some reason, people seem to have the idea that pattern matching of
strings must be a single expression, no matter how complicated the
pattern they're trying to match. If we have a complicated task to do in
almost any other field, we don't hesitate to write a function to do it,
or even multiple functions: we break our code up into small,
understandable, testable pieces. We recognise that a five-line function
may very well be less complex than a one-line expression that does the
same thing. But if it's a string pattern matching task, we somehow become
resistant to the idea of writing a function and treat one-line
expressions as "simpler", no matter how convoluted they become.

It's as if we decided that every maths problem had to be solved by a
single expression, no matter how complex, and invented a painfully terse
language unrelated to normal maths syntax for doing so:

# Calculate the roots of sin**2(3*x-y):
result = me.compile("{^g.?+*y:h}|\Y^r&(?P:2+)|\w+(x&y)|[?#\s]").solve()



Cheers,

- Alf
 
P

Paul Rubin

Steven D'Aprano said:
One disadvantage is that you have to learn an entire new language, a
language which is painfully terse and obfuscated, with virtually no
support for debugging. Larry Wall has criticised the Perl regex syntax on
a number of grounds: ...

There is a parser combinator library for Python called Pyparsing but it
is apparently dog slow. Maybe someone can do a faster one sometime.
See: http://pyparsing.wikispaces.com/ for info. I haven't used it,
but it is apparently similar in style to Parsec (a Haskell library):

http://research.microsoft.com/users/daan/download/papers/parsec-paper.pdf

I use Parsec sometimes, and it's much nicer than complicated regexps.
There is a version called Attoparsec now that is slightly less powerful
but very fast.
 
S

Stefan Behnel

Tim Chase, 08.04.2010 16:23:
There are some problem-classes for which regexps are the *right*
solution, and I don't see as much of your example dialog in those cases.

Obviously. People rarely complain about problems that are easy to solve
with the solution at hand.

Stefan
 
L

Lie Ryan

What defines "overly complex"?

These discussions about readability and suitability of regex are
orthogonal issue with the sub-topic I started. We are all fully aware of
the limitations of each approaches. What I am complaining is the recent
development of people just saying no to regex when the problem is in
fact in regex's very sweetspot. We have all seen people abusing regex;
but nowadays I'm starting to see people abusing built-ins as well.

We don't like when regex gets convoluted, but that doesn't mean built-in
fare much better either.
 
S

Stefan Behnel

Steven D'Aprano, 09.04.2010 10:59:
It's as if we decided that every maths problem had to be solved by a
single expression, no matter how complex, and invented a painfully terse
language unrelated to normal maths syntax for doing so:

# Calculate the roots of sin**2(3*x-y):
result = me.compile("{^g.?+*y:h}|\Y^r&(?P:2+)|\w+(x&y)|[?#\s]").solve()

Actually, I would expect that the result of any mathematical calculation
can be found by applying a suitable regular expression to pi.

Stefan
 
T

Tim Chase

Tim Chase, 08.04.2010 16:23:

Obviously. People rarely complain about problems that are easy to solve
with the solution at hand.

Well, you still see the "Got a problem with a string" and the
"having a problem with this regex" questions, but you don't see
the remainder of the "now you have two problems" dialog.
Granted, some folks give that as a knee-jerk reaction so we just
learn to ignore their input because sometimes a regexp is exactly
the right solution ;-)

-tkc
 
D

Dotan Cohen

Unreasonable phobia to regex is just as much harmful as overuse of it.
Agreed. I did not mean to sound as if I am against the use of regular
expressions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,942
Members
47,491
Latest member
mohitk

Latest Threads

Top