[regex] Can't get it to be ungreedy

J

Jane Doe

Hello,

[Since this question is actually a regex used in the open-source
filtering application Privoxy, it's a bit off-topic here, but this is
the closest ng I found about using regex, so please bear with me :)]

Although I read the O'Reilly book on regex along with the "The
Filter File" part of the documentation and various links, I can't
figure out why Privoxy searches for patterns in greedy mode, regarless
of my use of either the U switch, or the ? limiter after a counter
like .* or .+ .

Here's the starting HTML, the Privoxy filter, and the output:

1. I just want to remove the complete "Item2" row, as shown :

<html>
<head>
</head>
<body>
<table>
<tr>
<td>
Item1
</td>
</tr>
<!-- I want to remove this part -->
<tr>
<td>
Item2
</td>
</tr>
<!-- until this point -->
</table>
</body>
</html>

2. Here's the Privoxy config I used :

Default Action

{+filter{acme}}
www.acme.com

Filter

FILTER: acme
s|<tr>.+Item2.+</tr>||sigU
# DOESN'T WORK EITHER, SAME RESULT s|<tr>.+?Item2.+?</tr>||sig

3. Here's what Privoxy generates:

<html>
<head>
</head>
<body>
<table>

</table>
</body>
</html>

=> Ie. the U switch is ignored. Removing the U switch and using the
manual ? ungreedifier doesn't work any better.

Thx much for any tip
JD.
 
G

Gunnar Hjalmarsson

Jane said:
# DOESN'T WORK EITHER, SAME RESULT
s|<tr>.+?Item2.+?</tr>||sig
----------^

The first question mark does not make a difference, if the page
doesn't include multiple matches of the part of the pattern that
_follows_ it.

I'd say it works exactly as expected. It matches everything from the
_first_ occurrence of <tr> until the first occurrence of </tr> after
the string 'Item2'.

You may want to try:

s|<tr>\s*<td>\s*Item2.+?</tr>||sig;
 
M

mgarrish

James E Keenan said:
if ($str =~ m|.*(<tr>.*Item2.*?<\/tr>)|s) {
print "To be removed:\n$1\n";
}

Which isn't necessarily going to do what he wants (i.e., you're only
print "\n";
if ($str =~ m|^(.*)<tr>.*Item2.*?<\/tr>\n(.*)|s) {
print "To be kept:\n$1$2\n";
}

Adding ^ to the regex is pointless, because .* will start from the beginning
anyway (and escaping slashes just adds noise when you aren't using slashes
as your delimiters). He also used /g on his regex, so you can't assume that
he's only looking to remove one occurrence.

Matt
 
J

Jürgen Exner

Jane Doe wrote:

Subject: [regex] Can't get it to be ungreedy

Please see 'perldoc -q greedy':
"What does it mean that regexes are greedy? How can I get around it?"
1. I just want to remove the complete "Item2" row, as shown :

<html>
<head>
</head>
<body>
<table>
<tr>
<td>
Item1
</td>
</tr>
<!-- I want to remove this part -->
<tr>
<td>
Item2
</td>
</tr>
<!-- until this point -->
</table>
</body>
</html>

As has been pointed out many, many times REs are not powerful enough to
parse HTML. You should use an HTML parser for that purpose.

jue
 
J

Jane Doe

I'd say it works exactly as expected. It matches everything from the
_first_ occurrence of <tr> until the first occurrence of </tr> after
the string 'Item2'.

You may want to try:

s|<tr>\s*<td>\s*Item2.+?</tr>||sig;

Thanks Gunnar and others. I tried the ideas you gave, and did read the
PerlDoc "What does it mean that regexes are greedy? How can I get
around it?" _before_ asking the question... but Privoxy is still
acting greedy, no matter what I try. I'll come up with another trick
somehow :)

Thx again
JD.
 
M

matija

figure out why Privoxy searches for patterns in greedy mode, regarless
of my use of either the U switch, or the ? limiter after a counter
like .* or .+ .

Here's the starting HTML, the Privoxy filter, and the output:

1. I just want to remove the complete "Item2" row, as shown :

<html>
<head>
</head>
<body>
<table>
<tr>
<td>
Item1
</td>
</tr>
<!-- I want to remove this part -->
<tr>
<td>
Item2
</td>
</tr>
<!-- until this point -->
</table>
</body>
</html>

You might go away with eval regex if your app. supports it,

$table =~ s{(<tr.+?</tr>)}{
my $tr = $1;

#do something to $tr
..

$tr;
}iges;

Still it's probably highly unefficient, but as you're not using perl..
 
J

Jane Doe

If you by that mean that it also removes 'Item1', it sounds weird.

Someone mentioned that I could take a look at the [^...] syntax to
exclude patterns. Might do the trick.

Thank your for your help anyway :)
JD.
 
T

Tad McClellan

Jane Doe said:
Someone mentioned that I could take a look at the [^...] syntax to
exclude patterns.


You cannot use the [^...] syntax to exclude patterns.

A "character class", even a negated one, matches a *single character*.

You can use the [^...] syntax to exclude _characters_, not patterns.
 
C

ctcgag

Jane Doe said:
Thanks Gunnar and others. I tried the ideas you gave, and did read the
PerlDoc "What does it mean that regexes are greedy? How can I get
around it?" _before_ asking the question... but Privoxy is still
acting greedy,

No, it isn't (or at least you haven't demonstrated such). You just do
not know what it means to be greedy or non-greedy. The nongreedy
quantifier between <tr> and Item2 means it will find the first Item2 after
a <tr>, not that it will find the last <tr> before an Item2.

Xho
 
J

Jane Doe

No, it isn't (or at least you haven't demonstrated such).

Mmm... Using the regex I gave (s|<tr>.+Item2.+</tr>||sigU), Privoxy
returns this:

<body>
<table>

</table>
</body>

ie. making it non-greedy with either U or the ? quantifier doesn't
limit the search to the second line.
The nongreedy quantifier between <tr> and Item2 means it will find the first Item2 after
a <tr>, not that it will find the last <tr> before an Item2.

OK, but I expected a non-greedy regex to backtrack when finding Item2,
and stop when it found the first occurence of <tr> before Item2.
Obviously, it doesn't. The search goes on...

Thx anyhow
JD.
 
M

Matt Garrish

Jane Doe said:
OK, but I expected a non-greedy regex to backtrack when finding Item2,
and stop when it found the first occurence of <tr> before Item2.
Obviously, it doesn't. The search goes on...

You have to keep in mind that regexes do what you tell them to do, not what
you would like them to do. Your regex works from left to right, not from the
middle out. In other words: find a <tr> (which will be the first in the
file, since you have nothing before <tr> in your regex), then do a
non-greedy match until you find Item2, then if you find Item2 keep going
until you find the next </tr>. And that's what it's doing.

It's almost hopeless to parse any markup language by a single regular
expression. I still haven't found anything that compares with Omnimark (Perl
and James Clark are the next best thing), but I don't suppose there's any
point in telling you to program in another language, since it looks like
you're stuck with what you have. And even with those tools, most html isn't
going to parse cleanly anyway.

I did have a similar problem once, and the only solution I came up with was
to read the file into an array and then read through the array looking for,
in your case, Item2. Whenever you hit a <tr> in a line, save that entry
number to a variable. Then when you find Item2, start looking for the next
</tr>. When you get that entry number, you can then wipe out the
corresponding array entries in between and do a little cleanup on the start
and end (i.e., to make sure nothing precedes the opening <tr> or follows the
</tr> on those lines.

I offer this up only as a kludge and not as an elegant solution. And there
are, as you would find, still a lot of problems inherent in this method
(i.e., everything and more on one line, missing </tr> tags that will destroy
your data, etc.), but when you're getting desperate...

Matt
 
J

Jane Doe

I offer this up only as a kludge and not as an elegant solution.

Thank you very much for your help :) Unfortunately, since the regex
engine in Privoxy doesn't seem to allow running an external program to
handle a given filter entry, I'm stuck. Bah, I can live with this :)

Thx
JD.
 
M

Matt Garrish

Jane Doe said:
On Mon, 4 Aug 2003 19:39:21 -0400, "Matt Garrish"

Thank you very much for your help :) Unfortunately, since the regex
engine in Privoxy doesn't seem to allow running an external program to
handle a given filter entry, I'm stuck. Bah, I can live with this :)

Are you trying to rewrite the code for filtering, or are you just trying to
add a regex to a filter file? If the latter, you're probably pretty limited
in what you're ever going to accomplish.

Also, if you are just trying to write a line for your filter file, is there
any reason you need to remove the full <tr>...</tr> section? If you're just
looking to strip something from your email, why not just match Item2 and
remove it? It would probably make your life a whole lot easier, and an empty
row in the table probably isn't going to kill you... : )

Matt
 
J

Jane Doe

Are you trying to rewrite the code for filtering, or are you just trying to
add a regex to a filter file?

Adding an entry in the filter file :)
Also, if you are just trying to write a line for your filter file, is there
any reason you need to remove the full <tr>...</tr> section?

Yes, I want to remove a line from a table. Hence the need to locate a
whole <tr><td></td></tr> with a specific pattern somewhere inside a
line.

Thx :)
JD.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top