[regex] Can't get it to be ungreedy

Jane Doe · Aug 2, 2003

Hello,

[Since this question is actually a regex used in the open-source
filtering application Privoxy, it's a bit off-topic here, but this is
the closest ng I found about using regex, so please bear with me

]

Although I read the O'Reilly book on regex along with the "The
Filter File" part of the documentation and various links, I can't
figure out why Privoxy searches for patterns in greedy mode, regarless
of my use of either the U switch, or the ? limiter after a counter
like .* or .+ .

Here's the starting HTML, the Privoxy filter, and the output:

1. I just want to remove the complete "Item2" row, as shown :

<html>
<head>
</head>
<body>
<table>
<tr>
<td>
Item1
</td>
</tr>

<tr>
<td>
Item2
</td>
</tr>

</table>
</body>
</html>

2. Here's the Privoxy config I used :

Default Action

{+filter{acme}}
www.acme.com

Filter

FILTER: acme
s|<tr>.+Item2.+</tr>||sigU
# DOESN'T WORK EITHER, SAME RESULT s|<tr>.+?Item2.+?</tr>||sig

3. Here's what Privoxy generates:

<html>
<head>
</head>
<body>
<table>

</table>
</body>
</html>

=> Ie. the U switch is ignored. Removing the U switch and using the
manual ? ungreedifier doesn't work any better.

Thx much for any tip
JD.

Gunnar Hjalmarsson · Aug 2, 2003

Jane said:
# DOESN'T WORK EITHER, SAME RESULT
s|<tr>.+?Item2.+?</tr>||sig

----------^

The first question mark does not make a difference, if the page
doesn't include multiple matches of the part of the pattern that
_follows_ it.

I'd say it works exactly as expected. It matches everything from the
_first_ occurrence of <tr> until the first occurrence of </tr> after
the string 'Item2'.

You may want to try:

s|<tr>\s*<td>\s*Item2.+?</tr>||sig;

mgarrish · Aug 3, 2003

James E Keenan said:
if ($str =~ m|.*(<tr>.*Item2.*?<\/tr>)|s) {
print "To be removed:\n$1\n";
}

Which isn't necessarily going to do what he wants (i.e., you're only

print "\n";
if ($str =~ m|^(.*)<tr>.*Item2.*?<\/tr>\n(.*)|s) {
print "To be kept:\n$1$2\n";
}

Adding ^ to the regex is pointless, because .* will start from the beginning
anyway (and escaping slashes just adds noise when you aren't using slashes
as your delimiters). He also used /g on his regex, so you can't assume that
he's only looking to remove one occurrence.

Matt

Jürgen Exner · Aug 3, 2003

Jane Doe wrote:

Subject: [regex] Can't get it to be ungreedy

Please see 'perldoc -q greedy':
"What does it mean that regexes are greedy? How can I get around it?"

1. I just want to remove the complete "Item2" row, as shown :

<html>
<head>
</head>
<body>
<table>
<tr>
<td>
Item1
</td>
</tr>

<tr>
<td>
Item2
</td>
</tr>

</table>
</body>
</html>

As has been pointed out many, many times REs are not powerful enough to
parse HTML. You should use an HTML parser for that purpose.

jue

Jane Doe · Aug 3, 2003

I'd say it works exactly as expected. It matches everything from the
_first_ occurrence of <tr> until the first occurrence of </tr> after
the string 'Item2'.

You may want to try:

s|<tr>\s*<td>\s*Item2.+?</tr>||sig;

Thanks Gunnar and others. I tried the ideas you gave, and did read the
PerlDoc "What does it mean that regexes are greedy? How can I get
around it?" _before_ asking the question... but Privoxy is still
acting greedy, no matter what I try. I'll come up with another trick
somehow

Thx again
JD.

matija · Aug 3, 2003

figure out why Privoxy searches for patterns in greedy mode, regarless
of my use of either the U switch, or the ? limiter after a counter
like .* or .+ .

Here's the starting HTML, the Privoxy filter, and the output:

1. I just want to remove the complete "Item2" row, as shown :

<html>
<head>
</head>
<body>
<table>
<tr>
<td>
Item1
</td>
</tr>

<tr>
<td>
Item2
</td>
</tr>

</table>
</body>
</html>

You might go away with eval regex if your app. supports it,

$table =~ s{(<tr.+?</tr>)}{
my $tr = $1;

#do something to $tr
..

$tr;
}iges;

Still it's probably highly unefficient, but as you're not using perl..

Jane Doe · Aug 3, 2003

If you by that mean that it also removes 'Item1', it sounds weird.

Someone mentioned that I could take a look at the [^...] syntax to
exclude patterns. Might do the trick.

Thank your for your help anyway

JD.

Tad McClellan · Aug 3, 2003

Jane Doe said:
Someone mentioned that I could take a look at the [^...] syntax to
exclude patterns.

You cannot use the [^...] syntax to exclude patterns.

A "character class", even a negated one, matches a *single character*.

You can use the [^...] syntax to exclude _characters_, not patterns.

ctcgag · Aug 4, 2003

Jane Doe said:
Thanks Gunnar and others. I tried the ideas you gave, and did read the
PerlDoc "What does it mean that regexes are greedy? How can I get
around it?" _before_ asking the question... but Privoxy is still
acting greedy,

No, it isn't (or at least you haven't demonstrated such). You just do
not know what it means to be greedy or non-greedy. The nongreedy
quantifier between <tr> and Item2 means it will find the first Item2 after
a <tr>, not that it will find the last <tr> before an Item2.

Xho

Jane Doe · Aug 4, 2003

No, it isn't (or at least you haven't demonstrated such).

Mmm... Using the regex I gave (s|<tr>.+Item2.+</tr>||sigU), Privoxy
returns this:

<body>
<table>

</table>
</body>

ie. making it non-greedy with either U or the ? quantifier doesn't
limit the search to the second line.

The nongreedy quantifier between <tr> and Item2 means it will find the first Item2 after
a <tr>, not that it will find the last <tr> before an Item2.

OK, but I expected a non-greedy regex to backtrack when finding Item2,
and stop when it found the first occurence of <tr> before Item2.
Obviously, it doesn't. The search goes on...

Thx anyhow
JD.

Matt Garrish · Aug 5, 2003

Jane Doe said:
OK, but I expected a non-greedy regex to backtrack when finding Item2,
and stop when it found the first occurence of <tr> before Item2.
Obviously, it doesn't. The search goes on...

You have to keep in mind that regexes do what you tell them to do, not what
you would like them to do. Your regex works from left to right, not from the
middle out. In other words: find a <tr> (which will be the first in the
file, since you have nothing before <tr> in your regex), then do a
non-greedy match until you find Item2, then if you find Item2 keep going
until you find the next </tr>. And that's what it's doing.

It's almost hopeless to parse any markup language by a single regular
expression. I still haven't found anything that compares with Omnimark (Perl
and James Clark are the next best thing), but I don't suppose there's any
point in telling you to program in another language, since it looks like
you're stuck with what you have. And even with those tools, most html isn't
going to parse cleanly anyway.

I did have a similar problem once, and the only solution I came up with was
to read the file into an array and then read through the array looking for,
in your case, Item2. Whenever you hit a <tr> in a line, save that entry
number to a variable. Then when you find Item2, start looking for the next
</tr>. When you get that entry number, you can then wipe out the
corresponding array entries in between and do a little cleanup on the start
and end (i.e., to make sure nothing precedes the opening <tr> or follows the
</tr> on those lines.

I offer this up only as a kludge and not as an elegant solution. And there
are, as you would find, still a lot of problems inherent in this method
(i.e., everything and more on one line, missing </tr> tags that will destroy
your data, etc.), but when you're getting desperate...

Matt

Jane Doe · Aug 5, 2003

I offer this up only as a kludge and not as an elegant solution.

Thank you very much for your help

Unfortunately, since the regex
engine in Privoxy doesn't seem to allow running an external program to
handle a given filter entry, I'm stuck. Bah, I can live with this

Thx
JD.

Matt Garrish · Aug 6, 2003

Jane Doe said:
On Mon, 4 Aug 2003 19:39:21 -0400, "Matt Garrish"

Thank you very much for your help Unfortunately, since the regex
engine in Privoxy doesn't seem to allow running an external program to
handle a given filter entry, I'm stuck. Bah, I can live with this

Are you trying to rewrite the code for filtering, or are you just trying to
add a regex to a filter file? If the latter, you're probably pretty limited
in what you're ever going to accomplish.

Also, if you are just trying to write a line for your filter file, is there
any reason you need to remove the full <tr>...</tr> section? If you're just
looking to strip something from your email, why not just match Item2 and
remove it? It would probably make your life a whole lot easier, and an empty
row in the table probably isn't going to kill you... : )

Matt

Jane Doe · Aug 6, 2003

Are you trying to rewrite the code for filtering, or are you just trying to
add a regex to a filter file?

Adding an entry in the filter file

Also, if you are just trying to write a line for your filter file, is there
any reason you need to remove the full <tr>...</tr> section?

Yes, I want to remove a line from a table. Hence the need to locate a
whole <tr><td></td></tr> with a specific pattern somewhere inside a
line.

Thx

JD.

Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
Script to send email not working	1	Apr 10, 2023
Help needed with thank you message	5	Sep 11, 2021
Image shifts to the right when export the page to pdf	4	May 5, 2023
Registration form	13	May 19, 2021
How to have two html audio players on one page?	0	May 3, 2022
I want to Display Excel As HTML In js	2	Feb 24, 2023

[regex] Can't get it to be ungreedy

Jane Doe

Gunnar Hjalmarsson

mgarrish

Jürgen Exner

Jane Doe

matija

Jane Doe

Tad McClellan

ctcgag

Jane Doe

Matt Garrish

Jane Doe

Matt Garrish

Jane Doe

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads