Can't find a syntax error, hoping a second set of eyes will help

J

Jason C

Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :)

while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
if ($2 =~ /^http/i) {
$text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi;
}
}

The error is on the while() line (at least, I remove it and no more error). The error just says:

syntax error at blah.cgi line 239, near "if"
syntax error at blah.cgi line 246, near "}"

The purpose of the function is to remove the <a href=...></a> code in submitted text, but only if the linked text begins with http.

TIA,

Jason
 
U

Uri Guttman

JC> Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :)
JC> while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {

why do you think the # marks the start of a regex? only if you use m//
can you change the regex delim from /.
and ^ will not invert a char class for \1 as \1 isn't a char class
element. so even if you fix the regex delim, that will fail. finally,
why are you parsing out urls with a regex when there are modules that do
it correctly?

uri
 
J

Jason C

while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
^^ m

(I would suggest finding a highlighting editor. It makes this sort of
syntactic mistake much easier to spot.)

Thanks, Ben. I didn't realize the m//; was required; since you can change the delimiter with s/// ad hoc, I thought you could here, too.

I'm using Notepad++, and while it helps me catch opening and ending brackets, it didn't do a lot in recognizing syntax errors (at least, not that I know of). What editor do you recommend?
 
J

Jason C

why do you think the # marks the start of a regex? only if you use m//
can you change the regex delim from /.

Thanks to you, too, Uri. Like I replied to Ben a second ago, I thought thatsince you could replace the delimiter in s/// ad hoc, that you could in m//, too. Learn something new every day! :)

and ^ will not invert a char class for \1 as \1 isn't a char class
element. so even if you fix the regex delim, that will fail.

Oh. Now THAT I did NOT know at all! It does explain a few other errors I'vehad, though, and couldn't figure out.

finally,
why are you parsing out urls with a regex when there are modules that do
it correctly?

Two reasons:

1. I've been working with regex for a year or two, and while it's by no means a strong point in my vocabulary (yet), I'm at least familiar enough withit to usually figure it out.

2. I briefly looked for a module that would handle this correctly, but wasn't sure what to look for. And, I'm not sure that it warrants the including of a full module if it could potentially be done in a simple regex. If you can recommend a module that would be more stable and/or faster than what I'm doing, though, then I would definitely appreciate the reference!

FWIW, this modification did work:

while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
$pattern = $1$2$3;
$repl = $2;

if ($2 =~ /^http/i) {
$text =~ s/$pattern/$repl/gsi;
}
}

Admittedly, I'm not sure why $2 is stored long enough for the if() statement, but inside of the if() statement it's empty. Storing them to a differentvariable worked for this purpose, but if there's a better way, I'm very much open to it.
 
P

Peter Makholm

Jason C said:
while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
^^ m

Thanks, Ben. I didn't realize the m//; was required; since you can
change the delimiter with s/// ad hoc, I thought you could here, too.

You can change the delimiter, but the m is only optional when you use
the // delimiters.

//Makholm
 
A

anotheranne

Jason said:
Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :)

while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
if ($2 =~ /^http/i) {
$text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi;
}
}

Whatever other errors your regex may have, I would suggest that
you stick with the regular m// and s/// constructs. You should of
course then escape the '/' in </a> . Changing this should make it run.

Don't use # as an eye-easy replacement for / because a) it is the perl
character for a comment, and b) in a regex (at least with the /x
modifier) it is also a metacharacter. Trouble will come your way if
you use this.

If you do want to get away from // and /// then use balanced
delimiters like m{} and s{}{} . See p319 in Friedl MASTERING REGULAR
EXPRESSIONS. O'Reilly.

When use use any alternate to m// the m is then mandatory. Only when
using // can you omit the m. thus // or m{} are valid constructs.

Also you can remove the ';' after the gsi

hope this helps.

anotheranne
 
A

anotheranne

Jason said:
while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
^^ m

(I would suggest finding a highlighting editor. It makes this sort of
syntactic mistake much easier to spot.)

Thanks, Ben. I didn't realize the m//; was required; since you can change the delimiter with s/// ad hoc, I thought you could here, too.

I'm using Notepad++, and while it helps me catch opening and ending brackets, it didn't do a lot in recognizing syntax errors (at least, not that I know of). What editor do you recommend?

Padre is a nice perl IDE.

http://padre.perlide.org/

anotheranne
 
U

Uri Guttman

JC> Thanks to you, too, Uri. Like I replied to Ben a second ago, I
JC> thought that since you could replace the delimiter in s/// ad hoc,
JC> that you could in m//, too. Learn something new every day! :)

but s/// has the s to mark the next char. =~ ## has no leading marker so it
would just be a comment. also using # for the delimiter is just a bad
idea as it confuses many readers.

JC> Two reasons:

JC> 1. I've been working with regex for a year or two, and while it's
JC> by no means a strong point in my vocabulary (yet), I'm at least
JC> familiar enough with it to usually figure it out.

good that you are studying them but it still is the wrong tool for
this. learning when regexes aren't a good solution is part of learning
regexes.

JC> 2. I briefly looked for a module that would handle this correctly,
JC> but wasn't sure what to look for. And, I'm not sure that it
JC> warrants the including of a full module if it could potentially be
JC> done in a simple regex. If you can recommend a module that would
JC> be more stable and/or faster than what I'm doing, though, then I
JC> would definitely appreciate the reference!

JC> FWIW, this modification did work:

JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {

it will fail if the opening quote is " and the string has a ' inside
it. perfectly legal html but you can't parse it that way.

JC> Admittedly, I'm not sure why $2 is stored long enough for the if()
JC> statement, but inside of the if() statement it's empty. Storing
JC> them to a different variable worked for this purpose, but if
JC> there's a better way, I'm very much open to it.

you need to read more about regexes and the $1 stuff. they live until
the next regex is run (they are global).

uri
 
J

Jason C

FWIW, this modification did work:

while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
$pattern = $1$2$3;
^^ ^^
I think not...

Blah, sorry; that's what I get for trying to type of dummy code at 5am. In practice, I put it in quotes:

$pattern = "$1$2$3";

This almost certainly doesn't do what you think. If nothing else, you
want to \Q $pattern.

Excellent point about \Q. What do you mean, though, that it doesn't do what I think?

What are you trying to do here: strip tags?

Yes and no. I'm using a contenteditable instead of a textarea, and I've discovered that when someone copy-and-pastes an URL from Chrome or FF, it's automatically making the URL a link. Eg:

<a href="http://www.google.com">http://www.google.com</a>

But of course, if you just type the address, then it doesn't. So on my end, I was using URI::Find to convert addresses to links, and ending up with a mess like:

<a href="<a href="http://www.google.com">http://www.google.com</a>"><a href="http://www.google.com">http://www.google.com</a></a>

So, my goal here is to remove the <a href> tag, but only if the linked text is an URL.

Why not
just do one s/// (or, you know, use a module)?

I had originally tried doing it with a simple s///, but couldn't figure out how to make it conditional. Like this:

$text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi
if ($3 =~ /^http/i);

This worked correctly if I removed the if() statement. In testing, I changed the replacement to:

1 - $1, 2 - $2, 3 - $3

just to make sure that $3 did begin with http, and it did, so I couldn't figure out why the if() wasn't catching it unless it was dropping the $3 value before reaching the if().

The $N variables last until the next successful pattern match. In this
case, the '$2 =~ /^http/i' in the condition of the if clears them all
(even though it doesn't capture anything).

Ahh, that makes sense. I mistakenly thought that, since I wasn't assigning $N, then they would retain the previous value.

In general I prefer to assign captures to real variables right away:

while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {

(notice also that captures can be nested, and DTRT).

Great to know! Thanks.
 
J

Jason C

while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {

In this, how does it know that we're testing $test? Or, did you mean to type something like:

while (my (tag, $url) = $text =~ m#(<a...>(.*?)</a>)#gsi)
 
J

Jason C

JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {

it will fail if the opening quote is " and the string has a ' inside
it. perfectly legal html but you can't parse it that way.

I'll probably discard this idea and pursue a module, like you guys suggested. But for the sake of learning...

I recognized this issue, too, which is why I was originally using [^\1], like so:

(["'])*([^\1>]*)\1

I think it was you that pointed out that I can't negate a backreference like that, though.

What would be the correct way to do this, if I can't negate a backreference as a character class?
 
J

Jim Gibson

Jason C said:
JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {

it will fail if the opening quote is " and the string has a ' inside
it. perfectly legal html but you can't parse it that way.

I'll probably discard this idea and pursue a module, like you guys suggested.
But for the sake of learning...

I recognized this issue, too, which is why I was originally using [^\1], like
so:

(["'])*([^\1>]*)\1

I think it was you that pointed out that I can't negate a backreference like
that, though.

What would be the correct way to do this, if I can't negate a backreference as a character class?

Capture the leading delimiter and use a backreference that is not in a
character class:

while ($text =~ m{(<a[^>]* href=(["']).*?\2.*?>)(.*?)(</a>)}gsi) {
^^
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top