regexp pipe problems..

W

willwade

OK, I have a "fairly" straightforward regular expression which grabs
some bits out of a url. Now this should be easy but I can't see the
wood for the trees. I have a few or's in there but it seems to be
adding each to memory - when I only want the found one. Also - instead
of matching just the subdomain it matches the whole domain ($1 see
below) - whats that all about?

$url = 'http://sub1.site3.org/slash/slashey2/slashey3/39/4/223';

if ($url =~ m{(([^/]+).site1.org|([^/]+).site2.org|([^/]+).site3.org)
/slash/slashey2/([^/]+)/([0-9]+)/([0-9]+)/([0-9a-zA-Z-]+)}){
print "yay! $1,$2,$3,$4,$5";
} else {
print "poo";
exit;
}

and it prints:

yay! sub1.site3.org,,,sub1,

when I want to print:

yay! sub1,slashey3,39,4,223

this regular expression stuff makes my head hurt

any help muchly appreciated
:)

will
 
A

A. Sinan Unur

(e-mail address removed) wrote in @f14g2000cwb.googlegroups.com:
OK, I have a "fairly" straightforward regular expression which grabs
some bits out of a url. Now this should be easy but I can't see the
wood for the trees. I have a few or's in there but it seems to be
adding each to memory - when I only want the found one. Also - instead
of matching just the subdomain it matches the whole domain ($1 see
below) - whats that all about?

$url = 'http://sub1.site3.org/slash/slashey2/slashey3/39/4/223';

if ($url =~ m{(([^/]+).site1.org|([^/]+).site2.org|([^/]+).site3.org)
/slash/slashey2/([^/]+)/([0-9]+)/([0-9]+)/([0-9a-zA-Z-]+)}){

I don't understand what exactly you are doing here, but are you, by any
chance, forgetting that dots are special in regular expressions?
print "yay! $1,$2,$3,$4,$5";
} else {
print "poo";
exit;
}

and it prints:

yay! sub1.site3.org,,,sub1,

When I ran your code, it printed poo. Post the code you actually ran.

Please see the posting guidelines for this group to learn how you can
help yourself, and help others help you.
when I want to print:

yay! sub1,slashey3,39,4,223

#!/usr/bin/perl

use strict;
use warnings;

my $url = q{http://sub1.site3.org/slash/slashey2/slashey3/39/4/223};

if( $url =~ m{^http://
([[:alnum:]]+)
\..+ /
[[:alnum:]]+ /
[[:alnum:]]+ /
([[:alnum:]]+)/
([[:digit:]]+)/
([[:digit:]]+)/
([[:digit:]]+)$}x ) {
print join('|', $1, $2, $3, $4, $5)."\n";
} else {
print "did not match\n";
}

__END__

D:\Home> ttt
sub1|slashey3|39|4|223

See perldoc perlre for explanations.

Sinan
 
S

Simon Taylor

Hello Will,
OK, I have a "fairly" straightforward regular expression which grabs
some bits out of a url. Now this should be easy but I can't see the
wood for the trees.
[snip]

and it prints:

yay! sub1.site3.org,,,sub1,

when I want to print:

yay! sub1,slashey3,39,4,223

Actually, when I run the code you've posted it prints

poo

The code below uses split() to get the following output:

sub1.site3.org
slashey3
39
4
223

So depending on your needs, this might be adequate, (though you'd need
to handle the "sub1.site3.org" string as an extra step).


#!/usr/bin/perl
use strict;
use warnings;

my $url = 'http://sub1.site3.org/slash/slashey2/slashey3/39/4/223';

my @matches = (split m:/:, $url)[2,5..8];
print "$_\n" for @matches;


__END__


Regards,

Simon Taylor
 
T

Tad McClellan

Christian Winter said:
So to work around this the easiest solution would be to
move the hostname pattern outside of the or-clause:
m{
([^/]+)\.(site1|site2|site3)\.org
/slash/slashey2/
( [^/]+ ) / ( \d+ ) / ( \d+ ) / ( [0-9a-zA-Z-]+ )
}x;

and to change the print statement (or whatever uses the
captering variables) to ignore $2:


It is probably better to use the non-capturing form of parenthesis,
(?: ... ), for parens # 2:

m{
([^/]+)\. (?: site1|site2|site3 ) \.org
 
Y

yokoda

Thanks everyone. All really useful points.
Apologies for posting utter tripe to begin with - it was due to me
making some line breaks to make sure it didnt wrap in anybody's
reader.. Christian Winter's suggestion was very handy - can now see all
my wierd regular expression.

To explain the problem a little further, Im writing a regex that will
check a list of about 253 domains to see if the supplied url's domain
is in that list. If it is I want to then grab the subdomain, and the
bits of the url that match (slashey2 etc.. as above)..

As a result (and thanks to all above) Iam using this:

$url =~ m{
([^/]+)\.(site1\.org|site2\.com|site3.\org)
/slash/slashey2/
( [^/]+ ) / ( \d+ ) / ( \d+ ) / ( [0-9a-zA-Z-]+ )
}x;

Is that prone to any problems???

Thanks again.
Will
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,172
Messages
2,570,933
Members
47,472
Latest member
blackwatermelon

Latest Threads

Top