Problem with ([\w ]+?)

F

forwax

Hi there,

I'm trying to use "([\w ]+?)" to get some city name and is team name
out in some kind of condition but seems not to work with 3 city/team
that I got. The code that I use is at the bottom of the post.

Here some that work:
- "Edmonton Oilers"
- "Detroit Red Wings"
- "Montreal Canadiens"

But it doesn't work with does 3:
- "St.Louis Blues"
- "Wilkes-Barre/Scranton Penguins"
- "Omaha Ak-Sar-Ben Knights"

Should "([\w ]+?)" be enough to work or I'm missing something ?

Here the code from 2 different condition:
1.
elsif ($_ =~ /<tr><td class="STHSSchedule_GameNumber">\d+<\/td><td
class="STHSSchedule_ProLink"><a href="$prefix-(\d+).html">([\w ]+?)
\((\d+)\) vs ([\w ]+?) \((\d+)\)<\/a>(.*)<\/td><td
class="STHSSchedule_FarmLink">/i) { }

2.
elsif ($_ =~ /<a href="$prefix-Farm-(\d+).html">([\w ]+?) \((\d+)\) vs
([\w ]+?) \((\d+)\)(<\/a>)(.*)<\/td><\/tr>/i) { }
 
J

Jürgen Exner

forwax said:
Hi there,

I'm trying to use "([\w ]+?)" to get some city name and is team name
out in some kind of condition but seems not to work with 3 city/team
that I got. The code that I use is at the bottom of the post.

Here some that work:
- "Edmonton Oilers"
- "Detroit Red Wings"
- "Montreal Canadiens"

But it doesn't work with does 3:
- "St.Louis Blues"
- "Wilkes-Barre/Scranton Penguins"
- "Omaha Ak-Sar-Ben Knights"

Should "([\w ]+?)" be enough to work or I'm missing something ?

Neither . nor - or / are word characters. It appears you expect \w to match
them nevertheless.
1.
elsif ($_ =~ /<tr><td class="STHSSchedule_GameNumber">\d+<\/td><td
class="STHSSchedule_ProLink"><a href="$prefix-(\d+).html">([\w ]+?)
\((\d+)\) vs ([\w ]+?) \((\d+)\)<\/a>(.*)<\/td><td
class="STHSSchedule_FarmLink">/i) { }

2.
elsif ($_ =~ /<a href="$prefix-Farm-(\d+).html">([\w ]+?) \((\d+)\) vs
([\w ]+?) \((\d+)\)(<\/a>)(.*)<\/td><\/tr>/i) { }

Urg! This is excellent code if you are trying win the Perl obfuscation
contest.
However if you are interested in readable, maintainable, robust, and correct
code you may want to check out one of the HMTL parser modules.

jue
 
F

forwax

I'll go straight to it... I'm a newb, a newb that work in IT but is
just a technician and it's been a while since I writin some code and my
best language was structural C (yeah I know, I'm a dinosaur in that
field).

I've read the link you've givin me and did a search on google to see if
I could get something more of a beginner aproch to parse a HTML file. I
want to learn but I have to be honest, I was trying to keep it simple
so I could comprehend what I was doing.

The solution you have givin me seems the best, to extract the text for
the HTML and put it in the file I want. It's just great. But do you
know an other web site that would be more to my level (beginner that
is) ?


A. Sinan Unur said:
I'm trying to use "([\w ]+?)"
...

But it doesn't work with does 3:
- "St.Louis Blues"
- "Wilkes-Barre/Scranton Penguins"
- "Omaha Ak-Sar-Ben Knights"

Should "([\w ]+?)" be enough to work or I'm missing something ?

perldoc perlre

In addition, Perl defines the following:

\w Match a "word" character (alphanumeric plus "_")
1.
elsif ($_ =~ /<tr><td class="STHSSchedule_GameNumber">\d+<\/td><td
class="STHSSchedule_ProLink"><a href="$prefix-(\d+).html">([\w ]+?)
\((\d+)\) vs ([\w ]+?) \((\d+)\)<\/a>(.*)<\/td><td
class="STHSSchedule_FarmLink">/i) { }

2.
elsif ($_ =~ /<a href="$prefix-Farm-(\d+).html">([\w ]+?) \((\d+)\) vs
([\w ]+?) \((\d+)\)(<\/a>)(.*)<\/td><\/tr>/i) { }

This is impossible to read and very error prone. You would do yourself a
service by properly parsing HTML:

perldoc -q html

http://search.cpan.org/~gaas/HTML-Parser-3.55/

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
T

Tad McClellan

forwax said:
I've read the link you've givin me


Please also see the Posting Guidelines link that was in his .sig

and did a search on google to see if


The place to search is http://search.cpan.org, maybe you can find
a module that does most of the heavy lifting for you, such as
HTML::TableExtract, since your data looks to be in an HTML table.

I could get something more of a beginner aproch to parse a HTML file. I
want to learn but I have to be honest, I was trying to keep it simple
so I could comprehend what I was doing.


I would have posted an example using that module if I didn't have to
reverse-engineer what your data looks like (hint).

You can probably get something working by copying some of the code
given in the module's docs, and adding a bit to it.

perldoc HTML::TableExtract



[ snip TOFU ]
 
F

forwax

Tad, you are right, I'm trying to extract text from HTML tables so I'll
look in extractTable like you said.

As for the snipet you have asked. These is the 5 types of line I want
some extraction, all the other is of none importance.

1.
<tr><td colspan="3" class="STHSSchedule_GameDay"><b>Day 1</b></td></tr>

2.
<tr><td class="STHSSchedule_GameNumber">1</td><td
class="STHSSchedule_ProLink"><a href="AHSQ2006-1.html">New York Rangers
(1) vs Pittsburgh Penguins (2)</a> </td><td
class="STHSSchedule_FarmLink">

3.
<a href="AHSQ2006-Farm-1.html">Hartford Wolfpack (3) vs
Wilkes-Barre/Scranton Penguins (7)</a></td></tr>

4.
<tr><td class="STHSSchedule_GameNumber">55</td><td
class="STHSSchedule_ProLink">Carolina Hurricanes vs Atlanta
Thrashers</td><td>Lowell Lock Monsters

5.
vs Chicago Wolves </td></tr>

with all these 5 HTML lines I need the text of the cell a well as the
"a href" if there's one.

I hope this post is ok because I just starting out to readthe "posting
guidelines"

And as for running before walking, I did get some perl book from the
library but I do so little perl programming that I always forget the
intermidiate and advance stuff. So I would considere my self like a
walking baby at best LOL.
 
F

forwax

I did what you told me and read a bit about HTML::parser but think I'm
gonna go with HTML::TokeParser, I find it easier to comprehend but
there's one thing I'm not sure.

My HTML file have some url tag and some that do not. Which methode
would you choose ? Should I use like 2 while loop, one with the get_tag
and one with the get_text but with the get_text how I get to see if the
line as a reference to a url link and not do anything with it ?
 
M

Mumia W.

[...]
2.
<tr><td class="STHSSchedule_GameNumber">1</td><td
class="STHSSchedule_ProLink"><a href="AHSQ2006-1.html">New York Rangers
(1) vs Pittsburgh Penguins (2)</a> </td><td
class="STHSSchedule_FarmLink">
[...]

Where is all this data coming from? What are you trying to do
with it?
 
F

forwax

I finally change ([\w ]+?) for (.*) and it's working great, for now. I
will change it however for something more elegant with the
HTML::TokeParser.

That bring me to another question. I have that code for now...

$p = HTML::TokeParser->new("./$folder/${prefix}-Schedule.html") || die
"Can't open: $!";
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/td");
if ($url =~ /.*farm.*/) {
print FILE_OUT_FARM "<a href=\"$url\"> $text <\/a><br>\n";
}
else {
print FILE_OUT_PRO "<a href=\"$url\"> $text <\/a><br>\n";
}
}

and it seems that "if ($url =~ /.*farm.*/) {" doesn't work, can I use
Regex in the parser or am I missing something ?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,197
Messages
2,571,040
Members
47,635
Latest member
SkyePurves

Latest Threads

Top