Problem with ([\w ]+?)

forwax · Jul 15, 2006

Hi there,

I'm trying to use "([\w ]+?)" to get some city name and is team name
out in some kind of condition but seems not to work with 3 city/team
that I got. The code that I use is at the bottom of the post.

Here some that work:
- "Edmonton Oilers"
- "Detroit Red Wings"
- "Montreal Canadiens"

But it doesn't work with does 3:
- "St.Louis Blues"
- "Wilkes-Barre/Scranton Penguins"
- "Omaha Ak-Sar-Ben Knights"

Should "([\w ]+?)" be enough to work or I'm missing something ?

Here the code from 2 different condition:
1.
elsif ($_ =~ /<tr><td class="STHSSchedule_GameNumber">\d+<\/td><td
class="STHSSchedule_ProLink"><a href="$prefix-(\d+).html">([\w ]+?)
$(\d+)$ vs ([\w ]+?) $(\d+)$<\/a>(.*)<\/td><td
class="STHSSchedule_FarmLink">/i) { }

2.
elsif ($_ =~ /<a href="$prefix-Farm-(\d+).html">([\w ]+?) $(\d+)$ vs
([\w ]+?) $(\d+)$(<\/a>)(.*)<\/td><\/tr>/i) { }

Jürgen Exner · Jul 15, 2006

forwax said:
Hi there,

I'm trying to use "([\w ]+?)" to get some city name and is team name
out in some kind of condition but seems not to work with 3 city/team
that I got. The code that I use is at the bottom of the post.

Here some that work:
- "Edmonton Oilers"
- "Detroit Red Wings"
- "Montreal Canadiens"

But it doesn't work with does 3:
- "St.Louis Blues"
- "Wilkes-Barre/Scranton Penguins"
- "Omaha Ak-Sar-Ben Knights"

Should "([\w ]+?)" be enough to work or I'm missing something ?

Neither . nor - or / are word characters. It appears you expect \w to match
them nevertheless.

1.
elsif ($_ =~ /<tr><td class="STHSSchedule_GameNumber">\d+<\/td><td
class="STHSSchedule_ProLink"><a href="$prefix-(\d+).html">([\w ]+?)
$(\d+)$ vs ([\w ]+?) $(\d+)$<\/a>(.*)<\/td><td
class="STHSSchedule_FarmLink">/i) { }

2.
elsif ($_ =~ /<a href="$prefix-Farm-(\d+).html">([\w ]+?) $(\d+)$ vs
([\w ]+?) $(\d+)$(<\/a>)(.*)<\/td><\/tr>/i) { }

Urg! This is excellent code if you are trying win the Perl obfuscation
contest.
However if you are interested in readable, maintainable, robust, and correct
code you may want to check out one of the HMTL parser modules.

jue

forwax · Jul 15, 2006

I'll go straight to it... I'm a newb, a newb that work in IT but is
just a technician and it's been a while since I writin some code and my
best language was structural C (yeah I know, I'm a dinosaur in that
field).

I've read the link you've givin me and did a search on google to see if
I could get something more of a beginner aproch to parse a HTML file. I
want to learn but I have to be honest, I was trying to keep it simple
so I could comprehend what I was doing.

The solution you have givin me seems the best, to extract the text for
the HTML and put it in the file I want. It's just great. But do you
know an other web site that would be more to my level (beginner that
is) ?

A. Sinan Unur said:
I'm trying to use "([\w ]+?)"
...

But it doesn't work with does 3:
- "St.Louis Blues"
- "Wilkes-Barre/Scranton Penguins"
- "Omaha Ak-Sar-Ben Knights"

Should "([\w ]+?)" be enough to work or I'm missing something ?

Click to expand...

perldoc perlre

In addition, Perl defines the following:

\w Match a "word" character (alphanumeric plus "_")

1.
elsif ($_ =~ /<tr><td class="STHSSchedule_GameNumber">\d+<\/td><td
class="STHSSchedule_ProLink"><a href="$prefix-(\d+).html">([\w ]+?)
$(\d+)$ vs ([\w ]+?) $(\d+)$<\/a>(.*)<\/td><td
class="STHSSchedule_FarmLink">/i) { }

2.
elsif ($_ =~ /<a href="$prefix-Farm-(\d+).html">([\w ]+?) $(\d+)$ vs
([\w ]+?) $(\d+)$(<\/a>)(.*)<\/td><\/tr>/i) { }

Click to expand...

This is impossible to read and very error prone. You would do yourself a
service by properly parsing HTML:

perldoc -q html

http://search.cpan.org/~gaas/HTML-Parser-3.55/

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

Tad McClellan · Jul 16, 2006

forwax said:
I've read the link you've givin me

Please also see the Posting Guidelines link that was in his .sig

and did a search on google to see if

The place to search is http://search.cpan.org, maybe you can find
a module that does most of the heavy lifting for you, such as
HTML::TableExtract, since your data looks to be in an HTML table.

I could get something more of a beginner aproch to parse a HTML file. I
want to learn but I have to be honest, I was trying to keep it simple
so I could comprehend what I was doing.

I would have posted an example using that module if I didn't have to
reverse-engineer what your data looks like (hint).

You can probably get something working by copying some of the code
given in the module's docs, and adding a bit to it.

perldoc HTML::TableExtract

[ snip TOFU ]

forwax · Jul 16, 2006

Tad, you are right, I'm trying to extract text from HTML tables so I'll
look in extractTable like you said.

As for the snipet you have asked. These is the 5 types of line I want
some extraction, all the other is of none importance.

1.
<tr><td colspan="3" class="STHSSchedule_GameDay"><b>Day 1</b></td></tr>

2.
<tr><td class="STHSSchedule_GameNumber">1</td><td
class="STHSSchedule_ProLink"><a href="AHSQ2006-1.html">New York Rangers
(1) vs Pittsburgh Penguins (2)</a> </td><td
class="STHSSchedule_FarmLink">

3.
<a href="AHSQ2006-Farm-1.html">Hartford Wolfpack (3) vs
Wilkes-Barre/Scranton Penguins (7)</a></td></tr>

4.
<tr><td class="STHSSchedule_GameNumber">55</td><td
class="STHSSchedule_ProLink">Carolina Hurricanes vs Atlanta
Thrashers</td><td>Lowell Lock Monsters

5.
vs Chicago Wolves </td></tr>

with all these 5 HTML lines I need the text of the cell a well as the
"a href" if there's one.

I hope this post is ok because I just starting out to readthe "posting
guidelines"

And as for running before walking, I did get some perl book from the
library but I do so little perl programming that I always forget the
intermidiate and advance stuff. So I would considere my self like a
walking baby at best LOL.

forwax · Jul 16, 2006

I did what you told me and read a bit about HTML:

arser but think I'm
gonna go with HTML::TokeParser, I find it easier to comprehend but
there's one thing I'm not sure.

My HTML file have some url tag and some that do not. Which methode
would you choose ? Should I use like 2 while loop, one with the get_tag
and one with the get_text but with the get_text how I get to see if the
line as a reference to a url link and not do anything with it ?

Mumia W. · Jul 16, 2006

[...]
2.
<tr><td class="STHSSchedule_GameNumber">1</td><td
class="STHSSchedule_ProLink"><a href="AHSQ2006-1.html">New York Rangers
(1) vs Pittsburgh Penguins (2)</a> </td><td
class="STHSSchedule_FarmLink">
[...]

Where is all this data coming from? What are you trying to do
with it?

forwax · Jul 17, 2006

I finally change ([\w ]+?) for (.*) and it's working great, for now. I
will change it however for something more elegant with the
HTML::TokeParser.

That bring me to another question. I have that code for now...

$p = HTML::TokeParser->new("./$folder/${prefix}-Schedule.html") || die
"Can't open: $!";
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/td");
if ($url =~ /.*farm.*/) {
print FILE_OUT_FARM "<a href=\"$url\"> $text <\/a><br>\n";
}
else {
print FILE_OUT_PRO "<a href=\"$url\"> $text <\/a><br>\n";
}
}

and it seems that "if ($url =~ /.*farm.*/) {" doesn't work, can I use
Regex in the parser or am I missing something ?

Only one table shows up with the information	2	Mar 29, 2023
Problem with reg expression	12	Sep 5, 2007
Web-based two-column diff with color	2	Oct 1, 2009
Problem w/ Text Alignment - IE7 Problem Only	9	Aug 27, 2007
Problem with the divs,can any body help me ,its urgent please..	0	Sep 24, 2007
need help with simple html problem	9	Nov 26, 2008
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
problem with form getting submitted	8	Jul 6, 2010

Problem with ([\w ]+?)

forwax

Jürgen Exner

forwax

Tad McClellan

forwax

forwax

Mumia W.

forwax

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads