Perl script to extract data from webpage? (knucklehead newbie).

R

Ryan Haskell

Hello folks. I regret to announce that my understanding of Perl is
virtually nonexistant, and I'm looking for a little instruction. My
goal is to utilize a Perl script to extract specific numeric data from
various web pages, and then feed that data to MRTG for graphing
purposes. I have this running now using a script I found elsewhere,
and am using it to pull current temperature for my area from
www.weather.com and create a graph. Now I want to use the same
technique for other data elsewhere. Problem is, I can't figure out
how to modify this perl script to find the data of interest in a given
page, because I don't understand how the script actually locates the
data. The script itself is available from

http://howto.aphroland.de/HOWTO/MRTG/Scripts/weather4.pl

and here is a short excerpt from it, where the script parses the html
page from www.weather.com for the humidity data:

if ( /\%/ && /obsInfo2/ && ! /WIDTH/ ) {
if (/[0-9]{1,3}\%/) {
if ( $debug == 1 ) {
unless ( $& ) { die "Cannot determine the humidity!\n"; }
$humidity = $&;
chop ($humidity);
print "Humidity: $humidity\n";



And below is the relevant section of the html code from
www.weather.com that is being parsed:


<BR>
<TABLE BORDER=0 CELLPADDING=0 WIDTH=100% CELLSPACING=0>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1 WIDTH=40%>UV Index:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>3&nbsp;Low</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Dew Point:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>51&deg;F</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Humidity:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>40%</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Visibility:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>10.0 miles</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Pressure:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>29.79 inches and
rising</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Wind:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>From the North at 13 gusting
to 18&nbsp;mph</TD></TR>


I can see that "&" and "obsInfo2" are text strings found within the
html page on either side of the desired value, but I'm not clear on
how the perl script pulls the actual value (in this case 40) out of
the data and assigns it to the $humidity variable. How would I modify
the perl script if I wanted to get, for example, the pressure instead?
(which is 29.97 in the html example above.) I think if I could
understand how this variable matching/assignment is occuring, I could
then use this script to fetch almost any number from any web page,
right?

For another example, let's say I wanted to pull the value for "Heat
Index" off the NWS Weather page at:

http://weather.noaa.gov/weather/current/KVDF.html

What would I do?

Thanks for any help!
Ryan Haskell
 
R

Ryan Haskell

Gunnar Hjalmarsson said:


Been there already, Gunnar. I was hoping to get a little help from
the community... what would take me 10 hours to figure out could be
explained in less than 5 minutes by an experienced perl programmer.
I'll all for RTFM, and have been doing so. Hopefully there are others
out there a little more reminiscent of the days when they were first
trying to learn perl, or anything else for that matter.

Ryan
 
G

Gunnar Hjalmarsson

Ryan said:
Been there already, Gunnar. I was hoping to get a little help from
the community...
http://www.catb.org/~esr/faqs/smart-questions.html

what would take me 10 hours to figure out could be explained in
less than 5 minutes by an experienced perl programmer.

You asked in your first post how to modify the script to get pressure
instead. That would be easy:

if ( /inches/ && /obsInfo2/ && ! /WIDTH/ ) {
if (/\d\d\.\d\d/) {
print "Pressure: $&\n";
} else {
die "Cannot determine the pressure!\n";
}
}

Then you said: "I think if I could understand how this variable
matching/assignment is occuring, I could then use this script to fetch
almost any number from any web page, right?"

That sentence reveals very unrealistic expectations. Either you spend
quite some time learning Perl, or else you might be better off at

http://jobs.perl.org/
 
J

Jim Gibson

[description of problem snipped]
http://howto.aphroland.de/HOWTO/MRTG/Scripts/weather4.pl

and here is a short excerpt from it, where the script parses the html
page from www.weather.com for the humidity data:

if ( /\%/ && /obsInfo2/ && ! /WIDTH/ ) {
if (/[0-9]{1,3}\%/) {
if ( $debug == 1 ) {
unless ( $& ) { die "Cannot determine the humidity!\n"; }
$humidity = $&;
chop ($humidity);
print "Humidity: $humidity\n";



And below is the relevant section of the html code from
www.weather.com that is being parsed:


<BR>
<TABLE BORDER=0 CELLPADDING=0 WIDTH=100% CELLSPACING=0>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1 WIDTH=40%>UV Index:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>3&nbsp;Low</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Dew Point:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>51&deg;F</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Humidity:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>40%</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Visibility:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>10.0 miles</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Pressure:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>29.79 inches and
rising</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Wind:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>From the North at 13 gusting
to 18&nbsp;mph</TD></TR>


I can see that "&" and "obsInfo2" are text strings found within the
html page on either side of the desired value, but I'm not clear on
how the perl script pulls the actual value (in this case 40) out of
the data and assigns it to the $humidity variable. How would I modify
the perl script if I wanted to get, for example, the pressure instead?
(which is 29.97 in the html example above.) I think if I could
understand how this variable matching/assignment is occuring, I could
then use this script to fetch almost any number from any web page,
right?

The extraction method shown above depends upon the fact that the
humidity value contains a percent sign, appears on the same line as the
string 'obsInfo2', and doesn't appear on the same line as the WIDTH
parameter, which also contains a percent sign. If any of these
restrictions changes in the future, this script will fail.

It works by finding the line with the humidity on it using the above
three rules: if( /\%/ && /obsinfo2/ && ! /WIDTH/ )
If it passes that test, it looks for 1 to 3 numerical digits followed
by a percent sign: if( /[0-9]{1,3}\%/
If that matches, the results of the match are placed in the special $&
variable, and that is what is printed. Note that if the website ever
adds a decimal point to the humidity reading, the script will fail. See
'perldoc perlre' for information about regular expressions and 'perldoc
perlvar' about Perl's special variables.

If you want to do a good job extracting information from web pages, you
should be using an HTML parser and not regular expressions. Check out
HTML::parser, or look on www.cpan.org for HTML Table modules. You
should be looking at both columns in this table to figure out what
values are being displayed on the page.
 
G

Gavin Williams

Your $& is a special perl variable that represents the string matched by
the last successful pattern match...which in the case of your example
happens to be /[0-9]{1,3}\%/.....a pattern match which basically says
"return a pattern that contains a number from 1 to 3 digits long followed by
a "%" character.

Maybe an easier way of writing that same section of code would be:

# true if $_ contains "CLASS=obsInfo2>" followed by a 1-3 digit number and a
"%", concluded by a "</TD>"

if ( /CLASS=obsInfo2>([0-9]{1,3}\%)<\/TD>/i ) {
print "Humidity: $+\n" ;
}

# Note that I had to use \ to "quote" the / in </TD> or it would have been
interpreted as the end of the pattern
# Also used an "i" after the pattern to indicated case sensitivity
checking is Case Insensitive.

# "$+" is another special perl variable, that returns the value inside of
the ( ) from the last successful match
# "$&" returns the entire matched string
# "$`" returns everything before the matched string
# "$'" returns everything after the matched string

To get pressure, you might add:

# true if $_ contains the string "inches", and uses ".*" as a wildcard match
for the text we want to return

if ( /inches/i && /CLASS=obsInfo2>(.*)<\/TD>/i )
print "Pressure: $+\n" ;
}




Ryan Haskell said:
Hello folks. I regret to announce that my understanding of Perl is
virtually nonexistant, and I'm looking for a little instruction. My
goal is to utilize a Perl script to extract specific numeric data from
various web pages, and then feed that data to MRTG for graphing
purposes. I have this running now using a script I found elsewhere,
and am using it to pull current temperature for my area from
www.weather.com and create a graph. Now I want to use the same
technique for other data elsewhere. Problem is, I can't figure out
how to modify this perl script to find the data of interest in a given
page, because I don't understand how the script actually locates the
data. The script itself is available from

http://howto.aphroland.de/HOWTO/MRTG/Scripts/weather4.pl

and here is a short excerpt from it, where the script parses the html
page from www.weather.com for the humidity data:

if ( /\%/ && /obsInfo2/ && ! /WIDTH/ ) {
if (/[0-9]{1,3}\%/) {
if ( $debug == 1 ) {
unless ( $& ) { die "Cannot determine the humidity!\n"; }
$humidity = $&;
chop ($humidity);
print "Humidity: $humidity\n";



And below is the relevant section of the html code from
www.weather.com that is being parsed:


<BR>
<TABLE BORDER=0 CELLPADDING=0 WIDTH=100% CELLSPACING=0>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1 WIDTH=40%>UV Index:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>3&nbsp;Low</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Dew Point:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>51&deg;F</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Humidity:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>40%</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Visibility:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>10.0 miles</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Pressure:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>29.79 inches and
rising</TD></TR>
<TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Wind:</TD>
<TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>From the North at 13 gusting
to 18&nbsp;mph</TD></TR>


I can see that "&" and "obsInfo2" are text strings found within the
html page on either side of the desired value, but I'm not clear on
how the perl script pulls the actual value (in this case 40) out of
the data and assigns it to the $humidity variable. How would I modify
the perl script if I wanted to get, for example, the pressure instead?
(which is 29.97 in the html example above.) I think if I could
understand how this variable matching/assignment is occuring, I could
then use this script to fetch almost any number from any web page,
right?

For another example, let's say I wanted to pull the value for "Heat
Index" off the NWS Weather page at:

http://weather.noaa.gov/weather/current/KVDF.html

What would I do?

Thanks for any help!
Ryan Haskell
 
R

Ryan Haskell

Gavin Williams said:
Your $& is a special perl variable that represents the string matched by
the last successful pattern match...which in the case of your example
happens to be /[0-9]{1,3}\%/.....a pattern match which basically says
"return a pattern that contains a number from 1 to 3 digits long followed by
a "%" character.

Maybe an easier way of writing that same section of code would be:

# true if $_ contains "CLASS=obsInfo2>" followed by a 1-3 digit number and a
"%", concluded by a "</TD>"

if ( /CLASS=obsInfo2>([0-9]{1,3}\%)<\/TD>/i ) {
print "Humidity: $+\n" ;
}
<snip>

Thanks for the help everyone. After much trial and error I've managed
to produce a working script with which I've been successful in
obtaining the info I need. It really wasn't that hard after I
researched regular expressions for a while. I've come up with some
other complications now that I know how to actually get the data, but
I'll try to figure those out on my own...

Ryan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top