Regex...HTML::Parser...Getting webpage data?

Wesley Bresson · Aug 3, 2006

I'm pretty new to Perl, my past experience has been in modifying other
peoples code in order to do what I want it to do but now I'm trying to
write
my own to do a specific task that I can't find code for and am having
issues. I am trying to retrieve data from a webpage, say
http://www.apmex.com/shop/buy/Silver_American_Eagles.asp?orderid=0 for
example, the price of a 2006 1oz Silver American Eagle in the 20-99
price
break quantity. Should I use Regex to do that or would I be better off
with
HTML:

arser ? I've attemped Regex since I seem to understand it better
but
haven't had much success it getting it to pull the right price.
HTML:

arser
I understand even less than Regex but I've read that its a more
reliable way
of pulling webpage data ? I can't seem to find "easy" to understand
documentation on it though so I'm even farther away from getting it to
work
then Regex, Any advice ?

Paul Lalli · Aug 3, 2006

Wesley said:
I'm pretty new to Perl, my past experience has been in modifying other
peoples code in order to do what I want it to do but now I'm trying to
write
my own to do a specific task that I can't find code for and am having
issues. I am trying to retrieve data from a webpage, say
http://www.apmex.com/shop/buy/Silver_American_Eagles.asp?orderid=0 for
example, the price of a 2006 1oz Silver American Eagle in the 20-99
price
break quantity. Should I use Regex to do that

No. Regular Expressions are notoriously unable to parse "real" HTML.

or would I be better off with HTML:arser ?

Well, you'd be better than with Regular expressions...

I've attemped Regex since I seem to understand it better
but haven't had much success it getting it to pull the right price.
HTML:arser I understand even less than Regex

I agree. I don't like HTML:

arser's interface at all. I suggest you
give HTML::TokeParser a shot, though. After a few tries, I'm generally
able to get it to do what I want. I find the interface much more
understandable than HTML:

arser's.

Good luck,
Paul Lalli

Wesley Bresson · Aug 3, 2006

.....

I agree. I don't like HTML:arser's interface at all. I suggest you
give HTML::TokeParser a shot, though. After a few tries, I'm generally
able to get it to do what I want. I find the interface much more
understandable than HTML:arser's.

Good luck,
Paul Lalli

Thanks, I'll look into that, It looks like my provider does have it
installed http://links.1and1faqs.com/perldiver.cgi so I'll start
looking up documentation on it.

xhoster · Aug 3, 2006

Wesley Bresson said:
I'm pretty new to Perl, my past experience has been in modifying other
peoples code in order to do what I want it to do but now I'm trying to
write
my own to do a specific task that I can't find code for and am having
issues. I am trying to retrieve data from a webpage, say
http://www.apmex.com/shop/buy/Silver_American_Eagles.asp?orderid=0 for
example, the price of a 2006 1oz Silver American Eagle in the 20-99
price
break quantity.

What do you mean by "say" and "for example"? Are all the examples going
to be extremely similar to that one, or not? If not, I don't think there
is a magic bullet for you.

Should I use Regex to do that or would I be better off
with
HTML:arser ?

If I just wanted to parse that page every day to see how the price changes,
I would do it with a regex. If you want to parse a lot of pages that are
kind of, but not exactly, like that, then I would probably use some kind
of HTML parsing module.

I've attemped Regex since I seem to understand it better
but
haven't had much success it getting it to pull the right price.

$ perl -0777 -lne 's/\s+/ /g;
/2006 1oz Silver American Eagles.+?20 - 99.*?\$(\d{1,5}\.\d\d)/
and print "$1\n";' Silver_American_Eagles.html

13.95

If 20 - 99 is no longer offered for 2006 1oz Silver American Eagles, but
is for something further down on the list, you will get the price for that
thing futher down on the list. Similarly, if the price for 20 - 99 is
somehow malformed, it will silently move on to the next price that
is formated like this expects, and report that one.

Xho

Wesley Bresson · Aug 3, 2006

What do you mean by "say" and "for example"? Are all the examples going

to be extremely similar to that one, or not? If not, I don't think there
is a magic bullet for you.

If I just wanted to parse that page every day to see how the price changes,
I would do it with a regex. If you want to parse a lot of pages that are
kind of, but not exactly, like that, then I would probably use some kind
of HTML parsing module.

$ perl -0777 -lne 's/\s+/ /g;
/2006 1oz Silver American Eagles.+?20 - 99.*?\$(\d{1,5}\.\d\d)/
and print "$1\n";' Silver_American_Eagles.html

13.95

If 20 - 99 is no longer offered for 2006 1oz Silver American Eagles, but
is for something further down on the list, you will get the price for that
thing futher down on the list. Similarly, if the price for 20 - 99 is
somehow malformed, it will silently move on to the next price that
is formated like this expects, and report that one.

Xho

By "say" and "for example" I mean that yes that is one page that I want
to start on but there are others that would be nice also once that one
is figured out. I tried your code for this page and it errored out but
I'm assuming its either my windows perl that is messing it up or extra
spaces in the copy/paste, I saved the page to the same dir that I was
running from but no go. I'll look at it more later, thanks for your
help

C:\Users\Me\Desktop>perl -0777 -lne 's/\s+/ /g;/2006 1oz Silver
American Eagles
..+?20 - 99.*?\$(\d{1,5}\.\d\d)/and print "$1\n";'
Silver_American_Eagles.html
Can't find string terminator "'" anywhere before EOF at -e line 1.

xhoster · Aug 3, 2006

Wesley Bresson said:
I tried your code for this page and it errored out but
I'm assuming its either my windows perl that is messing it up or extra
spaces in the copy/paste, I saved the page to the same dir that I was
running from but no go. I'll look at it more later, thanks for your
help

C:\Users\Me\Desktop>perl -0777 -lne 's/\s+/ /g;/2006 1oz Silver
American Eagles
.+?20 - 99.*?\$(\d{1,5}\.\d\d)/and print "$1\n";'
Silver_American_Eagles.html
Can't find string terminator "'" anywhere before EOF at -e line 1.

On Windows, you need to wrap your -e program in double quotes rather
than single quotes which means you need to change any double quotes
occuring inside the script to something else, like qq'$1\n'

Or just put the program into a file.

#!/usr/bin/perl
use strict;
use warnings;
$/=undef; # same as the -0777 command line
$_=<>; # slurp
s/\s+/ /g;
/2006 1oz Silver American Eagles.+?20 - 99.*?\$(\d{1,5}\.\d\d)/
and print "$1\n";
__END__

Wesley Bresson · Aug 4, 2006

On Windows, you need to wrap your -e program in double quotes rather
than single quotes which means you need to change any double quotes
occuring inside the script to something else, like qq'$1\n'

Or just put the program into a file.

#!/usr/bin/perl
use strict;
use warnings;
$/=undef; # same as the -0777 command line
$_=<>; # slurp
s/\s+/ /g;
/2006 1oz Silver American Eagles.+?20 - 99.*?\$(\d{1,5}\.\d\d)/
and print "$1\n";
__END__

Thanks, I can see that works now. Now, hang in with a newbie, but I'm
trying to understand why exactly your code works.

$/=undef --- inputs the whole file instead of one line by one line
correct ? Why is it needed ?

$_=<> --not sure....is this what inputs the file off of the command
line ?

s/\s+/ /g; --not sure...is this taking out the white spaces ? If so why
is it needed ?

Paul Lalli · Aug 4, 2006

Thanks, I can see that works now. Now, hang in with a newbie, but I'm
trying to understand why exactly your code works.

$/=undef --- inputs the whole file instead of one line by one line
correct ?

Not quite. $/ is the "input record separator" variable. It determines
what makes a "line" when the readline operator is used. As you
surmised, setting it to undef makes the entire file the "line". But
this doesn't do any reading of the file by itself.

Why is it needed ?

Because HTML can have linebreaks in it. You need to match the entire
string - you can't process the file "line by line" because the pattern
your searching for could span more than one line.

$_=<> --not sure....is this what inputs the file off of the command
line ?

Yes. This is what does the reading. Ordinarily, this would read one
line from either the first file given on the command line, or if there
are none, from STDIN. However, because the $/ variable was set to
undef earlier, this reads the entire file, and puts it into $_.

s/\s+/ /g; --not sure...is this taking out the white spaces ?

Again, not quite. It's replacing all sequences of one or more
whitespace characters (which includes space characters, newlines, and
tabs) with one single whitespace character. It does so for every
sequence of one or more whitespace characters it finds (that's the /g
part)

If so why is it needed ?

Because your pattern is specifically looking for one space to separate
each "word". Just because that's how it appears in the browser does
not mean that's how it's formatted in the HTML. HTML doesn't care
about whitespace. So it's possible there's actually 5 newlines and 3
tabs between the two words that your pattern is expecting a single
whitespace character between.

Paul Lalli

Wesley Bresson · Aug 4, 2006

Jim said:
To read the entire file at once and process it, after getting rid of
newlines, since the text you are looking for may be on more than one
line.

The <> is the input operator and returns the result of a read-line
operation. Since $/ is undef, your input file is treated as one big
line, and the whole file ends up as a string in the $_ variable.

It is changing all occurences of whitespace (\s) to a single space,
concatenating any successive whitespace characters (\s+) into one space
character. Since newlines (\n) are whitespace, this also removes all
newlines from your string and you can use space characters in your
regular expression.

Thanks people, I'm slowly getting it, examples help a lot compared to
hard to read documentation. If anyone knows of any good regex
documentation or books listing all of the options and varibles let me
know and I'd appriciate it. I'm working on fully understanding how this
code works for this page first and then I'll move onto some others of
my own and see how I do there, thanks for you help.

Paul Lalli · Aug 4, 2006

Wesley said:
Thanks people, I'm slowly getting it, examples help a lot compared to
hard to read documentation. If anyone knows of any good regex
documentation or books listing all of the options and varibles let me
know and I'd appriciate it. I'm working on fully understanding how this
code works for this page first and then I'll move onto some others of
my own and see how I do there, thanks for you help.

perldoc perlretut
perldoc perlre
perldoc perlreref
(in that order)

If you don't like the built-in documentation for some reason, I suggest
Mastering Regular Expressions (
http://www.oreilly.com/catalog/regex3/index.html ). Note that it
covers more than just Perl regular expressions...

Paul Lalli

How to push data from one HTML page to another	4	Jan 3, 2024
SQL Connection string regex pattern to parse sections	1	May 9, 2024
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
My regex kung-fu is not strong =(	0	Apr 4, 2020
How to host data visualization beginner friendly?	1	Aug 10, 2023
Html data exchange help	0	Jan 2, 2020
Im having some issues with my html website	1	Jun 4, 2024
Python client/server that reads HTML body from server	1	Apr 12, 2023

Regex...HTML::Parser...Getting webpage data?

Wesley Bresson

Paul Lalli

Wesley Bresson

xhoster

Wesley Bresson

xhoster

Wesley Bresson

Paul Lalli

Wesley Bresson

Paul Lalli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads