Can this be done (by a noob :))

Thomas Andersson · Jul 30, 2010

I have set myself a task to create a script that can collect data from web
pages and insert them intoa MySQl database. I'm a complete noob at this
thougha nd not even sure what language I need (to learn), but think perl
might be it. What I ask now is not for you to tell me whow to do it, only if
it's feasible or if I'm barking up the wrong tree (pointers on where to find
relevant information is wellcome though.

First step would be to export a list of pids to be processed, each paired
with the last sid processed for the pid.
The script would read the list and set the first pid in list as current
Next step woud be for it to add current pid to a URL and load that page
containinga list.
From this page a list of sids needs to be collected untill I hit the "last
processed" one, these might be spread over severall pages so it need to keep
going either untill it finds "last processed" or there's no futher pages to
load (a fail I guess)

Next is the new sid list created in the previous step, each one need to be
processed and data collected
some basic data is collected frrom each sid and then 2 possible (but not
always excistant) lists.
The basic data collected for the sid cotains two values to be set as
variables, these decides how many data blocks needs to be collected lower
down on the page.
Go to first type block, collect the data I want and repeat as many times as
variable says
Go to seciodn type block and repeat.

Store the data collected from previous ina textfile named after pid, it
should contain 4sections of data to be inserted into 4 databases
First section update the pid with new last processed
Second section add sids with info to DB.
Third section add the data from type 1 blocks on sid pages to DB.
Fourth section section add the data from type 2 blocks on sid pages to DB.

Close the file, load next pid from list and repeat the process untill pid
list is empty.

A guess a bonus at the end would be if it could also insert all the data
collected into the db as well.

Is this something perl would be suitable for or is there a better choise?
My system is Win 7 64bit btw, running MySQL 5.1.

TIA
Thomas

RedGrittyBrick · Jul 30, 2010

I have set myself a task to create a script that can collect data from web
pages and insert them intoa MySQl database. I'm a complete noob at this
thougha nd not even sure what language I need (to learn), but think perl
might be it. What I ask now is not for you to tell me whow to do it, only if
it's feasible or if I'm barking up the wrong tree (pointers on where to find
relevant information is wellcome though.

It is feasible using Perl.

Other languages have subroutine libraries. Perl has "modules" for
handling specific tasks. Some modules are "core modules" that are
included with a normal Perl installation. Other modules can be found in
an online repository called CPAN. You can search it at
http://search.cpan.org/

You will need a module for fetching web pages and for extracting data
from the retrieved HTML.

You will need a module for working with MySQL. Perl's Database Interface
module is called DBI. See http://dbi.perl.org/

First step would be to export a list of pids to be processed, each paired
with the last sid processed for the pid.
The script would read the list and set the first pid in list as current
Next step woud be for it to add current pid to a URL and load that page
containinga list.
From this page a list of sids needs to be collected untill I hit the "last
processed" one, these might be spread over severall pages so it need to keep
going either untill it finds "last processed" or there's no futher pages to
load (a fail I guess)

Next is the new sid list created in the previous step, each one need to be
processed and data collected
some basic data is collected frrom each sid and then 2 possible (but not
always excistant) lists.
The basic data collected for the sid cotains two values to be set as
variables, these decides how many data blocks needs to be collected lower
down on the page.
Go to first type block, collect the data I want and repeat as many times as
variable says
Go to seciodn type block and repeat.

Store the data collected from previous ina textfile named after pid, it
should contain 4sections of data to be inserted into 4 databases
First section update the pid with new last processed
Second section add sids with info to DB.
Third section add the data from type 1 blocks on sid pages to DB.
Fourth section section add the data from type 2 blocks on sid pages to DB.

Close the file, load next pid from list and repeat the process untill pid
list is empty.

A guess a bonus at the end would be if it could also insert all the data
collected into the db as well.

Is this something perl would be suitable for or is there a better choise?
My system is Win 7 64bit btw, running MySQL 5.1.

I confess I can't fully follow your description but I didn't notice
anything that would be difficult using Perl.

I suggest you start with a Perl script that just fetches a web page. If
you have problems, try to reproduce the problem in the smallest possible
Perl program and post that here with a short description of what you
expected to happen and what actually happened (cut & paste messages
rather than re-typing them).

There's a Posting FAQ posted regularly in this newsgroup, it is worth
reading.

RedGrittyBrick · Jul 30, 2010

Oh yes, http://learn.perl.org/

Thomas Andersson · Jul 31, 2010

Hmm, been playing around a bit and gotten further than I had thought.
I open a file and read in the next webpage to be processed (a id number) and
set up the page count to 1 (each ID to process can have any number of
pages).
I create my URL from page count and current ID (pid)
The idea I have is that it will loop as long as there is a page to grab by
increasing the page count (this plan was flawed I realised though, but
that's another problem).
As it is now it keeps grabbing the same page over and over thousands of
times (creating new files for each loop).

#Create URL for sid list from pid and page count.
my $pcnt = 1;
my $page = get
"http://csr.wwiionline.com/scripts/services/persona/sorties.jsp?page=$pcnt&pid=$pid";
while ($page) {
if ($page) {
print "Site is alive\n";
}
else {
print "Site is not accessible\n";
};

#Create filename and write file, then save grabbed webpage into it.
open FILE, ">", "c:\\scr\\$pid-pg$pcnt.txt" or die $!;
print FILE $page;
$pcnt += 1;
};

I guess the URL doesn't get updated by the increased pagecount, any
suggestions on how to fix that part?

Thomas Andersson · Jul 31, 2010

Sherm said:
I'd put the "base" URL in a separate variable, to avoid repetition:
my $base =
'http://csr.wwiionline.com/scripts/services/persona/sorties.jsp';

Excellent idea, just realised that the links I will collect from the page
also uses the same base. Yhanks for the examples, helps me a lot!

The if() is redundant here; if $page is false, the while() will exit
and the if() won't be reached.

Sorry, didn't quite get what you were saying here?
One problem I've realised that kinda breaks this is that if you just up the
page count it will never fail and exit as you just keep getting empty sortie
pages back witha ever higher page number. (there's a string "No more
sorties found" on them though that I guess could be detected and used to
exit the loop).

You can use forward slashes on Windows too - it's only the command
shell (aka "DOS Box") that requires backslashes. Also, it's a good
idea to include the filename you're trying to open when reporting an
error, because that can help you figure out why it failed.

Ah, didn't realize, good to know, will definitely follow your suggestion
(might as well pick up good habbits early on).
Thanks for your good advice, I really apreciate it (and will likely come
back time and again for more

).

Best Wishes
Thomas

Thomas Andersson · Jul 31, 2010

Also: get into the habit, now, of keeping you filehandles in proper

variables. It will make life easier later.

open my $FILE, ">", "..." or ...;

Will definitely try to pick up good habbits on coding and formatting so
thanks for advice.
But if I createa variable of the filehandler like this, won't it contain the
filepath then, so when I do the print $FILE it will print the filepath
instead of the content of the file as I want? Or am I missunderstanding?
(quite likely).

Best Wishes
Thomas

Jürgen Exner · Jul 31, 2010

Thomas Andersson said:
As it is now it keeps grabbing the same page over and over thousands of
times (creating new files for each loop).

my $pcnt = 1;
my $page = get
"http://csr.wwiionline.com/scripts/services/persona/sorties.jsp?page=$pcnt&pid=$pid";
while ($page) {
if ($page) {
print "Site is alive\n";
}
else {
print "Site is not accessible\n";
};

#Create filename and write file, then save grabbed webpage into it.
open FILE, ">", "c:\\scr\\$pid-pg$pcnt.txt" or die $!;
print FILE $page;
$pcnt += 1;
};

I guess the URL doesn't get updated by the increased pagecount, any
suggestions on how to fix that part?

It may or it may not. Had you used better indentation then you might
have spotted that your get() is outside of the loop, therefore it is
executed only once, therefore the value of $page never changes, and
therefore of course your loop never terminates because the loop
condition will always be the same value as in the first test.

jue

Thomas Andersson · Jul 31, 2010

Using the suggestions from here I've rewritten it a bit, now it works as far
ass grabbing additional pages and storing. Now I just need to figure out how
to make it exit the loop under either of two conditions (found a processed
link or reached end of pages).
Eventually an additional loop need to be inserted processing the subpages we
collect the links for in these pages. (The plan is to build lists of link
from these pages and then collect data from those pages (and they in turn
contain lists variable number of data).

# Define some variables.
my $pbase =
'http://csr.wwiionline.com/scripts/services/persona/sorties.jsp';
my $pcnt = 1;
my $pidfile = 'c:/scr/pidlist.txt';
# Open list of pid's and set first one as current pid.
open PIDLIST, "<", $pidfile or die "Could not open $pidfile: $!";
my $pid = <PIDLIST>;
print $pid; # print just so we know we have a pid to process.
chomp $pid; # Remove endline from pid.
#Create URL for sid list from pid and page count.
my $page = get "$pbase?page=$pcnt&pid=$pid";
while ($page) {
# Create file for storing pages containing the sids.
my $tmpf = "c:/scr/$pid.txt";
open TEMPF, ">>", $tmpf or die "Could not open $tmpf: $!";
print TEMPF $page; # Store grabbed webpage into the file
$pcnt += 1; # Update page number for next grab.
$page = get "$pbase?page=$pcnt&pid=$pid"; # Grab next page.
};

Uri Guttman · Jul 31, 2010

TA> Using the suggestions from here I've rewritten it a bit, now it
TA> works as far ass grabbing additional pages and storing. Now I just
TA> need to figure out how to make it exit the loop under either of
TA> two conditions (found a processed link or reached end of pages).
TA> Eventually an additional loop need to be inserted processing the
TA> subpages we collect the links for in these pages. (The plan is to
TA> build lists of link from these pages and then collect data from
TA> those pages (and they in turn contain lists variable number of
TA> data).

so you need to put some conditionals in the loop. first, how would you
know when the pages are done? can you look for a link to the next page
and exit the loop if it isn't there? then define what a 'processed link'
is. keep track (likely in a hash) of processed links and if you find one
exit the loop. exiting a loop is easy, use the last function.

TA> # Define some variables.

use less comment. make your comments mean something outside the
code. code is what, comments are why. and you are writing code to be
read by a maintainer. always keep that person in your mind and your code
will be better for it.

TA> my $pbase =
TA> 'http://csr.wwiionline.com/scripts/services/persona/sorties.jsp';
TA> my $pcnt = 1;
TA> my $pidfile = 'c:/scr/pidlist.txt';
TA> # Open list of pid's and set first one as current pid.

have you ever heard of white space? jamming lines of code together makes
major migraines when reading it. loosen up a little. blank lines between
sections is a good idea.

TA> open PIDLIST, "<", $pidfile or die "Could not open $pidfile: $!";
TA> my $pid = <PIDLIST>;
TA> print $pid; # print just so we know we have a pid to process.

comments on the code line are a poor idea in most cases. when they are
long comments it is a horrible idea.

TA> chomp $pid; # Remove endline from pid.

again, you are telling us what you just did. redundant to anyone who
knows what chomp is.

TA> #Create URL for sid list from pid and page count.

this is actually getting the page AND building the url.

TA> my $page = get "$pbase?page=$pcnt&pid=$pid";
TA> while ($page) {

bah. it is not clear why you are testing page in the loop. and you have
two duplicate lines with the get. make it an infinite loop and exit when
the get fails.

TA> # Create file for storing pages containing the sids.
TA> my $tmpf = "c:/scr/$pid.txt";
TA> open TEMPF, ">>", $tmpf or die "Could not open $tmpf: $!";
TA> print TEMPF $page; # Store grabbed webpage into the file

you can do that with getstore or use File::Slurp's write_file (from cpan).

use File::Slurp ;

write_file( "c:/scr/$pid.txt", $page ) ;

much easier to read.

TA> $pcnt += 1; # Update page number for next grab.
TA> $page = get "$pbase?page=$pcnt&pid=$pid"; # Grab next page.
TA> };

here is a better loop:

while( 1 ) {

my $page = get "$pbase?page=$pcnt&pid=$pid";
last unless $page ;
write_file( "c:/scr/$pid.txt", $page ) ;
}

short, easy to read, easy to maintain. now you can add in the checks for
exiting the loop and it will be easier.

uri

Thomas Andersson · Jul 31, 2010

Sherm said:
You had originally written something like this:

while ($page) {
if ($page) {
# do stuff
} else {
}
}

Since the while() loop repeats only if $page evaluates to a true
value, you don't need to check $page again with an if(). If $page is
false, the body of the loop will not execute at all, so by the time
you reach the line that the if() is on, you already know that $page
is true. So, the if() block will always run, and the else block never
will; that being the case, it's simpler to just omit the if():

Ah, I realized that afterwards while looking over the code. That if/then bit
was a leftover from a example script I found and is now gone as it serves no
purpose in my script. Next thing I need to add is a check for the exit
conditions. Thinking about using $page as condition might be a bad idea, how
about it checking for a signal variable to be set? Inside the loop code
would run untill my exit conditions are meet and then it sets the signal
variable telling the loop to end? The two conditions would be finding a
either of two strings within the captured page (either a sid we already know
or the string "No more sorties").

Thomas Andersson · Jul 31, 2010

so you need to put some conditionals in the loop. first, how would you
know when the pages are done? can you look for a link to the next page
and exit the loop if it isn't there? then define what a 'processed
link' is. keep track (likely in a hash) of processed links and if you
find one exit the loop. exiting a loop is easy, use the last function.

They've been quite helpfull there as the empty pages contain the string "No
more sorties". The other condition is trickier, I need to load a variable at
the same time as the pid that tells the last processed sid, when that sid is
found no further pages needs to be loaded (the whole point of capturing
these list pages is so we can extract all sids we find in them for further
processing).

use less comment. make your comments mean something outside the
code. code is what, comments are why. and you are writing code to be
read by a maintainer. always keep that person in your mind and your
code will be better for it.

Well, I only started learning perl a day ago and the comments are mostly for
my own sake to remind me what I'm doing as most of this stuff is still
pretty voodoo to me.

have you ever heard of white space? jamming lines of code together
makes major migraines when reading it. loosen up a little. blank
lines between sections is a good idea.

Rodger that, will do.

comments on the code line are a poor idea in most cases. when they are
long comments it is a horrible idea.

OK, will stop doing that then.

again, you are telling us what you just did. redundant to anyone who
knows what chomp is.

Ok, but as I said before, I'm learning and those comments are only for my
own information to help me learn. Once it's done I can go over and remove
all thsoe comments and put something more useful in.

bah. it is not clear why you are testing page in the loop. and you
have two duplicate lines with the get. make it an infinite loop and
exit when the get fails.

Yeah, that's a big bug with my code and I know about it. The idea was to
keep loading pages untill there was no more, but that idea failed as the
server keeps serving empty pages with ever higher page numbers. Another
solution for finding a loop ender is needed and I have two requirements that
each should end it.

you can do that with getstore or use File::Slurp's write_file (from
cpan).

use File::Slurp ;

write_file( "c:/scr/$pid.txt", $page ) ;
much easier to read.

Definitely, so that one call replaces all 3 of my lines? Butwill I get a
error message like prrevious if it fails?

here is a better loop:

while( 1 ) {

my $page = get "$pbase?page=$pcnt&pid=$pid";
last unless $page ;
write_file( "c:/scr/$pid.txt", $page ) ;
}

short, easy to read, easy to maintain. now you can add in the checks
for exiting the loop and it will be easier.

Hmm, as I'm noob I don't quit get it, but I think it's allong the lines I
mentioned in another message. I assume a non failure signals 1? and I need
to set anything but inside the loop to exit it? But what do I set? it has no
variable name?

Thomas Andersson · Jul 31, 2010

Tad said:
A non-failure stores the contents of the page in $page (a true value).
A failure stores an undef in $page (a false value).

Prob is it will never fail, the server keeps feeding pages with no content
in so a test needs to be added inside the loop.
Would the following code help exiting the loop?

if ( $page eq $endstring ) {
exit( 0 );
};

($endstring = "No more sorties" which is the string replacing data in emprty
pages).

Thomas Andersson · Jul 31, 2010

Tad said:
No that will never work unless this is the "web page" that is
returned:

No more sorties

Doh, stupid me, of course, thanks!

Most web pages will have tags and newlines and whatnot in them,
so an equality test will not work. A pattern match would work though:

last if $page =~ /No more sorties/;

That's clean and nice, if inserted before the page is saved I won't have to
deal with useless pages!

Thank you very much sir!

Jürgen Exner · Jul 31, 2010

Thomas Andersson said:
Next thing I need to add is a check for the exit
conditions. Thinking about using $page as condition might be a bad idea, how
about it checking for a signal variable to be set? Inside the loop code
would run untill my exit conditions are meet and then it sets the signal
variable telling the loop to end?

Why does this remind me of the typical poor approaches of first year
Computer Science students?

No, this is almost always a Very Bad Idea(TM). Setting flags like that
quickly leads to very hard to maintain code.

I have not idea what your exit criterion is, but you should loop while
it is not met
while (!exit_criterion(whateverArgYouNeedToComputeIt)

Perl also gives you an additional function "last" which will exit the
loop immediately. It is a nice pragmatic shortcut, although programming
purists frown upon it.

jue

Jürgen Exner · Jul 31, 2010

Thomas Andersson said:
Prob is it will never fail, the server keeps feeding pages with no content
in so a test needs to be added inside the loop.

Then obviously you are using the wrong condition for your loop.

Would the following code help exiting the loop?

if ( $page eq $endstring ) {
exit( 0 );
};

($endstring = "No more sorties" which is the string replacing data in emprty
pages).

If this is the end condition for the loop then it would help even more
if your put it in the condition for the loop.

while (........ and ($page ne $endstring)) {

jue

Thomas Andersson · Jul 31, 2010

Jürgen Exner said:
Why does this remind me of the typical poor approaches of first year
Computer Science students?

Well, I'm a one day hobbie studier so same same

No, this is almost always a Very Bad Idea(TM). Setting flags like that
quickly leads to very hard to maintain code.

OK, good to know.

I have not idea what your exit criterion is, but you should loop while
it is not met
while (!exit_criterion(whateverArgYouNeedToComputeIt)

Perl also gives you an additional function "last" which will exit the
loop immediately. It is a nice pragmatic shortcut, although
programming purists frown upon it.

The loop has been rewritten and workds as intended now, using last to exit
on the two possible conditions. Now look like this:

while (1) {
my $page = get "$pbase?page=$pcnt&pid=$pid";
last if $page =~/No sorties/;
# Store grabbed webpage into the file
append_file( "c:/scr/$pid.txt", $page ) ;
last if $page =~/"sid=$lproc"/;
# Update page number and grab next.
$pcnt++;
};

Thomas Andersson · Jul 31, 2010

Sherm said:
Exit() exits the *program*. In this case, since your loop is basically
the whole program, it amounts to the same thing, but that won't always
be the case! Better to use last - that's what it's for.

It will make a difference as this is only part of the program I need to do.
I'm using last now.

Is it sent as plain text, without even a newline character at the end?
I doubt that - more likely, it's an HTML page that *contains* that
string. That being the case, you could use the index() function to see
if $endstring appears anywhere in $page:

if ( index($page, $endstring) == -1 ) {
last;
}

I'm currently using:
last if $page =~/No sorties/;
Which seems to do the trick, is there a downside to using my solution?

Thomas Andersson · Jul 31, 2010

Sherm said:
Which form to use is best judged on a case-by-case basis, with the
goal being readability.

Ok, in my case I have two conditions and I want to end loop in different
ways for each so I think the last version will work for me (If empty page
exit before storing it, if it contains the last processed I still need to
store it so do that and then exit).
Your advice and examples really help and is appreciated!

Best Wishes
Thomas

sln · Jul 31, 2010

Well, I'm a one day hobbie studier so same same

OK, good to know.

The loop has been rewritten and workds as intended now, using last to exit
on the two possible conditions. Now look like this:

while (1) {
my $page = get "$pbase?page=$pcnt&pid=$pid";
last if $page =~/No sorties/;
# Store grabbed webpage into the file
append_file( "c:/scr/$pid.txt", $page ) ;
last if $page =~/"sid=$lproc"/;
# Update page number and grab next.
$pcnt++;
};

my $sid_rx = qr/sid=$lproc/i;

for my $pid (1 .. 4)
{
my $fname = "c:/scr/$pid.txt";
open my $FHpid, ">>", $fname or die "Can't open $fname: $!";
my $pnumb = 1;
while (defined( my $page = get( "$pbase?page=$pnumb&pid=$pid")) and
$page !~ /No sorties/i )
{
# Store webpage
print $FHpid $page,"\n";
last if $page =~ /$sid_rx/;

# Update page number, get next.
$pnumb++;
}
close $FHpid;
}

------------------

Beware that if $page is generated html,
using a regex on it as in

$page !~ /No sorties/i
$page =~ /$sid_rx/

can be done, but only after it is parsed.
It can be parsed with regex's ... though,
thats beyond the scope of this post and another
thing entirely.

But, if you don't care, then its ok.

-sln

Jürgen Exner · Jul 31, 2010

Thomas Andersson said:
Jürgen Exner wrote:

The loop has been rewritten and workds as intended now, using last to exit
on the two possible conditions. Now look like this:

while (1) {

Ouch, this hurts! Usually this line indicates a deamon which is never
supposed to terminate.

my $page = get "$pbase?page=$pcnt&pid=$pid";
last if $page =~/No sorties/;
# Store grabbed webpage into the file
append_file( "c:/scr/$pid.txt", $page ) ;
last if $page =~/"sid=$lproc"/;
# Update page number and grab next.
$pcnt++;
};

Why not move the loop condition into the loop condition?

my $page = get "$pbase?page=$pcnt&pid=$pid";
while ((!$page =~/No sorties/) and (!$page =~/"sid=$lproc"/)) {
append_file( "c:/scr/$pid.txt", $page );
$pcnt++;
$page = get "$pbase?page=$pcnt&pid=$pid";}
}

Yes, I know the condition could be formulated better, but I transformed
it as little as possible to demonstrate how the exit() cond can
trivially be moved into the while() cond.

BTW: space characters are very cheap, I just saw a them on sale at
Costco. Fell free to use as many as you like to make your code more
readable.

jue

Web scraping i guess (Yet to start, maybe this should be done in python?)	1	Nov 10, 2021
Need help with code on website (noob)	2	Jul 18, 2022
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
Hello all! Noob here with completely unrealistic ambitions. Happy to join the crew and get good enough to help others.	4	Aug 13, 2024
can this be done with generics?	32	Nov 25, 2013
How can I train a neural network by reading different csv files	0	Nov 24, 2022
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023

Can this be done (by a noob :))

Thomas Andersson

RedGrittyBrick

RedGrittyBrick

Thomas Andersson

Thomas Andersson

Thomas Andersson

Jürgen Exner

Thomas Andersson

Uri Guttman

Thomas Andersson

Thomas Andersson

Thomas Andersson

Thomas Andersson

Jürgen Exner

Jürgen Exner

Thomas Andersson

Thomas Andersson

Thomas Andersson

sln

Jürgen Exner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads