T
Thomas Andersson
I have set myself a task to create a script that can collect data from web
pages and insert them intoa MySQl database. I'm a complete noob at this
thougha nd not even sure what language I need (to learn), but think perl
might be it. What I ask now is not for you to tell me whow to do it, only if
it's feasible or if I'm barking up the wrong tree (pointers on where to find
relevant information is wellcome though.
First step would be to export a list of pids to be processed, each paired
with the last sid processed for the pid.
The script would read the list and set the first pid in list as current
Next step woud be for it to add current pid to a URL and load that page
containinga list.
From this page a list of sids needs to be collected untill I hit the "last
processed" one, these might be spread over severall pages so it need to keep
going either untill it finds "last processed" or there's no futher pages to
load (a fail I guess)
Next is the new sid list created in the previous step, each one need to be
processed and data collected
some basic data is collected frrom each sid and then 2 possible (but not
always excistant) lists.
The basic data collected for the sid cotains two values to be set as
variables, these decides how many data blocks needs to be collected lower
down on the page.
Go to first type block, collect the data I want and repeat as many times as
variable says
Go to seciodn type block and repeat.
Store the data collected from previous ina textfile named after pid, it
should contain 4sections of data to be inserted into 4 databases
First section update the pid with new last processed
Second section add sids with info to DB.
Third section add the data from type 1 blocks on sid pages to DB.
Fourth section section add the data from type 2 blocks on sid pages to DB.
Close the file, load next pid from list and repeat the process untill pid
list is empty.
A guess a bonus at the end would be if it could also insert all the data
collected into the db as well.
Is this something perl would be suitable for or is there a better choise?
My system is Win 7 64bit btw, running MySQL 5.1.
TIA
Thomas
pages and insert them intoa MySQl database. I'm a complete noob at this
thougha nd not even sure what language I need (to learn), but think perl
might be it. What I ask now is not for you to tell me whow to do it, only if
it's feasible or if I'm barking up the wrong tree (pointers on where to find
relevant information is wellcome though.
First step would be to export a list of pids to be processed, each paired
with the last sid processed for the pid.
The script would read the list and set the first pid in list as current
Next step woud be for it to add current pid to a URL and load that page
containinga list.
From this page a list of sids needs to be collected untill I hit the "last
processed" one, these might be spread over severall pages so it need to keep
going either untill it finds "last processed" or there's no futher pages to
load (a fail I guess)
Next is the new sid list created in the previous step, each one need to be
processed and data collected
some basic data is collected frrom each sid and then 2 possible (but not
always excistant) lists.
The basic data collected for the sid cotains two values to be set as
variables, these decides how many data blocks needs to be collected lower
down on the page.
Go to first type block, collect the data I want and repeat as many times as
variable says
Go to seciodn type block and repeat.
Store the data collected from previous ina textfile named after pid, it
should contain 4sections of data to be inserted into 4 databases
First section update the pid with new last processed
Second section add sids with info to DB.
Third section add the data from type 1 blocks on sid pages to DB.
Fourth section section add the data from type 2 blocks on sid pages to DB.
Close the file, load next pid from list and repeat the process untill pid
list is empty.
A guess a bonus at the end would be if it could also insert all the data
collected into the db as well.
Is this something perl would be suitable for or is there a better choise?
My system is Win 7 64bit btw, running MySQL 5.1.
TIA
Thomas