Can I get a little help with my program? (string searching and regex)

M

michael.hincke

So here's my issue, I'm trying to figure out a way that's not insanely
round about to accomplish the following.

I am ripping book information off of a website. I was able to do this
quite easy, but i'm having problems when the site returns more than
one book. I need a way to say:

for each regex on the page
store info into next unused excel line

(I will be doing separate searches for each piece of info (author,
isbn, etc) because of the way the html is setup)
*note, I am using WATIR but the issues I'm having I believe are core
ruby issues.

<span id="rptCourses_ctl00_rptItems_ctl00_lblItemTxtTitle" style="font-
weight: bold;">Book title 1</span>
<span id="rptCourses_ctl00_rptItems_ctl01_lblItemTxtTitle" style="font-
weight: bold;">Another book title</span>

notice the slight difference in the second ctl0# depending on the
number of books on the page the second number just itterates, I have
yet to see a 10+ book return, but I would imagine the leading 0 would
itterate in that instance but im not positive.

Then the corosponding author is:

<span id="rptCourses_ctl00_rptItems_ctl00_lblItemTxtAuthor">author 1</
span>
<span id="rptCourses_ctl00_rptItems_ctl01_lblItemTxtAuthor">author 2</
span>

with the ctl0# matching the titles.

HOWEVER, when I am done pulling info from the page and go to the next
page the first book is reset back to ctl00.

This is what I have been using, but it never tests the regex a second
time around so I never get more than one book data per search

#do some search stuff based on an excel list of 4 digit numbers.
Website will return 0-many books. (currently the script crashes if 0
books are returned)
while contLoop do colVal = worksheet.Cells(row, 'a').Value
if (colVal) then
browser.goto("http://www.website.com/searchterm=" + colVal)
for i in 1...browser.spans.length
if (browser.span:)id, /
rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text) then
var = browser.span:)id, /
rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtTitle/).text
worksheet.Cells(row, 'b').value = var
end
if (browser.span:)id, /rptCourses_ctl00_rptItems_ctl\d
\d_lblItemTxtAuthor/).text) then
var = browser.span:)id, /
rptCourses_ctl00_rptItems_ctl\d\d_lblItemTxtAuthor/).text
worksheet.Cells(row, 'c').value = var
end
end
else
contLoop = false
end
row += 1
sleep 1
end

I'm worried that if it doesnt find an author or something for a book
the list will get out of sync. The other way I think it could be done
is to make the number that itterates in the regex a variable and go
through that but this might cause issues on subseqent pages.

That is the main problem. The other problem I have to tackle is making
a cross reference list for each book found (this is done on a seperate
sheet) ie

Searchterm | Book ID (just a simple 1 through however many books
created when the book is stored into the spreadsheet
0001 | 1
0001 | 2
0001 | 3
0002 | 4
0003 | 5
0004 | 1
This would denote that 3 books were found when searching for 0001 and
those are referenced by bookID (1,2,3) and one book each for 0002 and
0003. BookID 1 comes up when searching for both 0001 and 0004 so I
also need to find a way to make sure that another BookID is not made
for the same book when 0004 comes around.

I believe this is easiest done when storing the book but havent tried
to tackle that yet.

To sum up my problems:

1) getting infro from more than one book when searching
2) crashing when no books are found
3) creating the reference list
4) not double storing in reference list


Any insite or sample code you can provide would be awesome. I don't
perticularly want to code this, find out it doesnt work, and have to
recode it 15 times.

Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,979
Messages
2,570,185
Members
46,722
Latest member
DebShillit

Latest Threads

Top