Logging to a page and scrapping values

Vikash Kumar · Jan 12, 2007

I am running a test case, in which I have to first login to a web page
then I have to go to some particular page in the same web site, then
extract some data from that page. The data is in the table.

Such as the script first call http://localhost/login.asp, then we enter
user name and password, then we click on login button. By this we enter
to the web page, then we go to http://localhost/achievements.asp, from
this page we want to extract the data residing in html table. What
should be the approach to do this.

I can use the below code to extract the data if I have not to login to
the web site.

require 'net/http'

# read the page data

http = Net::HTTP.new('kvcrpf.org, 80)
resp, page = http.get('/achievements.htm', nil )

# BEGIN processing HTML

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end

output = []
table_data = parse_html(page,"table")
table_data.each do |table|
out_row = []
row_data = parse_html(table,"tr")
row_data.each do |row|
cell_data = parse_html(row,"td")
cell_data.each do |cell|
cell.gsub!(%r{<.*?>},"")
end
out_row << cell_data
end
output << out_row
end

# END processing HTML

# examine the result

def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts "#{"\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts "#{"\t" * (tab+1)}#{item}"
end
puts "#{"\t" * tab}}"
end
n += 1
end
end

parse_nested_array(output[2][4])

aa, ab, ac, ad = output[2][4]

puts"hello"
puts aa + "\t" + ab + "\t" + ac + "\t" + ad

Peter Szinek · Jan 12, 2007

Vikash said:
I am running a test case, in which I have to first login to a web page
then I have to go to some particular page in the same web site, then
extract some data from that page. The data is in the table.

Such as the script first call http://localhost/login.asp, then we enter
user name and password, then we click on login button. By this we enter
to the web page, then we go to http://localhost/achievements.asp, from
this page we want to extract the data residing in html table. What
should be the approach to do this.

I can use the below code to extract the data if I have not to login to
the web site.

In 2 days I am going to release a web extraction toolkit which will do
exactly what you want (and more of course, but this is a basic use
case)... It's based on Mechanize (which is used for login like stuff)
and HPricot for extracting the relevant stuff. The scenario you
described is an absolutely typical one, so you could try it with my stuff...

I will post here an announcement after the release.

Cheers,
Peter

__
http://www.rubyrailways.com

Vikash Kumar · Jan 13, 2007

require 'net/http'

# read the page data

http = Net::HTTP.new('kvcrpf.org, 80)
resp, page = http.get('/achievements.htm', nil )

# BEGIN processing HTML

The code given above can be used to extract values from a web page, I we
don't have to login to a web page, we know in advance which URL to look
for to get data from it, but the problem is to first login to a page,
then go to some desired location to scrap values from it.

Please help me out in doing this.
Thanks in advance
Vikash

lrlebron · Jan 13, 2007

If you are running on a windows platform that you should look at watir.
It will let you control Internet Explorer and log in to a site.

Luis

Rodrigo Bermejo · Jan 14, 2007

Vikash said:
The code given above can be used to extract values from a web page, I we
don't have to login to a web page, we know in advance which URL to look
for to get data from it, but the problem is to first login to a page,
then go to some desired location to scrap values from it.

Please help me out in doing this.
Thanks in advance
Vikash

There are a few ways of doing this <I am on hurry now to elaborate>, if
your are on windows watir[1] can help you out doing the login stuff, may
the tricky part is how to get the data, but I am sure there is a method
which allows you to extract the hole HTML

http://wtr.rubyforge.org/

$rm rm
rb

Vikash Kumar · Jan 15, 2007

your are on windows watir[1] can help you out doing the login stuff, may
the tricky part is how to get the data, but I am sure there is a method
which allows you to extract the hole HTML

http://wtr.rubyforge.org/

$rm rm
.rb

I am working on windows platform, I tried a lot to first log in to a web
page then go to some desired page to get some data from it, but unable
to do it.

Anyone's help will be appreciated.
Thanks
Vikash

Charles Lowe · Jan 15, 2007

Vikash said:
There are a few ways of doing this <I am on hurry now to elaborate>, if
your are on windows watir[1] can help you out doing the login stuff, may
the tricky part is how to get the data, but I am sure there is a method
which allows you to extract the hole HTML

http://wtr.rubyforge.org/

$rm rm
.rb

Click to expand...

I am working on windows platform, I tried a lot to first log in to a web
page then go to some desired page to get some data from it, but unable
to do it.

Anyone's help will be appreciated.
Thanks
Vikash

Try a combination of WWW::Mechanize (gem install mechanize), and Hpricot
(gem install hpricot).

Vikash Kumar · Jan 15, 2007

Try a combination of WWW::Mechanize (gem install mechanize), and Hpricot

(gem install hpricot).

I am new to Mechanize and hpricot, though I have installed it, but I am
still facing the problem in scrapping values by first log in to the web
site then going to some other page to extract data from it.

Please help me.
Vikash

alex_f_il · Jan 15, 2007

You can also try SWExplorerAutomation SWEA from http://webiussoft.com.
SWEA is .Net API, but can be used from Ruby using RubyCLR

example:

require 'rubyclr'
RubyClr::reference 'System'
RubyClr::reference 'SWExplorerAutomationClient'
include SWExplorerAutomation::Client
include SWExplorerAutomation::Client::Controls
include SWExplorerAutomation::Client:

ialogControls
explorerManager = ExplorerManager.new
explorerManager.Connect(-1)
explorerManager.LoadProject('google.htp')
explorerManager.Navigate('http://www.google.com/')
scene = explorerManager['Scene_0']
scene.WaitForActive(30000)
scene["q"].Value = 'c#'
scene['btnG'].Click()
scene = explorerManager['Scene_1']
scene.WaitForActive(30000)
explorerManager.DisconnectAndClose()

Vikash said:
There are a few ways of doing this <I am on hurry now to elaborate>, if
your are on windows watir[1] can help you out doing the login stuff, may
the tricky part is how to get the data, but I am sure there is a method
which allows you to extract the hole HTML

http://wtr.rubyforge.org/

$rm rm
.rb

Click to expand...

I am working on windows platform, I tried a lot to first log in to a web
page then go to some desired page to get some data from it, but unable
to do it.

Anyone's help will be appreciated.
Thanks
Vikash

storing values in variables	8	Dec 22, 2006
Mechanize for BIG website scrapping...	2	Sep 21, 2006
How do I use Find and Loop in VBA for Excel to identify, delete, and insert blank row for values greater than 6?	0	Feb 28, 2022
How to properly insert a landing page within same container beneath an image element?	0	Oct 7, 2024
Image shifts to the right when export the page to pdf	4	May 5, 2023
Code to fill a form	1	Dec 2, 2021
How does a HEAD pointer end up pointing to the first node in a linked list?	3	Jan 24, 2023
Pass values to a web page and extract result	1	Apr 15, 2013

Logging to a page and scrapping values

Vikash Kumar

Peter Szinek

Vikash Kumar

lrlebron

Rodrigo Bermejo

Vikash Kumar

Charles Lowe

Vikash Kumar

alex_f_il

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads