scraping web pages for cisco products

C

Chuck Dawit

I submitted a post a few days ago about scraping the web for Cisco
products. I didn't receive that much input so I thought I would ask
again. Here are the requirments. I have a list of 2000 urls that all
have Cisco in its domain name.
(ex. http://www.soldbycisco.net
http://www.ciscoindia.net
http://www.ciscobootcamp.net
http://www.cisco-guy.net

and I want to scrape through them and determine which websites are
selling new cisco products, I'm actually looking for around 20 or so
products (ex. WIC-1T, NM-4E, WS-G2950-24). One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don't know how to handle after that. Does
anyone have a different/better approach? Any help would be appreciated.
 
K

Konrad Meyer

--nextPart2034462.32gXqZMtcQ
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Glen Holcomb:
=20
=20
I don't remember who but someone suggested using Froogle and parsing that
output. Froogle and a few other sites like Pricewatch might be a far less
complicated approach, you won't find all of them but then again I don't
think you can possibly find everything anyway.
=20
--=20
"Hey brother Christian with your high and mighty errand, Your actions spe= ak
so loud, I can't hear a word you're saying."
=20
-Greg Graffin (Bad Religion)

That was me. Seems to me you shouldn't parse froogle so much as just use it.
Writing a script is a lot more work and won't get you what you want; froogle
will.

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart2034462.32gXqZMtcQ
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBG8WaaCHB0oCiR2cwRAoxRAJ0YfUzjSQTl3uY5425bHwB+FUFbmwCdG1yO
lorZNLwuPVwewROHM8c3u0c=
=EJGr
-----END PGP SIGNATURE-----

--nextPart2034462.32gXqZMtcQ--
 
C

Chuck Dawit

Konrad said:
Quoth Glen Holcomb:

That was me. Seems to me you shouldn't parse froogle so much as just use
it.
Writing a script is a lot more work and won't get you what you want;
froogle
will.

But see I need to use only the list that I have with Cisco in the domain
name. (ex. usedcisco.com, ciscoequipment.com) Can froogle look up
website names like the ones I have?
 
K

Konrad Meyer

--nextPart5575520.dxeH3hTIou
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Chuck Dawit:
=20
But see I need to use only the list that I have with Cisco in the domain= =20
name. (ex. usedcisco.com, ciscoequipment.com) Can froogle look up=20
website names like the ones I have?

Assuming it uses a similar interface to google (I don't know much about it),
yes, "site:usedcisco.com" etc.

Why do you need the list? Just search for anything below 60% MSRP, and ANY
website selling counterfeit cisco devices should come up.

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart5575520.dxeH3hTIou
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBG8XGMCHB0oCiR2cwRArLvAJsGGnuGZkNVf2jRREopuqCoLggvXwCeOGak
cx8tGhRIgyBauGfs0LNOVDc=
=Nv46
-----END PGP SIGNATURE-----

--nextPart5575520.dxeH3hTIou--
 
C

Chuck Dawit

Glen said:
Why is the domain important if you are looking for fraudulent equipment
based on selling price? I don't think you can search by url, I don't
see
why anyone looking for a specific product would need to do that.

--
"Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can't hear a word you're saying."

-Greg Graffin (Bad Religion)



I'm looking for copywright infrigment on Cisco's name 2. So I'm not only
looking for those companies that are selling Cisco counterfeit equipment
but also those who are infringing on Cisco's name as well.
 
B

brabuhr

One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don't know how to handle after that.

Here's a naive implementation of binning by forms:
cat sites www.cnn.com
www.usedcisco.com
www.rubyforge.org
slashdot.org
technocrat.net
bk.com

cat firstbin.rb
#!/usr/bin/env ruby

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

sites = File.readlines("sites")
bin1 = []
bin2 = []
bin3 = []

sites.each do |site|
site.chomp!

page = agent.get "http://#{site}"
forms = page.forms
search_forms = forms.select{|f|
(f.name and f.name.match /search/i) or
(f.action and f.action.to_s.match /search/i)
}

if search_forms.size > 0
bin1 << site
elsif forms.size > 0
bin2 << site
else
bin3 << site
end
end

p bin1
p bin2
p bin3
ruby firstbin.rb
["www.cnn.com", "www.rubyforge.org", "slashdot.org"]
["www.usedcisco.com", "technocrat.net"]
["bk.com"]
 
C

Chuck Dawit

With this method do I need to know the name of the form to use it? With
mechanize I thought you had to look at the form name first before you
could use it?
 
B

brabuhr

With this method do I need to know the name of the form to use it? With
mechanize I thought you had to look at the form name first before you
could use it?

It helps to know someway to distinguish the form you're looking for
from the other forms on the page. It would be possible to iterate
through all the forms on a page, entering some text into the text
fields in the form and submitting them; but, most of the time the
script would probably be in either the wrong form or the wrong field
in the right form (and, of course, there are other issues, e.g. forms
that require multiple fields to be edited). I don't see anyway to
avoid customizing the code for each site (though, if you get a good
framework built the effort per site should decrease?).
 
C

Chuck Dawit

unknown said:
It helps to know someway to distinguish the form you're looking for
from the other forms on the page. It would be possible to iterate
through all the forms on a page, entering some text into the text
fields in the form and submitting them; but, most of the time the
script would probably be in either the wrong form or the wrong field
in the right form (and, of course, there are other issues, e.g. forms
that require multiple fields to be edited). I don't see anyway to
avoid customizing the code for each site (though, if you get a good
framework built the effort per site should decrease?).

I agree but I have around 2000 sites to look at and I can't look at each
and every form, that would take way to long. Do you think a better
approach would be to use a search engines API to search for the products
on each site? I've never used any search engine API, if I know the
website name and the product name and a price I want can I use those
parameters in the search to find results?
 
B

Brad Phelan

Chuck said:
I agree but I have around 2000 sites to look at and I can't look at each
and every form, that would take way to long. Do you think a better
approach would be to use a search engines API to search for the products
on each site? I've never used any search engine API, if I know the
website name and the product name and a price I want can I use those
parameters in the search to find results?


This query seems to work

site:solecentral.com.au OR site:xtargets.com AND crocs

I advertise my brothers e-commerce site on my site and they both contain
the same keyword "crocs". Google returns all the pages from my site and
his site that contain the word "crocs". However I am not sure how high
the query scales as I think Google truncates the search string after
some length so adding in 2000 sites for the query string might break.

Not sure if the same query trick also works in froogle as well as
vanilla google.

Hope this is somewhat helpful.
 
C

Chuck Dawit

unknown said:
Here's a naive implementation of binning by forms:
page = agent.get "http://#{site}"
forms = page.forms
search_forms = forms.select{|f|
(f.name and f.name.match /search/i) or
(f.action and f.action.to_s.match /search/i)
}

if search_forms.size > 0
bin1 << site
elsif forms.size > 0
bin2 << site
else
bin3 << site
end
end

I'm checking the size of the form like in the code above but when it
gets to the 13th url to check the script just exits. Does anyone know
why? How can I run a check on this?
 
T

Todd Benson

I submitted a post a few days ago about scraping the web for Cisco
products. I didn't receive that much input so I thought I would ask
again. Here are the requirments. I have a list of 2000 urls that all
have Cisco in its domain name.
(ex. http://www.soldbycisco.net
http://www.ciscoindia.net
http://www.ciscobootcamp.net
http://www.cisco-guy.net

I suspect that if Cisco has a problem with counterfeit products that
hurt their long term bottom line, it would most certainly come from
web sites that do not have the word cisco in DNS name.

You should have asked about scraping for some more generic term, maybe?

There are basically two things that bother me with your question.

1: there is something fundamentally wrong with using an open source
product to protect the integrity of a select few relatively expensive
products.

2. an employee of Cisco would have no problem securing funds for a
proposal that was delivered on a hardware level (unless Cisco is
having some monetary problems I'm not aware of). If you don't know
what I'm talking about, then I'll shut up.

Todd
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,740
Latest member
AdolphBig6

Latest Threads

Top