Help with HTML parsing

V

Vivek Netha

Hello,

I'm new to Watir\Ruby and need to resolve something that involves HTML
parsing - you could also call it screen scraping. I haven't used either
library before, but I wanted to know if it is better to use Hpricot or
open_uri. The problem is similar to below:

let's say I'm searching Google for some string, "Dungeons & Dragons" for
instance. I want to parse through the first results page and get the
title text and url for the top 5 results. How would I do this using
Hpricot or open_uri or both?

Please help!


Viv.
 
V

Vivek Netha

ok, i've started to work with hpricot. i'm debating if i should use
open-uri or watir. but thats another discussion. my current issue is,
though, with xpath. this is what i have so far...

#!ruby
require 'watir'
require 'open-uri'
require 'rubygems'
require 'hpricot'


Watir.options_file = 'c:/ruby/options.yml'
Watir::Browser.default = 'ie'

test_site = "http://www.google.com"

br = Watir::Browser.new

br.goto test_site
br.text_field:)name, 'q').set("Dungeons & Dragons")
br.button:)name, 'btnG').click

doc = Hpricot(br.html)

#after the above, i'm trying to store all result elements in an array

x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

the XPath checker in firefox shows that I have the right path. but the
command doesn't work in irb. how do i drill down to specific elements so
that i can extract the title text and url? and wats wrong with my
xpath??

plz help!


Viv.
 
P

Phlip

Vivek said:
x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

the XPath checker in firefox shows that I have the right path. but the
command doesn't work in irb. how do i drill down to specific elements so
that i can extract the title text and url? and wats wrong with my
xpath??

Don't use Hpricot - its XPath support only covers a very few predicate and path
types. Use libxml-ruby, if you can install it, or REXML, if you don't mind the
slow speed (the RE stands for Regular Expressions!), or nokogiri, which I don't
know how to recommend yet, but I will soon.

BTW Google for my street-name and XPath for all kinds of fun in Ruby with them.
 
V

Vivek Netha

Hi Philip,

could you give me more pointers on how exactly you would do it using
REXML. with actual code, if possible.

thx.
Vivek said:
x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

the XPath checker in firefox shows that I have the right path. but the
command doesn't work in irb. how do i drill down to specific elements so
that i can extract the title text and url? and wats wrong with my
xpath??

Don't use Hpricot - its XPath support only covers a very few predicate
and path
types. Use libxml-ruby, if you can install it, or REXML, if you don't
mind the
slow speed (the RE stands for Regular Expressions!), or nokogiri, which
I don't
know how to recommend yet, but I will soon.

BTW Google for my street-name and XPath for all kinds of fun in Ruby
with them.
 
P

Phlip

Vivek said:
could you give me more pointers on how exactly you would do it using
REXML. with actual code, if possible.
x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

Sure, but this is just tutorial-level REXML (hint hint), and uncompiled:

require 'rexml/document'
doc = REXML::Document.new(my_xml_string)
x << REXML::XPath.first(doc, '//li[ @class = "g w0" ]/h3/a').text

Now we come to the very nub of the gist. A @class of "g w0" is the same as "w0
g", or any other permutation, yet XPath is literal-minded, and unaware of CSS,
so it will only match one of those permutations, when any other could have been
just as significant.

You could try contains(@class, "w0"), but that would match "w0p0p", which is a
different class.

This is why nokogiri is interesting - it might do CSS Selector notation, which
is less exact than XPath, and more aware of CSS rules. (Look up assert_select()
to see what I mean.)

For further REXML abuse, try this...

http://www.google.com/codesearch?q=REXML::XPath.first

....but be warned (again, I think), it's _almost_ as slow as Windows Vista with
more than one program running, so /caveat emptor/.
 
T

Tom Morris

Vivek said:
x << doc.search("//li[@class='g w0']/h3/a")

You could try contains(@class, "w0"), but that would match "w0p0p", which is a
different class.

XPath that solves this:
contains(concat(' ', @class, ' '), ' w0 ')

You can join multiple classes together like this:
contains(concat(' ', @class, ' '), ' w0 ') && contains(concat(' ',
@class, ' '), ' g ')

It's not pretty. I've written it too much in my life.
I do wish XPath had a class selector.
 
P

Phlip

Tom said:
XPath that solves this:
contains(concat(' ', @class, ' '), ' w0 ')

You can join multiple classes together like this:
contains(concat(' ', @class, ' '), ' w0 ') && contains(concat(' ',
@class, ' '), ' g ')

It's not pretty. I've written it too much in my life.
I do wish XPath had a class selector.

Noted. I'm writing an XPath DSL, above the level of a raw XML library, and I
just added to its pending feature list these line-items:

# TODO :class => :symbol should do the trick contains(concat(' ', @class, '
'), ' w0 ')
# TODO :class => [] should do the trick contains(concat(' ', @class, ' '), '
w0 ') && contains(concat(' ',@class, ' '), ' g ')
# TODO :class => a string should be raw.

Those won't fix low-level XPath, but they will be useful when my DSL targets
XHTML...
 
J

Jun Young Kim

you can also use ruby library Sanitize (http://wonko.com/post/sanitize)

This library can make you parse html template very easily.

let's see the following examples.

Using Sanitize is easy. First, install it:
sudo gem install sanitize

Then call it like so:

require 'rubygems'
require 'sanitize'

html =3D '<b><a href=3D"http://foo.com/">foo</a></b><img =
src=3D"http://foo.com/bar.jpg=20
" />'

Sanitize.clean(html) # =3D> 'foo'

By default, Sanitize removes all HTML. You can use one of the built-in =20=

configs to tell Sanitize to allow certain attributes and elements:

Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# =3D> '<b>foo</b>'

Sanitize.clean(html, Sanitize::Config::BASIC)
# =3D> '<b><a href=3D"http://foo.com/" rel=3D"nofollow">foo</a></b>'

Sanitize.clean(html, Sanitize::Config::RELAXED)
# =3D> '<b><a href=3D"http://foo.com/">foo</a></b><img =
src=3D"http://foo.com/bar.jpg=20
" />'

Or, if you=A1=AFd like more control over what=A1=AFs allowed, you can =
provide =20
your own custom configuration:

Sanitize.clean(html, :elements =3D> ['a', 'span'],
:attributes =3D> {'a' =3D> ['href', 'title'], 'span' =3D> =
['class']},
:protocols =3D> {'a' =3D> {'href' =3D> ['http', 'https', =
'mailto']}})

good one :)

2009. 01. 02, =BF=C0=C0=FC 6:42, Vivek Netha =C0=DB=BC=BA:
 
S

Santosh Turamari

Hi,

I am using Sanitize.clean(), for freeing contents from html tags, but
the difficulty is I want to preserve some of the tags from removing.. I
have given like this.

html = File.new(file).read
soup = BeautifulSoup.new(html)
soup.title.contents=['']
soup.find_all.each do |tag|
if tag.string!= nil
tag.contents = ['<strong>'+tag.contents.to_s+'</strong>'] if
(tag['style'] =~ /bold/)
tag.contents = ['<em>'+tag.contents.to_s+'</em>'] if
(tag['style'] =~ /italic/)
tag.contents = ['<u>'+tag.contents.to_s+'</u>'] if
(tag['style'] =~ /underline/)
end
end
soup_string = str_replace(soup.html.to_s)

return Sanitize.clean(soup_string.to_s, :elements =>
['div','p','span','center','table','tr','th','td','blockquote', 'br',
'cite', 'code', 'dd', 'dl', 'dt','em','i', 'li', 'ol','pre', 'q',
'small', 'strike','strong', 'sub','sup', 'u', 'ul','tbody']),
but the problem is that I want to preserver the center and right
justifications also, which is not happening if I give 'center' here. If
any body know how to preserve justifications pls help me.

Thanks In Advance,
Santosh



you can also use ruby library Sanitize (http://wonko.com/post/sanitize)

This library can make you parse html template very easily.

let's see the following examples.

Using Sanitize is easy. First, install it:
sudo gem install sanitize

Then call it like so:

require 'rubygems'
require 'sanitize'

html = '<b><a href="http://foo.com/">foo</a></b><img
src="http://foo.com/bar.jpg
" />'

Sanitize.clean(html) # => 'foo'

By default, Sanitize removes all HTML. You can use one of the built-in
configs to tell Sanitize to allow certain attributes and elements:

Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# => '<b>foo</b>'

Sanitize.clean(html, Sanitize::Config::BASIC)
# => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'

Sanitize.clean(html, Sanitize::Config::RELAXED)
# => '<b><a href="http://foo.com/">foo</a></b><img
src="http://foo.com/bar.jpg
" />'

Or, if you��d like more control over what��s allowed, you can provide
your own custom configuration:

Sanitize.clean(html, :elements => ['a', 'span'],
:attributes => {'a' => ['href', 'title'], 'span' => ['class']},
:protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})

good one :)

2009. 01. 02, ���� 6:42, Vivek Netha �ۼ�:
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,142
Messages
2,570,820
Members
47,367
Latest member
mahdiharooniir

Latest Threads

Top