Help with HTML parsing

Vivek Netha · Jan 1, 2009

Hello,

I'm new to Watir\Ruby and need to resolve something that involves HTML
parsing - you could also call it screen scraping. I haven't used either
library before, but I wanted to know if it is better to use Hpricot or
open_uri. The problem is similar to below:

let's say I'm searching Google for some string, "Dungeons & Dragons" for
instance. I want to parse through the first results page and get the
title text and url for the top 5 results. How would I do this using
Hpricot or open_uri or both?

Please help!

Viv.

Vivek Netha · Jan 2, 2009

ok, i've started to work with hpricot. i'm debating if i should use
open-uri or watir. but thats another discussion. my current issue is,
though, with xpath. this is what i have so far...

#!ruby
require 'watir'
require 'open-uri'
require 'rubygems'
require 'hpricot'

Watir.options_file = 'c:/ruby/options.yml'
Watir::Browser.default = 'ie'

test_site = "http://www.google.com"

br = Watir::Browser.new

br.goto test_site
br.text_field

name, 'q').set("Dungeons & Dragons")
br.button

name, 'btnG').click

doc = Hpricot(br.html)

#after the above, i'm trying to store all result elements in an array

x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

the XPath checker in firefox shows that I have the right path. but the
command doesn't work in irb. how do i drill down to specific elements so
that i can extract the title text and url? and wats wrong with my
xpath??

plz help!

Viv.

Phlip · Jan 2, 2009

Vivek said:
x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

the XPath checker in firefox shows that I have the right path. but the
command doesn't work in irb. how do i drill down to specific elements so
that i can extract the title text and url? and wats wrong with my
xpath??

Don't use Hpricot - its XPath support only covers a very few predicate and path
types. Use libxml-ruby, if you can install it, or REXML, if you don't mind the
slow speed (the RE stands for Regular Expressions!), or nokogiri, which I don't
know how to recommend yet, but I will soon.

BTW Google for my street-name and XPath for all kinds of fun in Ruby with them.

Vivek Netha · Jan 7, 2009

Hi Philip,

could you give me more pointers on how exactly you would do it using
REXML. with actual code, if possible.

thx.

Vivek said:
Vivek said:

x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

the XPath checker in firefox shows that I have the right path. but the
command doesn't work in irb. how do i drill down to specific elements so
that i can extract the title text and url? and wats wrong with my
xpath??

Click to expand...

Don't use Hpricot - its XPath support only covers a very few predicate
and path
types. Use libxml-ruby, if you can install it, or REXML, if you don't
mind the
slow speed (the RE stands for Regular Expressions!), or nokogiri, which
I don't
know how to recommend yet, but I will soon.

BTW Google for my street-name and XPath for all kinds of fun in Ruby
with them.

Phlip · Jan 7, 2009

Vivek said:
could you give me more pointers on how exactly you would do it using
REXML. with actual code, if possible.

x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

Click to expand...

Click to expand...

Sure, but this is just tutorial-level REXML (hint hint), and uncompiled:

require 'rexml/document'
doc = REXML:

ocument.new(my_xml_string)
x << REXML::XPath.first(doc, '//li[ @class = "g w0" ]/h3/a').text

Now we come to the very nub of the gist. A @class of "g w0" is the same as "w0
g", or any other permutation, yet XPath is literal-minded, and unaware of CSS,
so it will only match one of those permutations, when any other could have been
just as significant.

You could try contains(@class, "w0"), but that would match "w0p0p", which is a
different class.

This is why nokogiri is interesting - it might do CSS Selector notation, which
is less exact than XPath, and more aware of CSS rules. (Look up assert_select()
to see what I mean.)

For further REXML abuse, try this...

http://www.google.com/codesearch?q=REXML::XPath.first

....but be warned (again, I think), it's _almost_ as slow as Windows Vista with
more than one program running, so /caveat emptor/.

Tom Morris · Jan 9, 2009

Vivek said:
Vivek said:

x << doc.search("//li[@class='g w0']/h3/a")

Click to expand...

Click to expand...

You could try contains(@class, "w0"), but that would match "w0p0p", which is a
different class.

XPath that solves this:
contains(concat(' ', @class, ' '), ' w0 ')

You can join multiple classes together like this:
contains(concat(' ', @class, ' '), ' w0 ') && contains(concat(' ',
@class, ' '), ' g ')

It's not pretty. I've written it too much in my life.
I do wish XPath had a class selector.

Phlip · Jan 10, 2009

Tom said:
XPath that solves this:
contains(concat(' ', @class, ' '), ' w0 ')

You can join multiple classes together like this:
contains(concat(' ', @class, ' '), ' w0 ') && contains(concat(' ',
@class, ' '), ' g ')

It's not pretty. I've written it too much in my life.
I do wish XPath had a class selector.

Noted. I'm writing an XPath DSL, above the level of a raw XML library, and I
just added to its pending feature list these line-items:

# TODO :class => :symbol should do the trick contains(concat(' ', @class, '
'), ' w0 ')
# TODO :class => [] should do the trick contains(concat(' ', @class, ' '), '
w0 ') && contains(concat(' ',@class, ' '), ' g ')
# TODO :class => a string should be raw.

Those won't fix low-level XPath, but they will be useful when my DSL targets
XHTML...

Jun Young Kim · Jan 11, 2009

you can also use ruby library Sanitize (http://wonko.com/post/sanitize)

This library can make you parse html template very easily.

let's see the following examples.

Using Sanitize is easy. First, install it:
sudo gem install sanitize

Then call it like so:

require 'rubygems'
require 'sanitize'

html =3D '<a href=3D"http://foo.com/">foo</a><img =
src=3D"http://foo.com/bar.jpg=20
" />'

Sanitize.clean(html) # =3D> 'foo'

By default, Sanitize removes all HTML. You can use one of the built-in =20=

configs to tell Sanitize to allow certain attributes and elements:

Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# =3D> 'foo'

Sanitize.clean(html, Sanitize::Config::BASIC)
# =3D> '<a href=3D"http://foo.com/" rel=3D"nofollow">foo</a>'

Sanitize.clean(html, Sanitize::Config::RELAXED)
# =3D> '<a href=3D"http://foo.com/">foo</a><img =
src=3D"http://foo.com/bar.jpg=20
" />'

Or, if you=A1=AFd like more control over what=A1=AFs allowed, you can =
provide =20
your own custom configuration:

Sanitize.clean(html, :elements =3D> ['a', 'span'],
:attributes =3D> {'a' =3D> ['href', 'title'], 'span' =3D> =
['class']},

rotocols =3D> {'a' =3D> {'href' =3D> ['http', 'https', =
'mailto']}})

good one

2009. 01. 02, =BF=C0=C0=FC 6:42, Vivek Netha =C0=DB=BC=BA:

Santosh Turamari · Nov 4, 2009

Hi,

I am using Sanitize.clean(), for freeing contents from html tags, but
the difficulty is I want to preserve some of the tags from removing.. I
have given like this.

html = File.new(file).read
soup = BeautifulSoup.new(html)
soup.title.contents=['']
soup.find_all.each do |tag|
if tag.string!= nil
tag.contents = [''+tag.contents.to_s+''] if
(tag['style'] =~ /bold/)
tag.contents = [''+tag.contents.to_s+''] if
(tag['style'] =~ /italic/)
tag.contents = [''+tag.contents.to_s+''] if
(tag['style'] =~ /underline/)
end
end
soup_string = str_replace(soup.html.to_s)

return Sanitize.clean(soup_string.to_s, :elements =>
['div','p','span','center','table','tr','th','td','blockquote', 'br',
'cite', 'code', 'dd', 'dl', 'dt','em','i', 'li', 'ol','pre', 'q',
'small', 'strike','strong', 'sub','sup', 'u', 'ul','tbody']),
but the problem is that I want to preserver the center and right
justifications also, which is not happening if I give 'center' here. If
any body know how to preserve justifications pls help me.

Thanks In Advance,
Santosh

you can also use ruby library Sanitize (http://wonko.com/post/sanitize)

This library can make you parse html template very easily.

let's see the following examples.

Using Sanitize is easy. First, install it:
sudo gem install sanitize

Then call it like so:

require 'rubygems'
require 'sanitize'

html = '<a href="http://foo.com/">foo</a><img
src="http://foo.com/bar.jpg
" />'

Sanitize.clean(html) # => 'foo'

By default, Sanitize removes all HTML. You can use one of the built-in
configs to tell Sanitize to allow certain attributes and elements:

Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# => 'foo'

Sanitize.clean(html, Sanitize::Config::BASIC)
# => '<a href="http://foo.com/" rel="nofollow">foo</a>'

Sanitize.clean(html, Sanitize::Config::RELAXED)
# => '<a href="http://foo.com/">foo</a><img
src="http://foo.com/bar.jpg
" />'

Or, if youï¿½ï¿½d like more control over whatï¿½ï¿½s allowed, you can provide
your own custom configuration:

Sanitize.clean(html, :elements => ['a', 'span'],
:attributes => {'a' => ['href', 'title'], 'span' => ['class']},
rotocols => {'a' => {'href' => ['http', 'https', 'mailto']}})

good one

2009. 01. 02, ï¿½ï¿½ï¿½ï¿½ 6:42, Vivek Netha ï¿½Û¼ï¿½:

I need help making an html website	2	Aug 1, 2023
Need help with <rowspan> in an HTML table	1	Nov 6, 2024
Help with datascraping script	1	Aug 26, 2024
Help with some CSS	2	Mar 29, 2023
Troubles with Fullpage / please help	0	Dec 14, 2023
Html parsing with Hpricot	2	Jun 9, 2010
Help with code	0	Jun 11, 2022
Html data exchange help	0	Jan 2, 2020

Help with HTML parsing

Vivek Netha

Vivek Netha

Phlip

Vivek Netha

Phlip

Tom Morris

Phlip

Jun Young Kim

Santosh Turamari

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads