Nokogiri not pulling correct XPath

S

Scott B.

Hi everyone,

I was wondering if anyone could help me. I'm trying to pull text from a
website using nokogiri and not all the text is not being pulled into my
variables through XPath.

I have used Firebug (Firefox extension) to pull the correct XPath from
the page so I'm thinking it should be correct. So far, I have:

variable1 =
(doc/"/html/body/div[2]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div/div/h2").inner_html

variable 2 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong").inner_html

variable 3 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong[2]").inner_html

Now, variable1 is working but I can't get any values out of variable2 or
variable3. Is there a different syntax I should be using? To test, I've
only been outputting to the cli but I want to eventually push these into
a sqlite3 database.

Anyone have any ideas?
Cheers.

Scott.
 
L

Luis G.

Hello...

I've been using Nokogiri for a while and I never had problems with it.
It works great.

I have some questions for you... Why do you put the full path to the h2
tag?
The h2 has a class or an id defined? how about all the div in between,
they have class or id defined?

I'm asking that because you can access inner_html of an html tag like
this:

doc.xpath("//div[@class='(class of the div here)']/h2").each do |node|
var = node.inner.html
end

You don't really need to put the full path to the html tag. You can also
use //div[@id='(id of the div here), for example.

Probably the other variables are not working because you missed a div or
something else in between... I think the way I show in lines above is
easy to get the html content without making mistakes.

If you want just let me know the url you want to get the content and
I'll build a small script to do that.

Regards,

Luis Goncalves
 
R

Robert Klemme

I was wondering if anyone could help me. I'm trying to pull text from a
website using nokogiri and not all the text is not being pulled into my
variables through XPath.

I have used Firebug (Firefox extension) to pull the correct XPath from
the page so I'm thinking it should be correct. So far, I have:

variable1 =
(doc/"/html/body/div[2]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div/div/h2").inner_html

variable 2 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong").inner_html

variable 3 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong[2]").inner_html

Now, variable1 is working but I can't get any values out of variable2 or
variable3. Is there a different syntax I should be using? To test, I've
only been outputting to the cli but I want to eventually push these into
a sqlite3 database.

Anyone have any ideas?

First I would dump the page _as loaded by your program_ (this is
important) to disk and verify that those XPaths do work independently
(e.g. with Firefox's DOM Inspector or Eclipse XML tools).

Kind regards

robert
 
E

Eric Christopherson

Hi everyone,

I was wondering if anyone could help me. I'm trying to pull text from a
website using nokogiri and not all the text is not being pulled into my
variables through XPath.

I have used Firebug (Firefox extension) to pull the correct XPath from
the page so I'm thinking it should be correct. So far, I have:

variable1 =
(doc/"/html/body/div[2]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div/div/h2").inner_html

variable 2 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong").inner_html

variable 3 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong[2]").inner_html

Now, variable1 is working but I can't get any values out of variable2 or
variable3.

In my experience, Firebug shows a tbody element as part of the xpath,
even if there is no actual tbody tag in the HTML. In that case,
Nokogiri will fail to find the right element unless you take out the
'tbody/'.
 
S

Scott B.

Thanks guys for the help. In the end, I think it had more to do with the
tbody than anything. I still couldn't get it working with Xpath however,
so used CSS and was able to get it working that way (albeit in a round
about fashion using an array).

Cheers.

Scott.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,708
Latest member
SherleneF1

Latest Threads

Top