Help missing something BASIC

D

Don Norcott

This code is conceptually what I want to do with the nokogiri code below
s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
puts "********"
puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
puts "*********"
end
Results as expected
********
3
*********
********
6
*********
********
9
*********

I have an html page with multiple <table>...</table> elements
(equivalent to str above) and want to process each table (equivalent to
s1, s2, s3) and extract one item <td[ class="itemNumbr ...> from the
table (equivalent to extracting the middle element in any of s1 s2 s3).

I initially thought this was straight forward - but I am missing
something very fundamental when I move the concept to Nokogiri objects
---------------------- NOKOGIRI CODE ----------------------
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT")); # file containing web
page
doc.xpath("//table[@class='result']").each do |node| # select a table
puts "*************"
puts node.to_html # as expected
puts node.xpath("//td[@class='itemNumbr']") # 15 per each
puts "*************"
end
---------------------- NOKOGIRI CODE ----------------------

The output below dispays the table HTML as expected - but not itemnumbrs
***********
<table ..................../table> for item 1
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 2
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 3
<td class ="itemNumbr.....<b1> 1.</b>...../td>
.......

The tables are outputted as expected Tables with itemnumbr 1 to 15
sequentially.
The node.xpath("//td[@class='itemNumbr']") acts as if node contains all
15 tables but the output indicates otherwise. I think node should
always contain HTML for a single table only, but I appear to be wrong.

Also if i put a subscript on the first xpath
doc.xpath("//table[@class='result'][5]").each do |node|
to ensure only one table is found, still get itemnumbrs for all 15 table
elements

WHAT AM I MISSING HERE
 
D

Don Norcott

Posted incorrect code for number array and should have said last item
not
middle item

s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
puts "********
puts itm.to_s # ADDED LINE
puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
puts "*********"
end

Giving the output below - more in line with table & item printout
********
[1, 2, 3] # table
3 # item selected
*********
 
R

Robert Klemme

This code is conceptually what I want to do with the nokogiri code below
s1 =3D [1,2,3] ; s2 =3D [4,5,6]; s3 =3D [7,8,9]
str =3D [s1,s2,s3]
str.each do |itm|
=A0puts "********"
=A0puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
=A0puts "*********"
end
Results as expected
********
=A03
*********
********
=A06
*********
********
=A09
*********

I have an html page with multiple <table>...</table> elements
(equivalent to str above) and want to process each table (equivalent to
s1, s2, s3) and extract one item <td[ class=3D"itemNumbr ...> from the
table (equivalent to extracting the middle element in any =A0of s1 s2 s3)=
 
D

Don Norcott

Thanks Robert

I now have the code working with one additional line.

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class='result']").each do |node|

# next line has been added.
doc2 = Nokogiri::HTML("<body>" << "#{node}" << "</body>")

puts "*************"
puts doc2.xpath("//td[@class='itemNumbr']")
puts "*************"
end

I realized (even though I can not figure out why) early on that I had to
save off each table in the "do" before processing it to get around this
problem. I tried many things including an array which worked fine to
save the <table>s but I could not xpath the saved <table>s.

What I did not realize that if I took the <table> raw it was no longer
valid XML. What twigged me is your code adding in <body> to give the
correct html header <!DOCTYPE .....>

Now if you can shed some light on my underlying question.
Above I embed the node in a new html object, the new object contains
nothing other than a single <table>.

Also if print the node it contains only a single <table>.

Yet if I attempt to execute
"puts node.xpath("//td[@class='itemNumbr']")"
it finds the "itemnumbr" for all <table> items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.

Thanks for any insight you can provide.
Don
 
R

Robert Klemme

Thanks Robert

I now have the code working with one additional line.

require 'open-uri'
require 'nokogiri'

doc =3D Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class=3D'result']").each do |node|

# next line has been added.
=A0 =A0doc2 =3D Nokogiri::HTML("<body>" << "#{node}" << "</body>")

I'm sorry but this is ridiculous!
=A0 =A0puts "*************"
=A0 =A0puts doc2.xpath("//td[@class=3D'itemNumbr']")
=A0 =A0puts "*************"
end

I realized (even though I can not figure out why) early on that I had to
save off each table in the "do" before processing it to get around this
problem. =A0I tried many things including an array which worked fine to
save the <table>s but I could not xpath the saved <table>s.

As I said: you need a *relative XPath*. Your problem is the global
XPath. You need to shave off the leading "//" or prefix it with ".".

doc =3D Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class=3D'result']").each do |node|
puts "*************"
puts node.xpath("td[@class=3D'itemNumbr']")
puts "*************"
end

doc =3D Nokogiri::HTML(open("c:/RUBY_OUT.TXT"));
doc.xpath("//table[@class=3D'result']").each do |node|
puts "*************"
puts node.xpath(".//td[@class=3D'itemNumbr']")
puts "*************"
end
What I did not realize that if I took the <table> raw it was no longer
valid XML. What twigged me is your code adding in <body> to give the
correct html header <!DOCTYPE .....>

Now if you can shed some light on my underlying question.
Above I embed the node in a new html object, =A0the new object contains
nothing other than a single <table>.

Also if print the node it contains only a single <table>.

Yet if I attempt to execute
"puts node.xpath("//td[@class=3D'itemNumbr']")"
it finds the "itemnumbr" for all <table> items even though they do not
exist in node. =A0Does node actually contain the entire html, just not
visible.

http://www.w3schools.com/xpath/xpath_syntax.asp

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
D

Don Norcott

I have actualy taken this first tutorial plus a few more

From everything I did to verify the contents of
node(parse,to_s,to_html), I thought it to contain a single table
selected from the html page and could not prove different.

That is why I was not looking at relative paths since I thought I was
dealing with only a single table on "each". That is why I mistakenly
used the array concept which is obviously not a parallel.

Obviously I still do not understand the contents of node.
 
R

Robert Klemme

I have actualy taken this first tutorial plus a few more

From everything I did to verify the contents of
node(parse,to_s,to_html), I thought it to contain a single table
selected from the html page and could not prove different.

That is why I was not looking at relative paths since I thought I was
dealing with only a single table on "each". That is why I mistakenly
used the array concept which is obviously not a parallel.

Obviously I still do not understand the contents of node.

From this posting of yours it's not clear to me what issue you have.
Any XML or HTML parser that rips a document apart and builds a DOM of
some kind will create a nested, strictly hierarchical tree structure.
The only thing that may seem odd is that XPath queries beginning with
"//" search through the complete document regardless of the node you
invoke the method on.

Cheers

robert
 
D

Don Norcott

My issue is not with anything you have shown me and not with my ability
to get the code working or get the next piece code working predictably.

My only issue is conceptual. I have no problem understanding that node
contains the entire tree structure since I am able to return all the
itemNumbr elements with
node.xpath("//td[@class='itemNumbr']")


What I am not able to do is output proof to convince myself that the
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).

This is my last post - I really have no issue other than the conceptual
one above and I will re-visit the contents of node again when I have
time.

Thanks again for all your help.
 
R

Robert Klemme

My issue is not with anything you have shown me and not with my ability
to get the code working or get the next piece code working predictably.

My only issue is conceptual. =A0I have no problem understanding that node
contains the entire tree structure since I am able to return all the
itemNumbr elements with
=A0 =A0node.xpath("//td[@class=3D'itemNumbr']")


What I am not able to do is output proof to convince myself that the
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).

The problem might lie in the term "contains". Conceptually one would
probably say that a node contains all its sub nodes. Technically a
node can also (indirectly) contain the whole document. This happens
if you include a reference to the parent node or the document. Here's
an example with parent node inclusion.

http://gist.github.com/638085

If you add a line "pp ch" to the iteration code at the end of the
file, you will see that each node "contains" all the rest of the
document.
This is my last post - I really have no issue other than the conceptual

Hopefully not. :)
Thanks again for all your help.

Your welcome!

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,698
Latest member
LydiaHalle

Latest Threads

Top