Nokogiri bug or intended effect??

  • Thread starter Jeremy Woertink
  • Start date
J

Jeremy Woertink

I'm trying to parse this (poorly formatted) page, and when I look at the
page I see:

Name: ZITO, PEDRO OSVALDO

When I look at the source I get:

<td colspan="4" ><font face="Arial" size="2">
<b>Name: </b>ZITO, PEDRO OSVALDO </font>

</td>

When I parse the page I get:
page.search("/html/body/table[3]/tr[1]/td[4]/table/tr[1]/td[1]/table/tr[3]/td[2]/table/tr[1]/td[1]/table/tr[2]/td[1]").first
=> #<Nokogiri::XML::Element:0x1eb1f76 name="td"
attributes=[#<Nokogiri::XML::Attr:0x1eb1eea name="colspan" value="4">]
children=[#<Nokogiri::XML::Element:0x1eb0d24 name="font"
attributes=[#<Nokogiri::XML::Attr:0x1eb0c7a name="face" value="Arial">,
#<Nokogiri::XML::Attr:0x1eb0c70 name="size" value="2">]
children=[#<Nokogiri::XML::Text:0x1eb0694 "
\r\n ">, #<Nokogiri::XML::Element:0x1eb066c name="b"
children=[#<Nokogiri::XML::Text:0x1eb0496 "Name: ">]>,
#<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
">]>, #<Nokogiri::XML::Text:0x1eaf636 " \r\n
\r\n ">]>
If you notice in the #<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
"> All the spaces in the name have been removed.


Here's what I'm using:
=> "2.7.3"
macbook-pro:~ jeremywoertink$ ruby -v
ruby 1.8.6 (2009-06-08 patchlevel 369) [universal-darwin9.0]


Anyone have any ideas? My guess is maybe an encoding issue??? There are
other areas in the pages where I have to do string.gsub("\302\240", "").

Thanks,

~Jeremy
 
G

G_ F_

Try using the .content() or .text() methods to get the text content of
the nodes.
 
M

Mike Dalessio

[Note: parts of this message were removed to make it a legal post.]

If you post this question to nokogiri-talk with a reproducible test case, I
think you'll quickly get a response from the helpful nokogiri community.


I'm trying to parse this (poorly formatted) page, and when I look at the
page I see:

Name: ZITO, PEDRO OSVALDO

When I look at the source I get:

<td colspan="4" ><font face="Arial" size="2">
<b>Name: </b>ZITO, PEDRO OSVALDO </font>

</td>

When I parse the page I get:page.search("/html/body/table[3]/tr[1]/td[4]/table/tr[1]/td[1]/table/tr[3]/td[2]/table/tr[1]/td[1]/table/tr[2]/td[1]").first
=> #<Nokogiri::XML::Element:0x1eb1f76 name="td"
attributes=[#<Nokogiri::XML::Attr:0x1eb1eea name="colspan" value="4">]
children=[#<Nokogiri::XML::Element:0x1eb0d24 name="font"
attributes=[#<Nokogiri::XML::Attr:0x1eb0c7a name="face" value="Arial">,
#<Nokogiri::XML::Attr:0x1eb0c70 name="size" value="2">]
children=[#<Nokogiri::XML::Text:0x1eb0694 "
\r\n ">, #<Nokogiri::XML::Element:0x1eb066c name="b"
children=[#<Nokogiri::XML::Text:0x1eb0496 "Name: ">]>,
#<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
">]>, #<Nokogiri::XML::Text:0x1eaf636 " \r\n
\r\n ">]>
If you notice in the #<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
"> All the spaces in the name have been removed.


Here's what I'm using:
=> "2.7.3"
macbook-pro:~ jeremywoertink$ ruby -v
ruby 1.8.6 (2009-06-08 patchlevel 369) [universal-darwin9.0]


Anyone have any ideas? My guess is maybe an encoding issue??? There are
other areas in the pages where I have to do string.gsub("\302\240", "").

Thanks,

~Jeremy
 
J

Jeremy Woertink

G_ F_ said:
Try using the .content() or .text() methods to get the text content of
the nodes.

Yeah, I tried that. It just returns the name all squished. Any other
ideas?
 
J

Jeremy Woertink

Cool, I'll try that. Thanks man.

~Jeremy


Mike said:
If you post this question to nokogiri-talk with a reproducible test
case, I
think you'll quickly get a response from the helpful nokogiri community.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top