J
Jeremy Woertink
I'm trying to parse this (poorly formatted) page, and when I look at the
page I see:
Name: ZITO, PEDRO OSVALDO
When I look at the source I get:
<td colspan="4" ><font face="Arial" size="2">
<b>Name: </b>ZITO, PEDRO OSVALDO </font>
</td>
When I parse the page I get:
attributes=[#<Nokogiri::XML::Attr:0x1eb1eea name="colspan" value="4">]
children=[#<Nokogiri::XML::Element:0x1eb0d24 name="font"
attributes=[#<Nokogiri::XML::Attr:0x1eb0c7a name="face" value="Arial">,
#<Nokogiri::XML::Attr:0x1eb0c70 name="size" value="2">]
children=[#<Nokogiri::XML::Text:0x1eb0694 "
\r\n ">, #<Nokogiri::XML::Element:0x1eb066c name="b"
children=[#<Nokogiri::XML::Text:0x1eb0496 "Name: ">]>,
#<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
">]>, #<Nokogiri::XML::Text:0x1eaf636 " \r\n
\r\n ">]>
If you notice in the #<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
"> All the spaces in the name have been removed.
Here's what I'm using:
=> "2.7.3"
macbook-pro:~ jeremywoertink$ ruby -v
ruby 1.8.6 (2009-06-08 patchlevel 369) [universal-darwin9.0]
Anyone have any ideas? My guess is maybe an encoding issue??? There are
other areas in the pages where I have to do string.gsub("\302\240", "").
Thanks,
~Jeremy
page I see:
Name: ZITO, PEDRO OSVALDO
When I look at the source I get:
<td colspan="4" ><font face="Arial" size="2">
<b>Name: </b>ZITO, PEDRO OSVALDO </font>
</td>
When I parse the page I get:
=> #<Nokogiri::XML::Element:0x1eb1f76 name="td"page.search("/html/body/table[3]/tr[1]/td[4]/table/tr[1]/td[1]/table/tr[3]/td[2]/table/tr[1]/td[1]/table/tr[2]/td[1]").first
attributes=[#<Nokogiri::XML::Attr:0x1eb1eea name="colspan" value="4">]
children=[#<Nokogiri::XML::Element:0x1eb0d24 name="font"
attributes=[#<Nokogiri::XML::Attr:0x1eb0c7a name="face" value="Arial">,
#<Nokogiri::XML::Attr:0x1eb0c70 name="size" value="2">]
children=[#<Nokogiri::XML::Text:0x1eb0694 "
\r\n ">, #<Nokogiri::XML::Element:0x1eb066c name="b"
children=[#<Nokogiri::XML::Text:0x1eb0496 "Name: ">]>,
#<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
">]>, #<Nokogiri::XML::Text:0x1eaf636 " \r\n
\r\n ">]>
If you notice in the #<Nokogiri::XML::Text:0x1eb03c4 "ZITO,PEDROOSVALDO
"> All the spaces in the name have been removed.
Here's what I'm using:
=> "2.7.3"
macbook-pro:~ jeremywoertink$ ruby -v
ruby 1.8.6 (2009-06-08 patchlevel 369) [universal-darwin9.0]
Anyone have any ideas? My guess is maybe an encoding issue??? There are
other areas in the pages where I have to do string.gsub("\302\240", "").
Thanks,
~Jeremy