How to use ReXML "in the wild"?

Kenneth McDonald · Dec 16, 2008

I'd very much like to use ReXML's XPATH features to extract info from
Google's financial info pages, but find that Rexml chokes on the
Javascript, here's the result of trying to read in a page with this
bit of code:

require "rexml/document"
require 'net/http'
Net::HTTP.start('finance.google.com') do |http|
response = http.get('/finance?fstype=ii&q=NYSE:WAT')
rdoc = REXML:

ocument.new(response.body)
end

==========
Output:

/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse':
#<RuntimeError: Illegal character '&' in raw string
" (REXML:

arseException)
(function(){
var d=navigator.userAgent.toLowerCase().indexOf("msie")!=-1;function
e(){var b=document.styleSheets;for(var a=b.length-1;a>=0;--a){var
c=b[a].href;if(c)if(c.indexOf("styles/finance_")!=-1||
c.indexOf("styles_")!=-1)return b[a]}return null}function f(){var
b=e();if(b){var a=b.rules;return
a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}
function g(){if(document.scripts)for(var b=0;b">
/usr/local/lib/ruby/1.8/rexml/text.rb:91:in `initialize'

peter · Dec 16, 2008

Hi Kenneth,

I'd very much like to use ReXML's XPATH features to extract info from
Google's financial info pages, but find that Rexml chokes on the
Javascript, here's the result of trying to read in a page with this
bit of code:

Don't try that

REXML in the wild == epic FAIL. At this level, you might
want to try Hpricot or Nokogiri. At a bit higher level, scRUBYt!
You can read about web scraping in Ruby here (my most succesfull article
ever, was even mentioned in Learning Ruby from O'Reilly):

http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/

Is there a good way to get around this problem? If, not, I guess it's
back to regular expressions...

Web scraping with regular expressions is almost never a good idea.

Try scRUBYt!:

require 'rubygems'
require 'scrubyt'

data = Scrubyt::Extractor.define do
fetch 'http://finance.google.com/finance?fstype=ii&q=NYSE:WAT'

body '/html/body' do
revenue '/div[4]/div[2]/table/tr[2]' do
ending_9_27 '/td[2]'
ending_6_28 '/td[3]'
end

gross_profit '/div[4]/div[2]/table/tr[2]' do
ending_9_27 '/td[2]'
end
end
end

puts data.to_xml

output:

<root>
<body>
<revenue>
<ending_9_27>386.31</ending_9_27>
<ending_6_28>398.77</ending_6_28>
</revenue>
<gross_profit>
<ending_9_27>386.31</ending_9_27>
</gross_profit>
</body>
</root>

HTH,
Peter
___
http://scrubyt.org
http://www.rubyrailways.com

Phlip · Dec 16, 2008

Kenneth said:
I'd very much like to use ReXML's XPATH features to extract info from
Google's financial info pages, but find that Rexml chokes on the
Javascript, here's the result of trying to read in a page with this
bit of code:

I have studied REXML for many years, and I still can't figure out how to get it
to recognize an — or similar advanced entity.

Like the other responder said, give up while you still can. libxml-ruby is also
stable enough to give a shot - oh yeah, except it crashes on non-tiny inputs.

Aaaand...

/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in `parse':
#<RuntimeError: Illegal character '&' in raw string

That's because REXML and your web browser disagree on the definition of
well-formed. Your browser accepts a naked & inside a JavaScript tag, but REXML
does not. REXML is technically correct, and your browser would have accepted
&& here, but...

a.length>0&&a[a.length-1].selectorText==".lastFinanceRule"}return false}

....browsers cannot correctly interpolate & appearing inside JavaScript literal
strings, because some lowlife coder using Notepad might have actually wanted
"&" when they wrote "&" - such as with document.write().

So, because REXML cannot accept normal HTML, due to hits and misses of standards
compliance on all sides - you are better off with a dedicated parser!

Errors on REXML reading an HTML.	1	Dec 24, 2010
How can I structure the final array to meet the requirements of Bootstrap Tree View for building a tree in JavaScript?	1	Mar 29, 2024
REXML 3.1.6 has XPath problems	3	Dec 30, 2006
How can I guarantee that the all callback functions of the first Ajax API call have finished executing before initiating the 2 call in JavaScript?	2	Oct 30, 2023
REXML: parsing a string with unescaped ampersand entities	7	Dec 7, 2007
Issues with Ruby trying to parse data passed from Flex app	1	Mar 28, 2008
Syntax error, want to use multiple variables in one line, copy mysqldata	1	Dec 13, 2010
How to use WSDLDriverFactory with basic authentication?	1	Jan 10, 2006

How to use ReXML "in the wild"?

Kenneth McDonald

peter

Phlip

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads