How to get REXML to return items in order??

T

ted

Hi,

I'm new to Ruby and can't figure out why REXML isn't returning the elements
in the order they appear in the document. Here's my code and the document.
Any help appreciated.

Thanks,
Ted

#==============================
# ruby
#==============================
xml = REXML::Document.new(File.open("test.html"));
xml.elements.each("//span[@class='c5']") do |element|
puts element
end

#==============================
# the "test.html" file
#==============================
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>
 
G

Gavin Kistner

--Apple-Mail-2--692775499
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

I'm new to Ruby and can't figure out why REXML isn't returning the
elements
in the order they appear in the document. Here's my code and the
document.

I confirm the problem. Looks like a bug. If I remove some of the
anchors, it works.
(Off-topic - no need to use empty named anchors in your page - just
use IDs on existing elements instead.)

Sliver:~/Desktop] gkistner$ cat tmp.rb
code = <<ENDHTML
<html><body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></
table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></
table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr></
table>
</body></html>
ENDHTML

require 'rexml/document'
xml = REXML::Document.new( code );
xml.elements.each( "//span[@class='c5']" ) do |element|
puts element
end


[Sliver:~/Desktop] gkistner$ ruby -v tmp.rb
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.2]
<span class='c5'><b>3rd Title</b></span>
<span class='c5'><b>1st Title</b></span>
<span class='c5'><b>2nd Title</b></span>

--Apple-Mail-2--692775499--
 
T

ted

Thanks Gavin. Unfortunately I can't remove the anchors. The html is just a
sample of the documents (not my docs) that I'm given to parse. Someone on
IRC mentioned that XPath 1.0 doesn't guarantee the order of elements.




Gavin Kistner said:
I'm new to Ruby and can't figure out why REXML isn't returning the
elements
in the order they appear in the document. Here's my code and the
document.

I confirm the problem. Looks like a bug. If I remove some of the
anchors, it works.
(Off-topic - no need to use empty named anchors in your page - just
use IDs on existing elements instead.)

Sliver:~/Desktop] gkistner$ cat tmp.rb
code = <<ENDHTML
<html><body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></
table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></
table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr></
table>
</body></html>
ENDHTML

require 'rexml/document'
xml = REXML::Document.new( code );
xml.elements.each( "//span[@class='c5']" ) do |element|
puts element
end


[Sliver:~/Desktop] gkistner$ ruby -v tmp.rb
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.2]
<span class='c5'><b>3rd Title</b></span>
<span class='c5'><b>1st Title</b></span>
<span class='c5'><b>2nd Title</b></span>
 
D

David A. Black

Hi --

Thanks Gavin. Unfortunately I can't remove the anchors. The html is just a
sample of the documents (not my docs) that I'm given to parse. Someone on
IRC mentioned that XPath 1.0 doesn't guarantee the order of elements.

I would be astonished if Sean Russell had combed through the 1.0 spec
to find some loophole that made it plausible to have an iteration not
follow document order. I could be wrong but I think it's more likely
a REXML bug.


David
 
D

daz

Gavin said:
I'm new to Ruby and can't figure out why REXML isn't returning the
elements
in the order they appear in the document. Here's my code and the
document.

I confirm the problem. Looks like a bug. [...]


.... and it's fixed in CVS for 1.8.3

If you need this now, you could download the later version here:
http://www.ruby-lang.org/cgi-bin/cv...rexml.tar.gz?only_with_tag=ruby_1_8;tarball=1

to e.g. "C:\Ruby\TEMP" then change the lookup path at the top of your script.



$:.unshift('C:/Ruby/TEMP') # for rexml fixes
require 'rexml/document'
xml = REXML::Document.new(DATA)
xml.elements.each("//span[@class='c5']") do |element|
puts element
end

#-> <span class='c5'><b>1st Title</b></span>
#-> <span class='c5'><b>2nd Title</b></span>
#-> <span class='c5'><b>3rd Title</b></span>

__END__
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>


daz
 
T

ted

Thanks daz.


daz said:
Gavin said:
I'm new to Ruby and can't figure out why REXML isn't returning the
elements
in the order they appear in the document. Here's my code and the
document.

I confirm the problem. Looks like a bug. [...]


... and it's fixed in CVS for 1.8.3

If you need this now, you could download the later version here:
http://www.ruby-lang.org/cgi-bin/cv...rexml.tar.gz?only_with_tag=ruby_1_8;tarball=1

to e.g. "C:\Ruby\TEMP" then change the lookup path at the top of your
script.



$:.unshift('C:/Ruby/TEMP') # for rexml fixes
require 'rexml/document'
xml = REXML::Document.new(DATA)
xml.elements.each("//span[@class='c5']") do |element|
puts element
end

#-> <span class='c5'><b>1st Title</b></span>
#-> <span class='c5'><b>2nd Title</b></span>
#-> <span class='c5'><b>3rd Title</b></span>

__END__
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>


daz
 
D

Dan Kohn

I just wanted to mention that I encountered the same bug and that the
new version of the library fixed it for me. Thank you very much for
the clear instructions. If only for pay products had support that was
this good....

- dan
 
D

Dan Kohn

Daz, there's a bug in the CVS version of REXML. The following code
produces the error below, but works perfectly with the default 1.8.2
REXML (i.e., when I comment out the first line).
ruby rexmlbug.rb
C:/Dan/dev/rexml/xpath_parser.rb:157:in `expr': undefined method
`delete_if' for nil:NilClass (NoMethodError)
from C:/Dan/dev/rexml/xpath_parser.rb:481:in `d_o_s'
from C:/Dan/dev/rexml/xpath_parser.rb:478:in `each_index'
from C:/Dan/dev/rexml/xpath_parser.rb:478:in `d_o_s'
from C:/Dan/dev/rexml/xpath_parser.rb:469:in `descendant_or_self'
from C:/Dan/dev/rexml/xpath_parser.rb:314:in `expr'
from C:/Dan/dev/rexml/xpath_parser.rb:125:in `match'
from C:/Dan/dev/rexml/xpath_parser.rb:56:in `parse'
from C:/Dan/dev/rexml/xpath.rb:53:in `each'
from rexmlbug.rb:28
Exit code: 1


$:.unshift('C:/Dan/dev') # for rexml fixes
require "rexml/document"
include REXML
string = <<EOF
<html>
<td class="t4"><a href="javascript:lu('OZ')">OZ</a>
0204 F Class
<a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
LAX</a></td>
<tr>
<td class="t4"><font color="white">UNITED</font></td>
<td colspan="4" align="right">
<strong>48,164</strong></td>
</tr>
<tr>
<td class="t4"><font color="white">Star
Alliance</font></td>
<td colspan="4" align="right">
<strong>49,072</strong></td>
</tr>
</html>
EOF

doc = Document.new string.gsub!(/\s+|&nbsp;/," ")
array = Array.new
XPath.each( doc, "//td[@colspan='4']/preceding-sibling::td/child::*") {
|cell|
array << cell.texts.to_s }
puts array
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,968
Members
47,516
Latest member
ChrisHibbs

Latest Threads

Top