magic/xml library for easy XML processing

  • Thread starter Tomasz Wegrzanowski
  • Start date
T

Tomasz Wegrzanowski

Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl's XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

The code and (rather incomplete) documentation
are here -> http://zabor.org/taw/magic_xml/

A few examples, so you can quickly see whether you're interested or not :)

Parse ATOM feed for my blog and prints post titles and URLs:

doc = XML.from_url "http://t-a-w.blogspot.com/atom.xml"
doc.children:)entry).children:)link) {|c|
print "#{c[:title]}\n#{c[:href]}\n\n" if c[:rel] == "alternate"
}

Get my del.icio.us posts about magic/xml and format them
as a XHTML list (for magic/xml's website):

deli_passwd = File.read("/home/taw/.delipasswd").chomp
url = "http://taw:#{deli_passwd}@del.icio.us/api/posts/recent?tag=taw+blog+magicxml"
XML.from_url(url).children:)post).reverse.each_with_index {|p,i|
print XML.li("#{i+1}. ", XML.a({:href => p[:href]}, p[:description]))
}

Extract articles and IDs from a Wikipedia dump. It keeps only
small fragments in memory, but provides all convenient access
methods (works like XML::Twig, but with much nicer interface):

XML.parse_as_twigs(STDIN) {|node|
next unless node.name == :page
node.complete!
t = node.children:)title)[0].contents
i = node.children:)id)[0].contents
print "#{i}: #{t}\n"
}

More about stream processing with magic/xml at
http://t-a-w.blogspot.com/2006/08/xml-stream-processing-with-magicxml.html

The most important thing to do would be to find cases
where other libraries are more expressive than magic/xml
and fix these cases if possible :) As I don't know half
of the other libraries, and you certainly do, I need your help here :)

And I guess I should also add XPath, port to a faster XML parser (currently
using REXML to get a stream of XML parse events), and add
some interface for accessing fancy XML features like
processing instructions to get it out of alpha :)
 
A

Adam Keys

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like
CDuce
and to some extend by Perl's XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

This looks really promising. You almost lost me though. I suspected
from the mention of XML::Twig that magic/xml might handle Really Huge
XML (TM) more gracefully than our current options and it looks like
it does. Point being, you should mention it handles Really Huge XML
gracefully up front, as that is something I find myself inflicting
ugly hacks upon myself to achieve today.
And I guess I should also add XPath, port to a faster XML parser
(currently
using REXML to get a stream of XML parse events), and add
some interface for accessing fancy XML features like
processing instructions to get it out of alpha :)

Yes, please, XPath! It could be I'm the only one who likes XPath,
but I find it a great way to pluck data out of XML.

This library looks really good. I'm going to keep it in mind for all
my "some idiot sent me an 1 GB XML file that is a bunch of smaller
files concatenated together" needs.
 
T

Tomasz Wegrzanowski

This looks really promising. You almost lost me though. I suspected
from the mention of XML::Twig that magic/xml might handle Really Huge
XML (TM) more gracefully than our current options and it looks like
it does. Point being, you should mention it handles Really Huge XML
gracefully up front, as that is something I find myself inflicting
ugly hacks upon myself to achieve today.
[...]

This library looks really good. I'm going to keep it in mind for all
my "some idiot sent me an 1 GB XML file that is a bunch of smaller
files concatenated together" needs.

Handling huge XML files is just a bonus. The main reason the library exists
is its sheer expressive power.

I tried to recode W3C's XQuery Use Cases (
http://www.w3.org/TR/xquery-use-cases/ )
in magic/xml to see how it compares with XQuery on XQuery's terms,
and they're very close. For the use cases I translated so far the results are
(characters with whitespace merged and a few other transformations that
make it more meaningful):

Problem XMP 1: Ruby 187 (114%), XQuery: 164
Problem XMP 2: Ruby 132 (100%), XQuery: 132
Problem XMP 3: Ruby 115 (103%), XQuery: 112
Problem XMP 4: Ruby 400 (101%), XQuery: 398
Problem XMP 5: Ruby 367 (124%), XQuery: 296
Problem XMP 6: Ruby 220 (104%), XQuery: 211
Problem XMP 7: Ruby 232 (135%), XQuery: 172
Problem XMP 8: Ruby 150 (88%), XQuery: 170
Problem XMP 9: Ruby 157 (129%), XQuery: 122
Problem XMP 10: Ruby 298 (142%), XQuery: 210
Problem XMP 11: Ruby 295 (136%), XQuery: 217
Problem XMP 12: Ruby 457 (118%), XQuery: 387
Problem Tree 1: Ruby 166 (61%), XQuery: 270
Problem Tree 2: Ruby 118 (109%), XQuery: 108
Problem Tree 3: Ruby 133 (101%), XQuery: 132
Problem Tree 4: Ruby 75 (93%), XQuery: 81
Problem Tree 5: Ruby 168 (104%), XQuery: 161
Problem Tree 6: Ruby 255 (69%), XQuery: 369
Total: Ruby 3925 (106%), XQuery: 3712
Median ratio: 104%

I don't think any other Ruby library for XML can get anywhere
close to such results. And efficient processing of large XMLs ?
That's just a small freebie :-D
 
Z

Zed Shaw

Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl's XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.
doc = XML.from_url "http://t-a-w.blogspot.com/atom.xml"
doc.children:)entry).children:)link) {|c|
print "#{c[:title]}\n#{c[:href]}\n\n" if c[:rel] == "alternate"
}

I'd like to have it work like Hpricot as well:

(doc/:entry/:link).each do {|c|
...
}

Especially considering you overload [] to get attributes, why not / to
get children?
 
T

Trans

Zed said:
Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl's XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.
doc = XML.from_url "http://t-a-w.blogspot.com/atom.xml"
doc.children:)entry).children:)link) {|c|
print "#{c[:title]}\n#{c[:href]}\n\n" if c[:rel] == "alternate"
}

I'd like to have it work like Hpricot as well:

(doc/:entry/:link).each do {|c|
...
}

Especially considering you overload [] to get attributes, why not / to
get children?

http://cherry.rubyforge.org
http://rubyforge.org/projects/cherry/

T.
 
T

Tomasz Wegrzanowski

doc = XML.from_url "http://t-a-w.blogspot.com/atom.xml"
doc.children:)entry).children:)link) {|c|
print "#{c[:title]}\n#{c[:href]}\n\n" if c[:rel] == "alternate"
}

I'd like to have it work like Hpricot as well:

(doc/:entry/:link).each do {|c|
...
}

Especially considering you overload [] to get attributes, why not / to
get children?

Basically because there are three reasonable things to with node[:foo]:
* return attribute :foo
* return the first child with tag :foo
* return list of children with tag :foo
Ruby is not Perl, so we cannot have both 2 and 3 folded into one,
and doing only second or only third doesn't sound that convincing ;-)

Another issue is that I'd have to overload Array#/ to get
(doc/:entry/:link) working,
and that would have much higher mental cost than adding long-named
method like #children to it. Or use something else than an Array
for sequences of XML nodes (hpricot does so with Hpricot::Elements),
but that wouldn't be nice. I'll look at it again after I have all W3C
XQuery Use Cases recoded :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top