XML parser; maybe ruby is too slow?

N

nutsmuggler

Hello folks.
I managed to write a SGML parser with the hpricot library. As I
explained in a previous thread, I just need to compare source and
traget tags of translation memory files from IBM Translation manager.
The script now runs effectively, but I realised that it cannot cope
with large files; I tried to process TM file larger than 1MB and the
script took ages to generate the output. Should I switch to a compiled
language for this specific task?
At any rate, here is the script, it's very basic; please let me know
if I did something wrong or if its slowness is a necessary drawback of
ruby being interpreted. Cheers,
Davide


#!/usr/local/bin/ruby
require 'rubygems'
require 'hpricot'

$pattern = "server"
result = File.new("result.html", "w")
$stdout = result
puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
'http://www.w3.org/TR/html4/strict.dtd'>\n
<head>\n
<meta http-equiv='Content-type' content='text/html; charset=utf-8'>\n
<title>Ricerca di '#{$pattern}'</title>\n
<style type='text/css'>
body {
}
p {
margin: 0px;
}
p.source {
background: #FFFFCC;
padding: 10px 5px 10px 5px;
}
p.target {
background: #F8A271;
padding: 10px 5px 10px 5px;
}
span.pattern {
background: #B6B6B6;
}
</style>
</head>\n
<body>\n"
# per aprire lo stdin
# doc = Hpricot.XML(STDIN)


doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP"))
doc.search("Source").each do |item|
if item.innerHTML =~ /#{$pattern}/
highlightedSource = item.innerHTML.gsub(/#{$pattern}/, "<span
class='pattern'>#{$pattern}</span>")
puts "<p class='source'>EN: #{highlightedSource}</p>\n"
puts "<p class='target'>IT: #{item.next_sibling.html}</p>\n
<hr/>"
end
end
puts "</body>"
 
Y

yermej

Hello folks.
I managed to write a SGML parser with the hpricot library. As I
explained in a previous thread, I just need to compare source and
traget tags of translation memory files from IBM Translation manager.
The script now runs effectively, but I realised that it cannot cope
with large files; I tried to process TM file larger than 1MB and the
script took ages to generate the output. Should I switch to a compiled
language for this specific task?
At any rate, here is the script, it's very basic; please let me know
if I did something wrong or if its slowness is a necessary drawback of
ruby being interpreted. Cheers,
Davide

#!/usr/local/bin/ruby
require 'rubygems'
require 'hpricot'

$pattern = "server"
result = File.new("result.html", "w")
$stdout = result
puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
'http://www.w3.org/TR/html4/strict.dtd'>\n
<head>\n
<meta http-equiv='Content-type' content='text/html; charset=utf-8'>\n
<title>Ricerca di '#{$pattern}'</title>\n
<style type='text/css'>
body {
}
p {
margin: 0px;
}
p.source {
background: #FFFFCC;
padding: 10px 5px 10px 5px;
}
p.target {
background: #F8A271;
padding: 10px 5px 10px 5px;
}
span.pattern {
background: #B6B6B6;
}
</style>
</head>\n
<body>\n"
# per aprire lo stdin
# doc = Hpricot.XML(STDIN)

doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP"))
doc.search("Source").each do |item|
if item.innerHTML =~ /#{$pattern}/
highlightedSource = item.innerHTML.gsub(/#{$pattern}/, "<span
class='pattern'>#{$pattern}</span>")
puts "<p class='source'>EN: #{highlightedSource}</p>\n"
puts "<p class='target'>IT: #{item.next_sibling.html}</p>\n
<hr/>"
end
end
puts "</body>"

I haven't done any comparison testing, but if your *.EXP files are
truly XML, Ruby libxml might be a better choice as it's just a wrapper
around the libxml2 library (see http://libxml.rubyforge.org/).

Jeremy
 
N

nutsmuggler

I haven't done any comparison testing, but if your *.EXP files are
truly XML, Ruby libxml might be a better choice as it's just a wrapper
around the libxml2 library (seehttp://libxml.rubyforge.org/).

Jeremy

The problem is the EXP file are actually SGML; I could not parse them
with REXML precisely because they are not well formed XML: they
contains open tag, whoch are apparently valid in some SGML format, but
not in XML. That is why I had to use hpricot, which is less picky.
Cheers,
Davide
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top