XML parser; maybe ruby is too slow?

nutsmuggler · Sep 15, 2007

Hello folks.
I managed to write a SGML parser with the hpricot library. As I
explained in a previous thread, I just need to compare source and
traget tags of translation memory files from IBM Translation manager.
The script now runs effectively, but I realised that it cannot cope
with large files; I tried to process TM file larger than 1MB and the
script took ages to generate the output. Should I switch to a compiled
language for this specific task?
At any rate, here is the script, it's very basic; please let me know
if I did something wrong or if its slowness is a necessary drawback of
ruby being interpreted. Cheers,
Davide

#!/usr/local/bin/ruby
require 'rubygems'
require 'hpricot'

$pattern = "server"
result = File.new("result.html", "w")
$stdout = result
puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
'http://www.w3.org/TR/html4/strict.dtd'>\n
<head>\n
<meta http-equiv='Content-type' content='text/html; charset=utf-8'>\n
<title>Ricerca di '#{$pattern}'</title>\n
<style type='text/css'>
body {
}
p {
margin: 0px;
}
p.source {
background: #FFFFCC;
padding: 10px 5px 10px 5px;
}
p.target {
background: #F8A271;
padding: 10px 5px 10px 5px;
}
span.pattern {
background: #B6B6B6;
}
</style>
</head>\n
<body>\n"
# per aprire lo stdin
# doc = Hpricot.XML(STDIN)

doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP"))
doc.search("Source").each do |item|
if item.innerHTML =~ /#{$pattern}/
highlightedSource = item.innerHTML.gsub(/#{$pattern}/, "<span
class='pattern'>#{$pattern}</span>")
puts "<p class='source'>EN: #{highlightedSource}</p>\n"
puts "<p class='target'>IT: #{item.next_sibling.html}</p>\n
<hr/>"
end
end
puts "</body>"

yermej · Sep 16, 2007

Hello folks.
I managed to write a SGML parser with the hpricot library. As I
explained in a previous thread, I just need to compare source and
traget tags of translation memory files from IBM Translation manager.
The script now runs effectively, but I realised that it cannot cope
with large files; I tried to process TM file larger than 1MB and the
script took ages to generate the output. Should I switch to a compiled
language for this specific task?
At any rate, here is the script, it's very basic; please let me know
if I did something wrong or if its slowness is a necessary drawback of
ruby being interpreted. Cheers,
Davide

#!/usr/local/bin/ruby
require 'rubygems'
require 'hpricot'

$pattern = "server"
result = File.new("result.html", "w")
$stdout = result
puts "<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01//EN'
'http://www.w3.org/TR/html4/strict.dtd'>\n
<head>\n
<meta http-equiv='Content-type' content='text/html; charset=utf-8'>\n
<title>Ricerca di '#{$pattern}'</title>\n
<style type='text/css'>
body {
}
p {
margin: 0px;
}
p.source {
background: #FFFFCC;
padding: 10px 5px 10px 5px;
}
p.target {
background: #F8A271;
padding: 10px 5px 10px 5px;
}
span.pattern {
background: #B6B6B6;
}
</style>
</head>\n
<body>\n"
# per aprire lo stdin
# doc = Hpricot.XML(STDIN)

doc = Hpricot.XML(open("bch01aad006_MEMORIA.EXP"))
doc.search("Source").each do |item|
if item.innerHTML =~ /#{$pattern}/
highlightedSource = item.innerHTML.gsub(/#{$pattern}/, "<span
class='pattern'>#{$pattern}</span>")
puts "<p class='source'>EN: #{highlightedSource}</p>\n"
puts "<p class='target'>IT: #{item.next_sibling.html}</p>\n
<hr/>"
end
end
puts "</body>"

I haven't done any comparison testing, but if your *.EXP files are
truly XML, Ruby libxml might be a better choice as it's just a wrapper
around the libxml2 library (see http://libxml.rubyforge.org/).

Jeremy

nutsmuggler · Sep 16, 2007

I haven't done any comparison testing, but if your *.EXP files are
truly XML, Ruby libxml might be a better choice as it's just a wrapper
around the libxml2 library (seehttp://libxml.rubyforge.org/).

Jeremy

The problem is the EXP file are actually SGML; I could not parse them
with REXML precisely because they are not well formed XML: they
contains open tag, whoch are apparently valid in some SGML format, but
not in XML. That is why I had to use hpricot, which is less picky.
Cheers,
Davide

Align img inside nav tabs section	5	Dec 29, 2023
Why is the e.target not working here?	1	Dec 29, 2022
Php modal form to email	1	Aug 28, 2024
Help with my responsive home page	2	Dec 14, 2022
How do I fix this issue in sqaurespace code block?	1	Jul 2, 2024
Buttons lining up	1	Feb 6, 2022
Closing an overlay outside the overlay as well	1	Dec 11, 2022
This code is not working	8	Dec 9, 2020

XML parser; maybe ruby is too slow?

nutsmuggler

yermej

nutsmuggler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads