NEWBIE: Ruby & XML

  • Thread starter Jerome David Sallinger
  • Start date
J

Jerome David Sallinger

Hello.

I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.

For example here is some example data:

string = <<EOF
<seqexml version="1.0">
<primitive name='PHY_DATA_IND' time="00:00:40.450" CFN="206"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">24</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">21</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive>
</seqexml>
EOF

For example I may want to strip out all the "CQI" and timing values to
get:

23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

Question: These files can be very large and keeping the computer
resource overhead is important. I've looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.

I understand that stream parsing is faster that DOM. I also understand
that libxml is faster than REXML, but libxml streaming uses DOM. So is
it safe to assume that REXMl streaming is faster than libxml streaming?

I also need to consider which way of things would be easier to
implement.
 
B

brabuhr

Hello.

I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.

For example here is some example data:
[...]

For example I may want to strip out all the "CQI" and timing values to
get:

23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

Question: These files can be very large and keeping the computer
resource overhead is important. I've looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.

I make no claim about what might be best :) but, nokogiri seems to be
the leading Ruby XML library at the moment. I quickly adapted an old
REXML pull parser to work with your sample data:

def parse(stream)
raise "BlockRequired" unless block_given?

parser = REXML::parsers::pullParser.new(stream)

row = {}

while parser.has_next?
event = parser.pull

case event.event_type
when :start_element
case event[0]
when 'primitive'
row = event[1]; col = nil
when 'parameter'
col = event[1]["name"]
end

row[col] ||= "" if col

when :end_element
col = nil

case event[0]
when 'primitive'
yield(row)
else
# ignore
end

when :text
row[col] << event[0].chomp if col

else
#ignore
end
end
end

parse(string){|row|
#p row
puts "#{row["CQI"]}, #{row["time"]}"
}
ruby x.rb
23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

The original program I lifted that from was processing XML files up to
several gigabytes; particularly on the largest files we saw much
better performance running under JRuby over MRI (1.8.5 or so).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,818
Latest member
Brigette36

Latest Threads

Top