J
Jerome David Sallinger
Hello.
I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.
For example here is some example data:
string = <<EOF
<seqexml version="1.0">
<primitive name='PHY_DATA_IND' time="00:00:40.450" CFN="206"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">24</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">21</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive>
</seqexml>
EOF
For example I may want to strip out all the "CQI" and timing values to
get:
23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470
Question: These files can be very large and keeping the computer
resource overhead is important. I've looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.
I understand that stream parsing is faster that DOM. I also understand
that libxml is faster than REXML, but libxml streaming uses DOM. So is
it safe to assume that REXMl streaming is faster than libxml streaming?
I also need to consider which way of things would be easier to
implement.
I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.
For example here is some example data:
string = <<EOF
<seqexml version="1.0">
<primitive name='PHY_DATA_IND' time="00:00:40.450" CFN="206"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">24</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">21</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive>
</seqexml>
EOF
For example I may want to strip out all the "CQI" and timing values to
get:
23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470
Question: These files can be very large and keeping the computer
resource overhead is important. I've looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.
I understand that stream parsing is faster that DOM. I also understand
that libxml is faster than REXML, but libxml streaming uses DOM. So is
it safe to assume that REXMl streaming is faster than libxml streaming?
I also need to consider which way of things would be easier to
implement.