Get XML content using XML::Twig

A

alwaysonnet

Hello all,
I'm trying to parse the XML using XML::Twig Module as my XML could be
very large to handle using XML::Simple. Please help me out of how to
print the values based on the following...
<B>get the values of Sender, Receiver</B>
<B>get the FileType. In this case possible values are
InitTAP,FatalRAP,ReTxTAP</B>

<CODE>
get the values of Sender, Receiver
get the FileType. In this case possible values are
InitTAP,FatalRAP,ReTxTAP
</CODE>
<P>Here is the XML content....</P>
<CODE>
<?xml version="1.0" encoding="UTF-8"?>
<Data>
<ConnectionList>
<Connection>
<Sender>BRADD</Sender>
<Receiver>SHANE</Receiver>
<FileItemList>
<FileItem>
<FileID>378910</FileID>
<Tmstp>2009-01-16T16:59:07+01:00</Tmstp>
<FileType>
<InitTAP>
<TAPSeqNo>00083</TAPSeqNo>
<NotifFileInd>false</NotifFileInd>
<ChargeInfo>
<TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
TAPTxCutoffTmstp>
<TAPAvailTmstp>2009-01-16T16:59:07+01:00</
TAPAvailTmstp>
<TAPCurrency>XDR</TAPCurrency>
<TotalNoOfCalls>39</TotalNoOfCalls>
<TotalNetCharge>11.470</TotalNetCharge>
<TotalTax>0.000</TotalTax>
</ChargeInfo>
</InitTAP>
</FileType>
</FileItem>
<FileItem>
<FileID>380582</FileID>
<Tmstp>2009-01-20T18:00:00+01:00</Tmstp>
<FileType>
<ReTxTAP>
<TAPSeqNo>00083</TAPSeqNo>
<NotifFileInd>false</NotifFileInd>
<RefRAPSeqNo>00044</RefRAPSeqNo>
<RefRAPID>380573</RefRAPID>
<ChargeInfo>
<TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
TAPTxCutoffTmstp>
<TAPAvailTmstp>2009-01-20T18:00:00+01:00</
TAPAvailTmstp>
<TAPCurrency>XDR</TAPCurrency>
<TotalNoOfCalls>39</TotalNoOfCalls>
<TotalNetCharge>11.470</TotalNetCharge>
<TotalTax>0.000</TotalTax>
</ChargeInfo>
</ReTxTAP>
</FileType>
</FileItem>
<FileItem>
<FileID>380573</FileID>
<Tmstp>2009-01-16T20:34:45+01:00</Tmstp>
<FileType>
<FatalRAP>
<RAPSeqNo>00044</RAPSeqNo>
<RAPStatus>Exchanged</RAPStatus>
<RefTAPSeqNo>00083</RefTAPSeqNo>
<RefTAPID>378910</RefTAPID>
<RAPCreatTmstp>2009-01-16T20:21:30+01:00</
RAPCreatTmstp>
<RAPAvailTmstp>2009-01-16T20:21:30+01:00</
RAPAvailTmstp>
<ChargeInfo>
<TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
TAPTxCutoffTmstp>
<TAPAvailTmstp>2009-01-16T16:59:07+01:00</
TAPAvailTmstp>
<TAPCurrency>XDR</TAPCurrency>
<TotalNoOfCalls>-39</TotalNoOfCalls>
<TotalNetCharge>-11.470</TotalNetCharge>
<TotalTax>0.000</TotalTax>
</ChargeInfo>
</FatalRAP>
</FileType>
</FileItem>
</FileItemList>
</Connection>
</ConnectionList>
</Data>
</CODE>
 
J

John Bokma

alwaysonnet said:
Hello all,
I'm trying to parse the XML using XML::Twig Module as my XML could be
very large to handle using XML::Simple. Please help me out of how to
print the values based on the following...
<B>get the values of Sender, Receiver</B>
<B>get the FileType. In this case possible values are
InitTAP,FatalRAP,ReTxTAP</B>

For very simple things like this I would (probably, based on what I just
read) use XML::SAX or (even) XML::parser. Regarding the latter,
http://johnbokma.com/perl/ has some simple examples under "XML
Processing using Perl"
 
K

Klaus

Hello all,
I'm trying to parse the XML using XML::Twig Module as my XML could be
very large to handle using XML::Simple. Please help me out of how to
print the values based on the following...
 <B>get the values of Sender, Receiver</B>
 <B>get the FileType. In this case possible values are
InitTAP,FatalRAP,ReTxTAP</B>

<CODE>
 get the values of Sender, Receiver
 get the FileType. In this case possible values are
InitTAP,FatalRAP,ReTxTAP
</CODE>

What Tad McClellan and John Bokma suggested should be your first path
of investigation.

However, let me bring in a shameless plug:

You could also use my module XML::Reader
http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm

This module is specifically designed to handle very big XML files, it
only uses the memory it needs to have one XML element at a time in
memory (plus a small additional memory for buffering, which is
independent of the size of the XML file)

Here is a sample program:

use strict;
use warnings;
use XML::Reader;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
{ root => '/Data/ConnectionList/Connection/Sender', branch =>
[ '/' ] },
{ root => '/Data/ConnectionList/Connection/Receiver', branch =>
[ '/' ] },
{ root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType', branch => [
'/InitTAP/TAPSeqNo',
'/ReTxTAP/TAPSeqNo',
'/FatalRAP/RAPSeqNo',
] },
);

my ($sender, $receiver);

while ($rdr->iterate) {
if ($rdr->rx == 0) { $sender = $rdr->rvalue->[0]; }
elsif ($rdr->rx == 1) { $receiver = $rdr->rvalue->[0]; }
else {
my ($InitTAP, $ReTxTAP, $FatalRAP) = @{$rdr->rvalue};
my ($type, $seqno) = defined $InitTAP ? ('InitTAP',
$InitTAP)
: defined $ReTxTAP ? ('ReTxTAP',
$ReTxTAP)
: defined $FatalRAP ? ('FatalRAP',
$FatalRAP)
: ('???', '???');

printf "Sender: %-5s, Receiver: %-5s, Type: %-8s, Seqno: %s
\n",
$sender, $receiver, $type, $seqno;
}
}

__DATA__
<?xml version="1.0" encoding="UTF-8"?>
<Data>
<ConnectionList>
<Connection>
<Sender>BRADD</Sender>
<Receiver>SHANE</Receiver>
<FileItemList>
<FileItem>
<FileID>378910</FileID>
<Tmstp>2009-01-16T16:59:07+01:00</Tmstp>
<FileType>
<InitTAP>
<TAPSeqNo>00083</TAPSeqNo>
<NotifFileInd>false</NotifFileInd>
<ChargeInfo>
<TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
TAPTxCutoffTmstp>
<TAPAvailTmstp>2009-01-16T16:59:07+01:00</
TAPAvailTmstp>
<TAPCurrency>XDR</TAPCurrency>
<TotalNoOfCalls>39</TotalNoOfCalls>
<TotalNetCharge>11.470</TotalNetCharge>
<TotalTax>0.000</TotalTax>
</ChargeInfo>
</InitTAP>
</FileType>
</FileItem>
<FileItem>
<FileID>380582</FileID>
<Tmstp>2009-01-20T18:00:00+01:00</Tmstp>
<FileType>
<ReTxTAP>
<TAPSeqNo>00083</TAPSeqNo>
<NotifFileInd>false</NotifFileInd>
<RefRAPSeqNo>00044</RefRAPSeqNo>
<RefRAPID>380573</RefRAPID>
<ChargeInfo>
<TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
TAPTxCutoffTmstp>
<TAPAvailTmstp>2009-01-20T18:00:00+01:00</
TAPAvailTmstp>
<TAPCurrency>XDR</TAPCurrency>
<TotalNoOfCalls>39</TotalNoOfCalls>
<TotalNetCharge>11.470</TotalNetCharge>
<TotalTax>0.000</TotalTax>
</ChargeInfo>
</ReTxTAP>
</FileType>
</FileItem>
<FileItem>
<FileID>380573</FileID>
<Tmstp>2009-01-16T20:34:45+01:00</Tmstp>
<FileType>
<FatalRAP>
<RAPSeqNo>00044</RAPSeqNo>
<RAPStatus>Exchanged</RAPStatus>
<RefTAPSeqNo>00083</RefTAPSeqNo>
<RefTAPID>378910</RefTAPID>
<RAPCreatTmstp>2009-01-16T20:21:30+01:00</
RAPCreatTmstp>
<RAPAvailTmstp>2009-01-16T20:21:30+01:00</
RAPAvailTmstp>
<ChargeInfo>
<TAPTxCutoffTmstp>2009-01-16T09:43:26+02:00</
TAPTxCutoffTmstp>
<TAPAvailTmstp>2009-01-16T16:59:07+01:00</
TAPAvailTmstp>
<TAPCurrency>XDR</TAPCurrency>
<TotalNoOfCalls>-39</TotalNoOfCalls>
<TotalNetCharge>-11.470</TotalNetCharge>
<TotalTax>0.000</TotalTax>
</ChargeInfo>
</FatalRAP>
</FileType>
</FileItem>
</FileItemList>
</Connection>
</ConnectionList>
</Data>

=======
Here is the output:

Sender: BRADD, Receiver: SHANE, Type: InitTAP , Seqno: 00083
Sender: BRADD, Receiver: SHANE, Type: ReTxTAP , Seqno: 00083
Sender: BRADD, Receiver: SHANE, Type: FatalRAP, Seqno: 00044
 
S

sln

Hello all,
I'm trying to parse the XML using XML::Twig Module as my XML could be
very large to handle using XML::Simple. Please help me out of how to
print the values based on the following...
 <B>get the values of Sender, Receiver</B>
 <B>get the FileType. In this case possible values are
InitTAP,FatalRAP,ReTxTAP</B>

<CODE>
 get the values of Sender, Receiver
 get the FileType. In this case possible values are
InitTAP,FatalRAP,ReTxTAP
</CODE>

What Tad McClellan and John Bokma suggested should be your first path
of investigation.

However, let me bring in a shameless plug:

You could also use my module XML::Reader
http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm Indeed shameless.

This module is specifically designed to handle very big XML files, it
only uses the memory it needs to have one XML element at a time in
memory (plus a small additional memory for buffering, which is
independent of the size of the XML file) Is memory at a premium?

Here is a sample program:

use strict;
use warnings;
use XML::Reader;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
{ root => '/Data/ConnectionList/Connection/Sender', branch =>
[ '/' ] },
{ root => '/Data/ConnectionList/Connection/Receiver', branch =>
[ '/' ] },
{ root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType', branch => [
'/InitTAP/TAPSeqNo',
'/ReTxTAP/TAPSeqNo',
'/FatalRAP/RAPSeqNo',
^^^^^^^^^^^^
What do these have to do with it?
] },
);

my ($sender, $receiver);

while ($rdr->iterate) {
if ($rdr->rx == 0) { $sender = $rdr->rvalue->[0]; }
elsif ($rdr->rx == 1) { $receiver = $rdr->rvalue->[0]; }
else {
my ($InitTAP, $ReTxTAP, $FatalRAP) = @{$rdr->rvalue};
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Again, what do these have to do with it?
[snip]
=======
Here is the output:

Sender: BRADD, Receiver: SHANE, Type: InitTAP , Seqno: 00083
Sender: BRADD, Receiver: SHANE, Type: ReTxTAP , Seqno: 00083
Sender: BRADD, Receiver: SHANE, Type: FatalRAP, Seqno: 00044

Thats nice. Lets say he generally said "in this case its:"
InitTAP ReTxTAP FatalRAP
Why? Because its the file type.
Maybe he wants all file types of the sender/reciever's.
But its hard to know what the OP wants isin't it.

-sln
 
K

Klaus

Thats nice. Lets say he generally said "in this case its:"
InitTAP  ReTxTAP  FatalRAP
Why? Because its the file type.
Maybe he wants all file types of the sender/reciever's.

in that case you use XML::Reader->newhd(... {filter => 2});

use strict;
use warnings;
use XML::Reader;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 2});

my ($sender, $receiver);

while ($rdr->iterate) {
if ($rdr->path eq '/Data/ConnectionList/Connection/Sender') {
$sender = $rdr->value;
}
elsif ($rdr->path eq '/Data/ConnectionList/Connection/Receiver') {
$receiver = $rdr->value;
}
elsif ($rdr->is_start
and $rdr->path =~ m{\A /Data/ConnectionList/Connection/
FileItemList/FileItem/FileType/ (\w+) \z}xms) {
printf "Sender: %-5s, Receiver: %-5s, Type: %s\n",
$sender, $receiver, $1;
}
}

Here is the output

Sender: BRADD, Receiver: SHANE, Type: InitTAP
Sender: BRADD, Receiver: SHANE, Type: ReTxTAP
Sender: BRADD, Receiver: SHANE, Type: FatalRAP
 
S

sln

in that case you use XML::Reader->newhd(... {filter => 2});

use strict;
use warnings;
use XML::Reader;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 2});

my ($sender, $receiver);

while ($rdr->iterate) {
if ($rdr->path eq '/Data/ConnectionList/Connection/Sender') {
$sender = $rdr->value;
}
elsif ($rdr->path eq '/Data/ConnectionList/Connection/Receiver') {
$receiver = $rdr->value;
}
elsif ($rdr->is_start
and $rdr->path =~ m{\A /Data/ConnectionList/Connection/
FileItemList/FileItem/FileType/ (\w+) \z}xms) {
printf "Sender: %-5s, Receiver: %-5s, Type: %s\n",
$sender, $receiver, $1;
}
}

Here is the output

Sender: BRADD, Receiver: SHANE, Type: InitTAP
Sender: BRADD, Receiver: SHANE, Type: ReTxTAP
Sender: BRADD, Receiver: SHANE, Type: FatalRAP

This is pretty good. I assume it does attribute/value as well.
It appears to be a lot of regex work, the more unknown the
elements become, but thats a tree stack.

It would be good though to have a capture mechanism, where
xml capture can be triggered on/off by the user, later to
be regurgitated to the user (on demand), and given to an
xml::simple style mechanism to turn it into filtered records.

It wouldn't change the simple, low memmory stream parsing at all,
just the source would be captured (appended) on/off to a named buffer,
on demand.

Its not as easy as it seems though. CaptureON/OFF (bufname, before/after),
nested capture's, single data pool. I think I've done this before.

-sln
 
K

Klaus

This is pretty good. I assume it does attribute/value as well.

Yes it does, just put an '@' symbol in the path, for example
'/InitTAP/ChargeInfo/@attrib1'
It appears to be a lot of regex work, the more unknown the
elements become, but thats a tree stack.

It would be good though to have a capture mechanism, where
xml capture can be triggered on/off by the user, later to
be regurgitated to the user (on demand), and given to an
xml::simple style mechanism to turn it into filtered records.

For simple structures where you know exactly what you are looking for,
you can use {filter => 5} like so

use strict;
use warnings;
use XML::Reader;

use Data::Dumper;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
{ root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType', branch => [
'/InitTAP/TAPSeqNo',
'/ReTxTAP/TAPSeqNo',
'/FatalRAP/RAPSeqNo',
'/InitTAP/ChargeInfo/@attrib1',
'/InitTAP/ChargeInfo/TAPCurrency',
'/ReTxTAP/ChargeInfo/TAPCurrency',
'/FatalRAP/ChargeInfo/TAPCurrency',
] },
);

while ($rdr->iterate) {
print Dumper($rdr->rvalue), "\n";
}
It wouldn't change the simple, low memmory stream parsing at all,
just the source would be captured (appended) on/off to a named buffer,
on demand.
Its not as easy as it seems though. CaptureON/OFF (bufname, before/after),
nested capture's, single data pool. I think I've done this before.

For general capture into a buffer, you would use {filter => 3, using
=> '/Data/ConnectionList/Connection/FileItemList/FileItem/FileType'}

use strict;
use warnings;
use XML::Reader;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType'});

my $buffer = '';

while ($rdr->iterate) {
my $indentation = ' ' x ($rdr->level - 1);

if ($rdr->path eq '/') {
if ($rdr->is_start) {
$buffer = '';
}
elsif ($rdr->is_end) {
print "\n\n buffer ==>\n", $buffer, "\n\n";
}
next;
}

if ($rdr->is_start) {
$buffer .= $indentation.'<'.$rdr->tag.
join('', map{" $_='".$rdr->att_hash->{$_}."'"} sort keys %
{$rdr->att_hash}).
'>'."\n";
}

if ($rdr->type eq 'T' and $rdr->value ne '') {
$buffer .= $indentation.' '.$rdr->value."\n";
}

if ($rdr->is_end) {
$buffer .= $indentation.'</'.$rdr->tag.'>'."\n";
}
}
 
A

alwaysonnet

This is pretty good. I assume it does attribute/value as well.

Yes it does, just put an '@' symbol in the path, for example
'/InitTAP/ChargeInfo/@attrib1'
It appears to be a lot of regex work, the more unknown the
elements become, but thats a tree stack.
It would be good though to have a capture mechanism, where
xml capture can be triggered on/off by the user, later to
be regurgitated to the user (on demand), and given to an
xml::simple style mechanism to turn it into filtered records.

For simple structures where you know exactly what you are looking for,
you can use {filter => 5} like so

use strict;
use warnings;
use XML::Reader;

use Data::Dumper;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
    { root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType', branch => [
      '/InitTAP/TAPSeqNo',
      '/ReTxTAP/TAPSeqNo',
      '/FatalRAP/RAPSeqNo',
      '/InitTAP/ChargeInfo/@attrib1',
      '/InitTAP/ChargeInfo/TAPCurrency',
      '/ReTxTAP/ChargeInfo/TAPCurrency',
      '/FatalRAP/ChargeInfo/TAPCurrency',
    ] },
  );

while ($rdr->iterate) {
    print Dumper($rdr->rvalue), "\n";

}
It wouldn't change the simple, low memmory stream parsing at all,
just the source would be captured (appended) on/off to a named buffer,
on demand.
Its not as easy as it seems though. CaptureON/OFF (bufname, before/after),
nested capture's, single data pool. I think I've done this before.

For general capture into a buffer, you would use {filter => 3, using
=> '/Data/ConnectionList/Connection/FileItemList/FileItem/FileType'}

use strict;
use warnings;
use XML::Reader;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
    using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType'});

my $buffer = '';

while ($rdr->iterate) {
    my $indentation = '  ' x ($rdr->level - 1);

    if ($rdr->path eq '/') {
        if ($rdr->is_start) {
            $buffer = '';
        }
        elsif ($rdr->is_end) {
            print "\n\n buffer ==>\n", $buffer, "\n\n";
        }
        next;
    }

    if ($rdr->is_start) {
        $buffer .= $indentation.'<'.$rdr->tag.
          join('', map{" $_='".$rdr->att_hash->{$_}."'"} sortkeys %
{$rdr->att_hash}).
          '>'."\n";
    }

    if ($rdr->type eq 'T' and $rdr->value ne '') {
        $buffer .= $indentation.'  '.$rdr->value."\n";
    }

    if ($rdr->is_end) {
        $buffer .= $indentation.'</'.$rdr->tag.'>'."\n";
    }

}

My intention is to ~

- Get each sender and receiver
- Get the filetype ( could be InitTAP, FatalRAP etc )
- For each of filetype get the TAPSeqNo, NoofCalls etc....

Basically I want all the information in place for processing the
data....

Also, apart from XML::Twig, is there any module which can handle
larger XML files..

any help or suggestions are appreciated.
 
K

Klaus

Hello all,
I'm trying to parse the XML using XML::Twig Module as my XML could be
very large to handle using XML::Simple.

Klaus said:
However, let me bring in a shameless plug:
You could also use my module XML::Reader
http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm
Indeed shameless.

[...]

It would be good though to have a capture mechanism, where
xml capture can be triggered on/off by the user, later to
be regurgitated to the user (on demand), and given to an
xml::simple style mechanism to turn it into filtered records.

Here is an example of how to use XML::Reader to capture sub-trees from
a (potentially very big) XML file into a buffer and pass that buffer
to XML::Simple:

use strict;
use warnings;
use XML::Reader;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType'});

my $buffer = '';

while ($rdr->iterate) {

if ($rdr->path eq '/') {
if ($rdr->is_start) {
$buffer = qq{<?xml version="1.0" encoding="UTF-8"?
<FileType>};
}
if ($rdr->is_end) {
$buffer .= qq{</FileType>};

use XML::Simple;
use Data::Dumper;

my $ref = XMLin($buffer);
print Dumper($ref), "\n\n";
}
next;
}

if ($rdr->is_start) {
$buffer .= '<'.$rdr->tag.
join('', map{" $_='".$rdr->att_hash->{$_}."'"} sort keys %
{$rdr->att_hash}).
'>';
}

if ($rdr->type eq 'T' and $rdr->value ne '') {
$buffer .= $rdr->value;
}

if ($rdr->is_end) {
$buffer .= '</'.$rdr->tag.'>';
}
}
 
K

Klaus

Hello all,
I'm trying to parse the XML using XML::Twig Module as my XML could be
very large to handle using XML::Simple.

What Tad McClellan and John Bokma suggested should be your first
path of investigation.
However, let me bring in a shameless plug:
You could also use my module XML::Reader
http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm

Indeed shameless.

My intention is to ~
- Get each sender and receiver
- Get the filetype ( could be InitTAP, FatalRAP etc )
- For each of filetype get the TAPSeqNo, NoofCalls etc....

Basically I want all the information in place for processing the
data....

Also, apart from XML::Twig, is there any module which can handle
larger XML files..

As I said before, take the advice of Tad McClellan and John Bokma
first.

If, for whatever reason, you can't follow their advice, (and, for
whatever reason, you can't use XML::Twig either) there is always my
"shameless plug" XML::Reader:

There are, in my opinion, two scenarios:

Scenario 1:
You already know how to parse your XML with XML::Simple, but the XML
file is too big to fit entirely into memory.
In that case, I suggest you follow my example (with XML::Reader) that
I gave in this thread today (where I said: "...Here is an example of
how to use XML::Reader to capture sub-trees...)
see http://groups.google.com/group/comp.lang.perl.misc/msg/4bb3a769d96c1b2e

Scenario 2:
You know the general rules of your XML parsing, but you don't know
which XML module to use (and you can't follow the advice from Tad
McClellan and from John Bokma).
In that case I suggest you follow my example (with XML::Reader) that I
gave in this thread yesterday (where I said: "...use XML::Reader-
newhd(... {filter => 2})...")
see http://groups.google.com/group/comp.lang.perl.misc/msg/762534f342f939e6
 
R

RedGrittyBrick

[XML::Reader examples and discussion omitted]

My intention is to ~

- Get each sender and receiver
- Get the filetype ( could be InitTAP, FatalRAP etc )
- For each of filetype get the TAPSeqNo, NoofCalls etc....

Basically I want all the information in place for processing the
data....

Also, apart from XML::Twig, is there any module which can handle
larger XML files..

Well there's the XML::Reader that Klaus has thoughtfully spent time
explaining and providing examples for. You didn't say whether there is
some reason you'd not use that.
any help or suggestions are appreciated.

For very arge XML files, the obvious approach to consider is any SAX
parser. Perl SAX modules I've used before include XML::parser and XML::SAX.

Have you Googled for "Perl SAX" and searched CPAN for SAX?
 
R

RedGrittyBrick

[XML::Reader examples and discussion omitted]

My intention is to ~

- Get each sender and receiver
- Get the filetype ( could be InitTAP, FatalRAP etc )
- For each of filetype get the TAPSeqNo, NoofCalls etc....

Basically I want all the information in place for processing the
data....

Also, apart from XML::Twig, is there any module which can handle
larger XML files..

Well there's the XML::Reader that Klaus has thoughtfully spent time
explaining and providing examples for. You didn't say whether there is
some reason you'd not use that.
any help or suggestions are appreciated.

For very arge XML files, the obvious approach to consider is any SAX
parser. Perl SAX modules I've used before include XML::parser and XML::SAX.

Have you Googled for "Perl SAX" and searched CPAN for SAX?

I recommend you read this
http://xmltwig.com/article/ways_to_rome/ways_to_rome.html
 
A

alwaysonnet

[XML::Reader examples and discussion omitted]
My intention is to ~
- Get each sender and receiver
- Get the filetype ( could be InitTAP, FatalRAP etc )
- For each of filetype get the TAPSeqNo, NoofCalls etc....
Basically I want all the information in place for processing the
data....
Also, apart from XML::Twig, is there any module which can handle
larger XML files..

Well there's the XML::Reader that Klaus has thoughtfully spent time
explaining and providing examples for. You didn't say whether there is
some reason you'd not use that.


any help or suggestions are appreciated.

For very arge XML files, the obvious approach to consider is any SAX
parser. Perl SAX modules I've used before include XML::parser and XML::SAX.

Have you Googled for "Perl SAX" and searched CPAN for SAX?

I do find XML::Reader quite helpful for me.

I'm comparing my existing code with 40MB of XML file with XML::Simple
and XML::Reader to find out what fits by bill..
 
A

alwaysonnet

I'll post my observations in my next post regarding the comparison
times between XML::Simple and XML::Reader modules...

Anyway, it is good to use Storable module to store my datastructure on
the disk or use it directly. I know this is an irrelevant question in
this context, but I'm trying to understand the possible ways for
parsing the XML file..
use strict;
use XML::Simple;
use Storable;
use Data::Dumper;

my ($XML_FILE) = "sample.xml";

my $mldata = XMLin($XML_FILE);

store \$mldata, 'file';
my $hashref = retrieve('file');

#print Dumper($hashref);
 
K

Klaus

Hello all,
I'm trying to parse the XML using XML::Twig Module as my XML could be
very large to handle using XML::Simple.
Klaus said:
However, let me bring in a shameless plug:
You could also use my module XML::Reader
http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm
Indeed shameless.
[...]
It would be good though to have a capture mechanism, where
xml capture can be triggered on/off by the user, later to
be regurgitated to the user (on demand), and given to an
xml::simple style mechanism to turn it into filtered records.

use XML::Reader;
my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
    using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType'});

I have now released XML::Reader 0.34
http://search.cpan.org/~keichner/XML-Reader-0.34/lib/XML/Reader.pm

This new version allows to write the same program (...the program that
uses XML::Reader to capture sub-trees from a potentially very big XML
file into a buffer and pass that buffer to XML::Simple...) even
shorter:

use strict;
use warnings;
use XML::Reader 0.34;

use XML::Simple;
use Data::Dumper;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
{ root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType', branch => '*' },
);

while ($rdr->iterate) {
my $buffer = $rdr->rval;
my $ref = XMLin($buffer);
print Dumper($ref), "\n\n";
}
 
S

sln

Hello all,
I'm trying to parse the XML using XML::Twig Module as my XML could be
very large to handle using XML::Simple.
Klaus said:
However, let me bring in a shameless plug:
You could also use my module XML::Reader
http://search.cpan.org/~keichner/XML-Reader-0.32/lib/XML/Reader.pm
Indeed shameless.

It would be good though to have a capture mechanism, where
xml capture can be triggered on/off by the user, later to
be regurgitated to the user (on demand), and given to an
xml::simple style mechanism to turn it into filtered records.

use XML::Reader;
my $rdr = XML::Reader->newhd(\*DATA, {filter => 3,
    using => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType'});

I have now released XML::Reader 0.34
http://search.cpan.org/~keichner/XML-Reader-0.34/lib/XML/Reader.pm

This new version allows to write the same program (...the program that
uses XML::Reader to capture sub-trees from a potentially very big XML
file into a buffer and pass that buffer to XML::Simple...) even
shorter:

use strict;
use warnings;
use XML::Reader 0.34;

use XML::Simple;
use Data::Dumper;

my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},
{ root => '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType', branch => '*' },
);

while ($rdr->iterate) {
my $buffer = $rdr->rval;
my $ref = XMLin($buffer);
print Dumper($ref), "\n\n";
}

Good job on this.

my $buffer = '';

while ($rdr->iterate) {
$buffer .= $rdr->rval;
}

if (length $buffer) {
my $ref = XMLin('<FileItem>'.$buffer.'</FileItem>');
print Dumper($ref), "\n\n";
}

-sln
 
J

John Bokma

Klaus said:
my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},

To me filter is very unclear. I understand that it are options to the
program, but just 5 is very confusing. Maybe split "filter" in several
options which combined result in 1,2,3,4,5 ?

why is the constructor called newhd?

anyway, thanks for mentioning this module, I will check it out when I
have more time.
 
K

Klaus

my $buffer = '';

while ($rdr->iterate) {
   $buffer .= $rdr->rval;

}

if (length $buffer) {
   my $ref = XMLin('<FileItem>'.$buffer.'</FileItem>');
   print Dumper($ref), "\n\n";

}

If memory is not important, than you can use use XML::Reader 0.34
qw(slurp_xml):

use strict;
use warnings;
use XML::Reader 0.34 qw(slurp_xml);

use XML::Simple;
use Data::Dumper;

my $root = '/Data/ConnectionList/Connection/FileItemList/FileItem/
FileType';
my $lref = slurp_xml(\*DATA, {root => $root, branch => '*'});
my $buffer = join '', map {$$_} @{$lref->[0]};
my $ref = XMLin("<Item>$buffer</Item>");

print Dumper($ref), "\n\n";
 
K

Klaus

To me filter is very unclear. I understand that it are options to the
program, but just 5 is very confusing. Maybe split "filter" in several
options which combined result in 1,2,3,4,5 ?

"filter => 2,3,4,5" is just a construction that has historically grown
inside XML::Reader.

But I agree very much with you, I also find that "filter => 2,3,4,5"
is not expressive at all. I will think of a better way to select the
mode of operation for XML::Reader.
why is the constructor called newhd?

Thanks for the question.

That, again, is a historic accident. ==> Back in the old days of
XML::Reader ver 0.01, there used to be an option {filter => 1} and the
constructor back then was called new() and defaulted to {filter => 1}.

Then, in version 0.03 (or so) I decided to have the constructor
default to {filter => 2}, but I didn't want to break code that already
used the old default, so I came up with a second constructor called
newhd() that defaults to {filter => 2}.

At some version of XML::Reader the {filter => 1} and its use of the
constructor new() had disappeared. Therefore it is possible now to
rename newhd() back into new(). I think I will go back to constructor
new() in a future version of XML::Reader.
 
K

Klaus

To me filter is very unclear. I understand that it are options to the
program, but just 5 is very confusing. Maybe split "filter" in several
options which combined result in 1,2,3,4,5 ?

I will think of a better way to select the
mode of operation for XML::Reader.
why is the constructor called newhd?

[...] I think I will go back to constructor
new() in a future version of XML::Reader.

I have now released a new version of XML::Reader (ver
0.35) with some bug fixes, warts removed, relicensing, etc...
http://search.cpan.org/~keichner/XML-Reader-0.35/lib/XML/Reader.pm

The line I wrote in my previous post (which was for XML::Reader ver
0.34) was:

my $rdr = XML::Reader->newhd(\*DATA, {filter => 5},

With the new version 0.35 of XML::Reader, the same line would be
spelled:

my $rdr = XML::Reader->new(\*DATA, {mode => 'branches'},
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,001
Messages
2,570,254
Members
46,850
Latest member
VMRKlaus8

Latest Threads

Top