split xml file between two processing instructions

K

kcwolle

hello,

I want to split an xml file on processing instructions into different
files.
All content between the two PIs should be included in the new file.
The file name should contain the content of first and the last <no>
elements.


example:
<?split ?>
<h1>... text ...</h1>
<start-element/>
<text>
....text text text...
<nr>4</nr>
</text>
text text text
<nr>18</nr>
<end-element/>
<h6> ... text ...</h6>
<?split ?>

In this case the file name should be: test-no4to18.xml and everything
from <h1> to </h6> should be included.
(btw there can be different start and end tags so that no rule on the
starting and ending elements is possible)
I would like to use an XML module (eg XML::Twigs) but how do I get a
node list that contains all nodes between the processing instructions
for further processing?

Can anybody help me?

Yours

Wolfgang
 
A

Anno Siegel

kcwolle said:
hello,

I want to split an xml file on processing instructions into different
files.
All content between the two PIs should be included in the new file.
The file name should contain the content of first and the last <no>
elements.


example:
<?split ?>
<h1>... text ...</h1>
<start-element/>
<text>
...text text text...
<nr>4</nr>
</text>
text text text
<nr>18</nr>
<end-element/>
<h6> ... text ...</h6>
<?split ?>

In this case the file name should be: test-no4to18.xml and everything
from <h1> to </h6> should be included.
(btw there can be different start and end tags so that no rule on the
starting and ending elements is possible)
I would like to use an XML module (eg XML::Twigs) but how do I get a
node list that contains all nodes between the processing instructions
for further processing?

What have you tried so far?

We help people with programming, but we don't deliver programs
according to specification.

Anno
 
T

Tad McClellan

kcwolle said:
I want to split an xml file on processing instructions into different
files.


Does it have to work on arbitrary XML or only on "your" XML?

Might you have PIs like this?

<?split ?>
or
<?split
?>

If so, you're on your own. If not, see below.

All content between the two PIs should be included in the new file.
The file name should contain the content of first and the last <no>
elements.


example:
<?split ?>
<h1>... text ...</h1>
<start-element/>
<text>
...text text text...
<nr>4</nr>
</text>
text text text
<nr>18</nr>
<end-element/>
<h6> ... text ...</h6>
<?split ?>

In this case the file name should be: test-no4to18.xml and everything
from <h1> to </h6> should be included.

I would like to use an XML module


Since you don't need to make use of the XML structuring, I would
treat them as plain ol' text files.

Can anybody help me?


What have you tried so far?

We generally prefer to help those who have attempted to help
themselves first...


This should get you started:

foreach my $section ( split /\Q<?split ?>/ ) {
my( $num1, $num2) = ($section =~ /<nr>(\d+)/g)[0, -1];
next unless defined $num1;
my $fname = "text-no${num1}to$num2.xml";
print "$fname\n";
}
 
K

kcwolle

Hello Anno,

I tried the following code to split the document. The problem is that
I get only the first two <no> elements and not the first and the last.

use strict;

my $text;
my $file = shift;
my $outfile = shift;
my $testfile;
open(INPUT, "<$file") or die "Kann Datei $file nicht lesen!\n";
local $/;
$text = <INPUT>;
close INPUT;


while ($text =~ /<\?split \?>(.*?)(?=<\?split \?>)/sg)
{
my $fragment = $1;
my ($from, $to) = $fragment =~ /<no>(.*?)<\/no>/isg;
$testfile = $outfile."\\test-nr".${from}."to".${to}."\.xml",
open(OUTPUT, ">$testfile") or die "Kann Datei $testfile nicht
schreiben!!!\n";
print OUTPUT $fragment;
close OUTPUT;
}

The general problem with using regular expressions is that there could
be broken elements eg
<?split ?><level1><text>xxx</text><level2><text>yyy</text></level2><?split
?><level2><text>zzz</text></level2></level1>
where a level1 tag begins in the first <?split ?> and an ends in the
second.
How can that broken elements be handled, so that I have well-formed
XML.

On the other hand if I use an XML module the PI is a node that has no
children. How can the following nodes up to the next PI handled?

Btw I'm a relative newbie to Perl and XML programming so that I need
some support in these things. Maybe you can help me? :-|

Yours

Wolfgang
 
T

Tad McClellan

kcwolle said:
The problem is that
I get only the first two <no> elements and not the first and the last.

my ($from, $to) = $fragment =~ /<no>(.*?)<\/no>/isg;


Use a "list slice" ("Slices" section in perldata.pod) to slice
the list that m//g is returning, like I did in my earlier followup:


my ($from, $to) = ($fragment =~ /<no>(.*?)<\/no>/isg)[ 0, -1 ];
^ ^^^^^^^^^^
^ ^^^^^^^^^^
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,783
Latest member
RickeyDort

Latest Threads

Top