Splitting up an XML File

J

JAG

I have an XML file that looks like this:

<root>
<economist publications="true" >
<name>
<first>John</first>
<last>Doe</last>
</name>
<keywords>
<keyword>Foo</keyword>
<keyword>Bar</keyword>
</keywords>
<title>Indian Chief</title>
</economist>

<economist publications="true" >
<name>
<first>Jane</first>
<last>Smith</last>
</name>
<keywords>
<keyword>More Foo</keyword>
<keyword>More Bar</keyword>
</keywords>
<title>President</title>
</economist>
</root>

But the actual file has about 100 <economist> elements.
I need to write some Perl code to parse this XML file and
write out 100 smaller XML files, each file corresponding to one
<economist> element.

So in my example, I'd write 2 smaller files, one that
looks like this:
<economist publications="true" >
<name>
<first>John</first>
<last>Doe</last>
</name>
<keywords>
<keyword>Foo</keyword>
<keyword>Bar</keyword>
</keywords>
<title>Indian Chief</title>
</economist>

and one that looks like this:
<economist publications="true" >
<name>
<first>Jane</first>
<last>Smith</last>
</name>
<keywords>
<keyword>More Foo</keyword>
<keyword>More Bar</keyword>
</keywords>
<title>President</title>
</economist>

There are some nested elements in the real file, so I think
XML::Simple won't work for this.

Any ideas about how I can do this? I don't need to do any processing
(at least not now) - just reading and writing smaller chunks.

Thanks!
 
T

Tad McClellan

JAG said:
But the actual file has about 100 <economist> elements.
I need to write some Perl code to parse this XML file and
write out 100 smaller XML files, each file corresponding to one
<economist> element.

There are some nested elements in the real file,


I will assume that <economist> is NOT nested, and that the
start/end tags are on lines by themselves.

Any ideas about how I can do this?


# strip non-<economist> stuff at top of file
$/ = "<economist>\n";
while ( <> ) { # read one <economist> element per loop iteration
# open file, output $_ to file, close file.
}
 
J

JAG

I have an XML file that looks like this:
But the actual file has about 100 <economist> elements.
I need to write some Perl code to parse this XML file and
write out 100 smaller XML files, each file corresponding to one
<economist> element.

So in my example, I'd write 2 smaller files, one that
looks like this:
There are some nested elements in the real file, so I think
XML::Simple won't work for this.

Any ideas about how I can do this? I don't need to do any processing
(at least not now) - just reading and writing smaller chunks.

This uses one of my favorite modules, XML::XPath:

[trwww@waveright trwww]$ perl
use warnings;
use strict;
use XML::XPath;
use IO::File;

my($xp) = XML::XPath->new( xml => join('', <DATA>) );
my($nodeset) = $xp->find( '/root/economist' );

my($ext) = 0;

foreach my $record ( $nodeset->get_nodelist() ) {
IO::File->new('> record.'.$ext++)->print($record->toString());
}

__DATA__
<root>
<economist publications="true" >
<name>
<first>John</first>
<last>Doe</last>
</name>
<keywords>
<keyword>Foo</keyword>
<keyword>Bar</keyword>
</keywords>
<title>Indian Chief</title>
</economist>

<economist publications="true" >
<name>
<first>Jane</first>
<last>Smith</last>
</name>
<keywords>
<keyword>More Foo</keyword>
<keyword>More Bar</keyword>
</keywords>
<title>President</title>
</economist>
</root>
Ctrl-D
[trwww@waveright trwww]$ ls -l
total 24
drwxr-xr-x 3 trwww trwww 4096 Aug 17 19:00 apps
drwx------ 3 trwww trwww 4096 Sep 16 20:49 Desktop
drwxr-xr-x 3 trwww trwww 4096 Aug 18 16:50 misc
drwxrwxr-x 3 trwww trwww 4096 Sep 6 19:00 public_html
-rw-rw-r-- 1 trwww trwww 297 Sep 17 22:56 record.0
-rw-rw-r-- 1 trwww trwww 306 Sep 17 22:56 record.1
[trwww@waveright trwww]$ cat record.0
<economist publications="true">
<name>
<first>John</first>
<last>Doe</last>
</name>
<keywords>
<keyword>Foo</keyword>
<keyword>Bar</keyword>
</keywords>
<title>Indian Chief</title>
</economist>[trwww@waveright trwww]$ cat record.1
<economist publications="true">
<name>
<first>Jane</first>
<last>Smith</last>
</name>
<keywords>
<keyword>More Foo</keyword>
<keyword>More Bar</keyword>
</keywords>
<title>President</title>
</economist>[trwww@waveright trwww]$

Todd W.


Thanks! This works beautifully.
Now, here are two more things.

Instead of naming the files record.[0..n], I want each
output file to have the name of the person.
So these two files would be named Jane.Smith and John.Doe

Also, within each <economist> element, there is now an element
called <work> that contains other elements. I need each of these
<work> elements to be writtten to its own file called lastname_work
and not in the first output file.

So for this XML file:

<root>
<economist publications="true" >
<name>
<first>John</first>
<last>Doe</last>
</name>
<keywords>
<keyword>Foo</keyword>
<keyword>Bar</keyword>
</keywords>
<title>Indian Chief</title>
<work>
<title>Title 1</title>
<content>Some Content</content>
</work>
</economist>

<economist publications="true" >
<name>
<first>Jane</first>
<last>Smith</last>
</name>
<keywords>
<keyword>More Foo</keyword>
<keyword>More Bar</keyword>
</keywords>
<title>President</title>
<work>
<title>Title 2</title>
<content>Some More Content</content>
</work>
</economist>

So this would produce the same two files your original code produced,
but named John.Doe and Jane.Smith and also without the <work> element.
Instead of printing the work element in this file, it should be printed
in its own file, in this case, called Smith_work and Doe_work.

Thanks again.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,139
Messages
2,570,805
Members
47,351
Latest member
LolaD32479

Latest Threads

Top