Regular Expression for XML Parsing

T

tushar.saxena

Hi,

I have a set of XML files from which I need to extract some data. The
format of the file is as follows :

<tag1>
<tag3>DATA1</tag3>
</tag1>

<tag2>
<tag3>DATA2</tag3>
</tag2>

I need to extract the DATA part of the xml structure

Note : tag3 can be contained either within tag1 or tag2, but I need to
extract data only from tag1. i.e. DATA1 should be extracted, but not
DATA2

If I want to get both DATA1 and DATA2 I can use a simple regex like :

if (($_ =~ /<tag3>(\w+)<\/tag3>/g))
{
print $1
}

But if I try to get only DATA1 (embedded within tag1) I try using
something like this, but am unable to get it to work

if (($_ =~ /<tag1>[\n\s\S\w\W]*<tag2>(\w+)<\/tag2>[\n\s\S\w\W]*<\/
tag1>/g))
{
print $1
}

In this second case, the match itself fails.

Any help would be appreciated !
 
J

Jürgen Exner

I have a set of XML files
I need to extract the DATA part of the xml structure
If I want to get both DATA1 and DATA2 I can use a simple regex like :

It's a bad idea in the first place. XML is not a regular language, why would
you use regular expressions to parse it?
Any help would be appreciated !

Use a tool that is designed to parse XML like e.g. any of the XML parser
modules on CPAN.

jue
 
P

patriknym

Hi,

I have a set of XML files from which I need to extract some data. The
format of the file is as follows :

<tag1>
<tag3>DATA1</tag3>
</tag1>

<tag2>
<tag3>DATA2</tag3>
</tag2>

I need to extract the DATA part of the xml structure

Note : tag3 can be contained either within tag1 or tag2, but I need to
extract data only from tag1. i.e. DATA1 should be extracted, but not
DATA2

If I want to get both DATA1 and DATA2 I can use a simple regex like :

if (($_ =~ /<tag3>(\w+)<\/tag3>/g))
{
print $1

}

But if I try to get only DATA1 (embedded within tag1) I try using
something like this, but am unable to get it to work

if (($_ =~ /<tag1>[\n\s\S\w\W]*<tag2>(\w+)<\/tag2>[\n\s\S\w\W]*<\/
tag1>/g))
{
print $1

}

In this second case, the match itself fails.

Any help would be appreciated !

$/ = "";

while (<>) {
if ( m{<tag1>.*?<tag3>(\w+)</tag3>.*?</tag1>}gs )
{
print "$1\n";
}
}
 
T

Tad J McClellan

I have a set of XML files from which I need to extract some data. The
format of the file is as follows :

<tag1>
<tag3>DATA1</tag3>
</tag1>

<tag2>
<tag3>DATA2</tag3>
</tag2>


I thought you said you had an XML file.

That is not a valid XML file...

I need to extract the DATA part of the xml structure

Note : tag3 can be contained either within tag1 or tag2, but I need to
extract data only from tag1. i.e. DATA1 should be extracted, but not
DATA2

If I want to get both DATA1 and DATA2 I can use a simple regex like :


Using a regular expression to "parse" a non-regular language is
fraught with peril, and nearly always a Bad Idea.

Use a module that understands XML for processing XML data.

Any help would be appreciated !


Assuming that you have actual valid XML in $xml, then:

use XML::Simple;

my $ref = XMLin($xml);
foreach my $child ( @{ $ref->{tag1} } ) {
print "$child->{tag3}\n";
}
 
M

Michele Dondi

Subject: Regular Expression for XML Parsing

Nope. Perhaps a Regex for XML Parsing, in the Perl 6 acceptation of a
"Regex" which is not assumed to be a "Regular Expression" any more.
You will have to wait for quite a while, though...


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,813
Latest member
lawrwtwinkle111

Latest Threads

Top