Slow performance with Specific XSLT

D

Darren

Hi all,

I have an issue relating to Xalan-C performance that I need some help
on, the problem is that I have a large document that I need to perform
some very simple transformation on
1) Sort
and
2) Remove 1st Level of the document hierarchy.

The Document is structured like the following:

<Batch>
<Batch>
<ProductTypeA>
<ProductID>009466</ProductID>
<!-- ... other elements -->
</ProductTypeA>
</Batch>
<Batch>
<ProductTypeB>
<ProductID>002700</ProductID>
<!-- ... other elements -->
</ProductTypeB>
<ProductTypeA>
<ProductID>002600</ProductID>
<!-- ... other elements -->
</ProductTypeA>
</Batch>
</Batch>

Within the real document I have over 500,000 ProductTypeX records, and
I want the document to come out like the following:

<Batch>
<ProductTypeA>
<ProductID>002600</ProductID>
<!-- ... other elements -->
</ProductTypeA>
<ProductTypeB>
<ProductID>002700</ProductID>
<!-- ... other elements -->
</ProductTypeB>
<ProductTypeA>
<ProductID>009466</ProductID>
<!-- ... other elements -->
</ProductTypeA>
</Batch>

The Problem is that executing the following XSLT using XalanTransform
the process takes nearly 2 hours (not including the sort)! However if I
manually remove the intermediate <Batch> tags the process only takes 5
minutes (including the sort).

<?xml version='1.0' ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:eek:utput omit-xml-declaration="yes" />
<xsl:template match="Batch">
<Batch>
<xsl:apply-templates select="Batch/*">
<!--<xsl:sort select="ProductID"/>-->
</xsl:apply-templates>
</Batch>
</xsl:template>
<xsl:template match="/ | @* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

So the version that takes 5 minutes differs from the above by changing
the line:
<xsl:apply-templates select="Batch/*">
To:
<xsl:apply-templates select="*">

And un-commenting the sort.

Can anyone help?

P.S.
1) The file I get is generated by a 3rd Party system and is there
format not in my control
2) The File is originally all in 1 physical line, and far too big for
something like SED to process initially.

Darren
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Darren said:
2) The File is originally all in 1 physical line, and far too big for
something like SED to process initially.

There are other tools aside from sed which can do this.
A standard POSIX command for doing this is tr.
But AWK can also do this (you may re-define AWK's line-separator RS).
 
D

Darren

Thanks for the Reply!

I did initially use the command " awk '{gsub(/
/,"\n");print}'
" to break the file into lines, however the file was left slightly
corrupted by this -- I think the AIX version of awk used couldn't cope
with the size of the file).

and TR's performance was non existant when trying to do an equivilant.
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Darren said:
I did initially use the command " awk '{gsub(/
/,"\n");print}'
" to break the file into lines, however the file was left slightly
corrupted by this -- I think the AIX version of awk used couldn't cope
with the size of the file).

I forgot to say about AWK: Whenever you have insane lengths for
lines or fields or anything else, you should at least try
GNU Awk. GNU Awk is well-known for not having limitations
on line length.
and TR's performance was non existant when trying to do an equivilant.

Interesting.
 
D

David Carlisle

Don't know why it should be so slow, (what does saxon do for example)
You could probably speed it up a bit by using copy-of rather than a
recursive copying template, since below a certain level you just want to
copy whole branches:
<xsl:template match="/Batch">
<Batch>
<xsl:for-each select="Batch/*">
<xsl:copy-of select="."/>
<!--<xsl:sort select="ProductID"/>-->
</xsl:for-each>
</Batch>
</xsl:template>

David
 
D

Darren

Thanks for this - I'm trying it now ... however, it's already been
running over an hour -- so it doesn't look good!

I will then retry and change the <xsl:for-each select="Batch/*"> with
a <xsl:for-each select="*"> - just to see the time difference.
 
D

Dimitre Novatchev

Darren said:
Thanks for this - I'm trying it now ... however, it's already been
running over an hour -- so it doesn't look good!

I will then retry and change the <xsl:for-each select="Batch/*"> with
a <xsl:for-each select="*"> - just to see the time difference.

Probably you do not have sufficient memory.

I produced 500000 records of the type you describe and both transformations
provided by you (slightly corrected) take about a minute with MSXML4.

I have a 3GHz Pentium 4 with 2GB of RAM.

The correction to your first transformation is the following:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:eek:utput omit-xml-declaration="yes" />

<xsl:template match="/*">
<Batch>
<xsl:apply-templates select="Batch/*">
<!--<xsl:sort select="ProductID"/>-->
</xsl:apply-templates>
</Batch>
</xsl:template>

<xsl:template match="/ | @* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

It avoids trying:

<xsl:apply-templates select="Batch/*">

when the current node is a "Batch"

and this probably saves some time.

The second transformation is:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:eek:utput omit-xml-declaration="yes" />

<xsl:template match="/*">
<Batch>
<xsl:for-each select="Batch/*">
<xsl:sort select="ProductID"/>

<xsl:copy-of select="."/>
</xsl:for-each>
</Batch>
</xsl:template>
</xsl:stylesheet>

Here I use xsl:copy-of instead of the potentially deep-recursive identity
rule.

Cheers,
Dimitre Novatchev.
 
D

Darren

Thanks for your reply - I have tested with your suggested changes and
still in the nearly 2 hours bracket. I don't think the issue is
related to Memory as the host in question has 24GB of RAM - although it
an old (700mhz RISC based processor).

Reworking the process into the following steps

1. Use a simple <xsl:copy-of select="."/> to format document into
readable XML (break into multi-lines)
XalanTransform Products.xml copy.xslt new.xml

2. Use AWK to remove Batch Tags
awk '{gsub("</*Batch>","");print}' <new.xml >out.xml

3 re-insert outer Batch Tags (real 0m22.03s)
print "print \<Batch\>\ncat out.xml\nprint \</Batch\>\n" | sh >new.xml

4. Sort the file!
XalanTransform new.xml sort.xslt out.xml

reduces the processing time to 8.5 minutes - however far less eligant!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top