XSL for recursive transformation

I

Indy

Hi,
I have a XHTML input file with custom tag which specifies html
fragments to include
For example:
<html>
....
<include frag1="frag1.html" frag2="frag2.html">
More html here
</include>
....html...
<include frag1="frag3.html" ....>...

</html>
The include tag can be nested. The contents of an include tag would be
combined with the fragments [frag1.html and frag2.html] to produce the
output xml which would replace the currently processed include tag.
After that the whole output has to be checked for valid XML. And the
process is continued until there are no more include tags.

I was wondering about the best way to go about doing this. Is XSL
suitable? If so how?

Thanks
Indy
 
J

Joe Kesselman

Indy said:
I was wondering about the best way to go about doing this. Is XSL
suitable? If so how?

Given that XHTML is an XML language, the *right* way to do this would be
to use XInclude tags. Assuming your XHTML processor supports XInclude,
of course.

If it doesn't -- yes, you can implement XInclude, or similar
functionality, in XSLT if you want to. One such implementation can be
seen at http://www.dpawson.co.uk/xsl/sect2/include.html

(It's always worth checking Dave Pawson's XSLT FAQ website. He's done a
very good job of collecting many of the best answers from the XSLT
user's mailing list. Which, by the way, is also worth subscribing to if
you're looking for a deeper understanding of stylesheets.)
 
N

Nick Kew

Joe said:
Given that XHTML is an XML language, the *right* way to do this would be
to use XInclude tags. Assuming your XHTML processor supports XInclude,
of course.

FWIW, mod_transform for Apache is an XSLT filter that supports XInclude
(based on libxml2/libxslt). So it's a solved problem on the Web.

However, XSLT is not a good solution to this, except for small
documents. Inclusion can be streamed, so it'll be hugely faster
and more scalable using a SAX-based parser. mod_publisher would
be a better choice.
 
P

Peter Flynn

Indy said:
Hi,
I have a XHTML input file with custom tag which specifies html
fragments to include
For example:
<html>
...
<include frag1="frag1.html" frag2="frag2.html">
More html here
</include>
...html...
<include frag1="frag3.html" ....>...

</html>
The include tag can be nested. The contents of an include tag would be
combined with the fragments [frag1.html and frag2.html] to produce the
output xml which would replace the currently processed include tag.
After that the whole output has to be checked for valid XML. And the
process is continued until there are no more include tags.

I was wondering about the best way to go about doing this.

Why not just use entity declarations?

///Peter
 
J

Joe Kesselman

Peter said:
Why not just use entity declarations?

Parsed entities are pretty much dying as XML Schema replaces DTDs.
Schemas don't have any equivalent. XInclude/XLink were supposed to take
over that role.
 
I

Indy

Hi,
Thanks for your comments, I tried using XInclude tags but came across
some problems.
The fragments that I'm trying to include are not valid XML themselves,
they could be for example be:
---sof---
<table><tr><td>This is a header</td></tr>
---eof---

and only when the fragments are assembled it forms a valid XML.

Do you think XInclude can still be used to achieve this?

Thanks again,
Indeera
 
R

Richard Tobin

The fragments that I'm trying to include are not valid XML themselves, ....
and only when the fragments are assembled it forms a valid XML.
Do you think XInclude can still be used to achieve this?

No. XInclude operates at the level of the XML Infoset, not on
characters. You will need to use a non-XML tool to put them together.

-- Richard
 
J

Joe Kesselman

Indy said:
The fragments that I'm trying to include are not valid XML themselves,

In which case XML-aware tools aren't going to handle them. Write
something text-based.
 
J

Joe Kesselman

.... or redesign the whole problem so you're working with XML structure
rather than text fragments.
 
A

Andy Dingley

Indy said:
I have a XHTML input file with custom tag which specifies html
fragments to include

Other posters have suggested ways to include XML fragments in XML.

However I'd advise against this, because you're trying to embed HTML as
the fragment and HTML is _not_ XML. HTML needs to be processed with
text or SGML aware tools, not XML. What happens if you encounter a HTML
fragment that's not well-formed? What happens if you _want_ to use a
fragment that's not well forned?

RSS has addressed this same problem before now. Worth reading the
background.
 
P

Peter Flynn

Joe said:
Parsed entities are pretty much dying as XML Schema replaces DTDs.

I think you'll find them alive and kicking in many places. Reports
of the death of DTDs are greatly exaggerated.
Schemas don't have any equivalent.
QED

XInclude/XLink were supposed to take over that role.

Oooh look, flying pigs :)

///Peter
 
J

Joe Kesselman

Parsed entities are pretty much dying as XML Schema replaces DTDs.
I think you'll find them alive and kicking in many places. Reports
of the death of DTDs are greatly exaggerated.

Uhm. I agree that schemas are taking longer to find their way in than
might have been expected, partly becuase they're a syntax only a
database expert or computer science geek could love. (Though frankly the
DTD syntax is also pretty hideous.)

However, entities are definitely on the way out. The problem is that
they really aren't all that useful unless there's a fragment that will
appear in a huge number of instances of this kind of document, and even
then they're only a significant advantage when producing the document by
hand; it is a significant pain for software to recognize that the
opportunity exists to take advantage of a parsed entity, and there
usually isn't much to be gained by doing so.

Entities had value when most docs were produced by humans pounding on
raw XML text; they really aren't useful for docs produced by smarter
editors. Most of the things you might still want to use them for can be
handled better by an appropriate tool -- an editor that lets you see and
enter the actual characters rather than their named equivalents, for
example, or a syntax that's actually defined in the document rather than
in a non-tag-language secondary file. Among other things, that permits
different documents to reference different resource rather than having
only a single set, hard-wired into the DTD, that they can name.
Oooh look, flying pigs :)

I did put it in the imperfect tense... Part of the problem is that we're
finding that the need for a portable syntax for documents referencing
other documents isn't as universal as we expected. Or at least isn't so
right now.

If we'd designed XML completely before releasing it to the public, we
would have started with the infoset (including namespaces and schemas
and includes and links), then designed the syntax and APIs from that,
Instead the W3C started with the syntax and a known-inadequate schema
language (DTDs), and has build everything out from there. The upside is
that folks had a chance to start using XML much earlier, and we've
gotten some benefit from seeing which directions everyone has gone with
it. The downside is that there have been some warts and hiccups and
direction changes along the way, and tools have not always been quick to
catch up -- and even when they have, folks who have working solutions
using the old stopgaps are often reluctant to make the effort to move
over. Which leaves all of us with the job of supporting multiple ways of
doing things and trying to gently push folks toward the ones that will
make their life -- and ours -- easier in the long run.

Oh well. The cutting edge usually has a few nicks in it.
 
P

Peter Flynn

Joe said:
Uhm. I agree that schemas are taking longer to find their way in than
might have been expected, partly because they're a syntax only a
database expert or computer science geek could love. (Though frankly the
DTD syntax is also pretty hideous.)

Only a syntax geek would love it, but it has the advantage of being very
terse, and once learned, quite expressive. RelaxNG seems to be the way
forward, but I still feel we did the community a disservice by not
properly investigating the possibility of adding datatyping to DTDs
before running amok with W3C Schemas. Ah well. Another time.
However, entities are definitely on the way out. The problem is that
they really aren't all that useful unless there's a fragment that will
appear in a huge number of instances of this kind of document, and even
then they're only a significant advantage when producing the document by
hand;

Actually there is rather a lot of stuff out there that does this.
it is a significant pain for software to recognize that the
opportunity exists to take advantage of a parsed entity, and there
usually isn't much to be gained by doing so.

For parsed entities, yes. Legal boilerplate, tech doc, and chapter
files for long documents are the only real candidates.

Parameter entities are a different matter.
Entities had value when most docs were produced by humans pounding on
raw XML text; they really aren't useful for docs produced by smarter
editors. Most of the things you might still want to use them for can be
handled better by an appropriate tool -- an editor that lets you see and
enter the actual characters rather than their named equivalents, for

This refers to character entities. Sadly, editors are still in their
infancy when it comes to the interface (hence my thesis topic), and
there are still a gazillion so-called plaintext editors (non-XML) out
there that XML beginners use, which seriously screws up their chances
when they start editing UTF-8. For this reason, several companies and
projects I have been dealing with have made it policy for the moment
to create ISO-8859-1 files only, and ALL other characters go in as
character entity references or numeric references (fortunately for them
they deal only with western languages in Latin scripts).
example, or a syntax that's actually defined in the document rather than
in a non-tag-language secondary file. Among other things, that permits
different documents to reference different resource rather than having
only a single set, hard-wired into the DTD, that they can name.


I did put it in the imperfect tense...

Sorry, I was being deliberately provocative.
Part of the problem is that we're
finding that the need for a portable syntax for documents referencing
other documents isn't as universal as we expected. Or at least isn't so
right now.

Ahead of the curve as usual :) Although the demand for a syntax to
refer from one document to another is slowly approaching FAQ-level.
It's just embarrassing that we had multi-way bidirectional 3rd-party
linking in the Panorama plugin a decade ago, and still nothing to
replace it.
If we'd designed XML completely before releasing it to the public,

We'd still be discussing it.
would have started with the infoset (including namespaces and schemas
and includes and links), then designed the syntax and APIs from that,
Instead the W3C started with the syntax and a known-inadequate schema
language (DTDs), and has build everything out from there. The upside is
that folks had a chance to start using XML much earlier, and we've
gotten some benefit from seeing which directions everyone has gone with

I like the description, although I disagree about the infoset. Coming
from the tech doc background, I would have preferred to see some of the
useful SGML features retained and more attention paid to the usability
of markup. Pretending that a document is a tree when it's not (it's a
document!) was a mistake we are still paying for. Starting with the
syntax was OK, IMHO, and pretty much 99% of what we did was right. But
schemas were a later development, a bolt-on which only came when the
XML-Data folks saw the market for the syntax (and that's something else
we'll end up paying for -- I see way too many slabs of data being done
into XML when CSV would be much more sensible).
it. The downside is that there have been some warts and hiccups and
direction changes along the way, and tools have not always been quick to
catch up -- and even when they have, folks who have working solutions
using the old stopgaps are often reluctant to make the effort to move
over.

This is going to be the interesting bit. New tools -- *really good* new
tools -- are few and far between. And there are too many good old tools
which have become unavailable just at the point when they were most
needed, because of corporate buyouts resulting in technically-unaware
people dropping the ball.
Which leaves all of us with the job of supporting multiple ways of
doing things and trying to gently push folks toward the ones that will
make their life -- and ours -- easier in the long run.

It does work eventually. I've only had one breakage so far, and that was
due to sabotage.
Oh well. The cutting edge usually has a few nicks in it.

Mind that axe, Eugene.

///Peter
 
J

Joe Kesselman

If we'd designed XML completely before releasing it to the public,
We'd still be discussing it.

Which is why they went the other way around. Unfortunately that left us
with some warts where the afterthoughts were tacked on (including some
that could have been avoided, but... oh well; too much water over the
dam at this point).
I like the description, although I disagree about the infoset. Coming
from the tech doc background, I would have preferred to see some of the
useful SGML features retained

Trimming away everything that wasn't absolutely required is what made
implementing XML easy. If you've ever written an SGML processor, you
know getting it right is messy at best. XML was deliberately restricted
to the point where the parser is implementable by an average student in
a week or less.
This is going to be the interesting bit. New tools -- *really good* new
tools -- are few and far between.

They're starting to appear, though. If you see a market not being
adequately served, think of it as a marketing opportunity. That's what
got us started on Xerces and Xalan...<grin/>
 
P

Peter Flynn

Joe said:
Trimming away everything that wasn't absolutely required is what made
implementing XML easy. If you've ever written an SGML processor, you
know getting it right is messy at best. XML was deliberately restricted
to the point where the parser is implementable by an average student in
a week or less.

I think Tim Bray's comment was "implementable in 'just a few' 30-hour
Perl hacking sessions" :)
They're starting to appear, though. If you see a market not being
adequately served, think of it as a marketing opportunity.

Oh I am, believe me :)

///Peter
 
J

Joseph Kesselman

Peter said:
I think Tim Bray's comment was "implementable in 'just a few' 30-hour
Perl hacking sessions" :)

The concept of the DPH -- Desperate Perl Hacker -- has been invoked a
number of times as an argument for why everything should be kept as
simple as possible. (But not simpler.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,810
Latest member
Kassie0918

Latest Threads

Top