Normalizing XHTML with XML

R

Ryan Stewart

I'm getting XHTML input that can be in a number of formats, and I'm
trying to get it into a consistent format for later use. "Consistent"
in this case means everything in the root/body is in either a p, table,
img, ol, or ul tag. I'm processing just the body text. There is no head
section or anything. So the body is the root of the tree that I'm
processing. I've got almost everything working except one thing. If I
get input like the following:
some text<br/>some more text

then I need that to become two paragraphs, like:
<p>some text</p>
<p>some more text</p>

That's easy enough. But if I get this input:
some text <a href="blah">link</a> some more text

that should all become one paragraph:
<p>some text <a href="blah">link</a> some more text<p>

And if a table, list, or image is encountered, that should be the end
of a paragraph if there is one:
some text<table> ... </table>some more text

becomes
<p>some text</p>
<table> ... </table>
<p>some more text</p>

Again, simply placing the text nodes inside p tags is simple, but a
problem arises if there is a link or other tag inside some of that
text. (At this point other tags don't actually matter because I'm
stripping them out, but links need to be passed through.)

Basically, my problem boils down to this:
1) I need to select any text node child of the root and surround it
with p tags, but
2) if an a element is a child of the root, it should be joined with any
adjacent text nodes and the whole thing should be surrounded with p
tags.

Can someone give me an example of how to do this with XSL?
 
J

Joe Kesselman

1) I need to select any text node child of the root and surround it
with p tags, but
2) if an a element is a child of the root, it should be joined with any
adjacent text nodes and the whole thing should be surrounded with p
tags.

.... If I put those two rules together, I get "I want to wrap a <p>
element around all the root's children". Since that's trivial, I presume
there's some case where you don't want to do that....?
 
R

Ryan Stewart

Yes, only text nodes and links should be inside p tags. Tables, lists,
and images will also be present and must not be wrapped, especially
since tables and lists are block elements and p tags may only contain
inline elements. Maybe a more complex example:
some text <a href="blah">a link</a> some more text<br/>
third text node<table>...</table>final text node

should become:
<p>some text <a href="blah">a link</a> some more text</p>
<p>third text node</p>
<table>...</table>
<p>final text node</p>

Notice that the <br/> causes a new p element, the first two root-level
text nodes and the a element in between them become one paragraph, the
third text node becomes a paragraph, the table is not touched, and the
last text node becomes a paragraph.
 
R

Ryan Stewart

From looking around some more, I'm seeing that XSLT should be viewed as
transforming nodes from a source tree into nodes in a result tree. So a
different way of looking at my problem might be, "How do I grab
consecutive text and inline nodes (besides the br and img elements)
that are children of the root node from the source tree and put them
inside one node (a p element) in the result tree?"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,853
Latest member
GeorgiaSta

Latest Threads

Top