Newbie question about how to solve the use escape characters

Mark Chao · Nov 15, 2005

Hi, I am a newbie, I spend quite sometime searching on the web, but I
didn't find anything. I hope this question is not too bad to ask here.

I am trying to convert XML document into another form, such as this:

<a>
A
B
<c>C</c>
</a>

should be converted to this:

a A
a b B
a c C

I am using the Java's sax parser with my own extended DefaultHandler.
Usually XML documents given to me will have the elements and child
elements properly idented (as above). However this will cause problem,
as the character() in the handler class will be called even between 2
endElement() call, sometimes between 2 startElement() call.

This will also cause problem as the "A" will be parsed to "\n\tA"
because it is just parsed as it is. The obvious way to solve this
problem is to just make my handler taking only XML files which have no
"\n" nor "\t" escape characters. I can also manually take out any of
these escape characters, but it will also accidentally remove any
intended escape characters.

Another way would be disallowing XML documents which have character
data between 2 startElement or 2 endElement. ie only have character
data between 1 startElement and 1 endElement. However this constraint
is too heavy and not appropriate.

This is just a semantic problem, but I just want to know if there are
any other ways to tackle the problem.

Peter Flynn · Nov 16, 2005

Mark said:
Hi, I am a newbie, I spend quite sometime searching on the web, but I
didn't find anything. I hope this question is not too bad to ask here.

I am trying to convert XML document into another form, such as this:

<a>
A
B
<c>C</c>
</a>

This should ring immediate warning bells. Mixed Content (interspersed
text and markup) is normally the wrong model in data-oriented
applications. A more useful form would be

<a>
<something>A</something>
B
<c>C</c>
</a>

After all, the "A" must have some function, so it should be identified.

should be converted to this:

a A
a b B
a c C

The following XSLT will do this.

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl

utput method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="*">
<xsl:for-each select="ancestor::*">
<xsl:value-of select="name()"/>
<xsl:text> </xsl:text>
</xsl:for-each>
<xsl:value-of select="name()"/>
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()">
<xsl:text> </xsl:text>
<xsl:value-of select="normalize-space(.)"/>
<xsl:text>
</xsl:text>
</xsl:template>

I am using the Java's sax parser with my own extended DefaultHandler.
Usually XML documents given to me will have the elements and child
elements properly idented (as above). However this will cause problem,
as the character() in the handler class will be called even between 2
endElement() call, sometimes between 2 startElement() call.

That's why I suggest that this is a suboptimal format for the data.

This is just a semantic problem, but I just want to know if there are
any other ways to tackle the problem.

Try XSLT.

///Peter

mcha226 · Nov 16, 2005

Thanks a lot. I'll start learning XSLT as well.

About what I have done, I used the decorator pattern and created a
decorator wrapping around my base handler. This will buffer the text
received in characters(), and send the complete text in one go. It will
also take out the \n and \t from the beginning of the text and the end
of the text.

I found out later that there is a XMLFilterImpl. It is interesting that
this class implements both the reader interface and all the handler
interface, whereas my decorator only implements the ContentHandler.
Just a personal opinion, I think my design can be a little be more
efficient. For example:

reader = XMLReaderFactory.createXMLReader();
handler = new SimpleHandler(); // Extends DefaultHandler

reader.setContentHandler(new BufferedHandler(handler));
reader.setErrorHandler(handler);

My design is easier to understand (implements only the handler part of
the interface) and it can prevent passing the call unnecessarily. (if
you are using XMLFilterImpl to create a filter for each of the
ContentHandler and ErrorHandler, this will cause extra calls across
layers.)

Anyone think the same as me?

Can't solve problems! please Help	0	Sep 26, 2022
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
How to use Densenet121 in monai	0	Feb 16, 2024
How to install and use PhpSanitization	0	Feb 7, 2021
Noob question about mathematical addition vs. "string addition" in C#	1	Mar 6, 2022
I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023
How to go about building a crud app when you are a noob	1	Jan 2, 2023
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022

Newbie question about how to solve the use escape characters

Mark Chao

Peter Flynn

mcha226

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads