transforming xhtml to html (resolving namespace dependencies)

A

Andy

Hi,

I am using Apache xalan to transform xhtml files to html files.

My xslt stylesheet is:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
Transform">
<xsl:eek:utput method="html" encoding="UTF-8"/>
<xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
</xsl:stylesheet>

Seems to work. For example, I had an xhtml file which had entities
defined in DOCTYPE and those were resolved successfully.

However, I'm more concerned with another document:

Its an xhtml file and begins with:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
content="text/html; charset=utf-8"/>

My concern is that xalan resolve all dependencies in such an xhtml
file on the schemas referenced in the html tag.

Will it???

The xalan output to html began with:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://
schemas.microsoft.com/office/2004/12/omml" xmlns:eek:="urn:schemas-
microsoft-com:eek:ffice:eek:ffice" xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w="urn:schemas-microsoft-com:eek:ffice:word" xmlns:xlink="http://
www.w3.org/1999/xlink">

So I'm obviously concerned that the dependencies are still there!

If its ok, can I strip all those xmlns attributes in the <html> tag?

Or maybe I need a much better xslt stylesheet.

Thanks,
Andy
 
S

Stanimir Stamenkov

Sun, 30 Jan 2011 15:30:44 -0800 (PST), /Andy/:
Its an xhtml file and begins with:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
content="text/html; charset=utf-8"/>

It is not XHTML, it's an MS Office output for HTML which happens to
be some sort of XML.
My concern is that xalan resolve all dependencies in such an xhtml
file on the schemas referenced in the html tag.

Will it???

I don't see any direct schema references, but I don't think you need
any in this case.
The xalan output to html began with:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://
schemas.microsoft.com/office/2004/12/omml" xmlns:eek:="urn:schemas-
microsoft-com:eek:ffice:eek:ffice" xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w="urn:schemas-microsoft-com:eek:ffice:word" xmlns:xlink="http://
www.w3.org/1999/xlink">

So I'm obviously concerned that the dependencies are still there!

If its ok, can I strip all those xmlns attributes in the <html> tag?

Yes, if you want to output pure HTML you need to strip those
namespace declaration attributes off. See the
'exclude-result-prefixes' attribute [1][2].
Or maybe I need a much better xslt stylesheet.

I guess you would need to include templates for converting elements
like <o:p> into HTML ones - <p>. The crap which MS Office output
for HTML is enormous. I can't give you all the rules you need for
converting such a file to a clean HTML. You may also look at HTML
Tidy [3].

[1] http://www.w3.org/TR/xslt#stylesheet-element
[2] http://www.w3.org/TR/xslt#literal-result-element
[3] http://tidy.sourceforge.net/
 
M

Martin Honnen

Stanimir said:
Sun, 30 Jan 2011 15:30:44 -0800 (PST), /Andy/:
The xalan output to html began with:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://
schemas.microsoft.com/office/2004/12/omml" xmlns:eek:="urn:schemas-
microsoft-com:eek:ffice:eek:ffice" xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w="urn:schemas-microsoft-com:eek:ffice:word" xmlns:xlink="http://
www.w3.org/1999/xlink">

So I'm obviously concerned that the dependencies are still there!

If its ok, can I strip all those xmlns attributes in the <html> tag?

Yes, if you want to output pure HTML you need to strip those namespace
declaration attributes off. See the 'exclude-result-prefixes' attribute
[1][2].

exclude-result-prefixes does not help if namespaces are copied from an
input node, as was done in the posted stylesheet by using
<xsl:copy-of select="/"/>

You would need to write a stylesheet doing
<xsl:template match="*">
<xsl:element name="{name()}" namespace="{namespace-uri()}">
<xsl:apply-templates select="@* | node()"/>
</xsl:element>
</xsl:template>
in XSLT 1.0, to make sure elements are copied but their namespace nodes
are not automatically copied. But even this way, as long as elements in
a certain namespace are copied through, the result document when
serialized is going to declare those namespaces.
So in that input document you could only get rid of e.g.
xmlns:w="urn:schemas-microsoft-com:eek:ffice:word" as long as there are no
element in that namespace copied.

If you want to strip all namespace then use
<xsl:template match="*">
<xsl:element name="{local-name()}">
<xsl:apply-templates select="@* | node()"/>
</xsl:element>
</xsl:template>
or perhaps add templates for elements in namespaces like
urn:schemas-microsoft-com:eek:ffice:word to don't copy them at all, if you
don't need or want such elements.
 
M

Martin Honnen

Andy said:
However, I'm more concerned with another document:

Its an xhtml file and begins with:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
content="text/html; charset=utf-8"/>

My concern is that xalan resolve all dependencies in such an xhtml
file on the schemas referenced in the html tag.

Will it???

There are namespace declarations in that document. An XML parser does
not resolve the URLs in namespace declarations.
Schemas are not referenced, that would be done with
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:schemas-microsoft-com:vml
http://example.com/someschema.xsd"
 
A

Andy

Sun, 30 Jan 2011 15:30:44 -0800 (PST), /Andy/:



It is not XHTML, it's an MS Office output for HTML which happens to
be some sort of XML.

Let me tell you about my bigger problem. I have 1000s of epubs, which
are wrappers for xhmtl content type chapters. The requirement is that
each of the chapters be xhtml, which is enforced by the epub format.
My requirement is that I convert them to html. One problem I've seen
is a DOCTYPE prelude to the chapter that defines entity subsitutions
local to that chapter. xalan + xslt with the simple stylesheet I
posted in the first question resolves those entities to html entities
successfully.

But the MS Office generated epub document made me worried that in a
variety of ways, the creator of an epub chapter (xhtml subdocument)
could embed references to schemas other than just the xhtml schema,
and still expect firefox to resolve all those dependencies in its
parser.

I.e. The schemas referenced in the xhtml chapter might defined
entities, there. Is there an xslt stylesheet that would tell xalan to
resolve all the entities in externally referenced schemas other than
xhtml schema itself?

The second question is like with this MS Office generated epub. There
are schemas referenced that probably define the structure of o: and v:
and m: tags. What would Firefox's parser do with such a tag if I told
Firefox that the page content type was "application/xhtml+xml"? And
is there a simple stylesheet (that doesn't special case every external
schema tag definition) that will resolve each xhtml page to html (via
xalan xslt interpreter) the same way firefox does?

Andy
 
A

Andy

There are namespace declarations in that document. An XML parser does
not resolve the URLs in namespace declarations.
Schemas are not referenced, that would be done with
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="urn:schemas-microsoft-com:vmlhttp://example.com/someschema.xsd"

--

So you are saying that for this particular document, I can safely
strip the declarations from the html tag?

What I noticed is that when I display the document in firefox, even if
it does not have much size, it takes 10 to 15 seconds to load the
page, which made me think that firefox was going out to the microsoft
site and parsing those external schemas. For what purpose if the
"schemas are not referenced"?
 
M

Martin Honnen

Andy said:
So you are saying that for this particular document, I can safely
strip the declarations from the html tag?

I can't assess that, removing namespace declarations is only safe as
long as there are no elements or attributes in those namespaces. And you
have not shown any contents of that document, just the root element with
the namespace declarations.
What I noticed is that when I display the document in firefox, even if
it does not have much size, it takes 10 to 15 seconds to load the
page, which made me think that firefox was going out to the microsoft
site and parsing those external schemas. For what purpose if the
"schemas are not referenced"?

Well the document does not reference any schemas, it simply uses XML
namespace declarations. An XML parser simply recognizes elements and
attributes based on their namespace i.e. if you send application/xml or
text/xml or application/xhtml+xml (or other MIME types that trigger XML
parsing) to a browser then it only renders a HTML link if it finds an 'a
href' element in the XHTML namespace http://www.w3.org/1999/xhtml. An
'a' element in no namespace does not have any meaning as a link.
And the XML parser in Firefox is not even schema aware, it is Expat I
think, so even if you referenced a schema with xsi:schemaLocation, it
wouldn't matter to Firefox.

I don't know why it took that long to load and render your document but
it is certainly not because of namespace declarations.
 
P

Peter Flynn

On 31/01/11 14:33, Andy wrote:
[...]
Let me tell you about my bigger problem. I have 1000s of epubs, which
are wrappers for xhmtl content type chapters. The requirement is that
each of the chapters be xhtml, which is enforced by the epub format.
My requirement is that I convert them to html. One problem I've seen
is a DOCTYPE prelude to the chapter that defines entity subsitutions
local to that chapter. xalan + xslt with the simple stylesheet I
posted in the first question resolves those entities to html entities
successfully.

But the MS Office generated epub document made me worried that in a
variety of ways, the creator of an epub chapter (xhtml subdocument)
could embed references to schemas other than just the xhtml schema,
and still expect firefox to resolve all those dependencies in its
parser.

AFAIK FF does not pay any attention to resolving namespace URIs (it's
not required by XML in any case: they merely have to be present; actual
schema locations can specified separately with the xxx:schemaLocation
attribute). FF doesn't even resolve DTD references, FFS :)

In any case, if you are stripping off all this gunk and making plain ol'
HTML, there won't be any namespaces for a browser to resolve...
I.e. The schemas referenced in the xhtml chapter might defined
entities, there.

Schemas can't declare entities. Only DTDs can do that.
The second question is like with this MS Office generated epub. There
are schemas referenced that probably define the structure of o: and v:
and m: tags. What would Firefox's parser do with such a tag if I told
Firefox that the page content type was "application/xhtml+xml"?

Probably ignore it, but why not try it and see?
I thought you were generating HTML from these epubs, not XHTML.
is there a simple stylesheet (that doesn't special case every external
schema tag definition) that will resolve each xhtml page to html (via
xalan xslt interpreter) the same way firefox does?

Tidy.

///Peter
 
P

Peter Flynn

On 30/01/11 23:30, Andy wrote:
[...]
Its an xhtml file and begins with:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
content="text/html; charset=utf-8"/>

That's Word's "Save As...HTML". It's a horrendous kludge, and even omits
the title element, the one text-bearing element in HTML that is actually
compulsory :)
My concern is that xalan resolve all dependencies in such an xhtml
file on the schemas referenced in the html tag.

Will it???

No, and I'm not clear why you'd want to do that. There are no
dependencies there unless you are a copy of word.exe :)

Stanimir's suggestion of HTML Tidy is worth following. This needs the
bogus o:p elements replacing (I suggest span); you can then clean out
the rest of the rubbish with the -c and -n options:

$ sed -e "s+o:p>+span>+g" foo.htm | tidy -c -n -asxml - >foo.xhtml

///Peter
 
P

Peter Flynn

On 31/01/11 14:39, Andy wrote:
[...]
So you are saying that for this particular document, I can safely
strip the declarations from the html tag?

Unless there are any private element types embedded in there which have
the same names (modulo the namespace) as XHTML element types but
different semantics.

You could write a little XSLT script to pass over the document and check
for that. <o:p> is a good example, as it is a p element type in an o
namespace, yet it gets embedded in s span inside a HTML p element type.
As I suggested earlier, you can trivially convert those to a span, with
a specific class if you wanted to.
What I noticed is that when I display the document in firefox, even if
it does not have much size, it takes 10 to 15 seconds to load the
page, which made me think that firefox was going out to the microsoft
site and parsing those external schemas.

No, FF is probably parsing the XML and then converting it to its
internal HTML-based rendering model, so it's doing twice as much work as
it does when loading plain HTML.
For what purpose if the "schemas are not referenced"?

I think you may be confusing schemas with namespaces.

Schemas are for guiding the formation of a document, and for providing a
validating parser with a "reference map" of possible element type
locations and node structures. Their principal use in rendering is --
like DTDs -- to provide information about default attribute values; and
these are minimal in HTML anyway.

Namespaces are a way of identifying and disambiguating element and
attribute types which have the same name but come from different
backgrounds or have different semantics. This lets you embed (for
example) MathML in DocBook without <arg> in MathML being confused with
<arg> in DocBook; you also see this in XSLT, if you want to output
MathML: <xsl:eek:therwise> cannot be confused with <m:eek:therwise>.

Unfortunately, some document type designers think you're not
well-dressed unless you obfuscate everything with vast namespaces. They
have their place, and can be very useful, but they are often abused as a
substitute for rigorous document type analysis.

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top