Tidy; how to make it XML-conform? <BR> needs to be closed

R

Ragnar

Hi

I have one question regarding Tidy (http://tidy.sourceforge.net). My
source XML-file has got a lot of unclosed <BR>-tags. Which command do I
need (in my tidy config-file) to close it <BR/> and make valid XML out
of it?


regards
Rag.
 
B

Bjoern Hoehrmann

* Ragnar wrote in comp.text.xml:
I have one question regarding Tidy (http://tidy.sourceforge.net). My
source XML-file has got a lot of unclosed <BR>-tags. Which command do I
need (in my tidy config-file) to close it <BR/> and make valid XML out
of it?

HTML Tidy is not designed to clean up arbitrary XML documents, so if by
"XML-file" you really mean some arbitrary XML document, then it might be
difficult to address your problem. If you mean "HTML" or "XHTML" instead
then use the output-* family of options, or the -asxml command line
option and ensure that you have not set the input-xml flag.
 
R

Ragnar

Thank your for your help. It is very important to get support because
I have to finish it today

my command line looks like: tidy -asxml -config config.txt old.xml

I get the same error like without using "-asxml"

Error: unexpected </reference> in <BR>

That means it finds an unclosed <BR>-tag at node "reference".

To get rid of it I could use "no-xml" as input-format but then tidy
would transform my XML into a HTML-structure what is not wanted


Ragnar
 
R

Ragnar

Another question regarding Tidy:

I want to use the COM-Wrapper of Tidy. Now I have found this example:
I dont know why "Stat As Long" is used. I tried to work without "Stat"
but I cannot call objTidyDoc.MethodName directly


Dim objTidyDoc As TidyDocument
Set objTidyDoc = New TidyDocument
Stat = 0
Stat = objTidyDoc.LoadConfig(strTidyConfig)
Stat = objTidyDoc.ParseFile(strFilePath & strXmlFileName)
Stat = objTidyDoc.CleanAndRepair()
Stat = objTidyDoc.RunDiagnostics()
Stat = objTidyDoc.SaveFile(strFilePath & strXmlFileName)
 
R

Ragnar

Now I know how to use the COM-Wrapper but my main question is still
open

How can I transform this source-xml into valid xml without using the
workaround of getting an HTML-output? I dont want to have the HTML-tags
like <HEAD> and <BODY> around it

http://www.ticope.de/tmp/source.xml/download

help VERY appreciated, this task keeps me busy too long
Rag.
 
J

Joseph Kesselman

If your input isn't HTML, Tidy may not be able to help you, and nothing
else out there is likely to be able to read your mind and guess that you
intended <BR> tags to autoterminate.

Since you know that *was* your intent, how about just doing a text-level
global replace of <BR> with <BR/>?
 
R

Ragnar

Joseph said:
Since you know that *was* your intent, how about just doing a text-level
global replace of <BR> with <BR/>?

Joseph,
that is a very nice idea

It could look like this (assuming <BR> appears in node "reference"):
Set objDOMnode = objDom.selectSingleNode("//reference")
If Not objDOMnode Is Nothing Then
strReference = objDOMnode.Text
End If
strReference = Replace(strReference , "<BR>", "<BR/>", 1, -1,
vbTextCompare)

But I dont get a value in strReference which means that XML has to be
valid before working with XMLDOM. Am I right? I checked it by closing
<BR/> manually, then I get a value for strReference
 
J

Joe Kesselman

Ragnar said:
But I dont get a value in strReference which means that XML has to be
valid before working with XMLDOM.

XML has to be well-formed before using any XML tools. An unterminated
element, such as your <BR>, is not well-formed XML. Fix it first.
 
A

Andy Dingley

Ragnar said:
How can I transform this source-xml into valid xml without using the
workaround of getting an HTML-output?

Find some non-Tidy Tidy-like XML tool ? Maybe write one for your
specific task?

Tidy uses an approximation of an SGML parser and a tag-soup strainer to
take "approximate HTML", turn it into the best-guess internal
(DOM-like) model of the intended page, then serialise it accurately.
This relies on three things that you don't have available:

* SGML parsing (omitted tags can often be inferred cleanly)
* A known HTML DTD
* Fix-up code outside the SGML parser that has assumed HTML-soup
behaviours coded explicitly into it.

If your problem is "bad XML" that isn't even approximating HTML, then I
sympathise, but Tidy has three of its hands tied.

Why is your bad XML bad? What's the problem? Can you build some specifc
tool that fixes some specific problem? Even if it has to work with
simple text-file processing and can't support more than one encoding,
it might be enough.

I've done a lot of work with RSS which is only approximate XML at best
and often significantly invalid. Typically it includes HTML entity
references (eg &eacute; )that aren't part of XML. It's not too hard to
scan the whole document with a crude entity reference expander that can
map these (from a known list) onto the numeric form. I usually try to
XML parse them, then if this fails I check for the presence of such
entities, convert them and then attempt to re-parse.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,006
Messages
2,570,265
Members
46,861
Latest member
SanoraS48

Latest Threads

Top