Tidy transforms "&" in the source-xml into a "&"

R

Ragnar

Hi,

2 issues left with my tidy-work:

1) Tidy transforms a "&" in the source-xml into a "&" in the tidied
version. My XML-Importer cannot handle it
2) in a long <title>-string a wrap is produced like:
<title>my very long title blab la blab la
Blabla bla </title>
Importer also has got problems with it


My tidy.bat
tidy.exe --output-xhtml yes --show-body-only yes --new-blocklevel-tags
component,bblocation,title2,short_intro,long_intro,date,reference,category,image_small,image_medium,image_large,body2,external_link_text1,external_link_url1
--indent auto --write-back yes %1


regards
Ragnar
 
J

Joe Kesselman

Ragnar said:
1) Tidy transforms a "&amp;" in the source-xml into a "&" in the tidied
version.

Hold it a moment -- if your source is XML, why are you going through Tidy?

Having said that, this shouldn't happen in XHTML output mode. Contact
Tidy's authors, and/or show us a failing example so we can crosscheck
this and make sure

2) in a long <title>-string a wrap is produced like:
<title>my very long title blab la blab la
Blabla bla </title>
Importer also has got problems with it

Turn off auto-indent.
 
T

Timo Harmo

Hold it a moment -- if your source is XML, why are you going through Tidy?

Is there a better way to check the well-formedness of a xml-file than
tidy -xml ?
-Timo
 
J

Joe Kesselman

Timo said:
Is there a better way to check the well-formedness of a xml-file than
tidy -xml ?

Tidy is not primarily an XML tool. It's a tool for repairing
sloppily-written HTML and XHTML.

To check well-formedness of XML, feed it to any proper XML parser. If
the parser doesn't accept it, the XML is not well-formed.
 
J

Joe Kesselman

You never answered my question: If this is already XML, why are you
putting it through Tidy in the first place?
 
J

Joe Kesselman

Ragnar said:

Not well formed, so it isn't XML, despite the file name. First obvious
error is that someone failed to put quotes around the value of the lang
attribute. I'd recommend you fix this where it originates, rather than
trying to patch it later by running it through Tidy, especially since
you say Tidy's doing things you don't expect.
 
J

Joe Kesselman

Tried running the most recent copy of Tidy against your input file,
using your batchfile. It is *NOT* damaging the &. Either you're
confusing yourself badly (for example, looking at the text in an XML
tool, which of course will see &amp; as the & character since that's
what &amp; represents), or you're running a damaged copy of Tidy and
need to upgrade.

I'll bet on the former.
 
J

Joe Kesselman

Oh, forgot to say: The only thing I did differently was that I named the
input file test.html.
 
J

Joe Kesselman

I may also have accidentally dropped the "--write-back yes".

Still, this does suggest that Tidy isn't your problem.
 
R

Ragnar

Joe said:
Tried running the most recent copy of Tidy against your input file,
using your batchfile. It is *NOT* damaging the &. Either you're
confusing yourself badly (for example, looking at the text in an XML
tool, which of course will see &amp; as the & character since that's
what &amp; represents), or you're running a damaged copy of Tidy and
need to upgrade.


Hi Joe

thank you so for your work and help

Yes, you might be right. I was confused by the tool which has presented
&amp; as &.
So you say I dont have wellformed xml and therefore I cannot use tidy.
The content was exported automatically from an older version of a CMS
and the rich-text-fields were not XHTML-compliant. But you are right- I
should focus more on exporting and trying to optimize the exporter
instead of the importer. Maybe it is just enough to run tidy there or
do a lot of string-manipulations (Replace) in the phase where the
content is exported using SOAP.


Ragnar
 
J

Joe Kesselman

Ragnar said:
So you say I dont have wellformed xml and therefore I cannot use tidy.

Tidy's job is to (take an informed guess at how to) fix ill-formed HTML,
not ill-formed XML. And even there, it should be considered a stopgap,
used only because so few people (or tools!) produce officially correct HTML.

If you're working in XML, you should start by producing real XML. That
really shouldn't be hard to do.
 
A

Andy Dingley

Joe said:
To check well-formedness of XML, feed it to any proper XML parser. If
the parser doesn't accept it, the XML is not well-formed.

What would you suggest if it _isn't_ well-formed XML? (dodgy use of
HTML entities being an obvious "fixable" problem that springs to mind)

It's not an uncommon problem to have to deal with cruddy XML like this.
I'd be interested to hear what other peoples' favourite tools for
helping with it are.
 
J

Joseph Kesselman

Andy said:
What would you suggest if it _isn't_ well-formed XML? (dodgy use of
HTML entities being an obvious "fixable" problem that springs to mind)

There really is no good way to repair a damaged document without deep
knowledge of exactly what the intended document structure was -- which
is why Tidy is such a complicated application; it needs to understand
HTML well enough to make intelligent guesses about what the author's
intent was.

The *best* you can hope to do is to sweep the problem under the carpet
and guess right most of the time.

So I would, very strongly, suggest fixing the problem at the source. If
it isn't well-formed XML, fix the tool that generated it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,007
Messages
2,570,266
Members
46,863
Latest member
montyonthebonty

Latest Threads

Top