nbsp and xml::dom::parser

S

sakcee

hi

I am very newbie to perl , I have a file that have   in it,

when I parse using xml::dom::parser, it is crashing on it, i assume
because parser can not take
&nbsp,

when i replace   with &#x26 , the parser accepts it but on
doc-->toString , it adds &
at every   and I have to replace that & again with &nbsp to
make it look correct it html

also when i try replacing nbsp with &#160 , it converts to some wierd A
character which I think is c2c0 or something , so html page in utf-8 is
not rendered properly and I see weird A'..

the html document is coded in utf-8, there is no xsl, and i have xml
input as a cgi input, it is not a file to i can not declare dtd
entities etc

is this correct approach? it is 2 regex replaces through whole input ,
which seems kind of expensive

thanks
 
B

Bart Van der Donck

I am very newbie to perl , I have a file that have   in it,

when I parse using xml::dom::parser, it is crashing on it, i assume
because parser can not take
&nbsp,

when i replace   with &#x26 , the parser accepts it but on
doc-->toString , it adds &
at every   and I have to replace that & again with &nbsp to
make it look correct it html

also when i try replacing nbsp with &#160 , it converts to some wierd A
character which I think is c2c0 or something , so html page in utf-8 is
not rendered properly and I see weird A'..

the html document is coded in utf-8, there is no xsl, and i have xml
input as a cgi input, it is not a file to i can not declare dtd
entities etc

is this correct approach? it is 2 regex replaces through whole input ,
which seems kind of expensive

If you can't declare DTD entities, then you can't make your XML
well-formed. And then you have no choice but doing it by regexp, or by
invoking XML::DOM::parser with some option(s) so that it ignores the
entities (this seems nearly impossible though).

But is it really that hard to add a DOCTYPE ? Albeit just for parsing
purposes...

When you can insert a fake DTD call, XML::DOM::parser will not complain
about any entities:

#!/usr/bin/perl
use strict;
use warnings;
use XML::DOM;
my $parser = new XML::DOM::parser;
my $doc = $parser->parse('<!DOCTYPE rec SYSTEM "fake.dtd" []>
<foo>AA &nbsp; BB</foo>
');
print $doc->toString;

Alternatively (specific for &nbsp;) :

#!/usr/bin/perl
use strict;
use warnings;
use XML::DOM;
my $parser = new XML::DOM::parser(NoExpand => 1);
my $doc = $parser->parse('<!DOCTYPE foo [
<!ENTITY nbsp "&nbsp;">
]>
<foo>AA &nbsp; BB</foo>
');
print $doc->toString;

Hope this helps,
 
B

Brian McCauley

I am very newbie to perl ,

Please see the posting guidelines.
I have a file that have &nbsp; in it,

when I parse using xml::dom::parser, it is crashing on it, i assume
because parser can not take
&nbsp,

Yes, XML does not know most named enties. It doesn't need to because
XML is usually exchanged in a Unicode encoding (typically utf8) and
thus you don't need entities.

when i replace &nbsp; with &#x26 ,

&#x26 is an ampersand not a nobbreaking space.
the parser accepts it but on
doc-->toString , it adds &amp;
at every &nbsp;

Er, yes if you replace all non breaking spaces with ampersands then
every non breaking space will have been replaced with an ampersand.

and I have to replace that &amp; again with &nbsp to
make it look correct it html
also when i try replacing nbsp with &#160 , it converts to some wierd A
character

That 's what you see if you take a utf8 encoding of non breaking space
and then pass that data to something that thinks it is working with
Latin1
the html document is coded in utf-8,

What do you mean by "is coded"? It looks to me like there is
disagreement between two parts as to whether the document is utf8 or
Latin1.
it is not a file to i can not declare dtd
entities etc

What do you mean by this?
is this correct approach?

Why not just make sure everything agrees that you want to use utf8.
 
S

sakcee

Hi
thanks for the info and code, this declaration works fine, so can i add
my text inside the foo tag and display the to->toString inside an html
page?

or something has to de declared in xsl

thanks

I am very newbie to perl , I have a file that have &nbsp; in it,

when I parse using xml::dom::parser, it is crashing on it, i assume
because parser can not take
&nbsp,

when i replace &nbsp; with &#x26 , the parser accepts it but on
doc-->toString , it adds &amp;
at every &nbsp; and I have to replace that &amp; again with &nbsp to
make it look correct it html

also when i try replacing nbsp with &#160 , it converts to some wierd A
character which I think is c2c0 or something , so html page in utf-8 is
not rendered properly and I see weird A'..

the html document is coded in utf-8, there is no xsl, and i have xml
input as a cgi input, it is not a file to i can not declare dtd
entities etc

is this correct approach? it is 2 regex replaces through whole input ,
which seems kind of expensive

If you can't declare DTD entities, then you can't make your XML
well-formed. And then you have no choice but doing it by regexp, or by
invoking XML::DOM::parser with some option(s) so that it ignores the
entities (this seems nearly impossible though).

But is it really that hard to add a DOCTYPE ? Albeit just for parsing
purposes...

When you can insert a fake DTD call, XML::DOM::parser will not complain
about any entities:

#!/usr/bin/perl
use strict;
use warnings;
use XML::DOM;
my $parser = new XML::DOM::parser;
my $doc = $parser->parse('<!DOCTYPE rec SYSTEM "fake.dtd" []>
<foo>AA &nbsp; BB</foo>
');
print $doc->toString;

Alternatively (specific for &nbsp;) :

#!/usr/bin/perl
use strict;
use warnings;
use XML::DOM;
my $parser = new XML::DOM::parser(NoExpand => 1);
my $doc = $parser->parse('<!DOCTYPE foo [
<!ENTITY nbsp "&nbsp;">
]>
<foo>AA &nbsp; BB</foo>
');
print $doc->toString;

Hope this helps,
 
S

sakcee

What do you mean by "is coded"? It looks to me like there is
disagreement between two parts as to whether the document is utf8 or
Latin1.

the charset declaration in html headers is utf-8
Why not just make sure everything agrees that you want to use utf8.

the html page is returned as utf-8, probably the parent frame has some
other charset defined for it
 
B

Bart Van der Donck

thanks for the info and code, this declaration works fine, so can i add
my text inside the foo tag and display the to->toString inside an html
page?
or something has to de declared in xsl

You can insert whatever text you want inside the <foo>-tag, as long as
the XML rules are respected. Example:

<foo>
<bar>The quick brown fox jumps over the lazy dog</bar>
<bar>
<br/>
<p>
<a target="_blank" href="http://www.google.com">Google</a>
</p>
</bar>
</foo>

You cannot feed arbitrary HTML to XML::DOM::parser, if that's what you
mean. The passed string or file needs to be XML that is (1) valid and
(2) well-formed.

You can display $doc->toString any way you want, so, also inside an
HTML page. But non-HTML tags will have no meaning then for the browser,
unless you take special arrangements so the browser what to do with
them. XSL could be such an "arrangement" technology for that purpose.

But I think you need to understand XML better, see e.g.
http://www.w3schools.com/xml/default.asp
 
B

Bart Van der Donck

the charset declaration in html headers is utf-8

Then your CGI should set that header as well:

#!/usr/bin/perl
use strict;
use warnings;
print <<'HTML'
Content-Type: text/html; charset=utf-8

<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8">
</head>
<body>pretty sure we are in utf-8 now</body>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top