Reading UTF-8 Data from XML file

G

Guest

We have an XML file that contains text in various languages , ie English,
French, German and Chinese etc.
We currently have a StringWriter object that reads this in and transforms
against an XslTransform object.
the problem arises when we encounter Chinese characters; these characters
just come out as garbage in the internet explorer browser.

Setting the charset type on the .aspx page, in the web.config and in the
..xsl file to be transformed against has no effect.

Using a simple transform in classic ASP,
we can correctly display the text as its meant to be seen, however getting
the same output in c# seems a lot more tricky.

After trying various 'fixes' posted on several developer sites, nothing has
prevailed and the problem is still there.
We overloaded the StringWriter object to allow changing of the Encoding type
to force UTF-8 in, but to no avail.

When the transform is complete, we return the StringWriter objects .ToString
method.. This is where the error seems to lie,
because checking the .Encoding.EncodingName just prior to returning, its
labelled as 'Unicode (UTF-8)', however when output
to screen via a Text Literal, all we see is garbage.


Some of the charachters are replaced with ???????. We know are browser is
functioning correctly because we can see the types of text on
http://www.yahoo.com.hk
 
J

Joerg Jooss

Matt said:
We have an XML file that contains text in various languages , ie
English, French, German and Chinese etc.
We currently have a StringWriter object that reads this in and
transforms against an XslTransform object.

I really don't believe that you use a String*Writer* to *read* input
;-)
the problem arises when we encounter Chinese characters; these
characters just come out as garbage in the internet explorer browser.

Setting the charset type on the .aspx page, in the web.config and in
the .xsl file to be transformed against has no effect.

Using a simple transform in classic ASP,
we can correctly display the text as its meant to be seen, however
getting the same output in c# seems a lot more tricky.

After trying various 'fixes' posted on several developer sites,
nothing has prevailed and the problem is still there.
We overloaded the StringWriter object to allow changing of the
Encoding type to force UTF-8 in, but to no avail.

When the transform is complete, we return the StringWriter objects
.ToString method.. This is where the error seems to lie,
because checking the .Encoding.EncodingName just prior to returning,
its labelled as 'Unicode (UTF-8)', however when output
to screen via a Text Literal, all we see is garbage.


Some of the charachters are replaced with ???????. We know are
browser is functioning correctly because we can see the types of text
on http://www.yahoo.com.hk

Characters and strings in .NET are always Unicode und use UTF-16 as
internal representation. This means
a) a UTF-8 StringWriter is an oxymoron
b) truely character-based operations aren't susceptible to encoding
problems
c) encodings are only relevant when you need to transport strings using
a byte representation, i.e. when rendering a string on web page. Make
sure that your web application uses UTF-8 (or any other UTF that suits
your needs) as response encoding.

Cheers,
 
G

Guest

Joerg,

Thanks - A developer wrote this question...

We currently have a StringWriter object that reads this in and
Sorry - this means that the result of a transformation of an XmlDocument
object is written to a string writer to clarify.


My Webform does use uft-8 response and request encoding and I have tried
using several other different encoding types to get it to work.

I can get chinese charachters to display but some of the content is still
broken, could the fact that my transformation results in a mixture of html
code + english text + chinese text be part of the problem?

It seems I get something like "藛鈥犆bå“鈥?/P>" notice the question mark and half
a </p> tag. I have disabled output escaping in my xslt but still to no avail.

Your help appreciated,
Thanks
Matt
 
J

Joerg Jooss

Matt said:
Joerg,

Thanks - A developer wrote this question...

We currently have a StringWriter object that reads this in and

Sorry - this means that the result of a transformation of an
XmlDocument object is written to a string writer to clarify.


My Webform does use uft-8 response and request encoding and I have
tried using several other different encoding types to get it to work.

I can get chinese charachters to display but some of the content is
still broken, could the fact that my transformation results in a
mixture of html code + english text + chinese text be part of the
problem?

Only if you were not using Unicode. But since you use UTF-8 as response
encoding, and assuming you don't mistreat any string objects in your
code, that should not be a problem.
It seems I get something like "藛鈥犆bå“鈥?/P>" notice the
question mark and half a </p> tag.

What characters are missing in this string? Is it only the opening '<'?

Cheers,
 
G

Guest

Joerg Jooss said:
Only if you were not using Unicode. But since you use UTF-8 as response
encoding, and assuming you don't mistreat any string objects in your
code, that should not be a problem.


What characters are missing in this string? Is it only the opening '<'?

Cheers,

Yes - although if i disable output escaping in my xsl i can see that ?lt; is
in the code as if the & has been replaced with a ?


Here is the code for your ref:
XmlDocument oDoc = new XmlDocument();
XslTransform oXsl = new XslTransform();

oDoc.Load(Server.MapPath(""));
oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));

StringWriter oSw = new StringWriter();

oXsl.Transform(oDoc,null,oSw);

litTestText.Text = oSw.ToString();


Thanks
Matt
 
G

Guest

Matt Hollingworth said:
Yes - although if i disable output escaping in my xsl i can see that ?lt; is
in the code as if the & has been replaced with a ?


Here is the code for your ref:
XmlDocument oDoc = new XmlDocument();
XslTransform oXsl = new XslTransform();

oDoc.Load(Server.MapPath(""));
oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));

StringWriter oSw = new StringWriter();

oXsl.Transform(oDoc,null,oSw);

litTestText.Text = oSw.ToString();


Thanks
Matt


having further investigated, i forgot to say that i only see what i do by
changing the encoding to simplified chinese in the browser, if i choose utf8
it is all still encoded like it appears in notepad if you click view source.

i did the same page in asp and it all displays correctly without issue.
 
J

Joerg Jooss

Matt said:
Yes - although if i disable output escaping in my xsl i can see that
?lt; is in the code as if the & has been replaced with a ?


Here is the code for your ref:
XmlDocument oDoc = new XmlDocument();
XslTransform oXsl = new XslTransform();

oDoc.Load(Server.MapPath(""));
oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));

StringWriter oSw = new StringWriter();

oXsl.Transform(oDoc,null,oSw);

litTestText.Text = oSw.ToString();

Save for the wird Server.MapPath(""), there seems to be nothing wrong
here. I can only imagine that there's something wrong with the XSL
itself -- maybe somebody over in the XML group can help out.

Cheers,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top