Reading UTF-8 Data from XML file

Guest · May 26, 2005

We have an XML file that contains text in various languages , ie English,
French, German and Chinese etc.
We currently have a StringWriter object that reads this in and transforms
against an XslTransform object.
the problem arises when we encounter Chinese characters; these characters
just come out as garbage in the internet explorer browser.

Setting the charset type on the .aspx page, in the web.config and in the
..xsl file to be transformed against has no effect.

Using a simple transform in classic ASP,
we can correctly display the text as its meant to be seen, however getting
the same output in c# seems a lot more tricky.

After trying various 'fixes' posted on several developer sites, nothing has
prevailed and the problem is still there.
We overloaded the StringWriter object to allow changing of the Encoding type
to force UTF-8 in, but to no avail.

When the transform is complete, we return the StringWriter objects .ToString
method.. This is where the error seems to lie,
because checking the .Encoding.EncodingName just prior to returning, its
labelled as 'Unicode (UTF-8)', however when output
to screen via a Text Literal, all we see is garbage.

Some of the charachters are replaced with ???????. We know are browser is
functioning correctly because we can see the types of text on
http://www.yahoo.com.hk

Joerg Jooss · May 27, 2005

Matt said:
We have an XML file that contains text in various languages , ie
English, French, German and Chinese etc.
We currently have a StringWriter object that reads this in and
transforms against an XslTransform object.

I really don't believe that you use a String*Writer* to *read* input
;-)

the problem arises when we encounter Chinese characters; these
characters just come out as garbage in the internet explorer browser.

Setting the charset type on the .aspx page, in the web.config and in
the .xsl file to be transformed against has no effect.

Using a simple transform in classic ASP,
we can correctly display the text as its meant to be seen, however
getting the same output in c# seems a lot more tricky.

After trying various 'fixes' posted on several developer sites,
nothing has prevailed and the problem is still there.
We overloaded the StringWriter object to allow changing of the
Encoding type to force UTF-8 in, but to no avail.

When the transform is complete, we return the StringWriter objects
.ToString method.. This is where the error seems to lie,
because checking the .Encoding.EncodingName just prior to returning,
its labelled as 'Unicode (UTF-8)', however when output
to screen via a Text Literal, all we see is garbage.

Some of the charachters are replaced with ???????. We know are
browser is functioning correctly because we can see the types of text
on http://www.yahoo.com.hk

Characters and strings in .NET are always Unicode und use UTF-16 as
internal representation. This means
a) a UTF-8 StringWriter is an oxymoron
b) truely character-based operations aren't susceptible to encoding
problems
c) encodings are only relevant when you need to transport strings using
a byte representation, i.e. when rendering a string on web page. Make
sure that your web application uses UTF-8 (or any other UTF that suits
your needs) as response encoding.

Cheers,

Guest · May 31, 2005

Joerg,

Thanks - A developer wrote this question...

We currently have a StringWriter object that reads this in and
Sorry - this means that the result of a transformation of an XmlDocument
object is written to a string writer to clarify.

My Webform does use uft-8 response and request encoding and I have tried
using several other different encoding types to get it to work.

I can get chinese charachters to display but some of the content is still
broken, could the fact that my transformation results in a mixture of html
code + english text + chinese text be part of the problem?

It seems I get something like "è—›éˆ¥çŠ†ï½‚å“éˆ¥?/P>" notice the question mark and half
a </p> tag. I have disabled output escaping in my xslt but still to no avail.

Your help appreciated,
Thanks
Matt

Joerg Jooss · May 31, 2005

Matt said:
Joerg,

Thanks - A developer wrote this question...

We currently have a StringWriter object that reads this in and

Sorry - this means that the result of a transformation of an
XmlDocument object is written to a string writer to clarify.

My Webform does use uft-8 response and request encoding and I have
tried using several other different encoding types to get it to work.

I can get chinese charachters to display but some of the content is
still broken, could the fact that my transformation results in a
mixture of html code + english text + chinese text be part of the
problem?

Only if you were not using Unicode. But since you use UTF-8 as response
encoding, and assuming you don't mistreat any string objects in your
code, that should not be a problem.

It seems I get something like "è—›éˆ¥çŠ†ï½‚å“éˆ¥?/P>" notice the
question mark and half a </p> tag.

What characters are missing in this string? Is it only the opening '<'?

Cheers,

Guest · Jun 1, 2005

Joerg Jooss said:
Only if you were not using Unicode. But since you use UTF-8 as response
encoding, and assuming you don't mistreat any string objects in your
code, that should not be a problem.

What characters are missing in this string? Is it only the opening '<'?

Cheers,

Yes - although if i disable output escaping in my xsl i can see that ?lt; is
in the code as if the & has been replaced with a ?

Here is the code for your ref:
XmlDocument oDoc = new XmlDocument();
XslTransform oXsl = new XslTransform();

oDoc.Load(Server.MapPath(""));
oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));

StringWriter oSw = new StringWriter();

oXsl.Transform(oDoc,null,oSw);

litTestText.Text = oSw.ToString();

Thanks
Matt

Guest · Jun 1, 2005

Matt Hollingworth said:
Yes - although if i disable output escaping in my xsl i can see that ?lt; is
in the code as if the & has been replaced with a ?

Here is the code for your ref:
XmlDocument oDoc = new XmlDocument();
XslTransform oXsl = new XslTransform();

oDoc.Load(Server.MapPath(""));
oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));

StringWriter oSw = new StringWriter();

oXsl.Transform(oDoc,null,oSw);

litTestText.Text = oSw.ToString();

Thanks
Matt

having further investigated, i forgot to say that i only see what i do by
changing the encoding to simplified chinese in the browser, if i choose utf8
it is all still encoded like it appears in notepad if you click view source.

i did the same page in asp and it all displays correctly without issue.

Joerg Jooss · Jun 3, 2005

Matt said:
Yes - although if i disable output escaping in my xsl i can see that
?lt; is in the code as if the & has been replaced with a ?

Here is the code for your ref:
XmlDocument oDoc = new XmlDocument();
XslTransform oXsl = new XslTransform();

oDoc.Load(Server.MapPath(""));
oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));

StringWriter oSw = new StringWriter();

oXsl.Transform(oDoc,null,oSw);

litTestText.Text = oSw.ToString();

Save for the wird Server.MapPath(""), there seems to be nothing wrong
here. I can only imagine that there's something wrong with the XSL
itself -- maybe somebody over in the XML group can help out.

Cheers,

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
UTF-8	1	May 8, 2008
UTF-8 read & print?	6	Nov 25, 2012
Retrieving data from software GUI	0	Aug 7, 2022
Read utf-8 file	1	Mar 18, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
UTF-8 and strings	44	Jun 7, 2011

Reading UTF-8 Data from XML file

Guest

Joerg Jooss

Guest

Joerg Jooss

Guest

Guest

Joerg Jooss

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads