Help with utf8

F

Francois

I read a file encode as utf8, and it has accented characters displayed
as Rémi (in gvim).

I read and parse the file

File xmlFile is the file handler.

using:
InputStreamReader in = new InputStreamReader(new FileInputStream
(xmlFile), "UTF-8");
filter.parse(new InputSource(new BufferedReader(in)));

When the parsing is done, I output the file with
Writer out = new OutputStreamWriter(new FileOutputStream(outfile),
"UTF-8");
filter.setContentHandler(new XMLWriter(out));

During the parsing, I substitute the attributes content using a
HashMap wich is read from another file with
FileInputStream r = new FileInputStream(d);
InputStreamReader is = new InputStreamReader(r);
System.out.println("Zmodif encoding " + is.getEncoding());
BufferedReader reader = new BufferedReader(is);
String line;
while ((line = reader.readLine())!= null){
byte[] conv = line.getBytes("ISO-8859-1");
String u8Line = new String(conv, "UTF8");
...
I put u8line in the HashMap and it to make the substitutions
}

My problem is that that output file has accented characters like this
Rémi instead of Rémi
I don't know where it comes from and how to change it ...

Thanks for any help

Francois
 
R

Roedy Green

I read a file encode as utf8, and it has accented characters displayed
as Rémi (in gvim).

I read and parse the file

File xmlFile is the file handler.

Java's utf-8 created with writeUTF is not UTF, but a modification with
a length field. Think of it as a binary format. The only thing you
can read it with in readUTF. See http://mindprod.com/jgloss/utf.html

For files written in UTF-8 with e Writer or by some other app, you
read with a BufferedReader with an encoding parms.

See http://mindprod.com/applet/fileio.html
for sample code for reading both kinds of data.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"At this point, 29 percent of fish and seafood species have collapsed - that is,
their catch has declined by 90 percent. It is a very clear trend, and it is accelerating.
If the long-term trend continues, all fish and seafood species are projected to collapse
within my lifetime -- by 2048."
~ Dr. Boris Worm of Dalhousie University
 
T

Tom Anderson

I read a file encode as utf8, and it has accented characters displayed
as R??mi (in gvim).

I read and parse the file

File xmlFile is the file handler.

using:
InputStreamReader in = new InputStreamReader(new FileInputStream
(xmlFile), "UTF-8");
filter.parse(new InputSource(new BufferedReader(in)));

When the parsing is done, I output the file with
Writer out = new OutputStreamWriter(new FileOutputStream(outfile),
"UTF-8");
filter.setContentHandler(new XMLWriter(out));

During the parsing, I substitute the attributes content using a
HashMap wich is read from another file with

I don't understand what you mean by that. Substitute how?
FileInputStream r = new FileInputStream(d);
InputStreamReader is = new InputStreamReader(r);
System.out.println("Zmodif encoding " + is.getEncoding());
BufferedReader reader = new BufferedReader(is);
String line;
while ((line = reader.readLine())!= null){
byte[] conv = line.getBytes("ISO-8859-1");
String u8Line = new String(conv, "UTF8");
...

That looks like a really odd thing to do. What are you trying to achieve
by encoding a string as 8859-1 and then decoding it as UTF-8?
I put u8line in the HashMap and it to make the substitutions
}

My problem is that that output file has accented characters like this
Rémi instead of R??mi
I don't know where it comes from and how to change it ...

That's an XML numeric character escape. é means the unicode character
with code 233, which is a lowercase e with an acute accent. It's a
perfectly valid thing to find in an XML document; if the purpose of your
XML file is to be read by another program, it will be fine. If you want to
encode it as a normal character, you need to tell the XML encoder to do
that rather than use an escape; i don't know what this XMLWriter class
you're using is, but that's the object which is making that decision.

tom
 
F

Francois

I read a file encode as utf8, and it has accented characters displayed
as R??mi (in gvim).
I read and parse the file
File xmlFile is the file handler.
using:
InputStreamReader in = new InputStreamReader(new FileInputStream
(xmlFile), "UTF-8");
filter.parse(new InputSource(new BufferedReader(in)));
When the parsing is done, I output the file with
Writer out = new OutputStreamWriter(new FileOutputStream(outfile),
"UTF-8");
filter.setContentHandler(new XMLWriter(out));
During the parsing, I substitute the attributes  content using a
HashMap wich is read from another file with

I don't understand what you mean by that. Substitute how?
FileInputStream r  = new FileInputStream(d);
InputStreamReader is = new InputStreamReader(r);
System.out.println("Zmodif encoding " + is.getEncoding());
BufferedReader reader = new BufferedReader(is);
String line;
while ((line = reader.readLine())!= null){
   byte[] conv = line.getBytes("ISO-8859-1");
   String u8Line = new String(conv, "UTF8");
   ...

That looks like a really odd thing to do. What are you trying to achieve
by encoding a string as 8859-1 and then decoding it as UTF-8?
I put u8line in the HashMap and it to make the substitutions
}
My problem is that that output file has accented characters like this
Rémi instead of R??mi
I don't know where it comes from and how to change it ...

That's an XML numeric character escape. é means the unicode character
with code 233, which is a lowercase e with an acute accent. It's a
perfectly valid thing to find in an XML document; if the purpose of your
XML file is to be read by another program, it will be fine. If you want to
encode it as a normal character, you need to tell the XML encoder to do
that rather than use an escape; i don't know what this XMLWriter class
you're using is, but that's the object which is making that decision.

tom

Thanks for replying and for the suggestion to take a closer look a the
XMLWriter use. It was com.megginson.sax.XMLWriter
and it removed the encoding attribute in the xml tag of the file
produced. Thanks also for pointing in my wrong used of reading lines.
A BufferedReader was enough.

I wanted a way to ouput everything from the input xml file, and found
a page http://www.acooke.org/cute/SAXXMLFilt0.html
with used a TransformerHandler and a TransformerHandlerFactory to
create a contentHandler. With handler.setResult(new StreamResult
(out)) before
passing the handler to the parser I could get a parser reading a file
a giving the result to another file or to System.out. I found it a lot
easier to do the same with perl because I've found the doc much better
 
A

Arne Vajhøj

Roedy said:
Java's utf-8 created with writeUTF is not UTF, but a modification with
a length field.

And the relevance is?

(there is nothing in the post talking about DataOutputStream and
writeUTF - he is trying to read a XML file)
Think of it as a binary format.

It is a binary format.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,731
Latest member
MarcyGipso

Latest Threads

Top