converting from one charset encoding to another ...

Albretch Mueller · Nov 23, 2009

Sometime ago I coded some methods to charset re-encoding. Say you get
files in kirillic, “KOI8-R” and you want them as UTF-8

What I did was basically opening an InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till it
hit an EOF

That works just fine, yet I wonder if there are better/faster ways to
do that using channels/memory mapped files

Also where can you get actual files with different types fo encodings
to test these methods.

Thanks
lbrtchx
{comp.lang.java.programmer}

Mike Schilling · Nov 23, 2009

Albretch said:
Sometime ago I coded some methods to charset re-encoding. Say you
get
files in kirillic, “KOI8-R” and you want them as UTF-8

What I did was basically opening an
InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till
it
hit an EOF

That works just fine, yet I wonder if there are better/faster ways
to
do that using channels/memory mapped files

Also where can you get actual files with different types fo
encodings
to test these methods.

You can create them easily enough with a FileWriter that writes to an
OutputStreamWriter of the desired encoding.

Albretch Mueller · Nov 23, 2009

Albretch said:
Albretch said:

Sometime ago I coded some methods to charset re-encoding. Say you
get
files in kirillic, “KOI8-R” and you want them as UTF-8

Click to expand...

What I did was basically opening an
InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till
it
hit an EOF

Click to expand...

That works just fine, yet I wonder if there are better/faster ways
to
do that using channels/memory mapped files

Click to expand...

Also where can you get actual files with different types fo
encodings
to test these methods.

Click to expand...

You can create them easily enough with a FileWriter that writes to an
OutputStreamWriter of the desired encoding.

~
After checking the API I don't see what the difference would be
between a plain reader and a FileOutputStream. What is it?

Thank you
lbrtchx

Lew · Nov 23, 2009

Albretch said:
After checking the API I don't see what the difference would be
between a plain reader and a FileOutputStream. What is it?

I'll assume you either meant a "plain writer" or a 'FileInputStream', but the
question remains what you mean by a "plain reader/writer".

'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

Mike Schilling · Nov 23, 2009

Albretch said:
~
After checking the API I don't see what the difference would be
between a plain reader and a FileOutputStream. What is it?

A Writer converts from characters (Unicode) to whatever encoding it
was created with. an OutputStream just outputs bytes with no
conversion being done..

Roedy Green · Nov 23, 2009

That works just fine, yet I wonder if there are better/faster ways to
do that using channels/memory mapped files

The thing I don't understand, is nio uses ordinary file i/o
underneath. So how is it faster if you don't do something stupid with
ordinary file i/o in a case where caching would not help?

Albretch Mueller · Nov 23, 2009

I'll assume you either meant a "plain writer" or a 'FileInputStream'
~
;-)
~

'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

~
but once you write to a file as I am doing it all becomes a stream of
bytes anyway, till you eventually reopen the file using a Reader and
specifying the charset to interpret chuncks of bytes as they are being
read into an array of chars, and as specified by the API:
~
http://java.sun.com/javase/6/docs/api/java/lang/Character.html
~
"The Java 2 platform uses the UTF-16 representation in char arrays
and in the String and StringBuffer classes."
~
So I think there is no real fancifulness in converting streams from
and to char sets as long as your OS/Java supports both encodings, it
is by nature a serial process.
~
Thank you
lbrtchx

Lew · Nov 24, 2009

Albretch said:
but once you write to a file as I am doing it all becomes a stream of
bytes anyway, till you eventually reopen the file using a Reader and
specifying the charset to interpret chuncks of bytes as they are being
read into an array of chars, and as specified by the API:

The exact bytes written through a Writer depend on the encoding used. If you
use a Reader with a different encoding, you'll get garbage.

Albretch Mueller · Nov 25, 2009

The exact bytes written through a Writer depend on the encoding used. If you
use a Reader with a different encoding, you'll get garbage.

OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

So, what do you do in those situations?

Thank you
lbrtchx

Lew · Nov 25, 2009

Don't quote sigs.

OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

So, what do you do in those situations?

The editor in Rational Software Architect, an IDE built on Eclipse, simply
reports that the file is not in the specified encoding. I haven't looked at
its source, but I guess it notices illegal code points. Other editors just
display the wrong thing.

Mike Schilling · Nov 25, 2009

Albretch said:
OK, you have made me wonder about what to do when you don't know
the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

Readers assume that what you tell them is true. (If you don't create
a Reader with an explicit charset, it uses the platform's default.)

Roedy Green · Nov 25, 2009

OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

see http://mindprod.com/applet/encodingrecogniser.html

http://mindprod.com/project/encodingidentification.html

Arne Vajhøj · Nov 25, 2009

Albretch said:
OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

So, what do you do in those situations?

Ask for a specification.

The same sequence of bytes can be several different sequences of
chars depending on encoding.

A specification is necessary.

Arne

Document Encoding/Charset	2	Jun 21, 2007
How to identify File encoding in Java?	7	Apr 17, 2007
Lost UTF-8 encoding on all files while converting ASP.NET web from 1.1 to 2.0	4	Sep 28, 2007
xslt transformation, where is charset=UTF-16 coming from?	1	Dec 27, 2006
Trying to stream back data to Excel from ASP.NET - charset issues	1	Jun 21, 2007
Problem to insert an XML-element by XSLT-converting from one XML-file into another XML-file	2	May 29, 2006
Error (?) writing foreign-language (French/Japanese/..) string from Java program to a file	3	Feb 27, 2006
JSTL, JSP, Struts. Do I have to use UTF-8 encoding?	0	Mar 2, 2005

converting from one charset encoding to another ...

Albretch Mueller

Mike Schilling

Albretch Mueller

Lew

Mike Schilling

Roedy Green

Albretch Mueller

Lew

Albretch Mueller

Lew

Mike Schilling

Roedy Green

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads