converting from one charset encoding to another ...

A

Albretch Mueller

Sometime ago I coded some methods to charset re-encoding. Say you get
files in kirillic, “KOI8-R” and you want them as UTF-8

What I did was basically opening an InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till it
hit an EOF

That works just fine, yet I wonder if there are better/faster ways to
do that using channels/memory mapped files

Also where can you get actual files with different types fo encodings
to test these methods.

Thanks
lbrtchx
{comp.lang.java.programmer}
 
M

Mike Schilling

Albretch said:
Sometime ago I coded some methods to charset re-encoding. Say you
get
files in kirillic, “KOI8-R” and you want them as UTF-8

What I did was basically opening an
InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till
it
hit an EOF

That works just fine, yet I wonder if there are better/faster ways
to
do that using channels/memory mapped files

Also where can you get actual files with different types fo
encodings
to test these methods.

You can create them easily enough with a FileWriter that writes to an
OutputStreamWriter of the desired encoding.
 
A

Albretch Mueller

Albretch said:
 Sometime ago I coded some methods to charset re-encoding. Say you
get
files in kirillic, “KOI8-R” and you want them as UTF-8
 What I did was basically opening an
InputStreamReader(FileInputStream
FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
went InputStreamReader.read(char[] chrBffr) and
OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till
it
hit an EOF
 That works just fine, yet I wonder if there are better/faster ways
to
do that using channels/memory mapped files
 Also where can you get actual files with different types fo
encodings
to test these methods.

You can create them easily enough with a FileWriter that writes to an
OutputStreamWriter of the desired encoding.
~
After checking the API I don't see what the difference would be
between a plain reader and a FileOutputStream. What is it?

Thank you
lbrtchx
 
L

Lew

Albretch said:
After checking the API I don't see what the difference would be
between a plain reader and a FileOutputStream. What is it?

I'll assume you either meant a "plain writer" or a 'FileInputStream', but the
question remains what you mean by a "plain reader/writer".

'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.
 
M

Mike Schilling

Albretch said:
~
After checking the API I don't see what the difference would be
between a plain reader and a FileOutputStream. What is it?

A Writer converts from characters (Unicode) to whatever encoding it
was created with. an OutputStream just outputs bytes with no
conversion being done..
 
R

Roedy Green

That works just fine, yet I wonder if there are better/faster ways to
do that using channels/memory mapped files

The thing I don't understand, is nio uses ordinary file i/o
underneath. So how is it faster if you don't do something stupid with
ordinary file i/o in a case where caching would not help?
 
A

Albretch Mueller

I'll assume you either meant a "plain writer" or a 'FileInputStream'
~
;-)
~
'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.
~
but once you write to a file as I am doing it all becomes a stream of
bytes anyway, till you eventually reopen the file using a Reader and
specifying the charset to interpret chuncks of bytes as they are being
read into an array of chars, and as specified by the API:
~
http://java.sun.com/javase/6/docs/api/java/lang/Character.html
~
"The Java 2 platform uses the UTF-16 representation in char arrays
and in the String and StringBuffer classes."
~
So I think there is no real fancifulness in converting streams from
and to char sets as long as your OS/Java supports both encodings, it
is by nature a serial process.
~
Thank you
lbrtchx
 
L

Lew

Albretch said:
but once you write to a file as I am doing it all becomes a stream of
bytes anyway, till you eventually reopen the file using a Reader and
specifying the charset to interpret chuncks of bytes as they are being
read into an array of chars, and as specified by the API:

The exact bytes written through a Writer depend on the encoding used. If you
use a Reader with a different encoding, you'll get garbage.
 
A

Albretch Mueller

The exact bytes written through a Writer depend on the encoding used.  If you
use a Reader with a different encoding, you'll get garbage.

OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

So, what do you do in those situations?

Thank you
lbrtchx
 
L

Lew

Don't quote sigs.
OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

So, what do you do in those situations?

The editor in Rational Software Architect, an IDE built on Eclipse, simply
reports that the file is not in the specified encoding. I haven't looked at
its source, but I guess it notices illegal code points. Other editors just
display the wrong thing.
 
M

Mike Schilling

Albretch said:
OK, you have made me wonder about what to do when you don't know
the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

Readers assume that what you tell them is true. (If you don't create
a Reader with an explicit charset, it uses the platform's default.)
 
A

Arne Vajhøj

Albretch said:
OK, you have made me wonder about what to do when you don't know the
encoding of a file you got. As long as I know this is not taken care
by Readers even though some heuristics may be used

So, what do you do in those situations?

Ask for a specification.

The same sequence of bytes can be several different sequences of
chars depending on encoding.

A specification is necessary.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,740
Latest member
AdolphBig6

Latest Threads

Top