Problem with getting correct data out of buffer reading from channel

  • Thread starter nooneinparticular314159
  • Start date
N

nooneinparticular314159

I'm reading data from a socket channel in a network program. To test
my code, I'm telnetting to the program and typing in some data, which
I then try to read from the buffer, view as a charbuffer, and write to
standard out. Unfortunately, what I type in are english letters and
numbers, and what I get out seems to be unicode chinese! Here's what
I'm doing:

try {
NumberOfBytesReadFromChannel = Channel.read
(ReceiveBuffer); //read available data into the buffer
}

ReceiveBuffer.flip(); //flip the buffer so it
can be read

//Read the new data out of the buffer and add it to
IncomingDataString, which stores unprocessed incoming data
IncomingMessageBuffer = ReceiveBuffer.asCharBuffer();
if (IncomingDataString == null) {
IncomingDataString = IncomingMessageBuffer.toString();
} else {
IncomingDataString = IncomingDataString +
IncomingMessageBuffer.toString();
}

//*************************
System.out.println("String received was: " +
IncomingDataString);

ReceiveBuffer.clear();

(Not shown: the IOException catch statement)

What I get are a series of strings that look like:
String received was: 摧æ æ‘¦à´Šæœæ æ 

So somehow, I seem to be reading the data incorrectly, even though I
am receiving it. Any idea what I'm doing wrong here?

Thanks!
 
R

Roedy Green

On Sun, 19 Jul 2009 17:52:24 -0700 (PDT),
ReceiveBuffer.asCharBuffer();

this implies you have 16 bit Unicode, not UTF-8 or some other 8-bit
encoding.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"The industrial civilisation is based on the consumption of energy resources that are inherently limited in quantity, and that are about to become scarce. When they do, competition for what remains will trigger dramatic economic and geopolitical events; in the end, it may be impossible for even a single nation to sustain industrialism as we have know it in the twentieth century."
~ Richard Heinberg, The Party’s Over: Oil, War, and the Fate of Industrial Societies
 
N

nooneinparticular314159

Hmm...telnet should just be sending raw ASCII. Is there a way to
force java not to use unicode?

Thanks!
 
N

nooneinparticular314159

Ok, Richard. Looks like you were right! I created a CharsetDecoder
for ISO-8859-1, and use that to decode my buffer, and what I get out
is the text I transmitted using telnet!

So my remaining questions are: Let's say that I write a program in
Java to transmit some data to the program above. If I don't
explicitly change the encoding, will the encoding be correct, since it
will be using whatever Java natively uses?

Also, let's say that I want to get some data from something other than
my own Java programs. Is there a way to detect the encoding that they
are using, so that I can work with any program that transmits to my
program? Or do I have to just know what they are using?

Thanks,
Michael
 
R

Roedy Green

Hmm...telnet should just be sending raw ASCII. Is there a way to
force java not to use unicode?

You could process raw bytes with nio or with an InputStream.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"The industrial civilisation is based on the consumption of energy resources that are inherently limited in quantity, and that are about to become scarce. When they do, competition for what remains will trigger dramatic economic and geopolitical events; in the end, it may be impossible for even a single nation to sustain industrialism as we have know it in the twentieth century."
~ Richard Heinberg, The Party’s Over: Oil, War, and the Fate of Industrial Societies
 
R

Roedy Green

So my remaining questions are: Let's say that I write a program in
Java to transmit some data to the program above. If I don't
explicitly change the encoding, will the encoding be correct, since it
will be using whatever Java natively uses?

The answer is ugly. See http://mindprod.com/jgloss/encoding.html

The answer is, in general, data are not tagged with the encoding.
I assume this is a result of the male propensity to surround himself
with dirty coffee cups and empty pizza boxes. I can't imagine Martha
Stewart as computer programmer putting up with such a slovenly state
of affairs.

HTTP has some encoding headers, and some ways to request your
preferred encodings.

I think the way out is gradually to discard all encodings except
UTF-8.

I wrote a little utility to help you guess what encoding was used.
see http://mindprod.com/applet/officialencoding.html

Basically the receiver is just supposed to "know" the encoding.
This might have been reasonably in 1960 when every datacentre had its
own private encoding and everyone used it, and people rarely exchanged
data with the outside world. But today, with the international sharing
on the Internet, it is crazy.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"The industrial civilisation is based on the consumption of energy resources that are inherently limited in quantity, and that are about to become scarce. When they do, competition for what remains will trigger dramatic economic and geopolitical events; in the end, it may be impossible for even a single nation to sustain industrialism as we have know it in the twentieth century."
~ Richard Heinberg, The Party’s Over: Oil, War, and the Fate of Industrial Societies
 
M

markspace

nooneinparticular314159 said:
for ISO-8859-1, and use that to decode my buffer, and what I get out


As others have mentioned, this probably should be "US-ASCII" for telnet.

So my remaining questions are: Let's say that I write a program in
Java to transmit some data to the program above. If I don't
explicitly change the encoding, will the encoding be correct, since it
will be using whatever Java natively uses?


No, Java IO "natively" uses the platform default. This will be
different on each platform. Internally, all character encoding are the
same in Java (as long as it's based on char, not byte) but almost all IO
will convert Java's internal encoding to the platform default.

It's possible to write Java's internal raw characters to a stream, but
you have to be careful to do it correctly or Java will translate
(encode) that character. It's easier just to specify an encoding, imo.

Is there a way to detect the encoding that they
are using, so that I can work with any program that transmits to my
program? Or do I have to just know what they are using?

There's no way to figure it out, you have to "just know". That's for
any language or IO, not just Java. Very few IO operations specify an
encoding or how to obtain one. Two big exceptions which DO specify an
encoding are HTTP and XML, which is one reason why they're so popular.
 
L

Lew

Roedy said:
The answer is, in general, data are not tagged with the encoding.
I assume this is a result of the male propensity to surround himself
with dirty coffee cups and empty pizza boxes. I can't imagine Martha
Stewart as computer programmer putting up with such a slovenly state
of affairs.

What a profoundly bigoted and sexist thing to say.
 
A

Arne Vajhøj

Lew said:
What a profoundly bigoted and sexist thing to say.

The question is whether it is derogatory against women
or computer programmers.

:)

From the more serious perspective the stereotype computer
programmer described is not very common - software engineers
are professionals like other engineers - follow office
dress codes, try and work normal hours when possible, eat healthy
after doctors orders, have a wife and house etc.etc..

Arne
 
R

RedGrittyBrick

markspace said:
As others have mentioned, this probably should be "US-ASCII" for telnet.

My first reaction was that US-ASCII is something of an oxymoron, but I
do recall that there were some early 7-bit character-sets that were
national variations of ASCII (ANSI X-3.4 1968, ISO 646). Was there a UK
version with a £ in place of the $? I can't find any references to this.

I find telnet is mostly character-set agnostic. I can switch character
sets without the telnet protocol needing to know what I am doing.
 
R

RedGrittyBrick

RedGrittyBrick said:
My first reaction was that US-ASCII is something of an oxymoron, but I
do recall that there were some early 7-bit character-sets that were
national variations of ASCII (ANSI X-3.4 1968, ISO 646). Was there a UK
version with a £ in place of the $? I can't find any references to this.

Not quite right, http://rabbit.eng.miami.edu/info/ascii.html#BS4730 says
£ in place of #.

Since these non-American variants of ASCII have their own names I still
think the US- prefix is, at least, somewhat redundant for ASCII.

There were significant differences between the 1963, 1965 and 1967 ASCII
standards which might be more important to highlight than the US-ness of
the A in ASCII.

I'll shut up now :)
 
A

Arne Vajhøj

RedGrittyBrick said:
Not quite right, http://rabbit.eng.miami.edu/info/ascii.html#BS4730 says
£ in place of #.

Since these non-American variants of ASCII have their own names I still
think the US- prefix is, at least, somewhat redundant for ASCII.

There were significant differences between the 1963, 1965 and 1967 ASCII
standards which might be more important to highlight than the US-ness of
the A in ASCII.

http://www.iana.org/assignments/character-sets

says:

Name: ANSI_X3.4-1968 [RFC1345,KXS2]
MIBenum: 3
Source: ECMA registry
Alias: iso-ir-6
Alias: ANSI_X3.4-1986
Alias: ISO_646.irv:1991
Alias: ASCII
Alias: ISO646-US
Alias: US-ASCII (preferred MIME name)
Alias: us
Alias: IBM367
Alias: cp367
Alias: csASCII

US-ASCII is listed even claiming to be "preferred MIME name" !

Arne
 
R

RedGrittyBrick

Arne said:
RedGrittyBrick said:
Not quite right, http://rabbit.eng.miami.edu/info/ascii.html#BS4730
says £ in place of #.

Since these non-American variants of ASCII have their own names I
still think the US- prefix is, at least, somewhat redundant for ASCII.

There were significant differences between the 1963, 1965 and 1967
ASCII standards which might be more important to highlight than the
US-ness of the A in ASCII.

http://www.iana.org/assignments/character-sets

says:

Name: ANSI_X3.4-1968 [RFC1345,KXS2]
MIBenum: 3
Source: ECMA registry
Alias: iso-ir-6
Alias: ANSI_X3.4-1986
Alias: ISO_646.irv:1991
Alias: ASCII
Alias: ISO646-US
Alias: US-ASCII (preferred MIME name)
Alias: us
Alias: IBM367
Alias: cp367
Alias: csASCII

US-ASCII is listed even claiming to be "preferred MIME name" !

Oh well, If IANA say so, though I think the US prefix is about as
redundant as one of the numbers in "PIN number".
 
L

Lew

RedGrittyBrick said:
Oh well, If IANA say so, though I think the US prefix is about as
redundant as one of the numbers in "PIN number".

Not at all. That's no more redundant than "D-Day".
 
R

RedGrittyBrick

Lew said:
Not at all. That's no more redundant than "D-Day".

:) I'm beginning to regret starting this.

AIUI "D-Day" is a military term often used to *name* a specific day in a
military operation. The most famous D-Day (in my locale anyway) is June
6th 1944. Saying "D-Day" is like saying "Day X" and not like saying
"Day". So "D-Day" does not mean the exact same thing as "Day". D-Day is
part of a family of similarly named but distinct days such as VE-Day.
The prefixes are needed to distinguish amongst those days.

By contrast my ATM card has a Personal Identification Number. As this is
usually abbreviated to PIN, I could say my ATM card has a PIN. A PIN
number would be a Personal Identification Number number. Perhaps we
could abbreviate that to PINN and start talking about PINN numbers?

Unlike with "D-Day" and "Day", when people say "PIN number", they would
lose no information by saying "PIN" instead.

Therefore, it seems to me, the "number" in "PIN number" is *much* more
redundant than the "D-" in "D-Day". One is, the other isn't.
 
A

Arne Vajhøj

RedGrittyBrick said:
Arne said:
RedGrittyBrick said:
RedGrittyBrick wrote:
markspace wrote:
As others have mentioned, this probably should be "US-ASCII" for
telnet.

My first reaction was that US-ASCII is something of an oxymoron, but
I do recall that there were some early 7-bit character-sets that
were national variations of ASCII (ANSI X-3.4 1968, ISO 646). Was
there a UK version with a £ in place of the $? I can't find any
references to this.

Not quite right, http://rabbit.eng.miami.edu/info/ascii.html#BS4730
says £ in place of #.

Since these non-American variants of ASCII have their own names I
still think the US- prefix is, at least, somewhat redundant for ASCII.

There were significant differences between the 1963, 1965 and 1967
ASCII standards which might be more important to highlight than the
US-ness of the A in ASCII.

http://www.iana.org/assignments/character-sets

says:

Name: ANSI_X3.4-1968 [RFC1345,KXS2]
MIBenum: 3
Source: ECMA registry
Alias: iso-ir-6
Alias: ANSI_X3.4-1986
Alias: ISO_646.irv:1991
Alias: ASCII
Alias: ISO646-US
Alias: US-ASCII (preferred MIME name)
Alias: us
Alias: IBM367
Alias: cp367
Alias: csASCII

US-ASCII is listed even claiming to be "preferred MIME name" !

Oh well, If IANA say so, though I think the US prefix is about as
redundant as one of the numbers in "PIN number".

I would agree, but if the choice is between following the standard
or do what the standard should have been, then one should follow
the standard.

Arne
 
L

Lew

RedGrittyBrick said:
:) I'm beginning to regret starting this.

AIUI "D-Day" is a military term often used to *name* a specific day in a
military operation. The most famous D-Day (in my locale anyway) is June
6th 1944. Saying "D-Day" is like saying "Day X" and not like saying
"Day". So "D-Day" does not mean the exact same thing as "Day". D-Day is
part of a family of similarly named but distinct days such as VE-Day.
The prefixes are needed to distinguish amongst those days.

Actually, in (at least U.S.) military parlance, "D-Day" is part of a family of
terms like "H-Hour" - the first "D" is the specific day of a particular
operation (e.g., the landing at Normandy Beach). It stands for "Day", as in
"the Day of the operation", just as "H" in "H-Hour" stands for "Hour".

There are as many "D-Day"s as there are military operations that have a
scheduled day.
By contrast my ATM card has a Personal Identification Number. As this is
usually abbreviated to PIN, I could say my ATM card has a PIN. A PIN
number would be a Personal Identification Number number. Perhaps we
could abbreviate that to PINN and start talking about PINN numbers?

Unlike with "D-Day" and "Day", when people say "PIN number", they would
lose no information by saying "PIN" instead.

Actually, they would. The "N" in "PIN" is generic like the first "D" in
"D-Day", and it means generally the personal identification number, and "PIN
number" is the particular person's identifying personal identification number.
Therefore, it seems to me, the "number" in "PIN number" is *much* more
redundant than the "D-" in "D-Day". One is, the other isn't.

No more so than saying "machine" in "ATM machine". You have to say "ATM
machine" so you know you aren't speaking of the "ATM card" or the "ATM machine
PIN number".
 
J

Joshua Cranmer

Lew said:
No more so than saying "machine" in "ATM machine". You have to say "ATM
machine" so you know you aren't speaking of the "ATM card" or the "ATM
machine PIN number".

You kitten mass-murderer! ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top