Zero Byte Terminated Strings

P

PurpleServerMonkey

Hi,

I'm writting a simple UDP server in Java, it's designed to take an
initial request packet from a C based client and perform further
actions. The networking side of things is fine however I'm having
problems dealing with a zero byte terminated string being sent from
the client.

Code Snippet:
byte[] data = new byte[1000];
DatagramSocket serverSocket = new DatagramSocket(1025);
DatagramPacket packet = new DatagramPacket(data, data.length);
serverSocket.receive(packet);

The recieved packet then gets put onto a queue for pickup by a thread
pool. It's in the threadpool that I look at processing the packet and
extracting the string information (represents a filename, mode, etc).
Note that the strings in this packet are zero byte terminated.

Code Snippet:
byte[] payload = new byte[1000];
payload = packet.getData();

What I'd like to know is, what is the best way to retrive zero byte
terminated strings from the byte array?

Thanks in advance for your assistance.
 
K

Knute Johnson

PurpleServerMonkey said:
Hi,

I'm writting a simple UDP server in Java, it's designed to take an
initial request packet from a C based client and perform further
actions. The networking side of things is fine however I'm having
problems dealing with a zero byte terminated string being sent from
the client.

Code Snippet:
byte[] data = new byte[1000];
DatagramSocket serverSocket = new DatagramSocket(1025);
DatagramPacket packet = new DatagramPacket(data, data.length);
serverSocket.receive(packet);

The recieved packet then gets put onto a queue for pickup by a thread
pool. It's in the threadpool that I look at processing the packet and
extracting the string information (represents a filename, mode, etc).
Note that the strings in this packet are zero byte terminated.

Code Snippet:
byte[] payload = new byte[1000];
payload = packet.getData();

What I'd like to know is, what is the best way to retrive zero byte
terminated strings from the byte array?

Thanks in advance for your assistance.

Actually very easy to do. Just create a String from your byte[] buffer
and split it on the 0s.

public class test {
public static void main (String[] args) throws Exception {
byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00,
0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };

String str = new String(buf);
String[] arr = str.split("\u0000");

for (int i=0; i<arr.length; i++)
System.out.println(arr);
}
}
 
A

Adam Maass

Knute Johnson said:
Actually very easy to do. Just create a String from your byte[] buffer
and split it on the 0s.

public class test {
public static void main (String[] args) throws Exception {
byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00,
0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };

String str = new String(buf);

Ahem, it will be critically important to specify the encoding to the String
constructor!

String str = new String(buf, "ASCII");
String[] arr = str.split("\u0000");

for (int i=0; i<arr.length; i++)
System.out.println(arr);
}
}
 
C

Chris Uppal

PurpleServerMonkey said:
What I'd like to know is, what is the best way to retrive zero byte
terminated strings from the byte array?

There is no easy way to do it. That's to say, the /code/ will be trivially
simple once you know what you have to do, but finding out what you have to do
will be tricky unless the C programmers who generate the input are unusually
knowledgeable.

There is no equivalence between character data and binary data, so one is
always turned into the other by using some character encoding or other (often
called a "charset" or a "code page"). In Java, when you convert bytes to text
(or vice versa) you /always/ have to tell the system what character encoding to
use. (There are some "convenience" methods which use a system-default code
page, but you should avoid those in most circumstances, and you should
/definitely/ avoid them in this case).

So how do you find out what character set has been used by the C programmers ?
The first thing to do is to ask them. The chances are fairly good that they'll
have no idea what you are talking about. If not, then presumably they haven't
taken any steps at all to /control/ what code page is being used, and it will
be either:
some system default, if they are generating the text themselves
or
whatever character set the /real/ source of the data used.

If they are generating the data themselves, then you can probably get a decent
guess as to what character set they are using by running the following little
Java programs on the machine where they compile their stuff.

public class Main
{
public static void
main(String[] args)
{
System.out.println(
"file.encoding: "
+ System.getProperty("file.encoding"));
}
}

That will tell you what character set Java thinks is most likely to be a
sensible default for that machine, and it /may/ be correct. On my system
today, that name is "Cp1252" (which cognoscenti will recognise as meaning I
have a Windows box set up to use an English/Western European character set by
default).

If you can't find any sensible information, then it's probably a good idea to
assume that the data is pure ASCII -- which is a 7-bit encoding which
(therefore) only defines 127 characters, but those 127 characters are common to
all (as far as I know) encodings that your UDP packets are likely to be using.
To use that character encoding use an encoding name of "US-ASCII".

Once you have decided what character set is in use, actually decoding it is
trivial. Just find the start of the text data in your byte[] buffer (which you
must already know how to do), loop down the buffer looking for the terminating
byte which has value 0 (but see below), and then pass the resulting data into
the String constructor:
String(byte[] bytes, int offset, int length, String charsetName)
or, if you prefer:
String(byte[] bytes, int offset, int length, Charset charset)
which will do the conversion for you.

(The potential gotcha about looking for the value 0 is that it assumes that the
data is encoded using an 8-bit (or 7-bit) encoding like "ISO-8859-1", "UTF-8",
or "Cp1252", rather than a 16-bit encoding like "UTF-16" -- but that seems a
safe bet or even C programmers would know that there was a potential problem
and warn you about it.)

If you can, I'd advise getting the C people to send a packet containing /all/
the potential 254 non-zero characters, and then compare what you decode it as
with what they expect it to look like. Needless to say, you'll have to be
careful about character encoding issues when you do the comparison...

-- chris
 
K

Knute Johnson

Adam said:
Knute Johnson said:
Actually very easy to do. Just create a String from your byte[]
buffer and split it on the 0s.

public class test {
public static void main (String[] args) throws Exception {
byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00,
0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };

String str = new String(buf);

Ahem, it will be critically important to specify the encoding to the
String constructor!

String str = new String(buf, "ASCII");
String[] arr = str.split("\u0000");

for (int i=0; i<arr.length; i++)
System.out.println(arr);
}
}


Only if he doesn't want his system default character set. Mine
certainly doesn't default to ASCII, or as it is more correctly known
ANSI_X3.4-1968. What character set does your C compiler default to?
 
P

PurpleServerMonkey

Adam said:
Actually very easy to do. Just create a String from your byte[]
buffer and split it on the 0s.
public class test {
public static void main (String[] args) throws Exception {
byte[] buf = { 0x54, 0x48, 0x49, 0x53, 0x00, 0x49, 0x53, 0x00,
0x41, 0x00, 0x54, 0x45, 0x53, 0x54, 0x00 };
String str = new String(buf);
Ahem, it will be critically important to specify the encoding to the
String constructor!
String str = new String(buf, "ASCII");
String[] arr = str.split("\u0000");
for (int i=0; i<arr.length; i++)
System.out.println(arr);
}
}


Only if he doesn't want his system default character set. Mine
certainly doesn't default to ASCII, or as it is more correctly known
ANSI_X3.4-1968. What character set does your C compiler default to?


Thanks guys, that has worked a treat.

The client is an old C based application and is using ASCII encoding,
the above info has solved the problem and is working well.
 
C

Chris Uppal

Knute said:
Ahem, it will be critically important to specify the encoding to the
String constructor!
[..]
Only if he doesn't want his system default character set. Mine
certainly doesn't default to ASCII, or as it is more correctly known
ANSI_X3.4-1968. What character set does your C compiler default to?

But using the Java system default charset is almost always going to be a bad
mistake in this situation. Or do you have a good reason to believe that the
default charset of the C compiler installation where the code which generates
the UDP packets was complied will be the same[*] as the default Java charset
set on the system where the UDP packets are received ?

([*] Note: that is "will be the same", not "is likely to be the same").

Using the default system charset for real data, in production code, is nothing
better than lazy and incompetent.

-- chris
 
K

Knute Johnson

Chris said:
Knute said:
Ahem, it will be critically important to specify the encoding to the
String constructor!
[..]
Only if he doesn't want his system default character set. Mine
certainly doesn't default to ASCII, or as it is more correctly known
ANSI_X3.4-1968. What character set does your C compiler default to?

But using the Java system default charset is almost always going to be a bad
mistake in this situation. Or do you have a good reason to believe that the
default charset of the C compiler installation where the code which generates
the UDP packets was complied will be the same[*] as the default Java charset
set on the system where the UDP packets are received ?

([*] Note: that is "will be the same", not "is likely to be the same").

Using the default system charset for real data, in production code, is nothing
better than lazy and incompetent.

-- chris

You know I don't like being called lazy and incompetent this late in the
evening. The other fellow mentioned nothing about the character set he
was using. Picking one out of a hat is no better than using the system
default. Odds are pretty good that system defaults will be the same if
used on the same computer, albeit different compilers. Specifying the
wrong character set may very well cause it to not work at all. If he
said gee this doesn't work for my Chinese clients, they get a bunch of
?????? then you can deal with his character set problems. Or you can
force his Chinese clients to use ANSI_X3.4-1968 and they will get ??????
right off the bat.

It's late and this lazy incompetent is going to bed now.
 
C

Chris Uppal

Knute Johnson wrote:

[me:]
You know I don't like being called lazy and incompetent this late in the
evening.

You won't see this until tomorrow, and I suppose you'll like it even less then.
But I'm afraid that I'm going to stick by my comment, and if -- by
implication -- it applies to you, then that's unfortunate because I had meant
nothing personal, but I will also stand by the implications.

-- chris
 
A

Adam Maass

Chris Uppal said:
PurpleServerMonkey said:
What I'd like to know is, what is the best way to retrive zero byte
terminated strings from the byte array?

There is no easy way to do it. That's to say, the /code/ will be
trivially
simple once you know what you have to do, but finding out what you have to
do
will be tricky unless the C programmers who generate the input are
unusually
knowledgeable.

There is no equivalence between character data and binary data, so one is
always turned into the other by using some character encoding or other
(often
called a "charset" or a "code page"). In Java, when you convert bytes to
text
(or vice versa) you /always/ have to tell the system what character
encoding to
use. (There are some "convenience" methods which use a system-default
code
page, but you should avoid those in most circumstances, and you should
/definitely/ avoid them in this case).

So how do you find out what character set has been used by the C
programmers ?
The first thing to do is to ask them. The chances are fairly good that
they'll
have no idea what you are talking about. If not, then presumably they
haven't
taken any steps at all to /control/ what code page is being used, and it
will
be either:
some system default, if they are generating the text themselves
or
whatever character set the /real/ source of the data used.

If they are generating the data themselves, then you can probably get a
decent
guess as to what character set they are using by running the following
little
Java programs on the machine where they compile their stuff.

public class Main
{
public static void
main(String[] args)
{
System.out.println(
"file.encoding: "
+ System.getProperty("file.encoding"));
}
}

That will tell you what character set Java thinks is most likely to be a
sensible default for that machine, and it /may/ be correct. On my system
today, that name is "Cp1252" (which cognoscenti will recognise as meaning
I
have a Windows box set up to use an English/Western European character set
by
default).

If you can't find any sensible information, then it's probably a good idea
to
assume that the data is pure ASCII -- which is a 7-bit encoding which
(therefore) only defines 127 characters, but those 127 characters are
common to
all (as far as I know) encodings that your UDP packets are likely to be
using.
To use that character encoding use an encoding name of "US-ASCII".

Once you have decided what character set is in use, actually decoding it
is
trivial. Just find the start of the text data in your byte[] buffer
(which you
must already know how to do), loop down the buffer looking for the
terminating
byte which has value 0 (but see below), and then pass the resulting data
into
the String constructor:
String(byte[] bytes, int offset, int length, String charsetName)
or, if you prefer:
String(byte[] bytes, int offset, int length, Charset charset)
which will do the conversion for you.

(The potential gotcha about looking for the value 0 is that it assumes
that the
data is encoded using an 8-bit (or 7-bit) encoding like "ISO-8859-1",
"UTF-8",
or "Cp1252", rather than a 16-bit encoding like "UTF-16" -- but that seems
a
safe bet or even C programmers would know that there was a potential
problem
and warn you about it.)

If you can, I'd advise getting the C people to send a packet containing
/all/
the potential 254 non-zero characters, and then compare what you decode it
as
with what they expect it to look like. Needless to say, you'll have to be
careful about character encoding issues when you do the comparison...

-- chris

Thank you Chris for a thorough, thoughtful, and detailed response.

If you expect 0-byte terminated strings, you absolutely need to know the
character encoding in use; some of the more exotic encodings (to those of us
using Latin charsets) will contain 0-bytes that do not indicate the end of a
string. If you don't specify the charset and operate on a system that
defaults to one of these "exotic" encodings, then the String(byte[])
constructor will not do what you expect.

In short, when dealing with raw bytes that represent character data, you
need to know what encoding was used to generate the bytes.
 
K

Knute Johnson

Chris said:
Knute Johnson wrote:

[me:]
You know I don't like being called lazy and incompetent this late in the
evening.

You won't see this until tomorrow, and I suppose you'll like it even less then.
But I'm afraid that I'm going to stick by my comment, and if -- by
implication -- it applies to you, then that's unfortunate because I had meant
nothing personal, but I will also stand by the implications.

-- chris

Computer programs are tools, just like any other tool. They have a cost
and a benefit. You can buy a rusty box-end wrench or you can buy a gold
plated spanner. They do the same job most of the time. To say that you
absolutely have to use the gold plated spanner and that you are lazy and
incompetent if you don't is just plain rude.

If the default character set wasn't adequate for his purposes he could
easily change it. As it turns out he was happy with the solution
provided and it worked just fine.

And now I'm going to take my lazy butt to town.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top