UTF-8 in char*

J

Joona I Palaste

Jacky Cheung said:
I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

AFAIK UTF-8 does not have NUL characters (most people prefer to spell it
that way to avoid confusion). UTF-8 only includes "normal" ASCII
characters and special characters with bit 7 set. You're in no more
danger of seeing NUL in UTF-8 than you are of seeing it in ASCII.
Note that vCards, by themselves, are completely off-topic here.
 
H

Hallvard B Furuseth

Jacky said:
I am developing a vCard application which have to support UTF-8. Does
the UTF-8 in char* will crash the strlen, I mean does UTF-8 have some
char which treat as NULL character in strlen?

Well, it has a null control character, but it means more or less the
same as the ASCII null character. So if you just want to handle normal
text, you can use normal C strings, and thus strlen().


BTW, if you have only written programs for ASCII before, you might note
that functions like getchar() return character values in the range of
'unsigned char' or EOF, while 'char' can be negative. So code like
char buf[] = "<UTF-8 string>";
int ch, i;
...
while ((ch = getchar()) != EOF) {
if (ch == buf) ...
is wrong. (Even if you don't use UTF-8, but you may not have noticed
before.) You need to convert ch to char or buf[j] to unsigned char
before comparing the two.
 
C

Chris Torek

I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

UTF-8 is simply an encoding mechanism for taking larger-than-8-bit
values and storing them in 8-bit values. The details of this
mechanism are pretty much off-topic in comp.lang.c, but here we
can say that UTF-8 encoded characters will always fit in objects
of type "unsigned char", as those will have at least 8 bits.

Your actual question above cannot (quite) be answered as asked as
it appears to contain at least one false assumption, i.e., that
the presence of a '\0' character in an array of unsigned char will
"crash" strlen(). In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler's discretion)
char, searching forward until it finds a '\0' value, then returning
the number of non-'\0'-"char"s it has skipped. Passing strlen()
the address of an array of "char" that does *not* contain a '\0'
could cause the program to crash (or indeed exhibit any behavior
at all); so I think what you really mean to ask is:

"Given some sequence of values in some wider-than-8-bit
character set (such as 16 or 32 bit Unicode), suppose I have
encoded it in 8-bit bytes using the UTF-8 scheme. Can I
(usefully) apply strlen() to the result?"

The answer to this version of the question is "maybe". In particular,
you must ensure that:

a) none of the 8-bit values is a trap representation in plain
"char" if plain "char" is signed (and the C language proper is
not terribly helpful here, but you could constrain yourself to
two's complement systems or those with wide-enough "plain" chars,
by checking that either CHAR_MAX >= 255 -- i.e., no UTF-8 value
will be negative -- or that -CHAR_MIN <= -128);

b) that the "char" array you have used to stored the encoded
values is '\0'-terminated;

c) that you did not embed any '\0' values in that array, and

d) that the resulting strlen() value meets any other criteria
you may hide beneath the word "useful".

The conditions in part (a) are met by most C systems today, so you
might simply assume them (and document that assumption somewhere).
The conditions in part (b) and (c) may, or may not, arise naturally
out of the values you are UTF-8 encoding -- this part is up to you.
Part (d) is likewise something only you can answer.
 
C

Chris

I remember when I did UCS2 for similar vCard application, I used the
following structure:

typedef struct ucs2_tag {
unsigned short* str_ptr;
unsigned int length;
} ucs2, *ucs2_ptr;

By doing it like this, it'll be obvious to people who need to work on your
code what you're doing and stick to memcpy() for string copying then you dun
need to worry about the NUL terminator or not, this is achieved at a small
cost of managing the memory usage for each structure that you create.

Also, you might need to write your own myStrlen() to count the number of
characters of an input string since it's length can be unpredictable.

So in your case you can declare a structure for UTF-8 as:

typedef struct utf8_tag {
unsigned char* str_ptr;
unsigned int length;
} utf8, *utf8_ptr;

The answer to your 2nd question is that NUL character is NOT EQUIVALENT to
NULL!!!
There is no such thing as "NULL character" but there exists an "NUL
character", which is the '\0' at the end of a string buffer.

Just out of curiosity, are you a mobile phone software developer?
 
J

Jacky Cheung

Hi,

I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

Jacky
 
S

Simon Biber

Chris said:
I remember when I did UCS2 for similar vCard application, I used the
following structure:

typedef struct ucs2_tag {
unsigned short* str_ptr;
unsigned int length;
} ucs2, *ucs2_ptr;

That looks good. I would use a value of type size_t to store the
length, and as a style point I wouldn't provide a pointer typedef.
Users can declare a
ucs2 *my_ptr;
if they like.
By doing it like this, it'll be obvious to people who need to work on
your code what you're doing and stick to memcpy() for string copying
then you dun need to worry about the NUL terminator or not, this is
achieved at a small cost of managing the memory usage for each
structure that you create.

True, and useful in the case of UCS2. Or you could just use C's
wide strings if they are implemented in UCS2 on your platform.
Also, you might need to write your own myStrlen() to count the
number of characters of an input string since it's length can
be unpredictable.

That would be more a file format issue; in the case of a UTF-8
encoded text file, there should not be any embedded zero bytes
and usual I/O or string functions like fgets and strlen should
work fine.
So in your case you can declare a structure for UTF-8 as:
typedef struct utf8_tag {
unsigned char* str_ptr;
unsigned int length;
} utf8, *utf8_ptr;

This is typically not needed for UTF-8. UTF-8 has the important
property that any code value from 0 to 127 inclusive codes for
the respective ASCII character and cannot occur as part of the
multi-byte representation for a higher character. Therefore, any
zero byte occuring in the UTF-8 string is indeed a real ASCII NUL
character and therefore can be used transparently with the usual
C semantics of string termination.
 
G

grobbeltje

Chris Torek said:
In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler's discretion)
char,
Do you mean the compiler can basically do what it wants concerning
signing of chars? Where can I find more info on this?
I've been trying to read some iso documentation from 1999,
and how 'plain' char works is a bit hard to understand for me.

It says: "If the value of an object of type char is treated as a signed
integer when used in an expression, the value of CHAR_MIN shall be the same
as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of
SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of
CHAR_MAX shall be the same as that of UCHAR_MAX." (sorry for any typo's,
they are mine).

As I read it, this would mean 'char' is always 'unsigned char' when compared
to other chars. The same document says the null character is defined as a
byte with all bits set to 0. So to my understanding a simple strlen consisting
of a for/while loop searching for a '\0' should operate on unsigned chars.

Am I wrong (again)? Does it really matter whether the chars in
this comparison are signed or not?

ps: Sorry for my bad english.
Grobbeltje (just another curious newbee).
 
K

Kevin Goodsell

grobbeltje said:
Do you mean the compiler can basically do what it wants concerning
signing of chars? Where can I find more info on this?
I've been trying to read some iso documentation from 1999,
and how 'plain' char works is a bit hard to understand for me.

It says: "If the value of an object of type char is treated as a signed
integer when used in an expression, the value of CHAR_MIN shall be the same
as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of
SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of
CHAR_MAX shall be the same as that of UCHAR_MAX." (sorry for any typo's,
they are mine).

You're reading in the wrong place. The behavior of char is described
earlier, in section 6.2.5/15:

The three types char, signed char, and unsigned char are
collectively called the character types. The implementation
shall define char to have the same range, representation,
and behavior as either signed char or unsigned char.

So char behaves exactly like either signed char or unsigned char, and
the implementation must decide (and document) which. The three types are
distinct, however.
As I read it, this would mean 'char' is always 'unsigned char' when compared
to other chars.

This interpretation is wrong. I'm unsure of how you got that from the
passage you quoted.
The same document says the null character is defined as a
byte with all bits set to 0. So to my understanding a simple strlen consisting
of a for/while loop searching for a '\0' should operate on unsigned chars.

I'm not totally sure what you mean here. To the best of my knowledge,
strlen should operate correctly on any character type.
Am I wrong (again)? Does it really matter whether the chars in
this comparison are signed or not?

The 'signed-ness' of the objects involved in a comparison is certainly
important, but I'm still not sure what you are getting at.

-Kevin
 
D

David Resnick

Jacky Cheung said:
Hi,

I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?

Jacky

There are no embedded NUL characters in a UTF-8 encoded string,
that is one of its primary virtues. However, you need to note
that the strlen of a UTF-8 string is greater than (unless that
string is all ASCII) the number of characters represented by
that string...

-David
 
C

Christian Bau

[QUOTE="grobbeltje said:
In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler's discretion)
char,
Do you mean the compiler can basically do what it wants concerning
signing of chars? Where can I find more info on this?
I've been trying to read some iso documentation from 1999,
and how 'plain' char works is a bit hard to understand for me.

It says: "If the value of an object of type char is treated as a signed
integer when used in an expression, the value of CHAR_MIN shall be the same
as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of
SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of
CHAR_MAX shall be the same as that of UCHAR_MAX." (sorry for any typo's,
they are mine).

As I read it, this would mean 'char' is always 'unsigned char' when compared
to other chars. The same document says the null character is defined as a
byte with all bits set to 0. So to my understanding a simple strlen consisting
of a for/while loop searching for a '\0' should operate on unsigned chars. [/QUOTE]

The compiler has to make a choice between two possibilities: Either
plain "char" behaves exactly the same way as "unsigned char", or plain
"char" behaves exactly the same as "signed char". The compiler must make
its decision and then stick with it.
 
J

Jared Dykstra

Christian Bau said:
[QUOTE="grobbeltje said:
In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler's discretion)
char,
Do you mean the compiler can basically do what it wants concerning
signing of chars? Where can I find more info on this?
I've been trying to read some iso documentation from 1999,
and how 'plain' char works is a bit hard to understand for me.

It says: "If the value of an object of type char is treated as a signed
integer when used in an expression, the value of CHAR_MIN shall be the same
as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of
SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of
CHAR_MAX shall be the same as that of UCHAR_MAX." (sorry for any typo's,
they are mine).

As I read it, this would mean 'char' is always 'unsigned char' when compared
to other chars. The same document says the null character is defined as a
byte with all bits set to 0. So to my understanding a simple strlen consisting
of a for/while loop searching for a '\0' should operate on unsigned chars.

The compiler has to make a choice between two possibilities: Either
plain "char" behaves exactly the same way as "unsigned char", or plain
"char" behaves exactly the same as "signed char". The compiler must make
its decision and then stick with it.[/QUOTE]

Signed...Unsigned, it doesn't really matter. Most string functions
don't care about the "real" numerical value of something, just whether
it equates to something else or not. If you're going to start testing
the range of a character, like if (c < 'a') then you have to ensure
the range doesn't cross a sign change. Just ensure all lvalues are
either signed or unsigned, casting them if necessary.

Of course the UFT-8 aspect complicates things. Strings could have the
same number of characters encoded in them but be different byte
lengths, so characters need to be decoded into a larger data type
before comparing. However, this has already been noted in this
thread.
 
C

Chris Torek

[on plain "char" in general, and strlen() specifically]
grobbeltje said:
Do you mean the compiler can basically do what it wants concerning
sign[edness] of chars?

(The answer is "sort of", as Kevin Goodsell explained in text I snipped.)

Now suppose we have a UTF-8 encoding in an array "a" defined
as:

unsigned char a[UTF8_MAX_LEN];

where a[len] == '\0' for some len in [0..UTF8_MAX_LEN).

... To the best of my knowledge,
strlen should operate correctly on any character type.

The concern here is for plain char that is the same as "signed
char" on a machine where CHAR_MIN is -127 rather than -128. The
UTF-8 encoding potentially uses all 256 possible values in the
range [0..255], which will certainly fit into a C array of "unsigned
char", because UCHAR_MAX must be at least 255. But what if UCHAR_MAX
is indeed 255, and some array element a (where i < len) is set
to either 255 (for ones' complement systems) or 128 (for
sign-and-magnitude)? Then a, interpreted as a plain (and thus
signed) char, will be what C99 calls an "object representation" of
the value negative zero (which exists in ones' complement and
signed-magnitude, but not in the much more common two's complement
that most C systems use). This may be a "trap representation",
and it may be the case that strlen(a) traps, instead of returning
some number.

On any ordinary two's complement system today, you will find that
CHAR_MIN is either -128 (char being signed) or 255 (char being
unsigned), so that strlen(a) will indeed find that a[len]=='\0'
byte even if a==128 or a==255. And if CHAR_MIN exceeds
255 (e.g., on DSP C compilers with CHAR_BIT being 16 or more),
then all a values, for i in [0..len), are valid "char" values
as well. Thus, strlen(a) will "work right" on all these systems.
But, this being comp.lang.c, we must worry about systems that
do not have these properties, too. :)
 
B

Ben Pfaff

Chris Torek said:
On any ordinary two's complement system today, you will find that
CHAR_MIN is either -128 (char being signed) or 255 (char being
unsigned), so that strlen(a) will indeed find that a[len]=='\0'
byte even if a==128 or a==255. And if CHAR_MIN exceeds
255 (e.g., on DSP C compilers with CHAR_BIT being 16 or more),


Are you getting a little sleepy this late at night, Chris? There
is no possible way that CHAR_MIN can be 255, and certainly no way
that CHAR_MIN can exceed 255.
 
C

Chris Torek

Chris Torek said:
On any ordinary two's complement system today, you will find that
CHAR_MIN is either -128 (char being signed) or 255 (char being
unsigned), so that strlen(a) will indeed find that a[len]=='\0'
byte even if a==128 or a==255. And if CHAR_MIN exceeds
255 (e.g., on DSP C compilers with CHAR_BIT being 16 or more),


Are you getting a little sleepy this late at night, Chris? There
is no possible way that CHAR_MIN can be 255, and certainly no way
that CHAR_MIN can exceed 255.

Er, right. CHAR_M{IN,AX} -- just two little letters... :)

Make that:

On ordinary two's complement systems today, you will find that
either CHAR_MIN is -128 (char being signed), or CHAR_MAX is 255
(char being unsigned), so that strlen(a) will indeed find the
a[len]=='\0' byte even if a==128 (-128 if signed) or 255
(-1 if signed). And if CHAR_MAX exceeds 255 ...
 
J

J. J. Farrell

Chris said:
The answer to your 2nd question is that NUL character is NOT EQUIVALENT to
NULL!!!
There is no such thing as "NULL character" but there exists an "NUL
character", which is the '\0' at the end of a string buffer.

Just for pedantry's sake: ASCII gives the character with value 0
the name "NUL" and the description "the null character".
 
T

those who know me have no need of my name

in comp.lang.c i read:
The answer to your 2nd question is that NUL character is NOT EQUIVALENT to
NULL!!!
There is no such thing as "NULL character" but there exists an "NUL
character", which is the '\0' at the end of a string buffer.

NUL is used exactly once in a footnote -- footnotes are not normative.

the standard defines the term `null character' (below). it doesn't present
the word null in all upper-case, so there certainly is room for confusion
with the macro, and it's good that this be clarified, i.e., the intention
of this correction is well meant but it is slightly flawed. there is a
null character, which will compare equal with NULL, but they are not the
same, in that NULL may be of type void* rather than int.

| A byte with all bits set to 0, called the null character, shall exist in
| the basic execution character set; it is used to terminate a character
| string.
 
C

Christian Bau

Of course the UFT-8 aspect complicates things. Strings could have the
same number of characters encoded in them but be different byte
lengths, so characters need to be decoded into a larger data type
before comparing. However, this has already been noted in this
thread.

The beauty of UTF-8 is that there is no need to do this.

If two sequences of characters A and B are encoded in UTF-8, then the
_bytes_ of the encoding of A will match a subsequence of bytes in the
encoding of B if and only if the characters of A match a subsequence of
the characters in B.
 
J

J. J. Farrell

J. J. Farrell said:
"Chris" <[email protected]> wrote in message

Just for pedantry's sake: ASCII gives the character with value 0
the name "NUL" and the description "the null character".

Well, if I'm going to be pedantic, what has ASCII got to do with
anything? What's relevant is that C defines the "null character"
as a character with value 0. "NUL" is the name of that character
in both the ASCII and EBCDIC character sets (and those based on
them, and perhaps a few others as well) but they aren't relevant
to C. Chris's correction would have been better as

! There is no such thing as "NULL character" but there exists a
! "null character", which is the '\0' at the end of a string buffer.
 
J

Jared Dykstra

Christian Bau said:
The beauty of UTF-8 is that there is no need to do this.

If two sequences of characters A and B are encoded in UTF-8, then the
_bytes_ of the encoding of A will match a subsequence of bytes in the
encoding of B if and only if the characters of A match a subsequence of
the characters in B.

True. An encoding scheme is less useful if it doesn't encode the same
data the same way twice.

The original post wasn't clear if all data was encoded this way or
just some of it. Obviousally if the two strings are encoded
differently, conversion to a common encoding scheme is required for
any useful byte-wise comparison. If not, compare bytes.
 
C

Christian Bau

True. An encoding scheme is less useful if it doesn't encode the same
data the same way twice.

That is not what I said. UTF8 is better than that. Consider the case
where x, y, and a are single characters that are all three encoded to
two bytes each. It would be possible that the last byte of x matches the
first byte of a, and the first byte of y matches the second byte of a,
so the encoding of xy has the encoding of a as a substring. This is not
the case with UTF8.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,129
Messages
2,570,770
Members
47,329
Latest member
FidelRauch

Latest Threads

Top