I am developing a vCard application which have to support UTF-8. Does the
UTF-8 in char* will crash the strlen, I mean does UTF-8 have some char which
treat as NULL character in strlen?
UTF-8 is simply an encoding mechanism for taking larger-than-8-bit
values and storing them in 8-bit values. The details of this
mechanism are pretty much off-topic in comp.lang.c, but here we
can say that UTF-8 encoded characters will always fit in objects
of type "unsigned char", as those will have at least 8 bits.
Your actual question above cannot (quite) be answered as asked as
it appears to contain at least one false assumption, i.e., that
the presence of a '\0' character in an array of unsigned char will
"crash" strlen(). In fact, strlen() simply operates on an array
of (plain, i.e., optionally-signed at the compiler's discretion)
char, searching forward until it finds a '\0' value, then returning
the number of non-'\0'-"char"s it has skipped. Passing strlen()
the address of an array of "char" that does *not* contain a '\0'
could cause the program to crash (or indeed exhibit any behavior
at all); so I think what you really mean to ask is:
"Given some sequence of values in some wider-than-8-bit
character set (such as 16 or 32 bit Unicode), suppose I have
encoded it in 8-bit bytes using the UTF-8 scheme. Can I
(usefully) apply strlen() to the result?"
The answer to this version of the question is "maybe". In particular,
you must ensure that:
a) none of the 8-bit values is a trap representation in plain
"char" if plain "char" is signed (and the C language proper is
not terribly helpful here, but you could constrain yourself to
two's complement systems or those with wide-enough "plain" chars,
by checking that either CHAR_MAX >= 255 -- i.e., no UTF-8 value
will be negative -- or that -CHAR_MIN <= -128);
b) that the "char" array you have used to stored the encoded
values is '\0'-terminated;
c) that you did not embed any '\0' values in that array, and
d) that the resulting strlen() value meets any other criteria
you may hide beneath the word "useful".
The conditions in part (a) are met by most C systems today, so you
might simply assume them (and document that assumption somewhere).
The conditions in part (b) and (c) may, or may not, arise naturally
out of the values you are UTF-8 encoding -- this part is up to you.
Part (d) is likewise something only you can answer.