strncmp and unsigned char

me · May 19, 2011

Hi guys,

I'm using an utf8 state-machine I made to check and handle unicode
strings, and was wondering if strncmp could be used for comparing the
after check or if I should roll my own?

It's prototype accepts const char and (on linux at least) internally
uses unsigned char.

What should I do?

Regards,

Shao Miller · May 19, 2011

I'm using an utf8 state-machine I made to check and handle unicode
strings, and was wondering if strncmp could be used for comparing the
after check or if I should roll my own?

It's prototype accepts const char and (on linux at least) internally
uses unsigned char.

What should I do?

Might you be interested in 'wcsncmp()?'

Ben Bacarisse · May 19, 2011

me said:
I'm using an utf8 state-machine I made to check and handle unicode
strings, and was wondering if strncmp could be used for comparing the
after check or if I should roll my own?

This confused me until I decided that a "strings" was missing:

| if strncmp could be used for comparing the [strings] after check[ing]

is that what you meant? If so, you certainly could use strncmp but the
result would be much less useful than a proper Unicode compare. As has
been suggested, you could convert to a wide string an use wcsncmp (or
wcscmp).

However, if all you want is a rather arbitrary ordering (say for a
binary search) then the byte comparison of the UTF8 encoded strings
would do.

It's prototype accepts const char and (on linux at least) internally
uses unsigned char.

That's not an issue. All of C's compare functions treat the bytes as if
they were unsigned char, despite the prototypes. If you don't like the
look of the prototype, memcmp uses void *.

Angel · May 19, 2011

That's not an issue. All of C's compare functions treat the bytes as if
they were unsigned char, despite the prototypes. If you don't like the
look of the prototype, memcmp uses void *.

Unlike the str*cmp() functions, memcmp() doesn't check for null bytes so
if you do that you might end up comparing garbage data if the strings
are shorter than the given size.

Keith Thompson · May 20, 2011

christian.bau said:
strcmp will compare strings and return a result assuming that the data
is signed char.

No, it won't.

strcmp's arguments are of type const char*; plain char may be either
signed or unsigned. But even if plain char is signed, 7.21.4p1 says:

The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters
(both interpreted as unsigned char) that differ in the objects
being compared.

[...]

The main problem is that with Unicode, just comparing code points
isn't very meaningful. You'd have to put the code points into a
canonical order at least to get any meaningful result. And when you do
that, using strcmp is quite pointless.

I *think* that strcmp() returns correctly ordered results for UTF-8
strings. UTF-8 was carefully designed to make this work.

Ben Bacarisse · May 20, 2011

Keith Thompson said:
"christian.bau" <[email protected]> writes:

I *think* that strcmp() returns correctly ordered results for UTF-8
strings. UTF-8 was carefully designed to make this work.

It all depends on "correctly ordered" of course. A byte-by-byte compare
of correctly encoded UTF-8 encoded strings preserves the ordering on the
code points the strings represent. To put it another way, converting to
wide strings and using wcscmp will give the same result as strcmp will
when passed the originals. The encoded strings must be not contain any
over-long representations (nor any other forbidden bytes or byte
combinations) but I think the OP has covered that since they talked
about checking the strings first.

However, because Unicode says so much about the characters, one could
argue that a truly correct ordering should be rather more than this.
For example, "fine" with an fi ligature should compare equal to "fine"
without one and so on. If that seems too much like a detail, in some
scripts that code points are not in the correct collating sequence for
even the most basic ordering. That's what Christian Bau is saying, I
think.

const void * to const unsigned char (*)[2]	6	Dec 14, 2009
Is char obsolete?	20	Apr 8, 2011
char data to unsigned char	6	May 2, 2010
I want unsigned char * string literals	43	Jul 22, 2007
unsigned char** argv	2	Mar 24, 2006
comparison between signed and unsigned	4	Jul 13, 2008
Unsigned and signed char types	4	Jun 12, 2005
Difference between Char* ptr and char arrCh []	4	Jul 27, 2009

strncmp and unsigned char

me

Shao Miller

Ben Bacarisse

Angel

Keith Thompson

Ben Bacarisse

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads