strncmp and unsigned char

M

me

Hi guys,

I'm using an utf8 state-machine I made to check and handle unicode
strings, and was wondering if strncmp could be used for comparing the
after check or if I should roll my own?

It's prototype accepts const char and (on linux at least) internally
uses unsigned char.

What should I do?

Regards,
 
S

Shao Miller

I'm using an utf8 state-machine I made to check and handle unicode
strings, and was wondering if strncmp could be used for comparing the
after check or if I should roll my own?

It's prototype accepts const char and (on linux at least) internally
uses unsigned char.

What should I do?

Might you be interested in 'wcsncmp()?' :)
 
B

Ben Bacarisse

me said:
I'm using an utf8 state-machine I made to check and handle unicode
strings, and was wondering if strncmp could be used for comparing the
after check or if I should roll my own?

This confused me until I decided that a "strings" was missing:

| if strncmp could be used for comparing the [strings] after check[ing]

is that what you meant? If so, you certainly could use strncmp but the
result would be much less useful than a proper Unicode compare. As has
been suggested, you could convert to a wide string an use wcsncmp (or
wcscmp).

However, if all you want is a rather arbitrary ordering (say for a
binary search) then the byte comparison of the UTF8 encoded strings
would do.
It's prototype accepts const char and (on linux at least) internally
uses unsigned char.

That's not an issue. All of C's compare functions treat the bytes as if
they were unsigned char, despite the prototypes. If you don't like the
look of the prototype, memcmp uses void *.
 
A

Angel

That's not an issue. All of C's compare functions treat the bytes as if
they were unsigned char, despite the prototypes. If you don't like the
look of the prototype, memcmp uses void *.

Unlike the str*cmp() functions, memcmp() doesn't check for null bytes so
if you do that you might end up comparing garbage data if the strings
are shorter than the given size.
 
K

Keith Thompson

christian.bau said:
strcmp will compare strings and return a result assuming that the data
is signed char.

No, it won't.

strcmp's arguments are of type const char*; plain char may be either
signed or unsigned. But even if plain char is signed, 7.21.4p1 says:

The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters
(both interpreted as unsigned char) that differ in the objects
being compared.

[...]
The main problem is that with Unicode, just comparing code points
isn't very meaningful. You'd have to put the code points into a
canonical order at least to get any meaningful result. And when you do
that, using strcmp is quite pointless.

I *think* that strcmp() returns correctly ordered results for UTF-8
strings. UTF-8 was carefully designed to make this work.
 
B

Ben Bacarisse

Keith Thompson said:
"christian.bau" <[email protected]> writes:

I *think* that strcmp() returns correctly ordered results for UTF-8
strings. UTF-8 was carefully designed to make this work.

It all depends on "correctly ordered" of course. A byte-by-byte compare
of correctly encoded UTF-8 encoded strings preserves the ordering on the
code points the strings represent. To put it another way, converting to
wide strings and using wcscmp will give the same result as strcmp will
when passed the originals. The encoded strings must be not contain any
over-long representations (nor any other forbidden bytes or byte
combinations) but I think the OP has covered that since they talked
about checking the strings first.

However, because Unicode says so much about the characters, one could
argue that a truly correct ordering should be rather more than this.
For example, "fine" with an fi ligature should compare equal to "fine"
without one and so on. If that seems too much like a detail, in some
scripts that code points are not in the correct collating sequence for
even the most basic ordering. That's what Christian Bau is saying, I
think.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,952
Messages
2,570,111
Members
46,692
Latest member
NewtonChri

Latest Threads

Top