Using std::lexicographical_compare with ignore case equalitydoesn't always work

Alex Buell · Dec 28, 2008

The short snippet below demonstrates the problem I'm having with
std::lexicographical_compare() in that it does not reliably work!

#include <iostream>
#include <vector>
#include <ctype.h>

bool compare_ignore_case_equals(char c1, char c2)
{
return toupper(c1) == toupper(c2);
}

bool compare_ignore_case_less(char c1, char c2)
{
return toupper(c1) < toupper(c2);
}

int main(int argc, char *argv[])
{
std::vector<std::string> args(argv + 1, argv + argc);
const char *words[] =
{
"add", "del", "new", "help"
};

std::vector<std::string> list(words, words + (sizeof words / sizeof words[0]));
std::vector<std::string>::iterator word = list.begin();
while (word != list.end())
{
std::cout << "Testing " << *word << " = " << args[0];
if (std::lexicographical_compare(
word->begin(), word->end(),
args[0].begin(), args[0].end(),
compare_ignore_case_equals))
{
std::cout << " found!\n";
break;
}

std::cout << "\n";
word++;
}
}

Here's an example:

./quick new
Testing add = new
Testing del = new found!

That simply cannot be correct, what is it that I've done wrongly? Thanks

Alex Buell · Dec 28, 2008

if (std::lexicographical_compare(
word->begin(), word->end(),
args[0].begin(), args[0].end(),
compare_ignore_case_equals))

Click to expand...

First, remove compare_ignore_case_equals and try again. You'll get
similar problems. Then read about lexicographical_compare and what
its return value means.

I've now switched to using this:

#include <string.h>
#include <string>

inline int strcasecmp(const std::string& s1, const std::string& s2)
{
return strcasecmp(s1.c_str(), s2.c_str());
}

This leverages C++'s ability to overload functions and works better.

stricmp() isn't standard whilst strcasecmp() is standard ANSI/ISO. Some
posters have mentioned using stricmp() instead of strcasecmp(), which
happens not to be the correct answer. Why?

Alex Buell · Dec 28, 2008

No, it's not. It's Unix, if I remeber correctly. But I think I didn't
make my point clearly enough. The problem isn't fundamentally in the
predicate. So drop the predicate and use the default predicate until
you understand what lexicographical_compare does.

strcasecmp() is actually defined in the POSIX standards. But I will
look again at std::lexicograpical_compare() when I get some time. The
program works well enough with strcasecmp().

James Kanze · Dec 29, 2008

if (std::lexicographical_compare(
word->begin(), word->end(),
args[0].begin(), args[0].end(),
compare_ignore_case_equals))

Click to expand...

First, remove compare_ignore_case_equals and try again.
You'll get similar problems. Then read about
lexicographical_compare and what its return value means.

Click to expand...

I've now switched to using this:

#include <string.h>
#include <string>

inline int strcasecmp(const std::string& s1, const std::string& s2)
{
return strcasecmp(s1.c_str(), s2.c_str());
}

This leverages C++'s ability to overload functions and works
better.

stricmp() isn't standard whilst strcasecmp() is standard
ANSI/ISO.

It's not present in any version of the standard I have handy
(C++98, C99, and the latest C++ draft). The standard C++
functionnal object for comparing strings in a locale dependent
way is std::locale (which has an operator() which does exactly
what is needed for lexicographical_compare). And as any
comparisons involved case are locale sensitive, it's really what
you need, e.g.:

if ( std::lexicographical_compare(
word->begin(), word->end(),
args[ 0 ].begin(), args[ 0 ].end(),
std::locale() ) ) {...}

(or std::locale( "xxx" ), with whatever locale you want).

Some posters have mentioned using stricmp() instead of
strcasecmp(), which happens not to be the correct answer.
Why?

Neither are the correct answer, since neither are standard
C/C++. (strcasecmp is defined in Posix, but not very well: "In
the POSIX locale, [...]. The results are unspecified in other
locales." So unless you happen to live in POSIX, it's not very
useful.)

James Kanze · Dec 29, 2008

The short snippet below demonstrates the problem I'm having with
std::lexicographical_compare() in that it does not reliably work!

#include <iostream>
#include <vector>
#include <ctype.h>

bool compare_ignore_case_equals(char c1, char c2)
{
return toupper(c1) == toupper(c2);

Just a reminder, but this is, of course, undefined behavior.

}

bool compare_ignore_case_less(char c1, char c2)
{
return toupper(c1) < toupper(c2);

As is this.

}

(I've addressed the other issues in another posting.)

Alex Buell · Dec 29, 2008

Actually, it looks more like it leverages C++'s ability to cause a
stack overflow due to infinite recursion. strcasecmp isn't part of
ISO C++, so on plenty of compilers, this function will simply call
itself.

As this snippet below shows, you're actually correct.

#include <iostream>
#include <string>

int hahaha(const std::string& s1, const std::string& s2)
{
return hahaha(s1.c_str(), s2.c_str());
}

int main()
{
std::string s1 = "hahaha";
std::string s2 = "HAHAHA";

if (hahaha(s1, s2) == 0)
std::cout << "Equal!\n";

return 0;
}

As far as I can tell, neither are part of standard C++.

Yes, at some point in time I'm going to have to change to
std::lexicographical_compare, or is there anything else I can try for
case insensitive compares on std::string objects?

Thomas J. Gritzan · Dec 29, 2008

James said:
stricmp() isn't standard whilst strcasecmp() is standard
ANSI/ISO.

Click to expand...

It's not present in any version of the standard I have handy
(C++98, C99, and the latest C++ draft). The standard C++
functionnal object for comparing strings in a locale dependent
way is std::locale (which has an operator() which does exactly
what is needed for lexicographical_compare). And as any
comparisons involved case are locale sensitive, it's really what
you need, e.g.:

if ( std::lexicographical_compare(
word->begin(), word->end(),
args[ 0 ].begin(), args[ 0 ].end(),
std::locale() ) ) {...}

(or std::locale( "xxx" ), with whatever locale you want).

operator() of std::locale works on strings by itself. You could use
operator() directly:

/* true, if word < args[0] */
if ( std::locale()(word, args[0]) ) {...}

But does std::locale()() really compare case insensitive?

James Kanze · Dec 29, 2008

James said:
James said:

stricmp() isn't standard whilst strcasecmp() is standard
ANSI/ISO.

Click to expand...

It's not present in any version of the standard I have handy
(C++98, C99, and the latest C++ draft). The standard C++
functionnal object for comparing strings in a locale
dependent way is std::locale (which has an operator() which
does exactly what is needed for lexicographical_compare).
And as any comparisons involved case are locale sensitive,
it's really what you need, e.g.:
if ( std::lexicographical_compare(
word->begin(), word->end(),
args[ 0 ].begin(), args[ 0 ].end(),
std::locale() ) ) {...}
(or std::locale( "xxx" ), with whatever locale you want).

Click to expand...

operator() of std::locale works on strings by itself. You could use
operator() directly:

/* true, if word < args[0] */
if ( std::locale()(word, args[0]) ) {...}

But does std::locale()() really compare case insensitive?

The answer to that is a definite maybe. It does (or it should)
in locales where case insensitive comparison makes sense. And
it does so correctly, matching "Straße" and "STRASSE" (or
"ändern" and "Aendern", in Switzerland, but not in Germany).
And "I" and "i" won't compare equal in a Turkish locale. Since
the "C" locale is designed for parsing C code, and the POSIX
locale for working in a Posix environment (including the file
systems and filenames), the comparison in those locales will NOT
be case insensitive.

And of course, you can always define your own locale. (At
least, that's what it says. In practice, it takes a pretty high
level of C++ competence to do it reliably. More than I have, at
any rate.)

Thomas J. Gritzan · Dec 29, 2008

James said:
Just a reminder, but this is, of course, undefined behavior.

#include <locale>

struct compare_ignore_case_equals
{
compare_ignore_case_equals(const std::locale& loc_ = std::locale())
: loc(loc_) {}

bool operator()(char c1, char c2) const
{
return std::tolower(c1, loc) == std::tolower(c2, loc);
}

private:
std::locale loc;
};

How about this? Doesn't depend on users locale, you can provide your own
locale, and isn't UB.

Why does ::toupper actually take an int?

Thomas J. Gritzan · Dec 29, 2008

James said:
On Dec 29, 3:10 pm, "Thomas J. Gritzan" <[email protected]> [...]

But does std::locale()() really compare case insensitive?

Click to expand...

The answer to that is a definite maybe. [...]

If you want to parse commands case insensitivly, like in a shell, script
interpreter or text based protocoll, a maybe isn't enough.

And of course, you can always define your own locale. (At
least, that's what it says. In practice, it takes a pretty high
level of C++ competence to do it reliably. More than I have, at
any rate.)

Then it would be easier to build a comparision predicate with
std::toupper/tolower as I showed else-thread.

What do people do for multibyte encodings like UTF-8?

jason.cipriani · Dec 30, 2008

Why does ::toupper actually take an int?

See particularly Eric Sosman's response to the OP here (message #2):

http://groups.google.com/group/comp.lang.c/browse_frm/thread/3b27e652f1a7ab32

The other immediate responses to the OP are also informative.

Jason

Thomas J. Gritzan · Dec 30, 2008

Daniel said:
Still doesn't work with lexicographical_compare...

Replace the == with < and you've got the ordering predicate needed for
lexicographical_compare.

James Kanze · Dec 30, 2008

James Kanze schrieb:

On Dec 29, 3:10 pm, "Thomas J. Gritzan" <[email protected]> [...]

But does std::locale()() really compare case insensitive?

Click to expand...

The answer to that is a definite maybe. [...]

Click to expand...

If you want to parse commands case insensitivly, like in a
shell, script interpreter or text based protocoll, a maybe
isn't enough.

The problem is that case insensitive comparison is locale
dependent. So of course, you have to involve the locale
somehow. But yes, there is a gap between literal comparison
(all bytes equal) and locale dependent colating (which can
involve a number of things, e.g. "é" compares equal to "E", "ä"
collates as "ae", etc. And there's no real support for anything
between these two extremes in the language (either C or C++).

Then it would be easier to build a comparision predicate with
std::toupper/tolower as I showed else-thread.

Probably

. You have to define what equality actually means
first (e.g. does "ß" compare equal to "SS"), but for things like
filenames and interpreter commands, you're often limited to a
small set of characters where the definition isn't too
difficult. (This is becoming less and less true with regards to
filenames, of course.)

What do people do for multibyte encodings like UTF-8?

A lot of hand written code

. In practice, you can't count on
the present of a UTF-8 locale, and you can't count on it working
right if it's present. Note too that anything case insensitive
will still be locale dependent, even if you limit it to UTF-8;
in practice, if you want case insensitivity over the full
Unicode range, you have a lot of defining to do (although the
Unicode Consortium data files help a lot).

James Kanze · Dec 30, 2008

James Kanze schrieb:

#include <locale>

struct compare_ignore_case_equals
{
compare_ignore_case_equals(const std::locale& loc_ = std::locale())
: loc(loc_) {}

bool operator()(char c1, char c2) const
{
return std::tolower(c1, loc) == std::tolower(c2, loc);
}

private:
std::locale loc;
};

How about this? Doesn't depend on users locale, you can
provide your own locale, and isn't UB.

I'm not sure what you mean by "doesn't depend on the user's
locale". The constructor std::locale() creates a copy of the
current global locale, which if you're writing library code, is
unknown, but which will usually be the user's locale, since the
very first action in most main functions is to set the global
locale to "".

Why does ::toupper actually take an int?

So that things like:

for ( int ch = getchar() ; isspace( ch ) ; ch = getchar() )
...

work. It is defined for EOF, as well as all of the values in
the range 0...UCHAR_MAX. (The reason for toupper, of course, is
coherence---all of the functions in <ctype.h> take the same type
of argument.) It's a useful idiom; I still use it a lot (not
with ::toupper, etc., but with some of my own stuff).

The real question is why plain char is allowed to be signed, if
it is intended to contain "characters". I don't know of any
character encoding which uses negative values.

Alex Buell · Dec 30, 2008

Replace the == with < and you've got the ordering predicate needed
for lexicographical_compare.

Click to expand...

You might want to look at the OPs question again. His complaint (as
can be seen by the subject line) was that "lexicographical_compare
with ignore case *equality* doesn't always work." [stress added]
Think about that sentence for a second...

If the OP hasn't already figured it out, lexicographical_compare
isn't *designed* to work with equality functors in the first place.

[pained grin]

Yeah.

Perhaps this should be a FAQ: How do we do a case insensitive equality
compare on std::string values?

James Kanze · Dec 30, 2008

It also won't work reliably for all languages. Personally I
don't think anything will work reliably for all languages. A
programmer is better off IMHO to ignore locals and the "upper"
and "lower" functions in <cctype>, and write his own code that
works with the languages he has to deal with.

It's supposed to work reliably for all supported locales. (A
locale is more than just a language.) Which is sort of vague:
the standard doesn't make any requirements with regards to what
locales are supported (other than "C"), and it leaves the
definition as to what the behavior is in a given locale
"implementation defined".

If you're targetting a single compiler, for a single locale or a
small set of locales, and that compiler provides them, and they
behave "correctly" (for your definition of "correctly"), there's
no problem with using locales for this. Otherwise, you're
right: it can be a bit tricky.

jason.cipriani · Dec 30, 2008

You might want to look at the OPs question again. His complaint (as
can be seen by the subject line) was that "lexicographical_compare
with ignore case *equality* doesn't always work." [stress added]
Think about that sentence for a second...

Click to expand...

If the OP hasn't already figured it out, lexicographical_compare
isn't *designed* to work with equality functors in the first place.

Click to expand...

[pained grin]

Yeah.

Perhaps this should be a FAQ: How do we do a case insensitive equality
compare on std::string values?

Why? It's easy enough to find on Google already. Here is a good
article discussing all of the issues with proposed solutions, which
everybody involved in this thread should read:

http://lafstern.org/matt/col2_new.pdf

It was linked to from GCC's page on case-insensitive strings:

http://gcc.gnu.org/onlinedocs/libstdc++/manual/bk01pt05ch13s02.html

Which was linked to in a forum post in the first Google result for
"std string case insensitive compare":

http://bytes.com/groups/c/489747-lowercase-std-string-compare

Although it did require a bit of poking around on gcc.gnu.org since
the link in the forum post was actually broken.

Jason

Alex Buell · Dec 30, 2008

Replace the == with < and you've got the ordering predicate
needed for lexicographical_compare. Â

Click to expand...

You might want to look at the OPs question again. His complaint
(as can be seen by the subject line) was that
"lexicographical_compare with ignore case *equality* doesn't
always work." [stress added] Think about that sentence for a
second...

Click to expand...

If the OP hasn't already figured it out, lexicographical_compare
isn't *designed* to work with equality functors in the first
place.

Click to expand...

[pained grin]

Yeah.

Perhaps this should be a FAQ: How do we do a case insensitive
equality compare on std::string values?

Click to expand...

Why? It's easy enough to find on Google already. Here is a good
article discussing all of the issues with proposed solutions, which
everybody involved in this thread should read:

http://lafstern.org/matt/col2_new.pdf

It was linked to from GCC's page on case-insensitive strings:

http://gcc.gnu.org/onlinedocs/libstdc++/manual/bk01pt05ch13s02.html

Which was linked to in a forum post in the first Google result for
"std string case insensitive compare":

http://bytes.com/groups/c/489747-lowercase-std-string-compare

Although it did require a bit of poking around on gcc.gnu.org since
the link in the forum post was actually broken.

Thanks for all that, I'd already seen some of these pages.

Alex Buell · Dec 30, 2008

As this thread, and every other thread/article on the subject shows,
it is a rather complex subject. Pretty much any subject that deals
with natural language is.

I suggest you don't perform case insensitive compares in your code.

Seems a lot of thought has gone into designing the STL libraries. I've
just been playing with std::locale and std::locale::global, with
currencies. I can see how useful this can be in cojunction with glibc.

Crossword	2	May 11, 2020
stream_cast fails with case insensitive string	0	May 6, 2011
can't stream cast from a case insensitive string	0	Oct 9, 2011
Decompressed bitmap image doesn't properly render when using WinGDI	2	Jun 14, 2024
C language. work with text	3	Dec 10, 2021
Crossword	14	May 13, 2020
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022
TF-IDF	2	Aug 19, 2021

Using std::lexicographical_compare with ignore case equalitydoesn't always work

Alex Buell

Alex Buell

Alex Buell

James Kanze

James Kanze

Alex Buell

Thomas J. Gritzan

James Kanze

Thomas J. Gritzan

Thomas J. Gritzan

jason.cipriani

Thomas J. Gritzan

James Kanze

James Kanze

Alex Buell

James Kanze

jason.cipriani

Alex Buell

Alex Buell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads