std::string and case insensitive comparison

M

Mosfet

Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the
case insensitive part :

TCHAR is a char in MultiByte String env (MBCS)
and wchar_t if UNICODE

#if defined(WIN32) || defined(UNDER_CE)
typedef std::basic_string<TCHAR, std::char_traits<TCHAR>,
std::allocator<TCHAR> > tstring;
#else

#endif
#endif


class String : public Object
{
private:
tstring m_str;

public:
String(){}

String(LPCTSTR lpsz)
{
m_str = lpsz;
}

String(tstring str)
{
m_str = str;
}
// Comparison
int Compare( LPCTSTR psz ) const
{
return m_str.compare(psz);
}

int CompareNoCase( LPCTSTR psz ) const
{
???
}

// Convert the string to lowercase
String& MakeLower( LPCTSTR psz )
{
std::transform(m_str.begin(), m_str.end(), m_str.begin(), tolower);
return *this;
}

}
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the
case insensitive part :

Usually it involves modifying both strings to either upper- or lower-
case and then compare. Checkout the toupper()/tolower() functions from
<cctype>/ctype.h>.
 
J

Jim Langston

Mosfet said:
Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the case
insensitive part :

TCHAR is a char in MultiByte String env (MBCS)
and wchar_t if UNICODE

#if defined(WIN32) || defined(UNDER_CE)
typedef std::basic_string<TCHAR, std::char_traits<TCHAR>,
std::allocator<TCHAR> > tstring;
#else

#endif
#endif


class String : public Object
{
private:
tstring m_str;

public:
String(){}

String(LPCTSTR lpsz)
{
m_str = lpsz;
}

String(tstring str)
{
m_str = str;
}
// Comparison
int Compare( LPCTSTR psz ) const
{
return m_str.compare(psz);
}

int CompareNoCase( LPCTSTR psz ) const
{
???
}

// Convert the string to lowercase
String& MakeLower( LPCTSTR psz )
{
std::transform(m_str.begin(), m_str.end(), m_str.begin(), tolower);
return *this;
}

}

This is what I use. I'm not sure if it's optimal, but it works.

bool StrLowCompare( std::string String1, std::string String2 )
{
std::transform(String1.begin(), String1.end(), String1.begin(),
tolower);
std::transform(String2.begin(), String2.end(), String2.begin(),
tolower);
return String1 == String2;
}
 
M

Mosfet

Jim Langston a écrit :
This is what I use. I'm not sure if it's optimal, but it works.

bool StrLowCompare( std::string String1, std::string String2 )
{
std::transform(String1.begin(), String1.end(), String1.begin(),
tolower);
std::transform(String2.begin(), String2.end(), String2.begin(),
tolower);
return String1 == String2;
}
Ir doesn't seem very efficient...
 
J

Jim Langston

Mosfet said:
Jim Langston a écrit :
Ir doesn't seem very efficient...

No, it doesn't. But then, there is no way to do a case insensitive compare
without converting both to either upper or lower. Or to determine if one is
uppercase before converting to lower, but it probably takes about the same
amount of time for the if statement.

Basically, that's the way case insensitve works. You convert both to upper
or lower, then compare, or compare character by character converting.

It may be faster to compare character by character and see if you can return
early without having to go through the whole string, but the you're doing a
bunch of if statments anyway. I.E something like: (untested code)

bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;

for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1 ) != tolower( String2 )
return false;
}
return true;
}
 
J

John Harrison

bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;

for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1 ) != tolower( String2 )
return false;
}
return true;
}


If I had a pound for everytime this mistake is made I would be as rich
as Bill Gates.


tolower( String1 )

is undefined since char may be signed and therefore you may pass a
negative number to tolower. tolower is only defined on integer values in
the range of unsigned char and the value of EOF.

tolower( (unsigned char) String1 )

is correct.

This also means that

std::transform(str.begin(), str.end(), tolower)

is undefined for the same reason.

john
 
K

Kai-Uwe Bux

John said:
bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;

for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1 ) != tolower( String2 )
return false;
}
return true;
}


If I had a pound for everytime this mistake is made I would be as rich
as Bill Gates.


tolower( String1 )

is undefined since char may be signed and therefore you may pass a
negative number to tolower. tolower is only defined on integer values in
the range of unsigned char and the value of EOF.

tolower( (unsigned char) String1 )

is correct.

This also means that

std::transform(str.begin(), str.end(), tolower)

is undefined for the same reason.


That wording is a little too harsh. The above code has perfectly
well-defined behavior for quite a lot of input values. To dismiss it as
undefined is like saying *p is undefined since p might be null. I agree,
however, that one can and should do better.

For the use in std::transform(), I would suggest a function object like
this:

#include <locale>
#include <string>
#include <iostream>
#include <algorithm>

class to_lower {

std::locale const & loc;

public:

to_lower ( std::locale const & r_loc = std::locale() )
: loc ( r_loc )
{}

template < typename CharT >
CharT operator() ( CharT chr ) const {
return( std::tolower( chr, this->loc ) );
}

}; // class to_lower;

int main ( void ) {
std::string str ( "Hello World!" );
std::transform ( str.begin(), str.end(), str.begin(), to_lower() );
std::cout << str << '\n';
}


Best

Kai-Uwe Bux
 

nmi

Joined
Jul 6, 2007
Messages
13
Reaction score
0
Mosfet said:
Hi,

what is the most efficient way of doing a case insensitive comparison ?

A simple mapping table. For example a 256 long array of unsigned char-s. You fill it up first so that every item contains its index, e.g.:

maptbl[32] = 32...

Then you map lower case and upper case letters to the same value:

maptbl[(unsigned char)'a'] = 'A';
maptbl[(unsigned char)'b'] = 'B';

and so on.

In the comparison function you index into the table with the characters to be compared and compare what you find there:

if(maptbl[(unsigned char)char1] == maptbl[(unsigned char)char2])
printf("hey, these are equal!!!");

It works and it is as fast as can be.
 
J

James Kanze

news:[email protected]... [...]
This is what I use. I'm not sure if it's optimal, but it works.
bool StrLowCompare( std::string String1, std::string String2 )
{
std::transform(String1.begin(), String1.end(), String1.begin(),
tolower);
std::transform(String2.begin(), String2.end(), String2.begin(),
tolower);
return String1 == String2;
}

Using which headers? It shouldn't compile if you happen to
include <locale>, and <string> may (or may not) include
<locale>. (With g++, <string> doesn't include <locale>, but
<iostream> does, so if you happen to also include <iostream>, it
doesn't compile.) If it does compile, it has undefined
behavior if char is signed. (Of course, g++ and VC++ have
options to force char to be unsigned, so if you're using these,
you may be OK.)

If you want to use transform, the correct invocation is:

std::transform(
source.begin(), source.end(),
std::back_inserter( dest ),
boost::bind(
&Cvt::tolower,
&std::use_facet< Cvt >( std::locale() ),
_1 ) ) ;

You need boost::bind, or else write you're own functional
object.

Note that using std::equal with boost::transform_iterator (and
the functional object used with transform---either your own, or
the results of boost::bind) will allow the comparison without
making a copy.
 
J

James Kanze

"Mosfet" <[email protected]> wrote in message

[...]
No, it doesn't. But then, there is no way to do a case
insensitive compare without converting both to either upper or
lower. Or to determine if one is uppercase before converting
to lower, but it probably takes about the same amount of time
for the if statement.

If you do it on a character by character basis, you can avoid
the copy of the string. If the string is long, this could be
significant.

On the other hand, if you're doing a lot of comparisons, it's
probably better to convert the strings once, and then just use
== on them.
Basically, that's the way case insensitve works. You convert
both to upper or lower, then compare, or compare character by
character converting.

Actually, it's more complicated than that. In German, for
example, in a case insensitive comparison, the single character
'ß' must compare equal to the two character sequence "SS"; in
Swiss German (and according to DIN, I think), 'ä' compares equal
to "AE", etc., where as in Turkish, 'i' shouldn't compare equal
to 'I'. Just defining what you actually mean by "case
insensitive compare" is a nightmare, even before you start
trying to implement it.
It may be faster to compare character by character and see if
you can return early without having to go through the whole
string, but the you're doing a bunch of if statments anyway.

There's never the slightest need for an if; tolower is generally
implemented as a simple table mapping. By comparing character
by character, however, you save copying the string.
 
J

James Kanze

[...]
If I had a pound for everytime this mistake is made I would be as rich
as Bill Gates.
tolower( String1 )
is undefined since char may be signed and therefore you may
pass a negative number to tolower. tolower is only defined
on integer values in the range of unsigned char and the
value of EOF.
tolower( (unsigned char) String1 )
is correct.
This also means that
std::transform(str.begin(), str.end(), tolower)
is undefined for the same reason.

That wording is a little too harsh. The above code has perfectly
well-defined behavior for quite a lot of input values.

By "the above code", which example to you mean? "tolower(
String1 )" has undefined behavior for slightly more than half
all input values if char is signed (as it is by default with
most C++ compilers).
To dismiss it as undefined is like saying *p is undefined
since p might be null.

If p might be null, it is undefined. That's why we generally
check it before hand, or require the user to do so. If the
specification of his StrLowCompare function specifically says
that the behavior is undefined if e.g. either of the strings
actually contains a character not in the basic execution
character set, then he's off the hook. But then every user must
verify any strings which contain characters from the outside.
And it's a pain, because a lot of normal text does contain
characters outside the basic execution character set.
I agree,
however, that one can and should do better.
For the use in std::transform(), I would suggest a function object like
this:
#include <locale>
#include <string>
#include <iostream>
#include <algorithm>
class to_lower {
std::locale const & loc;
public:
to_lower ( std::locale const & r_loc = std::locale() )
: loc ( r_loc )
{}

The defaul argument (and probably most of the arguments a user
will pass here) are temporaries, and will leave you with a
dangling reference once you return from the constructor. The
loc member should not be a temporary.
template < typename CharT >
CharT operator() ( CharT chr ) const {
return( std::tolower( chr, this->loc ) );
}
}; // class to_lower;

I'd suggest extracting the ctype facet once up front, since
that's what std::tolower is going to do anyway.

For most applications, using a std::ctype<char> const* as the
member is probably the appropriate solution, e.g. :

template< typename charT >
class toLower
{
public:
typedef std::ctype< charT >
CType ;
explicit toLower( std::locale const& loc =
std::locale() )
: myCType( &std::use_facet< CType >( loc ) )
{
}

charT operator( charT in ) const
{
return myCType->tolower( in ) ;
}

private:
CType const* myCType ;
} ;

This has a potential problem with the lifetime of the facet if
the user passes it a temporary locale, or changes the locale
while instance of the class is alive. A perfectly robust
solution requires keeping a copy of the locale in the object as
well (which in turn makes copying it significantly more
expensive).
int main ( void ) {
std::string str ( "Hello World!" );
std::transform ( str.begin(), str.end(), str.begin(), to_lower() );

This actually will work with your code, because the temporary
passed to the constructor of to_lower will last until the end of
the full expression. Something like:

to_lower l ;
std::transform( s1.begin(), s1.end(), s1.begin(), l ) ;

won't, however. And it's what I'd naturally write if I wanted
to call transform on a number of strings. e.g.:

to_lower l ;
for ( std::vector< std::string > it = v.begin() ;
it != v.end() ;
++ it ) {
std::transform( it->begin(), it->end(), it->begin(), l ) ;
}
std::cout << str << '\n';
}

In professional code, I agree that using <locale> is the way to
go. But <locale> was designed to make it particularly difficult
to use. For a beginner, I'd suggest writing your own functional
object with the tolower in <ctype>, and casting the char to
unsigned char. While less flexible as a solution based on
<locale>, it's an order of magnitude (or more) simpler to write
and understand.
 
K

Kai-Uwe Bux

James said:
The defaul argument (and probably most of the arguments a user
will pass here) are temporaries, and will leave you with a
dangling reference once you return from the constructor. The
loc member should not be a temporary.

Right. I lifted this code from some inner parts where to_lower is only used
as to_lower() in expressions where the life-time extension for const
references ensure that no BadThings(tm) happen. The code is not ready for
re-use.
I'd suggest extracting the ctype facet once up front, since
that's what std::tolower is going to do anyway.

That's a good idea. I did some benchmarking, at it appears that pulling this
step out of the inner loop really pays off.

For most applications, using a std::ctype<char> const* as the
member is probably the appropriate solution, e.g. :

template< typename charT >
class toLower
{
public:
typedef std::ctype< charT >
CType ;
explicit toLower( std::locale const& loc =
std::locale() )
: myCType( &std::use_facet< CType >( loc ) )
{
}

charT operator( charT in ) const
{
return myCType->tolower( in ) ;
}

private:
CType const* myCType ;
} ;

This has a potential problem with the lifetime of the facet if
the user passes it a temporary locale, or changes the locale
while instance of the class is alive. A perfectly robust
solution requires keeping a copy of the locale in the object as
well (which in turn makes copying it significantly more
expensive).

One can use a shared_ptr<> to circumvent this. I tested both versions, and
they seem to be equally efficient (in a single-threaded environment, that
is). Test code follows:


#include <iostream>
#include <algorithm>
#include <tr1/memory>
#include <cstdlib>

template < typename CharT >
class to_lower_a {

typedef std::ctype< CharT > char_type;

std::tr1::shared_ptr< std::locale > the_loc_ptr;
char_type const * the_type_ptr;

public:

to_lower_a ( std::locale const & r_loc = std::locale() )
: the_loc_ptr ( new std::locale ( r_loc ) )
, the_type_ptr ( &std::use_facet< char_type >( *the_loc_ptr ) )
{}

CharT operator() ( CharT chr ) const {
return ( the_type_ptr->tolower( chr ) );
}

}; // class to_lower;

template < typename CharT >
class to_lower_b {

typedef std::ctype< CharT > char_type;

char_type const * the_type_ptr;

public:

to_lower_b ( std::locale const & r_loc = std::locale() )
: the_type_ptr ( &std::use_facet< char_type >( r_loc ) )
{}

CharT operator() ( CharT chr ) const {
return ( the_type_ptr->tolower( chr ) );
}

}; // class to_lower;

class to_lower_any {

std::locale the_loc;

public:

to_lower_any ( std::locale const & loc = std::locale() )
: the_loc ( loc )
{}

template < typename CharT >
CharT operator() ( CharT chr ) const {
return ( tolower( chr, the_loc ) );
}

}; // class to_lower;


unsigned long loop_to_lower_a ( unsigned long runs ) {
std::srand( 412341 );
unsigned long sum = 0;
std::locale copy;
to_lower_a< char > l ( copy );
while ( runs -- != 0 ) {
char dummy = std::rand() % 100;
sum += l ( dummy );
}
return ( sum );
}

unsigned long loop_to_lower_b ( unsigned long runs ) {
std::srand( 412341 );
unsigned long sum = 0;
std::locale copy;
to_lower_b< char > l ( copy );
while ( runs -- != 0 ) {
char dummy = std::rand() % 100;
sum += l ( dummy );
}
return ( sum );
}

unsigned long loop_to_lower_any ( unsigned long runs ) {
std::srand( 412341 );
unsigned long sum = 0;
to_lower_any l;
while ( runs -- != 0 ) {
to_lower_any ll ( l ); // stupid;
char dummy = std::rand() % 100;
sum += l ( dummy );
}
return ( sum );
}



int main ( void ) {
// std::cout << loop_to_lower_a( 100000000 ) << '\n';
std::cout << loop_to_lower_b( 100000000 ) << '\n';
// std::cout << loop_to_lower_any( 100000000 ) << '\n';
}




sys 0m0.112s
1487052608

real 0m10.890s
user 0m9.353s
sys 0m0.232s
1487052608

real 0m6.259s
user 0m5.424s
sys 0m0.084s
1487052608

real 0m6.266s
user 0m5.264s
sys 0m0.180s
1487052608

real 0m6.263s
user 0m5.384s
sys 0m0.156s

real 0m1.503s
user 0m1.300s
sys 0m0.024s

news_group> while true; do time to_lower_b; done
1487052608

real 0m6.013s
user 0m5.360s
sys 0m0.088s
1487052608

real 0m6.345s
user 0m5.436s
sys 0m0.100s
1487052608

real 0m9.603s
user 0m8.169s
sys 0m0.196s
1487052608

real 0m7.154s
user 0m6.124s
sys 0m0.184s
1487052608

real 0m7.217s
user 0m6.092s
sys 0m0.144s

real 0m0.715s
user 0m0.624s
sys 0m0.020s

news_group> while true; do time to_lower_any; done
1487052608

real 0m11.045s
user 0m9.525s
sys 0m0.180s
1487052608

real 0m11.127s
user 0m9.457s
sys 0m0.220s
1487052608

real 0m11.065s
user 0m9.517s
sys 0m0.172s
1487052608

real 0m11.528s
user 0m9.537s
sys 0m0.320s
1487052608

real 0m11.512s
user 0m9.433s
sys 0m0.200s
1487052608

real 0m11.848s
user 0m9.569s
sys 0m0.208s

real 0m9.682s
user 0m8.113s
sys 0m0.168s



This actually will work with your code, because the temporary
passed to the constructor of to_lower will last until the end of
the full expression. Something like:

to_lower l ;
std::transform( s1.begin(), s1.end(), s1.begin(), l ) ;

won't, however. And it's what I'd naturally write if I wanted
to call transform on a number of strings. e.g.:

to_lower l ;
for ( std::vector< std::string > it = v.begin() ;
it != v.end() ;
++ it ) {
std::transform( it->begin(), it->end(), it->begin(), l ) ;
}

Yup. My bad.

In professional code, I agree that using <locale> is the way to
go. But <locale> was designed to make it particularly difficult
to use. For a beginner, I'd suggest writing your own functional
object with the tolower in <ctype>, and casting the char to
unsigned char. While less flexible as a solution based on
<locale>, it's an order of magnitude (or more) simpler to write
and understand.

Agreed. However, this question comes up regularly on this group. Maybe it
would be worthwhile to have a truly robust and efficient solution in the
archives or the FAQ. As of now, I would vote for to_lower_a<>.


Best

Kai-Uwe Bux
 
J

James Kanze

[...]
That's a good idea. I did some benchmarking, at it appears
that pulling this step out of the inner loop really pays off.

I didn't want to stress it too much, because you do want to get
the code working first. But in this case, it will come up,
sooner or later. And once you're familiar with the way facets
work, it's just as natural, or even more so, to use them
directly.
One can use a shared_ptr<> to circumvent this.

std::locale itself is more or less required to use the
equivalent of an intrusive shared_ptr for each of its facettes.
It seems somehow abherent to then be required to allocate the
std::locale itself dynamically, and use a shared_ptr for it.
But I suspect that if there is much copying going on, this might
be the fastest way to go.
I tested both versions, and they seem to be equally efficient
(in a single-threaded environment, that is).

A priori:

-- If you can be sure that at least one locale using the facet
will stay alive, the fastest solution is to not worry about
it. For an application specific to_lower, this is likely
the case; not many applications play with locales once
they've set the global locale in main. For a generic
to_lower, on the other hand, it's playing with fire, even if
you document the restriction.

-- Copying a to_lower object is probably a lot faster using the
shared_ptr, rather than copying a complete std::locale
object; the shared_ptr requires updating one counter,
copying locale requires updating a counter for each facet
the locale contains.

-- Creating a to_lower object is probably a lot faster using
the complete copy of the std::locale object, since that
doesn't entail any dynamic allocation, just updating
something like 7 counters. (I am assuming that on most
systems, a dynamic allocation will require significantly
more work.)

-- Various compiler optimizations may change the number of
copies, so any measurements you make may be invalidated with
the next release of the compiler:).

[...]
Agreed. However, this question comes up regularly on this
group. Maybe it would be worthwhile to have a truly robust and
efficient solution in the archives or the FAQ. As of now, I
would vote for to_lower_a<>.

That's basically I'm currently using:).

I only recently got around to putting this into my library, and
I'm having problems with my provider (nothing new---I'm trying
to use ADSL, I'm five kilometers from the switch, and the line
is old, runs next to a high density electric rail line, and very
noisy), and will not be updating the version on the web
immediately. I'm looking into alternatives, however, and hope,
sometime soon. (But these sort of things always end up taking
more time than I imagine.)
 
M

Milburn Young

Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the
case insensitive part :

TCHAR is a char in MultiByte String env (MBCS)
and wchar_t if UNICODE

#if defined(WIN32) || defined(UNDER_CE)
typedef std::basic_string<TCHAR, std::char_traits<TCHAR>,
std::allocator<TCHAR> > tstring;
#else

#endif
#endif

class String : public Object
{ ....
}
I have no idea why you have an Object class unless you are trying to
mimic Java. In "Exceptional C++" by Herb Sutter, there is an
excellent solution: provide your own char_traits<TCHAR> (pp. 5-6).
You will have to overload operators <<, >>, =, +, +=, -, -=, etc. (or
always call c_str()).

template<typename CHAR>
class ci_char_traits: public char_traits<CHAR>
{
public: static bool eq( CHAR c1, CHAR c2 )
{
return( toupper( static_cast<unsigned char>( c1 ))
== toupper( static_cast<unsigned char>( c2 )));
}
public: static bool lt( CHAR c1, CHAR c2 )
{
return( toupper( static_cast<unsigned char>( c1 ))
< toupper( static_cast<unsigned char>( c2 )));
}
public: static int compare( CHAR *s1, CHAR *s2, size_t n )
{
return( memicmp( s1, s2, n ));
}
public: static const CHAR *find( const CHAR *s, int n, CHAR a )
{
while((n-- > 0) && toupper( static_cast<unsigned char>( *s ))
!= toupper( static_cast<unsigned char>( a )))
{
++s;
}
return((n > 0)?(s):(NULL));
}
};

static_cast<>'s would've been forgotten if it hadn't been for John
Harrison. Keep in mind James Kanze's comments about equality. For
more information, search the Internet for Java collator classes.

Milburn Young
 
J

Jim Langston

John Harrison said:
bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;

for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1 ) != tolower( String2 )
return false;
}
return true;
}


If I had a pound for everytime this mistake is made I would be as rich as
Bill Gates.


tolower( String1 )

is undefined since char may be signed and therefore you may pass a
negative number to tolower. tolower is only defined on integer values in
the range of unsigned char and the value of EOF.

tolower( (unsigned char) String1 )

is correct.

This also means that

std::transform(str.begin(), str.end(), tolower)

is undefined for the same reason.


I'm using Microsoft VC++ .net 2003. I'm using the tolower defined in
ctype.h. I just did a test, put in values from 0 to 255 (-1), displayed
their values, ran them through tolower and displayed their values. Yes,
some were negative going in, but they were also negative coming out. I
would be interested in seeing any results other than this in any compiler.
 
B

Bo Persson

Jim Langston wrote:
:: ::::
:::: bool StrLowCompare( std::string& String1, std::string& String2 )
:::: {
:::: if ( String1.size() != String2.size() )
:::: return false;
::::
:::: for ( std::string::size_type i = 0; i < String1.size(); ++i )
:::: {
:::: if ( tolower( String1 ) != tolower( String2 )
:::: return false;
:::: }
:::: return true;
:::: }
:::
::: If I had a pound for everytime this mistake is made I would be as
::: rich as Bill Gates.
:::
:::
::: tolower( String1 )
:::
::: is undefined since char may be signed and therefore you may pass a
::: negative number to tolower. tolower is only defined on integer
::: values in the range of unsigned char and the value of EOF.
:::
::: tolower( (unsigned char) String1 )
:::
::: is correct.
:::
::: This also means that
:::
::: std::transform(str.begin(), str.end(), tolower)
:::
::: is undefined for the same reason.
::
:: I'm using Microsoft VC++ .net 2003. I'm using the tolower defined
:: in ctype.h. I just did a test, put in values from 0 to 255 (-1),
:: displayed their values, ran them through tolower and displayed
:: their values. Yes, some were negative going in, but they were
:: also negative coming out. I would be interested in seeing any
:: results other than this in any compiler.

You can't test for undefined behavior, because there is no requirement
that it should be consistent. In fact, there are no requirements at
all. :)

It so happens that the MS implementation handles this by casting to
unsigned internally, and checking the range of the parameter. Possibly
because there is a compiler option to make the char type unsigned, in
which case your test must work.

It's non-portable anyway.


Bo Persson
 
J

Jim Langston

Bo Persson said:
Jim Langston wrote:
:: ::::
:::: bool StrLowCompare( std::string& String1, std::string& String2 )
:::: {
:::: if ( String1.size() != String2.size() )
:::: return false;
::::
:::: for ( std::string::size_type i = 0; i < String1.size(); ++i )
:::: {
:::: if ( tolower( String1 ) != tolower( String2 )
:::: return false;
:::: }
:::: return true;
:::: }
:::
::: If I had a pound for everytime this mistake is made I would be as
::: rich as Bill Gates.
:::
:::
::: tolower( String1 )
:::
::: is undefined since char may be signed and therefore you may pass a
::: negative number to tolower. tolower is only defined on integer
::: values in the range of unsigned char and the value of EOF.
:::
::: tolower( (unsigned char) String1 )
:::
::: is correct.
:::
::: This also means that
:::
::: std::transform(str.begin(), str.end(), tolower)
:::
::: is undefined for the same reason.
::
:: I'm using Microsoft VC++ .net 2003. I'm using the tolower defined
:: in ctype.h. I just did a test, put in values from 0 to 255 (-1),
:: displayed their values, ran them through tolower and displayed
:: their values. Yes, some were negative going in, but they were
:: also negative coming out. I would be interested in seeing any
:: results other than this in any compiler.

You can't test for undefined behavior, because there is no requirement
that it should be consistent. In fact, there are no requirements at all.
:)

It so happens that the MS implementation handles this by casting to
unsigned internally, and checking the range of the parameter. Possibly
because there is a compiler option to make the char type unsigned, in
which case your test must work.

It's non-portable anyway.


Assigning a signed value a non signed value of the same size is well
defined, isn't it?
unsigned int = -1;
is well defined, is it not?

So what if tolower expects an unsigned char and I pass it a signed char, it
is well defined, right?

Please quote from the standard where it is undefined.
 
J

James Kanze

John Harrison said:
bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;
for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1 ) != tolower( String2 )
return false;
}
return true;
}

If I had a pound for everytime this mistake is made I would be as rich as
Bill Gates.
tolower( String1 )
is undefined since char may be signed and therefore you may pass a
negative number to tolower. tolower is only defined on integer values in
the range of unsigned char and the value of EOF.
tolower( (unsigned char) String1 )
is correct.
This also means that
std::transform(str.begin(), str.end(), tolower)
is undefined for the same reason.

I'm using Microsoft VC++ .net 2003. I'm using the tolower
defined in ctype.h. I just did a test, put in values from 0
to 255 (-1), displayed their values, ran them through tolower
and displayed their values. Yes, some were negative going in,
but they were also negative coming out.

Just any random negative value isn't right.
I would be interested
in seeing any results other than this in any compiler.

The standard almost toupper to fail in an ISO 8859-15 locale, at
least if EOF is -1 (as it usually is).

Many modern implementations "protect" against this error, in so
far as possible, but creating an array where negative indexes
work, and return the correct value. (I first saw this under
Solaris, but some quick tests indicate that Linux does it as
well, and if you always get a negative value for negative input,
it's likely that VC++ does too. On the other hand, I've
definitly used more than a few systems in the past where it
really did return random results.) As far as I know, however,
this is nowhere a documented extension---the behavior remains
undefined. In general, it can't be made to work 100% if EOF is
representable in a signed char (and since EOF is almost always
-1, it is representable in a signed char); the C standard says
very exactly what the behavior for EOF should be, so you can't
wiggle out and make it behave as if it were really 255. Thus,
for example, in locale fr.FR.ISO-8859-15, if I assign 0xFF
(Latin small letter y with dïaeresis) to a char, with g++ under
Linux, islower() still returns 0 (false), and to upper() still
returns -1 (and not 0xBF/-61, as it should).
 
J

James Kanze

[...]
It so happens that the MS implementation handles this by
casting to unsigned internally, and checking the range of the
parameter.

I would be very surprised if it does this, because that would
not be conform. It probably does like Sun, and simply ensures
that the bytes in from of the mapping array contain the right
thing. And it will probably fail for 0xFF, if the function
requires returning something other than 0xFF, because
tolower(EOF) is defined, and EOF is usually -1 (although I don't
think I've ever looked for Microsoft).

(BTW: I tend to think that Sun invented this trick, since I
first saw it under Solaris, at a time when the versions of HP/UX
and (I thing) AIX didn't use it. But to tell the truth, I have
no idea who did it first.)
Possibly because there is a compiler option to make the char
type unsigned, in which case your test must work.

G++ and VC++ both have this option. IMHO, it would have made
sense for the C committee to require that char be unsigned, and
it certainly makes sense for the implementation to make it
unsigned whenever there is no additional overhead. But
historically, most early C users were on PDP-11, where making
char unsigned had a very significant run-time overhead. And
they wrote a lot of code assuming that char was signed.
Wrong---even K&R I stress the fact that its signedness is
implementation dependant. But vendors don't like breaking
existing code, even if it's wrong, and to this day, most
implementations default to plain char being signed.
 
J

James Kanze

[...]
Assigning a signed value a non signed value of the same size is well
defined, isn't it?

Assigning a signed value to an unsigned variable is well
defined, regardless of the sizes. The signed to unsigned
conversion is defined to be modulo 2^n, where N is the number of
value bits in the target unsigned type.
unsigned int = -1;
is well defined, is it not?

Yes, for a given implementation. (The exact value will, of
course, depend on the size of a int.)
So what if tolower expects an unsigned char and I pass it a signed char, it
is well defined, right?

tolower doesn't expect an unsigned char. It expects an int,
whose value is either EOF or in the range of [0...UCAR_MAX].
Converting a negative signed char to int will preserve its
value, and the results will be negative.
Please quote from the standard where it is undefined.
From the C standard (obviously), §7.4/1:

The header <ctype.h> declares several functions useful for
classifying and mapping characters.166) In all cases the
argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value
of the macro EOF. If the argument has any other value, the
behavior is undefined.

(This is from C99, the only version I can copy/paste from. But
the text is unchanged from C90, the reference version for C++.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,690
Latest member
MacGyver

Latest Threads

Top