Safe to use substr?

Immortal Nephi · Jan 31, 2010

I want to know that size_type returns –1 (minus one) is safe before I
extract one string into two substrings. First example is safe and
second example is not sure.

const basic_string <char>::size_type npos = -1;
basic_string< char >::size_type begin_index, end_index, length_index;

end_index = 0;

string data = "Hello World!!", token1, token2;

First example:

begin_index = data.find_first_not_of( " ", end_index );
end_index = data.find_first_of( " ", begin_index );
token1 = data.substr( begin_index, end_index - begin_index );
length_index = token1.length();

begin_index returns 0 and end_index returns 5. substr is safe.

Second example:

begin_index = data.find_first_not_of( " ", end_index );
end_index = data.find_first_of( " ", begin_index );
token2 = data.substr( begin_index, end_index - begin_index );
length_index = token2.length();

begin_index returns 6 and end_index returns –1. Is substr safe for
token2 because end_index returns –1 indicates space character is not
found.

Another question—is size_type the same as size_t? They are always
unsigned maximum integer. Can I always copy variable from size_type
to signed integer or unsigned integer?

const basic_string <char>::size_type npos = -1;

signed int sNpos = npos;
unsigned int uNpos = npos;

Robert Fendt · Jan 31, 2010

I want to know that size_type returns â€“1 (minus one) is safe before I
extract one string into two substrings. First example is safe and
second example is not sure.

const basic_string <char>::size_type npos = -1;

Why don't you just use std::string (which is a typedef of std::basic_string said:
begin_index returns 0 and end_index returns 5. substr is safe.

Let's just say, it does what you expected it to do.

Second example:

begin_index = data.find_first_not_of( " ", end_index );
end_index = data.find_first_of( " ", begin_index );
token2 = data.substr( begin_index, end_index - begin_index );
length_index = token2.length();

begin_index returns 6 and end_index returns â€“1. Is substr safe for
token2 because end_index returns â€“1 indicates space character is not
found.

Yes. The standard specifies that its parameters are of type string::size_type said:
Another questionâ€”is size_type the same as size_t? They are always
unsigned maximum integer. Can I always copy variable from size_type
to signed integer or unsigned integer?

First question: yes, and no. size_type gets into string via traits class templates. If you are not familiar with that technique, I suggest you read up on it, since it is used extensively throughout the STL. The point being that you can adapt basic_string to just about any type of underlying data, thus it does not assume that every length is of type size_t but rather gets its definition via a traits template.

That said, it _is_ true that basic_string<char> and basic_string<wchar_t> (i.e., string and wstring) do use a definition of size_type that is identical to size_t.

Second question: no. You cannot assume that size_t is the same size as "int". It might, or it might not (in fact, e.g. on newer MSVC++ in 64bit mode, it is not!). Secondly, while it is safe to cast an unsigned value to signed and back (the resulting value being IIRC guaranteed to be identical to the original), the semantics of interpreting an unsigned value as signed if it is 'too large' are unspecified.

On most systems, casting a large unsigned number to signed yields a negative number, since that is how signed values are usually implemented. However, I don't think the standard actually specifies that casting numeric_limits<max>(unsigned) to int actually yields "-1".

Regards,
Robert

James Kanze · Jan 31, 2010

I want to know that size_type returns -1 (minus one)

size_type never contains -1. It can't, since it is an unsigned
type. (Also, variables and types don't "return" anything. Only
functions return things.)

is safe before I extract one string into two substrings.
First example is safe and second example is not sure.

const basic_string <char>::size_type npos = -1;

Which results in an implicit conversion, according to the rules
of conversion of signed to unsigned. Basically, npos will be
the largest possible value of size_type.

But why are you defining this? (And why are you using
basic_string< char > instead of the typedef std::string?) If,
for convenience, you want a local constant variable (to be able
to write npos, rather than std::string::npos), then:

std::string::size_type const npos = std::string::npos;

is the simplest solution.

basic_string< char >::size_type begin_index, end_index, length_index;

Just a general rule (good practice, not a language requirement):
don't define variables until you can initialize them.

end_index = 0;

string data = "Hello World!!", token1, token2;

First example:

begin_index = data.find_first_not_of( " ", end_index );
end_index = data.find_first_of( " ", begin_index );
token1 = data.substr( begin_index, end_index - begin_index );
length_index = token1.length();

begin_index returns 0 and end_index returns 5. substr is safe.

Second example:

begin_index = data.find_first_not_of( " ", end_index );
end_index = data.find_first_of( " ", begin_index );
token2 = data.substr( begin_index, end_index - begin_index );
length_index = token2.length();

begin_index returns 6 and end_index returns -1.

Again, end_index doesn't return anything; data.substring returns
std::string::npos. Which is the largest possible value which
can be held in an std::string::size_type.

Is substr safe for token2 because end_index returns -1
indicates space character is not found.

What does the documentation for substr say? What is the meaning
of the second argument? (I don't have my copy of the standard
handy to quote exactly, but what it says is something along the
lines of "the second argument specifies the maximum length of
the returned string", and that the return value is something
like "std::string( s.begin() + position, s.begin() + position +
std::min(length, s.size() - position))".)

Another question---is size_type the same as size_t?

For std::string and std::wstring, yes. If you instantiate
std::basic_string with a non-standard allocator, not
necessarily.

They are always unsigned maximum integer.

No. size_t is an unsigned integer large enough that the size of
the largest possible object can be represented in it. I've used
machines where size_t was 16 bits, for example.

Can I always copy variable from size_type to signed integer or
unsigned integer?

There are several possible answers to that question. If you
mean copy without loss of value, the answer is no; a lot of
modern machines have a 64 bit size_type, but a 32 bit integer
type, and there's no way you can convert a 64 bit type into a 32
bit type without loss of value.

Formally, of course, you can convert to the unsigned
integer---the results of converting to the signed integer are
implementation defined, but on most implementations, the
conversion is well defined as well. But if the value doesn't
fit, you'll get some other value.

Finally, in practice, it's likely that practical constraints
mean that you won't have strings larger than what can be
represented in an int. In which case, there's no problem.

const basic_string <char>::size_type npos = -1;

signed int sNpos = npos;

The results here are implementation defined. It's very likely
that sNpos will end up -1, but it's not guaranteed by the
standard. (And if sNpos does end up -1, then the conversion
back to size_t is guaranteed, so comparison with a size_t will
work.)

unsigned int uNpos = npos;

Perfectly legal, but uNpos will not compare equal to npos on
most 64 bit machines.

I'm not too clear as to what your goal is. First, for better or
for worse, std::string uses an unsigned size_t for all of its
indexing and positionning. Mixing signed and unsigned in C++
often gives surprising results, and should be avoided. (Using
unsigned for numeric values should generally be avoided as well,
but the rule about not mixing is more critical, and trumps this
rule---if an external library uses unsigned, you should stick
with whatever type it uses.)

Also, and this is really just a question of personal preference,
but I prefer by far using the algorithms in <algorithm> to the
special member functions in std::string. Once you're used to
the standard library, it just seems more comfortable working
with iterators than with indexes. And it avoids all of the
issues related to unsigned types in C++. Given that any time
you're going to be processing text, you're going to be using
functions like isalpha, isspace, etc. a lot, the first thing to
do is to defined predicate object types for each of the
functions and its complement. (Macros make this fairly easy.)
Then you use them with std::find_if. So your initial example
becomes:

typedef std::string::const_iterator text_iterator;
std::string const data( "Hello, world!" );
text_iterator begin_token = std::find_if(data.begin(), data.end(),
is_not_space());
text_iterator end_token = std::find_if(begin_token, data.end(),
is_space());
// or is_not_alnum(), or whatever...
std::string const first_token( begin_token, end_token );

(As I say, this is a personal preference, not any established
rule. But IMHO, it fits in better with the philosophy of the
standard library.)

James Kanze · Jan 31, 2010

And thus spake Immortal Nephi <[email protected]>
Sat, 30 Jan 2010 18:49:34 -0800 (PST):

Why don't you just use std::string (which is a typedef of
std::basic_string<char>)? It is more readable. Secondly,
consider using string::npos instead of redefining it yourself.
IIRC, the exact definition of npos is implementation-defined,
thus it is dangerous to assume too much about it. It _is_ in
fact defined as (size_t)-1 on almost all systems, but strictly
speaking that depends on implementation and processor
architecture.

The standard requires it to be defined as
static_cast< size_type >( -1 )
The implemenation and process architecture dependencies are in
the definition of size_type (which must be size_t in the default
allocator). The actual numeric value will vary, but it is well
defined, and used correctly as a sentinal value, there should be
no portability problems.

Immortal Nephi · Jan 31, 2010

Let's just say, it does what you expected it to do.

Yes. The standard specifies that its parameters are of type string::size_type, thus (at least in case of basic_string<char> and basic_string<wchar_t>) they are definitely unsigned. So in fact you are passing a _very_ large number as second parameter. My standard library docs state that if the second parameter points beyond the string, the end of the string is assumed instead (in fact, the default value for the second parameter is string::npos).

find_first_not of() function and find_first_of() function always
return unsigned integer like size_type. The size _type gives you the
information if unsigned integer is valid or not valid.
The minimum size_type is 0 and maximum size_type is 0xFFFFFFFE (on 32
bit machine). Both integer values provide you the information how
many elements do string have. The 0xFFFFFFFF or –1 indicates that
data in the string is not found or is not valid.
Let’s discuss substr() function. The substr() function’s first
parameter must always have minimum size_type and maximum size_type.
If 0xFFFFFFFF or –1 is detected, then exception will be thrown.
The second parameter always has default 0xFFFFFFFF or –1 if you do
not assign second parameter.

For example

string data( “Hello World!“ );
string token = data.substr( 0 );

The data has 11 elements in length. Notice that second parameter in
substr() function is not assigned. The default is –1. How do substr
() function know to count 11 elements correctly? It should always
count all 256 values of character set including ‘\0’.
If you insert ‘\0’ between Hello and World ( “Hello \0World!” ), then
it will count 12 elements including ‘\0’. The string object is not
like C string. It does not check null terminator and it always check
number of elements in size with size() function or length() function.

end_token = 5;
begin_token = data.find_first_not_of( " ", end_token + 1 );
end_token = data.find_first_of( " ", begin_token );

string token = data.substr( begin_token, end_token - begin_token );
length_token = token.length();

find_first_of() function returns –1 indicates space is not found.
substr() function cannot guarantee to assume to be 11. Possibly, it
will go beyond 11 elements boundary until it detects ‘\0’ and returns
the wrong end_token value.

I think that my example code above is not a good solution. I will
use iterator loop to test each element instead.

LR · Jan 31, 2010

Immortal said:
find_first_not of() function and find_first_of() function always
return unsigned integer like size_type. The size _type gives you the
information if unsigned integer is valid or not valid.
The minimum size_type is 0 and maximum size_type is 0xFFFFFFFE (on 32
bit machine). Both integer values provide you the information how
many elements do string have. The 0xFFFFFFFF or –1 indicates that
data in the string is not found or is not valid.

Let’s discuss substr() function. The substr() function’s first
parameter must always have minimum size_type and maximum size_type.

I think you mean the argument pos must be between 0 and size().
const std::string s ("Hello World");
const std::string t = s.substr(); // pos == 0
const std::string u = s.substr(0);
const std::string v = s.substr(s.size());

If 0xFFFFFFFF or –1 is detected, then exception will be thrown.
The second parameter always has default 0xFFFFFFFF or –1 if you do
not assign second parameter.

For example

string data( “Hello World!“ );
string token = data.substr( 0 );

The data has 11 elements in length. Notice that second parameter in
substr() function is not assigned. The default is –1. How do substr
() function know to count 11 elements correctly?

std::string keeps track of the length or size of the string. It doesn't
use zero termination the way C strings do.

Also, note that a std::string cannot grow to be larger than
std::string::max_size(). In the implementation I use this is
std::numeric_limits<std::string::size_type>::max()-1.

It should always
count all 256 values of character set including ‘\0’.

It will. Try this:

const std::string s =
std::string("Hello") + '\0' + std::string("World");
std::cout << s << std::endl;
std::cout << s.size() << std::endl;

If you insert ‘\0’ between Hello and World ( “Hello \0World!” ), then
it will count 12 elements including ‘\0’. The string object is not
like C string. It does not check null terminator and it always check
number of elements in size with size() function or length() function.

end_token = 5;
begin_token = data.find_first_not_of( " ", end_token + 1 );

You're not looking for the '\0'.

end_token = data.find_first_of( " ", begin_token );
Same.

string token = data.substr( begin_token, end_token - begin_token );
length_token = token.length();

I think this will work:

const std::string
data = std::string("Hello ") + '\0' + std::string("World");

const std::string look_for = std::string(" ")+'\0';
const std::string::size_type first = data.find_first_of(look_for);
const std::string::size_type
begin_token = data.find_first_not_of(look_for, first+1);
const std::string::size_type
end_token = data.find_first_of(look_for, begin_token);

const std::string
token = data.substr(begin_token, end_token-begin_token);
const std::string::size_type length_token = token.length();

LR

James Kanze · Jan 31, 2010

[...]

find_first_not of() function and find_first_of() function
always return unsigned integer like size_type. The size_type
gives you the information if unsigned integer is valid or not
valid.

I'm afraid I don't understand that last sentence. A type can't
give you any information.

The minimum size_type is 0 and maximum size_type is 0xFFFFFFFE
(on 32 bit machine). Both integer values provide you the
information how many elements do string have.

What do you mean by "both" here? A zero value designates the
first character of the string, or indicates that the length of
the string is 0. The maximum value is used as a sentinal:
std::string::size will never return it. The only functions
which do return it are those which look for something, and they
use it as a special value, to indicate that they didn't find
what they were looking for.

The 0xFFFFFFFF or -1 indicates that data in the string is not
found or is not valid.

(Just a nit, but 0xFFFFFFFF is *not* -1. They're two different
values.)

Let’s discuss substr() function. The substr() function’s
first parameter must always have minimum size_type and maximum
size_type.

The first argument must be in the range [0...s.size()], where s
is the string you're concerned with. It specifies the index of
the first character in the substring you want.

If 0xFFFFFFFF or -1 is detected, then exception will be
thrown.

(Again, -1 cannot be detected, because it cannot be represented
on the type of the argument.)

The second parameter always has default 0xFFFFFFFF or -1 if
you do not assign second parameter.

For example

string data( "Hello World!" );
string token = data.substr( 0 );

The data has 11 elements in length. Notice that second
parameter in substr() function is not assigned. The default
is -1.

The default is std::string::npos, not -1.

How do substr () function know to count 11 elements
correctly?

It's a member function. It knows the length of the string.
(How do you think std::string::size works?)

It should always count all 256 values of character set
including ‘\0’.

It doesn't count anything.

If you insert ‘\0’ between Hello and World ( "Hello \0World!"
), then it will count 12 elements including ‘\0’. The string
object is not like C string. It does not check null
terminator and it always check number of elements in size with
size() function or length() function.

end_token = 5;
begin_token = data.find_first_not_of( " ", end_token + 1 );
end_token = data.find_first_of( " ", begin_token );

string token = data.substr( begin_token, end_token - begin_token );
length_token = token.length();

find_first_of() function returns -1 indicates space is not found.

It returns std::string::npos (which is *not* -1) to indicate
that it didn't find any character in the list given.

substr() function cannot guarantee to assume to be 11.
Possibly, it will go beyond 11 elements boundary until it
detects ‘\0’ and returns the wrong end_token value.

Why on earth would it do a thing like that? An std::string
knows its length, and unless the standard specifically states
otherwise, it uses this length. No member function ever looks
for '\0'.

I think that my example code above is not a good solution. I
will use iterator loop to test each element instead.

I think you still have a lot to learn about the standard
library. (And also expressing yourself clearly---which is a
prerequisite to good programming. I don't know how much of this
is due to English not being your native language, however.)

James Kanze · Jan 31, 2010

Immortal Nephi wrote:

[...]

I think this will work:

const std::string
data = std::string("Hello ") + '\0' + std::string("World");

An even simpler solution might be:
std::string const data( "Hello \0World", 12 );

Öö Tiib · Feb 1, 2010

assert(static_cast<unsigned int>(-1) == -1);

/Leigh

Anyway you get diagnostic warnings for it from most compilers. If
'static_cast<unsigned int>(-1)' is needed then '~0U' is perhaps
shortest form that makes all compilers happy with it.

LR · Feb 1, 2010

James said:
The minimum size_type is 0 and maximum size_type is 0xFFFFFFFE
(on 32 bit machine). Both integer values provide you the
information how many elements do string have.

Click to expand...

What do you mean by "both" here? A zero value designates the
first character of the string, or indicates that the length of
the string is 0. The maximum value is used as a sentinal:
std::string::size will never return it. The only functions
which do return it are those which look for something, and they
use it as a special value, to indicate that they didn't find
what they were looking for.

The 0xFFFFFFFF or -1 indicates that data in the string is not
found or is not valid.

Click to expand...

(Just a nit, but 0xFFFFFFFF is *not* -1. They're two different
values.)

Let’s discuss substr() function. The substr() function’s
first parameter must always have minimum size_type and maximum
size_type.

Click to expand...

The first argument must be in the range [0...s.size()], where s
is the string you're concerned with. It specifies the index of
the first character in the substring you want.

If 0xFFFFFFFF or -1 is detected, then exception will be
thrown.

Click to expand...

(Again, -1 cannot be detected, because it cannot be represented
on the type of the argument.)

My copy of the standard, or my most recent copy of a working draft
explicitly initializes static const size_type npos = -1;

LR

LR · Feb 1, 2010

James said:
Immortal Nephi wrote:
[...]
I think this will work:

Click to expand...

const std::string
data = std::string("Hello ") + '\0' + std::string("World");

Click to expand...

An even simpler solution might be:
std::string const data( "Hello \0World", 12 );

I didn't think of that, but I hate to count things since I think it
makes maintenance more difficult.

LR

James Lothian · Feb 1, 2010

LR said:
James said:

Immortal Nephi wrote: [...]
I think this will work:
const std::string
data = std::string("Hello ") + '\0' + std::string("World");

Click to expand...

An even simpler solution might be:
std::string const data( "Hello \0World", 12 );

Click to expand...

I didn't think of that, but I hate to count things since I think it
makes maintenance more difficult.

LR

Presumably then you could do something like:
const char blah[] = "Hello \0World";
const std::string data(blah, sizeof(blah));

James

James Kanze · Feb 1, 2010

James said:
James said:

Immortal Nephi wrote:

Click to expand...

[...]

I think this will work:
const std::string
data = std::string("Hello ") + '\0' + std::string("World");

Click to expand...

An even simpler solution might be:
std::string const data( "Hello \0World", 12 );

Click to expand...

I didn't think of that, but I hate to count things since I
think it makes maintenance more difficult.

Yes. I was afraid my usual solution would confuse the original
poster:

static char const init[] = "Hello \0World";
std::string const data(begin(init), end(init)-1);

(In this case, of course, begin and end are the usual template
functions.)

As soon as you accept to give a name to the initialization, you
can get the compiler to do the counting. You need the name,
however, since you need to refer to the initialization object
twice. (
static char const init[] = "Hello \0World";
std::string const data(init, init + sizeof(init) - 1);
will also work, but the begin and end solution is more general.)

James Kanze · Feb 1, 2010

LR said:
LR said:

James said:

Immortal Nephi wrote:
[...]
I think this will work:
const std::string
data = std::string("Hello ") + '\0' + std::string("World");
An even simpler solution might be:
std::string const data( "Hello \0World", 12 );

Click to expand...

I didn't think of that, but I hate to count things since I think it
makes maintenance more difficult.

Click to expand...

Presumably then you could do something like:
const char blah[] = "Hello \0World";
const std::string data(blah, sizeof(blah));

sizeof(blah) - 1, if you don't want the final '\0'.

James Kanze · Feb 1, 2010

Anyway you get diagnostic warnings for it from most compilers.
If 'static_cast<unsigned int>(-1)' is needed then '~0U' is
perhaps shortest form that makes all compilers happy with it.

Except that it doesn't work. There are only two portable
solutions to get the vaue yourself:
static_cast< size_t >( -1 )
or
std::numeric_limits< size_t >::max();
(Both are guaranteed to be equal.)

Of course, the best solution is just to use std::string::npos.
There's no reason for you to worry about anything else. (And
you don't care what the value really is.)

String - substr query	6	Jul 12, 2006
Chatbot	0	Oct 8, 2024
pointer to a pointer problems	3	Aug 15, 2011
TF-IDF	2	Aug 19, 2021
std::string::npos always < std::string::size() ?	11	Jan 10, 2008
Help optimize nbody bench program (c++ sse2 intrinsics)	3	Oct 12, 2012
Checking the available range while iterating through a string	13	Feb 16, 2011
substr() hassle, nx vs. Win32	4	Jun 14, 2006

Safe to use substr?

Immortal Nephi

Robert Fendt

James Kanze

James Kanze

Immortal Nephi

LR

James Kanze

James Kanze

Öö Tiib

LR

LR

James Lothian

James Kanze

James Kanze

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads