Efficiently Building Strings

P

Pavel

Juha said:
Luca Risolia said:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
Just a nit-pick: std::string is not guaranteed to be stored contiguously so
technically your code has UB.

std::fclose(inFile);

For example in my system reading a 700-megabyte file takes about 7.5 seconds
by using the istreambuf_iterator method, while with fread it takes about
1.5 seconds.

(If maximum reading speed is not imperative, or if input files are
expected to be small, then use whichever method is safest and most
convenient.)


-Pavel
 
W

woodbrian77

Juha said:
Luca Risolia said:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
Just a nit-pick: std::string is not guaranteed to be stored contiguously so
technically your code has UB.

Iirc, this changed with C++ 2011. Now the storage is
guaranteed to be contiguous.

Brian
Ebenezer Enterprises
http://webEbenezer.net
 
B

Bo Persson

Pavel skrev 2012-05-27 04:55:
Juha said:
Luca Risolia said:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not
guaranteed to be stored contiguously (despite many implementations'
doing so).

Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to change
anything).


Bo Persson
 
P

Pavel

Bo said:
Pavel skrev 2012-05-27 04:55:
Juha said:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not
guaranteed to be stored contiguously (despite many implementations'
doing so).

Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to change anything).


Bo Persson
I just took a look at this part of C+11 and was surprised c_str() is no longer
required to invalidate iterators etc and even it does not seem to be required to
return a null-terminated string. Is this intentional?

-Pavel
 
D

Dombo

Op 28-May-12 2:13, Pavel schreef:
Bo said:
Pavel skrev 2012-05-27 04:55:
Juha Nieminen wrote:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not
guaranteed to be stored contiguously (despite many implementations'
doing so).

Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to
change anything).


Bo Persson
I just took a look at this part of C+11 and was surprised c_str() is no
longer required to invalidate iterators etc and even it does not seem to
be required to return a null-terminated string. Is this intentional?

If it doesn't return a null-terminated string it would be pretty useless
and break almost all code that uses c_str(). I didn't look it up in the
C++11 standard, but if what you say is true than I'm pretty sure it is
not intentional.

According to the C++ 2003 version of the standard c_str() does return a
null-terminated string, but AFAICT does not require iterators to be
invalidated when this member is called. However any subsequent call to a
non-const member of the string object may make the pointer returned by
the c_str() function invalid.
 
J

Jan Andres

Bo said:
Pavel skrev 2012-05-27 04:55:
Juha Nieminen wrote:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not
guaranteed to be stored contiguously (despite many implementations'
doing so).

Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to change anything).


Bo Persson
I just took a look at this part of C+11 and was surprised c_str() is no longer
required to invalidate iterators etc and even it does not seem to be required to
return a null-terminated string. Is this intentional?

I tend to disagree on your second point above. 21.4.7.1(1) requires
c_str() and data() to return

A pointer p such that p + i == &operator[](i) for each i in [0,size()],

i.e. the interval includes the index size() which is one past the end of
the string. 21.4.5(2) specifies that operator[](pos) returns an object
of value charT() for pos == size(), which is a NUL character for your
typical charT.
 
P

Pavel

Dombo said:
Op 28-May-12 2:13, Pavel schreef:
Bo said:
Pavel skrev 2012-05-27 04:55:
Juha Nieminen wrote:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not
guaranteed to be stored contiguously (despite many implementations'
doing so).


Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to
change anything).


Bo Persson
I just took a look at this part of C+11 and was surprised c_str() is no
longer required to invalidate iterators etc and even it does not seem to
be required to return a null-terminated string. Is this intentional?

If it doesn't return a null-terminated string it would be pretty useless and
break almost all code that uses c_str(). I didn't look it up in the C++11
standard, but if what you say is true than I'm pretty sure it is not intentional.

According to the C++ 2003 version of the standard c_str() does return a
null-terminated string, but AFAICT does not require iterators to be invalidated
when this member is called.
See 21.3-5:
References, pointers, and iterators referring to the elements of a basic_string
sequence may be invalidated by the following uses of that basic_string object:
....
— Calling data() and c_str() member functions.

However any subsequent call to a non-const member of
the string object may make the pointer returned by the c_str() function invalid.

-Pavel
 
P

Pavel

Jan said:
Bo said:
Pavel skrev 2012-05-27 04:55:
Juha Nieminen wrote:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not
guaranteed to be stored contiguously (despite many implementations'
doing so).


Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to change anything).


Bo Persson
I just took a look at this part of C+11 and was surprised c_str() is no longer
required to invalidate iterators etc and even it does not seem to be required to
return a null-terminated string. Is this intentional?

I tend to disagree on your second point above. 21.4.7.1(1) requires
c_str() and data() to return

A pointer p such that p + i ==&operator[](i) for each i in [0,size()],

i.e. the interval includes the index size() which is one past the end of
the string. 21.4.5(2) specifies that operator[](pos) returns an object
of value charT() for pos == size(), which is a NUL character for your
typical charT.
Thanks, I see now. It was more explicit before. But, then, in conjunction with
non-invalidating iterators on c_str(), and the requirement of contiguous storage
this means that any practical implementation of std::string shall have its
contiguous storage ended with an extra NUL character, even if the client code
don't need it. This is probably minor, but good to know when calculating memory
requirements for many small strings.

As for adding the requirement contiguous storage,, I think it closed a
possibility for some optimizations related of string growing (the subject that
arose earlier in this thread). For example, before the change an std::string
implementation with size < C * pow(2, n) could keep an array of n contiguous
chunks with sizes C * pow(2, i) where C is a constant and 0 <= i < n. Then
appending would require neither memory deallocation nor data copying.

-Pavel
 
J

Jan Andres

Jan said:
Bo Persson wrote:
Pavel skrev 2012-05-27 04:55:
Juha Nieminen wrote:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not
guaranteed to be stored contiguously (despite many implementations'
doing so).


Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to change anything).


Bo Persson
I just took a look at this part of C+11 and was surprised c_str() is no longer
required to invalidate iterators etc and even it does not seem to be required to
return a null-terminated string. Is this intentional?

I tend to disagree on your second point above. 21.4.7.1(1) requires
c_str() and data() to return

A pointer p such that p + i ==&operator[](i) for each i in [0,size()],

i.e. the interval includes the index size() which is one past the end of
the string. 21.4.5(2) specifies that operator[](pos) returns an object
of value charT() for pos == size(), which is a NUL character for your
typical charT.
Thanks, I see now. It was more explicit before. But, then, in conjunction with
non-invalidating iterators on c_str(), and the requirement of contiguous storage
this means that any practical implementation of std::string shall have its
contiguous storage ended with an extra NUL character, even if the client code
don't need it. This is probably minor, but good to know when calculating memory
requirements for many small strings.

Agreed, though for strings so short that a single-character size
difference is significant, there are other types of overhead that must
be taken into account. Take e.g. the memory allocator's internal
overhead which can easily be something on the order of 2 pointers per
allocated block, plus many malloc libraries will round up the allocation
size to the next multiple of 16 bytes or so. IMHO that makes the one
extra NUL character irrelevant in the vast majority of cases. (Unless
you are using a truly gigantic character type.:)
As for adding the requirement contiguous storage,, I think it closed a
possibility for some optimizations related of string growing (the subject that
arose earlier in this thread). For example, before the change an std::string
implementation with size < C * pow(2, n) could keep an array of n contiguous
chunks with sizes C * pow(2, i) where C is a constant and 0 <= i < n. Then
appending would require neither memory deallocation nor data copying.

Agreed also, but I'd like to add that from a complexity theoretic
perspective this isn't as bad as it sounds. Say the size of the buffer
is doubled every time it fills up and we are building up a string of
total length n where 2^(k-1) < n <= 2^k.
Then taking the sum over all reallocations they will copy a total of
2^0 + 2^1 + ... + 2^(k-1) = 2^k - 1 characters and there will be only
(k-1) deallocations, so the time required to build up the whole
n-character string is still linear in n.

Of course, the usual care must be taken when applying such complexity
theoretic considerations to practical programming.
 
P

Pavel

Jan said:
Jan said:
Bo Persson wrote:
Pavel skrev 2012-05-27 04:55:
Juha Nieminen wrote:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not
guaranteed to be stored contiguously (despite many implementations'
doing so).


Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to change anything).


Bo Persson
I just took a look at this part of C+11 and was surprised c_str() is no longer
required to invalidate iterators etc and even it does not seem to be required to
return a null-terminated string. Is this intentional?

I tend to disagree on your second point above. 21.4.7.1(1) requires
c_str() and data() to return

A pointer p such that p + i ==&operator[](i) for each i in [0,size()],

i.e. the interval includes the index size() which is one past the end of
the string. 21.4.5(2) specifies that operator[](pos) returns an object
of value charT() for pos == size(), which is a NUL character for your
typical charT.
Thanks, I see now. It was more explicit before. But, then, in conjunction with
non-invalidating iterators on c_str(), and the requirement of contiguous storage
this means that any practical implementation of std::string shall have its
contiguous storage ended with an extra NUL character, even if the client code
don't need it. This is probably minor, but good to know when calculating memory
requirements for many small strings.

Agreed, though for strings so short that a single-character size
difference is significant, there are other types of overhead that must
be taken into account. Take e.g. the memory allocator's internal
overhead which can easily be something on the order of 2 pointers per
allocated block, plus many malloc libraries will round up the allocation
size to the next multiple of 16 bytes or so. IMHO that makes the one
extra NUL character irrelevant in the vast majority of cases. (Unless
you are using a truly gigantic character type.:)
True; but I have surprisingly many strings that have size of 4 bytes. By adding
an extra byte, they will break to another category (allocate 32 bytes instead of
16 otherwise -- on 32-bit system). This is all theoretical of course because
many implementations have chosen to always allocate NUL terminator even under
the previous standard.
Agreed also, but I'd like to add that from a complexity theoretic
perspective this isn't as bad as it sounds. Say the size of the buffer
is doubled every time it fills up and we are building up a string of
total length n where 2^(k-1)< n<= 2^k.
Then taking the sum over all reallocations they will copy a total of
2^0 + 2^1 + ... + 2^(k-1) = 2^k - 1 characters and there will be only
(k-1) deallocations, so the time required to build up the whole
n-character string is still linear in n.
Yes big-O is same, but when you are in a business where speed is a competitive
advantage, constant C matters :). No copying vs extra copy for every allocation
adds a factor about 2 for copying (4 reallocations from 4->8->16->32->64 will
copy 60 characters during reallocations whereas alternative approach copies
none). For extra k deallocations, its cost in comparison to allocations varies
wildly, which makes it difficult to make a general estimate. I only know it is
far from being free.
Of course, the usual care must be taken when applying such complexity
theoretic considerations to practical programming.

-Pavel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,137
Messages
2,570,799
Members
47,347
Latest member
edward_eden

Latest Threads

Top