read/parse flat file / performance / boost::tokenizer

Knackeback · May 8, 2004

task:
- read/parse CSV file

code snippet:
string key,line;
typedef tokenizer<char_separator<char> > tokenizer;
tokenizer tok(string(""), sep);
while ( getline(f, line) ){
++lineNo;
tok.assign(line, sep);
short tok_counter = 0;
for(tokenizer::iterator beg = tok.begin(); beg!=tok.end();++beg){
if ( ( idx = lineArr[tok_counter] ) != -1 ){ //look if the token should
keyArr[idx] = *beg; //be part of the key
}
++tok_counter;
}
for (int i=0; i<keySize; i++ ){ //build a key, let say first and third
key += keyArr; //token build a key
key += delim;
}
m.insert(make_pair(key,LO(new Line(line, lineNo)))); //m is a multimap
key.erase();
}

gprof hits:
% cumulative self self total
time seconds seconds calls s/call s/call name

16.89 0.50 0.50 2621459 0.00 0.00 bool boost::char_separator<char, std::char_traits<char> >:perator()<__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, __gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >&)

11.99 0.85 0.35 24903838 0.00 0.00 boost::char_separator<char, std::char_traits<char> >::is_dropped(char) const

7.09 1.06 0.21 28508346 0.00 0.00 bool __gnu_cxx:perator!=<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > >(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, __gnu_cxx::__normal_iterator<char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)

problem:
I want to improve the performance of this code passage.

questions:
I hope the goal is somewhat clear. I want to read all line objects (consist of
line number and line content) of a file identified by a key into the container.
Every idea which improves the style and performace of this snippet is welcome !

Thomas

John Harrison · May 8, 2004

Knackeback said:
task:
- read/parse CSV file

code snippet:
string key,line;
typedef tokenizer<char_separator<char> > tokenizer;
tokenizer tok(string(""), sep);
while ( getline(f, line) ){
++lineNo;
tok.assign(line, sep);
short tok_counter = 0;
for(tokenizer::iterator beg = tok.begin(); beg!=tok.end();++beg){
if ( ( idx = lineArr[tok_counter] ) != -1 ){ //look if the token should
keyArr[idx] = *beg; //be part of the key
}
++tok_counter;
}
for (int i=0; i<keySize; i++ ){ //build a key, let say first and third
key += keyArr; //token build a key
key += delim;
}
m.insert(make_pair(key,LO(new Line(line, lineNo)))); //m is a multimap
key.erase();
}

gprof hits:
% cumulative self self total
time seconds seconds calls s/call s/call name

16.89 0.50 0.50 2621459 0.00 0.00 bool

boost::char_separator said:

:perator()<__gnu_cxx::__normal_iterator<char const*,

Click to expand...

std::allocator<char> > > said:

(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,

Click to expand...

std::char_traits<char>, std::allocator<char> > >&,
__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,

std::allocator<char> > > said:

11.99 0.85 0.35 24903838 0.00 0.00

Click to expand...

boost::char_separator said:

7.09 1.06 0.21 28508346 0.00 0.00 bool

Click to expand...

__gnu_cxx:perator!=<char const*, std::basic_string<char,

(__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,

Click to expand...

std::char_traits<char>, std::allocator<char> > > const&,
__gnu_cxx::__normal_iterator<char const*, std::basic_string<char,

problem:
I want to improve the performance of this code passage.

questions:
I hope the goal is somewhat clear. I want to read all line objects (consist of
line number and line content) of a file identified by a key into the container.
Every idea which improves the style and performace of this snippet is welcome !

Thomas

Click to expand...

All your performance bottlenecks seem to be from within the boost tokenizer
library. The obvious answer then is to replace that code with your own
custom code. The tokenizer library is a generic tokenizer, you have a
specific requirements to solve, so you should be able to beat the
performance of boost by taking advantage of the specific knowledge you have
about your application.

john

Knackeback · May 8, 2004

Yes I will try a handcrafted line reading.
But can you talk a bit more what you mean with "generic tokenizer" ?
My taks is to split a line in tokens and the example from boost::tokenizer
does exactly the same.
At the moment I don't need ALL the tokens for me line-key. Therefore I think
the boost::tokenizer is too expensive.
BTW, I compiled my program with g++ and icc (Intels C++ compiler for Linux).
The icc compiled code was five times faster and the compile warnings from icc
are very fine. Good work !

Thomas

John Harrison · May 8, 2004

Knackeback said:
Yes I will try a handcrafted line reading.
But can you talk a bit more what you mean with "generic tokenizer" ?
My taks is to split a line in tokens and the example from boost::tokenizer
does exactly the same.
At the moment I don't need ALL the tokens for me line-key. Therefore I think
the boost::tokenizer is too expensive.

That's exactly what I mean. For instance boost will probably create a string
for each token, but you throw some of those tokens away. Your custom code
will only create a string for the tokens you actually need.

Also looking at your original code it seems that after extracting a token,
you add the delimiter back in to the key you are building up. That would be
another improvement, for your purposes a token can include the trailing
delimiter.

john

Knackeback · May 11, 2004

Thanks for your hint. That handcrafted solution was now three times faster than
the boost tokenizer !

John Harrison · May 11, 2004

Knackeback said:
Thanks for your hint. That handcrafted solution was now three times faster than
the boost tokenizer !

Don't take that as an argument against boost tokenizer. It still does its
job, and presumably does it efficiently (I haven't looked at the code).

What I liked about your post was that you did things the right way round.
First you got a working solution using general purpose tools available to
you, then you decided that it wasn't fast enough so you looked to replace
general purpose code with hand crafted code. That's the way it should be
done.

And of course many times, the hand crafted code isn't necessary at all.

john

stl transform error	0	May 7, 2012
boost gcc-3.3.6	0	Jun 4, 2008
Vecotor of Pairs	8	Nov 22, 2007
list<string> insertion fails	1	Jul 29, 2008
std::vector<boost::xpressive::sregex> fails to compile using gcc	7	Jan 21, 2010
Problems using Boost, Pls Advise	7	Apr 27, 2004
Performance and Profiling help	3	Dec 24, 2007
boost::regex - fail to compile a simple sample	5	Nov 3, 2006

read/parse flat file / performance / boost::tokenizer

Knackeback

John Harrison

Knackeback

John Harrison

Knackeback

John Harrison

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads