C Containers Library vs STL

jacob navia · Aug 3, 2011

Hi

I would like to compare the C containers library (written in C for C
programmers) against the STL.

Here is the code for the CCL. Maybe one C++ wizard would solve the
same problem using the STL?

I would be very ingterested in comparing code size, complexity, etc.

Thanks in advance, and here is the C part:
--------------------------------------------------------------
Unique
------

Given a text file, print in standard output the lines that are
unique in it, i.e. filtering all duplicated lines.

Algorithm:
---------

Normally this involves keeping a sorted list/array of lines
and testing if a line is in the set or not.

Solution using the CCL.
----------------------

1 #include <containers.h>
2 int main(int argc,char *argv[])
3 {
4 FILE *f;
5 int i=1,r;
6 Dictionary *dict;
7 char buf[8192];
8
9 if (argc < 2) {
10 fprintf(stderr,"%s <file name>\n",argv[0]);
11 return -1;
12 }
13 f = fopen(argv[1],"r");
14 if (f == NULL)
15 return -1;
16 dict = iDictionary.Create(0,500);
17 if (dict == NULL)
18 return -1;
19 while (fgets(buf,sizeof(buf),f)) {
20 r= iDictionary.Add(dict,buf,NULL);
21 if (r > 0)
22 printf("[%3d] %s",i,buf);
23 else if (r < 0) break;
24 i++;
25 }
26 iDictionary.Finalize(dict);
27 fclose(f);
28 }

Algorithm
---------
A hash table will be used to determine if a line is a duplicate
or not.

Commentary
----------
We use the following local variables (lines 4-7):
Name Usage
f Input stream bound to the file to read
i Counter for lines read
r Result of adding a line
dict Dictionary (Hash table)
buf Line buffer limited to 8K per line

Lines 9-15 are concerned with opening the input file, with some error
checking.

In line 16 we create a dictionary, requesting a size of zero for the
data associated with the key since we aren't storing any data, just the
key, and we suppose that the table will contain more or less 500
entries. If the file contains much more lines performance could
suffer but the algorithm would still work.

Lines 19-25 are the main loop of the program. We read each line into
the buffer and add it to then dictionary. If the "Add" API returns
a positive number the line wasn't there, if it returns zero the
line was already in the dictionary. If the result is negative it
is an error code and we stop the loop aborting the operation. Failure
can be provoked only by lack of memory.

If the result is positive we print the line.

Cleanup is performed in lines 26 and 27: we dispose of the dictionary
and close the file.

The C containers library web page is:
http://code.google.com/p/ccl/
You will find there the source code and the documentation.

-----------------------------------------------------------------------

Ian Collins · Aug 3, 2011

Hi

I would like to compare the C containers library (written in C for C
programmers) against the STL.

Here is the code for the CCL. Maybe one C++ wizard would solve the
same problem using the STL?

I would be very ingterested in comparing code size, complexity, etc.

Thanks in advance, and here is the C part:
--------------------------------------------------------------
Unique
------

Given a text file, print in standard output the lines that are
unique in it, i.e. filtering all duplicated lines.

Algorithm:
---------

Normally this involves keeping a sorted list/array of lines
and testing if a line is in the set or not.

Solution using the CCL.
----------------------

1 #include<containers.h>
2 int main(int argc,char *argv[])
3 {
4 FILE *f;
5 int i=1,r;
6 Dictionary *dict;
7 char buf[8192];
8
9 if (argc< 2) {
10 fprintf(stderr,"%s<file name>\n",argv[0]);
11 return -1;
12 }
13 f = fopen(argv[1],"r");
14 if (f == NULL)
15 return -1;
16 dict = iDictionary.Create(0,500);
17 if (dict == NULL)
18 return -1;
19 while (fgets(buf,sizeof(buf),f)) {
20 r= iDictionary.Add(dict,buf,NULL);
21 if (r> 0)
22 printf("[%3d] %s",i,buf);
23 else if (r< 0) break;
24 i++;
25 }
26 iDictionary.Finalize(dict);
27 fclose(f);
28 }

Algorithm

This is probably the closest equivalent (sticking to a similar layout
style):

#include <set>
#include <string>
#include <iostream>
#include <fstream>

typedef std::set<std::string> Lines;

int main( int argc, char** argv )
{
std::ifstream in( argv[1] );

Lines unique;

while( in ) {
std::string line;

std::getline( in, line );

if( unique.insert(line).second ) {
std::cout << line << std::endl;
}
}
}

Commentary
----------
We use the following local variables (lines 4-7):
Name Usage
f Input stream bound to the file to read
i Counter for lines read
r Result of adding a line
dict Dictionary (Hash table)
buf Line buffer limited to 8K per line

It would have been much easier to give the variables meaningful names!

jacob navia · Aug 4, 2011

Le 04/08/11 00:38, Robert Wessel a écrit :

On Wed, 03 Aug 2011 17:36:54 -0500, Robert Wessel

I'd point out that unlike Jacob's original, you're not displaying a
line number, although that's trivial. You'd probably change to cout
to something like

std::cout<< "["<< setw(3)<< linenumber<< "] "<< line<< std:endl;

And add the appropriate code to maintain linenumber.

Click to expand...

errr... std::setw(3)

Thanks to you and to Ian.

More important than the line number, is the behavior of the STL
container in case of errors, for instance when there is no more
memory. Does it throw an exception? In that case it would be
fully equivalent of the C code that tests for the return code
being less than zero.

In my Macintosh I compile using gcc the code to
-rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
-rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C

both are compiled using -Os (size optimization) using the same commpiler
~/Documents/stl $ gcc -v
Using built-in specs.
Target: i686-apple-darwin10
gcc version 4.2.1 (Apple Inc. build 5664)

I think that the codes are really equivalent, and I am pleased that
the CCL doesn't come all that bad. What performance is concerned
I doubt that there would be any differences since the programs are
bound by I/O anyway

Thanks again for your answers.

jacob

Ian Collins · Aug 4, 2011

Thanks to you and to Ian.

More important than the line number, is the behavior of the STL
container in case of errors, for instance when there is no more
memory. Does it throw an exception? In that case it would be
fully equivalent of the C code that tests for the return code
being less than zero.

In my Macintosh I compile using gcc the code to
-rw-r--r-- 1 jacobnavia staff 6548 3 aoû 23:23 unique.o // CPP
-rw-r--r-- 1 jacobnavia staff 1348 3 aoû 23:23 unique1.o // C

both are compiled using -Os (size optimization) using the same commpiler
~/Documents/stl $ gcc -v
Using built-in specs.
Target: i686-apple-darwin10
gcc version 4.2.1 (Apple Inc. build 5664)

I think that the codes are really equivalent, and I am pleased that
the CCL doesn't come all that bad. What performance is concerned
I doubt that there would be any differences since the programs are
bound by I/O anyway

Um, I tried both versions with the same optimisations (cc/CC -fast) on
my Solaris box and the run times (over several runs) for a 350K line
file with 214K unique lines were:

C: 5.027s
CC: 1.835s

Ian Collins · Aug 4, 2011

Um, I tried both versions with the same optimisations (cc/CC -fast) on
my Solaris box and the run times (over several runs) for a 350K line
file with 214K unique lines were:

C: 5.027s
CC: 1.835s

A quick profile shows the C version spends most of its time in strcmp
while the C++ version spends most of its time in I/O. So I guess the
performance difference is mainly between the r-b tree used by std::set
and your hashing algorithm.

So if I change your code to use std::set in order to remove the
difference in I/O libraries:

#include <set>
#include <string>
#include <stdio.h>

int main(int argc,char *argv[])
{
FILE* f = fopen(argv[1],"r");
if (f == NULL)
return -1;

std::set<std::string> unique;
char buf[8192];
int i = 0;

while (fgets(buf,sizeof(buf),f)) {
if( unique.insert(buf).second ) {
printf("[%3d] %s",i,buf);
}
i++;
}
fclose(f);
}

the run time drops to 0.66 seconds.

Juha Nieminen · Aug 4, 2011

Ian Collins said:
the run time drops to 0.66 seconds.

If you wanted to reduce the running time even further, and assuming the
entire input file fits in RAM, what you could do is to read the entire
file into RAM with one single fread() call and then use a std::set of
char* (and the proper comparator), each pointer pointing to this single
data block (which needs newlines replaced with null characters).

In this case the (probably significant) speedup comes from removing the
need to make a memory allocation for each line.

(And if that's not enough, then the next step is to not to use a
std::set, but a std::vector<char*> instead. Once it has been initialized,
sort it and make the elements unique (both one-liners). That way we avoid
making another memory allocation per line.)

Ian Collins · Aug 4, 2011

If you wanted to reduce the running time even further, and assuming the
entire input file fits in RAM, what you could do is to read the entire
file into RAM with one single fread() call and then use a std::set of
char* (and the proper comparator), each pointer pointing to this single
data block (which needs newlines replaced with null characters).

Running more than once should be sufficient to cache the file in RAM.

In this case the (probably significant) speedup comes from removing the
need to make a memory allocation for each line.

My point was the choice of container was the dominant player in this
comparison. All of the above optimisations can be applied to either
version.

jacob navia · Aug 4, 2011

Le 04/08/11 02:34, Ian Collins a écrit :

Um, I tried both versions with the same optimisations (cc/CC -fast) on
my Solaris box and the run times (over several runs) for a 350K line
file with 214K unique lines were:

C: 5.027s
CC: 1.835s

Did you change the number passed to the Create API?

The code as shown is optimized for a file of 500 lines or so.

Changing that 500 to (say 135000) would be significantly
better but would use much more memory.

jacob

Ian Collins · Aug 4, 2011

Le 04/08/11 02:34, Ian Collins a écrit :

Did you change the number passed to the Create API?

The code as shown is optimized for a file of 500 lines or so.

Changing that 500 to (say 135000) would be significantly
better but would use much more memory.

That does make a very significant difference:

0.51s

Slightly quicker than using std::set.

jacob navia · Aug 4, 2011

Le 04/08/11 01:40, Robert Wessel a écrit :

Yes, you'd get an exception if you ran out of memory. The handling of
that is a bit different than your program (you quietly return -1,
where as the C++ program would probably die with an "unhandled
exception" message, neither of which is really acceptable, although
both programs could obviously be improved).

The default behavior in the CCL is to print an error message in the
standard error stream. In this case the user would get a "No memory
left" message. That is why I do not print any error message and
return -1.

You can obviously change this behavior by another one if you wish.
(For instance aborting the program).

Marc · Aug 4, 2011

jacob said:
Le 04/08/11 02:34, Ian Collins a écrit :

Comparing a hash table to a binary tree is not really fair, try with
hash_set (common extension) or unordered_set (C++0x).

Did you change the number passed to the Create API?

The code as shown is optimized for a file of 500 lines or so.

Changing that 500 to (say 135000) would be significantly
better but would use much more memory.

Standard C++ hash tables automatically increase their number of buckets
when the number of objects is large enough.

Why should 135000 buckets use much more memory if there are enough
objects to fill most buckets?

jacob navia · Aug 4, 2011

Le 04/08/11 14:47, Marc a écrit :

Comparing a hash table to a binary tree is not really fair, try with
hash_set (common extension) or unordered_set (C++0x).

Standard C++ hash tables automatically increase their number of buckets
when the number of objects is large enough.

I am thinking about adding such an extension to the Dictionary
container. I have already such an extension for the general hash
table container, it resizes automatically. The Dictionary container
was thought as a light weight hash table, with character keys. The
general hash table can have keys of any type but its usage is
more cumbersome, that is why I splitted it into a dictionary
(good for 80% of applications) and a general hash table
(when you need resisizing, and keys that aren't character data.

Now, I am wondering if that interface is not MORE complicated than
a more capable dictionary... Anyway if the resizing is automatic I can
keep the same interface. The code is bulkier though.

Why should 135000 buckets use much more memory if there are enough
objects to fill most buckets?

Because there is more overhead per bucket. In my implementation
we have a table of <n> slots, and each slot with a single linked
list of objects that map to the same hash code.

Space overhead is:
(1) overhead for the linked list: 2*sizeof(void *) for each object
(2) Overhead of the table for the slots: sizeof(void *)/ number of
objects per slot.

You see? The larger the table, the more overhead it makes since the
occupancy decreases (and hence the speed of search increases)

jacob

Juha Nieminen · Aug 4, 2011

Ian Collins said:
Running more than once should be sufficient to cache the file in RAM.

That's not the point. The point was to avoid needless memory allocations.

jacob navia · Aug 4, 2011

Le 04/08/11 16:16, Juha Nieminen a écrit :

That's not the point. The point was to avoid needless memory allocations.

In the C containers library the container ALWAYS owns the memory used
by the objects/keys. This means the given buffers are always copied
when you add an item.

What is the behavior in C++, it is different?

Nobody · Aug 4, 2011

In the C containers library the container ALWAYS owns the memory used
by the objects/keys. This means the given buffers are always copied
when you add an item.

What is the behavior in C++, it is different?

No. But if you have a container which contains pointers, it
literally contains the pointers, not the data which they point to. E.g.
std::vector<char*>:

ush_back() will just add the supplied pointer, it
won't perform a strdup() (char* doesn't necessarily imply a NUL-terminated
string).

Using the containers library	0	Aug 3, 2011
HELP:function at c returning (null)	4	Mar 21, 2024
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
C# problem	1	Sep 11, 2024
Can't solve problems! please Help	0	Sep 26, 2022
Blue J Ciphertext Program	2	Nov 22, 2023
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
The C Containers library	17	Apr 23, 2011

C Containers Library vs STL

jacob navia

Ian Collins

jacob navia

Ian Collins

Ian Collins

Juha Nieminen

Ian Collins

jacob navia

Ian Collins

jacob navia

Marc

jacob navia

Juha Nieminen

jacob navia

Nobody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads