size_t

A

assaarpa

IRIX 6.5.21 / Octane R10000 250 Mhz / g++ 3.4.0:

std::list
push_back() 149.844 ms, sum: -1474736480
iteration 24.733 ms, sum: -1474736480

core::list
push_back() 35.464 ms, sum: -1474736480
iteration 20.326 ms, sum: -1474736480

std::vector
push_back() 16.166 ms, sum: -1474736480
iteration 2.652 ms, sum: -1474736480

core::vector
push_back() 20.698 ms, sum: -1474736480
iteration 2.188 ms, sum: -1474736480

That said, the test is very primitive and does test only insertion and
iteration performance but those are very commonly used operations. The test
pattern doesn't reflect real-world use very well (only push_back and then
iterate, what, 20,000 times in a row? Or was it 200,000 times? :) So I
acknowledge that the test isn't very good one but those are the results
whatever the quality of the test was. :)
 
P

Peter Koch Larsen

"assaarpa" <[email protected]> skrev i en meddelelse
[snip]
I had the same sentiments, the test is pretty trivial and indeed didn't try
splice neither I implemented splice the feature that were introduced in
C++2003 (could be wrong, all the same these methods are not implemented :)
are commented out as "TODO" section in the header.

[snip]

The standard container is balanced in order to make all operations
reasonably fast. Thus i would like to have seen more tests. (although splice
is perhaps not one of the operations i would care about)
Here's output from Windows XP Professional SP1 / Athlon64 3000+ / Visual C++
.NET 2003:

[snip]

Far more interesting is the project settings. Are you running debug code?
What optimizations are involved? Running in debug mode could easily penalise
std::list more than assaarpa::list.

/Peter
 
A

assaarpa

Far more interesting is the project settings. Are you running debug code?
What optimizations are involved? Running in debug mode could easily penalise
std::list more than assaarpa::list.

I never profile debug builds. For each compiler the settings are the
defaults and only few settings tuned to favour speed (/O3, /Ox, or similiar
depending what compiler was used) with very little furry dices or other
spices thrown in.

Here's the Bill Gates settings:

/Ox /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /GS /FAs
/Fa"Release/" /Fo"Release/" /Fd"Release/vc70.pdb" /W3 /nologo /c /Wp64 /Zi
/TP

(that conflicts with what I wrote above, because that is copy-pasted from
Command Line -output window of the solution settings, only change I did was
to switch O2 to Ox after creating a fresh solution).

g++ settings:
-g -Wall -I$(incbase) -O3 $(libs)

So in practise only -O3 that directly affects code generator, if there are
any specific settings you would like me to try please post them.
 
I

Ioannis Vranos

assaarpa said:
I never profile debug builds. For each compiler the settings are the
defaults and only few settings tuned to favour speed (/O3, /Ox, or similiar
depending what compiler was used) with very little furry dices or other
spices thrown in.

Here's the Bill Gates settings:

/Ox /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /GS /FAs
/Fa"Release/" /Fo"Release/" /Fd"Release/vc70.pdb" /W3 /nologo /c /Wp64 /Zi
/TP


/O2 is the best for speed. Also use /Og /Oi /Ot /Oy /GA.



So in summary use these:


C:\c>cl /O2 /Og /Oi /Ot /Oy /GA /EHsc temp.cpp



And add /clr if you want the code to be IL.



(that conflicts with what I wrote above, because that is copy-pasted from
Command Line -output window of the solution settings, only change I did was
to switch O2 to Ox after creating a fresh solution).

g++ settings:
-g -Wall -I$(incbase) -O3 $(libs)

So in practise only -O3 that directly affects code generator, if there are
any specific settings you would like me to try please post them.



This is what I use:

C:\MinGW\bin\g++.exe -std=c++98 -pedantic-errors -Wall
-fexpensive-optimizations -O3 -ffloat-store -mcpu=pentiumpro temp.cpp -o
temp.exe
 
A

assaarpa

C:\MinGW\bin\g++.exe -std=c++98 -pedantic-errors -Wall
-fexpensive-optimizations -O3 -ffloat-store -mcpu=pentiumpro temp.cpp -o
temp.exe

-mcpu is deprecated on g++ 3.4.2, omitting that, the first four:

54.5 ms
2.7 ms
29.5 ms
2.3 ms

The compiler options cannot do very much to catch up the difference, because
I *cheated*, normally if you want to allocate internal nodes for the linked
list you would use the allocator but the core::list has internal block
allocator which stores the allocated blocks in this format:

struct foo
{
foo* next; // link to next delete[] call when we want to get rid of the
memory :)
bar* data; // information for delete[]
};

This contributes to reduced memory footprint, I know this is platform
dependent but let's look at how things work with Visual C++ .NET 2003 on
Windows, when we allocate memory using malloc or new (assuming not
overloaded) the block size for a single memory allocation is 8 bytes. This
is the minimum amount of memory that can be allocated. Overhead per
allocated block is 24 bytes. Our iterator node fits easily within 8 bytes
since it only stores xor of the next and prev pointers, actually both next
and prev in this platform would fit into single allocation block, totaling
32 bytes consumed per node.

Since we allocate in lots of atleast 16 blocks, this means the 24 bytes is
ditributed evenly over each node so the memory footprint is the mandatory 4
bytes for xor, and 1.5 bytes for the memory allocation overhead so we spend
5.5 bytes of memory per node, this is roughly 6-to-1 saving in memory
footpring for the linked list management. Additionally, the nodes are
initially placed in memory in sequential locations which is very cache
friendly (it is also very cache friendly that the iteration needs only 1/6
of unique memory addreses to go through when iterating!)

In this scenario saving of merely FOUR BYTES (one pointer) means 50% saving
in memory footpring! 5.5 vs. 9.5 bytes per node! If we did not have the
built-in block allocation of nodes the impact of having two separate, unique
pointers would be neglible. Now that we know this, we also realize that
having a custom allocator would be only HALF AS EFFICIENT as one that we
have integrated into the list class itself (compare the memory footprints
for both cases, xor and next/prev strategy).

I just got this idea one day and figured it worth trying after a little
consideration. Making improvements to vector is not very easy, since my code
seems to be consistently to be 1% or so slower (I am using specialization to
detect if type is POD, et cetera to give little extra boost, don't do
placement new or such kept it very simple). I also notice that the std::map
implementation is very good, atleast as good as RB-tree, clearly not AVL
tree (if it is on std::map implementations all more the power to the
implementations as I generally found RB tree to be better than AVL tree but
I haven't really looked into that very much just compared performances a
little bit with very basic tree implementations). p.s. I am fully aware that
std::map doesn't imply tree but that is the implementation that would come
to my mind first when reading the contract the interfaces makes with the
client regarding performance, I could be totally wrong ofcourse so just
thinking "aloud" the thoughts are not very coherent anymore excuse me i
think i will sleep now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

std::size_t 6
size_t Question 9
size_t, ssize_t and ptrdiff_t 56
Overflow of size_t? 9
Chatbot 0
Mixing size_t and other types 43
size_t in inttypes.h 4
size_t in a struct 24

Members online

No members online now.

Forum statistics

Threads
474,202
Messages
2,571,057
Members
47,665
Latest member
salkete

Latest Threads

Top