Building a Large Container

M

Mike Copeland

One of my applications running very slowly, due to high volume of
data. I'm currently using an STL map to store the data, and I'm trying
to convert to STL vector for processing.
The basic operation is that I could be loading some of the container
from one source (or not), but then I am definitely loading data from a
second source - and modifying existing information if it came from the
first source. The data is accessed by an integer scalar variable (see
"bibNum", below). Here's the (new) data structure:

typedef struct ChipRec
{
int bibNum; // Bib#
int dbeAge, ndfAge, gruAge;
char dbeGender, ndfGender, gruGender;
char dbeRCode, ndfRCode, gruRCode;
bool timeWritten, dbAdd;
bool dbeMatched, ndfMatched, gruMatched;
time_t startDelta;
time_t clockTime;
string dbeName, gruName;
string strAddress, strCity, strDoB;
[etc.]
} ChipData;
extern ChipData chipWork;
typedef vector<ChipRec> CRVect;
extern CRVect chipVect;
extern CRVect::iterator chipIter;

I planned to sort my container and use a binary search to determine
if I was adding a new object of modifying an existing one. That
technique won't work, as I'd have to sort the container each time I add
an object (very slow!). The binary search won't help me, because it
doesn't return an iterator to the matched object for possible
modification.
It seems using an STL vector isn't a solution...8<{{ The STL map
works, but for 30,000 objects takes about 20 minutes to go through this
process.
Am I missing something (a different container, perhaps) that would
support my application in an efficient way? Thoughts? TIA
 
D

Daniel

One of my applications running very slowly, due to high volume of
data. I'm currently using an STL map to store the data, and I'm trying
to convert to STL vector for processing.

typedef struct ChipRec
{
int bibNum; // Bib#


I planned to sort my container and use a binary search to determine
if I was adding a new object of modifying an existing one.
That technique won't work, as I'd have to sort the container each time I
add an object (very slow!). The binary search won't help me, because it
doesn't return an iterator to the matched object for possible
modification.

You don't need to resort. You can use std::lower_bound, like this

class bibNum_compare
{
public:
bool operator()(const ChipData& a, int bibNum) const
{
return a.bibNum < bibNum;
}
};

....

auto it = std::lower_bound(v.begin(), v.end(), bibNum, bibNum_compare());
if (it != v.end() && it->bibNum == bibNum)
{
// update record at location it
}
else
{
v.insert(it, ...
}

Daniel

"A plink a day keeps the plonker at bay." - Alf
 
Ö

Öö Tiib

It seems using an STL vector isn't a solution...8<{{ The STL map
works, but for 30,000 objects takes about 20 minutes to go through this
process.

That is the odd thing that you have posted before several times.
No one can not imagine how you read those 30K little structures for
20 minutes. It should go sub-second to populate such tiny std::map.
Can you post real code?
 
A

Alf P. Steinbach

One of my applications running very slowly, due to high volume of
data. I'm currently using an STL map to store the data, and I'm trying
to convert to STL vector for processing.

Presumably you mean `std::map`.

A `std::map` is sorted, on the key that you specify.

The basic operation is that I could be loading some of the container
from one source (or not), but then I am definitely loading data from a
second source - and modifying existing information if it came from the
first source. The data is accessed by an integer scalar variable (see
"bibNum", below). Here's the (new) data structure:

typedef struct ChipRec
{
int bibNum; // Bib#
int dbeAge, ndfAge, gruAge;
char dbeGender, ndfGender, gruGender;
char dbeRCode, ndfRCode, gruRCode;
bool timeWritten, dbAdd;
bool dbeMatched, ndfMatched, gruMatched;
time_t startDelta;
time_t clockTime;
string dbeName, gruName;
string strAddress, strCity, strDoB;
[etc.]
} ChipData;

In C++ you can just write

struct ChipRec { ... };

`typedef` isn't necessary -- it's a C-ism.

extern ChipData chipWork;

Global variable might be a problem, e.g., is it initialized when you
access it?

In particular, a global as a scratch-pad variable, as this one seems to
be, is a practice that cause a lot of problems where it's adopted.

typedef vector<ChipRec> CRVect;
extern CRVect chipVect;
extern CRVect::iterator chipIter;

Ditto comments.

These variables apparently do nothing but create problems.

Ditch them.

I planned to sort my container and use a binary search to determine
if I was adding a new object of modifying an existing one. That
technique won't work, as I'd have to sort the container each time I add
an object (very slow!). The binary search won't help me, because it
doesn't return an iterator to the matched object for possible
modification.
It seems using an STL vector isn't a solution...8<{{ The STL map
works, but for 30,000 objects takes about 20 minutes to go through this
process.

The time you site seems unreasonable/unbelievable/excessive, even with
some O(n^2) algorithm at work.

I suspect that in the inner code there, the code that is repeated in an
O(n^2) fashion or worse, is some network communication or disk activity.

Am I missing something (a different container, perhaps) that would
support my application in an efficient way? Thoughts? TIA

Well the technical problem you ask about, how to update a collection of
records, presumably identified by some key with some ordering defined,
is easy by using the C++ standard library. You're saying that you're
using `std::map`, and then that problem is already solved since a
`std::map` keeps its contents sorted and provides O(log n) random access
to items. Just use the [] indexing operator to get at an item with a
given key, or creating a new one if there isn't one already.

If you don't need the data sorted you can do this even more efficiently
by using a hash table, like C++11 std::unordered_map, but don't rush to
try out that: instead concentrate on ...

* getting rid of O(n^2) or worse behavior, and
* moving network or disk access away from the innermost code.

Again, you can use [] indexing with a `std::map`, and there are also
other operations (see the documentation) for accessing something fast.

To identify the origin of item data, include that in the item data.

Yes?


Cheers & hth.,

- Alf
 
K

kfrank29.c

Hi Mike!

Some more details about the specific processing you need
to do would be helpful in figuring out what data structures
and algorithms would best fit your use case.

One of my applications running very slowly, due to high volume of
data. I'm currently using an STL map to store the data, and I'm trying
to convert to STL vector for processing.

I don't understand exactly what you mean here. Are you using a
std::map for building the data structure, and then do you feel
you need to convert it to a std::vector for further processing?

Why? I could imagine situations where this could make sense,
but they would be unusual.
The basic operation is that I could be loading some of the container
from one source (or not), but then I am definitely loading data from a
second source - and modifying existing information if it came from the
first source. The data is accessed by an integer scalar variable (see
"bibNum", below). Here's the (new) data structure:

typedef struct ChipRec
{
int bibNum; // Bib#
int dbeAge, ndfAge, gruAge;
...
[etc.]
} ChipData;
...

I planned to sort my container and use a binary search to determine
if I was adding a new object of modifying an existing one.

You won't want to do this (in most cases). This is what
std::map (and std::set, or unordered_map and unordered_set)
are for.
That
technique won't work, as I'd have to sort the container each time I add
an object (very slow!). The binary search won't help me, because it
doesn't return an iterator to the matched object for possible
modification.
It seems using an STL vector isn't a solution...8<{{ The STL map
works,

Yes, I would think std::map should work.
but for 30,000 objects takes about 20 minutes to go through this
process.

This is the part that doesn't make a lot of sense to me.
It seems to me that 20 minutes is a very long time (on a
modern machine) to do something -- even something relatively
complicated -- with 30,000 objects.

Let's say that your ChipRec is 1000 bytes in size. (That
seems like an overestimate, unless your [etc.] is hiding
a lot of stuff in it.) Even then you would have only 30 MB,
for which 20 minutes seems like a very large amount of
processing.

What is taking that much time? Is it building the data
structure? is it processing it after it is built?

Again, more detail about what processing you are actually
doing would be helpful.
Am I missing something (a different container, perhaps) that would
support my application in an efficient way? Thoughts? TIA

Based on my guesswork about what you're trying to do, I think
your use of std::map is probably a reasonable choice. I do
think you are missing something (but I don't know what) because
that 20-minute processing time seems very long.

I will make one suggestion about a possible data structure:
*If* you know that your bibNum will be bounded by some not
huge value, you could use a pre-allocated array of ChipRec's
of size bibNum_Max, and just index into it. (You could use
a raw c-style array or a pre-sized std::vector.) Your actual
bibNum's can have gaps in them -- you don't need to use them
all. So, for example, if you knew that bibNum <= 100,000,
you could allocated a contiguous array of 100,000 ChipRec's,
and put your 30,000 actual ChipRec's in it. You'd waste
about two-thirds of the space, but that would be your most
time-efficient data structure for indexing into it. (But
this only works if bibNum has some known, adequately small
upper limit.) Also, this is good if you are indexing into
your data structure, but is not so good if you want to iterate
through it (because you would have to recognize and waste
time skipping over all of the non-existing values).


Good luck.


K. Frank
 
R

red floyd

I will make one suggestion about a possible data structure:
*If* you know that your bibNum will be bounded by some not
huge value, you could use a pre-allocated array of ChipRec's
of size bibNum_Max, and just index into it. (You could use
a raw c-style array or a pre-sized std::vector.)

To heck with std::vector or C style array. Use std::array.
 
M

Mike Copeland

One of my applications running very slowly, due to high volume of
I don't understand exactly what you mean here. Are you using a
std::map for building the data structure, and then do you feel
you need to convert it to a std::vector for further processing?

No, I had only used the map: to load data into it; to read (other)
data to see if the key matched existing object data; to either load new
data or update existing data in the objects. I had not used a vector at
all and only used the map container to create a "clean" data set which I
wrote to a database file.
The map was used to "easily" determine if "initial data" exists when
"update data" is read to either update the former or add new information
from the latter. The problem arose when the volume of data being
read/tested/updated/loaded (~30,000 data sets) runs so slow that I've
been seeking a different approach to store/test than the map container.
I've read that storing data into a vector (where a reserve can be
applied, but which a map doesn't offer that function) is considerably
faster than allocating a large number of individual map objects. And
this is what I'm trying to do here.
Why? I could imagine situations where this could make sense,
but they would be unusual.
The basic operation is that I could be loading some of the container
from one source (or not), but then I am definitely loading data from a
second source - and modifying existing information if it came from the
first source. The data is accessed by an integer scalar variable (see
"bibNum", below). Here's the (new) data structure:

typedef struct ChipRec
{
int bibNum; // Bib#
int dbeAge, ndfAge, gruAge;
...
[etc.]
} ChipData;
...

I planned to sort my container and use a binary search to determine
if I was adding a new object of modifying an existing one.

You won't want to do this (in most cases). This is what
std::map (and std::set, or unordered_map and unordered_set)
are for.

I wanted to avoid the overhead of a linear search. Hence the
perceived need to have a sorted vector.
Yes, I would think std::map should work.
Yes, it works. However, withe data volumes greater than, say, 2500
the program runs very slowly - as I can see from a running counter that
I display during processing. 8<{{
 
M

Mike Copeland

One of my applications running very slowly, due to high volume of
Presumably you mean `std::map`.
Yes.
A `std::map` is sorted, on the key that you specify.

Yes, but that's not the issue. It's a problem that when storing many
objects, processing of each object's memory allocation is noticeably
slow. I wanted to see if std::map had a "reserve" function - it
doesn't, and I understand why. So, 30,000+ std::map objects being
stored has become a big problem, and I'm looking for a different
approach.
"bibNum", below). Here's the (new) data structure:
typedef struct ChipRec
{
int bibNum; // Bib#
int dbeAge, ndfAge, gruAge;
char dbeGender, ndfGender, gruGender;
char dbeRCode, ndfRCode, gruRCode;
bool timeWritten, dbAdd;
bool dbeMatched, ndfMatched, gruMatched;
time_t startDelta;
time_t clockTime;
string dbeName, gruName;
string strAddress, strCity, strDoB;
[etc.]
} ChipData;

In C++ you can just write
struct ChipRec { ... };
`typedef` isn't necessary -- it's a C-ism.
Understood.
extern ChipData chipWork;
Global variable might be a problem, e.g., is it initialized when you
access it?

Yes, always.
In particular, a global as a scratch-pad variable, as this one seems to
be, is a practice that cause a lot of problems where it's adopted.

True. However, I've been making this "faux pas" for more than 50
years of programming, and I accept the risk. 8<}}
 
M

Mike Copeland

It seems using an STL vector isn't a solution...8<{{ The STL map
That is the odd thing that you have posted before several times.
No one can not imagine how you read those 30K little structures for
20 minutes. It should go sub-second to populate such tiny std::map.
Can you post real code?
I thought my original message was already too long and wordy (a fault
of mine): posting the code that does all the work would surely annoy the
regulars here, no?
I'll see what I can do, though...
 
Ö

Öö Tiib

I thought my original message was already too long and wordy (a fault
of mine): posting the code that does all the work would surely annoy the
regulars here, no?

Cutting it down to smaller example/testbed never hurts. If
you can't then probably others see also thousands of lines of code every
day anyway so ... why it annoys? Post away.
I'll see what I can do, though...

If you have it in some public repo (like Github) then you can
post a link to it ... I suppose.
 
I

Ian Collins

Mike said:
Yes, it works. However, withe data volumes greater than, say, 2500
the program runs very slowly - as I can see from a running counter that
I display during processing. 8<{{

So where does your profiler tell you the bottleneck is?
 
J

Jorgen Grahn

Cutting it down to smaller example/testbed never hurts. If
you can't then probably others see also thousands of lines of code every
day anyway so ... why it annoys? Post away.

It does annoy, but it's a lot more annoying to read these "30K objects
in 20 minutes" questions so yeah -- please post!

/Jorgen
 
M

Mike Copeland

I will make one suggestion about a possible data structure:
To heck with std::vector or C style array. Use std::array.
I'm considering this approach. I'll have to scan the input file(s)
first, to see what the maximum index value is, before I can allocate the
std::array. Might be faster overall...
 
M

Mike Copeland

Yes, it works. However, withe data volumes greater than, say, 2500
So where does your profiler tell you the bottleneck is?
I don't have a profiler (for VS2010 Express). Is there one
available?
 
M

Mike Copeland

Why? I could imagine situations where this could make sense,
but they would be unusual.

The basic operation is that I could be loading some of the container
from one source (or not), but then I am definitely loading data from a
second source - and modifying existing information if it came from the
first source. The data is accessed by an integer scalar variable (see
"bibNum", below). Here's the (new) data structure:

typedef struct ChipRec
{
int bibNum; // Bib#
int dbeAge, ndfAge, gruAge;
...
[etc.]
} ChipData;
...

I planned to sort my container and use a binary search to determine
if I was adding a new object of modifying an existing one.

You won't want to do this (in most cases). This is what
std::map (and std::set, or unordered_map and unordered_set)
are for.

The "20 minutes for 30,000 objects" is an exaggeration: it's 14-15
minutes. Still bad, though, and it's a problem that must be corrected.
As has been suggested, I'll try a std::array. 8<}}
 
P

peter koch

Den tirsdag den 3. december 2013 18.50.06 UTC+1 skrev Mike Copeland:
The "20 minutes for 30,000 objects" is an exaggeration: it's 14-15
minutes. Still bad, though, and it's a problem that must be corrected.
As has been suggested, I'll try a std::array. 8<}}

I do not know who recommended using std::array, but this is a really bad idea. Your problem lies elsewhere. I believe it was Juha that suggested you breaking
into the code using the debugger (just start the program then press CTRL-BREAK).
If you do this a few times, I am sure that you will discover that the same piece
of code is called repeatedly. A very poor mans profiler, but that should
suffice.

/Peter
 
K

K. Frank

Hello Mike!

...
The "20 minutes for 30,000 objects" is an exaggeration: it's 14-15
minutes.

Mike, what's the big secret? Several posters have suggested
that "20 minutes for 30,000 objects" (or even merely 14-15
minutes) is truly outlandish.

Either you're doing something substantive that you're not
telling us about (in which case we're in the dark, so we
can't help you), or you're doing wacky. What are you
actually doing?
Still bad, though, and it's a problem that must be corrected.
As has been suggested, I'll try a std::array. 8<}}

If you're doing something wacky, it's very unlikely that
switching to std::array will help you. You'd just be taking
a shot in the dark that misses the real issue.

But unless you tell us what you're actually doing (e.g., post
some code), we have no idea what the real issue is, so we can
hardly help you.

You've probably got some detail wrong some place, but if you
keep all those details secret you keep us clueless about how
we might help you fix your problem.

As they say -- Details, details ...


Good luck.


K. Frank
 
Ö

Öö Tiib

I don't have a profiler (for VS2010 Express). Is there one
available?

If you want simple free profiler then Very Sleepy is such. Basically
it gets the job done. http://www.codersnotes.com/sleepy/

If you want more accurate free profiler then windows performance
toolkit contains PerfMon.
http://msdn.microsoft.com/en-us/library/hh162945.aspx
Does work on Windows Vista or newer. Not too user-friendly.

There are several profilers with better accuracy than Very Sleepy
and better usability than PerfMon but these cost money (AQTime,
Intel VTune, Devpartner Studio etc.). May have free trials.

It is rather hard (sometimes barely possible) to fine-tune performance
without a profiler.
 
M

Mike Copeland

Either you're doing something substantive that you're not
telling us about (in which case we're in the dark, so we
can't help you), or you're doing wacky. What are you
actually doing?


If you're doing something wacky, it's very unlikely that
switching to std::array will help you. You'd just be taking
a shot in the dark that misses the real issue.

But unless you tell us what you're actually doing (e.g., post
some code), we have no idea what the real issue is, so we can
hardly help you.

You've probably got some detail wrong some place, but if you
keep all those details secret you keep us clueless about how
we might help you fix your problem.
Well, I've changed the whole process to use a std:array...and it
hasn't helped at all. 8<{{
So, okay, I've got some other problem - somewhere in the data
conversion and preparation process. Here's what I'm doing (without
exposing you all to an enormous amount of detail code):
1. I am reading a CSV file, which I parse into individual fields.
2. For each field (~15 of them, I either trim the alphanumeric data or
convert the numeric data to populate the data object (which has been a
std::map, std::vector, and now is a std::array object).
3. I do some data validation on the data fields, before I post he
object to the container.
4. As each object is stored, I display a running count of objects
processed. This is where I can see that the activity is running slowly
and erratically - the running counter does not update smoothly, and it
is quite slow for the large volume of records (~30,000) I'm processing.
The problem wasn't really noticeable prior to running this large
input file. I had been dealing with ~2000 input records and never
noticed a concern.
I now see that something in the above processing has substantial
processing, but I don't yet know where or what. For now, I'm going to
comment out various functions to see if I can determine where the
overhead is. <sigh...>
 
J

Jorgen Grahn

On Wed, 2013-12-04, Mike Copeland wrote:

[attribution lost]
Well, I've changed the whole process to use a std:array...and it
hasn't helped at all. 8<{{
So, okay, I've got some other problem - somewhere in the data
conversion and preparation process. Here's what I'm doing (without
exposing you all to an enormous amount of detail code):

Like others have explained over and over agsin, noone
can help you without seeing that code.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,962
Messages
2,570,134
Members
46,692
Latest member
JenniferTi

Latest Threads

Top