Memory issues with Map

M

mohitanchlia

I have written a program that loads a file having 3 columns into a
map. Here is the declaration:

struct strE
{
char iEID[22+1];
char acEN[6+1];
int iDed;
};

map<string, struct strE> mEro;

struct strE eStr = {0,};
if ( 0 < vDataColumn.size() )
eStr.iDzReceived =
atol(vDataColumn.at(0).c_str());
//vDataColumn has columns from file which is "|" delimiter separated.
We split the line into columns and store it in vDataColumn

mEro.insert(std::pair<std::string,
struct strERO>(key, eStr)
);

I am seeing that for 6MB file the resident memory is around 12 MB. It
looks like either there is a memory leak or limitation in Map. I am
not sure what kind of memory mangement is used by Map. Could somebody
shed some light as what would be the best approach and dos and donts
when dealing with Maps
 
V

Victor Bazarov

I have written a program that loads a file having 3 columns into a
map. Here is the declaration:

struct strE
{
char iEID[22+1];
char acEN[6+1];
int iDed;
};

map<string, struct strE> mEro;

You may omit the keyword 'struct' here. It's not needed. 'strE'
is the name of the *type* (just like 'string'). We're not in C any
more.
struct strE eStr = {0,};

Same here, you may omit 'struct '.
if ( 0 < vDataColumn.size() )
eStr.iDzReceived =

There is no member 'iDzReceived' in 'strE'.
atol(vDataColumn.at(0).c_str());
//vDataColumn has columns from file which is "|" delimiter separated.
We split the line into columns and store it in vDataColumn

mEro.insert(std::pair<std::string,
struct strERO>(key, eStr)

What's 'strERO'?
);

I am seeing that for 6MB file the resident memory is around 12 MB. It
looks like either there is a memory leak or limitation in Map.

Memory leak can be discovered by using the tools that exist for that
particular purpose. Limitation in std::map? Looks a lot more like
"overhead" than "limitation" to me...
I am
not sure what kind of memory mangement is used by Map. Could somebody
shed some light as what would be the best approach and dos and donts
when dealing with Maps

Please find a copy of 'The C++ Standard Library' by Josuttis. It
contains so much information on memory allocation for standard
containers that the margins of this posting are too narrow to hold
it.

V
 
P

Phil Endecott

I have written a program that loads a file having 3 columns into a
map.
I am seeing that for 6MB file the resident memory is around 12 MB.

Each string and each element in the map has some overhead, i.e. pointers
to the child nodes if the map uses a tree implementation internally (as
it probably does), and pointers to the start and the end of the data for
the string. How many elements are there in your 6MB file? I would
expect of the order of 20 bytes per element for your code.

Now, how to reduce the overhead.... that is the interesting question.


Phil.
 
L

LR

I have written a program that loads a file having 3 columns into a
map. Here is the declaration:

struct strE
{
char iEID[22+1];
char acEN[6+1];
int iDed;
};

map<string, struct strE> mEro;

struct strE eStr = {0,};
if ( 0 < vDataColumn.size() )
eStr.iDzReceived =
atol(vDataColumn.at(0).c_str());
//vDataColumn has columns from file which is "|" delimiter separated.
We split the line into columns and store it in vDataColumn

mEro.insert(std::pair<std::string,
struct strERO>(key, eStr)
);

I am seeing that for 6MB file the resident memory is around 12 MB. It
looks like either there is a memory leak or limitation in Map. I am
not sure what kind of memory mangement is used by Map. Could somebody
shed some light as what would be the best approach and dos and donts
when dealing with Maps

Aside from the inconstancies that Victor pointed out in another post,
there are a few things that might be useful to know.

How many entries in the 6MB file.

You say that your file has three columns. What are they? I ask because
your struct strE (or strERO?) has three member variables and the key of
the map would make four values per entry. So if your data has three
values per entry and you're storing per entry (key and three others)
would that account for the change in size?

What types are the three values in the data file? And for each of the
'string' type values what is the largest, smallest, and average size?

Is it possible that you'd be better off rewriting your struct with
std::string instead of char arrays? Or perhaps writing some very
lightweight string class if memory usage is a big concern?

Also, what is sizeof(int) on your system and how many bits is a char?

I think given the information you've provided so far, it's difficult to
give a useful answer.

LR
 
V

Victor Bazarov

LR said:
I have written a program that loads a file having 3 columns into a
map. Here is the declaration:

struct strE
{
char iEID[22+1];
char acEN[6+1];
int iDed;
};

map<string, struct strE> mEro;
[..]

Is it possible that you'd be better off rewriting your struct with
std::string instead of char arrays? Or perhaps writing some very
lightweight string class if memory usage is a big concern?

Considering that the arrays are only 23 bytes and 7 bytes long, it
seems that using 'std::string' (which usually contains at least
a pointer and a size_t, making it 8 bytes, plus it allocates its
buffer somewhere on the heap requiring heap overhead) might not be
beneficial (aside from the convenience point of view). Lightweight
string classes also take memory...
Also, what is sizeof(int) on your system and how many bits is a char?

Wouldn't they be the same in the file?
I think given the information you've provided so far, it's difficult
to give a useful answer.

std::map carries some overhead. With elements that are small, the
overhead is going to be felt stronger than with elements that are
large. Paying for search speed and convenience with as much of the
memory as required to store just the data is not that awful, if you
ask me.

V
 
L

LR

Victor said:
LR said:
I have written a program that loads a file having 3 columns into a
map. Here is the declaration:

struct strE
{
char iEID[22+1];
char acEN[6+1];
int iDed;
};

map<string, struct strE> mEro;
[..]
Is it possible that you'd be better off rewriting your struct with
std::string instead of char arrays? Or perhaps writing some very
lightweight string class if memory usage is a big concern?

Considering that the arrays are only 23 bytes and 7 bytes long, it
seems that using 'std::string' (which usually contains at least
a pointer and a size_t, making it 8 bytes, plus it allocates its
buffer somewhere on the heap requiring heap overhead) might not be
beneficial (aside from the convenience point of view). Lightweight
string classes also take memory...

I guess it depends on how big the fields are in the file.

If they're all fixed length then we can save 1 byte for each char array
at the expense of having to write some class that implements a fixed
length string. Assuming that's important to us.

For the 23 char array, if the lengths vary all over the place, but tend
toward most of them being say, less than ten bytes with the largest
being 23 chars, then maybe a lightweight string class would be useful.
Maybe just a pointer to zero terminated strings.
Wouldn't they be the same in the file?

The number of bits in a char? Yes, almost certainly. Don't know what I
was thinking there. But the OP seems to be storing the int as text in
the file. I suppose I should also have asked about these values as
well. Do these tend to have more than sizeof(int) digits? Or fewer?
std::map carries some overhead. With elements that are small, the
overhead is going to be felt stronger than with elements that are
large.

Yes. But I think that we still don't really know what the keys are like
and what kind of data is being stored in that struct.
> Paying for search speed and convenience with as much of the
> memory as required to store just the data is not that awful, if you
> ask me.

I tend to agree, it's generally not a big concern for me either, but I
don't know what limitations the OP has for memory usage or if it's
possible that he does in fact have a leak somewhere.

I think finding out how many entries he's storing in the map and the
size of the key strings would be useful.

LR
 
J

James Kanze

Victor Bazarov wrote:

[...]
The number of bits in a char? Yes, almost certainly.

What makes you think that?

In practice, all general purpose machines today use 8 bit bytes,
and of course, 8 bits will always be 8 bits. Some more
specialized machines don't, however---there is a mainframe with
9 bit bytes, and a lot of DSP's use 16 or 32 bit bytes. Because
8 bit bytes are so widespread, however, such machines
doubtlessly have facilities for reading files with 8 bit bytes
(that were written on other machines), at least in so far as
they have facilities for reading files.

Depending on context, you may or may not have to take such
machines into consideration.
Don't know what I was thinking there. But the OP seems to be
storing the int as text in the file. I suppose I should also
have asked about these values as well. Do these tend to have
more than sizeof(int) digits? Or fewer?

When storing as text, don't forget the needed separators when
considering size.
 
L

LR

James said:
Victor Bazarov wrote:
[...]
Also, what is sizeof(int) on your system and how many bits is a char?
Wouldn't they be the same in the file?
The number of bits in a char? Yes, almost certainly.

What makes you think that?

Think that it's almost certain? Because as you say...

In practice, all general purpose machines today use 8 bit bytes,
and of course, 8 bits will always be 8 bits.

True for all general purpose machines today, yes. But I've worked on
system with some strange arrangements. 8 bit chars in 12 bits. 8 bit
characters as 6 bit chars with escape sequences. 8 bit chars packed in
60 bit words. 8 bit chars in 9 bits. 12 bit chars stored as 8 bit chars
with escape sequences.
> Some more
specialized machines don't, however---there is a mainframe with
9 bit bytes, and a lot of DSP's use 16 or 32 bit bytes. Because
8 bit bytes are so widespread, however, such machines
doubtlessly have facilities for reading files with 8 bit bytes
(that were written on other machines), at least in so far as
they have facilities for reading files.

In my limited experience with mainframes, these systems may do 8 bit IO
for terminals, but the IO of 8 bit char files on these systems have
often been limited to rather complicated conversion programs.

Depending on context, you may or may not have to take such
machines into consideration.

Ok, I stand corrected again. I was right the first time. ;)

When storing as text, don't forget the needed separators when
considering size.

We haven't heard from the OP, so I don't know for sure, but the OP said
that there were three columns in the file and these were delimited by '|'.

The char arrays in the struct the OP presented were almost certainly
zero terminated. I conclude this, because of the way the OP defined the
arrays [22+1] and [6+1], which seems to be a popular idiom for
indicating the size of the data and adds one for the termination character.

If we assume (Danger!) that two of the fields in the file are char, and
the third is a character representation of an int, then the two
delimiters are accounted for already since we (likely) have two zero
terminated char arrays.

Although, I suppose it's possible that the file contains something like:
|5|hello|world|
instead of
5|hello|world

LR
 
P

Phil Endecott

Well the OP hasn't replied, but I think this is an interesting subject,
having had to solve similar problems myself in the past.

Here's a simpler case: say I have a std::map<int,int>, which I
initialise from a file and subsequently only read from. The memory
required will be substantially more than 2*N*sizeof(int); I'm not sure
exactly what the overhead of a typical map implementation is, but I
guess that each tree node has pointers to two child nodes, a parent
node, and the data; if sizeof(T*)==sizeof(int), then that needs
6*N*sizeof(int) plus any allocator overhead.

So here's one way to improve that: use a std::vector said:
>. Read them in and then sort the array, and then use a binary search
to do lookups. That should have the same complexity as using a map, and
needs only 2*N*sizeof(int) storage.

But it becomes more interesting if you want to be able to modify the
data after initially loading it. With the vector, insert will be O(N)
rather than O(log N). So is there a solution that has O(log N) insert
and lookup complexity and ~ 2*N*sizeof(int) storage? For example, can
you group together M of the pair<int,int>s, and store those groups in a
map? The overhead of the map is then amortised over the M elements.
Example (just the flavour, lots of details ommitted - feel free to
finish it if you like!):

template <typename KEY, typename VALUE>
class dense_map {

typedef std::vector<VALUE> vec_t;
typedef std::map<KEY,vec_t> impl_t;
impl_t impl;

public:
VALUE& operator[](KEY k) {
impl_t::iterator i = impl.upper_bound(k); --i;
vec_t& v = i->second;
vec_t::iterator j = std::find(v.begin(),v.end(),k);
return *j;
}

void insert(std::pair<KEY,VALUE> p) {
KEY& k = p.first;
impl_t::iterator i = impl.upper_bound(k); --i;
vec_t& v = i->second;
if (v.size()>max_size_per_chunk) {
// split v into two chunks, and re-start the insert
} else {
v.push_back(p);
}
}
};


Question: can something like this be implemented that has the same
guarantees (complexity, iterator validity, etc) as a std::map? In
particular, I think there's an unavoidable conflict between the split
(and join, in erase) operations and iterator / reference validity. So
it can't be used as a drop-in replacement for a std::map. Any thoughts?


Cheers, Phil.
 
M

mohitanchlia

Victor said:
LR said:
(e-mail address removed) wrote:
I have written a program that loads a file having 3 columns into a
map. Here is the declaration:
struct strE
{
   char iEID[22+1];
   char acEN[6+1];
   int  iDed;
};
map<string, struct strE>  mEro;
[..]
Is it possible that you'd be better off rewriting your struct with
std::string instead of char arrays?  Or perhaps writing some very
lightweight string class ifmemoryusage is a big concern?
Considering that the arrays are only 23 bytes and 7 bytes long, it
seems that using 'std::string' (which usually contains at least
a pointer and a size_t, making it 8 bytes, plus it allocates its
buffer somewhere on the heap requiring heap overhead) might not be
beneficial (aside from the convenience point of view).  Lightweight
string classes also takememory...

I guess it depends on how big the fields are in the file.

If they're all fixed length then we can save 1 byte for each char array
at the expense of having to write some class that implements a fixed
length string. Assuming that's important to us.

For the 23 char array, if the lengths vary all over the place, but tend
toward most of them being say, less than ten bytes with the largest
being 23 chars, then maybe a lightweight string class would be useful.
Maybe just a pointer to zero terminated strings.


Wouldn't they be the same in the file?

The number of bits in a char?  Yes, almost certainly. Don't know what I
was thinking there.  But the OP seems to be storing the int as text in
the file.  I suppose I should also have asked about these values as
well.  Do these tend to have more than sizeof(int) digits? Or fewer?


std::mapcarries some overhead.  With elements that are small, the
overhead is going to be felt stronger than with elements that are
large.  

Yes.  But I think that we still don't really know what the keys are like
and what kind of data is being stored in that struct.

 > Paying for search speed and convenience with as much of the
 >memoryas required to store just the data is not that awful, if you
 > ask me.

I tend to agree, it's generally not a big concern for me either, but I
don't know what limitations the OP has formemoryusage or if it's
possible that he does in fact have a leak somewhere.

I think finding out how many entries he's storing in themapand the
size of the key strings would be useful.

LR- Hide quoted text -

- Show quoted text -

The file has 3 columns. First 2 columns are the key. That file
probably has 200000 lines that is being stored in map.
 
L

LR

The file has 3 columns. First 2 columns are the key. That file
probably has 200000 lines that is being stored in map.


Didn't you say that the map was something like,

std::map<std::string, strE>

Where strE was some sort of struct?

If the first two columns of your data file are the key, why aren't you
using something like, at least, a std::pair for the key, like,

std::map<std::pair<std::string,std::string>, strE>

or whatever the two types should be for the pair?

I think it's very difficult to figure out exactly what problem you're
having if you can't tell us a little bit more explicitly what data you
have and what you're doing with it.

But ok. You have 200000 entries. As previously pointed out in this
thread, the overhead for each entry might be roughly estimated as
follows, assuming that each of an int, size_t or a pointer is 4 bytes.

Three pointers per map entry plus a flag, call it 16 bytes.
One pointer and one size_t for the string you used as a key, 8 bytes.
34 bytes for struct strE.

Now I have to add a ridiculous estimate for the size of the keys. I'll
guess that it's 20 characters per key.

So that's around 78 bytes per entry, multiply by 200000 and you get a
number that's around 15M.

So I'd say that 12M isn't bad at all, assuming my assumptions are
correct. Maybe I'm off on the size of the keys. Or over a bit on the
number of bytes for each map entry. I also neglected to add in the,
probably 4 bytes, for each memory allocation. Probably two per entry.

I very much doubt that you have a memory leak. Or at least the amount
of memory you say this code uses at run time probably doesn't indicate that.

I think that as I already pointed if you really want to you might save
some space using a lightweight string class. There may be other things
you can do too.

If you're willing to give more specific information you will probably
get more specific advice.

The struct you showed us had three fields. Are they generated from some
other data? From the keys?

LR
 
M

mohitanchlia

Didn't you say that themapwas something like,

std::map<std::string, strE>

Where strE was some sort of struct?

If the first two columns of your data file are the key, why aren't you
using something like, at least, a std::pair for the key, like,

std::map<std::pair<std::string,std::string>, strE>

or whatever the two types should be for the pair?

I think it's very difficult to figure out exactly what problem you're
having if you can't tell us a little bit more explicitly what data you
have and what you're doing with it.

But ok.  You have 200000 entries.  As previously pointed out in this
thread, the overhead for each entry might be roughly estimated as
follows, assuming that each of an int, size_t or a pointer is 4 bytes.

Three pointers permapentry plus a flag, call it 16 bytes.
One pointer and one size_t for the string you used as a key, 8 bytes.
34 bytes for struct strE.

Now I have to add a ridiculous estimate for the size of the keys.  I'll
guess that it's 20 characters per key.

So that's around 78 bytes per entry, multiply by 200000 and you get a
number that's around 15M.

So I'd say that 12M isn't bad at all, assuming my assumptions are
correct.  Maybe I'm off on the size of the keys.  Or over a bit on the
number of bytes for eachmapentry.  I also neglected to add in the,
probably 4 bytes, for eachmemoryallocation. Probably two per entry.

I very much doubt that you have amemoryleak.  Or at least the amount
ofmemoryyou say this code uses at run time probably doesn't indicate that.

I think that as I already pointed if you really want to you might save
some space using a lightweight string class. There may be other things
you can do too.

If you're willing to give more specific information you will probably
get more specific advice.

The struct you showed us had three fields.  Are they generated from some
other data?  From the keys?

LR

I think it's my mistake that I didn't give you complete information:
Here is how data looks like

1234567891234567891.1000|2

So the syntax of the file is "firstkey.secondkey|remaining fields"

strE is struct for the format that I just mentioned.
 
L

LR

Here is how data looks like

1234567891234567891.1000|2

Could the firstKey ever be as long as 22 chars? Could the second key
ever be as long as 6?
So the syntax of the file is "firstkey.secondkey|remaining fields"

Remaining fields? You've only shown one. What could the others be? Is
there only ever one?
strE is struct for the format that I just mentioned.

I'm still struggling to understand. Do you mean that you essentially do
this:

Note it's just a snippet and it's not tested and it won't work well.
struct strE {
char iEID[22+1];
char acEN[6+1];
int iDed;

strE(const char *p1, const char *p2, const int i) {
strcpy(iEID,p1);
strcpy(acEN,p2);
iDed = i;
}
};

std::map<std::string, strE> m;

and then for each data record in your file

const std::string firstKey = extractFirstKeyFromRecord();
const std::string secondKey = extractSecondKeyFromRecord();
const int i = extractIntFromRecord();

m[firstKey+secondKey] = strE(firstKey.c_str(), secondKey.c_str(), i);

Is that what you're doing?

If that's it, then you can save plenty of memory at the price of a
little time and space overhead, by removing the two char fields from
your struct and creating a wrapper for the key from the two extracted
strings from the data. I estimate the data space overhead at something
like 24 more bytes per entry but you'd save 30 from the struct for a
savings of 6 bytes per entry. Around 1.2M. Use a lightweight string
class and it gets a little bit better. Or you can make a struct for the
key that uses char arrays to store the data and do even better.

But I'm still not sure what you're doing with the data that you read in,
or what data ends up in strE.

I'm missing something.

LR
 
M

mohitanchlia

Here is how data looks like
1234567891234567891.1000|2

Could the firstKey ever be as long as 22 chars?  Could the second key
ever be as long as 6?


So the syntax of the file is "firstkey.secondkey|remaining fields"

Remaining fields?  You've only shown one.  What could the others be?  Is
there only ever one?


strE is struct for the format that I just mentioned.

I'm still struggling to understand.  Do you mean that you essentially do
this:

Note it's just a snippet and it's not tested and it won't work well.
struct strE {
   char iEID[22+1];
   char acEN[6+1];
   int  iDed;

   strE(const char *p1, const char *p2, const int i) {
      strcpy(iEID,p1);
      strcpy(acEN,p2);
      iDed = i;
   }

};

std::map<std::string, strE> m;

and then for each data record in your file

const std::string firstKey = extractFirstKeyFromRecord();
const std::string secondKey = extractSecondKeyFromRecord();
const int i = extractIntFromRecord();

m[firstKey+secondKey] = strE(firstKey.c_str(), secondKey.c_str(), i);

Is that what you're doing?

If that's it, then you can save plenty ofmemoryat the price of a
little time and space overhead, by removing the two char fields from
your struct and creating a wrapper for the key from the two extracted
strings from the data.  I estimate the data space overhead at something
like 24 more bytes per entry but you'd save 30 from the struct for a
savings of 6 bytes per entry.  Around 1.2M. Use a lightweight string
class and it gets a little bit better.  Or you can make a struct for the
key that uses char arrays to store the data and do even better.

But I'm still not sure what you're doing with the data that you read in,
or what data ends up in strE.

I'm missing something.

LR

For example here us what I am doing:

1. From File
1234567891234567891.1000|2

2. After substituting data
std::map< std::string, struct strE> mE;

struct strE eStr = {0,};
eStr.iDed = 2 // this is second field in the
file

mE.insert(std::pair<std::string,
struct
strERO>(1234567891234567891.1000, 2)
);
First 2 fields in the struct are not used, but I am it's being set to
0 by using {0,}.

You mentioned lightweight string, I am already using std::string, did
you mean some other class ?
 
L

LR

On Jan 8, 9:33 am, LR <[email protected]> wrote:
For example here us what I am doing:

1. From File
1234567891234567891.1000|2

2. After substituting data
std::map< std::string, struct strE> mE;

struct strE eStr = {0,};
eStr.iDed = 2 // this is second field in the
file

mE.insert(std::pair<std::string,
struct
strERO>(1234567891234567891.1000, 2)
);
First 2 fields in the struct are not used, but I am it's being set to
0 by using {0,}.

If those two fields aren't being used I think you can easily save around
6MB. Just take those fields out of the struct or use a different struct
or just use an int.

If they're not being used, what are they doing in there?


You mentioned lightweight string, I am already using std::string, did
you mean some other class ?

You'd have to roll your own. But you'd likely save around 4 bytes per
entry minus the code overhead.

Start with this:

class LWS {
char *p_;
public:
LWS () : p_(0) {}

.... add more as needed....
};

Add a function to compare them
bool operator<(const LWS &s1, const LWS &s2);

LWS is not a good name for this class.

LR
 
J

Jerry Coffin

Question: can something like this be implemented that has the same
guarantees (complexity, iterator validity, etc) as a std::map? In
particular, I think there's an unavoidable conflict between the split
(and join, in erase) operations and iterator / reference validity. So
it can't be used as a drop-in replacement for a std::map. Any thoughts?

I've never tried to work through it in detail, but I think it should be
possible to create something with implicit pointers to child nodes (a
bit like a heap) to come close. For example, the root item would be at x
[0]. Its children would be at x[1] and x[2]. The children of x[1] would
be at x[3] and x[4], and the children of x[2] at x[5] and x[6], and so
on.

From there it's a matter of math to provide addressing that works
correctly. The tree is exponential -- i.e. the root of the tree has only
one item. The next level has two items, the next four items, and so on.
IOW, the root of the tree has 2^0 items. The next level of the tree has
2^1 items. The next level has 2^2 items, and so on.

That implies that the addressing is also exponential. Level N of the
tree will follow immediately after level N-1, and the data for all the
preceding levels will contain 2^N-1 nodes, so at level N, the nodes
start at x[2^N]. Nicely enough, given a node x[N], we can easily work
the math in the other direction to find x[N]'s parent. IOW, the tree is
also automatically threaded so it's easy to traverse the tree without
recursion.

My immediate guess is that attempting to maintain balance (like a
red/black or AVL tree) probably won't work out for this structure.
Specifically, we can't just shift an entire sub-tree by manipulating
pointers -- such a thing would require moving all the sub-tree's data,
which probably makes it impractical. Leaving it as a plain binary tree
means (among other things) that all insertions and deletions take place
at the leaf nodes -- which will generally be toward the end of the
vector, exactly where we'd like.

The last thing we need is something equivalent to a null pointer to
indicate which nodes have no children. Since we don't have any actual
pointers, however, it's probably more sensible to use a bitmap to
indicate which nodes are valid. This would use the same addressing as
the nodes themselves, except (of course) addressing bits instead of
nodes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top