[Slightly OT] Memory management and custom allocator

B

Ben Bacarisse

BartC said:
[growing a string from empty, to 10 million chars]
char *p;
int i,len;

p=malloc(1);
*p=0;
len=0;

for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);
p[len-1]='*';
p[len]=0;
}
On three C compilers, this took about 12 seconds (on DMC, any length
of string resulted in a runtime of 0.4 seconds; a bit odd).

On my system is takes, .15 seconds
Measuring the number of times the value of p changes, with the slow
compilers, this was respectively 2335, 2336, and 2337. With DMC, it
was just 46.

and changes p 156 times. (Off-topic extra: a C++ program using s += '*'
on a std::string takes 0.07s on the same machine.)
Clearly DMC overallocates more than the others (growing allocations by
sqrt(2) apparently).

I don't see how you can conclude that. The allocated storage might be
able to grown (i.e. it is not initially over-allocated) without changing
the address.

<snip>
 
B

BartC

Ben Bacarisse said:
On my system is takes, .15 seconds

and changes p 156 times.

Your implementation must be pretty good then, or maybe it's optimised a few
realloc() calls out.

Running the test by calling realloc() with the same argument (so no new
allocations are needed) still takes about 1 second on those three slow
compilers (gcc, lccwin-32 and Pelles C). Seems like two of them have ripped
off the other's inefficient realloc() implementation!
(Off-topic extra: a C++ program using s += '*'
on a std::string takes 0.07s on the same machine.)

Sounds about right when calls to realloc() are minimised. The question I
suppose is how much buffering/allocation logic should be done outside of
malloc()/realloc(), or do you just let those functions do most of the work;
you can afford to do that with fast implementations like yours.
I don't see how you can conclude that.

Just a guess. The 46th root of ten million is around 1.42.
The allocated storage might be
able to grown (i.e. it is not initially over-allocated) without changing
the address.

That's true; with nothing else going on memory-wise, even an over-allocated
block could be further expanded at the same location, depending on how
allocator works.
 
K

Kaz Kylheku

for (i=1; i<=10000000; ++i) {
++len;
p=realloc(p,len+1);

Growing linearly, and, by one character, is silly.

You have O(N*N) behavior here.

If you grow exponentially (e.g. doubling the buffer), you get amortized O(N)
behavior.

I.e. this is a poor algorithm. It's the O(N*N) time that's killing this,
not malloc.
 
8

88888 Dihedral

FtMæ–¼ 2011å¹´12月31日星期六UTC+8下åˆ10時56分53秒寫é“:
[xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
windows.programmer.win32]
Hello NG,
I have a very specific question, but first of all I describe you what
I'm facing as I'm open to alternative general solutions..
I'm refactoring some old C code that deals with very basilar data
structures (like lists, priority queues, hashtables and so on) that
pretends to be very portable. Regarding the memory management, I have
an internal wrapper that, on *nix and windows systems, maps the malloc/
calloc/realloc/free functions one-to-one to the default ones, but it's
possibile to switch to another custom allocator that works pretty much
as a standard allocator (a big chunk of memory splitted/joined, tought
for embedded systems).
I would like to implement my custom allocator for the *nix and windows
systems too (of course loosing some portability for performance gain),
and maybe implement something like the C++ allocators to improve the
memory management of the different algorithms (like rare large
allocations on vectors / frequent small allocations on lists), but
I've some questions:
- In these systems, to have decent performances, I was thinking to
allocate page-sized (and page-aligned) chunks to have small
allocations in the same page, but how is it possible to request some
memory with this characteristics? Will it help the OS memory manager?
- Does all of this make any sense? :)
Thanks for any suggestion!
Ciao!

P.S. I don't really have the necessity of a "performance boost", but
I'm dealing with a general, low-level and already used library, and as
long as I'm refactoring the code I'd like to do it the better possible
way ;-)

Kind of interesting here. I thought you were describing a customized
heap manager that can support page swapping to some file in the hard disk.

A graph of nodes of buffers is required.

If you don't have to page out to the secondary storage system, then
the malloc, memcpy, memset and free in the GCC source code are available.
 
F

FtM

I don't think there's enough information here; what sort of sizes are the
three classes of allocator handling?

16<=x<=128 (little structures, one-shot char buffers, etc.)
128<x<=2048 (array of pointers, bigger structures, etc.)
x>2048 (realloc()atable memory)

Will any of them always be a fixed size?

Yes, the first two types. (Of course, if a realloc() is requested for
a first or second type memory it should be switched to the third one,
but this is not likely to happen as the user (me) is aware of that
behaviour!). I didn't write it, but obviously I will have the
possibility to call the objects directly, choosing the preferred
allocation strategy.
Will they need deallocating?

Sure they will. The first two by marking on their own map-vector, the
third by the usual way of joining adiacent free blocks.
What strategy will be used to grow the
heap from which the the blocks will come; in fact does this heap need to
grow?

Each "object" will have some pages to start with; when one of them
fills its pages completely will ask the system for some more
(increasing its allocation vector on the fly too). I don't know the
"some" quantity, but I think no one do before runtime.. Making it a
compile-(or maybe run-)time parameter will help monitoring and testing
the overall performances, but this is just an idea.
Naturally, speaking of the first two "objects", the space of the
allocation vector should be taken into account, but as it is a bit-
masked vector it will occupy a very little space.
Where does the memory come from in the first place?

mmap (thus even leaving the possibility to swap to persistent storage
or share the memory between processes).

Ciao!
 
8

88888 Dihedral

FtMæ–¼ 2012å¹´1月2日星期一UTC+8下åˆ6時18分45秒寫é“:
16<=x<=128 (little structures, one-shot char buffers, etc.)
128<x<=2048 (array of pointers, bigger structures, etc.)
x>2048 (realloc()atable memory)



Yes, the first two types. (Of course, if a realloc() is requested for
a first or second type memory it should be switched to the third one,
but this is not likely to happen as the user (me) is aware of that
behaviour!). I didn't write it, but obviously I will have the
possibility to call the objects directly, choosing the preferred
allocation strategy.


Sure they will. The first two by marking on their own map-vector, the
third by the usual way of joining adiacent free blocks.


Each "object" will have some pages to start with; when one of them
fills its pages completely will ask the system for some more
(increasing its allocation vector on the fly too). I don't know the
"some" quantity, but I think no one do before runtime.. Making it a
compile-(or maybe run-)time parameter will help monitoring and testing
the overall performances, but this is just an idea.
Naturally, speaking of the first two "objects", the space of the
allocation vector should be taken into account, but as it is a bit-
masked vector it will occupy a very little space.


mmap (thus even leaving the possibility to swap to persistent storage
or share the memory between processes).

This requires customized data compression modules not just the heap manager..

If a program only uses 20 to 30 percent of the total heap space in a typical
platform, then a lot people won't bother to write a new heap manager at all..
 
F

FtM

FtMæ–¼ 2012å¹´1月2日星期一UTC+8下åˆ6時18分45秒寫é“:



This requires customized data compression modules not just the heap manager.

Maybe. Or maybe more disk space? :) I don't think it's relevant
here..
If a program only uses 20 to 30 percent of the total heap space in a typical
platform, then a lot people won't bother to write a new heap manager at all.

I don't see your point.. In the OSes we were speaking about there's no
fixed "heap-space", the physical memory is "recycled" based on the
current needs.. Or you were thinking something different?

Ciao!
 
8

88888 Dihedral

FtMæ–¼ 2012å¹´1月2日星期一UTC+8下åˆ6時43分33秒寫é“:
Maybe. Or maybe more disk space? :) I don't think it's relevant
here..


I don't see your point.. In the OSes we were speaking about there's no
fixed "heap-space", the physical memory is "recycled" based on the
current needs.. Or you were thinking something different?

Ciao!

I think the heap space in a 32 bit platform is somewhat limited.
Of course in the 64 bit OS a lot things will be different.

But as long as the hard disk and the dram are both of some finite sizes,
I just model the heap space to be of a finite size in my programs.

I am not interested in trivial programs.
 
B

BartC

FtM said:
Each "object" will have some pages to start with; when one of them
fills its pages completely will ask the system for some more
(increasing its allocation vector on the fly too). I don't know the
"some" quantity, but I think no one do before runtime.. Making it a
compile-(or maybe run-)time parameter will help monitoring and testing
the overall performances, but this is just an idea.
Naturally, speaking of the first two "objects", the space of the
allocation vector should be taken into account, but as it is a bit-
masked vector it will occupy a very little space.


mmap (thus even leaving the possibility to swap to persistent storage
or share the memory between processes).

It sounds like you are more interesting in playing with the virtual memory
side of things and doing your own paging.

Be aware that the OS tends to be take care of this, and does it rather well
in conjunction with the hardware! (If you look in particular at the memory
management routines on the msdn website, it can do it rather comprehensively
too. I usually take one look then go back to the friendly little malloc,
realloc and free functions of C.)

Also, if you just want to experiment, then possibly malloc() can provide the
initial memory within which you can do your own allocations. Just ask for
the largest block of memory you want to work with, and look for the first
address within it where the bottom 12 bits are zero; that will likely be on
a page boundary. (In a real program, malloc()ed memory may not grow in the
way you might like, so your mmap() method might be needed.)
 
B

BartC

Kaz Kylheku said:
Growing linearly, and, by one character, is silly.

Nevertheless, that's what the program wants to do. (As it happens I quite
often have to build strings this way, and without knowing the final size.)

You have O(N*N) behavior here.

If you grow exponentially (e.g. doubling the buffer), you get amortized
O(N)
behavior.

But do you worry about this stuff directly in your code, or let the
language's runtime take care of it, or have some intermediate functions deal
with these details?

(And in the case of a simple C string, there is nowhere to put any extra
information about what the current allocated size is; it doesn't even know
it's length!)
I.e. this is a poor algorithm. It's the O(N*N) time that's killing this,
not malloc.

Yet some C implementations coped reasonably enough:
 
E

Eric Sosman

A) VM pages are powers of two, so if you have non-two objects you waste
space or take more VM pages for each object than you would like

B) cache lines are powers of two, so if you have non-two objects you
waste space or take more cache lines for each object than you would like

If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"

If you have a thousand such structs, what "waste" do you avoid
by spending 64K instead of 44K to hold them? That is, what gain
repays your expenditure of an extra 20K?
i am numerologically superstitious, but you can also file this into the
'maybe someone has figured out something that we don't understand'
category

I think the "someone" was Agent Mulder.
 
B

BartC

Eric Sosman said:
If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"

If you have a thousand such structs, what "waste" do you avoid
by spending 64K instead of 44K to hold them? That is, what gain
repays your expenditure of an extra 20K?

This is where you redesign your struct to use only 32 bytes, or find a good
use for another 20 bytes to make 64.

Such structs are easier to index in arrays. And fit nicely into those 32- or
64-byte blocks that memory seems to be structured as now, if also aligned
properly, and less likely to straddle two VM pages (actually I don't know if
that's a big deal now; it certainly has been).

So powers-of-two are hardware-friendly. But now stick on 8- or 12-bytes of
overhead, and it's a mess. You can't design a struct to be 8- or 12-bytes
smaller than a power-of-two; every system will be different.
 
E

Eric Sosman

This is where you redesign your struct to use only 32 bytes, or find a
good use for another 20 bytes to make 64.

Ri-i-i-ght. Pretending the extra twenty bytes are in a good cause
doesn't make them so.
Such structs are easier to index in arrays.

Nonsense. The `[]' operator works with equal ease on arrays
of any element size at all.
And fit nicely into those
32- or 64-byte blocks that memory seems to be structured as now, if also
aligned properly, and less likely to straddle two VM pages (actually I
don't know if that's a big deal now; it certainly has been).

The fit is not "nice" if it's attained by wasting space. (Weren't
you the guy who recently complained about malloc() adding overhead to
each block?)

As for straddling VM pages, denser packing wins over bloated "nicer"
packing any day. Consider again the example of a thousand structs: I
want to use 44KB, you're suggesting 64KB would be "nicer." Suppose an
8KB page: You want to use eight pages, I want to use six. Suppose a
64B cache line: You want to use a thousand cache lines, I want to use
six hundred eighty-eight.
So powers-of-two are hardware-friendly. But now stick on 8- or 12-bytes
of overhead, and it's a mess.

Wait, wait, wait: You started with 44 bytes, strained your brain
to imagine a "good use" for another twenty because you're superstitious
about powers of two), and then complained when malloc() tacked on eight
bytes of its own? Why didn't you just stick with the original 44?
You can't design a struct to be 8- or
12-bytes smaller than a power-of-two; every system will be different.

You can't design it to be exactly 64 bytes, either. Quick: How
many bytes in a `long double'? What's the alignment requirement of
a function pointer? What is the "unit" size of a bit-field?
 
B

Ben Bacarisse

BartC said:
"Kaz Kylheku" <[email protected]> wrote in message

Yet some C implementations coped reasonably enough:

Though I made no attempt to determine the growth rate of the time. It
may still have been quadratic, but fast quadratic.

However, it is in fact linear. Judging by the number of times the
pointer is changed by realloc it is doing some kind of internal doubling
of the requested space.

The trouble is that you can't rely on this, so to get amortised linear
performance in portable C you have to do the exponential growth
yourself.
 
F

FtM

FtMæ–¼ 2012å¹´1月2日星期一UTC+8下åˆ6時43分33秒寫é“:





I think the heap space in a 32 bit platform is somewhat limited.
Of course in the 64 bit OS a lot things will be different.

More bits == More space. So?
But as long as the hard disk  and the dram  are both  of some finite sizes,
I just model the heap space to be of a finite size  in my programs.

I don't think anyone deals with infinite spaces.

Ciao!
 
F

FtM

It sounds like you are more interesting in playing with the virtual memory
side of things and doing your own paging.

Be aware that the OS tends to be take care of this, and does it rather well
in conjunction with the hardware! (If you look in particular at the memory
management routines on the msdn website, it can do it rather comprehensively
too. I usually take one look then go back to the friendly little malloc,
realloc and free functions of C.)

Yeah, I know.. I have a strange intrinsic vision of the memory
management that is more low-level than the c-level, probably due to my
previous works.. I don't know if it's a "fortune" or not, but that's
why I start by directly asking for low-level facilities instead of
abstracting the real problems. Anyway, I think that porting some of
this logic one level up could help somehow..
Also, if you just want to experiment, then possibly malloc() can provide the
initial memory within which you can do your own allocations. Just ask for
the largest block of memory you want to work with, and look for the first
address within it where the bottom 12 bits are zero; that will likely be on
a page boundary. (In a real program, malloc()ed memory may not grow in the
way you might like, so your mmap() method might be needed.)

You made the right example: I didn't know if that was true or not!
Technically, a virtual page base address doesn't have to match the
physical page boundary..

Ciao!
 
F

FtM

This is where you redesign your struct to use only 32 bytes, or find a
good use for another 20 bytes to make 64.

     Ri-i-i-ght.  Pretending the extra twenty bytes are in a goodcause
doesn't make them so.
Such structs are easier to index in arrays.

     Nonsense.  The `[]' operator works with equal ease on arrays
of any element size at all.
And fit nicely into those
32- or 64-byte blocks that memory seems to be structured as now, if also
aligned properly, and less likely to straddle two VM pages (actually I
don't know if that's a big deal now; it certainly has been).

     The fit is not "nice" if it's attained by wasting space.  (Weren't
you the guy who recently complained about malloc() adding overhead to
each block?)

     As for straddling VM pages, denser packing wins over bloated "nicer"
packing any day.  Consider again the example of a thousand structs: I
want to use 44KB, you're suggesting 64KB would be "nicer."  Suppose an
8KB page: You want to use eight pages, I want to use six.  Suppose a
64B cache line: You want to use a thousand cache lines, I want to use
six hundred eighty-eight.

Someone previously asked me what if the allocation took no cycles at
all. That question is obviously plain wrong, because I *do* like the
allocator taking cycles if this means that, when the memory is low, my
complex data structure resides in one single page and the system has
not to hysterically swap from disk to deal with a pathological
fragmentation.
With your comment anyway you are agreeing with me: what if you could
choose? :)
Ciao!
 
E

Eric Sosman

[...]
Someone previously asked me what if the allocation took no cycles at
all. That question is obviously plain wrong, [...]

Part of the problem is that you have not made your objectives
clear. Kaz apparently believed you were seeking an allocator that
would consume less CPU to perform its job -- indeed, that's exactly
how I understand some of your remarks. In that context his question
is not "plain wrong" at all, but entirely to the point: He's asking
about the upper bound on potential improvement. If you could reduce
the allocator's time to zero -- if you could achieve an infinite
speedup -- how much overall improvement would you attain? If you
cannot answer that question, at least with a well-grounded estimate,
you have *no* business trying to speed up the allocator, nor the sort,
nor the parser, nor the hash table, nor the frammis rebiculator.

If you have not measured, you do not know whether you are in
trouble. If you achieve an X% improvement, you *still* won't know
whether you are out of trouble or still in it.
[...] I *do* like the
allocator taking cycles if this means that, when the memory is low, my
complex data structure resides in one single page and the system has
not to hysterically swap from disk to deal with a pathological
fragmentation.

A memory manager with C's semantics, that is, one that cannot
relocate an already-allocated block, is *always* vulnerable to
fragmentation (J.M. Robson, JACM 18, 416-423). The practical issue
is to find strategies that "usually" don't suffer from "severe"
fragmentation in "typical" cases.
With your comment anyway you are agreeing with me: what if you could
choose? :)

Sorry; I don't understand this remark.
 
F

FtM

[...]
Someone previously asked me what if the allocation took no cycles at
all. That question is obviously plain wrong, [...]

     Part of the problem is that you have not made your objectives
clear.  Kaz apparently believed you were seeking an allocator that
would consume less CPU to perform its job -- indeed, that's exactly
how I understand some of your remarks.  In that context his question
is not "plain wrong" at all, but entirely to the point: He's asking
about the upper bound on potential improvement.  If you could reduce
the allocator's time to zero -- if you could achieve an infinite
speedup -- how much overall improvement would you attain?  If you
cannot answer that question, at least with a well-grounded estimate,
you have *no* business trying to speed up the allocator, nor the sort,
nor the parser, nor the hash table, nor the frammis rebiculator.

     If you have not measured, you do not know whether you are in
trouble.  If you achieve an X% improvement, you *still* won't know
whether you are out of trouble or still in it.

Yup, reading again what I wrote from this point of view I understand
his question. I was a little bit harsh because on the one hand in my
opinion the "speed" was only a part of the problem, and on the other
hand I know that I can't measure anything before implementing
something that I think its a good idea..
[...] I *do* like the
allocator taking cycles if this means that, when the memory is low, my
complex data structure resides in one single page and the system has
not to hysterically swap from disk to deal with a pathological
fragmentation.

     A memory manager with C's semantics, that is, one that cannot
relocate an already-allocated block, is *always* vulnerable to
fragmentation (J.M. Robson, JACM 18, 416-423).  The practical issue
is to find strategies that "usually" don't suffer from "severe"
fragmentation in "typical" cases.

That's very interesting, I didn't know that it was mathematically
proved! Anyway.. [read below]
     Sorry; I don't understand this remark.

.... if for your own use you give up the standard interface and
manually choose which allocation strategy best fit your needs, you
could have an advantage.

A very practical example. Someone use my library just for the linked-
lists, implemented with this hypothetical allocator. His code uses the
allocator of its choice, probably the system's one, and of course he
will have one or more linked-lists accessed from time to time; these
lists will hardly suffer severe fragmentation (if any), no matter how
or when he inserts or deletes their elements, simply because the
memory was already allocated from the start.
If all of this can be useful or not depends on the specific library
user's needs, but personally I think that giving this possibility
would be a nice-to-have feature.

Ciao!
 
B

BartC

Eric Sosman said:
On 1/2/2012 9:07 AM, BartC wrote:
Such structs are easier to index in arrays.

Nonsense. The `[]' operator works with equal ease on arrays
of any element size at all.

Where an address calculation is needed, then it's usually multiply versus a
shift. Although turning 44 bytes/element into 64 just for that reason would
not be worth it. There should be other benefits to using 64.
As for straddling VM pages, denser packing wins over bloated "nicer"
packing any day. Consider again the example of a thousand structs: I
want to use 44KB, you're suggesting 64KB would be "nicer." Suppose an
8KB page: You want to use eight pages, I want to use six. Suppose a
64B cache line: You want to use a thousand cache lines, I want to use
six hundred eighty-eight.

You forgot the bit where I suggested trying to get it down to 32 bytes. So
some data structures could get smaller.

You would only expand one up to the next power-of-two if there was a net
benefit. Keeping 20 bytes unused out of 64 wouldn't be a benefit. Storing
something in there that would have needed it's own space anyway might well
be.

(While most structs you'd leave alone because there aren't going to be
enough of them, or accessed often enough, to worry about.)
You can't design it to be exactly 64 bytes, either. Quick: How
many bytes in a `long double'? What's the alignment requirement of
a function pointer? What is the "unit" size of a bit-field?

I'm pretty good at that actually..

(I've just been looking at an old application of mine. There is one struct
that has an 8-byte padding field added at the end to bring the size up to
exactly 128 bytes. Would that really have made a difference or not? Let's
see!

Running the application with a struct size of 128 bytes, the runtime was 48
seconds (this is for a 3D hidden line display). I took out the padding
field, and the runtime wentup to 59 seconds, 23% slower. Now that could be
due to a million reasons, but all the same, I think I'll leave that padding
in, thanks!)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,083
Messages
2,570,591
Members
47,212
Latest member
RobynWiley

Latest Threads

Top