Reusable source code

  • Thread starter James Dow Allen
  • Start date
B

Ben Pfaff

James Dow Allen said:
Much software has the form
High Level -- application-specific code
Mid Level -- application-specific code
Low Level -- FREQUENT FUNCTION, use library

But often one encounters
High Level -- application-specific code
Mid Level -- FREQUENT FUNCTION
Low Level -- application-specific code

A familiar example where this form is encountered and
dealt with is qsort(), with the low-level handled
simply with a function pointer.

When I run into the latter situation in a performance-critical
area, I try to think of ways that I can invert the dependency,
flipping it around so that it becomes the former situation. One
way to do this is to make the application-specific client code do
more work. This does have the cost of making the client write
more code (but not always much more) and making the client more
tightly bound to the implementation,

One example is the hash table library that I use in PSPP,
available here:
http://git.savannah.gnu.org/cgit/pspp.git/tree/src/libpspp/hmap.h
http://git.savannah.gnu.org/cgit/pspp.git/tree/src/libpspp/hmap.c

Most generic hash table library, in my experience, require the
client to provide at least a hashing function and a comparison
function. My library does not use any function pointers.
Instead, the client provides a hash value at the time of
insertion, as an argument call to the insertion function, and the
library saves the hash value as part of the hash table node.
Instead of providing a conventional search function, the library
provides only a macro that iterates through hash table nodes that
have a given hash value, which is a suitable building block from
which the client can build a "search" function.

The resulting library is slightly harder to use in some ways. On
the other hand, I find that callback functions are often quite
awkward to use--for example, they force auxiliary data to be
inconveniently bundled up--so sometimes it is in fact easier to
use. Its operation is also, in my opinion, more transparent.

I've often idly thought about how the same idea could be applied
to sorting or balanced binary search trees, but it seems to be a
little trickier in those cases.
 
M

Malcolm McLean

It's not easy to knock up a quick example in C, but it is in C++, where
the standard sort is a function template.  Comparing the sort of
10000000 random doubles, qsort is at least as fast as std::sort with no
optimisation, but takes twice as long with optimisation on.  The big
difference is inlining the comparison function.  I'll post the test case
if granted immunity from flames for posting C++!
Lots of languages provide sorts.

Most of them are very easy to use for sorting lists of integers, which
of course is what the examples in the documentation provide. The
problem is that you seldom need to sort integers, you want to sort
records on a field.
Then you end up messing about with copy constructors and overloaded
"<" operators. I've found very few languages as easy to use as C's
qsort().

Speed yes, is a factor, particuarly for sorting. But most software
projects that fail do so because the interactions between the
different modules become too complicated, not because the processor
isn't fast enough.
 
I

Ian Collins

Lots of languages provide sorts.

But only two of them allow one to compare the performance of inlined and
not inlined function calls.
Most of them are very easy to use for sorting lists of integers, which
of course is what the examples in the documentation provide. The
problem is that you seldom need to sort integers, you want to sort
records on a field.
Then you end up messing about with copy constructors and overloaded
"<" operators. I've found very few languages as easy to use as C's
qsort().

For most practical uses, C++'s std::sort is no more complex to use than
qsort(), it requires the same inputs (the range and the comparison
function). You can even use the same comparison function (less the
casing from void*). There's no messing about with anything.
Speed yes, is a factor, particuarly for sorting. But most software
projects that fail do so because the interactions between the
different modules become too complicated, not because the processor
isn't fast enough.

Which is a language independent problem.
 
I

ImpalerCore

Sigh.  It appears I've overestimated your skills as well.
SAVING THE CALLBACK IN QSORT() WAS JUST A TRIVIAL EXAMPLE
to avoid digressing into a discussion of a reusable hash-table
manager.

It seems that in lieu of the confusion (including my own) on what
exact concept you're trying to get across, perhaps you could construct
an example (in code) that explicitly demonstrates the differences. To
simplify the implementation, instead of a sophisticated qsort, you
could create a bubblesort interface a la

\code
void bubblesort ( void * base,
size_t num,
size_t size,
int ( * comparator ) ( const void *, const void
* ) );
\endcode

and demonstrate your version with the macro technique. Most macro
techniques are harder to understand implementation-wise because you
need to carry more context like remembering or looking up exactly what
the macros do.
 The hashtable manager has MANY options.  Do you store
via bits or bytes?  What's your collision/rehash strategy?
We're not trying to avoid one trivial callback, we're avoiding
a largish collection of switches, etc.  Sometimes you need to
waste a bit for a deletion marker, sometimes you don't. Etc. etc.

Writing one hashtable manager is better than writing 3 or 4, right?
Using simple macros makes the code simpler than switches or if/elses,
right?

You quote part of the purposeful discussion:


... But then focus on the trivial example:



This has *nothing* to do with my intent for this thread but:
The reason it makes a difference is that qsort() spends a *huge*
majority of its time calling (*cmp)() (which is often a trivial
function) and doing swaps.
Saving function invocation is a big win.  A known-size swap is
faster than a variable-size swap.


I don't have a number to give you offhand, but calls to (*cmp)
sharply outnumber the recursive calls to the sorting function.


I tried to emphasize that it was the READABILITY advantage of using
one (hashtable-manager, for example) function, where otherwise
multiple versions would be needed that was the goal, and that qsort()
(were a speed advanatge happens to obtain) was introduced solely
because it was well-known, simple, and going into detail on a REAL
example where the mehtod isapplicable would be diversive.
Alas, it was to no avail.

How can you expect people to understand a hash table technique if they
aren't understanding what your trivial example is meant to get
across. Plus we don't have access to your hash table source, so we
can't put the code in front of our faces (like Pfaff provided for his
example). We can't all be mind-readers. You say that performance is
not the point, but the focus keeps drifting to performance reasons for
doing these things. Put up a complete example (like bubblesort) that
demonstrates your interpretation of improved readability and more of
us may get what you're trying to say.
Thanks anyway,
James

Best regards,
John D.
 
L

lawrence.jones

There's also no particular reason a compiler couldn't inline
comparison function to a specialized qqort function as things stand
now. While qsort is a bit special in that it's a library function, a
compiler that supports link time code generation could quite plausible
specialize and inline that so long as it had the semi-compiled qsort
source available.

There's no reason an implementation can't inline the comparison function
to qsort itself, even without link-time code generation. All that's
necessary is to define qsort as a macro that expands to an invocation of
an inline version of qsort:

#define qsort(_b, _n, _s, _c) _inline_qsort((_b), (_n), (_s), (_c))

static inline void _inline_qsort(void *_b, size_t _n, size_t _s,
int (*_c)(const void *, const void *))
{
...
}
 
J

James Dow Allen

I'll enclose a brief excerpt from a specific header for
the "Source Reusable" Hash-table handler. The key point is
that the table element structure itself is variable; in the
example application key and data are PACKED together into
a single short. You can't expect to handle all the
variations of the element type with callbacks or switches.
     Besides, the possible space efficiency seems insufficient
to justify "CAN'T BE USED AT ALL;" it's entirely akin to the
time savings you already don't care about.

     ... The technique of
using the preprocessor to specialize general code to a particular
use is fairly well-known -- yet the plain fact is that it's not
widely and routinely used.  Considering the efficiencies it may
offer, one wonders why not?

I'll try to address these questions, then
go away quietly!

I agree that seeking a mere 10% improvement in
speed or space is often pointless perfectionism.
But in many hash-table applications one seeks a
space savings which is *MUCH* larger than that.

Some "off-the-shelf" hash-tables use 12 bytes
per entry PLUS the key; call it 16 bytes total.
If instead you want 2 bytes per entry, the
off-the-shelf table will waste 87% of its
storage.

Buying eight times as much storage isn't really
the answer: If you head 8x storage, you might
want 8x the table size, not the old table
with 87% of space wasted!

Hash tables with 4 (or even 2) bytes per entry
are hardly unusual. Nodes in typical Connect-4
game-playing software (A topic on which I'm less
than totally ignorant:
http://www.amazon.com/Complete-Book-CONNECT-History-Strategy/dp/1402756216
) are 4 bytes. I've
seen a Rubik's-cube solver with 2-byte nodes,
and used 2-byte nodes myself on similar problems.

In fact, it was such a 2-byte per node hashtable
that led to my original "reusable source" in the
first place. Here's part of the application-specific
header:
typedef unsigned short Te_ktype;
typedef struct {
Te_ktype te_key;
} Tab_entry;

#define EPEND 0x4000
#define OPEND 0x8000
#define KEY(tekey) ((tekey) & ~(EPEND | OPEND))
#define KEYEQ(a, b) (!(KEY(a ^ b)))
#define HASHER(s) ((s) + ((s) >> 1) - ((s) >> 3))
#define EMPTY (0)
#define MTSLOT(s) ((s) == EMPTY)
#define MTSET(s) ((s) = EMPTY)

Note that the data type of a table element is application
dependent. To imagine a giant switch statement, or
invocation of a callback, on any table-element operation
would be absurd!

Some may ask: Why didn't you just "roll your own"
hash-table manager? I *DID* roll my own; but I rolled
in such a fashion that I wouldn't have to KEEP re-rolling
it for every major change!
(I *did* change the .c, refining the definition
of the .h -> .c interface whenever that seemed best.)

I don't claim the method is a panacea. I adopted
a wholly different approach in an image-processing
library, with a high-level routine invoking four
processing routines whose nature varied by link.
(And now, I've switched to different hash-table
methods which bypass the concept!)

But the method seems useful. It's hardly original;
in glibc-2.7 it's used to provide single .c modules
to support several similar functions, e.g.
strtod & friends, wcstol & friends.

Evidently I articulated the idea poorly, and apologize
for that. Kudos to Eric and Ben, who understand me
anyway and made intelligent responses. (A big
raspberry to the "Saint" who was completely baffled
but tried to show off anyway.)

James Dow Allen
 
B

Ben Bacarisse

James Dow Allen said:
I agree that seeking a mere 10% improvement in
speed or space is often pointless perfectionism.
But in many hash-table applications one seeks a
space savings which is *MUCH* larger than that.

Some "off-the-shelf" hash-tables use 12 bytes
per entry PLUS the key; call it 16 bytes total.
If instead you want 2 bytes per entry, the
off-the-shelf table will waste 87% of its
storage.

Buying eight times as much storage isn't really
the answer: If you head 8x storage, you might
want 8x the table size, not the old table
with 87% of space wasted!
[...] Here's part of the application-specific
header:
typedef unsigned short Te_ktype;
typedef struct {
Te_ktype te_key;
} Tab_entry;

#define EPEND 0x4000
#define OPEND 0x8000
#define KEY(tekey) ((tekey) & ~(EPEND | OPEND))
#define KEYEQ(a, b) (!(KEY(a ^ b)))
#define HASHER(s) ((s) + ((s) >> 1) - ((s) >> 3))
#define EMPTY (0)
#define MTSLOT(s) ((s) == EMPTY)
#define MTSET(s) ((s) = EMPTY)

Note that the data type of a table element is application
dependent. To imagine a giant switch statement, or
invocation of a callback, on any table-element operation
would be absurd!

I need some help with this bit. What's absurd about a callback?

You've said time and time again that this is not about speed; rather it
is a more general design issue. But from a design point of view I'd be
happy to have four function pointers stored in the hash table, and once
you have that you only need to tell the table how big the elements are,
and not their actual type. It can then be a conventional compiled
module.

[I say four because it looks like key equality, hashing and testing and
setting empty are the ones that aren't helpers for the others. If it's
only a few more, I don't think it alters the argument that much.]

You say this is part, so maybe some other part is where the absurdity
of a callback becomes clear. Cutting down code can be like that -- you
can simplify to the point of missing a part of what you are trying to
illustrate.

By the way, I don't think the technique is daft, and I've used it myself.
I would try to avoid it on some rather ill-defined grounds of good taste
but it's certainly workable. Nowadays I'd be tempted to use inline
functions in the way that I think Seebs illustrated, but it would still
feel to me like a design of last resort.

<snip>
 
J

James Dow Allen

I need some help with this bit.  What's absurd about a callback?

The hash-table handler iterates through elements of an (explicit or
implicit) collision list (in which the iteration stride, sizeof
Element,
is *variable*); each element encountered needs to be classified as
a Match or Other(*). The test for Match (which also must include
classification of Other) *could* be done by callback, but note
that if quotienting is in use, the callback would need to pass state
info just so the application could do the match. Possible to
do, but if I try to envision it, it seems horrid. Admittedly, this
might be partly a matter of taste.... (* - Other can include Empty,
Deleted, Mismatch-Hard, Mismatch-Soft; this last case is useful
in caching applications where some low-value elements are eligible
for overwriting.)

My code looks *very* ordinary and easy to read, except that, e.g.
if (KEYEQ(a, b)) /* macro */
might occur where
if (keyeq(a, b))
would be "ordinary". I personally would find the version full
of callback()'s tedious to look at and annoying. I'd "roll my own"
rather than using it.

James
 
J

John Kelly

usually when a newb comes into a group [any group] and says things
like "why not we do everything while standing on our heads?" it's
because they don't know what they're talking about.
Yup.  James Dow Allen is a newb here.  Never seen him post before, no
siree!  He's got some nerve comin' in here and stirin' things up!

Thanks for proving my point.
If you had actually read what I wrote, I said we [most regulars]
**DO** read what they write. We just don't have to agree with it
because you say so.
There is no "we".
There is no committee.
There is no organization.
There is no central authority.
There is no funding body.

I don't seek social structure in ngs. Some people do though, and they
want to belong. Thus form cliques.

Ng topicality need not be defined by a dominant clique. Minorities can
use it as they please, clique objections notwithstanding.
 
C

Chris Noonan

I'm actually totally surprised that this makes a significant difference,
because I wouldn't have expected it to be even measurable, since the
callback idiom is so common.  I've done a ton of things that use a
function pointer for their inner loop processing, and it never occurred
to me to try to figure out a way to inline them.

That actually leads to an interesting question, to me, which is where
the inefficiency is.  Is it in the indirection through a function pointer?
The argument passing?  The lack of optimization because the comparison
argument can't be inlined?  In particular, I'm wondering how much of it
comes down to the extra argument to qsort(), since a lot of qsort()
implementations are presumably recursive, meaning every layer of the
call tree has a function pointer argument.  If you could strip that away,
I wonder whether it would noticably improve performance.

I guess, to me, it doesn't seem like replacing the callback model with
this #define/inlining stuff model is enough of a conceptual improvement to
be significant -- it doesn't strike me as making things much easier to
understand.  But if it's making a big speed difference, maybe that IS
worth looking at.  But it surprises me that it'd matter much!  It would
be interesting to come up with some reasonable test cases to experiment
with and see where the efficiency is coming from -- whether it's the
inlining or the argument lists or what.

What you're all missing it that when the compiler sees "qsort" it
doesn't *have to* arrange a call to object code in a library.
It can sort the data however it likes. That includes a technique
as crude as textual insertion of COMP or whatever into some hidden
qsort source code.

The mystery is why implementations generally don't treat
"library functions" as built-ins when it would be advantageous to
do so. A good opportunity arises when a malloc() and its corresponding
free() lie within the same function; calls to the heap manager can be
replaced with allocation on the stack.

Must be the weight of C history.

Regards,
Chris Noonan
 
C

Chris Noonan

     Since the stack often has far less space available than the heap,
this could fail rather badly for large allocations.

I wasn't proposing an unbreakable rule. I envisaged an optimising
compiler making sensible decisions. If size is a problem on a
particular architecture, the compiler may be able to determine
statically that the requested size is bounded above. Otherwise it can
emit code to branch on the size. For malloc the gains are worthwhile,
they go beyond inlining. As well as replacing a call to an intricate
function with as few as two opcodes, you are reducing (in a small way)
heap fragmentation and possibly avoiding a bus-locking instruction.

     Our Foolish Forebears, right?

Indeed. C was designed as a hacking language. In the early days,
replacing a library function with one of your own because the library
was buggy or you had written a better one was routine. Of course in
an engineering context, discovering that isdigit() excludes '8' or
that "while" means "if" is the last thing you want. Hence starting
with C89, it is forbidden to spoof standard library functions. I
have always assumed that one motive for the change was to make it
easier for compilers to optimise around library functions - to treat
them as built-ins if necessary.

     But why stop with the library functions?  Why not in-line all
functions, library-resident and user-written, into one enormous blob
of straight-line code?  Every "call" to printf() would just plop the
entire printf() source right down at the call point, every call to
topFunctionInMyHugeTree() would do similarly (and recursively, for all
the functions it calls in turn), and the result would be a program with
no call/return overhead whatsoever!  I can't see any drawbacks to the
scheme; can you?

You overlooked a few words in my post:
"...when it would be advantageous to do so."

Regards,
Chris Noonan
 
S

Seebs

Because it's excruciatingly difficult to know exactly when it would be
advantageous to do so for most library functions, and it would likely
never be advantageous for many. The no-brainers like memcpy and strcmp
already are implemented as built-ins in many implementations.

This isn't directly relevant, because it's really fairly system-specific,
but we once encountered a delightful bug caused by a compiler being used
in freestanding mode.

Long story short: In some cases, when given predictable-enough string
arguments, gcc will turn strcmp(foo, "bar") into memcmp(foo, "bar", 4),
and the like. Or memchr("foo", x) into memchr("foo", x, 3). Or, and
this is the kicker, strcpy(foo, "bar") into memcpy(foo, "bar", 4).

Someone wrote something "clever", roughly to the effect of:

strcat(strcpy(foo, "start: "), end);

gcc did its magic.

Unfortunately, it was working in a freestanding environment, where the
str* and mem* functions had been provided by the code it was working on,
and someone else who was *also* being excessively clever had realized that
you could save an ENTIRE INSTRUCTION from the execution of memcpy by
always returning 0 instead of returning the "unused" return value.

Clever programmers, causing things to blow up without warning since the
60s.

-s
 
I

Ian Collins

abs(i)< abs(j) returns 0 or 1

What about -1 ?

C++ sort only requires "does i come before j" which is why it can be
used with any type with a less then operator. That's why I said the
sentence you snipped; Take the function you would pass to qsort(),
simplify it to return a boolean result and use it.
 
D

David RF

C++ sort only requires "does i come before j" which is why it can be
used with any type with a less then operator.  That's why I said the
sentence you snipped; Take the function you would pass to qsort(),
simplify it to return a boolean result and use it.

Ok, got it
 
B

BGB / cr88192

Seebs said:
This isn't directly relevant, because it's really fairly system-specific,
but we once encountered a delightful bug caused by a compiler being used
in freestanding mode.

Long story short: In some cases, when given predictable-enough string
arguments, gcc will turn strcmp(foo, "bar") into memcmp(foo, "bar", 4),
and the like. Or memchr("foo", x) into memchr("foo", x, 3). Or, and
this is the kicker, strcpy(foo, "bar") into memcpy(foo, "bar", 4).

Someone wrote something "clever", roughly to the effect of:

strcat(strcpy(foo, "start: "), end);

gcc did its magic.

Unfortunately, it was working in a freestanding environment, where the
str* and mem* functions had been provided by the code it was working on,
and someone else who was *also* being excessively clever had realized that
you could save an ENTIRE INSTRUCTION from the execution of memcpy by
always returning 0 instead of returning the "unused" return value.

Clever programmers, causing things to blow up without warning since the
60s.

yep.


my compiler handles some cases as special builtins as well.

my compiler had typically expanded fixed-size memcpy into a range of
register-based operations (including SSE).

so, for example:
memcpy(src, dst, 20);

could become, essentailly (src/dst are placeholders):
movdqu xmm0, [src]
movdqu [dst], xmm0
mov eax, [src+16]
mov [dst+16], eax

then another compiler translates this sort of thing to, essentially:
lea esi, [src]
lea edi, [dst]
mov ecx, 20
rep movsb

with essentially *no* real performance cost vs the prior...

I later realized a trick when writing an x86 interpreter:
one doesn't actually have to execute this instruction verbatim, as since all
properties are known, it is possible to short-cut the logic in many cases
and move the bytes in whatever way is most convinient.


in many cases, micro-optimization is a trap...
be careful or be snared.
 
E

Eric Sosman

I wasn't proposing an unbreakable rule. I envisaged an optimising
compiler making sensible decisions. If size is a problem on a
particular architecture, the compiler may be able to determine
statically that the requested size is bounded above. Otherwise it can
emit code to branch on the size. For malloc the gains are worthwhile,
they go beyond inlining. As well as replacing a call to an intricate
function with as few as two opcodes, you are reducing (in a small way)
heap fragmentation and possibly avoiding a bus-locking instruction.

A compiler that could make the decision reliably would need an
awful lot of smarts. Consider that the function f() containing the
malloc(100) call might be at the bottom of a l-o-n-g call chain of
other functions that have chewed up unknown amounts of stack space;
f() might even be directly or indirectly recursive. I suspect that a
practical compiler would only be able to apply this transformation
safely for static functions in the same translation unit as main(),
which might somewhat restrict the benefits ...

Also, I don't find the benefits all that convincing. If you've
got to make a run-time decision about whether the malloc(size) call
plucks from the heap or from the stack, then you've got to make some
provision for the corresponding free(): it might be an ordinary free(),
or it might be a pop-and-forget, depending on the choice made at the
malloc() point, which needs to be remembered somewhere, and ... The
additional machinery to support the "optimization" is not free.

Finally, if what you really want is an automatically-managed
block-scoped allocation, you've already got it: The variable-length
array is just what you asked for. Personally, I think the VLA is a
snare of Satan -- but you Devil-worshipers mayn't mind that ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,083
Messages
2,570,591
Members
47,212
Latest member
RobynWiley

Latest Threads

Top