Floating point linkage

S

Seebs

How is that even possible? When you statically link a library, the
parts that you need are added to your executable and the rest is thrown
away, and even the parts that are included are subject to link-time
relocation, so they can't be shared with other programs.

When you statically link a library, the linker resolves symbols to fixed
addresses. All the things about adding or throwing away are possible, but
optional. In BSD/OS, the default static-shared linking for libc meant that
references to libc were resolved statically to fixed addresses, the same
as any other static link would be, but that nothing was put in the
executable that would map to those addresses. Instead, the dynamic linker
would map the library at a fixed address such that the resolved addresses
pointed to the relevant hunks of the library.

Noticeable speed improvement.

-s
 
T

Thomas Jahns

Since (nearly) every program (and other library) on a typical Unix
platform is going to be linked to libc by default, it would be utterly
wasteful to statically link against it. libc is the canonical example
of why dynamic linking is needed.

The reasons libc is a mandatory shared library on the platforms in question is
never for this reason: AIX has no guaranteed system call interface and Linux
libc needs to dynamically load parts of the dynamic name service switch (see
nsswitch.conf).
In contrast, few programs (and other libraries) use libm, so while
static linking would probably still be wasteful, it'd be a relatively
minor problem in practice.

And also for many programs containing libm calls, speed really matters, i.e. the
extra indirection for dynamic libraries is intolerable to some users, because
calls to e.g. exp occur in tight loops.

Regards, Thomas
 
S

Scott Fluhrer

glen herrmannsfeldt said:
(snip, I wrote)

I am not sure exactly how it was implemented. The IBM OS/360
linker has weak external references that will be resolved if the
called routine is linked, but won't pull a routine from a library.
I don't remember which linkers provide that feature.

That trick wouldn't help this situation.

Here's the problem they were addressing: suppose you have the simple "Hello
world" program; that calls printf; if printf included the logic to handle
the %f (and friends) format, that means that we'd need to pull in a
signficant fraction of the math library; that library was (by the standards
of the time) quite lengthy; the result might be that the resulting
executable consisted of 90% math library (which is dead code; the "Hello
world" program will never call it), and 10% code which was actually used.

Now, by todays standards, this executable expansion is fairly trivial;
however, remember in that era, the permament memory they had might be
360kbyte (0.00036 GByte) floppy disks (or possibly, if they were lucky) a
20MByte == 0.02 GByte harddisk); that overhead was a real problem.

So, they addressed this by implementing two versions of the printf (and
scanf) library; one with the code which handled the %f formats, and one
without. The one without was much shorter (and didn't bring in the math
library); the one with did everything. The linker decided which one to pull
in depending on whether the program used any floating point within the
program (and, yes, the compiler published that fact somehow). This mostly
worked pretty well... unless, of course, someone wrote a program which used
a %f format, but avoided all other use of floating point...
 
G

glen herrmannsfeldt

(snip on "floating point not loaded")
(I wrote)
That trick wouldn't help this situation.
Here's the problem they were addressing: suppose you have the
simple "Hello world" program; that calls printf; if printf
included the logic to handle the %f (and friends) format,
that means that we'd need to pull in a signficant fraction
of the math library; that library was (by the standards of the
time) quite lengthy; the result might be that the resulting
executable consisted of 90% math library (which is dead code;
the "Hello world" program will never call it), and 10% code
which was actually used.

This is exactly the case that WXTRN solves. The library routines
(like printf) WXTRN the floating point routines. If the compiler
determines that they are actually needed, it puts EXTRN entries
in the calling routine. When the linker resolves the EXTRN, it
pulls the routine in, and then also resolves the WXTRN.

If not, the WXTRN is not resolved, zero is stored there, and I
believe the linker prints a warning message. The program can
test for zero before calling the routine.

I believe WXTRN was meant to solve a similar problem in the
early PL/I compilers, which also ran (and the generated code ran)
on small computers. PL/I (F) was designed to run in 44K on
a 64K system, leaving 20K for OS/360 PCP. It likely ran pretty
slow that way, though. There is a warning message when the compiler
starts writing the symbol table to disk. (That is before virtual
storage, so the compiler had to do it itself.)

(snip)
So, they addressed this by implementing two versions of the
printf (and scanf) library; one with the code which handled
the %f formats, and one without. The one without was much
shorter (and didn't bring in the math library); the one with
did everything. The linker decided which one to pull in
depending on whether the program used any floating point within the
program (and, yes, the compiler published that fact somehow).

But it is the compiler that knowns, not the linker.

The other way is to give the two printf routines different names and
have the compiler call the appropriate one. A little ugly, but
it doesn't require changing the linker.

-- glen
 
A

Albert van der Horst

I feel I should point out a distinction, at some risk of going off-topic:

Dynamic and static linking are NOT the same question as shared or unshared.

At least one operating system I know of implemented static linking for
shared libraries like libc, so you could get a noticeable speed improvement
over a typical dynamic linker, but still have the massive address space
savings of letting multiple programs use the same blocks of memory for the
shared text segments.

There is another flaw in Stephens reasoning:
statically linking if done well, only pulls in the exact code that is needed.
It is left as an exercise to the reader to estimate whether on average that
would be more or less than the code needed to access shared libraries (which
is not small.)
With regards to the math's library, you may want to do much compile time
inlining. The cosine is a single Pentium instruction, so doing a function
call means a huge overhead. This alone warants the math library's special
treatment. Also doing math is a strong indication of a compute bound program,
i.e. it matters more.

Groetjes Albert
 
J

James Kuyper

On 10/25/2013 06:15 PM, Albert van der Horst wrote:
....
treatment. Also doing math is a strong indication of a compute bound program,

I think there's lots of programs that do a little math, but are not
"compute bound".

The main program I'm responsible for does several time conversions for
every single frame of data, so it can determine the precise orientation
of the Earth, the satellite, and the mirror, and the precise location of
the satellite, at the time that frame was collected. For each frame it
traces a line of sight from each of 10 different detectors, bouncing off
the mirror, and determining the location of the point where that line
intersects the surface of the Earth ellipsoid, and then uses an
iterative algorithm to determine where it intersects the actual Earth's
surface. Those calculations each involve 5 changes of coordinate systems
with corresponding rotation matrix multiplications, and several of those
matrices require calls to transcendental functions to determine the
elements of the matrix. Despite all of that math, it spends most of it's
time writing the results of it's calculations; and most of the rest of
it's time reading the input data. Only a small fraction of it's time is
spent actually calculating the results.

There are many programs in the world that need floating point math, but
spend even less of their time actually doing it than that program does.
 
M

Malcolm McLean

I think there's lots of programs that do a little math, but are not
"compute bound".
I recently wrote four little programs to try out Baby X.

The first was a front-end to my data density function. The function works
be calculating sum of reciprocal squared distance at every point. So a
problem is it's rather slow, O(N M P) where N and M are the image dimensions,
and P the number of points. It would have been nice to make the spinner
interactive so the map would change in real time as you changed the smearing
factor, but it won't handle it on my machine. So a compute bound program.

The second was battle life. Its Conway's game of life with a twist to make it
competitive. But the arena is small, because the player want to see his cells,
and you only want t step it every tenth of a second. So non-compute bound,
and it doesn't use maths.

The third was a Perlin / simplex noise program. Perlin noise is locally
correlated noise which you can manipulate and colour to get nice textures,
like clouds or flames or water. It's hard to say whether that one is
compute bound or not. If you want a big texture, it's rather slow. But
you can produce a texture which is big enough for most practical purposes
of seeing the result, without stressing a halfway decent machine at all. So
the texture changes interactively with the parameter spinners.

The fourth was baby Mandelbrot, a Mandelbrot browser. Very clearly compute
bound. Bit it doesn't make any calls to the maths library, except that
there a call to sprintf("%g") in the spinner code.

So I'd say there is maybe an indication, but it's not strong.
 
S

Stephen Sprunk

When you statically link a library, the linker resolves symbols to
fixed addresses. All the things about adding or throwing away are
possible, but optional. In BSD/OS, the default static-shared linking
for libc meant that references to libc were resolved statically to
fixed addresses, the same as any other static link would be, but that
nothing was put in the executable that would map to those addresses.
Instead, the dynamic linker would map the library at a fixed address
such that the resolved addresses pointed to the relevant hunks of the
library.

Interesting; I'd never heard of that before. It doesn't seem to be used
by any modern OS, at least under that name, which leads me to think that
it wasn't as good an idea in practice as in theory.
Noticeable speed improvement.

Assuming all executables were linked to the same version of the library,
that seems reasonable. Dynamic libraries require run-time relocation
and PIC, both of which have a performance hit.

However, libraries get updated periodically, and static linking means
that you will need to load a copy of each version used into memory,
which can easily result in using _more_ memory and therefore having
_worse_ performance than with dynamic libraries.

S
 
S

Stephen Sprunk

There is another flaw in Stephens reasoning: statically linking if
done well, only pulls in the exact code that is needed. It is left as
an exercise to the reader to estimate whether on average that would
be more or less than the code needed to access shared libraries
(which is not small.)

Well, that depends on where that code lives. For instance, on Unix
platforms that's typically done by ld.so, which is an "ELF program
interpreter" that is separate from your program. True, it does consume
memory, but it's not all that big and can itself be shared across
multiple processes; cutting out even one or two decent-sized library
functions easily pays for it.

More problematic for dynamic shared libraries are the run-time _CPU_
costs of relocation and of PIC.
With regards to the math's library, you may want to do much compile
time inlining. The cosine is a single Pentium instruction, so doing
a function call means a huge overhead. This alone warants the math
library's special treatment.

I can't figure out any way to get GCC to inline cos(), regardless of how
the program is linked; it always calls into libm.

S
 
S

Seebs

Interesting; I'd never heard of that before. It doesn't seem to be used
by any modern OS, at least under that name, which leads me to think that
it wasn't as good an idea in practice as in theory.

It was a very significant speed improvement, but it imposed a very
significant development cost, and also prevented the use of some kinds of
position-independent execution -- it guaranteed fixed addresses, which
are now often regarded as an attack vector. Also, performance of dynamic
linkers has improved a fair bit, and computers are a LOT faster than they
were then. Something that made a noticeable difference trying to serve
10k pages in a second off a Pentium is not necessarily relevant to a
modern quad-core.
However, libraries get updated periodically, and static linking means
that you will need to load a copy of each version used into memory,
which can easily result in using _more_ memory and therefore having
_worse_ performance than with dynamic libraries.

Not with static-shared. The entire point is that as long as the *interface*
is stable, you are statically linking to call "whatever function is at
0xdeadbeef", and if you replace the library that gets loaded at that
address, everyone gets the new code. So as long as the update wouldn't
have forced you to keep two copies of libc.so on a dynamic-shared system,
it didn't on static-shared either.

-s
 
S

Stephen Sprunk

Not with static-shared. The entire point is that as long as the
*interface* is stable, you are statically linking to call "whatever
function is at 0xdeadbeef", and if you replace the library that gets
loaded at that address, everyone gets the new code. So as long as the
update wouldn't have forced you to keep two copies of libc.so on a
dynamic-shared system, it didn't on static-shared either.

That assumes it is possible for all future versions of said library to
maintain the exact same entry points, which is highly unlikely unless
all the entry points are in a table containing nothing but jumps to the
actual function addresses, which is what shared _dynamic_ libraries
evolve into after run-time relocation anyway.

If you link to the actual function addresses, then either (a) no
function can ever grow larger than its original size, or (b) you need
padding between functions to accommodate growth, which means you're
taking up _more_ memory than shared dynamic libraries would have.

S
 
S

Seebs

That assumes it is possible for all future versions of said library to
maintain the exact same entry points, which is highly unlikely unless
all the entry points are in a table containing nothing but jumps to the
actual function addresses, which is what shared _dynamic_ libraries
evolve into after run-time relocation anyway.

My vague recollection -- and it is admittedly pretty vague -- is that
the net effect was that there was such a table, but that you ended up
with one less jump per call into the library post-relocation, and you
also ended up skipping a fairly significant hunk of relocations at
startup. So the cost was quite noticeable.

I found a page with a little more info:
http://www.pix.net/software/bsdos/elf_faq.html

The relevant hunk:

(As a test, I built the cat program as a dynamically linked
program (4777 bytes), as a statically linked program with
shared libraries (2437 bytes), and as a statically linked
program without shared libraries (50736 bytes). I ran `cat
/etc/fstab' and counted the number of instructions executed
using ptrace() since it's hard to measure run-times for
such small runs. The dynamically linked version used 300181
instructions; the static shared version used 2015 instructions;
and the static unshared version used 1855 instructions. To
give you an idea of the impact on programs that use the
library heavily, I ran `cat -v /sys/compile/FOO/bsd.gdb >
/dev/null' and measured the best run time out of 3 for each
flavor of cat. The dynamic version ran 20.59 seconds; the
static shared version ran 17.48 seconds; the static unshared
version ran 17.05 seconds. Since most programs run longer
and don't make the libraries do such a large proportion of
their computes, the impact of dynamic linking is generally
lost in the noise, however.)

For small, commonly-used programs, you can probably see how reducing
the setup cost from 300k to 2k instructions would add up. Think of
how many thousands of times a typical Unix system is running executables
like "echo" or "cat" in any given hour...

-s
 
B

Ben Bacarisse

Stephen Sprunk said:
That assumes it is possible for all future versions of said library to
maintain the exact same entry points, which is highly unlikely unless
all the entry points are in a table containing nothing but jumps to the
actual function addresses, which is what shared _dynamic_ libraries
evolve into after run-time relocation anyway.

If you link to the actual function addresses, then either (a) no
function can ever grow larger than its original size, or (b) you need
padding between functions to accommodate growth, which means you're
taking up _more_ memory than shared dynamic libraries would have.

.... or you need a more complex address space. On some ICL systems (and
that includes one Unix system and thus a C implementation) functions
were compiled into their own sections, and these were just numbered.
Function addresses where therefore 0x00000001, 0x00000002 and so on.
Data and code address spaces were separate.

However, I don't recall shared libraries being used, but I don't see why
they could not be. I *think* that the link-loader allocated the
numbers, but it was a long time ago.
 
G

glen herrmannsfeldt

Interesting; I'd never heard of that before. It doesn't seem to be used
by any modern OS, at least under that name, which leads me to think that
it wasn't as good an idea in practice as in theory.

I hadn't either, but there are lots of fun tricks you can
play with page tables.
Assuming all executables were linked to the same version of the library,
that seems reasonable. Dynamic libraries require run-time relocation
and PIC, both of which have a performance hit.

Presumably that is system dependent, but, yes, that is what many
do require.
However, libraries get updated periodically, and static linking means
that you will need to load a copy of each version used into memory,
which can easily result in using _more_ memory and therefore having
_worse_ performance than with dynamic libraries.

Tradition, at least back to OS/360 21.6 when I first remember, is
that libraries are backward, but not necessarily forward, compatible.
The SunOS dll number convention has a major version, minor version,
and you can also have another (I don't know if it had a name)
number after that. I believe the system knew that major versions
were incompatible, and minor versions were back compatible.

It used to be popular to replace the Sun supplied libc with one
that supported DNS instead of /etc/hosts and NIS. That was easiest
to do by adding an additional .1 onto the name and copying to /usr/lib.
This had to be done with each new OS version.

-- glen
 
S

Stephen Sprunk

Tradition, at least back to OS/360 21.6 when I first remember, is
that libraries are backward, but not necessarily forward,
compatible. The SunOS dll number convention has a major version,
minor version, and you can also have another (I don't know if it had
a name) number after that. I believe the system knew that major
versions were incompatible, and minor versions were back compatible.

In the Unix world, if something is dynamically linked against library
A.B.C, then it can be used with any library X.Y.Z, provided:

1. X is equal to A
2. Y is equal to or higher than B
3. Z is equal to C, higher than C or lower than C

This implements the following interface versioning rules:

1. If you remove something from the interface, you increment X
2. If you add something to the interface, you increment Y
3. Otherwise, you increment Z

A Sun paper names them "major", "minor" and "micro", but I've seen other
names ("patch", "maintenance", etc.) for the last elsewhere.

S
 
G

glen herrmannsfeldt

(snip, I wrote)
In the Unix world, if something is dynamically linked against library
A.B.C, then it can be used with any library X.Y.Z, provided:
1. X is equal to A
2. Y is equal to or higher than B
3. Z is equal to C, higher than C or lower than C
This implements the following interface versioning rules:
1. If you remove something from the interface, you increment X
2. If you add something to the interface, you increment Y
3. Otherwise, you increment Z

If you add that the .Z is optional, it agrees with what I
remember from the SunOS days. But also that ld.so will choose the
larger Z if more than one is available.
A Sun paper names them "major", "minor" and "micro", but I've
seen other names ("patch", "maintenance", etc.) for the last
elsewhere.

-- glen
 
S

Stephen Sprunk

If you add that the .Z is optional, it agrees with what I remember
from the SunOS days.

A missing number is presumed to be 0; that goes for both Y and Z.
But also that ld.so will choose the larger Z if more than one is
available.

IIRC, it will also choose a larger Y if more than one is available.
This rule maximizes the chance that a later program (or library) will be
able to be linked against the version already in memory.

S
 
S

Stephen Sprunk

My vague recollection -- and it is admittedly pretty vague -- is
that the net effect was that there was such a table, but that you
ended up with one less jump per call into the library
post-relocation,
Hmm.

and you also ended up skipping a fairly significant hunk of
relocations at startup. So the cost was quite noticeable.

Well, you don't need much in the way of startup relocations if you use
lazy binding, which AFAIK is the default on all modern systems.
I found a page with a little more info:
http://www.pix.net/software/bsdos/elf_faq.html

Thanks; lots of interesting reading.
The relevant hunk:

(As a test, I built the cat program as a dynamically linked program
(4777 bytes), as a statically linked program with shared libraries
(2437 bytes), and as a statically linked program without shared
libraries (50736 bytes). I ran `cat /etc/fstab' and counted the
number of instructions executed using ptrace() since it's hard to
measure run-times for such small runs. The dynamically linked version
used 300181 instructions; the static shared version used 2015
instructions;

I'm not sure how to generate such numbers myself, and I don't have
access to a system with a static libc anyway for comparison, but the
dynamic version seems really, really high to me.

S
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,091
Messages
2,570,605
Members
47,225
Latest member
DarrinWhit

Latest Threads

Top