Good book

C

Clark Smith

The biggest incorrect assumption I see about assembly is the idea that
any C code could be re-written in assembly to make it faster.
Occasionally it makes sense to do this, but only occasionally.

That's going to depend on the actual platform and compiler. It
may be true for x86, but for IA64 in particular, it is not, even when
using the compilers supplied by Intel and HP.
I think knowledge of assembly is more important for smaller processors -

Not necessarily - see above.
 
I

Ian Collins

Malcolm said:
It used to be a lot faster to pad a 2D array out to a sum of two powers of 2, then
access via ((y << 8) | (y << 2)) + x. I haven't actually checked, but I wouldn't be at
all surprised if y * (soft) width +x isn't equally fast or even faster.

However it could easily go back again as chip architecture develops. If you
have experience of work at the instruction level, you're aware of these potential
issues.

Just as long as you keep that knowledge up to date. Stale knowledge is
a dangerous thing when it comes to processor architecture!
 
M

Malcolm McLean

Malcolm McLean wrote:

Just as long as you keep that knowledge up to date. Stale knowledge is
a dangerous thing when it comes to processor architecture!
No, because if you try to optimise a function by rewriting it in
assembler, the first thing you will do is test it for correctness,
and the second thing you will do is test it for execution speed.
So the worst likely result is that you waste a couple of day's
work. Not good, but a wasted day or two happens all the time
in a software development environment, it's not going to make
or break the success of the project, normally.
 
D

David Brown

It used to be a lot faster to pad a 2D array out to a sum of two powers of 2, then
access via ((y << 8) | (y << 2)) + x. I haven't actually checked, but I wouldn't be at
all surprised if y * (soft) width +x isn't equally fast or even faster.

However it could easily go back again as chip architecture develops. If you
have experience of work at the instruction level, you're aware of these potential
issues.

If you have a chip in which "((y << 8) + (y << 2)) + x" is faster than
"y * 260 + x", and your compiler doesn't optimise "y * 260 + x" into the
shifts and adds, then you need a better compiler.

And note that your compiler is likely to get the optimisation right -
and use a "+" instead of an "|", while your manual "optimisation" is
incorrect but is likely to pass simple code tests.

Using "y * 260 + x" is also vastly clearer and more flexible than an
explicit shift-and-add version.

All in all, using such shift-and-add expressions when you really want a
multiply is pretty much always a bad idea in many ways. There was a
time when such things were useful (I've used them myself), but not now.
 
B

Bill Cunningham

No, I think you're exactly right on this. I found the original C memo
written by K&R (that later mutated into a chapter in the book) to be
exactly what I needed for learning the language. If it had been my
first language, it would have been hopeless (I don't even remember how
many languages I'd learned by the time I came across C -- I think I'd
even had my "language of the week" style programming languages course
first).

I did very brief "coding" with BASIC for a bit. But couldn't really put
much together. There's a difference in "coding" and "programming".

B
 
M

Malcolm McLean

If you have a chip in which "((y << 8) + (y << 2)) + x" is faster than
"y * 260 + x", and your compiler doesn't optimise "y * 260 + x" into the
shifts and adds, then you need a better compiler.
Say the data is actually sales of 257 varieties of baked bean by
300 shops. So a 300*257 2D array, hard-coded (our products rarely
change, we've been using "257 varieties" as a slogan for donkey's
years).We can pad the array out to 260 and put null values in the empty
slots.
To give a slightly more realistic example, you often used to
pad a sprite to 8x8, even if a row or column was all transparent.
 
D

David Brown

That's going to depend on the actual platform and compiler. It
may be true for x86, but for IA64 in particular, it is not, even when
using the compilers supplied by Intel and HP.


Not necessarily - see above.

I'm a little confused by what you are saying here, but I believe you are
saying that a knowledge of assembly is important for efficient IA64
programming, and that re-writing IA64 code in hand-written assembly can
often be necessary to get good code. Is that right?

I don't have any direct experience with the IA64, but from what I have
read, you are correct. The theory behind the VLISC architecture was
that the compiler would generate good static scheduling of multiple
parallel instructions. In practice, compilers have never managed it
very well - so a good understanding of the assembly and architecture
would help IA64 programmers, and I can well believe that hand-written
assembly /could/ outperform C code. However, writing assembly for the
IA64 of any significant size and complexity is going to be very hard,
and suffer from the usual maintenance issues and portability issues
(across different IA64 devices), meaning it is unlikely to be a good
solution. The only sane solution, of course, is to drop the IA64
architecture - the theory on which it is founded is simply wrong for
most general programming. No compiler or human assembler can pick good
static scheduling because there is no good static scheduling choice in
most cases - it has to be done dynamically (as is done in most
superscaler cpus).
 
D

David Brown

Say the data is actually sales of 257 varieties of baked bean by
300 shops. So a 300*257 2D array, hard-coded (our products rarely
change, we've been using "257 varieties" as a slogan for donkey's
years).

It is irrelevant - you write your multiplication using the real numbers
(preferably #define'd values, or static const values). As long as the
compiler can get compile-time access to the fixed values, it should
generate the fastest code.

We can pad the array out to 260 and put null values in the empty
slots.
To give a slightly more realistic example, you often used to
pad a sprite to 8x8, even if a row or column was all transparent.

That's a different matter - and certainly in some cases it is worth
adding padding to make alignments fit well or to make array elements fit
within cache lines (for big cpus) or to let the compiler turn multiplies
into shifts (for small cpus).

But note that I said "to let the /compiler/ turn multiplies into shifts"
- not "to let the /programmer/ turn multiplies into shifts". Work
/with/ your compiler, not /against/ it.
 
M

Malcolm McLean

It is irrelevant - you write your multiplication using the real numbers
(preferably #define'd values, or static const values). As long as the
compiler can get compile-time access to the fixed values, it should
generate the fastest code.
No, because it's not allowed to replace

/* NBEANS = 257 */
int sales[NSHOPS][NBEANS];

with
int sales_internal[NSHOPS][260];because 260 can be efficiently backcracked while 257 can't.
 
D

David Brown

It is irrelevant - you write your multiplication using the real numbers
(preferably #define'd values, or static const values). As long as the
compiler can get compile-time access to the fixed values, it should
generate the fastest code.
No, because it's not allowed to replace

/* NBEANS = 257 */
int sales[NSHOPS][NBEANS];

with
int sales_internal[NSHOPS][260];because 260 can be efficiently backcracked while 257 can't.

We will leave it up to the language lawyers for a ruling on whether or
not the compiler is allowed to make this change (I think not, but am not
sure). But I don't think any compiler /would/ make that change, even if
it turns out to be legal.

However, no one is asking a compiler to do this change - changing a
multiply by 257 into a multiply by 260 would change the logic of the
code. I am saying that you should let the compiler change the multiply
by 257 into shifts-and-adds if it wants, you should never do it manually.

That is a different thing from manually adding padding to a struct in
order to give the compiler easier numbers to multiply - that can
sometimes be worth doing. But you do it by defining NBEANS to 260 and
using "x * NBEANS" in the code - not by writing shifts.

Incidentally, for most targets it will be at least as efficient, and
often more efficient, to multiply by 257 rather than 260. The fact that
you regularly get your "micro-optimism" examples wrong shows that this
sort of thing is normally best left up to the compiler.
 
M

Malcolm McLean

No, because it's not allowed to replace

/* NBEANS = 257 */
int sales[NSHOPS][NBEANS];

int sales_internal[NSHOPS][260];
because 260 can be efficiently backcracked while 257 can't.



We will leave it up to the language lawyers for a ruling on whether or
not the compiler is allowed to make this change (I think not, but am not
sure). But I don't think any compiler /would/ make that change, even if
it turns out to be legal.

However, no one is asking a compiler to do this change - changing a
multiply by 257 into a multiply by 260 would change the logic of the
code. I am saying that you should let the compiler change the multiply
by 257 into shifts-and-adds if it wants, you should never do it manually.
Once you've worked out that 260 can be backcracked, 257 can't be, you can
write C code with 260 hardcoded. But then you've got to look at the assembly
to check that the compiler genuinely made the optimisation. Also, there
are lots of numbers you could potentially pad to. Unless you've got some insight
into the instruction set, it's hard to know which ones to choose. The average
person is not going to think "260, that's only got two set bits".

Then the data table holds sales of beans for a supermarket chain. Let's say each
sale is a record, so typically customers will buy one or two cans of beans. So
well be indexing into this table millions of times on a typical run. But the
compiler has no way of knowing that. It can't distinguish a loop that will be
run millions of times from one that will be run half a dozen times, because of
course Ncustomers has to be soft-coded.
 
C

Clark Smith

I'm a little confused by what you are saying here, but I believe you are
saying that a knowledge of assembly is important for efficient IA64
programming, and that re-writing IA64 code in hand-written assembly can
often be necessary to get good code. Is that right?

That is right.
I don't have any direct experience with the IA64, but from what I have
read, you are correct. The theory behind the VLISC architecture was
that the compiler would generate good static scheduling of multiple
parallel instructions. In practice, compilers have never managed it
very well - so a good understanding of the assembly and architecture
would help IA64 programmers, and I can well believe that hand-written
assembly /could/ outperform C code.

I have seen some spectacular examples, especially in the realm of
cryptography, where carefully scheduled IA64 assembly language code
outperforms its C counterpart by an order of magnitude.
However, writing assembly for the
IA64 of any significant size and complexity is going to be very hard,
and suffer from the usual maintenance issues and portability issues
(across different IA64 devices), meaning it is unlikely to be a good
solution.

True. You can still get a bang for your buck by focussing on the
right areas though. This is probably no different than other processors,
in principle; it's just that under IA64 there are more opportunities for
hand-coded assembly language code to shine vis-a-vis compiler-created
assembly language code.
The only sane solution, of course, is to drop the IA64
architecture - the theory on which it is founded is simply wrong for
most general programming. No compiler or human assembler can pick good
static scheduling because there is no good static scheduling choice in
most cases - it has to be done dynamically (as is done in most
superscaler cpus).

Actually, I partially disagree. Experienced human assemblers can
and do pick up good static scheduling for IA64, and it is certainly worth
the development time in some cases, as mentioned above.

I agree that current compilers most certainly don't do nearly as
well. Current compiler technology does not seem to be sophisticated
enough to do so, and it is not clear to me whether this is so because of
a lack of investment and interest, or because it is just too difficult in
general. The IA64 C compilers from Intel and HP generate far better IA64
code than GCC, but they still make a suboptimal use of the architecture.
 
D

David Brown

No, because it's not allowed to replace

/* NBEANS = 257 */
int sales[NSHOPS][NBEANS];

int sales_internal[NSHOPS][260];
because 260 can be efficiently backcracked while 257 can't.



We will leave it up to the language lawyers for a ruling on whether or
not the compiler is allowed to make this change (I think not, but am not
sure). But I don't think any compiler /would/ make that change, even if
it turns out to be legal.

However, no one is asking a compiler to do this change - changing a
multiply by 257 into a multiply by 260 would change the logic of the
code. I am saying that you should let the compiler change the multiply
by 257 into shifts-and-adds if it wants, you should never do it manually.
Once you've worked out that 260 can be backcracked, 257 can't be, you can
write C code with 260 hardcoded.

I am close to giving up and killfiling you for my own sanity.
"backcrack" is clearly yet another of your private terms that you think
are known to all speakers of the English language. I had assumed you
were talking about the ease of transforming multiplies into
shifts-and-adds, since that was what you were talking about a few posts
earlier when you incorrectly claimed that it might be worth converting
"y * 260 + x" into "((y << 8) | (y << 2)) + x". As I pointed out
earlier, this is an incorrect transformation (unless you are sure that y
is less than 64 and non-negative). The correct transformation would be
"(y << 8) + (y << 2) + x". And any half-decent compiler will do such
strength reduction as and when appropriate.

/If/ that is what you mean by "backcrack", then y * 257 + x can also be
"backcracked" into (y << 8) + y + x. This is simpler, and could be
faster than the shift+add arrangement for 260.
But then you've got to look at the assembly
to check that the compiler genuinely made the optimisation.

If you care about the speed here, your main task is to measure the
timings. /How/ the compiler optimises the code is secondary to how fast
the code is (compared to how fast you need it to be). Looking at the
assembly is one way to get an idea if you don't want to measure the real
times - it should not be hard in this particular case.
Also, there
are lots of numbers you could potentially pad to. Unless you've got some insight
into the instruction set, it's hard to know which ones to choose. The average
person is not going to think "260, that's only got two set bits".

The average person is not going to care - they will let the compiler
deal with it. The more advanced programmer will know that 257 /also/
has only two bits set, and the bit numbers are more amenable to optimal
speed even on the simplest of processors. And a /really/ advanced
programmer will take a step back and think whether this matters, and if
so ask what the /real/ bottlenecks are - aligning to cache lines is
likely to be more relevant than the multiplication on big systems.
 
I

Ian Collins

Clark said:
I agree that current compilers most certainly don't do nearly as
well. Current compiler technology does not seem to be sophisticated
enough to do so, and it is not clear to me whether this is so because of
a lack of investment and interest, or because it is just too difficult in
general. The IA64 C compilers from Intel and HP generate far better IA64
code than GCC, but they still make a suboptimal use of the architecture.

Possibly because the underlying architecture shifts every 18 months or
so and even Intel's compiler team can't keep up?
 
D

David Brown

That is right.


I have seen some spectacular examples, especially in the realm of
cryptography, where carefully scheduled IA64 assembly language code
outperforms its C counterpart by an order of magnitude.

I have no problem believing that example - cryptography is a field that
lends itself well to good assembly language implementations (that
applies to most or all cpus, not just the IA64). It often needs lots of
similar operations, and lots of maths operations that don't map well to
C (such as multiplying with numbers and results of different sizes, or
rotations of large numbers).

There are other types of algorithm that can have similar effect,
especially if the cpu supports particular instructions that don't fit
well with C or the C compiler. DSP algorithms are often in this category.

But these are the exceptions - "general" code is usually best written in C.
True. You can still get a bang for your buck by focussing on the
right areas though. This is probably no different than other processors,
in principle; it's just that under IA64 there are more opportunities for
hand-coded assembly language code to shine vis-a-vis compiler-created
assembly language code.


Actually, I partially disagree. Experienced human assemblers can
and do pick up good static scheduling for IA64, and it is certainly worth
the development time in some cases, as mentioned above.

That can be true for some types of code with a very predictable flow.
But for a lot of general code, full of jumps and conditional branches,
there is no way to get a good static scheduling.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,073
Messages
2,570,539
Members
47,197
Latest member
NDTShavonn

Latest Threads

Top