speed it up

G

Gernot Frisch

Hi,

I have 2 C code snippets that prodcue the same result. However A is 2x
faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
206MhZ)


// Startup conditions + types
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
register unsigned short x, y, *ldst;
short xptch = 320, yptch = -1;
dst = pDst + 319;
src = pSrc;


// A:
(unsigned long*) pDisplay = (unsigned long*)dst;
for(x=0; x<240; x++)
{
for(y=0; y<160; y++)
{
*pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
once
src-=480;
}
src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
the bottom
}

// B:
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

Can someone explain it to me. An better: How to make this really fast?
Using ASM? I need an optimized version for an ARM processor.
Example B shows what it does obviously, I think.

Thank you in advice,

--
-Gernot
int main(int argc, char** argv) {printf
("%silto%c%cf%cgl%ssic%ccom%c", "ma", 58, 'g', 64, "ba", 46, 10);}

________________________________________
Looking for a good game? Do it yourself!
GLBasic - you can do
www.GLBasic.com
 
T

Thomas Matthews

Gernot said:
Hi,

I have 2 C code snippets that prodcue the same result. However A is 2x
faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
206MhZ)


// Startup conditions + types
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
register unsigned short x, y, *ldst;
short xptch = 320, yptch = -1;
dst = pDst + 319;
src = pSrc;


// A:
(unsigned long*) pDisplay = (unsigned long*)dst;
for(x=0; x<240; x++)
{
for(y=0; y<160; y++)
{
*pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
once
src-=480;
}
src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
the bottom
}

// B:
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

Can someone explain it to me. An better: How to make this really fast?
Using ASM? I need an optimized version for an ARM processor.
Example B shows what it does obviously, I think.

Thank you in advice,

Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.

Anyway, to optimize for the ARM. The ARM processor likes rolled out
for-loops and reduced number of branches (which might be true for
most processors). The ARM processor has special instructions that
can load many registers at once from memory and put many instructions
into memory. My information is that both instructions require
sequential memory locations. Thus we can use the load but not the
put.

Let us concentrate on algorithm B.
I will optimize it in steps.
B1:
/* The "const" modifiers will allow the compiler to better
* optimize the code.
*/
const short xpitch = 320;
const short ypitch = -1;
const unsigned short * src = pSrc;
for (y = 0; y < 320; ++y)
{
ldst = dst;
for (x = 0; x < 60; ++x)
{
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
}
dst += ypitch;
}
In the above modification, the inner loop is unrolled
so that 4 memory transfers are performed for each
branch, rather than one transfer per branch as in
your original code.

B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}
The above loop tells the compiler that 4 registers
are being loaded at once, then written to memory.
Hopefully this will trigger that special instruction.
The "ldst[index]" assignment is telling the compiler
to use a store at location indexed by register instruction.
You can expand or rollout the inner loop more by the
number of registers available. There are a minimum
of three variables used in a function: program counter,
return address, and local variable pointer. So print
the function in assembly language and see how many
registers are left, then expand the inner loop.

If you tell us what the algorithms are doing, perhaps
we can suggest a more optimal method for the processors.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
 
G

Gernot Frisch

Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.

Yes, right. I'm transfering my back buffer to the display memory here.
(PDA-device).
B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}

<klonk> (That was my yaws) You made it almost 5x faster! Unbelievable.

Thank you so much,
Gernot
 
G

Gernot Frisch

but now it's got errors in execution. I can't find any error (other
than the typo x(i)ptch).
Really strange here...
Try defining /undefining FAST.
Can you find what's different!?
-Gernot

#define FAST
void Do()
{
unsigned short* src, *dst=NULL, *dst2=NULL, *pDst, *pDst2, *pSrc;
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
pDst2 = new unsigned short[320*240];
long n;
for (n=0; n<320*240; n++) pSrc[n]=21+n*n;
memset(pDst, 0, 320*240*2);
memset(pDst2, 0, 320*240*2);

register unsigned short x, y, *ldst;
short xpitch=640;
short ypitch=-2;
short xptch = (xpitch >> 1), yptch = (ypitch >> 1);

dst = pDst2 + 319;
src = pSrc;
// Old method
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

dst = pDst + 319;
src = pSrc;
// new method
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
#ifndef FAST
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
#else
for (x = 0; x < 60; x++)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index=0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xptch;
ldst[index] = s2;
index += xptch;
ldst[index] = s3;
index += xptch;
ldst[index] = s4;
index += xptch;
}
#endif
dst += yptch; // add a line to dst buffer
}

long q=0;
for (n=0; n<320*240;n++)
{
if (pDst[n] != pDst2[n]) q++;
}
printf ("Errors: %d\n", q);
}
 
G

Gernot Frisch

Sorry for all that traffic. It was the index-variable. Should belong
out of the loop...
-Gernot
 
P

Peter van Merkerk

Thomas said:
Gernot said:
Hi,

I have 2 C code snippets that prodcue the same result. However A is 2x
faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
206MhZ)


// Startup conditions + types
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
register unsigned short x, y, *ldst;
short xptch = 320, yptch = -1;
dst = pDst + 319;
src = pSrc;


// A:
(unsigned long*) pDisplay = (unsigned long*)dst;
for(x=0; x<240; x++)
{
for(y=0; y<160; y++)
{
*pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
once
src-=480;
}
src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
the bottom
}

// B:
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

Can someone explain it to me. An better: How to make this really fast?
Using ASM? I need an optimized version for an ARM processor.
Example B shows what it does obviously, I think.

Thank you in advice,

Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.

Anyway, to optimize for the ARM. The ARM processor likes rolled out
for-loops and reduced number of branches (which might be true for
most processors). The ARM processor has special instructions that
can load many registers at once from memory and put many instructions
into memory. My information is that both instructions require
sequential memory locations. Thus we can use the load but not the
put.

Let us concentrate on algorithm B.
I will optimize it in steps.
B1:
/* The "const" modifiers will allow the compiler to better
* optimize the code.
*/
const short xpitch = 320;
const short ypitch = -1;
const unsigned short * src = pSrc;
for (y = 0; y < 320; ++y)
{
ldst = dst;
for (x = 0; x < 60; ++x)
{
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
}
dst += ypitch;
}
In the above modification, the inner loop is unrolled
so that 4 memory transfers are performed for each
branch, rather than one transfer per branch as in
your original code.

B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}
The above loop tells the compiler that 4 registers
are being loaded at once, then written to memory.
Hopefully this will trigger that special instruction.
The "ldst[index]" assignment is telling the compiler
to use a store at location indexed by register instruction.
You can expand or rollout the inner loop more by the
number of registers available. There are a minimum
of three variables used in a function: program counter,
return address, and local variable pointer. So print
the function in assembly language and see how many
registers are left, then expand the inner loop.

I'm a bit surprised to see that you have to explicitly unroll the loop
and have to use the register keyword. I would expect a good optimizing
compiler to perform these optimization by itself, especially if the
target processors doesn't like branches.

I have seen compilers for other CPU architectures doing these kind of
optimizations by themselves. Sometimes a minor hint in the code is
required to trigger the optimization, but also quite often no
modifications to the code are needed at all.
 
M

Marcin Kalicinski

Yes, right. I'm transfering my back buffer to the display memory here.
(PDA-device).

I don't know anything about PDA hardware, but on PC the display memory used
to be a lot slower than main memory (in times of DOS and VGA cards when
programmer had direct access to card). The speed of copying to video memory
was always bottlenecked by this low v-memory speed, never by CPU. It
completely didn't matter how you coded the copy operation, it just couldn't
get any faster.

Fortunately, on PDA this need not be so.

Best regards,
Marcin
B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}

<klonk> (That was my yaws) You made it almost 5x faster! Unbelievable.

Thank you so much,
Gernot
 
G

Gernot Frisch

I don't know anything about PDA hardware, but on PC the display memory used
to be a lot slower than main memory (in times of DOS and VGA cards when
programmer had direct access to card). The speed of copying to video memory
was always bottlenecked by this low v-memory speed, never by CPU. It
completely didn't matter how you coded the copy operation, it just couldn't
get any faster.

Fortunately, on PDA this need not be so.

Here it does matter. My old version gave my max 60 fps, now I'm
getting 72fps. I found a lib that does a smooth 95 fps, but it's pure
optimized asm I think. Anyway, I'm close enough to avoid buying it
now.
-Gernot
 
T

Thomas Matthews

Peter said:
Thomas said:
Gernot said:
Hi,

I have 2 C code snippets that prodcue the same result. However A is 2x
faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
206MhZ)


// Startup conditions + types
pSrc = new unsigned short[320*240];
pDst = new unsigned short[320*240];
register unsigned short x, y, *ldst;
short xptch = 320, yptch = -1;
dst = pDst + 319;
src = pSrc;


// A:
(unsigned long*) pDisplay = (unsigned long*)dst;
for(x=0; x<240; x++)
{
for(y=0; y<160; y++)
{
*pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
once
src-=480;
}
src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
the bottom
}

// B:
for (y = 0; y < 320; y++ )
{
ldst = dst; // Get current line address
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel right on src
ldst += xptch; // add a pixel to the right on dest
}
dst += yptch; // add a line to dst buffer
}

Can someone explain it to me. An better: How to make this really fast?
Using ASM? I need an optimized version for an ARM processor.
Example B shows what it does obviously, I think.

Thank you in advice,

Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.

Anyway, to optimize for the ARM. The ARM processor likes rolled out
for-loops and reduced number of branches (which might be true for
most processors). The ARM processor has special instructions that
can load many registers at once from memory and put many instructions
into memory. My information is that both instructions require
sequential memory locations. Thus we can use the load but not the
put.

Let us concentrate on algorithm B.
I will optimize it in steps.
B1:
/* The "const" modifiers will allow the compiler to better
* optimize the code.
*/
const short xpitch = 320;
const short ypitch = -1;
const unsigned short * src = pSrc;
for (y = 0; y < 320; ++y)
{
ldst = dst;
for (x = 0; x < 60; ++x)
{
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
}
dst += ypitch;
}
In the above modification, the inner loop is unrolled
so that 4 memory transfers are performed for each
branch, rather than one transfer per branch as in
your original code.

B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}
The above loop tells the compiler that 4 registers
are being loaded at once, then written to memory.
Hopefully this will trigger that special instruction.
The "ldst[index]" assignment is telling the compiler
to use a store at location indexed by register instruction.
You can expand or rollout the inner loop more by the
number of registers available. There are a minimum
of three variables used in a function: program counter,
return address, and local variable pointer. So print
the function in assembly language and see how many
registers are left, then expand the inner loop.


I'm a bit surprised to see that you have to explicitly unroll the loop
and have to use the register keyword. I would expect a good optimizing
compiler to perform these optimization by itself, especially if the
target processors doesn't like branches.

I have seen compilers for other CPU architectures doing these kind of
optimizations by themselves. Sometimes a minor hint in the code is
required to trigger the optimization, but also quite often no
modifications to the code are needed at all.

Hey, pretty good for not testing it!
Actually, I was converting from assembly language into C
on the fly. The "register" keyword may not be necessary, but
I wanted to remind the compiler to use registers. This is almost
exactly what the assembly language would look like.

The reason for the speed up is more that all the data is read
at once and the loop is unrolled. I figured that using the
s* variables would trigger an optimization pattern match
within the compiler. Sometimes, a little hint to the compiler
goes a long way.


--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
 
T

Thomas Matthews

Gernot said:
Sorry for all that traffic. It was the index-variable. Should belong
out of the loop...
-Gernot

Mea culpa! The actual mistake is that I forgot
to update the pointer:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
/****** insert the line below *****/
ldst += 4 * xpitch;
/* Or: */
ldst += xpitch << 2;
/* add ldst, ldst, xpitch lsl 2 */
}

The index variable is fine within the loop.
You may be able to increase the speed by declaring
the s* and index variables outside of the loops.

Have you tried increasing the number of "s" variables
to 6 or 8?


--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
 
T

Thomas Matthews

Gernot said:
Yes, right. I'm transfering my back buffer to the display memory here.
(PDA-device).

Still doesn't look like a straight transfer. Looks like a rotation.
If the transfer is the bottleneck, you may want to change the
layout of the source array to match your PDA (destination) buffer.

You may also get more performance by declaring the arrays the same
bit-width as the PDA. For example, if an integer is 32 bits and
your ARM processor uses 32 bits, declare your arrays as unsigned
ints. This will reduce the number of instructions and cycles as
processors love to use their native integer size and anything
else slows it down.
B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}


<klonk> (That was my yaws) You made it almost 5x faster! Unbelievable.

Thank you so much,
Gernot

I used some optimization techniques, which you should research:
1. Loop unrolling.
2. Load (Input) as much data before processing.

Unfortunately, because the increment in the destination is not
contigous, I could not apply the other two rules:
3. Process as much data as possible
4. Output as much data as possible.


For more fun, obtain a book about the StrongARM processor and
see how it works. Try changing your C code look more like
the assembly code.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
 
G

Gernot Frisch

Still doesn't look like a straight transfer. Looks like a rotation.
If the transfer is the bottleneck, you may want to change the
layout of the source array to match your PDA (destination) buffer.

In fact, it is a rotation. Since some devices have rotated displays,
some don't. Anyway.
You may also get more performance by declaring the arrays the same
bit-width as the PDA. For example, if an integer is 32 bits and
your ARM processor uses 32 bits, declare your arrays as unsigned
ints. This will reduce the number of instructions and cycles as
processors love to use their native integer size and anything
else slows it down.

That is exaclty what I tried to do when using my other version, which
copies to a unsigned long array rather than the short array. This
works very fast on my PC, but the ARM version runs twice as long!?
I used some optimization techniques, which you should research:
1. Loop unrolling.
2. Load (Input) as much data before processing.

I'll google for it.
Unfortunately, because the increment in the destination is not
contigous, I could not apply the other two rules:
3. Process as much data as possible
4. Output as much data as possible.

Yes, I know. That's pretty much my problem. I'm not even sure the
step-values are positive, since on my device e.g. pitchy = -2...
For more fun, obtain a book about the StrongARM processor and
see how it works. Try changing your C code look more like
the assembly code.

Yes, thought about that, too. But I'm pretty busy, so I was wondering
if some ASM/C guru would be able to help in the first place - which
you did by speeding it up 5 times!

-Gernot
 
G

Gernot Frisch

/****** insert the line below *****/
ldst += 4 * xpitch;
/* Or: */
ldst += xpitch << 2;
/* add ldst, ldst, xpitch lsl 2 */
}

Unfortunaltely I can't be sure, that after one line the next line
begins. On my device the previous line begins. Pretty stupid, I know.
So now I'm doing:

ldst = dst; // Get current line address
/*
for (x = 0; x < 240; x++ )
{
*(ldst) = *src++; // one pixel rigth on src
ldst += xptch; // add a pixel to the right on dest
}
*/
for (x = 0; x < 60; ++x)
{
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
*ldst = s1;
ldst += xptch;
*ldst = s2;
ldst += xptch;
*ldst = s3;
ldst += xptch;
*ldst = s4;
ldst += xptch;
}
dst += yptch; // add a line to dst buffer

I declared the ldst as "register unsigned short*", so it should be
faster adding the pointer itself instead of an index that will be
added to the original location, no?

short* p1 = p2;
for() *(p1+=x)=...
// instead of
long index=0;
for() p1[index]=...; index+=x;

Have you tried increasing the number of "s" variables
to 6 or 8?

I'll do that next.
 
G

Gernot Frisch

Unfortunaltely I can't be sure, that after one line the next line

Rubbish - I missread your changes. Sorry. Do you know ASM? Can I post
you my produced ASM code and see if some speedups might be possible?
-Gernot
 
G

Gernot Frisch

I guess _that_ listing can't be speed up much anymore now? Only chance
might be to use another approach... Anyway - it's really fast now.


; 11 : int xptch8 =(xptch<<3); // 8 pitches
; 12 : register unsigned short s1, s2, s3, s4, s5, s6, s7, s8;
; 13 : register unsigned int index = 0;
; 14 : for (y = 0; y < 320; y++)

mov r0, #0
mov lr, r0
b |$L42991|
|$L43502|
mov r0, #0
|$L42991|

; 15 : {
; 16 : ldst = dst; // Get current line address

mov r8, r11

; 17 : /*
; 18 : for (x = 0; x < 240; x++ )
; 19 : {
; 20 : *(ldst) = *src++; // one pixel rigth on src
; 21 : ldst += xptch; // add a pixel to the right on dest
; 22 : }
; 23 : */
; 24 : for (x = 0; x < 30; ++x)

mov r9, r0
|$L42994|

; 25 : {
; 26 : s1 = *src++;

ldrh r0, [r10], #2

; 27 : s2 = *src++;

ldrh r1, [r10], #2

; 28 : s3 = *src++;

ldrh r2, [r10], #2

; 29 : s4 = *src++;

ldrh r3, [r10], #2

; 30 : s5 = *src++;

ldrh r4, [r10], #2

; 31 : s6 = *src++;

ldrh r5, [r10], #2

; 32 : s7 = *src++;

ldrh r6, [r10], #2

; 33 : s8 = *src++;

ldrh r7, [r10], #2

; 34 : *ldst=s1; ldst+=xptch;

strh r0, [r8], #-2
add r0, r9, #1

; 35 : *ldst=s2; ldst+=xptch;

strh r1, [r8], #-2
mov r1, r0, lsl #16

; 36 : *ldst=s3; ldst+=xptch;

strh r2, [r8], #-2
mov r2, r1, lsr #16

; 37 : *ldst=s4; ldst+=xptch;

strh r3, [r8], #-2

; 38 : *ldst=s5; ldst+=xptch;

strh r4, [r8], #-2
mov r0, r2, lsl #16

; 39 : *ldst=s6; ldst+=xptch;

strh r5, [r8], #-2
mov r9, r0, lsr #16

; 40 : *ldst=s7; ldst+=xptch;

strh r6, [r8], #-2

; 41 : *ldst=s8; ldst+=xptch;

strh r7, [r8], #-2
cmp r9, #0x1E
bcc |$L42994|
add r0, lr, #1
mov r1, r0, lsl #16

; 42 : }
; 43 : dst += yptch; // add a line to dst buffer

add r11, r11, #0xA, 26
mov r2, r1, lsr #16
mov r0, r2, lsl #16
mov lr, r0, lsr #16
cmp lr, #5, 26
bcc |$L43502|

; 44 : }
; 45 : }
 
T

Thomas Matthews

Gernot said:
I guess _that_ listing can't be speed up much anymore now? Only chance
might be to use another approach... Anyway - it's really fast now.


; 11 : int xptch8 =(xptch<<3); // 8 pitches
; 12 : register unsigned short s1, s2, s3, s4, s5, s6, s7, s8;
; 13 : register unsigned int index = 0;
; 14 : for (y = 0; y < 320; y++)

mov r0, #0
mov lr, r0
b |$L42991|
|$L43502|
mov r0, #0
|$L42991|

; 15 : {
; 16 : ldst = dst; // Get current line address

mov r8, r11

; 17 : /*
; 18 : for (x = 0; x < 240; x++ )
; 19 : {
; 20 : *(ldst) = *src++; // one pixel rigth on src
; 21 : ldst += xptch; // add a pixel to the right on dest
; 22 : }
; 23 : */
; 24 : for (x = 0; x < 30; ++x)

mov r9, r0
|$L42994|

; 25 : {
; 26 : s1 = *src++;

ldrh r0, [r10], #2

; 27 : s2 = *src++;

ldrh r1, [r10], #2

; 28 : s3 = *src++;

ldrh r2, [r10], #2

; 29 : s4 = *src++;

ldrh r3, [r10], #2

; 30 : s5 = *src++;

ldrh r4, [r10], #2

; 31 : s6 = *src++;

ldrh r5, [r10], #2

; 32 : s7 = *src++;

ldrh r6, [r10], #2

; 33 : s8 = *src++;

ldrh r7, [r10], #2
[snip]

Although assembly is not topical in this group,
the listing shows that the compiler is not using
the most efficient instruction.

Try converting your arrays from short to integer.
This should speed things up.

Otherwise, the assembly looks rather efficient.
I think the optimization is reaching its limit.
Any more code changes won't realize a drastic
improvement, only subtle ones.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
 
G

Gernot Frisch

Try converting your arrays from short to integer.
This should speed things up.

That's what I tried to do with my other code, that runs much faster on
my PC, but not on the ARM processor. The asm looks horribly. I have a
lot of "lsr #16" calls which I can't get rid of.
Here's my code + the produces asm:

; 401 : {
; 402 : dst++;
; 403 : src+=(319*240);

mov r0, #0x95, 22
orr r0, r0, #0x22, 28

; 404 : unsigned long* pDisplay = (unsigned long*)dst;
; 405 : pDisplay-=160; // One line back

mov r1, #0x9F, 30
add r6, r7, r0
orr r1, r1, #2

; 406 : for(x=0; x<240; x++)

mov r0, #0
sub r5, r5, r1
mov r7, r0

; 414 : }
; 415 : }
; 416 : else // Safe & slow

b |$L43362|
|$L44796|
mov r0, #0
|$L43362|

; 407 : {
; 408 : for(y=0; y<160; y++)

mov r4, r0
|$L43365|

; 409 : {
; 410 : *pDisplay++ = (*(src-240)<<16) | *(src); // Process 4
bytes at once

ldr r12, [pc, #0x714]
ldrh r0, [r6, +r12]
mov r1, r0, lsl #16
ldrh r0, [r6]

; 411 : src-=480;

sub r6, r6, #0xF, 26
mov r2, r1, lsr #16
mov r1, r0, lsl #16
add r0, r4, #1
mov r3, r2, lsl #16
orr r2, r3, r1, lsr #16
mov r1, r0, lsl #16
str r2, [r5], #4
mov r2, r1, lsr #16
mov r0, r2, lsl #16
mov r4, r0, lsr #16
cmp r4, #0xA0
bcc |$L43365|

; 412 : }
; 413 : src+=(320*240+1);

add r0, r6, #0x96, 22
add r1, r7, #1
add r6, r0, #2
mov r0, r1, lsl #16
mov r2, r0, lsr #16
mov r1, r2, lsl #16
mov r7, r1, lsr #16
cmp r7, #0xF0
bcc |$L44796|

; 414 : }
; 415 : }
 
P

Peter van Merkerk

Thomas said:
The reason for the speed up is more that all the data is read
at once and the loop is unrolled. I figured that using the
s* variables would trigger an optimization pattern match
within the compiler. Sometimes, a little hint to the compiler
goes a long way.

I know, for really performance critical stuff it always interesting to
take a look at the assembly output of the compiler to see what it makes
of your code. Sometimes compilers can perform quite amazing
optimizations, at other times they let you down and you have to really
spell it out what code you want it to generate. I'm glad that I learned
assembly programming in the good old days. For this kind of stuff it can
really make a difference if you understand what the compiler is doing.
 
T

Thomas Matthews

Gernot said:
That's what I tried to do with my other code, that runs much faster on
my PC, but not on the ARM processor. The asm looks horribly. I have a
lot of "lsr #16" calls which I can't get rid of.

Sorry, but there is a misunderstanding here.
Use the "B2" code, but destination and source arrays
are declared as integers:

/*****************************************************
* Use named constants. Get into this habit.
*****************************************************/
static const unsigned int MAX_COLUMNS = 320;
static const unsigned int MAX_ROWS = 240;
static const unsigned int ARRAY_SIZE = MAX_COLUMNS * MAX_ROWS;

/*****************************************************
* Declare arrays as integers even though the data range
* is a short, because int is the processor's native
* size. [1] [2]
*****************************************************/
unsigned int source[ARRAY_SIZE];
unsigned int destination[ARRAY_SIZE];
unsigned int * ldst;
unsigned int * dst;

/*****************************************************
* Since the source array will not be changed, all
* pointers to the source array are declared as a
* pointer to constant data. This is important so
* that the compiler can invoke some optimizations.
*****************************************************/
const unsigned int * src;

dst = destination + MAX_COLUMNS - 1;
/* a.k.a. dst = &destination[MAX_COLUMNS - 1]; */

src = source;

/*****************************************************
* If a variable's value never changes, prefer to use
* a named constant.
*
* When constants are declared in-line or in a function
* prefer to declare them as static.
*
* Again, proper use of "const" allows a compiler to
* perform better optimization "magic".
*****************************************************/
static const int xpitch = 2 * MAX_COLUMNS;
static const int ypitch = -2;

const unsigned int TRANSFERS_IN_LOOP = 4;
const unsigned int INNER_ITERATIONS =
MAX_ROWS / TRANSFERS_IN_LOOP;
unsigned int x;
unsigned int y;

/*****************************************************
* From Scott Meyer's Effective C++ series, prefer
* pre-increment indices in loops.
*****************************************************/
for (y = 0; y < MAX_COLUMNS; ++y ) /* Use Named constant */
{
ldst = dst; // Get current line address
for (x = 0; x < INNER_ITERATIONS; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
ldst += xpitch << 2;
}
dst += yptch; // add a line to dst buffer
/* See note [3] */
}

Notes:
[1] One reason _not_ to use "int" is when the destination pointer
refers to a specific hardware memory area which has a
different bit-width than "int".
For example, if the display memory is 16-bits wide and an
int is 32-bits wide.

[2] Don't use dynamic arrays unless the size is not known
until run-time or they are too large to be allocated
as an automatic variable. If the arrays are large
prefer to declare them as static inside blocks and
functions.

[3] You could also try unrolling the outer loop also. The performance
gain may not be as much as unrolling the inner loop. The objective
is to reduce the ratio of processing to branching.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
 
G

Gernot Frisch

My source was an short array.
unsigned int source[ARRAY_SIZE];
const unsigned int * src;
unsigned short s1;
src = source;
s1 = *src++;

The src++ will move the src pointer by 4 bytes now (sizeof int), won't
it? That's a problem now, isn't it?
I would have to use:

unsigned long sx;
sx = *src++
s1 = sx >> 0x10;
s2 = sx & 0xFFFF;

Right?

Thank you for your pacience. You help me very much in understanding
asm and speed issues.

-Gernot
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top