Re: speed it up
From: Thomas Matthews (Thomas_MatthewsSpitsOnSpamBots_at_sbcglobal.net)
Date: 06/22/04
- Next message: JKop: "Re: Unicode strings"
- Previous message: JKop: "Re: Ultimate Efficiency"
- In reply to: Gernot Frisch: "speed it up"
- Next in thread: Gernot Frisch: "Re: speed it up"
- Reply: Gernot Frisch: "Re: speed it up"
- Reply: Gernot Frisch: "Re: speed it up"
- Reply: Peter van Merkerk: "Re: speed it up"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 22 Jun 2004 16:56:43 GMT
Gernot Frisch wrote:
> Hi,
>
> I have 2 C code snippets that prodcue the same result. However A is 2x
> faster than B on my PC (x86) but 1.5x slower on my PDA (strongARM @
> 206MhZ)
>
>
> // Startup conditions + types
> pSrc = new unsigned short[320*240];
> pDst = new unsigned short[320*240];
> register unsigned short x, y, *ldst;
> short xptch = 320, yptch = -1;
> dst = pDst + 319;
> src = pSrc;
>
>
> // A:
> (unsigned long*) pDisplay = (unsigned long*)dst;
> for(x=0; x<240; x++)
> {
> for(y=0; y<160; y++)
> {
> *pDisplay++ = (*(src-240)<<16) | *(src); // Process 4 bytes at
> once
> src-=480;
> }
> src+=76801; // (320*240+1); // Get a row ahead+320 lines down to
> the bottom
> }
>
> // B:
> for (y = 0; y < 320; y++ )
> {
> ldst = dst; // Get current line address
> for (x = 0; x < 240; x++ )
> {
> *(ldst) = *src++; // one pixel right on src
> ldst += xptch; // add a pixel to the right on dest
> }
> dst += yptch; // add a line to dst buffer
> }
>
> Can someone explain it to me. An better: How to make this really fast?
> Using ASM? I need an optimized version for an ARM processor.
> Example B shows what it does obviously, I think.
>
> Thank you in advice,
>
Looks like you are performing a {matrix or bitmap} rotation,
but I'm not sure.
Anyway, to optimize for the ARM. The ARM processor likes rolled out
for-loops and reduced number of branches (which might be true for
most processors). The ARM processor has special instructions that
can load many registers at once from memory and put many instructions
into memory. My information is that both instructions require
sequential memory locations. Thus we can use the load but not the
put.
Let us concentrate on algorithm B.
I will optimize it in steps.
B1:
/* The "const" modifiers will allow the compiler to better
* optimize the code.
*/
const short xpitch = 320;
const short ypitch = -1;
const unsigned short * src = pSrc;
for (y = 0; y < 320; ++y)
{
ldst = dst;
for (x = 0; x < 60; ++x)
{
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
*ldst = *src++;
ldst += xpitch;
}
dst += ypitch;
}
In the above modification, the inner loop is unrolled
so that 4 memory transfers are performed for each
branch, rather than one transfer per branch as in
your original code.
B2: Replace inner loop with:
for (x = 0; x < 60; ++x)
{
register unsigned short s1, s2, s3, s4;
register unsigned int index = 0;
s1 = *src++;
s2 = *src++;
s3 = *src++;
s4 = *src++;
ldst[index] = s1;
index += xpitch;
ldst[index] = s2;
index += xpitch;
ldst[index] = s3;
index += xpitch;
ldst[index] = s4;
index += xpitch;
}
The above loop tells the compiler that 4 registers
are being loaded at once, then written to memory.
Hopefully this will trigger that special instruction.
The "ldst[index]" assignment is telling the compiler
to use a store at location indexed by register instruction.
You can expand or rollout the inner loop more by the
number of registers available. There are a minimum
of three variables used in a function: program counter,
return address, and local variable pointer. So print
the function in assembly language and see how many
registers are left, then expand the inner loop.
If you tell us what the algorithms are doing, perhaps
we can suggest a more optimal method for the processors.
--
Thomas Matthews
C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
- Next message: JKop: "Re: Unicode strings"
- Previous message: JKop: "Re: Ultimate Efficiency"
- In reply to: Gernot Frisch: "speed it up"
- Next in thread: Gernot Frisch: "Re: speed it up"
- Reply: Gernot Frisch: "Re: speed it up"
- Reply: Gernot Frisch: "Re: speed it up"
- Reply: Peter van Merkerk: "Re: speed it up"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|