Re: MMX speedup for Floyd Steinberg error diffusion




"rep_movsd" wrote in message
On May 7, 11:35 pm, "Maarten Kronenburg" wrote:
"Maarten Kronenburg" wrote in message

Now I see the data is in bytes. In that case it seems better to put 16
bytes
in an 128-bit xmm register, then put 8 bytes each time into 8 16-bit
words
by shifting and anding, and do the above with PSLLW/PSRLW and
PADDW/PSUBW in
the 16-bit words. Then the scaling mentioned is not needed because the
upper
8 bits in the 16-bit words should be zero.
Maarten.

Thanks...

The thing is Floyd-Steinberg dithering works serially one pixel at a
time... Each pixel processed affects the pixel to its right and the
pixels on the next scanline.
So the best I can hope for is to handle one pixels R, G and B byte
values in one go.

Let me clarify with actual code ( highly unoptimal C++ , for clarities
sake )

Yes it's always good to first make a working C++ code and testing it before
transferring it to assembler.


/////////////////////////////////////////////////
struct RGBA {unsigned char r, g, b, a; };

int saturateAdd(int a, int b)
{
int ret = a + b;
if(ret < 0) return 0;
if(ret > 255) return 255;
return ret;
}


This is a signed addition with unsigned saturation.
Normally a signed byte goes from -2^7 to 2^7-1,
and adding these can never be larger than 2^8-1 anyway.
See below for a hint how to do this.

void diffuse(RGBA* pImg, int w, int h)
{
for(int y = 0; y < h; ++y)
{
RGBA* pPix = pImg + (y * w);

Here it seems that w and h are unsigned int offsets, so unsigned int seems
better for w and h.

for(int x = 0; x < w; ++x, ++pPix)
{
RGBA bestMatch = getNearestPalColor(pPix);
int rDiff = pPix->r - bestMatch.r;
int gDiff = pPix->g - bestMatch.g;
int bDiff = pPix->b - bestMatch.b;


RGBA* pNext = pPix + 1;
pNext->r = saturateAdd(pNext->r, (rDiff * 7) >> 16);
pNext->g = saturateAdd(pNext->g, (gDiff * 7) >> 16);
pNext->b = saturateAdd(pNext->b, (bDiff * 7) >> 16);

Divide by 16 means shift right by 4, so >> 4.


// repeat 3 lines above for pixel below, below left and
below right with co efficients 5, 3 and 1
}
}
}

///////////////////////////////////////////////

Since logical and multiply arent available for 8 bit operands, heres
what im thinking....

The pixel bytes lets say are RGBA
I do a PUNPCKLBW getting bytes XRXGXBXA in an MMX register
The X are unwanted values.
Then I do a PAND with 0x00FF00FF00FF00FF , getting rid of the X's


This seems to be OK, because here they are still unsigned.

I repeat the same process for the new palette pixel

Then I can do a PSUB to get 4 signed differences
Then PMULLW with the coefficient value like 0x0007000700070007
and PSRAW to shift

Yes this seems to be OK, PSUBW and PMULLW for signed multiplication.
Because they fit in 8 bits each the signed result fits in 16 bits so you
need only low result.


Now i have 4 signed WORDs in some MMX register which are the signed
differences between the original and palettized pixel colors...

Now how do i add these to the destination pixel with saturated
addition?

There are instructions for adding signed values with signed saturation
and unsigned values with unsigned saturation.
How do i add signed differences to unsigned values with unsigned
saturation?

Perhaps some sort of tricky bit manipulation can work?


The result A seems to be unsigned, so in range 0 .. 2^8-1.
You need to saturated add a signed B, in range -2^7 .. 2^7-1.
Now the trick seems to first normally (that is unsaturated) subtract 2^7
from A, so it becomes a signed.
That means that 0 becomes -2^7 and 2^8-1 becomes 2^7-1, so that's the signed
range.
Note that for unsaturated add and subtract, signed or unsigned doesn't
matter, because it always wraps around in the same way.
Then saturated add the signed B, with signed saturation.
Then normally add 2^7 to the resulting A again.
This works because the total byte saturation range is 2^8, whether signed or
unsigned.
Unsaturated subtract 2^7 may also be unsaturated add 2^7, because as
mentioned unsaturated addition and subtraction wrap around, and 2^7 is
exactly half the wrap around of 2^8.
Perhaps doing this in C++ first as above:
signed char saturate_signed_add( signed char a, signed char b )
and rewrite diffuse with this one and above trick and test it, and then put
it in assembler.


Anyhow even if i get this far and have to do the rest normally without
MMX, it should be much simpler code than the horror that my compiler
generates for the above C++ code.

Further ideas appreciated...


Yes but assembler code may also be hard to debug.
Maarten.

.



Relevant Pages

  • Re: MMX speedup for Floyd Steinberg error diffusion
    ... Each pixel processed affects the pixel to its right and the ... int saturateAdd ... There are instructions for adding signed values with signed saturation ... How do i add signed differences to unsigned values with unsigned ...
    (comp.lang.asm.x86)
  • Re: Decreasing saturation as lightness increases
    ... increasing the lightness of each pixel. ... Even though I set the saturation scale ... the HLS conversion uses three independent inputs ...
    (sci.engr.color)
  • Re: Decreasing saturation as lightness increases
    ... the behaviour of the 'Adjust Hue/Saturation/Lightness' feature in Paint Shop ... increasing the lightness of each pixel. ... Even though I set the saturation scale ...
    (sci.engr.color)
  • Decreasing saturation as lightness increases
    ... the behaviour of the 'Adjust Hue/Saturation/Lightness' feature in Paint Shop ... increasing the lightness of each pixel. ... Even though I set the saturation scale ...
    (sci.engr.color)
  • Re: Decreasing saturation as lightness increases
    ... increasing the lightness of each pixel. ... Even though I set the saturation scale ... brightness it would look as though the original was fading to white, ...
    (sci.engr.color)