Re: MMX speedup for Floyd Steinberg error diffusion



On May 8, 7:00 pm, Terje Mathisen <spamt...@xxxxxxxxxx> wrote:

Interesting algorithm, very serially dependent. :-(

How does it handle the corner/edge cases?

I.e. the last pixel on a scan line has no down-right neighbor, the first
pixel has no down-left and the last scanline has nothing below at all!


First of all, I'm honored to have a reply from Terje :)

The edges are handled by ignoring all pixels which do not have the
neighbours, the simplest way I found was to keep my x and y loops
ranging over [1, width-1) and [0, height-1).


It seems to me like the getNearestPalColor() function could easily
dominate the processing time here!

Well Initially the code used the standard distance matching algorithm
as seen in zillions of color reduction libraries :

For each color in the palette
Compute the RGB colorspace distance between the palette entry and the
target color -> (r - rp)^2 + (g - gp)^2 + (b - bp)^2
Choose the index with the minimum distance.

This was pathetically slow... It has to loop 256 times for each pixel
doing several arithmetic ops!
One optimization is to use abs and get rid of the multiplications.

Then I thought of a simplistic algorithm based on a precomputed
table...

I start with a 3d array 256 * 256 * 256, initialized to -1

I iterate over the 256 colors and each time i "draw" a cube of a
certain size in that array around each color entry. The cube size
grows from size 3 to 33. The cube "color" is the palette index.
So the sequence goes like 1 pixel, 3x3 shell, 5x5 shell etc etc. I
also make sure I do not overwrite an empty slot.
Its kind of like a cubical floodfill.

Now I can use this lookup table to get the nearest color at any given
R, G and B.

Later I changed the box to 32x32x32 and maximal box size to 8 and
still get good results.
The reasoning is that Floyd-Steinberg dithering perturbs the colors
usually by a small amount, so it would be unlikely to look for a color
thats way far away from any given palette entry.
It can happen that the lookup table doesnt have an entry at some given
point, in such cases I fall back on the old linear algorithm.
With some tests I find that about 80% of the cube is usually filled,
and the "cache misses" happen about 5% to 8% of the time.

I guess a 3D space partition would be a good way to go, using an initial
lookup table based on something like 3:3:3 bits to start in the proper
subvolume, or simply use that table to point to lists of possible
palColor candidates for each starting point.

(Due to truncation effects, several pal entries might belong to multiple
lists, depending upon the value of the rounded/truncated RGB bits, but
with a good set of palette colors, the number of values to test against
should be (on average) pretty low.

Anyway, to speed up the dithering stage I would use MMX/SSE and work on
multiple lines in parallel, keeping the intermediate results in
registers until we're finished with them.

How is it possible to do multiple lines in paralell?
One thing I can think of is that :
At each pixel, 2 pixels have to be updated immediately - the right and
bottom-left. The other two have to be updated twice.. So I could hold
3 error values in registers, accumulating them and touching each pixel
only once instead of twice...

My biggest problem now is that the main loop is littered with memory
accesses, and since the getNearestPaletteIndex() function is called on
every pixel, there is some register starvation.
I will next try to avoid looping on x and y indexing and increment the
pointers directly.

But my last measurement shows that this diffusion ( which took more
than half the time before) is now less of the performance hog. So I'll
have to see where to pry out some performance next...

Its been really fun for me to dive into assembler after a period of 6
or 7 years and actually get things working fast!!

Best Regards
Vivek

.



Relevant Pages

  • Re: A new method for the steganography of 8 bit images
    ... > The freeware program Irfanview also uses a very good algorithm!. ... > - Then you double each value in the palette. ... > now has two neighboring entries. ... differing by one bit" to "there is data hidden in the LSB of each pixel." ...
    (sci.crypt)
  • Re: Is this image resizing algorithm already known?
    ... > I searched the Internet for image resizing algorithms, ... > Does someone know if this algorithm already exists? ... E are the pixel colors. ... > 1) each original row pixel contributes with m color parts. ...
    (comp.graphics.algorithms)
  • Is this image resizing algorithm already known?
    ... I searched the Internet for image resizing algorithms, ... Does someone know if this algorithm already exists? ... E are the pixel colors. ... When we have 0 available parts, we take the next original row ...
    (comp.graphics.algorithms)
  • Re: FS-error diffusion (Was:Re: Different Implementations of VLIW .)
    ... |>>> You start with the top left pixel, pick the closest palette match (which ... That is a classic form of a sequential algorithm corresponding to a ... One of the approaches is the standard iterative improvement, ... oscillations where the best fit is ambiguous. ...
    (comp.arch)
  • Re: FS-error diffusion (Was:Re: Different Implementations of VLIW .)
    ... You start with the top left pixel, pick the closest palette match (which ... except that the algorithm depends on it: ... The nice thing about ordered dithering is that it is completely parallizable. ...
    (comp.arch)