Re: Determine the compressibility of a data buffer .. how to do this fast ?

From: gswork (gswork_at_mailcity.com)
Date: 10/15/03


Date: 15 Oct 2003 01:11:26 -0700

SuperFly <mail@mail.com> wrote in message news:<38ooov4mabr0ijprkjfvnt8tm729eqg991@4ax.com>...
> Hi all,
>
> I need a fast way to determine the compressibility of a data buffer in
> memory without using an actual compression scheme. I think the best
> way to do this is to determine the entropy (so i was told), but maybe
> there are faster and/or better ways. I would really appreciate it if
> somebody could point me to some source/pseudo code or a good tutorial
> on how to do this.

You could run through the buffer doing a compression, say an RLE type
count, i.e. count the number of sequences of the same byte value (e.g.
if there's 1000 255's in a row). If you end up with a relatively
small number of long runs the buffer is likely to compress well, if
there'e lots of small runs it probably won't. Not sure that's going
to be much faster than just doing it though!

I made a predictive utility for files once, very crude but
surprisingly predictive of whether the file would compress well.
Without heed to file structure or run-lengths i looped through the
buffer reading the value of each byte and 'counted' that in a 'counter
array'. If, in the end, there was a fairly even distribution of
values then the file would probably not compress well, if there wasn't
(i.e. there were 50,000 zeros and 10 of 255) then it probably would.
I've no doubt there are plenty of reasons not to use such a crude
approach, but it was good for discerning things like jpg files, which
generally won't compress much more than they already are and pdf
files, which sometimes do, sometimes don't, and bmp files, which
normally compress well.

the main part would go, from memory, something like:

for n:=1 to BufferLength do begin
 b:= ByteArray[n] // assign the value of the byte from the buffer to
b, a byte
 bCounter[b]:= bCounter[b] + 1 //increment count of each byte value
end;

bCounter is an array of integers from 0 to 255, they all start at 0
and increment by one each time the corresponding byte value is
encountered. In the end you have knowledge only of how many of each
value there is, and that seems to be be ok, and quite fast. You can
choose a number of ways to analyse that. I'm sure there are much
cleverer ways of predicting compressibility, and faster too. Infact,
i'll gladly benefit from such.



Relevant Pages

  • Re: Streaming XML
    ... This document object grows bigger and bigger while the application ... The document is used as a buffer of data, ... The first is related to memory usage. ... I want to compress it. ...
    (microsoft.public.dotnet.xml)
  • Re: Big Problem/Bug with new matfile command for partial mat file read/writes - creates massivly
    ... stored mat file and does not remove it afterwards. ... buffer when the file is closed. ... RAM and compress and write the contents of that buffer ... I am sure that if competent designers are let loose on ...
    (comp.soft-sys.matlab)
  • Streaming XML
    ... This document object grows bigger and bigger while the application ... The document is used as a buffer of data, ... The first is related to memory usage. ... I want to compress it. ...
    (microsoft.public.dotnet.xml)
  • Re: String to byte[] reloaded
    ... can reuse the same buffer several times. ... I cannot change the compress method's signature (e.g. ... To make sure I handle very careful such a ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Malcolms new book - Chapter 1 review
    ... he will naturally assume that the reason for the size_t is to allow the ... function to compress an buffer that can be held in memory. ... absolutely no way, with your functions, to specify how long it is. ...
    (comp.lang.c)