Re: Java text compression



Chris wrote:
Bonus question for OP: what is the size of data sets and how are they used? Especially, where are they stored?

Multi-terabyte sized, split across multiple machines. On a single machine, generally not more than a few hundred Gb. One or two disks per machine, SATA, no RAID.

At compression time, the data is streamed from an external source, transformed in memory, and written to disk.

At decompression time, the app seeks to the particular block of text of interest and decompresses it. Seek time dominates decompression time, *except* when we do heavy caching, in which case the decompression becomes the bottleneck. Storing the decompressed text in memory takes up too much space. Has to be cached in compressed form.

Cutting the text into blocks usually makes the compression
less effective. Most compression schemes nowadays (and I believe
DEFLATE is among them) are adaptive, meaning that they adjust to
the characteristics of the data stream as they process it. Thus,
they compress relatively poorly at first, then improve as they
learn more about the statistical profile of the data.

Implications: (1) You'll get better compression if you can
keep the blocks "fairly long." (1a) You might make the blocks
"long" by concatenating multiple sub-blocks, at the expense of
needing to decompress from the start of a block even if you only
need the sub-block at its end. (2) If the blocks simply must be
small, you probably shouldn't waste effort on BEST_COMPRESSION.

Some interesting experiments are in order.

--
Eric Sosman
esosman@xxxxxxxxxxxxxxxxxxxx
.



Relevant Pages

  • Re: Java text compression
    ... Multi-terabyte sized, split across multiple machines. ... Seek time dominates decompression time, ... Cutting the text into blocks usually makes the compression ...
    (comp.lang.java.programmer)
  • Re: NSA and crypto
    ... David - where are your test results? ... was to run a compression then a decompression to see if file matches. ... For example the compressor actually only works with strings. ...
    (sci.crypt)
  • Re: NSA and crypto
    ... David - where are your test results? ... | was to run a compression then a decompression to see if file matches. ... For example the compressor actually only works with strings. ...
    (sci.crypt)
  • Re: [RFC] LZO de/compression support - take 6
    ... Compression: 42.4646 usec ... Decompression: 42.4646 usec ... early finish compressed data buffer to less than the full size ...
    (Linux-Kernel)
  • Re: A basic cryptanalysis question
    ... >> keys. ... >> the above is for bijective file compression. ... unique bitstream weather doing compression or decompression. ... based on second order english. ...
    (sci.crypt)