Re: Java text compression
- From: Eric Sosman <esosman@xxxxxxxxxxxxxxxxxxxx>
- Date: Sun, 18 Nov 2007 18:30:19 -0500
Chris wrote:
Bonus question for OP: what is the size of data sets and how are they used? Especially, where are they stored?
Multi-terabyte sized, split across multiple machines. On a single machine, generally not more than a few hundred Gb. One or two disks per machine, SATA, no RAID.
At compression time, the data is streamed from an external source, transformed in memory, and written to disk.
At decompression time, the app seeks to the particular block of text of interest and decompresses it. Seek time dominates decompression time, *except* when we do heavy caching, in which case the decompression becomes the bottleneck. Storing the decompressed text in memory takes up too much space. Has to be cached in compressed form.
Cutting the text into blocks usually makes the compression
less effective. Most compression schemes nowadays (and I believe
DEFLATE is among them) are adaptive, meaning that they adjust to
the characteristics of the data stream as they process it. Thus,
they compress relatively poorly at first, then improve as they
learn more about the statistical profile of the data.
Implications: (1) You'll get better compression if you can
keep the blocks "fairly long." (1a) You might make the blocks
"long" by concatenating multiple sub-blocks, at the expense of
needing to decompress from the start of a block even if you only
need the sub-block at its end. (2) If the blocks simply must be
small, you probably shouldn't waste effort on BEST_COMPRESSION.
Some interesting experiments are in order.
--
Eric Sosman
esosman@xxxxxxxxxxxxxxxxxxxx
.
- Follow-Ups:
- Re: Java text compression
- From: George Neuner
- Re: Java text compression
- References:
- Java text compression
- From: Chris
- Re: Java text compression
- From: Eric Sosman
- Re: Java text compression
- From: Robert Klemme
- Re: Java text compression
- From: Chris
- Java text compression
- Prev by Date: Re: Great SWT Program
- Next by Date: JSP include and servlet
- Previous by thread: Re: Java text compression
- Next by thread: Re: Java text compression
- Index(es):
Relevant Pages
|