Re: zlib interface semi-broken



Travis wrote:
On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote:
.... I personally would like it and bz2 to get closer to each other...

Well, I like this idea; perhaps this is a good time to discuss the
equivalent of some "abstract base classes", or "interfaces", for
compression.

As I see it, the fundamental abstractions are the stream-oriented
de/compression routines. Given those, one should easily be able to
implement one-shot de/compression of strings. In fact, that is the
way that zlib is implemented; the base functions are the
stream-oriented ones and there is a layer on top of convenience
functions that do one-shot compression and decompression.

There are a couple of things here to think about. I've wanted to
do some low-level (C-coded) search w/o bothering to create strings
until a match. I've no idea how to push this down in, but I may be
looking for a nice low-level spot to fit. Characteristics for that
could be read-only access to small expansion parts w/o copying them
out. Also, in case of a match, a (relatively quick) way to mark points
as we proceed and a (possibly slower) way to resrore from one or
more marked points.

Also, another programmer wants to parallelize _large_ bzip file
expansion by expanding independent blocks in separate threads (we
know how to find safe start points). To get such code to work, we
need to find big chunks of computation, and (at least optionally)
surround them with GIL release points.

So what I suggest is a common framework of three APIs; a sequential
compression/decompression API for streams, a layer (potentially
generic) on top of those for strings/buffers, and a third API for
file-like access. Presumably the file-like access can be implemented
on top of the sequential API as well.
If we have to be able to start from arbitrary points in bzip files, they
have one nasty characteristic: they are bit-serial, and we'll need to
start them at arbitrary _bit_ points (not simply byte boundaries).

One structure I have used for searching is a result iterator fed by
a source iterator, so rather than a read w/ inconvenient boundaries
the input side of the thing calls the 'next' method of the provided
source.

... I would rather see a pythonic interface to the libraries than a
> simple-as-can-be wrapper around the C functions....
I'm on board with you here.

My further suggestion is that we start with the sequential
de/compression, since it seems like a fundamental primitive.
De/compressing strings will be trivial, and the file-like interface is
already described by Python.
Well, to be explicit, are we talking about Decompresion and Compression
simultaneously or do we want to start with one of them first?

2) The de/compression object has routines for reading de/compressed
data and states such as end-of-stream or resynchronization points as
exceptions, much like the file class can throw EOFError. My problem
with this is that client code has to be cognizant of the possible
exceptions that might be thrown, and so one cannot easily add new
exceptions should the need arise. For example, if we add an exception
to indicate a possible resynchronization point, client code may not
be capable of handling it as a non-fatal exception.

Seems like we may want to say things like, "synchronization points are
too be silently ignored."

--Scott David Daniels
Scott.Daniels@xxxxxxx
.



Relevant Pages

  • Re: zlib interface semi-broken
    ... implement one-shot de/compression of strings. ... functions that do one-shot compression and decompression. ... That file interface could form a third API, ... conform to what python expects of files. ...
    (comp.lang.python)
  • Re: Repeatable compression is possible and easy to do, heres how...
    ... Establish three streams of pseudo-random data, ... love with disallows not only repeatable compression but regular ... values for a b c client. ...
    (comp.compression)
  • Re: Repeatable compression is possible and easy to do, heres how...
    ... You *may* find that todays accepted natural laws are ... I'm not saying this reasoning can be applied to data compression and ... different plane, new presentation, higher potentials for pattern ... in streams, so what if the potential is 30% files compressed at 463 ...
    (comp.compression)
  • Re: Compression in block-oriented data
    ... Key payloads will likely be stored in streams on top of blocks. ... really offer much improvement in terms of compression). ... We have structured our trees to be separate from the data streams ...
    (comp.compression)
  • Re: Compression in block-oriented data
    ... Key payloads will likely be stored in streams on top of blocks. ... really offer much improvement in terms of compression). ... For this reason, putting the tree ...
    (comp.compression)