Re: zlib interface semi-broken



Travis wrote:
On Tue, Feb 10, 2009 at 01:36:21PM -0800, Scott David Daniels wrote:
.... I personally would like it and bz2 to get closer to each other...

Well, I like this idea; perhaps this is a good time to discuss the
equivalent of some "abstract base classes", or "interfaces", for
compression.

As I see it, the fundamental abstractions are the stream-oriented
de/compression routines. Given those, one should easily be able to
implement one-shot de/compression of strings. In fact, that is the
way that zlib is implemented; the base functions are the
stream-oriented ones and there is a layer on top of convenience
functions that do one-shot compression and decompression.

There are a couple of things here to think about. I've wanted to
do some low-level (C-coded) search w/o bothering to create strings
until a match. I've no idea how to push this down in, but I may be
looking for a nice low-level spot to fit. Characteristics for that
could be read-only access to small expansion parts w/o copying them
out. Also, in case of a match, a (relatively quick) way to mark points
as we proceed and a (possibly slower) way to resrore from one or
more marked points.

Also, another programmer wants to parallelize _large_ bzip file
expansion by expanding independent blocks in separate threads (we
know how to find safe start points). To get such code to work, we
need to find big chunks of computation, and (at least optionally)
surround them with GIL release points.

So what I suggest is a common framework of three APIs; a sequential
compression/decompression API for streams, a layer (potentially
generic) on top of those for strings/buffers, and a third API for
file-like access. Presumably the file-like access can be implemented
on top of the sequential API as well.
If we have to be able to start from arbitrary points in bzip files, they
have one nasty characteristic: they are bit-serial, and we'll need to
start them at arbitrary _bit_ points (not simply byte boundaries).

One structure I have used for searching is a result iterator fed by
a source iterator, so rather than a read w/ inconvenient boundaries
the input side of the thing calls the 'next' method of the provided
source.

... I would rather see a pythonic interface to the libraries than a
> simple-as-can-be wrapper around the C functions....
I'm on board with you here.

My further suggestion is that we start with the sequential
de/compression, since it seems like a fundamental primitive.
De/compressing strings will be trivial, and the file-like interface is
already described by Python.
Well, to be explicit, are we talking about Decompresion and Compression
simultaneously or do we want to start with one of them first?

2) The de/compression object has routines for reading de/compressed
data and states such as end-of-stream or resynchronization points as
exceptions, much like the file class can throw EOFError. My problem
with this is that client code has to be cognizant of the possible
exceptions that might be thrown, and so one cannot easily add new
exceptions should the need arise. For example, if we add an exception
to indicate a possible resynchronization point, client code may not
be capable of handling it as a non-fatal exception.

Seems like we may want to say things like, "synchronization points are
too be silently ignored."

--Scott David Daniels
Scott.Daniels@xxxxxxx
.