Issue 1625: bz2.BZ2File doesn't support multiple streams [proxy]

Created on 2007-12-14 09:20 by therve, last changed 2012-02-05 12:30 by python-dev. This issue is now closed.

Messages (56) msg58619 - (view) Author: Thomas Herve (therve) * Date: 2007-12-14 09:20

The BZ2File class only supports one stream per file. It possible to have
multiple streams concatenated in one file, it the resulting data should
be the concatenation of all the streams. It's what the bunzip2 program
produces, for example. It's also supported by the gzip module.

Once this done, this would add the ability to open a file for appending,
by adding another stream to the file.

I'll probably try to do this, but the fact it's done in C (unlike gzip)
makes it harder, so if someone beats me to it, etc.

msg59897 - (view) Author: Thomas Lee (thomaslee) Date: 2008-01-14 13:31

If you're referring to an 'append' mode for bz2file objects, it may be a
limitation of the underlying library: my version of bzlib.h only
provides BZ2_bzWriteOpen and BZ2_bzReadOpen - it's not immediately clear
how you would open a BZ2File in append mode looking at this API.

It may be possible to implement r/w/a using the lower-level
bzCompress/bzDecompress functions, but I doubt that's going to happen
unless somebody (such as yourself? :)) cares deeply about this.

msg60236 - (view) Author: A.M. Kuchling (akuchling) * Date: 2008-01-19 22:00

Like gzip, you can concatenate two bzip2 files:

bzip2 -c /etc/passwd >/tmp/pass.bz2

bzip2 -c /etc/passwd >>/tmp/pass.bz2

bunzip2 will output both parts, generating two copies of the file.

So nothing needs to be done on compression, but uncompression needs to
look for another chunk of compressed data after finishing one chunk.

msg60268 - (view) Author: Thomas Herve (therve) * Date: 2008-01-20 09:12

The gzip module supports reopening an existing file to add another
stream. I think the bz2 module should not the same.

msg93323 - (view) Author: David Bonner (dbonner) Date: 2009-09-29 19:36

I've got a patch that fixes this.  It allows BZ2File to read
multi-stream files as generated by pbzip2, allows BZ2File to open files
in append mode, and also updates bz2.decompress to allow it to handle
multi-stream chunks of data.

We originally wrote it against 2.5, but I've updated the patch to py3k
trunk, and attached it here.  If there's interest in a patch against 2.7
trunk, please let me know.

msg93326 - (view) Author: David Bonner (dbonner) Date: 2009-09-29 20:07

sorry, the previous patch was from an old version.  attaching the
correct version now.  apologies for the noise.

msg93405 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-10-01 13:42

Some notes about posting patches:
- you should post the patch alone, not in an archive
- generally you should post patches against the 2.7 trunk, we take care
of merging them to py3k ourselves (but in this case the difference
should be minimal anyway)
- I'm not sure it's ok to add legal boilerplate at the top of files, we
never do that usually (and if everyone did it would become unreadable).
Does your company require you to do so?

I'll look at the patch itself another day, I don't have the time right
now. But thanks for posting it!

msg93407 - (view) Author: David Bonner (dbonner) Date: 2009-10-01 14:21

Thanks for the reply.

My company's legal dept. told me that we needed to put the boilerplate
into the files as part of releasing it under the apache license.  I used
a tarball because they also recommended including a full copy of the
license with the patch.

I'm reattaching just the patch to the bug now.  I'll check with legal
and see if they'd have a problem with removing the boilerplate.

msg93408 - (view) Author: R. David Murray (r.david.murray) * Date: 2009-10-01 14:29

If the patch is substantial enough that legal boilerplate is even an
issue, then I'm pretty sure a contributor agreement will be required for
patch acceptance, at which point I think the boilerplate won't be
needed.  The Apache license is certainly acceptable.  I'm obviously not
the authority on this, though.  That would be van Lindburg.

msg93721 - (view) Author: David Bonner (dbonner) Date: 2009-10-07 20:36

I can remove the boilerplate from the code as long as I add the
following to the submittal:

VMware, Inc. is providing this bz2 module patch to you under the terms
of the Apache License 2.0 with the understanding that you plan to
re-license this under the terms and conditions of the Python License.
This patch is provided as is, with no warranties or support. VMware
disclaims all liability in connection with the use/inability to use this
patch. Any use of the attached is considered acceptance of the above.

msg93841 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-10-10 20:11

As far as I can tell, the patch looks mostly good.
I just wonder, in Util_HandleBZStreamEnd(), why you don't set self->mode
to MODE_CLOSED if BZ2_bzReadOpen() fails.

As a sidenote, the bz2 module implementation seems to have changed quite
a bit between trunk and py3k, so if you want it to be backported to
trunk (2.7), you'll have to provide a separate patch.

msg94316 - (view) Author: David Bonner (dbonner) Date: 2009-10-21 18:02

Hrm...yeah, I should probably be setting it to closed as soon as
BZ2_bzReadClose() returns, and then back to open once BZ2_bzReadOpen
succeeds.  Wasn't intentional...thanks for the catch.  You guys need a
new patch with that change in it?

I'll try and get a 2.7 patch done and uploaded in a day or two.

msg94318 - (view) Author: R. David Murray (r.david.murray) * Date: 2009-10-21 18:28

A new patch will make it more likely that it will actually get applied :)

Thanks for your work on this.

msg94321 - (view) Author: David Bonner (dbonner) Date: 2009-10-21 20:09

Understandable.  New patch attached.

msg94434 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-10-24 18:44

I'm not comfortable with the following change (which appears twice in
the patch):

-			BZ2_bzReadClose(&bzerror, self->fp);
+			if (self->fp)
+				BZ2_bzReadClose(&bzerror, self->fp);
 			break;
 		case MODE_WRITE:
-			BZ2_bzWriteClose(&bzerror, self->fp,
-					 0, NULL, NULL);
+			if (self->fp)
+				BZ2_bzWriteClose(&bzerror, self->fp,
+						 0, NULL, NULL);


If you need to test for the file pointer, perhaps there's a logic flaw
in your patch. Also, it might be dangerous in write mode: could it occur
that the file isn't closed but the problem isn't reported?

msg94446 - (view) Author: David Bonner (dbonner) Date: 2009-10-25 03:37

That was mostly just out of paranoia, since the comments mentioned 
multiple calls to close being legal.  Looking at it again, that particular 
case isn't an issue, since we don't hit that call when the mode is 
MODE_CLOSED.  The testsuite runs happily with those changes reverted.  
Should I upload a new patch?

msg94499 - (view) Author: Antoine Pitrou (pitrou) * Date: 2009-10-26 18:53

> That was mostly just out of paranoia, since the comments mentioned 
> multiple calls to close being legal.  Looking at it again, that particular 
> case isn't an issue, since we don't hit that call when the mode is 
> MODE_CLOSED.  The testsuite runs happily with those changes reverted.  
> Should I upload a new patch?

You don't need to, but on the other hand I forgot to ask you to update
the documentation :-) (see Doc/library/bz2.rst)

msg101521 - (view) Author: David Bonner (dbonner) Date: 2010-03-22 17:02

Picking this back up again.  There's actually no docs changes necessary...the docs never mentioned that the module didn't support multiple logical streams, and I didn't see any other mentions in the docs that seemed to need updating.  I supposed I could add something along the lines of "BZ2File supports multiple logical streams in a single compressed file", if that's what you/re looking for.

Working on a patch for trunk as well.

msg115198 - (view) Author: Oliver Deppert (Kontr-Olli) Date: 2010-08-30 00:16

Dear all,

first of all, thank you for the patch making multiple file-streams in bz2 available in python. Yesterday, I've tried to adapt the recent patch for python 3k to the actual python 2.7. Most of the hunks could be easy adapted by editing just one ore two things concerning the different file-layout between p3k and python 2.7.

Unfortunatelly it wasn't possible for me to completly downgrade this patch to python 2.7. Especially the last hunk in the patch and also the hunks which are related to self->rawfp couldn't be adapted succesfully by me.

Could anybody assist me to make this patch available for python 2.7 or does a patch for this python version already exist?

If you like, I can upload my recent changes in this patch for further investigation.

Thank you!

best regards,
Oliver Deppert

msg116964 - (view) Author: Antoine Pitrou (pitrou) * Date: 2010-09-20 19:31

Here is an update of the patch against current py3k. I've added a copyright mention at the top of Modules/bz2module.c which I hope manages to capture the essence of msg93721. Martin, what do you think?

msg118946 - (view) Author: Oliver Deppert (Kontr-Olli) Date: 2010-10-17 15:49

Thanks for the update....

Like I mentioned before in my previous comment, I'm still searching for a solution/patch for python 2.x able to handle multiple streams of bz2. 

Does anybody know a work-around or have a solution porting the p3k-patch to the good old python 2.x?!

At the moment I try to use this patch with py3k and it seems to work. But I need this multistream option for pythone 2.x because the most of the time I deal with matplotlib....and matplotlib at the moment isn't able to deal with py3k....so, I would be very happy for any suggestion running multiple-streams with python 2.x ! !

Thank you very much, and best regards
Kontr-Olli

msg132842 - (view) Author: Antoine Pitrou (pitrou) * Date: 2011-04-03 15:12

The patch here is totally out of date, following issue5863.

msg136689 - (view) Author: Nir Aides (nirai) Date: 2011-05-23 19:04

Hi, I attach a patch to Python 3.3 Lib/bz2.py with updated tests:
cpython-bz2-streams.patch

msg136690 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2011-05-23 19:30

Thanks for the patch. I'll review it tomorrow.

msg136723 - (view) Author: Nir Aides (nirai) Date: 2011-05-24 06:10

Wait, the tests seem wrong. I'll post an update later today.

msg136753 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2011-05-24 14:49

OK, I'll hold off on doing a detailed review until then.

msg136790 - (view) Author: Nir Aides (nirai) Date: 2011-05-24 20:19

False alarm; go ahead with the review. I took a look too early in the morning before caffeine kicked in.

Note Lib/test/test_bz2.py was directly upgraded from bz2ms.patch.

A note on bz2 behavior: A BZ2Decompressor object is only good for one stream; after that eof is set and it will refuse to continue to the next stream; this seems in line with bzip2 manual:
http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzDecompress

msg136799 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2011-05-24 22:04

> False alarm; go ahead with the review. I took a look too early in the
> morning before caffeine kicked in.

No worries. I know the feeling.

The tests look fine. The bodies of testRead() and testReadMultiStream()
appear to have been swapped, though. I'm guessing testRead() was
supposed to remain unmodified, with testReadMultiStream() testing the
case of streams=5?

For the change to BZ2File._fill_buffer(), I'm not sure that the check
for end-of-file is correct. It seems that if the end of rawblock lines
up with the end of a stream, the mode will be set to EOF even if there
are more streams waiting to be read. Is this possible?

> A note on bz2 behavior: A BZ2Decompressor object is only good for one
> stream; after that eof is set and it will refuse to continue to the
> next stream; this seems in line with bzip2 manual

I think this is the right way to do things. BZ2Decompressor is a low-
level wrapper around the underlying C API -- it should not grow extra
features unnecessarily. Also, certain use-cases might depend on being
able to detect the end of a logical stream; this would not be possible
if BZ2Decompressor were to be changed.

The difference in behaviour from BZ2File and decompress() should probably
be documented, though.

msg136889 - (view) Author: Nir Aides (nirai) Date: 2011-05-25 20:06

Right! I updated the patch and added a test for the aligned stream/buffer case.

msg137018 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-27 00:02

New changeset 8cebbc6473d9 by Nadeem Vawda in branch 'default':
Issue #1625: BZ2File and bz2.decompress() now support multi-stream files.
http://hg.python.org/cpython/rev/8cebbc6473d9

New changeset 0be55601f948 by Nadeem Vawda in branch 'default':
Update bz2 docs following issue #1625.
http://hg.python.org/cpython/rev/0be55601f948

msg137019 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2011-05-27 00:07

Committed. Once again, thanks for the patch!

msg137074 - (view) Author: Éric Araujo (eric.araujo) * Date: 2011-05-27 16:15

I made a few comments and asked two questions on the review page.  (I should have said so here.)

msg137098 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2011-05-27 21:37

I seem to be unable to log in to rietveld, so I'll reply here.

>> result += decomp.decompress(data)
> Is this efficient?  I understood that other Python implementations
> had poorly performing str.__iadd__, and therefore that using a list
> was the common idiom (using “return b''.join(result)” at the end).

Good point. I hadn't thought about other implementations.

Also, you're right about the superfluous comments in test_bz2; I'll do a
general cleanup of the test code soon.

> Looks good.  I only have one paranoid comment: since the tests use
> self.TEXT * 5 to create multiple streams, the ordering of the files is
> not tested.  IOW, the code could mix-up the order of the files and the
> tests would not catch that.  Is it a concern?

I wouldn't think so. It's not as though there is an index that the code
looks at to find the location of each stream. It just reads the data from
the file and if it reaches the end of one stream, it assumes that the
following data is a new stream, and decompresses it accordingly.

That said, I wouldn't be opposed to adding a test for that sort of thing
(just for paranoia's sake :P) if it doesn't involve adding large amounts
of additional binary data in the file. I'll come back to it once I've
tidied up the existing code.

msg137197 - (view) Author: Éric Araujo (eric.araujo) * Date: 2011-05-29 16:41

If you’re logged into Roundup, you should automatically be logged into our Rietveld instance.  You can file a bug on the meta-tracker (link in the left sidebar) if this does not work.

msg137228 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-29 23:17

New changeset 48e837b2a327 by Nadeem Vawda in branch 'default':
Miscellaneous cleanups to bz2 and test_bz2 following issue #1625.
http://hg.python.org/cpython/rev/48e837b2a327

msg137229 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2011-05-29 23:19

> If you’re logged into Roundup, you should automatically be logged into
> our Rietveld instance.

I thought this was the case, but it isn't working for me. I've filed a
bug on the meta-tracker.

msg137230 - (view) Author: Roundup Robot (python-dev) Date: 2011-05-30 00:01

New changeset 3e5200abf8eb by Nadeem Vawda in branch 'default':
Issue #1625: Add stream ordering test to test_bz2.
http://hg.python.org/cpython/rev/3e5200abf8eb

msg144404 - (view) Author: ozan caglayan (ozan.caglayan) Date: 2011-09-22 07:44

I ported the bz2ms.patch to Python 2.7.2 and it works correctly within the bz2 module.

But when you open a multistream (tar)bz2 with the tarfile module, even the tarfile uses the BZ2File() class, there exists unextracted missing files. I'll now try with the current tip.

msg144406 - (view) Author: ozan caglayan (ozan.caglayan) Date: 2011-09-22 08:17

With the current tip everything works correctly. I think it's because of the complete rewrite of the bz2 module with python and the refactoring of _bz2.so.

msg144443 - (view) Author: ozan caglayan (ozan.caglayan) Date: 2011-09-23 13:06

Attached patch is a revised version of bz2ms.patch against Python 2.7.2. The patch is tested using tarfile and bz2 modules. It also passes the included tests correctly.

It also imports a missing class from BytesIO to fix the tests.

It's up to you to take that into 2.7.x branch or not.

msg144445 - (view) Author: Éric Araujo (eric.araujo) * Date: 2011-09-23 14:28

We don’t add news features in stable releases.  Nadeem has closed this bug as fixed for 3.3 and it can’t go in 2.7, so I think we’re done here.

msg144479 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2011-09-23 22:33

Ozan: Thanks for taking the time to backport the patch. Unfortunately, as
Éric said, 2.7 is in maintenance mode, so it no longer receives new features.

msg144551 - (view) Author: Crispin Wellington (Crispin.Wellington) Date: 2011-09-27 06:35

This is all fine and well, but this is clearly a bug and not a feature. Can we please see this bug fix go into 2.7 at some point?

msg144552 - (view) Author: Gökcen Eraslan (gkcn) * Date: 2011-09-27 12:04

+1. If we think this as a bug, python 2.x users will never be able to extract multiple-stream bz2 files.

msg144553 - (view) Author: Gökcen Eraslan (gkcn) * Date: 2011-09-27 12:05

I mean "as a feature".

msg144782 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2011-10-02 21:46

> This is all fine and well, but this is clearly a bug and not a feature.

No, it is not at all clear that this is a bug. I agree that this is a desirable
capability to have, but nowhere does the module claim to support multi-stream files.
Nor is it an inherent feature of the underlying bzip2 library that we are failing
to expose to users.

> [...] python 2.x users will never be able to extract multiple-stream bz2 files.

Incorrect. It is perfectly possible to extract a multi-stream bz2 file in 2.x -
you just need to open it with open() and decompress the data using
BZ2Decompressor (pretty much the same way that 3.3's BZ2File does it).

If there is really a large demand for these facilities in 2.x, I would be willing
to port 3.3's BZ2File implementation to 2.x and make it available on PyPI. But 
this patch is not going in to the 2.7 stdlib; it is simply not the sort of
behavior change that is acceptable in a bugfix release.

msg152144 - (view) Author: Jesús Cea Avión (jcea) * Date: 2012-01-28 00:49

I think this support should be backported to Python 2.7 and 3.2. Current code can't decompress files generated by "pbzip2", fairly popular. I would consider that a bug, not a feature request.

I am just recompressing a 77GB file because of this :-(.

msg152222 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2012-01-29 11:50

> I am just recompressing a 77GB file because of this :-(.

Sorry to hear that :(

> I would consider that a bug, not a feature request.

Semantic issues aside, my concern here is that the patch for 2.7 is considerably
larger than the one for 3.3, and the code it modifies is more fragile.

An alternative solution I'd like to pursue is to backport 3.3's BZ2File
implementation to run on 2.7, and release it on PyPI. That way there's no risk
of introducing regressions for people who don't need this facility, and those
who do can get it without needing to upgrade to 2.7.3 (or even wait for it to be
released). It shouldn't take much effort to get it working on 2.6 as well.

How does that sound?

msg152257 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2012-01-29 20:40

> An alternative solution I'd like to pursue is to backport 3.3's BZ2File
> implementation to run on 2.7, and release it on PyPI.

Well, that was easier than I expected. It didn't take much work to get it
working under 2.6, 2.7 and 3.2. I've put up this "bz2file" module on
GitHub <https://github.com/nvawda/bz2file>. I'll package it up and upload
it to PyPI sometime soon.

msg152327 - (view) Author: Éric Araujo (eric.araujo) * Date: 2012-01-30 15:58

> I think this support should be backported to Python 2.7 and 3.2.
I think our policy is pretty clear: the module docs did not say multiple streams were supported, so when support for them was added it was clearly a new feature.

> Current code can't decompress files generated by "pbzip2", fairly popular.
That is unfortunate, but I don’t think that the features of one tool, even popular, should make us break the policy.  That’s why I concur with Nadeem and think that a PyPI backport is the only way to provide multi-streams bz2 to previous versions.

Best regards

msg152393 - (view) Author: Jesús Cea Avión (jcea) * Date: 2012-01-31 16:28

Éric, bz2 module in Python is documented as able to manage bz2 files. But that is not true, since it is unable to manage popular bz2 files. That looks like a bug to me. In fact, you can create those files from python, files that python can not unpack later.

Moreover, Python 2.7 is going to live for a long time.

If the refusal of backporting this to 3.2 and 2.7 is firm, I would beg to document this limitation in the 2.7/3.2 docs. It is serious enough.

Please, reconsider backporting this to 2.7, at least.

msg152513 - (view) Author: Éric Araujo (eric.araujo) * Date: 2012-02-03 14:12

As the bug/feature judgment is not easy to make, I think python-dev should be asked.

msg152551 - (view) Author: Nadeem Vawda (nadeem.vawda) * Date: 2012-02-03 19:48

> In fact, you can create those files from python, files that python can not unpack later.

Really? How so? BZ2File only started accepting the "a" mode in 3.3
(thanks to Nir's patch for this issue, actually).

> If the refusal of backporting this to 3.2 and 2.7 is firm, I would beg to document this limitation in the 2.7/3.2 docs. It is serious enough.

Of course. I'll add a note to the docs once I've created the bz2file
package on PyPI, so we can refer readers to it.


> As the bug/feature judgment is not easy to make, I think python-dev should be asked.

Fair enough; I was actually going to suggest consulting the 2.7 release
manager, but I suppose getting more opinions on the mailing list makes
sense.

msg152645 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-04 21:46

New changeset e73d549b7458 by Nadeem Vawda in branch '2.7':
Issue #1625: Document BZ2File's lack of support for multi-stream inputs.
http://hg.python.org/cpython/rev/e73d549b7458

New changeset 190826ee0450 by Nadeem Vawda in branch '3.2':
Issue #1625: Document BZ2File's lack of support for multi-stream inputs.
http://hg.python.org/cpython/rev/190826ee0450

msg152674 - (view) Author: Éric Araujo (eric.araujo) * Date: 2012-02-05 10:19

Just a thought: maybe the doc note should mention that bz2file is a backport of 3.3’s improved class, so that people know that 1) it’s well-supported code 2) a future Python version will remove the need for the external dependency.

msg152681 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-05 12:30

New changeset ad20324229f4 by Nadeem Vawda in branch '2.7':
Clarify note in BZ2File docs about lack of multi-stream support (issue #1625).
http://hg.python.org/cpython/rev/ad20324229f4

New changeset e4c4595033ad by Nadeem Vawda in branch '3.2':
Clarify note in BZ2File docs about lack of multi-stream support (issue #1625).
http://hg.python.org/cpython/rev/e4c4595033ad