Created on 2007-12-14 09:20 by therve, last changed 2012-02-05 12:30 by python-dev. This issue is now closed.
The BZ2File class only supports one stream per file. It possible to have multiple streams concatenated in one file, it the resulting data should be the concatenation of all the streams. It's what the bunzip2 program produces, for example. It's also supported by the gzip module. Once this done, this would add the ability to open a file for appending, by adding another stream to the file. I'll probably try to do this, but the fact it's done in C (unlike gzip) makes it harder, so if someone beats me to it, etc.
If you're referring to an 'append' mode for bz2file objects, it may be a limitation of the underlying library: my version of bzlib.h only provides BZ2_bzWriteOpen and BZ2_bzReadOpen - it's not immediately clear how you would open a BZ2File in append mode looking at this API. It may be possible to implement r/w/a using the lower-level bzCompress/bzDecompress functions, but I doubt that's going to happen unless somebody (such as yourself? :)) cares deeply about this.
Like gzip, you can concatenate two bzip2 files: bzip2 -c /etc/passwd >/tmp/pass.bz2 bzip2 -c /etc/passwd >>/tmp/pass.bz2 bunzip2 will output both parts, generating two copies of the file. So nothing needs to be done on compression, but uncompression needs to look for another chunk of compressed data after finishing one chunk.
The gzip module supports reopening an existing file to add another stream. I think the bz2 module should not the same.
I've got a patch that fixes this. It allows BZ2File to read multi-stream files as generated by pbzip2, allows BZ2File to open files in append mode, and also updates bz2.decompress to allow it to handle multi-stream chunks of data. We originally wrote it against 2.5, but I've updated the patch to py3k trunk, and attached it here. If there's interest in a patch against 2.7 trunk, please let me know.
sorry, the previous patch was from an old version. attaching the correct version now. apologies for the noise.
Some notes about posting patches: - you should post the patch alone, not in an archive - generally you should post patches against the 2.7 trunk, we take care of merging them to py3k ourselves (but in this case the difference should be minimal anyway) - I'm not sure it's ok to add legal boilerplate at the top of files, we never do that usually (and if everyone did it would become unreadable). Does your company require you to do so? I'll look at the patch itself another day, I don't have the time right now. But thanks for posting it!
Thanks for the reply. My company's legal dept. told me that we needed to put the boilerplate into the files as part of releasing it under the apache license. I used a tarball because they also recommended including a full copy of the license with the patch. I'm reattaching just the patch to the bug now. I'll check with legal and see if they'd have a problem with removing the boilerplate.
If the patch is substantial enough that legal boilerplate is even an issue, then I'm pretty sure a contributor agreement will be required for patch acceptance, at which point I think the boilerplate won't be needed. The Apache license is certainly acceptable. I'm obviously not the authority on this, though. That would be van Lindburg.
I can remove the boilerplate from the code as long as I add the following to the submittal: VMware, Inc. is providing this bz2 module patch to you under the terms of the Apache License 2.0 with the understanding that you plan to re-license this under the terms and conditions of the Python License. This patch is provided as is, with no warranties or support. VMware disclaims all liability in connection with the use/inability to use this patch. Any use of the attached is considered acceptance of the above.
As far as I can tell, the patch looks mostly good. I just wonder, in Util_HandleBZStreamEnd(), why you don't set self->mode to MODE_CLOSED if BZ2_bzReadOpen() fails. As a sidenote, the bz2 module implementation seems to have changed quite a bit between trunk and py3k, so if you want it to be backported to trunk (2.7), you'll have to provide a separate patch.
Hrm...yeah, I should probably be setting it to closed as soon as BZ2_bzReadClose() returns, and then back to open once BZ2_bzReadOpen succeeds. Wasn't intentional...thanks for the catch. You guys need a new patch with that change in it? I'll try and get a 2.7 patch done and uploaded in a day or two.
A new patch will make it more likely that it will actually get applied :) Thanks for your work on this.
Understandable. New patch attached.
I'm not comfortable with the following change (which appears twice in the patch): - BZ2_bzReadClose(&bzerror, self->fp); + if (self->fp) + BZ2_bzReadClose(&bzerror, self->fp); break; case MODE_WRITE: - BZ2_bzWriteClose(&bzerror, self->fp, - 0, NULL, NULL); + if (self->fp) + BZ2_bzWriteClose(&bzerror, self->fp, + 0, NULL, NULL); If you need to test for the file pointer, perhaps there's a logic flaw in your patch. Also, it might be dangerous in write mode: could it occur that the file isn't closed but the problem isn't reported?
That was mostly just out of paranoia, since the comments mentioned multiple calls to close being legal. Looking at it again, that particular case isn't an issue, since we don't hit that call when the mode is MODE_CLOSED. The testsuite runs happily with those changes reverted. Should I upload a new patch?
> That was mostly just out of paranoia, since the comments mentioned > multiple calls to close being legal. Looking at it again, that particular > case isn't an issue, since we don't hit that call when the mode is > MODE_CLOSED. The testsuite runs happily with those changes reverted. > Should I upload a new patch? You don't need to, but on the other hand I forgot to ask you to update the documentation :-) (see Doc/library/bz2.rst)
Picking this back up again. There's actually no docs changes necessary...the docs never mentioned that the module didn't support multiple logical streams, and I didn't see any other mentions in the docs that seemed to need updating. I supposed I could add something along the lines of "BZ2File supports multiple logical streams in a single compressed file", if that's what you/re looking for. Working on a patch for trunk as well.
Dear all, first of all, thank you for the patch making multiple file-streams in bz2 available in python. Yesterday, I've tried to adapt the recent patch for python 3k to the actual python 2.7. Most of the hunks could be easy adapted by editing just one ore two things concerning the different file-layout between p3k and python 2.7. Unfortunatelly it wasn't possible for me to completly downgrade this patch to python 2.7. Especially the last hunk in the patch and also the hunks which are related to self->rawfp couldn't be adapted succesfully by me. Could anybody assist me to make this patch available for python 2.7 or does a patch for this python version already exist? If you like, I can upload my recent changes in this patch for further investigation. Thank you! best regards, Oliver Deppert
Here is an update of the patch against current py3k. I've added a copyright mention at the top of Modules/bz2module.c which I hope manages to capture the essence of msg93721. Martin, what do you think?
Thanks for the update.... Like I mentioned before in my previous comment, I'm still searching for a solution/patch for python 2.x able to handle multiple streams of bz2. Does anybody know a work-around or have a solution porting the p3k-patch to the good old python 2.x?! At the moment I try to use this patch with py3k and it seems to work. But I need this multistream option for pythone 2.x because the most of the time I deal with matplotlib....and matplotlib at the moment isn't able to deal with py3k....so, I would be very happy for any suggestion running multiple-streams with python 2.x ! ! Thank you very much, and best regards Kontr-Olli
The patch here is totally out of date, following issue5863.
Hi, I attach a patch to Python 3.3 Lib/bz2.py with updated tests: cpython-bz2-streams.patch
Thanks for the patch. I'll review it tomorrow.
Wait, the tests seem wrong. I'll post an update later today.
OK, I'll hold off on doing a detailed review until then.
False alarm; go ahead with the review. I took a look too early in the morning before caffeine kicked in. Note Lib/test/test_bz2.py was directly upgraded from bz2ms.patch. A note on bz2 behavior: A BZ2Decompressor object is only good for one stream; after that eof is set and it will refuse to continue to the next stream; this seems in line with bzip2 manual: http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzDecompress
> False alarm; go ahead with the review. I took a look too early in the > morning before caffeine kicked in. No worries. I know the feeling. The tests look fine. The bodies of testRead() and testReadMultiStream() appear to have been swapped, though. I'm guessing testRead() was supposed to remain unmodified, with testReadMultiStream() testing the case of streams=5? For the change to BZ2File._fill_buffer(), I'm not sure that the check for end-of-file is correct. It seems that if the end of rawblock lines up with the end of a stream, the mode will be set to EOF even if there are more streams waiting to be read. Is this possible? > A note on bz2 behavior: A BZ2Decompressor object is only good for one > stream; after that eof is set and it will refuse to continue to the > next stream; this seems in line with bzip2 manual I think this is the right way to do things. BZ2Decompressor is a low- level wrapper around the underlying C API -- it should not grow extra features unnecessarily. Also, certain use-cases might depend on being able to detect the end of a logical stream; this would not be possible if BZ2Decompressor were to be changed. The difference in behaviour from BZ2File and decompress() should probably be documented, though.
Right! I updated the patch and added a test for the aligned stream/buffer case.
New changeset 8cebbc6473d9 by Nadeem Vawda in branch 'default': Issue #1625: BZ2File and bz2.decompress() now support multi-stream files. http://hg.python.org/cpython/rev/8cebbc6473d9 New changeset 0be55601f948 by Nadeem Vawda in branch 'default': Update bz2 docs following issue #1625. http://hg.python.org/cpython/rev/0be55601f948
Committed. Once again, thanks for the patch!
I made a few comments and asked two questions on the review page. (I should have said so here.)
I seem to be unable to log in to rietveld, so I'll reply here. >> result += decomp.decompress(data) > Is this efficient? I understood that other Python implementations > had poorly performing str.__iadd__, and therefore that using a list > was the common idiom (using “return b''.join(result)” at the end). Good point. I hadn't thought about other implementations. Also, you're right about the superfluous comments in test_bz2; I'll do a general cleanup of the test code soon. > Looks good. I only have one paranoid comment: since the tests use > self.TEXT * 5 to create multiple streams, the ordering of the files is > not tested. IOW, the code could mix-up the order of the files and the > tests would not catch that. Is it a concern? I wouldn't think so. It's not as though there is an index that the code looks at to find the location of each stream. It just reads the data from the file and if it reaches the end of one stream, it assumes that the following data is a new stream, and decompresses it accordingly. That said, I wouldn't be opposed to adding a test for that sort of thing (just for paranoia's sake :P) if it doesn't involve adding large amounts of additional binary data in the file. I'll come back to it once I've tidied up the existing code.
If you’re logged into Roundup, you should automatically be logged into our Rietveld instance. You can file a bug on the meta-tracker (link in the left sidebar) if this does not work.
New changeset 48e837b2a327 by Nadeem Vawda in branch 'default': Miscellaneous cleanups to bz2 and test_bz2 following issue #1625. http://hg.python.org/cpython/rev/48e837b2a327
> If you’re logged into Roundup, you should automatically be logged into > our Rietveld instance. I thought this was the case, but it isn't working for me. I've filed a bug on the meta-tracker.
New changeset 3e5200abf8eb by Nadeem Vawda in branch 'default': Issue #1625: Add stream ordering test to test_bz2. http://hg.python.org/cpython/rev/3e5200abf8eb
I ported the bz2ms.patch to Python 2.7.2 and it works correctly within the bz2 module. But when you open a multistream (tar)bz2 with the tarfile module, even the tarfile uses the BZ2File() class, there exists unextracted missing files. I'll now try with the current tip.
With the current tip everything works correctly. I think it's because of the complete rewrite of the bz2 module with python and the refactoring of _bz2.so.
Attached patch is a revised version of bz2ms.patch against Python 2.7.2. The patch is tested using tarfile and bz2 modules. It also passes the included tests correctly. It also imports a missing class from BytesIO to fix the tests. It's up to you to take that into 2.7.x branch or not.
We don’t add news features in stable releases. Nadeem has closed this bug as fixed for 3.3 and it can’t go in 2.7, so I think we’re done here.
Ozan: Thanks for taking the time to backport the patch. Unfortunately, as Éric said, 2.7 is in maintenance mode, so it no longer receives new features.
This is all fine and well, but this is clearly a bug and not a feature. Can we please see this bug fix go into 2.7 at some point?
+1. If we think this as a bug, python 2.x users will never be able to extract multiple-stream bz2 files.
I mean "as a feature".
> This is all fine and well, but this is clearly a bug and not a feature. No, it is not at all clear that this is a bug. I agree that this is a desirable capability to have, but nowhere does the module claim to support multi-stream files. Nor is it an inherent feature of the underlying bzip2 library that we are failing to expose to users. > [...] python 2.x users will never be able to extract multiple-stream bz2 files. Incorrect. It is perfectly possible to extract a multi-stream bz2 file in 2.x - you just need to open it with open() and decompress the data using BZ2Decompressor (pretty much the same way that 3.3's BZ2File does it). If there is really a large demand for these facilities in 2.x, I would be willing to port 3.3's BZ2File implementation to 2.x and make it available on PyPI. But this patch is not going in to the 2.7 stdlib; it is simply not the sort of behavior change that is acceptable in a bugfix release.
I think this support should be backported to Python 2.7 and 3.2. Current code can't decompress files generated by "pbzip2", fairly popular. I would consider that a bug, not a feature request. I am just recompressing a 77GB file because of this :-(.
> I am just recompressing a 77GB file because of this :-(. Sorry to hear that :( > I would consider that a bug, not a feature request. Semantic issues aside, my concern here is that the patch for 2.7 is considerably larger than the one for 3.3, and the code it modifies is more fragile. An alternative solution I'd like to pursue is to backport 3.3's BZ2File implementation to run on 2.7, and release it on PyPI. That way there's no risk of introducing regressions for people who don't need this facility, and those who do can get it without needing to upgrade to 2.7.3 (or even wait for it to be released). It shouldn't take much effort to get it working on 2.6 as well. How does that sound?
> An alternative solution I'd like to pursue is to backport 3.3's BZ2File > implementation to run on 2.7, and release it on PyPI. Well, that was easier than I expected. It didn't take much work to get it working under 2.6, 2.7 and 3.2. I've put up this "bz2file" module on GitHub <https://github.com/nvawda/bz2file>. I'll package it up and upload it to PyPI sometime soon.
> I think this support should be backported to Python 2.7 and 3.2. I think our policy is pretty clear: the module docs did not say multiple streams were supported, so when support for them was added it was clearly a new feature. > Current code can't decompress files generated by "pbzip2", fairly popular. That is unfortunate, but I don’t think that the features of one tool, even popular, should make us break the policy. That’s why I concur with Nadeem and think that a PyPI backport is the only way to provide multi-streams bz2 to previous versions. Best regards
Éric, bz2 module in Python is documented as able to manage bz2 files. But that is not true, since it is unable to manage popular bz2 files. That looks like a bug to me. In fact, you can create those files from python, files that python can not unpack later. Moreover, Python 2.7 is going to live for a long time. If the refusal of backporting this to 3.2 and 2.7 is firm, I would beg to document this limitation in the 2.7/3.2 docs. It is serious enough. Please, reconsider backporting this to 2.7, at least.
As the bug/feature judgment is not easy to make, I think python-dev should be asked.
> In fact, you can create those files from python, files that python can not unpack later. Really? How so? BZ2File only started accepting the "a" mode in 3.3 (thanks to Nir's patch for this issue, actually). > If the refusal of backporting this to 3.2 and 2.7 is firm, I would beg to document this limitation in the 2.7/3.2 docs. It is serious enough. Of course. I'll add a note to the docs once I've created the bz2file package on PyPI, so we can refer readers to it. > As the bug/feature judgment is not easy to make, I think python-dev should be asked. Fair enough; I was actually going to suggest consulting the 2.7 release manager, but I suppose getting more opinions on the mailing list makes sense.
New changeset e73d549b7458 by Nadeem Vawda in branch '2.7': Issue #1625: Document BZ2File's lack of support for multi-stream inputs. http://hg.python.org/cpython/rev/e73d549b7458 New changeset 190826ee0450 by Nadeem Vawda in branch '3.2': Issue #1625: Document BZ2File's lack of support for multi-stream inputs. http://hg.python.org/cpython/rev/190826ee0450
Just a thought: maybe the doc note should mention that bz2file is a backport of 3.3’s improved class, so that people know that 1) it’s well-supported code 2) a future Python version will remove the need for the external dependency.
New changeset ad20324229f4 by Nadeem Vawda in branch '2.7': Clarify note in BZ2File docs about lack of multi-stream support (issue #1625). http://hg.python.org/cpython/rev/ad20324229f4 New changeset e4c4595033ad by Nadeem Vawda in branch '3.2': Clarify note in BZ2File docs about lack of multi-stream support (issue #1625). http://hg.python.org/cpython/rev/e4c4595033ad