Issue34010
Created on 2018-06-30 09:27 by hajoscher, last changed 2022-04-11 14:59 by admin. This issue is now closed.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 8020 | merged | python-dev, 2018-06-30 09:36 | |
| PR 8082 | merged | miss-islington, 2018-07-04 08:14 | |
| PR 8083 | merged | miss-islington, 2018-07-04 08:15 | |
| Messages (10) | |||
|---|---|---|---|
| msg320763 - (view) | Author: hajoscher (hajoscher) * | Date: 2018-06-30 09:27 | |
Buffer read of large files in a compressed tarfile stream performs poorly. The buffered read in tarfile _Stream is extending a bytes object. It is much more efficient to use a list followed by a join. Using a list can mean seconds instead of minutes. This performance regression was introduced in b506dc32c1a. How to test: # create random tarfile 50Mb dd if=/dev/urandom of=test.bin count=50 bs=1M tar czvf test.tgz test.bin # read with tarfile as stream (note pipe symbol in 'r|gz') import tarfile tfile = tarfile.open("test.tgz", 'r|gz') for t in tfile: file = tfile.extractfile(t) if file: print(len(file.read())) |
|||
| msg320848 - (view) | Author: Inada Naoki (methane) * | Date: 2018-07-01 23:30 | |
Nice catch. I confirmed this is a hard regression of performance. Decompressing a file must be O(n) when n=filesize, but O(n^2) now. While we live with this regression for a long time, I feel it's worth enough to backport. This can be DoS vulnerability. Can you write NEWS entry for it? |
|||
| msg321011 - (view) | Author: hajoscher (hajoscher) * | Date: 2018-07-04 05:22 | |
Yes, it performance is really bad for large files, and memory consumption as well. I will write something for NEWS. |
|||
| msg321018 - (view) | Author: Inada Naoki (methane) * | Date: 2018-07-04 08:13 | |
New changeset 12a08c47601cadea8e7d3808502cdbcca87b2ce2 by INADA Naoki (hajoscher) in branch 'master': bpo-34010: Fix tarfile read performance regression (GH-8020) https://github.com/python/cpython/commit/12a08c47601cadea8e7d3808502cdbcca87b2ce2 |
|||
| msg321020 - (view) | Author: Inada Naoki (methane) * | Date: 2018-07-04 08:25 | |
thanks |
|||
| msg321021 - (view) | Author: STINNER Victor (vstinner) * | Date: 2018-07-04 08:28 | |
https://github.com/python/cpython/pull/8020/files/77a54a39aace1a38794884218abe801b85b54e62#diff-ef64d8b610dda67977a63a9837f46349 - buf = "".join(t) + buf = b"".join(t) @hajoscher: "It never caused a problem, since this line is never called; size is never None in the function call. But still, should be fixed, I guess." Would it be possible to have an unit test for this modified line? Untested code is broken, as you showed :-) |
|||
| msg321022 - (view) | Author: miss-islington (miss-islington) | Date: 2018-07-04 08:32 | |
New changeset c1b75b5fb92fda0ac5b931d7b18c1418557cb7c4 by Miss Islington (bot) in branch '3.7': bpo-34010: Fix tarfile read performance regression (GH-8020) https://github.com/python/cpython/commit/c1b75b5fb92fda0ac5b931d7b18c1418557cb7c4 |
|||
| msg321023 - (view) | Author: miss-islington (miss-islington) | Date: 2018-07-04 08:43 | |
New changeset d7a0ad7dd7bd7dfbdbf6be2c89fde5a71813628a by Miss Islington (bot) in branch '3.6': bpo-34010: Fix tarfile read performance regression (GH-8020) https://github.com/python/cpython/commit/d7a0ad7dd7bd7dfbdbf6be2c89fde5a71813628a |
|||
| msg321040 - (view) | Author: Inada Naoki (methane) * | Date: 2018-07-04 12:22 | |
@Victor I think removing unused code is better than adding test for it and maintain it. So I removed that unused code block in GH-8089. |
|||
| msg321041 - (view) | Author: STINNER Victor (vstinner) * | Date: 2018-07-04 12:27 | |
> @Victor I think removing unused code is better than adding test for it and maintain it. Sure. I reviewed your PR 8089. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:59:02 | admin | set | github: 78191 |
| 2018-07-04 12:27:59 | vstinner | set | messages: + msg321041 |
| 2018-07-04 12:22:31 | methane | set | messages: + msg321040 |
| 2018-07-04 08:43:52 | miss-islington | set | messages: + msg321023 |
| 2018-07-04 08:32:44 | miss-islington | set | nosy:
+ miss-islington messages: + msg321022 |
| 2018-07-04 08:28:58 | vstinner | set | nosy:
+ vstinner messages: + msg321021 |
| 2018-07-04 08:25:19 | methane | set | status: open -> closed resolution: fixed messages: + msg321020 stage: patch review -> resolved |
| 2018-07-04 08:15:29 | miss-islington | set | pull_requests: + pull_request7684 |
| 2018-07-04 08:14:43 | miss-islington | set | pull_requests: + pull_request7683 |
| 2018-07-04 08:13:21 | methane | set | messages: + msg321018 |
| 2018-07-04 07:49:42 | methane | set | keywords:
+ 3.2regression title: tarfile stream read performance -> tarfile stream read performance regression |
| 2018-07-04 05:22:01 | hajoscher | set | messages: + msg321011 |
| 2018-07-03 11:25:24 | methane | set | versions: - Python 3.4, Python 3.5 |
| 2018-07-01 23:30:23 | methane | set | nosy:
+ methane messages: + msg320848 |
| 2018-06-30 09:36:18 | python-dev | set | keywords:
+ patch stage: patch review pull_requests: + pull_request7628 |
| 2018-06-30 09:27:00 | hajoscher | create | |