Issue26464
Created on 2016-03-01 13:51 by ben.knight, last changed 2016-03-01 21:08 by python-dev. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| unicode_fast_translate.patch | vstinner, 2016-03-01 19:54 | review | ||
| Messages (8) | |||
|---|---|---|---|
| msg261049 - (view) | Author: Ben Knight (ben.knight) | Date: 2016-03-01 13:51 | |
Python 3.5.1 x86-64, Windows 10
I created a translation map that translated some characters to None and others to strings and found that in some cases str.translate() will duplicate one of the untranslated characters in the returned string.
How to reproduce:
table = str.maketrans({'a': None, 'b': 'cd'})
'axb'.translate(table)
Expected result:
'xcd'
Actual result:
'xxcd'
Mapping 'a' to '' instead of None will produce the desired effect.
|
|||
| msg261059 - (view) | Author: Eryk Sun (eryksun) * | Date: 2016-03-01 16:31 | |
It duplicates translated characters as well. For example:
>>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
>>> 'aaaaaamnopqrb'.translate(table)
'rqponmrqponmĀ'
3.4 returns the correct result:
>>> table = str.maketrans('mnopqrb', 'rqponm\u0100', 'a')
>>> 'aaaaaamnopqrb'.translate(table)
'rqponmĀ'
The problem is the new fast path for one-to-one ASCII mapping (unicode_fast_translate in Objects/unicodeobject.c) doesn't have a way to return the current input position in order to resume processing the translation. _PyUnicode_TranslateCharmap assumes it's the same as the current writer position, which is wrong when input characters have been deleted.
|
|||
| msg261064 - (view) | Author: STINNER Victor (vstinner) * | Date: 2016-03-01 19:54 | |
Oh... I see. It's a bug introduced by the optimization for ASCII replacing one character with another ASCII character or deleting a character: unicode_fast_translate(). See change cca6b056236a of issue #21118. There is a confusion in the code between input and ouput position. "i = writer.pos;" is used in the caller to continue when unicode_fast_translate() was interrupted (because a translation use a non-ASCII character or a string longer than 1 character), but writer.pos is the position in the *output* string, not in the *input* string :-/ I see that I added unit tests on translate, but it lacks an unit testing fast translation, starting with ignore and then switching to regular translation. Attached patch should fix the issue. It adds unit tests. |
|||
| msg261065 - (view) | Author: STINNER Victor (vstinner) * | Date: 2016-03-01 19:55 | |
> See change cca6b056236a of issue #21118. The bug was introduced in Python v3.5.0a1. |
|||
| msg261069 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2016-03-01 20:24 | |
LGTM. |
|||
| msg261070 - (view) | Author: Roundup Robot (python-dev) | Date: 2016-03-01 20:31 | |
New changeset 27ba9ba5deb1 by Victor Stinner in branch '3.5': Fix str.translate() https://hg.python.org/cpython/rev/27ba9ba5deb1 |
|||
| msg261071 - (view) | Author: STINNER Victor (vstinner) * | Date: 2016-03-01 20:33 | |
> LGTM. Thanks for the review. I pushed my fix. Sorry for the regression, I hate being responsible of a regression in a core feature :-/ I may even deserve a release, but Python doesn't have the habit of "release often" yet :-( |
|||
| msg261072 - (view) | Author: Roundup Robot (python-dev) | Date: 2016-03-01 21:08 | |
New changeset 6643c5cc9797 by Victor Stinner in branch '3.5': Issue #26464: Fix unicode_fast_translate() again https://hg.python.org/cpython/rev/6643c5cc9797 |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2016-03-01 21:08:10 | python-dev | set | messages: + msg261072 |
| 2016-03-01 20:33:24 | vstinner | set | status: open -> closed priority: high -> release blocker nosy:
+ larry resolution: fixed |
| 2016-03-01 20:31:27 | python-dev | set | nosy:
+ python-dev messages: + msg261070 |
| 2016-03-01 20:24:47 | serhiy.storchaka | set | assignee: serhiy.storchaka -> vstinner messages: + msg261069 stage: needs patch -> commit review |
| 2016-03-01 19:55:30 | vstinner | set | messages: + msg261065 |
| 2016-03-01 19:54:12 | vstinner | set | files:
+ unicode_fast_translate.patch keywords: + patch messages: + msg261064 |
| 2016-03-01 16:36:44 | serhiy.storchaka | set | nosy:
+ vstinner |
| 2016-03-01 16:31:34 | eryksun | set | versions: + Python 3.6 |
| 2016-03-01 16:31:17 | eryksun | set | nosy:
+ eryksun messages:
+ msg261059 |
| 2016-03-01 16:26:28 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka priority: normal -> high assignee: serhiy.storchaka components: + Interpreter Core stage: needs patch |
| 2016-03-01 13:51:44 | ben.knight | create | |