Issue37966
Created on 2019-08-28 03:52 by Greg Price, last changed 2019-09-04 13:56 by vstinner. This issue is now closed.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 15558 | merged | Greg Price, 2019-08-28 04:57 | |
| PR 15671 | merged | miss-islington, 2019-09-04 02:45 | |
| Messages (6) | |||
|---|---|---|---|
| msg350651 - (view) | Author: Greg Price (Greg Price) * | Date: 2019-08-28 03:52 | |
In 3.8 we add a new function `unicodedata.is_normalized`. The result is equivalent to `str == unicodedata.normalize(form, str)`, but the implementation uses a version of the "quick check" algorithm from UAX #15 as an optimization to try to avoid having to copy the whole string. This was added in issue #32285, commit 2810dd7be. However, it turns out the code doesn't actually implement the same algorithm as UAX #15, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Here's a quick demo on my desktop. We pass a long string made entirely out of a character for which the quick-check algorithm always says `NO`, it's not normalized: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \ 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \ 's == unicodedata.normalize("NFD", s)' 50 loops, best of 5: 4.41 msec per loop That's the same 4.4 ms (for a 1 MB string) with or without the attempted optimization. Here it is after a patch that makes the algorithm run as in the standard: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \ 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1). The root cause of the issue is that our `is_normalized` static helper, which the new function relies on, was never written as a full implementation of the quick-check algorithm. The full algorithm can return YES, MAYBE, or NO; but originally this helper's only caller was the implementation of `unicodedata.normalize`, which only cares about YES vs. MAYBE-or-NO. So the helper often returns MAYBE when the standard algorithm would say NO. (More precisely, perhaps: it's fine that this helper was never a full implementation... but it didn't say that anywhere, even while referring to the standard algorithm, and as a result set us up for future confusion.) That's exactly what's happening in the example above: the standard quick-check algorithm would say NO, but our helper says MAYBE. Which for `unicodedata.is_normalized` means it has to go compute the whole normalized string. |
|||
| msg350655 - (view) | Author: Greg Price (Greg Price) * | Date: 2019-08-28 05:04 | |
Fix posted, as GH-15558. Adding cc's for the folks in the thread on #32285, where this function was originally added. |
|||
| msg351110 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2019-09-04 02:46 | |
New changeset 2f09413947d1ce0043de62ed2346f9a2b4e5880b by Benjamin Peterson (Greg Price) in branch 'master': closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558) https://github.com/python/cpython/commit/2f09413947d1ce0043de62ed2346f9a2b4e5880b |
|||
| msg351111 - (view) | Author: Maxime Belanger (Maxime Belanger) | Date: 2019-09-04 02:47 | |
Thanks for that! |
|||
| msg351112 - (view) | Author: miss-islington (miss-islington) | Date: 2019-09-04 03:03 | |
New changeset 4dd1c9d9c2bca4744c70c9556b7051f4465ede3e by Miss Islington (bot) in branch '3.8': closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) https://github.com/python/cpython/commit/4dd1c9d9c2bca4744c70c9556b7051f4465ede3e |
|||
| msg351127 - (view) | Author: STINNER Victor (vstinner) * | Date: 2019-09-04 13:56 | |
Thanks Greg Price for this nice optimization! |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2019-09-04 13:56:46 | vstinner | set | messages: + msg351127 |
| 2019-09-04 03:03:43 | miss-islington | set | nosy:
+ miss-islington messages: + msg351112 |
| 2019-09-04 02:47:17 | Maxime Belanger | set | messages: + msg351111 |
| 2019-09-04 02:46:07 | benjamin.peterson | set | status: open -> closed resolution: fixed messages: + msg351110 stage: patch review -> resolved |
| 2019-09-04 02:45:57 | miss-islington | set | pull_requests: + pull_request15335 |
| 2019-08-29 13:19:15 | vstinner | set | nosy:
+ serhiy.storchaka |
| 2019-08-28 05:04:42 | Greg Price | set | nosy:
+ ezio.melotti, steven.daprano, benjamin.peterson, vstinner, Maxime Belanger versions: + Python 3.9 messages: + msg350655 components:
+ Unicode |
| 2019-08-28 04:57:51 | Greg Price | set | keywords:
+ patch stage: patch review pull_requests: + pull_request15231 |
| 2019-08-28 03:52:56 | Greg Price | create | |