Issue 29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 [proxy]

The Wayback Machine - https://web.archive.org/web/20210302052951/https://bugs.python.org/issue29456

Issue29456

classification

Title:	bugs in unicodedata.normalize: u1176, u11a7 and u11c3
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.8, Python 3.7, Python 3.6, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, lemburg, loewis, malin, miss-islington, pusnow, vstinner, xiang.zhang
Priority:	normal	Keywords:	patch

Created on 2017-02-06 04:27 by pusnow, last changed 2018-06-18 14:21 by xiang.zhang. This issue is now closed.

Files
File name	Uploaded	Description	Edit
u1176.patch	pusnow, 2017-02-06 04:27		review
u11a7u11c3.patch	pusnow, 2017-02-06 05:47		review

Pull Requests
URL	Status	Linked	Edit
PR 1958	merged	pusnow, 2017-06-05 15:48
PR 7702	merged	miss-islington, 2018-06-15 12:03
PR 7703	merged	miss-islington, 2018-06-15 12:04
PR 7704	merged	xiang.zhang, 2018-06-15 12:23

Messages (23)
msg287077 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-02-06 04:27
unicodedata can't normalize(NFC) hangul strings which contain \u1176(HANGUL JUNGSEONG A-O). >>> from unicodedata import normalize >>> normalize("NFC", "\u1100\u1176\u11a8") '깍' => should be "\u1100\u1176\u11a8" not '깍' (\uae4d) I attached a patch for this issue. (Fixing boundary of modern medial vowels)
msg287078 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2017-02-06 05:21
How about the third character's range? The code seems assuming it's [11a7..11c3] while the spec is [11a8..11c2]? >>> unicodedata.normalize("NFC", "\u1100\u1175\u11a7") '기' while it should be '기ᆧ'?
msg287079 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-02-06 05:47
I think you are right. The modern final consonants is [11a8..11c2]. I attached another patch for this issue.
msg295123 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-06-04 11:19
Is there anything need more?
msg295171 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2017-06-05 07:32
We have moved our code hosting to GitHub, would you mind turn your patch into a GitHub PR first Wonsup?
msg295172 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-06-05 08:06
Ok, I'll do it.
msg299214 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-07-26 07:54
Any updates? I need this fix for my project.
msg299657 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-08-02 13:25
I added some test cases for this issue. Please, someone check this.
msg300039 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-08-10 03:46
I think it can be merged. Is there anything I need to do?
msg300046 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2017-08-10 05:00
Hi Wonsup, sorry for the delay. I get really busy with my work these days. If no one get involved I'd try to find time reviewing your patch this week.
msg300576 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-08-19 09:54
This patch fixes changes in Unicode 4.1.0. I think it well reviewed and it is time to merge. Who can commit this patch? @animalize says: Let me give a supplement: Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0) see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition This change happened in 2005.
msg300933 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2017-08-28 02:41
Hello?
msg313056 - (view)	Author: Ma Lin (malin) *	Date: 2018-02-28 11:09
ping, this was forgotten.
msg315214 - (view)	Author: Wonsup Yoon (pusnow) *	Date: 2018-04-12 08:18
Hello!
msg319591 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2018-06-15 07:58
Sorry for the absence and late response. I just reviewed it and think it's ready. I think the change in the unicode standard is more like a bug in the implementation than an intentional change. It's mentioned in Unicode 3.0 the third character is out of bounds when TIndex <= 0 or TIndex >= TCount. We have a ucd_3_2_0 in unicodedata. I'll merge it after resolve the CI bot.
msg319608 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2018-06-15 12:03
New changeset d134809cd3764c6a634eab7bb8995e3e2eff14d5 by Xiang Zhang (Wonsup Yoon) in branch 'master': bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) https://github.com/python/cpython/commit/d134809cd3764c6a634eab7bb8995e3e2eff14d5
msg319609 - (view)	Author: miss-islington (miss-islington)	Date: 2018-06-15 12:21
New changeset 0e2b76ea4e48d0fc1ca34ae4ffbb2fd6c19664bb by Miss Islington (bot) in branch '3.7': bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) https://github.com/python/cpython/commit/0e2b76ea4e48d0fc1ca34ae4ffbb2fd6c19664bb
msg319610 - (view)	Author: miss-islington (miss-islington)	Date: 2018-06-15 12:32
New changeset e2e7ff0d0378ba44f10a1aae10e4bee957fb44d2 by Miss Islington (bot) in branch '3.6': bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) https://github.com/python/cpython/commit/e2e7ff0d0378ba44f10a1aae10e4bee957fb44d2
msg319615 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2018-06-15 13:26
New changeset 1889c4cbd62e200fa4cde3d6219e0aadf9bd8149 by Xiang Zhang in branch '2.7': bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) (GH-7704) https://github.com/python/cpython/commit/1889c4cbd62e200fa4cde3d6219e0aadf9bd8149
msg319701 - (view)	Author: Ma Lin (malin) *	Date: 2018-06-16 03:18
> We have a ucd_3_2_0 in unicodedata. Probably this 3.2 unicodedata is used for IDNA2003. In IDNA2003 there is a step: normalize the domain_name string to Unicode Normalization Form C. Now we changed the Composition code of Hangul to Unicode Standard 4.1+, and fixed the bug even in Unicode Standard 4.1-. Should this (Unicode Standard 4.1+ behavior) cause a security vulnerability for someone who is using IDNA2003 via ucd_3_2_0?
msg319719 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2018-06-16 05:56
As I said, I checked Unicode 3.0 for the hangul composition algorithm. It looks consistent with Unicode 4.1+. 3.0 only gets description but no sample implementation. So I think the changed code also applies to Unicode 3.0+.
msg319802 - (view)	Author: Ma Lin (malin) *	Date: 2018-06-17 02:40
You are right. I found a Normalization Test Suite for Unicode 3.2 http://www.unicode.org/Public/3.2-Update/NormalizationTest-3.2.0.txt \u1176 is not in the range of the second character. \u11a7, \u11c3 are not in the range of the third character.
msg319886 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2018-06-18 14:21
Thanks for your confirmation, Ma Lin. Also thanks for Wonsup!

History
Date	User	Action	Args
2018-06-18 14:21:55	xiang.zhang	set	messages: + msg319886 components: + Unicode, - Library (Lib)
2018-06-17 02:40:32	malin	set	messages: + msg319802
2018-06-16 05:56:16	xiang.zhang	set	messages: + msg319719
2018-06-16 03:18:55	malin	set	messages: + msg319701
2018-06-15 13:28:49	xiang.zhang	set	status: open -> closed resolution: fixed components: + Library (Lib), - Unicode stage: patch review -> resolved
2018-06-15 13:26:57	xiang.zhang	set	messages: + msg319615
2018-06-15 12:32:53	miss-islington	set	messages: + msg319610
2018-06-15 12:23:29	xiang.zhang	set	pull_requests: + pull_request7320
2018-06-15 12:21:57	miss-islington	set	nosy: + miss-islington messages: + msg319609
2018-06-15 12:04:26	miss-islington	set	pull_requests: + pull_request7319
2018-06-15 12:03:37	miss-islington	set	pull_requests: + pull_request7318
2018-06-15 12:03:16	xiang.zhang	set	messages: + msg319608
2018-06-15 07:58:37	xiang.zhang	set	messages: + msg319591 versions: + Python 3.8, - Python 3.5
2018-04-12 08:18:58	pusnow	set	messages: + msg315214
2018-02-28 11:09:52	malin	set	nosy: + malin messages: + msg313056
2017-08-28 02:41:24	pusnow	set	messages: + msg300933
2017-08-19 09:54:09	pusnow	set	messages: + msg300576
2017-08-10 05:00:52	xiang.zhang	set	messages: + msg300046
2017-08-10 04:59:30	xiang.zhang	set	files: - 800.jpg
2017-08-10 04:11:28	高可爱	set	files: + 800.jpg
2017-08-10 03:46:54	pusnow	set	messages: + msg300039
2017-08-02 13:25:40	pusnow	set	messages: + msg299657
2017-07-26 07:54:11	pusnow	set	messages: + msg299214
2017-06-05 15:48:57	pusnow	set	pull_requests: + pull_request2029
2017-06-05 15:46:27	pusnow	set	title: bug in unicodedata.normalize: u1176, u11a7 and u11c3 -> bugs in unicodedata.normalize: u1176, u11a7 and u11c3
2017-06-05 08:06:08	pusnow	set	messages: + msg295172
2017-06-05 07:32:39	xiang.zhang	set	messages: + msg295171
2017-06-04 11:19:17	pusnow	set	messages: + msg295123
2017-03-11 12:55:26	serhiy.storchaka	set	nosy: + lemburg, loewis stage: patch review type: behavior versions: + Python 3.5, Python 3.7
2017-03-11 12:33:28	pusnow	set	title: bug in unicodedata.normalize: u1176 -> bug in unicodedata.normalize: u1176, u11a7 and u11c3
2017-02-06 05:47:24	pusnow	set	files: + u11a7u11c3.patch messages: + msg287079
2017-02-06 05:21:48	xiang.zhang	set	nosy: + xiang.zhang messages: + msg287078
2017-02-06 04:27:52	pusnow	create