Issue 36520: Email header folded incorrectly [proxy]

The Wayback Machine - https://web.archive.org/web/20210120102016/https://bugs.python.org/issue36520

Issue36520

classification

Title:	Email header folded incorrectly
Type:	behavior	Stage:	resolved
Components:	email	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jeffrey.Kintscher, Jonathan Horn, barry, iritkatriel, miss-islington, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2019-04-03 23:15 by Jonathan Horn, last changed 2020-11-20 15:09 by iritkatriel. This issue is now closed.

Files
File name	Uploaded	Description	Edit
bpo-36520-test.py	Jeffrey.Kintscher, 2019-05-27 10:43	UTF-8 header encoding test cases

Pull Requests
URL	Status	Linked	Edit
PR 13608	merged	Jeffrey.Kintscher, 2019-05-28 02:33
PR 13610	closed	Jeffrey.Kintscher, 2019-05-28 03:38
PR 13909	merged	maxking, 2019-06-08 07:50
PR 13910	merged	maxking, 2019-06-08 07:50

Messages (10)
msg339419 - (view)	Author: Jonathan Horn (Jonathan Horn)	Date: 2019-04-03 23:15
I encountered a problem with replacing the 'Subject' header of an email. After serializing it again, the utf8 encoding was wrong. It seems to be occurring when folding the internal header objects. Example: >> email.policy.default.fold_binary('Subject', email.policy.default.header_store_parse('Subject', 'Hello Wörld! Hello Wörld! Hello Wörld! Hello Wörld!Hello Wörld!')[1]) Expected output: b'Subject: Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?=\n Hello =?utf-8?q?W=C3=B6rld!Hello_W=C3=B6rld!?=\n' (or similar) Actual output: b'Subject: Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?=\n Hello =?utf-8?=?utf-8?q?q=3FW=3DC3=3DB6rld!Hello=3F=3D_W=C3=B6rld!?=\n' I'm running Python 3.7.3 on Arch Linux using Linux 5.0.
msg343267 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-05-23 01:34
Can you demonstrate the problem with an actual email object? header_store_parse is not meant to be called directly.
msg343268 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2019-05-23 01:39
Nevermind, I was testing with the wrong version of python. This bug was introduced somewhere after 3.4 :( >>> from email.message import EmailMessage >>> m = EmailMessage() >>> m['Subject'] = 'Hello Wörld! Hello Wörld! Hello Wörld! Hello Wörld!Hello Wörld!' >>> bytes(m) b'Subject: Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?=\n Hello =?utf-8?=?utf-8?q?q=3FW=3DC3=3DB6rld!Hello=3F=3D_W=C3=B6rld!?=\n\n'
msg343606 - (view)	Author: Jeffrey Kintscher (Jeffrey.Kintscher) *	Date: 2019-05-27 03:43
To aid with debugging the code, the Subject line can be simplified: >>> from email.message import EmailMessage >>> m = EmailMessage() >>> m['Subject'] = 'Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?= Hello Wörld!Hello Wörld!' >>> print(bytes(m)) b'Subject: Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?=\n Hello =?utf-8?=?utf-8?q?q=3FW=3DC3=3DB6rld!Hello=3F=3D_W=C3=B6rld!?=\n\n'
msg343612 - (view)	Author: Jeffrey Kintscher (Jeffrey.Kintscher) *	Date: 2019-05-27 10:43
I uploaded a test script with some test cases: The failure mode occurs when 1. line folding occurs 2. the first folded line has two or more words with UTF-8 characters 3. subsequent lines contain a word with UTF-8 characters located at a different offset than the last encoded substring in the first line For example, the first folded and encoded line of 'Hello Wörld! Hello Wörld! Hello Wörld! Hello Wörld!Hello Wörld!' is b'Subject: Hello =?utf-8?q?W=C3=B6rld!_Hello_W=C3=B6rld!_Hello_W=C3=B6rld!?=' and the second line should be b' Hello =?utf-8?q?W=C3=B6rld!Hello_W=C3=B6rld!?=' but instead, it is b' Hello =?utf-8?=?utf-8?q?q=3FW=3DC3=3DB6rld!Hello=3F=3D_W=C3=B6rld!?=' The function at fault is _refold_parse_tree() in Lib/email/_header_value_parser.py. In the first line, it encodes the first UTF-8 word and saves the starting offset in the output string (15). When it encounters the second UTF-8 word, it re-encodes the entire string starting at the saved offset. This is to help reduce the bloat added by multiple '=?utf-8?q?' start-of-encoding tokens. When it encodes the first UTF-8 word on the second line, it tries to store it at the saved offset into the second line output string, but that is past the end of the string so it just gets appended. When it encounter the second UTF-8 word in the second line, it re-encodes the entire second-line string starting at the saved offset (15), which is in the middle of the first encoded UTF-8 string. The failure mode is not triggered if there is at most one UTF-8 word in each folded line. It also is not triggered when folding occurs in the middle of a word instead of at whitespace because the code follows a different path. The solution is to set the saved starting offset to None when starting a new folded line when the fold-point is whitespace. I will submit a pull request soon with a fix.
msg343730 - (view)	Author: Jeffrey Kintscher (Jeffrey.Kintscher) *	Date: 2019-05-28 04:49
The pull request has been submitted with both the code fix and tests.
msg344863 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2019-06-06 19:53
New changeset f6713e84afc5addcfa8477dbdf2c027787f711c0 by Barry Warsaw (websurfer5) in branch 'master': bpo-36520: Email header folded incorrectly (#13608) https://github.com/python/cpython/commit/f6713e84afc5addcfa8477dbdf2c027787f711c0
msg345287 - (view)	Author: miss-islington (miss-islington)	Date: 2019-06-11 23:27
New changeset 0745cc66db3acbe7951073071cf063db6337dd10 by Miss Islington (bot) (Abhilash Raj) in branch '3.7': [3.7] bpo-36520: Email header folded incorrectly (GH-13608) (GH-13910) https://github.com/python/cpython/commit/0745cc66db3acbe7951073071cf063db6337dd10
msg345288 - (view)	Author: miss-islington (miss-islington)	Date: 2019-06-11 23:28
New changeset 36eea7af48ca0a1c96b78c82bf95bbd29d2332da by Miss Islington (bot) (Abhilash Raj) in branch '3.8': [3.8] bpo-36520: Email header folded incorrectly (GH-13608) (GH-13909) https://github.com/python/cpython/commit/36eea7af48ca0a1c96b78c82bf95bbd29d2332da
msg378380 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2020-10-10 10:37
This seems complete, can it be closed?

History
Date	User	Action	Args
2020-11-20 15:09:53	iritkatriel	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2020-10-10 10:37:54	iritkatriel	set	nosy: + iritkatriel messages: + msg378380
2019-06-11 23:28:18	miss-islington	set	messages: + msg345288
2019-06-11 23:27:16	miss-islington	set	nosy: + miss-islington messages: + msg345287
2019-06-08 07:50:36	maxking	set	pull_requests: + pull_request13783
2019-06-08 07:50:27	maxking	set	pull_requests: + pull_request13782
2019-06-06 19:53:49	barry	set	messages: + msg344863
2019-05-28 04:49:18	Jeffrey.Kintscher	set	messages: + msg343730
2019-05-28 03:38:14	Jeffrey.Kintscher	set	pull_requests: + pull_request13515
2019-05-28 02:33:18	Jeffrey.Kintscher	set	keywords: + patch stage: patch review pull_requests: + pull_request13514
2019-05-27 10:43:16	Jeffrey.Kintscher	set	files: + bpo-36520-test.py messages: + msg343612
2019-05-27 03:43:45	Jeffrey.Kintscher	set	messages: + msg343606
2019-05-23 01:39:57	r.david.murray	set	messages: + msg343268
2019-05-23 01:34:59	r.david.murray	set	messages: + msg343267
2019-05-22 09:16:10	Jeffrey.Kintscher	set	nosy: + Jeffrey.Kintscher
2019-04-03 23:15:00	Jonathan Horn	create