Issue39011
Created on 2019-12-09 23:40 by mefistotelis, last changed 2020-04-12 12:52 by scoder. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| 0001-bpo-39011-Preserve-line-endings-within-attributes.patch | mefistotelis, 2020-02-10 00:24 | Patch v1 | ||
| 0002-bpo-39011-Test-white-space-preservation-in-attribs.patch | mefistotelis, 2020-02-11 10:39 | |||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 18468 | merged | python-dev, 2020-02-12 01:11 | |
| Messages (9) | |||
|---|---|---|---|
| msg358154 - (view) | Author: Mefistotelis (mefistotelis) * | Date: 2019-12-09 23:40 | |
TLDR:
If I place "\r" in an Element attribute, it is handled and idiomized to " " in the XML file. But wait - \r is not really code 10, right?
Real description:
If I create ElementTree and read it just after creation, I'm getting what I put there - "\r". But if I save and re-load, it transforms into "\n". The character is incorrectly converted before being idiomized, and saved XML file has invalid value stored.
Quick repro:
# python3 -i
Python 3.8.0 (default, Oct 25 2019, 06:23:40) [GCC 9.2.0 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> elem = ET.Element('TEST')
>>> elem.set("Attt", "a\x0db")
>>> tree = ET.ElementTree(elem)
>>> with open("_test1.xml", "wb") as xml_fh:
... tree.write(xml_fh, encoding='utf-8', xml_declaration=True)
...
>>> tree.getroot().get("Attt")
'a\rb'
>>> tree = ET.parse("_test1.xml")
>>> tree.getroot().get("Attt")
'a\nb'
>>>
Related issue: https://bugs.python.org/issue5752
(keeping this one separate as it seem to be a simple bug, easy to fix outside of the discussion there)
If there's a good workaround - please let me know.
Tested on Windows, v3.8 and v3.6
|
|||
| msg358181 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2019-12-10 11:14 | |
See https://www.w3.org/TR/REC-xml/#sec-line-ends. |
|||
| msg358219 - (view) | Author: Mefistotelis (mefistotelis) * | Date: 2019-12-10 20:04 | |
Disclaimer: I'm not at all an expert in XML specs. The linked spec chapter, "End-of-Line Handling", says all line endings should behave like they were converted to "\n" _before_ parsing. This means: 1. This part of spec does not apply to the behavior described in the issue , because line endings are converted before the file is saved. The spec describes loading process, not saving. 2. Before parsing, the line endings within attributes are replaced by idioms - so they are no longer line endings in the meaning assigned by the spec. The chapter starts with clear indication that it only applies to line endings which are used to give structure to physical file. The affected line endings are narrowed by stating: "files [...], for editing convenience, are organized into lines.". Since line endings in attributes are idiomized, they don't take part of organizing file into lines. Then again, I'm not an expert. From the various specs I worked with, I know that the affected industry always comes out with unified interpretation of specs. If it was widely accepted to apply this chapter to values of attributes, I'd understand. |
|||
| msg358831 - (view) | Author: Stefan Behnel (scoder) * | Date: 2019-12-23 19:19 | |
I think we did it wrong in issue 17582. Parser behaviour is not a reason why the *serialisation* should modify the content. Luckily, fixing this does not impact the C14N serialisation (which aims to guarantee byte identical serialisation), but it changes the "normal" serialisation. I would therefore suggest that we remove the newline replacement code in the next release only, Py3.9. @mefistotelis, do you want to submit a PR? |
|||
| msg361664 - (view) | Author: Mefistotelis (mefistotelis) * | Date: 2020-02-10 00:24 | |
Patch attached.
I was thinking about one for() instead, but didn't wanted to introduce too large changes..
Let me know if you would prefer something like:
for i in (9,10,13,):
if chr(i) not in text: continue
text = text.replace(chr(i), "&#{:02d};".format(i))
That would also make it easy to extend for other chars, ie. if we'd like the parser to be always able to re-read the XML we've created. Currently, placing control chars in attributes will prevent that. But I'm getting out of scope of this issue now.
|
|||
| msg361681 - (view) | Author: Stefan Behnel (scoder) * | Date: 2020-02-10 12:13 | |
Your patch looks good to me. Could you please add (or adapt) the tests and then create a PR from it? You also need to write a NEWS entry for this change, and it also seems worth an entry in the "What's new" document. https://devguide.python.org/committing/ |
|||
| msg361682 - (view) | Author: Reece Johnson (nows) | Date: 2020-02-10 12:20 | |
Hope it is fixed now. |
|||
| msg361795 - (view) | Author: Mefistotelis (mefistotelis) * | Date: 2020-02-11 10:39 | |
I'm on it. Test update attached. |
|||
| msg366244 - (view) | Author: Stefan Behnel (scoder) * | Date: 2020-04-12 12:52 | |
New changeset 5fd8123dfdf6df0a9c29363c8327ccfa0c1d41ac by mefistotelis in branch 'master': bpo-39011: Preserve line endings within ElementTree attributes (GH-18468) https://github.com/python/cpython/commit/5fd8123dfdf6df0a9c29363c8327ccfa0c1d41ac |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2020-04-12 12:52:57 | scoder | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2020-04-12 12:52:18 | scoder | set | messages: + msg366244 |
| 2020-02-12 01:11:10 | python-dev | set | stage: needs patch -> patch review pull_requests: + pull_request17838 |
| 2020-02-11 10:39:02 | mefistotelis | set | files:
+ 0002-bpo-39011-Test-white-space-preservation-in-attribs.patch messages: + msg361795 |
| 2020-02-10 12:20:52 | nows | set | nosy:
+ nows messages: + msg361682 |
| 2020-02-10 12:13:40 | scoder | set | messages: + msg361681 |
| 2020-02-10 00:24:28 | mefistotelis | set | files:
+ 0001-bpo-39011-Preserve-line-endings-within-attributes.patch keywords: + patch messages: + msg361664 |
| 2019-12-23 19:19:53 | scoder | set | stage: needs patch messages: + msg358831 versions: + Python 3.9, - Python 3.6, Python 3.8 |
| 2019-12-10 20:04:34 | mefistotelis | set | messages: + msg358219 |
| 2019-12-10 11:14:09 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg358181 |
| 2019-12-10 01:40:25 | rhettinger | set | nosy:
+ scoder |
| 2019-12-09 23:40:50 | mefistotelis | create | |