Issue35079
Created on 2018-10-26 19:00 by Springem Springsbee, last changed 2022-04-11 14:59 by admin. This issue is now closed.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 10144 | merged | terry.reedy, 2018-10-27 02:56 | |
| PR 10145 | merged | miss-islington, 2018-10-27 03:03 | |
| PR 10146 | merged | miss-islington, 2018-10-27 03:03 | |
| PR 10147 | merged | miss-islington, 2018-10-27 03:04 | |
| Messages (9) | |||
|---|---|---|---|
| msg328593 - (view) | Author: Springem Springsbee (Springem Springsbee) | Date: 2018-10-26 19:00 | |
Hello, I'm using difflib's SequenceMatcher to locate common substrings. It seems like the matcher is missing a common substrings. I'm guessing this is a rather low-level issue in difflib. The autojunk parameter has no effect for obvious reasons. Alternate pairwise comparisons between the following 3 strings omit the 2-character match 'AC' GATTACA TAGACCA ATACA The following Github gist captures the issue, which I'll repeat here for redundancy https://gist.github.com/MatthewRalston/b0ab6ac1dbe322cb12063310ccdbb786 >import difflib >string1 = "TAGACCA" >string2 = "ATACA" >s = difflib.SequenceMatcher(None, string1, string2) >blox = s.get_matching_blocks() >print(blox) [Match(a=0, b=1, size=2), Match(a=5, b=3, size=2), Match(a=7, b=5, size=0)] # Missing Match(a=3, b=2, size=2) >print([string1[x.a:x.a+x.size] for x in blox if x.size > 1]) ['TA', 'CA'] # Missing the substring 'CA' |
|||
| msg328595 - (view) | Author: Tim Peters (tim.peters) * | Date: 2018-10-26 19:19 | |
Sorry, I find this too hard to follow. At the end: > ['TA', 'CA'] # Missing the substring 'CA' the comment claims it's missing 'CA', while the output plainly shows it's _not_ missing 'CA'. If your complaint is that's missing 'AC', yes, it is. It's not intended to find _overlapping_ matches at all (read the docs). The longest matching blocks in TAGACCA ATACA are in fact TA, CA, and AC, but after the leftmost-longest TA is matched first, AC no longer exists _to_ match in the second string. Only CA does. |
|||
| msg328602 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2018-10-26 20:25 | |
We can assume that "substring 'CA'" was meant to be "substring 'AC'", but as explained, missing 'AC' is not a bug. (Tim wrote the module.) I read the doc, and 'non-overlapping' is implied in the SequenceMatcher entry at the top of the file. "The idea is to find the longest contiguous matching subsequence that contains no “junk” elements; ... The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence." However, a user of SequenceMatcher could easily miss that, and its implication, as Springem did. For clarity, I think we should add 'non-overlapping to the first line of the .get_matching_blocks entry, which is in the middle of the page. "Return list of triples describing non-overlapping matching subsequences." I also think "i+n != i' or j+n != j'" should be changed to "i+n < i' or j+n < j'" as '>' would mean overlapping. So != must mean <. I will prepare a doc PR later. |
|||
| msg328606 - (view) | Author: Tim Peters (tim.peters) * | Date: 2018-10-26 21:08 | |
I don't object to spelling it out more, and your (Terry's) suggestions are fine. On the other hand, this module has been around for a loooong time now, and this is the first instance I've seen of someone surprised that it doesn't produce overlapping matches - it's obvious from "The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence" that a matching subsequence is wholly eliminated from further consideration. At some point of ever-more tedious elaboration, the docs risk missing the forest for the trees. I don't think _these_ docs are quite there yet - although the docs for `find_longest_match()` are. Speaking of which, that method _could_ be used to find overlapping matches, one at a time, by passing appropriate slice indices. Which can be horridly inefficient; e.g., find all overlapping matches between 'A' * 100 and 'A' * 150 |
|||
| msg328627 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2018-10-27 03:00 | |
Tim, I share your concern about bloating docs, but think this one word worthwhile. I suspect that people are conditioned to accept 'non-overlapping' because that is ofter (usually?) the default for linear searches and regex matching. |
|||
| msg328628 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2018-10-27 03:03 | |
New changeset d9bff4e81b8ca36fe6c4e90c0b9cf02bc020e713 by Terry Jan Reedy in branch 'master': bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) https://github.com/python/cpython/commit/d9bff4e81b8ca36fe6c4e90c0b9cf02bc020e713 |
|||
| msg328629 - (view) | Author: miss-islington (miss-islington) | Date: 2018-10-27 03:07 | |
New changeset cb920c1442bf3b0899fee51915b4bf6614f2c1d7 by Miss Islington (bot) in branch '3.7': bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) https://github.com/python/cpython/commit/cb920c1442bf3b0899fee51915b4bf6614f2c1d7 |
|||
| msg328630 - (view) | Author: miss-islington (miss-islington) | Date: 2018-10-27 03:09 | |
New changeset 5282125650a70811f0d64ab231e2a6c8ac20c96b by Miss Islington (bot) in branch '3.6': bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) https://github.com/python/cpython/commit/5282125650a70811f0d64ab231e2a6c8ac20c96b |
|||
| msg328631 - (view) | Author: miss-islington (miss-islington) | Date: 2018-10-27 03:09 | |
New changeset e389de8e3e897fa5ba4f71b0f711355fb7158049 by Miss Islington (bot) in branch '2.7': bpo-35079: Revise difflib.SequenceManager.get_matching_blocks doc (GH-10144) https://github.com/python/cpython/commit/e389de8e3e897fa5ba4f71b0f711355fb7158049 |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:59:07 | admin | set | github: 79260 |
| 2018-10-27 03:23:10 | terry.reedy | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2018-10-27 03:09:14 | miss-islington | set | messages: + msg328631 |
| 2018-10-27 03:09:00 | miss-islington | set | messages: + msg328630 |
| 2018-10-27 03:07:46 | miss-islington | set | nosy:
+ miss-islington messages: + msg328629 |
| 2018-10-27 03:04:04 | miss-islington | set | pull_requests: + pull_request9474 |
| 2018-10-27 03:03:52 | miss-islington | set | pull_requests: + pull_request9473 |
| 2018-10-27 03:03:39 | miss-islington | set | stage: needs patch -> patch review pull_requests: + pull_request9472 |
| 2018-10-27 03:03:14 | terry.reedy | set | messages: + msg328628 |
| 2018-10-27 03:00:58 | terry.reedy | set | messages:
+ msg328627 stage: patch review -> needs patch |
| 2018-10-27 02:56:22 | terry.reedy | set | keywords:
+ patch stage: needs patch -> patch review pull_requests: + pull_request9471 |
| 2018-10-26 21:08:46 | tim.peters | set | messages: + msg328606 |
| 2018-10-26 20:25:26 | terry.reedy | set | versions:
+ Python 3.8, - Python 3.4, Python 3.5 messages: + msg328602 assignee: terry.reedy |
| 2018-10-26 19:22:00 | xtreak | set | nosy:
+ xtreak |
| 2018-10-26 19:19:12 | tim.peters | set | nosy:
+ tim.peters messages: + msg328595 |
| 2018-10-26 19:00:41 | Springem Springsbee | create | |