[proxy] github.com← back | site home | direct (HTTPS) ↗ | proxy home | ◑ dark◐ light
/ cpython Public

Conversation

Copy link
Member

lysnikolaou commented Apr 14, 2020

When there is a SyntaxError after reading the last input character from
the tokenizer and if no newline follows it, the error message used to be
unexpected EOF while parsing, which is wrong.

https://bugs.python.org/issue40267

When there is a SyntaxError after reading the last input character from
the tokenizer and if no newline follows it, the error message used to be
`unexpected EOF while parsing`, which is wrong.
Copy link
Member Author

CC: @gvanrossum @pablogsal

Copy link
Member

gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks straightforward enough. I have one niggling thought. Why is tok->done set to E_EOF in the first place?

Copy link
Member Author

If we take an example, where the last character produces a SyntaxError, like x+@, the tokenizer checks for a two(or three) character token and it thus reaches EOF, when it tokenizes the @ character. Upon doing so, the tokenizer state gets updated so that tok->done is E_EOF.

Note that if the toknizer reaches EOF, it cannot backup, because it would go into an endless loop, if it were to do so.

Copy link
Member

I'm not sure I entirely believe that:

>>> eval('+and ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    +and 
     ^
SyntaxError: invalid syntax
>>> eval('+and')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    +and
     ^
SyntaxError: unexpected EOF while parsing
>>> 

But it does look like it always has to do with the final operator ending the file, so you're close.

Copy link
Member Author

lysnikolaou commented Apr 15, 2020

Ohh, you're right! It's not the last character, it's the last token. In your example, and just gets tokenized as a NAME, which means all of its characters are consumed until a character is found that is not a valid identifier character. In your first example, it's the space, in the second it's EOF. So, tok->done gets the value E_EOF there.

I still think that this is a fix that catches all these cases (I tested your example and a few more, should I maybe add tests for these?) and does not generate any new problems, since E_SYNTAX is what's getting propagated up anyway, if it's not E_EOF. Right?

Copy link
Member

OK, so the done field is set to E_EOF when the tokenizer sees the EOF after the last token. This is harmless if the program is valid, since then the token just get processed, but when there's a syntax error on the last token, the EOF state modifies the error message, incorrectly.

I'll merge now.

gvanrossum merged commit 9a4b38f into python:master Apr 15, 2020
lysnikolaou deleted the tokenizer-bug branch April 24, 2020 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants