bpo-35808: Retire pgen and use pgen2 to generate the parser #11814
Conversation
23ff976
to
20fcc02
d60bfe5
to
adc73e6
|
It doesn't appear that there are changes to configure.ac so is there a need to churn aclocal.m4 and configure? |
|
Do you need my review? I'm super behind on stuff so it may be a while, and I don't want to block you. |
|
Hi, in the next line: Line 1736 in 5448767 there is a reference to $(PGEN) and that does not exist. By the way, that don't give any error because is empty |
Good catch! |
@gvanrossum Don't worry. I will try to get someone else to review, but the fact that the interpreter builds and passes the test is a good assurance that the patch is not fundamentally wrong. |
|
Hum. I'm not super excited by duplicating Lib/lib2to3/pgen2/ as Parser/pgen2/ "with a few changes". It sounds like a good recipe to duplicate the maintenance. Would it be possible to modify Lib/lib2to3/pgen2/ to respect your constraints instead? Maybe even with "#ifdef" ( Maybe my question doesn't make sense, the code diverges enough to justify to have two "copies" of the "same code". |
@vstinner We could add some branching on the original pgen2, but IMHO I think having the code in Being said that, if we think is ok to add branching to pgen2 and invoke that from P.S. Thanks for reviewing this big PR! |
|
Searching PGEN on CPython root I have the some issues:
|
There should be a way to modify lib2to3.pgen2 just enough to allow an user to change some constants and functions, no? I compared Parser/pgen2/ and Lib/lib2to3/pgen2/ using meld. Most code is duplicated, but "some" is modified. That's my worry. The 194 lines of grammar.py are almost all the same, except of opmap_raw constant which is different. That's why I talked about "#ifdef". Would it be possible to modify lib2to3 to allow to modify opmap_raw? For this specific case, it looks like lib2to3 doesn't support PEP 572 whereas your change adds support for ":="... It means that we have 2 copies of the same file, but one is outdated? What's the point of having an outdated parser which doesn't support PEP 572 in lib2to3? lib2to3 should also be modified to support ":=", no? Can't Parser/pgen2/ reuses Lib/lib2to3/pgen2/grammar.py? For a practical reason, I would suggest to rename Parser/pgen2/ to Parser/pgen/: it would avoid conflict with lib2to3.pgen2 package and it reuses "pgen" name :-) Parser/pgen2/pgen.py: multiple changes are to add support for older Python versions. Would it be acceptable to modify lib2to3 and reuse its pgen.py in Parser/pgen2/? Parser/pgen2/tokenize.py: most code is the same, but you replace " with ' and other coding style changes. I'm not against coding style changes, except that here it makes the maintenance harder to keep Parser/pgen2/tokenize.py in sync with Lib/lib2to3/pgen2/tokenize.py. Either revert coding style changes or update also Lib/lib2to3/pgen2/tokenize.py. Parser/pgen2/token.py: you added BACKQUOTE, removed ELLIPSIS, etc. That's maybe one of the most tricky difference which may prevent core reuse? Or other modules should be made configurable to accept a "custom" token module? Rather than being "hardcoded" to use a specific "token" module? I'm talking about |
|
When you're done making the requested changes, leave the comment: |
|
@vstinner I will work on changes that will address these issues trying to reuse pgen2 as much as possible. Will ping you when is ready :) |
|
Signing off on the changes to the Windows build files. Haven't looked at the rest. |
| **{f'{prefix}"""': double3prog for prefix in _strprefixes}, | ||
| **{prefix: None for prefix in _strprefixes}} | ||
| "'''": single3prog, '"""': double3prog} | ||
| endprogs.update({"{}'''".format(prefix): single3prog for prefix in _strprefixes}) |
pablogsal
Feb 11, 2019
•
Author
Member
These are needed because Travis uses an old version of python for the regenerations step.
These are needed because Travis uses an old version of python for the regenerations step.
|
I have made some commits triying to use pgen2, but there are some problems that only show on Travis and Windows.....I will try to find what is happening. |
|
As I have rebased, the old conversations that were attached to commits can be found here: 80c1217#commitcomment-32381091 104b8e7#commitcomment-32380993 @gvanrossum Ok, I have decided to go to the option you initially suggested and leave
The reasons there is a full implementation of the parser generator in
|
|
(Honestly I think at this point you might as well have started over with a new PR. But alla.) |
|
Getting closer. Maybe we can we completely independent from lib2to3. |
| "grammar", type=str, help="The file with the grammar definition in EBNF format" | ||
| ) | ||
| parser.add_argument( | ||
| "gramminit_h", |
gvanrossum
Feb 21, 2019
Member
Normally this is spelled with one ‘m’. (Also below.)
Normally this is spelled with one ‘m’. (Also below.)
| # Add token to the module cache so tokenize.py uses that excact one instead of | ||
| # the one in the stdlib of the interpreter executing this file. | ||
| sys.modules['token'] = token | ||
| tokenize = importlib.machinery.SourceFileLoader('tokenize', |
gvanrossum
Feb 21, 2019
Member
This still looks fragile. Why do we need to use the latest tokenize to parse the Grammar file? The “meta” grammar is super simple, it just has NAME, string literals, and some basic punctuation and operators. The tokenize module from Python 2.4 can handle this. :-)
This still looks fragile. Why do we need to use the latest tokenize to parse the Grammar file? The “meta” grammar is super simple, it just has NAME, string literals, and some basic punctuation and operators. The tokenize module from Python 2.4 can handle this. :-)
pablogsal
Feb 21, 2019
•
Author
Member
The tokenize module from Python 2.4 can handle this.
:)
Why do we need to use the latest tokenize to parse the Grammar file?
Is not that the tokenize cannot handle the grammar is that but that the tokenizer uses different values for the tokens, it fails when constructing the dfas when calling self.parse():
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/src/Parser/pgen/__main__.py", line 36, in <module>
main()
File "/src/Parser/pgen/__main__.py", line 29, in main
p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
File "/src/Parser/pgen/pgen.py", line 20, in __init__
self.dfas, self.startsymbol = self.parse()
File "/src/Parser/pgen/pgen.py", line 173, in parse
self.expect(self.tokens['OP'], ":")
File "/src/Parser/pgen/pgen.py", line 337, in expect
type, value, self.type, self.value)
File "/src/Parser/pgen/pgen.py", line 356, in raise_error
self.end[1], self.line))
File "./Grammar/Grammar", line 13
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
^
SyntaxError: expected 54/:, got 52/:
This is because OP has the value of 52 in Python3.5 (in this example) and 54 in the tokens that we construct from Grammar/Tokens (or in Lib/tokens.py). This difference is because the value of 52 is yielded by the tokenize (from Python3.5) when calling next(self.generator) in gettoken. Maybe I am missing something here, but that is the problem I found when triying to use the tokenize from the running Python :(
The tokenize module from Python 2.4 can handle this.
:)
Why do we need to use the latest tokenize to parse the Grammar file?
Is not that the tokenize cannot handle the grammar is that but that the tokenizer uses different values for the tokens, it fails when constructing the dfas when calling self.parse():
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/src/Parser/pgen/__main__.py", line 36, in <module>
main()
File "/src/Parser/pgen/__main__.py", line 29, in main
p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
File "/src/Parser/pgen/pgen.py", line 20, in __init__
self.dfas, self.startsymbol = self.parse()
File "/src/Parser/pgen/pgen.py", line 173, in parse
self.expect(self.tokens['OP'], ":")
File "/src/Parser/pgen/pgen.py", line 337, in expect
type, value, self.type, self.value)
File "/src/Parser/pgen/pgen.py", line 356, in raise_error
self.end[1], self.line))
File "./Grammar/Grammar", line 13
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
^
SyntaxError: expected 54/:, got 52/:
This is because OP has the value of 52 in Python3.5 (in this example) and 54 in the tokens that we construct from Grammar/Tokens (or in Lib/tokens.py). This difference is because the value of 52 is yielded by the tokenize (from Python3.5) when calling next(self.generator) in gettoken. Maybe I am missing something here, but that is the problem I found when triying to use the tokenize from the running Python :(
| @@ -0,0 +1,97 @@ | |||
| from lib2to3.pgen2 import grammar | |||
gvanrossum
Feb 21, 2019
Member
Maybe also copy this, so we’re completely independent from lib2to3?
Maybe also copy this, so we’re completely independent from lib2to3?
| import collections | ||
| import importlib.machinery | ||
|
|
||
| # Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this |
gvanrossum
Feb 21, 2019
Member
I would get them directly from Grammar/Tokens
I would get them directly from Grammar/Tokens
|
OK I will try to play with this myself.
On Thu, Feb 21, 2019 at 9:36 AM Pablo Galindo ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In Parser/pgen/pgen.py
<#11814 (comment)>:
> +
+# Use Lib/token.py and Lib/tokenize.py to obtain the tokens. To maintain this
+# compatible with older versions of Python, we need to make sure that we only
+# import these two files (and not any of the dependencies of these files).
+
+CURRENT_FOLDER_LOCATION = os.path.dirname(os.path.realpath(__file__))
+LIB_LOCATION = os.path.realpath(os.path.join(CURRENT_FOLDER_LOCATION, '..', '..', 'Lib'))
+TOKEN_LOCATION = os.path.join(LIB_LOCATION, 'token.py')
+TOKENIZE_LOCATION = os.path.join(LIB_LOCATION, 'tokenize.py')
+
+token = importlib.machinery.SourceFileLoader('token',
+ TOKEN_LOCATION).load_module()
+# Add token to the module cache so tokenize.py uses that excact one instead of
+# the one in the stdlib of the interpreter executing this file.
+sys.modules['token'] = token
+tokenize = importlib.machinery.SourceFileLoader('tokenize',
The tokenize module from Python 2.4 can handle this.
:)
Why do we need to use the latest tokenize to parse the Grammar file?
Is not that the tokenize cannot handle the grammar is that if the
tokenizer uses different values for the tokens, it fails when constructing
the dfas when calling self.parse():
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/src/Parser/pgen/__main__.py", line 36, in <module>
main()
File "/src/Parser/pgen/__main__.py", line 29, in main
p = ParserGenerator(args.grammar, token_lines, verbose=args.verbose)
File "/src/Parser/pgen/pgen.py", line 20, in __init__
self.dfas, self.startsymbol = self.parse()
File "/src/Parser/pgen/pgen.py", line 173, in parse
self.expect(self.tokens['OP'], ":")
File "/src/Parser/pgen/pgen.py", line 337, in expect
type, value, self.type, self.value)
File "/src/Parser/pgen/pgen.py", line 356, in raise_error
self.end[1], self.line))
File "./Grammar/Grammar", line 13
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
^
SyntaxError: expected 54/:, got 52/:
This is because OP has the value of 52 in Python3.5 (in this example) and
54 in the tokens that we construct from Grammar/Tokens (or in
Lib/tokens.py). This difference is because the value of 52 is yielded by
the tokenize (from Python3.5) when calling next(self.generator) in
gettoken. Maybe I am missing something here, but that is the problem I
found when triying to use the tokenize from the running Python :(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11814 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACwrMsvafFsHzCCrxp4OM99kzKGQCCOWks5vPtkKgaJpZM4azYg1>
.
--
--Guido (mobile)
|
|
I think I have the solution, at least for |
Of course! Yeah, I can confirm that this works perfectly. Thanks for the patience on going with me through this problem. :) With this, if we read from |
|
Ship it! |
|
@vstinner can you re-review this? |
|
@pablogsal You can remove the other requested reviewers who haven't said anything yet. |
|
@gvanrossum @pablogsal: Sorry, I was busy with other stuff last days and so I failed to follow this PR. Well, I reviewed an earlier version of the code. The code changed a lot in the meanwhile. I trust Guido to be able to review such change properly. Pablo: please don't wait for me if you want to merge this PR. |
|
@pablogsal, feel free to merge. |
|
Whee! |
…-11814) Pgen is the oldest piece of technology in the CPython repository, building it requires various #if[n]def PGEN hacks in other parts of the code and it also depends more and more on CPython internals. This commit removes the old pgen C code and replaces it for a new version implemented in pure Python. This is a modified and adapted version of lib2to3/pgen2 that can generate grammar files compatibles with the current parser. This commit also eliminates all the #ifdef and code branches related to pgen, simplifying the code and making it more maintainable. The regen-grammar step now uses $(PYTHON_FOR_REGEN) that can be any version of the interpreter, so the new pgen code maintains compatibility with older versions of the interpreter (this also allows regenerating the grammar with the current CI solution that uses Python3.5). The new pgen Python module also makes use of the Grammar/Tokens file that holds the token specification, so is always kept in sync and avoids having to maintain duplicate token definitions.
…-11814) Pgen is the oldest piece of technology in the CPython repository, building it requires various #if[n]def PGEN hacks in other parts of the code and it also depends more and more on CPython internals. This commit removes the old pgen C code and replaces it for a new version implemented in pure Python. This is a modified and adapted version of lib2to3/pgen2 that can generate grammar files compatibles with the current parser. This commit also eliminates all the #ifdef and code branches related to pgen, simplifying the code and making it more maintainable. The regen-grammar step now uses $(PYTHON_FOR_REGEN) that can be any version of the interpreter, so the new pgen code maintains compatibility with older versions of the interpreter (this also allows regenerating the grammar with the current CI solution that uses Python3.5). The new pgen Python module also makes use of the Grammar/Tokens file that holds the token specification, so is always kept in sync and avoids having to maintain duplicate token definitions.
This problem turned out to be a bit more complex because pgen2 needed some modifications to make it compatible with the old pgen output (apart from the extra glue to generate
graminit.candgraminit.h). I had also to downgrade some f-strings and '**'-unpacking because travis uses an old python for the regeneration step.This PR implements the following:
Removes pgen and all related files from the source tree and Makefile.
Substitutes pgen for a modified and reduced (only pgen functionality is
included) version of Lib/lib2to3/pgen2:
Add new tokens (like COLONEQUAL, TYPE_IGNORE,... etc) to
synchronize with the current Grammar.
All dfa names are included in the label list (this has the effect
of including 'file_input', 'eval_input', ... in the list of
labels to guarantee compatibility with the old pgen generated
files).
The order in which dfas are generated is preserved to maintain
compatibility with old pgen generated files and avoid empty stacks
when the parser processes the rule. If this change is not made,
the parser fails with:
"""' ... It's a token we know
DFA 'and_expr', state 0: Push 'shift_expr'
DFA 'shift_expr', state 0: Push 'arith_expr'
DFA 'arith_expr', state 0: Push 'term'
DFA 'term', state 0: Push 'factor'
DFA 'factor', state 0: Push 'power'
DFA 'power', state 0: Push 'atom_expr'
DFA 'atom_expr', state 0: Push 'atom'
DFA 'atom', state 0: Shift.
Token NEWLINE/'' ... It's a token we know
DFA 'atom', state 5: Pop ...
DFA 'atom_expr', state 2: Pop ...
DFA 'power', state 1: Pop ...
DFA 'factor', state 2: Pop ...
DFA 'term', state 1: Pop ...
DFA 'arith_expr', state 1: Pop ...
DFA 'shift_expr', state 1: Pop ...
DFA 'and_expr', state 1: Pop ...
Error: bottom of stack.
This seem to happen because instead of starting with
'single_input' / 'file_input' ... etc it starts with
'and_expr' just because its the second element in the
dfa list in gramminit.c. This makes the parser to push
tokens from 'and_expr' so before the shift can happen,
it arrives at an empty stack when popping the nonterminals.
First sets are represented with python sets intead of dictionaries
with 1 in the values.
Added (optional) verbose output that dumps different elements of
the grammar and displays the first sets of every nonterminal.
pgen2 now is invoked as:
python -m Parser.pgen2 Grammar/Grammar Include/graminit.h Python/graminit.c
(optionally add '-v' for verbose output).
pgen included "<>" in the list of labels but pgen2 does not as it
thinks is already in the list of tokens because it collides with
the value of "!=". This is not a problem because this label is a
terminal and is handled manually and Parser/token.c produces the
token NOTEQUAL anyway). This is the reason there are now 183 labels
where pgen generates 184).
Changes the regen-grammar to use pgen2 instead of pgen.
Regenerates Makefiles using autoreconf.
https://bugs.python.org/issue35808
This is just a (working) first proposal to start discussing the issue