Improved tokenizing of context sensitive keywords#3484
Improved tokenizing of context sensitive keywords#3484gsherwood merged 2 commits intosquizlabs:masterfrom
Conversation
2a9833f to
1f2580d
Compare
1f2580d to
a51355f
Compare
|
@gsherwood Please, this PR should be merged before #3478 and #3483. It should really help in the future. I've already simplified the support for |
|
Left a couple of comments for minor stuff, but looks good to merge otherwise. |
a51355f to
5383484
Compare
jrfnl
left a comment
There was a problem hiding this comment.
Nice one!, This would be a great step forward, but there's still quite a lot more to do in this regards (though not necessarily all in this PR). Also see: #3336
Note: I locally made a start on a PR to handle this when I opened that issue, but got side-tracked before I could finish it as it is hard to get that one right, especially when trying to account for all situations, including multi-use, group use etc.
Happy to share the code I have though.
Regarding this PR:
I wonder if there aren't more chunks of code in the tokenizer after this new block, which could now be removed, or simplified ?
In particular, I'm looking at the "string-like token after a function keyword" block round line 1740 and the "special case for PHP 5.6 use function and use const" block round line 2168.
More extensive unit tests would also help catch edge cases.
| $preserveKeyword = true; | ||
| } | ||
|
|
||
| // `namespace\` should be preserved |
There was a problem hiding this comment.
What about namespace ... ;, i.e declaration ? Every part of the name should be tokenized as T_STRING.
There was a problem hiding this comment.
There was a problem hiding this comment.
That's not what I meant (please unresolve this).
Think:
namespace foreach;
namespace my\class\extensions;There was a problem hiding this comment.
Noted. That test case still only safeguards the first of the above examples though, the current tokenizer would still fail on the second.
There was a problem hiding this comment.
Added one more test :)
There was a problem hiding this comment.
Thanks. Would be wonderful though if the other parts of the namespace name would also be checked. The test currently only checks the first part, i.e. my.
What I'm trying to point out, is that on PHP < 8.0, this snippet:
<?php
namespace my\class\extensions;
namespace my\foreach\yield;... would be tokenized as - take note of class, yield and foreach:
Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [5]: <?php
1 | L2 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [0]:
2 | L3 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace
3 | L3 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]:
4 | L3 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my
5 | L3 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
6 | L3 | C 14 | CC 0 | ( 0) | T_CLASS | [5]: class
7 | L3 | C 19 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
8 | L3 | C 20 | CC 0 | ( 0) | T_STRING | [10]: extensions
9 | L3 | C 30 | CC 0 | ( 0) | T_SEMICOLON | [1]: ;
10 | L3 | C 31 | CC 0 | ( 0) | T_WHITESPACE | [0]:
11 | L4 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace
12 | L4 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]:
13 | L4 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my
14 | L4 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
15 | L4 | C 14 | CC 0 | ( 0) | T_FOREACH | [7]: foreach
16 | L4 | C 21 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
17 | L4 | C 22 | CC 0 | ( 0) | T_YIELD | [5]: yield
18 | L4 | C 27 | CC 0 | ( 0) | T_SEMICOLON | [1]: ;
19 | L4 | C 28 | CC 0 | ( 0) | T_WHITESPACE | [0]:
... while with your change, it will be tokenized (better), like so:
Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [5]: <?php
1 | L2 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [0]:
2 | L3 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace
3 | L3 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]:
4 | L3 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my
5 | L3 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
6 | L3 | C 14 | CC 0 | ( 0) | T_STRING | [5]: class
7 | L3 | C 19 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
8 | L3 | C 20 | CC 0 | ( 0) | T_STRING | [10]: extensions
9 | L3 | C 30 | CC 0 | ( 0) | T_SEMICOLON | [1]: ;
10 | L3 | C 31 | CC 0 | ( 0) | T_WHITESPACE | [0]:
11 | L4 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace
12 | L4 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]:
13 | L4 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my
14 | L4 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
15 | L4 | C 14 | CC 0 | ( 0) | T_STRING | [7]: foreach
16 | L4 | C 21 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
17 | L4 | C 22 | CC 0 | ( 0) | T_STRING | [5]: yield
18 | L4 | C 27 | CC 0 | ( 0) | T_SEMICOLON | [1]: ;
19 | L4 | C 28 | CC 0 | ( 0) | T_WHITESPACE | [0]:
There is no difference on PHP 8.0+ as the "retokenization of namespaced names" already takes care of it in that case.
There was a problem hiding this comment.
Ok, now I finally get it :) Added.
| && $tokenIsArray === true | ||
| && $token[0] === T_STRING | ||
| && strtolower($token[1]) === 'yield' | ||
| && isset($this->tstringContexts[$finalTokens[$lastNotEmptyToken]['code']]) === false |
There was a problem hiding this comment.
Should a similar condition be added for the match retokenization round line 1439 ? And in that case, could the "final check" in PHP::processAdditional() be possibly removed/simplified ?
Similar question for the fn condition round line 1721.
There was a problem hiding this comment.
T_MATCH can probably simplified but I'm not sure about this test case:
/* testLiveCoding */
// Intentional parse error. This has to be the last test in the file.
echo match
The token is currently tokenized as T_STRING but I'm not sure if it's right.
src/Util/Tokens.php
Outdated
| T_MATCH => T_MATCH, | ||
| T_NAMESPACE => T_NAMESPACE, | ||
| T_NEW => T_NEW, | ||
| T_PARENT => T_PARENT, |
There was a problem hiding this comment.
T_PARENT is PHPCS native token and doesn't exist in PHP itself.
There was a problem hiding this comment.
The test if failing when I remove the T_PARENT token.
1) PHP_CodeSniffer\Tests\Core\Tokenizer\ContextSensitiveKeywordsTest::testKeywords with data set #19 ('/* testParentIsKeyword */', 'T_PARENT')
Failed to find test target token for comment string: /* testParentIsKeyword */
Failed asserting that true is false.
There was a problem hiding this comment.
@kukulich That's because it can't find token after the comment (as the custom token is no longer in the list), not because the tokenization is wrong.
There was a problem hiding this comment.
@jrfnl Yes, you're right. However I'm not sure if the tokens should be removed here or not...
There was a problem hiding this comment.
I suppose that depends on where else this token list will be used, but for the current use case, those tokens aren't needed in the list.
2be5165 to
6ce46dc
Compare
"string-like token after a function keyword" is gone. "special case for PHP 5.6 use function and use const" cannot be removed (without other work) but "This is a special case for the PHP 5.5 classname::class syntax" is gone. |
Nice find! Yeah, of course, with the |
e4b1a34 to
8a71b96
Compare
8a71b96 to
c73f456
Compare
|
Just an update that I'm trying to merge this at the moment. I'm needing to resolve a lot of conflicts in the 4.0 branch first, and adjust the namespace changes over there as well. |
|
Finally got this merged and tested after Github had issues today, but all working. Appreciate all the work getting this done. Keen to have anyone else run their eyes over things as well to make sure I haven't missed something obvious. |
If you like, I'll try to have a look at the 4.x branch later in the week ? (on that note: there are still a few PRs open for that branch) |
Other tokens should be added after these PR are merged:
enumkeyword #3478readonlykeyword #3480T_ENUM_CASE#3483