Conversation
|
@jdconrad I wanted to take a crack at regexes "because parsers are fun" (irrelevant ping for @bd808)! It turned out a bit more complicated than I'd liked because |
|
@nik9000 I'll make sure to take a look at this at some point today. Thanks for tackling this! |
|
I like the new constants handling, including the writeConstants method ref :-) I have not closely looked at the antlr code, but the ASM stuff looks fine. I will report once I have done a more close review! |
Thanks! |
|
I think all "synthetic constants" should have a "$" in their name, so they are definitely not accessible outside. Javac does the same (e.g. the assertion constant, or lambda implementation methods). Our handles for the DEF case already have this, maybe make it consistent. A second thing: If the pattern is bullshit, you would only see this in a clinit error. Maybe you should pass the Pattern to Pattern.compile also as a compile-time check (just to try if it compiles). Another question: How to handle pattern arguments like caseinsensitive? currently you have to pass as "(?i)" as part of the regex. PERL users know this as coming after the second slash, e.g. |
|
@nik9000 The constants stuff is awesome. Very clean. I'm wondering if we can use modes instead of custom code for the regexes in the lexer. I just can't help but think we should be able to resolve that divides conflict some other way. |
There was a problem hiding this comment.
Nit: I realize your IDE probably did this, but would you mind switching it back to org.elasticsearch.painless.Definition.Type; and list out the org.objectweb.asm.Type instead? It matches with the other imports already from Definition.
There was a problem hiding this comment.
I did that intentionally but I can revert it.
Sure! I saw that at some point in something else and forgot to adapt. Easy enough to fix.
Yeah. I wonder if I can compile it when I hit is and just pass the compiled Pattern to the initializer.
I was hoping to do that in a followup, but, yes, I think we should hadn't flags after the trailing
I investigated that. We can do it but it makes the lexer much more complex to read though probably simpler to run. Because modes don't inherit from each other you'd need to copy every rule and rename it. It pretty much quadruples the size of the lexer because you need to specify every mode transition and most token's types. Twice. |
|
@nik9000 Yeah, you're right. It would be ridiculous to do this another way. I'm on board with the grammar changes. The only question I have left is should the predicates in the lexer be on the right side of the rules? I think this might have ramifications on how it matches the first and last characters. (In the lexer they can be anywhere.) Edit: The reason I ask this question is right hand predicates seem to be the preferred way in the book. |
That does not work with final variables. Just compiling it while parsing once does not matter in reality, Its just to check that syntax is correct and throw some sytax related exception on parse problem. The compiled pattern is not important. |
I've moved it to the right side and it seems to be being called far fewer times. I just checked by debugging it, but that seems like a good thing? |
|
@uschindler and @jdconrad I believe this is ready for another round. |
|
I'd make the type in addConstant also be a Definition.Type for consistency... unless it is too complicated to setup for the MethodHandles special case... |
|
Ignore that it's too complicated! Maybe only add a second overload for the consistency :) |
That is why I didn't do it in the first place. |
That I'm happy to do. |
Done. |
|
The behavior of the lexer makes sense with the predicate being called fewer times if it's on the right hand side. It will only test the predicate now if it actually matches one of the tokens. |
|
Morning syntactic predicates.... |
|
Ok, from my side all looks fine. I'd just add some tests for more complicated regexes (e.g. those with escaped forward slashes or similar). Also the renamed constants with a $ similar to the others are still missing. Otherwise LGTM from the ASM side. |
There was a problem hiding this comment.
I think you can remove that method now and inline the body above. It was just added to make the code in SSource not rely on internals here.
mourning.... Spelling is hard. |
This is why the method had to die! :-) |
|
OK! Since @uschindler is good with it and we're pretty much done with the review part I've prepared it for merge. @jdconrad, are you good with it? |
|
This looks good to me. +1 @nik9000 Thanks for adding this! |
Adds `/regex/` as a regex constructor. A couple of fun points: 1. This makes generic the idea of arbitrary stuff adding a constant. Both SFunction and LRegex create a statically initialized constant. Both go through Locals to do this because they LRegex isn't directly iterable from SScript. 2. Differentiating `/` as-in-division from `/` as-in-start-of-regex is hard. See: http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions The javascript folks have a way, way tougher time of it then we do because they have semicolon insertion. We have the much simpler delimiter rules. Even with our simpler life we still have to add a hack to get lexing `/regex/` to work properly. I chose to add token-level lookbehind because it seems to be a pretty contained hack. I considered and rejected lexer modes, a lexer member variable, having the parser set variables on the lexer (this is a fairly common solution for js, I believe), and moving regex parsing to the parser level. 3. I've only added a very small subset of java.util.regex to the whitelist because it is the subset I needed to test LRegex sanely. More deserves to be added, and maybe more regex syntax like `=~` and `==~`. Those can probably be added without too much pain.
|
Thanks for reviewing @jdconrad and @uschindler ! |
Adds
/regex/as a regex constructor. A couple of fun points:Both SFunction and LRegex create a statically initialized constant.
Both go through Locals to do this because they LRegex isn't directly
iterable from SScript.
/as-in-division from/as-in-start-of-regexis hard. See:
http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions
The javascript folks have a way, way tougher time of it then we do
because they have semicolon insertion. We have the much simpler
delimiter rules. Even with our simpler life we still have to add
a hack to get lexing
/regex/to work properly. I chose to addtoken-level lookbehind because it seems to be a pretty contained hack.
I considered and rejected lexer modes, a lexer member variable,
having the parser set variables on the lexer (this is a fairly common
solution for js, I believe), and moving regex parsing to the parser
level.
whitelist because it is the subset I needed to test LRegex sanely.
More deserves to be added, and maybe more regex syntax like
=~and==~. Those can probably be added without too much pain.