Subroutine support#230
Conversation
as it fixes a bug with passing a Vec with more slots than is necessary for the number of capture groups
Add detection in analyzer to throw an error when a backreference is used inside a capture group being recursed and refers to the same capture group. This would currently exhibit undefined (or rather, plain wrong) behavior.
| x2("\\g<1>|\\zEND(.a.)", "bac", 0, 3); | ||
|
|
||
| // Compile failed: CompileError(FeatureNotYetSupported("Subroutine Call") | ||
| // Expected group to exist |
There was a problem hiding this comment.
this failure occurs due to Oniguruma by default ignoring numbered capture groups (treating them as non-capture groups) when named groups are defined, and doesn't reflect subroutine call support being lacking in any way. I made a test specifically to prove this.
| x2("(?<n>|\\g<m>\\g<n>)\\z|\\zEND (?<m>a|(b)\\g<m>)", "bbbbabba", 0, 8); | ||
|
|
||
| // Compile failed: CompileError(FeatureNotYetSupported("Subroutine Call")) | ||
| // Match found at start 0 and end 1 (expected 2 and 3) |
There was a problem hiding this comment.
same cause here for the failure
robinst
left a comment
There was a problem hiding this comment.
I've only had time to read the documentation so far, some feedback on that. But in general great to have good docs with lots of examples :).
| @@ -0,0 +1,52 @@ | |||
| # Subroutines: reusable patterns with stable meaning | |||
|
|
|||
| Subroutines in fancy-regex are not "inline expansions" and not "dynamic calls". | |||
There was a problem hiding this comment.
What are "inline expansions" and "dynamic calls"? It would probably make sense to leave that for later, instead of starting with what they aren't.
Also, as an intro to someone who doesn't know subroutines, it's a lot to read without any context. I would start with an example straight away. And I'm not sure that (\d)\g<1>\1 is a good one, because it already brings the complexity of updating the capture. I think (?<word>[a-z]+) ... is better, but some people might think "I can just use [a-z]+ again instead of using a subroutine".
How about something like this: (?<num>\d+\.\d+|\d+) x \g<num>
It's fairly simple and shows how reuse works.
There was a problem hiding this comment.
Nice, updated to your simple example and removed the first line. (I think it is covered by 2_flags.md)
| - executed exactly the way the capture group was originally defined | ||
| - reusable from multiple places | ||
|
|
||
| >> Calling a subroutine does not recompile it in the caller's context. |
There was a problem hiding this comment.
Double blockquote? Also, not sure what this sentence means, sorry :).
There was a problem hiding this comment.
removed it as it doesn't contain any new information not covered in 2_flags.md
| ```rust | ||
| use fancy_regex::Regex; | ||
|
|
||
| let re = Regex::new(r"(\d)\g<1>\1").expect("expected compilation to be successful"); |
There was a problem hiding this comment.
To make this less verbose/distracting, can you turn the expect(...) into plain unwrap() or maybe even ? (not sure if/how possible in examples).
|
|
||
| Let's imagine a pattern which will match a digit and capture it into group 1. Then it will call that capture group as a subroutine. Then it will do a backref to group 1. | ||
|
|
||
| This will match only when a digit is followed by another digit and then itself. |
There was a problem hiding this comment.
"itself" here is not clear. It refers to "another digit", right?
There was a problem hiding this comment.
clarified the wording
| - Makes performance predictable | ||
| - Shifts errors from runtime to compile time (note that this refers to when the regex is compiled to a VM, which is not done by the rust compiler, but when the regex is instanciated/parsed from a string) | ||
|
|
||
| The result is a regex engine that is both expressive and safe - without relying on fragile heuristics or input-dependent behavior. |
There was a problem hiding this comment.
I think this last section here is not necessary, it was pretty clear from the previous sections already. (Is this AI btw?)
There was a problem hiding this comment.
removed the last section. Yes, I used ChatGPT to get me started, as it was quite daunting initially to document it all
| `\g<name>` | ||
| : call the subroutine defined in capture group named *name* \ | ||
| `\g<1>` | ||
| : call the subroutine defined in capture group 1 |
There was a problem hiding this comment.
Can you mention the recursion limit here? Some people might not read on to the subroutines section.
| - Add support for more `\p{...}` and `\P{...}` aliases, for Oniguruma compatibility (#207) | ||
| - Add support for word boundaries and zero-length fancy patterns inside variable lookbehinds (at top level) (#216) | ||
| - Add support for `\R` to mean general newline, matching all common line break characters, treating `\r\n` atomically (#220) | ||
| - Add support for subroutine calls (including recursion) `\g<1>` (#230) |
There was a problem hiding this comment.
Let's also mention the recursion limit here.
- add ordinal numbers, to reflect the same order they are included from `lib.rs` - remove spaces from filenames
- don't start with what subroutines are not - start with a simple example - clarify instead of referring to a digit as `itself`
it just duplicates what was already mentioned
This follows the suggestion documented at https://doc.rust-lang.org/rustdoc/write-documentation/documentation-tests.html#using--in-doc-tests - using `#` to hide lines - explicitly mentioning the error type
robinst
left a comment
There was a problem hiding this comment.
LGTM, awesome work! Just some minor comments/questions.
| if let Some(calls) = self.subroutine_calls.get(&group) { | ||
| if calls.iter().any(|call| call.target_group == group) { | ||
| return Err(Error::CompileError(Box::new( | ||
| CompileError::FeatureNotYetSupported( |
There was a problem hiding this comment.
Is this "not yet" supported or an error that will stay an error?
There was a problem hiding this comment.
this one is apparently not needed for Giallo - none of the TextMate syntaxes seem to use this feature in the regex patterns, but Oniguruma seems to support it, so it is something I would like to eventually tackle, for proper (implicit recursion level aware?) backreference support, so we can support matching palindromes that way etc. Until I work on it, I don't entirely understand how it is supposed to work, but I think it makes sense for it to stay as a not yet supported feature error for now
| // If empty, this is an error in the analysis phase | ||
| if target_info.children.is_empty() { | ||
| return Err(Error::CompileError(Box::new( | ||
| CompileError::FeatureNotYetSupported(format!( |
There was a problem hiding this comment.
Same here, this shouldn't be "not yet", right?
There was a problem hiding this comment.
indeed, thanks, fixed
| } | ||
| self.b.add(Insn::SaveCaptureGroupStart(target_group)); | ||
| self.visit(&target_info.children[0], hard)?; | ||
| self.b.add(Insn::Save(target_group * 2 + 1)); |
There was a problem hiding this comment.
Do we need/want a SaveCaptureGroupEnd?
There was a problem hiding this comment.
not sure, for consistency it makes sense, but at the same time, right now (until we implement recursion level backrefs and see if we need it to behave differently) it would be essentially identical to the Save instruction, except the slot lookup would then be at runtime instead of compile time...
| // The target group doesn't exist (invalid group reference) | ||
| // This should have been caught by analysis, but be defensive | ||
| return Err(Error::CompileError(Box::new( | ||
| CompileError::FeatureNotYetSupported(format!( |
|
|
||
| The main author is Raph Levien, with many contributions from Robin Stocker | ||
| and Keith Hall. | ||
| The main authors are Raph Levien, Robin Stocker and Keith Hall. |
| .iter() | ||
| .filter(|&&g| g == target_group) | ||
| .count(); | ||
| if recursion_count >= MAX_SUBROUTINE_RECURSION_DEPTH { |
There was a problem hiding this comment.
A bit confused by this doing >= 19 but the release notes saying "including recursion up to 20 levels deep". Shouldn't 20 still be ok? Or am I reading it wrong?
There was a problem hiding this comment.
The first execution of the capture group technically isn't a recursion, but should count towards the recursion depth, so the count is off by one. As it stands, we match Oniguruma behavior, which also has a limit of 20, so I think the documentation is correct, even though it is admittedly a little confusing.
There was a problem hiding this comment.
(I think Oniguruma counts like this because the initial visit of a capture group via the normal flow is compiled identically to subroutine call visits.)
There was a problem hiding this comment.
The first execution of the capture group technically isn't a recursion, but should count towards the recursion depth, so the count is off by one.
Aah I see, that's what I was missing.
don't use FeatureNotYetSupported CompileError when it isn't a feature that will ever be supported
Add support for subroutine calls (including recursion) using Oniguruma syntax
\g<1>I have deliberately kept the VM changes minimal for this initial work, and chose to expand the subroutine calls inline in the compiler, up to the recursion depth limit of 20. I also haven't added any unit tests in the compiler to assert how the program looks with recursive calls, to reduce friction and give us more flexibility to change the implementation later when we work on relative recursion level backreferences. Simple subroutine call compilation is tested, and many matching tests have been added, including passing lots of Oniguruma test cases.
Also, huge effort was put into documentation, to explain how subroutines work.