Python: Fix performance issue in charSet#7838
Merged
Merged
Conversation
Observed on `mozilla/bugbug` on the 2.8.0 CLI branch, we had the
following line in the timing report:
```
FullServerSideRequestForgery.ql-17:regex::RegexString::charSet_dispred#fff#antijoin_rhs ............... 1m13s
```
Inspecting the logs, we see the following join:
```
(644s) Tuple counts for regex::RegexString::charSet_dispred#fff#antijoin_rhs/5@f295d1bk after 1m13s:
1 ~0% {1} r1 = CONSTANT(unique string)["]"]
2389 ~4% {3} r2 = JOIN r1 WITH regex::RegexString::nonEscapedCharAt_dispred#fff_201#join_rhs ON FIRST 1 OUTPUT Rhs.1 'arg0', Rhs.2 'arg1', (Rhs.2 'arg1' + 1)
668873 ~0% {6} r3 = JOIN r2 WITH regex::RegexString::char_set_start_dispred#fff ON FIRST 1 OUTPUT Lhs.0 'arg0', "]", Lhs.1 'arg1', Lhs.2 'arg2', Rhs.1 'arg3', Rhs.2 'arg4'
537501371 ~4% {7} r4 = JOIN r3 WITH regex::RegexString::nonEscapedCharAt_dispred#fff_021#join_rhs ON FIRST 2 OUTPUT Lhs.0 'arg0', Lhs.2 'arg1', Lhs.3 'arg2', Lhs.4 'arg3', Lhs.5 'arg4', "]", Rhs.2
269085087 ~0% {7} r5 = SELECT r4 ON In.6 > In.4 'arg4'
89583155 ~3% {7} r6 = SELECT r5 ON In.6 < In.1 'arg1'
89583155 ~26634% {5} r7 = SCAN r6 OUTPUT In.0 'arg0', In.1 'arg1', In.2 'arg2', In.3 'arg3', In.4 'arg4'
return r7
```
Now, this is problematic not just because of the large intermediary join
but also because of the large number of tuples being materialised at the
end. The culprit in this case turns out to be this bit of `charSet`:
```
not exists(int mid | this.nonEscapedCharAt(mid) = "]" | mid > inner_start and mid < inner_end)
```
Rewriting this to instead look for the minimum index at which a `]`
appears resulted in a much nicer join.
I also fixed up a similar issue surrounding the `\N` unicode escape.
Not that I think this will necessarily be relevant, but the `min`-based
solution is more robust either way.
05d8e5b to
22aa4c9
Compare
yoff
requested changes
Feb 4, 2022
yoff
left a comment
Contributor
There was a problem hiding this comment.
Nicer join and nicer code, I would say. Thanks for these fixes.
Also gets rid of `inner_end`, since we're already doing `end - 1 = ...` in the other fix (and so this is more consistent).
Contributor
Author
|
Performance evaluation shows no changes at all, which is to be expected since this was a pathological example. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Observed on
mozilla/bugbugon the 2.8.0 CLI branch, we had the following line in the timing report:Inspecting the logs, we see the following join:
Now, this is problematic not just because of the large intermediary join but also because of
the large number of tuples being materialised at the end. The culprit in this case turns out to be this bit of
charSet:Rewriting this to instead look for the minimum index at which a
]appears resulted in a much nicer join.I also fixed up a similar issue surrounding the
\Nunicode escape. Not that I think this will necessarily be relevant, but themin-based solution is more robust either way.Performance fix, so no change note required.