Skip to content

Python: Fix performance issue in charSet#7838

Merged
yoff merged 2 commits into
github:mainfrom
tausbn:python-fix-charset-performance-problem
Feb 4, 2022
Merged

Python: Fix performance issue in charSet#7838
yoff merged 2 commits into
github:mainfrom
tausbn:python-fix-charset-performance-problem

Conversation

@tausbn

@tausbn tausbn commented Feb 3, 2022

Copy link
Copy Markdown
Contributor

Observed on mozilla/bugbug on the 2.8.0 CLI branch, we had the following line in the timing report:

FullServerSideRequestForgery.ql-17:regex::RegexString::charSet_dispred#fff#antijoin_rhs ............... 1m13s

Inspecting the logs, we see the following join:

(644s) Tuple counts for regex::RegexString::charSet_dispred#fff#antijoin_rhs/5@f295d1bk after 1m13s:
1         ~0%         {1} r1 = CONSTANT(unique string)["]"]
2389      ~4%         {3} r2 = JOIN r1 WITH regex::RegexString::nonEscapedCharAt_dispred#fff_201#join_rhs ON FIRST 1 OUTPUT Rhs.1 'arg0', Rhs.2 'arg1', (Rhs.2 'arg1' + 1)
668873    ~0%         {6} r3 = JOIN r2 WITH regex::RegexString::char_set_start_dispred#fff ON FIRST 1 OUTPUT Lhs.0 'arg0', "]", Lhs.1 'arg1', Lhs.2 'arg2', Rhs.1 'arg3', Rhs.2 'arg4'
537501371 ~4%         {7} r4 = JOIN r3 WITH regex::RegexString::nonEscapedCharAt_dispred#fff_021#join_rhs ON FIRST 2 OUTPUT Lhs.0 'arg0', Lhs.2 'arg1', Lhs.3 'arg2', Lhs.4 'arg3', Lhs.5 'arg4', "]", Rhs.2
269085087 ~0%         {7} r5 = SELECT r4 ON In.6 > In.4 'arg4'
89583155  ~3%         {7} r6 = SELECT r5 ON In.6 < In.1 'arg1'
89583155  ~26634%     {5} r7 = SCAN r6 OUTPUT In.0 'arg0', In.1 'arg1', In.2 'arg2', In.3 'arg3', In.4 'arg4'
                    return r7

Now, this is problematic not just because of the large intermediary join but also because of
the large number of tuples being materialised at the end. The culprit in this case turns out to be this bit of charSet:

not exists(int mid | this.nonEscapedCharAt(mid) = "]" | mid > inner_start and mid < inner_end)

Rewriting this to instead look for the minimum index at which a ] appears resulted in a much nicer join.

I also fixed up a similar issue surrounding the \N unicode escape. Not that I think this will necessarily be relevant, but the min-based solution is more robust either way.


Performance fix, so no change note required.

@tausbn tausbn added the no-change-note-required This PR does not need a change note label Feb 3, 2022
@tausbn tausbn requested a review from a team as a code owner February 3, 2022 18:28
@github-actions github-actions Bot added the Python label Feb 3, 2022
Observed on `mozilla/bugbug` on the 2.8.0 CLI branch, we had the
following line in the timing report:
```
FullServerSideRequestForgery.ql-17:regex::RegexString::charSet_dispred#fff#antijoin_rhs ............... 1m13s
```

Inspecting the logs, we see the following join:

```
(644s) Tuple counts for regex::RegexString::charSet_dispred#fff#antijoin_rhs/5@f295d1bk after 1m13s:
1         ~0%         {1} r1 = CONSTANT(unique string)["]"]
2389      ~4%         {3} r2 = JOIN r1 WITH regex::RegexString::nonEscapedCharAt_dispred#fff_201#join_rhs ON FIRST 1 OUTPUT Rhs.1 'arg0', Rhs.2 'arg1', (Rhs.2 'arg1' + 1)
668873    ~0%         {6} r3 = JOIN r2 WITH regex::RegexString::char_set_start_dispred#fff ON FIRST 1 OUTPUT Lhs.0 'arg0', "]", Lhs.1 'arg1', Lhs.2 'arg2', Rhs.1 'arg3', Rhs.2 'arg4'
537501371 ~4%         {7} r4 = JOIN r3 WITH regex::RegexString::nonEscapedCharAt_dispred#fff_021#join_rhs ON FIRST 2 OUTPUT Lhs.0 'arg0', Lhs.2 'arg1', Lhs.3 'arg2', Lhs.4 'arg3', Lhs.5 'arg4', "]", Rhs.2
269085087 ~0%         {7} r5 = SELECT r4 ON In.6 > In.4 'arg4'
89583155  ~3%         {7} r6 = SELECT r5 ON In.6 < In.1 'arg1'
89583155  ~26634%     {5} r7 = SCAN r6 OUTPUT In.0 'arg0', In.1 'arg1', In.2 'arg2', In.3 'arg3', In.4 'arg4'
                    return r7
```
Now, this is problematic not just because of the large intermediary join
but also because of the large number of tuples being materialised at the
end. The culprit in this case turns out to be this bit of `charSet`:
```
not exists(int mid | this.nonEscapedCharAt(mid) = "]" | mid > inner_start and mid < inner_end)
```

Rewriting this to instead look for the minimum index at which a `]`
appears resulted in a much nicer join.

I also fixed up a similar issue surrounding the `\N` unicode escape.
Not that I think this will necessarily be relevant, but the `min`-based
solution is more robust either way.
@tausbn tausbn force-pushed the python-fix-charset-performance-problem branch from 05d8e5b to 22aa4c9 Compare February 3, 2022 20:42

@yoff yoff left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicer join and nicer code, I would say. Thanks for these fixes.

Comment thread python/ql/lib/semmle/python/regex.qll Outdated
Comment thread python/ql/lib/semmle/python/regex.qll
Also gets rid of `inner_end`, since we're already doing `end - 1 = ...`
in the other fix (and so this is more consistent).
@tausbn

tausbn commented Feb 4, 2022

Copy link
Copy Markdown
Contributor Author

Performance evaluation shows no changes at all, which is to be expected since this was a pathological example.

@yoff yoff left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yoff yoff merged commit 182c62f into github:main Feb 4, 2022
@tausbn tausbn deleted the python-fix-charset-performance-problem branch February 4, 2022 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-change-note-required This PR does not need a change note Python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants