[ty] Render all changed diagnostics in conformance.py#23613
[ty] Render all changed diagnostics in conformance.py#23613AlexWaygood merged 20 commits intoastral-sh:mainfrom
Conversation
Typing conformance resultsNo changes detected ✅The percentage of diagnostics emitted that were expected errors held steady at 86.56%. The percentage of expected errors that received a diagnostic held steady at 77.40%. |
a757c4f to
9752926
Compare
Memory usage reportSummary
Significant changesClick to expand detailed breakdownprefect
sphinx
trio
flake8
|
|
|
|
Thanks @WillDuke for working on improving this! It seems to me that having the same line(s) show up as both e.g. "true positives added" and "true positives removed" is pretty confusing. I would rather have a separate "diagnostics changed" section where we give the diff for any line which didn't change true/false positive status, but where the diagnostic is now different. Will this PR also fix the issue observed just now in #23611, where a line went from one false positive to two false positives, resulting in a "1 added false positive" in the summary table, but no details below? I think in that case it could also just show up in "diagnostics changed", since the line did not change status. (This should maybe be separate, but while I'm wishing for features, it would be awesome if this job were to notify us if a PR moves an entire conformance-suite file into -- or I guess out of -- fully-passing state.) This job is already a massive improvement over what we had before, thank you so much! |
|
Thanks, Carl! Yeah, I wasn't sure how best to render the changed ones. Maybe a wider table that has both versions? Or perhaps just show the new diagnostics? And yes, this would fix that issue. At least, it would now show one row under each of false positives removed and false positives added with the latter having two grouped diagnostics in the same confusing way as seen here. I can also look at adding a file-level summary! Edit: I went ahead and tried one approach: changed diagnostics get their own table with rows alternating between the diagnostics removed and added for each test case. The current conformance comment has updated to reflect these changes. The changes I added to demonstrate the fix didn't cause a new file to fully pass, but they will be listed now with this change. Given the confusion that the other approach was causing, I've changed the metrics to count test cases rather than diagnostics. This means that you wont see spurious regressions appearing in the summary table when in fact ty's performance on the test cases improved. However, you will also not see any change in the summary table if e.g. false positive diagnostics were removed without changing the status of the test case. You would have to expand the false positives changed tab to see that some of the false positive diagnostics had been removed. |
|
The latest conformance comment here looks pretty good to me -- curious what others think. @AlexWaygood ? |
AlexWaygood
left a comment
There was a problem hiding this comment.
Haven't done a detailed review of the code, but the output looks great on this PR's comment! A copule of nits:
scripts/conformance.py
Outdated
| display = f"{filename}:{min_line}:{max_line}" | ||
|
|
||
| rowspan_attr = f' rowspan="{rowspan}"' if rowspan > 1 else "" | ||
| return f'<td{rowspan_attr} align="center" valign="middle"><a href="{url}">{display}</a></td>' |
There was a problem hiding this comment.
I actually don't love the centre-aligned formatting... and it looks like it might be making GitHub render the table slightly smaller than it normally would?
|
Here's what the "True positives changed table looks like currently: Details
I wonder if we actually need such fancy rendering for that table? Just a two-column table that shows the line (with a hyperlink) on the first column and the raw diff for that line in the second column would present the information more concisely, and possibly more readably? Details
I made that table with a pretty unholy mixture of HTML and MarkDown -- here's the source for it 😆 Details<table>
<tr>
<th>Test case</th>
<th>Diff</th>
</tr>
<tr>
<td>
[dataclasses_transform_converter.py:121](https://github.com/python/typing/blob/e9fccc9dbbd8f1e8b24b4f88911c3d3155059e2a/conformance/tests/dataclasses_transform_converter.py#L121)
</td>
<td>
```diff
-error[invalid-argument-type] Argument is incorrect: Expected `int`, found `Literal["f1"]`
-error[invalid-argument-type] Argument is incorrect: Expected `int`, found `Literal["f2"]`
-error[invalid-argument-type] Argument is incorrect: Expected `ConverterClass`, found `Literal[b"f3"]`
-error[invalid-argument-type] Argument is incorrect: Expected `int`, found `list[Unknown]`
+error[invalid-argument-type] Argument is incorrect: Expected `str`, found `Literal[1]`
```
</td>
</table> |
|
Before I started this I didn't know that I could merge cells in an html table in a comment, and I definitely didn't know that the diffs would render in a cell! Dredging the depths of what's possible in Github-flavored markdown 😄. I've tried to address each of those items. Let me know what you think. |
|
Awesome, thank you!! I think you'll have to fixup the merge conflicts (probably caused by me...) before CI will run again, unfortunately |
|
I ran the script on this branch locally in lieu of a CI run, and here's what it gave me: Typing conformance results improved 🎉SummaryHow are test cases classified?Each test case represents one expected error annotation or a group of annotations sharing a tag. Counts are per test case, not per diagnostic — multiple diagnostics on the same line count as one. Required annotations (`E`) are true positives when ty flags the expected location and false negatives when it does not. Optional annotations (`E?`) are true positives when flagged but true negatives (not false negatives) when not. Tagged annotations (`E[tag]`) require ty to flag exactly one of the tagged lines; tagged multi-annotations (`E[tag+]`) allow any number up to the tag count. Flagging unexpected locations counts as a false positive.
Files changed
True positives added (1)1 diagnostic
False positives removed (4)4 diagnostics
True positives changed (4)4 diagnostics
False positives changed (1)1 diagnostic
Which looks awesome. Happy to merge once the unrelated dataclass_transform changes are dropped! |
0da638a to
b9a6b5a
Compare
|
@AlexWaygood Sorry, it's been a busy week for me. Thanks for taking it home! |
* main: Update conformance suite commit hash (#23746) conformance.py: Collapse the summary paragraph when nothing changed (#23745) [ty] Make inferred specializations line up with source types more better (#23715) Bump 0.15.5 (#23743) [ty] Render all changed diagnostics in conformance.py (#23613) [ty] Split deferred checks out of `types/infer/builder.rs` (#23740) Discover markdown files by default in preview mode (#23434) [ty] Use `HasOptionalDefinition` for `except` handlers (#23739) [ty] Fix precedence of `all` selector in TOML configurations (#23723) [ty] Make `all` selector case sensitive (#23713) [ty] Add a diagnostic if a `TypeVar` is used to specialize a `ParamSpec`, or vice versa (#23738) [ty] Override home directory in ty tests (#23724) [ty] More type-variable default validation (#23639) [ty] Validate bare ParamSpec usage in type annotations, and support stringified ParamSpecs as the first argument to `Callable` (#23625) [ty] Add `all` selector to ty.json's `schema` (#23721) [ty] Add quotes to related issues links (#23720) [ty] Fix panic on incomplete except handlers (#23708)
Summary
This is meant to fix astral-sh/ty#2786. Previously, the conformance.py script would only render changed diagnostics if one of the ty versions raised no diagnostics in the test group. This PR updates the change detection logic and the renderer so that we display both the old and new diagnostics if they are not identical. I've also created separate sections for cases where diagnostics changed without changing the classification as well as a new
Files changedsection that shows how performance at the file level has changed.Test Plan
I'll try cherry-picking #23088 and take a look at the comment.
I also ran the following command to see how the output would look against 0.0.18:
Which returned the following:
conformance.md
Typing conformance results improved 🎉
Summary
Each test case represents one expected error annotation or a group of annotations sharing a tag. Counts are per test case, not per diagnostic — multiple diagnostics on the same line count as one. Required annotations (
E) are true positives when ty flags the expected location and false negatives when it does not. Optional annotations (E?) are true positives when flagged but true negatives (not false negatives) when not. Tagged annotations (E[tag]) require ty to flag exactly one of the tagged lines; tagged multi-annotations (E[tag+]) allow any number up to the tag count. Flagging unexpected locations counts as a false positive.Files changed
True positives added (1)
1 diagnostic
False positives removed (1)
1 diagnostic
True positives changed (6)
6 diagnostics
False positives changed (1)
1 diagnostic
False positives added (1)
1 diagnostic
Optional Diagnostics Added (2)
2 diagnostics
Optional Diagnostics Removed (1)
1 diagnostic
Optional Diagnostics Changed (1)
1 diagnostic