[core] Add Rule.getPostProcessors() method to support per-rule global post-processing steps by aaronhurst-google · Pull Request #3355 · pmd/pmd

aaronhurst-google · 2021-06-23T20:20:52Z

Add Rule.getPostProcessors() abstract method to allow a rule to specify extra rendering steps after analyzing all files.
Add default implementation for existing rules to do nothing.
Update PMD.doPMD to collect these in the list of renderers and to call the start, end, and flush lifecycle methods.
Unit test the new method.

Change-Id: I92327fada337feca74967bb3bb4e6187b3a0d1fd

Describe the PR

Our org is building some custom PMD rules that require gathering and then post-processing global data. This pull request supports with the latter aspect. I think it would be very useful to others.

Architecturally, the Renderer is a natural "post-processor" because it already has the necessary data access and workflow callback (i.e. end) to process cross-file data. We did consider just adding a new top-level renderer... but we want this post-processing to run in addition to the existing renderers that output the violations.

A possible factoring would be to split the workflow methods (start, end, renderFile, ...) out of Renderer and create a new PostProcessor parent interface, leaving behind the output-related methods (setWriter, defaultFileExtension, ...) in Renderer. I'm happy to do this, but it would change a fair number of parameter types, and I can just easily ignore the output-related methods in my custom classes. PLMK.

…fy extra rendering steps after analyzing all files. Add default implementation for all rules to do nothing. Update PMD.doPMD to collect these in the list of renderers and to call the start, end, and flush lifecycle methods. Unit test the new methods. Change-Id: I92327fada337feca74967bb3bb4e6187b3a0d1fd

oowekyala

This looks like an ad-hoc solution and not very well thought through. I don't think rules should be able to specify renderers or post processors or whatnot. Rules find violations. What happens to these violations is not the concern of the rule and should be decided elsewhere.

How is your Renderer rule-specific? If it needs access to the rule state, then it will most likely be broken by using incremental analysis.

jsotuyod · 2021-06-23T23:56:14Z

I agree that the usage of a Renderer here seems forced, and @oowekyala has a great point about the applicability of incremental analysis.

Nonetheless, I'd love @aaronhurst-google if you could share some of the use cases you are considering. I'd love to better understand them and figure out together the best way to cover them. If there were legitimate use cases not currently covered, maybe adding new lifecycle methods to the rule (beyond the current start, end and apply) with proper handling for incremental analysis could be an option.

aaronhurst-google · 2021-06-24T00:08:01Z

"Rules find violations" is trivially true where the Rule class is a single-file AST visitor and its reported violations are all local to a single file.

Consider a global property G that depends on code in multiple files: G(file1, file2). All such properties can be decomposed into some G'( F(file1), F(file2) ), and in practice the output of F can often substantially simpler than the original file itself.

The Rule class, which is primarily a single-file AST visitor, very naturally implements F. The extension I'm proposing is to allow it to return an object to compute G'. I do think that this is very much keeping with the spirit of "rules determine how to report violations", because the overall function is entirely specified by the Rule class even if it delegates part of the computation to another class.

The other supporting part of this proposal is PR #3357, which extends the Report to store (in my case) intermediate summarization or conditional findings (i.e. the result of F), in addition to the set of final violations. To answer your question, my post-processor does not need to access the Rule state; its only input is the (enhanced) Report.

Why does getPostProcessors return (custom) Renderer objects? The current narrow role of a Renderer is to take the Report object(s) with final violations and output them in some format... and this is admittedly generalizing that role. I used it because a Renderer is a nice superset of the G' computation: it already operates on the right inputs [a Report] at the right time [both end-of-file and end-of-analysis]. (And FWIW, the AbstractAccumulatingRenderer performed the right end-of-file operation for my case.)

My offer still stands, to refactor Renderer into a smaller interface for the above and a derived one that matches the current Renderer. I wasn't sure if propagating a bunch of parameter type changes around the codebase was better/worse than one line that doesn't use the purest possible type interface.

Apologies if the original PR didn't give a lot of perspective on the goals of this change. I tried to keep this quite small-- because it's essentially a customization at the moment-- but it would be sufficient to unblock some of the global rules we want to write on top of PMD. It would be nice to have (some) path to custom cross-file analysis, even if that's not a direction the mainline of PMD wants to head at this time.

jsotuyod · 2021-06-24T05:11:59Z

We do care deeply about how we design new features, but not simply because of purity itself.

As maintainers for PMD we vow to keep things backwards compatible during the whole life of a major version so that you as a rule developer can easily upgrade PMD versions without worrying too much about breaking your custom rules (ie: PMD 6.0.0 is now 3.5 years old). In order to do so, we need to be sure of several things:

Is this change backwards binary-compatible? ie: adding a method to the Report interface breaks this, anyone implementing Report directly in a custom extension will start seeing runtime failures if they don't change their code. If this is not, then it should be part of the next major release. We do spend some time every now and then figuring out if something can be done in a such a way.
Is this change properly working in all supported scenarios? ie: working when running with incremental analysis enabled.
Is this change sensible enough for us to want to keep maintaining it for such a long time? ie: a hacky workaround may be a pain later on within the lifetime of the major version

I feel that neither the current approach nor refactoring the Report interface would do.

We do want to help you. But we need to better understand the underlying need in order to do so.

As already mentioned, adding new lifecycle methods to AbstractRule (not the Rule interface, until PMD 7) along with some typechecking the SourceCodeProcesssor before invocation could be an option (and allow the rule to be self-contained with all the logic needed to produce reports), but that alone would not be enough to work with incremental analysis…

Dealing with that would require:

Having a violation being able to reference more than one file (which on it's own would probably have a ripple effect on reporting, and specially for structured reporters such as XML needing an updated XML Schema).
When a new file is being added, how do we know if we still need to look into other files even if not modified since the last run? Which ones? Is that scenario even possible?

So once again, I am compelled to ask, what kind of violations are you considering to look for across multiple files? Maybe some real case scenarios can help us figure out what would work best, or if this is really a use case we want to support in PMD.

aaronhurst-google · 2021-06-24T07:24:28Z

Thanks for sharing the concern about binary backwards-compatibility. It wasn't on my radar. Let's assume that any of this is fine to target for some future major release.

I agree that it would be mostly straightforward to expose additional lifecycle methods on the Rule class. I was more wary about a design that stuffs state into that class, where a Rule is a thing that needs to be merged across files/threads. As you point out, incremental analysis makes this even more awkward: a Rule would become a thing that needs to understand how to cache its state for unchanged files.

The extra state here is an intermediate result in a global computation, but it bears some strong operational similarities to a violation: both are produced by a single-file AST visitor (i.e. a Rule as it exists), both are file-independent, both support a notion of "merging" across files, both could be cached when a file is unchanged, and both are consumed at the end of the scan. This is why it made sense (to me at least) to stash it in the same container as the violations, be part of Report, and have it operated on by a Renderer (or a new interface that shares this relevant parts of its role). Architecturally, does this at least make sense-- even if it's a stretch of the original role of those components?

I admittedly haven't looked deeply at your incremental analysis implementation-- we're running only full scans. But perhaps the formulation above fits naturally with only scanning the changed files: recompute F(each changed file) and store and reload the value of F(unchanged file). The G' function needs to be applied across the union of the two (and so this piece is non-incremental... but often trivial).

To answer your question about the application: I don't want to go too far into the weeds, but to start I am computing a few properties across the transitive callees of a method. A specific example: can an Apex test method ever execute a DML statement? Here, the per-file intermediate result of F is roughly a set of call edges, some type relations for CHA callgraph algorithm, and whether each method body contains any ASTDml*Statement node. Hope that helps.

jsotuyod · 2021-06-24T20:06:11Z

Ok, that's an interesting scenario.

Doing something such as that would require:

for each analyzed class, for each non-private method store either
- if it executes DML either directly or indirectly within the same source file (leaf node)
- or a complete list of called methods from other code units (node and adjacency list)
a list of all test methods (start points for searches)

After looking at all files, the call graph should be reconstructed starting from each test method, and navigated until exhausted or a node is found to execute DML.

This would require keeping in memory quite a lot of data for any medium / large codebase. This could be offset by trimming the data to remove method call lists from intermediate data if we can determine mid-analysis there exists a path that ends up doing DML, but this will trade-off on analysis time, be quite cumbersome, and completely dependant on a per-rule implementation. Moreover, even though the called methods list could be cached, the intermediate trimmed info wouldn't be valid on an incremental analysis.

I'm unsure this would be sane as in the past PMD came across memory issues with larger codebases, although mainly from CPD.

So, to sum it up:

Pros:

potentially cacheable, although it would have a huge impact on cache file size

Cons:

high memory usage
would require a heavy rework of incremental analysis… an unmodified file may need to change the violations reported on them based on metadata that is specific to a rule (current implementation is if file is unchanged => skip and repeat all cached violations)

I don't think we can escape the need to rework how we check for cache staleness, needing to consider not only the file's source, but dependencies / transitive dependencies. Future implementations could check what changed, and invalidate cache results per-rule rather than the current all or nothing approach (currently all rules are implicitly considered as "depending on the file source").

As for the memory usage, we could probably avoid it by doing a directed search… start from the test and follow the call graph through multiple files on-demand. This could of course end up parsing unmodified files, and maybe even re-parsing filles (we could apply an in-memory cache here, but there is a tradeoff between memory and performance); on the bright side, collecting the file dependency list would be straightforward, but the search could not be aborted early when finding a DML as the complete list of dependencies would still be required to be built. The call graph data could be cached, but we would have to think thoroughly if it's worth it (ie: besides the dependencies, we care about the features we are looking for… maybe there is room here for some synergy with the work already done with metrics…). This approach requires no new lifecycle methods nor to keep state in the rules / deal with concurrency and data sharing. Moreover, PMD does have some very limited call graph logic, which is currently limited to single-file Java code, but at some point we will want to revamp.

EDIT: bonus point. There are already some Apex rules that already try to do some call graph analysis, but this is outside the PMD outdated call graph structure and ad-hoc… having a proper call graph transversal API would greatly simplify these rules and provide better results even for single-file scenarios.

Open questions:

would a multi-file call graph suffice for all scenarios?

Caveats:

such an approach would probably require some heavy lifting… call graph code is limited and old, plausibly needing a full rewrite. Also the only current implementation is completely bound to Java.

aaronhurst-google · 2021-07-22T22:01:29Z

Thanks for the great feedback, @jsotuyod.

Your summary accurately reflects the algorithm. In Apex code, all unit test methods (starting points) are annotated with the standard @istest, which is also straightforward to collect from the AST.

I'd be careful about reparsing and reanalyzing transitive callees in an incremental analysis. It has the potential for unbounded effort and breaks the expected correlation between analysis time and the size of the changes.

My straw man for caching callgraph data is to keep a per-file summary of its methods and their callees. There is no non-file-local data here, and so this wouldn't ever have to be regenerated unless the file is modified. However, you'd still need to load (worst case) the whole callgraph to recompute the desired properties. This would be several times faster than having to reparse and retraverse the same files... but its worst-case scaling is still fundamentally the size of the whole program. I'd expect the I/O to actually load the cache to be the bottleneck; computing my simple reachability-based properties would be terribly quick. Caveat: this idea isn't particularly well-informed by what PMD currently caches.

would require a heavy rework of incremental analysis… an unmodified file may need to change the violations reported on
them based on metadata that is specific to a rule (current implementation is if file is unchanged => skip and repeat all
cached violations)

This is true, but it may or may not be desirable behavior. For example, my target workflow-- code reviews-- any new defects in unchanged files are likely out-of-scope anyway (... or would at least be downgraded to a log).

call graph code is limited and old, plausibly needing a full rewrite

Another challenge was that there is limited/non-existent semantic type analysis in the Apex Jorje front-end. I had to significantly under- or over-approximate the call resoluiton; empirically, this performed ok on my codebase but YMMV.

oowekyala requested changes Jun 23, 2021

View reviewed changes

adangel changed the title ~~Add Rule.getPostProcessors() method to support per-rule global post-processing steps~~ [core] Add Rule.getPostProcessors() method to support per-rule global post-processing steps Jun 24, 2021

aaronhurst-google closed this Feb 8, 2022

oowekyala mentioned this pull request Apr 15, 2022

[core] Support global rules that report at the end of analysis #3920

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Add Rule.getPostProcessors() method to support per-rule global post-processing steps#3355

[core] Add Rule.getPostProcessors() method to support per-rule global post-processing steps#3355
aaronhurst-google wants to merge 1 commit into
pmd:masterfrom
aaronhurst-google:master

aaronhurst-google commented Jun 23, 2021

Uh oh!

oowekyala left a comment

Uh oh!

jsotuyod commented Jun 23, 2021

Uh oh!

aaronhurst-google commented Jun 24, 2021 •

edited by adangel

Loading

Uh oh!

jsotuyod commented Jun 24, 2021

Uh oh!

aaronhurst-google commented Jun 24, 2021 •

edited

Loading

Uh oh!

jsotuyod commented Jun 24, 2021 •

edited

Loading

Uh oh!

aaronhurst-google commented Jul 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aaronhurst-google commented Jun 23, 2021

Describe the PR

Uh oh!

oowekyala left a comment

Choose a reason for hiding this comment

Uh oh!

jsotuyod commented Jun 23, 2021

Uh oh!

aaronhurst-google commented Jun 24, 2021 • edited by adangel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsotuyod commented Jun 24, 2021

Uh oh!

aaronhurst-google commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsotuyod commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronhurst-google commented Jul 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aaronhurst-google commented Jun 24, 2021 •

edited by adangel

Loading

aaronhurst-google commented Jun 24, 2021 •

edited

Loading

jsotuyod commented Jun 24, 2021 •

edited

Loading