[core] Add Rule.getPostProcessors() method to support per-rule global post-processing steps#3355
[core] Add Rule.getPostProcessors() method to support per-rule global post-processing steps#3355aaronhurst-google wants to merge 1 commit into
Conversation
…fy extra rendering steps after analyzing all files. Add default implementation for all rules to do nothing. Update PMD.doPMD to collect these in the list of renderers and to call the start, end, and flush lifecycle methods. Unit test the new methods. Change-Id: I92327fada337feca74967bb3bb4e6187b3a0d1fd
oowekyala
left a comment
There was a problem hiding this comment.
This looks like an ad-hoc solution and not very well thought through. I don't think rules should be able to specify renderers or post processors or whatnot. Rules find violations. What happens to these violations is not the concern of the rule and should be decided elsewhere.
How is your Renderer rule-specific? If it needs access to the rule state, then it will most likely be broken by using incremental analysis.
|
I agree that the usage of a Nonetheless, I'd love @aaronhurst-google if you could share some of the use cases you are considering. I'd love to better understand them and figure out together the best way to cover them. If there were legitimate use cases not currently covered, maybe adding new lifecycle methods to the rule (beyond the current |
|
"Rules find violations" is trivially true where the Rule class is a single-file AST visitor and its reported violations are all local to a single file. Consider a global property G that depends on code in multiple files: G(file1, file2). All such properties can be decomposed into some G'( F(file1), F(file2) ), and in practice the output of F can often substantially simpler than the original file itself. The Rule class, which is primarily a single-file AST visitor, very naturally implements F. The extension I'm proposing is to allow it to return an object to compute G'. I do think that this is very much keeping with the spirit of "rules determine how to report violations", because the overall function is entirely specified by the Rule class even if it delegates part of the computation to another class. The other supporting part of this proposal is PR #3357, which extends the Report to store (in my case) intermediate summarization or conditional findings (i.e. the result of F), in addition to the set of final violations. To answer your question, my post-processor does not need to access the Rule state; its only input is the (enhanced) Report. Why does getPostProcessors return (custom) Renderer objects? The current narrow role of a Renderer is to take the Report object(s) with final violations and output them in some format... and this is admittedly generalizing that role. I used it because a Renderer is a nice superset of the G' computation: it already operates on the right inputs [a Report] at the right time [both end-of-file and end-of-analysis]. (And FWIW, the AbstractAccumulatingRenderer performed the right end-of-file operation for my case.) My offer still stands, to refactor Renderer into a smaller interface for the above and a derived one that matches the current Renderer. I wasn't sure if propagating a bunch of parameter type changes around the codebase was better/worse than one line that doesn't use the purest possible type interface. Apologies if the original PR didn't give a lot of perspective on the goals of this change. I tried to keep this quite small-- because it's essentially a customization at the moment-- but it would be sufficient to unblock some of the global rules we want to write on top of PMD. It would be nice to have (some) path to custom cross-file analysis, even if that's not a direction the mainline of PMD wants to head at this time. |
|
We do care deeply about how we design new features, but not simply because of purity itself. As maintainers for PMD we vow to keep things backwards compatible during the whole life of a major version so that you as a rule developer can easily upgrade PMD versions without worrying too much about breaking your custom rules (ie: PMD 6.0.0 is now 3.5 years old). In order to do so, we need to be sure of several things:
I feel that neither the current approach nor refactoring the We do want to help you. But we need to better understand the underlying need in order to do so. As already mentioned, adding new lifecycle methods to Dealing with that would require:
So once again, I am compelled to ask, what kind of violations are you considering to look for across multiple files? Maybe some real case scenarios can help us figure out what would work best, or if this is really a use case we want to support in PMD. |
|
Thanks for sharing the concern about binary backwards-compatibility. It wasn't on my radar. Let's assume that any of this is fine to target for some future major release. I agree that it would be mostly straightforward to expose additional lifecycle methods on the The extra state here is an intermediate result in a global computation, but it bears some strong operational similarities to a violation: both are produced by a single-file AST visitor (i.e. a I admittedly haven't looked deeply at your incremental analysis implementation-- we're running only full scans. But perhaps the formulation above fits naturally with only scanning the changed files: recompute To answer your question about the application: I don't want to go too far into the weeds, but to start I am computing a few properties across the transitive callees of a method. A specific example: can an Apex test method ever execute a DML statement? Here, the per-file intermediate result of |
|
Ok, that's an interesting scenario. Doing something such as that would require:
After looking at all files, the call graph should be reconstructed starting from each test method, and navigated until exhausted or a node is found to execute DML. This would require keeping in memory quite a lot of data for any medium / large codebase. This could be offset by trimming the data to remove method call lists from intermediate data if we can determine mid-analysis there exists a path that ends up doing DML, but this will trade-off on analysis time, be quite cumbersome, and completely dependant on a per-rule implementation. Moreover, even though the called methods list could be cached, the intermediate trimmed info wouldn't be valid on an incremental analysis. I'm unsure this would be sane as in the past PMD came across memory issues with larger codebases, although mainly from CPD. So, to sum it up: Pros:
Cons:
I don't think we can escape the need to rework how we check for cache staleness, needing to consider not only the file's source, but dependencies / transitive dependencies. Future implementations could check what changed, and invalidate cache results per-rule rather than the current all or nothing approach (currently all rules are implicitly considered as "depending on the file source"). As for the memory usage, we could probably avoid it by doing a directed search… start from the test and follow the call graph through multiple files on-demand. This could of course end up parsing unmodified files, and maybe even re-parsing filles (we could apply an in-memory cache here, but there is a tradeoff between memory and performance); on the bright side, collecting the file dependency list would be straightforward, but the search could not be aborted early when finding a DML as the complete list of dependencies would still be required to be built. The call graph data could be cached, but we would have to think thoroughly if it's worth it (ie: besides the dependencies, we care about the features we are looking for… maybe there is room here for some synergy with the work already done with metrics…). This approach requires no new lifecycle methods nor to keep state in the rules / deal with concurrency and data sharing. Moreover, PMD does have some very limited call graph logic, which is currently limited to single-file Java code, but at some point we will want to revamp. EDIT: bonus point. There are already some Apex rules that already try to do some call graph analysis, but this is outside the PMD outdated call graph structure and ad-hoc… having a proper call graph transversal API would greatly simplify these rules and provide better results even for single-file scenarios. Open questions:
Caveats:
|
|
Thanks for the great feedback, @jsotuyod. Your summary accurately reflects the algorithm. In Apex code, all unit test methods (starting points) are annotated with the standard @istest, which is also straightforward to collect from the AST. I'd be careful about reparsing and reanalyzing transitive callees in an incremental analysis. It has the potential for unbounded effort and breaks the expected correlation between analysis time and the size of the changes. My straw man for caching callgraph data is to keep a per-file summary of its methods and their callees. There is no non-file-local data here, and so this wouldn't ever have to be regenerated unless the file is modified. However, you'd still need to load (worst case) the whole callgraph to recompute the desired properties. This would be several times faster than having to reparse and retraverse the same files... but its worst-case scaling is still fundamentally the size of the whole program. I'd expect the I/O to actually load the cache to be the bottleneck; computing my simple reachability-based properties would be terribly quick. Caveat: this idea isn't particularly well-informed by what PMD currently caches.
This is true, but it may or may not be desirable behavior. For example, my target workflow-- code reviews-- any new defects in unchanged files are likely out-of-scope anyway (... or would at least be downgraded to a log).
Another challenge was that there is limited/non-existent semantic type analysis in the Apex Jorje front-end. I had to significantly under- or over-approximate the call resoluiton; empirically, this performed ok on my codebase but YMMV. |
Add Rule.getPostProcessors() abstract method to allow a rule to specify extra rendering steps after analyzing all files.
Add default implementation for existing rules to do nothing.
Update PMD.doPMD to collect these in the list of renderers and to call the start, end, and flush lifecycle methods.
Unit test the new method.
Change-Id: I92327fada337feca74967bb3bb4e6187b3a0d1fd
Describe the PR
Our org is building some custom PMD rules that require gathering and then post-processing global data. This pull request supports with the latter aspect. I think it would be very useful to others.
Architecturally, the Renderer is a natural "post-processor" because it already has the necessary data access and workflow callback (i.e. end) to process cross-file data. We did consider just adding a new top-level renderer... but we want this post-processing to run in addition to the existing renderers that output the violations.
A possible factoring would be to split the workflow methods (start, end, renderFile, ...) out of Renderer and create a new PostProcessor parent interface, leaving behind the output-related methods (setWriter, defaultFileExtension, ...) in Renderer. I'm happy to do this, but it would change a fair number of parameter types, and I can just easily ignore the output-related methods in my custom classes. PLMK.