Skip to content

Conversation

@atrosinenko
Copy link
Contributor

@atrosinenko atrosinenko commented Dec 1, 2019

This PR introduces initial draft of Source Code Minimizer feature, see #2103 (hope it would not name-clash with something source code management related).

Now it is already able to execute something like this:

bin/run.sh scm --language \
    --input-file pmd-core/src/main/java/net/sourceforge/pmd/scm/SCMConfiguration.java \
    --output-file test.java \
    --strategy xpath --xpath-expression "//Annotation"

There are no multi-step interactive strategies implemented yet.

@ghost
Copy link

ghost commented Dec 1, 2019

1 Message
📖 This changeset introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
This changeset introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
This changeset introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
This changeset introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
This changeset introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
This changeset introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report
This changeset introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.
Full report

Generated by 🚫 Danger

@atrosinenko
Copy link
Contributor Author

Implemented a greedy interactive (w.r.t. to relying on a feedback from the compiler) strategy.

Suppose I copied RuleSet.java and introduced a bug:

$ diff pmd-core/src/main/java/net/sourceforge/pmd/RuleSet.java RuleSet.java 
336c336
<             Objects.requireNonNull(patterns, "Pattern collection was null");
---
>             Objects.requireNonNull(patterns, "Pattern collection was null", 123);

Now let's try to compile it:

$ javac -cp pmd-bin-6.21.0-SNAPSHOT/lib/pmd-core-6.21.0-SNAPSHOT.jar RuleSet.java
RuleSet.java:21: error: package org.apache.commons.lang3 does not exist
import org.apache.commons.lang3.StringUtils;
                               ^
RuleSet.java:223: error: cannot find symbol
            if (StringUtils.isBlank(ruleSetFileName)) {
                ^
  symbol:   variable StringUtils
  location: class RuleSetBuilder
RuleSet.java:287: error: cannot find symbol
            if (StringUtils.isBlank(ruleSet.getFileName())) {
                ^
  symbol:   variable StringUtils
  location: class RuleSetBuilder
RuleSet.java:336: error: no suitable method found for requireNonNull(Collection<CAP#1>,String,int)
            Objects.requireNonNull(patterns, "Pattern collection was null", 123);
                   ^
    method Objects.<T#1>requireNonNull(T#1) is not applicable
      (cannot infer type-variable(s) T#1
        (actual and formal argument lists differ in length))
    method Objects.<T#2>requireNonNull(T#2,String) is not applicable
      (cannot infer type-variable(s) T#2
        (actual and formal argument lists differ in length))
    method Objects.<T#3>requireNonNull(T#3,Supplier<String>) is not applicable
      (cannot infer type-variable(s) T#3
        (actual and formal argument lists differ in length))
  where T#1,T#2,T#3 are type-variables:
    T#1 extends Object declared in method <T#1>requireNonNull(T#1)
    T#2 extends Object declared in method <T#2>requireNonNull(T#2,String)
    T#3 extends Object declared in method <T#3>requireNonNull(T#3,Supplier<String>)
  where CAP#1 is a fresh type-variable:
    CAP#1 extends Pattern from capture of ? extends Pattern
Note: RuleSet.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
4 errors

Sure, there are other problems, but let's try to minimize for just introduced one:

$ time ./pmd-bin-6.21.0-SNAPSHOT/bin/run.sh scm --language java --input-file RuleSet.java --output-file test.java --strategy greedy --invariant message --printed-message "error: no suitable method found for requireNonNull" --command-line "javac -cp pmd-bin-6.21.0-SNAPSHOT/lib/pmd-core-6.21.0-SNAPSHOT.jar test.java"
Original file: 26487 bytes, 3825 nodes.
After pass #1: size 26458 bytes (99%), 3823 nodes (99%)
After pass #2: size 26431 bytes (99%), 3821 nodes (99%)
After pass #3: size 26402 bytes (99%), 3819 nodes (99%)
After pass #4: size 26371 bytes (99%), 3817 nodes (99%)
After pass #5: size 26350 bytes (99%), 3815 nodes (99%)
After pass #6: size 26318 bytes (99%), 3813 nodes (99%)
After pass #7: size 26274 bytes (99%), 3811 nodes (99%)
After pass #8: size 26222 bytes (98%), 3809 nodes (99%)
...
After pass #248: size 10465 bytes (39%), 62 nodes (1%)
After pass #249: size 10319 bytes (38%), 43 nodes (1%)
After pass #250: size 10265 bytes (38%), 34 nodes (0%)
After pass #251: size 10204 bytes (38%), 26 nodes (0%)
After pass #252: size 10202 bytes (38%), 26 nodes (0%)

real    8m23,959s
user    18m33,567s
sys     1m0,347s

Finally, I got the following test.java: it has lots of comments and compiles like that:

$ javac -cp pmd-bin-6.21.0-SNAPSHOT/lib/pmd-core-6.21.0-SNAPSHOT.jar test.java 
test.java:40: error: class RuleSet is public, should be declared in a file named RuleSet.java
public class RuleSet  {
       ^
test.java:73: error: illegal combination of modifiers: public and private
    /* package */ static class RuleSetBuilder {
                         ^
test.java:194: error: no suitable method found for requireNonNull(no arguments)
            Objects.requireNonNull();
                   ^
    method Objects.<T#1>requireNonNull(T#1) is not applicable
      (cannot infer type-variable(s) T#1
        (actual and formal argument lists differ in length))
    method Objects.<T#2>requireNonNull(T#2,String) is not applicable
      (cannot infer type-variable(s) T#2
        (actual and formal argument lists differ in length))
    method Objects.<T#3>requireNonNull(T#3,Supplier<String>) is not applicable
      (cannot infer type-variable(s) T#3
        (actual and formal argument lists differ in length))
  where T#1,T#2,T#3 are type-variables:
    T#1 extends Object declared in method <T#1>requireNonNull(T#1)
    T#2 extends Object declared in method <T#2>requireNonNull(T#2,String)
    T#3 extends Object declared in method <T#3>requireNonNull(T#3,Supplier<String>)
3 errors

@atrosinenko
Copy link
Contributor Author

With the same test file and minimization request:

Original file: 26487 bytes, 3825 nodes.
After initial white-space cleanup: size 16376 bytes (61%), 3825 nodes (100%)
After pass #1: size 16348 bytes (61%), 3823 nodes (99%)
After pass #2: size 16321 bytes (61%), 3821 nodes (99%)
After pass #3: size 16292 bytes (61%), 3819 nodes (99%)
After pass #4: size 16261 bytes (61%), 3817 nodes (99%)
After pass #5: size 16240 bytes (61%), 3815 nodes (99%)
After pass #6: size 16208 bytes (61%), 3813 nodes (99%)
After pass #7: size 16164 bytes (61%), 3811 nodes (99%)
After pass #8: size 16112 bytes (60%), 3809 nodes (99%)
After pass #9: size 16065 bytes (60%), 3807 nodes (99%)
After pass #10 (white-space cleanup) : size 16030 bytes (60%), 3807 nodes (99%)
After pass #11: size 16017 bytes (60%), 3805 nodes (99%)
After pass #12: size 15966 bytes (60%), 3803 nodes (99%)
...
After pass #275: size 656 bytes (2%), 62 nodes (1%)
After pass #276: size 510 bytes (1%), 43 nodes (1%)
After pass #277: size 456 bytes (1%), 34 nodes (0%)
After pass #278: size 395 bytes (1%), 26 nodes (0%)
After pass #279: size 393 bytes (1%), 26 nodes (0%)
After final white-space cleanup: size 395 bytes (1%), 26 nodes (0%)

real    8m24,671s
user    18m41,643s
sys     1m1,959s

(Part) of javac output:

test.java:30: error: no suitable method found for requireNonNull(no arguments)
            Objects.requireNonNull();
                   ^

Result: test.java

@atrosinenko
Copy link
Contributor Author

Hmm... When I just started to check that the trimmed source can be parsed again before starting compiler, the minimization time dropped to ~5min - it is quite expected. But when I then changed < to <= in findNodeToRemove, it converged in about 70s! This is probably due to the fact that now, after removing some method altogether, it rewinds right to the next one (and not entering it to remove it piece-wise) - so it can be explained as well - but just to show that heuristics can be quite fragile. Or optimizable - from the optimistic point of view. :)

@oowekyala
Copy link
Member

Not commenting on implementation yet.

I'm really not sure this is something we want in PMD at this moment. It's a completely independent program, and among other things it tries to introduce its own parallel system of Language (while we already have a messy split between CPD and PMD Language).

To be very clear: I don't want to discourage people from using PMD as a generic platform to implement tools working on ASTs. In fact, I think that's something PMD should absolutely strive for. The amount of work that has been put into designing language-independent interfaces, and implementing them, is enormous, and IMO it would be failing PMD's potential to limit it to just "make rules". So in principle I'm totally for basing new programs off PMD's platform.

Yet I think the codebase is not ready yet to be used in that way. Currently anyone wanting to use it needs to publish it to pmd-core. Everything's merged into that module with no real structure, and we have to maintain it all. That's btw just the same for language implementations. Eg someone published a pmd-scala module a few months ago, and now we have to maintain it, yet nobody bothered to add rules.

We're just a few maintainers, and I feel like those plugins should be factored out of the main pmd distribution and maintained by people that are interested in it/ originated the project. This is btw one of the reasons the designer was moved out of this repo. That would also give them more visibility, and avoid stuff getting buried for decades and forgotten (eg DCD, eg the old designer programs, eg data flow). Speaking in my own name: your program looks interesting but I don't want to be responsible for its development, or future lack thereof. Nothing personal.

PMD certainly has the potential to be a great platform for that kind of tools, but I think the core of the app needs to be refactored in a way that is more extension-friendly before we can commit to fully supporting this. This needs a lot of planning and I think should be one of the design goals of the next major version (for which we've already been rethinking APIs for a while).

@atrosinenko
Copy link
Contributor Author

Just to make it clear: I just though it is the preferred way to register one Language system per tool to allow them to evolve independently. In fact, minimizer would not probably work without the full AST parser support, and that seems to be a "full PMD" feature. And its core functionality would work even without almost any intervention into pmd-core.

On the other hand, some language modules may want to interfere with minimizer module (such as describing the dependencies between AST nodes). This is some API, too, but not the entire Language system.

Technically, for now, it seems the minimizer module can be a completely separate tool utilizing the main PMD (with its languages, etc.) as a dependency. The only thing that would be great to handle is to have some extension points for language modules to introduce themselves to separate tools. But for this specific need even this is not absolutely required.

So, I'm not sure right now, but it may turn out that this tool can be designed utilizing only the public PMD APIs from the outside. In that case this PR can be dropped at all.

@atrosinenko
Copy link
Contributor Author

Meanwhile, how it would be better to design the language implementation, so it would require as less maintainance as possible for simply not becoming deprecated? Is the Scala module hard to maintain or I just misunderstood you and it is just not too useful without bundled rules?

@oowekyala
Copy link
Member

I just though it is the preferred way to register one Language system per tool to allow them to evolve independently

I understand why that may be so. Ideally I think the collection of parsers could function independently of even the PMD/CPD/SCM programs, and we'd have a single shared model to represent languages, language versions, etc in all those. But we're stuck with the current design until PMD 7.

The only thing that would be great to handle is to have some extension points for language modules to introduce themselves to separate tools.

This is what I'm talking about, which we should think about for PMD 7.

Is the Scala module hard to maintain or I just misunderstood you and it is just not too useful without bundled rules?

It's not worse per se than other language modules. It's just still quite immature, and nobody's worked on it since the initial PRs, which is a shame. IMO it's just pretty underwhelming to ship features that are half baked and remain so for years. Afterwards when you want to add one generic feature, all modules need to somehow be aligned on a shared contract, and so the ones that have been left behind either need more work, or get even more behind... The thing with PMD is that each language module would ideally have its own community of developers of that language to drive innovation, but some modules are next to abandoned...

Meanwhile, how it would be better to design the language implementation, so it would require as less maintainance as possible for simply not becoming deprecated?

The worst thing about the current implementations of language modules, is the code duplication everywhere, and the fact that most every class is public without reason. Some suboptimal design decisions need to stay for an entire major release cycle because they're public, and may be duplicated tens of times.

A nice thing you can do is keep the published API as small as possible. This helps keep your hands free when you want to refactor down the line.

So eg if you're implementing an interface from PMD core, consider the implementation internal by default and hide it. Clients can use your implementation through the interface. If you're implementing something new, start by writing everything as package-private, and only make public what needs to be so.

Side note: we're not about to deprecate language modules either, so don't worry about modelica (or scala for that matter).

@adangel
Copy link
Member

adangel commented Dec 23, 2019

Hi @atrosinenko ,

thanks for your PoC!
However, I don't see yet, where we would need this for PMD itself. When we have a parser problem, the parser tells us the location, where the error is. When we have a false positive, we know the location of the false positive. It's usually trivial to see, where the issue/the problematic code is. So, this doesn't look like something that is needed inside PMD.
However, it's perfectly valid to develop tools based on PMD.

Therefore I've moved your code to a separate repository: https://github.com/pmd/pmd-scm .
I've invited you there as a collaborator. So, feel free to work in this project there.
It's just depends on PMD rather than changes it. If we need a new API or a bugfix, then we need to implement it upstream (meaning in PMD) - like the bugfix for DocumentFile (#2171).
The integration into the distribution can be improved in the future (e.g. to allow easily to add tools rather than needing to provide a custom run.sh).

Is the Scala module hard to maintain or I just misunderstood you and it is just not too useful without
bundled rules?

It's mostly a chicken-egg-problem: Without provided rules to start with ("batteries included"), there are no users. Without users, there are no improvements/developers.

One side note about moving stuff out of the main repo: The probable thing, that will happen is: it is simply not maintained and will shortly start to be defunctional. This is prevented by keeping it in the main repo, because then, when we refactor etc. we are forced to keep it running (building+unit tests, at least).

So, I'm not sure right now, but it may turn out that this tool can be designed utilizing only the public
PMD APIs from the outside. In that case this PR can be dropped at all.

So, that is indeed possible. And that's all what we need here. So, I'm closing this PR.

Regards,
Andreas

@adangel adangel closed this Dec 23, 2019
@atrosinenko
Copy link
Contributor Author

Thanks @adangel ,

Random thoughts on the design. Previously, I tried to completely separate SCM from PMD. On one hand, it could use PMD as one of the backends (the others may be plain ANTLR grammars or even binary formats based on Kaitai Struct for format-aware replacement for afl-tmin, for example) with some common strategies and some backend-specific (such as querying computed types). On the other hand, technically, handling binary formats may be supported by PMD, too (not sure it would be beneficial from the static analysis point of view but still may be interesting as performing XPath queries on them, for example). So, the backends for SCM enumerated above can be useful for the main PMD as well (AFAIK full ANTLR support is planned for 7.0). Or they still may be implemented in SCM only since it is a separate tool, if not well suited for PMD (such as binary formats).

In fact, some SCM strategies may work with just a plain ANTLR grammars, while others may utilize as much language support as possible (and it may be tricky to abstract it out completely), so it fills the entire range from CPD to PMD w.r.t. required language support. :)

@atrosinenko
Copy link
Contributor Author

Another side-effect of such separation is drastically faster mvnw clean verify :)

@atrosinenko atrosinenko deleted the minimizer branch December 27, 2019 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants