Skip to content

[core] Avoid parsing rulesets multiple times #724

@jsotuyod

Description

@jsotuyod

We currently parse XMLs in a greedy and not very efficient way.

  1. Upon finding a ruleset reference (ended in .xml), we go and parse that file completely before continuing.
  2. Upon finding a rule reference from another ruleset (ended in .xml/rule-name) we go and parse that file completely, ignoring everything but the one rule we were looking for.

This means that, a ruleset such as:

<?xml version="1.0"?>
<ruleset name="Coupling"
xmlns="http://pmd.sourceforge.net/ruleset/2.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pmd.sourceforge.net/ruleset/2.0.0 http://pmd.sourceforge.net/ruleset_2_0_0.xsd">
<description>
Rules which find instances of high or inappropriate coupling between objects and packages.
</description>
<!-- Rules, that have been moved into a category -->
<rule ref="category/java/bestpractices.xml/LooseCoupling" deprecated="true" />
<rule ref="category/java/design.xml/CouplingBetweenObjects" deprecated="true" />
<rule ref="category/java/design.xml/ExcessiveImports" deprecated="true" />
<rule ref="category/java/design.xml/LawOfDemeter" deprecated="true" />
<rule ref="category/java/design.xml/LoosePackageCoupling" deprecated="true" />
</ruleset>

Will parse completely once looking for category/java/bestpractices.xml, and then will proceed to completely parse category/java/design.xml 4 times, to get the other rules.

Having multiple such references is common, and will be far more with the ruleset reorganization, since categories are fewer, larger, and intended to be broken up and referenced rule by rule from rulesets.

This also means, it's possible to produce cycles in the ruleset references and produce infinite loops (which end in Stackoverflows / OutOfMemory errors).

I believe we should rework the rule parsing in order to:

  1. Parse each file exactly once
  2. Avoid parsing beyond what's needed
  3. Detect and report properly cycles

One possible approach to this would be:

  1. Start with an empty set of open and a sorted list of closed RuleSetReferenceId containing all rulesets indicated by user input, with the "include all" flag set (more on sorting later on)
  2. Remove the first RuleSetReferenceId from closed, put it into open, and get a parser for it
    1. If the "include all" flag is set, parse the file completely. Rule definitions are loaded, rule / ruleset references, are simply translated into RuleSetReferenceId and if present in open, an exception is thrown (cycle!) otherwise check if closed has another reference to this xml, and merge it or add it directly (more on merging below)
    2. If the "include all" flag is not set, parse until we find any required rule name and mark it as loaded from RuleSetReferenceId. Rule definitions are loaded, rule / ruleset references, are simply translated into RuleSetReferenceId and if present in open, an exception is thrown (cycle!) otherwise check if closed has another reference to this xml, and merge it or add it directly (more on merging below)
      1. If not all references from RuleSetReferenceId are loaded, go back to step 2.ii, otherwise just stop parsing.
  3. Go back to step 2 until open is empty

Notice that the 2 scenarios (include all, described in 2.i and not, described in 2.ii) can be unified into a single one if the haveAllRequiredRulesBeenLoaded() check for RuleSetReferenceId returns always false in the include all scenario.

On sorting:

  • we should prioritize ruleset references with the "include all" flag. That way we are able to include ALL references to rules within a single ruleset xml before parsing it.
  • For non "include all" references, order is probably irrelevant, but prioritizing those with more rules to be loaded could be beneficial under complex ruleset graphs

On merging:

  • RuleSetReferenceId should be reworked to be able to reference, 1, many or all rules within a ruleset; and if not including all, keep track of which ones have been loaded. This way we can merge 2 references to different rules inside the same xml to be parsed together.

The one scenario that wouldn't be supported, would be a rule within a ruleset that references another rule within the same ruleset. This would be detected as a cycle, which is technically correct since it's a loop in the ruleset graph. We could support it if it was later in the file, but such distinction seems broken to me, I'd rather just throw for cycle.

This refactor would pose a good chance to split rule parsing from ruleset generation, as it has proved to be troublesome as show on https://github.com/pmd/pmd/pull/731/files#r152080633

Metadata

Metadata

Assignees

Labels

for:performanceThe goal of this change is to improve PMD's performance

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions