[core] Avoid parsing rulesets multiple times

We currently parse XMLs in a greedy and not very efficient way.

1. Upon finding a ruleset reference (ended in .xml), we go and parse that file completely before continuing.
2. Upon finding a rule reference from another ruleset (ended in .xml/rule-name) we go and parse that file completely, ignoring everything but the one rule we were looking for.

This means that, a ruleset such as:

https://github.com/pmd/pmd/blob/b159143a06cd5c6ec3c6263a9518938ea4f223d2/pmd-java/src/main/resources/rulesets/java/coupling.xml#L1-L20

Will parse completely once looking for `category/java/bestpractices.xml`, and then will proceed to completely parse `category/java/design.xml` **4 times**, to get the other rules.

Having multiple such references is common, and will be far more with the ruleset reorganization, since categories are fewer, larger, and intended to be broken up and referenced rule by rule from rulesets.

This also means, it's possible to produce cycles in the ruleset references and produce infinite loops (which end in Stackoverflows / OutOfMemory errors).

I believe we should rework the rule parsing in order to:

1. Parse each file exactly once
2. Avoid parsing beyond what's needed
3. Detect and report properly cycles

One possible approach to this would be:

1. Start with an empty set of open and a sorted list of closed `RuleSetReferenceId` containing all rulesets indicated by user input, with the "include all" flag set (more on sorting later on)
2. Remove the first `RuleSetReferenceId` from closed, put it into open, and get a parser for it
    1. If the "include all" flag is set, parse the file completely. Rule definitions are loaded, rule / ruleset references, are simply translated into `RuleSetReferenceId` and if present in open, an exception is thrown (cycle!) otherwise check if closed has another reference to this xml, and merge it or add it directly (more on merging below)
    2. If the "include all" flag is not set, parse until we find any required rule name and mark it as loaded from `RuleSetReferenceId`. Rule definitions are loaded, rule / ruleset references, are simply translated into `RuleSetReferenceId` and if present in open, an exception is thrown (cycle!) otherwise check if closed has another reference to this xml, and merge it or add it directly (more on merging below)
        1. If not all references from `RuleSetReferenceId` are loaded, go back to step 2.ii, otherwise just stop parsing.
3. Go back to step 2 until open is empty

Notice that the 2 scenarios (include all, described in 2.i and not, described in 2.ii) can be unified into a single one if the `haveAllRequiredRulesBeenLoaded()` check for `RuleSetReferenceId` returns always false in the `include all` scenario.

On sorting:
- we should prioritize ruleset references with the "include all" flag. That way we are able to include *ALL* references to rules within a single ruleset xml *before* parsing it.
- For non "include all" references, order is probably irrelevant, but prioritizing those with more rules to be loaded could be beneficial under complex ruleset graphs

On merging:
 - `RuleSetReferenceId` should be reworked to be able to reference, 1, many or all rules within a ruleset; and if not including all, keep track of which ones have been loaded. This way we can merge 2 references to different rules inside the same xml to be parsed together.


The one scenario that wouldn't be supported, would be a rule within a ruleset that references another rule within the same ruleset. This would be detected as a cycle, which is technically correct since it's a loop in the ruleset graph. We *could* support it if it was later in the file, but such distinction seems broken to me, I'd rather just throw for cycle.


This refactor would pose a good chance to split rule parsing from ruleset generation, as it has proved to be troublesome as show on https://github.com/pmd/pmd/pull/731/files#r152080633

	<?xml version="1.0"?>

	<ruleset name="Coupling"
	xmlns="http://pmd.sourceforge.net/ruleset/2.0.0"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://pmd.sourceforge.net/ruleset/2.0.0 http://pmd.sourceforge.net/ruleset_2_0_0.xsd">

	<description>
	Rules which find instances of high or inappropriate coupling between objects and packages.
	</description>

	<!-- Rules, that have been moved into a category -->
	<rule ref="category/java/bestpractices.xml/LooseCoupling" deprecated="true" />

	<rule ref="category/java/design.xml/CouplingBetweenObjects" deprecated="true" />
	<rule ref="category/java/design.xml/ExcessiveImports" deprecated="true" />
	<rule ref="category/java/design.xml/LawOfDemeter" deprecated="true" />
	<rule ref="category/java/design.xml/LoosePackageCoupling" deprecated="true" />

	</ruleset>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Avoid parsing rulesets multiple times #724

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[core] Avoid parsing rulesets multiple times #724

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions