We currently parse XMLs in a greedy and not very efficient way.
- Upon finding a ruleset reference (ended in .xml), we go and parse that file completely before continuing.
- Upon finding a rule reference from another ruleset (ended in .xml/rule-name) we go and parse that file completely, ignoring everything but the one rule we were looking for.
This means that, a ruleset such as:
|
<?xml version="1.0"?> |
|
|
|
<ruleset name="Coupling" |
|
xmlns="http://pmd.sourceforge.net/ruleset/2.0.0" |
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
|
xsi:schemaLocation="http://pmd.sourceforge.net/ruleset/2.0.0 http://pmd.sourceforge.net/ruleset_2_0_0.xsd"> |
|
|
|
<description> |
|
Rules which find instances of high or inappropriate coupling between objects and packages. |
|
</description> |
|
|
|
<!-- Rules, that have been moved into a category --> |
|
<rule ref="category/java/bestpractices.xml/LooseCoupling" deprecated="true" /> |
|
|
|
<rule ref="category/java/design.xml/CouplingBetweenObjects" deprecated="true" /> |
|
<rule ref="category/java/design.xml/ExcessiveImports" deprecated="true" /> |
|
<rule ref="category/java/design.xml/LawOfDemeter" deprecated="true" /> |
|
<rule ref="category/java/design.xml/LoosePackageCoupling" deprecated="true" /> |
|
|
|
</ruleset> |
Will parse completely once looking for category/java/bestpractices.xml, and then will proceed to completely parse category/java/design.xml 4 times, to get the other rules.
Having multiple such references is common, and will be far more with the ruleset reorganization, since categories are fewer, larger, and intended to be broken up and referenced rule by rule from rulesets.
This also means, it's possible to produce cycles in the ruleset references and produce infinite loops (which end in Stackoverflows / OutOfMemory errors).
I believe we should rework the rule parsing in order to:
- Parse each file exactly once
- Avoid parsing beyond what's needed
- Detect and report properly cycles
One possible approach to this would be:
- Start with an empty set of open and a sorted list of closed
RuleSetReferenceId containing all rulesets indicated by user input, with the "include all" flag set (more on sorting later on)
- Remove the first
RuleSetReferenceId from closed, put it into open, and get a parser for it
- If the "include all" flag is set, parse the file completely. Rule definitions are loaded, rule / ruleset references, are simply translated into
RuleSetReferenceId and if present in open, an exception is thrown (cycle!) otherwise check if closed has another reference to this xml, and merge it or add it directly (more on merging below)
- If the "include all" flag is not set, parse until we find any required rule name and mark it as loaded from
RuleSetReferenceId. Rule definitions are loaded, rule / ruleset references, are simply translated into RuleSetReferenceId and if present in open, an exception is thrown (cycle!) otherwise check if closed has another reference to this xml, and merge it or add it directly (more on merging below)
- If not all references from
RuleSetReferenceId are loaded, go back to step 2.ii, otherwise just stop parsing.
- Go back to step 2 until open is empty
Notice that the 2 scenarios (include all, described in 2.i and not, described in 2.ii) can be unified into a single one if the haveAllRequiredRulesBeenLoaded() check for RuleSetReferenceId returns always false in the include all scenario.
On sorting:
- we should prioritize ruleset references with the "include all" flag. That way we are able to include ALL references to rules within a single ruleset xml before parsing it.
- For non "include all" references, order is probably irrelevant, but prioritizing those with more rules to be loaded could be beneficial under complex ruleset graphs
On merging:
RuleSetReferenceId should be reworked to be able to reference, 1, many or all rules within a ruleset; and if not including all, keep track of which ones have been loaded. This way we can merge 2 references to different rules inside the same xml to be parsed together.
The one scenario that wouldn't be supported, would be a rule within a ruleset that references another rule within the same ruleset. This would be detected as a cycle, which is technically correct since it's a loop in the ruleset graph. We could support it if it was later in the file, but such distinction seems broken to me, I'd rather just throw for cycle.
This refactor would pose a good chance to split rule parsing from ruleset generation, as it has proved to be troublesome as show on https://github.com/pmd/pmd/pull/731/files#r152080633
We currently parse XMLs in a greedy and not very efficient way.
This means that, a ruleset such as:
pmd/pmd-java/src/main/resources/rulesets/java/coupling.xml
Lines 1 to 20 in b159143
Will parse completely once looking for
category/java/bestpractices.xml, and then will proceed to completely parsecategory/java/design.xml4 times, to get the other rules.Having multiple such references is common, and will be far more with the ruleset reorganization, since categories are fewer, larger, and intended to be broken up and referenced rule by rule from rulesets.
This also means, it's possible to produce cycles in the ruleset references and produce infinite loops (which end in Stackoverflows / OutOfMemory errors).
I believe we should rework the rule parsing in order to:
One possible approach to this would be:
RuleSetReferenceIdcontaining all rulesets indicated by user input, with the "include all" flag set (more on sorting later on)RuleSetReferenceIdfrom closed, put it into open, and get a parser for itRuleSetReferenceIdand if present in open, an exception is thrown (cycle!) otherwise check if closed has another reference to this xml, and merge it or add it directly (more on merging below)RuleSetReferenceId. Rule definitions are loaded, rule / ruleset references, are simply translated intoRuleSetReferenceIdand if present in open, an exception is thrown (cycle!) otherwise check if closed has another reference to this xml, and merge it or add it directly (more on merging below)RuleSetReferenceIdare loaded, go back to step 2.ii, otherwise just stop parsing.Notice that the 2 scenarios (include all, described in 2.i and not, described in 2.ii) can be unified into a single one if the
haveAllRequiredRulesBeenLoaded()check forRuleSetReferenceIdreturns always false in theinclude allscenario.On sorting:
On merging:
RuleSetReferenceIdshould be reworked to be able to reference, 1, many or all rules within a ruleset; and if not including all, keep track of which ones have been loaded. This way we can merge 2 references to different rules inside the same xml to be parsed together.The one scenario that wouldn't be supported, would be a rule within a ruleset that references another rule within the same ruleset. This would be detected as a cycle, which is technically correct since it's a loop in the ruleset graph. We could support it if it was later in the file, but such distinction seems broken to me, I'd rather just throw for cycle.
This refactor would pose a good chance to split rule parsing from ruleset generation, as it has proved to be troublesome as show on https://github.com/pmd/pmd/pull/731/files#r152080633