[core] Support language dialects #5438

jsotuyod · 2025-01-01T19:58:27Z

Motivation

XML is a supported language in PMD. However, there are some specific languages based on XML. One example would be XHTML (but not HTML5...) and XSLT and WSDL. Currently, we have the language XML (https://docs.pmd-code.org/latest/pmd_languages_xml.html) and some related languages (Maven POM, WSDL, XSLT). These are all in the module pmd-xml. They provide their own rules (e.g. https://docs.pmd-code.org/latest/pmd_rules_xsl.html).

In the future, we might want to support other XML-based languages, especially in the Salesforce world, such as XML Metadata (custom fields, etc.) and LWC (Lightning Web Components). Note, that Visualforce (https://docs.pmd-code.org/latest/pmd_languages_visualforce.html) is currently an language of its own with its own grammar (supporting the expression language), but it is actually also XML-based.

These specialized languages, that are based on another one, will be called Language Dialects. They support their own file extensions, own language dialect versions, own rules, own custom XPath functions. At the same time, they "inherit" the features from the base language.

Describe the PR

Provide a way to declare a language as a dialect of another.

When a language is declared as a dialect:

The same parser (and AST) of the base language is used.
All language properties, XPath functions and metrics of the base language are still available

These 2 conditions ensure that all rules that were applicable to the base language are also applicable to the dialect (and do so in a single parser pass).

Additionally, the dialect may extend upon these, adding it's own language properties, XPath functions or metrics for it's own rules to use.

The implementation is completely declarative. This means there is no compile-time dependency on the base language, nor requires the base language classes (language module, property bundles, language version handler, etc.) to provide any special constructors to ensure extensibility. It also means, these may change without breaking the dialect.

Additionally, this introduces a major constraint. Once a LanguageMetadata is declared to be a dialect (asDialectOf(String id)), it forces users to not use SimpleLanguageModuleBase as a base, but the new SimpleDialectLanguageModuleBase, which implements all the above features.

This aims at forcing a correct dialect implementation. If a non-dialect language metadata is passed to SimpleDialectLanguageModuleBase a compile time error occurs rather than a runtime error.
Additionally, by leading developers to extend SimpleDialectLanguageModuleBase, a correct implementation is achieved, which honors all dialect guarantees. Unfortunately, it would still be possible to manually extend LanguageModuleBase in a broken way (ie: a dialect not using the base language parser).

I couldn't think of a better / more comprehensive way to achieve this.

I've been thinking a lot about how to map dialect lang version to base lang version… The mapping are not required to be 1:1 (ie: my xml dialect may accept xml 1.0 or 1.1). Inferring at runtime which is under use (even if possible in XML) is not done by PMD as it can't be done for most languages. I'm therefore thinking of having this be provided by the user (ie: --use-version xml-1.0 --use-version pom-4.0.0) or apply the default if not provided. As such, the only thing a dialect may do is state "compatibility", producing a runtime error if not satisfied… but that would hardly be future-proof / I'm not really sure if it's valuable. Hence, I'm considering leaving that out for this iteration of the API (which is still @Experimental).

Pending work:

I'm still thinking on how to make it self-evident that a language is a dialect. One plausible way would be to force dialect ids to become {baselang}/{dialect}. This would force rule definitions in XMLs to use this id, and would change the CLI's usage of {lang}-{version} convention to {lang}/{dialect}-{version}, so pom-4.0.0 would become xml/pom-4.0.0.

This obviously is a breaking change, and would require some close-gap patchwork with proper deprecation messages when the old convention is used.

RuleSet.applies still needs to be updated to acknowledge version numbers for dialect and base language may be different.

Also missing:

Documentation!
Tests

Related issues

Ready?

Added unit tests for fixed bug/feature
Passing all unit tests
Complete build ./mvnw clean verify passes (checked automatically by github actions)
Added (in-code) documentation (if needed)

- The idea is to force people defining a metadata to follow a different implementation path (and hence, ensuring it behaves as a dialect)

- This takes in DialectLanguageMetadata instances and ensures the implementation for the language: - Uses the same parser as the base language, and therefore the same AST, so rules of the base language are applicable to a dialect. - Keeps all decorators from the base language, and adds the dialect's - Keeps all XPath functions from the base language, and adds the dialect's - Keeps all metrics from the base language, and adds the dialect's - Keeps all violation suppression from the base language, and adds the dialect's - Keeps all language properties from the language, and adds the dialect's

- Now they are proper dialects, without needing to extend or have a compile-time dependency into the base language.

- When checking applicability we should always pass the proper language, not the applied dialect nonetheless, for proper version checking.

- Tests are passing, but we are not doing proper version check for rule application. - Overall, things are still tricky… there are several things that I've found while working on this that are not really sorted out: - If more than one language is applicable (even without dialects), rules get initialized for all processors, even those that don't apply. - Analysis cache is unaware of the chosen lang-version for the analysis, so even if the ruleset is the same, results may vary. - FileVersionDiscoverer configures the forced language, but doesn't apply it… that is done by the FileCollector, and not on all flows… - LanguageVersionHandler has no access to the actual LanguageVersion… at all. These are only available to the LanguageProcessor / LanguagePropertyBundle, and only to the instance created in PmdAnalysis

github-actions · 2025-04-17T21:45:05Z

Documentation Preview

Compared to main:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.

Regression Tester Report

- They are used in collections, but always assuming they are singletons. - In practice this is true, but inherently, we actually expect QNames to be unique (that is how Saxon finds them underneath), so let's make that explicit in the implementation.

- Make it readily available for all tests instead of defined outside of it only for Saxon tests.

- Add a dialect-level xpath fn, ensure it's available. - Add skeletons for other promises to be tested

- Avoid any binary incompatibility. The classes remain, but unused.

jsotuyod · 2025-04-18T05:22:42Z

Sorry for the long time without following up on this…

This is now fully functional, and honors all the promises.

There are only 2 things pending, none of which I consider a blocker at this point in time.

Even though all base language rules applies to a file associated with a dialect (ie: all configured XML rules are applied to XSLT files), there is no obvious way to indicate this is the case. I see 2 possible ways to go around it:
1. We document it in the language docs (which will probably go unnoticed)
2. We change how we set-up the lang ids of dialects (ie: forcing it to be {baselangid}/{dialectid}. I personally prefer this one, but it changes how languages (and rules) are defined and consumed it's a breaking change that requires careful backwards compatibility and somewhat extensive changes to accommodate it while providing proper error messages. So this should be a separate PR.
Right now Ruleset.applies will apply any rule for the base language without any version checks, as the base language version and the dialect version don't need to match (ie: XSLT 3.0 may be on top of an XML 1.0 or 1.1 file). Inferring at runtime which is under use (even if possible in XML) is not done by PMD as it can't be done for most languages. I'm therefore thinking of having this be provided by the user (ie: --use-version xml-1.0 --use-version pom-4.0.0) or apply the default if not provided. However, right now this is not possible. The way we manage language versions within PMD is very clanky, and needs some significant rework before this can be done properly. As this is currently not an issue on any of the existing dialects (only Java actually uses lang versions for rule applicability), this can wait for another PR.

Why do I say our lang version management is clanky?

First and most importantly, the dialect Language / LanguageVersion itself doesn't know which version of the base language is in use. Neither does the parser, that only knows it's own dialect version and only when parsing (through the ParserTask instance), so when RuleSet.applies is invoked, we simply don't have access to the info. Possibly passing around a LanguageVersionDiscoverer and asking for the version of the base language is a way around it; but I am not sure it's up to RuleSet to know how to do this… it seems to me a dialect version should know the base lang version it's being mapped to… but giving it access to LanguageVersionDiscoverer would introduce a tight coupling… we seem to be missing some pre-processing during setup of the the LanguageRegistry to cater for this.
On the topic of LanguageVersionDiscoverer, this class is where we set forced languages… yet it's not where we apply them… that logic resides on FileCollector that uses the LanguageVersionDiscoverer and decides how / when to apply this override… This is calling for a refactor.
Our cache is actually unaware of language versions in use… so even 'though the results may change by changing the language version; the cache will not be invalidated.
Counterintuitively, no implementation of LanguageVersionHandler actually handles a LanguageVersion… in hindsight setting up the parser for the language version for this run at this point would have been probably saner than setting that up per-file on each call to parse.
All LanguagePropertyBundle contain a version property and exposes a getLanguageVersion()method… but the property is never set, it will always return the default version. Thankfully, that is only consumed by BatchLanguageProcessor that exposes that in a method that is never used.

So there is a lot of work to be done with this… but again, none of it seems to be blocking supporting dialects as an experimental feature; nor block in any way the dialects we are already shipping with PMD.

adangel

Thanks!

I've reviewed it and found a couple of typos/nitpicks/etc. I think, there are only 2 questions left, but since this is for now an experimental feature, I don't think we should wait any further. I'm going to leave these comments open and resolve the others I fixed in the below mentioned commits.

I've fixed the typos/nitpicks here: https://github.com/pmd/pmd/tree/pr-5438 since I can't push to your branch on Monits. This also includes a merge from main and release notes.
Specifically - when I'm going to merge this today or early tomorrow - these are the two commits:

3b65a41 (fixes)
62029e8 (release notes)

docs/pages/pmd/devdocs/major_contributions/adding_a_dialect.md

adangel · 2025-04-24T07:18:05Z

docs/pages/pmd/devdocs/major_contributions/adding_a_dialect.md

+## Steps
+
+### 1.  Create a dialect module
+*   Dialects usually reside in the same module of the base language they leverage; but can technically live standalone in a separate module if needed.


Future:

Should we demand a more rigorous structure? While it's technically possible, should we allow separate modules?

Grouped together with the base language - Pro:
🟢 This makes sure, that base language is always available at runtime.
🟢 Easy to see which dialects are existing.
🟢 Clear and unambiguous rule

Con:
🔴 It is more a technical grouping, than logical. E.g. Visualforce would belong to a (future) pmd-salesforce module and not to pmd-xml...
🔴 If you add the dependency to pmd-xml, you might get a whole bunch of languages/dialects, you don't necessarily need/aren't interested in.

Since this is experimental, we can still change this. But it would have an impact of currently existing modules, which would need to be refactored, so that PMD itself serves as a good example.

We also should suggest a base package name for all the classes created for the dialect: net.sourceforge.pmd.lang.<base-lang-id>.<dialect-lang-id>.

+1 to tighter guidelines.

I'm in for using the same module within PMD… but I felt it was worth documenting this is not required. For instance, if I decide to create a dialect that for some reason I don't want to contribute back to PMD, I can create it without needing to fork PMD… not something we necessarily want to promote, but supported by this design.

All-in on the package convention. It is the one we are following either way.

Right, I think we need to differ between modules inside PMD and distributed as part of the "official" PMD binary and a module developed outside. Maybe we can just add a sentence or so like "..." - hm... actually, it is your sentence already 😄 The only problem I see here we use the word "module" for different things: "Dialect module" (in the heading) and "maven module". Maybe this would work?

Dialects usually reside in the same maven module of the base language they leverage; but can technically live standalone in a separate maven module or project if needed (e.g. to develop the dialect module as a project separate from PMD).

But let's change this in a follow-up PR (including mentioning the desired package name explicitly).

sounds good!

docs/pages/pmd/devdocs/major_contributions/adding_a_dialect.md

pmd-xml/src/main/java/net/sourceforge/pmd/lang/xml/wsdl/WsdlDialectModule.java

pmd-xml/src/main/java/net/sourceforge/pmd/lang/xml/wsdl/WsdlLanguageModule.java

pmd-xml/src/main/java/net/sourceforge/pmd/lang/xml/xsl/XslDialectModule.java

pmd-xml/src/main/java/net/sourceforge/pmd/lang/xml/xsl/XslLanguageModule.java

pmd-core/src/main/java/net/sourceforge/pmd/lang/rule/RuleSet.java

jsotuyod · 2025-04-24T16:17:57Z

Thanks for the feedback and review! 🎉

Merge pull request #5438 from Monits:lang-dialects

oowekyala and others added 6 commits December 30, 2024 17:49

Add dialect concept, make POM, WSDL, XSL dialects of XML

1569af3

Make rules of XML apply to its dialects

dc3b762

Model Dialect Language Metadata as standalone

79e46a9

- The idea is to force people defining a metadata to follow a different implementation path (and hence, ensuring it behaves as a dialect)

Allow to combine to XPathHandlers together

619f35e

Update the XML languages' implementation

7761a95

- Now they are proper dialects, without needing to extend or have a compile-time dependency into the base language.

jsotuyod added an:enhancement An improvement on existing features / rules in:pmd-internals Affects PMD's internals labels Jan 1, 2025

jsotuyod added 3 commits January 1, 2025 17:10

Fix PMD warnings

954097a

Fix style violations

892bb30

Style issues

254e803

oowekyala self-requested a review January 2, 2025 00:04

jsotuyod added 4 commits January 19, 2025 01:48

Add LangVersionDiscoverer test

4a09fd4

Lang is no longer a dialect of itself

c04ed84

- When checking applicability we should always pass the proper language, not the applied dialect nonetheless, for proper version checking.

Test dialect promises

2830e1a

jsotuyod added 12 commits April 17, 2025 19:20

Have the DummyLanguage define an xpath fn

423e4b6

- Make it readily available for all tests instead of defined outside of it only for Saxon tests.

Continue testing dialect promises

de1c850

- Add a dialect-level xpath fn, ensure it's available. - Add skeletons for other promises to be tested

Properly test dialect properties

9ec8ee7

Add tests for dialect metric guarantees

97c7b70

Fix style issues and add license headers

a851689

Update the @SInCE docs to reflect current version

0675019

Reorder statements for short-circuit exit

d70a56a

Rename language modules to explicitly call them dialects

c2adfbe

Restore the old language module as deprecated

55336e7

- Avoid any binary incompatibility. The classes remain, but unused.

Add documentation on creating a dialect

11f47df

Review and fix docs

c02dfd0

jsotuyod marked this pull request as ready for review April 18, 2025 05:22

jsotuyod added this to the 7.13.0 milestone Apr 18, 2025

adangel self-requested a review April 22, 2025 17:54

adangel mentioned this pull request Apr 24, 2025

Cleanups for PMD 8 #5703

Open

adangel added a commit that referenced this pull request Apr 24, 2025

[doc] Update release notes (#5438)

62029e8

adangel approved these changes Apr 24, 2025

View reviewed changes

adangel added a commit that referenced this pull request Apr 25, 2025

[doc] Update release notes (#5438)

63bc45a

adangel added a commit that referenced this pull request Apr 25, 2025

[core] Support language dialects (#5438)

df92a0e

Merge pull request #5438 from Monits:lang-dialects

adangel merged commit c02dfd0 into pmd:main Apr 25, 2025
12 checks passed

jsotuyod deleted the lang-dialects branch April 25, 2025 16:30

jsotuyod mentioned this pull request Aug 18, 2025

[core] Improve internal language version management #5993

Open

5 tasks

Uh oh!

[core] Support language dialects #5438

[core] Support language dialects #5438

Uh oh!

Conversation

jsotuyod commented Jan 1, 2025 • edited by adangel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Describe the PR

Related issues

Ready?

Uh oh!

github-actions bot commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsotuyod commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adangel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adangel Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

jsotuyod Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

adangel Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

jsotuyod Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsotuyod commented Apr 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jsotuyod commented Jan 1, 2025 •

edited by adangel

Loading

github-actions bot commented Apr 17, 2025 •

edited

Loading

jsotuyod commented Apr 18, 2025 •

edited

Loading