Skip to content

Conversation

@jsotuyod
Copy link
Member

@jsotuyod jsotuyod commented Jan 1, 2025

Motivation

XML is a supported language in PMD. However, there are some specific languages based on XML. One example would be XHTML (but not HTML5...) and XSLT and WSDL. Currently, we have the language XML (https://docs.pmd-code.org/latest/pmd_languages_xml.html) and some related languages (Maven POM, WSDL, XSLT). These are all in the module pmd-xml. They provide their own rules (e.g. https://docs.pmd-code.org/latest/pmd_rules_xsl.html).

In the future, we might want to support other XML-based languages, especially in the Salesforce world, such as XML Metadata (custom fields, etc.) and LWC (Lightning Web Components). Note, that Visualforce (https://docs.pmd-code.org/latest/pmd_languages_visualforce.html) is currently an language of its own with its own grammar (supporting the expression language), but it is actually also XML-based.

These specialized languages, that are based on another one, will be called Language Dialects. They support their own file extensions, own language dialect versions, own rules, own custom XPath functions. At the same time, they "inherit" the features from the base language.

Describe the PR

Provide a way to declare a language as a dialect of another.

When a language is declared as a dialect:

  1. The same parser (and AST) of the base language is used.
  2. All language properties, XPath functions and metrics of the base language are still available

These 2 conditions ensure that all rules that were applicable to the base language are also applicable to the dialect (and do so in a single parser pass).

Additionally, the dialect may extend upon these, adding it's own language properties, XPath functions or metrics for it's own rules to use.

The implementation is completely declarative. This means there is no compile-time dependency on the base language, nor requires the base language classes (language module, property bundles, language version handler, etc.) to provide any special constructors to ensure extensibility. It also means, these may change without breaking the dialect.

Additionally, this introduces a major constraint. Once a LanguageMetadata is declared to be a dialect (asDialectOf(String id)), it forces users to not use SimpleLanguageModuleBase as a base, but the new SimpleDialectLanguageModuleBase, which implements all the above features.

This aims at forcing a correct dialect implementation. If a non-dialect language metadata is passed to SimpleDialectLanguageModuleBase a compile time error occurs rather than a runtime error.
Additionally, by leading developers to extend SimpleDialectLanguageModuleBase, a correct implementation is achieved, which honors all dialect guarantees. Unfortunately, it would still be possible to manually extend LanguageModuleBase in a broken way (ie: a dialect not using the base language parser).

I couldn't think of a better / more comprehensive way to achieve this.

I've been thinking a lot about how to map dialect lang version to base lang version… The mapping are not required to be 1:1 (ie: my xml dialect may accept xml 1.0 or 1.1). Inferring at runtime which is under use (even if possible in XML) is not done by PMD as it can't be done for most languages. I'm therefore thinking of having this be provided by the user (ie: --use-version xml-1.0 --use-version pom-4.0.0) or apply the default if not provided. As such, the only thing a dialect may do is state "compatibility", producing a runtime error if not satisfied… but that would hardly be future-proof / I'm not really sure if it's valuable. Hence, I'm considering leaving that out for this iteration of the API (which is still @Experimental).

Pending work:

I'm still thinking on how to make it self-evident that a language is a dialect. One plausible way would be to force dialect ids to become {baselang}/{dialect}. This would force rule definitions in XMLs to use this id, and would change the CLI's usage of {lang}-{version} convention to {lang}/{dialect}-{version}, so pom-4.0.0 would become xml/pom-4.0.0.

This obviously is a breaking change, and would require some close-gap patchwork with proper deprecation messages when the old convention is used.

RuleSet.applies still needs to be updated to acknowledge version numbers for dialect and base language may be different.

Also missing:

  • Documentation!
  • Tests

Related issues

Ready?

  • Added unit tests for fixed bug/feature
  • Passing all unit tests
  • Complete build ./mvnw clean verify passes (checked automatically by github actions)
  • Added (in-code) documentation (if needed)

oowekyala and others added 6 commits December 30, 2024 17:49
 - The idea is to force people defining a metadata to follow a different
   implementation path (and hence, ensuring it behaves as a dialect)
 - This takes in DialectLanguageMetadata instances and ensures the
   implementation for the language:
    - Uses the same parser as the base language, and therefore the same
      AST, so rules of the base language are applicable to a dialect.
    - Keeps all decorators from the base language, and adds the
      dialect's
    - Keeps all XPath functions from the base language, and adds the
      dialect's
    - Keeps all metrics from the base language, and adds the dialect's
    - Keeps all violation suppression from the base language, and adds
      the dialect's
    - Keeps all language properties from the language, and adds the
      dialect's
 - Now they are proper dialects, without needing to extend or have a
   compile-time dependency into the base language.
@jsotuyod jsotuyod added an:enhancement An improvement on existing features / rules in:pmd-internals Affects PMD's internals labels Jan 1, 2025
@oowekyala oowekyala self-requested a review January 2, 2025 00:04
 - When checking applicability we should always pass the proper
   language, not the applied dialect nonetheless, for proper version
   checking.
 - Tests are passing, but we are not doing proper version check for rule
   application.
 - Overall, things are still tricky… there are several things that I've
   found while working on this that are not really sorted out:
    - If more than one language is applicable (even without dialects), rules
      get initialized for all processors, even those that don't apply.
    - Analysis cache is unaware of the chosen lang-version for the
      analysis, so even if the ruleset is the same, results may vary.
    - FileVersionDiscoverer configures the forced language, but doesn't
      apply it… that is done by the FileCollector, and not on all flows…
    - LanguageVersionHandler has no access to the actual
      LanguageVersion… at all. These are only available to the
      LanguageProcessor / LanguagePropertyBundle, and only to the
      instance created in PmdAnalysis
@github-actions
Copy link

github-actions bot commented Apr 17, 2025

Documentation Preview

Compared to main:
This changeset changes 0 violations,
introduces 0 new violations, 0 new errors and 0 new configuration errors,
removes 0 violations, 0 errors and 0 configuration errors.

Regression Tester Report

jsotuyod added 12 commits April 17, 2025 19:20
 - They are used in collections, but always assuming they are
   singletons.
 - In practice this is true, but inherently, we actually expect QNames
   to be unique (that is how Saxon finds them underneath), so let's make
   that explicit in the implementation.
 - Make it readily available for all tests instead of defined outside of
   it only for Saxon tests.
 - Add a dialect-level xpath fn, ensure it's available.
 - Add skeletons for other promises to be tested
 - Avoid any binary incompatibility. The classes remain, but unused.
@jsotuyod
Copy link
Member Author

jsotuyod commented Apr 18, 2025

Sorry for the long time without following up on this…

This is now fully functional, and honors all the promises.

There are only 2 things pending, none of which I consider a blocker at this point in time.

  1. Even though all base language rules applies to a file associated with a dialect (ie: all configured XML rules are applied to XSLT files), there is no obvious way to indicate this is the case. I see 2 possible ways to go around it:
    1. We document it in the language docs (which will probably go unnoticed)
    2. We change how we set-up the lang ids of dialects (ie: forcing it to be {baselangid}/{dialectid}. I personally prefer this one, but it changes how languages (and rules) are defined and consumed it's a breaking change that requires careful backwards compatibility and somewhat extensive changes to accommodate it while providing proper error messages. So this should be a separate PR.
  2. Right now Ruleset.applies will apply any rule for the base language without any version checks, as the base language version and the dialect version don't need to match (ie: XSLT 3.0 may be on top of an XML 1.0 or 1.1 file). Inferring at runtime which is under use (even if possible in XML) is not done by PMD as it can't be done for most languages. I'm therefore thinking of having this be provided by the user (ie: --use-version xml-1.0 --use-version pom-4.0.0) or apply the default if not provided. However, right now this is not possible. The way we manage language versions within PMD is very clanky, and needs some significant rework before this can be done properly. As this is currently not an issue on any of the existing dialects (only Java actually uses lang versions for rule applicability), this can wait for another PR.

Why do I say our lang version management is clanky?

  • First and most importantly, the dialect Language / LanguageVersion itself doesn't know which version of the base language is in use. Neither does the parser, that only knows it's own dialect version and only when parsing (through the ParserTask instance), so when RuleSet.applies is invoked, we simply don't have access to the info. Possibly passing around a LanguageVersionDiscoverer and asking for the version of the base language is a way around it; but I am not sure it's up to RuleSet to know how to do this… it seems to me a dialect version should know the base lang version it's being mapped to… but giving it access to LanguageVersionDiscoverer would introduce a tight coupling… we seem to be missing some pre-processing during setup of the the LanguageRegistry to cater for this.
  • On the topic of LanguageVersionDiscoverer, this class is where we set forced languages… yet it's not where we apply them… that logic resides on FileCollector that uses the LanguageVersionDiscoverer and decides how / when to apply this override… This is calling for a refactor.
  • Our cache is actually unaware of language versions in use… so even 'though the results may change by changing the language version; the cache will not be invalidated.
  • Counterintuitively, no implementation of LanguageVersionHandler actually handles a LanguageVersion… in hindsight setting up the parser for the language version for this run at this point would have been probably saner than setting that up per-file on each call to parse.
  • All LanguagePropertyBundle contain a version property and exposes a getLanguageVersion()method… but the property is never set, it will always return the default version. Thankfully, that is only consumed by BatchLanguageProcessor that exposes that in a method that is never used.

So there is a lot of work to be done with this… but again, none of it seems to be blocking supporting dialects as an experimental feature; nor block in any way the dialects we are already shipping with PMD.

@jsotuyod jsotuyod marked this pull request as ready for review April 18, 2025 05:22
@jsotuyod jsotuyod added this to the 7.13.0 milestone Apr 18, 2025
@adangel adangel self-requested a review April 22, 2025 17:54
@adangel adangel mentioned this pull request Apr 24, 2025
adangel added a commit that referenced this pull request Apr 24, 2025
Copy link
Member

@adangel adangel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I've reviewed it and found a couple of typos/nitpicks/etc. I think, there are only 2 questions left, but since this is for now an experimental feature, I don't think we should wait any further. I'm going to leave these comments open and resolve the others I fixed in the below mentioned commits.

I've fixed the typos/nitpicks here: https://github.com/pmd/pmd/tree/pr-5438 since I can't push to your branch on Monits. This also includes a merge from main and release notes.
Specifically - when I'm going to merge this today or early tomorrow - these are the two commits:

## Steps

### 1. Create a dialect module
* Dialects usually reside in the same module of the base language they leverage; but can technically live standalone in a separate module if needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future:

Should we demand a more rigorous structure? While it's technically possible, should we allow separate modules?

Grouped together with the base language - Pro:
🟢 This makes sure, that base language is always available at runtime.
🟢 Easy to see which dialects are existing.
🟢 Clear and unambiguous rule

Con:
🔴 It is more a technical grouping, than logical. E.g. Visualforce would belong to a (future) pmd-salesforce module and not to pmd-xml...
🔴 If you add the dependency to pmd-xml, you might get a whole bunch of languages/dialects, you don't necessarily need/aren't interested in.

Since this is experimental, we can still change this. But it would have an impact of currently existing modules, which would need to be refactored, so that PMD itself serves as a good example.

We also should suggest a base package name for all the classes created for the dialect: net.sourceforge.pmd.lang.<base-lang-id>.<dialect-lang-id>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to tighter guidelines.

I'm in for using the same module within PMD… but I felt it was worth documenting this is not required. For instance, if I decide to create a dialect that for some reason I don't want to contribute back to PMD, I can create it without needing to fork PMD… not something we necessarily want to promote, but supported by this design.

All-in on the package convention. It is the one we are following either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think we need to differ between modules inside PMD and distributed as part of the "official" PMD binary and a module developed outside. Maybe we can just add a sentence or so like "..." - hm... actually, it is your sentence already 😄 The only problem I see here we use the word "module" for different things: "Dialect module" (in the heading) and "maven module". Maybe this would work?

Dialects usually reside in the same maven module of the base language they leverage; but can technically live standalone in a separate maven module or project if needed (e.g. to develop the dialect module as a project separate from PMD).

But let's change this in a follow-up PR (including mentioning the desired package name explicitly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@jsotuyod
Copy link
Member Author

Thanks for the feedback and review! 🎉

adangel added a commit that referenced this pull request Apr 25, 2025
adangel added a commit that referenced this pull request Apr 25, 2025
Merge pull request #5438 from Monits:lang-dialects
@adangel adangel merged commit c02dfd0 into pmd:main Apr 25, 2025
12 checks passed
@jsotuyod jsotuyod deleted the lang-dialects branch April 25, 2025 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

an:enhancement An improvement on existing features / rules in:pmd-internals Affects PMD's internals

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants