Skip to content

RAT-321: text based configuration#157

Merged
ottlinger merged 50 commits intoapache:masterfrom
Claudenw:RAT-321_text_based_configuration
Oct 20, 2023
Merged

RAT-321: text based configuration#157
ottlinger merged 50 commits intoapache:masterfrom
Claudenw:RAT-321_text_based_configuration

Conversation

@Claudenw
Copy link
Copy Markdown
Contributor

@Claudenw Claudenw commented Oct 7, 2023

Overview

This is a larger change than I had hoped for. However, the change has been minimized as much as possible. The goal of this change is to switch to a text based configuration and in the process simplify the configuration architecture.

A secondary goal was to attempt to align the configuration options so that the same property name is used across all the user interfaces.

Changes from the user perspective.

For many users there are no changes as the original licenses are maintained in the change. For users that define custom licenses there are changes.

For users that define special or custom licenses the easiest solution is to rewrite the custom licenses into a configuration file and include that when running RAT.

Configuration format

Configuration format is defined in XML only in this change. However, future implementations of other formats are possible and anticipated in the code. The default configuration file is located in /apache-rat-core/src/main/resources/org/apache/rat/default.xml

The configuration file starts with a <rat-config> tag and ends with a closing </rat-config>. Within the configuration there are 3 sections:

  • <licenses> - Contains the definition of licenses.
  • <approved> - An optional list of approved licenses. If not specified all licenses in the <licenses> element are assumed to be approved. The licenses in the list may include licenses defined but not approved in other configuration files.
  • <matchers> - Defines matcher builder implementations. Implementations have names like TextBuilder. The final Builder part of the name is removed and the first part lowercased, becomes the name of the matcher. (i.e. CopyrightBuilder becomes the copyright matcher).

License definition

Each license is enclosed in a <license> tag. The <license> tag has 3 properties:

  • id - The id of the license. Must be unique across all definitions. If more than 5 characters are specified only the first 5 are used, the rest are discarded. This is equivalent to the old LicenseFamilyCategory property.
  • name - The name of the license. Used in display. This is equivalent to the oldd LicenseFamilyName property.
  • derived_from - (optional) Specifies the id of a license from which the current license is derived. Currently this option is unprocessed but in future may be used to accept licenses from which an accepted license is derived.

Each license has up to two enclosed tags. The possible enclosed tags are:

  • <note> - Notes about the license. If multiple <note> tags are specified they are merged into a single note.
  • matcher - This is not a tag but one of several tags as defined in the <matchers> section of the configuration. There may be only one matcher. If more than one matcher is specified the last one is selected.

Matcher definition

There are eight (8) matchers defined in the default configuration file. They have varying numbers of parameters and child nodes. ALL matchers have id properties. If the id property is not specified a default one is generated. The id is used to reference the matcher.

Text matcher

The <text> matcher matches text just like the old text matching did. The text to match can be specified either in a text property or simply by enclosing the text in <text> and </text> tags.

Regex matcher

The <regex> matcher uses a regular expression for matching. This is much slower than the text matching above. The expression is specified in a expr property on the tag.

Spdx matcher

The <spdx> matcher matches the SPDX tags of the form "SPDX-License-Identifier: ". The name attribute of the <spdx> tag specifies the name in the license identifier string.

Copyright matcher

The Copyright matcher is a new matcher that matches the tokens "Copyright", "(C)" , "(c)", "©". The token must be followed by a date, two dates separated by a dash, or a copyright holders name. or a combination. The <copyright> tag has 3 properties, all of which are optional:

  • start - the start year for the copyright. If only one year is provided this property should be set.
  • end - end end year for the copyright. This should only be set if start has been set and the copyright is expected to have dates of the form "9999 - 9999".
  • owner - The owner of the copyright.

Copyright matches any either than name first of the date first. if no date or owner is specified it will match the copyright tokens followed by 4 digits. no dates are specified but the owner is then it will match the token followed by the owner name. if date(s) and owner are specified then it will match the token followed by either the owner and then the date(s) or the date(s) and then the owner.

MatcherRef matcher

This is a matcher that references another defined matcher. It has one property refId which matches the id property of another matcher. The referenced matcher is used in place of the MatcherRef.

Not matcher

This matcher reverses the meaning of the match. It has no properties and must enclose one and only one other matcher.

Any matcher

This matcher encloses a collection of matchers. For this matcher to match one of the enclosed matchers must be matched.

All matcher

This matcher encloses a collection of matchers. For this matcher to match all of the enclosed matchers must be matched.

Approved section

The <approved> section specifies which licenses are approved. The <approved> encompasses one or more <family> tags that have a license_ref that contains the id of the approved license. Licenses defined in other configuration files may be listed for approval. If the approved section is not specified all licenses defined in the file are assumed to be approved.

Matchers section

The <matchers> section of the configuration file registers matcher builders for use in the system. Implementations have names like TextBuilder. The final Builder part of the name is removed and the first part lowercased, becomes the name of the matcher. (i.e. CopyrightBuilder becomes the copyright matcher). The <matchers> comprises child elements that are of the form <matcher class="org.apache.rat.configuration.builders.AllBuilder" name='somename'/> where the class attribute specifies the class name of the implementation. If the name attribute is specified it overrides the name of the matcher.

CLI interface

The command line interface adds new options to specify files of configurations to read. Executing the --help option will list all the commands and their options.

ANT interface

The Ant interface has changed to utilize the various builders in the system. An example of an ant build.xml can be found at /apache-rat-tasks/src/test/resources/antunit/report-junit.xml.

the <rat:report> tag has several properties to configure the system. In most cases there should be no changes for users. For users that specify custom licenses the <rat:license> tag is the same as the standard configuration file <license> tag.

Maven interface

the Maven interface has changed to utilize the various builders in the system. An example of an maven pom.xml can be found at /apache-rat-plugin/src/it/it1/pom.xml. The main difference between the Maven implementation and the configuration file is that items that were properties in the configuration file are specified as enclosed text tags.

Changes from developer perspective

Separation of configuration from running report.

The ReportConfiguration class now contains all the information necessary to run the report. The code to run the report has been moved from Report to Reporter. Report is now limited to implementing the CLI interface and setting the properties in the ReportConfiguration properly.

All user interfaces (cli, ANT and Maven) have been modified to set the ReportConfiguration properties and call the Reporter to execute the report.

As part of this separation the MetaData object has been removed from the configuration and user facing code. It remains in the reporting engine where it belongs.

Migration to builders

Multiple builders were created to assist in the creation of ILicense, and IHeaderMatcher implementations. The Builders are utilized in the code that configures the ReportConfiguration and provide a standard mechanism to ensure that the objects are properly configured across all client interfaces.

ILicense and ILicenseFamily

The ILicense has been simplified to comprise and LicenseFamily, notes, derivedFrom, and a single matcher. The ILicenseBuilder class implements the builder. The licenseFamilyCategory must be unique across all licenses. Licenses may be sorted by the LicenseFamily. There is a Comparator<ILicense> available from the ILicense class.

The ILicenseFamily now implements Comparable<ILicenseFamily>. Both the ILicense and the ILicenseFamily are ordered by the licenseFamilyCategory property.

IHeaderMatcher (matchers)

The system provided matchers are implemented by Builders. New matcher types can be created by creating builders that extends org.apache.rat.configuration.builders.AbstractBuilder or otherwise implement IHeaderMatcher.Builder and produce an IHeaderMatcher that extends org.apache.rat.analysis.matchers.AbstractMatcher.

Any properly constructed org.apache.rat.configuration.builders.AbstractBuilder that is added to the <matchers> section of the configuration file (see above) is automatically registered and available for use in the license definition in the configuration file. Additional steps must be taken to access the new matchers from ANT and Maven.

Changes to matching logic

With the addition of the <not> and <all> matchers it became necessary to provide deeper inspection of the state of the match. Prior to this all matches were so a matcher could return true if a match was detected and false othewise. However, the code tests the headers line by line so the matcher can not declare that it does not match until the last header line has been processed. To handle this situation, the org.apache.rat.analysis.State enum has been added to the code base. Matchers now return State when the IHeaderMatcher.matches(String) method is called and the reset() method now resets the current state to i (indeterminate). IHeaderMatcher also has two new methods:

  • currentState() - returns the current state of the matcher.
  • finalizeState() - sets the current state to either t (true) or f (false).

To simplify the construction of new IHeaderMatcher implementations there is a org.apache.rat.analysis.AbstractSimpleMatcher that handles the tracking of the State and assumes that finalizeState() will convert a i to f. Classes implementing this simply need to implement boolean doMatch(String) to perform the check.

The change to matching logic necessitated a change to the matching engine to call the finalizeState() and switch from handling boolean matches(string) result to State handling.

Testing

Significant work has been done to improve testing for the various matchers and other classes that were touched by this change.

JavaDoc

Modified files have had their javadocs updated. New files have javadocs as well.

@Claudenw Claudenw marked this pull request as draft October 8, 2023 06:04
@ottlinger ottlinger changed the title Rat 321 text based configuration RAT-321: text based configuration Oct 8, 2023
@Claudenw Claudenw marked this pull request as ready for review October 8, 2023 17:10
@Claudenw
Copy link
Copy Markdown
Contributor Author

Claudenw commented Oct 8, 2023

I recognize that this is a massive change and will take some time to process. If you have any questions or want to see extra tests please let me know and I will endeavour to complete them as quickly as possible.

It may make sense to create a new branch to put these changes on so that we can get an alpha release out to get some feedback before going all in on the changes.

@ottlinger
Copy link
Copy Markdown
Contributor

@Claudenw if I get it correctly there are no functional changes of existing functionality. Thus I'd prefer to merge your changes and prepare a release of RAT in order to collect feedback from existing users. WDYT?

@Claudenw
Copy link
Copy Markdown
Contributor Author

Claudenw commented Oct 16, 2023 via email

@Claudenw
Copy link
Copy Markdown
Contributor Author

@ottlinger how do we proceed? I do not have access to merge and I am unsure of the process to create a release so I think I need you to guide and/or execute this process. If you need anything from me please let me know.

@ottlinger
Copy link
Copy Markdown
Contributor

@Claudenw I'll start preparations for a new RC during the weekend (as the changelog and Jira needs some more attention before a new RC can be created). How would you summarize the feature RAT-321 introduces (feel free to add this to this branch's src/main/changes.xml))

@ottlinger ottlinger merged commit a3fe059 into apache:master Oct 20, 2023
@Claudenw Claudenw deleted the RAT-321_text_based_configuration branch May 22, 2024 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants