Skip to content

[core] Text documents escapes#4044

Merged
adangel merged 86 commits into
pmd:pmd/7.0.xfrom
oowekyala:text-utils-javacc
Sep 9, 2022
Merged

[core] Text documents escapes#4044
adangel merged 86 commits into
pmd:pmd/7.0.xfrom
oowekyala:text-utils-javacc

Conversation

@oowekyala

Copy link
Copy Markdown
Member

Describe the PR

  • introduce a mechanism to handle text escapes in the document API (like \u00A0). This allows preserving line/columns properly when escapes introduce or remove newlines.
    • the mechanism is create another TextDocument with the translated text (escapes replaced by their value) and parse/lex that instead. There is a mapping between the offsets of the translated document and the offsets of the old document, which is used to recover actual positions.
    • the same mechanism can be used to define sub-documents later (which are used for javadoc parsing and for [core] RFC: Analyzing embedded snippets from other languages #237)
  • finish [core] Merging Javacc build scripts #2239 properly: implement CharStream manually on top of TextDocument instead of using the sources generated by javacc.

Related issues

Ready?

  • Added unit tests for fixed bug/feature
  • Passing all unit tests
  • Complete build ./mvnw clean verify passes (checked automatically by github actions)
  • Added (in-code) documentation (if needed)

Slicing Chars will be enough
This uses "fragmented docs". This
is less efficient when there are
a lot of escapes, but:
- it doesn't require the initial
buffer copy
- it doesn't mutate any buffer, so
Chars can be readonly all the time
- it handles escapes of arbitrary
inLen/outLen ratio and is thus
future proof
- it is much more maintainable

On the downside there is
- efficiency (this data structure is
"pointy")
- clarity of the Reader implementation
(one more state to support, ie reading
not from the input buffer but from an
escape)
This is to profit from
compact string features,
and to make the use of
StringBuilders more
efficient.
@oowekyala oowekyala mentioned this pull request Jul 17, 2022
6 tasks
@ghost

ghost commented Jul 17, 2022

Copy link
Copy Markdown
2 Messages
📖 Compared to pmd/7.0.x:
This changeset changes 147 violations,
introduces 25 new violations, 0 new errors and 0 new configuration errors,
removes 24 violations, 0 errors and 0 configuration errors.
Full report
📖 Compared to master:
This changeset changes 0 violations,
introduces 699657 new violations, 22 new errors and 0 new configuration errors,
removes 827036 violations, 39 errors and 6 configuration errors.
Full report

Generated by 🚫 Danger

@oowekyala oowekyala added this to the 7.0.0 milestone Jul 18, 2022
@oowekyala oowekyala marked this pull request as ready for review July 18, 2022 10:50
@adangel adangel self-requested a review August 30, 2022 19:25

@adangel adangel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I'm going to merge this now...

*/

package net.sourceforge.pmd.lang.ast;
package net.sourceforge.pmd.lang.ast.impl.javacc;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, whether such changes need to be mentioned in release notes or somewhere...

@Test
public void testSeveralEscapes() throws IOException {

String input = "abc\\\\\\u00a0d\\uu00a0ede";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct, that these contiguously backslash characters are gone after translating?

I don't see, that https://docs.oracle.com/javase/specs/jls/se17/html/jls-3.html#jls-3.3 defines what happens to these backslashes, just that the last backslash of an even sequence might start a unicode escape...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, thanks for spotting this!

jshell> "\u00a0"
$1 ==> " "

jshell> "\\u00a0"
$2 ==> "\\u00a0"

jshell> "\\\u00a0"
$3 ==> "\\ "

adangel added a commit that referenced this pull request Sep 9, 2022
adangel added a commit that referenced this pull request Sep 9, 2022
@adangel adangel merged commit 34a668c into pmd:pmd/7.0.x Sep 9, 2022
@oowekyala oowekyala deleted the text-utils-javacc branch September 9, 2022 14:38
@adangel adangel mentioned this pull request Jan 23, 2023
55 tasks
adangel added a commit to adangel/pmd that referenced this pull request Feb 21, 2025
Not in use anymore since pmd#4044
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants