Skip to content

feat: add lexer/parser support for Summer '26 multi-line strings (#102)#104

Merged
nawforce merged 2 commits into
mainfrom
feat/issue-102-multiline-strings
Apr 25, 2026
Merged

feat: add lexer/parser support for Summer '26 multi-line strings (#102)#104
nawforce merged 2 commits into
mainfrom
feat/issue-102-multiline-strings

Conversation

@kjonescertinia

Copy link
Copy Markdown
Contributor

Summary

Adds MultilineStringLiteral token support for Salesforce Summer '26 triple-quoted string syntax ('''<NL>...'''). Closes #102.

String json = '''
{
  "name": "John"
}''';

Empirical findings (verified against Summer '26 pre-release org)

Input Platform behaviour
'''abc''' (no newline after open) Compile error: "Unexpected symbol 'a', was expecting '\n'"
'''''' (six quotes) Compile error: same
'''<NL>''' (empty body) Valid, produces ""
'''<NL>...<NL>''' Valid; leading newline stripped at runtime
Single ' and double '' in body Allowed
\t \' \\ ç escapes Processed identically to regular strings
Indented body + indented closing ''' Common leading whitespace stripped at runtime (Java text block semantics)
String templates ${var} + .template(map) ${...} is plain literal text in the parser; no parser work needed

Indent stripping is a runtime string semantic, not a lexer concern — the token captures raw multi-line text and downstream consumers can resolve as needed.

Design: strict newline enforcement

The lexer rule requires [\r\n] immediately after the opening ''':

MultilineStringLiteral
    :   '\'\'\'' [\r\n] ( EscapeSequence | . )*? '\'\'\''
    ;

This is deliberately strict to match platform behaviour. The alternative — a permissive lexer accepting '''abc''' — would silently lex it as a single multi-line literal that the parser accepts but the platform later rejects, hiding the user's mistake.

With strict enforcement, ANTLR falls back gracefully on malformed input: '''abc''' lexes as three legacy StringLiteral tokens ('', 'abc', ''). That's a recognisable, unambiguous token pattern — apex-ls#443 will detect it and surface a targeted "did you mean a multi-line string?" diagnostic with quick fix.

apex-parser stays minimal: just the grammar, no special-case diagnostics. The helpful UX is layered in the language server where it has user context.

Parser integration

MultilineStringLiteral is accepted alongside StringLiteral at all 9 sites via inline alternation:

  • literal, whenLiteral (Apex)
  • SOQL value, DISTANCE(...)
  • SOSL WITH DIVISION, WITH NETWORK, WITH PRICEBOOKID, WITH METADATA, networkList

This preserves parse-tree shape — no rule restructure, no breaking change for tree-walking consumers.

Test plan

  • npm tests: 88/88 pass (76 existing + 12 new)
  • JVM tests: 87/87 pass (75 existing + 12 new)
  • Lint clean
  • Lexer tests: basic, closing variants, empty body, internal quotes, escape sequences, line/column tracking after literal, fallback for '''abc''', six-quote fallback, unterminated literal
  • Parser tests: literal rule, in class body (JSON example), in concatenation expression
  • Verified against Summer '26 pre-release org via sf apex run

Follow-up

  • apex-ls#443 — Diagnostic & quick fix for malformed multi-line attempts

Add MultilineStringLiteral token for the Summer '26 triple-quoted string
syntax: '''<NL>...'''. The body must start on a new line after the opening
triple quote, matching platform behaviour confirmed against a Summer '26
pre-release org.

Strict newline enforcement is deliberate. Without it, malformed code like
'''abc''' would be silently accepted as a multi-line literal that the
platform later rejects. With strict enforcement, ANTLR falls back to
lexing it as three legacy StringLiteral tokens ('', 'abc', ''), preserving
a recognisable token pattern for downstream tooling. apex-ls#443 will
detect this pattern and surface a targeted "did you mean a multi-line
string?" diagnostic with quick fix.

The new token is accepted alongside StringLiteral at all 9 parser sites
(literal, whenLiteral, SOQL DISTANCE/value, SOSL WITH clauses,
networkList) via inline alternation, preserving parse-tree shape for
existing consumers.

Closes #102
@kjonescertinia

Copy link
Copy Markdown
Contributor Author

Cross-reference: outline-parser also affected — filed as outline-parser#27. The hand-written tokenizer in outline-parser pairs quotes greedily two-at-a-time, so multi-line bodies with an odd number of single quotes (e.g. can't) cause it to consume past the literal and throw at EOF. apex-parser#104 itself is unaffected — separate fix needed in outline-parser.

@nawforce nawforce self-requested a review April 25, 2026 08:49
@nawforce

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 299f1d06b1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread antlr/BaseApexLexer.g4 Outdated
// Declared before StringLiteral so longest-match prefers '''...''' over the
// degenerate ''+'...'+'' fallback when a newline follows the opening.
MultilineStringLiteral
: '\'\'\'' [\r\n] ( EscapeSequence | . )*? '\'\'\''

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Disallow invalid backslash escapes in multiline literals

MultilineStringLiteral currently uses ( EscapeSequence | . )*?, and . matches a raw backslash, so malformed escapes like '''\n\q''' (or any \<invalid>) are accepted as a valid token with no lexer error. That is looser than existing StringLiteral behavior and undermines the commit’s stated goal of matching Apex string semantics, because invalid escapes in multiline strings will now silently parse instead of being flagged. Restrict the fallback branch so backslashes are only consumed via EscapeSequence.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 8e1958d.

Verified against the Summer '26 pre-release org: the platform rejects \q inside a '''<NL>...<NL>''' body with the same "Illegal character sequence" error it raises for single-quoted strings. So strict matches platform behaviour.

Rule changed from ( EscapeSequence | . )*? to ( EscapeSequence | '\'' | ~['\\] )*?. Backslashes are now only consumable via EscapeSequence (matching StringLiteral semantics); single quotes are handled separately since ~['\\] excludes them; the *? plus '\'\'\'' terminator still allows 1- or 2-quote runs in the body. Added a lexer test for an invalid escape in a multi-line body.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex pls review fix

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex review on #104 flagged that the rule used `( EscapeSequence | . )*?`,
where `.` matches any char including a raw backslash. A malformed escape
like `\q` inside a `'''<NL>...<NL>'''` body silently consumed the
backslash as a regular char — looser than `StringLiteral`, which rejects
bad escapes via `~['\\] | EscapeSequence`.

Verified against the Summer '26 pre-release org: the platform rejects
`\q` in multi-line bodies with the same "Illegal character sequence"
error it raises for single-quoted strings. Our lexer should match that.

Replace `.` with `'\'' | ~['\\]` so backslashes are only consumable via
EscapeSequence. Single quotes are handled separately because `~['\\]`
excludes them; the surrounding `*?` plus the `'\'\'\''` terminator
ensures 1- or 2-quote runs in the body still parse cleanly while a
3-quote run terminates the literal.

Add a lexer test covering an invalid escape inside a multi-line body.
@nawforce nawforce merged commit 16ebbe0 into main Apr 25, 2026
1 check passed
@nawforce nawforce deleted the feat/issue-102-multiline-strings branch April 25, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add lexer support for multiline strings (Summer '26)

2 participants