-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Bug: Markdown emphasis parser fails on overlapping delimiters (*text**more***) - CommonMark compliance issue #8073
Description
(Note: this bug description generated by Claude.ai)
Bug: Markdown emphasis parser fails on overlapping delimiters (CommonMark compliance issue)
Description
Lexical's $convertFromMarkdownString incorrectly parses markdown emphasis when single and double asterisk delimiters overlap in certain patterns. This violates the CommonMark specification for emphasis delimiter matching.
Environment
- Lexical version: 0.39.0
- Package:
@lexical/markdown - Browser: All browsers (parser issue)
Bug Type
Parser incorrectly handles CommonMark emphasis delimiter rules
Expected Behavior (per CommonMark spec)
According to the CommonMark specification, the markdown string:
ggggg*jhhhh**dsjdhjdj***Should parse as:
<p>ggggg<em>jhhhh<strong>dsjdhjdj</strong></em></p>Breakdown:
- First
*opens italic **opens bold inside the italic- First two asterisks of
***close the bold - Last
*of***closes the italic
Actual Behavior (Lexical v0.39.0)
Lexical parses it as:
<p>ggggg*jhhhh<strong>dsjdhjdj</strong>*</p>The first and last asterisks are treated as literal characters instead of emphasis delimiters.
Test Cases
Here's a comprehensive test showing which patterns work and which fail:
import { $convertFromMarkdownString } from "@lexical/markdown";
import { $getRoot } from "lexical";
// ✅ WORKS: Simple cases
$convertFromMarkdownString("*italic*", TRANSFORMERS);
// Result: <em>italic</em> ✓
$convertFromMarkdownString("**bold**", TRANSFORMERS);
// Result: <strong>bold</strong> ✓
$convertFromMarkdownString("***bold and italic***", TRANSFORMERS);
// Result: <em><strong>bold and italic</strong></em> ✓
// ✅ WORKS: Properly balanced nesting
$convertFromMarkdownString("**bold with *italic* inside**", TRANSFORMERS);
// Result: <strong>bold with <em>italic</em> inside</strong> ✓
$convertFromMarkdownString("*italic with **bold** inside*", TRANSFORMERS);
// Result: <em>italic with <strong>bold</strong> inside</em> ✓
// ❌ FAILS: Overlapping delimiters (opener before opener, closer after closer)
$convertFromMarkdownString("ggggg*jhhhh**dsjdhjdj***", TRANSFORMERS);
// Expected: ggggg<em>jhhhh<strong>dsjdhjdj</strong></em>
// Actual: ggggg*jhhhh<strong>dsjdhjdj</strong>* ✗Root Cause Analysis
After investigating the @lexical/markdown source code, the issue appears to be in the findOutermostMatch function in LexicalMarkdown.dev.js (lines 308-340).
The Algorithm Flaw
The current algorithm:
- Finds all opening tags using
openTagsRegExpin a left-to-right scan - For each opening tag (in order of appearance), tries to match its full pattern
- Returns the first successful match
This is a greedy left-to-right approach that doesn't respect CommonMark's emphasis delimiter matching rules.
Why It Fails on *jhhhh**dsjdhjdj***
- Scanner finds
*at position 5,**at position 11 - Tries to match
*...*for position 5 → NO MATCH (finds*jhhhh**which has**at the end) - Tries to match
**...**for position 11 → MATCH! (**dsjdhjdj**) - Applies bold formatting, leaving
*jhhhhand*as literal characters
CommonMark's Algorithm
Per the CommonMark spec, emphasis delimiter matching should:
- Identify delimiter runs and apply flanking rules
- Process emphasis from innermost to outermost
- Match delimiters based on precedence (*** before ** before *)
- Handle overlapping by respecting the delimiter stack
The CommonMark algorithm is significantly more complex than Lexical's current implementation. See CommonMark spec section 6.2 for the full algorithm.
Reproduction
Minimal Test Case
import { createHeadlessEditor } from '@lexical/headless';
import { $convertFromMarkdownString, TRANSFORMERS } from '@lexical/markdown';
import { $getRoot } from 'lexical';
const editor = createHeadlessEditor({
nodes: [],
onError: (error) => console.error(error),
});
editor.update(() => {
$convertFromMarkdownString('ggggg*jhhhh**dsjdhjdj***', TRANSFORMERS);
const root = $getRoot();
console.log(root.getTextContent()); // Shows: ggggg*jhhhhdsjdhjdj*
});Automated Test
describe('@lexical/markdown emphasis parsing', () => {
test('should handle overlapping emphasis delimiters per CommonMark spec', () => {
const editor = createHeadlessEditor({
nodes: [],
onError: (error) => console.error(error),
});
editor.update(() => {
$convertFromMarkdownString('ggggg*jhhhh**dsjdhjdj***', TRANSFORMERS);
const root = $getRoot();
const html = root.getFirstChild()?.exportDOM().element?.innerHTML;
// Should have nested em and strong tags
expect(html).toContain('<em>');
expect(html).toContain('<strong>');
expect(html).toMatch(/ggggg<em>jhhhh<strong>dsjdhjdj<\/strong><\/em>/);
// Should NOT have literal asterisks
expect(html).not.toContain('*');
});
});
});Related Issues
- Bug:
@lexical/markdowntext-formatcombination #2632 - Similar issue with**Bold *Italic***(fixed in v0.22.0 via PR [lexical-markdown] Bug Fix: preserve the order of markdown tags for markdown combinations, and close the tags when the outmost tag is closed #5758) - The fix in [lexical-markdown] Bug Fix: preserve the order of markdown tags for markdown combinations, and close the tags when the outmost tag is closed #5758 addressed one pattern but didn't fully implement CommonMark's delimiter matching algorithm
Proposed Solution
Implement CommonMark's full emphasis delimiter algorithm in findOutermostMatch or replace it with a proper delimiter stack-based approach. Key changes needed:
- Apply flanking rules to determine which delimiter runs can open/close
- Process by precedence: *** before ** before *
- Track delimiter stack to handle overlapping cases
- Match from innermost to outermost instead of left-to-right
Alternatively, consider using a battle-tested CommonMark parser like commonmark.js or markdown-it for the import path, then convert to Lexical nodes.
Impact
- Severity: Medium - affects CommonMark compliance
- Frequency: Low - specific pattern, but occurs in real-world markdown (especially LLM-generated content)
- Workaround: Users can avoid this pattern and use properly balanced emphasis like
**bold with *italic* inside**
Additional Context
This issue was discovered while working with markdown content in a rich text editor. The pattern appears occasionally in user-generated content and more frequently in LLM-generated markdown.
CommonMark compliance is important for interoperability with other markdown tools and ensuring user expectations are met when importing markdown content.