Bug: Markdown emphasis parser fails on overlapping delimiters (*text**more***) - CommonMark compliance issue

(Note: this bug description generated by Claude.ai)

# Bug: Markdown emphasis parser fails on overlapping delimiters (CommonMark compliance issue)

## Description

Lexical's `$convertFromMarkdownString` incorrectly parses markdown emphasis when single and double asterisk delimiters overlap in certain patterns. This violates the CommonMark specification for emphasis delimiter matching.

## Environment

- **Lexical version**: 0.39.0
- **Package**: `@lexical/markdown`
- **Browser**: All browsers (parser issue)

## Bug Type

Parser incorrectly handles CommonMark emphasis delimiter rules

## Expected Behavior (per CommonMark spec)

According to the [CommonMark specification](https://spec.commonmark.org/0.31.2/#emphasis-and-strong-emphasis), the markdown string:

```markdown
ggggg*jhhhh**dsjdhjdj***
```

Should parse as:

```html
gggggjhhhhdsjdhjdj
```

**Breakdown:**
- First `*` opens italic
- `**` opens bold inside the italic
- First two asterisks of `***` close the bold
- Last `*` of `***` closes the italic

## Actual Behavior (Lexical v0.39.0)

Lexical parses it as:

```html
ggggg*jhhhhdsjdhjdj*
```

The first and last asterisks are treated as literal characters instead of emphasis delimiters.

## Test Cases

Here's a comprehensive test showing which patterns work and which fail:

```typescript
import { $convertFromMarkdownString } from "@lexical/markdown";
import { $getRoot } from "lexical";

// ✅ WORKS: Simple cases
$convertFromMarkdownString("*italic*", TRANSFORMERS);
// Result: italic ✓

$convertFromMarkdownString("**bold**", TRANSFORMERS);
// Result: bold ✓

$convertFromMarkdownString("***bold and italic***", TRANSFORMERS);
// Result: bold and italic ✓

// ✅ WORKS: Properly balanced nesting
$convertFromMarkdownString("**bold with *italic* inside**", TRANSFORMERS);
// Result: bold with italic inside ✓

$convertFromMarkdownString("*italic with **bold** inside*", TRANSFORMERS);
// Result: italic with bold inside ✓

// ❌ FAILS: Overlapping delimiters (opener before opener, closer after closer)
$convertFromMarkdownString("ggggg*jhhhh**dsjdhjdj***", TRANSFORMERS);
// Expected: gggggjhhhhdsjdhjdj
// Actual: ggggg*jhhhhdsjdhjdj* ✗
```

## Root Cause Analysis

After investigating the `@lexical/markdown` source code, the issue appears to be in the `findOutermostMatch` function in `LexicalMarkdown.dev.js` (lines 308-340).

### The Algorithm Flaw

The current algorithm:
1. Finds all opening tags using `openTagsRegExp` in a **left-to-right scan**
2. For each opening tag (in order of appearance), tries to match its full pattern
3. Returns the **first successful match**

This is a greedy left-to-right approach that doesn't respect CommonMark's emphasis delimiter matching rules.

### Why It Fails on `*jhhhh**dsjdhjdj***`

1. Scanner finds `*` at position 5, `**` at position 11
2. Tries to match `*...*` for position 5 → **NO MATCH** (finds `*jhhhh**` which has `**` at the end)
3. Tries to match `**...**` for position 11 → **MATCH!** (`**dsjdhjdj**`)
4. Applies bold formatting, leaving `*jhhhh` and `*` as literal characters

### CommonMark's Algorithm

Per the CommonMark spec, emphasis delimiter matching should:
1. Identify delimiter runs and apply **flanking rules**
2. Process emphasis from **innermost to outermost**
3. Match delimiters based on **precedence** (*** before ** before *)
4. Handle overlapping by respecting the **delimiter stack**

The CommonMark algorithm is significantly more complex than Lexical's current implementation. See [CommonMark spec section 6.2](https://spec.commonmark.org/0.31.2/#emphasis-and-strong-emphasis) for the full algorithm.

## Reproduction

### Minimal Test Case

```typescript
import { createHeadlessEditor } from '@lexical/headless';
import { $convertFromMarkdownString, TRANSFORMERS } from '@lexical/markdown';
import { $getRoot } from 'lexical';

const editor = createHeadlessEditor({
 nodes: [],
 onError: (error) => console.error(error),
});

editor.update(() => {
 $convertFromMarkdownString('ggggg*jhhhh**dsjdhjdj***', TRANSFORMERS);
 const root = $getRoot();
 console.log(root.getTextContent()); // Shows: ggggg*jhhhhdsjdhjdj*
});
```

### Automated Test

```typescript
describe('@lexical/markdown emphasis parsing', () => {
 test('should handle overlapping emphasis delimiters per CommonMark spec', () => {
 const editor = createHeadlessEditor({
 nodes: [],
 onError: (error) => console.error(error),
 });

 editor.update(() => {
 $convertFromMarkdownString('ggggg*jhhhh**dsjdhjdj***', TRANSFORMERS);
 const root = $getRoot();
 const html = root.getFirstChild()?.exportDOM().element?.innerHTML;

 // Should have nested em and strong tags
 expect(html).toContain('');
 expect(html).toContain('');
 expect(html).toMatch(/gggggjhhhhdsjdhjdj<\/strong><\/em>/);

 // Should NOT have literal asterisks
 expect(html).not.toContain('*');
 });
 });
});
```

## Related Issues

- #2632 - Similar issue with `**Bold *Italic***` (fixed in v0.22.0 via PR #5758)
- The fix in #5758 addressed one pattern but didn't fully implement CommonMark's delimiter matching algorithm

## Proposed Solution

Implement CommonMark's full emphasis delimiter algorithm in `findOutermostMatch` or replace it with a proper delimiter stack-based approach. Key changes needed:

1. **Apply flanking rules** to determine which delimiter runs can open/close
2. **Process by precedence**: *** before ** before *
3. **Track delimiter stack** to handle overlapping cases
4. **Match from innermost to outermost** instead of left-to-right

Alternatively, consider using a battle-tested CommonMark parser like `commonmark.js` or `markdown-it` for the import path, then convert to Lexical nodes.

## Impact

- **Severity**: Medium - affects CommonMark compliance
- **Frequency**: Low - specific pattern, but occurs in real-world markdown (especially LLM-generated content)
- **Workaround**: Users can avoid this pattern and use properly balanced emphasis like `**bold with *italic* inside**`

## Additional Context

This issue was discovered while working with markdown content in a rich text editor. The pattern appears occasionally in user-generated content and more frequently in LLM-generated markdown.

CommonMark compliance is important for interoperability with other markdown tools and ensuring user expectations are met when importing markdown content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Markdown emphasis parser fails on overlapping delimiters (textmore) - CommonMark compliance issue #8073

Bug: Markdown emphasis parser fails on overlapping delimiters (CommonMark compliance issue)

Description

Environment

Bug Type

Expected Behavior (per CommonMark spec)

Actual Behavior (Lexical v0.39.0)

Test Cases

Root Cause Analysis

The Algorithm Flaw

Why It Fails on `jhhhhdsjdhjdj`

CommonMark's Algorithm

Reproduction

Minimal Test Case

Automated Test

Related Issues

Proposed Solution

Impact

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Markdown emphasis parser fails on overlapping delimiters (*text**more***) - CommonMark compliance issue #8073

Description

Bug: Markdown emphasis parser fails on overlapping delimiters (CommonMark compliance issue)

Description

Environment

Bug Type

Expected Behavior (per CommonMark spec)

Actual Behavior (Lexical v0.39.0)

Test Cases

Root Cause Analysis

The Algorithm Flaw

Why It Fails on *jhhhh**dsjdhjdj***

CommonMark's Algorithm

Reproduction

Minimal Test Case

Automated Test

Related Issues

Proposed Solution

Impact

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug: Markdown emphasis parser fails on overlapping delimiters (textmore) - CommonMark compliance issue #8073

Why It Fails on `jhhhhdsjdhjdj`