Skip to content

Bug: Markdown emphasis parser fails on overlapping delimiters (*text**more***) - CommonMark compliance issue #8073

@Yith1

Description

@Yith1

(Note: this bug description generated by Claude.ai)

Bug: Markdown emphasis parser fails on overlapping delimiters (CommonMark compliance issue)

Description

Lexical's $convertFromMarkdownString incorrectly parses markdown emphasis when single and double asterisk delimiters overlap in certain patterns. This violates the CommonMark specification for emphasis delimiter matching.

Environment

  • Lexical version: 0.39.0
  • Package: @lexical/markdown
  • Browser: All browsers (parser issue)

Bug Type

Parser incorrectly handles CommonMark emphasis delimiter rules

Expected Behavior (per CommonMark spec)

According to the CommonMark specification, the markdown string:

ggggg*jhhhh**dsjdhjdj***

Should parse as:

<p>ggggg<em>jhhhh<strong>dsjdhjdj</strong></em></p>

Breakdown:

  • First * opens italic
  • ** opens bold inside the italic
  • First two asterisks of *** close the bold
  • Last * of *** closes the italic

Actual Behavior (Lexical v0.39.0)

Lexical parses it as:

<p>ggggg*jhhhh<strong>dsjdhjdj</strong>*</p>

The first and last asterisks are treated as literal characters instead of emphasis delimiters.

Test Cases

Here's a comprehensive test showing which patterns work and which fail:

import { $convertFromMarkdownString } from "@lexical/markdown";
import { $getRoot } from "lexical";

// ✅ WORKS: Simple cases
$convertFromMarkdownString("*italic*", TRANSFORMERS);
// Result: <em>italic</em> ✓

$convertFromMarkdownString("**bold**", TRANSFORMERS);
// Result: <strong>bold</strong> ✓

$convertFromMarkdownString("***bold and italic***", TRANSFORMERS);
// Result: <em><strong>bold and italic</strong></em> ✓

// ✅ WORKS: Properly balanced nesting
$convertFromMarkdownString("**bold with *italic* inside**", TRANSFORMERS);
// Result: <strong>bold with <em>italic</em> inside</strong> ✓

$convertFromMarkdownString("*italic with **bold** inside*", TRANSFORMERS);
// Result: <em>italic with <strong>bold</strong> inside</em> ✓

// ❌ FAILS: Overlapping delimiters (opener before opener, closer after closer)
$convertFromMarkdownString("ggggg*jhhhh**dsjdhjdj***", TRANSFORMERS);
// Expected: ggggg<em>jhhhh<strong>dsjdhjdj</strong></em>
// Actual: ggggg*jhhhh<strong>dsjdhjdj</strong>* ✗

Root Cause Analysis

After investigating the @lexical/markdown source code, the issue appears to be in the findOutermostMatch function in LexicalMarkdown.dev.js (lines 308-340).

The Algorithm Flaw

The current algorithm:

  1. Finds all opening tags using openTagsRegExp in a left-to-right scan
  2. For each opening tag (in order of appearance), tries to match its full pattern
  3. Returns the first successful match

This is a greedy left-to-right approach that doesn't respect CommonMark's emphasis delimiter matching rules.

Why It Fails on *jhhhh**dsjdhjdj***

  1. Scanner finds * at position 5, ** at position 11
  2. Tries to match *...* for position 5 → NO MATCH (finds *jhhhh** which has ** at the end)
  3. Tries to match **...** for position 11 → MATCH! (**dsjdhjdj**)
  4. Applies bold formatting, leaving *jhhhh and * as literal characters

CommonMark's Algorithm

Per the CommonMark spec, emphasis delimiter matching should:

  1. Identify delimiter runs and apply flanking rules
  2. Process emphasis from innermost to outermost
  3. Match delimiters based on precedence (*** before ** before *)
  4. Handle overlapping by respecting the delimiter stack

The CommonMark algorithm is significantly more complex than Lexical's current implementation. See CommonMark spec section 6.2 for the full algorithm.

Reproduction

Minimal Test Case

import { createHeadlessEditor } from '@lexical/headless';
import { $convertFromMarkdownString, TRANSFORMERS } from '@lexical/markdown';
import { $getRoot } from 'lexical';

const editor = createHeadlessEditor({
  nodes: [],
  onError: (error) => console.error(error),
});

editor.update(() => {
  $convertFromMarkdownString('ggggg*jhhhh**dsjdhjdj***', TRANSFORMERS);
  const root = $getRoot();
  console.log(root.getTextContent()); // Shows: ggggg*jhhhhdsjdhjdj*
});

Automated Test

describe('@lexical/markdown emphasis parsing', () => {
  test('should handle overlapping emphasis delimiters per CommonMark spec', () => {
    const editor = createHeadlessEditor({
      nodes: [],
      onError: (error) => console.error(error),
    });

    editor.update(() => {
      $convertFromMarkdownString('ggggg*jhhhh**dsjdhjdj***', TRANSFORMERS);
      const root = $getRoot();
      const html = root.getFirstChild()?.exportDOM().element?.innerHTML;

      // Should have nested em and strong tags
      expect(html).toContain('<em>');
      expect(html).toContain('<strong>');
      expect(html).toMatch(/ggggg<em>jhhhh<strong>dsjdhjdj<\/strong><\/em>/);

      // Should NOT have literal asterisks
      expect(html).not.toContain('*');
    });
  });
});

Related Issues

Proposed Solution

Implement CommonMark's full emphasis delimiter algorithm in findOutermostMatch or replace it with a proper delimiter stack-based approach. Key changes needed:

  1. Apply flanking rules to determine which delimiter runs can open/close
  2. Process by precedence: *** before ** before *
  3. Track delimiter stack to handle overlapping cases
  4. Match from innermost to outermost instead of left-to-right

Alternatively, consider using a battle-tested CommonMark parser like commonmark.js or markdown-it for the import path, then convert to Lexical nodes.

Impact

  • Severity: Medium - affects CommonMark compliance
  • Frequency: Low - specific pattern, but occurs in real-world markdown (especially LLM-generated content)
  • Workaround: Users can avoid this pattern and use properly balanced emphasis like **bold with *italic* inside**

Additional Context

This issue was discovered while working with markdown content in a rich text editor. The pattern appears occasionally in user-generated content and more frequently in LLM-generated markdown.

CommonMark compliance is important for interoperability with other markdown tools and ensuring user expectations are met when importing markdown content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions