Skip to content

fix(doc): strip blockquote > prefixes from JSDoc code blocks#34866

Open
LeSingh1 wants to merge 5 commits into
denoland:mainfrom
LeSingh1:fix-doc-blockquote-stripping
Open

fix(doc): strip blockquote > prefixes from JSDoc code blocks#34866
LeSingh1 wants to merge 5 commits into
denoland:mainfrom
LeSingh1:fix-doc-blockquote-stripping

Conversation

@LeSingh1

@LeSingh1 LeSingh1 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This fixes #25980. Code blocks nested inside a markdown blockquote, both in JSDoc comments and in plain markdown files, were extracted with their leading > markers still attached, so the snippets did not parse as valid TypeScript.

The fix detects whether a fenced block is inside a blockquote by looking at the opening fence line. For markdown the block pattern captures the leading blockquote markers in a named group; for JSDoc the markers are read from the text before the fence with the comment star removed. The count of > markers on the fence line is the blockquote depth. Each content line then has up to that many markers stripped, where one marker is a > followed by an optional single space or tab.

Bounding the stripping to the detected depth keeps two cases working that a naive strip-all-leading-> would break: a normal code line that starts with > inside a template literal is left untouched when the block is not in a blockquote, and a line quoted only at the outer level is not over-stripped inside a nested block.

Tests cover JSDoc blockquotes, plain markdown blockquotes, a two-level nested blockquote, an empty quoted line written as a bare >, and the regression case of a non-blockquote block whose body starts with >.

@bartlomieju bartlomieju left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix and the clear write-up — the blockquote case itself is handled well, and the unit + spec tests are good. I ran test_extract_doc_tests locally and it passes.

There is one correctness issue to address before this can land: the regex strips > from every doc-code line unconditionally, even when the code block is not inside a blockquote. So a legitimate code line that begins with > (e.g. a line inside a multi-line template literal) gets its > silently removed, corrupting the example. This worked correctly before the PR, so it's a regression. Details inline, plus a request for a regression test that would catch it.

Comment thread cli/util/extract.rs Outdated
let blocks_regex = lazy_regex::regex!(r"```([^\r\n]*)\r?\n([\S\s]*?)```");
let lines_regex =
lazy_regex::regex!(r"(?:\* ?)((#!+).*)|(?:\* ?)(?:\# ?)?(.*)");
lazy_regex::regex!(r"(?:\* ?(?:> ?)*)((#!+).*)|(?:\* ?(?:> ?)*)(?:\# ?)?(.*)");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strips > markers from every line of every doc-code block, regardless of whether the block is actually inside a blockquote. A legitimate code line that starts with > therefore gets corrupted. For example:

/**
 * ```ts
 * const s = `
 * > real content
 * `;
 * console.log(s);
 * ```
 */

The line > real content is template-literal content, but it extracts as real content — the > is dropped, silently changing the string (this was correct before this PR). I confirmed the new regex matches and strips it.

Could you scope the stripping to blocks that are genuinely inside a blockquote? One approach: capture the blockquote prefix from the opening fence line (the > sequence that precedes ```ts) and strip exactly that consistent prefix from each content line, instead of greedily consuming any > on every line. That leaves non-blockquote examples untouched while still fixing #25980.

Comment thread cli/util/extract.rs
// https://github.com/denoland/deno/issues/25980
// Code blocks nested inside blockquotes in JSDoc comments should have
// the blockquote `> ` prefix stripped so the extracted source is valid.
Test {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: once the stripping is scoped, please add a regression test for the inverse case — a code block that is NOT inside a blockquote but contains a line starting with > (e.g. a multi-line template literal). That asserts the > is preserved and would have caught the over-stripping above.

@LeSingh1

LeSingh1 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed feedback, Bartek. I've pushed a rework that addresses the scoping issue.

The old approach added (?:> ?)* to the shared lines_regex, which greedily stripped any leading > from every content line in every code block. That's what broke non-blockquote blocks containing lines starting with > , like content inside a template literal.

The new approach detects the blockquote prefix from the opening fence line itself. For each block match, I look at what appears on that line before the triple-backtick in the source string, strip the JSDoc * comment marker, and then collect the > token sequence that remains. That gives the exact prefix for this block, for example > for a single-level blockquote or > > for nested. Then I strip exactly that prefix from each extracted content line. If the fence line has no > before the backticks the block is not inside a blockquote and nothing extra is stripped.

I also added a regression test for the inverse case: a code block not inside a blockquote whose body has a line starting with > inside a template literal. The test asserts the > is preserved verbatim in the extracted output. The existing blockquote test still covers the original issue from #25980.

I was unable to run cargo test locally because the 1.95.0 toolchain on my machine is missing rustc and cargo binaries (only rustfmt and clippy are present), and the linker setup requires lld which is not installed. I did run rustfmt --check and confirmed the file is clean. The CI run on this push should exercise the extract tests.

@bartlomieju

Copy link
Copy Markdown
Member

Can you please resolve the conflict in cli/util/extract.rs?

LeSingh1 added 2 commits June 6, 2026 13:12
When a code example is nested inside a blockquote in a JSDoc comment,
each line of the code block carries a leading "> " (or "> ") prefix from
the markdown blockquote syntax. The lines_regex used when extracting those
blocks only stripped the JSDoc "* " prefix, leaving the "> " on every line.
That produced source like "> console.log(...)" which the TypeScript parser
rejected with a confusing "Expression expected" error pointing at the ">"
character.

Extend the regex to also consume any number of "> " blockquote markers that
follow the "* " JSDoc marker:

  before: (?:\* ?)((#!+).*)|(?:\* ?)(?:\# ?)?(.*)
  after:  (?:\* ?(?:> ?)*)((#!+).*)|(?:\* ?(?:> ?)*)(?:\# ?)?(.*)

This matches the shebang capture group pattern too so shebang lines inside
blockquoted code blocks keep working.

Fixes denoland#25980
…uote

The previous approach added `(?:> ?)*` to the lines_regex, which stripped
any leading `> ` from every content line in every code block, regardless of
whether the block was actually inside a markdown blockquote. This corrupted
legitimate code where a line begins with `> ` -- for example, content inside
a template literal.

The fix detects the blockquote prefix from the opening fence line itself:
look at what appears before the triple-backtick on that line, extract the
`> ` token sequence, and strip exactly that prefix from each content line.
If the fence line has no `> ` prefix the block is not inside a blockquote,
so nothing extra is stripped.

Adds a regression test for the inverse case: a code block not inside a
blockquote whose body contains a line starting with `> ` (inside a template
literal). That line must be preserved as-is.
@LeSingh1 LeSingh1 force-pushed the fix-doc-blockquote-stripping branch from 80ad9e9 to d18781e Compare June 6, 2026 20:14
@LeSingh1

LeSingh1 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Rebased onto current upstream/main and resolved the conflict in cli/util/extract.rs. The conflict was between upstream's approach (using is_markdown_blockquote + a strip_markdown_blockquote_marker helper that stripped any leading > from every line) and this PR's approach (detecting the exact blockquote prefix from the opening fence line and stripping only that). The PR's logic is strictly more precise, so the resolution keeps the blockquote_prefix detection and the lines_regex.captures_iter loop and drops the upstream helper. The now-dead is_markdown_blockquote variable and strip_markdown_blockquote_marker function were also removed to keep things clean. Ran rustfmt --check on the file and it passes. The regression test for the non-blockquote code block containing a > line is intact. New head is d18781e.

The blockquote prefix was only detected when the fence line carried a
JSDoc star marker, so code blocks quoted in plain markdown kept their
leading angle brackets and extracted as invalid source. Detect the
blockquote depth from the fence line whether or not a star marker is
present, and strip that many markers per line so an empty quoted line
written as a bare angle bracket no longer leaks into the output.
@LeSingh1

LeSingh1 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a fix for the failing extraction tests. The blockquote prefix was only being detected when the fence line had a JSDoc star marker, so code blocks quoted in plain markdown (like the README example in the issue) kept their leading angle brackets and extracted as invalid TypeScript. I now read the blockquote depth from the fence line whether or not a star marker is present, and strip that many markers per line, so an empty quoted line written as a bare angle bracket no longer leaves a stray marker behind. The non-blockquote case is unchanged since its depth is zero.

@bartlomieju

Copy link
Copy Markdown
Member

Thanks for the fix! The JSDoc case and the nested-depth / template-literal handling look good. One concern about the shared refactor:

Possible regression on the markdown (.md) blockquote path

The stripping logic now lives in the shared extract_files_from_regex_blocks, but the two callers use regexes that anchor their match differently:

  • JSDoc (extract_files_from_source_comments): blocks_regex starts the match at the literal ```, so block.get(0).start() is the backtick position and line_before_fence correctly sees * > → depth 1. ✅
  • Markdown (extract_files_from_fenced_blocks): blocks_regex is ^[ \t]*(?P<blockquote>(?:>[ \t]?)*)```…, so the match starts at the > (the blockquote group is inside group 0). For > ```ts, fence_start lands on the >, source[..fence_start] ends right after the previous \n, and line_before_fence is an empty stringblockquote_depth == 0 → the > markers are never stripped.

If that's right, a .md file with a blockquoted code block extracts > console.log(...) as invalid TS. The existing tests/specs/test/markdown spec exercises exactly this (the block at lines 23–26 is expected to type-check cleanly — only $35-38 is meant to fail), so I think it would now break.

Could you run ./x test-spec markdown to confirm? If it does break, the cleanest fix is probably to reuse the blockquote named capture group (its marker count is the depth) for that caller, and fall back to the fence-line scan only for the JSDoc regex which has no such group. As-is the (?P<blockquote>…) group is now dead.

Minor

  • The use deno_ast::… import block got reordered (e.g. MediaType/SourceRangedForSpanned moved below the swc imports). It's unrelated to the fix and looks like a different rustfmt ordering than the pinned toolchain — worth reverting to keep the diff focused and avoid a format-check surprise in CI.
  • Switching from text.lines() to lines_regex.captures_iter(text) can emit an extra empty match at a trailing newline; harmless most likely, but worth a quick check that snippet output is unchanged.

The added unit cases are nicely chosen. Mainly want to make sure the .md path still passes before this lands.

The markdown fence pattern captures the leading blockquote markers in a
named group, so the match begins at the angle bracket and the text
before the fence is empty. Reading the prefix from before the fence only
worked for JSDoc, where the pattern has no such group. Use the captured
group when present and fall back to the fence line text for JSDoc, so
quoted code blocks in markdown files have their markers stripped and
extract as valid source.
@LeSingh1

LeSingh1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: the markdown doctest was still failing because the markdown fence pattern captures the leading blockquote markers in a named group, so the match starts at the angle bracket and the text before the fence is empty. My first attempt read the prefix from before the fence, which only works for the JSDoc pattern. It now reads the captured group when present and falls back to the fence line for JSDoc, so quoted blocks in markdown files strip correctly and extract as valid source.

@bartlomieju

Copy link
Copy Markdown
Member

Thanks for tackling this, and for the thorough writeup and tests. This is really close to landing! The depth-bounded approach is exactly right: only stripping when the fence is actually inside a blockquote, and only up to the detected depth, is what makes the template-literal case (> real content) keep working where a naive "strip all leading > " would break it. Nice.

A few small things to tighten it up before we merge:

1. The description doesn't match the implementation. It says the fix "extends the regex so it greedily consumes any number of > markers" via (?:> ?)*, but the diff doesn't touch lines_regex at all, and that greedy version would actually strip the > from the template-literal regression case you added a test for. Could you rewrite the description to cover what landed (fence-line depth detection plus bounded per-line stripping)? It'll also become the squash-commit message, so it's worth getting right.

2. Add a nested-blockquote test. The code loops 0..blockquote_depth and the description mentions handling > > code, but nothing exercises depth >= 2. A > > ```ts case would cover the part most likely to regress:

/**
 * > > ```ts
 * > > console.log("nested");
 * > > ```
 */

3. The import reordering looks unrelated. The diff reshuffles the use block (deno_ast::MediaType, deno_core::ModuleSpecifier, CommentKind vs DUMMY_SP, etc.), which isn't part of the fix. Could you double-check that's exactly what tools/format.js (./x fmt) produces? If not, reverting it keeps the diff focused on the bug.

Optional nit: the "consume one > plus optional space/tab" logic appears twice (counting depth, and per content line). A small helper like fn strip_blockquote_marker(s: &str) -> Option<&str> would DRY it up and make both sites read identically. Totally fine to leave as-is though.

Thanks again for the contribution!

Adds a depth-two nested blockquote test, extracts a strip_blockquote_marker
helper so the depth counting and the per-line stripping share one
implementation, and restores the import block to its original order since
the earlier reshuffle was unrelated to this fix.
@LeSingh1

Copy link
Copy Markdown
Contributor Author

Thanks, all three addressed in the latest commit. I rewrote the description to match what actually landed (fence-line depth detection plus bounded per-line stripping, no change to lines_regex). I added the two-level nested case you suggested. I pulled the strip-one-marker logic into a small strip_blockquote_marker helper so the depth count and the per-line stripping use the same code. And I restored the import block to its original order, since that reshuffle slipped in during an earlier rebase and was not part of the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

deno test --doc fails if the code is inside a blockquote

2 participants