Skip to content

Studio: rewrite OpenAI Responses citation markers to markdown links#5713

Merged
danielhanchen merged 14 commits into
mainfrom
fix/openai-citation-markers
May 26, 2026
Merged

Studio: rewrite OpenAI Responses citation markers to markdown links#5713
danielhanchen merged 14 commits into
mainfrom
fix/openai-citation-markers

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

OpenAI's /v1/responses stream interleaves text deltas with inline citation markers built from private-use codepoints (U+E200 / U+E201 / U+E202) shaped like citeSOURCE_ID. Most fonts render the codepoints as garbled "E202" glyphs or empty boxes, and the markdown layer further strips them, leaving run-on text like citeturn1view0turn1view1turn3view0turn3view1turn4view0... smeared across the assistant message. The Sources panel still received the URLs (via url_citation annotations) but the inline hand-off into the prose was unreadable.

This PR rewrites each marker into a clickable [N](URL) markdown link when the matching url_citation annotation has already been recorded on this stream, and drops the marker silently otherwise. The lookup uses a new source_id field captured on _record_url_citation (accepts source_id / id / locator across Responses API revisions). Annotations are now processed BEFORE the delta text is rewritten so that markers and their resolving annotation arriving in the same SSE event still resolve.

Reference: https://developers.openai.com/api/docs/guides/citation-formatting

Repro before this PR

  1. Connect an OpenAI cloud provider.
  2. Pick a gpt-5.x Responses-family model and enable web_search.
  3. Ask something that pulls multiple sources, e.g. "Find what's the highest grossing film in the last month".
  4. The streamed reply contains citeturn1view0turn1view1turn3view0... instead of citations.

Behavior after this PR

The same query renders inline citations as [[1]](https://...) [[2]](https://...) links resolving to the same URLs the Sources panel already showed.

Test plan

  • cd studio && python -m pytest backend/tests/test_openai_citation_markers.py -v (11 new tests, all passing)
  • cd studio && python -m pytest backend/tests/test_openai_responses_translation.py -v (no regression)
  • Idempotency: text containing no private-use codepoints is returned verbatim, so accidental double-rewriting is safe.
  • Unknown source_ids: marker is silently stripped, surrounding prose preserved, no E202 glyph ever leaks through.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42342598f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 3067 to 3075
if any(c["url"] == url for c in all_url_citations):
# Backfill source_id on the existing entry if
# the first annotation event omitted it.
if source_id:
for c in all_url_citations:
if c["url"] == url and not c.get("source_id"):
c["source_id"] = source_id
break
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve source-id aliases when deduplicating citations

When url_citation events repeat the same url with a different source_id (for example, multiple inline markers for different spans/locators on one page), this branch returns early and keeps only the first source_id. _replace_openai_citation_markers then cannot resolve markers for later IDs and strips them, so valid inline citations disappear from the streamed text even though the URL is known. Please retain additional source_id aliases for an existing URL instead of dropping them on dedup.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to process OpenAI's inline citation markers, converting them from private-use codepoints into markdown links. Key changes include a new regex-based replacement function, updated citation recording to capture and backfill source IDs, and reordering event processing to ensure annotations are available before text deltas. A new test suite is included to verify these transformations. Feedback was provided to optimize the citation recording logic by combining redundant list iterations into a single pass.

Comment on lines 3067 to 3075
# the first annotation event omitted it.
if source_id:
for c in all_url_citations:
if c["url"] == url and not c.get("source_id"):
c["source_id"] = source_id
break
return
title = payload.get("title") or url
snippet = payload.get("snippet") or payload.get("quote") or ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic performs two separate iterations over all_url_citations: first with any() to check for existence, and then with a for loop to perform the backfill. These can be combined into a single pass for better efficiency and readability.

for c in all_url_citations:
    if c["url"] == url:
        if source_id and not c.get("source_id"):
            c["source_id"] = source_id
        return
References
  1. To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d36be02beb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# on this stream, or drop it silently when the source_id is unknown.
# Group 1 is the source id; an optional `<locator>` suffix is
# discarded (we link the whole source rather than per-line locators).
_OPENAI_CITATION_MARKER = re.compile("cite([^]+)(?:[^]*)?")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve all source IDs in multi-source cite markers

OpenAI citation markers can encode multiple supporting source IDs in one marker (e.g. ...sourceAsourceB...), but this regex captures only the first ID and treats every later ... segment as a locator suffix. In streamed answers that use multi-source markers, the rewriter will emit a single link and silently drop the remaining supports, so citations shown inline become incomplete/misleading even though url_citation annotations are available.

Useful? React with 👍 / 👎.

danielhanchen and others added 3 commits May 23, 2026 14:00
OpenAI's /v1/responses stream interleaves text deltas with inline
citation markers built from private-use codepoints (U+E200 / U+E201 /
U+E202) shaped like `citeSOURCE_ID`. The codepoints render
as garbled "E202" glyphs or empty boxes in most fonts, and the
markdown layer further strips them, leaving run-on text like
"citeturn1view0turn1view1turn3view0...". The url list still arrived in
the Sources panel via url_citation annotations, but the inline cite
hand-off into the prose was unreadable.

Rewrite each marker into `[N](URL)` when the matching url_citation
has already been recorded on this stream, and drop the marker
silently otherwise. The lookup uses a new `source_id` field captured
on `_record_url_citation` (accepts source_id / id / locator across
Responses API revisions). Annotations are now applied BEFORE the
delta text is rewritten so that markers and their resolving
annotation arriving in the same SSE event still resolve.

Reference: https://developers.openai.com/api/docs/guides/citation-formatting
OpenAI's Responses stream cites the same URL under multiple
source_id markers when the model references different spans of the
same page. The previous dedup-by-URL kept only the first alias and
dropped the rest, so subsequent markers for the same URL never
resolved and got stripped from the prose. Switch the citation
record to a ``source_ids`` list and append new aliases on every
duplicate. The rewriter resolves any alias back to the same
citation number so the inline markers all collapse onto one footnote
rather than fanning out into bogus repeats.

Also collapse the two passes over ``all_url_citations`` in
``_record_url_citation`` into a single loop for clarity. Adds two
regression tests covering the alias-collision and mixed-shape cases.
@danielhanchen danielhanchen force-pushed the fix/openai-citation-markers branch from d36be02 to 7dd993e Compare May 23, 2026 14:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7dd993e83d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +3183 to +3185
delta_text = _replace_openai_citation_markers(
delta_text, all_url_citations
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Buffer partial citation markers across streamed deltas

_replace_openai_citation_markers is invoked on each response.output_text.delta chunk independently, but citation markers are token-level text and can be split across consecutive deltas (e.g., first chunk ends with ````). In that case no full regex match exists in either chunk, so private-use marker bytes leak into output or the citation is lost instead of becoming a link. This regresses the stated goal for streamed responses whenever marker boundaries do not align with delta boundaries; keep a carryover buffer between deltas and only rewrite once a full marker is available.

Useful? React with 👍 / 👎.

danielhanchen and others added 4 commits May 23, 2026 16:28
The original rewriter only handles markers that arrive whole inside a
single response.output_text.delta event. OpenAI's stream chunks text
on byte-buffer boundaries with no awareness of the marker grammar,
so a marker can straddle two deltas (delta-1 ends with
"citetu", delta-2 starts with "rn0view0"). Each delta
was rewritten in isolation, so the half-marker leaked as garbled
"E200/E202" glyphs in the rendered prose.

Buffer the unterminated tail across deltas and concatenate it onto
the front of the next one so the rewriter sees a complete marker.
Flush the held-over tail on response.completed / response.incomplete /
[DONE], stripping any leftover private-use bytes so a never-closed
marker (truncated stream, missing annotation) never leaks.

Also handle the multi-source marker shape from the OpenAI docs --
citeid1id2 should expand to one bracket
link per resolvable id. The previous regex captured only the first
source id and silently dropped id2/id3.

Reference: https://developers.openai.com/api/docs/guides/citation-formatting

Tests: 21 new cases covering multi-source, locator suffix, marker
split across two and three deltas, unterminated marker on truncation,
late annotation resolving a buffered marker, idempotency, and the
head/tail split helper directly.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 674e9b8dca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +3097 to +3098
for ch in ("", "", ""):
rendered = rendered.replace(ch, "")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Strip actual citation codepoints in tail flush

The tail-flush sanitizer currently loops over ("", "", ""), so rendered.replace(ch, "") is a no-op and never removes OpenAI’s private-use marker bytes. When a stream ends with an unterminated citation marker (e.g., [DONE], response.completed, or response.incomplete while pending_marker_tail is non-empty), _replace_openai_citation_markers cannot rewrite it and these glyph bytes leak into user-visible output—the exact regression this path is meant to prevent.

Useful? React with 👍 / 👎.

danielhanchen and others added 2 commits May 25, 2026 13:30
The split-marker buffer already concatenates a marker that straddles
two response.output_text.delta events. But when the annotation event
for a url_citation arrives AFTER the delta that contains its inline
marker (the typical OpenAI Responses ordering), the rewriter still
saw an empty lookup table at delta time and silently stripped the
marker. The URL kept showing up in the sources panel but the inline
link reference was permanently gone.

Add _rewrite_citation_markers_partial which leaves an unresolved
marker verbatim and reports has_unresolved=True. The streaming loop
buffers any closed segment that contains an unresolved marker into a
pending_citation_segments FIFO and drains the queue on every later
annotation event, on response.completed, on response.incomplete, and
on the [DONE] sentinel. Drain order is preserved so later clean text
does not leapfrog an earlier deferred segment. End-of-stream forces a
strip so no codepoint leaks if the annotation never arrived.

Add six regression tests covering single-pass resolution, the late-
annotation two-pass case, multi-source markers with partial
resolution, mixed known and pending markers in one segment, and
idempotency on marker-free input.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec5e529184

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +197 to +199
if not rendered:
has_unresolved = True
return match.group(0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep multi-source markers pending until all IDs resolve

_rewrite_citation_markers_partial only marks a marker as unresolved when no token resolves; if one source ID resolves and another is still pending, it emits the resolved link(s), drops the unresolved token(s), and returns unresolved=False. In streamed Responses events where a multi-source marker’s annotations arrive across separate annotation.added events, this causes later source IDs to be permanently lost because the caller no longer buffers that segment for a retry, so inline citations become incomplete/misleading despite eventually receiving all citation metadata.

Useful? React with 👍 / 👎.

danielhanchen and others added 3 commits May 25, 2026 13:41
`_flush_pending_marker_tail` stripped the three private-use citation
codepoints from the held-over buffer, but left the literal ``cite``
keyword plus the source id behind as plain text. A stream ending
mid-marker therefore emitted user-visible garbage like
``Some text citeturn0view0`` instead of the intended clean prose.

``pending_marker_tail`` is by construction the suffix that starts at
an unclosed ``\\ue200`` opener -- the split helper guarantees there is
no closing ``\\ue201`` byte. Without that close the marker is
meaningless: the source id cannot be resolved to a URL and the user
prose before the opener was already emitted as ``head`` on the
originating delta. Bail out before the strip step and return the
empty string. As a belt-and-braces measure also drop any orphan
``cite<sid>`` literal at the head of the buffer in case a future
caller passes a partially-terminated tail.

Update the matching ``_simulate_delta_stream`` harness in the edge
tests so it mirrors the new flush logic, and add four regression
tests covering unterminated marker with surrounding prose, marker-
only inputs, prefix-only outputs, and the split-then-close path that
still must resolve to a link.
`_rewrite_citation_markers_partial` previously treated a marker as
resolved when even one token in a multi-source marker resolved,
dropping any still-pending source ids. In streamed Responses events
the annotations for a multi-source marker can arrive across separate
`annotation.added` chunks, so the caller no longer buffered that
segment for retry and the late source id was lost from the inline
citation entirely.

Flag the marker unresolved whenever any token misses the lookup so
the streamer keeps the segment pending. End-of-stream force flush
still drops unresolved tokens through `_replace_openai_citation_markers`
so locator-style suffixes (which look like unresolved ids at the token
level but only appear at end-of-stream) render cleanly.

Updated the multi-source test to assert the new pending-then-flush
behavior; locator output now lands at force-flush rather than mid
stream.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member Author

Round 4 audit caught the P1 the codex bot flagged on _rewrite_citation_markers_partial. Pushed cd97e29.

The previous logic marked a marker as resolved when at least one token resolved -- the intuition being unresolved tokens are typically locator suffixes (L8-L13, page numbers, etc.) rather than pending source ids. That's right for source+locator markers but wrong for multi-source markers whose annotations arrive across separate annotation.added events: the second source id was dropped permanently because the caller no longer buffered that segment for retry. The streamed inline citation ended up incomplete.

Fix: any unresolved token now flags the whole marker pending so the streamer holds the segment until either all ids resolve or end-of-stream force-flush. End-of-stream still drops unresolved tokens via _replace_openai_citation_markers, which is also the path locator suffixes hit, so locator output renders cleanly at force-flush instead of mid-stream. Updated test_partial_multi_source_partial_resolution to assert the deferred-then-flushed behavior.

Independent round-4 read of the rest of the stream loop (split-tail buffer, pending segments FIFO, _drain_pending_segments, _flush_pending_marker_tail at completed/incomplete/[DONE]) found no further bugs. 44/44 citation marker tests pass.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member Author

Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen danielhanchen merged commit 063e1e4 into main May 26, 2026
32 checks passed
@danielhanchen danielhanchen deleted the fix/openai-citation-markers branch May 26, 2026 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant