Studio: rewrite OpenAI Responses citation markers to markdown links by danielhanchen · Pull Request #5713 · unslothai/unsloth

danielhanchen · 2026-05-22T15:12:55Z

Summary

OpenAI's /v1/responses stream interleaves text deltas with inline citation markers built from private-use codepoints (U+E200 / U+E201 / U+E202) shaped like citeSOURCE_ID. Most fonts render the codepoints as garbled "E202" glyphs or empty boxes, and the markdown layer further strips them, leaving run-on text like citeturn1view0turn1view1turn3view0turn3view1turn4view0... smeared across the assistant message. The Sources panel still received the URLs (via url_citation annotations) but the inline hand-off into the prose was unreadable.

This PR rewrites each marker into a clickable [N](URL) markdown link when the matching url_citation annotation has already been recorded on this stream, and drops the marker silently otherwise. The lookup uses a new source_id field captured on _record_url_citation (accepts source_id / id / locator across Responses API revisions). Annotations are now processed BEFORE the delta text is rewritten so that markers and their resolving annotation arriving in the same SSE event still resolve.

Reference: https://developers.openai.com/api/docs/guides/citation-formatting

Repro before this PR

Connect an OpenAI cloud provider.
Pick a gpt-5.x Responses-family model and enable web_search.
Ask something that pulls multiple sources, e.g. "Find what's the highest grossing film in the last month".
The streamed reply contains citeturn1view0turn1view1turn3view0... instead of citations.

Behavior after this PR

The same query renders inline citations as [[1]](https://...) [[2]](https://...) links resolving to the same URLs the Sources panel already showed.

Test plan

cd studio && python -m pytest backend/tests/test_openai_citation_markers.py -v (11 new tests, all passing)
cd studio && python -m pytest backend/tests/test_openai_responses_translation.py -v (no regression)
Idempotency: text containing no private-use codepoints is returned verbatim, so accidental double-rewriting is safe.
Unknown source_ids: marker is silently stripped, surrounding prose preserved, no E202 glyph ever leaks through.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42342598f0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T15:15:13Z

                        if any(c["url"] == url for c in all_url_citations):
+                            # Backfill source_id on the existing entry if
+                            # the first annotation event omitted it.
+                            if source_id:
+                                for c in all_url_citations:
+                                    if c["url"] == url and not c.get("source_id"):
+                                        c["source_id"] = source_id
+                                        break
                            return


Preserve source-id aliases when deduplicating citations

When url_citation events repeat the same url with a different source_id (for example, multiple inline markers for different spans/locators on one page), this branch returns early and keeps only the first source_id. _replace_openai_citation_markers then cannot resolve markers for later IDs and strips them, so valid inline citations disappear from the streamed text even though the URL is known. Please retain additional source_id aliases for an existing URL instead of dropping them on dedup.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces logic to process OpenAI's inline citation markers, converting them from private-use codepoints into markdown links. Key changes include a new regex-based replacement function, updated citation recording to capture and backfill source IDs, and reordering event processing to ensure annotations are available before text deltas. A new test suite is included to verify these transformations. Feedback was provided to optimize the citation recording logic by combining redundant list iterations into a single pass.

gemini-code-assist · 2026-05-22T15:18:27Z

+                            # the first annotation event omitted it.
+                            if source_id:
+                                for c in all_url_citations:
+                                    if c["url"] == url and not c.get("source_id"):
+                                        c["source_id"] = source_id
+                                        break
                            return
                        title = payload.get("title") or url
                        snippet = payload.get("snippet") or payload.get("quote") or ""


The current logic performs two separate iterations over all_url_citations: first with any() to check for existence, and then with a for loop to perform the backfill. These can be combined into a single pass for better efficiency and readability.

for c in all_url_citations: if c["url"] == url: if source_id and not c.get("source_id"): c["source_id"] = source_id return

References

To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d36be02beb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T15:46:10Z

+# on this stream, or drop it silently when the source_id is unknown.
+# Group 1 is the source id; an optional `<locator>` suffix is
+# discarded (we link the whole source rather than per-line locators).
+_OPENAI_CITATION_MARKER = re.compile("cite([^]+)(?:[^]*)?")


Preserve all source IDs in multi-source cite markers

OpenAI citation markers can encode multiple supporting source IDs in one marker (e.g. ...sourceAsourceB...), but this regex captures only the first ID and treats every later ... segment as a locator suffix. In streamed answers that use multi-source markers, the rewriter will emit a single link and silently drop the remaining supports, so citations shown inline become incomplete/misleading even though url_citation annotations are available.

Useful? React with 👍 / 👎.

OpenAI's /v1/responses stream interleaves text deltas with inline citation markers built from private-use codepoints (U+E200 / U+E201 / U+E202) shaped like `citeSOURCE_ID`. The codepoints render as garbled "E202" glyphs or empty boxes in most fonts, and the markdown layer further strips them, leaving run-on text like "citeturn1view0turn1view1turn3view0...". The url list still arrived in the Sources panel via url_citation annotations, but the inline cite hand-off into the prose was unreadable. Rewrite each marker into `[N](URL)` when the matching url_citation has already been recorded on this stream, and drop the marker silently otherwise. The lookup uses a new `source_id` field captured on `_record_url_citation` (accepts source_id / id / locator across Responses API revisions). Annotations are now applied BEFORE the delta text is rewritten so that markers and their resolving annotation arriving in the same SSE event still resolve. Reference: https://developers.openai.com/api/docs/guides/citation-formatting

for more information, see https://pre-commit.ci

OpenAI's Responses stream cites the same URL under multiple source_id markers when the model references different spans of the same page. The previous dedup-by-URL kept only the first alias and dropped the rest, so subsequent markers for the same URL never resolved and got stripped from the prose. Switch the citation record to a ``source_ids`` list and append new aliases on every duplicate. The rewriter resolves any alias back to the same citation number so the inline markers all collapse onto one footnote rather than fanning out into bogus repeats. Also collapse the two passes over ``all_url_citations`` in ``_record_url_citation`` into a single loop for clarity. Adds two regression tests covering the alias-collision and mixed-shape cases.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7dd993e83d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T14:03:16Z

+                                    delta_text = _replace_openai_citation_markers(
+                                        delta_text, all_url_citations
+                                    )


Buffer partial citation markers across streamed deltas

_replace_openai_citation_markers is invoked on each response.output_text.delta chunk independently, but citation markers are token-level text and can be split across consecutive deltas (e.g., first chunk ends with ````). In that case no full regex match exists in either chunk, so private-use marker bytes leak into output or the citation is lost instead of becoming a link. This regresses the stated goal for streamed responses whenever marker boundaries do not align with delta boundaries; keep a carryover buffer between deltas and only rewrite once a full marker is available.

Useful? React with 👍 / 👎.

#5741 already)

The original rewriter only handles markers that arrive whole inside a single response.output_text.delta event. OpenAI's stream chunks text on byte-buffer boundaries with no awareness of the marker grammar, so a marker can straddle two deltas (delta-1 ends with "citetu", delta-2 starts with "rn0view0"). Each delta was rewritten in isolation, so the half-marker leaked as garbled "E200/E202" glyphs in the rendered prose. Buffer the unterminated tail across deltas and concatenate it onto the front of the next one so the rewriter sees a complete marker. Flush the held-over tail on response.completed / response.incomplete / [DONE], stripping any leftover private-use bytes so a never-closed marker (truncated stream, missing annotation) never leaks. Also handle the multi-source marker shape from the OpenAI docs -- citeid1id2 should expand to one bracket link per resolvable id. The previous regex captured only the first source id and silently dropped id2/id3. Reference: https://developers.openai.com/api/docs/guides/citation-formatting Tests: 21 new cases covering multi-source, locator suffix, marker split across two and three deltas, unterminated marker on truncation, late annotation resolving a buffered marker, idempotency, and the head/tail split helper directly.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 674e9b8dca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-24T14:01:38Z

+                        for ch in ("", "", ""):
+                            rendered = rendered.replace(ch, "")


Strip actual citation codepoints in tail flush

The tail-flush sanitizer currently loops over ("", "", ""), so rendered.replace(ch, "") is a no-op and never removes OpenAI’s private-use marker bytes. When a stream ends with an unterminated citation marker (e.g., [DONE], response.completed, or response.incomplete while pending_marker_tail is non-empty), _replace_openai_citation_markers cannot rewrite it and these glyph bytes leak into user-visible output—the exact regression this path is meant to prevent.

Useful? React with 👍 / 👎.

The split-marker buffer already concatenates a marker that straddles two response.output_text.delta events. But when the annotation event for a url_citation arrives AFTER the delta that contains its inline marker (the typical OpenAI Responses ordering), the rewriter still saw an empty lookup table at delta time and silently stripped the marker. The URL kept showing up in the sources panel but the inline link reference was permanently gone. Add _rewrite_citation_markers_partial which leaves an unresolved marker verbatim and reports has_unresolved=True. The streaming loop buffers any closed segment that contains an unresolved marker into a pending_citation_segments FIFO and drains the queue on every later annotation event, on response.completed, on response.incomplete, and on the [DONE] sentinel. Drain order is preserved so later clean text does not leapfrog an earlier deferred segment. End-of-stream forces a strip so no codepoint leaks if the annotation never arrived. Add six regression tests covering single-pass resolution, the late- annotation two-pass case, multi-source markers with partial resolution, mixed known and pending markers in one segment, and idempotency on marker-free input.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec5e529184

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T13:36:04Z

+        if not rendered:
+            has_unresolved = True
+            return match.group(0)


Keep multi-source markers pending until all IDs resolve

_rewrite_citation_markers_partial only marks a marker as unresolved when no token resolves; if one source ID resolves and another is still pending, it emits the resolved link(s), drops the unresolved token(s), and returns unresolved=False. In streamed Responses events where a multi-source marker’s annotations arrive across separate annotation.added events, this causes later source IDs to be permanently lost because the caller no longer buffers that segment for a retry, so inline citations become incomplete/misleading despite eventually receiving all citation metadata.

Useful? React with 👍 / 👎.

`_flush_pending_marker_tail` stripped the three private-use citation codepoints from the held-over buffer, but left the literal ``cite`` keyword plus the source id behind as plain text. A stream ending mid-marker therefore emitted user-visible garbage like ``Some text citeturn0view0`` instead of the intended clean prose. ``pending_marker_tail`` is by construction the suffix that starts at an unclosed ``\\ue200`` opener -- the split helper guarantees there is no closing ``\\ue201`` byte. Without that close the marker is meaningless: the source id cannot be resolved to a URL and the user prose before the opener was already emitted as ``head`` on the originating delta. Bail out before the strip step and return the empty string. As a belt-and-braces measure also drop any orphan ``cite<sid>`` literal at the head of the buffer in case a future caller passes a partially-terminated tail. Update the matching ``_simulate_delta_stream`` harness in the edge tests so it mirrors the new flush logic, and add four regression tests covering unterminated marker with surrounding prose, marker- only inputs, prefix-only outputs, and the split-then-close path that still must resolve to a link.

for more information, see https://pre-commit.ci

`_rewrite_citation_markers_partial` previously treated a marker as resolved when even one token in a multi-source marker resolved, dropping any still-pending source ids. In streamed Responses events the annotations for a multi-source marker can arrive across separate `annotation.added` chunks, so the caller no longer buffered that segment for retry and the late source id was lost from the inline citation entirely. Flag the marker unresolved whenever any token misses the lookup so the streamer keeps the segment pending. End-of-stream force flush still drops unresolved tokens through `_replace_openai_citation_markers` so locator-style suffixes (which look like unresolved ids at the token level but only appear at end-of-stream) render cleanly. Updated the multi-source test to assert the new pending-then-flush behavior; locator output now lands at force-flush rather than mid stream.

chatgpt-codex-connector · 2026-05-26T00:49:59Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

danielhanchen · 2026-05-26T01:04:30Z

Round 4 audit caught the P1 the codex bot flagged on _rewrite_citation_markers_partial. Pushed cd97e29.

The previous logic marked a marker as resolved when at least one token resolved -- the intuition being unresolved tokens are typically locator suffixes (L8-L13, page numbers, etc.) rather than pending source ids. That's right for source+locator markers but wrong for multi-source markers whose annotations arrive across separate annotation.added events: the second source id was dropped permanently because the caller no longer buffered that segment for retry. The streamed inline citation ended up incomplete.

Fix: any unresolved token now flags the whole marker pending so the streamer holds the segment until either all ids resolve or end-of-stream force-flush. End-of-stream still drops unresolved tokens via _replace_openai_citation_markers, which is also the path locator suffixes hit, so locator output renders cleanly at force-flush instead of mid-stream. Updated test_partial_multi_source_partial_resolution to assert the deferred-then-flushed behavior.

Independent round-4 read of the rest of the stream loop (split-tail buffer, pending segments FIFO, _drain_pending_segments, _flush_pending_marker_tail at completed/incomplete/[DONE]) found no further bugs. 44/44 citation marker tests pass.

…arkers

chatgpt-codex-connector · 2026-05-26T01:09:24Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

danielhanchen · 2026-05-26T01:10:21Z

Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope.

chatgpt-codex-connector · 2026-05-26T05:55:16Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

danielhanchen requested a review from rolandtannous as a code owner May 22, 2026 15:12

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

danielhanchen and others added 3 commits May 23, 2026 14:00

[pre-commit.ci] auto fixes from pre-commit.com hooks

6fa0f79

for more information, see https://pre-commit.ci

danielhanchen force-pushed the fix/openai-citation-markers branch from d36be02 to 7dd993e Compare May 23, 2026 14:00

chatgpt-codex-connector Bot reviewed May 23, 2026

View reviewed changes

danielhanchen and others added 4 commits May 23, 2026 16:28

ci: re-trigger after flake in Studio GGUF Tool calling (rebased on main

a2416fc

#5741 already)

ci: re-run after transient CodeQL Python checkout auth flake

ab451c2

[pre-commit.ci] auto fixes from pre-commit.com hooks

674e9b8

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 24, 2026

View reviewed changes

danielhanchen and others added 2 commits May 25, 2026 13:30

[pre-commit.ci] auto fixes from pre-commit.com hooks

ec5e529

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

danielhanchen and others added 3 commits May 25, 2026 13:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

dea5090

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'origin/main' into fix/openai-citation-m…

9ecc154

…arkers

Shorten citation marker comments for PR #5713

431c085

danielhanchen merged commit 063e1e4 into main May 26, 2026
32 checks passed

danielhanchen deleted the fix/openai-citation-markers branch May 26, 2026 06:37

danielhanchen mentioned this pull request May 26, 2026

Studio: agentic web_search action variants + image overlay polish (OpenAI) #5787

Open

4 tasks

Uh oh!

Conversation

danielhanchen commented May 22, 2026

Summary

Repro before this PR

Behavior after this PR

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

danielhanchen commented May 26, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

danielhanchen commented May 26, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant