Studio: rewrite OpenAI Responses citation markers to markdown links#5713
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42342598f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if any(c["url"] == url for c in all_url_citations): | ||
| # Backfill source_id on the existing entry if | ||
| # the first annotation event omitted it. | ||
| if source_id: | ||
| for c in all_url_citations: | ||
| if c["url"] == url and not c.get("source_id"): | ||
| c["source_id"] = source_id | ||
| break | ||
| return |
There was a problem hiding this comment.
Preserve source-id aliases when deduplicating citations
When url_citation events repeat the same url with a different source_id (for example, multiple inline markers for different spans/locators on one page), this branch returns early and keeps only the first source_id. _replace_openai_citation_markers then cannot resolve markers for later IDs and strips them, so valid inline citations disappear from the streamed text even though the URL is known. Please retain additional source_id aliases for an existing URL instead of dropping them on dedup.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces logic to process OpenAI's inline citation markers, converting them from private-use codepoints into markdown links. Key changes include a new regex-based replacement function, updated citation recording to capture and backfill source IDs, and reordering event processing to ensure annotations are available before text deltas. A new test suite is included to verify these transformations. Feedback was provided to optimize the citation recording logic by combining redundant list iterations into a single pass.
| # the first annotation event omitted it. | ||
| if source_id: | ||
| for c in all_url_citations: | ||
| if c["url"] == url and not c.get("source_id"): | ||
| c["source_id"] = source_id | ||
| break | ||
| return | ||
| title = payload.get("title") or url | ||
| snippet = payload.get("snippet") or payload.get("quote") or "" |
There was a problem hiding this comment.
The current logic performs two separate iterations over all_url_citations: first with any() to check for existence, and then with a for loop to perform the backfill. These can be combined into a single pass for better efficiency and readability.
for c in all_url_citations:
if c["url"] == url:
if source_id and not c.get("source_id"):
c["source_id"] = source_id
returnReferences
- To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d36be02beb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # on this stream, or drop it silently when the source_id is unknown. | ||
| # Group 1 is the source id; an optional `<locator>` suffix is | ||
| # discarded (we link the whole source rather than per-line locators). | ||
| _OPENAI_CITATION_MARKER = re.compile("cite([^]+)(?:[^]*)?") |
There was a problem hiding this comment.
Preserve all source IDs in multi-source cite markers
OpenAI citation markers can encode multiple supporting source IDs in one marker (e.g. ...sourceAsourceB...), but this regex captures only the first ID and treats every later ... segment as a locator suffix. In streamed answers that use multi-source markers, the rewriter will emit a single link and silently drop the remaining supports, so citations shown inline become incomplete/misleading even though url_citation annotations are available.
Useful? React with 👍 / 👎.
OpenAI's /v1/responses stream interleaves text deltas with inline citation markers built from private-use codepoints (U+E200 / U+E201 / U+E202) shaped like `citeSOURCE_ID`. The codepoints render as garbled "E202" glyphs or empty boxes in most fonts, and the markdown layer further strips them, leaving run-on text like "citeturn1view0turn1view1turn3view0...". The url list still arrived in the Sources panel via url_citation annotations, but the inline cite hand-off into the prose was unreadable. Rewrite each marker into `[N](URL)` when the matching url_citation has already been recorded on this stream, and drop the marker silently otherwise. The lookup uses a new `source_id` field captured on `_record_url_citation` (accepts source_id / id / locator across Responses API revisions). Annotations are now applied BEFORE the delta text is rewritten so that markers and their resolving annotation arriving in the same SSE event still resolve. Reference: https://developers.openai.com/api/docs/guides/citation-formatting
for more information, see https://pre-commit.ci
OpenAI's Responses stream cites the same URL under multiple source_id markers when the model references different spans of the same page. The previous dedup-by-URL kept only the first alias and dropped the rest, so subsequent markers for the same URL never resolved and got stripped from the prose. Switch the citation record to a ``source_ids`` list and append new aliases on every duplicate. The rewriter resolves any alias back to the same citation number so the inline markers all collapse onto one footnote rather than fanning out into bogus repeats. Also collapse the two passes over ``all_url_citations`` in ``_record_url_citation`` into a single loop for clarity. Adds two regression tests covering the alias-collision and mixed-shape cases.
d36be02 to
7dd993e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7dd993e83d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| delta_text = _replace_openai_citation_markers( | ||
| delta_text, all_url_citations | ||
| ) |
There was a problem hiding this comment.
Buffer partial citation markers across streamed deltas
_replace_openai_citation_markers is invoked on each response.output_text.delta chunk independently, but citation markers are token-level text and can be split across consecutive deltas (e.g., first chunk ends with ````). In that case no full regex match exists in either chunk, so private-use marker bytes leak into output or the citation is lost instead of becoming a link. This regresses the stated goal for streamed responses whenever marker boundaries do not align with delta boundaries; keep a carryover buffer between deltas and only rewrite once a full marker is available.
Useful? React with 👍 / 👎.
The original rewriter only handles markers that arrive whole inside a single response.output_text.delta event. OpenAI's stream chunks text on byte-buffer boundaries with no awareness of the marker grammar, so a marker can straddle two deltas (delta-1 ends with "citetu", delta-2 starts with "rn0view0"). Each delta was rewritten in isolation, so the half-marker leaked as garbled "E200/E202" glyphs in the rendered prose. Buffer the unterminated tail across deltas and concatenate it onto the front of the next one so the rewriter sees a complete marker. Flush the held-over tail on response.completed / response.incomplete / [DONE], stripping any leftover private-use bytes so a never-closed marker (truncated stream, missing annotation) never leaks. Also handle the multi-source marker shape from the OpenAI docs -- citeid1id2 should expand to one bracket link per resolvable id. The previous regex captured only the first source id and silently dropped id2/id3. Reference: https://developers.openai.com/api/docs/guides/citation-formatting Tests: 21 new cases covering multi-source, locator suffix, marker split across two and three deltas, unterminated marker on truncation, late annotation resolving a buffered marker, idempotency, and the head/tail split helper directly.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 674e9b8dca
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for ch in ("", "", ""): | ||
| rendered = rendered.replace(ch, "") |
There was a problem hiding this comment.
Strip actual citation codepoints in tail flush
The tail-flush sanitizer currently loops over ("", "", ""), so rendered.replace(ch, "") is a no-op and never removes OpenAI’s private-use marker bytes. When a stream ends with an unterminated citation marker (e.g., [DONE], response.completed, or response.incomplete while pending_marker_tail is non-empty), _replace_openai_citation_markers cannot rewrite it and these glyph bytes leak into user-visible output—the exact regression this path is meant to prevent.
Useful? React with 👍 / 👎.
The split-marker buffer already concatenates a marker that straddles two response.output_text.delta events. But when the annotation event for a url_citation arrives AFTER the delta that contains its inline marker (the typical OpenAI Responses ordering), the rewriter still saw an empty lookup table at delta time and silently stripped the marker. The URL kept showing up in the sources panel but the inline link reference was permanently gone. Add _rewrite_citation_markers_partial which leaves an unresolved marker verbatim and reports has_unresolved=True. The streaming loop buffers any closed segment that contains an unresolved marker into a pending_citation_segments FIFO and drains the queue on every later annotation event, on response.completed, on response.incomplete, and on the [DONE] sentinel. Drain order is preserved so later clean text does not leapfrog an earlier deferred segment. End-of-stream forces a strip so no codepoint leaks if the annotation never arrived. Add six regression tests covering single-pass resolution, the late- annotation two-pass case, multi-source markers with partial resolution, mixed known and pending markers in one segment, and idempotency on marker-free input.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec5e529184
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if not rendered: | ||
| has_unresolved = True | ||
| return match.group(0) |
There was a problem hiding this comment.
Keep multi-source markers pending until all IDs resolve
_rewrite_citation_markers_partial only marks a marker as unresolved when no token resolves; if one source ID resolves and another is still pending, it emits the resolved link(s), drops the unresolved token(s), and returns unresolved=False. In streamed Responses events where a multi-source marker’s annotations arrive across separate annotation.added events, this causes later source IDs to be permanently lost because the caller no longer buffers that segment for a retry, so inline citations become incomplete/misleading despite eventually receiving all citation metadata.
Useful? React with 👍 / 👎.
`_flush_pending_marker_tail` stripped the three private-use citation codepoints from the held-over buffer, but left the literal ``cite`` keyword plus the source id behind as plain text. A stream ending mid-marker therefore emitted user-visible garbage like ``Some text citeturn0view0`` instead of the intended clean prose. ``pending_marker_tail`` is by construction the suffix that starts at an unclosed ``\\ue200`` opener -- the split helper guarantees there is no closing ``\\ue201`` byte. Without that close the marker is meaningless: the source id cannot be resolved to a URL and the user prose before the opener was already emitted as ``head`` on the originating delta. Bail out before the strip step and return the empty string. As a belt-and-braces measure also drop any orphan ``cite<sid>`` literal at the head of the buffer in case a future caller passes a partially-terminated tail. Update the matching ``_simulate_delta_stream`` harness in the edge tests so it mirrors the new flush logic, and add four regression tests covering unterminated marker with surrounding prose, marker- only inputs, prefix-only outputs, and the split-then-close path that still must resolve to a link.
for more information, see https://pre-commit.ci
`_rewrite_citation_markers_partial` previously treated a marker as resolved when even one token in a multi-source marker resolved, dropping any still-pending source ids. In streamed Responses events the annotations for a multi-source marker can arrive across separate `annotation.added` chunks, so the caller no longer buffered that segment for retry and the late source id was lost from the inline citation entirely. Flag the marker unresolved whenever any token misses the lookup so the streamer keeps the segment pending. End-of-stream force flush still drops unresolved tokens through `_replace_openai_citation_markers` so locator-style suffixes (which look like unresolved ids at the token level but only appear at end-of-stream) render cleanly. Updated the multi-source test to assert the new pending-then-flush behavior; locator output now lands at force-flush rather than mid stream.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Round 4 audit caught the P1 the codex bot flagged on _rewrite_citation_markers_partial. Pushed cd97e29. The previous logic marked a marker as resolved when at least one token resolved -- the intuition being unresolved tokens are typically locator suffixes (L8-L13, page numbers, etc.) rather than pending source ids. That's right for source+locator markers but wrong for multi-source markers whose annotations arrive across separate annotation.added events: the second source id was dropped permanently because the caller no longer buffered that segment for retry. The streamed inline citation ended up incomplete. Fix: any unresolved token now flags the whole marker pending so the streamer holds the segment until either all ids resolve or end-of-stream force-flush. End-of-stream still drops unresolved tokens via _replace_openai_citation_markers, which is also the path locator suffixes hit, so locator output renders cleanly at force-flush instead of mid-stream. Updated test_partial_multi_source_partial_resolution to assert the deferred-then-flushed behavior. Independent round-4 read of the rest of the stream loop (split-tail buffer, pending segments FIFO, _drain_pending_segments, _flush_pending_marker_tail at completed/incomplete/[DONE]) found no further bugs. 44/44 citation marker tests pass. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary
OpenAI's
/v1/responsesstream interleaves text deltas with inline citation markers built from private-use codepoints (U+E200 / U+E201 / U+E202) shaped likeciteSOURCE_ID. Most fonts render the codepoints as garbled "E202" glyphs or empty boxes, and the markdown layer further strips them, leaving run-on text likeciteturn1view0turn1view1turn3view0turn3view1turn4view0...smeared across the assistant message. The Sources panel still received the URLs (viaurl_citationannotations) but the inline hand-off into the prose was unreadable.This PR rewrites each marker into a clickable
[N](URL)markdown link when the matchingurl_citationannotation has already been recorded on this stream, and drops the marker silently otherwise. The lookup uses a newsource_idfield captured on_record_url_citation(acceptssource_id/id/locatoracross Responses API revisions). Annotations are now processed BEFORE the delta text is rewritten so that markers and their resolving annotation arriving in the same SSE event still resolve.Reference: https://developers.openai.com/api/docs/guides/citation-formatting
Repro before this PR
gpt-5.xResponses-family model and enable web_search.citeturn1view0turn1view1turn3view0...instead of citations.Behavior after this PR
The same query renders inline citations as
[[1]](https://...) [[2]](https://...)links resolving to the same URLs the Sources panel already showed.Test plan
cd studio && python -m pytest backend/tests/test_openai_citation_markers.py -v(11 new tests, all passing)cd studio && python -m pytest backend/tests/test_openai_responses_translation.py -v(no regression)E202glyph ever leaks through.