[pull] main from openai:main#153
Merged
Merged
Conversation
Fixes #24186. ## Why When the TUI resumes a thread through the local app-server daemon with a selected workspace, `thread/resume` can hit an already-loaded but idle cached thread. That path previously rejoined the cached `CodexThread`, so cwd/config overrides in `ThreadResumeParams` were ignored and the resumed session kept using the old cwd. ## What changed App-server now treats a loaded-but-idle thread with no subscribers as a cache entry when resume overrides differ: it unloads that cached thread and lets the normal resume path rebuild it with the requested cwd/config. Threads that still have subscribers, or active runtime work, continue to rejoin the existing loaded thread so in-flight state remains observable. The existing thread teardown helper was generalized from archive-specific cleanup to shared unload cleanup for this path.
## Why When Codex calls responsesapi, we currently send `session_id`, `thread_id`, and `turn_id` among other things as `client_metadata["x-codex-turn-metadata"]`. This PR adds `forked_from_thread_id` which helps explain the "lineage" of a forked thread. ## What's changed - Track the immediate history source copied into a forked thread through thread/session creation, including subagent and review turn metadata paths. - Include `forked_from_thread_id` in Codex turn metadata while preventing turn-scoped Responses API client metadata from overwriting Codex-owned lineage fields. - Add coverage for fork lineage in turn metadata and the app-server Responses API request path.
only allow `Direct` callers of the standalone websearch tool because its not supported in codemode
## Summary TUI plugin mention refresh still joined app-server plugin inventory with client-local plugin config, which can diverge once plugin state is owned by the app server. This changes the TUI to mirror the GUI client: `plugin/list` is the autocomplete source, and mention candidates are plugin-level entries filtered to installed, enabled, and not disabled by admin. The TUI no longer reads local plugin config or calls `plugin/read` while refreshing plugin mention candidates. ## API shape and limitations The current app-server API does not expose effective per-session plugin capability summaries for mention autocomplete. As in the GUI, autocomplete now trusts `plugin/list` metadata rather than proving which plugin capabilities are loaded in the active session. That avoids stale client-local reads and the cwd/remote detail gaps in `plugin/read`, but intentionally accepts the same list-level tradeoff as the app: if `plugin/list` reports a remote plugin before its local bundle is materialized, the plugin can still appear as a mention candidate.
Fixes #24249. ## Why Codex already supports discovering marketplaces under both `.agents/plugins/marketplace.json` and `.claude-plugin/marketplace.json`. The Git marketplace auto-upgrade no-op check only looked for the `.agents` layout. That meant an installed `.claude-plugin` marketplace with matching revision metadata still looked absent, so plugin list/startup upgrade work could stage and re-activate the same marketplace again. That matches the failure shape in #24249: the report called out repeated marketplace sync/cache refresh logs and a large recently-touched `.tmp/marketplaces/.staging` directory. This change makes the auto-upgrade path recognize the installed `.claude-plugin` marketplace as already current, which should remove that staging/activation feedback loop. ## What changed `codex-rs/core-plugins/src/marketplace_upgrade.rs` now uses the existing supported marketplace manifest discovery helper when deciding whether an installed Git marketplace is already current. Existing local plugin source validation is unchanged; `source: "./"` still remains invalid. ## Confidence Confidence is high that this fixes the repeated marketplace upgrade path: the old hardcoded layout check was definitely wrong for installed `.claude-plugin` marketplaces, and the reported staging churn points directly at that path. Confidence is not 100% because we do not have a CPU profile or a fully re-run reporter repro. A malformed marketplace entry can still be logged as invalid if another caller repeatedly lists plugins; this PR fixes the staging/upgrade feedback loop that likely made the failure pathological, not every possible source of repeated marketplace resolution.
## Why The Windows sandbox runner still carried the old `SandboxPolicy` compatibility path even though core now computes `PermissionProfile`. That meant Windows command-runner execution could only see the legacy projection, so profile-only filesystem rules such as deny globs were not part of the runner input. ## What Changed - Removed the Windows-local `SandboxPolicy` parser/export and deleted `windows-sandbox-rs/src/policy.rs`. - Changed restricted-token capture/session setup, elevated setup, world-writable audit, read-root grant, and command-runner session APIs to accept `PermissionProfile` plus the profile cwd. - Bumped the elevated command-runner IPC protocol to version 2 because `SpawnRequest` now carries `permission_profile` / `permission_profile_cwd` instead of the legacy `policy_json_or_preset` / `sandbox_policy_cwd` fields. - Updated core exec, unified exec, debug-sandbox, TUI setup/grant flows, and app-server setup to pass the actual effective `PermissionProfile`. - Left regression coverage asserting the old IPC policy fields are absent and the runner serializes tagged `PermissionProfile` JSON. ## Verification - `cargo test -p codex-windows-sandbox` - `cargo test -p codex-core windows_sandbox` - `cargo test -p codex-app-server request_processors::windows_sandbox_processor` - `just fix -p codex-windows-sandbox -p codex-core -p codex-app-server -p codex-cli -p codex-tui` - `just fix -p codex-cli -p codex-tui` - `just fix -p codex-windows-sandbox -p codex-tui` - `rg "\\bSandboxPolicy\\b" codex-rs/windows-sandbox-rs` returned no matches. Note: `cargo test -p codex-cli` was attempted but did not reach crate tests because local disk filled while compiling dependencies (`No space left on device`). The targeted clippy pass compiled the affected CLI/TUI surfaces afterward. --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/openai/codex/pull/23813). * #24108 * __->__ #23813
## Why Remote image submissions currently wrap native `input_image` spans in literal `<image>` and `</image>` text spans. Those extra prompt tokens add structure without providing label or routing information. ## What Changed - Serialize `UserInput::Image` directly as an `input_image` content span. - Preserve named local-image framing and legacy wrapper parsing for labeled attachments and existing histories. - Update existing request-shape expectations for drag-and-drop images, model switching, and compaction. ## Validation - `just test -p codex-protocol` - Focused `codex-core` run covering `drag_drop_image_persists_rollout_request_shape`, `model_change_from_image_to_text_strips_prior_image_content`, and `snapshot_request_shape_pre_turn_compaction_including_incoming_user_message` ## Notes - A broader `just test -p codex-core` run was attempted; the affected tests passed, while the overall run failed in unrelated CLI, MCP, and tooling tests plus a `thread_manager` timeout.
## Why Windows sandbox diagnostics are currently hard to recover from `/feedback` even though they are often the most useful artifact when debugging sandbox behavior. Now that sandbox logging uses daily rolling files, feedback can safely include the current day's sandbox log without uploading the old ever-growing legacy `sandbox.log`. ## What changed - Add a `codex-windows-sandbox` helper that resolves the current daily sandbox log from `codex_home`. - When feedback is submitted with logs enabled on Windows, app-server attaches today's sandbox log if it exists. - Upload the attachment under the stable filename `windows-sandbox.log`, independent of the dated on-disk filename. - Keep existing raw `extra_log_files` behavior unchanged for rollout and desktop log attachments. ## Verification - `cargo fmt -p codex-app-server -p codex-windows-sandbox` - `cargo test -p codex-windows-sandbox current_log_file_path_for_codex_home_uses_sandbox_dir` - `cargo test -p codex-app-server windows_sandbox_log_attachment_uses_current_log` - Manual CLI/TUI `/feedback` test confirmed Sentry received `windows-sandbox.log`.
## Why Older persisted rollouts can contain `input_image.detail` values of `auto` or `low` from before `ImageDetail` was narrowed to `high`/`original`. Current deserialization rejects those values, which can make resume skip later compacted checkpoints and reconstruct an oversized raw suffix before the next compaction attempt. Confirmed Sentry reports fixed by this compatibility path: - [CODEX-1H3F](https://openai.sentry.io/issues/7500642496/) - [CODEX-1H6N](https://openai.sentry.io/issues/7501025347/) - [CODEX-1JDP](https://openai.sentry.io/issues/7504549065/) - [CODEX-1HW6](https://openai.sentry.io/issues/7503407986/) ## Background [#20693](#20693) added image-detail plumbing for app-server `UserInput` so input images could explicitly request `detail: original`. The Slack discussion behind that PR was about ScreenSpot / bridge evals where user input images were resized, while tool output images already had MCP/code-mode ways to request image detail. In review, the intended new API surface was narrowed to `high` and `original`: default to `high`, allow `original` when callers need unchanged image handling, and avoid encouraging new `auto` or `low` usage. That policy still makes sense for newly emitted values. The missing compatibility piece is persisted history. Older rollouts can already contain `auto` and `low`, and resume reconstructs typed history by deserializing those rollout records. Rejecting old values at that boundary causes valid compacted checkpoints to be skipped. This PR restores `auto` and `low` as real variants so old records deserialize and round-trip without being rewritten as `high`, while product paths can continue to default to `high` and avoid emitting `auto` for new behavior. ## What changed - Restored `ImageDetail::Auto` and `ImageDetail::Low` as first-class protocol values. - Preserved `auto`/`low` through rollout deserialization, MCP image metadata, code-mode image output, and schema/type generation. - Kept local image byte handling conservative: only `original` switches to original-resolution loading; `auto`/`low`/`high` continue through the resize-to-fit path while retaining their detail value. - Added regression coverage for enum round-tripping and code-mode `low` detail handling. ## Testing - `just write-app-server-schema` - `just test -p codex-protocol` - `just test -p codex-tools` - `just test -p codex-code-mode` - `just test -p codex-app-server-protocol` - `just test -p codex-core suite::rmcp_client::stdio_image_responses_preserve_original_detail_metadata` - `just test -p codex-core suite::code_mode::code_mode_can_use_mcp_image_result_with_image_helper` - Loaded broken rollouts on local fixed builds, and started/completed new turns. I also attempted `just test -p codex-core`; the local broad run did not finish green: 2559 tests run, 2467 passed, 55 flaky, 91 failed, 1 timed out. The failures were broad timeout/deadline failures across unrelated areas; targeted changed-path core tests above passed.
## Why - Runtime analytics events report `thread_id`, which identifies the individual thread emitting an event - They don't report `session_id`, which identifies the shared session for a root thread and its subagent threads - Emitting both identifiers allows analytics to group related activity ## What Changed - Adds `session_id` to relevant analytics events (thread_initalized, turn, turn_steer, compaction, guardian_review) - Tracks each thread's session ID in the analytics reducer so subsequent thread scoped events emit the same value - Carries the shared session ID through subagent initialization ## Verification - `just test -p codex-analytics` validates event payloads and subagent session grouping. - Focused `codex-app-server` tests validate session IDs for thread, turn, and steer events. - Focused `codex-core` tests validate root and subagent session ID propagation.
## Why `continuation_turn_id` was introduced to distinguish synthetic goal continuation turns for the no-tool continuation suppression heuristic. #20523 removed that heuristic, but left the marker behind. It is still written and cleared without affecting any runtime decision. ## What Changed - Remove `GoalRuntimeState::continuation_turn_id`. - Remove the marker setter/clearer and their now-no-op start, finish, and abort call sites. ## Testing - Not run yet (deferred at request).
add new `parse_tool_input_schema_without_compaction` to bypass the existing compaction/trimming of client-provided tool schemas that are over 4k bytes. we want this for standalone web search to keep field guidance/metadata on certain fields; this keeps us closer to parity with existing hosted tool schema (which didnt go through this 4k byte filter).
## Why When a turn needs a follow-up request after tool output is recorded, Codex can still appear stuck in `Thinking` before the next `/responses` request is opened. The existing local trace showed the last completed response and the absence of a new backend request, but it did not show whether the stall was in tool-router preparation or later request setup. Issue: N/A (internal incident investigation) ## What Changed Added trace spans around the pre-stream tool-router handoff in `core/src/session/turn.rs`, including the `built_tools` phase and the MCP manager read lock. Added per-server MCP tool-listing spans and trace breadcrumbs in `codex-mcp/src/connection_manager.rs` with startup snapshot / startup-complete state so a pending MCP client is visible in feedback logs instead of looking like a silent hang. ## Verification - `just fmt` - `just test -p codex-mcp` - `just test -p codex-core` (prior full rerun fails in this workspace on unrelated integration tests: code-mode output length expectations, one shell timeout formatting assertion, and shell snapshot timeouts; latest review-fix rerun compiled and passed 1160 tests before I stopped the abnormally slow unrelated suite)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )
Summary by cubic
Syncs upstream changes improving analytics, sandboxing, and request shapes. Adds session grouping, restores legacy image detail values, attaches Windows sandbox logs to feedback, and fixes resume cwd overrides.
New Features
autoandlow; schemas and TS types updated;image()accepts low; non-original still uses resized loading.windows-sandbox.logto feedback on Windows.Bug Fixes
<image>wrapper spans from user inputs; keep labeled local-image framing.plugin/listand filter by availability; no local config joins..claude-pluginlayout.Written for commit 64e340a. Summary will update on new commits. Review in cubic