Materialize compressed rollouts before append#25088
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e9745e52b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
2e9745e to
831e5e7
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 831e5e7e15
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
831e5e7 to
0825bb2
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0825bb2b6d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Ok(thread) | ||
| } | ||
|
|
||
| async fn refresh_resolved_rollout_path(resolved: &mut ResolvedRolloutPath) { |
There was a problem hiding this comment.
it's not clear to me what does this function do from the name.
aibrahim-oai
left a comment
There was a problem hiding this comment.
Same feedback as of the first PR. I think we would benefit from clear abstraction here because this part of the code has gotten very complicated.
Function names aren't very clear what they are doing. I can understand once I read carefully, but it's it's hard to grok
66f796c to
acf9e8b
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: acf9e8b600
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| match std::fs::remove_file(compressed_path.as_path()) { | ||
| Ok(()) => {} | ||
| Err(err) if err.kind() == io::ErrorKind::NotFound => {} | ||
| Err(err) => return Err(err), |
There was a problem hiding this comment.
Don't fail append when compressed cleanup is blocked
When materializing succeeds but removing the old .jsonl.zst fails, the append/resume path returns an error before opening the newly created plain rollout. This can happen on Windows or with a concurrent scanner/reader holding the compressed file open; the rest of this module already prefers the plain sibling and hides compressed siblings, so a cleanup-only failure leaves a usable .jsonl but still makes metadata updates or resume fail unnecessarily.
Useful? React with 👍 / 👎.
| let counter = TEMP_COUNTER.fetch_add(1, Ordering::Relaxed); | ||
| file_name.push(format!( | ||
| ".{operation}.{}.{counter}{TEMP_SUFFIX}", | ||
| std::process::id() | ||
| )); |
There was a problem hiding this comment.
Use collision-resistant materialization temp names
The decompression temp name is deterministic from the operation, process id, and a counter that resets on every process start. If Codex crashes after creating the temp file but before cleanup, a later process that gets the same PID and tries to append/resume the same compressed rollout starts at counter 0 again, so create_new hits the stale temp path and the rollout cannot be materialized until the temp file is manually removed; using a random NamedTempFile-style name or retrying on AlreadyExists avoids this crash-recovery failure.
Useful? React with 👍 / 👎.
Rollout compression stack
This stack splits #24941 into reviewable steps for local rollout compression. The design is intentionally staged:
local_thread_store_compression.The key invariant is that writers append to plain
.jsonl. A.jsonl.zstfile is a cold/read representation; if a write is needed, the compressed file is materialized back to plain JSONL first. Readers prefer plain.jsonlwhen both forms exist and can fall back to the compressed sibling during transitions.The worker is deliberately the last PR and remains behind an under-development feature flag. It currently scans only
archived_sessions, not activesessions, because active sessions have the highest resume/append race risk. That means this stack does not yet compress most unarchived local history.Known race / follow-up
The remaining unresolved design question is writer/compressor coordination. Even for archived rollouts, a resume or metadata update can append while the worker is replacing the plain file with
.jsonl.zst; the current double-stat checks narrow but do not fully eliminate the window where a writer has opened the plain file before unlink. Do not treat the worker PR as production-ready until we either:The first two PRs are useful independently: they make compressed rollouts readable and make append paths safely recover back to plain JSONL. The third PR isolates the worker behavior so that coordination issue is reviewable separately.
Validation
Focused local validation for the stack includes:
just test -p codex-rolloutjust test -p codex-thread-storewhere thread-store paths were touchedjust test -p codex-featuresfor the feature flag slicejust bazel-lock-checkafter dependency graph changesjust fix -p ...passes for changed cratesCI is still the source of truth for the full platform matrix.
This PR in the stack
This is PR 2/3, based on #25087. It adds the write-side transition: recorder resume, blocking append opens, and raw append helpers materialize
.jsonl.zstback to plain.jsonlbefore writing. It also preserves permissions and uses a no-clobber publish path so fallback materialization does not expose a partially copied plain rollout.Stack order: