Skip to content

Rework simulation pipeline for adaptive concurrency and connection resiliency. #648

Merged
hoytak merged 11 commits intomainfrom
hoytak/260129-simulation-package
Mar 9, 2026
Merged

Rework simulation pipeline for adaptive concurrency and connection resiliency. #648
hoytak merged 11 commits intomainfrom
hoytak/260129-simulation-package

Conversation

@hoytak
Copy link
Collaborator

@hoytak hoytak commented Feb 11, 2026

This PR replaces the previous collection of scripts around setting up docker containers with a much more nimble and lightweight set of rust scripts and a simple, reusable proxy that can limit bandwidth and congestion simulations. The previous scripts are rewritten to be more nimble and use more reusable components.

New tools:

  • cas_client/src/simulation/network_simulation: A lightweight, in-process network congestion simulation proxy that lives between the LocalServer instance and the RemoteClient instance, allowing simulation tests to run on a network with realistic congestion conditions and a gated bandwidth. This can be controlled dynamically through a LocalTestServer instance.
  • simulation/: A new package for collecting simulation scripts and analyzing the results.

To run the new simulation scripts for the adaptive concurrency on upload, compile in release mode and run one of the scripts in simulation/src/adaptive_concurrency/scripts/. Docker is no longer needed to run any of the simulations.

The old cas_client/tests/adaptive_concurrency/ paths were removed.

@hoytak hoytak force-pushed the hoytak/260129-simulation-package branch from 5fa23c2 to 26d735e Compare February 27, 2026 05:42
Copy link
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Rework simulation pipeline for adaptive concurrency

Clean architectural improvement — replacing Docker-based simulation with an in-process Rust framework significantly improves developer ergonomics and CI portability. The separation of concerns (bandwidth proxy, server-side latency simulation, scenario orchestration) is well-designed. The CancellationToken parent/child pattern and RAII guards for connection tracking are done correctly. Comprehensive test coverage across unit and integration tests.

The deleted Docker scripts had hardcoded paths (/Users/hoytak/workspace/...), confirming they were developer-local tooling — nothing of value is lost.

Issues

  • unsafe impl Send/Sync for ServerState: unnecessary — should be derived automatically
  • unsafe { std::env::set_var(...) }: unsound in multi-threaded context (Rust 1.83+)
  • to_bytes(body, usize::MAX): no body size limit on set_config handler
  • guard.take().expect(...) in run_accept_loop: panics on double-call instead of returning Result
  • NetworkSimulationProxy fields all pub(crate): internal state unnecessarily exposed
  • pub fn http_client(): exposes internal HTTP client with middleware stack publicly
  • DirectAccessClient: 'static: breaking trait change, should be documented
  • buf[..n].to_vec() in hot path: per-chunk allocation in bandwidth-limited copies
  • Test semantics changed: test_session_resume switched from in-memory to ephemeral disk server

const LATENCY_PIPE_DEPTH: usize = 16;

/// Single struct for the network simulation proxy. Methods take `Arc<Self>` for use in async tasks.
pub struct NetworkSimulationProxy {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fields of NetworkSimulationProxy are pub(crate), exposing internal state (semaphores, atomics, shutdown flags, listener) to the entire crate. These should be private — the struct already has proper public methods for all needed operations.

@hoytak
Copy link
Collaborator Author

hoytak commented Feb 28, 2026

Addressed the PR feedback -- should be better now.

@rajatarya
Copy link
Collaborator

Code Review Summary

Solid PR — replacing Docker-based simulation with pure Rust tooling is a clear win for developer experience and accessibility. The architecture is clean with good use of builder patterns, cancellation tokens, and separation of concerns.

Given this is internal simulation tooling, most findings are minor style/hygiene items that can be deferred. Three items are worth addressing since they affect the tool's correctness or usability:

  1. _duration_sec parameter ignored in run_added_uploads — parameter is accepted but unused, with hardcoded 600s sleep durations instead. Will confuse future callers.
  2. Unlimited bandwidth reports 0% utilization — when bytes_per_sec is None, the utilization calculation produces 0%, generating misleading reports from the tool whose purpose is accurate reporting.
  3. Double elapsed() capture in stats loopstart_instant.lock().unwrap().elapsed() called twice per iteration can skew the measurements this tool is designed to produce.

See inline comments for exact locations.

Copy link
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comments on items worth addressing.

hoytak added 6 commits March 4, 2026 13:13
Merged main branch changes including DeletionControlableClient,
SimulationControlClient, and simulation handler routes with the
branch's LatencySimulation and LocalTestServerBuilder changes.

Made-with: Cursor
@hoytak
Copy link
Collaborator Author

hoytak commented Mar 7, 2026

All issues should be fixed.

Copy link
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didn't see everything got addressed, thanks!

@hoytak hoytak merged commit 6a5535b into main Mar 9, 2026
7 checks passed
@hoytak hoytak deleted the hoytak/260129-simulation-package branch March 9, 2026 17:49
rajatarya added a commit that referenced this pull request Mar 9, 2026
…-885)

PR #648 refactored LocalTestServer to use a builder pattern. Update
the shard upload timeout integration tests to use LocalTestServerBuilder
instead of LocalTestServer::start(). Also call set_api_delay_range
directly on the server instead of going through .client().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants