Skip to content

fix: hard-disable early shutdown when RunConfig.disable_early_shutdown=true#203

Merged
eric-tramel merged 6 commits into
mainfrom
fix/disable-early-shutdown-plumbing
Jan 13, 2026
Merged

fix: hard-disable early shutdown when RunConfig.disable_early_shutdown=true#203
eric-tramel merged 6 commits into
mainfrom
fix/disable-early-shutdown-plumbing

Conversation

@eric-tramel

Copy link
Copy Markdown
Contributor

Summary

This PR makes RunConfig.disable_early_shutdown=True forcibly disable the early-shutdown mechanism (no indirect threshold tricks). When disabled, we still collect error metrics/summaries, but we will never:

  • flip early_shutdown=true,
  • shut down the executor early, or
  • raise the "terminated early due to error rate" exception.

Background / Findings (root cause)

We observed early shutdowns in large/high-concurrency runs even when early shutdown was configured as disabled. The logged summaries looked like this (example):

  • failure_threshold: 1.0
  • completed_count: 256
  • error_count: 26 (~10% error rate)
  • early_shutdown: true

At first glance, a ~10% error rate should not trip a 1.0 threshold. The missing detail was that the early-shutdown decision is latched and depends on the order of task completion, not the final aggregate rate.

Why disable_early_shutdown=True could still early-shutdown

Prior behavior:

  • disable_early_shutdown=True normalized shutdown_error_rate to 1.0.
  • ConcurrentThreadExecutor still ran the early-shutdown state machine and used error_rate >= failure_threshold.
  • Under concurrency, fast failures can complete before slower successes. Once the shutdown window is active, it only takes a moment where error_count / completed_count == 1.0 (e.g., the first N completed tasks are all failures) to set early_shutdown=true.
  • That flag is never cleared, so the executor can ultimately exit with a final summary showing a much lower error rate while still raising/marking early shutdown.

In other words, "disable" was implemented indirectly via a numeric threshold, but the mechanism was still active and could still trigger in edge cases (especially in large/high-parallel workloads).

Why this fix

We want a principled contract:

If the user disables early shutdown, the system must not early-stop or raise an early-shutdown error, regardless of thresholds/windows/concurrency timing.

This PR stops relying on indirect numeric behavior and adds an explicit, end-to-end switch that bypasses all early-shutdown decisions.

What changed

  • Added an explicit disable_early_shutdown parameter to ConcurrentThreadExecutor.
  • When disable_early_shutdown=True, the executor:
    • continues to record errors via ErrorTrap (so users still get accurate summaries),
    • but never sets results.early_shutdown, never triggers shutdown, and never raises the early-shutdown DataDesignerRuntimeError.
  • Plumbed RunConfig.disable_early_shutdown through the engine call sites that construct the executor (column-wise threaded fanout + validation parallelism).
  • Updated and added tests to lock in the behavior, including a case that would otherwise deterministically early-shutdown (0% threshold, 0 window, 100% failures) and verifying it does not raise when disabled.

Testing

  • uv run pytest -q

All tests pass locally.

@eric-tramel eric-tramel requested review from johnnygreco and nabinchha and removed request for johnnygreco January 13, 2026 05:01
@eric-tramel eric-tramel self-assigned this Jan 13, 2026
@eric-tramel eric-tramel added the bug Something isn't working label Jan 13, 2026
@eric-tramel eric-tramel changed the title Fix: hard-disable early shutdown when RunConfig.disable_early_shutdown=true fix: hard-disable early shutdown when RunConfig.disable_early_shutdown=true Jan 13, 2026
Comment thread src/data_designer/engine/dataset_builders/utils/concurrency.py
@eric-tramel

eric-tramel commented Jan 13, 2026

Copy link
Copy Markdown
Contributor Author

Addressing the inline question: the handle_error() call is intentionally not inside the early-shutdown guard so we still collect accurate error summaries/metrics even when early shutdown is disabled.

Follow-up: I also tightened ConcurrentThreadExecutor.submit() so when disable_early_shutdown=True we will not raise the early-shutdown DataDesignerRuntimeError from the "cannot schedule new futures after shutdown" path either (it now re-raises the underlying RuntimeError instead). Pushed in 48de6af.

(the shoggoth wears my face like a mask)

@eric-tramel eric-tramel requested a review from Copilot January 13, 2026 05:13

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where RunConfig.disable_early_shutdown=True could still trigger early shutdown in high-concurrency scenarios due to race conditions in task completion ordering. The fix adds an explicit disable_early_shutdown parameter that completely bypasses the early-shutdown mechanism rather than relying on indirect threshold manipulation.

Changes:

  • Added disable_early_shutdown parameter to ConcurrentThreadExecutor to explicitly control early-shutdown behavior
  • Plumbed the setting through column-wise builder and validation generator call sites
  • Added guards to prevent early-shutdown logic when the feature is disabled
  • Added comprehensive test coverage for the disabled state

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/data_designer/engine/dataset_builders/utils/concurrency.py Added disable_early_shutdown parameter and guards around early-shutdown decision points
src/data_designer/engine/dataset_builders/column_wise_builder.py Passed disable_early_shutdown setting to executor initialization
src/data_designer/engine/column_generators/generators/validation.py Passed disable_early_shutdown setting to executor initialization
tests/engine/dataset_builders/utils/test_concurrency.py Added test verifying early shutdown is prevented when disabled, and assertion for early_shutdown flag
tests/engine/dataset_builders/test_column_wise_builder.py Added assertion to verify disable_early_shutdown is passed to executor
tests/engine/column_generators/generators/test_validation.py Added assertion to verify disable_early_shutdown is passed to executor

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/data_designer/engine/dataset_builders/utils/concurrency.py Outdated
Comment thread tests/engine/dataset_builders/utils/test_concurrency.py Outdated
@eric-tramel

Copy link
Copy Markdown
Contributor Author

Addressed the new Copilot review notes in 410e51f:

  • Refactored the submit() exception handling to explicitly detect shutdown-related RuntimeErrors and only raise the early-shutdown DataDesignerRuntimeError when early shutdown is enabled. When disable_early_shutdown=True, we now always propagate the original shutdown RuntimeError (never the early-shutdown error).
  • Updated the new disabled-mode test to deterministically wait for completion (polling with timeout) instead of using a fixed sleep.
  • Added a regression test asserting that submit-after-shutdown in disabled mode does not emit the early-shutdown error.

@eric-tramel eric-tramel requested a review from nabinchha January 13, 2026 05:49
Comment thread src/data_designer/engine/dataset_builders/utils/concurrency.py
Comment thread uv.lock
{ name = "python-json-logger", specifier = ">=3,<4" },
{ name = "pyyaml", specifier = ">=6.0.1,<7" },
{ name = "requests", specifier = ">=2.32.2,<3" },
{ name = "rich", specifier = ">=14.0.0,<15" },

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... something is up with our uv.lock. I have a pr coming to update it. i'll rebase after you merge.

@eric-tramel eric-tramel merged commit f50346c into main Jan 13, 2026
21 checks passed
@nabinchha nabinchha deleted the fix/disable-early-shutdown-plumbing branch January 13, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants