fix: hard-disable early shutdown when RunConfig.disable_early_shutdown=true#203
Conversation
|
Addressing the inline question: the Follow-up: I also tightened (the shoggoth wears my face like a mask) |
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where RunConfig.disable_early_shutdown=True could still trigger early shutdown in high-concurrency scenarios due to race conditions in task completion ordering. The fix adds an explicit disable_early_shutdown parameter that completely bypasses the early-shutdown mechanism rather than relying on indirect threshold manipulation.
Changes:
- Added
disable_early_shutdownparameter toConcurrentThreadExecutorto explicitly control early-shutdown behavior - Plumbed the setting through column-wise builder and validation generator call sites
- Added guards to prevent early-shutdown logic when the feature is disabled
- Added comprehensive test coverage for the disabled state
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/data_designer/engine/dataset_builders/utils/concurrency.py | Added disable_early_shutdown parameter and guards around early-shutdown decision points |
| src/data_designer/engine/dataset_builders/column_wise_builder.py | Passed disable_early_shutdown setting to executor initialization |
| src/data_designer/engine/column_generators/generators/validation.py | Passed disable_early_shutdown setting to executor initialization |
| tests/engine/dataset_builders/utils/test_concurrency.py | Added test verifying early shutdown is prevented when disabled, and assertion for early_shutdown flag |
| tests/engine/dataset_builders/test_column_wise_builder.py | Added assertion to verify disable_early_shutdown is passed to executor |
| tests/engine/column_generators/generators/test_validation.py | Added assertion to verify disable_early_shutdown is passed to executor |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Addressed the new Copilot review notes in 410e51f:
|
| { name = "python-json-logger", specifier = ">=3,<4" }, | ||
| { name = "pyyaml", specifier = ">=6.0.1,<7" }, | ||
| { name = "requests", specifier = ">=2.32.2,<3" }, | ||
| { name = "rich", specifier = ">=14.0.0,<15" }, |
There was a problem hiding this comment.
hm... something is up with our uv.lock. I have a pr coming to update it. i'll rebase after you merge.
Summary
This PR makes
RunConfig.disable_early_shutdown=Trueforcibly disable the early-shutdown mechanism (no indirect threshold tricks). When disabled, we still collect error metrics/summaries, but we will never:early_shutdown=true,Background / Findings (root cause)
We observed early shutdowns in large/high-concurrency runs even when early shutdown was configured as disabled. The logged summaries looked like this (example):
failure_threshold: 1.0completed_count: 256error_count: 26(~10% error rate)early_shutdown: trueAt first glance, a ~10% error rate should not trip a 1.0 threshold. The missing detail was that the early-shutdown decision is latched and depends on the order of task completion, not the final aggregate rate.
Why
disable_early_shutdown=Truecould still early-shutdownPrior behavior:
disable_early_shutdown=Truenormalizedshutdown_error_rateto1.0.ConcurrentThreadExecutorstill ran the early-shutdown state machine and usederror_rate >= failure_threshold.error_count / completed_count == 1.0(e.g., the first N completed tasks are all failures) to setearly_shutdown=true.In other words, "disable" was implemented indirectly via a numeric threshold, but the mechanism was still active and could still trigger in edge cases (especially in large/high-parallel workloads).
Why this fix
We want a principled contract:
This PR stops relying on indirect numeric behavior and adds an explicit, end-to-end switch that bypasses all early-shutdown decisions.
What changed
disable_early_shutdownparameter toConcurrentThreadExecutor.disable_early_shutdown=True, the executor:ErrorTrap(so users still get accurate summaries),results.early_shutdown, never triggers shutdown, and never raises the early-shutdownDataDesignerRuntimeError.RunConfig.disable_early_shutdownthrough the engine call sites that construct the executor (column-wise threaded fanout + validation parallelism).Testing
uv run pytest -qAll tests pass locally.