Skip to content

[model-gateway] fix: handle workflow deadlock and optimize cycle detection#15000

Merged
slin1237 merged 1 commit intomainfrom
wfaas-n/4
Dec 12, 2025
Merged

[model-gateway] fix: handle workflow deadlock and optimize cycle detection#15000
slin1237 merged 1 commit intomainfrom
wfaas-n/4

Conversation

@slin1237
Copy link
Copy Markdown
Collaborator

@slin1237 slin1237 commented Dec 12, 2025

  • Fail workflow explicitly on deadlock instead of marking as completed
  • Optimize DAG cycle detection from O(V²) to O(V+E) using HashMap lookup

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the model-gateway's workflow management by improving both its robustness and efficiency. It introduces explicit failure handling for deadlocked workflows, preventing them from hanging or incorrectly completing. Concurrently, it significantly optimizes the performance of the DAG cycle detection mechanism, leading to faster validation of workflow definitions.

Highlights

  • Workflow Deadlock Resolution: Workflows that encounter a deadlock (no steps ready and none running) will now explicitly fail with an appropriate error message, rather than potentially being marked as completed or stuck. This improves debugging and workflow reliability by providing clear failure states.
  • DAG Cycle Detection Performance Optimization: The Directed Acyclic Graph (DAG) cycle detection algorithm has been optimized from O(V²) to O(V+E) complexity. This was achieved by introducing a HashMap for O(1) step lookups during the Depth-First Search (DFS) traversal, significantly improving performance for workflows with many steps.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two significant improvements: a performance optimization for DAG cycle detection and a critical bug fix for workflow deadlocks. The cycle detection is now O(V+E) instead of O(V^2) by using a HashMap for efficient lookups, which is a great enhancement. The deadlock fix correctly identifies and fails workflows that get stuck, preventing them from being erroneously marked as complete. The changes are well-implemented. I have a couple of minor suggestions regarding Rust conventions to improve code style and readability.

Comment thread sgl-model-gateway/src/workflow/definition.rs Outdated
Comment thread sgl-model-gateway/src/workflow/definition.rs Outdated
…ction

- Fail workflow explicitly on deadlock instead of marking as completed
- Optimize DAG cycle detection from O(V²) to O(V+E) using HashMap lookup
@slin1237 slin1237 merged commit 56d0ad4 into main Dec 12, 2025
51 of 55 checks passed
@slin1237 slin1237 deleted the wfaas-n/4 branch December 12, 2025 16:53
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 13, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (121 commits)
  Super tiny add gsp-fast-prepare (sgl-project#14992)
  Super tiny fix confusing slash_command_handler hint (sgl-project#14976)
  Super tiny remove unused argument (sgl-project#14966)
  [registry] Add a strict mode to model registration (sgl-project#14933)
  Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly (sgl-project#14795)
  Tune triton fused moe for the case of glm-4.6-fp8 b200 tp4 (sgl-project#15020)
  [model-gateway] refactor: unify worker management into modular workflow structure (sgl-project#15010)
  Update ci permission (sgl-project#15014)
  Refactor of http and engine entrypoints to allow custom override  (sgl-project#14869)
  Add KV4-capable backend flashmla and update server args (sgl-project#14989)
  Revert several PRs (sgl-project#14958)
  Super tiny extract route_typed_request_once (sgl-project#14951)
  Fix CI by reverting incorrect metric check logic (sgl-project#15004)
  [model-gateway] refactor: workflow engine cleanup and minor optimization (sgl-project#15001)
  [model-gateway] fix: handle workflow deadlock and optimize cycle detection (sgl-project#15000)
  [model-gateway] feat: add DAG parallel execution support and workflow optimization (sgl-project#14999)
  [model-gateway] refactor: extract workflow engine to src/workflow module (sgl-project#14996)
  Update CODEOWNERS for multimodal_gen (sgl-project#14995)
  [diffusion] docker: Tiny fix Docker Hub link in installation documentation (sgl-project#14987)
  [PD] Add decode PP event loop for PD disaggregation (sgl-project#14945)
  ...

# Conflicts:
#	python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant