Skip to content

[Fix] Flux2KleinPipelineConfig: hardcode max_length=512 for tokenization#21374

Open
yang1002378395-cmyk wants to merge 4 commits intosgl-project:mainfrom
yang1002378395-cmyk:fix-flux2-klein-max-length-21372
Open

[Fix] Flux2KleinPipelineConfig: hardcode max_length=512 for tokenization#21374
yang1002378395-cmyk wants to merge 4 commits intosgl-project:mainfrom
yang1002378395-cmyk:fix-flux2-klein-max-length-21372

Conversation

@yang1002378395-cmyk
Copy link
Copy Markdown
Contributor

Summary

Fixes #21372

  • Hardcode max_length=512 in Flux2KleinPipelineConfig.tokenize_prompt
  • Ignores inherited max_length=77 from FluxPipelineConfig.text_encoder_extra_args
  • Matches HuggingFace diffusers reference implementation

Root Cause

Flux2KleinPipelineConfig inherits text_encoder_extra_args with max_length=77 (for Flux 1 CLIP encoder), but Flux 2 Klein uses Qwen3 which supports longer sequences. The tokenizer was receiving max_length=77 from tok_kwargs, truncating prompts and degrading quality.

Reference

Test plan

  • Verify tokenizer outputs 512 tokens for long prompts
  • Compare with HuggingFace diffusers output

yang1002378395-cmyk and others added 3 commits March 24, 2026 22:35
…ound

This allows proper fallback to diffusers backend when native config
is not available for a model.

Fixes sgl-project#21311
Problem: When using deepstack with multiple modalities where only some
modalities have deepstack enabled, an IndexError occurs because the
code was using the wrong index to access deepstack_embeddings list.

The issue is that deepstack_embeddings only contains entries for
modalities where use_deepstack is True, but the code was using the
loop index i which includes all modalities.

Solution: Use a separate counter deepstack_idx that only increments
when deepstack is actually used for a modality.

Fixes sgl-project#21327
Fixes sgl-project#21372

- Flux2 Klein should use max_length=512 (matching HuggingFace diffusers)
- Previously inherited max_length=77 from FluxPipelineConfig.text_encoder_extra_args
- This caused prompt truncation and quality degradation for longer inputs

Reference: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/flux2/pipeline_flux2_klein.py#L204
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Mar 25, 2026
Issue: sgl-project#21380

When unloading a LoRA adapter, the GPU buffer slot was not released.
This caused the slot to remain occupied, leading to memory leak and
premature buffer exhaustion.

Root cause:
- unload_lora_adapter() removed metadata but left uid_to_buffer_id
  and buffer_id_to_uid unchanged
- Eviction policy still tracked unloaded adapters

Fix:
1. Add LoRAMemoryPool.release_lora_slot(uid) method
   - Removes uid from uid_to_buffer_id
   - Resets buffer_id_to_uid to EMPTY_SLOT
   - Removes uid from eviction policy tracking
   - Idempotent (safe to call multiple times)
   - No-op for None (base model)

2. Call release_lora_slot in lora_manager.unload_lora_adapter()

Testing:
- Code logic verified via AST
- Manual testing shows buffer slots properly released
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion lora

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Flux2 Klein uses incorrect max_length=77 instead of 512 for prompt tokenization

1 participant