Skip to content

[Feature] Add support for Kimi K2.6 #2412

@anwithk

Description

@anwithk

[Feature] Add support for Kimi K2.6

Summary

Add NeMo RL support for moonshotai/Kimi-K2.6 on the Megatron Core / Megatron backend, including checkpoint conversion or loading, model configuration mapping, example training recipes, and backend consistency validation for RL workflows.

Motivation

Kimi K2.6 is a recent open-source MoE model from Moonshot AI with strong long-horizon coding and agentic-task performance. The Hugging Face model card describes it as a native multimodal agentic model with:

  • 1T total parameters, 32B activated parameters
  • 384 experts, 8 selected experts per token, 1 shared expert
  • 61 layers, 1 dense layer
  • 256K context length
  • MLA attention, SwiGLU activation, 160K vocabulary
  • MoonViT vision encoder

NeMo RL already exposes a Megatron Core path for large models, long context, MoE, sequence packing, and RL training. Adding Kimi K2.6 support would make it possible to post-train and evaluate a frontier-scale MoE coding/agent model in NeMo RL without users hand-rolling the Megatron mapping and validation.

Proposed Scope

MCore model integration

  • Add or reuse the Megatron Core model mapping required for Kimi K2.6's architecture.
  • Map the HF config fields to the appropriate Megatron Bridge / MCore configuration:
    • MoE expert count, selected experts, shared experts, expert hidden size
    • dense layer placement
    • MLA attention settings
    • vocab size and tokenizer settings
    • long-context settings
  • Confirm whether Kimi K2.6 can reuse the Kimi K2.5 architecture path, since the Kimi model card says K2.6 has the same architecture as K2.5 and can reuse its deployment method.

Checkpoint conversion / loading

  • Support conversion from moonshotai/Kimi-K2.6 Hugging Face checkpoints into the MCore format used by NeMo RL, or document the supported loading path through Megatron Bridge.
  • Preserve tied/untied embedding semantics, MoE tensor layout, tokenizer assets, and chat template behavior.
  • Document any unsupported weight formats. Kimi K2.6 includes native INT4 quantization; if INT4 is out of scope for training, please document the expected BF16/FP8 path.

Training and rollout recipes

  • Add a minimal SFT recipe for Kimi K2.6 on the Megatron backend.
  • Add a minimal GRPO recipe for Kimi K2.6 on the Megatron backend.
  • Include recommended parallelism settings for a smoke test and for a realistic multi-node run where possible:
    • TP / PP / CP / EP / FSDP
    • sequence packing on/off guidance for MoE
    • long-context guidance
  • Clarify supported rollout backends:
    • vLLM and/or SGLang for generation
    • Megatron inference if supported to avoid weight conversion

Validation

Follow the NeMo RL "Add New Models" validation workflow:

  • Verify HF vs rollout-backend log probability consistency.
  • Verify Megatron vs rollout-backend log probability consistency.
  • Run across real and synthetic prompts, greedy and sampling modes, multiple batch sizes, and at least short/medium/long sequence lengths.
  • Use the documented 1.05 error threshold for equal-precision log-probability consistency unless Kimi-specific precision caveats require a different threshold.
  • Run the existing model diagnostics where applicable:
    • max_model_len_respected.py
    • long_generation_decode_vs_prefill.py
    • check_hf_model_embeddings_untrained.py
    • vllm_precision_compilation_test.py

Acceptance Criteria

  • moonshotai/Kimi-K2.6 can be loaded or converted for Megatron Core training in NeMo RL.
  • A Kimi K2.6 Megatron config can complete a small SFT smoke test.
  • A Kimi K2.6 Megatron config can complete a small GRPO smoke test.
  • Training-backend and rollout-backend log probabilities are validated and documented.
  • Example configs are added under examples/configs/.
  • Documentation is updated to list Kimi K2.6 support, known limitations, and recommended precision/parallelism settings.
  • Tests or diagnostics are added so regressions in the model mapping or checkpoint conversion path are caught.

References

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions