[Docs] add cookbook for Ling-2.6 family#23947
Conversation
Combined cookbook covering Ling-2.6-flash (104B/7.4B BF16 MoE) and Ling-2.6-1T (~1T FP8 MoE), both built on the 1:7 MLA + Lightning Linear hybrid attention backbone. Includes per-variant deployment selectors: - flash: single-node TP=4 across H20/H100/H200/B200, optional YaRN 256K context extension, and qwen25 tool-call parser. - 1T: single GB300/GB200 node with TP=4 (FP8 weights fit on a single node, unlike Ling-2.5-1T BF16) or two-node TP=8 + PP=2 fallback for H200/B200. Documents the tool-call + reasoning interaction (qwen + qwen3 parsers) requiring chat_template_kwargs.enable_thinking=false to avoid <tool_call> being captured into reasoning_content. Includes GSM8K result on Ling-2.6-1T (96.21%, GB300 x 4). Both Hugging Face repos are private at the time of writing -- noted in the cookbook intro.
Ling-2.6-flash is an instruct-only model -- do not pass --reasoning-parser or chat_template_kwargs.enable_thinking=true. The reasoning/thinking interaction note still applies to Ling-2.6-1T only.
There was a problem hiding this comment.
Code Review
This pull request adds documentation for the Ling-2.6 model family, updates the navigation configuration, and introduces interactive React components for generating deployment commands. The review feedback identifies missing React hook imports in the JSX snippets, dead logic in the environment prefix calculation for the 1T model, and an unused hardware selection variable in the flash model's generator.
| @@ -0,0 +1,180 @@ | |||
| export const Ling261TDeployment = () => { | |||
There was a problem hiding this comment.
The component uses useState and useEffect hooks but does not import them from react. This will cause a ReferenceError at runtime unless these are provided globally by the build environment, which is not standard for .jsx files.
import React, { useState, useEffect } from 'react';
export const Ling261TDeployment = () => {| const { hardware, toolcall, reasoning } = values; | ||
| const isSingleNode = hardware === 'gb300' || hardware === 'gb200'; | ||
| const isGB = hardware === 'gb300' || hardware === 'gb200'; | ||
| const envPrefix = isGB && !isSingleNode ? 'NCCL_IB_DISABLE=1 ' : ''; |
There was a problem hiding this comment.
The logic for envPrefix is currently ineffective because isGB and isSingleNode are derived from the same hardware IDs (gb300, gb200). As a result, isGB && !isSingleNode is always false. If multi-node GB deployments are not supported, this logic should be simplified to avoid confusion.
| const envPrefix = isGB && !isSingleNode ? 'NCCL_IB_DISABLE=1 ' : ''; | |
| const envPrefix = ''; |
| @@ -0,0 +1,149 @@ | |||
| export const Ling26FlashDeployment = () => { | |||
|
|
||
| // Generate command | ||
| const generateCommand = () => { | ||
| const { yarn, toolcall } = values; |
There was a problem hiding this comment.
The hardware selection is extracted from values but never used to modify the generated command. If the command is indeed identical for all listed hardware, consider removing the selector or using it to display hardware-specific performance expectations to make the UI more meaningful.
| const { yarn, toolcall } = values; | |
| const { hardware, yarn, toolcall } = values; |
Both Ling-2.6-flash and Ling-2.6-1T are controllable-reasoning
models, not pure-instruct or pure-thinking. The chat template
defaults to 'detailed thinking off' (set via {% set thinking_option =
'off' %}) and is toggled by textual directives in the system message
('detailed thinking on' / 'detailed thinking off'), not by the
Qwen3-style enable_thinking template variable.
Changes:
- Revert previous "flash has no thinking mode" claim -- it does, just
defaults to off.
- Drop --reasoning-parser qwen3 from the recommended baseline launch
command (both variants). The qwen3 reasoning parser assumes
default-thinking semantics and mis-routes Ling-2.6 output into
reasoning_content.
- New section 4.3 "Thinking Mode (detailed thinking on/off)" covers
how to enable thinking via the system message and explains the
qwen3-parser caveat (per-request chat_template_kwargs.enable_thinking
=false is consumed by the parser, not the template).
- Flash deployment selector gains an optional Reasoning Parser toggle
(default disabled).
- 1T deployment selector default for Reasoning Parser flipped from
enabled -> disabled.
- Remove 'private at the time of writing' callout in the intro -- the cookbook should read like a normal product doc, not a release note. - Reword the MTP tip to drop the 'requires a patched branch / once upstream' framing. - 'Validated on...' -> 'Reference run on...' in the benchmark section.
Drop the 'nightly PyPI builds required' framing -- the latest stable release works. Match the standard LLaDA-2.1-style pointer to the official install guide.
- Replace 'python3 -m sglang.launch_server' with 'sglang serve' in the cookbook and both deployment snippets, matching the Llama3.1-style pattern used elsewhere in docs_new. - 1T snippet: remove the dead 'envPrefix = isGB && !isSingleNode ? ...' branch (gemini-bot review). The two predicates are derived from the same hardware IDs, so the branch never fired; multi-node deployment only applies to non-GB hardware here.
The previous heading '### 4.3 Thinking Mode (`detailed thinking on/off`)' produced an unreliable anchor due to backticks, parens, and the slash. Drop the parenthetical so the heading slugs cleanly to 'sgl-project#43-thinking-mode' (matching the convention used by other cookbooks, e.g. Gemma4 'sgl-project#51-speed-benchmark').
Mintlify generates '#4-3-thinking-mode' (with dash) for the '### 4.3 Thinking Mode' heading -- not 'sgl-project#43-thinking-mode' as I assumed. Fix both references to use the correct slug.
Summary
Combined cookbook covering the Ling-2.6 family from inclusionAI:
Both share the
1:7 MLA + Lightning Linearhybrid attention backbone introduced in Ling-2.5, refined for token efficiency and agentic workloads (BFCL-V4, TAU2-bench, SWE-bench Verified, Claw-Eval, PinchBench).What's included
docs_new/cookbook/autoregressive/InclusionAI/Ling-2.6.mdx— single combined cookbook page covering both variants, with per-variant deployment selectors, tool-calling examples, and a thinking-mode section.docs_new/src/snippets/autoregressive/ling-26-flash-deployment.jsx— Hardware (H20-3e / H100 / H200 / B200), context length (128K vs 256K via YaRN), tool-call parser, and optional reasoning parser selectors.docs_new/src/snippets/autoregressive/ling-26-1t-deployment.jsx— Hardware (GB300 / GB200 single-node TP=4 vs H200 / B200 two-node TP=8 + PP=2), tool-call parser, and optional reasoning parser selectors.docs_new/docs.json— addsLing-2.6to theInclusionAIgroup (above the existing 2.5 entries).Notable details
--tp 4. H200 / B200 fall back to a 2-node TP=8 + PP=2 deployment.sglang serveCLI (matches the Llama3.1-style pattern elsewhere indocs_new).detailed thinking offand is toggled by textual directives in the system message (detailed thinking on/detailed thinking off) — it does not read the Qwen3-styleenable_thinkingtemplate variable.--reasoning-parser qwen3. SGLang'sqwen3reasoning parser assumes default-thinking semantics that mismatch Ling-2.6's default-off template; combining the two requires every tool-call request to includechat_template_kwargs.enable_thinking=falseto prevent<tool_call>...</tool_call>from being routed intoreasoning_content. The cookbook's §4.3 "Thinking Mode" section walks through both how to enable thinking and the parser caveat.Test plan
Ling-2.5-1T.mdx/ling-25-1t-deployment.jsx).docs_new/docs.jsonvalidates as JSON; newLing-2.6page registered in the InclusionAI group.