Nanbeige/Nanbeige4.1-3B
Nanbeige4.1-3B is a 3 billion parameter model developed by Chen Yang et al. (as per the technical report authors) that builds upon Nanbeige4-3B-Base, optimized through SFT and RL. This compact model is designed to simultaneously achieve robust reasoning, strong preference alignment, and effective agentic behaviors. It excels at solving complex, multi-step problems and supports deep-search tasks with extensive tool invocations, filling a gap for small general models that perform well in both reasoning and agentic scenarios.
Loading preview...
Nanbeige4.1-3B: A Small General Model for Reasoning and Agentic Tasks
Nanbeige4.1-3B is an enhanced 3 billion parameter model, developed by Chen Yang et al., that significantly improves upon its predecessor, Nanbeige4-3B-Thinking-2511, through supervised fine-tuning (SFT) and reinforcement learning (RL). This model demonstrates that compact architectures can achieve high performance across multiple critical areas.
Key Capabilities and Differentiators
- Strong Reasoning: Nanbeige4.1-3B is adept at solving complex, multi-step problems, consistently producing correct answers on challenging benchmarks like LiveCodeBench-Pro, IMO-Answer-Bench, and AIME 2026 I. It significantly outperforms same-scale models and even larger models like Qwen3-30B-A3B and Qwen3-32B in various code, math, and science reasoning tasks.
- Robust Preference Alignment: The model exhibits solid alignment performance, surpassing not only same-scale competitors (e.g., Qwen3-4B-2507) but also substantially larger models (e.g., Qwen3-30B-A3B) on benchmarks such as Arena-Hard-v2 and Multi-Challenge.
- Advanced Agentic Capability: Uniquely for a small general model, Nanbeige4.1-3B natively supports deep-search tasks and can reliably sustain complex problem-solving involving over 500 rounds of tool invocations. This capability fills a notable gap, as most small models are typically optimized for either general reasoning or agentic scenarios, but rarely both.
Performance Highlights
Nanbeige4.1-3B shows superior performance across general reasoning and deep-search benchmarks. For instance, it achieves 76.9 on Live-Code-Bench-V6 (compared to 66.0 for Qwen3-30B-A3B-2507) and 87.40 on AIME 2026 I (compared to 87.30 for Qwen3-30B-A3B). In deep-search tasks, it scores 75 on xBench-DeepSearch-2505, significantly outperforming other small foundation models like Qwen3-4B-2507 (34) and even competing with specialized small agents.
Limitations
While safety is emphasized during training, the probabilistic nature and size of the model mean it may occasionally generate unexpected or harmful content. Users are advised to exercise caution and not propagate such outputs.