English | 简体中文
-
[2026.06.12]Release SenseNova-U1-8B-MoT-Infographic-LoRA-8step-V1.0 for faster infographic generation. Please see the example script. -
[2026.06.11]Release SenseNova-U1-8B-MoT-Interleaved 📖, specially optimized for interleaved image-text generation, with notably improved narrative coherence, character and style consistency, and text-image alignment in multi-page content. -
[2026.05.21]Release the full-parameter fine-tuning training code for SenseNova-U1. -
[2026.05.15]Release SenseNova-U1-8B-MoT-Infographic 📊 model for improved infographic generation. See U1 Infographic Model for details, and ✨ Infographic Showcases for 100 generated examples. -
[2026.05.10]Release 🔥SenseNova-U1 Technical Report🔥 and the weights for SenseNova-U1-A3B-MoT-SFT & SenseNova-U1-A3B-MoT. -
[2026.05.08]Add GGUF quantized checkpoints and layer-offload VRAM modes for low-VRAM single-GPU inference. See Memory-efficient inference. GGUF weights forSenseNova-U1-8B-MoT-Mergerare available at 🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf — many thanks to @smthemex for contributing the quantized weights. -
[2026.05.06]Release SenseNova-U1-8B-MoT-LoRA-8step-V1.0. Please see the example script. -
[2026.04.30]Release the preview version of the 8-step inference model SenseNova-U1-8B-MoT-8step-preview. In most cases, the image generation quality of this model closely matches that of the base model (see comparison and existing issues). To test this model, you can use the inference scripts, but with the following parameters:--cfg_scale 1.0 --num_steps 8. -
[2026.04.27]Initial release of the weights for SenseNova-U1-8B-MoT-SFT and SenseNova-U1-8B-MoT. -
[2026.04.27]Initial release of the inference code for SenseNova-U1.
🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental paradigm shift in multimodal AI: from modality integration to true unification. Rather than relying on adapters to translate between modalities, SenseNova U1 models think-and-act across language and vision natively.
Unifying visual understanding and generation in an end-to-end architecture from pixel to word opens tremendous possibilities, enabling highly efficient and strong understanding, generation, and interleaved reasoning in a natively multimodal manner.
At the core of SenseNova U1 is NEO-unify, a novel architecture designed from the first principles for multimodal AI: It eliminates both Visual Encoder (VE) and Variational Auto-Encoder (VAE) where pixel-word information are inherently and deeply correlated. Several important features are as follows:
- 🔗 Model language and visual information end-to-end as a unified compound.
- 🖼️ Preserve semantic richness while maintaining pixel-level visual fidelity.
- 🧠 Reason across modalities with high efficiency & minimal conflict via native MoTs.
Powered by this new core architecture, SenseNova U1 delivers exceptional efficiency in multimodal learning:
Left: Generation Latency vs. Averaging Performance on OneIG (EN, ZH), LongText (EN, ZH), BizGenEval (Easy, Hard), CVTG and IGenBench.
Right: Generation Latency vs. Averaging Performance on Infographic Benchmarks, i.e., BizGenEval (Easy, Hard), and IGenBench.
-
🏆 Open-source SoTA in both understanding and generation: SenseNova U1 sets a new standard for unified multimodal understanding and generation, achieving state-of-the-art performance among open-source models across a wide range of understanding, reasoning, and generation benchmarks.
-
📖 Native interleaved image-text generation: SenseNova U1 can generate coherent interleaved text and images in a single flow with one model, enabling use cases such as practical guides and travel diaries that combine clear communication with vivid storytelling and transform complex information into intuitive visuals.
-
📰 High-density information rendering: SenseNova U1 demonstrates strong capabilities in dense visual communication, generating richly structured layouts for knowledge illustrations, posters, presentations, comics, resumes, and other information-rich formats.
- 🤖 Vision–Language–Action (VLA)
- 🌐 World Modeling (WM)
In this release, we are open-sourcing the SenseNova U1 Lite series in two sizes:
- SenseNova U1-8B-MoT — dense backbone
- SenseNova U1-A3B-MoT — MoE backbone
| Model | Params | HF Weights |
|---|---|---|
| SenseNova-U1-8B-MoT-Interleaved | 8B MoT | 🤗 link |
| SenseNova-U1-8B-MoT-Infographic | 8B MoT | 🤗 link |
| SenseNova-U1-8B-MoT-Infographic-LoRA-8step-V1.0 | 0.4B | 🤗 link |
| SenseNova-U1-8B-MoT-SFT | 8B MoT | 🤗 link |
| SenseNova-U1-8B-MoT | 8B MoT | 🤗 link |
| SenseNova-U1-8B-MoT-LoRA-8step-V1.0 | 0.4B | 🤗 link |
| SenseNova-U1-A3B-MoT-SFT | A3B MoT | 🤗 link |
| SenseNova-U1-A3B-MoT | A3B MoT | 🤗 link |
Here SFT models (×32 downsampling ratio) are trained via Understanding Warmup, Generation Pre-training, Unified Mid-training, and Unified SFT, with final models obtained after an initial round of T2I RL training.
Although relatively compact by today’s standards, these models already show strong performance across diverse tasks, comparable to commercial models with excellent cost efficiency. Notably, larger-scale versions are planned to further enhance capability and performance in the future.
💡 The
8B-MoTinSenseNova-U1-8B-MoTrefers to ~8B understanding parameters and ~8B generation parameters. See parameter breakdown for details.
-
Training code of SenseNova-U1
-
Final weights and technical report of SenseNova-U1
🖼️ Text-to-Image (Reasoning)
📸 More generation samples: see Image Generation Gallery.
✏️ Image Editing (General)
✏️ Image Editing (Reasoning)
📸 More editing samples: see Image Editing Gallery.
📸 More interleaved samples: see Interleaved Generation Gallery.
📸 More understanding samples: see Visual Understanding Gallery.
Evaluation scripts and benchmark reproduction guides are added in
evaluation.
Despite strong performance across tasks, several limitations remain for improvement:
-
Visual Understanding:
The current model only supports a context length of up to 32K tokens, which may constrain performance in scenarios requiring longer or more complex visual contexts. -
Human-centric Generation:
Fine-grained details of human bodies can be challenging, especially when people appear as small elements within a scene or are engaged in complex interactions with surrounding objects. -
Text-based Generation:
Text rendering may sometimes produce misspellings, distorted characters, or formatting inconsistencies, which are sensitive to how prompts are phrased, especially in text-heavy scenarios. (seeprompt enhancementfor best practice) -
Interleaved Generation:
-
As an experimental feature, interleaved generation is still evolving and may not yet match the performance of dedicated text-to-image (T2I) pipelines.
-
Beta status: RL has not been specifically optimized for visual editing, reasoning, and interleaved tasks, and current performance is comparable SFT models.
-
We view these areas as active directions and expect continued improvements in future iterations.
The fastest way to experience SenseNova-U1 is through SenseNova-Studio — a 🆓 free online playground where you can try the model directly in your browser, no installation or GPU required.
Note: To serve more users, U1-Fast has undergone step and CFG distillation, and is dedicated to infographic generation.
The easiest way to integrate SenseNova-U1 into your own agent or application is through our companion repository SenseNova-Skills (OpenClaw) 🦞, which ships SenseNova-U1 as a ready-to-use skill with a unified tool-calling interface.
Refer to the SenseNova-Skills README for installation and usage details.
Setup: Follow the Installation Guide to clone the repo and install dependencies with uv.
📝 Visual Understanding
python examples/vqa/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --image examples/vqa/data/images/menu.jpg --question "My friend and I are dining together tonight. Looking at this menu, can you recommend a good combination of dishes for 2 people? We want a balanced meal — a mix of mains and maybe a starter or dessert. Budget-conscious but want to try the highlights." --output outputs/answer.txt --max_new_tokens 8192 --do_sample --temperature 0.6 --top_p 0.95 --top_k 20 --repetition_penalty 1.05 --profileSee
examples/README.mdfor batched inference, generation parameters, and JSONL format.
🖼️ Text-to-Image
python examples/t2i/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "这张信息图的标题是“SenseNova-U1”,采用现代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型规格”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-8B-MoT”,下方是等宽字体说明“8B MoT 密集主干模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-A3B-MoT”,下方是等宽字体说明“A3B MoT 混合专家(MoE)主干模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,性价比出色”。" --width 2720 --height 1536 --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output.png --profileDefault resolution is 2048×2048 (1:1). See supported resolution buckets for other aspect ratios.
For high-quality infographic generation, it is recommended to apply prompt enhancement before generating images.
✏️ Image Editing
python examples/editing/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "Change the animal's fur color to a darker shade." --image examples/editing/data/images/1.webp --cfg_scale 4.0 --img_cfg_scale 1.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 --output output_edited.png --profile --compare💡 Pre-resize inputs to ~2048×2048 resolution with orginal aspect ratio before inference for best quality (see
examples/editing/resize_inputs.py).
♻️ Interleaved Generation
python examples/interleave/inference.py --model_path sensenova/SenseNova-U1-8B-MoT --prompt "I want to learn how to cook tomato and egg stir-fry. Please give me a beginner-friendly illustrated tutorial." --resolution "16:9" --output_dir outputs/interleave/ --stem demo --profileSee
examples/README.mdfor batched inference, JSONL format, prompt enhancement, resolution buckets, and full flag reference.
See
docs/gpu_mem_profiler.mdfor GPU memory profiler.
For users running on a single consumer GPU, two complementary features lower the VRAM footprint of the transformers path. They can be combined freely.
Pass --gguf_checkpoint to any of the four inference scripts (t2i, editing, interleave, vqa) to load a quantized .gguf file via the diffusers GGUF Linear layer instead of the bf16 safetensors weights. The base --model_path is still required (for tokenizer / config / non-LM weights).
# install the optional extra once
uv pip install -e ".[gguf]" # or: pip install "gguf>=0.10.0" "diffusers>=0.30.0"
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--gguf_checkpoint /path/to/SenseNova-U1-8B-MoT-Merger-Q4_K_M.gguf \
--prompt "A male peacock trying to attract a female" \
--output output.pngGGUF weights for SenseNova-U1-8B-MoT-Merger (multiple quant levels: Q3 / Q4 / Q5 / Q6 / Q8) are available at:
| Quantized weights | HF link |
|---|---|
| SenseNova-U1-8B-MoT-Merger-gguf | 🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf |
🙏 Thanks to GitHub user @smthem for contributing the quantized GGUF weights to the community.
Pass --vram_mode to keep the language-model layers resident on CPU pinned memory and stream them onto the GPU on-demand during forward, freeing weight VRAM while keeping activations on-device.
| Mode | Behavior | When to use |
|---|---|---|
full (default) |
No offload; whole model on GPU | Plenty of VRAM, best speed |
low |
Synchronous per-layer CPU↔GPU swap | Lowest VRAM footprint |
balanced |
Async prefetch overlaps H2D copy with compute | Tight on VRAM but want to recover speed |
python examples/t2i/inference.py \
--model_path sensenova/SenseNova-U1-8B-MoT \
--vram_mode balanced \
--prompt "..." --output output.png--gguf_checkpoint and --vram_mode compose: a Q4 GGUF + balanced is the recommended setup for ~10–12 GB consumer cards.
For production serving, we co-design a dedicated inference stack on top of LightLLM (understanding) and LightX2V (generation). The two engines are disaggregated so that each path can use its own parallelism and resource budget, with a low-overhead transfer channel in between.
On a single node with TP2 + CFG2, this stack delivers roughly ~0.15 s/step and ~9 s end-to-end for a 2048×2048 image on H100 / H200, with a ~2.4–3.2× prefill speedup from our FA3-based hybrid-mask attention over the Triton baseline. Full per-GPU performance are reported in docs/inference_infra.md.
An official docker image is provided for one-command deployment:
docker pull lightx2v/lightllm_lightx2v:20260407⚙️ Deployment guide (Docker, launch flags, modes, quantization, API test): see
docs/deployment.md.📖 Full design and performance profiling: see
docs/inference_infra.md.
Join our growing community to share feedback, get support, and stay updated on the latest SenseNova-U1 developments — we'd love to hear from you!
| Discord | WeChat Group |
![]() |
![]() |
If this project is helpful for your research, please consider star ⭐ and citation 📝 :
@misc{sensenova2026neounify,
title = {NEO-unify: Building Native Multimodal Unified Models End to End},
author = {SenseNova},
journal = {Hugging Face blog},
url = {https://huggingface.co/blog/sensenova/neo-unify},
year = {2026}
}
@article{sensenova2026sensenovau1,
title = {SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture},
author = {Diao, Haiwen and Wu, Penghao and Deng, Hanming and Wang, Jiahao and Bai, Shihao and Wu, Silei and Fan, Weichen and Ye, Wenjie and Tong, Wenwen and Fan, Xiangyu and others},
journal = {arXiv preprint arXiv:2605.12500},
year = {2026}
}This project is released under the Apache 2.0 License.



















































































