DeepSeek LLM is the model that started everything - the original open-source foundation that proved DeepSeek could train frontier-quality language models from scratch. 2 trillion tokens. Novel scaling laws. Beating LLaMA-2 70B. Surpassing GPT-3.5 in Chinese. Released December 2023, fully open for commercial use.
DeepSeek LLM ships as four open-source models: Base models (raw pre-trained foundation for research and fine-tuning) and Chat models (instruction-following, SFT + DPO aligned) at both 7B and 67B scale.
The 7-billion parameter pre-trained foundation model. Ideal for supervised fine-tuning, RLHF research, domain adaptation, and academic studies of scaling behavior.
Fine-tuned chat variant via SFT and DPO alignment. Designed for consumer-grade hardware. GSM8K 0-shot: 63.0%. A strong small model for interactive applications and edge deployment.
The flagship pre-trained foundation. 67B parameters with Grouped-Query Attention (GQA). Outperforms LLaMA-2 70B Base across reasoning, coding, math, and Chinese. The best open-source base model of its era.
The flagship chat model. SFT + DPO aligned. HumanEval 73.78%, GSM8K 84.1% zero-shot, MATH 32.6% zero-shot. Surpasses GPT-3.5 in Chinese. Hungarian National High School Exam: 65.
DeepSeek LLM follows the LLaMA auto-regressive transformer decoder architecture but makes deliberate changes — most notably adjusting layer counts for pipeline efficiency and using GQA for the 67B model.
DeepSeek LLM is an auto-regressive transformer decoder closely following LLaMA's architecture — the de facto open-source standard at the time. Key shared elements: RMSNorm for pre-normalization, SwiGLU activation functions, and no bias terms in attention or FFN layers.
The primary deviation from LLaMA is the layer count: the 7B model uses 30 layers and the 67B uses 95 layers — rather than the typical 32/80 counts — specifically to optimize pipeline parallelism partitioning during training and inference.
Positional encoding uses Rotary Embedding (RoPE) throughout, which captures long-range token dependencies far more efficiently than traditional absolute position embeddings. The tokenizer is a custom Byte-level BPE implementation via Hugging Face Tokenizer — importantly, not SentencePiece, which has implications for quantization tool compatibility.
| Component | 7B | 67B |
|---|---|---|
| Layers | 30 | 95 |
| Attention heads | 32 | 64 |
| KV heads | 32 (MHA) | 8 (GQA) |
| Hidden dim | 4096 | 8192 |
| Context length | 4096 | 4096 |
| Vocab size | 100,000 | |
| Positional enc. | RoPE | |
| Activation | SwiGLU | |
The 67B model uses Grouped-Query Attention (GQA) instead of standard Multi-Head Attention (MHA). GQA groups multiple query heads to share a single set of key-value heads, dramatically reducing the KV cache memory footprint during inference.
With 64 query heads grouped into 8 KV heads, the 67B model's KV cache is 8× smaller than an equivalent MHA model — enabling practical deployment on 4×A100 80GB or similar hardware that would otherwise be unable to serve the model at reasonable sequence lengths.
This was a forward-looking design choice at the time: GQA had just been published by Ainslie et al. (2023) and the DeepSeek team was among the first to adopt it in a major open-source release at 67B scale.
DeepSeek LLM uses a custom Byte-level BPE tokenizer with a vocabulary of 100,000 tokens — substantially larger than LLaMA's 32K vocabulary. The larger vocabulary improves compression efficiency particularly for Chinese text, where character-level granularity would otherwise inflate token counts.
The tokenizer is implemented via the Hugging Face Tokenizer library with specially designed pre-tokenizers. Importantly, it does not use SentencePiece — a distinction that matters for compatibility with certain GGUF quantization pipelines (llama.cpp) which may require model-specific tokenizer support.
DeepSeek LLM's most significant research contribution wasn't the model itself — it was the new scaling law analysis that derived optimal hyperparameters beyond what Chinchilla described. This work directly informed every subsequent DeepSeek model.
Kaplan et al. (2020) and Hoffmann et al. (2022, Chinchilla) established that model size and training tokens should scale together optimally. DeepSeek's paper went further: it derived how batch size and learning rate should scale as a function of compute budget. This joint optimization is computationally non-trivial but yields measurable gains over models trained with fixed hyperparameters. The fitted power-law curves were validated on models from 1e20 to full 7B/67B scale.
The paper studied how scaling laws change when training on different data quality regimes — specifically comparing CommonCrawl-quality data (noisier, cheaper to acquire) versus higher-quality curated corpora. The key finding: scaling curves shift upward with better data, meaning a model trained on higher-quality data achieves the same loss as a larger model on lower-quality data. This insight shaped DeepSeek's subsequent emphasis on data curation pipelines in V2, V3, and beyond.
In an unusual and highly cited decision, DeepSeek made the intermediate training checkpoints of both the 7B and 67B Base models available via AWS S3. This allowed the research community to directly study the training dynamics — how loss decreases, how benchmark performance evolves, and how different capabilities emerge at different training token counts. This transparency accelerated academic research on LLM training dynamics significantly.
All benchmark results from the official paper (arXiv:2401.02954). At the time of release, DeepSeek-LLM-67B represented the best open-source model across coding, mathematics, and Chinese comprehension.
DeepSeek LLM was trained completely from scratch — no initialization from other models. The training infrastructure, data pipeline, and optimization recipe were all developed internally by DeepSeek.
The training corpus consists of 2 trillion tokens in English and Chinese — sourced primarily from Common Crawl with aggressive deduplication, filtering, and domain remixing. The filtering pipeline removes low-quality content, near-duplicates, and content with excessive non-alphabetic characters, following and extending methods from DeepSeek's internal research.
A key design decision was the domain remixing strategy: underrepresented high-value domains (scientific papers, technical documentation, Chinese web content) are upsampled relative to their natural frequency in Common Crawl. This improves downstream performance on reasoning and specialized tasks without requiring more raw data.
Importantly, no instruction-tuning data is included in pre-training. The Base models are pure auto-regressive next-token prediction models. Chat models are created separately via SFT and DPO on the Base checkpoints.
Training uses the AdamW optimizer (β₁=0.9, β₂=0.95, weight_decay=0.1) with a multi-step learning rate schedule — not the standard cosine warmup + decay used by most contemporaneous models. The multi-step schedule allows for more precise control over learning rate reduction points, enabling the team to adapt the schedule based on observed training dynamics.
Hyperparameters were derived from DeepSeek's own scaling law analysis:
Training runs on DeepSeek's Fire-Flyer 2 (萤火二号) cluster — a co-designed hardware/software infrastructure with NVIDIA GPUs connected via 200 Gbps InfiniBand. The cluster uses a two-zone topology with cross-zone task support.
The software stack includes 3FS (Fire-Flyer File System) — DeepSeek's custom distributed file system built for the asynchronous random-read patterns of LLM training, using Direct I/O and RDMA Read for maximum throughput without cache overhead. This infrastructure was later used for every subsequent DeepSeek model through V4.
At launch, December 2023, DeepSeek LLM stood apart from the open-source field in a number of concrete, measurable ways.
Trained on 2 trillion tokens — double the 1T commonly used at the time for 7B-70B models. More training tokens directly improves general capability, especially on rare knowledge and edge cases.
Chinchilla+Published new scaling law derivations for batch size and learning rate as functions of compute budget — extending beyond Chinchilla and Kaplan et al. These findings were cited by numerous subsequent papers and influenced training recipe design across the field.
Research contributionOne of the first 70B-scale open-source models trained with substantial Chinese data. 67B Chat surpasses GPT-3.5 on Chinese open-ended evaluation — demonstrating that bilingual training at scale produces genuinely multilingual capability.
Chinese SOTAThe 67B model adopted Grouped-Query Attention before it became standard practice — reducing KV cache by 8× vs MHA. This made 67B inference practical on 4× consumer-grade A100 GPUs rather than requiring 8×.
Inference efficiencyDeepSeek published training checkpoints at multiple points during the 7B and 67B runs — downloadable via AWS S3. This unprecedented transparency enabled researchers to study how capabilities emerge during pre-training.
Open scienceChat models were aligned using both Supervised Fine-Tuning and Direct Preference Optimization — an at-the-time novel combination. DPO avoids the instability of RLHF while still producing well-calibrated preference following.
DPO pioneerReleased under the DeepSeek License which permits commercial use. This was significant in late 2023 when many open-source models (including early LLaMA) had non-commercial restrictions that limited industry adoption.
Commercial ✓3× larger vocabulary than LLaMA's 32K — dramatically improving Chinese tokenization efficiency and reducing hallucination on non-Latin scripts. The Byte-level BPE design ensures robust handling of any Unicode content.
100K BPEUnusual layer counts (30 for 7B, 95 for 67B) were chosen specifically to divide evenly across GPU pipeline stages — reducing pipeline bubble overhead during training and inference. A practical engineering insight rarely documented in other papers.
Engineering depthScored 65 on the Hungarian National High School Mathematics Exam — a real-world held-out test not included in training. This demonstrated genuine mathematical generalization beyond standard benchmarks like GSM8K and MATH.
GeneralizationUsed a multi-step learning rate schedule rather than the standard cosine decay — allowing dynamic adaptation based on observed training loss curves. This provided more control over training dynamics than fixed schedules.
Training innovationEvery model in the DeepSeek family traces its lineage here. The Fire-Flyer infrastructure, the scaling law analysis, the tokenizer design, and the bilingual training recipe used in DeepSeek LLM were inherited, refined, and extended through Coder, Math, V2, V3, R1, and V4.
Origin of V4All four model variants are available on Hugging Face. The standard transformers library works out of the box.
DeepSeek LLM established the practices, infrastructure, and research culture that produced R1, V3, and V4. Understanding it is understanding where DeepSeek comes from.
Released as DeepSeek's first public model. 7B and 67B, Base and Chat. 2T token training. Novel scaling laws. Beats LLaMA-2 70B across coding, math, reasoning, and Chinese. Published all intermediate checkpoints. Established: bilingual training recipe, Fire-Flyer infrastructure, custom 100K BPE tokenizer, and the scaling law research program that would guide every subsequent model.
The foundation — arXiv:2401.02954Using the LLM's tokenizer, data pipeline, and training infrastructure as a template, DeepSeek built domain-specific models: Coder (initialized from DeepSeek-Coder-Base, 86 languages, FIM) and Math (GRPO reinforcement learning for mathematical reasoning). Both used lessons from LLM's scaling law analysis.
Coder V2 · DeepSeekMath · Domain expertiseBuilt on the experience of training DeepSeek LLM at 67B scale, V2 introduced the Mixture-of-Experts framework with Multi-Head Latent Attention (MLA) and DeepSeekMoE — dramatically improving efficiency. The bilingual training recipe, tokenizer design, and infrastructure from LLM carried forward. V2 demonstrated that DeepSeek's engineering could scale to 236B parameters affordably.
MoE + MLA · 236B · V2 breakthroughR1's training used GRPO — first developed for DeepSeekMath, itself built on LLM's infrastructure. The bilingual corpus approach from LLM enabled R1 to reason fluently in both English and Chinese. The Fire-Flyer 2 cluster trained R1 at 671B scale. LLM's scaling law insights informed R1's training recipe. The distilled models (1.5B–70B) were possible because of the tokenizer and architecture compatibility across the family.
GRPO · Chain-of-thought · Sputnik momentV3 (671B MoE), V3.2 (+ DeepSeek Sparse Attention), and V4 (1M context, Compressed Sparse Attention) — all trace directly to the architectural choices, data curation philosophy, and scaling law research first articulated in DeepSeek LLM. The same tokenizer family, the same bilingual training approach, the same Fire-Flyer infrastructure. DeepSeek LLM's paper is cited in every subsequent DeepSeek technical report.
V3 → V3.2 → V4 · IMO Gold · Frontier AIFour ways to run or use DeepSeek LLM — from cloud API to local inference.
All four models — 7B/67B × Base/Chat — are at huggingface.co/deepseek-ai. Use from_pretrained with device_map="auto" for multi-GPU distribution. 7B requires ~16GB VRAM; 67B requires ~130GB.
The 7B model is available via Ollama: ollama run deepseek-llm:7b. Runs on consumer hardware — 16GB RAM is sufficient for 4-bit quantized inference. No VRAM required for CPU inference.
For domain-specific applications, download the Base model and fine-tune with PEFT/LoRA. The Base model has no instruction-following format to work around — clean slate for supervised fine-tuning on your own data.
Download training checkpoints at multiple steps via aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer. Study capability emergence during pre-training.
DeepSeek's API no longer serves the original LLM models — new projects should use deepseek-v4-flash or deepseek-v4-pro at platform.deepseek.com. V4 is 6× stronger on benchmarks.
GGUF quantized versions exist on Hugging Face (Q4_K_M recommended: ~4GB, minimal quality loss). Note: the custom tokenizer requires compatible llama.cpp builds. Check model cards for supported versions before downloading.
DeepSeek LLM is the original foundational model released by DeepSeek in December 2023 — the first model in what would become one of the most influential open-source AI families in the world. At 67B parameters trained on 2 trillion tokens, it was the first open-source model to beat LLaMA-2 70B across coding, math, reasoning, and Chinese. More importantly, it established the research practices (novel scaling laws), infrastructure (Fire-Flyer 2), and data pipeline that directly produced Coder V2, DeepSeekMath, R1, V3, and V4. Every DeepSeek model that came after traces its lineage to this release.
DeepSeek LLM is significantly less capable than the current generation. V4-Flash (April 2026) scores 79% on SWE-bench Verified — DeepSeek LLM didn't report SWE-bench and likely scores well below 15%. MATH-500 for V4-Pro is 97.3% vs LLM 67B Chat's 32.6% zero-shot on the MATH benchmark. For all practical applications today, use V4-Flash via the API. DeepSeek LLM is valuable for: historical research into early LLM training dynamics, fine-tuning experiments on modest hardware, and studying how the scaling law methodology evolved across the family.
DeepSeek LLM is released under the DeepSeek License — a source-available license that permits commercial use. This distinguished it from the original LLaMA models (which had non-commercial restrictions) and made it attractive for businesses who wanted to use or fine-tune it in products. The license does impose some conditions — read the full text in the GitHub repository before deployment. Subsequent DeepSeek models (Coder V2, R1, V3, V4) were released under the more permissive MIT License.
This is a deliberate engineering optimization for pipeline parallelism. During distributed training and inference across multiple GPUs, the model is split into pipeline stages — each GPU handles a group of consecutive layers. Layer counts divisible by common pipeline depths (e.g., 6, 10, 15) reduce "pipeline bubble" overhead — the idle time when GPUs wait for the previous stage to complete. 30 layers divides cleanly by 2, 3, 5, 6, 10, 15; 95 layers by 5, 19. The paper explicitly states this choice was made to "facilitate model pipeline partitioning to optimize training and inference." It's one of those small but important engineering details that most model papers don't document.
DeepSeek released training snapshots at multiple token counts during the 7B and 67B pre-training runs — making it possible to study how capabilities develop over training time. This was unusual and scientifically valuable: most labs only release final weights. Access via AWS CLI: aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer. The --request-payer flag means you pay for the data transfer costs (AWS egress charges apply). Total size is roughly 280GB for 7B and 720GB for 67B in BF16.
DeepSeek uses a custom Byte-level BPE tokenizer implemented via the Hugging Face Tokenizer library — not SentencePiece (which powers most LLaMA-family models). The choice was motivated by better handling of edge cases in Byte-level encoding and more flexibility in pre-tokenizer design for mixed Chinese/English text. The practical implication: some quantization tools (especially older versions of llama.cpp and GGUF converters) that assume SentencePiece tokenization may not work with DeepSeek LLM weights without model-specific support. Always check compatibility before running quantized inference.
December 2023. 2 trillion tokens. Novel scaling laws. Beat LLaMA-2 70B. Surpassed GPT-3.5 in Chinese. Open-source, commercial use. The first step of the journey that led to IMO Gold.