December 2023 — DeepSeek's first open-source LLM, where it all began 2 trillion tokens — trained from scratch on English and Chinese Beats LLaMA-2 70B — at code, math, reasoning, and Chinese comprehension Novel scaling laws — derived new hyperparameter scaling beyond Chinchilla 7B and 67B — Base and Chat variants, all open-source for commercial use HumanEval 73.78% — 67B Chat surpasses GPT-3.5 on coding GSM8K 84.1% — zero-shot mathematics on 67B Chat December 2023 — DeepSeek's first open-source LLM, where it all began 2 trillion tokens — trained from scratch on English and Chinese Beats LLaMA-2 70B — at code, math, reasoning, and Chinese comprehension Novel scaling laws — derived new hyperparameter scaling beyond Chinchilla 7B and 67B — Base and Chat variants, all open-source for commercial use HumanEval 73.78% — 67B Chat surpasses GPT-3.5 on coding GSM8K 84.1% — zero-shot mathematics on 67B Chat
Released December 2023 · The Foundation

Where DeepSeek
began.

DeepSeek LLM is the model that started everything - the original open-source foundation that proved DeepSeek could train frontier-quality language models from scratch. 2 trillion tokens. Novel scaling laws. Beating LLaMA-2 70B. Surpassing GPT-3.5 in Chinese. Released December 2023, fully open for commercial use.

Download 67B Chat → 🤗 7B Base Read Paper ↗
Dec 2023Released
2T tokensTraining data
7B & 67BParameter sizes
73.78%HumanEval (67B Chat)
84.1%GSM8K 0-shot (67B Chat)
CommercialUse permitted
Model Variants

Four Models. Two Sizes, Two Purposes.

DeepSeek LLM ships as four open-source models: Base models (raw pre-trained foundation for research and fine-tuning) and Chat models (instruction-following, SFT + DPO aligned) at both 7B and 67B scale.

BASE
7B
DeepSeek-LLM-7B-Base

The 7-billion parameter pre-trained foundation model. Ideal for supervised fine-tuning, RLHF research, domain adaptation, and academic studies of scaling behavior.

Layers30
AttentionMHA
Context4096 tokens
LicenseDeepSeek License
CHAT
7B
DeepSeek-LLM-7B-Chat

Fine-tuned chat variant via SFT and DPO alignment. Designed for consumer-grade hardware. GSM8K 0-shot: 63.0%. A strong small model for interactive applications and edge deployment.

Base7B-Base
TrainingSFT + DPO
GSM8K63.0%
VRAM~16 GB (FP16)
BASE
67B
DeepSeek-LLM-67B-Base

The flagship pre-trained foundation. 67B parameters with Grouped-Query Attention (GQA). Outperforms LLaMA-2 70B Base across reasoning, coding, math, and Chinese. The best open-source base model of its era.

Layers95
AttentionGQA
Context4096 tokens
LicenseDeepSeek License
CHAT
67B
DeepSeek-LLM-67B-Chat

The flagship chat model. SFT + DPO aligned. HumanEval 73.78%, GSM8K 84.1% zero-shot, MATH 32.6% zero-shot. Surpasses GPT-3.5 in Chinese. Hungarian National High School Exam: 65.

HumanEval73.78%
GSM8K (0-shot)84.1%
MATH (0-shot)32.6%
vs GPT-3.5Surpasses (ZH)
Architecture

Built on LLaMA. Improved Everywhere.

DeepSeek LLM follows the LLaMA auto-regressive transformer decoder architecture but makes deliberate changes — most notably adjusting layer counts for pipeline efficiency and using GQA for the 67B model.

🏗️
Core Architecture

DeepSeek LLM is an auto-regressive transformer decoder closely following LLaMA's architecture — the de facto open-source standard at the time. Key shared elements: RMSNorm for pre-normalization, SwiGLU activation functions, and no bias terms in attention or FFN layers.

The primary deviation from LLaMA is the layer count: the 7B model uses 30 layers and the 67B uses 95 layers — rather than the typical 32/80 counts — specifically to optimize pipeline parallelism partitioning during training and inference.

Positional encoding uses Rotary Embedding (RoPE) throughout, which captures long-range token dependencies far more efficiently than traditional absolute position embeddings. The tokenizer is a custom Byte-level BPE implementation via Hugging Face Tokenizer — importantly, not SentencePiece, which has implications for quantization tool compatibility.

Component7B67B
Layers3095
Attention heads3264
KV heads32 (MHA)8 (GQA)
Hidden dim40968192
Context length40964096
Vocab size100,000
Positional enc.RoPE
ActivationSwiGLU
🔀
Grouped-Query Attention (67B)

The 67B model uses Grouped-Query Attention (GQA) instead of standard Multi-Head Attention (MHA). GQA groups multiple query heads to share a single set of key-value heads, dramatically reducing the KV cache memory footprint during inference.

With 64 query heads grouped into 8 KV heads, the 67B model's KV cache is 8× smaller than an equivalent MHA model — enabling practical deployment on 4×A100 80GB or similar hardware that would otherwise be unable to serve the model at reasonable sequence lengths.

This was a forward-looking design choice at the time: GQA had just been published by Ainslie et al. (2023) and the DeepSeek team was among the first to adopt it in a major open-source release at 67B scale.

🔤
Tokenizer

DeepSeek LLM uses a custom Byte-level BPE tokenizer with a vocabulary of 100,000 tokens — substantially larger than LLaMA's 32K vocabulary. The larger vocabulary improves compression efficiency particularly for Chinese text, where character-level granularity would otherwise inflate token counts.

The tokenizer is implemented via the Hugging Face Tokenizer library with specially designed pre-tokenizers. Importantly, it does not use SentencePiece — a distinction that matters for compatibility with certain GGUF quantization pipelines (llama.cpp) which may require model-specific tokenizer support.

Research Contribution

A New Theory of Scaling.

DeepSeek LLM's most significant research contribution wasn't the model itself — it was the new scaling law analysis that derived optimal hyperparameters beyond what Chinchilla described. This work directly informed every subsequent DeepSeek model.

01
Beyond Chinchilla: Hyperparameter Scaling

Kaplan et al. (2020) and Hoffmann et al. (2022, Chinchilla) established that model size and training tokens should scale together optimally. DeepSeek's paper went further: it derived how batch size and learning rate should scale as a function of compute budget. This joint optimization is computationally non-trivial but yields measurable gains over models trained with fixed hyperparameters. The fitted power-law curves were validated on models from 1e20 to full 7B/67B scale.

02
Data Quality Scaling Curves

The paper studied how scaling laws change when training on different data quality regimes — specifically comparing CommonCrawl-quality data (noisier, cheaper to acquire) versus higher-quality curated corpora. The key finding: scaling curves shift upward with better data, meaning a model trained on higher-quality data achieves the same loss as a larger model on lower-quality data. This insight shaped DeepSeek's subsequent emphasis on data curation pipelines in V2, V3, and beyond.

03
Intermediate Checkpoints Published

In an unusual and highly cited decision, DeepSeek made the intermediate training checkpoints of both the 7B and 67B Base models available via AWS S3. This allowed the research community to directly study the training dynamics — how loss decreases, how benchmark performance evolves, and how different capabilities emerge at different training token counts. This transparency accelerated academic research on LLM training dynamics significantly.

KEY FINDING: Optimal Batch Size Scaling
Power-law relationship derived by DeepSeek (beyond Kaplan & Chinchilla)
Optimal batch size B* ∝ C^0.24 (C = compute budget in FLOPs) Optimal lr α* ∝ C^-0.31 Validated on: {1e20, 1e21, ..., 7B, 67B} compute budgets 7B: B=2304, lr=4.2e-4 → achieved near-optimal training efficiency 67B: B=4608, lr=3.2e-4 → near-optimal at 67B scale
The DeepSeek LLM paper validated that their derived hyperparameter formulae produced models centered in the optimal parameter space — meaning neither over- nor under-trained relative to their compute budget. This rigorous validation was missing from most contemporaneous open-source model releases.
Benchmarks

State of the Art — December 2023

All benchmark results from the official paper (arXiv:2401.02954). At the time of release, DeepSeek-LLM-67B represented the best open-source model across coding, mathematics, and Chinese comprehension.

HumanEval Pass@1
Python code generation from docstrings
DS 67B Chat best open-source73.78%
DS-67B-Chat
73.78%
GPT-3.5-Turbo
68.0%
LLaMA-2 70B Chat
32.3%
DS-7B-Chat
45.1%
HumanEval (Base models)
Zero-shot code completion
DS 67B beats LLaMA-2 70B
DS-67B-Base
44.5%
LLaMA-2 70B Base
29.9%
DS-7B-Base
26.2%
GSM8K — Grade School Math (0-shot)
Multi-step arithmetic word problems
DS 67B Chat: 84.1%
DS-67B-Chat
84.1%
GPT-3.5-Turbo
78.9%
LLaMA-2 70B Chat
56.8%
DS-7B-Chat
63.0%
MATH (Competition Math, 0-shot)
AMC to competition-level problems
DS 67B Chat: 32.6%
DS-67B-Chat
32.6%
GPT-3.5-Turbo
34.1%
LLaMA-2 70B Chat
13.5%
Hungarian National High School Exam
Held-out real-world math test — rare generalization signal
DS 67B Chat: 65
DS-67B-Chat
65
GPT-4 (estimated)
~70+
MMLU (5-shot)
57-subject knowledge across STEM and humanities
DS 67B beats LLaMA-2 70B
DS-67B-Base
71.3%
LLaMA-2 70B
68.9%
Mistral 7B
64.0%
DS-7B-Base
49.7%
HellaSwag (10-shot)
Commonsense NLI — activity continuation
DS-67B-Base
87.1%
LLaMA-2 70B
87.3%
ARC-Challenge (25-shot)
Science reasoning — 4th–9th grade difficulty
DS-67B-Base
67.8%
LLaMA-2 70B
67.3%
Chinese Open-Ended Evaluation (MT-Bench equivalent)
Human-rated open-ended Chinese conversation quality
DS 67B Chat > GPT-3.5 in Chinese
DeepSeek LLM 67B Chat was the first open-source model to surpass GPT-3.5-Turbo in Chinese-language open-ended evaluation. This was a significant milestone because Chinese benchmarks historically showed a large gap between open and closed models. The model was trained on a bilingual corpus — 2T tokens split across English and Chinese — with careful domain balancing to ensure Chinese comprehension, tone, and factual accuracy met professional standards.
C-Eval (Chinese Knowledge)
52-subject Chinese academic exam benchmark
DS 67B strong on Chinese STEM
DS-67B-Base
~71%
LLaMA-2 70B
~32%
Training Details

2 Trillion Tokens from Scratch.

DeepSeek LLM was trained completely from scratch — no initialization from other models. The training infrastructure, data pipeline, and optimization recipe were all developed internally by DeepSeek.

📊Training Data

The training corpus consists of 2 trillion tokens in English and Chinese — sourced primarily from Common Crawl with aggressive deduplication, filtering, and domain remixing. The filtering pipeline removes low-quality content, near-duplicates, and content with excessive non-alphabetic characters, following and extending methods from DeepSeek's internal research.

A key design decision was the domain remixing strategy: underrepresented high-value domains (scientific papers, technical documentation, Chinese web content) are upsampled relative to their natural frequency in Common Crawl. This improves downstream performance on reasoning and specialized tasks without requiring more raw data.

Importantly, no instruction-tuning data is included in pre-training. The Base models are pure auto-regressive next-token prediction models. Chat models are created separately via SFT and DPO on the Base checkpoints.

2T
Training tokens
EN+ZH
Languages
CC+
Primary source
DeDup
Heavy dedup applied
⚙️Optimizer & Schedule

Training uses the AdamW optimizer (β₁=0.9, β₂=0.95, weight_decay=0.1) with a multi-step learning rate schedule — not the standard cosine warmup + decay used by most contemporaneous models. The multi-step schedule allows for more precise control over learning rate reduction points, enabling the team to adapt the schedule based on observed training dynamics.

Hyperparameters were derived from DeepSeek's own scaling law analysis:

Optimal training hyperparameters (from scaling law fits)
7B: batch_size=2304, lr=4.2e-4 67B: batch_size=4608, lr=3.2e-4 Gradient clip: 1.0 · Warmup: 2000 steps
🖥️Infrastructure

Training runs on DeepSeek's Fire-Flyer 2 (萤火二号) cluster — a co-designed hardware/software infrastructure with NVIDIA GPUs connected via 200 Gbps InfiniBand. The cluster uses a two-zone topology with cross-zone task support.

The software stack includes 3FS (Fire-Flyer File System) — DeepSeek's custom distributed file system built for the asynchronous random-read patterns of LLM training, using Direct I/O and RDMA Read for maximum throughput without cache overhead. This infrastructure was later used for every subsequent DeepSeek model through V4.

Key Properties

What Made DeepSeek LLM Different

At launch, December 2023, DeepSeek LLM stood apart from the open-source field in a number of concrete, measurable ways.

📚
2T Token Training

Trained on 2 trillion tokens — double the 1T commonly used at the time for 7B-70B models. More training tokens directly improves general capability, especially on rare knowledge and edge cases.

Chinchilla+
🔢
Novel Scaling Laws

Published new scaling law derivations for batch size and learning rate as functions of compute budget — extending beyond Chinchilla and Kaplan et al. These findings were cited by numerous subsequent papers and influenced training recipe design across the field.

Research contribution
🇨🇳
Bilingual: EN + ZH

One of the first 70B-scale open-source models trained with substantial Chinese data. 67B Chat surpasses GPT-3.5 on Chinese open-ended evaluation — demonstrating that bilingual training at scale produces genuinely multilingual capability.

Chinese SOTA
🔍
GQA on 67B

The 67B model adopted Grouped-Query Attention before it became standard practice — reducing KV cache by 8× vs MHA. This made 67B inference practical on 4× consumer-grade A100 GPUs rather than requiring 8×.

Inference efficiency
📍
Intermediate Checkpoints

DeepSeek published training checkpoints at multiple points during the 7B and 67B runs — downloadable via AWS S3. This unprecedented transparency enabled researchers to study how capabilities emerge during pre-training.

Open science
⚖️
SFT + DPO Alignment

Chat models were aligned using both Supervised Fine-Tuning and Direct Preference Optimization — an at-the-time novel combination. DPO avoids the instability of RLHF while still producing well-calibrated preference following.

DPO pioneer
💻
Commercial License

Released under the DeepSeek License which permits commercial use. This was significant in late 2023 when many open-source models (including early LLaMA) had non-commercial restrictions that limited industry adoption.

Commercial ✓
🧮
100K Vocabulary

3× larger vocabulary than LLaMA's 32K — dramatically improving Chinese tokenization efficiency and reducing hallucination on non-Latin scripts. The Byte-level BPE design ensures robust handling of any Unicode content.

100K BPE
🏗️
Pipeline-Optimized Layers

Unusual layer counts (30 for 7B, 95 for 67B) were chosen specifically to divide evenly across GPU pipeline stages — reducing pipeline bubble overhead during training and inference. A practical engineering insight rarely documented in other papers.

Engineering depth
🌏
Hungarian Exam: 65

Scored 65 on the Hungarian National High School Mathematics Exam — a real-world held-out test not included in training. This demonstrated genuine mathematical generalization beyond standard benchmarks like GSM8K and MATH.

Generalization
🔁
Multi-Step LR Schedule

Used a multi-step learning rate schedule rather than the standard cosine decay — allowing dynamic adaptation based on observed training loss curves. This provided more control over training dynamics than fixed schedules.

Training innovation
🧬
Foundation for Everything

Every model in the DeepSeek family traces its lineage here. The Fire-Flyer infrastructure, the scaling law analysis, the tokenizer design, and the bilingual training recipe used in DeepSeek LLM were inherited, refined, and extended through Coder, Math, V2, V3, R1, and V4.

Origin of V4
Getting Started

Run DeepSeek LLM Locally

All four model variants are available on Hugging Face. The standard transformers library works out of the box.

# pip install transformers torch accelerate from transformers import AutoTokenizer, AutoModelForCausalLM import torch # 67B Chat — flagship model (~130GB in BF16) model_id = "deepseek-ai/deepseek-llm-67b-chat" # 7B Chat — consumer GPU version (~14GB in BF16) # model_id = "deepseek-ai/deepseek-llm-7b-chat" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", # distributes across GPUs ) messages = [ {"role": "user", "content": "Implement binary search in Python."} ] input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) output = model.generate(input_ids, max_new_tokens=1024) print(tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))
# Base model — for research, fine-tuning, or text completion from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "deepseek-ai/deepseek-llm-67b-base" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) # Raw completion — no chat template for base models prompt = "def fibonacci(n: int) -> int:\n '''\n Returns the nth Fibonacci number.\n '''\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(output[0], skip_special_tokens=True))
# Download intermediate training checkpoints from AWS S3 # Requires AWS CLI: pip install awscli # No AWS account needed — public requester-pays bucket # DeepSeek-LLM-7B-Base checkpoints aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> \ --recursive --request-payer # DeepSeek-LLM-67B-Base checkpoints aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-67B-Base <local_path> \ --recursive --request-payer # These contain snapshots at multiple training steps # Useful for studying capability emergence during pre-training # Total size: ~280GB (7B) / ~720GB (67B) in BF16
Legacy

The Root of Every DeepSeek Model.

DeepSeek LLM established the practices, infrastructure, and research culture that produced R1, V3, and V4. Understanding it is understanding where DeepSeek comes from.

December 2023
DeepSeek LLM — The Origin

Released as DeepSeek's first public model. 7B and 67B, Base and Chat. 2T token training. Novel scaling laws. Beats LLaMA-2 70B across coding, math, reasoning, and Chinese. Published all intermediate checkpoints. Established: bilingual training recipe, Fire-Flyer infrastructure, custom 100K BPE tokenizer, and the scaling law research program that would guide every subsequent model.

The foundation — arXiv:2401.02954
Early 2024 (parallel)
DeepSeek Coder & DeepSeek Math — Domain Specialization

Using the LLM's tokenizer, data pipeline, and training infrastructure as a template, DeepSeek built domain-specific models: Coder (initialized from DeepSeek-Coder-Base, 86 languages, FIM) and Math (GRPO reinforcement learning for mathematical reasoning). Both used lessons from LLM's scaling law analysis.

Coder V2 · DeepSeekMath · Domain expertise
May 2024
DeepSeek-V2 — MoE Architecture Breakthrough

Built on the experience of training DeepSeek LLM at 67B scale, V2 introduced the Mixture-of-Experts framework with Multi-Head Latent Attention (MLA) and DeepSeekMoE — dramatically improving efficiency. The bilingual training recipe, tokenizer design, and infrastructure from LLM carried forward. V2 demonstrated that DeepSeek's engineering could scale to 236B parameters affordably.

MoE + MLA · 236B · V2 breakthrough
January 2025
DeepSeek-R1 — The Model That Changed Everything

R1's training used GRPO — first developed for DeepSeekMath, itself built on LLM's infrastructure. The bilingual corpus approach from LLM enabled R1 to reason fluently in both English and Chinese. The Fire-Flyer 2 cluster trained R1 at 671B scale. LLM's scaling law insights informed R1's training recipe. The distilled models (1.5B–70B) were possible because of the tokenizer and architecture compatibility across the family.

GRPO · Chain-of-thought · Sputnik moment
December 2024 – April 2026
DeepSeek V3, V3.2, V4 — The Frontier

V3 (671B MoE), V3.2 (+ DeepSeek Sparse Attention), and V4 (1M context, Compressed Sparse Attention) — all trace directly to the architectural choices, data curation philosophy, and scaling law research first articulated in DeepSeek LLM. The same tokenizer family, the same bilingual training approach, the same Fire-Flyer infrastructure. DeepSeek LLM's paper is cited in every subsequent DeepSeek technical report.

V3 → V3.2 → V4 · IMO Gold · Frontier AI
Getting Started

How to Use DeepSeek LLM

Four ways to run or use DeepSeek LLM — from cloud API to local inference.

1
Download from Hugging Face

All four models — 7B/67B × Base/Chat — are at huggingface.co/deepseek-ai. Use from_pretrained with device_map="auto" for multi-GPU distribution. 7B requires ~16GB VRAM; 67B requires ~130GB.

2
Run with Ollama (7B)

The 7B model is available via Ollama: ollama run deepseek-llm:7b. Runs on consumer hardware — 16GB RAM is sufficient for 4-bit quantized inference. No VRAM required for CPU inference.

3
Fine-tune the Base model

For domain-specific applications, download the Base model and fine-tune with PEFT/LoRA. The Base model has no instruction-following format to work around — clean slate for supervised fine-tuning on your own data.

4
Study intermediate checkpoints

Download training checkpoints at multiple steps via aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer. Study capability emergence during pre-training.

5
Use via current API

DeepSeek's API no longer serves the original LLM models — new projects should use deepseek-v4-flash or deepseek-v4-pro at platform.deepseek.com. V4 is 6× stronger on benchmarks.

6
GGUF quantization (llama.cpp)

GGUF quantized versions exist on Hugging Face (Q4_K_M recommended: ~4GB, minimal quality loss). Note: the custom tokenizer requires compatible llama.cpp builds. Check model cards for supported versions before downloading.

FAQ

Frequently Asked Questions

What is DeepSeek LLM and why is it important?+

DeepSeek LLM is the original foundational model released by DeepSeek in December 2023 — the first model in what would become one of the most influential open-source AI families in the world. At 67B parameters trained on 2 trillion tokens, it was the first open-source model to beat LLaMA-2 70B across coding, math, reasoning, and Chinese. More importantly, it established the research practices (novel scaling laws), infrastructure (Fire-Flyer 2), and data pipeline that directly produced Coder V2, DeepSeekMath, R1, V3, and V4. Every DeepSeek model that came after traces its lineage to this release.

How does DeepSeek LLM compare to modern DeepSeek models?+

DeepSeek LLM is significantly less capable than the current generation. V4-Flash (April 2026) scores 79% on SWE-bench Verified — DeepSeek LLM didn't report SWE-bench and likely scores well below 15%. MATH-500 for V4-Pro is 97.3% vs LLM 67B Chat's 32.6% zero-shot on the MATH benchmark. For all practical applications today, use V4-Flash via the API. DeepSeek LLM is valuable for: historical research into early LLM training dynamics, fine-tuning experiments on modest hardware, and studying how the scaling law methodology evolved across the family.

What is the DeepSeek License and can I use it commercially?+

DeepSeek LLM is released under the DeepSeek License — a source-available license that permits commercial use. This distinguished it from the original LLaMA models (which had non-commercial restrictions) and made it attractive for businesses who wanted to use or fine-tune it in products. The license does impose some conditions — read the full text in the GitHub repository before deployment. Subsequent DeepSeek models (Coder V2, R1, V3, V4) were released under the more permissive MIT License.

Why does DeepSeek LLM use 30 layers (7B) and 95 layers (67B) instead of 32 and 80?+

This is a deliberate engineering optimization for pipeline parallelism. During distributed training and inference across multiple GPUs, the model is split into pipeline stages — each GPU handles a group of consecutive layers. Layer counts divisible by common pipeline depths (e.g., 6, 10, 15) reduce "pipeline bubble" overhead — the idle time when GPUs wait for the previous stage to complete. 30 layers divides cleanly by 2, 3, 5, 6, 10, 15; 95 layers by 5, 19. The paper explicitly states this choice was made to "facilitate model pipeline partitioning to optimize training and inference." It's one of those small but important engineering details that most model papers don't document.

What are the intermediate training checkpoints and how do I access them?+

DeepSeek released training snapshots at multiple token counts during the 7B and 67B pre-training runs — making it possible to study how capabilities develop over training time. This was unusual and scientifically valuable: most labs only release final weights. Access via AWS CLI: aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer. The --request-payer flag means you pay for the data transfer costs (AWS egress charges apply). Total size is roughly 280GB for 7B and 720GB for 67B in BF16.

Why doesn't DeepSeek LLM use SentencePiece for tokenization?+

DeepSeek uses a custom Byte-level BPE tokenizer implemented via the Hugging Face Tokenizer library — not SentencePiece (which powers most LLaMA-family models). The choice was motivated by better handling of edge cases in Byte-level encoding and more flexibility in pre-tokenizer design for mixed Chinese/English text. The practical implication: some quantization tools (especially older versions of llama.cpp and GGUF converters) that assume SentencePiece tokenization may not work with DeepSeek LLM weights without model-specific support. Always check compatibility before running quantized inference.

The Foundation

The model that
started it all.

December 2023. 2 trillion tokens. Novel scaling laws. Beat LLaMA-2 70B. Surpassed GPT-3.5 in Chinese. Open-source, commercial use. The first step of the journey that led to IMO Gold.

🤗 Download 67B Chat → Read Paper ↗ GitHub ↗