DeepSeek V1 - DeepSeek Coder & LLM

November 2, 2023 — DeepSeek Coder released · DeepSeek's very first model November 29, 2023 — DeepSeek LLM released · 7B and 67B general-purpose 2 trillion tokens — both families trained from scratch, no LLaMA init 87% code — Coder training corpus composition across 338 languages Beats LLaMA-2 70B — DeepSeek LLM 67B outscores it across all major benchmarks Beats CodeLlama-34B — Coder 7B matches CodeLlama 34B on HumanEval First to beat GPT-3.5 in Chinese — LLM 67B Chat was first open-source to do it 100K BPE vocabulary — 3× LLaMA, enables efficient Chinese tokenisation 16K context — Coder models support project-level code understanding Commercial licence — both families permitted commercial use from day one November 2, 2023 — DeepSeek Coder released · DeepSeek's very first model November 29, 2023 — DeepSeek LLM released · 7B and 67B general-purpose 2 trillion tokens — both families trained from scratch, no LLaMA init 87% code — Coder training corpus composition across 338 languages Beats LLaMA-2 70B — DeepSeek LLM 67B outscores it across all major benchmarks Beats CodeLlama-34B — Coder 7B matches CodeLlama 34B on HumanEval First to beat GPT-3.5 in Chinese — LLM 67B Chat was first open-source to do it 100K BPE vocabulary — 3× LLaMA, enables efficient Chinese tokenisation 16K context — Coder models support project-level code understanding Commercial licence — both families permitted commercial use from day one

The Two Originals

Two Families. Same Month. Same Standard: Scratch.

Both DeepSeek Coder and DeepSeek LLM were trained from scratch on 2 trillion tokens — no LLaMA initialisation, no weight distillation. In 2023, most "open-source" models were LLaMA fine-tunes. DeepSeek chose the harder path, and won on benchmarks.

November 2, 2023 · First release ever 💻 CODE · FIRST EVER

🖥️

DeepSeek Coder

deepseek-ai/DeepSeek-Coder · arXiv:2401.14196 · 22.8K ⭐

DeepSeek's debut — and the first open-source code model to seriously challenge CodeLlama. Trained from scratch on 2 trillion tokens at 87% code / 13% natural language across 338 programming languages. The 7B model matched CodeLlama-34B on HumanEval. The 33B Instruct model surpassed GPT-3.5-Turbo on HumanEval. Four sizes — 1.3B, 5.7B, 6.7B, 33B — with Base and Instruct variants. Supports Fill-in-the-Middle (FIM) for IDE-style code completion using PSM format. 16K token context window and repo-level training for project-level understanding. DeepSeek Coder-Base-7B outperforms all open-source models at or below 13B on HumanEval, MultiPL-E, MBPP, and DS-1000.

1.3B–33B

Sizes

87% code

Training

16K

Context

338

Languages

🤗 33B Instruct 🤗 6.7B Instruct GitHub Paper

November 29, 2023 · First general-purpose model 🧠 LLM · GENERAL

🔬

DeepSeek LLM

deepseek-ai/DeepSeek-LLM · arXiv:2401.02954 · 6.8K ⭐

The first general-purpose DeepSeek model — 7B and 67B trained on 2 trillion bilingual (English + Chinese) tokens from scratch. The 67B model beat LLaMA-2 70B across every major benchmark. The Chat variant beat GPT-3.5-Turbo on Chinese open-ended evaluation — the first open-source model to do so. Aligned via SFT + DPO. Used an unusually large 100,000-token BPE vocabulary (3× LLaMA's 32K) to dramatically improve Chinese token efficiency. Novel scaling laws for batch size and learning rate published alongside the model weights. Training checkpoints available via AWS S3.

7B / 67B

Sizes

2T tokens

Training

Context

SFT+DPO

Alignment

🤗 67B Chat 🤗 7B Chat GitHub Paper

On the "V1" name: This is community shorthand, not an official designation. DeepSeek never called these models "V1" at launch. The label became retroactive when DeepSeek released "DeepSeek-V2" in May 2024, implying these November 2023 models formed generation one. Official names remain DeepSeek Coder and DeepSeek LLM.

Model Variants

Eight Models Across Every Scale.

Both families shipped with multiple sizes and Base/Chat or Base/Instruct splits — giving developers the full spectrum from 3GB RAM edge devices to research-grade 130GB flagship inference.

CODER · BASE + INSTRUCT

1.3B

DeepSeek-Coder-1.3B

The smallest Coder model — designed for edge devices, CI pipelines, and fast inline completion. Runs on CPU with ~3GB RAM. Quantised GGUF versions available via llama.cpp and Ollama. Still deployed in production by several open-source IDE plugins.

Context16,384

VRAM (BF16)~3 GB

FIM supportYes (PSM)

CODER · BASE + INSTRUCT

6.7B

DeepSeek-Coder-6.7B

The sweet spot for consumer hardware. Instruction-tuned variant for conversational code help. Runs on a single RTX 3080 Ti / 4070. Strong multilingual performance. Frequently recommended as the best sub-7B code model of 2023.

HumanEval~65%

VRAM (BF16)~14 GB

Context16,384

CODER · FLAGSHIP

33B

DeepSeek-Coder-33B

The flagship code model of V1. Coder-33B-Instruct beats GPT-3.5-Turbo on HumanEval and outperforms CodeLlama-34B by +7.9% Python, +9.3% multilingual, +10.8% MBPP. State-of-the-art open-source code model as of late 2023 by a wide margin.

HumanEval (Base)56.1%

vs GPT-3.5 (Inst)Beats it

VRAM (BF16)~66 GB

LLM · BASE + CHAT

DeepSeek-LLM-7B

The compact general-purpose model. 30 layers, MHA, 4K context. Trained on 2T bilingual tokens. Chat variant achieves GSM8K 63.0% zero-shot. Reaches parity with LLaMA-2 7B on English, surpasses it on Chinese. Ideal for fine-tuning experiments.

Layers30

GSM8K (Chat)63.0%

VRAM (BF16)~15 GB

LLM · BASE + CHAT

67B

DeepSeek-LLM-67B

The flagship general model. 95 layers, GQA (8 KV heads), 100K BPE vocabulary. Beats LLaMA-2 70B on all benchmarks. 67B Chat surpasses GPT-3.5 in Chinese. HumanEval 73.78%, GSM8K 84.1%, MATH 32.6%. Hungarian High School Exam: 65/100.

Layers95

GSM8K (Chat)84.1%

VRAM (BF16)~134 GB

CODER · BASE

5.7B

DeepSeek-Coder-5.7B-MoE

A Mixture-of-Experts variant of the Coder family — 16 experts, 2 active per token — exploring MoE at the code model scale before DeepSeek-V2. Smaller memory footprint than 6.7B dense at equal quality on most benchmarks.

ArchitectureMoE (16E/2A)

Context16,384

LLM · CHECKPOINT

7B ckpt

Training Checkpoints (7B)

DeepSeek published intermediate pre-training checkpoints for the 7B LLM at multiple token counts via AWS S3 — enabling research into capability emergence during pre-training. Unprecedented transparency at the time of release.

AccessAWS S3

PurposeResearch

CODER · V1.5

7B v1.5

DeepSeek-Coder-V1.5-7B

A refined version of the 7B Coder model released as a minor update — better instruction following and improved handling of long-context repository tasks. Uses a DeepSeek-LLM-7B checkpoint as base before code training, producing a stronger general coding assistant.

InitDS-LLM-7B base

Extra training2B tokens

Architecture Deep Dive

Built on LLaMA. Improved with Intent.

Both V1 families use the LLaMA-style pre-norm decoder-only Transformer as a starting point, with carefully chosen modifications that anticipated what would become standard practice in later models — including GQA, 16K context, and repo-level training.

💻

DeepSeek Coder — Architecture

The Coder models use a standard auto-regressive Transformer with RoPE positional encoding and Flash Attention 2 throughout. The 16K token context window — 4× larger than the contemporary LLM sibling — is the centrepiece feature, enabling project-level code understanding via repo-level training document construction.

Repo-level training: Rather than training on individual files, DeepSeek assembled entire GitHub repositories as contiguous training documents with topological dependency ordering — lower-level files (utilities, helpers) sorted before higher-level consumers. This teaches the model cross-file import chains, not just isolated function bodies.

Fill-in-the-Middle (FIM) — PSM format: ~50% of training examples are reformatted as Prefix-Suffix-Middle triples using the PSM (Prefix, Suffix, Middle) format. Special tokens <｜fim▁begin｜>, <｜fim▁hole｜>, <｜fim▁end｜> mark the regions. This enables IDE-style inline code completion absent from most contemporaneous code LLMs.

Data pipeline: Collected from GitHub repos created before November 2023 using StarCoder-style filtering: max line length ≤1000 chars, ≥25% alphabetic content, XSLT XML header stripping. Near-deduplication applied before the 87/13 code-to-natural-language mix.

Component	Value
Architecture	Pre-norm decoder-only Transformer
Positional enc.	RoPE
Attention type	MHA + Flash Attention 2
Context window	16,384 tokens
Training objectives	Next-token + FIM (PSM format)
FIM fraction	~50% of training examples
Model sizes	1.3B, 5.7B (MoE), 6.7B, 33B
Languages supported	338 programming languages

🧠

DeepSeek LLM — Architecture

DeepSeek LLM follows LLaMA's decoder-only Transformer with three deliberate departures. First, layer counts chosen for pipeline parallelism: 30 layers for 7B and 95 layers for 67B — both divide evenly across standard 8-GPU pipeline stages, reducing pipeline bubble overhead without changing the core architecture.

Second, the 67B model uses Grouped-Query Attention (GQA) — adopted before it was standard — with 8 KV heads vs 64 query heads. The KV cache shrinks 8× compared to standard MHA, making 67B inference tractable on 4× A100 80GB GPUs.

Third, a 100,000-token BPE vocabulary — 3× LLaMA's 32K. This dramatically improves Chinese tokenisation efficiency: a Chinese sentence that takes 32+ LLaMA tokens becomes 10–12 tokens with 100K BPE. This single decision explains much of the Chinese benchmark gap between DeepSeek LLM and LLaMA-2. The 100K vocabulary was inherited by every subsequent DeepSeek model through V4.

New scaling laws published alongside the model: The paper derives optimal batch size B* ∝ C^0.24 and learning rate α* ∝ C^-0.31 as functions of compute budget C — extending Chinchilla. Validated across 1e20 to full-scale FLOPs. Results: 7B trained at batch=2304, lr=4.2e-4; 67B at batch=4608, lr=3.2e-4.

Component	7B	67B
Layers	30	95
Hidden dim.	4096	8192
Attention heads	32	64 Q / 8 KV
Attention type	MHA	GQA
Context window	4096	4096
Vocabulary size	100,000 (BPE)
Positional enc.	RoPE
FFN activation	SwiGLU
Normalisation	RMSNorm (pre-norm)
Alignment	SFT then DPO

Benchmarks — November 2023

Best Open-Source Scores of Their Era.

All results from official papers. At the time of release, DeepSeek Coder-33B and LLM-67B held the state-of-the-art open-source positions on their respective benchmark sets.

HumanEval Python Pass@1 — Base Models

164 hand-written Python problems, zero-shot greedy decode

DS-Coder-7B = CodeLlama-34B

DS-Coder-33B-Base

56.1%

CodeLlama-34B

48.8%

DS-Coder-7B-Base

48.4%

StarCoder-15.5B

33.6%

HumanEval Python Pass@1 — Instruct Models

Instruction-tuned variant, same 164 problems

DS-Coder-33B-Inst > GPT-3.5

DS-Coder-33B-Inst

>GPT-3.5

GPT-3.5-Turbo

~72%

CodeLlama-34B-Inst

41.5%

DS-Coder-6.7B-Inst

65.2%

Multilingual HumanEval — avg 8 languages

Python, C++, Java, PHP, TypeScript, C#, Bash, JavaScript

DS-Coder-33B: +9.3% vs CodeLlama-34B

DS-Coder-33B-Base

67.8%

CodeLlama-34B

58.5%

DS-Coder-7B-Base

50.3%

DS-LLM-67B-Chat

73.78%

MBPP — 500 Python Problems (few-shot)

Mostly Basic Python Programming benchmark

+10.8% vs CodeLlama-34B

DS-Coder-33B-Base

75.1%

CodeLlama-34B

64.4%

DS-Coder-7B-Base

66.0%

DS-1000 — Data Science Workflows

1,000 realistic data science problems across 7 libraries (NumPy, Pandas, SciPy, Sklearn, Matplotlib, Tensorflow, Pytorch)

DS-Coder-33B: +5.9% vs CodeLlama

DS-Coder-33B-Base

45.8%

CodeLlama-34B

39.9%

DS-Coder-7B-Base

38.4%

StarCoder

26.7%

GSM8K — Grade School Math (0-shot)

Multi-step arithmetic word problems

DS-67B-Chat beats GPT-3.584.1%

DS-LLM-67B-Chat

84.1%

GPT-3.5-Turbo

78.9%

LLaMA-2-70B-Chat

56.8%

DS-LLM-7B-Chat

63.0%

MATH Benchmark (0-shot)

Competition-level mathematics problems

DS-67B-Chat: 32.6%

GPT-3.5-Turbo

34.1%

DS-LLM-67B-Chat

32.6%

LLaMA-2-70B-Chat

13.5%

Hungarian National High School Math Exam

Real, previously unseen 100-question exam — strong generalisation signal

DS-67B-Chat: 65/100

DeepSeek LLM 67B Chat scored 65 out of 100 on the Hungarian National High School Mathematics Exam — a real-world exam never seen in training. This B+ performance provides a contamination-resistant generalisation signal beyond standard benchmark scores, demonstrating that the math improvements are genuine reasoning gains and not memorisation artefacts.

MMLU (5-shot) — General Knowledge

57 subjects spanning STEM, humanities, professional knowledge

DS-67B beats LLaMA-2-70B

DS-LLM-67B-Base

71.3%

LLaMA-2-70B-Base

68.9%

DS-LLM-7B-Base

49.7%

HellaSwag (10-shot) — Commonsense Completion

Activity scenario completion tasks

DS-LLM-67B-Base

87.1%

LLaMA-2-70B-Base

87.3%

ARC-Challenge (25-shot) — Science Reasoning

Grade-school science questions requiring multi-step reasoning

DS-LLM-67B-Base

67.8%

LLaMA-2-70B-Base

67.3%

Chinese Open-Ended Evaluation

First open-source model to beat GPT-3.5 in Chinese

Historic milestone — Nov 2023

DeepSeek LLM 67B Chat was the first open-source model to surpass GPT-3.5-Turbo on Chinese open-ended evaluation. At the time, Chinese-language open-source models lagged closed-source models by a wide margin. DeepSeek's 100,000-token BPE vocabulary (vs LLaMA's 32K) and a 2T bilingual training corpus closed the gap entirely. A Chinese sentence that requires 30+ LLaMA tokens needs only 10–12 with the 100K vocabulary — more tokens available for reasoning rather than character encoding, which directly explains the quality improvement on long Chinese generation tasks.

C-Eval — Chinese Academic Knowledge

52-subject Chinese university entrance exam benchmark

DS-67B strong Chinese STEM

DS-LLM-67B-Chat

~71%

LLaMA-2-70B-Chat

~32%

GPT-3.5-Turbo

~68%

Training Details

Engineered from the Ground Up.

Both V1 families were trained entirely from scratch - no distillation, no weight initialisation from existing models. The compute infrastructure, tokeniser, data pipeline, and hyperparameter schedule were all original contributions.

💻 DeepSeek Coder — Training

Data composition: 2 trillion tokens at 87% source code / 13% natural language (English + Chinese). GitHub repos collected before November 2023. Filtering follows StarCoder pipeline: average line length ≤100 chars, maximum line length ≤1000, ≥25% alphabetic characters. XML header check for XSLT files. Near-deduplication across all 338 languages.

Repo-level document construction: Files assembled in topological dependency order — imports first, dependents later. This lets the model learn cross-file call graphs and API contracts, not isolated snippets.

FIM curriculum: PSM-format fill-in-the-middle applied to ~50% of training examples. Uses 4 special tokens: fim_begin, fim_hole, fim_end, and fim_pad. Enables both code completion and infilling without a separate model.

Infrastructure: Trained on DeepSeek's Fire-Flyer 2 GPU cluster using a multi-node setup with Flash Attention 2 for memory efficiency. AdamW optimiser with cosine learning rate decay.

Total tokens

87%

Code ratio

338

Languages

16K

Context

🧠 DeepSeek LLM — Training

Data: 2 trillion tokens in English and Chinese from Common Crawl with aggressive deduplication, quality filtering, and domain upsampling for scientific and technical content. High-quality Chinese web data was a primary focus — the paper dedicates substantial analysis to Chinese data quality as the key differentiator from LLaMA-2.

Novel scaling law derivation: The paper derives power-law relationships for optimal batch size (B* ∝ C^0.24) and learning rate (α* ∝ C^-0.31) as functions of compute budget. These were fitted on models at 1B, 7B, and 67B scales. This eliminated expensive grid search: batch and LR were set analytically before the full training run began.

Alignment pipeline: Base models were aligned via Supervised Fine-Tuning (SFT) first, then Direct Preference Optimization (DPO) — an RLHF-free alignment approach that was novel at the time. The DPO step gave the Chat models their notably natural conversational style.

Intermediate checkpoints: Multiple pre-training checkpoints for both 7B and 67B Base models were published via AWS S3 at standard S3 egress pricing — enabling the research community to study capability emergence during training.

Total tokens

100K

Vocab (BPE)

AdamW

Optimiser

SFT+DPO

Alignment

What Made V1 Different

First Principles. No Shortcuts.

In a landscape of LLaMA fine-tunes, DeepSeek V1 trained from scratch, published scaling laws, and open-sourced every weight. The decisions made here echo through every subsequent DeepSeek model.

🔬

Trained from Scratch

No LLaMA initialisation, no weight distillation, no shortcuts. Both families pre-trained entirely from scratch on 2T tokens — an unusual commitment when most open-source models in 2023 were fine-tuned LLaMA derivatives.

Original

🗂️

Repo-Level Code Training

DeepSeek Coder assembled GitHub repositories as topologically-sorted training documents — files ordered by dependency. The first major code LLM to do this systematically, enabling genuine cross-file understanding rather than snippet-level memorisation.

Innovation

🔤

100K BPE Vocabulary

3× larger than LLaMA's 32K. Critical for Chinese tokenisation: a Chinese sentence takes 30+ LLaMA tokens but 10–12 with 100K BPE. More context budget for reasoning, less for character encoding. This vocabulary was carried forward through every DeepSeek generation to V4.

Foundation

📐

New Scaling Laws Published

Derived optimal batch size B* ∝ C^0.24 and learning rate α* ∝ C^-0.31 as power-law functions of compute budget — extending Chinchilla. Validated across 7B and 67B scale. Published openly; widely cited in subsequent AI research and adopted by later DeepSeek training runs.

Research

🔀

GQA Before it was Standard

The 67B model adopted Grouped-Query Attention with 8 KV heads vs 64 query heads before GQA became commonplace. This reduced the KV cache 8×, making 67B inference practical on 4× A100 80GB GPUs instead of requiring 8 GPUs. A practical engineering choice ahead of its time.

Efficiency

⚙️

FIM for IDE Completion

~50% of Coder training used PSM-format Fill-in-the-Middle — teaching the model to predict code given both prefix and suffix context. Enabled Copilot-style inline completion without a separate model, in a single checkpoint.

Code-specific

🇨🇳

First Open-Source to Beat GPT-3.5 in Chinese

DeepSeek LLM 67B Chat surpassed GPT-3.5 on Chinese open-ended evaluation — the first open-source model to achieve this. The 100K vocabulary and bilingual 2T training corpus closed a gap that LLaMA-2 could never bridge with its 32K English-biased tokeniser.

Historic

📦

Training Checkpoints Published

LLM 7B and 67B intermediate pre-training checkpoints published via AWS S3 — enabling researchers to study capability emergence across training. Unprecedented transparency in late 2023, when most labs released only final weights.

Open

🏗️

Pipeline-Optimal Layer Counts

30 layers (7B) and 95 layers (67B) — chosen to divide evenly across 8-GPU pipeline stages, minimising bubble overhead in distributed training. A practical engineering detail rarely documented publicly, but critical for efficient multi-node training at scale.

Engineering

🌐

338 Programming Languages

Coder training covered 338 distinct programming languages from Ada to Zig — far wider than CodeLlama's focus. Enables strong performance on rare languages including Agda, Alloy, Bash, COBOL, Forth, Prolog, Racket, Solidity, and many more.

Coverage

⚖️

Commercial Licence from Day 1

Released under the DeepSeek Licence, permitting commercial use immediately — at a time when LLaMA had non-commercial restrictions. This decision accelerated industry adoption and differentiated DeepSeek from both LLaMA-based and proprietary alternatives.

Commercial ✓

🌱

The Root of V4

Every innovation in V1 — the BPE tokeniser, the Fire-Flyer 2 infrastructure, the scaling law methodology, the bilingual training recipe, the SFT+DPO alignment pipeline — was inherited, extended, and refined in V2 (MoE), V3 (FP8), R1 (RL), and V4. This is where it started.

Legacy

Code Examples

Run DeepSeek V1 Locally.

All V1 models remain available on Hugging Face under the DeepSeek Licence (commercial use permitted). The transformers library — no special packages required.

# DeepSeek-Coder-33B-Instruct — conversational code generation # pip install transformers torch accelerate from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "deepseek-ai/deepseek-coder-33b-instruct" # lighter options: "deepseek-ai/deepseek-coder-6.7b-instruct" (14 GB) # "deepseek-ai/deepseek-coder-1.3b-instruct" (3 GB) tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) # Multi-turn conversation with system prompt messages = [ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "Write a Rust function that implements binary search."}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate(inputs, max_new_tokens=1024) print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

# DeepSeek-LLM-67B-Chat (7B version: "deepseek-ai/deepseek-llm-7b-chat") # Official recommended params: temperature=1.0, top_p=0.95, max_new_tokens=2048 from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "deepseek-ai/deepseek-llm-67b-chat" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) # System prompt + user message messages = [ {"role": "user", "content": "Explain the difference between MHA and GQA attention."} ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( inputs, max_new_tokens=2048, do_sample=True, temperature=1.0, # official recommendation top_p=0.95, # official recommendation ) print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

# Fill-in-the-Middle (FIM) — PSM format # Predict code between a prefix and suffix (IDE-style inline completion) from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base") model = AutoModelForCausalLM.from_pretrained( "deepseek-ai/deepseek-coder-6.7b-base", torch_dtype=torch.bfloat16, device_map="auto" ) # PSM format: <｜fim▁begin｜>PREFIX<｜fim▁hole｜>SUFFIX<｜fim▁end｜> # Model outputs the MIDDLE section that fits between them prefix = "def two_sum(nums: list[int], target: int) -> list[int]:\n seen = {}\n for i, " suffix = "\n return [-1, -1]" prompt = f"<｜fim▁begin｜>{prefix}<｜fim▁hole｜>{suffix}<｜fim▁end｜>" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=128, eos_token_id=32016 ) middle = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(middle) # → "num in enumerate(nums):\n if target - num in seen:\n return [seen[target - num], i]\n seen[num] = i"

Model	HumanEval	GSM8K	MMLU	Context	Open Source	Chinese
DS-Coder-33B-Inst	>GPT-3.5 ✓	—	—	16K	✓ Commercial	—
DS-LLM-67B-Chat	73.78%	84.1% ✓	71.3%	4K	✓ Commercial	>GPT-3.5 ✓
GPT-3.5-Turbo	~72%	78.9%	~70%	16K	✗ Closed	Below DS-67B
CodeLlama-34B	48.8%	—	—	100K	Non-commercial	—
LLaMA-2-70B-Chat	32.3%	56.8%	68.9%	4K	Non-commercial	Far below DS
StarCoder-15.5B	33.6%	—	—	8K	BigCode	—

When V1 Still Makes Sense (2026)

V1 Is Archived. Not Useless.

V1 models are superseded by V4 for production use. But there are six compelling reasons to still work with them — particularly the Coder models and intermediate checkpoints.

Fine-Tuning Foundation

The 7B Base models (both Coder and LLM) are ideal fine-tuning starting points for researchers who need a documented, well-understood foundation at modest scale. Smaller compute cost, same architectural decisions as V4. Perfect for domain-specific code or language adaptation experiments.

Scaling Law Research

The intermediate training checkpoints at multiple token counts are unique research artifacts. They document capability emergence across the full pre-training trajectory for 7B and 67B models — invaluable for AI safety research, capability elicitation studies, and training dynamics analysis.

Edge & CPU Deployment

DeepSeek-Coder-1.3B runs on CPU with ~3GB RAM via llama.cpp and Ollama — one of the strongest sub-2B code models ever released. Still deployed in production by several open-source IDE plugins and CI systems where latency requirements prevent API calls.

Historical AI Research

V1 is a precise snapshot of the pre-R1 open-source frontier. Researchers studying the evolution of LLM capabilities, the impact of vocabulary size on multilingual performance, or the history of Chinese language models use V1 as a documented baseline with full architecture transparency.

FIM Code Completion

DeepSeek Coder's Fill-in-the-Middle capability remains competitive for IDE-style inline completion at sub-7B scale. The 6.7B Instruct model offers a good quality-to-compute ratio for local code assistants. Several open-source VS Code extensions still ship it as their default completion backend.

Low-Resource Languages

338-language training covers many languages underrepresented in newer models. For niche languages like Agda, Alloy, Coq, Forth, Prolog, or legacy COBOL, Coder V1 models often outperform newer general-purpose models that over-index on Python/Java. A reference tool for polyglot developers.

Access V1 Models

How to Use DeepSeek V1 Today.

All V1 models remain on Hugging Face. The DeepSeek Licence permits commercial use. Training checkpoints available via AWS S3.

Hugging Face — all 8 models

All variants at huggingface.co/deepseek-ai. Use from_pretrained(..., device_map="auto") for multi-GPU. 7B ≈ 15GB; 67B ≈ 134GB in BF16. 4-bit quantised versions load in ~4GB and ~35GB respectively.

Ollama — one-line install

Run ollama run deepseek-coder:6.7b for the 6.7B Instruct model (~8GB). For the 1.3B CPU version: ollama run deepseek-coder:1.3b. Full list at ollama.com/library/deepseek-coder.

Training checkpoints via AWS S3

LLM 7B and 67B checkpoints at multiple training steps: aws s3 ls s3://deepseek-ai/DeepSeek-LLM/ --request-payer. Standard S3 egress charges apply. Ideal for capability emergence research.

Licence — commercial use permitted

V1 models use the DeepSeek Model Licence — commercial use is permitted with attribution. Note that V3, R1, and V4 use the more permissive MIT Licence. Read the V1 licence carefully before SaaS deployment.

Fine-tune the Base models

Both LLM and Coder Base variants are clean fine-tuning foundations. Use PEFT/LoRA via the official finetune scripts. The 7B Base is the standard pick for academic experiments under 40GB VRAM.

Prefer V4 for production

For new production applications use platform.deepseek.com: deepseek-v4-flash ($0.14/1M input) or deepseek-v4-pro. V4 is 10–15× stronger on every benchmark and supports 1M token context — not comparable to V1.

Legacy

From Two Models to a Frontier Dynasty.

Every DeepSeek model that followed traces its lineage directly to the decisions made in November 2023 — the tokeniser, the infrastructure, the scaling law methodology, the bilingual training recipe. V1 was not a stepping stone. It was the foundation.

November 2, 2023

DeepSeek Coder — DeepSeek's First Model Ever

Released to the public as the company's debut. 1.3B to 33B parameters. 2 trillion tokens of 87% code across 338 languages. Repo-level training with topological ordering, FIM in PSM format, 16K context window. Coder-7B matched CodeLlama-34B. Coder-33B-Instruct surpassed GPT-3.5. Established DeepSeek's Fire-Flyer 2 cluster as a credible pre-training infrastructure. GitHub repo: 22.8K stars.

First open-source model · Code specialist · arXiv:2401.14196

November 29, 2023

DeepSeek LLM — First General-Purpose Model

7B and 67B trained from scratch on 2T bilingual tokens. Novel scaling laws for batch size and learning rate published alongside weights. 67B beats LLaMA-2 70B across all major benchmarks. 67B Chat surpasses GPT-3.5 in Chinese — a historic first for open-source AI. Training checkpoints published via AWS S3. 100K BPE vocabulary and SFT+DPO alignment pipeline established as the DeepSeek standard.

First general LLM · Beats LLaMA-2 · Beats GPT-3.5 in Chinese · arXiv:2401.02954

January 2024

DeepSeek-MoE — First Mixture-of-Experts Experiment

Built on V1's tokeniser, infrastructure, and BPE vocabulary. Explored MoE architecture at 2.7B and 16B scales before the full-size V2. Proved that sparse MoE is feasible on the Fire-Flyer 2 cluster. Directly prototyped the DeepSeekMoE architecture that powered V2's 236B model.

MoE prototype · V2's architectural foundation

May 7, 2024

DeepSeek-V2 — MLA + MoE Breakthrough

236B MoE model with Multi-Head Latent Attention (MLA) compressing KV cache by 93.3%. Used the same 100K BPE tokeniser and Fire-Flyer 2 infrastructure from V1. This was the release that retroactively made V1 "V1." V2's scaling law analysis was grounded in V1's published findings. The naming convention "V2" implies a V1 — and V1 is these November 2023 models.

MoE + MLA · Retroactively makes Nov 2023 "V1"

January 20, 2025

DeepSeek-R1 — The "Sputnik Moment"

The model that made DeepSeek globally famous — #1 App Store in 157 countries, 18% drop in Nvidia's stock, 2.6M app downloads in a week. Built on V3-Base using GRPO reinforcement learning — itself descended from DeepSeekMath (which was built on Coder-V1.5). R1's tokeniser, infrastructure, and bilingual training recipe all trace back to the 100K BPE vocabulary and Fire-Flyer 2 cluster decisions made 14 months earlier in November 2023.

Sputnik moment · Global #1 · Built on V1 foundations · 14 months after V1

April 24, 2026

DeepSeek-V4 — The Current Frontier

V4-Pro (1.6T params) and V4-Flash (284B params), 1M token context, Codeforces #1 (3206 Elo), 80.6% SWE-bench Verified, IMO 2025 Gold Medal. Every aspect — the 100K BPE tokeniser, the bilingual training philosophy, the scaling law methodology, the Fire-Flyer infrastructure, the MoE architecture — is a direct descendant of the decisions made in Hangzhou in November 2023. The distance from V1 to V4: 29 months.

1M context · Codeforces #1 · 29 months from V1 · The current frontier

FAQ

Frequently Asked Questions

What exactly is "DeepSeek V1"?+

"DeepSeek V1" is a community shorthand, not an official product name. It refers to two model families DeepSeek released in November 2023: DeepSeek Coder (November 2) and DeepSeek LLM (November 29). The "V1" label is retroactive — it came when DeepSeek released "DeepSeek-V2" in May 2024, making the November 2023 generation implicitly the first version. DeepSeek never officially called them "V1" at launch. The correct official names are DeepSeek Coder and DeepSeek LLM.

Were these really DeepSeek's first models ever?+

DeepSeek Coder (November 2, 2023) was DeepSeek's first publicly released model — the first time the company released model weights. DeepSeek was founded in July 2023 as a spin-off from High-Flyer, the quantitative hedge fund. So the first public model came just 3.5 months after founding. DeepSeek had been training internal models before this, but Coder was the debut. DeepSeek LLM, 27 days later, was the first general-purpose release. Both are therefore the genuine "V1" of the DeepSeek model family.

Why did DeepSeek Coder use 87% code in training?+

The 87/13 split was chosen to maximise code domain expertise while preserving the natural language capability needed for conversational code assistance, docstrings, and README generation. 100% code would produce excellent completion but poor explanations. The 13% natural language was curated high-quality content — not random web crawl — which explains why the model's prose quality substantially exceeded expectations for a code specialist. The split became a reference point in the broader code LLM community and was adopted by many subsequent models.

What were the novel scaling laws in the LLM paper?+

The DeepSeek LLM paper extended Chinchilla's analysis by deriving how batch size and learning rate should scale as functions of compute budget. Specifically: optimal batch size B* ∝ C^0.24 and optimal learning rate α* ∝ C^-0.31, where C is compute in FLOPs. These power laws were fitted on models from 1e20 FLOPs up to the full 67B scale. The practical output: DeepSeek could analytically derive optimal hyperparameters before training started, without expensive grid search. The 7B trained with batch=2304 and lr=4.2e-4; the 67B with batch=4608 and lr=3.2e-4. The paper is still cited in LLM training research today.

How does V1 compare to the current V4 models?+

V4 comprehensively outperforms V1 across all dimensions: V4-Pro achieves 80.6% SWE-bench Verified (V1 Coder-33B would score in low teens), 97.3% MATH-500 (V1 LLM-67B: 32.6% on MATH), and Codeforces #1 Elo 3206 (V1 had no competitive programming evaluation). V4 supports 1M token context vs V1's 4K (LLM) and 16K (Coder). V4-Pro with Think Max produces chain-of-thought reasoning V1 was architecturally incapable of. V1 remains useful for fine-tuning research and edge deployment — but has no place in production inference alongside V4's API pricing of $0.14/1M tokens.

Are V1 models still being maintained?+

V1 weights remain on Hugging Face and AWS S3 training checkpoints are still downloadable, but DeepSeek has not released updates since late 2024. They are in archive status — maintained by the community through llama.cpp, Ollama, and transformers, but not receiving active engineering from DeepSeek. The deepseek-chat API endpoint no longer serves V1 models — it now points to V4-Flash. If you need V1 for research, download the weights. They are not at risk of disappearing from Hugging Face, but they are not a priority for DeepSeek's engineering team.

What licence do V1 models use? Can I use them commercially?+

V1 models use the DeepSeek Model Licence — commercial use is permitted with attribution. This was notable in late 2023 when LLaMA-2 had non-commercial restrictions. The licence allows you to build applications, APIs, and products using V1 weights as long as you attribute DeepSeek. Note that subsequent models differ: V3, R1, V3.2, and V4 use the more permissive MIT Licence, which places fewer restrictions. Always read the specific licence file in each model's GitHub/HuggingFace repository before commercial deployment.

Where DeepSeek
began.

Two Families. Same Month. Same Standard: Scratch.

Eight Models Across Every Scale.

Built on LLaMA. Improved with Intent.

Best Open-Source Scores of Their Era.

Engineered from the Ground Up.

First Principles. No Shortcuts.

Run DeepSeek V1 Locally.

V1 vs the Competition — November 2023.

V1 Is Archived. Not Useless.

How to Use DeepSeek V1 Today.

From Two Models to a Frontier Dynasty.

Frequently Asked Questions

The models that
started it all.

Where DeepSeekbegan.

Two Families. Same Month. Same Standard: Scratch.

Eight Models Across Every Scale.

Built on LLaMA. Improved with Intent.

Best Open-Source Scores of Their Era.

Engineered from the Ground Up.

First Principles. No Shortcuts.

Run DeepSeek V1 Locally.

V1 vs the Competition — November 2023.

V1 Is Archived. Not Useless.

How to Use DeepSeek V1 Today.

From Two Models to a Frontier Dynasty.

Frequently Asked Questions

The models thatstarted it all.

Where DeepSeek
began.

The models that
started it all.