In November 2023, a four-month-old startup in Hangzhou released two models that shouldn't have been possible. DeepSeek Coder (Nov 2) beat CodeLlama-34B with a 7B model. DeepSeek LLM (Nov 29) beat LLaMA-2 70B and became the first open-source model to surpass GPT-3.5 in Chinese. Both trained from scratch on 2 trillion tokens. Both open-source from day one. The origin of everything that followed.
Both DeepSeek Coder and DeepSeek LLM were trained from scratch on 2 trillion tokens — no LLaMA initialisation, no weight distillation. In 2023, most "open-source" models were LLaMA fine-tunes. DeepSeek chose the harder path, and won on benchmarks.
DeepSeek's debut — and the first open-source code model to seriously challenge CodeLlama. Trained from scratch on 2 trillion tokens at 87% code / 13% natural language across 338 programming languages. The 7B model matched CodeLlama-34B on HumanEval. The 33B Instruct model surpassed GPT-3.5-Turbo on HumanEval. Four sizes — 1.3B, 5.7B, 6.7B, 33B — with Base and Instruct variants. Supports Fill-in-the-Middle (FIM) for IDE-style code completion using PSM format. 16K token context window and repo-level training for project-level understanding. DeepSeek Coder-Base-7B outperforms all open-source models at or below 13B on HumanEval, MultiPL-E, MBPP, and DS-1000.
The first general-purpose DeepSeek model — 7B and 67B trained on 2 trillion bilingual (English + Chinese) tokens from scratch. The 67B model beat LLaMA-2 70B across every major benchmark. The Chat variant beat GPT-3.5-Turbo on Chinese open-ended evaluation — the first open-source model to do so. Aligned via SFT + DPO. Used an unusually large 100,000-token BPE vocabulary (3× LLaMA's 32K) to dramatically improve Chinese token efficiency. Novel scaling laws for batch size and learning rate published alongside the model weights. Training checkpoints available via AWS S3.
DeepSeek Coder and DeepSeek LLM.
Both families shipped with multiple sizes and Base/Chat or Base/Instruct splits — giving developers the full spectrum from 3GB RAM edge devices to research-grade 130GB flagship inference.
The smallest Coder model — designed for edge devices, CI pipelines, and fast inline completion. Runs on CPU with ~3GB RAM. Quantised GGUF versions available via llama.cpp and Ollama. Still deployed in production by several open-source IDE plugins.
The sweet spot for consumer hardware. Instruction-tuned variant for conversational code help. Runs on a single RTX 3080 Ti / 4070. Strong multilingual performance. Frequently recommended as the best sub-7B code model of 2023.
The flagship code model of V1. Coder-33B-Instruct beats GPT-3.5-Turbo on HumanEval and outperforms CodeLlama-34B by +7.9% Python, +9.3% multilingual, +10.8% MBPP. State-of-the-art open-source code model as of late 2023 by a wide margin.
The compact general-purpose model. 30 layers, MHA, 4K context. Trained on 2T bilingual tokens. Chat variant achieves GSM8K 63.0% zero-shot. Reaches parity with LLaMA-2 7B on English, surpasses it on Chinese. Ideal for fine-tuning experiments.
The flagship general model. 95 layers, GQA (8 KV heads), 100K BPE vocabulary. Beats LLaMA-2 70B on all benchmarks. 67B Chat surpasses GPT-3.5 in Chinese. HumanEval 73.78%, GSM8K 84.1%, MATH 32.6%. Hungarian High School Exam: 65/100.
A Mixture-of-Experts variant of the Coder family — 16 experts, 2 active per token — exploring MoE at the code model scale before DeepSeek-V2. Smaller memory footprint than 6.7B dense at equal quality on most benchmarks.
DeepSeek published intermediate pre-training checkpoints for the 7B LLM at multiple token counts via AWS S3 — enabling research into capability emergence during pre-training. Unprecedented transparency at the time of release.
A refined version of the 7B Coder model released as a minor update — better instruction following and improved handling of long-context repository tasks. Uses a DeepSeek-LLM-7B checkpoint as base before code training, producing a stronger general coding assistant.
Both V1 families use the LLaMA-style pre-norm decoder-only Transformer as a starting point, with carefully chosen modifications that anticipated what would become standard practice in later models — including GQA, 16K context, and repo-level training.
The Coder models use a standard auto-regressive Transformer with RoPE positional encoding and Flash Attention 2 throughout. The 16K token context window — 4× larger than the contemporary LLM sibling — is the centrepiece feature, enabling project-level code understanding via repo-level training document construction.
Repo-level training: Rather than training on individual files, DeepSeek assembled entire GitHub repositories as contiguous training documents with topological dependency ordering — lower-level files (utilities, helpers) sorted before higher-level consumers. This teaches the model cross-file import chains, not just isolated function bodies.
Fill-in-the-Middle (FIM) — PSM format: ~50% of training examples are reformatted as Prefix-Suffix-Middle triples using the PSM (Prefix, Suffix, Middle) format. Special tokens <|fim▁begin|>, <|fim▁hole|>, <|fim▁end|> mark the regions. This enables IDE-style inline code completion absent from most contemporaneous code LLMs.
Data pipeline: Collected from GitHub repos created before November 2023 using StarCoder-style filtering: max line length ≤1000 chars, ≥25% alphabetic content, XSLT XML header stripping. Near-deduplication applied before the 87/13 code-to-natural-language mix.
| Component | Value |
|---|---|
| Architecture | Pre-norm decoder-only Transformer |
| Positional enc. | RoPE |
| Attention type | MHA + Flash Attention 2 |
| Context window | 16,384 tokens |
| Training objectives | Next-token + FIM (PSM format) |
| FIM fraction | ~50% of training examples |
| Model sizes | 1.3B, 5.7B (MoE), 6.7B, 33B |
| Languages supported | 338 programming languages |
DeepSeek LLM follows LLaMA's decoder-only Transformer with three deliberate departures. First, layer counts chosen for pipeline parallelism: 30 layers for 7B and 95 layers for 67B — both divide evenly across standard 8-GPU pipeline stages, reducing pipeline bubble overhead without changing the core architecture.
Second, the 67B model uses Grouped-Query Attention (GQA) — adopted before it was standard — with 8 KV heads vs 64 query heads. The KV cache shrinks 8× compared to standard MHA, making 67B inference tractable on 4× A100 80GB GPUs.
Third, a 100,000-token BPE vocabulary — 3× LLaMA's 32K. This dramatically improves Chinese tokenisation efficiency: a Chinese sentence that takes 32+ LLaMA tokens becomes 10–12 tokens with 100K BPE. This single decision explains much of the Chinese benchmark gap between DeepSeek LLM and LLaMA-2. The 100K vocabulary was inherited by every subsequent DeepSeek model through V4.
New scaling laws published alongside the model: The paper derives optimal batch size B* ∝ C^0.24 and learning rate α* ∝ C^-0.31 as functions of compute budget C — extending Chinchilla. Validated across 1e20 to full-scale FLOPs. Results: 7B trained at batch=2304, lr=4.2e-4; 67B at batch=4608, lr=3.2e-4.
| Component | 7B | 67B |
|---|---|---|
| Layers | 30 | 95 |
| Hidden dim. | 4096 | 8192 |
| Attention heads | 32 | 64 Q / 8 KV |
| Attention type | MHA | GQA |
| Context window | 4096 | 4096 |
| Vocabulary size | 100,000 (BPE) | |
| Positional enc. | RoPE | |
| FFN activation | SwiGLU | |
| Normalisation | RMSNorm (pre-norm) | |
| Alignment | SFT then DPO | |
All results from official papers. At the time of release, DeepSeek Coder-33B and LLM-67B held the state-of-the-art open-source positions on their respective benchmark sets.
DeepSeek LLM 67B Chat scored 65 out of 100 on the Hungarian National High School Mathematics Exam — a real-world exam never seen in training. This B+ performance provides a contamination-resistant generalisation signal beyond standard benchmark scores, demonstrating that the math improvements are genuine reasoning gains and not memorisation artefacts.
DeepSeek LLM 67B Chat was the first open-source model to surpass GPT-3.5-Turbo on Chinese open-ended evaluation. At the time, Chinese-language open-source models lagged closed-source models by a wide margin. DeepSeek's 100,000-token BPE vocabulary (vs LLaMA's 32K) and a 2T bilingual training corpus closed the gap entirely. A Chinese sentence that requires 30+ LLaMA tokens needs only 10–12 with the 100K vocabulary — more tokens available for reasoning rather than character encoding, which directly explains the quality improvement on long Chinese generation tasks.
Both V1 families were trained entirely from scratch - no distillation, no weight initialisation from existing models. The compute infrastructure, tokeniser, data pipeline, and hyperparameter schedule were all original contributions.
Data composition: 2 trillion tokens at 87% source code / 13% natural language (English + Chinese). GitHub repos collected before November 2023. Filtering follows StarCoder pipeline: average line length ≤100 chars, maximum line length ≤1000, ≥25% alphabetic characters. XML header check for XSLT files. Near-deduplication across all 338 languages.
Repo-level document construction: Files assembled in topological dependency order — imports first, dependents later. This lets the model learn cross-file call graphs and API contracts, not isolated snippets.
FIM curriculum: PSM-format fill-in-the-middle applied to ~50% of training examples. Uses 4 special tokens: fim_begin, fim_hole, fim_end, and fim_pad. Enables both code completion and infilling without a separate model.
Infrastructure: Trained on DeepSeek's Fire-Flyer 2 GPU cluster using a multi-node setup with Flash Attention 2 for memory efficiency. AdamW optimiser with cosine learning rate decay.
Data: 2 trillion tokens in English and Chinese from Common Crawl with aggressive deduplication, quality filtering, and domain upsampling for scientific and technical content. High-quality Chinese web data was a primary focus — the paper dedicates substantial analysis to Chinese data quality as the key differentiator from LLaMA-2.
Novel scaling law derivation: The paper derives power-law relationships for optimal batch size (B* ∝ C^0.24) and learning rate (α* ∝ C^-0.31) as functions of compute budget. These were fitted on models at 1B, 7B, and 67B scales. This eliminated expensive grid search: batch and LR were set analytically before the full training run began.
Alignment pipeline: Base models were aligned via Supervised Fine-Tuning (SFT) first, then Direct Preference Optimization (DPO) — an RLHF-free alignment approach that was novel at the time. The DPO step gave the Chat models their notably natural conversational style.
Intermediate checkpoints: Multiple pre-training checkpoints for both 7B and 67B Base models were published via AWS S3 at standard S3 egress pricing — enabling the research community to study capability emergence during training.
In a landscape of LLaMA fine-tunes, DeepSeek V1 trained from scratch, published scaling laws, and open-sourced every weight. The decisions made here echo through every subsequent DeepSeek model.
No LLaMA initialisation, no weight distillation, no shortcuts. Both families pre-trained entirely from scratch on 2T tokens — an unusual commitment when most open-source models in 2023 were fine-tuned LLaMA derivatives.
OriginalDeepSeek Coder assembled GitHub repositories as topologically-sorted training documents — files ordered by dependency. The first major code LLM to do this systematically, enabling genuine cross-file understanding rather than snippet-level memorisation.
Innovation3× larger than LLaMA's 32K. Critical for Chinese tokenisation: a Chinese sentence takes 30+ LLaMA tokens but 10–12 with 100K BPE. More context budget for reasoning, less for character encoding. This vocabulary was carried forward through every DeepSeek generation to V4.
FoundationDerived optimal batch size B* ∝ C^0.24 and learning rate α* ∝ C^-0.31 as power-law functions of compute budget — extending Chinchilla. Validated across 7B and 67B scale. Published openly; widely cited in subsequent AI research and adopted by later DeepSeek training runs.
ResearchThe 67B model adopted Grouped-Query Attention with 8 KV heads vs 64 query heads before GQA became commonplace. This reduced the KV cache 8×, making 67B inference practical on 4× A100 80GB GPUs instead of requiring 8 GPUs. A practical engineering choice ahead of its time.
Efficiency~50% of Coder training used PSM-format Fill-in-the-Middle — teaching the model to predict code given both prefix and suffix context. Enabled Copilot-style inline completion without a separate model, in a single checkpoint.
Code-specificDeepSeek LLM 67B Chat surpassed GPT-3.5 on Chinese open-ended evaluation — the first open-source model to achieve this. The 100K vocabulary and bilingual 2T training corpus closed a gap that LLaMA-2 could never bridge with its 32K English-biased tokeniser.
HistoricLLM 7B and 67B intermediate pre-training checkpoints published via AWS S3 — enabling researchers to study capability emergence across training. Unprecedented transparency in late 2023, when most labs released only final weights.
Open30 layers (7B) and 95 layers (67B) — chosen to divide evenly across 8-GPU pipeline stages, minimising bubble overhead in distributed training. A practical engineering detail rarely documented publicly, but critical for efficient multi-node training at scale.
EngineeringCoder training covered 338 distinct programming languages from Ada to Zig — far wider than CodeLlama's focus. Enables strong performance on rare languages including Agda, Alloy, Bash, COBOL, Forth, Prolog, Racket, Solidity, and many more.
CoverageReleased under the DeepSeek Licence, permitting commercial use immediately — at a time when LLaMA had non-commercial restrictions. This decision accelerated industry adoption and differentiated DeepSeek from both LLaMA-based and proprietary alternatives.
Commercial ✓Every innovation in V1 — the BPE tokeniser, the Fire-Flyer 2 infrastructure, the scaling law methodology, the bilingual training recipe, the SFT+DPO alignment pipeline — was inherited, extended, and refined in V2 (MoE), V3 (FP8), R1 (RL), and V4. This is where it started.
LegacyAll V1 models remain available on Hugging Face under the DeepSeek Licence (commercial use permitted). The transformers library — no special packages required.
How the V1 models stacked up against the best available open-source and closed-source models at their time of release.
| Model | HumanEval | GSM8K | MMLU | Context | Open Source | Chinese |
|---|---|---|---|---|---|---|
| DS-Coder-33B-Inst | >GPT-3.5 ✓ | — | — | 16K | ✓ Commercial | — |
| DS-LLM-67B-Chat | 73.78% | 84.1% ✓ | 71.3% | 4K | ✓ Commercial | >GPT-3.5 ✓ |
| GPT-3.5-Turbo | ~72% | 78.9% | ~70% | 16K | ✗ Closed | Below DS-67B |
| CodeLlama-34B | 48.8% | — | — | 100K | Non-commercial | — |
| LLaMA-2-70B-Chat | 32.3% | 56.8% | 68.9% | 4K | Non-commercial | Far below DS |
| StarCoder-15.5B | 33.6% | — | — | 8K | BigCode | — |
V1 models are superseded by V4 for production use. But there are six compelling reasons to still work with them — particularly the Coder models and intermediate checkpoints.
The 7B Base models (both Coder and LLM) are ideal fine-tuning starting points for researchers who need a documented, well-understood foundation at modest scale. Smaller compute cost, same architectural decisions as V4. Perfect for domain-specific code or language adaptation experiments.
The intermediate training checkpoints at multiple token counts are unique research artifacts. They document capability emergence across the full pre-training trajectory for 7B and 67B models — invaluable for AI safety research, capability elicitation studies, and training dynamics analysis.
DeepSeek-Coder-1.3B runs on CPU with ~3GB RAM via llama.cpp and Ollama — one of the strongest sub-2B code models ever released. Still deployed in production by several open-source IDE plugins and CI systems where latency requirements prevent API calls.
V1 is a precise snapshot of the pre-R1 open-source frontier. Researchers studying the evolution of LLM capabilities, the impact of vocabulary size on multilingual performance, or the history of Chinese language models use V1 as a documented baseline with full architecture transparency.
DeepSeek Coder's Fill-in-the-Middle capability remains competitive for IDE-style inline completion at sub-7B scale. The 6.7B Instruct model offers a good quality-to-compute ratio for local code assistants. Several open-source VS Code extensions still ship it as their default completion backend.
338-language training covers many languages underrepresented in newer models. For niche languages like Agda, Alloy, Coq, Forth, Prolog, or legacy COBOL, Coder V1 models often outperform newer general-purpose models that over-index on Python/Java. A reference tool for polyglot developers.
All V1 models remain on Hugging Face. The DeepSeek Licence permits commercial use. Training checkpoints available via AWS S3.
All variants at huggingface.co/deepseek-ai. Use from_pretrained(..., device_map="auto") for multi-GPU. 7B ≈ 15GB; 67B ≈ 134GB in BF16. 4-bit quantised versions load in ~4GB and ~35GB respectively.
Run ollama run deepseek-coder:6.7b for the 6.7B Instruct model (~8GB). For the 1.3B CPU version: ollama run deepseek-coder:1.3b. Full list at ollama.com/library/deepseek-coder.
LLM 7B and 67B checkpoints at multiple training steps: aws s3 ls s3://deepseek-ai/DeepSeek-LLM/ --request-payer. Standard S3 egress charges apply. Ideal for capability emergence research.
V1 models use the DeepSeek Model Licence — commercial use is permitted with attribution. Note that V3, R1, and V4 use the more permissive MIT Licence. Read the V1 licence carefully before SaaS deployment.
Both LLM and Coder Base variants are clean fine-tuning foundations. Use PEFT/LoRA via the official finetune scripts. The 7B Base is the standard pick for academic experiments under 40GB VRAM.
For new production applications use platform.deepseek.com: deepseek-v4-flash ($0.14/1M input) or deepseek-v4-pro. V4 is 10–15× stronger on every benchmark and supports 1M token context — not comparable to V1.
Every DeepSeek model that followed traces its lineage directly to the decisions made in November 2023 — the tokeniser, the infrastructure, the scaling law methodology, the bilingual training recipe. V1 was not a stepping stone. It was the foundation.
Released to the public as the company's debut. 1.3B to 33B parameters. 2 trillion tokens of 87% code across 338 languages. Repo-level training with topological ordering, FIM in PSM format, 16K context window. Coder-7B matched CodeLlama-34B. Coder-33B-Instruct surpassed GPT-3.5. Established DeepSeek's Fire-Flyer 2 cluster as a credible pre-training infrastructure. GitHub repo: 22.8K stars.
First open-source model · Code specialist · arXiv:2401.141967B and 67B trained from scratch on 2T bilingual tokens. Novel scaling laws for batch size and learning rate published alongside weights. 67B beats LLaMA-2 70B across all major benchmarks. 67B Chat surpasses GPT-3.5 in Chinese — a historic first for open-source AI. Training checkpoints published via AWS S3. 100K BPE vocabulary and SFT+DPO alignment pipeline established as the DeepSeek standard.
First general LLM · Beats LLaMA-2 · Beats GPT-3.5 in Chinese · arXiv:2401.02954Built on V1's tokeniser, infrastructure, and BPE vocabulary. Explored MoE architecture at 2.7B and 16B scales before the full-size V2. Proved that sparse MoE is feasible on the Fire-Flyer 2 cluster. Directly prototyped the DeepSeekMoE architecture that powered V2's 236B model.
MoE prototype · V2's architectural foundation236B MoE model with Multi-Head Latent Attention (MLA) compressing KV cache by 93.3%. Used the same 100K BPE tokeniser and Fire-Flyer 2 infrastructure from V1. This was the release that retroactively made V1 "V1." V2's scaling law analysis was grounded in V1's published findings. The naming convention "V2" implies a V1 — and V1 is these November 2023 models.
MoE + MLA · Retroactively makes Nov 2023 "V1"The model that made DeepSeek globally famous — #1 App Store in 157 countries, 18% drop in Nvidia's stock, 2.6M app downloads in a week. Built on V3-Base using GRPO reinforcement learning — itself descended from DeepSeekMath (which was built on Coder-V1.5). R1's tokeniser, infrastructure, and bilingual training recipe all trace back to the 100K BPE vocabulary and Fire-Flyer 2 cluster decisions made 14 months earlier in November 2023.
Sputnik moment · Global #1 · Built on V1 foundations · 14 months after V1V4-Pro (1.6T params) and V4-Flash (284B params), 1M token context, Codeforces #1 (3206 Elo), 80.6% SWE-bench Verified, IMO 2025 Gold Medal. Every aspect — the 100K BPE tokeniser, the bilingual training philosophy, the scaling law methodology, the Fire-Flyer infrastructure, the MoE architecture — is a direct descendant of the decisions made in Hangzhou in November 2023. The distance from V1 to V4: 29 months.
1M context · Codeforces #1 · 29 months from V1 · The current frontier"DeepSeek V1" is a community shorthand, not an official product name. It refers to two model families DeepSeek released in November 2023: DeepSeek Coder (November 2) and DeepSeek LLM (November 29). The "V1" label is retroactive — it came when DeepSeek released "DeepSeek-V2" in May 2024, making the November 2023 generation implicitly the first version. DeepSeek never officially called them "V1" at launch. The correct official names are DeepSeek Coder and DeepSeek LLM.
DeepSeek Coder (November 2, 2023) was DeepSeek's first publicly released model — the first time the company released model weights. DeepSeek was founded in July 2023 as a spin-off from High-Flyer, the quantitative hedge fund. So the first public model came just 3.5 months after founding. DeepSeek had been training internal models before this, but Coder was the debut. DeepSeek LLM, 27 days later, was the first general-purpose release. Both are therefore the genuine "V1" of the DeepSeek model family.
The 87/13 split was chosen to maximise code domain expertise while preserving the natural language capability needed for conversational code assistance, docstrings, and README generation. 100% code would produce excellent completion but poor explanations. The 13% natural language was curated high-quality content — not random web crawl — which explains why the model's prose quality substantially exceeded expectations for a code specialist. The split became a reference point in the broader code LLM community and was adopted by many subsequent models.
The DeepSeek LLM paper extended Chinchilla's analysis by deriving how batch size and learning rate should scale as functions of compute budget. Specifically: optimal batch size B* ∝ C^0.24 and optimal learning rate α* ∝ C^-0.31, where C is compute in FLOPs. These power laws were fitted on models from 1e20 FLOPs up to the full 67B scale. The practical output: DeepSeek could analytically derive optimal hyperparameters before training started, without expensive grid search. The 7B trained with batch=2304 and lr=4.2e-4; the 67B with batch=4608 and lr=3.2e-4. The paper is still cited in LLM training research today.
V4 comprehensively outperforms V1 across all dimensions: V4-Pro achieves 80.6% SWE-bench Verified (V1 Coder-33B would score in low teens), 97.3% MATH-500 (V1 LLM-67B: 32.6% on MATH), and Codeforces #1 Elo 3206 (V1 had no competitive programming evaluation). V4 supports 1M token context vs V1's 4K (LLM) and 16K (Coder). V4-Pro with Think Max produces chain-of-thought reasoning V1 was architecturally incapable of. V1 remains useful for fine-tuning research and edge deployment — but has no place in production inference alongside V4's API pricing of $0.14/1M tokens.
V1 weights remain on Hugging Face and AWS S3 training checkpoints are still downloadable, but DeepSeek has not released updates since late 2024. They are in archive status — maintained by the community through llama.cpp, Ollama, and transformers, but not receiving active engineering from DeepSeek. The deepseek-chat API endpoint no longer serves V1 models — it now points to V4-Flash. If you need V1 for research, download the weights. They are not at risk of disappearing from Hugging Face, but they are not a priority for DeepSeek's engineering team.
V1 models use the DeepSeek Model Licence — commercial use is permitted with attribution. This was notable in late 2023 when LLaMA-2 had non-commercial restrictions. The licence allows you to build applications, APIs, and products using V1 weights as long as you attribute DeepSeek. Note that subsequent models differ: V3, R1, V3.2, and V4 use the more permissive MIT Licence, which places fewer restrictions. Always read the specific licence file in each model's GitHub/HuggingFace repository before commercial deployment.
November 2023. DeepSeek Coder. DeepSeek LLM. Two models from a four-month-old startup that beat CodeLlama, beat LLaMA-2, and beat GPT-3.5 in Chinese — all trained from scratch. Open-source from day one. The origin of V2, R1, V3, and V4.