How DeepSeek Trained a Frontier Model for $5.5 Million

When DeepSeek released R1 in January 2025, it shocked the industry: a model matching OpenAI o1 in reasoning, trained for a fraction of the cost. Here's the architecture behind it.

AI Mate

June 19, 2026

In January 2025, a Chinese AI lab released a model that matched OpenAI’s best reasoning system — and disclosed that it cost approximately $5.57 million to train. For context, Meta’s Llama 3.1 405B required an estimated $123 million in compute. The Nasdaq dropped. Nvidia lost $600 billion in market cap in a single day. The assumption that frontier AI required massive capital had just been broken.

The Architecture That Made It Possible

Mixture of Experts (MoE): DeepSeek-V3 has 671B total parameters but only 37B are active at any given inference step. You get the capability of a 671B model at the compute cost of a 37B model.
Multi-Head Latent Attention (MLA): Compresses the key-value cache that normally explodes in size during long-context inference. Dramatically cuts memory requirements.
FP8 training precision: 8-bit floating point training instead of the standard 16-bit, cutting compute further without measurable quality loss.
Reinforcement learning-first: DeepSeek-R1 was trained primarily on RL with minimal supervised fine-tuning. The model developed chain-of-thought reasoning and self-verification behaviours autonomously — a major methodological departure from standard LLM training.

What the $5.5M Number Actually Means

The $5.57 million figure represents direct GPU compute costs — approximately 2.78 million H800 GPU hours. It does not include research and engineering labour, data acquisition, or the years of foundational work that preceded V3. So the number is real but also carefully scoped. Even so, the 22x cost gap versus Llama 3.1 is structural, not a rounding error. MoE, MLA, and FP8 training aren’t tricks — they’re architectural decisions that compound into massive efficiency gains. The industry is now reverse-engineering every one of them.

The open-source angle

DeepSeek R1 was released under the MIT license — full commercial use permitted, self-hosting allowed, weights available. This is a frontier-class reasoning model that any organisation can deploy on their own infrastructure. That geopolitical and commercial significance is separate from the cost story, and arguably more durable.

Back to AI

The Architecture That Made It Possible

Mixture of Experts (MoE): DeepSeek-V3 has 671B total parameters but only 37B are active at any given inference step. You get the capability of a 671B model at the compute cost of a 37B model.

Multi-Head Latent Attention (MLA): Compresses the key-value cache that normally explodes in size during long-context inference. Dramatically cuts memory requirements.

FP8 training precision: 8-bit floating point training instead of the standard 16-bit, cutting compute further without measurable quality loss.

Reinforcement learning-first: DeepSeek-R1 was trained primarily on RL with minimal supervised fine-tuning. The model developed chain-of-thought reasoning and self-verification behaviours autonomously — a major methodological departure from standard LLM training.

What the $5.5M Number Actually Means

The open-source angle