Understanding DeepSeek R1

DeepSeek-R1 is an open-source language design constructed on DeepSeek-V3-Base that's been making waves in the AI community. Not only does it match-or even surpass-OpenAI's o1 model in many standards, however it also includes completely MIT-licensed weights. This marks it as the very first non-OpenAI/Google model to deliver strong reasoning abilities in an open and available manner.

What makes DeepSeek-R1 particularly amazing is its openness. Unlike the less-open approaches from some market leaders, DeepSeek has actually published a detailed training method in their paper. The model is likewise remarkably cost-efficient, with input tokens costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).

Until ~ GPT-4, the common wisdom was that much better models required more information and calculate. While that's still legitimate, models like o1 and R1 demonstrate an option: inference-time scaling through thinking.

The Essentials

The DeepSeek-R1 paper provided numerous models, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while fascinating, I won't talk about here.

DeepSeek-R1 utilizes two major ideas:

1. A multi-stage pipeline where a little set of cold-start data kickstarts the model, followed by massive RL.

Group Relative Policy Optimization (GRPO), a reinforcement learning technique that relies on comparing several model outputs per prompt to prevent the requirement for wiki.vst.hs-furtwangen.de a separate critic.

R1 and R1-Zero are both thinking designs. This essentially implies they do Chain-of-Thought before answering. For the R1 series of designs, this takes type as believing within a tag, before answering with a last summary.

R1-Zero vs R1

R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is used to enhance the model's policy to make the most of benefit. R1-Zero attains excellent accuracy however sometimes produces confusing outputs, such as mixing numerous languages in a single response. R1 repairs that by integrating minimal supervised fine-tuning and multiple RL passes, which enhances both correctness and readability.

It is intriguing how some languages may reveal certain ideas much better, which leads the model to select the most meaningful language for the job.

Training Pipeline

The training pipeline that DeepSeek released in the R1 paper is immensely interesting. It showcases how they produced such strong thinking designs, and what you can anticipate from each phase. This includes the issues that the resulting designs from each stage have, and how they fixed it in the next phase.

It's interesting that their training pipeline differs from the typical:

The normal training method: Pretraining on big dataset (train to anticipate next word) to get the base model → monitored fine-tuning → preference tuning through RLHF R1-Zero: Pretrained → RL R1: Pretrained → Multistage training pipeline with numerous SFT and RL stages

Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to make sure the RL process has a good starting point. This offers a good model to begin RL. First RL Stage: Apply GRPO with rule-based rewards to improve thinking correctness and formatting (such as forcing chain-of-thought into believing tags). When they were near convergence in the RL process, they relocated to the next step. The outcome of this action is a strong thinking model but with weak general abilities, e.g., poor [users.atw.hu](http://users.atw.hu/samp-info-forum/index.php?PHPSESSID=92d9b4afd135b2388915c31b57e44be3&action=profile