Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions

I ran a quick experiment investigating how DeepSeek-R1 performs on agentic tasks, regardless of not supporting tool use natively, and I was quite pleased by initial outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the design not just prepares the actions however likewise creates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outperforms Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% proper, and other designs by an even larger margin:

The experiment followed model use guidelines from the DeepSeek-R1 paper and the design card: Don't use few-shot examples, avoid including a system timely, and set the temperature level to 0.5 - 0.7 (0.6 was utilized). You can find more evaluation details here.

Approach

DeepSeek-R1's strong coding capabilities enable it to act as an agent without being clearly trained for tool use. By permitting the design to generate actions as Python code, it can flexibly connect with environments through code execution.

Tools are implemented as Python code that is consisted of straight in the prompt. This can be a basic function meaning or a module of a larger package - any valid Python code. The model then produces code actions that call these tools.

Arise from performing these actions feed back to the design as follow-up messages, driving the next actions until a last response is reached. The representative framework is a simple iterative coding loop that mediates the discussion between the design and its environment.

Conversations

DeepSeek-R1 is utilized as chat model in my experiment, where the design autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing an engine or fetching data from websites. This drives the discussion with the environment that continues up until a last answer is reached.

On the other hand, o1 designs are understood to carry out badly when used as chat designs i.e. they do not try to pull context throughout a discussion. According to the connected post, o1 models carry out best when they have the full context available, with clear instructions on what to do with it.

Initially, classihub.in I also tried a complete context in a single timely method at each action (with arise from previous steps included), but this resulted in considerably lower scores on the GAIA subset. Switching to the conversational method explained above, I had the ability to reach the reported 65.6% performance.

This raises an intriguing question about the claim that o1 isn't a chat design - possibly this observation was more pertinent to older o1 models that lacked tool use capabilities? After all, isn't tool usage support an essential mechanism for enabling designs to pull additional context from their environment? This conversational approach certainly seems efficient for DeepSeek-R1, though I still need to carry out comparable explores o1 designs.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding tasks, it is impressive that generalization to agentic tasks with tool usage by means of code actions works so well. This capability to generalize to agentic jobs reminds of recent research by DeepMind that reveals that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't examined because work.

Despite its ability to generalize to tool usage, DeepSeek-R1 often produces long thinking traces at each step, compared to other models in my experiments, limiting the usefulness of this design in a single-agent setup. Even simpler jobs sometimes take a long period of time to finish. Further RL on agentic tool use, be it through code actions or not, could be one choice to improve effectiveness.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model frequently switches in between different reasoning thoughts without adequately checking out appealing paths to reach a proper service. This was a major reason for extremely long thinking traces produced by DeepSeek-R1. This can be seen in the taped traces that are available for download.

Future experiments

Another typical application of reasoning models is to use them for preparing only, while utilizing other models for creating code actions. This might be a potential new function of freeact, if this separation of roles shows beneficial for more complex jobs.

I'm also curious about how reasoning models that currently support tool use (like o1, o3, ...) perform in a single-agent setup, with and without creating code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also uses code actions, look intriguing.