Skip to main content

DeepSeek-R1 vs OpenAI o1: A guide to reasoning model setup and evaluation

Discover the capabilities of DeepSeek-R1 and OpenAI o1 models for reasoning and decision-making. Includes setup guides, API usage, local deployment, and Weave-powered comparisons.
Created on January 23|Last edited on January 24
The field of reasoning-focused large language models has seen rapid advancements, with DeepSeek-R1 standing out as an amazing open-source contribution. Building on the foundation of DeepSeek-R1-Zero, DeepSeek-R1 addresses key challenges in reinforcement learning-based training while pushing the boundaries of reasoning capabilities.
Unlike its predecessors that relied solely on large-scale reinforcement learning without supervised fine-tuning, DeepSeek-R1 introduces a hybrid training approach that combines reinforcement learning with strategically applied supervised fine-tuning. This method enhances the model's ability to handle complex reasoning tasks while ensuring better alignment with human language preferences.
In this guide, we will explore how to set up and use DeepSeek-R1 to tackle complex reasoning and mathematical problems.
Specifically, we'll look at how to run a distilled model locally using Ollama and Hugging Face leverage the DeepSeek API to run R1. To evaluate its reasoning abilities, we'll use challenging problem sets designed to test mathematical reasoning and decision-making, benchmarking DeepSeek-R1’s performance against OpenAI's o1 model as well as o1-mini, accessed via the Azure API.
By the end, you’ll have a solid understanding of how to effectively utilize and assess DeepSeek-R1 for demanding reasoning applications.
Here's a sneak peek (but you'll have to read on to see how we get there)


Table of contents



This guide is structured to provide a practical, hands-on exploration of DeepSeek-R1 and its comparison to other leading models. Each section is designed to help you understand not only how to set up and run these models but also how to evaluate their performance across various scenarios. Whether you’re accessing the R1 API, running it locally, or benchmarking it against OpenAI’s o1 models in Azure AI Foundry, this walkthrough offers actionable steps and insights to help you make the most of these tools.
Let's start with running things locally on a distilled version of DeepSeek-R1.

Running a distilled version of DeepSeek-R1 locally

This section is for experienced users who are comfortable setting up local environments, managing servers, and running system-level commands.
If you’re not familiar with these tasks, you can skip ahead to the DeepSeek API section, which offers a simpler and more accessible approach.
💡
To set up your environment for using DeepSeek-R1 locally, you will need to install several Python packages alongside the Ollama server. For Linux systems, begin by updating your package manager and installing curl. Then, download and install the Ollama server, which will host the DeepSeek-R1 model locally:
apt-get update
apt-get install -y curl
curl https://ollama.ai/install.sh | sh
ollama pull deepseek-r1:14b
ollama serve > server.log 2>&1 &
ollama --version
Once the server is ready, install the required Python packages to integrate the models and evaluate their performance:
pip install torch wandb weave langchain-ollama accelerate transformers
For macOS, the process of installing Ollama differs, and I recommend referencing the Ollama install page for more details. Refer to the official Ollama documentation for the specific steps.
With the Ollama server running, the DeepSeek-R1 model downloaded, and all necessary packages installed, your environment will be fully prepared to take advantage of the model’s reasoning capabilities across various tasks.
Next, we'll write a basic script to run inference using DeepSeek-R1. This script checks if the Ollama server is active and uses it for reasoning tasks when available. If Ollama is not running, it seamlessly switches to a Hugging Face implementation of DeepSeek-R1. The script includes a single entry point function, run_inference, which dynamically decides the backend and is traced using Weave’s @weave.op decorator.
This allows us to log and analyze all inputs and outputs of the function, ensuring full transparency and traceability during execution.

import requests
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
import transformers
import torch
import weave; weave.init("r1_local_inference")

# Define the prompt template
template = """System: {system}
User: {user}
Response:"""
prompt = ChatPromptTemplate.from_template(template)

# Input messages for inference
system_message = "You are a pirate chatbot who always responds in pirate speak!"
user_message = "Who are you?"


# Function to check if Ollama server is running
def is_ollama_running():
try:
response = requests.get("http://localhost:11434", timeout=2)
return response.status_code == 200
except requests.ConnectionError:
return False

# Hugging Face model setup
@weave.op
def run_huggingface_inference(system_message, user_message):
print("Using Hugging Face for inference...")
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
]
outputs = pipeline(
messages,
max_new_tokens=32768,
temperature=0.6,
top_p=0.95,
)
return outputs[0]["generated_text"]

# Ollama inference setup
@weave.op
def run_ollama_inference(system_message, user_message):
print("Using Ollama for inference...")
model = OllamaLLM(
model="deepseek-r1:14b",
temperature=0.6,
top_p=0.95,
max_tokens=32768,
)
chain = prompt | model
response = chain.invoke({"system": system_message, "user": user_message})
return response

# Main function to decide which backend to use
if __name__ == "__main__":
if is_ollama_running():
response = run_ollama_inference(system_message, user_message)
else:
response = run_huggingface_inference(system_message, user_message)
print("Generated Response:")
print(response)

This script dynamically selects the backend for running DeepSeek-R1 inference, with Ollama as the preferred option due to its speed and ability to run locally without external API calls. The run_inference function, traced with Weave, ensures inputs and outputs are logged inside Weave.
If Ollama is unavailable, the script automatically switches to Hugging Face, providing flexibility while maintaining strong performance across different environments.

Running R1 via the DeepSeek API

The base R1 model is a 671B parameter MoE model, so I recommend using the API provided by DeepSeek for accessing the full non-distilled version of R1.
💡
Running such a large model locally requires extensive computational resources, including high-end GPUs with significant memory capacity, which is not feasible for most users. The DeepSeek API offers a convenient and efficient way to leverage the model without the need for local infrastructure, ensuring optimal performance and scalability for your reasoning and problem-solving tasks. By signing up for the DeepSeek platform and adding credits to your account, you can easily integrate the R1 model into your workflows.
Token pricing at the time of writing:
For comparison's sake, here's the o1 pricing:

Next, ensure you have the following libraries installed:
pip install weave openai
And here's the code for running inference:
import os
import asyncio
import weave
from openai import OpenAI

# Initialize Weave
weave.init("deepseek_r1_api")

# Set up your DeepSeek-R1 API key
r1_api_key = os.getenv("DEEPSEEK_API_KEY") # Replace with your API key or set it as an environment variable

# Initialize the DeepSeek-R1 API client
r1_client = OpenAI(api_key=r1_api_key, base_url="https://api.deepseek.com")

# API Inference Function with Weave Logging
@weave.op
async def r1_api_inference(prompt: str) -> str:

# Perform inference using the API client
response = r1_client.chat.completions.create(
model="deepseek-reasoner",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content


if __name__ == "__main__":
example_prompt = "What is the derivative of x^3 + 5x^2?"
try:
# Run the inference and log the results with Weave
api_result = asyncio.run(r1_api_inference(example_prompt))
print("DeepSeek-R1 API Result:")
print(api_result)


except Exception as e:
print(f"An error occurred: {e}")
While the DeepSeek API provides a robust and scalable solution for accessing the R1 model, it’s valuable to explore alternatives like OpenAI’s o1 and o1-mini models. These models, available on Azure AI Foundry, are fine-tuned for mathematical reasoning and decision-making tasks and offer a compelling option.
In the following section, we’ll examine how to run these models via Azure AI Foundry and use Weave to compare their performance, reasoning capabilities, and pricing against DeepSeek-R1. This side-by-side evaluation will provide a comprehensive view of each platform’s strengths and trade-offs, helping you choose the best option for your needs.

Running o1 and o1-mini on Azure AI foundry

Azure AI Foundry offers a robust platform for running OpenAI-powered models like o1 and o1-mini. These models are fine-tuned for mathematical reasoning and decision-making tasks. With Azure, these models can be accessed via API calls, allowing easy integration into workflows without managing infrastructure. Azure ensures high reliability and scalability for consistent performance.
I’ve previously written a guide on setting up Azure AI Foundry, including steps for creating a project, deploying models, and integrating them via API endpoints. If you're interested in more details for how to set up Azure AI foundry, I encourage you to check out the article.
Here's the script for running inference with o1 and o1-mini on Azure AI Foundry:
import json
import asyncio
from openai import AzureOpenAI
import weave

# Initialize Weave
weave.init("aime_evaluation")

# Define model IDs
o1_model_id = "your o1 model id"
o1_mini_model_id = "your o1 mini model id"

# Initialize AzureOpenAI clients
az_o1_client = AzureOpenAI(
azure_endpoint="https://your-azure-endpoint.openai.azure.com/openai/deployments/o1/chat/completions?api-version=2024-09-01-preview",
api_key="your-azure-api-key",
api_version="2024-09-01-preview"
)

az_o1mini_client = AzureOpenAI(
azure_endpoint="https://your-azure-endpoint.openai.azure.com/openai/deployments/o1-mini/chat/completions?api-version=2024-09-01-preview",
api_key="your-azure-api-key",
api_version="2024-09-01-preview"
)

@weave.op
def run_inference(prompt, client, model_id):
try:
response = client.chat.completions.create(
model=model_id,
messages=[
{"role": "user", "content": prompt}
]
)
response_json = json.loads(response.model_dump_json(indent=2))
choices = response_json.get("choices", [])
if choices:
content = choices[0].get("message", {}).get("content", "")
print("Generated Content:")
print(content)
return content
else:
print("No content found in response")
return None
except Exception as e:
print(f"Failed to get response: {e}")
return None

# Main function
async def main():
prompt = "What is the derivative of x^2?"
system_message = "Solve the following problem. Put your final answer within \\boxed{}:"

# Run inference with O1 model
print("Running inference with O1 model...")
o1_response = run_inference(f"{system_message} {prompt}", az_o1_client, o1_model_id)
print(f"O1 Response: {o1_response}")

# Run inference with O1-Mini model
print("Running inference with O1-Mini model...")
o1mini_response = run_inference(f"{system_message} {prompt}", az_o1mini_client, o1_mini_model_id)
print(f"O1-Mini Response: {o1mini_response}")

if __name__ == "__main__":
asyncio.run(main())

Evaluating OpenAI-o1 vs. Deepseek-R1 with Weave Evaluations

Now we will evaluate DeepSeek-R1’s reasoning capabilities by comparing the model to OpenAI's o1 and o1-mini models. Using Weave Evaluations, we will track inputs, outputs, and performance metrics, which will provide a clear and organized way to analyze the model’s strengths and weaknesses. By the end, you’ll have a streamlined process to explore its capabilities and compare it to other models.
The dataset we will use for this evaluation is the AIME 2024 dataset, a collection of challenging mathematical and logical reasoning problems. Each entry in the dataset contains a problem description as the input (text) and the corresponding solution as the output (label). This dataset is specifically designed to test the reasoning capabilities of large language models across a variety of tasks, including algebra, calculus, and number theory. By using this dataset, we aim to benchmark the models on their ability to understand and solve problems that require both contextual understanding and precise reasoning.
Weave Evaluations from Weights & Biases offers a powerful framework for assessing model performance. It logs and organizes data into a structured, interactive format. Each evaluation consists of three key components:
  • Datasets: Collections of test inputs and expected outputs.
  • Models: The systems being evaluated.
  • Scorers: Metrics that compare model predictions to ground truth, producing metrics like accuracy and correctness.
Here is the script we’ll use for the evaluation:
import os
import asyncio
import requests
from litellm import acompletion
from datasets import load_dataset
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from openai import OpenAI
import json
from openai import AzureOpenAI
import weave; weave.init("aime_evaluation") # Initialize Weave

az_4o_client = AzureOpenAI(
azure_endpoint="your azure 4o endpoint url",
api_key="your 4o azure api key",
api_version="2024-09-01-preview"
)


o1_model_id = "your azure o1 model id"
o1_mini_model_id = "your azure o1 mini model id"

# Initialize AzureOpenAI clients for o1 and o1-mini models
az_o1_client = AzureOpenAI(
azure_endpoint="your o1 enpoint url",
api_key="your o1 api key",
api_version="2024-09-01-preview"
)

az_o1mini_client = AzureOpenAI(
azure_endpoint="your o1 mini enpoint url",
api_key="your o1 mini api key",
api_version="2024-09-01-preview"
)


r1_api_key = "your r1 api key"
r1_client = OpenAI(api_key=r1_api_key, base_url="https://api.deepseek.com")


# Check if Ollama is running
def is_ollama_running():
try:
response = requests.get("http://localhost:11434", timeout=2)
return response.status_code == 200
except requests.ConnectionError:
print("OLLAMA IS NOT RUNNING!!")
return False

USE_OLLAMA = is_ollama_running()

# Initialize models based on Ollama availability
r1_model = None
if USE_OLLAMA:
print("Ollama is running. Using Ollama for inference.")
r1_model = OllamaLLM(model="deepseek-r1:14b", temperature=0.6, top_p=0.95, max_tokens=32768)

# Consistent system message for all models
system_message = "Solve the following problem. put your final answer within \boxed{}: "


# Function to perform inference using the AzureOpenAI client
def run_inference(prompt, client, model_id):

try:
response = client.chat.completions.create(
model=model_id,
messages=[
{"role": "user", "content": prompt}
]
)
# Parse the response
response_json = json.loads(response.model_dump_json(indent=2))
choices = response_json.get("choices", [])
if choices:
content = choices[0].get("message", {}).get("content", "")
print("Generated Content:")
print(content)
return content
else:
print("No content found in response")
return None
except Exception as e:
print(f"Failed to get response: {e}")
return None


async def o1_inference(prompt: str, model_id: str) -> str:
"""
Generate steps for solving the problem using AzureOpenAI clients for o1 and o1-mini.
"""
client = az_o1_client if model_id == o1_model_id else az_o1mini_client
return run_inference(f"{system_message}{prompt}", client, model_id)


# Ollama inference function
async def ollama_r1_inference(prompt: str) -> str:
# Define the input to the prompt template
inputs = {
"system": "",
"user": system_message + prompt,
}

# Define the prompt template
template = """
User: {user}
Response:"""
prompt_template = ChatPromptTemplate.from_template(template)

# Create the chain and invoke it
chain = prompt_template | r1_model
response = chain.invoke(inputs)
return response


class R1APIModel(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
messages = [{"role": "user", "content": system_message + text}]
response = r1_client.chat.completions.create(
model="deepseek-reasoner",
messages=messages
)
return response.choices[0].message.content

class O1MiniModel(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
return await o1_inference(text, model_id=o1_mini_model_id)


# Define the o1-mini model
class O1Model(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
return await o1_inference(text, model_id=o1_model_id)


# Define the R1-14B model
class R1_14B_Model(weave.Model):
@weave.op
async def predict(self, text: str) -> str:
return await ollama_r1_inference(text)



@weave.op
async def gpt4o_scorer(label: str, model_output: str) -> dict:
"""Score the model's output by comparing it with the ground truth."""
query = f"""
YOU ARE A LLM JUDGE DETERMINING IF THE FOLLOWING MODEL GENERATED ANSWER IS THE SAME AS THE CORRECT ANSWER
I WILL GIVE YOU THE LAST 100 CHARS OF THE MODEL'S REASONING PATH, WHICH WILL CONTAIN THE FINAL ANSWER ->

Model's Answer (last 100 chars): {str(model_output)[-100:]}
Correct Answer: {label}
Your task:
1. State the model's predicted answer (answer only).
2. State the ground truth (answer only).
3. Determine if the model's final answer is correct (ignore formatting differences, etc.). RESPOND with the predicted and ground truth answer, followed with a JSON object containing the correctness encapsulated within the following delimiters:
```json
{{ "correctness": true/false }}
```
"""
# Perform inference using AzureOpenAI client
response = run_inference(query, az_4o_client, "gpt-4o")
if response is None:
return {"correctness": False, "reasoning": "Inference failed."}
try:
# Extract correctness JSON object from the response
json_start = response.index("```json") + 7
json_end = response.index("```", json_start)
correctness = json.loads(response[json_start:json_end].strip()).get("correctness", False)
except (ValueError, IndexError):
correctness = False

return {"correctness": correctness, "reasoning": response}




# Load and preprocess dataset
def load_ds():
print("Loading AIME dataset...")
dataset = load_dataset("Maxwell-Jia/AIME_2024")["train"] # no test set here
return [{"text": row["Problem"], "label": row["Answer"]} for row in dataset]


# Run evaluations for each model
async def run_evaluations():
print("Loading dataset...")
dataset = load_ds()
print("Initializing models...")


models = {
"o1": O1Model(),
"o1-mini": O1MiniModel(),
"r1_api": R1APIModel()
}

if USE_OLLAMA:
models["r1-14b"] = R1_14B_Model()


print("Preparing dataset for evaluation...")
dataset_prepared = [{"text": row["text"], "label": row["label"]} for row in dataset]

print("Running evaluations...")
scorers = [gpt4o_scorer]
for model_name, model in models.items():
print(f"\nEvaluating {model_name}...")
evaluation = weave.Evaluation(
dataset=dataset_prepared,
scorers=scorers,
name=f"{model_name} Evaluation"
)
results = await evaluation.evaluate(model)
print(f"Results for {model_name}: {results}")

if __name__ == "__main__":
asyncio.run(run_evaluations())

The script begins by loading the AIME dataset using the Hugging Face load_dataset function. This dataset, consisting of problem prompts (text) and corresponding solutions (labels), serves as the foundation for evaluating model accuracy and reasoning capabilities.
The script evaluates the R1 14B model locally using the OllamaLLM class if the Ollama server is running. It also supports the sull DeepSeek-R1 via the DeepSeek API and OpenAI’s o1 and o1-mini models through Azure AI Foundry. Each model is evaluated using backend-specific inference methods: run_inference for Azure models, ollama_r1_inference for local R1 14B, and the R1 API’s predict method for DeepSeek-R1.
Weave’s evaluation framework integrates the dataset, models, and scoring functions into a unified pipeline. Scorers such as gpt4o_scorer compare model outputs to ground truth, generating metrics like accuracy and correctness. The results are logged and visualized in Weave’s interactive tools, enabling detailed analysis and comparisons across models.
This setup provides a robust framework for assessing DeepSeek-R1’s reasoning capabilities alongside Azure-based o1 models, focusing on accuracy, latency, and total tokens generated.



Summary of results

  • o1 Model: Achieved an accuracy of 76.7%.
  • R1 API Model: Matched the o1 model in accuracy at 76.7%.
  • o1-Mini Model: Scored slightly lower with an accuracy of 73.3%.
  • R1 14B Model: Achieved the lowest accuracy at 20.0% when run locally using Ollama.
DeepSeek reports an accuracy of 69.7% for the R1 14B model and 79.8% for the full R1 model. These figures are based on pass@1 benchmarks, where 64 responses are generated per query, and one response is sampled at random for evaluation. In contrast, my evaluation sampled a single response without multiple iterations, aligning with pass@1 but potentially missing nuances from repeated sampling.
The large discrepancy in the R1 14B model’s performance remains unexplained. Factors such as parameter configurations (e.g., temperature and top-p) may influence results. DeepSeek’s evaluation notes indicate a temperature of 0.6 and a top-p of 0.95, but these values were not fully confirmed. By comparison, the o1 models use fixed parameters with a temperature of 1 and a top-p of 1.
As DeepSeek has not released their source code for evaluations, there is a possibility of differences in evaluation methods. To better understand these results, I encourage replicating this evaluation or experimenting with alternative prompts, hyper-parameters, and deployment setups.

Debugging with W&B Weave

Weave played a valuable role in helping me uncover and fix issues during the evaluation of DeepSeek-R1. For example, I encountered situations where my OpenAI quota was exceeded, and Weave surfaced these errors directly in the interactive dashboard, saving time that would have been spent debugging elsewhere.

Another issue arose with my LLM judge setup. I realized I was mistakenly passing the first 100 characters of the model's solution to the judge instead of the last 100. This subtle bug led to inaccurate evaluations but was made obvious in Weave’s detailed input-output tracking. By identifying this discrepancy through Weave, I quickly corrected the logic and ensured the scoring function was assessing the correct portion of the solution.

The Weave comparisons dashboard

Once evaluations are complete, Weave organizes results into an interactive dashboard. This powerful tool enables you to:
  • Compare model outputs side by side,
  • Filter results by specific parameters, and
  • Trace inputs and outputs for every function call.
This structured, visual approach simplifies debugging and provides deep insights into model performance, making Weave an indispensable tool for tracking and refining large language models.
For reasoning-focused tasks like those tackled by DeepSeek-R1, the comparisons view offers a step-by-step trace of each model’s decision-making process. This feature makes it easy to identify logical missteps, errors in interpreting prompts, or areas where one model outperforms another, such as the O1-Mini.
By analyzing these granular outputs, you can better understand why models succeed or fail on specific tasks. This insight is invaluable for diagnosing issues, improving models, and tailoring their capabilities for complex reasoning scenarios.
Here’s a screenshot of what the comparison view looks like:

By analyzing these detailed outputs, the comparisons view makes it easier to understand how and why models succeed or fail on specific tasks. This level of detail is invaluable for diagnosing issues, improving models, and tailoring their capabilities for challenging reasoning scenarios.

Conclusion

Overall, I was really impressed with R1, as the accuracy of o1 and the R1 671B models was so similar. The lower performance of the R1 14B model compared to its reported benchmarks raises questions, and I'm definitely curious about what is causing such a large discrepancy between my results and Deepseek's reported results.
If anyone from DeepSeek is reading, I’d welcome the opportunity to collaborate and clarify the cause of the discrepancy observed for the 14B model. Exploring these differences together could provide actionable insights to optimize model configurations and benchmarks. As we continue to push the boundaries of what these models can achieve, evaluations like this remain pivotal in aligning their capabilities with practical, real-world applications.


Iterate on AI agents and models faster. Try Weights & Biases today.