DeepSeek V3.1 tutorial: How to use the open-source LLM with Python and W&B Weave
Explore DeepSeek V3.1, a groundbreaking open-source language model with 671B parameters. Learn to integrate with Python and W&B for optimal performance. Discover more!
Created on August 21|Last edited on August 22
Comment
In this article, you'll learn what DeepSeek V3.1 is, why it's significant, and what you'll build in this tutorial. DeepSeek V3.1 represents a leap forward in open-source language models, combining efficient inference with cost-effective training. It’s built to deliver high performance (even rivaling some closed-source models) while remaining accessible to developers. In this tutorial, we’ll explore DeepSeek V3.1’s key innovations and walk through how to use it with Python, including integration with Weights & Biases (W&B) for experiment tracking. By the end, you’ll know how to leverage DeepSeek V3.1 in your own projects, from understanding its architecture to making API calls and analyzing results.
Overview
DeepSeek V3.1 stands out due to its massive scale and innovative design in the realm of large language models (LLMs). With 671 billion parameters organized in a Mixture-of-Experts (MoE) architecture (where only 37B parameters are active per token), it achieves state-of-the-art accuracy without incurring the full cost of a dense model of similar size. The model was pre-trained on 14.8 trillion high-quality tokens, spanning languages, code, and diverse knowledge domains – giving it a broad and robust understanding of various tasks. Crucially, despite its enormity, DeepSeek V3.1 is optimized for efficient inference, meaning it can generate responses quickly (up to ~60 tokens/second) and handle extremely long contexts (up to 128K tokens of context window) without stumbling. For developers and organizations, the significance is clear: DeepSeek V3.1 offers cutting-edge performance on par with top proprietary models, but as an open-source solution that can be integrated into your own applications and workflows.
💡 Tip: As you follow along, focus on learning by doing. We will not only discuss what makes DeepSeek V3.1 special, but also get hands-on with using the model. Keep an eye out for code examples – they’re meant to be run and experimented with!
✅ What you’ll accomplish in this section: You’ve been introduced to DeepSeek V3.1’s importance and features at a high level. Next, we’ll dive into the technical innovations that make this model both powerful and efficient.
Key architectural innovations in DeepSeek V3.1
In this section, you'll learn about the groundbreaking changes in DeepSeek’s architecture that distinguish V3 from its predecessor (V2). DeepSeek V3 introduces several innovations that boost its performance and efficiency. We’ll explore how Multi-Head Latent Attention, a refined MoE architecture, and new training strategies contribute to these improvements. By understanding these changes, you’ll appreciate why DeepSeek V3.1 performs so well and how those ideas might apply in your own machine learning work. Note that DeepSeek V3.1 is built on top of Deepseek V3, so many of the architectural details are identical.
Innovations from DeepSeek-V2 to DeepSeek-V3.1
DeepSeek-V3 builds upon the foundation of V2 with key enhancements that improve model capability and inference speed. Below are the major innovations introduced in V3.1, compared to V2:
- Multi-Head Latent Attention (MLA): This is a novel attention mechanism that extends the standard multi-head attention. Instead of having each attention head operate only on the original token space, MLA introduces latent features that heads can attend to. In practice, this means the model can capture more nuanced patterns and long-range dependencies without a proportional increase in computation. Effect: DeepSeek-V3.1 can handle very long contexts (up to 128k tokens) more efficiently by focusing attention through these latent representations, which reduces the memory and compute load compared to naive full attention on extremely long sequences.
- DeepSeek-MoE Architecture: DeepSeek V3.1 uses a Mixture-of-Experts at an unprecedented scale – 671B total parameters – but critically, only a subset of experts (about 37B parameters worth) are active for each input token. In DeepSeek-V2, MoE was already used, but V3’s MoE is more advanced and better optimized. The model is split into many “expert” sub-networks, and a gating mechanism chooses which experts handle each token. Effect: This architecture allows the model to have the capacity of a very large model, while keeping inference compute closer to that of a much smaller model. In other words, it achieves higher accuracy by leveraging many specialized experts, but only a handful are used at a time, so inference remains feasible.
- Auxiliary-Loss-Free Load Balancing: Traditional MoE models often require an auxiliary loss term during training to ensure the tokens are evenly distributed among experts (preventing some experts from being overused or underused). DeepSeek-V3.1 pioneers an approach to load balancing without any auxiliary loss. Instead, it relies on improved gating strategies or architectural tweaks so that expert utilization remains balanced naturally. Effect: By removing the auxiliary loss, training is simplified (one less hyperparameter to tune) and potentially more stable, since the model doesn’t have to juggle an extra objective. The gating mechanism in V3.1 effectively ensures each expert is sufficiently utilized, which was validated during the DeepSeek-V2 experiments and perfected in V3.1.
- Multi-Token Prediction Objective: Unlike the standard language modeling objective (predicting one next token at a time), DeepSeek-V3.1 was trained with a multi-token prediction objective. This means during training, the model sometimes tried to predict multiple future tokens in a single step. Effect: This training strategy leads to stronger predictive modeling – the model learns to generate sequences of tokens more coherently and can partially “plan ahead” in its outputs. As a bonus, it can also make inference more efficient: in some implementations, a model trained for multi-token prediction can generate several tokens per forward pass, boosting throughput. Even if you use it to generate one token at a time normally, the training objective still tends to yield a model that is more context-aware and less prone to errors in long outputs.
Together, these innovations significantly improve DeepSeek-V3’s performance (it achieves higher accuracy on benchmarks than V2 did) and efficiency (it uses computational resources more effectively). For example, thanks to the MoE and MLA, DeepSeek-V3 can outperform dense models that have an order of magnitude more active parameters per token, and handle ultra-long inputs that would choke other models.
💡 Tip: If you’re interested in the nitty-gritty, consider reading the DeepSeek-V3 Technical Report. It provides detailed rationale and ablation studies for these innovations. Understanding these concepts can inspire ideas for optimizing your own models (e.g., using MoE for scaling or multi-token objectives for faster generation).
✅ What you accomplished: You’ve learned how DeepSeek-V3’s architecture differs from V2, including MLA, MoE improvements, auxiliary-loss-free training, and multi-token prediction. These advances set the stage for why DeepSeek V3.1 can be both powerful and efficient – something we’ll explore next in terms of training and inference.
Efficient inference and cost-effective training
In this section, you'll learn how DeepSeek-V3.1 achieves its efficiency – both in making predictions quickly and in training cost-effectively. Building a 671B-parameter model might sound computationally prohibitive, but the DeepSeek team employed smart techniques to keep it manageable. We’ll discuss methods like FP8 precision, clever load balancing (without extra losses), and reducing communication bottlenecks in distributed training. By understanding these, you’ll see how large models can be trained and run practically – lessons that apply to training any large-scale model.
How DeepSeek-V3.1 achieves efficiency
DeepSeek-V3 (the backbone of DeepSeek v3.1) was designed from the ground up to maximize performance per compute, making it economical for its size. Here are the key methods it uses to be efficient in both inference and training:
- FP8 Mixed-Precision Training: DeepSeek-V3’s training pipeline leverages 8-bit floating point (FP8) precision (supported on latest GPUs like NVIDIA H100). By representing weights and activations in 8-bit format during certain parts of training, memory usage and communication overhead are drastically reduced compared to the traditional 16-bit or 32-bit precision. Impact: Using FP8 allowed the team to train the model with far fewer GPU-hours for the same number of tokens. It also speeds up computation since smaller numbers mean faster matrix multiplies. The result is a full training run that took roughly 2.8 million H800 GPU hours – an impressive feat for a model of this scale. For reference, without such precision optimizations, the required compute might have been many times higher.
- Optimized Load Balancing (Aux-loss-free): As mentioned earlier, DeepSeek-V3 does away with the usual MoE auxiliary loss by using a better gating mechanism. This directly contributes to efficiency: the model doesn’t waste capacity or get stuck in bad local minima where some experts do all the work. Impact: This means more consistent usage of all experts, which keeps any one expert from becoming a bottleneck. In practice, it yields faster training convergence (since all parts of the model learn in parallel) and smooth inference speed (no single component is swamped with work).
- Reduced Communication Bottlenecks: Training a 671B model involves splitting work across many GPUs (and even multiple nodes). Communication overhead – exchanging activations, gradients, and parameters between devices – can severely slow down large-model training. The DeepSeek team implemented communication-efficient algorithms to tackle this. For example, they likely used optimized all-to-all communication for MoE expert outputs and a sharded model approach (so each GPU only handles a fraction of the model’s weights). They also mention features like get_packed_weights for tensor parallelism in their codebase, indicating efforts to bundle communication steps efficiently. Impact: These optimizations keep the GPUs busy doing useful work (matrix multiplications) rather than idling on data transfer. As a result, scaling the training to many GPUs has near-linear efficiency – adding more GPUs actually helps, up to the huge scale they used, without hitting network bottlenecks too early.
- High Throughput Inference: Efficiency isn’t just about training – inference (generating text) is where many users care about speed. DeepSeek-V3 is optimized to output text quickly. With its multi-token training objective and streamlined architecture, it can generate around 60 tokens per second, about 3× faster than DeepSeek-V2. Internally, this is achieved by using faster transformer implementations (possibly FlashAttention or other fused kernels) and by the MoE structure which keeps the computation per token lower than an equivalent dense model. Impact: For end-users, this means snappy responses even though the model is huge. For instance, a response of a few hundred tokens could be ready in a couple of seconds rather than tens of seconds.
- Stable and Cost-Effective Training Regime: One often overlooked aspect of efficiency is not wasting trials. The DeepSeek-V3 training was remarkably stable – reports indicate no irreversible loss spikes or training crashes that required rollback. That stability comes from careful hyperparameter tuning and possibly techniques like gradual learning rate warmups and monitors to catch divergence early. Impact: Stable training means fewer aborted runs and wasted compute. Additionally, since the model converged in one go, the total cost (those 2.8M GPU hours) was a one-time cost. They also augmented the pre-training with Supervised Fine-Tuning (SFT) and Reinforcement Learning stages to fully unlock the model’s capabilities. These fine-tuning stages were done on top of the pre-trained model to further improve alignment and reasoning, maximizing the value of the initial training investment.
Taken together, these methods make DeepSeek-V3 a cost-effective project despite its size. For people interested in training large models, it’s a case study in combining cutting-edge techniques (like FP8 and MoE) to push the boundaries of scale without breaking the bank. For users of the model, the benefits are directly felt in faster responses and (if you have the hardware) the possibility to run the model yourself with fewer resources than you’d expect.
💡 Tip: Many of these efficiency tricks (like mixed precision) are available in popular frameworks. If you fine-tune or serve models, consider using half-precision (FP16 or bfloat16) or even new FP8 support if your hardware allows — it can dramatically improve speed and reduce memory usage. Also, keep an eye on model parallel libraries (e.g., DeepSpeed, ColossalAI) that let you leverage large models by sharding them across GPUs.
✅ What you accomplished: You now understand how DeepSeek-V3 manages to be both powerful and efficient. We covered FP8 training, MoE load balancing without auxiliary loss, reduced GPU communication overhead, and how all this translates to fast inference. This knowledge demystifies how such a large model is usable in practice. Next, we’ll see how DeepSeek-V3.1 actually performs against other models, before moving on to using it hands-on.
Evaluation results of DeepSeek-V3.1
On real-world coding challenges, V3.1 shows a major jump. SWE-bench Verified, which measures how often models correctly fix GitHub issues so unit tests pass, gives V3.1 a 66% success rate compared with 45% for V3-0324 and 45% for R1. In the multilingual version, V3.1 solves 54.5% of issues, nearly double the ~30% scores of the others. Terminal-Bench, which evaluates whether a model can successfully complete tasks in a live Linux shell, shows the same pattern: V3.1 succeeds on 31% of tasks versus 13% and 6%. These gains highlight that V3.1 is far more dependable at executing code and operating in real tool environments.

Browsing, search, and QA
Information retrieval benchmarks also favor V3.1. On BrowseComp, which requires navigating and extracting answers from web pages, V3.1 answers 30% of questions correctly versus only 9% for R1. In the Chinese-language version, V3.1 reaches 49% accuracy compared with 36% for R1. On HLE (a hard language exam), V3.1 is slightly ahead at 30% vs 25%. In deeper search tasks like xbench-DeepSearch, which require synthesizing information across sources, V3.1 scores 71% compared with 55%. Benchmarks like Frames (structured reasoning), SimpleQA (factual Q&A), and Seal0 (domain-specific Q&A) all show smaller but consistent leads. Taken together, V3.1 is more effective at retrieval and lightweight QA than R1.

Reasoning efficiency
The token usage results highlight efficiency. On AIME 2025, a challenging math exam, V3.1-Think matches or slightly surpasses R1 accuracy (88.4% vs 87.5%) while using about 30% fewer tokens. On GPQA Diamond, a graduate-level exam across many domains, the two models are almost tied (80.1% vs 81.0%) but V3.1 achieves it with nearly half the tokens. On LiveCodeBench, which measures reasoning about code, V3.1 is both more accurate (74.8% vs 73.3%) and more concise. This suggests that V3.1-Think can produce detailed reasoning while avoiding verbosity.

Overall
Compared to V3-0324, V3.1 is a clear generational leap. Compared to R1, it achieves higher accuracy on nearly every benchmark and is more efficient in reasoning-heavy tasks. The only test where R1 keeps pace is GPQA, but it does so at almost double the token cos
Key features of DeepSeek-V3.1 models
In this section, you'll learn about the two modes of DeepSeek V3.1 (Non-thinking vs. Thinking) and when to use each. DeepSeek V3.1 actually offers two variants of its model: one geared towards straightforward responses (we’ll call it “Non-thinking Mode”) and one that engages in a more explicit reasoning process (“Thinking Mode”). We’ll explain what each mode means, their differences, and the use cases they shine in. By understanding this, you can pick the right model behavior for your particular task – whether you need quick answers or deep, step-by-step solutions.
Non-thinking Mode vs. Thinking Mode
DeepSeek V3.1 introduced a concept that not many other LLMs have: two distinct modes of operation. These correspond to different model variants (or settings) you can use via the API:
- Non-thinking Mode: This is the standard chat mode. When you use DeepSeek V3.1 in non-thinking mode, it behaves like a typical conversational AI – you ask a question or give a prompt, and it directly outputs the answer or completion. The model does not explicitly show its chain-of-thought; it just gives you the final response, usually in a concise and direct manner. This mode is optimized for speed and straightforward tasks. It’s similar to how ChatGPT or other chatbots respond: you see only the answer, not the reasoning the model might be doing internally. Under the hood, the model is still capable of complex reasoning, but it doesn’t surface the intermediate steps in the output. You’d use Non-thinking Mode for use-cases like: casual Q&A, summarization, straightforward coding help, translation, etc. – basically whenever you just want the solution or answer as quickly as possible.
- Thinking Mode: This is the advanced reasoning mode. In thinking mode, DeepSeek V3.1 will engage in a slow, multi-step reasoning process and can actually expose that reasoning as part of its output (or via special API fields). Essentially, before giving you a final answer, the model will output a "thinking" process – this might include the model brainstorming, step-by-step calculations, or intermediate conclusions. It’s akin to you solving a math problem on paper: you write down the steps, then the final answer. DeepSeek’s Thinking Mode does something similar. This mode is based on DeepSeek’s R1 research (their reasoning model that was trained with reinforcement learning to improve chain-of-thought). Use cases: whenever you have a complex problem that benefits from step-by-step reasoning. For example, solving a complicated math word problem, performing logical reasoning puzzles, or debugging code with multiple potential issues. Thinking Mode tends to produce more thorough answers and can handle scenarios where a quick answer might be incorrect or incomplete. The trade-off is speed: because the model is effectively doing more work (and outputting more text) to get to an answer, it will respond more slowly and use more tokens. In some internal tests, the thinking-enhanced model’s responses were more accurate on hard tasks but took longer (this matches an earlier note that a reasoning-optimized model had longer response times).
To illustrate the difference, imagine asking both modes a tricky question: “Provide a proof or explanation for why the square root of 2 is irrational.”
- Non-thinking Mode’s approach: It might directly output a concise explanation or well-known proof (like the classic proof by contradiction: assume p/q = sqrt(2) etc.). The answer might be correct if the model recalls it, but if not prompted well, it could also give a shallow or slightly flawed answer because it’s not explicitly working through the logic step by step in output.
- Thinking Mode’s approach: The model might start by saying, “Let’s reason this out,” and then proceed to write the proof step by step: “Assume √2 = p/q is rational... (step 1) ... we reach a contradiction (step 5)... hence √2 is irrational.” Finally, it might conclude, “Therefore, √2 is irrational.” The final answer is the same, but in thinking mode you got to see the whole reasoning trajectory, which can be insightful or more convincing.
It’s important to note that both modes ultimately use the same underlying knowledge and architecture; the difference is in how the model is prompted or set up. In the API, these often correspond to different model endpoints or toggles:
- For instance, DeepSeek might offer two model names: say "deepseek-v3.1-chat" (non-thinking default) and "deepseek-v3.1-reasoning" (thinking mode) – or as the documentation hints, one might use the model name “deepseek-r1” for the reasoning variant. You choose which one to call depending on what you need.
- When using the Thinking Mode via API, responses often come with structured fields. Specifically, the streaming API can return a reasoning_content alongside the normal content. The reasoning_content holds the intermediate thoughts, which the client (you) can choose to display or not. Typically, you might show a user only the final content (answer) and log the reasoning_content for yourself or for debugging, especially if you want to verify how the model arrived at an answer (useful in sensitive applications).
Now, when should you use which mode?
- Use Non-thinking Mode if you need speed or the task is straightforward. For example, in a real-time chatbot that answers customer queries, you’d prefer this mode to keep responses snappy. Also, if you’re doing something like text completion or generation where the chain-of-thought would just be extraneous, stick to non-thinking.
- Use Thinking Mode for complex tasks where correctness is critical and the problem is non-trivial. If you’re building a tool to solve programming puzzles, or a tutor AI that teaches math with explanations, the thinking mode is invaluable. It might also help in cases where you found the non-thinking mode was giving wrong answers – switching to thinking mode can sometimes fix that by forcing the model to reason more carefully.
⚠️ Troubleshooting: If you accidentally use the Thinking Mode model and find that your outputs contain strange tokens or the model’s internal thoughts (“<think>...”), remember you might need to handle that output properly. For example, you might need to filter out or hide the reasoning content before showing the answer to end-users. Conversely, if you expected a detailed reasoning but only got a short answer, double-check that you’re using the correct model variant or that you set the API result_format to allow reasoning (some APIs require a special parameter to include the chain-of-thought). Always refer to the provider’s docs on how to enable or disable the thinking output.
💡 Tip: Think of the two modes as tools in your toolbox. You don’t always need a hammer for every screw – likewise, don’t overuse Thinking Mode on simple tasks (it could slow things down unnecessarily). But when you face a challenging problem, that extra “thinking” capability can be the difference between success and failure. It’s a relatively new concept in AI, so as you experiment with DeepSeek V3.1, you’ll develop an intuition for which mode to employ.
✅ What you accomplished: You now understand DeepSeek V3.1’s Non-thinking vs Thinking modes. Non-thinking mode gives quick, to-the-point answers, whereas Thinking mode provides a step-by-step reasoning process for tough problems. With this knowledge, you’re better equipped to choose how you interact with the model for different tasks. Now it’s time for the fun part: a hands-on tutorial to set up DeepSeek V3.1 in Python, make some API calls, and even integrate W&B for monitoring our experiments.
Step-by-step tutorial: Using DeepSeek V3.1 with Python
In this section, you'll set up the DeepSeek V3.1 API environment and make your first API call, step by step. By following these instructions, you’ll be able to connect to the DeepSeek model using Python code (leveraging an OpenAI-compatible API). We’ll go through obtaining credentials, installing required libraries, and executing a query. We’ll also explore how to use streaming responses (so you can get the answer token-by-token) and ensure everything is working properly. Let’s dive in and get your first DeepSeek-powered Python script running!
Setting up and making your first API call
In this subsection, we’ll configure access to the DeepSeek V3.1 API and send a simple query to the model.
Step 1: Obtain a DeepSeek API Key. To use DeepSeek V3.1 via API, you need an API key (just like you would for services like OpenAI). If you haven’t already, sign up for the DeepSeek service or its provider platform. For example, DeepSeek models are available through the Alibaba Cloud Model Studio, which provides keys with free trial quotas. Once signed up, grab your API key – it will be a string starting with something like sk-... (very similar to OpenAI keys).
- Security Best Practice: Never hard-code your API keys in scripts that you might publish. Instead, store it in an environment variable or a configuration file. We’ll assume you’ve stored your key in an environment variable called DEEPSEEK_API_KEY for this tutorial.
Step 2: Install the OpenAI Python library. DeepSeek’s API is designed to be compatible with the OpenAI API format, meaning we can use OpenAI’s official Python client to call DeepSeek’s models by just pointing it to the right base URL. If you don’t have the openai library installed, install it via pip:
pip install openai weave
This library will make it easy to craft chat completion requests.
Step 3: Set up your Python script or environment. We’ll start by importing the necessary modules and configuring the API key and endpoint. In your Python environment (could be a script or a notebook), do the following:
import weavefrom openai import OpenAIweave.init("deepseek_examples")client = OpenAI(api_key="your_deepseek_api_key", base_url="https://api.deepseek.com")
Above, I import Weave and OpenAI, then initialize Weave so every inference call you decorate with @weave.op is automatically logged to your project. I also create a DeepSeek client with my API key and endpoint.
Step 4: Compose a message and call the model. Now we’ll craft a chat prompt and send it to DeepSeek V3.1. We’ll use the ChatCompletion API (as DeepSeek is a chat-style model). Here’s an example:
@weave.op()def deepseek_inference(prompt: str, model: str):response = client.chat.completions.create(model=model,messages=[{"role": "system", "content": "you are a helpful assistant."},{"role": "user", "content": prompt}])reasoning = getattr(response.choices[0].message, "reasoning_content", None)final_answer = response.choices[0].message.contentreturn {"model": model, "answer": final_answer, "reasoning": reasoning}# Extract and print the assistant's replyassistant_reply = deepseek_inference("why is the sky blue?", "deepseek-chat")["answer"]print("Assistant:", assistant_reply)
Here, we define an inference function. The function takes a prompt and a model id ("deepseek-chat" for fast responses, or "deepseek-reasoner" for thinking mode). Inside, we call the Chat Completions API with a conversation history made of two messages: a system prompt to set the assistant’s behavior, and a user prompt with the actual question.
The API returns both the final answer and, if the model is reasoning-enabled, a reasoning trace. We grab those from the response object and return them in a dictionary.
Finally, we call the function with a simple query ("why is the sky blue?") using the "deepseek-chat" model and print the assistant’s reply. This setup demonstrates how to send prompts to DeepSeek, capture its responses, and log everything to Weave for later inspection.
If everything is set up correctly, running this code will yield a response from DeepSeek V3.1. For example, you might see:
### A Helpful AnalogyImagine a busy plaza with a large fountain in the middle (the Sun). If you start spraying a fine mist of water (the atmosphere) into the air around the plaza, the smaller, lighter droplets (blue light) will be scattered everywhere, misting everyone in the plaza. The larger, heavier droplets (red light) will tend to travel in a straighter line from the fountain and not get scattered as much.### Bonus: Why are Sunsets Red?This is the perfect reverse of the effect! At sunset, sunlight has to travel through a much thicker slice of the atmosphere to reach your eyes.All that efficient blue light is scattered *away* from your line of sight long before it reaches you. The longer wavelengths of light (reds and oranges) are scattered less, so they pass straight through this thick layer of atmosphere and dominate the light that finally reaches your eyes, creating those brilliant sunset colors.**In summary: The sky is blue because air molecules scatter short-wavelength blue light from the Sun much more than they scatter red light, filling our field of view with blue light coming from all directions.**reasoning: Hmm, the user is asking about the color of the sky, which is a classic science question. They probably want a clear, straightforward explanation without too much technical jargon.I should start with the basic concept of sunlight being made of different colors, then explain how Rayleigh scattering works in simple terms. The key is to emphasize why blue light scatters more and how that affects what we see.
As you can see, the assistant (DeepSeek V3.1) provided a factual explanation about itself. This confirms our API call is working and the model understands the query.
After running the code, you can navigate to Weave to see the result:

Step 5: Using streaming for large responses (optional). When asking very open-ended or long questions, you might anticipate a lengthy answer. Enabling streaming lets you start receiving parts of the answer as they’re generated, rather than waiting for the whole answer to finish. The OpenAI-compatible API supports this via stream=True. Let’s modify our call to use streaming and print the response incrementally:
import weavefrom openai import OpenAI# weave.init("deepseek_examples")client = OpenAI(api_key="your_deepseek_api_key", base_url="https://api.deepseek.com")@weave.op()def deepseek_inference(prompt: str, model: str, stream: bool = False):if stream:answer = ""s = client.chat.completions.create(model=model,messages=[{"role": "system", "content": "you are a helpful assistant."},{"role": "user", "content": prompt}],stream=True)print("=== streaming start ===")for chunk in s:delta = chunk.choices[0].deltaif delta.content:answer += delta.contentprint(delta.content, end="", flush=True)print("\n=== streaming done ===")# reasoning not available in streaming modereturn {"model": model, "answer": answer, "reasoning": None}else:response = client.chat.completions.create(model=model,messages=[{"role": "system", "content": "you are a helpful assistant."},{"role": "user", "content": prompt}])reasoning = getattr(response.choices[0].message, "reasoning_content", None)final_answer = response.choices[0].message.contentreturn {"model": model, "answer": final_answer, "reasoning": reasoning}if __name__ == "__main__":r1 = deepseek_inference("why is the sky blue?", "deepseek-reasoner", stream=True)print("\n=== deepseek-reasoner finished ===")print("answer:", r1["answer"])print("reasoning:", r1["reasoning"])r2 = deepseek_inference("why is the sky blue?", "deepseek-chat", stream=True)print("\n=== deepseek-chat finished ===")print("answer:", r2["answer"])print("reasoning:", r2["reasoning"])
We set stream=True in the request. Instead of a single response object, we get a generator (stream_response) that yields chunks. Each chunk contains a piece of the message (in the .delta['content']). We loop over these chunks and print them out as they arrive. The end="" and flush=True ensure that we print continuously on the same line, so the text appears as if the assistant is “typing” the answer.
With streaming enabled, you would see the answer start to appear token by token (or phrase by phrase) in your console. For example, it might print “DeepSeek V3.1 is an advanced open-source language model...” gradually rather than all at once. This is particularly useful for very large answers or when you want to show intermediate progress to a user in an application.
💡 Tip: Streaming is recommended for user-facing applications where response time matters. Even if the total time to generate the answer is the same, getting partial output can improve the experience. Just remember to handle the chunks properly – as we did by concatenating content. Also, note that when streaming, you won’t have the full response with usage info until the end. You may need to accumulate tokens if you want to count them.
💡 Tip: The model name in the API call is critical. If you get an error like “model not found,” double-check the exact string you should use.
⚠️ Troubleshooting:
- If you encounter an authentication error (e.g., “Invalid API key” or similar), ensure that openai.api_key is set to the correct key and that your key is valid (not expired or out of quota). Remember that some platforms have region-specific keys; ensure you’re using the correct region or endpoint as mandated by the provider.
- If you get a connection error or timeout, it could be due to firewall issues or hitting a rate limit. Check your internet connection and possibly try a smaller prompt to see if it’s a size issue. DeepSeek’s endpoint might also require specific network access (for example, the Alibaba Cloud endpoint might be slower from certain regions).
- If the response is empty or seems truncated, a few things could be wrong: (1) The prompt might violate usage policies (some endpoints filter certain content, though for an open model like DeepSeek, this is less likely unless the provider has a filter). (2) The model might not know how to respond (though asking about itself as we did should be fine). (3) If using streaming, maybe we didn’t handle the completion properly. Always finalize by printing a newline after streaming, and look for a chunk that has a finish reason (in OpenAI streaming, the last chunk often has finish_reason: "stop").
- If you see an output but it includes something like a reasoning trace (e.g., some hidden tokens or an internal thought process when you weren’t expecting it), you might have accidentally called the “thinking” model without handling the reasoning_content. In such a case, you’ll notice the output containing what looks like a chain-of-thought (perhaps delineated by special tokens or just as text). To fix it, either switch to the non-thinking model, or adjust your code to separate and ignore the reasoning content in the final output.
✅ What you accomplished: You successfully set up Python to call DeepSeek V3.1’s API and received a completion! You’ve learned how to send chat prompts, retrieve the assistant’s answer, and even how to enable streaming for real-time token output. This means you’re now capable of integrating DeepSeek into a Python application or experiment. Next, we’ll explore some alternative ways to use DeepSeek (like different tasks and modes) and how to harness W&B tools to get the most out of your DeepSeek experiments.
Alternative use cases and tools
In this section, you'll explore various ways to apply DeepSeek V3.1 and learn how to leverage Weights & Biases (W&B) for experiment tracking and model optimization. Now that you have the basic API call working, it’s time to think bigger: what can you do with DeepSeek V3.1? We’ll discuss a few compelling use cases. Then, we’ll introduce W&B’s tools – specifically W&B Weave and W&B Models – which can elevate your workflow by helping you track results, compare model variants, and visualize outcomes. This will be more of a guided tour rather than strict step-by-step code, but there will be example snippets and clear guidance on how to integrate these tools.
Use Cases for DeepSeek V3.1:DeepSeek V3.1 is a versatile language model, thanks to its huge knowledge base and reasoning ability. Here are some ways you can use it:
1. Advanced Language Understanding and Generation: Because of its training on diverse data, you can use DeepSeek for tasks like summarization of long documents (it can handle extremely long inputs), translation between languages, and answering questions on complex topics (science, history, etc.). It will often produce detailed, context-aware results.
2. Complex Reasoning and Problem Solving: With the Thinking Mode capability, DeepSeek excels at tasks like mathematical problem solving, logical reasoning puzzles, or multi-hop question answering (where it needs to combine information from multiple sources or steps). It can break down problems and attempt solutions step-by-step.
3. Code Generation and Debugging: The model has shown strong performance in code-related tasks. You can prompt it to write code snippets (in languages like Python, Java, etc.), or to explain and fix code. For example, you might give it a piece of code and ask for optimization suggestions, and it could reason through the code (especially in thinking mode) and provide improvements.
4. Long-form Content Creation: With a 128K token context, DeepSeek can potentially ingest a huge amount of text. This means you could feed it an entire book and ask for an analysis or continuation. It could also be used for long conversations without hitting context length limits. This is ideal for use cases like analyzing lengthy legal contracts, summarizing extensive logs or transcripts, or conducting research where you dump a lot of reference text into the prompt.
5. Multilingual Applications: If you need an AI that works in languages beyond English (Chinese especially, given DeepSeek’s origin, but also others), DeepSeek V3.1 is a strong choice. You could build a chatbot that converses in multiple languages or a translator that also explains nuances.
For any of these applications, it’s often useful to measure how the model is performing – e.g., is it accurate? is it fast enough? how do different settings or modes compare? This is where Weights & Biases can significantly help you.
Using W&B for DeepSeek experiments:Weights & Biases offers tools to track experiments, visualize data, and manage models. Here are two specific W&B features that pair well with DeepSeek usage:
- W&B Weave: Weave is W&B’s interactive dashboard and analysis tool. You can stream your results (model outputs, metrics, etc.) to Weave and build custom dashboards to visualize them. For example, you could create a panel that shows response times for different prompt lengths, or a table comparing answers from Non-thinking vs Thinking modes side by side. Weave is great for slicing and dicing your results to gain insights.
- W&B Models: The Models feature is like a model registry. It lets you version your models and store them (or references to them) along with metadata. While DeepSeek V3.1 itself is a large public model (which you wouldn’t upload to W&B), you might fine-tune smaller distilled versions or have checkpoints of interactions.
Let’s consider an example scenario to illustrate using these tools: comparing Non-thinking vs Thinking mode on a set of tasks. Suppose we have a list of questions or tasks that we want to evaluate DeepSeek on – some are straightforward, some are complicated. We want to see how well each mode does and also measure the time and tokens used. We could do the following:
Step 1: Set up W&B tracking. At the start of your script or notebook, initialize a W&B run.
import weaveweave.init(project="deepseek_v3_eval", name="mode_comparison_run")
This creates a new run on your W&B project, where logs will be recorded.
Step 2: Define your evaluation set and loop through it. For demonstration, let’s say:
questions = [{"prompt": "What is 5 + 7?", "mode": "non-thinking"},{"prompt": "Explain the significance of the number zero in mathematics.", "mode": "non-thinking"},{"prompt": "Explain step by step how to solve 5 + 7, and then give the answer.", "mode": "thinking"},{"prompt": "Explain the significance of the number zero in mathematics.", "mode": "thinking"}]
In this list, we have two basic tasks (a simple addition and a conceptual question) asked in non-thinking mode, and the same or similar tasks asked with thinking mode. For thinking mode, we might expect the model to give a more detailed stepwise explanation. In practice, you’d probably separate the logic of calling the different model endpoints, but here we annotate them conceptually.
Step 3: Compare our models with Weave Evaluations
For this toy example, we don’t actually want to judge correctness. Instead, we only want to capture and compare the outputs of the two modes (thinking vs non-thinking). To do that, we define a trivial scorer that always returns True. This way, the evaluation framework will still log each prediction and its metadata, but without imposing a real metric.
In practice, you would replace this dummy scorer with something meaningful — for example, exact-match scoring for factual tasks, BLEU/ROUGE for summarization, or even another LLM-based judge that evaluates reasoning quality. That would let you compare whether thinking mode is genuinely better for complex tasks. Here's the full code:
import asyncioimport weavefrom openai import AsyncOpenAIweave.init("thinking-vs-nonthinking")# DeepSeek clientclient = AsyncOpenAI(api_key="your_deepseek_api_key", base_url="https://api.deepseek.com")# datasetquestions = [{"id": "0", "prompt": "What is 5 + 7?", "mode": "non-thinking"},{"id": "1", "prompt": "Explain the significance of the number zero in mathematics.", "mode": "non-thinking"},{"id": "2", "prompt": "Explain step by step how to solve 5 + 7, and then give the answer.", "mode": "thinking"},{"id": "3", "prompt": "Explain the significance of the number zero in mathematics.", "mode": "thinking"},]# model wrapperclass DeepSeekModel(weave.Model):model_name: str@weave.op()async def predict(self, prompt: str) -> dict:resp = await client.chat.completions.create(model=self.model_name,messages=[{"role": "system", "content": "you are a helpful assistant."},{"role": "user", "content": prompt},],)answer = resp.choices[0].message.contentreasoning = getattr(resp.choices[0].message, "reasoning_content", None)return {"answer": answer, "reasoning": reasoning}# trivial scorer@weave.op()def always_true(prompt: str, mode: str, output: dict) -> dict:return {"correct": True}# pick models: one thinking, one non-thinkingnonthinking_model = DeepSeekModel(model_name="deepseek-chat")thinking_model = DeepSeekModel(model_name="deepseek-reasoner")# run eval with bothevaluation = weave.Evaluation(name="thinking-vs-nonthinking-eval",dataset=questions,scorers=[always_true],)print("=== Non-thinking eval ===")print(asyncio.run(evaluation.evaluate(nonthinking_model)))print("=== Thinking eval ===")print(asyncio.run(evaluation.evaluate(thinking_model)))
This uploads all results to W&B Weave. In your W&B run page, you’ll be able to view the responses in a nice comparison view!
Step 4: Analyze with Weave. Once the data is logged, you can harness Weave to create plots or comparisons. For instance:
- You could plot a bar chart of TimeTaken for Non-thinking vs Thinking for the same prompt (expect Thinking mode to take longer).
- Or show token usage differences.
- Or simply view the table to qualitatively compare the Response content side by side.

💡 Tip: Visualize and iterate. The strength of logging everything to W&B is that you can identify patterns. For example, you might find Thinking mode greatly improves accuracy on certain classes of questions but is overkill for others. You could then adjust your application to toggle modes based on question type (maybe auto-detect if a question is complex). Or if you see that responses are too slow, you might try reducing the max_tokens or using a smaller distilled model for less critical queries. Without tracking, these decisions would be guesswork – with tracking, you have data to back them up.
By using W&B Weave alongside DeepSeek V3.1, you essentially create a feedback loop for yourself: run model → get results → visualize/improve → run again. This is especially useful when fine-tuning models or evaluating prompt strategies.
✅ What you accomplished: You’ve explored how to use DeepSeek V3.1 in various scenarios, from Q&A to coding assistance. More importantly, you learned how to integrate W&B tools to track and analyze your model’s performance. We built a conceptual framework for comparing the two modes of DeepSeek and discussed logging outputs, timing, and token usage to W&B. With these skills, you’re not just calling the model blindly – you’re gathering insights and data to use it more effectively. Give yourself a pat on the back: you’ve gone beyond the basics into more advanced, robust usage of a state-of-the-art model!
Conclusion
In this section, you'll recap the key points and look ahead to future developments. Congratulations on making it this far! Let’s summarize what we’ve covered and learned about DeepSeek V3.1, and discuss where things might go from here.
Recap and future directions
Throughout this tutorial, we started by understanding what DeepSeek V3.1 is – a cutting-edge open-source LLM that combines massive scale with efficient design. We explored its architectural innovations, like Multi-Head Latent Attention and a gargantuan MoE setup without auxiliary losses, which together allow it to deliver top-tier performance (rivaling models like GPT-4) at a fraction of the inference cost. We saw how those innovations translate into efficient inference and training, leveraging FP8 precision and clever parallelism to train a 671B model stably and use it quickly.
You learned about the model’s two flavors: Non-thinking vs Thinking mode. This is a concept that sets DeepSeek V3.1 apart – having a “fast intuitive mode” and a “slow reasoning mode” in the same system. Knowing when and how to use each can make a big difference in building effective AI applications.
On the practical side, we set up a Python environment to call DeepSeek V3.1. You obtained an API key, configured the OpenAI-compatible API, and made your first request to the model. You saw how easy it is to integrate (the openai library did most of the heavy lifting) and got an actual response from the model, confirming it’s working. We also enabled streaming to handle long answers gracefully.
We didn’t stop there – we went into advanced usage by discussing real-world use cases and crucially, how to use Weights & Biases to track and improve our interactions with DeepSeek. By logging data and using W&B Weave, you can systematically analyze things like performance differences between modes, response times, and more. This transforms your usage from ad-hoc testing into a robust, data-driven workflow.
Now, looking to the future, what can we expect or hope for with DeepSeek and beyond?
- Continued Refinement: DeepSeek V3.1 itself might receive minor updates or improvements (for example, a V3.2) especially in the area of licensing (hopefully clarifying the model’s usage rights) and maybe model safety or alignment.
- DeepSeek V4? It’s natural to wonder if an even larger or more advanced model is on the horizon. Perhaps a DeepSeek V4 could push the parameters further or integrate reasoning even more deeply such that the line between “thinking mode” and normal operation blurs. Future models might also focus on efficiency, maybe using techniques like knowledge distillation to pack V3.1’s punch into smaller packages that are easier to deploy.
- Community and Ecosystem: As an open model, a lot of future development could come from the community. We might see fine-tuned versions of DeepSeek V3.1 on specific domains (imagine a biomedical version, or a legal-assistant version), which use the base model and adapt it to specialized tasks. We already saw hints of that with the DeepSeek-R1 distilled models based on LLaMA and Qwen.
For you as a reader and practitioner, the journey doesn’t end here. You now have the knowledge and tools to:
- Use DeepSeek V3.1 in your projects, taking advantage of its strengths.
- Experiment and iterate with W&B, which will help you fine-tune prompts or decide on the best configurations for your use case.
- Stay curious and keep learning: The field of AI moves fast. DeepSeek V3.1 is state-of-the-art today, but there will be new papers, new models, and new techniques tomorrow. Many of the techniques you learned (like prompt engineering, analyzing outputs, using chain-of-thought, and tracking experiments) are transferable to whatever comes next.
We encourage you to build something with DeepSeek V3.1 – whether it’s a smart chatbot that can handle long conversations, an assistant that can solve complex puzzles, or a tool that helps you analyze code. Share your results with the community, and don’t forget to utilize tools like Weights & Biases to make your development process smoother and more insightful.
Thanks for following through this tutorial.
Sources
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.