Skip to main content

Fine-tuning LLMs for function-calling

Learn how to fine-tune LLMs for function calling, cut inference costs, and boost accuracy with step-by-step guidance on dataset prep and tokenizer tweaks.
Created on May 16|Last edited on June 10
The rise of agentic frameworks - such as the OpenAI Agent SDK, LangGraph, and others - has driven a surge of interest in more autonomous, goal-driven generative AI systems. These frameworks enable dynamic planning, tool usage, and multi-step reasoning. However, with greater autonomy comes a new challenge: managing inference cost and behavior predictability.
At the same time, a growing number of practitioners are exploring fine-tuning function-calling LLMs to address these challenges.
What is function calling? Function calling enables an LLM to call external code—retrieving real-time data, performing calculations, or triggering actions. The model inspects user input, determines whether a call is needed, emits a structured JSON request, then uses the result to compose its response. For a deeper dive, see the “What is function calling?” section in Hugging Face’s Agents Course.
💡
This report begins by examining when and where fine-tuning models for function-calling is truly beneficial, and then outlines the concrete steps involved in the fine-tuning process.

Table of contents



AI agent vs. workflow: A key distinction

What is an AI agent? Broadly, you can think of an agent as a dynamic and autonomous, capable of breaking out of the rigid workflows we often associate with early chatbot tech. Many agentic systems use a combination of both workflows and agents, however. The table below (from the smolagents conceptual guide) places workflows and agents on a spectrum of autonomy, with Levels 0–2 generally considered workflows and higher levels representing true agents.
Added category of AI workflow and AI Agent to table from smolagents conceptual guide

Where is fine-tuning LLMs for function calling useful?

Higher autonomy agents introduce complexities that lightweight models often can’t handle reliably. Although it’s possible (as of May 2025) to swap out OpenAI’s agent backend for LiteLLM, only high-capacity models with built-in function-calling support—such as Qwen—consistently produce valid outputs at those agency levels.
When your use case fits within an AI workflow (Levels 0–2), however, fine-tuning for function calling is usually enough. It aligns the model with your function schema, cuts inference costs, boosts accuracy in a controlled environment, and can also reduce latency.

Fine-tune LLMs for function-calling LLMs

The following sections describe the core techniques for fine-tuning a model to generate structured function calls reliably.

Terminology

First, let's clarify some terminology:
  • Function calling refers to an LLM’s ability to generate a structured string - typically JSON - that represents a call to an external function.
  • Tools are the actual functions the model can invoke.

How function calling works

Function calling lets the model decide when to call an external API, format the call according to a schema, and incorporate the result into its response.

Type of function calling

Inspired by the deeplearning.ai taxonomy, here are the main patterns you’ll encounter:
Types of function calling

Single call

A single call is the simplest type of function usage - the LLM receives a request, selects the appropriate function, and makes a single call before responding. It’s ideal for straightforward, one-step tasks where no extra reasoning or chaining is needed. The main benefit is clarity: minimal overhead, minimal room for failure.

Parallel calls

Parallel calls let the LLM trigger multiple tools simultaneously - perfect when tasks can be done in isolation. It’s faster than running them one by one, but requires care: you need async infrastructure, and the model must reliably output multiple JSON calls in the right format. Best for batched queries.

No call

The model relies on its knowledge instead of calling a function. This reduces latency but risks outdated or incorrect answers if a necessary call is omitted.
Tips to make no-call work well:
  • Train or prompt the model with examples where not using a function is the right move.
  • Include instructions like: “Only use a function if the answer cannot be reliably generated from internal knowledge.”
  • Monitor tool usage: too many unnecessary calls = a prompt tuning issue.

Multiple calls (sequential)

In this pattern, the LLM makes several separate function calls across a session, one after another. Each function handles a different part of the overall task. These calls can span multiple turns or be triggered by user follow-ups.
An example:
  • User: “Translate the latest news on electric vehicles into French.”
    • → LLM calls get_latest_news(topic="EVs")
    • → Then calls summarize_text(news_result)
    • → Then calls translate(text=summary, lang="fr")
Each call is logically linked, but they are handled step-by-step, with the LLM deciding what to do after seeing each result. This approach is flexible and works well when user intent unfolds progressively. It’s commonly used in agent frameworks or multi-turn chat workflows.

Nested calls (Dependent)

Nested calls happen when the output of one function is immediately required as the input for the next - in a tightly coupled, often single-turn process. The LLM must reason about the full pipeline and plan it in advance.
Example:
  • User: “Plot the average stock price of Nvidia over the last month.”
    • → LLM must call get_stock_data("NVDA", range="1mo")
    • → Feed that into calculate_average(data)
    • → Then feed the result into plot(value)
This is effectively a chain of dependent operations, like nested function calls in code: plot(calculate_average(get_stock_data(...))). Often, the model doesn’t make these all at once, but instead steps through them across turns - still, the logical dependency between steps is tight.
Nested function calls are fragile because each step depends entirely on the previous one being correct. If the model makes even a small mistake - like passing the wrong format, misnaming a parameter, or misunderstanding an intermediate result - the entire chain breaks. Unlike sequential calls where each function can stand somewhat on its own, nested calls require perfect alignment across multiple steps. One broken link and the whole process fails.

How can you fine-tune LLMs for function calling?

For a practical introduction to fine-tuning LLMs for function calling, "Bonus Unit 1. Fine-tuning an LLM for Function-calling" from Hugging Face’s Agents Course is a helpful reference. Although the tutorial is educational and not intended for production use, it provides a clear and accessible starting point. Let’s take a closer look at the key aspects it covers.

Dataset preparation

For function calling tasks, a language model must do more than generate fluent text: it must produce structured, schema-compliant outputs that align with a predefined interface. Typically, this interface is defined using OpenAPI or JSON Schema, which describe the valid structure, parameters, and constraints for each callable function.
Given a user prompt, a function-calling model must:
  • Understand intent in chat format: Track multi-turn dialogue and context, inserting structured outputs at the appropriate point in the conversation.
  • Adhere strictly to the schema: Emit function calls as well-formed JSON objects that conform exactly to the input/output definitions. Any deviation can cause runtime errors or failed execution.
Standard instruction-tuned models don’t meet these requirements out of the box. The dataset and preprocessing steps below teach the model when and how to invoke each tool correctly—and can optionally nudge it to “think” before acting.

Example dataset: Jofthomas/hermes-function-calling-thinking-V1

This dataset builds on NousResearch/hermes-function-calling-v1 by inserting an intermediate reasoning phase. Here’s the flow:
User message
The available tools are listed inside <tools>…</tools> tags, followed by the user’s query.
<tools>[{…get_news_headlines…}, {…search_recipes…}]</tools>
Can you get me the latest news headlines for the United States?
Model “think” and “act”
The assistant (labeled “model”) first exposes its reasoning in <think>…</think>, then emits a structured function call in <tool_call>…</tool_call>.
<think>Reviewing tools… get_news_headlines fits best…</think>
<tool_call>{"name":"get_news_headlines","arguments":{"country":"United States"}}</tool_call>
Tool response
If a <tool_call> occurs, the dataset appends a <tool_response>…</tool_response> message containing the actual function output.
<tool_response>{"headlines":["…"]}</tool_response>
By merging any system instructions into the first user message and treating <think>, <tool_call>, and <tool_response> as indivisible tokens, this format teaches the model both the rationale and the mechanics of reliable function calling.
To include the tool’s actual output in your dataset, simply insert a “tool” turn immediately after the <tool_call> and wrap the JSON in <tool_response> tags. For example, extending your snippet:
<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.Here are the available tools:<tools> [{'type': 'function', 'function': {'name': 'get_news_headlines', 'description': 'Get the latest news headlines', 'parameters': {'type': 'object', 'properties': {'country': {'type': 'string', 'description': 'The country for which headlines are needed'}}, 'required': ['country']}}}, {'type': 'function', 'function': {'name': 'search_recipes', 'description': 'Search for recipes based on ingredients', 'parameters': {'type': 'object', 'properties': {'ingredients': {'type': 'array', 'items': {'type': 'string'}, 'description': 'The list of ingredients'}}, 'required': ['ingredients']}}}] </tools>Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{tool_call}
</tool_call>Also, before making a call to a function take the time to plan the function to take. Make that thinking process between <think>{your thoughts}</think>

Can you get me the latest news headlines for the United States?<end_of_turn><eos>
<start_of_turn>model
<think>Okay, so the user asked for the latest news headlines in the United States. I need to figure out how to respond appropriately using the tools available.

Looking at the tools provided, there's a function called get_news_headlines. That seems perfect for this situation. It requires a country code, which the user specified as the United States.

Since the user is asking specifically about the US, I'll use the country parameter as 'United States' in the function call. This should fetch the relevant news for them.

I should make sure to format the response correctly as a JSON object with the function name and arguments.
</think><tool_call>
{'name': 'get_news_headlines', 'arguments': {'country': 'United States'}}
</tool_call><end_of_turn><eos>
<start_of_turn>tool
<tool_response>
{'headlines': ['US President signs new bill into law', 'Major earthquake hits California', 'NASA announces new Mars mission', 'Tech giant releases new smartphone', 'US economy shows signs of recovery']}
</tool_response><end_of_turn><eos>
<start_of_turn>model
Here are the latest news headlines for the United States:
1. US President signs new bill into law
2. Major earthquake hits California
3. NASA announces new Mars mission
4. Tech giant releases new smartphone
5. US economy shows signs of recovery<end_of_turn><eos>
<start_of_turn>human
That's interesting. Can you get me the news for France now?<end_of_turn><eos>
<start_of_turn>model
<think>Okay, the user initially asked for the latest US news, so I used get_news_headlines with 'United States'. Now the user is asking for France, so I'll call the same function with 'France' as the country. That makes sense because the function is designed for any country.
</think><tool_call>
{'name': 'get_news_headlines', 'arguments': {'country': 'France'}}
</tool_call><end_of_turn><eos>
<start_of_turn>tool
<tool_response>
{'headlines': ['French President announces new environmental policy', 'Paris Fashion Week highlights', 'France wins World Cup qualifier', 'New culinary trend sweeps across France', 'French tech startup raises millions in funding']}
</tool_response><end_of_turn><eos>
<start_of_turn>model
Here are the latest news headlines for France:
1. French President announces new environmental policy
2. Paris Fashion Week highlights
3. France wins World Cup qualifier
4. New culinary trend sweeps across France
5. French tech startup raises millions in funding<end_of_turn><eos>
This structure aims to model not only tool selection but also the cognitive process behind it.
If the model is expected to call multiple tools, the dataset must include examples where multiple function calls are made within a single interaction.
💡

Tokenizer adjustments

If the base model such as google/gemma-2-2b-it does not natively support function-calling prompts or custom XML-like tags then tags like <think>, <tool_call>, and <tool_response> must be treated as indivisible tokens to preserve semantics during tokenization. Without registration as special tokens, these would be broken into subwords, degrading both training and inference quality.
class ChatmlSpecialTokens(str, Enum):
tools = "<tools>"
eotools = "</tools>"
think = "<think>"
eothink = "</think>"
tool_call = "<tool_call>"
eotool_call = "</tool_call>"
tool_response = "<tool_response>"
eotool_response = "</tool_response>"
pad_token = "<pad>"
eos_token = "<eos>"

@classmethod
def list(cls):
return [c.value for c in cls]

tokenizer = AutoTokenizer.from_pretrained(
model_name,
pad_token=ChatmlSpecialTokens.pad_token.value,
additional_special_tokens=ChatmlSpecialTokens.list()
)
model.resize_token_embeddings(len(tokenizer))
This ensures that the new special tokens are properly mapped to unique embeddings and handled during training.

Fine-tuning

With your dataset and tokenizer ready, proceed with the standard Hugging Face Transformers fine-tuning workflow. Train the model on examples that mix natural language turns, <think> reasoning steps, and <tool_call> actions. Remember that the LLM itself only generates text - actual execution of function calls must be orchestrated by your external application.

Tips to improve accuracy

Add thinking process

  • The <think> phase is not present in traditional function-calling datasets. In typical setups, the model directly transitions from user input to function invocation without exposing the underlying rationale. However, recent research, such as DeepMind’s Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, suggests that explicitly allowing the model to "think before acting" can lead to improvements in planning accuracy and task reliability.
  • In the above modified dataset, intermediate reasoning segments were generated by running selected samples through deepseek-ai/DeepSeek-R1-Distill-Qwen-32B. These model-generated thoughts are inserted before any function call, enclosed in <think> tags.
  • This step is technically optional but potentially valuable. By making the reasoning process explicit, the model is nudged to justify the choice of tool and parameters. This can help avoid hallucinated calls, misuse of arguments, or brittle outputs - problems that often surface in zero-shot setups.

Include no call cases

  • It’s also important to include examples where no suitable tool is available. Without these, the model may try to force a tool selection even when the user’s request is unrelated or outside the scope of the available tools.

Evaluation

Build an evaluation dataset in parallel with your training data to track performance over time. For a public benchmark, consider BFCL (Benchmark for Function Calling with Language Models). BFCL evaluates:
  • Correct tool selection based on user intent
  • Strict adherence to the function schema
  • Robust handling of edge cases (missing arguments, ambiguous prompts, irrelevant requests)
BFCL offers a useful quantitative framework for comparing models on their ability to reason, select, and format calls correctly. While not a replacement for domain-specific evaluation, it offers a useful baseline for model capability and serves as a helpful reference point when building internal evaluation pipelines.


Final thoughts

As the development of AI agents continues to advance, cost-efficiency will become an increasingly practical concern. As of May 2025, fully productionized AI agents remain relatively rare - but that is likely to change. With broader deployment, issues such as latency, cost, and infrastructure complexity will become more visible and harder to ignore. When those challenges arise, hopefully this blog post can serve as a useful reference.
Iterate on AI agents and models faster. Try Weights & Biases today.