Skip to main content

Using LLMs to Extract Structured Data: OpenAI Function Calling in Action

Harnessing GPT-4 and ChatGPT for Efficient Processing of Unstructured Documents
Created on June 30|Last edited on July 5
This article accompanies W&B free online course Building LLM-Powered Applications. Sign up today and start learning!
💡

Motivation

Motivated by an analytics project aimed at understanding the strategies for winning Kaggle competitions, we face the challenge of extracting structured data from a vast collection of unstructured writeups.
These writeups, collected from Kaggle Solutions, contain valuable insights into the methods employed by successful Kaggle participants. However, the lack of a predefined structure poses a hurdle in comprehending what it truly takes to win. To overcome this obstacle, we turn to large language models (LLMs). We will use them to comprehend and parse the unstructured text, enabling us to extract the key elements that contribute to success in these data science competitions.

Dataset

To uncover the secrets of winning Kaggle competitions, we dive into the dataset presented in the W&B Table below. While this structured data provides a starting point, the true gems of knowledge lie within the writeups sourced from a Kaggle dataset.
Unfortunately, these writeups are unstructured, which means extracting valuable insights would require an incredibly long processing time if done manually. Fortunately, we can use LLMs to efficiently extract structured data from these writeups, enabling us to expedite the analysis and uncover what it truly takes to succeed in these challenging competitions.

Run: hopeful-glade-79
1


Using OpenAI Function Calling

To extract structured data using OpenAI API, we can utilize the flexibility provided by defining our own function. By defining a function and making it available to the LLM, we can instruct the model to use this function based on the provided prompt data, ensuring structured output.
You can follow along and replicate the results in this Kaggle notebook.
💡
Let's take a look at the function we'll define, called "MethodsList." This function is designed to collect a list of machine learning methods. It takes a parameter called methods which is an array of strings representing different machine learning methods identified in a solution description. By passing this function to the GPT model, we should be able to extract structured data efficiently and effectively.
functions = [
{
'name': 'MethodsList',
'description': 'List of machine learning methods.',
'parameters': {
'type': 'object',
'properties': {
'methods': {
'title': 'Methods',
'type': 'array',
'items': {'type': 'string'},
},
}
}, 'required': ['methodName', 'methodDescription']
}
]

function_call = {"name": "MethodsList"}
Let's also design a prompt to facilitate the extraction:
system_prompt = """You are a machine learning engineer analyzing Kaggle competition solutions.
Your goal is to come up with the ultimate set of proven machine learning methods.
Review the winning solution description provided by user, and identify methods that worked.
Omit methods that didn't work.
Don't add any methods not mentioned in the solution description.
Call the MethodsList function to save a list of methods that worked."""
We should note that some solution descriptions may be longer than our model's context window. In these cases, we'll just take the beginning of the document:
MAX_TOKENS = 3500
encoding = tiktoken.encoding_for_model(GPT_MODEL)
def shorten(method_description):
tokens = encoding.encode(method_description)
return encoding.decode(tokens[:MAX_TOKENS])
To call the language model's API endpoint for extracting structured data, we need to handle the API rate limits, especially when working with multiple documents.
To address this, we employ a retry mechanism using the tenacity library. This ensures that even if we encounter rate limit issues, the function will retry a few times with random exponential waits before giving up. The chat_completion_request function takes in parameters like system_prompt and method_description and constructs a chat-based interaction with the LLM model.
The function also includes the definitions of functions and a specific function call for parsing structured data from the model output. If the function output is available in the response, it's parsed and returned as a list of methods. If there are any errors during the process, the function will return the response message for further analysis.
This robust approach allows us to effectively handle rate limits and parse structured data from the LLM model output for our analysis.
from tenacity import retry, wait_random_exponential, stop_after_attempt
import json

@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def chat_completion_request(system_prompt, method_description, functions=None, function_call=None, model=GPT_MODEL):
short_prompt = shorten(method_description)
messages = []
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": short_prompt})
response = openai.ChatCompletion.create(
model=model,
messages=messages,
functions=functions,
function_call=function_call
)
try:
res = response.choices[0].message.function_call.arguments
data = json.loads(res)
methods = data["methods"]
return methods
except:
return response.choices[0].message

Cleaning Up Data with Our LLM

Now that we have obtained a list of methods for each writeup, we need to address the issue of messy and duplicated data. The list contains duplicates, synonyms, and sometimes overly specific entries, making it challenging to work with.
To clean it up, we once again turn to the LLMs. This time, instead of using function calling, we will directly ask the LLM - specifically GPT-4 - to output the data in a structured format. We construct a prompt that instructs the model to restructure the list of methods, avoiding duplicates and messy categories. The prompt provides clear guidelines, including mapping each dirty method to a cleaned-up category. The model is requested to return the mappings in CSV format.
restructure_prompt = """You are a machine learning engineer analyzing Kaggle competition solutions.
You are provided with a granular, unstructured list of machine learning methods that worked in various competitions.
Your goal is to restructure the list of methods to avoid duplicates or messy categories.
Do NOT use "Other" as a category. Always come up with a specific category. You can also decide to keep the original category as is.
Return a mapping of dirty method to cleaned up category, in csv format, for example:
XGBoost,XGBoost
xgb,XGBoost
Random Forest,Random Forest
CNN,CNNs
EfficientNet-B3,CNNs
SAINT,Transformers
DistillationLoss,DistillationLoss
BCE(smoothing=0.01),BCE
Here's the list of methods. Start your response with the mapping, don't add any extra description or text. Begin!
"""
To parse the outputs obtained from the LLM model into a CSV format, we utilize the StringIO module to create a stream-like object for handling strings as files. After generating the completion using the GPT-4 model and retrieving the response, we extract the content of the message, which contains the structured mapping of methods to categories. Next, we can read this CSV-formatted string into a DataFrame using a library such as pandas. In this case, we read the CSV content into a DataFrame named df with columns named 'method' and 'category'. Finally, we create a dictionary called category_mapping which will be useful for cleaning the data later on, ensuring consistency and organization in our analysis.
from io import StringIO

completion = openai.ChatCompletion.create(
model='gpt-4',
messages=[
{"role": "system", "content": restructure_prompt},
{"role": "user", "content": methods_string},
]
)
csv_file = StringIO(completion.choices[0].message.content)

# Read the CSV-formatted string into a DataFrame
df = pd.read_csv(csv_file, header=None, names=['method', 'category'])

category_mapping = dict(zip(df.method, df.category))

Results

Once we have successfully cleaned up the data and obtained the structured mapping of methods to categories, we can load the resulting DataFrame into a table within the W&B platform for further analysis.
This table provides a convenient interface to explore and manipulate the data. We can leverage its functionalities such as grouping by categories, sorting, and applying filters to gain insights into the dataset.
For instance, we can examine the most frequently used methods in winning solutions by grouping the data, sorting it accordingly, and applying relevant filters. This allows us to uncover patterns and understand the key methods employed by successful Kaggle participants. By utilizing the capabilities of the W&B table, we can delve deeper into the data and extract valuable insights for our analysis.


Run: dashing-night-80
1



Conclusion

In conclusion, leveraging large language models (LLMs) such as OpenAI's GPT-4 and employing function calling enables us to extract structured data from unstructured sources like Kaggle solution writeups. By utilizing LLMs and techniques like cleaning the data, parsing it into a CSV format, and loading it into tools like W&B tables, we can uncover valuable insights, such as the most frequently used methods in winning solutions, aiding our understanding of strategies for success in Kaggle competitions.
Interested in diving deeper into the world of LLMs? Weights & Biases is offering a free online course: Building LLM-powered Applications. This course will guide you through the entire process of designing, experimenting, and evaluating LLM-based apps.
Sign up for the course today and start your journey toward mastering the art of building applications powered by large language models.
Iterate on AI agents and models faster. Try Weights & Biases today.