Attribute-Value Extraction With GPT-3 and Weights & Biases

In this article, we learn how to fine-tune OpenAI's GPT-3 for attribute-value extraction from products, looking at the challenges and how to overcome them.
Bharat Ramanathan
Created on October 13|Last edited on June 11
Comment
Extracting product attribute values is essential in many e-commerce scenarios, such as search and retrieval, product ranking, and recommendations.
In this article, we'll explore how to fine-tune OpenAI's GPT-3 to accomplish exactly these tasks and more through attribute extraction and product classification. We'll not only explore the challenges but look at the issues specific to the application of machine learning and deep learning algorithms to the domain, and how to overcome them.
What We'll Be CoveringWhat Is Attribution Value Extraction?The MAVE DatasetPreprocessing The DataFine-tuning the GPT-3 Model for Attribute Value ExtractionEvaluating Our Mode's PerformanceHyperparameter TuningConclusionRecommended Reading
﻿
﻿
Let's dive in.
What Is Attribution Value Extraction?Attribute value extraction refers to the task of identifying values of an attribute of interest from product information. A common approach to solving this problem is with Named Entity Recognition (NER), which poses its own problems.
Product attributes and their corresponding values often vary over time and are usually incomplete making practical applications of many existing deep learning solutions almost infeasible in many scenarios. 
The MAVE DatasetWe will use a large, multi-sourced, diverse dataset for product attribute extraction study - MAVE by Google Research in this report.
The dataset comprises a curated set of 2.2 million products from Amazon pages, with 3 million attribute-value annotations across 1257 unique categories. The authors open source the labels and the code to deterministically clean the original product metadata in the Amazon Review Data (2018), and join with the labels to generate the full version of the dataset.
The following table contains the overall statistics of the dataset


















































CountsPositivesNegatives
# products22265091248009
# product-attribute pairs29871511780428
# products with 1-2 attributes21029271140561
# products with 3-5 attributes12189799896
# products with >=6 attributes16857552
# unique categories12571114
# unique attributes705693
# unique category-attribute pairs25352305
﻿
For convenience, we have created a Weight & Biases Artifact with a subset of the dataset. The Artifact contains product titles with corresponding attributes and values for the 20 most frequent categories in the dataset. We only make use of the positive samples in our subset.
The Artifact of the subset can be downloaded using the following command. 
import wandb
run = wandb.init()
artifact = run.use_artifact('parambharat/mave/raw_dataset:v0', type='dataset')
artifact_dir = artifact.download()
﻿
The following table shows a preview of the subset we created.
﻿
project("parambharat", "mave").artifact("raw_dataset").membershipForAlias("v1").artifactVersion.file("raw_dataset.table.json")
 - 2 of 2998
category
paragraphs.*.source
paragraphs.*.text
attributes.*.evidences.*.value
attributes.*.key
1
2
Each record consists of a product's metadata and a list of attributes and their corresponding values in JSON format. An example record looks as follows. 
Raw Data Sample (Click to Expand)
Preprocessing The DataThe dataset contains multi-source representations i.e. attribute-value pairs are extracted from multiple product metadata sources such as product title, description, and features.
For our task, we preprocess the data and only retain attributes that are present in the title of the product. Our pre-processed dataset contains product titles and the corresponding annotations for attribute-value pairs and the category of the product. The dataset is formatted to make the product title the prompt and the JSON representation of the attribute value pairs as the completion along with the required suffixes for the prompt and the completion.
﻿
project("parambharat", "mave").artifact("split_dataset").membershipForAlias("v1").artifactVersion.file("train.table.json")
 - 6 of 2248
prompt
completion
1
2
3
4
5
6
The Code to Prepare the Datasetimport json
﻿
# utility to preprocess and prepare the dataset.
def prepare_dataset(row):
    paragraphs = row["paragraphs"]
    attributes = row["attributes"]
﻿
    completion = {}
    
    pids = []
    for attribute in attributes:
        key = attribute["key"]
        for evidence in attribute["evidences"]:
            pid = evidence['pid']
            source = paragraphs[pid].get('source', pid)
            if source in ["title",]:
                current = {key: evidence['value']}
                if current[key].lower() not in map(lambda x: x.lower(), completion.values()):
                    completion[key] = current[key]
                    pids.append(pid)
    completion["category"] = row["category"]
    completion = " " + json.dumps(completion) + "\n\n###\n\n"
    
    prompt = ""
    for pid in set(pids):
        source = paragraphs[pid]
        prompt+= f"{source.get('text', '')}\n"
    prompt += "==>\n"
    
    return pd.Series({"prompt": prompt, "completion": completion})
﻿
# reuse the artifact for the subset we created earlier
wandb.init(project="mave", entity="parambharat")
artifact = wandb.use_artifact('raw_dataset:latest', type="dataset")
subset = artifact.get("raw_dataset")
subset = pd.DataFrame(subset.data, columns=subset.columns)
﻿
# split the dataset into train test and validation splits.
train_df, test_df = train_test_split(subset, stratify=subset.category, test_size=0.25)
val_df, test_df = train_test_split(test_df, stratify=test_df.category, test_size=0.5)
﻿
train_df = train_df.apply(prepare_dataset, axis=1)
train_df.to_json("prompts_dataset_train.jsonl", lines=True, orient="records")
﻿
val_df = test_df.apply(prepare_dataset, axis=1)
val_df.to_json("prompts_dataset_val.jsonl", lines=True, orient="records")
﻿
test_df = test_df.apply(prepare_dataset, axis=1)
test_df.to_json("prompts_dataset_test.jsonl", lines=True, orient="records")
﻿
# run openai dataset preparation tool.
!openai tools fine_tunes.prepare_data -f prompts_dataset_train.jsonl -q
!openai tools fine_tunes.prepare_data -f prompts_dataset_val.jsonl -q
!openai tools fine_tunes.prepare_data -f prompts_dataset_test.jsonl -q
﻿
We store the dataset as an Artifact that contains prompt completion pairs to be fed into GPT-3 for fine-tuning and split the dataset into the train, validation and test splits.
Finally, we run the OpenAI dataset preparation tool to format the dataset for the GPT3 models. 
The final prepared dataset is stored in the following Artifact.
﻿
project("parambharat", "mave").artifact("prepared_dataset")
prepared_datasetlatest
All Versions
Aliases
latest
Versions
v0
VersionMetadataUsageFilesLineage
Version overview
Full Name
parambharat/mave/prepared_dataset:v0
Aliases
latest
v0
Tags
Digest
f74f142ab4cd51751dac0bb3f7cba26b
Created By
toasty-paper-19
Created At
October 11th, 2022 09:26:30
Num Consumers
2
Num Files
3
Size
680.1KB
TTL Remaining
Inactive
Description
Fine-tuning the GPT-3 Model for Attribute Value ExtractionFine-tuning the GPT-3 model is quite simple once the dataset is prepared. We only need to run the following CLI command. 
openai api fine_tunes.create -t "prompts_dataset_train_prepared.jsonl" -v "prompts_dataset_val_prepared.jsonl" -m ada --suffix "mave attribute recognition"
Here, we fine-tune the "ada" model with the default hyperpameters for learning rate, epochs and batch_size. 
The command streams the progress of the fine-tune in the terminal. The screenshot below shows what this looks like.
﻿
Notice that it costs only $0.23 to fine-tune the ada model on our dataset containing 3000 product titles and attributes. 
Weight & Biases provides an integration that can be run through the OpenAI api. This makes visualizing the results of our fine-tuning as easy as running the following command by providing the project, entity, and the ID of the fine-tuning from the above console log.
openai wandb sync --id "ft-hPQvti4QpFiLP2bHvrJsSdah"
The training progress and validation metrics of training the model are visualized below.
﻿
﻿
As we can see the model validation loss quickly stabilizes in about 500 steps of fine-tuning, however, this is not indicative of the performance of the model on our custom attribute-value extraction task and the classification task.
The next section dives into how we evaluate our model's performance on the validation set for the task.
Evaluating Our Mode's PerformanceTo evaluate our model's performance we need to verify the validity of the predictions against the reference JSON. We need to make sure that the keys and the values of the prediction and the reference as similar.
We do this by calculating the exact match of the key-value pairs in the two dictionaries. 
The Evaluation Code Sampledef score_dict_similar(row):
    reference = row["reference"]
    prediction = row["prediction"]
    try:
        reference = json.loads(reference)
        prediction = json.loads(prediction)
        common = len(set(reference.items()) & set(prediction.items()))
        actual = len(reference.items())
        return common/actual
    except:
        return 0.0
    
    
def prompt_to_bio(row, label_key="target"):
    prompt = row["prompt"]
    target = row[label_key]
    try: 
        target = json.loads(target)
        prompt = prompt.split()
        labels = ["O"] * len(prompt)
    except:
        labels = ["O"]
        return labels
﻿
    for attribute,value in target.items():
        values = value.split()
        start_ent = False
        for idx, word in enumerate(values):
            try:
                first_idx = prompt.index(word)
                if idx == 0:
                    first_idx = prompt.index(word)
                    labels[first_idx] = f"B-{attribute}"
                    start_ent = True
                elif start_ent:
                    first_idx = prompt.index(word)
                    labels[first_idx] = f"I-{attribute}"
            except ValueError:
                pass
    return labels
﻿
﻿
﻿
def to_category(row):
    reference = json.loads(row["reference"])["category"]
    try:
        prediction = json.loads(row["prediction"])["category"]
    except Exception:
        return pd.Series({"reference_category": reference, "predicted_category": ""})
    return pd.Series({"reference_category": reference, "predicted_category": prediction})
﻿
metric = evaluate.load("seqeval")
﻿
def evaluate_results(results_df):
    results_df = results_df[results_df.reference_labels.map(len) == results_df.predicted_labels.map(len)]
    results_df["exact_match_score"]  = results_df.apply(score_dict_similar, axis=1)
    seq_results = metric.compute(
        predictions=results_df["predicted_labels"].tolist(),
        references=results_df["reference_labels"].tolist())
    
    seq_results = (pd.DataFrame(seq_results)
                   .T
                   .reset_index()
                   .rename({"index": "label"}, axis=1)
                  )
    
    clf_results = clf_report = classification_report(
        y_true=results_df["reference_category"],
        y_pred=results_df["predicted_category"],
        output_dict=True)
    clf_results = (pd.DataFrame(clf_results)
                   .T
                   .reset_index()
                   .rename({"index": "label"}, axis=1)
                  )
    return results_df, seq_results, clf_results
﻿
﻿
﻿
﻿
Additionally, we also measure the precision-recall and f1-score of each attribute by treating it as a sequence evaluation task.
﻿
﻿
Finally, we measure the performance on the classification task by computing the precision, recall and f1-score on the category predicted.
﻿
﻿
Hyperparameter TuningWe use wandb.sweep to find the best hyper-parameters for our task and optimize for the exact_match_score metric defined above. This can be easily achieved with a little instrumentation the custom code and configuration to perform the sweeps is shown below. We optimize for the model, learning_rate_multiplier and the prompt_loss in our sweep.
Sweeps Config and Code(Click to expand)
ConclusionIn this article, we explored how to use OpenAI GPT-3 for the task of product Attribute-Value extraction and classification in the e-commerce domain. We demonstrate how to present the problem as a text completion task and finetune GPT3 models for the task. Finally, we show how to conduct hyperparameter tuning and evaluate specific tasks using Weights & Biases sweeps.
If there's anything you think we missed, or something you'd like to see us explore next, feel free to comment below.
And in the meantime, here is some ...
Recommended Reading
Automating Change Log Tweets with Few-Shot Learning and GPT-3
A text summarization recipe using OpenAI GPT-3's few-shot learning and Weights & Biases
Use GPT-3 in Python With the OpenAI API and W&B Tables
This is a guide to using GPT-3 with Python to help you get going. We'll start with some background on GPT-3 before diving into the code — and we'll be using W&B
Using OpenAI's GPT-3 to Generate 'Doctor Who' Episode Synopses 
You can now track your GPT-3 fine-tunes with Weights & Biases! Here's how to do just that, but with more David Tennant than you were otherwise expecting
Summary, Sentiment, Question Answering & More: 5 Creative Tips for GPT-3 Prompt Engineering
Learn how to use a pre-trained GPT-3 model for text summary, sentiment analysis, and more
﻿
﻿
Counts	Positives	Negatives
# products	2226509	1248009
# product-attribute pairs	2987151	1780428
# products with 1-2 attributes	2102927	1140561
# products with 3-5 attributes	121897	99896
# products with >=6 attributes	1685	7552
# unique categories	1257	1114
# unique attributes	705	693
# unique category-attribute pairs	2535	2305
Add a comment
Rubens Mau • 3 years ago
Congratulations, very interesting your post. Is it possible to get the complete code described here? My email is rubensmau@gmail.com
1 reply
Tags: Articles, NLP, Experiment, Intermediate, Sweeps, Plots, Panels, Tables, GPT, OpenAI
Iterate on AI agents and models faster. Try Weights & Biases today.