Skip to main content

Attribute-Value Extraction With GPT-3 and Weights & Biases

In this article, we learn how to fine-tune OpenAI's GPT-3 for attribute-value extraction from products, looking at the challenges and how to overcome them.
Created on October 13|Last edited on June 11
Extracting product attribute values is essential in many e-commerce scenarios, such as search and retrieval, product ranking, and recommendations.
In this article, we'll explore how to fine-tune OpenAI's GPT-3 to accomplish exactly these tasks and more through attribute extraction and product classification. We'll not only explore the challenges but look at the issues specific to the application of machine learning and deep learning algorithms to the domain, and how to overcome them.

What We'll Be Covering



Let's dive in.

What Is Attribution Value Extraction?

Attribute value extraction refers to the task of identifying values of an attribute of interest from product information. A common approach to solving this problem is with Named Entity Recognition (NER), which poses its own problems.
Product attributes and their corresponding values often vary over time and are usually incomplete making practical applications of many existing deep learning solutions almost infeasible in many scenarios.

The MAVE Dataset

We will use a large, multi-sourced, diverse dataset for product attribute extraction study - MAVE by Google Research in this report.
The dataset comprises a curated set of 2.2 million products from Amazon pages, with 3 million attribute-value annotations across 1257 unique categories. The authors open source the labels and the code to deterministically clean the original product metadata in the Amazon Review Data (2018), and join with the labels to generate the full version of the dataset.
The following table contains the overall statistics of the dataset
CountsPositivesNegatives
# products22265091248009
# product-attribute pairs29871511780428
# products with 1-2 attributes21029271140561
# products with 3-5 attributes12189799896
# products with >=6 attributes16857552
# unique categories12571114
# unique attributes705693
# unique category-attribute pairs25352305

For convenience, we have created a Weight & Biases Artifact with a subset of the dataset. The Artifact contains product titles with corresponding attributes and values for the 20 most frequent categories in the dataset. We only make use of the positive samples in our subset.
The Artifact of the subset can be downloaded using the following command.
import wandb
run = wandb.init()
artifact = run.use_artifact('parambharat/mave/raw_dataset:v0', type='dataset')
artifact_dir = artifact.download()

The following table shows a preview of the subset we created.

category
paragraphs.*.source
paragraphs.*.text
attributes.*.evidences.*.value
attributes.*.key
1
2
Each record consists of a product's metadata and a list of attributes and their corresponding values in JSON format. An example record looks as follows.

Raw Data Sample (Click to Expand)

Preprocessing The Data

The dataset contains multi-source representations i.e. attribute-value pairs are extracted from multiple product metadata sources such as product title, description, and features.
For our task, we preprocess the data and only retain attributes that are present in the title of the product. Our pre-processed dataset contains product titles and the corresponding annotations for attribute-value pairs and the category of the product. The dataset is formatted to make the product title the prompt and the JSON representation of the attribute value pairs as the completion along with the required suffixes for the prompt and the completion.

prompt
completion
1
2
3
4
5
6

The Code to Prepare the Dataset

import json

# utility to preprocess and prepare the dataset.
def prepare_dataset(row):
paragraphs = row["paragraphs"]
attributes = row["attributes"]

completion = {}
pids = []
for attribute in attributes:
key = attribute["key"]
for evidence in attribute["evidences"]:
pid = evidence['pid']
source = paragraphs[pid].get('source', pid)
if source in ["title",]:
current = {key: evidence['value']}
if current[key].lower() not in map(lambda x: x.lower(), completion.values()):
completion[key] = current[key]
pids.append(pid)
completion["category"] = row["category"]
completion = " " + json.dumps(completion) + "\n\n###\n\n"
prompt = ""
for pid in set(pids):
source = paragraphs[pid]
prompt+= f"{source.get('text', '')}\n"
prompt += "==>\n"
return pd.Series({"prompt": prompt, "completion": completion})

# reuse the artifact for the subset we created earlier
wandb.init(project="mave", entity="parambharat")
artifact = wandb.use_artifact('raw_dataset:latest', type="dataset")
subset = artifact.get("raw_dataset")
subset = pd.DataFrame(subset.data, columns=subset.columns)

# split the dataset into train test and validation splits.
train_df, test_df = train_test_split(subset, stratify=subset.category, test_size=0.25)
val_df, test_df = train_test_split(test_df, stratify=test_df.category, test_size=0.5)

train_df = train_df.apply(prepare_dataset, axis=1)
train_df.to_json("prompts_dataset_train.jsonl", lines=True, orient="records")

val_df = test_df.apply(prepare_dataset, axis=1)
val_df.to_json("prompts_dataset_val.jsonl", lines=True, orient="records")

test_df = test_df.apply(prepare_dataset, axis=1)
test_df.to_json("prompts_dataset_test.jsonl", lines=True, orient="records")

# run openai dataset preparation tool.
!openai tools fine_tunes.prepare_data -f prompts_dataset_train.jsonl -q
!openai tools fine_tunes.prepare_data -f prompts_dataset_val.jsonl -q
!openai tools fine_tunes.prepare_data -f prompts_dataset_test.jsonl -q

We store the dataset as an Artifact that contains prompt completion pairs to be fed into GPT-3 for fine-tuning and split the dataset into the train, validation and test splits.
Finally, we run the OpenAI dataset preparation tool to format the dataset for the GPT3 models.
The final prepared dataset is stored in the following Artifact.

prepared_dataset
Version overview
Full Name
parambharat/mave/prepared_dataset:v0
Aliases
latest
v0
Tags
Digest
f74f142ab4cd51751dac0bb3f7cba26b
Created By
Created At
October 11th, 2022 09:26:30
Num Consumers
2
Num Files
3
Size
680.1KB
TTL Remaining
Inactive
Description

Fine-tuning the GPT-3 Model for Attribute Value Extraction

Fine-tuning the GPT-3 model is quite simple once the dataset is prepared. We only need to run the following CLI command.
openai api fine_tunes.create -t "prompts_dataset_train_prepared.jsonl" -v "prompts_dataset_val_prepared.jsonl" -m ada --suffix "mave attribute recognition"
Here, we fine-tune the "ada" model with the default hyperpameters for learning rate, epochs and batch_size.
The command streams the progress of the fine-tune in the terminal. The screenshot below shows what this looks like.

Notice that it costs only $0.23 to fine-tune the ada model on our dataset containing 3000 product titles and attributes.
Weight & Biases provides an integration that can be run through the OpenAI api. This makes visualizing the results of our fine-tuning as easy as running the following command by providing the project, entity, and the ID of the fine-tuning from the above console log.
openai wandb sync --id "ft-hPQvti4QpFiLP2bHvrJsSdah"
The training progress and validation metrics of training the model are visualized below.


As we can see the model validation loss quickly stabilizes in about 500 steps of fine-tuning, however, this is not indicative of the performance of the model on our custom attribute-value extraction task and the classification task.
The next section dives into how we evaluate our model's performance on the validation set for the task.

Evaluating Our Mode's Performance

To evaluate our model's performance we need to verify the validity of the predictions against the reference JSON. We need to make sure that the keys and the values of the prediction and the reference as similar.
We do this by calculating the exact match of the key-value pairs in the two dictionaries.

The Evaluation Code Sample

def score_dict_similar(row):
reference = row["reference"]
prediction = row["prediction"]
try:
reference = json.loads(reference)
prediction = json.loads(prediction)
common = len(set(reference.items()) & set(prediction.items()))
actual = len(reference.items())
return common/actual
except:
return 0.0
def prompt_to_bio(row, label_key="target"):
prompt = row["prompt"]
target = row[label_key]
try:
target = json.loads(target)
prompt = prompt.split()
labels = ["O"] * len(prompt)
except:
labels = ["O"]
return labels

for attribute,value in target.items():
values = value.split()
start_ent = False
for idx, word in enumerate(values):
try:
first_idx = prompt.index(word)
if idx == 0:
first_idx = prompt.index(word)
labels[first_idx] = f"B-{attribute}"
start_ent = True
elif start_ent:
first_idx = prompt.index(word)
labels[first_idx] = f"I-{attribute}"
except ValueError:
pass
return labels



def to_category(row):
reference = json.loads(row["reference"])["category"]
try:
prediction = json.loads(row["prediction"])["category"]
except Exception:
return pd.Series({"reference_category": reference, "predicted_category": ""})
return pd.Series({"reference_category": reference, "predicted_category": prediction})

metric = evaluate.load("seqeval")

def evaluate_results(results_df):
results_df = results_df[results_df.reference_labels.map(len) == results_df.predicted_labels.map(len)]
results_df["exact_match_score"] = results_df.apply(score_dict_similar, axis=1)
seq_results = metric.compute(
predictions=results_df["predicted_labels"].tolist(),
references=results_df["reference_labels"].tolist())
seq_results = (pd.DataFrame(seq_results)
.T
.reset_index()
.rename({"index": "label"}, axis=1)
)
clf_results = clf_report = classification_report(
y_true=results_df["reference_category"],
y_pred=results_df["predicted_category"],
output_dict=True)
clf_results = (pd.DataFrame(clf_results)
.T
.reset_index()
.rename({"index": "label"}, axis=1)
)
return results_df, seq_results, clf_results




Additionally, we also measure the precision-recall and f1-score of each attribute by treating it as a sequence evaluation task.


Finally, we measure the performance on the classification task by computing the precision, recall and f1-score on the category predicted.



Hyperparameter Tuning

We use wandb.sweep to find the best hyper-parameters for our task and optimize for the exact_match_score metric defined above. This can be easily achieved with a little instrumentation the custom code and configuration to perform the sweeps is shown below. We optimize for the model, learning_rate_multiplier and the prompt_loss in our sweep.

Sweeps Config and Code(Click to expand)

Conclusion

In this article, we explored how to use OpenAI GPT-3 for the task of product Attribute-Value extraction and classification in the e-commerce domain. We demonstrate how to present the problem as a text completion task and finetune GPT3 models for the task. Finally, we show how to conduct hyperparameter tuning and evaluate specific tasks using Weights & Biases sweeps.
If there's anything you think we missed, or something you'd like to see us explore next, feel free to comment below.
And in the meantime, here is some ...

Rubens Mau
Rubens Mau •  
Congratulations, very interesting your post. Is it possible to get the complete code described here? My email is rubensmau@gmail.com
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.
File<(table)>
File<(table)>
artifact