Extractive Question Answering With HuggingFace Using PyTorch and W&B
This article explores extractive question answering using HuggingFace Transformers, PyTorch, and W&B. Learn how to build a SOTA question-answering model.
Created on May 11|Last edited on May 30
Comment

Introduction
In this article, we will explore the exciting world of extractive question answering by leveraging the power of HuggingFace Transformers, PyTorch, and Weights & Biases (W&B). We will begin with a brief introduction to extractive question answering, followed by an overview of the HuggingFace Transformers library and the role of pre-trained models, such as BERT, in this context.
Next, we will discuss how to fine-tune a BERT-based model using PyTorch for a question-answering task on a specific dataset. Furthermore, we will delve into the integration of Weights & Biases with our training pipeline to enable efficient experiment tracking, model comparison, and hyperparameter optimization.
By the end of this article, you will have gained a solid understanding of extractive question answering using HuggingFace, PyTorch, and W&B and be equipped with the knowledge to build your own state-of-the-art question-answering models.
Here's what we'll be covering:
Table Of Contents
IntroductionWhat Is Extractive Question Answering? How Does Extractive Question Answering Work? Extractive Question Answering Tutorial Using HuggingFaceWhat Are We Trying To AchieveWhat Is BERTWhat Is the Benefit of Training Our Mode on the COVID QA Dataset?Dataset Used for Fine-Tuning the ModelStep-by-Step TutorialStep 1: Importing the Required LibrariesStep 2: Load the Dataset and Create a Pandas DataFrameStep 3: Prepare the Dataset in the Hugging Face Dataset FormatStep 4: Prepare Training Features and Filter Invalid ExamplesStep 5: Split the Dataset Into Training and Evaluation SetsStep 6: Initialize the Model and Set Up the Training ConfigurationStep 7: Train the ModelStep 8: Save and Load the Trained ModelStep 9: Test the Model With Example QuestionsOutputWeights & Biases MonitoringConclusion
Let's dig in!
What Is Extractive Question Answering?
Question answering is a fascinating field in natural language processing (NLP) and artificial intelligence, which aims to enable machines to understand and respond to human inquiries effectively by teaching computers to comprehend and process natural language.
Question answering can unlock a wealth of potential applications, ranging from virtual assistants and chatbots to advanced search engines and knowledge management systems. Question answering can be broadly divided into two categories: extractive and generative.
Extractive question answering involves identifying and extracting the answer to a given question directly from a given text. This approach is akin to using a highlighter to mark relevant parts of the text and then extracting the required information.
On the other hand, generative question answering entails producing a new, coherent answer based on the given context rather than merely selecting existing phrases. It is more like a creative writing exercise where the model synthesizes an original response using the information it has learned.
Both extractive and generative question answering have their own advantages and limitations. Extractive question answering excels in situations where the answer is explicitly stated in the text, while generative question answering thrives when dealing with multiple correct answers or when the answer isn't directly mentioned.
In this article, we will primarily focus on extractive question answering, delving into both the theoretical and practical aspects of building such a QA model.
How Does Extractive Question Answering Work?
While it may seem simple at first, the actual process behind Question Answering in machine learning can be quite complicated. With that said, it is not that hard to grasp fully. Let’s start by dividing the entire process into 5 main steps:
- Question processing: The model leverages natural language processing techniques to break down the question into smaller components, extracting and identifying the most relevant keywords to help generate the final answer. Techniques such as named entity recognition and part-of-speech parsing are employed in this step.
- Text segmentation: The model dissects the text document into smaller units, such as sentences or paragraphs. This allows for the analysis of each unit separately, identifying those that are more likely to contain the answer. Segmenting the text also reduces the computational complexity, making the process more efficient.
- Data analysis: Once the text is segmented, the model analyzes each unit using various techniques, including named entity recognition, part-of-speech tagging, and syntactic parsing. These techniques help identify essential entities, grammatical structures, and relationships between words and phrases in the text.
- Data analysis: Once the text is segmented, the model analyzes each unit using various techniques, including named entity recognition, part-of-speech tagging, and syntactic parsing. These techniques help identify essential entities, grammatical structures, and relationships between words and phrases in the text.
- Answer extraction: Finally, the model selects the highest-scoring unit and extracts the answer from it. The answer can be a single word, a phrase, or a complete sentence, depending on the question and the information available in the text. The extracted answer is then presented to the user as the model's response to the question.
Extractive Question Answering Tutorial Using HuggingFace
To better understand the concept of Question answering, in this part of the article, we will further explain how to create a simple yet effective QA model. We will provide each step with its code and an explanation of what each code snippet is trying to achieve.
What Are We Trying To Achieve
We are building a Python machine learning Q&A model that takes as its input a given covid-19 related question and returns the most appropriate extractive answer. We will be using the BERT model as the base for our own model. To better explain the structure of the code, let's start by explaining what BERT is and why we used HuggingFace in building our model.
What Is BERT
BERT, short for Bidirectional Encoder Representations from Transformers, is an advanced language model created by Google AI. It has revolutionized the field of natural language processing by excelling in various tasks, such as sentiment analysis, named entity recognition, and question-answering.
BERT is based on the Transformer architecture and learns from a vast amount of text using unsupervised techniques like masked language modeling and next-sentence prediction. Its unique bidirectional training approach helps BERT understand the context from both sides—left and right—resulting in a more comprehensive understanding of the text. When fine-tuned for specific tasks, BERT can achieve top-notch performance even with a small amount of labeled data.
With that said, we will be utilizing the pre-trained BERT model in building our QA model. The BERT model is already trained on massive amounts of data, including the SQuAD dataset. In its base form, the model is capable of answering a reasonable amount of questions related to the concepts on which the model is trained. To further improve our model’s performance, we will go the extra mile and train the BERT model on the Covid QA dataset.
What Is the Benefit of Training Our Mode on the COVID QA Dataset?
Improved performance: BERT, when fine-tuned on domain-specific data, can better understand the nuances, terms, and concepts related to that domain. In this case, the model will be more familiar with COVID-19-related information, leading to better answers.
Focused knowledge: The CovidQA dataset has many questions and answers about COVID-19, covering different parts of the virus, like how it spreads, symptoms, treatments, and prevention. Adjusting the model to this dataset helps it gain deeper knowledge in this particular area.
Dataset Used for Fine-Tuning the Model
The data set of choice is the Covid QA data set. The CovidQA Kaggle dataset is a collection of questions and answers related to COVID-19. It covers various aspects of the virus, such as transmission, symptoms, treatments, and prevention measures. This dataset is used to train and test machine learning models, specifically for extractive question-answering tasks, to help them better understand and answer questions about COVID-19.
Step-by-Step Tutorial
Step 1: Importing the Required Libraries
In this part of the code, we will import our already pre-trained BERT model. We will also import some necessary libraries, such as pandas, torch, and sklearn, which will be handy in the coming parts of the code.
import jsonimport pandas as pdimport torchfrom transformers import (BertTokenizerFast,BertForQuestionAnswering,TrainingArguments,Trainer,)from sklearn.model_selection import train_test_splitfrom datasets import Dataset, DatasetDict
Step 2: Load the Dataset and Create a Pandas DataFrame
data_path = "/kaggle/input/covidqa-dataset/COVID-QA.json"with open(data_path, "r") as f:data = json.load(f)questions = []answers = []contexts = []for entry in data['data']:for paragraph in entry['paragraphs']:context = paragraph['context']for qa in paragraph['qas']:questions.append(qa['question'])answers.append(qa['answers'][0]['text'])contexts.append(context)df = pd.DataFrame({'question': questions,'answer': answers,'context': contexts})
Step 3: Prepare the Dataset in the Hugging Face Dataset Format
The tokenize(batch) function converts the questions and the texts containing answers into numbers using the loaded tool, limits the length to 512 units, and finds where the answers start and end in the text. Finally, it saves this information (start and end positions) for each answer.
dataset = Dataset.from_pandas(df)tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")def tokenize(batch):tokenized_batch = tokenizer(batch["question"], batch["context"],max_length=512,padding="max_length",truncation=True,. return_offsets_mapping=True,return_token_type_ids=True)answer_starts = []answer_ends = []for i, context in enumerate(batch["context"]):answer_start = context.find(batch["answer"][i])answer_end = answer_start + len(batch["answer"][i])answer_starts.append(answer_start)answer_ends.append(answer_end)tokenized_batch["answer_start"] = answer_startstokenized_batch["answer_end"] = answer_endsreturn tokenized_batchtokenized_dataset = dataset.map(tokenize, batched=True)
Step 4: Prepare Training Features and Filter Invalid Examples
The prepare_train_features(examples) finds the positions of the answer in the processed text.
Inside this function, we will set the default start and end positions to the beginning and end of the text. Then loop through the processed text to find where the answer starts and ends. If the answer's start and end are not found, set them to -1 (invalid position). Finally, we will save the start and end positions of the answer for each example.
The filter_invalid_examples(example) function removes examples with invalid answer positions (i.e., -1). We will apply this function to the prepared dataset to filter out invalid examples.
def prepare_train_features(example):start_position = example["input_ids"].index(tokenizer.cls_token_id)end_position = example["input_ids"].index(tokenizer.sep_token_id)found_start = Falsefound_end = Falsefor i, (offset_start, offset_end) in enumerate(example["offset_mapping"]):if not found_start and offset_start == example["answer_start"]:start_position = ifound_start = Trueif not found_end and offset_end == example["answer_end"]:end_position = ifound_end = Trueif found_start and found_end:breakif not found_start or not found_end:start_position = -1end_position = -1example["start_positions"] = start_positionexample["end_positions"] = end_positionreturn exampleprepared_dataset = tokenized_dataset.map(prepare_train_features, batched=False)def filter_invalid_examples(example):return example["start_positions"] != -1 and example["end_positions"] != -1filtered_dataset = prepared_dataset.filter(filter_invalid_examples, batched=False)
Step 5: Split the Dataset Into Training and Evaluation Sets
In this code snippet, the filtered dataset is split into training and evaluation sets using a 90-10 ratio. The train_dataset contains 90% of the examples for training, while the eval_dataset has the remaining 10% for evaluation purposes. The convert_to_tensors function converts input_ids and attention_mask into PyTorch tensors, ensuring they have the correct data type (long). This function is then applied to both the train_dataset and eval_dataset. Finally, a DatasetDict is created, which holds both the train and eval datasets, making it convenient to work with during training and evaluation.
train_indices, eval_indices = train_test_split(list(range(len(filtered_dataset))), test_size=0.1, random_state=42)train_dataset = filtered_dataset.select(train_indices)eval_dataset = filtered_dataset.select(eval_indices)def convert_to_tensors(example):example["input_ids"] = torch.tensor(example["input_ids"], dtype=torch.long)example["attention_mask"] = torch.tensor(example["attention_mask"], dtype=torch.long)return exampletrain_dataset = train_dataset.map(convert_to_tensors)eval_dataset = eval_dataset.map(convert_to_tensors)dataset_dict = DatasetDict({"train": train_dataset, "eval": eval_dataset})
Step 6: Initialize the Model and Set Up the Training Configuration
In this code snippet, a pre-trained BERT model is loaded to serve as the base model. Next, various training arguments are defined using the TrainingArguments class. These arguments include the output directory for saving results, the number of training epochs, batch sizes for both training and evaluation, warmup steps, weight decay, and logging information. Additionally, the model is set to evaluate and save itself after each epoch, with a limit of 2 saved models. The model will also use mixed precision (fp16) for faster training and load the best model at the end of the training process.
Moreover, by setting report_to="wandb," we link our training process to our Weights & Biases account. This enables us to monitor our model's performance across multiple dimensions and stages of the training process in real time, making it easier to track progress and identify potential areas for improvement.
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")training_args = TrainingArguments(output_dir='./results',num_train_epochs=3,per_device_train_batch_size=16,per_device_eval_batch_size=64,warmup_steps=500,weight_decay=0.01,logging_dir='./logs',logging_steps=10,evaluation_strategy="epoch",save_strategy="epoch",save_total_limit=2,fp16=True,load_best_model_at_end=True,report_to="wandb", # Enable logging to W&B)
Step 7: Train the Model
In this part of the code, a Trainer object is created by passing the previously initialized BERT model, training arguments, and the train and eval datasets from the dataset_dict.
Finally, the trainer.train() function is called to start the training process.
trainer = Trainer(model=model,args=training_args,train_dataset=dataset_dict["train"],eval_dataset=dataset_dict["eval"],)trainer.train()
Step 8: Save and Load the Trained Model
We will save our trained model for future use.
model.save_pretrained("trained_model")model = BertForQuestionAnswering.from_pretrained("trained_model")
Step 9: Test the Model With Example Questions
We will create a get_answer (question, context) function, which takes as input a single question and context text and returns the most appropriate extractive answer to the question utilizing the given context.
def get_answer(question, context):inputs = tokenizer.encode_plus(question, context, return_tensors="pt")start_logits, end_logits = model(**inputs).values()start_index_and_logits = torch.argmax(start_logits, dim=1).item(), start_logits[0].max().item()end_index_and_logits = torch.argmax(end_logits, dim=1).item(), end_logits[0].max().item()if end_index_and_logits[0] >= start_index_and_logits[0]:start_index, end_index = start_index_and_logits[0], end_index_and_logits[0]else:if start_index_and_logits[1] > end_index_and_logits[1]:start_index, end_index = start_index_and_logits[0], start_index_and_logits[0]else:start_index, end_index = end_index_and_logits[0], end_index_and_logits[0]answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_index:end_index+1]))return answer
Output
We tested our final model with 6 examples, in which we passed a given question and context and requested a given answer.
Example 1
question1 = "What is the incubation period of the novel coronavirus?"context1 = "The novel coronavirus, also known as COVID-19, has an incubation period ranging from 1 to 14 days, with the majority of cases showing symptoms around 5 days after exposure."answer1 = get_answer(question1, context1)print("Answer 1:", answer1)
Output:
Answer 1: what is the incubation period of the novel coronavirus? [SEP] The novel coronavirus, also known as covid - 19, has an incubation period ranging from 1 to 14 days, with the majority of cases showing symptoms around 5 days after exposure. [SEP]
Example 2
question2 = "What are the common symptoms of COVID-19?"context2 = "COVID-19 symptoms can vary widely and may include fever, cough, shortness of breath, fatigue, body aches, and loss of taste or smell. Some people may also experience gastrointestinal symptoms like nausea, vomiting, and diarrhea."answer2 = get_answer(question2, context2)print("Answer 2:", answer2)
Output:
Answer 2: ##estinal symptoms like nausea, vomiting, and diarrhea. [SEP]
Example 3
question3 = "How is the COVID-19 virus primarily transmitted?"context3 = "COVID-19 is primarily transmitted through respiratory droplets produced when an infected person coughs, sneezes, or talks. These droplets can land in the mouths or noses of people who are nearby or be inhaled into their lungs. The virus can also spread through close personal contact, such as touching or shaking hands, or by touching a surface or object that has the virus on it and then touching their own mouth, nose, or eyes."answer3 = get_answer(question3, context3)print("Answer 3:", answer3)
Output:
Answer 3: transmitted? [SEP] covid - 19 is primarily transmitted through respiratory droplets produced when an infected person coughs, sneezes, or talks. these droplets can land in the mouths or noses of people who are nearby or be inhaled into their lungs. the virus can also spread through close personal contact, such as touching or shaking hands, or by touching a surface or object that has the virus on it and then touching their own mouth, nose, or eyes. [SEP]
Example 4
question4 = "Is a vaccine for COVID-19 available?"context4 = "As of September 2021, several COVID-19 vaccines have been developed and are being administered worldwide. These include the Pfizer-BioNTech, Moderna, AstraZeneca, and Johnson & Johnson vaccines. The availability and distribution of these vaccines vary depending on the country and region."answer4 = get_answer(question4, context4)print("Answer 4:", answer4)
Output:
Answer 4: 2021, several covid - 19 vaccines have been developed and are being administered worldwide. these include the pfizer - biontech, moderna, astrazeneca, and johnson & johnson vaccines. the availability and distribution of these vaccines vary depending on the country and region. [SEP]
Example 5
question5 = "What is the role of ACE2 in COVID-19 infection?"context5 = "ACE2, or angiotensin-converting enzyme 2, is a protein found on the surface of various human cells, including those in the lungs, heart, kidneys, and intestines. The novel coronavirus, SARS-CoV-2, responsible for COVID-19, uses ACE2 as a receptor to enter and infect human cells. The spike protein on the virus's surface binds to ACE2, allowing the virus to enter the cell and replicate."answer5 = get_answer(question5, context5)print("Answer 5:", answer5)
Output:
Answer 5: infection? [SEP] ace2, or angiotensin - converting enzyme 2, is a protein found on the surface of various human cells, including those in the lungs, heart, kidneys, and intestines. the novel coronavirus, sars - cov - 2, responsible for covid - 19, uses ace2 as a receptor to enter and infect human cells. the spike protein on the virus ' s surface binds to ace2, allowing the virus to enter the cell and replicate. [SEP]
Example 6
question6 = "Can COVID-19 be transmitted through food?"context6 = "According to the World Health Organization (WHO) and the U.S. Food and Drug Administration (FDA), there is currently no evidence to suggest that COVID-19 can be transmitted through food or food packaging. The primary mode of transmission is through respiratory droplets from person to person. However, it is still essential to practice good food hygiene and wash your hands before preparing or consuming food."answer6 = get_answer(question6, context6)print("Answer 6:", answer6)
Output:
Answer 6: food or food packaging. the primary mode of transmission is through respiratory droplets from person to person. however, it is still essential to practice good food hygiene and wash your hands before preparing or consuming food. [SEP]
Weights & Biases Monitoring
Note that when running the code, you will be asked to insert your weights and biases API key. To get the key, simply press the provided authorized link provided.


After completing the training process, click the View Project At link, which will provide you with the model’s training statistics, including the learning_rate, train_runtime, and much more.

Conclusion
Extractive question answering is an exciting and vital area of natural language processing that paves the way for a wide array of applications. By harnessing the power of HuggingFace Transformers, PyTorch, and Weights & Biases, we can build state-of-the-art models that efficiently and effectively process questions and extract answers from textual data. As technology continues to advance, the potential of question-answering systems is bound to grow, further enhancing our ability to obtain accurate and relevant information in a human-like manner.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.