Named Entity Recognition With HuggingFace Using PyTorch and W&B

This article explores Named Entity Recognition (NER) using HuggingFace, PyTorch, and W&B. It covers the process of training a model on the CoNLL2003 dataset and performing NER on example sentences.
Mostafa Ibrahim
Created on May 27|Last edited on June 1
Comment
﻿
﻿Source﻿﻿﻿
IntroductionIn this article, we will dive into the world of NER and explore how to build a powerful NER model using state-of-the-art techniques. We will focus on utilizing the HuggingFace library, which provides a wide range of pre-trained models and tools for NLP tasks. Specifically, we will leverage PyTorch, a popular deep-learning framework, to fine-tune a pre-trained BERT model on the CoNLL2003 dataset.
Throughout this tutorial, we will cover various important steps, including data preparation, tokenization, model initialization, training, and evaluation. We will also demonstrate how to perform NER on example sentences using the trained model, allowing us to extract named entities accurately.
By the end of this article, you will have a solid understanding of Named Entity Recognition and the practical implementation using HuggingFace, PyTorch, and W&B. Let’s get started!
Table of ContentsIntroductionTable of ContentsWhat Is Named Entity RecognitionHow Does Named Entity Recognition Work?Named Entity Recognition Tutorial Using HuggingFaceWhat Are We Trying To AchieveA Bit More Info About Our Dataset Step-by-Step TutorialStep 1: Install Packages and Import DependenciesStep 2: Check CUDA Availability and Device InformationStep 3: Read and Prepare DataStep 4: Initialize Tokenizer and ModelStep 5: Define Metrics and Tokenization FunctionStep 6: Tokenize Datasets and Set Training ArgumentsStep 7: Define Data Collator and Initialize TrainerStep 8: Train the ModelStep 9: Perform Named Entity Recognition on Examples Example 1Example 2Example 3Example 4Example 5Weights & Biases MonitoringConclusion
﻿
What Is Named Entity Recognition﻿Named entity recognition (NER) is a field of natural language processing (NLP) that involves the identification and extraction of a variety of named entities from text. These entities are specific objects, people, places, organizations, and other entities that are referred to by proper nouns.
Essentially, NER wants to identify "Michael Jordan" not "basketball player" or "athlete." 
When you read a news article, you might come across names of people, places, or organizations that are mentioned. NER algorithms can automatically detect these named entities and categorize them into their respective groups, making it easier to extract key information from large volumes of text. 
Why is this important? NER has many real-world applications. For example, in the field of information extraction, NER can be used to extract important information from large amounts of unstructured text, such as identifying the names of people, organizations, and locations mentioned in news articles or social media posts. NER can help with better summary, classification, search, recommendations, business intelligence, question answering, sentiment analysis, and machine translation, among other areas.
How Does Named Entity Recognition Work?Broadly speaking, you can break NER into the following steps: 
Tokenization: In the initial step of named entity recognition, the input text undergoes tokenization. Tokenization involves breaking down the text into individual units, which can be words, subwords, or even characters. This process enables the NER model to process and understand the text at a granular level.
Pretrained Model: NER models often rely on powerful deep learning architectures like BERT (Bidirectional Encoder Representations from Transformers). These models are pre-trained on extensive datasets to grasp the contextualized meanings and relationships between words. By leveraging this pre-training, the models gain a deep understanding of language and its nuances.
Token Classification: During the training phase, the pre-trained model is fine-tuned using labeled data containing text sequences and their corresponding entity labels. The model learns to classify each token in the input sequence into specific entity categories, or it may assign a special label for tokens that do not represent entities. This training enables the model to recognize and categorize entities accurately.
Inference: Once the NER model is trained, it can make predictions on new, unseen text. To make predictions, the input text is tokenized and the model processes the tokenized sequence. By analyzing the contextual information within the tokens, the model generates predictions for each token, indicating its predicted entity category.
Post-processing: The predicted labels may include additional information, such as confidence scores or probability distributions. To obtain the final named entities, post-processing steps are applied to the predicted labels. These steps involve refining the predictions by filtering out unwanted labels, resolving overlapping entities, addressing ambiguities, and applying language-specific rules. This ensures the output reflects accurate and meaningful named entities.
Evaluation: Evaluating the performance of NER models involves comparing the predicted named entities with the ground truth labels. Metrics like precision, recall, and F1 score are commonly used to measure the accuracy and completeness of the NER system in identifying the correct entities. These evaluations help assess the model's effectiveness and guide improvements in its performance.
Named Entity Recognition Tutorial Using HuggingFace
What Are We Trying To AchieveIn this tutorial, our goal is to build a Python machine-learning model capable of performing named entity recognition (NER). Specifically, we aim to develop a model that can accurately identify and extract named entities from text.
To accomplish this, we'll leverage the power of BERT (Bidirectional Encoder Representations from Transformers), one of the first transformer language models developed by Google AI. BERT has been pretrained on a massive amount of text data and has demonstrated impressive performance in various NLP tasks, including NER.
By fine-tuning the pretrained BERT model on the CoNLL2003 dataset, which contains labeled examples of named entities, we can train our model to recognize and classify different types of named entities, such as persons, locations, organizations, and more.
A Bit More Info About Our Dataset The CoNLL2003 dataset consists of a wide collection of English news articles from the Reuters Corpus, annotated with named entity labels. It is commonly used as a benchmark dataset for named entity recognition (NER) tasks in natural language processing.
To find the dataset, check the following link: CoNLL2003 Dataset.
💡
Step-by-Step Tutorial
Step 1: Install Packages and Import DependenciesFirst, we install the necessary packages, including seqeval, transformers, and datasets, using the pip package manager. Moreover, we will import some necessary dependencies such as torch, numpy, AutoTokenizer, etc.
!pip install seqeval
!pip install transformers
!pip install datasets
﻿
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
from datasets import load_dataset, load_metric, Dataset, DatasetDict
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score, classification_report
Step 2: Check CUDA Availability and Device InformationNext, let's check if CUDA is available for GPU acceleration.
print("CUDA available:", torch.cuda.is_available())
print("Current device index:", torch.cuda.current_device())
print("Device name:", torch.cuda.get_device_name(torch.cuda.current_device()))
Step 3: Read and Prepare DataThis step defines a helper function, read_conll_file, to read the CoNLL dataset from the provided file paths. The dataset is split into train, validation, and test sets. Each dataset is represented as a list of tokenized sentences, where each token is associated with its corresponding named entity tag.
def read_conll_file(file_path):
    with open(file_path, "r") as f:
        content = f.read().strip()
        sentences = content.split("\n\n")
        data = []
        for sentence in sentences:
            tokens = sentence.split("\n")
            token_data = []
            for token in tokens:
                token_data.append(token.split())
            data.append(token_data)
    return data
﻿
train_data = read_conll_file("/kaggle/input/conll2003-dataset/conll2003/eng.train")
validation_data = read_conll_file("/kaggle/input/conll2003-dataset/conll2003/eng.testa")
test_data = read_conll_file("/kaggle/input/conll2003-dataset/conll2003/eng.testb")
﻿
def convert_to_dataset(data, label_map):
    formatted_data = {"tokens": [], "ner_tags": []}
    for sentence in data:
        tokens = [token_data[0] for token_data in sentence]
        ner_tags = [label_map[token_data[3]] for token_data in sentence]
        formatted_data["tokens"].append(tokens)
        formatted_data["ner_tags"].append(ner_tags)
    return Dataset.from_dict(formatted_data)
﻿
label_list = sorted(list(set([token_data[3] for sentence in train_data for token_data in sentence])))
label_map = {label: i for i, label in enumerate(label_list)}
﻿
train_dataset = convert_to_dataset(train_data, label_map)
validation_dataset = convert_to_dataset(validation_data, label_map)
test_dataset = convert_to_dataset(test_data, label_map)
﻿
datasets = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset,
})
Step 4: Initialize Tokenizer and ModelIn this step, another helper function, convert_to_dataset, is defined to convert the formatted CoNLL data into the Hugging Face dataset format. The function takes the dataset and a label map as inputs and returns a Dataset object with the tokenized sentences and corresponding named entity tags.
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))
Step 5: Define Metrics and Tokenization FunctionHere, you define the evaluation metrics and the tokenization function. The metrics include precision, recall, F1 score, and classification report, which are imported from seqeval.metrics. The compute_metrics() function calculates these metrics by comparing the predicted labels with the true labels. The tokenize_and_align_labels() function is responsible for tokenizing the input examples and aligning the corresponding labels.
def compute_metrics(eval_prediction):
    predictions, labels = eval_prediction
    predictions = np.argmax(predictions, axis=2)
﻿
    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
﻿
    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        "classification_report": classification_report(true_labels, true_predictions),
    }
﻿
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True, padding=True
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs
Step 6: Tokenize Datasets and Set Training ArgumentsIn this step, we'll tokenize the datasets using the map() function from the datasets library. The tokenize_and_align_labels() function is applied to each dataset in a batched manner. It tokenizes the examples and aligns the labels with the tokenized inputs. The resulting tokenized datasets are stored in the tokenized_datasets variable. 
We'll also define the training arguments, such as the output directory, evaluation strategy, batch size, learning rate, and number of training epochs, using the TrainingArguments class.
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)
﻿
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=100,
    learning_rate=5e-5,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)
Step 7: Define Data Collator and Initialize TrainerHere, you define the data collator function, which prepares the input data for training. The data_collator() function takes a batch of data and converts it into PyTorch tensors, padding the sequences as necessary. Then, you initialize the Trainer object with the model, training arguments, datasets, data collator, tokenizer, and the compute_metrics() function for evaluation during training.
def data_collator(data):
    input_ids = [torch.tensor(item["input_ids"]) for item in data]
    attention_mask = [torch.tensor(item["attention_mask"]) for item in data]
    labels = [torch.tensor(item["labels"]) for item in data]
﻿
    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
    attention_mask = torch.nn.utils.rnn.pad_sequence(attention_mask, batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=-100)
﻿
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }
﻿
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
Step 8: Train the ModelWe start the training process by calling the trainer.train() method. The model is trained on the specified datasets using the training arguments defined earlier. 
trainer.train()
Step 9: Perform Named Entity Recognition on ExamplesIn this step, we will demonstrate how to perform named entity recognition (NER) on several example sentences using the trained model and tokenizer. We will insert a given sentence and print all the resulting outputs of the model.
 Example 1sentence = "John Smith is a software engineer who works at Google."
﻿
tokenized_input = tokenizer(sentence, return_tensors="pt").to(model.device)
﻿
outputs = model(**tokenized_input)
﻿
predicted_labels = outputs.logits.argmax(-1)[0]
﻿
named_entities = [tokenizer.decode([token]) for token, label in zip(tokenized_input["input_ids"][0], predicted_labels) if label != 0 and label != label_map['O']]
﻿
print("Named Entities - Example 1:", named_entities)
Output: Named Entities - Example 1: ['John', 'Smith', 'Google']
Example 2sentence2 = "The company Apple Inc. announced its new product, the iPhone 12, at a press conference held in San Francisco."
﻿
tokenized_input2 = tokenizer(sentence2, return_tensors="pt").to(model.device)
﻿
outputs2 = model(**tokenized_input2)
﻿
predicted_labels2 = outputs2.logits.argmax(-1)[0]
﻿
named_entities2 = [tokenizer.decode([token]) for token, label in zip(tokenized_input2["input_ids"][0], predicted_labels2) if label != 0 and label != label_map['O']]
﻿
print("Named Entities - Example 2:", named_entities2)
Output: Named Entities - Example 2: ['Apple', 'Inc', 'iPhone', '12', 'Francisco']
Example 3sentence3 = "The actor Tom Hanks starred in the movie Forrest Gump."
﻿
tokenized_input3 = tokenizer(sentence3, return_tensors="pt").to(model.device)
﻿
outputs3 = model(**tokenized_input3)
﻿
predicted_labels3 = outputs3.logits.argmax(-1)[0]
﻿
named_entities3 = [tokenizer.decode([token]) for token, label in zip(tokenized_input3["input_ids"][0], predicted_labels3) if label != 0 and label != label_map['O']]
﻿
print("Named Entities - Example 3:", named_entities3)
Output:  Named Entities - Example 3: ['Tom', 'Hank', '##s', 'Forrest', 'G', '##ump']
Example 4sentence4 = "Paris is the capital city of France."
﻿
tokenized_input4 = tokenizer(sentence4, return_tensors="pt").to(model.device)
﻿
outputs4 = model(**tokenized_input4)
﻿
predicted_labels4 = outputs4.logits.argmax(-1)[0]
﻿
named_entities4 = [tokenizer.decode([token]) for token, label in zip(tokenized_input4["input_ids"][0], predicted_labels4) if label != 0 and label != label_map['O']]
﻿
print("Named Entities - Example 4:", named_entities4)
Output: Named Entities - Example 4: []
As shown above, the model failed to return the named entities in such an example. The model's failure to identify named entities in certain examples could be due to various factors. To enhance the model's performance, consider the following approaches: (1) Increase training data diversity and fine-tune the model on your specific dataset. (2) Preprocess and normalize data to handle variations. (3) Experiment with different models and architectures. (4) Analyze errors and manually correct annotations.
Example 5sentence5 = "The scientist Marie Curie won the Nobel Prize in Physics and Chemistry."
﻿
tokenized_input5 = tokenizer(sentence5, return_tensors="pt").to(model.device)
﻿
outputs5 = model(**tokenized_input5)
﻿
predicted_labels5 = outputs5.logits.argmax(-1)[0]
﻿
named_entities5 = [tokenizer.decode([token]) for token, label in zip(tokenized_input5["input_ids"][0], predicted_labels5) if label != 0 and label != label_map['O']]
﻿
print("Named Entities - Example 5:", named_entities5)
Output: Named Entities - Example 5: ['Marie', 'C', '##uri', '##e', 'Nobel', 'Prize', 'in', 'Physics', 'and', 'Chemistry']
Similar to example 4, this example’s output does contain some errors, though not as bad. To improve on this model, follow similar steps as mentioned in the above example.
Weights & Biases MonitoringTo track our model’s training statistics, we will be utilizing Weights & Biases.
Note that when running the code, you will be asked to insert your wandb API key. To get the key, simply use the provided authorized link provided.
﻿
﻿
After completing the training process, click the "View Project" link, which will provide you with the model’s training statistics, including the learning_rate, train_runtime, and much more.
﻿
ConclusionIn this tutorial, we delved into the world of named entity recognition (NER) using HuggingFace, PyTorch, and W&B. We learned about the importance of NER in extracting valuable information from the text and how NER models based on deep learning architectures like BERT can achieve impressive results. 
By following the step-by-step tutorial, we obtained a trained NER model capable of accurately identifying and categorizing named entities. This model can be applied to various real-world applications, such as information extraction, question answering, sentiment analysis, and machine translation. With the power of HuggingFace, PyTorch, and W&B, we can continue exploring and enhancing NER systems to extract meaningful insights from text data.
﻿
﻿
Add a comment
Tags: Articles, NER, HuggingFace, NLP, PyTorch, Kaggle, Panels, Beginner, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.