Skip to main content

Fine-tuning llama 2 for Named Entity Recognition

Brief overview of training a model from the llama family for Named Entity Recognition
Created on May 11|Last edited on May 29



Model: dalle-3-xl-v2, Prompt: A LLaMA Wearing a spiderman T Shirt sipping coffee
In this article we'll look into using the newly released Llama models for Named Entity Recognition (generalisable to any type of Token Classification Task).

Table of Contents





Code

At the moment of writing the 🤗 transformers library doesn't have a Llama implementation for Token Classification (although there is a open PR). However a community member @KoichiYasuoka has kindly contributed a simple implementation viz.
from typing import List, Optional, Tuple, Union
import torch
from torch import nn
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.file_utils import add_start_docstrings_to_model_forward
from transformers.models.llama.modeling_llama import LlamaModel, LlamaPreTrainedModel, LLAMA_INPUTS_DOCSTRING

class LlamaForTokenClassification(LlamaPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = LlamaModel(config)
if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None:
classifier_dropout = config.classifier_dropout
elif hasattr(config, "hidden_dropout") and config.hidden_dropout is not None:
classifier_dropout = config.hidden_dropout
else:
classifier_dropout = 0.1
self.dropout = nn.Dropout(classifier_dropout)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)

# Initialize weights and apply final processing
self.post_init()

def get_input_embeddings(self):
return self.model.embed_tokens

def set_input_embeddings(self, value):
self.model.embed_tokens = value

@add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, TokenClassifierOutput]:

return_dict = return_dict if return_dict is not None else self.config.use_return_dict

transformer_outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)

hidden_states = transformer_outputs[0]
hidden_states = self.dropout(hidden_states)
logits = self.classifier(hidden_states)

loss = None
if labels is not None:
labels = labels.to(logits.device)
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

if not return_dict:
output = (logits,) + transformer_outputs[2:]
return ((loss,) + output) if loss is not None else output

return TokenClassifierOutput(
loss=loss,
logits=logits,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions
)
With this implementation we can simply plug in the model into a standard transformers + peft training pipeline and start fine-tuning a 7B parameter model !!
However we'll need to use some tricks if we want to achieve this using the free colab GPUs
  • Instead of fine-tuning the entire model we'll use LoRA to only train a small fraction of the weights. (For more context refer to the article on LoRA)

  • Instead of using the original LLaMA which are too big to fit on the free instance we use a quantized 4 bit model made available via unsloth.ai viz. unsloth/llama-2-7b-bnb-4bit.
  • We use the paged_adamw_8bit optimizer initially introduced with QLoRA.


Results




Let's look at the results of training two such models with varying a lora rank on the CoNLL-2003 benchmark dataset for Named Entity Recognition.

Run set
2


Conclusion

In this article, you read through a brief overview of training a model from the LLaMA family for Named Entity Recognition and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.