Fine-tuning llama 2 for Named Entity Recognition
Brief overview of training a model from the llama family for Named Entity Recognition
Created on May 11|Last edited on May 29
Comment

Model: dalle-3-xl-v2, Prompt: A LLaMA Wearing a spiderman T Shirt sipping coffee
In this article we'll look into using the newly released Llama models for Named Entity Recognition (generalisable to any type of Token Classification Task).
Table of Contents
Code
At the moment of writing the 🤗 transformers library doesn't have a Llama implementation for Token Classification (although there is a open PR). However a community member @KoichiYasuoka has kindly contributed a simple implementation viz.
from typing import List, Optional, Tuple, Unionimport torchfrom torch import nnfrom transformers.modeling_outputs import TokenClassifierOutputfrom transformers.file_utils import add_start_docstrings_to_model_forwardfrom transformers.models.llama.modeling_llama import LlamaModel, LlamaPreTrainedModel, LLAMA_INPUTS_DOCSTRINGclass LlamaForTokenClassification(LlamaPreTrainedModel):def __init__(self, config):super().__init__(config)self.num_labels = config.num_labelsself.model = LlamaModel(config)if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None:classifier_dropout = config.classifier_dropoutelif hasattr(config, "hidden_dropout") and config.hidden_dropout is not None:classifier_dropout = config.hidden_dropoutelse:classifier_dropout = 0.1self.dropout = nn.Dropout(classifier_dropout)self.classifier = nn.Linear(config.hidden_size, config.num_labels)# Initialize weights and apply final processingself.post_init()def get_input_embeddings(self):return self.model.embed_tokensdef set_input_embeddings(self, value):self.model.embed_tokens = value@add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)def forward(self,input_ids: Optional[torch.LongTensor] = None,attention_mask: Optional[torch.Tensor] = None,position_ids: Optional[torch.LongTensor] = None,past_key_values: Optional[List[torch.FloatTensor]] = None,inputs_embeds: Optional[torch.FloatTensor] = None,labels: Optional[torch.LongTensor] = None,use_cache: Optional[bool] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple, TokenClassifierOutput]:return_dict = return_dict if return_dict is not None else self.config.use_return_dicttransformer_outputs = self.model(input_ids,attention_mask=attention_mask,position_ids=position_ids,past_key_values=past_key_values,inputs_embeds=inputs_embeds,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states = transformer_outputs[0]hidden_states = self.dropout(hidden_states)logits = self.classifier(hidden_states)loss = Noneif labels is not None:labels = labels.to(logits.device)loss_fct = nn.CrossEntropyLoss()loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))if not return_dict:output = (logits,) + transformer_outputs[2:]return ((loss,) + output) if loss is not None else outputreturn TokenClassifierOutput(loss=loss,logits=logits,hidden_states=transformer_outputs.hidden_states,attentions=transformer_outputs.attentions)
With this implementation we can simply plug in the model into a standard transformers + peft training pipeline and start fine-tuning a 7B parameter model !!
However we'll need to use some tricks if we want to achieve this using the free colab GPUs
- Instead of fine-tuning the entire model we'll use LoRA to only train a small fraction of the weights. (For more context refer to the article on LoRA)
- Instead of using the original LLaMA which are too big to fit on the free instance we use a quantized 4 bit model made available via unsloth.ai viz. unsloth/llama-2-7b-bnb-4bit.
- We use the paged_adamw_8bit optimizer initially introduced with QLoRA.
Results
Let's look at the results of training two such models with varying a lora rank on the CoNLL-2003 benchmark dataset for Named Entity Recognition.
Run set
2
Conclusion
In this article, you read through a brief overview of training a model from the LLaMA family for Named Entity Recognition and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.
What Are Intrinsic Dimensions? The Secret Behind LoRA
This article provides a brief overview of intrinsic dimensions and how they enable Low-Rank Domain Adaptation. We also provide code samples which use Weights & Biases for interactive visualizations.
A Brief Introduction to LoRA
This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
AdaLoRA: Adaptive Budget Allocation for LoRA
This article provides an overview of "Adaptive Budget Allocation for Parameter Efficient Fine-Tuning" using W&B for interactive visualizations. It includes code samples for you to follow!
What is QLoRA?
This article provides an overview of "QLoRA: Efficient Finetuning of Quantized LLMs" using W&B for interactive visualizations. It includes code samples for you to follow!
How to Fine-Tune an LLM Part 2: Instruction Tuning Llama 2
In part 1, we prepped our dataset. In part 2, we train our model
Scaling Llama 2 to 32k Tokens With LongLora
The need for LLMs that can digest long content is becoming increasingly more important. Go beyond 4096 tokens with LongLora!
Add a comment