Fine-tuning llama 2 for Named Entity Recognition

Brief overview of training a model from the llama family for Named Entity Recognition
Created on May 11|Last edited on May 29
Comment
﻿
﻿
﻿
﻿
Model: dalle-3-xl-v2, Prompt: A LLaMA Wearing a spiderman T Shirt sipping coffee
In this article we'll look into using the newly released Llama models for Named Entity Recognition (generalisable to any type of Token Classification Task).
Table of ContentsCodeResultsConclusion
﻿
﻿
CodeAt the moment of writing the 🤗 transformers library doesn't have a Llama implementation for Token Classification (although there is a open PR). However a community member @KoichiYasuoka has kindly contributed a simple implementation viz.
from typing import List, Optional, Tuple, Union
import torch
from torch import nn
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.file_utils import add_start_docstrings_to_model_forward
from transformers.models.llama.modeling_llama import LlamaModel, LlamaPreTrainedModel, LLAMA_INPUTS_DOCSTRING
﻿
class LlamaForTokenClassification(LlamaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.model = LlamaModel(config)
        if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None:
            classifier_dropout = config.classifier_dropout
        elif hasattr(config, "hidden_dropout") and config.hidden_dropout is not None:
            classifier_dropout = config.hidden_dropout
        else:
            classifier_dropout = 0.1
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
﻿
        # Initialize weights and apply final processing
        self.post_init()
﻿
    def get_input_embeddings(self):
        return self.model.embed_tokens
﻿
    def set_input_embeddings(self, value):
        self.model.embed_tokens = value
﻿
    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, TokenClassifierOutput]:
﻿
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
﻿
        transformer_outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
﻿
        hidden_states = transformer_outputs[0]
        hidden_states = self.dropout(hidden_states)
        logits = self.classifier(hidden_states)
﻿
        loss = None
        if labels is not None:
            labels = labels.to(logits.device)
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
﻿
        if not return_dict:
            output = (logits,) + transformer_outputs[2:]
            return ((loss,) + output) if loss is not None else output
﻿
        return TokenClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions
        )
With this implementation we can simply plug in the model into a standard transformers + peft training pipeline and start fine-tuning a 7B parameter model !!
However we'll need to use some tricks if we want to achieve this using the free colab GPUs
Instead of fine-tuning the entire model we'll use LoRA to only train a small fraction of the weights. (For more context refer to the article on LoRA)
A Brief Introduction to LoRA
This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
﻿
Instead of using the original LLaMA which are too big to fit on the free instance we use a quantized 4 bit model made available via unsloth.ai viz. unsloth/llama-2-7b-bnb-4bit.
We use the paged_adamw_8bit optimizer initially introduced with QLoRA.
What is QLoRA?
This article provides an overview of "QLoRA: Efficient Finetuning of Quantized LLMs" using W&B for interactive visualizations. It includes code samples for you to follow!
﻿
Results﻿
﻿
﻿
Let's look at the results of training two such models with varying a lora rank on the CoNLL-2003 benchmark dataset for Named Entity Recognition.
﻿
Run set2
﻿
ConclusionIn this article, you read through a brief overview of training a model from the LLaMA family for Named Entity Recognition and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!﻿
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.
What Are Intrinsic Dimensions? The Secret Behind LoRA
This article provides a brief overview of intrinsic dimensions and how they enable Low-Rank Domain Adaptation. We also provide code samples which use Weights & Biases for interactive visualizations. 
A Brief Introduction to LoRA
This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
AdaLoRA: Adaptive Budget Allocation for LoRA
This article provides an overview of "Adaptive Budget Allocation for Parameter Efficient Fine-Tuning" using W&B for interactive visualizations. It includes code samples for you to follow!
What is QLoRA?
This article provides an overview of "QLoRA: Efficient Finetuning of Quantized LLMs" using W&B for interactive visualizations. It includes code samples for you to follow!
How to Fine-Tune an LLM Part 2: Instruction Tuning Llama 2
In part 1, we prepped our dataset. In part 2, we train our model
Scaling Llama 2 to 32k Tokens With LongLora
The need for LLMs that can digest long content is becoming increasingly more important. Go beyond 4096 tokens with LongLora! 
﻿
﻿
Add a comment