Language Modeling: A Beginner's Guide
This article is a beginner's guide to Language Modeling and covers how to use pre-trained models for natural language processing tasks using HuggingFace Transformers.
Created on January 23|Last edited on November 23
Comment
The rise of the internet in the 21st century has made it easier for people to communicate and share knowledge. Additionally, advancements in technologies in the field of Natural Language Processing (NLP) combined with the vast amount of training data the internet provides has made it possible for machines to generate convincing human language.
This is where the term language modeling comes in. Many NLP tasks require understanding or generating a sequence of words, and language models provide the necessary probability distributions that make these tasks possible for us.
In this article, we will explore language modeling tasks, learn how to fine-tune a language model, and build a question-answering system using HuggingFace, W&B, and Gradio
Here's what we'll cover:
What Is Language Modeling?Types of Language ModelsStatistical Language ModelsN-Gram Language ModelNeural Language ModelsTransformer Language ModelsCausal Language ModelingMasked Language ModelingEvaluation Metrics for Language ModelsGeneric Metrics Task-Specific MetricsDataset-Specific MetricsApplications of Language ModelingToken ClassificationText ClassificationQuestion-AnsweringTranslationSummarizationExample: Language Modeling and Question-Answering using HuggingFace and W&BCausal and Masked Language Modeling Question-Answering with pre-trained models using HuggingFace and W&B SummaryReferencesRecommended Reads
What Is Language Modeling?
Language modeling is the process of training models to predict the probability of a sequence of words occurring. The probability is then used to glean meaning and context in the case of natural language understanding (NLU) and generate grammatically correct text in natural language generation (NLG).
We usually train language models using a large corpus of text data that is preferably in the form of sentences. During the training process, the model learns patterns in the text, such as the probability of a particular word appearing after others. Once trained, we can then use the model to generate new text by predicting the probabilities of each word in a sentence given the previous words and choosing the word that has the highest probability.
For example, if we train our language model using English text and we present it with the sentence " I am going to the," let's see how a language model would construct the sentence by assigning probabilities to different words:
Based on the training data we used, the model would assign high probabilities to certain words that usually follow this phrase, such as "store" (0.8), "park" (0.6), "beach" (0.4), and "movies" (0.2).
The model would also assign probabilities to different verb forms that have a higher chance to come next, such as "buy" (0.7), "visit" (0.5), "enjoy" (0.3), and "watch" (0.1).
The model then selects the word with the highest probability - "store", and assigns it to the next word in the sentence.
The model will then repeat the process with the updated context "I am going to the store" and assigns probabilities to different words and verb forms that might come next, such as "to buy" (0.8), "to look" (0.6), "to grab" (0.4), and "to eat" (0.2).
Then the model selects the words with the highest probability, in our case, "to buy," and assigns it to the next word in the sentence.
The final sentence constructed by the model would then be, "I am going to the store to buy."
Note: This is just a simplified example used to form a better understanding of how a language model works under the black box. In reality, the model would also use the probabilities of different word sequences and not just the individual words to construct the sequence.
💡
Now that we have a basic idea of how a language model works, let's look at the various types of language models.
Types of Language Models
Like every other field, language modeling has also seen an evolution in methods over time. Initially, traditional statistical models like n-grams were used for language modeling, which focused on modeling the probability of a word given the previous n-1 words (more on this later).
However, these models had limitations when it came to capturing long-term dependencies and handling large vocabulary sizes. When neural networks started overtaking the machine learning field, models such as LSTM, GRU, and vanilla RNNs were able to capture long-term dependencies and were able to handle much larger vocabulary sizes.
When the paper Attention Is All You Need introduced the attention mechanism and transformer architecture in 2017, it quickly took over the world of machine learning. Transformers use a self-attention mechanism to set the weight of different words in a sentence, which allows the model to better understand the context in text in a much more efficient way than models like LSTMs. The maximum sequence length for transformer-based models depends on the model input size window and the number of attention heads. Most transformer-based models like BERT and T5 can handle sequence lengths of 512 to 1024 tokens, and models like Longformer and LED can handle up to 4096 tokens!
What are Tokens?
In NLP, a token is a sequence of characters that represent a single schematic unit for processing. Tokenization is an important step in NLP as most models work on the assumption that each word in a text has a unique meaning, and that the relationship between these tokens can be used to understand the text as a whole.
💡
Statistical Language Models
Statistical models were the OGs in the language modeling field. They use basic statistics and probability estimation techniques to learn the probability distribution of words in the training data. A few commonly used methods include:
- Maximum Likelihood Estimation: In this technique, we count the number of occurrences of each word in the training data. Using this data (number of occurrences) we can estimate the probability of each word.
- Conditional Probability: Conditional probability is the probability of an event occurring (in our case, the event occurring could be a word or a sequence of words), given that another event has already occurred.
- Probability Distributions: These statistical models use simple probability distributions like Multinomial Distribution (or) Discrete Uniform Distribution to model the probability of certain words in a given sequence and use more advanced distributions such as the Gaussian Distribution and Dirichlet Distribution to account for the variability in our data.
In addition, these models may use various techniques from statistics such as smoothing, which is used to avoid the issue of zero probability for unseen words, and Markov property which states that the probability of the next word depends only on the previous words, it will not depend on the words that came before the previous words.
N-Gram Language Model
N-gram models are statistical language models that calculates the probability of a word occurring, given the preceding n-1 words. They are based on the idea that the probability of a word depends on the context in which it appears.
The model divides the text into a sequence of n-word chunks (Unigram, Bigram, Trigram, and so on) and then uses statistical techniques to determine the probability of each n-gram occurring in the text.
The basic idea behind the n-gram model is that the probability of a word depends on the previous (n-1) words in the sequence. For example, in a bigram (2-gram) model, the probability of a word depends only on the previous word. In a trigram (3-gram) model, the probability of a word depends on the previous two words. The higher the value of n, the more context the model takes into account when determining the probability of a word.

Source: Just trust me bruh (By author)
Behind the scenes, the n-gram model uses the chain rule (the probability of a sequence of events happening is the product of the probabilities of each event occurring given the previous events), Bayes' Rule (a formula used to update the probability of an event based on new information using the concept of conditional probability) and the Markov assumption (the probability of a word depends only on the previous n-1 words) to calculate the probability of a sequence of words.
Let's have a look at how the model does this using a Bigram example:
Step 1: Create our text corpus
Before we start our n-grams language modeling process, we need a text corpus for our training data, and I'll be using my friends' names for the same.
Logesh read a novelDepak played tennisJoshua read a book while travellingMukilan went to the store
Using this text corpus, let's try to find the probability of "book" given "a" i.e., Conditional probability of p(book|a):
p(book|a) =
Now, using this data, let's try to predict the probability of a sequence of words:
where, is the sequence and are the words.
Step 2: Apply Chain Rule
We take the probability of the first word and then condition the next word on the first word and so on, till we reach the word.
This makes the condition in the last term lengthy and complicated, and we need to simplify this to make the computation less painful. This is where Markov's assumption saves us from the pain of calculating this chunky chain equation.
Step 3: Enter Markov Assumption
In short, Markov assumption says the probability of a word depends only on the previous n-1 words and we should take only the last n-1 terms and condition on them.
PS: The "n" in "n-grams" refers to the "n" in (n-1)
Using the Markov assumption gives us a choice of selecting the variable "n" and the higher the value of n, the more context the model takes into account when determining the probability of a word and possibly produces better results.
Step 4: Bigram Language Model (n=2)
If we apply n=2 in the previous equation, we get:
(or)
Using this function, let's try to estimate the probability of words in a sentence:

If we use this model to predict the next sequence or word, there are two main challenges that it will encounter: determining the end of a sequence and identifying the start of a new sequence.
To solve this problem, we will be introducing a fake start and end token in our data so that our model will know that it should end the sequence if it encounters a certain word and how to start a sequence.
<start> Logesh read a novel <end><start> Depak played tennis <end><start> Joshua read a book while travelling <end><start> Mukilan went to the store <end>
Due to the <start> and <end> tokens, our probability of sequence of words equation:
will become
This is how a Bigram model would construct a sentence.
If you're looking to put N-Grams into practice using Python, this video by Douglas Starnes is a great resource. He walks through the whole process step-by-step, and you can follow along using this notebook here.
Note: N-Grams aren't the only statistical language model; there are many more models like Hidden Markov Model (HMM), Latent Dirichlet Allocation (LDA) that use similar basic statistical properties to learn the pattern in sequential data.
Neural Language Models
Neural networks handle language modeling by predicting the likelihood of a sequence of words, just like statistical language models. However, unlike statistical models, neural language models use artificial neural networks, which are machine learning models inspired by the structure and function of the human brain.
Neural networks are trained on large amounts of text data to predict the probability of the next word in a sequence based on the previous words. The neural network consists of multiple interconnected nodes, or neurons, that process and transmit information through multiple layers. The input to the network is a sequence of words represented as numerical vectors, and the output is a prediction of the next word in the sequence, represented as a probability distribution over all possible words in the vocabulary.
The network is trained through a process called backpropagation, where the error between the predicted and actual output is propagated back through the network to update the weights and biases of the neurons and improve the prediction accuracy.
Now let's look at how Long Short-Term Memory (LSTM), a type of Recurrent Neural Network (RNN) usually used to handle sequential data, can be used as a language model.
LSTM Language Model
The Long Short-Term Memory (LSTM) model was introduced in the paper "Long Short-Term Memory" by Hochreiter and Schmidhuber in 1997 to address the issue RNNs face in the vanishing / exploding gradient problem (EVGP).
The occurrence of the Exploding/Vanishing Gradient Problem (EVGP) is due to the nature of the backpropagation through time (BPTT) algorithm for training RNNs. This algorithm requires computing the product of hidden state gradients over a large number of steps, which can lead to either an exponentially small or large result as the number of recurrent interactions increases.
LSTMs are designed to overcome this shortcoming. The LSTM model works by using gates (input, forget, and output gates ) to control the flow of information and preserve the memory of the network over a long sequence of inputs. The LSTM network also has a memory cell that stores the information, and the gates control how much information is allowed to pass through.
- The input gate controls the new information to add to the memory cell. It uses a sigmoid function to determine which values from the current input should be added to the memory cell.
- The forget gate controls the amount of old information to forget. It also uses a sigmoid function to determine this.
- The memory cell stores the information from the current and previous inputs, processed by the input and forgets gates.
- The output gate controls the amount of information to output as the prediction. It uses a sigmoid function to determine this, followed by a tanh activation function to squish the values between -1 and 1.
This allows LSTM to effectively handle long sequences of data and make predictions based on the context of the entire sequence.

The cell state here is a memory cell that maintains information across time steps in a sequence, and the hidden state is the output of the LSTM unit at each time step that is passed to the next time step in a sequence.
LSTMs can be used as a language model to predict the next word or sequence of words in a sentence, given the previous context. This is achieved by training an LSTM network on a large text corpus, where the input to the network is a sequence of words, and the output is the next word in the sequence.
During training, the network learns the relationships between words and their probabilities of occurring in a specific order. Once trained, the network can generate new sentences by starting with a seed word and predicting the next word based on the learned relationships and probabilities. This process is repeated until the end of the sequence is reached, or a desired length is achieved.
Although RNN-based language models can handle sequential data better than statistical language models, they still struggle with EVGP (including LSTMs and GRUs, even though they can handle it far better than vanilla RNNs), especially when lengthy sequences are used. Also, since RNNs are sequential models (sentences must be processed token by token), it's harder to parallelize the training/inference process.
If you want to learn how to implement LSTMs for text generation, check out this blog by Shivam Bansal and follow along using this notebook.
Transformer Language Models
A Transformer is a neural network architecture designed by researchers at the University of Toronto and Google in 2017. Transformers address many of the limitations that we saw in RNN-based language models.
Transformer architectures achieve this by introducing the Attention mechanism, which allows the model to process the entire sequence in parallel and avoid the sequential processing and gradient flow of RNNs, which can result in the EVGP. Additionally, transformer models use layer normalization, which helps to prevent the exploding and vanishing gradient problem by normalizing the activations of each layer in the network.

Image from arXiv:1706.03762 Attention Is All You Need
Attention can be further classified into self-attention and cross-attention.
Self-attention is a mechanism that allows each token in a sequence to weigh the contribution of all other tokens in the sequence in a computationally efficient manner. It works in the following steps:
- Calculate the Key, Query, and Value vectors for each token in the input sequence. These vectors are created by multiplying the token embedding vector by the weight vectors that we train during the training process.
- Calculate the score vector using the dot product of the query vector with the key vector of the specific word we are scoring for and divide it by the square root of the dimension of the key vector (scaled dot-product attention).
- The attention weights are then used to aggregate the value representations to produce an attention context vector.
- The attention context vector is concatenated with the token representation and fed into a feed-forward network to produce the final output representation.
- The process is repeated for each token in the input sequence.
The final output representations from each token are used for prediction. In the actual implementation of Transformers, we use matrix form instead of vectors for this calculation as it speeds up the process.
The only difference between self-attention and cross-attention is that in self-attention, the queries, keys, and values all come from the same sequence, whereas, in cross-attention, the queries come from one input sequence. In contrast, the keys and values come from another sequence. If you're interested in implementing self-attention and cross-attention, check out this blog.
The paper took self-attention to next level by introducing multi-headed attention, which allows the network to attend to multiple "representation subspaces," which improves the performance of the attention layers.
The input sequence is first linearly transformed into multiple queries, keys, and values, and then these representations are independently scored against each other to produce multiple attention maps. These attention maps are then concatenated and linearly transformed to produce the final attention representation, which is used to compute the final model output.
Since transformers are not recurrent (sequential) models, they aren't good at learning the order of the words in the input sequence. To solve this problem, the paper introduces Positional Encoding, which adds a vector to each embedding in the input sequence, which the model can use to learn the position of each token in a sequence.
In short, positional encoders are used to learn the order of words in a sequence. For example, if we use integers for positional encoding, "The age of transformers" would be positionally encoded as [("The",1),("age",2),("of",3),("transformers",4)] . The actual implementation of transformers uses sine and cosine functions for positional encoding as it allowed the model to learn to "attend" by relative positions easily.
where is the position, is the dimension of embedding layers output and is the dimension of the encoding vector.
This was a very comprehensive intro to how a transformer works. I've only concentrated on the important topics in the Transformer architecture for our purposed here in exploring language modeling, and by no means justify the full scope of it. The aim was to give a general understanding of the self-attention mechanism, multi-head attention, and the use of positional encoding in transformers.
Check out Jay Alammar's Illustrated Transformer blog to get a better insight into how a Transformer model works.
Since transformer models are parallelizable, i.e., train multiple parts of the model simultaneously, we were able to train Large Language Models such as BERT, T5, and GPT models. These models were trained on massive amounts of text data, allowing them to generate high-quality, human-like text. To put things in perspective, the GPT-3 model by OpenAI ( used to create ChatGPT ) was trained on nearly 45 TB of text data!
What are Large Language Models?
Large Language Models are trained on massive amounts of data and they have a large number of parameters, often in the order of billions. They have been shown to perform well on a wide range of NLP tasks and are considered state-of-the-art in many cases.
Some examples of large language models include OpenAI's GPT-3, Google's BERT, and Microsoft's ERNIE. Large language models typically use the transformer architecture since it's easier to parallelize and more efficient.
💡
Before entering the following sections, you must know two main steps in Training Large Language Models (LLMs), they are:
- Pre-training: Pre-training in language models refers to the initial training stage where a language model is trained on a large generic corpus of text data, usually in an unsupervised or self-supervised manner, to learn the relationships and patterns in the language. Some common pre-training techniques for language models include Causal Language Modeling and Masked Language Modeling.
- Fine-tuning: Fine-tuning in language models is the process of adjusting the parameters of a pre-trained language model to a specific task. Training the model with labeled data from downstream tasks like Text Classification, Question-Answering, and Summarization, allows the model to learn task-specific information while retaining information from the pre-training step.
We will now look at the two main pre-training language modeling methods.
Causal Language Modeling
Causal language modeling (CLM) is a language modeling method in which we train the model to predict the next word in a sequence given the context of previous words (the model attends only to the tokens on the left).
Due to this, the causal language models cannot see the future words. Since CLMs are good at predicting the next token in a sequence of tokens, they are used for downstream tasks like text completion and language generation. Causal language models are also known as auto-regressive models. The GPT-2 model was trained with causal language modeling objective and is powerful at predicting the next word in a sequence.
Here's a quick example of CLM using the HuggingFace pipeline. The pipeline function takes the name of the task you want to perform as an argument and automatically downloads the appropriate pre-trained model and tokenizer from the HuggingFace model hub. You can then pass in your text data to the pipeline and receive the output of the pre-trained model for that task.
In this example, we use the text-generation pipeline, a prompt as the input, and we then generate text using generator(prompt) function.
from transformers import pipelineprompt = "Writer's block is a condition, in which an author is either unable to produce new work or experiences a creative slowdown."generator = pipeline(task="text-generation")generator(prompt)
Output:
[{'generated_text': "Writer's block is a condition, in which an author is either unable to produce new work or experiences a creative slowdown. It's not necessarily a bad thing to have a block being placed on you when writing a book. In fact, the"}]
Masked Language Modeling
Masked Language Modeling (MLM) is a language modeling method where a model is trained to predict a masked or hidden word in a sentence or a text corpus. The goal of MLM is to enable the model to learn contextual relationships between words and to generate more realistic language.
In masked language modeling, a certain percentage of tokens in the input text is randomly masked or replaced with a special token, such as [MASK], during the pre-training stage. The model is then trained to predict the original token based on the context of the surrounding words. For example, given the sentence "The cat sat on the [MASK]," the model must predict that the masked word is "mat," "chair," or some other appropriate word based on the context of the sentence.
Masked language modeling is commonly used in pre-training large language models such as BERT (Bidirectional Encoder Representations from Transformers). BERT is trained using a masked language modeling objective where 15% of the input tokens are randomly masked. The model is trained to predict the original token based on the context of the surrounding words.
Here's an example of MLM using the HuggingFace pipeline. We use the fill-mask pipeline here.
from transformers import pipelinetext = "Writer's block is a condition, primarily associated with <mask>, in which an author is either unable to produce new work or experiences a creative slowdown."fill_mask = pipeline(task="fill-mask")preds = fill_mask(text, top_k=1)preds = [{"score": round(pred["score"], 4),"token": pred["token"],"token_str": pred["token_str"],"sequence": pred["sequence"],}for pred in preds]preds
Output:
[{'score': 0.0907,'token': 6943,'token_str': ' depression','sequence': "Writer's block is a condition, primarily associated with depression, in which an author is either unable to produce new work or experiences a creative slowdown."}]
What's HuggingFace?
HuggingFace is a popular NLP library that provides a high-level API for performing various NLP tasks. They offer a large collection of pre-trained models for different NLP tasks such as sentiment analysis, text classification, named entity recognition, and question-answering, among others.
Additionally, HuggingFace provides an easy-to-use pipeline API for applying these pre-trained models to new data, as well as an open-source software library called Transformers that allows developers to easily build and fine-tune their own models. I highly recommend their course which covers all most everything that you need to know about transformers and their various other libraries.
💡
Evaluation Metrics for Language Models
In general, language model evaluation metrics can be categorized into three types: generic metrics, task-specific metrics, and dataset-specific metrics.
Generic Metrics
These metrics are used to evaluate the overall performance of a language model independent of a specific task or dataset. Some common generic metrics for language models include:
- Cross-entropy: A measure of the difference between the predicted probability distribution and the true distribution of the target sequence.
- Perplexity: Perplexity is a measure of how well a language model predicts a sequence of tokens. It is calculated by taking the inverse probability of the test set, normalized by the number of words. The lower the perplexity, the better the language model is performing. The formula for perplexity is:
Task-Specific Metrics
These metrics are used to evaluate the performance of a language model on a specific task. For example, tasks like Machine Translation and Named Entity Recognition have specific metrics to evaluate models, such as BLEU and its variations, ROUGE, MAUVE, and other metrics for text generation.
Dataset-Specific Metrics
These metrics are used to evaluate a language model's performance on a particular dataset. For example, in the case of the GLUE benchmark, the evaluation metrics include accuracy, F1 score, and Matthews correlation coefficient.
Applications of Language Modeling
Language modeling has numerous applications in NLP, such as text classification, named entity recognition, machine translation, and question-answering. With the rise of transformers and open-source pre-trained models, fine-tuning a model for a downstream task with accurate results is not that difficult.
By fine-tuning a pre-trained model on a specific downstream task, we can leverage the pre-trained model's knowledge and transfer its learned representations to the task at hand. This often results in better performance on the downstream task, as the pre-trained model has already learned useful patterns and features from the vast amounts of data it was trained on.
I won't be spending much time on how these tasks work as that is not our main objective, and using the HuggingFace pipeline is pretty simple (all you need to do is change the task in the pipeline API). If you want to learn how these pipeline works check out this doc from HuggingFace.
We will learn to create a custom pipeline and deploy it using Gradio for question-answering in the Examples section below!
Here are some popular downstream applications of language modeling:
Token Classification
Token classification is a natural language understanding task that assigns labels to specific tokens in a text. Named entity recognition and part-of-speech tagging are two common subtasks of token classification.
NER models can be trained to recognize entities like names, dates, and locations in a given text, while PoS tagging is used to identify parts of speech such as verbs, nouns, and punctuation. Check out this task page of HuggingFace for an in-depth understanding of token classification.
from transformers import pipelineclassifier = pipeline(task="ner")preds = classifier("Language Modeling is amazing.")preds = [{"entity": pred["entity"],"score": round(pred["score"], 4),"index": pred["index"],"word": pred["word"],"start": pred["start"],"end": pred["end"],}for pred in preds]print(*preds, sep="\n")
Text Classification
Text classification is a natural language processing task in which a piece of text is assigned to one or more predefined categories (classes or labels). The goal of text classification is to automatically classify text documents into classes based on their content.
Examples of text classification tasks include sentiment analysis, topic categorization, spam detection, and language identification. The output of a text classification model is a label or a set of labels that represents the category or categories to which the input text belongs.
from transformers import pipelineclassifier = pipeline(task="sentiment-analysis")preds = classifier("I love this movie, loved every part of it!")preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]preds
Question-Answering
Question answering models aim to answer a question based on the input text provided, which can be helpful for tasks like document search and retrieval. These models can retrieve the answer to a given question from a given text (Extractive QA and Open Generative QA) and can even generate answers without any contextual information (Closed Generative QA).
Models like BERT (Encoder only) are great at handling Extractive QA and encoder-decoder models like T5 and BART are good at Generative QA.
from transformers import pipelinequestion_answerer = pipeline(task="question-answering")preds = question_answerer(question="What is my name?",context="I am Madhana and I come from India",)print(f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}")
Translation
Translation in NLP refers to the process of translating text from one language to another using the encoder and decoder architecture. We first encode the source sentence into a fixed-length vector representation and then use a decoder to generate the target sentence word-by-word based on that representation.
from transformers import pipelinetext = "translate English to French: My name is Madhana and I love learning new things."translator = pipeline(task="translation", model="t5-small")translator(text)
Summarization
Summarization refers to the process of reducing a lengthy text document to a shorter version while retaining the most important information. The goal of text summarization is to produce a condensed and coherent summary that captures the key ideas and information in the original document.
from transformers import pipelinesummarizer = pipeline(task="summarization", model="t5-small")summarizer("Language is a powerful tool that allows us to communicate, collaborate and connect with each other. It shapes how we perceive and understand the world around us. Historians like Yuval Noah Harari, suggest that language is not just a tool for communication but also a tool for creating shared realities. Early humans created shared myths, stories, and beliefs that allowed us to form societies as we know them. Over the years as technology has advanced, it has had a profound impact on language. The rise of the internet in the 21st century has made it easier for people to communicate and share knowledge, later the advancements in technologies such as Natural Language Processing (NLP) has made it possible for machines to understand and generate human language, which is used in various applications such as machine translation, summarization, chatbots, sentiment analysis and many more.", max_length=50)
Note: I have used an separate notebook for the example below, please use this notebook to follow along!
💡
Example: Language Modeling and Question-Answering using HuggingFace and W&B
In this example, we will learn how to use a HuggingFace pre-trained model on both Causal Language Modeling and Masked Language Modeling task and how this Masked Language Model can be further fine-tuned for a Question-Answering downstream task.
Causal and Masked Language Modeling
In this sub-section, we'll see how to load and pre-process the data for language modeling tasks using HuggingFace datasets and AutoTokenizer (from HuggingFace transformers). Then we will train our model using the Trainer API and finally evaluate our model using Perplexity and push our fine-tuned model to the HuggingFace Model Hub.
Step One: Install Necessary Libraries
As always, we will start our script by installing all the necessary packages. This includes datasets for data collection, transformers for tokenizers, trainers, and pipeline for inference. Then install git-lfs (git for large file storage).
# Install Hugging Face Transformers, Datasets, Evaluate and Pipeline!pip install transformers datasets pipeline# Install git-lfs (git for large file storage)!apt install git-lfs
I recommend logging into your HuggingFace account and creating a new token for your notebook with the WRITE role, as this is an important step for sharing our models with the HuggingFace community and generating results with inference API. If you don't have a HuggingFace account, you can start here :)
# *NOTE*: Create a new token for notebook with 'WRITE' rolefrom huggingface_hub import notebook_loginnotebook_login()
Step Two: Load the Wikitext Dataset
The WikiText dataset for language modeling consists of more than 100 million tokens extracted from a group of verified articles on Wikipedia. We'll be using this data for both causal and masked language modeling. I've used the HF datasets library for downloading this dataset, and if you want to learn more about HF datasets, check out this doc.
from datasets import load_datasetdatasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
Now that we have downloaded our dataset let's start our data preprocessing steps for Causal Language Modeling.
Step Three: Perform Causal Language Modeling
To perform CLM, we need to tokenize the texts in our dataset and then concatenate them. Then we need to split this tokenized data into smaller chunks of a certain length, i.e., block_size (so that our model receives fixed length inputs with padding if necessary). Then we can start our model-building process.
Step 1: Tokenize our dataset
We will be using the distilgpt2 model as our pre-processed model for CLM; this is going to be our model_checkpoint (think of checkpoints as gateways in using any pre-trained work, be it tokenizer or models). You can use the model checkpoints in this list as well.
Use the AutoTokenizer class to download a pre-trained tokenizer to tokenize all the texts using the same vocabulary that was used during the training of the pre-trained model.
Note: Whenever you're using a pre-trained model, make sure to use the same tokenizer that was used to train the pre-trained model.
💡
Then we created a function for tokenizing our text and used the map method from the datasets library to call the tokenizer function on all our texts.
from transformers import AutoTokenizermodel_checkpoint = "distilgpt2"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)def tokenize_function(examples):return tokenizer(examples["text"])tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
Then we define a function to split our tokenized data into chunks (our block size here is 128). Then we duplicate the inputs for our label data, and we don't need to manually shift our label data (in CLM will be predicting the next token in the sentence, so our label is the same as the input but shifted to the right) as transformers library takes care of this for us.
Then again, we use the datasets map method to create our tokenized dataset.
# block_size = tokenizer.model_max_lengthblock_size = 128def group_texts(examples):# Concatenate all texts.concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}total_length = len(concatenated_examples[list(examples.keys())[0]])# We drop the small remainder.total_length = (total_length // block_size) * block_size# Split by chunks of max_len.result = {k: [t[i : i + block_size] for i in range(0, total_length, block_size)]for k, t in concatenated_examples.items()}result["labels"] = result["input_ids"].copy()return resultlm_datasets = tokenized_datasets.map(group_texts,batched=True,batch_size=1000,num_proc=4,)
Step 2: Model building using Trainer and Training Arguments class
from transformers import AutoModelForCausalLMmodel = AutoModelForCausalLM.from_pretrained(model_checkpoint)
Now we need to instantiate our Trainer class, and we use training_args to pass our hyperparameters. Then using trainer.train(), we start our model training.
from transformers import Trainer, TrainingArgumentsmodel_name = model_checkpoint.split("/")[-1]training_args = TrainingArguments(f"{model_name}-finetuned-wikitext2",evaluation_strategy = "epoch",learning_rate=2e-5,weight_decay=0.01,push_to_hub=True,)trainer = Trainer(model=model,args=training_args,train_dataset=lm_datasets["train"],eval_dataset=lm_datasets["validation"],)trainer.train()
Step 3: Evaluate and Save
Then we use the trainer.evaluate() method to calculate the perplexity of our language model.
import matheval_results = trainer.evaluate()print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Finally, we can share our model with the HF community and save our model using the method below.
trainer.push_to_hub()
from transformers import AutoModelForCausalLMmodel = AutoModelForCausalLM.from_pretrained("Madhana/distilgpt2-finetuned-wikitext2")
from transformers import pipelinemodel_checkpoint = "Madhana/distilgpt2-finetuned-wikitext2"clm_model = AutoModelForCausalLM.from_pretrained(model_checkpoint)tokenizer = AutoTokenizer.from_pretrained("distilgpt2", use_fast=True)
We can now use the custom pipeline we created to generate text by using a prompt as the input to the pipeline.
prompt = "Language Modeling is a task in NLP"preds = txt_gen(prompt)print(preds)#Output: [{'generated_text': 'Language Modeling is a task in NLP literature, and was originally developed by a number of researchers, especially in the field of artificial intelligence and AI, in the late 1970s and early 1980s, or in the 1960s and early 1990s'}]
Masked Language Modeling
For training a Masked Language Model, we'll be using the same preprocessing steps as causal language models and then randomly mask some tokens in our data by replacing them with [MASK] tokens and adjust our labels to only include the masked tokens, as we don't need to predict the unmasked tokens.
I've used distilroberta-base model for this example. You can use the model checkpoints in this list as well.
from transformers import AutoTokenizermodel_checkpoint = "distilroberta-base"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])lm_datasets = tokenized_datasets.map(group_texts,batched=True,batch_size=1000,num_proc=4,)
The last step is to use a special data_collator function that batches the samples into tensors. We need to perform random masking, which we could do as a pre-processing step, but this would always mask the tokens in the same way at each epoch.
To ensure that the masking is randomized for each epoch, we do this step inside the data_collator function. The library offers a DataCollatorForLanguageModeling that performs the random masking for us, and we can adjust the masking probability by setting a parameter.
from transformers import AutoModelForMaskedLMmodel = AutoModelForMaskedLM.from_pretrained(model_checkpoint)model_name = model_checkpoint.split("/")[-1]training_args = TrainingArguments(f"{model_name}-finetuned-wikitext2",evaluation_strategy = "epoch",learning_rate=2e-5,weight_decay=0.01,push_to_hub=True,)from transformers import DataCollatorForLanguageModelingdata_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
Instantiate our Trainer class and use trainer.train() to train our model.
trainer = Trainer(model=model,args=training_args,train_dataset=lm_datasets["train"],eval_dataset=lm_datasets["validation"],data_collator=data_collator,)trainer.train()
Once again, we evaluate our model using perplexity and push it to the hub so that we can use it again.
eval_results = trainer.evaluate()print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")trainer.push_to_hub()
Question-Answering with pre-trained models using HuggingFace and W&B
Next, we'll explore how to use the saved Masked Language Model to build an extractive question-answering system. Wait, what's an extractive QA system?
As we saw earlier, Question-Answering systems can be of two types:
- Extractive: The answers will be extracted from a given context. For example, if we provide the context: I'm Madhana and I'm from India and ask the question: Who am I?, the QA system should extract the answer from the context we provided and answer Madhana.
- Abstractive: Abstractive question-answering is used to produce answers for open-ended questions. Unlike extractive question answering, which involves selecting an answer from a given set of text passages, abstractive question answering creates answers by synthesizing information from multiple sources like document stores.
Step 1: W&B Set-Up
First, we need to integrate W&B with our language modeling workflow. W&B can help us track our experiment, version our datasets and models, and perform hyperparameter optimization.
If you're new to Weights & Biases, you will have to create a W&B account first. I recommend going through this short Quickstart guide as it takes only 5 mins, and honestly, using tools like these to track any ML workflow saves a lot of time as we don't have to worry about losing our data or model as we iterate upon our workflow.
After setting up our account, you'll have to log in to the W&B library in your notebook. This step requires your API key, which you can find here.
# install weights and biases!pip install -qq wandbimport wandbwandb.login()
Step 2: Load the SQuAD Dataset
The Stanford Question Answering Dataset (SQuAD), is a reading comprehension dataset. The dataset consists of a series of questions that were asked about various Wikipedia articles. The best part? The answer to each question is a specific segment of text that can be found in the corresponding reading passage. We will be using this dataset for our QA system.
from datasets import load_datasetsquad = load_dataset("squad")
# Create a wandb runrun = wandb.init(project='Language-Models')# Create a W&B Table and log 10 random rows of the dataset to exploretable = wandb.Table(dataframe=SQuAD_train.sample(10))# Log the Table to your W&B workspacewandb.log({'train_dataset': table})# Close the wandb runwandb.finish()
Step 3: Data Preprocessing
Use the same Tokenizer that we used for our Masked Language Model.
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("distilroberta-base", use_fast=True)
We know that the answer to our question must be a part of the context, and as you can see in our dataset, some content may have an extremely long text length and exceeds the model's maximum input length. This may get truncated by the tokenizer, resulting in a loss of data, and we would lose labels for our model. How do we solve this issue?
Keep the truncated part as a separate feature instead of discarding it. We do this by setting return_overflowing_tokens = True inside the Tokenizer function.
With the mapping, we can determine the start and end tokens of the answer. We can use the sequence_ids method to determine which part of the offset corresponds to the question and which corresponds to the context.
def preprocess_function(examples):questions = [q.strip() for q in examples["question"]]inputs = tokenizer(questions,examples["context"],max_length=384,truncation="only_second",return_offsets_mapping=True,padding="max_length",)offset_mapping = inputs.pop("offset_mapping")answers = examples["answers"]start_positions = []end_positions = []for i, offset in enumerate(offset_mapping):answer = answers[i]start_char = answer["answer_start"][0]end_char = answer["answer_start"][0] + len(answer["text"][0])sequence_ids = inputs.sequence_ids(i)# Find the start and end of the contextidx = 0while sequence_ids[idx] != 1:idx += 1context_start = idxwhile sequence_ids[idx] == 1:idx += 1context_end = idx - 1# If the answer is not fully inside the context, label it (0, 0)if offset[context_start][0] > end_char or offset[context_end][1] < start_char:start_positions.append(0)end_positions.append(0)else:# Otherwise it's the start and end token positionsidx = context_startwhile idx <= context_end and offset[idx][0] <= start_char:idx += 1start_positions.append(idx - 1)idx = context_endwhile idx >= context_start and offset[idx][1] >= end_char:idx -= 1end_positions.append(idx + 1)inputs["start_positions"] = start_positionsinputs["end_positions"] = end_positionsreturn inputstokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
from transformers import DefaultDataCollatordata_collator = DefaultDataCollator()
Step 4: Model Building
Use the Masked Language Model checkpoint that is pushed to the HuggingFace hub.
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainermodel_checkpoint = "Madhana/distilroberta-base-finetuned-wikitext2"model_name = model_checkpoint.split("/")[-1]model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
Initiate our model Trainer.
training_args = TrainingArguments(f"{model_name}-SQuAD-qa-WandB2",evaluation_strategy="epoch",learning_rate=2e-5,num_train_epochs=2,weight_decay=0.01,push_to_hub=True,report_to='wandb',)trainer = Trainer(model=model,args=training_args,train_dataset=tokenized_squad["train"],eval_dataset=tokenized_squad["validation"],tokenizer=tokenizer,data_collator=data_collator,)trainer.train()
Step 5: Evaluate and Save
As usual, let's evaluate our model using perplexity and push it to the hub so that we can reuse it.
eval_results = trainer.evaluate()print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")trainer.push_to_hub()
Create a custom QA pipeline for our Gradio demo below:
from transformers import pipelinemodel_checkpoint = "Madhana/distilroberta-base-finetuned-wikitext2-SQuAD-qa-WandB2"new_model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)tokenizer = AutoTokenizer.from_pretrained("distilroberta-base", use_fast=True)qa = pipeline("question-answering", new_model, tokenizer, tokenizer)
Step 6: Use Gradio for Inference and Sharing
With Gradio, we can build user-friendly web interfaces to demonstrate our ML models and use the HF Spaces to host and share our Gradio web apps for free permanently!
import gradio as grdemo = gr.Blocks()with demo:gr.Markdown("Language Model QA Demo")with gr.Tabs():with gr.TabItem("Question Answering"):with gr.Row():qa_input = gr.Textbox(label = "Input Text")qa_context = gr.Textbox(label = "Input Context")qa_output = gr.Textbox(label = "Output")qa_button = gr.Button("Answer")qa_button.click(qa, inputs=[qa_input, qa_context], outputs=qa_output)demo.launch() # share=True
Hyperparameter Search using Sweeps (additional)
If you want to find the best hyperparameters for your language mode, use Weights & Biases Sweeps to find the optimal hyperparameters easily. In this section, we'll see how to do this.
Disclaimer: Since I have limited compute resources, I limited my parameter search window a lot too. Due to this, the model that I trained without any Sweeps performs better than the ones that I've trained using Sweeps (Don't judge me, it's super frustrating to use google colab's free tier for training LLMs). Nevertheless, using W&B Sweeps is very easy and it's worth giving a shot if you're trying to optimize your Language Models!
💡
Step 1: Create Sweep Config
The sweep_config is a nested dictionary where we set our hyperparameter search strategy (grid or random search) and range of values for our hyperparameters.
sweep_config = {"name" : "Language-Models",'method': 'random',"parameters": {'epochs': {'value': 1},'batch_size': {'values': [8, 16]},'learning_rate': {'distribution': 'log_uniform_values','min': 1e-4,'max': 1e-3},'weight_decay': {'values': [0.1, 0.2]},}}sweep_id = wandb.sweep(sweep_config, project='Language-Models')
Step 2: Train Function
We then define a train function that contains our Trainer and training arguments which sets the hyperparameters based on our sweep config.
from transformers import TrainingArguments, Trainerdef train(config=None):with wandb.init(config=config):# set sweep configurationconfig = wandb.config# set training argumentstraining_args = TrainingArguments(f"{model_name}-SQuAD-QA-WandB2",report_to='wandb',num_train_epochs=config.epochs,learning_rate=config.learning_rate,weight_decay=config.weight_decay,save_strategy='epoch',evaluation_strategy='epoch',logging_strategy='epoch',load_best_model_at_end=True,push_to_hub=True,)trainer = Trainer(model=model,args=training_args,train_dataset=tokenized_squad["train"],eval_dataset=tokenized_squad["validation"],tokenizer=tokenizer,data_collator=data_collator,)# start training looptrainer.train()
Finally, we call wandb.agent to start our hyperparameter search.
wandb.agent(sweep_id, train, count=10)
Step 3: Visualize
The best part of using Weights & Biases for hyperparameter search is their amazing dashboards which offer a variety of visualization tools and custom charts. Below, you can find a Parallel Coordinates plot, which is very useful in visualizing the best pair of hyperparameter values. The second chart is a Parameter Importance chart, which shows the importance of a feature with respect to a certain metric (evaluation loss in our case).
Summary
In this blog, we saw how Language Models evolved over time and how Transformer architecture-based models are able to produce state-of-the-art (SOTA-Language Models) results in Language Modeling tasks. We also learned how to implement language models with the help of HuggingFace and W&B.
I personally wanted to make sure to justify the title of this report (to help beginners master language models). Although I'm pretty satisfied with my work, I still was not able to showcase many other works and current research topics that revolve around Language Models, topics like Language Model Hallucination, research papers like DetectGPT, Watermark for LLMs, and many more.
Since I can't condense all these topics within this blog and practically can't update this blog every time something amazing happens in the field of NLP, I want to share some great resources that might be helpful for our readers in being updated in the NLP field. Happy learning!
References
- Application of Long Short-Term Memory (LSTM) Neural Network for Flood Forecasting - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/The-structure-of-the-Long-Short-Term-Memory-LSTM-neural-network-Reproduced-from-Yan_fig8_334268507 [accessed 29 Jan, 2023]
Recommended Reads
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.