An Introduction To HuggingFace Transformers for NLP
In this article, we learn all about the history and utility of HuggingFace, the transformer models that made them a household name, and how you can use them with W&B
Created on November 16|Last edited on June 30
Comment
In this article, we will be covering what HuggingFace is, how and why it came to exist, and how to best utilize it for your machine learning workflows. We'll also touch on the different use cases covered by the various HuggingFace packages in the ecosystem.
Here's what we'll cover:
Table of Contents:
What Is HuggingFace?A Brief History of HuggingFaceWhat Are Transformers in Machine Learning?What Is HuggingFace Used For?Common NLP Tasks on HuggingFaceSequence ClassificationQuestion AnsweringLanguage modelingTranslationAn Interview with HuggingFace CEOWhat Does Weights & Biases Have To Do With HuggingFace?Logged Model as ArtifactConclusion
Want to get coding right away? Follow our integrations link to productionize your HF models : https://docs.wandb.ai/guides/integrations/huggingface
💡
What Is HuggingFace?
HuggingFace is a large open-source community that builds tools to enable users to build, train, and deploy machine learning models based on open-source code and technologies. HuggingFace makes it easy to share tools, models, model weights, and datasets, between other practitioners, via its toolkit.
It's most popularly known for its transformers library. It exposes an intuitively designed Python API to leverage state-of-the-art deep learning architectures for common natural language processing (NLP) tasks.
A Brief History of HuggingFace
Founded in 2016, HuggingFace (named after the popular emoji 🤗) started as a chatbot company and later transformed into an open-source provider of NLP technologies. The chatbot company at the time, aimed at the teenage demographic, was focused on:
(...) building an AI so that you’re having fun talking with it. When you’re chatting with it, you’re going to laugh and smile — it’s going to be entertaining - Clem Delangue, CEO & Co-founder
Like Tamagotchi, the chatbot could talk coherently about a wide range of topics, detect emotions in text, and adapt its tone accordingly.
Underlying this chatbot, however, were HuggingFace's main strengths: in-house NLP models (one such one was called Hierarchical Multi-Task Learning (HMTL)) and a managed library of pre-trained NLP models. This would serve as the early backbone of the transformers we know of currently.
The early PyTorch transformers established compatibility between PyTorch and TensorFlow 2.0, which then enabled users to move easily from one framework to another during the life of a model. Coupled with the release of the “Attention Is All You Need” paper by Google.
The shift to transformers in the NLP space, HuggingFace, who had already released parts of the powerful library powering their chatbot as an open-source project on GitHub, began to focus on open-sourcing popular large language models to PyTorch such as BERT and GPT.
With the most recent Series C funding round leading to $2 billion in evaluation, HuggingFace currently offers an ecosystem of models and datasets spread across its various tools like HuggingFace Hub, transformers, diffusers, and more.
What Are Transformers in Machine Learning?
In terms of NLP, transformers are language models that have been trained on a large amount of text in a self-learning fashion. Self-supervised (or transfer) learning is a type of training where the system learns on the go and doesn’t need any labeled data.
💡
The famous paper “Attention Is All You Need” proposed the transformer neural network with a novel architecture that aims to solve sequence-to-sequence tasks, which are complex language problems like translation, question answering, and chatbot creation, all while managing long-range dependencies.
Long-range dependencies were a reoccurring issue with the popular pre-existing models of the time, which were based on RNN and LSTM-based architectures. Instead, a transformer model works by tracking relations between the different words found in a given sentence.
A transformer model uses an augmented form of encoder-decoder architectural structure to achieve this.

- The encoder receives inputs and iteratively processes the inputs to generate information about which parts of inputs are relevant to each other. The model will be optimized to get the best understanding of the input.
- The decoder generates a target sequence using representation from the encoder and uses the contextual information to generate outputs.
The key to this structure for transformers is the addition of the "attention" and "self-attention" mechanisms.

As computers are only capable of understanding numerical values, a data transformation process that converts alphabetical values (textual data) into numerical values must be applied first at the beginning of most transformer models. In the deep learning world, this process is called word embedding.
Word embeddings will represent each word in the form of a real-valued vector that encodes the word's meaning such that the words closer in the vector space are expected to be similar in meaning.
One thing missing from the model input, as we have described it so far, is a way to account for the order of the words in the input sequence. To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.

After converting our data into a more understandable format, the embedded data is passed into the next layer, known as the self-attention layer.
By utilizing self-attention, a transformer is capable of detecting distant data relations and resolving vanishing gradients. Meaning that a given transformer model will still be able to study a given relationship between two related words even if both these words are too far away from each other in a given context.
The self-attention process represents how relevant a specific word is in relation to its neighboring words in a given sentence. This relation is then represented as what we call an attention vector.

There are three additional types of vectors created in the self-attention layer which are key, query, and value vectors. Each vector is then multiplied by the input vector in order to return a weighted value.

This process is then followed by a feed-forward neural network that takes in each attention vector and transforms it into a more accessible form for the next layer. Data is then passed to the decoder layer, which predicts the output of a given model.


This architecture has shown promising results across a variety of tasks, not only NLP. We'll save that for another article, though.
What Is HuggingFace Used For?
The HuggingFace transformer library was created to provide ease, flexibility, and simplicity to using complex models with architecture reminiscent of the above by accessing a single API. The models can be loaded, trained, and saved without any hassle.
At first, HuggingFace was used primarily for NLP use cases but has since evolved to capture use cases in the audio and visual domains.
This works as a typical deep learning solution consisting of multiple steps from getting the data to fine-tuning a model, a reusable workflow domain by domain.

To make this process easier, HuggingFace Transformers offers a pipeline that performs all pre- and post-processing steps on the given input text data. The overall process of every HuggingFace solution is encapsulated within these pipelines, which are the most basic object in the Transformer library. This helps to connect a model with required pre- and post-processing steps, and we only have to provide input texts.

Behind the pipeline, preparing data properly without all the additional explicit steps
The relevant code:
from transformers import pipelinetask_pipe = pipeline(f"{SELECTED_TASK}")text = "Hello world! Hugging Face + Weights and Biases = [MASK]"output = task_pipe(text)
However, in the case where your task is not available by default within HuggingFace or the Hub, or you want to fine-tune a model, you can easily develop your own pipelines by running:
# AutoModelForTASK should be replaced with any desired task typefrom transformers import AutoTokenizer, AutoModelForTASKtokenizer = AutoTokenizer.from_pretrained(f"{SELECTED_MODEL_TOKENIZER}")model = AutoModelForTASK.from_pretrained(f"{SELECTED_MODEL}")text = "Hello my friends! How are you doing today?"tokenized_text = tokenizer.encode([text])output = model.generate(**tokenized_text)
Fine-tuning this loaded model becomes as simple as:
from transformers import TrainingArguments, Trainertraining_args = TrainingArguments(output_dir="test_trainer",evaluation_strategy="epoch",report_to="wandb") #🪄🐝trainer = Trainer(model=model,args=training_args,train_dataset=small_train_dataset,eval_dataset=small_eval_dataset,compute_metrics=compute_metrics,)trainer.train()
Below is a non-exhaustive list of popular tasks that are being solved by models developed and hosted on HuggingFace, with the examples to use for each task taken from the HuggingFace documentation:

Common NLP Tasks on HuggingFace
Sequence Classification
Sequence classification is the task of classifying sequences according to a given number of classes, where a sequence can be something like a sentence of text data.
Code using pipelines
Example: identifying if a sentence is positive or negative
Input: I hate you
Output: label: NEGATIVE, with score: 0.9991
SELECTED_TASK = "sentiment_analysis"INPUT = "I hate you"## From Above ##from transformers import pipelinetask_pipe = pipeline(f"{SELECTED_TASK}")OUTPUT = task_pipe(INPUT)print(f"label: {OUTPUT['label']}, with score: {round(OUTPUT['score'], 4)}")
Example code using customization
Example: Identifying if a sentence is a paraphrase of another
Input:
- The company HuggingFace is based in New York City
- Apples are especially bad for your health
- HuggingFace's headquarters are situated in Manhattan
Output:
not paraphrase: 10%is paraphrase: 90%
from transformers import AutoTokenizer, AutoModelForSequenceClassificationimport torchtokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")classes = ["not paraphrase", "is paraphrase"]sequence_0 = "The company HuggingFace is based in New York City"sequence_1 = "Apples are especially bad for your health"sequence_2 = "HuggingFace's headquarters are situated in Manhattan"# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to# the sequence, as well as compute the attention masks.paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")paraphrase_classification_logits = model(**paraphrase).logitsnot_paraphrase_classification_logits = model(**not_paraphrase).logitsparaphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]# Should be paraphrasefor i in range(len(classes)):print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")# Should not be paraphrasefor i in range(len(classes)):print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
Question Answering
Code using pipelines
Example: asking the question of what question answering is
Input:
- Context: Extractive Question Answering is the task of extracting an answer from a text given a question.
- Query: What is extractive question answering
Output: Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
SELECTED_TASK = "question-answering"CONTEXT = "Extractive Question Answering is the task of extracting an answer from a text given a question."QUERY = "What is extractive question answering?"## From Above ##from transformers import pipelinetask_pipe = pipeline(f"{SELECTED_TASK}")OUTPUT = task_pipe(question=QUERY, context=CONTEXT)print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
You get the gist hopefully! Try out the ones below!
💡
Language modeling
Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based models are trained using a variant of language modeling, e.g., BERT with masked language modeling, and GPT-2 with causal language modeling.
The popular available options that are exposed by HuggingFace currently include:
- Masked language modeling: Masked language modeling is the task of masking tokens in a sequence with a masking token and prompting the model to fill that mask with an appropriate token.
- Causal language modeling: Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the model only attends to the left context (tokens on the left of the mask)
Text generation code using pipelines
Example: Generating additional text following some initial text
Input: Hello, I'm a language model
Output: Hello, I'm a language modeler. I write and maintain the software in Python. I love to code, and that includes coding things that require writing
## From Above ##from transformers import pipelinefrom pprint import pprintSELECTED_TASK = "text-generation"MODEL = "gpt2"task_pipe = pipeline(f"{SELECTED_TASK}", model = MODEL)INPUT = "Hello, I'm a language model"OUTPUT = task_pipe(INPUT, max_length = 30, num_return_sequences=3)pprint(OUTPUT)
Summarization
Summarization is the task of producing a shorter version of a document while preserving its important information. Some models can extract text from the original input, while other models can generate entirely new text.
Code using pipelines
Example: reducing this long fact about France into a simple statement
Input: Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometers (41 square miles). The City of Paris is the center and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.
Output: "Paris is the capital and most populous city of France..."
## From Above ##from transformers import pipelinefrom pprint import pprintSELECTED_TASK = "summarization"task_pipe = pipeline(f"{SELECTED_TASK}")INPUT = "Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017."OUTPUT = task_pipe(INPUT)pprint(OUTPUT)
Translation
Code using pipelines
Example: converting English text to French text
Input: How are you?
Output: quel âge êtes-vous?
## From Above ##from transformers import pipelinefrom pprint import pprintlang1 = "en"lang2 = "fr"SELECTED_TASK = f"translation_{lang1}_to_{lang2}"task_pipe = pipeline(f"{SELECTED_TASK}")INPUT = "How are you?"OUTPUT = task_pipe(INPUT)pprint(OUTPUT)
An Interview with HuggingFace CEO
If you'd like to hear about the story of HuggingFace from a man who's seen it firsthand, we've got you covered. Clément Delangue, Co-founder and CEO of HuggingFace, joined our Gradient Dissent podcast to talk about that experience and more. Clem explains the virtuous cycles behind the creation and success of HuggingFace, and shares his thoughts on where NLP is heading.
What Does Weights & Biases Have To Do With HuggingFace?
While HuggingFace makes it straightforward to load and fine-tune models, Weights & Biases makes it easy to scale the volume and richness of your experiments.
To load and fine-tune a model with HuggingFace, it's recommended you use the provided HuggingFace trainer, which takes in your model, datasets, training arguments, and important metric computations:
from transformers import TrainingArguments, Trainer# Load model similar to what was shown above# model = AutoModelForTASK.from_pretrained("{SELECTED_MODEL}")# (...)training_args = TrainingArguments(output_dir="test_trainer",evaluation_strategy="epoch",report_to="wandb") #🪄🐝trainer = Trainer(model=model,args=training_args,train_dataset=small_train_dataset,eval_dataset=small_eval_dataset,compute_metrics=compute_metrics,)trainer.train()
Instead of needing to instrument your code with complicated logging logic to organize and store experiment details and assets properly, you can just set the report_to="wandb" flag in the TrainingArguments, and you have access to automatically logged metrics and models.
💡
Example
Training Dashboard
Run set
1
Evaluation Dashboard
Run set
1
Logged Model as Artifact
model-3r8axeaf
Direct lineage view
Conclusion
That's it for our brief introduction to HuggingFace! If you'd like to read more about how you can use W&B alongside HuggingFace, check out these other reports below.
How To Fine-Tune Hugging Face Transformers on a Custom Dataset
In this article, we will learn how to easily fine-tune a HuggingFace Transformer on a custom dataset with Weights & Biases.
Compare Methods for Converting and Optimizing HuggingFace Models for Deployment
In this article, we'll walk through how to convert trained HuggingFace models to slimmer, leaner models for deployment with code examples.
Recommender Systems Using Hugging Face & NVIDIA
Learn to implement a recommender system with Hugging Face and NVIDIA in this short Transformers4Rec tutorial. Includes an easy-to-follow Colab.
Unconditional Image Generation Using Hugging Face Diffusers
Training unconditional image generation models using HuggingFace Diffusers and Weights & Biases
Add a comment
ves inputs and iteratively processes the inputs to generate information about which parts of inputs are relevant to each other decoder generates a target sequence using representation from the encoder and uses the contextual information to generate outputs. Word embeddings will represent each word in the form of a real-valued vector that encodes the word's meaning such that the words closer in the vector space are expected to be similar in meaning. ine the position of each wor self-attention layer. detecting distant data relations and resolving vanishing gradients study a given relationship between two related words even if both these words are too far away from each other in a given context how relevant a specific word is in relation to its neighboring words in a given sentence. attention vector s a pipeline that performs all pre- and post-processing steps on the given input text data.
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.