Skip to main content

Transfer learning versus fine-tuning

Learn the key differences between transfer learning and fine-tuning
Created on September 17|Last edited on September 23
Transfer learning and fine-tuning are two related methods that let you make the most of pretrained models. A network trained on a large dataset, such as ImageNet, has already learned general features like edges, textures, and shapes. Transfer learning reuses those features by freezing most of the model and training only a new classification layer for your specific task. Fine-tuning goes further by unfreezing some of the pretrained layers and updating their weights so the model adapts more closely to the new data.
These approaches are valuable when you do not have massive datasets or the time to train a deep model from scratch. They can reduce computation and often improve accuracy by building on the knowledge the model already has. For example, you can take an ImageNet model and transfer it to bird classification. The frozen features provide a strong base, and fine-tuning can refine the representation to capture the details of bird species.
In this tutorial, we will show both strategies in practice. You'll see how to build a model with a frozen backbone for transfer learning and how to unfreeze a portion of layers for fine-tuning. Weights & Biases will be used to log training and validation results, allowing you to compare performance and understand when each method works best.
Now, let’s dive into the concepts of transfer learning and fine-tuning, starting with the basics of transfer learning. It’s worth noting that these two terms are often used differently depending on the source. Some people treat fine-tuning as a distinct method, while others see it as a specific form of transfer learning. The boundaries aren’t rigid, and in practice, the two approaches overlap quite a bit. What matters most is understanding how much of the pretrained model you keep fixed versus how much you allow to adapt. That is the real lever you control when reusing pretrained models.
If you're new to Weights & Biases, it's a platform that helps track your machine learning experiments. In this tutorial, we'll integrate W&B to monitor our training process and visualize the results. This will make it easier to see the benefits of transfer learning vs. fine-tuning.
💡

Table of contents



What is transfer learning?

Transfer learning is a machine learning technique where a model trained on one task is reused on a second related task. Instead of starting from scratch, you begin with a model that has learned to recognize patterns from a large dataset. For example, a model trained on ImageNet (a large collection of diverse images) learns general features like edges, textures, and shapes. Transfer learning allows us to take those learned features and apply them to a new problem — say, classifying a smaller set of animal images — which often leads to faster training and better performance than training a new model from nothing.

How does transfer learning work?

The typical transfer learning process involves the following steps:
  1. Choose a pre-trained model: Select a model that was trained on a broad task similar to your target task. Common choices include models like ResNet, VGG, or MobileNet trained on ImageNet for image problems, or BERT/GPT for language problems.
  2. Freeze the pre-trained layers: The idea is to keep the learned features intact. "Freezing" means you make the layers of the pre-trained model non-trainable, so their weights won't get updated during training on the new task.
  3. Add new layers (the head): The pre-trained model likely solved a different problem (different number of classes or outputs). So you remove its original output layer and add your own classifier/regressor layers on top, configured for your new task.
  4. Train the new layers on your dataset: Now train the model, but only the new layers' weights will adjust (since the base is frozen). The model will use the pre-learned features to extract information from your new data, and the new head will learn to map those features to the correct outputs for your task.
I will start by showing how to create a torch model that will be used for transfer learning. We will build off of a pre-trained ResNet, which has already learned to recognize general features from ImageNet. The backbone of the model will be frozen so that its weights do not update, and we will replace the final classification layer with a new linear head that matches the number of classes in our dataset. This way, the network acts as a feature extractor while the new head learns how to map those features to the task at hand.
Here's the code for our model:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import random
import numpy as np
import wandb

class TransferResNet18(nn.Module):
def __init__(self, num_classes=10, pretrained=True):
super().__init__()
self.net = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1 if pretrained else None)
in_feats = self.net.fc.in_features
self.net.fc = nn.Linear(in_feats, num_classes)
for name, p in self.net.named_parameters():
p.requires_grad = name.startswith("fc")

def forward(self, x):
return self.net(x)
Here, we freeze the entire ResNet backbone so that its convolutional layers do not change during training. Only the final fully connected classification layer is trainable. This is the standard setup for transfer learning: the pretrained model acts as a fixed feature extractor, and the new head learns to classify based on those extracted features.
With this approach, training is much faster and requires less data, since most of the parameters remain untouched. The pretrained weights carry over the general image features learned from ImageNet, and the new classification layer adapts those features to the classes in our dataset.
Now, before training our model, I will discuss an alternative approach called fine-tuning.

What is fine-tuning?

Fine-tuning is like taking transfer learning a step further. After you have a model that is partially trained on your new task (usually via transfer learning, where the pre-trained base was frozen), you unfreeze some layers of the pre-trained model and continue training. This means the model can adjust the previously learned features to better fit the new data. Fine-tuning is typically done after an initial phase of training the new top layers, to avoid large, unneeded updates to the pre-trained weights at the start.

When should you fine-tune?

Fine-tuning is most beneficial when:
  • You have a larger dataset for your new task (so the model can afford to adjust more parameters without overfitting).
  • The new task is similar in nature to the original task the model was trained on (so the base features are a good starting point and just need slight refinement).
  • You need that extra boost in accuracy or performance that freezing layers alone can't achieve.
If your dataset is very small or quite different from the original dataset, fine-tuning too many layers can lead to overfitting or even degrade performance (a phenomenon known as negative transfer). In those cases, using transfer learning without fine-tuning (keeping most layers frozen) might actually work better.
How to fine-tune a model (steps):
  1. Start with a trained model from transfer learning: You should already have a model that was trained with frozen layers (as we prepared in the previous section). This model has learned to make decent predictions on your new task using the pre-trained features.
  2. Decide which layers to unfreeze: Often, you unfreeze the last few layers of the base model (the layers closest to the output). These higher-level feature layers are more specialized to the original task, so adjusting them can help them become more relevant to your new task. The lower layers (closer to input) usually capture very general features (edges, textures) and may not need much change.
  3. Unfreeze the chosen layers: Set those layers' trainable property to True. Keep the rest of the base model frozen to avoid over-adjusting everything.
  4. Compile the model with a lower learning rate: Fine-tuning works best with a smaller learning rate. This is because you want to make only slight adjustments to the pre-trained weights (which are already pretty good). A high learning rate could wreck those learned features with large gradient updates.
  5. Continue training the model: Train for a few more epochs on your dataset. Now the model will update both the new head layers and the unfrozen base layers. Monitor the performance on a validation set to avoid overfitting — it's common to use early stopping or only do a few epochs of fine-tuning.

Now, I will show how to define a Torch model which will be used for fine tuning. Just like before, we start with a pre-trained ResNet and replace the final fully connected layer with a new one sized for our dataset. The key difference is that instead of freezing the entire backbone, we allow part of it to be trainable. In practice this means unfreezing the deepest layers, such as layer4 in ResNet18, which contain more task-specific representations. By combining a new classification head with a partially unfrozen backbone, we let the model adapt its higher-level features to the new data while still keeping the general low-level features intact. This balance gives the model flexibility to learn from your dataset without discarding the powerful features gained from large-scale pretraining.
Here's the code for our model:
class FineTuneResNet18(nn.Module):
def __init__(self, num_classes=10, pretrained=True, pct_unfreeze=0.25):
super().__init__()
pct_unfreeze = float(max(0.0, min(1.0, pct_unfreeze)))
self.net = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1 if pretrained else None)
in_feats = self.net.fc.in_features
self.net.fc = nn.Linear(in_feats, num_classes)

for p in self.net.parameters():
p.requires_grad = False
for p in self.net.fc.parameters():
p.requires_grad = True

group_names = ["layer4", "layer3", "layer2", "layer1", "bn1", "conv1"]
total_groups = len(group_names)
k = int(round(pct_unfreeze * total_groups))
to_unfreeze = group_names[:k]

for name, module in self.net.named_modules():
if any(name == g or name.startswith(g + ".") for g in to_unfreeze):
for p in module.parameters(recurse=True):
p.requires_grad = True

def forward(self, x):
return self.net(x)
This class gives us fine control over how much of the backbone we unfreeze. By default, everything is frozen except the new classification layer, which ensures stability. The pct_unfreeze argument then decides how many of the backbone groups to unlock, starting from the deepest layers and moving toward the input. For example, if pct_unfreeze=0.25, only layer4 is unfrozen. If you increase it to 0.5, both layer4 and layer3 will train, and so on until the entire network is trainable at 1.0.
This setup allows you to experiment with different amounts of fine tuning depending on your dataset size and similarity to ImageNet. Smaller datasets or very different domains often benefit from only unfreezing the last block, while larger or more related datasets can support deeper unfreezing. The idea is to strike a balance between preserving the robust pretrained features and giving the network enough flexibility to adapt to your specific task.

Recap of fine-tuning and transfer learning

Fine tuning and transfer learning sit on a spectrum of approaches for reusing pretrained models. Training from scratch means learning all weights on your dataset alone, transfer learning means freezing the base and training only a new head, and fine tuning goes further by unfreezing part of the backbone so it can adapt to your data. Each method has different tradeoffs in data needs, training time, and performance.
Let's define each approach briefly:
  • Transfer learning (feature extraction): You take a pre-trained model and freeze its weights, then add a new head to perform your task. Only the new head's weights are trained.
  • Fine-tuning: This builds on transfer learning. After initially training the new head with the base frozen, you then unfreeze some (or all) of the base model’s layers and continue training, allowing the pre-trained weights to adjust slightly.
Now, let's compare these approaches across a few important dimensions:
  1. Data requirements:
  • Transfer Learning: Works well with smaller datasets. The pre-trained model has already learned general features from a big dataset, so it can successfully apply those to a smaller new dataset. You can get good results with far less data than you'd need from scratch.
  • Fine-tuning: Requires moderate data. Fine-tuning more parameters means there's a risk of overfitting if your dataset is not sufficiently large or representative. Ideally, you have enough data to justify updating some pre-trained weights. If not, you might stick to pure transfer learning.
  1. Training time and computational cost:
  • Transfer Learning: Shorter training time. With most of the model frozen, there are far fewer parameters to update. Often, you'll train only for a few epochs to get a good result. This is computationally cheaper because the forward pass through the pre-trained layers can be done quickly and backpropagation only happens in the small head network.
  • Fine-tuning: Moderate training time. Longer than pure transfer learning (because now more layers/parameters are being updated and you might train for more epochs) but still typically faster than training from scratch. You usually do fewer fine-tuning epochs after an initial training phase.
  1. Performance and accuracy:
  • Transfer Learning: Often yields good performance quickly. It might not reach the absolute best possible accuracy that fine-tuning can achieve, but it will be far better than a scratch model when data is limited. For many practical tasks, a well-chosen pre-trained model with a new head can get you surprisingly high accuracy.
  • Fine-tuning: Tends to give the best performance for your task without needing the enormous data of a from-scratch approach. By selectively retraining some pre-trained layers, the model can better fit your data than a frozen feature extractor, often leading to higher accuracy. However, fine-tuning improperly can also degrade performance (for instance, if you fine-tune with not enough data or too high a learning rate, it might overfit or destroy useful features).
  1. Risk of overfitting:
  • Transfer Learning: Low risk generally, because the pre-trained layers are fixed and already generalized from a large dataset. Only the new layers (which are relatively few parameters) must be trained, so overfitting is easier to manage. That said, if the pre-trained model's features are not relevant to your task, the new layers might struggle.
  • Fine-tuning: Medium risk. Fine-tuning can overfit if you open up too many parameters with not enough data to constrain them. There's a careful balance: unfreeze just enough layers that your model can adapt, but not so many that it starts overfitting your training data. Using validation checks and early stopping is important here.
  1. Use cases:
  • Transfer Learning: Use when you have a moderate or small dataset and a pre-trained model exists in a related domain. It's a great default choice for many applications (e.g., using a pre-trained language model for a new NLP classification task, or a pre-trained image model for a new set of image categories).
  • Fine-tuning: Use when transfer learning has given you a solid start, and you have enough data and compute to justify training some extra layers. Fine-tuning is common in competitions and production when you need to squeeze out extra accuracy. For instance, in computer vision, you might train a new head on your data (feature extraction stage) and then fine-tune the last few convolutional layers to boost performance.

Comparing Fine-tuning and Transfer learning

Now, we will train and evaluate the two approaches step by step. The first run will use transfer learning, where the ResNet backbone is frozen and only the final classification head is updated. This keeps training focused on a small number of parameters and makes convergence fast. We will track training loss, validation loss, and accuracy at each epoch, sending all results to Weights and Biases for visualization.
Once this run is complete, we will switch to fine tuning. In this version, the backbone is not fully frozen. Instead, a percentage of the deeper layers is unfrozen, which allows the model to adapt its higher-level features to the CIFAR-10 dataset. The classifier head is still trained from scratch, but now the backbone parameters we unfroze are also updated. To prevent large weight shifts from destroying useful pre-trained features, we use a smaller learning rate for the backbone and a larger one for the new head.
Both approaches will run under the same setup, using identical data loading and logging pipelines. With W&B keeping track of the metrics, we will be able to compare training curves directly and observe how transfer learning and fine tuning behave differently on the same dataset.
Here's my code:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import random
import numpy as np
import wandb

# ---------------- utils ----------------
def set_seed(seed=1337):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# ---------------- data ----------------
def get_loaders(batch_size=128, num_workers=2):
mean = (0.4914, 0.4822, 0.4465)
std = (0.2470, 0.2435, 0.2616)

train_tf = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean, std),
])
test_tf = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean, std),
])

train_ds = datasets.CIFAR10("./data", train=True, download=True, transform=train_tf)
test_ds = datasets.CIFAR10("./data", train=False, download=True, transform=test_tf)
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True)
test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
return train_loader, test_loader

# ---------------- models ----------------
class TransferResNet18(nn.Module):
def __init__(self, num_classes=10, pretrained=True):
super().__init__()
self.net = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1 if pretrained else None)
in_feats = self.net.fc.in_features
self.net.fc = nn.Linear(in_feats, num_classes)
for name, p in self.net.named_parameters():
p.requires_grad = name.startswith("fc")

def forward(self, x):
return self.net(x)

class FineTuneResNet18(nn.Module):
def __init__(self, num_classes=10, pretrained=True, pct_unfreeze=0.25):
super().__init__()
pct_unfreeze = float(max(0.0, min(1.0, pct_unfreeze)))
self.net = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1 if pretrained else None)
in_feats = self.net.fc.in_features
self.net.fc = nn.Linear(in_feats, num_classes)

for p in self.net.parameters():
p.requires_grad = False
for p in self.net.fc.parameters():
p.requires_grad = True

group_names = ["layer4", "layer3", "layer2", "layer1", "bn1", "conv1"]
total_groups = len(group_names)
k = int(round(pct_unfreeze * total_groups))
to_unfreeze = group_names[:k]

for name, module in self.net.named_modules():
if any(name == g or name.startswith(g + ".") for g in to_unfreeze):
for p in module.parameters(recurse=True):
p.requires_grad = True

def forward(self, x):
return self.net(x)

# ---------------- train/eval loops ----------------
def train_one_epoch(model, loader, optimizer, device):
model.train()
criterion = nn.CrossEntropyLoss()
total, correct, loss_sum = 0, 0, 0.0
for x, y in loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad(set_to_none=True)
out = model(x)
loss = criterion(out, y)
loss.backward()
optimizer.step()
loss_sum += loss.item() * y.size(0)
correct += (out.argmax(1) == y).sum().item()
total += y.size(0)
return loss_sum / total, correct / total

@torch.no_grad()
def evaluate(model, loader, device):
model.eval()
criterion = nn.CrossEntropyLoss()
total, correct, loss_sum = 0, 0, 0.0
for x, y in loader:
x, y = x.to(device), y.to(device)
out = model(x)
loss_sum += criterion(out, y).item() * y.size(0)
correct += (out.argmax(1) == y).sum().item()
total += y.size(0)
return loss_sum / total, correct / total

# ---------------- runs ----------------
def run_transfer(train_loader, test_loader, device, epochs=50):
wandb.init(project="transfer-vs-finetune", name="transfer-learning", config={"epochs": epochs})
model = TransferResNet18(num_classes=10).to(device)
optimizer = optim.Adam(model.net.fc.parameters(), lr=1e-3, weight_decay=1e-4)

for ep in range(1, epochs + 1):
tr_loss, tr_acc = train_one_epoch(model, train_loader, optimizer, device)
vl_loss, vl_acc = evaluate(model, test_loader, device)
wandb.log({
"epoch": ep,
"train_loss": tr_loss, "train_acc": tr_acc,
"val_loss": vl_loss, "val_acc": vl_acc
})
print(f"[Transfer] epoch {ep} train_loss {tr_loss:.4f} train_acc {tr_acc:.3f} val_loss {vl_loss:.4f} val_acc {vl_acc:.3f}")
wandb.finish()

def run_finetune(train_loader, test_loader, device, epochs=50, pct_unfreeze=0.5):
wandb.init(project="transfer-vs-finetune", name=f"fine-tuning-{pct_unfreeze}", config={"epochs": epochs, "pct_unfreeze": pct_unfreeze})
model = FineTuneResNet18(num_classes=10, pct_unfreeze=pct_unfreeze).to(device)

head_params = list(model.net.fc.parameters())
base_params = [p for n, p in model.net.named_parameters() if p.requires_grad and not n.startswith("fc")]
optimizer = optim.Adam([
{"params": head_params, "lr": 1e-3},
{"params": base_params, "lr": 1e-4},
], weight_decay=1e-4)

for ep in range(1, epochs + 1):
tr_loss, tr_acc = train_one_epoch(model, train_loader, optimizer, device)
vl_loss, vl_acc = evaluate(model, test_loader, device)
wandb.log({
"epoch": ep,
"train_loss": tr_loss, "train_acc": tr_acc,
"val_loss": vl_loss, "val_acc": vl_acc
})
print(f"[FineTune] epoch {ep} train_loss {tr_loss:.4f} train_acc {tr_acc:.3f} val_loss {vl_loss:.4f} val_acc {vl_acc:.3f}")
wandb.finish()

# ---------------- main ----------------
def main():
set_seed(1337)
device = "cuda" if torch.cuda.is_available() else "cpu"
train_loader, test_loader = get_loaders()

# run transfer learning separately
run_transfer(train_loader, test_loader, device, epochs=30)

# run fine-tuning separately
run_finetune(train_loader, test_loader, device, epochs=30, pct_unfreeze=0.5)

if __name__ == "__main__":
main()
This script sets up a clean comparison between transfer learning and fine tuning using ResNet18 on CIFAR-10. Both models are wrapped in their own classes so the logic for freezing and unfreezing layers is explicit. The training and evaluation loops are kept outside the model definitions, making it easy to reuse them across both experiments.
In the transfer learning case, the ResNet backbone is completely frozen and acts as a fixed feature extractor. Only the new fully connected layer is trained, which means the optimizer only updates a small set of parameters. This makes training fast, memory-efficient, and less prone to overfitting, especially when the dataset is relatively small. However, because the backbone never adapts, the model’s ability to specialize to CIFAR-10 is limited to what the new head can capture.
The fine tuning class, on the other hand, introduces a flexible way to unfreeze part of the backbone with the pct_unfreeze argument. By default, everything is frozen except the head, but as you increase the percentage, groups of layers starting from the deepest blocks are gradually unfrozen. This allows the model to adjust its high-level features to the new dataset while preserving the lower-level feature extractors. To keep training stable, the script also uses two learning rates: a higher rate for the new head, which needs to learn quickly, and a lower rate for the unfrozen backbone, so that pretrained weights shift only slightly.
Weights and Biases is integrated into both runs for logging, making it straightforward to track training and validation loss, accuracy, and to compare the dynamics of transfer learning versus fine tuning side by side. This setup lets you observe how quickly transfer learning converges and how fine tuning can squeeze out additional accuracy when more flexibility is given to the model. In practice, the choice between the two depends on data size, similarity to the pretraining domain, and how much compute you want to spend.

Comparing the results

After running the previous script, you will see your results inside W&B. Here are the results for my run:

Run set
2

The transfer learning run quickly levels off, reaching about 41 to 42 percent validation accuracy with losses hovering around 1.7. This is expected because the frozen ImageNet backbone cannot adapt to the small 32×32 CIFAR-10 images, and only the new classification head is learning. The model acts as a reliable feature extractor, but its performance is capped by the mismatch between the pretraining domain and the target task.
The fine tuning run, in contrast, shows a much steeper improvement. Validation accuracy rises past 80 percent within the first 10 epochs and eventually stabilizes around 83 to 84 percent while train accuracy pushes past 90 percent. Validation loss bottoms out near 0.54 to 0.56 before starting to creep up, indicating the model has reached its adaptation sweet spot and is beginning to overfit slightly.
This clear gap illustrates the trade-off: transfer learning is fast, efficient, and stable but limited in accuracy, while fine tuning unlocks more potential by allowing the backbone to reshape its higher-level features for the new task. With careful control of learning rates and early stopping, fine tuning generally yields the best results when you have enough data to support the extra flexibility.

Conclusion

By exploring transfer learning and fine tuning side by side, we can clearly see how each method offers different strengths. Transfer learning gives you a fast and stable baseline by reusing powerful pretrained features, while fine tuning adds flexibility that can push accuracy much higher when you have enough data and compute. The experiments on CIFAR-10 showed this contrast directly: a frozen backbone plateaued early, while partial unfreezing allowed the model to adapt and deliver significantly better results.
The broader takeaway is that modern deep learning rarely starts from scratch anymore. Leveraging pretrained models is now the norm, not just for computer vision but across domains like natural language processing and speech. Your choice between transfer learning and fine tuning depends on your dataset size, how closely it matches the pretraining data, and your tolerance for extra training time.
As you move forward, think of transfer learning as a strong starting point and fine tuning as a way to squeeze out the last gains in performance. Pair these techniques with experiment tracking in W&B, and you’ll not only get better results but also understand why one approach works better than the other in your specific context.
Iterate on AI agents and models faster. Try Weights & Biases today.