Skip to main content

Pretraining & Finetuning of Protein Language Model with BioNeMO & WandB

Created on July 14|Last edited on July 14
If you have questions about this report or about wandb, please contact us at contact-jp@wandb.com.

Chatbots are a representative use case of generative AI, but the technological innovations extend beyond chat and are applied in various fields. Particularly in the field of drug discovery, it takes a lot of time and money to develop new drugs from vast combinations of molecules, but efforts are made daily to shorten the drug discovery cycle by utilizing AI.

Since the advent of the Transformer, the predictive performance using protein language models and biochemical models that have learned amino acid and molecular sequences has improved dramatically. This has drawn attention to the potential of using generative AI in drug discovery. In fact, companies like Amgen and Genentech have reported cases of drug discovery using generative AI.

However, drug discovery using generative AI requires handling vast amounts of data, necessitating distributed processing with multi-node GPUs and efficient computation understanding the model structure. This requires not only an understanding of data analysis but also high engineering skills, making it a field with high barriers to entry. Against this background, the framework 'NVIDIA's BioNeMo,' which allows easy utilization of foundational models in the drug discovery domain, has been developed and released.

This report introduces the implementation methods from pre-training to fine-tuning of protein language models, targeting those who are using BioNeMo for the first time. In the implementation examples, we will also introduce convenient ways to utilize Weights & Biases, which many global pharmaceutical companies use along with BioNeMo, with this in mind. The general flow is as follows.



Environment Setup

For detailed environment settings for BioNeMo, please refer to the official documentation of the BioNemo Frame. Below, the steps to set up the environment to implement the content described in this report are shown.

Prerequisites

  • x86 Linux systems
  • Docker (with GPU support, docker engine >= 19.03)
  • Python 3.10 or above
  • Pytorch 1.13.1 or above
  • NeMo pinned to version 1.20 NVIDIA GPU, if you intend to do model training.
  • Tested GPUs: DGX-H100, A100, V100 RTX A6000, A8000 Tesla T4 GeForce RTX 2080 Ti (GPUs with known issues: Tesla K80)
  • bfloat16 precision requires an Ampere generation GPU or higher.

Request access / Account setup

  • Weights & Biases Account
If you do not have a Weights & Biases account, please create a trial account from the Weights & Biases homepage
-> Sign into NGC.
-> Download and install NGC CLI and setup your API Key..
  • (Optional) Access to NVIDIA DGX compute infrastructure (DGX-Cloud or DGX-Pod). (This is optional)

Set up

(1) Docker set up - image file (Image: nvcr.io/nvidia/clara/bionemo-framework:1.5)
The following command is just an example. Please launch Docker according to your own GPU environment.
login
docker login nvcr.io
Username: $oauthtoken
Password <insert NGC API token here>
pull docker image
docker pull nvcr.io/nvidia/clara/bionemo-framework:1.5
Launch Docker
docker run -it --rm --gpus all nvcr.io/nvidia/clara/bionemo-framework:1.5 bash
(2) NGC CLI download
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.42.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
find ngc-cli/ -type f -exec md5sum {} + | LC_ALL=C sort | md5sum -c ngc-cli.md5
sha256sum ngccli_linux.zip
chmod u+x ngc-cli/ngc
echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
ngc config set
  • Please enter the NGC API key for the API key.
  • Choose something other than 'no-org' for Org.
  • Select 'json' for format_type.
  • The rest can be left as default values."
(3) Set up of environment variable
export BIONEMO_HOME="/workspace/bionemo"
export WANDB_API_KEY=<your wandb API key> 

# If you are using a dedicated cloud or on-premises version of wandb, please set the 'WANDB_BASE_URL' appropriately. Your company's wandb administrator will know the 'WANDB_BASE_URL'.
[optional] export WANDB_BASE_URL=<your wandb base url>

Implementation of used code & Test code

Primarily, you will use sample scripts and functions within Docker, but the series of steps for the example require modifications, including some for wandb integration. These scripts are stored in the following GitHub repository. Please download them to your local machine and deploy them in your environment (consideration is being given to future support for git pull and integration into the Docker image itself).
For those participating in the hands-on session on this topic ↓
Place the test_code.ipynb found in the GitHub repository into the workspace directory as follows, implement it, and make sure it runs without errors. If the implementation is error-free and the final implementation includes logging into wandb and saving (a wandb URL will be displayed. If there are no errors such as 'Failed', it is fine).


Pre-Training (The procedure for the hands-on will be updated later)

This time, pre-training will be conducted using the datasets Uniref50 and Uniref90 included in the BioNeMo Docker image. Uniref50 is divided into training, validation, and testing sets. During training, minibatches of Uniref50 are sampled, but each sequence in this batch is replaced with sequences from the corresponding Uniref90 cluster to increase data size and diversity (Uniref50 clusters sequences with 50% similarity to save representative sequences, while Uniref90 clusters sequences with 90% similarity to save representative sequences, meaning Uniref90 contains more data). Please refer to 'Language models of protein sequences at the scale of evolution enable accurate structure prediction.' Uniref90 is only used during training and not for validation or testing.

Preparation of script & wandb

(1) As a base, we will use the data and scripts in nvcr.io/nvidia/clara/bionemo-framework:1.5, but for some scripts, we will use the code from https://github.com/olachinkei/BioNeMo_WandB/tree/main.
Although we will prepare a custom Docker image and a GitHub repository in the future, due to the rapid updates in BioNeMo, we will proceed by manually adding scripts to the official Docker image, which might look a bit messy.
💡
(2) Place the Protein/01_protein_LLM.ipynb from the GitHub repository under the workspace, and mainly use this Jupyter notebook!
(3) Replace workspace/bionemo/examples/protein/esm2nv/pretrain.py with Protein/pretrain.py from the GitHub repository.
(4) Replace workspace/bionemo/examples/protein/esm2nv/conf/base_config.yaml with Protein/base_config.yaml from the GitHub repository.
(5) Open 01_protein_LLM.ipynb and start executing. First, set the environment variables for wandb and decide where to save within wandb.
The saving location in wandb is structured as shown below. If you work in the same entity (team) with your teammates, the results are automatically shared in real-time.
import os
os.environ["WANDB_ENTITY"]="<your team where you want to log>"
os.environ["WANDB_PROJECT"]="BioNeMo_protein_LLM_pretraining"


Data pre-processing

We will proceed to 'Data Preprocessing' in 01_protein_LLM.ipynb.
Here, we prepare the data for pre-training, using the datasets Uniref50 and Uniref90 included in the BioNeMo Docker image.
(1) Execute the script in order and unzip uniref202104_esm2_qc_test200_val200.zip.
(2) For version control of the data, save the raw data in wandb's Artifacts. Artifacts is a tool that enables version control of object files such as data and models.

(3) We will perform data preprocessing. In the following pretrain.py, the raw data contained in wandb's artifacts is accepted and then modified to be registered again as a different artifact in wandb.
!python examples/protein/esm2nv/pretrain.py\
--config-path=conf\
--config-name=pretrain_esm2_650M\
++do_training=False\
++exp_manager.wandb_logger_kwargs.name='preproceed_data_upload'\
++exp_manager.wandb_logger_kwargs.job_type='data_upload'\
++wandb_artifacts.wandb_use_artifact_path='${oc.env:WANDB_ENTITY}/${oc.env:WANDB_PROJECT}/uniref202104_esm2_qc_test200_val200:v0'\
++wandb_artifacts.wandb_log_artifact_name='uniref202104_esm2_qc_test200_val200_preprocessed'\
++model.data.val_size=500\
++model.data.test_size=100\
++model.data.uf50_datapath=/uniref50_train_filt.fasta\
++model.data.uf90_datapath=/ur90_ur50_sampler.fasta\
++model.data.cluster_mapping_tsv=/mapping.tsv\
++model.data.dataset_path=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uf50\
++model.data.uf90.uniref90_path=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uf90\
++model.data.train.uf50_datapath=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uniref50_train_filt.fasta\
++model.data.train.uf90_datapath=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/ur90_ur50_sampler.fasta\
++model.data.train.cluster_mapping_tsv=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/mapping.tsv\
++model.data.val.uf50_datapath=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uniref50_train_filt.fasta\
++model.data.test.uf50_datapath=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uniref50_train_filt.fasta

Parameters that start with '--' are passed as command-line arguments to pretrain.py. For example, 'config-path' and 'config-name' specify the folder of the configuration YAML file and the YAML file name, respectively. This path is relative to pretrain.py. 'conf' refers to examples/protein/esm2nv/conf, and 'pretrain_esm2_650M' refers to examples/protein/esm2nv/conf/pretrain_esm2_650M.yaml.

Parameters starting with '++' can be set in the YAML file. For example, in pretrain_esm2_650M.yaml inherited from base_config.yaml, you can find the following parameters:
  • do_training: Set to False to perform only data preprocessing and not training.
  • exp_manager.wandb_logger_kwargs.name: Specifies the wandb run name.
  • exp_manager.wandb_logger_kwargs.job_type: Specifies the wandb job type (job type becomes useful metadata when organizing run information later; it does not change the implementation).
  • wandb_artifacts.wandb_use_artifact_path: Sets the path for the wandb artifacts of the data to be preprocessed.
  • wandb_artifacts.wandb_log_artifact_name: Specifies the name of the wandb artifacts to save the data after preprocessing.
  • model.data.val_size and model.data.test_size: Specify the sizes of the validation and test datasets.
  • model.data.uf50_datapath: Specifies the path to the uniref50 fasta file.
  • model.data.uf90_datapath: Specifies the path to the uniref90 fasta file.
  • model.data.cluster_mapping_tsv: Specifies the path to the file that maps uniref50 clusters to uniref90 sequences.
  • model.data.dataset_path: Specifies the path to the output directory of the preprocessed uniref50 data, which will include train, validation, and test splits.
  • model.data.uf90.uniref90_path: Specifies the path to the output directory of the preprocessed uniref90 data, which will only contain the folder u90_csvs, as uniref90 is used only in training and does not have train/test/validation splits.
  • model.data.train.uf50_datapath: Specifies the path to the uniref50 fasta file.
  • model.data.train.uf90_datapath: Specifies the path to the uniref90 fasta file.
  • model.data.train.cluster_mapping_tsv: Specifies the path to the file that maps uniref50 clusters to uniref90 sequences.
  • model.data.val.uf50_datapath: Specifies the path to the uniref50 fasta file.
  • model.data.test.uf50_datapath: Specifies the path to the uniref50 fasta file.
Instead of overriding arguments through the command line as above, you can also directly modify the YAML file. Once processing is complete, the preprocessed data will be located at /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uf50/uf50/. If you want to use your own data for pre-training, fine-tuning, or inference, specify the path as /workspace/bionemo/mydata/. However, you need to align the structure and format of your data with the sample data.

uniref202104_esm2_qc_test200_val200_preprocessed
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Artifact - dataset
uniref202104_esm2_qc_test200_val200_preprocessed:v0
Artifact - dataset
uniref202104_esm2_qc_test200_val200:v0
Artifact - model
model-7tsnfe0j:v1
Artifact - model
model-7tsnfe0j:v0
Run - pretraining
pretraining_240529-084130
Run - data_upload
preproceed_data_upload
Run
data_upload
Run - pretraining
pretraining_240603-153108
Run - pretraining
pretraining_240603-091326
Run - pretraining
pretraining_240603-072750
Artifacts
4
model-ok71o770:v3
model
model-ok71o770:v2
model
model-ok71o770:v1
model
model-ok71o770:v0
model
Artifacts
20
model-41vz8yh8:v19
model
model-41vz8yh8:v18
model
model-41vz8yh8:v17
model
model-41vz8yh8:v16
model
model-41vz8yh8:v15
model
model-41vz8yh8:v14
model
model-41vz8yh8:v13
model
model-41vz8yh8:v12
model
model-41vz8yh8:v11
model
model-41vz8yh8:v10
model
model-41vz8yh8:v9
model
model-41vz8yh8:v8
model
model-41vz8yh8:v7
model
model-41vz8yh8:v6
model
model-41vz8yh8:v5
model
model-41vz8yh8:v4
model
model-41vz8yh8:v3
model
model-41vz8yh8:v2
model
model-41vz8yh8:v1
model
model-41vz8yh8:v0
model
Artifacts
6
model-dprgx89l:v5
model
model-dprgx89l:v4
model
model-dprgx89l:v3
model
model-dprgx89l:v2
model
model-dprgx89l:v1
model
model-dprgx89l:v0
model

Pre-training

Now that the data is prepared, execute the following code sequentially, and proceed with the pre-training.
!python examples/protein/esm2nv/pretrain.py \
--config-path=conf \
--config-name=pretrain_esm2_650M \
++do_training=True \
++do_testing=True \
++wandb_artifacts.wandb_use_artifact_path='${oc.env:WANDB_ENTITY}/${oc.env:WANDB_PROJECT}/uniref202104_esm2_qc_test200_val200_preprocessed:v0'\
++model.data.dataset_path=/ \
++model.data.uf90.uniref90_path=/uf90 \
++trainer.devices=1 \
++model.tensor_model_parallel_size=1 \
++model.micro_batch_size=4 \
++trainer.max_steps=1 \
++trainer.val_check_interval=1 \
++exp_manager.create_wandb_logger=True \
++exp_manager.checkpoint_callback_params.save_top_k=10
Below is an explanation of the main parameters set above.
  • do_training: Set to True to train the model. This assumes that the data has been preprocessed.
  • do_testing: Set to False to skip testing.
  • wandb_artifacts.wandb_use_artifact_path: Enter the path of the dataset used for training.
  • model.data.dataset_path: Specify the path to the preprocessed uniref50 data folder that includes training/validation/test splits.
  • model.data.uf90.uniref90_path: Specify the path to the preprocessed uniref90 data. This folder must contain another folder named u90_csvs, which should have files from x000.csv to x049.csv.
  • trainer.devices: Specify the number of GPUs to use.
  • model.tensor_model_parallel_size: Set the tensor model parallel size.
  • model.micro_batch_size: Set the batch size. Increase this as much as possible unless a memory error occurs.
  • trainer.max_steps: Specify the maximum number of training steps. Set to 100 for demo purposes. 1 step = processing 1 batch. First, calculate total_batches = total number of samples / batch size. If you want to train for N epochs, set max_steps to N * total_batches.
  • trainer.val_check_interval: Specify the interval at which to run the validation set.
  • exp_manager.create_wandb_logger: Set to False to disable logging to wandb. If set to True, you must provide the wandb API key.
  • exp_manager.checkpoint_callback_params.save_top_k: Specify the number of best checkpoints to save. The trained results will be saved in /workspace/bionemo/results/nemo_experiments/.
Additionally, if you want to perform continual pre-training instead of initial pre-training, you can do so by entering the path of the model that serves as the starting point in the 'restore_from_path' within the base_config.yaml
💡
The learning process and settings are saved in wandb, and the trained model is also stored in wandb's artifacts. Below is an example of what is tracked in wandb.

Run: pretraining_240603-153108
1


model-41vz8yh8
Direct lineage view
Artifact - model
model-41vz8yh8:v16
Artifact - dataset
uniref202104_esm2_qc_test200_val200_preprocessed:v0
Artifact - dataset
uniref202104_esm2_qc_test200_val200:v0
Run - pretraining
pretraining_240603-153108
Run - data_upload
preproceed_data_upload
Run
data_upload

Register to Model registry

Having models in Artifacts can make it a bit difficult for other team members to understand which model is currently the best. To facilitate team collaboration, there is wandb's Model registry. Let's register the pre-trained model in the Model registry.

The procedure is simple. Click 'Link to registry' from the model you want to register.

By doing so, you can verify that the current model is being managed as one in the organization's Model registry. If you improve the pre-training and register it again, new versions such as v1, v2 will be automatically created. Even if many versions are created, you can tag them, making it easy to manage which model is used in production or which model has the best accuracy.


Finetuning

The BioNeMo Framework provides sample codes for three downstream tasks. The first task is to predict 10 intracellular localizations of proteins, the second is to predict the melting temperature of proteins, and the third is to predict the secondary structure of proteins. This report will explain the third task as an example, specifically predicting whether each amino acid in a sequence belongs to a helix, sheet, or coil.
Supplement: 3-state prediction
In 3-state prediction, the protein's secondary structure is classified into the following three types:
  • Helix (H) - The part of the amino acid chain that forms a helical shape.
  • Sheet (E) - The part of the amino acid chain that forms parallel or antiparallel sheet structures.
  • Coil (C) - The part that does not fall into the categories of helix or sheet, taking a disordered or other arbitrary structure.
Supplement: 8-state prediction
In 8-state prediction, the protein's secondary structure is classified in more detail as follows:
  • H: Alpha helix
  • E: Extended structure (beta sheet)
  • G: 3-10 helix
  • I: Pi helix
  • T: Turn
  • S: Bend
  • B: Beta bridge
  • C: Coil

Preparation of script & wandb

(0) We will use the Protein/01_protein_LLM.ipynb from the same GitHub repository used for pre-training.
(1) Replace workspace/bionemo/examples/protein/downstream/downstream_flip.py with Protein/downstream_flip.py from the GitHub repository.
(2) Replace bionemo/examples/protein/esm2nv/conf/downstream_sec_str_LORA.yaml with Protein/downstream_sec_str_LORA.yaml from the GitHub repository.
(3) Open 01_protein_LLM.ipynb and start executing. First, set up the environment variables for wandb and decide where to save within wandb."
import os
os.environ["WANDB_ENTITY"]="<your team where you want to log>"
os.environ["WANDB_PROJECT"]="BioNeMo_protein_LLM_finetuning"

Data preprocessing & Model preparation

(1) Use the following code to download data.
!wget -q -O /tmp/ngccli_linux.zip --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.38.0/files/ngccli_linux.zip && unzip -o /tmp/ngccli_linux.zip -d /tmp && chmod u+x /tmp/ngc-cli/ngc && rm /tmp/ngccli_linux.zip
(2) Use the following code to download the model.
!python download_models.py --download_dir /workspace/bionemo/models esm2nv_650m
For this hands-on, we will base it on esm2nv_650m, but it is also possible to use a model registered in the model registry from the pre-training process described above
💡
(2) Upload the model and data used for finetuning to wandb and manage their versions.
!python download_models.py --download_dir /workspace/bionemo/models esm2nv_650m
# save model
import wandb
with wandb.init(name="model_upload") as run:
artifact = wandb.Artifact(
name="esm2nv_650m",
type="model",
description="original esm2nv_650m",
metadata={"path":"bionemo/models/esm2nv_650M_converted.nemo"},
)
artifact.add_file("/workspace/bionemo/models/esm2nv_650M_converted.nemo")
run.log_artifact(artifact)

# save dataset
with wandb.init(name="data_upload") as run:
artifact = wandb.Artifact(
name="downstream_taskdataset",
type="dataset",
description="bionemo/examples/tests/test_data/protein/downstream",
metadata={"path":"/workspace/bionemo/examples/tests/test_data/protein/downstream"},
)
artifact.add_dir("/workspace/bionemo/examples/tests/test_data/protein/downstream")
run.log_artifact(artifact)

Finetuning

Finally, execute the following code. The downstream_flip.py mentioned below has been modified to download and utilize the data and model registered in wandb's Artifacts.
Furthermore, detailed settings can be made in the yaml file provided below
/workspace/bionemo/examples/protein/esm2nv/conf/downstream_flip_sec_LORA.yaml
  • restore_from_path: Set to the path of the .nemo file of the pre-trained model's checkpoint.
  • trainer.devices, trainer.num_nodes: Set to the number of GPUs and nodes to be used.
  • trainer.max_epochs: Set to the number of epochs you want to train.
  • trainer.val_check_interval: Set to the number of steps at which validation is performed.
  • model.micro_batch_size: Set to the micro batch size for training.
  • data.task_name: Set an arbitrary name.
  • data.task_type: Current options include token-level-classification, classification (sequence level), and regression (sequence level).
  • preprocessed_data_path: Set to the path of the parent folder of dataset_path.
  • dataset_path: Set to the path of the folder containing the train/val/test folders. For example, set to the path specified as path/to/data.
  • dataset.train, dataset.val, dataset.test: Set to the CSV name or range.
  • sequence_column: Set to the name of the column containing the sequences. Example: sequence
  • target_column: Set to the name of the column containing the targets. Example: scl_label
  • target_size: The number of classes for each label for classification.
  • num_classes: Set to target_size.
In the following, we are setting frequently changed variables by overwriting the yaml file.
wandb_artifacts.wandb_use_artifact_data_path is the path to the wandb artifacts of the data you are using.
wandb_artifacts.wandb_use_artifact_model_path is the path to the wandb artifacts of the model you are using.
!python examples/protein/downstream/downstream_flip.py\
--config-path="../esm2nv/conf"\
--config-name=downstream_sec_str_LORA\
++trainer.max_epochs=15\
++wandb_artifacts.wandb_use_artifact_data_path='${oc.env:WANDB_ENTITY}/${oc.env:WANDB_PROJECT}/downstream_taskdataset:v0'\
++wandb_artifacts.wandb_use_artifact_model_path='${oc.env:WANDB_ENTITY}/${oc.env:WANDB_PROJECT}/esm2nv_650m:v0'\
++model.data.dataset_path=/\
++model.restore_encoder_path=/esm2nv_650M_converted.nemo

The learning process and settings are saved in wandb, and the trained model is also stored in wandb's artifacts. Below is an example of what is tracked in wandb.


Run: esm2nv_flip_secondary_structure_finetuning_encoder_frozen_False
45

Additionally, the finetuned model can be version-controlled using Artifacts

model-mo32gf3o
Direct lineage view
Artifact - model
model-mo32gf3o:v29
Artifact - dataset
downstream_taskdataset:v0
Artifact - model
esm2nv_650m:v0
Run - finetuning
esm2nv_flip_secondary_structure_finetuning_encoder_frozen_False
Run
data_upload
Run
model_upload
Run
model_upload
In this instance, after finetuning, validation on the test data was also performed, but with an accuracy of 35.72 after about 10 epochs, it is still close to chance levels, indicating that further finetuning is necessary. With wandb, the table function allows for saving and visualizing each sample individually, enabling deeper analysis.

Further updates on this method will be provided in this report in the future.

Finally

In this report, we introduced the specific process of pre-training and finetuning protein language models using BioNeMo, aimed at those who are beginning to study BioNeMo. Additionally, while BioNeMo provides integration with wandb, we have also proposed code that makes stronger use of wandb and introduced its implementation. We hope this report contributes to further advancing the use of AI in drug discovery.
If you have questions about this report or about wandb, please contact contact-jp@wandb.com.
artifact
artifact
artifact