Skip to main content

Cell Discovery Catalog

Created on January 5|Last edited on November 15



Background: Variational Inference

Cell Type Labeling

Model Training



As mentioned above, once we have specifications for our likelihood and variational distribution, training consists of gradient descent on the ELBO. We train and evaluate this model on two datasets from cells coming from two different tissues types: blood and bone marrow. In the peripheral blood mononuclear cell (pbmc) dataset, there are only 200 labeled cells of 4 different cell types in a dataset of 20k samples with 20k+ genes. The marrow dataset has about 8000 samples with 300 genes and 12 uniques cell types.



There are many hyper-parameters we could choose to tune, but variational auto-encoders are known to be fairly insensitive to hyper-parameters in the unsupervised context, where we are only concerned with maximizing the log probability of the data. However, we can perform sweeps using weights and biases that help us find good settings for those hyper-parameters we might want to put more care into such as:

  • The dimensionality of the latent space for z1z1 and z2z2
  • α\alpha-the weighting of the classification loss in the objective
  • batch size
  • learning rate decay

In the semisupervised setting, we are optimizing for the ELBO as well as a classification loss. The supervised component is concerned with predicting the class yy of a sample based on its latent representation z2z_2. If z2z_2 is just normally distributed as in the mean-field case of the variational distribution, then it may not be an optimal representation such that the different classes are well separated in the latent space. A normalizing flow will allow z2z_2 to have arbitrary complexity in its representation so that the classifier network of the encoder q(y∣z2)q(y|z_2) can discriminate between the classes easier. Indeed, we see better classification performance when a normalizing flow is employed. A normalizing flow is sometimes overkill, but it can be useful to see quickly whether using one can offer any benefits using Weights and Biases.




Sweep: scanvi-sweep 1
19



Datasets

bone marrow and peripheral blood mononuclear cells (pbmc). The bone marrow dataset has about 8000 cells and 400 genes, with a little under half of them labeled into 14 diffferent cell types. The pbmc dataset is much larger with 20k cells and 20k+ genes, with only 200 of them labeled into four different cell types.

marrow_genes
Direct lineage view
Some nodes are concealed in this view - Break out items to reveal more.
Artifact - dataset
marrow_genes:v1
Runs
58
cerulean-terrain-56
prediction_insights
vocal-glitter-52
marrow_dataset_split
dashing-voice-50
prediction_insights
earnest-glade-30
marrow_dataset_split
eager-sweep-19
marrow_training
classic-sweep-18
marrow_training
gentle-sweep-17
marrow_training
drawn-sweep-16
marrow_training
vital-sweep-15
marrow_training
pleasant-sweep-14
marrow_training
generous-sweep-13
marrow_training
rare-sweep-12
marrow_training
dutiful-sweep-11
marrow_training
hardy-sweep-10
marrow_training
jolly-sweep-9
marrow_training
rich-sweep-8
marrow_training
upbeat-sweep-7
marrow_training
lucky-sweep-6
marrow_training
iconic-sweep-5
marrow_training
deft-sweep-4
marrow_training
eager-sweep-3
marrow_training
ancient-sweep-2
marrow_training
silvery-sweep-1
marrow_training
leafy-bush-318
marrow_dataset_split
dandy-sweep-25
marrow_dataset_split
apricot-sweep-24
marrow_dataset_split
youthful-sweep-23
marrow_dataset_split
super-sweep-22
marrow_dataset_split
vocal-sweep-21
marrow_dataset_split
zesty-sweep-20
marrow_dataset_split
polished-sweep-19
marrow_dataset_split
pretty-sweep-18
marrow_dataset_split
icy-sweep-17
marrow_dataset_split
sleek-sweep-16
marrow_training
icy-sweep-15
marrow_training
happy-sweep-14
marrow_training
jumping-sweep-13
marrow_training
vital-sweep-12
marrow_training
expert-sweep-11
marrow_training
fast-sweep-10
marrow_training
exalted-sweep-9
marrow_training
lemon-sweep-8
marrow_training
royal-sweep-7
marrow_training
bumbling-sweep-6
marrow_training
vivid-sweep-5
marrow_training
lucky-sweep-4
marrow_training
polar-sweep-3
marrow_training
magic-sweep-2
marrow_training
cool-sweep-1
marrow_training
happy-bush-290
marrow_dataset_split
feasible-snow-288
marrow_dataset_split
eternal-sweep-2
marrow_training
swift-sweep-1
marrow_training
hearty-deluge-283
marrow_dataset_split
scarlet-sky-262
marrow_dataset_split
hearty-river-254
marrow_dataset_split
faithful-feather-252
marrow_dataset_split
laced-sponge-250
marrow_dataset_split

pbmc_genes
Direct lineage view
Artifact - pbmc_test
pbmc_test_y:v0
Artifact - pbmc_test
pbmc_test_x:v0
Artifact - pbmc_train
pbmc_train_y:v0
Artifact - pbmc_model
pbmc_scanvi_model:v1
Artifact - pbmc_model
pbmc_scanvi_model:v0
Artifact - pbmc_train
pbmc_train_x:v0
Artifact - dataset
pbmc_genes:v1
Run - prediction_insights
azure-water-57
Run - pbmc_dataset_split
magic-wood-54
Run - prediction_insights
dandy-pine-51
Run - pbmc_training
comic-butterfly-55
Run - pbmc_training
confused-planet-33
Run - pbmc_dataset_split
avid-plant-32

Model Evaluation

We trained models on two separate genetic datasets from two tissue types: bone marrow and peripheral blood mononuclear cells (pbmc). The bone marrow dataset has about 8000 cells and 400 genes, with a little under half of them labeled into 14 diffferent cell types. The pbmc dataset is much larger with 20k cells and 20k+ genes, with only 200 of them labeled into four different cell types.

Bone Marrow
1
PBMC
1


Model Interpretability

We can also inspect the latent variable z2z_2 to see how the model separates the classes for the labeled data and if the model can group unlabeled points appropriately. We can make this comparison by leveraging WandB's 2D Projection Plot to project z2z_2 down into two dimensions and then compare the labels to the model's predicted probability.




Bone Marrow
1
PBMC
1


References


artifact
artifact