Exploring ClimateBERT with W&B Tables

A quick story of the challenges and opportunities in reproducing language models
Created on October 11|Last edited on June 7
Comment
I was excited to explore climate-specific language models following Webersinke et al, 2021 and see how far I could get without needing to ask the authors questions. Here are some of the challenges I encountered and how I addressed them. If you'd like to play with the dataset itself—and see if you can tell which statements about climate change are logically supported by the accompanying evidence—please see the Appendix.
Challenge 0: Replicating the exact model as publishedModel artifacts without matching training codeNo inference mode for the task(s) of interestChallenge 1: Using the exact dataset as publishedLack of required or traceable data splits/partitionsCustomizing the dataloader to parse the right fieldsChallenge 2: Evaluating the model as publishedUnderstanding label meanings & patterns of errorsExploring prediction distributionsGetting a larger sample size vs questioning premisesResolution: swap "contradiction" and "entailment"P.S. So how did this happen?AppendixBrowsing the Climate-Fever DatasetVote annotation keyExploring claims by length and article
﻿
Challenge 0: Replicating the exact model as published
Model artifacts without matching training codeFrom the paper, I found the ClimateBert training repo with an excellent example of fine-tuning from a transformer base. From climatebert.ai, I found a number of pre-trained ClimateBert models (denote CB) on the HuggingFace Model hub. However, it's not obvious which version of the code—e.g. which training recipe and hyperparameter settings—produced which of the several CB versions described in the paper.
No inference mode for the task(s) of interestThe models in the paper were fine-tuned for and evaluated on three climate-specific tasks: text classification, sentiment analysis, and fact-checking. None of the models available on HF are configured or available for inference on these tasks—the inference endpoint only allows for mask-filling, or completing blank words in a sentence. The closest model on which I could quickly run inference was amandakonet/climatebert-fact-checking (denote AKCFC).
Challenge 1: Using the exact dataset as publishedConveniently, both CB and AKCFC fine-tune and present results on this Climate-Fever dataset (CF) from Diggelmann et al, 2020: 7,675 pairs of claims and evidence which humans have judged to support (0), refute (1), or insufficiently relate (2) to the claim. I downloaded the dataset to explore model performance. A 20% sample is shown below. Note that the final "vote" is averaged across 5 pieces of evidence for each claim and may not apply to the single (claim, evidence) pair shown in this table.
﻿
CF 20% Sample1
﻿
Lack of required or traceable data splits/partitionsUnfortunately the whole dataset available on HF is assigned to the "test" partition—it's unclear which of the examples in Climate-Fever dataset, or even what approximate fraction of this dataset, may have already been used to fine-tune AKCFC or CB. Perhaps I can use my evaluation as an "upper bound" or overestimate of performance.
Customizing the dataloader to parse the right fieldsThe CF data is nested at multiple levels: 
each piece of evidence yields 2-5 votes (human evaluations of whether this evidence supports the claim) and a single summary label for the overall outcome across the individual votes
each claim maps to five different pieces of evidence and a single summary label across this evidence (aggregating over 2-5 votes per evidence piece, or 10-50 votes total)
These two layers of averaging mean individual labels frequently disagree with the final overall label, making the dataset tricky to understand as one scans row by row or reads a few random samples. It also requires some parsing effort to extract and format correctly with pandas.json_normalize(). Associating example dataloader(s) with the dataset would help folks get started faster.
Challenge 2: Evaluating the model as published
Understanding label meanings & patterns of errorsI ran a quick evaluation on 1024 samples from CF and logged these to a wandb.Table() for exploration. I grouped these predictions by the correct answer (truth) and on the right hand side, filtered for incorrect classifications only (where guess != truth).  The possible labels are
contradiction (the evidence refutes the claim)
entailment (the evidence supports the claim) 
neutral (inconclusive / claim and evidence are unrelated)
The distributions of guesses for a given true label are very surprising. Based on this sample, the model gets most of the "neutral" class correct but the vast majority of the "contradiction" and "entailment" classes confused for each other. Manual verification or evaluation by generating new pairs of statements for real-time inference on HF was inconclusive: it's surprisingly hard to generate an unambiguous pair of statements for any of these classes :) 
﻿
Run: cf_eval_1024_entropy1
﻿
Exploring prediction distributionsAs a quick sanity check, I focused on the "entropy" field in CF. This is a measure of how much different voters disagreed on evaluating a specific (claim, evidence) pair. 
The "neutral" class has the highest entropy, and the distribution of entropy matches across the other two classes. Plotting the entropy across guessed (bottom left) and actual (bottom right) classes further confirms this—entropy across true labels is very slightly higher for the neutral class, but it is not obviously correlated with prediction accuracy (orange/correct and blue/incorrect predictions are mixed across the classes). You can hover your mouse over the dots in the charts below to see corresponding examples of the claim (C:) and evidence (E: ) pairs.
﻿
Run: cf_eval_1024_entropy1
﻿
Getting a larger sample size vs questioning premisesThe reported validation accuracy of AKCFC is 67%, which roughly matched my evaluations. I hadn't randomized my samples, so I ran a longer evaluation of AKCFC on 4/5 partitions of the CF data by claim. Again, this is an upper bound on performance since the model has almost certainly trained on some of these (claim, evidence) pairs in finetuning. The upper panel shows the distribution for all model predictions, and the lower panel filters for errors (only the examples where the model was incorrect). The last panel shows some examples for exploration: you can sort, filter, or group by various columns, including by the "N" ("contradiction"), "Y" ("entailment"), and "?" ("neutral") logits (confidence scores for each of the three classes which the model outputs in inference mode). These tables entirely confirm my earlier suspicion—perhaps the "entailment" and "contradiction" labels are simply flipped somewhere?
﻿
CF Evaluation4
﻿
Resolution: swap "contradiction" and "entailment"As a last check, I visualized the model's predictions on 80% of CF across the two logit scores representing entailment (Y, y-axis) and contradiction (N, x-axis). Correct predictions are shown in orange and incorrect in blue. The neutral pairs are mostly classified correctly, while the fraction correct in the contradiction and entailment regions is very small. You can hover your mouse over any point to see the associated claim and evidence. 
﻿
CF Evaluation4
﻿
Next I compare the true labels and the model's guesses across the same set of examples. The red cluster for neutral pairs matches across the two, while the orange and blue are clearly flipped. This visual finally convinced me to try inverting the labels: swapping all of the model's "entailment" guesses for "contradiction" and "contradiction" for "entailment". I did this with a quick Tables column operation:
row["guess"].replace("entailment", "tmp").replace("contradiction", "entailment").replace("tmp", "contradiction")
to get my new results immediately without needing to rerun any scripts/redo the evaluation. While the visual change itself is unsurprising—equivalent to swapping the colors in the legend—seeing it was a crucial step in acknowledging the evaluation workflow was incorrect and moving on to new experiments and explorations. 
﻿
CF Evaluation4
﻿
P.S. So how did this happen?If you've made it this far, you're probably wondering what the issue was. To the best of my understanding, the sample inference code for the existing HF model I was using contains a different label mapping:
label_mapping = ['contradiction', 'entailment', 'neutral']
than the Climate-Fever dataset card:
claim_label: a int feature, overall label assigned to claim (based on evidence majority vote). The label correspond to 0: "supports", 1: "refutes", 2: "not enough info" and 3: "disputed".
"supports" and "entailment" match semantically, but the integer labels are different across the dataset and the model (0 in the original dataset, 1 according to the model inference code provided). "refutes" is 1 in the original dataset, but 0 in the model inference code. I hope that versioning datasets, models, and metadata in W&B Artifacts can help minimize the chances of this kind of confusion in the future :)
Appendix
Browsing the Climate-Fever DatasetBelow I show a 20% sample of the Climate-Fever dataset (CF): each claim is paired with the first piece of evidence (of five total per claim in the full dataset). For each claim-evidence pair, you can see the
overall vote: does the evidence support (0), refute (1), or not sufficiently relate (2) to the claim, with (3) reserved for disputed pairs.
source article for the claim
entropy: one of 7 possible values measuring the distribution of the 5 votes per pair
all votes: up to 5 individual votes per claim-evidence pair
Some of these claim-evidence pairs are very tricky to evaluate! Try sorting by descending entropy to see some of the most challenging examples (hover over the "entropy" column heading > click on the three dots > select "Sort Desc").
A grouped version of the same Table shows the distribution by overall vote. There are more than twice as many "supported" as "refuted" pairings in this 20% sample.
Vote annotation key0: evidence supports the claim
1: evidence refutes the claim
2: unclear, not enough information
3: disputed﻿﻿
﻿
CF 20% sample1
﻿
Exploring claims by length and articleSaved examples of possibly interesting views of this dataset sample: (1) sorted by shortest evidence text first and (2) grouped by article title and sorted by article yielding the most claims in the dataset.
﻿
Climate-Fever 20% sample 1
﻿
﻿
Add a comment