[longer draft] tackling climate

Created on April 28|Last edited on July 1
Comment
In the US, new federal rules will soon require all public companies to disclose their impact on the climate: how they identify, address, and mitigate relevant risks and manage carbon emissions. Parsing this pile of new paperwork will be challenging, and the recent ClimateBert language model could help. Webersinke et al (2021) ﻿benchmark performance on text classification, sentiment analysis, and fact-checking for climate-specific text. For this last task, Diggelman et al (2021) compile and share Climate-Fever, a dataset of 1535 real-world climate claims and associated evidence. In this report, I visualize the process of fine-tuning the ClimateBert model and explore performance on the Climate-Fever dataset to find patterns and opportunities for further improvements.
Browsing the Climate-Fever DatasetBelow I show a 20% sample of the Climate-Fever dataset (CF): each claim is paired with the first piece of evidence (of five total per claim in the full dataset). For each claim-evidence pair, you can see the
overall vote: does the evidence support (0), refute (1), or not sufficiently relate (2) to the claim, with (3) reserved for disputed pairs.
source article for the claim
entropy: one of 7 possible values measuring the distribution of the 5 votes per pair
all votes: up to 5 individual votes per claim-evidence pair
Some of these claim-evidence pairs are very tricky to evaluate! Try sorting by descending entropy to see some of the most challenging examples (hover over the "entropy" column heading > click on the three dots > select "Sort Desc").
A grouped version of the same Table shows the distribution by overall vote. There are more than twice as many "supported" as "refuted" pairings in this 20% sample.
Vote annotation key0: evidence supports the claim
1: evidence refutes the claim
2: unclear, not enough information
3: disputed﻿﻿
﻿
CF 20% sample1
﻿
Exploring claims﻿
Run set27
﻿
Resourcescolab: https://colab.research.google.com/drive/15mjmYYinrhoXIAzZN3KyVud77nWcZUTc#scrollTo=tvxEVZGjdTLf﻿
Climate-Fever: A Dataset for Verification of Real-World Climate Claims: https://arxiv.org/abs/2012.00614﻿
benchmark: https://paperswithcode.com/dataset/climate-fever﻿
Models
all climatebert: https://huggingface.co/climatebert text classification, sentiment analysis, and fact-checking
W&B
Emotions report for dataframes: https://wandb.ai/stacey/emotions/reports/Emotions--Vmlldzo4NzQyMjg﻿
FinBert colab: https://colab.research.google.com/drive/1C6_ahu0Eps_wLKcsfspEO0HIEouND-oI?usp=sharing﻿
FinBert report: https://wandb.ai/ivangoncharov/FinBERT_Sentiment_Analysis_Project/reports/Financial-Sentiment-Analysis-on-Stock-Market-Headlines-With-FinBERT-Hugging-Face--VmlldzoxMDQ4NjM0﻿
HF
Climate-relevant models: https://huggingface.co/models?sort=downloads&search=climate﻿
integration with W&B: https://docs.wandb.ai/guides/integrations/huggingface﻿
Trainer class: https://huggingface.co/docs/transformers/main_classes/trainer﻿
Related datasets
Fact Extraction and Verification (FEVER): https://paperswithcode.com/dataset/fever﻿
landing page: https://aclanthology.org/N18-1074/﻿
﻿
﻿
Available models:
ClimateBert base + variants—can train, but not finetuned for any task, can't evaluate.
fact-checking climate bert variant on HF: 61% acc
Available datasets:
Climate-Fever: only test split.
Entailment is most often mistaken for contradictionNeutral statements have highest entropy
﻿
Run: cf_eval_1024_entropy1
﻿
﻿
Confusion of contradiction and entailmentAs a point of reference, I found an existing model on HuggingFace finetuned on CF: amandakonet/climatebert-fact-checking. Below I evaluate this model on partitions of CF and analyze the pattern of predictions. Since no official train/val/test partition of CF is available, note that I may be evaluating on examples which the model has seen during training. An interesting pattern emerges as I evaluate on more examples—"contradiction" and "entailment" are confused so often that the model would perform much much better if the labels were simply flipped. Perhaps this actually happened somewhere in the pipeline?
﻿
CF Evaluation4
﻿
Can we do better with calibration?High scoring Nos, weak low Yes is consistently misclassified
﻿
Run: cf_eval_1024_scores4
﻿
Observationshas trouble when the same proper noun is in different functions in the sentence (e.g. as a subject vs direct object vs indirect object—model gets confused and says "contradiction")
are all those high-confidence contradictions shorter, more inane, less content-ful?
some entailment has very high contradiction scores, neutral has lower scores, hard to compare across these distributions—entailment actually has lowest confusion scores? can we calibrate somehow?
﻿
Run: cf_eval_full_N01
﻿
Evaluating performance: which votes are hardest to predict?Both panels show CB performance across 4/5 partitions of the CF data, grouped by the correct label: entailment (the evidence supports the claim), neutral (no conclusion possible / claim and evidence are unrelated), or contradiction (the evidence contradicts the claim). The upper panel shows the distribution for all model predictions, while the lower panel filters for errors, where the model predicted an incorrect answer. 
﻿
Run set4
﻿
Possible next stepsclean up everything so far, consolidate feedback, make an interesting report
log training of CB variant & look at predictions over time
figure out how to finetune CB variants for CF
apply same finetuning procedure to multiple variants?
unclear why there's no split of CF
email CF & CB authors and figure out my questions
can I get other datasets?
can I get details of training?
do they want to like work together on any of this
figure out what actual disclosures will look like & if this is useful?
find other sentiment analysis / classification benchmark models or datasets?
fact-checking outside of climate
try my own finetuning....but would need their topic dataset ideally
﻿
Reference﻿climatebert.ai: corporate disclosure analytics for climate
﻿proposed rules will meet resistance: hard to quantify + standardize Scope 3 emissions especially (from partners/dependencies of customers)
﻿
﻿
Add a comment