Skip to main content

[longer draft] tackling climate

Created on April 28|Last edited on July 1
In the US, new federal rules will soon require all public companies to disclose their impact on the climate: how they identify, address, and mitigate relevant risks and manage carbon emissions. Parsing this pile of new paperwork will be challenging, and the recent ClimateBert language model could help. Webersinke et al (2021) benchmark performance on text classification, sentiment analysis, and fact-checking for climate-specific text. For this last task, Diggelman et al (2021) compile and share Climate-Fever, a dataset of 1535 real-world climate claims and associated evidence. In this report, I visualize the process of fine-tuning the ClimateBert model and explore performance on the Climate-Fever dataset to find patterns and opportunities for further improvements.

Browsing the Climate-Fever Dataset

Below I show a 20% sample of the Climate-Fever dataset (CF): each claim is paired with the first piece of evidence (of five total per claim in the full dataset). For each claim-evidence pair, you can see the
  • overall vote: does the evidence support (0), refute (1), or not sufficiently relate (2) to the claim, with (3) reserved for disputed pairs.
  • source article for the claim
  • entropy: one of 7 possible values measuring the distribution of the 5 votes per pair
  • all votes: up to 5 individual votes per claim-evidence pair
Some of these claim-evidence pairs are very tricky to evaluate! Try sorting by descending entropy to see some of the most challenging examples (hover over the "entropy" column heading > click on the three dots > select "Sort Desc").
A grouped version of the same Table shows the distribution by overall vote. There are more than twice as many "supported" as "refuted" pairings in this 20% sample.

Vote annotation key

  • 0: evidence supports the claim
  • 1: evidence refutes the claim
  • 2: unclear, not enough information
  • 3: disputed

CF 20% sample
1


Exploring claims


Run set
27


Resources

Available models:
  • ClimateBert base + variants—can train, but not finetuned for any task, can't evaluate.
  • fact-checking climate bert variant on HF: 61% acc
Available datasets:
  • Climate-Fever: only test split.

Entailment is most often mistaken for contradiction

Neutral statements have highest entropy

Run: cf_eval_1024_entropy
1



Confusion of contradiction and entailment

As a point of reference, I found an existing model on HuggingFace finetuned on CF: amandakonet/climatebert-fact-checking. Below I evaluate this model on partitions of CF and analyze the pattern of predictions. Since no official train/val/test partition of CF is available, note that I may be evaluating on examples which the model has seen during training. An interesting pattern emerges as I evaluate on more examples—"contradiction" and "entailment" are confused so often that the model would perform much much better if the labels were simply flipped. Perhaps this actually happened somewhere in the pipeline?

CF Evaluation
4


Can we do better with calibration?

High scoring Nos, weak low Yes is consistently misclassified

Run: cf_eval_1024_scores
4


Observations

  • has trouble when the same proper noun is in different functions in the sentence (e.g. as a subject vs direct object vs indirect object—model gets confused and says "contradiction")
  • are all those high-confidence contradictions shorter, more inane, less content-ful?
  • some entailment has very high contradiction scores, neutral has lower scores, hard to compare across these distributions—entailment actually has lowest confusion scores? can we calibrate somehow?

Run: cf_eval_full_N0
1


Evaluating performance: which votes are hardest to predict?

Both panels show CB performance across 4/5 partitions of the CF data, grouped by the correct label: entailment (the evidence supports the claim), neutral (no conclusion possible / claim and evidence are unrelated), or contradiction (the evidence contradicts the claim). The upper panel shows the distribution for all model predictions, while the lower panel filters for errors, where the model predicted an incorrect answer.

Run set
4


Possible next steps

  • clean up everything so far, consolidate feedback, make an interesting report
  • log training of CB variant & look at predictions over time
  • figure out how to finetune CB variants for CF
    • apply same finetuning procedure to multiple variants?
    • unclear why there's no split of CF
  • email CF & CB authors and figure out my questions
    • can I get other datasets?
    • can I get details of training?
    • do they want to like work together on any of this
  • figure out what actual disclosures will look like & if this is useful?
  • find other sentiment analysis / classification benchmark models or datasets?
    • fact-checking outside of climate
    • try my own finetuning....but would need their topic dataset ideally


Reference

  • climatebert.ai: corporate disclosure analytics for climate
  • proposed rules will meet resistance: hard to quantify + standardize Scope 3 emissions especially (from partners/dependencies of customers)