Using Language Models for Fact-Checking & Claim Assessment

In this article, we review an automated solution for fact-checking using claims and fake news datasets to fine-tune language models published recently for NLP tasks.
Othman EL HOUFI
Created on June 6|Last edited on January 19
Comment
As false information and fake news continue propagating throughout the internet and social networks, the need for fact-checking operations becomes necessary in order to maintain a truthful digital environment. The reasons are fairly well-established, with disinformation and fakery having deleterious effects on politics (such as the 2016 USA Elections) and public health (COVID-19).
A number of solutions have been proposed to deal with this problem and limit the spread of false information, both manual
and automatic. Undoubtedly the manual approaches done on websites such as PolitiFact.com, FactCheck.org, and Snopes.com
aren't a viable long-term solution: put simply, disinformation is increasing, and human fact-checkers simply don’t scale up at the same rate.
Here, we present our contribution in this regard: an automated solution for fact-checking using state-of-the-art language models used today for NLP tasks (BERT, RoBERTa, XLNet etc.) and five well-known datasets (FEVER, MultiFC, Liar, COVID-19, and ANTiVax) containing annotated claims/tweets to fine-tune each LM and classify a given claim. 
We successfully prove that fine-tuning an LM with the correct settings can achieve an accuracy of 98% and an F1-score of 98% in COVID19 & ANTiVax datasets, as well as an accuracy of 64% and an F1-score of 63% in the FEVER dataset, which is more advanced than the majority of fact-checking methods that exists today.
Table of ContentsBackgroundDatasets: Splits & SamplesTransformer-Based Language ModelsExperimental Protocol & ResultsConclusion & Future WorkAcknowledgments
﻿
BackgroundFrom a social and psychological perspective, humans have been proven irrational and vulnerable when differentiating between real and fake news (typical accuracy ranges between 55% and 58%). In other words, fake news can gain broad public trust relatively easier than truthful news because individuals tend to trust fake news after constant exposure (validity effect), if it confirms their pre-existing beliefs (confirmation bias), or simply due to the obligation of participating socially and proving a social identity (peer pressure). The social sciences are still trying to comprehend the biological motivations that make fake news appealing to humans.
On the other hand, the rise of social media platforms has resulted in a huge acceleration of news consumption generally (whether that news is real or fake). As of Aug. 2017, 67% of Americans get their news from social media. These platforms give users the right to share, forward, vote, and participate in online discussions. These factors also contribute to the spread of fake news.
And to be clear: some fake news has a real cost, be it economic, political, or health-wise. For example, a fake tweet claiming that Barack Obama was injured in an explosion caused $130 billion drop in the stock market. Dubious claims about COVID vaccines or the companies which manufactured them directly reduced vaccination rates which resulted in a deepening of the pandemic globally.
With all that being said, is there a way to monitor the spread of fake news through social media? Or, more specifically, how can we differentiate fake news from real news, and at what level of confidence can we do so?
From a computer engineering perspective, we examined various approaches:
Knowledge-based Fake News Detection: a method that aims to assess news authenticity by comparing the knowledge extracted from to-be-verified news content with known facts. Broadly: you can think of this as fact-checking.
Style-based Fake News Detection: focuses on the style of writing, i.e., the form of a text rather than its meaning.
Propagation-based Fake News Detection: a principled way to characterize and understand hierarchical propagation network features. We perform a statistical comparative analysis of these features, including micro-level and macro-level, of fake news and real ones.
Credibility-based Fake News Detection: the information about authors of news articles can indicate news credibility and help detect fake news.
In this article, we will focus on a modern approach that utilizes pre-trained Language Models (LMs) to solve the classification task that is fact-checking claims. The aim is not to implement an algorithm that scans social networks for real-time fake news detection, but we rather will design a model that can assess with a degree of confidence the truthfulness or falseness of a claim given by a user as input by exploiting LMs that were already trained on large scale textual databases such as Wikipedia.
Datasets: Splits & SamplesIn this study, we were able to experiment with five well-known datasets in the fact-checking field. Some of these datasets don't contain a lot of records as it is difficult to manually annotate each given claim by human fact-checkers in order to construct a large dataset. On the other hand, we rely on the ability to fine-tune Language Models to be fine-tuned already pre-trained on gigabytes of general texts and corpora as they required less novel data from this fine-tuning. 
The following sections show a description of each dataset with tables of splits and input/output samples.
FEVER DatasetFEVER (standing for Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. 
Dataset splits
Claim samples from the FEVER dataset
(NEI means Not Enough Information to make an assessment.)
More details about FEVER dataset in this paper.
💡
Liar DatasetLiar is a publicly available dataset for fake news detection. This consists of 12.8K manually labeled short statements collected in various contexts from PolitiFact.com, which provides a detailed analysis report and links to source documents for each case. 
Dataset splits
TRUE – The statement is accurate, and there’s nothing significant missing.
MOSTLY TRUE – The statement is accurate but needs clarification or additional information.
HALF TRUE – The statement is partially accurate but leaves out important details or takes things out of context.
BARELY TRUE – The statement contains an element of truth but ignores critical facts that would give a different impression.
FALSE – The statement is not accurate.
PANTS FIRE – The statement is not accurate and makes a ridiculous claim. a.k.a. "Liar, Liar, Pants on Fire!"
Claim samples from the Liar dataset
﻿
More details about Liar dataset in this paper.
💡
MultiFC DatasetThe MultiFC is the largest publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It consists of 34,918 claims, collected from 26 fact-checking websites in English, paired with textual sources and rich metadata, and labeled for veracity by human expert journalists.
Originally this dataset had over 40 different labels as the data was fetched from different sources. After thorough investigation, we observed that some labels are semantically identical; for example: True is the same as truth as well as correct. That's why we decided to map the output space to a smaller one where we only have 5 classes, as shown in the table below.
﻿
Dataset splits
Claim samples from the MultiFC dataset
﻿
More details about MultiFC dataset in this paper.
💡
COVID-19 Rumor DatasetA collection of more than 6,000 annotated rumors about COVID-19 extracted from various social media platforms such as Twitter, Facebook, Instagram, etc. 
﻿
Dataset splits
Claim samples from the COVID-19 rumor dataset
﻿
﻿
More details about COVID-19 rumor dataset in this paper.
💡
ANTi-Vax DatasetThis dataset contains more than 15K tweets that were annotated as misinformation or general COVID-19 vaccine tweets using reliable sources and validated by medical experts.  
﻿
Dataset splits
Claim samples from the ANTi-Vax dataset
﻿
﻿
More details about ANTi-Vax dataset in this paper.
💡
Transformer-Based Language Models2018 was an inflection point for the NLP field, in large part to a Google model called BERT (Bidirectional Encoder Representations from Transformers). At the time, it was described as a state-of-the-art model that solves the most difficult tasks in NLP, and it's still used today in Google’s search engine for text completion and translation. BERT model can be fine-tuned with just one additional output layer to create advanced models for a wide range of tasks, such as question answering and language inferencing, without substantial task-specific architecture modifications.
From there, many language models were introduced that used the same architecture as BERT but with small changes––such as number of parameters and data on which they were pre-trained.
Transformer ArchitectureIf you aren't familiar, in 2017, a research team at Google published an article entitled "Attention Is All You Need", and in this paper they introduced a new Deep Neural Network architecture called Transformer.
The Transformer - model architecture.
﻿
What is special about this architecture is that it's based on the Attention Mechanism. To come to the point, the idea behind this mechanism is to create a matrix of the importance each word has while defining the context for other words in the same sentence. Each word has significant importance with itself and then varying levels of importance with other words. That's what gives us Context. This was a game changer in NLP because sentences can bear different meanings if the order of the words changes. (Think "I want a bagel so bad" vs. "this bagel is so bad.")
Another advantage of this architecture is that we don’t need to pass the data sequentially but in parallel. This drastically improves the training time. And, since we are looking at the words in an overall manner without any directional sense, the overall understanding of language or context is much better than other architectures.
Selected Language ModelsFor this research project, we chose the following set of seven pre-trained language models to solve this classification problem:
BERT-base-uncased
RoBERTa-base
﻿DistilBERT-base-uncased
XLNET-base-cased
ALBERT-base-v2
BigBird-RoBERTa-base
ConvBERT-base
All of these LMs are based on the Transformer architecture. They have subtle differences in the architecture, such as the number of layers, number of attention heads, type of layers (convolutional vs. normal), and more. 
We can see a brief comparison of each LM in the following table:
﻿
Brief comparison of the different LM architectures used in this study
﻿
They are also pre-trained differently on large-scale textual databases. Therefore, the conjecture behind our method is that by fine-tuning these LMs that were pre-trained on gigabytes of general text and by exploiting the already stored knowledge within these LMs, we can produce a new model that is capable of differentiating between Real and Fake news.
Needless to say, there are other LMs that exist today, but as they are all based on the Transformers architecture, it is unnecessary to experiment with all of them. Furthermore, the goal of our research is not to find the best existing LM out there but to investigate the capability of LMs to be used as fact-checkers.   
Experimental Protocol & ResultsNow that our datasets are pre-processed, cleaned, and transformed, let's move to the experimental phase where we fine-tune each model using each dataset separately.
Experimental SetupAs mentioned before, we conduct our experiments by fine-tuning each LM using the Training / Validation / Testing splits associated with each annotated dataset. In addition, we experiment with the original number of labels as well as two labels (True and False) by integrating the other claims to a space of only two labels. For example: Mostly True becomes True, and Barely True becomes False. The intention behind label-space reduction is to evaluate the strength of LMs when the boundaries are well-defined and separate (True and False) versus when the boundaries are overlapping and vague (True and Mostly True or Half True). 
It goes without saying that the label-space reduction was not performed for the COVID-19 rumor dataset and the ANTi-Vax dataset because they only have two distinguished labels. As for the FEVER dataset, the label-space reduction was performed by excluding the Not Enough Information (NEI) records because there is no logical way to integrate them in one or the other spaces due to the fact that this type of claim can be True or False and we don't have the ability to make such assessment. 
Furthermore, the hyperparameters were carefully selected after numerous different repeated runs. They may not be the optimal hyperparameters, but optimizing the models remains a different task that can be tackled for future works. A log file containing the hyperparameters of each successful run is provided alongside the code, the metrics, and the fine-tuned LMs.
﻿
The general architecture for the proposed solution
﻿
The LMs were implemented using HuggingFace API, and the experimental phase was monitored closely using Weights & Biases API. It is also worth mentioning that this task was performed using an NVIDIA Quadro RTX 8000 GPU with 48 GB of GDDR6 memory.
Evaluation MetricsIn all our experiments, we rely on the usual classification metrics such as accuracy, recall, precision, and F1-score. The accuracy can tell us immediately whether a model is being trained correctly and how it may perform generally. Precision helps when the costs of false positives are high, so if we have a model that has very low precision, it can classify a false claim as true, which can be problematic. In contrast, recall helps when the cost of false negatives is high, meaning that if a model has low recall, it may classify true claims as false. Finally, F1-score is an overall measure of a model’s accuracy that combines precision and recall, a good F1-score means that we have low false positives and low false negatives, so the model is correctly identifying false claims or fake news.
In addition, we track other aspects of each LM, such as training time, model size, and memory usage during the fine-tuning step in order to make an overall evaluation.
Results & DiscussionAs reported in the table below that only takes into consideration the best LM from a performance perspective in each experiment (dataset and the number of labels), the overall results prove the validity of our method in some cases and its ineffectiveness in other cases.
﻿
Concise performance report of the best LM for each dataset
﻿
For instance, we recognize good results when it comes to 2-labels classification as the boundaries are well defined, and there is no overlapping between claims. Therefore, it is expected for the LMs to perform well in this case. It's important to mention that because the COVID-19 and ANTi-Vax datasets are topic-specific and they don't contain a higher variance when it comes to language and topics, the LMs performed proficiently. Undoubtedly that in this case, we can deploy our LMs and completely rely on their claim assessment.
For the FEVER dataset, the results are intriguing owing to the fact that this dataset has a high topic variance, and yet the LMs performed impressively in comparison to the MultiFC and Liar datasets. It is difficult to explain the difference in performance in this case, but this may be justified by the fact that all the LMs were pre-trained on Wikipedia, and the FEVER dataset contains claims that were extracted directly from Wikipedia, which makes it easy to classify. In addition, the language used in news reporting evolves a specific style of writing that sometimes can only contain keywords or unspecific spatiotemporal context that can render classification extremely hard, for example, the claim "About 99% of rape allegations are fabricated." is vague as it doesn't contain sufficient information to deliver a reliable assessment, and this language style can be found in MultiFC and Liar datasets which can explain the poor performance of LMs.
Furthermore, we observe that by incrementing the number of labels, the classification task becomes considerably harder. For the FEVER dataset using the original number of labels, our approach yields better results than most of the traditional methods that involve external information retrieval modules as well as FacebookAI’s model that was published in a paper entitled Language Models as Fact Checkers? where they used BERT-large that has 340 million parameters and 24 encoding layers and achieved 0.57 in accuracy and macro f1-score, thus making our LM more efficient to train, to store and to implement. Nevertheless, by taking into consideration a third space of claims which is Not Enough Information, the scores drop considerably, making it much harder for the LMs to perform the classification task. The same goes for the MultiFC and Liar datasets.
﻿
﻿
ANTiVax-2-labels7
 
COVID19-2-labels7
 
FEVER-2-labels7
 
FEVER-3-labels7
 
MultiFC-2-labels7
 
MultiFC-5-labels7
 
Liar-2-labels7
 
Liar-6-labels7
﻿
﻿
It's noteworthy that all LMs performed relatively close to each other during all experiments. The margin of difference is 0.01 to 0.07 in accuracy and 0.01 to 0.14 in F1-score. The lowest scores are attributed to ALBERT, and it's presumably due to the fact that this LM has the lowest number of parameters (12 million) in comparison to other LMs. 
On the other hand, ConvBERT had the highest scores in most runs — especially when the number of labels increased — which may be explained by the subtle difference in the architecture (span level dynamic convolution). Alongside ConvBERT, we find RoBERTa and BERT that perform better than ConvBERT in some cases by a margin of 0.01 in both accuracy and F1-score, but we should keep in mind that a large number of variables controls the outcome of the experiments, for instance, changing the hyperparameters can cause minor or highly noticeable differences in the performance of our LMs. When the networks are large, deep and complex, the explainability and interpretability of the results become an exceedingly hard task.
﻿
Note that a complete document reporting detailed scores, fine-tuned models, as well as configurations of each experiment are provided at the end of this article. Additionally, a link to the GitHub repository containing the entire code implemented in this study is made publicly available.
Conclusion & Future WorkIn this study, we explored the capabilities of Language Models to be fine-tuned and utilized for a downstream task that is fact-checking claims. We have successfully proven the effectiveness of pre-trained LMs as an entity capable of storing knowledge rather than implementing modules for external information retrieval adopted by traditional approaches. Our experiments conclusively yield results that surpass most of the existing fact-checking methods, both traditional and LM-based. Nevertheless, our approach does not beat state-of-the-art traditional architectures leaving us with more paths to explore in order to produce a reliable fact-checking engine.
In time to come, we plan to investigate solutions to deal with a high topic variance where LMs struggle. In addition, we will attempt to combine other models with our approach, like the credibility-based method or, even more, a network-based method that exploits the structural features of Complex Networks.
AcknowledgmentsThis study was conducted with the supervision and guidance of Pr. Dimitris Kotzinos, who I would like to thank as he is the person who gave me the opportunity to work on this research project at the ETIS laboratory in CY Cergy Paris University.
﻿
ResourcesDetailed experimental results (recall, prec and other metrics of each LM and each dataset): https://bit.ly/3O2LqVl﻿
GitHub repository with code and detailed documentation: https://github.com/othmanelhoufi/LM_for_fact_checking﻿
Hyperparameters used in each successful run during the experimental phase: https://bit.ly/3mYUApZ 
All fine-tuned LMs produced from our experiments: https://bit.ly/3Orh0vR﻿
﻿
Bibliography﻿
Xinyi Zhou, Reza Zafarani, Kai Shu, and Huan Liu. Fake news: Fundamental theories, detection strategies and challenges. In Proceedings of the twelfth ACM international conference on web search and data mining, pages 836–837, 2019.
Soroush Vosoughi, Deb Roy, and Sinan Aral. The spread of true and false news online. Science, 359(6380):1146– 1151, 2018.
Anton Chernyavskiy, Dmitry Ilvovsky, and Preslav Nakov. Whatthewikifact: Fact-checking claims against wikipedia. arXiv preprint arXiv:2105.00826, 2021.
Piotr Przybyla. Capturing the style of fake news. In Proceedings of the AAAI Conference on Artificial Intel- ligence, 2020.
Nayeon Lee, Belinda Z Li, Sinong Wang, Wen-tau Yih, Hao Ma, and Madian Khabsa. Language models as fact checkers? arXiv preprint arXiv:2006.04102, 2020.
Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
Thorne, James, et al. "Fever: a large-scale dataset for fact extraction and verification." arXiv preprint arXiv:1803.05355 (2018).
Wang, William Yang. "liar, liar pants on fire": A new benchmark dataset for fake news detection." arXiv preprint arXiv:1705.00648 (2017).
Augenstein, Isabelle, et al. "MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims." arXiv preprint arXiv:1909.03242 (2019).
Patwa, Parth, et al. "Fighting an infodemic: Covid-19 fake news dataset." International Workshop on​ Combating On​ line Ho​ st​ ile Posts in​ Regional Languages dur​ ing Emerge​ ncy Si​ tuation. Springer, Cham, 2021.
Hayawi, Kadhim, et al. "ANTi-Vax: a novel Twitter dataset for COVID-19 vaccine misinformation detection." Public health 203 (2022): 23-30.
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: General- ized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
Jiang, Zi-Hang, et al. "Convbert: Improving bert with span-based dynamic convolution." Advances in Neural Information Processing Systems 33 (2020): 12837-12848.
Hunt Allcott and Matthew Gentzkow. Social media and fake news in the 2016 election. Journal of economic perspectives, 31(2):211–36, 2017.
Vian Bakir and Andrew McStay. Fake news and the economy of emotions: Problems, causes, solutions, dig- ital journalism pp 1-22, 2017.
Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. Information credibility on twitter. In Proceedings of the 20th international conference on World wide web, pages 675–684, 2011.
﻿
﻿
Add a comment
Thomas Capelle • 3 years ago
This is great work, and it remembers me the "crappification" function that Jeremy Howard presented on the course a couple of years ago. In this case, altering sentences is the function.
1 reply
Tags: NLP, Articles, Intermediate, Plots, BERT, LLM
Iterate on AI agents and models faster. Try Weights & Biases today.