An Introduction to the F1 Score in Machine Learning
What the F1 score is and why it matters in machine learning
Created on February 28|Last edited on March 1
Comment
What Is the F1 Score?
In the landscape of machine learning and data science, evaluation metrics hold the key to deciphering the effectiveness and performance of predictive models. Each metric, (think accuracy, precision or recall) offers unique insights into the model's performance, helping us fine-tune predictions and achieve better outcomes.
The metric we're going to cover today is the F1 score. F1 score is a vital tool, particularly in the context of classification problems. It serves as a single metric that encapsulates the performance of a classifier in terms of both precision and recall, ensuring that neither is ignored at the expense of the other.
The mathematical formula for the F1 Score is:

Source: Author
Why is the F1 Score Important in Machine Learning?
When working with machine learning models, datasets can often be imbalanced, meaning there may be a disproportionate number of instances in one class compared to another. This imbalance can skew traditional evaluation metrics, leading to an inaccurate representation of the model's true performance.
For example, in a dataset where 95% of the instances are of one class (say, a cat), a model could naively predict that class (that's a cat!) for all instances and still achieve 95% accuracy, despite not having learned anything meaningful. This phenomenon, known as the "accuracy paradox," highlights how traditional metrics can be misleading in the presence of class imbalance.
Meanwhile, precision and recall reveal different aspects of the model's performance that are not captured by accuracy alone. Precision measures the proportion of correct positive predictions among all positive predictions made, while recall measures the proportion of correct positive predictions among all actual positives. However, each can be skewed by class imbalance: a model with high precision might simply be conservative in predicting positive instances, whereas a model with high recall might over-predict a minority class.
In such scenarios, the F1 score is really useful. It providing a balance between precision and recall by taking their harmonic mean. The balance between precision and recall is crucial because it ensures that an increase in one metric does not disproportionately affect the overall score without a corresponding improvement in the other. In scenarios where false positives and false negatives carry different costs, the F1 score helps maintain a trade-off: improving precision typically reduces false positives, while improving recall reduces false negatives.
What Are the Applications of the F1 Score in Real-World Machine Learning Projects?
The F1 Score is particularly valuable in various real-world use cases where the balance between precision and recall is critical. Here are some examples:
Medical Diagnosis
In medical testing, such as predicting whether patients have a certain disease, it's crucial to minimize false negatives (failing to identify a sick patient) and false positives (incorrectly diagnosing a healthy person). The F1 Score helps in evaluating the model's ability to correctly identify actual cases of the disease while minimizing false alarms.
Fraud Detection
In banking and finance, detecting fraudulent transactions is vital. A high recall (identifying most fraud cases) is important, but precision is also crucial to avoid falsely flagging legitimate transactions as fraudulent. The F1 Score provides a balanced measure of a fraud detection model’s performance.
Spam Filtering
In email filtering systems, it’s important to correctly classify spam emails without incorrectly filtering out important, legitimate emails. The F1 Score can help optimize spam filters by ensuring they effectively catch spam while preserving user trust by not over-blocking genuine messages.
Moreover, when we train machine learning models for these classification problems, it is important to identify valuable features that contribute to a more balanced classification performance. Therefore, during the process of feature selection, the F1 score can be used to evaluate the impact of adding or removing features from the model. It is also important in scenarios when comparing multiple machine learning models where one model may have a high precision but low recall, or vice versa.
By considering the harmonic mean of the two, the F1 score helps in selecting models that achieve an optimal balance, rather than those skewed towards one metric at the expense of the other. The F1 score is instrumental in tuning the decision threshold of classification models. By adjusting the threshold that determines the classification cut-off, one can influence the model’s precision and recall.
What Challenges Might You Encounter When Using the F1 Score?
While the F1 score is a valuable metric, it's not without its potential misunderstandings and misapplications:
Class Imbalance Ignorance
Although the F1 score is better than accuracy in imbalanced settings, it does not fully solve the imbalance problem. Over-relying on it without considering the underlying distribution can still lead to misleading conclusions. For example, if a model predicts 'positive' correctly for 4 out of 5 actual positive cases (high recall) but also incorrectly labels 20 out of 95 negative cases as positive (lower precision), the F1 Score might suggest a reasonably balanced performance between precision and recall. But in reality, this model could still be impractical if the cost of false positives (e.g., unnecessary treatment) is high.
Misinterpretation
Misunderstanding the F1 Score as a measure of accuracy rather than a balance between precision and recall can lead to incorrect conclusions about the model's effectiveness.
Single Metric Fallacy
Relying solely on the F1 Score for model evaluation can be misleading. It should be used in conjunction with other metrics to provide a comprehensive view of model performance.
Overemphasis on Balance
The F1 Score assumes equal importance of precision and recall, which may not align with all business or clinical objectives. In some contexts, one might be significantly more critical than the other.
It must also be noted that in extremely imbalanced datasets, the sheer volume of the majority class can dilute the impact of recall in the F1 Score. Since the number of true positives can become negligible compared to the number of false negatives and false positives, a model could ignore or misclassify almost all minority class instances and still achieve a reasonable F1 Score, due to the overwhelming presence of the majority class. Additionally, the precision can also become less informative in extreme imbalance. If the minority class instances are very rare, a model could predict very few instances as positive (trying to be 'safe') and still maintain high precision if those few predictions are correct, despite a vast number of missed positive cases (low recall). Therefore, careful consideration must be done when using F1 Score for evaluating models dealing with imbalanced datasets.
How Is the F1 Score Calculated and Interpreted?
To calculate F1 Score, we first need to calculate the precision and recall separately and then find the F1 Score.
Calculating F1 Score
- Calculate Precision: Precision is the number of true positives (TP) over the number of true positives plus the number of false positives (FP).
- Calculate Recall: Recall is the number of true positives over the number of true positives plus the number of false negatives (FN).
- Calculate F1 Score: The F1 Score is the harmonic mean of precision and recall.
Interpretation of F1 Score Values:
- F1 Score = 1: Perfect precision and recall.
- F1 Score = 0: Worst possible score; the model failed to identify any true positives.
- 0 < F1 Score < 1: Indicates a balance between precision and recall. The closer to 1, the better the balance.
Step-by-Step Guide to Calculating the F1 Score using Python
In this guide we will be using the XGBoost classifier to predict whether the cancer tumor is benign or malignant. The results for the prediction will be logged into Weights and Biases (WandB) which is a machine learning platform designed to help track experiments by providing tools for logging hyperparameters, outputs, and results from ML models, making it easier to monitor and compare different experiments and models. The model will be evaluated using F1 Score and other metrics due to the case of class imbalance.
1. Importing the Libraries
from google.colab import driveimport osimport zipfilefrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom xgboost import XGBClassifierimport pandas as pdfrom sklearn.model_selection import GridSearchCV!pip install wandbimport wandbfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_reportfrom wandb.xgboost import WandbCallbackdrive.mount('/content/drive')!pip install kaggleos.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
2. Download and Extract the Dataset
We are utilizing the Breast Cancer Wisconsin (Diagnostic) dataset for understanding how to calculate and evaluate the F1 Score. The dataset is used to predict whether the cancer is benign or malignant. It is accessed through Kaggle as a zip file which is then extracted.
!kaggle datasets download -d uciml/breast-cancer-wisconsin-datafile_path = '/content/breast-cancer-wisconsin-data.zip'with zipfile.ZipFile(file_path, 'r') as zip_ref:zip_ref.extractall('/content/kaggle/')
3. Preprocess the Dataset
We load the dataset and set the diagnosis column as our label which is predicted by the ML model. We drop the ‘id’ column that is irrelevant and check the distribution of data which is a bit unbalanced with more samples for the benign class.
df = pd.read_csv('/content/kaggle/data.csv')y = df['diagnosis']x = df.drop(columns=['diagnosis', 'id'])df['diagnosis'].value_counts().reset_index()

Source: Author
We next encode the diagnosis column with 0 for benign and 1 for malignant. The data is then normalized using the standard scalar from
encoding_dict = {'B': 0, 'M': 1}y = y.map(encoding_dict)scaler = StandardScaler()x = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)
4. Test Train Split
features_train, features_test, target_train, target_test = train_test_split(x, y, test_size = 0.2, stratify=y, random_state = 0)
5. Setup Machine Learning Model
We instantiate the XGBoost model and assign a dictionary of hyperparameters to search over using a cross-validation grid search along with scoring metrics. Next, we instantiate a GridSearch object using the model, evaluation metrics, and target metric of F1 score
xgboost_model = XGBClassifier(objective='binary:logistic', random_state=0)cv_hyperparameters = {'max_depth': [3, 5],'min_child_weight': [3, 5],'learning_rate': [0.001, 0.1, 0.2],'n_estimators': [5, 10, 15],'subsample': [0.7]}evaluation_metrics_dict = {'accuracy', 'precision', 'recall', 'f1'}xgboost_cv = GridSearchCV(xgboost_model, cv_hyperparameters, scoring=evaluation_metrics_dict, cv=5, refit='f1')
6. Train the Machine Learning Model
We now train the model and log the best parameters from the gridsearch to wandb for easier evaluation and analysis.
xgboost_cv.fit(features_train, target_train, callbacks=[WandbCallback(log_model=True)])wandb.log({"best_params": xgboost_cv.best_params_, "best_score": xgboost_cv.best_score_})cv_results_df = pd.DataFrame(xgboost_cv.cv_results_)wandb.log({"cv_results": wandb.Table(dataframe=cv_results_df)})
The results below, logged in wandb show the best parameters selected from the grid search that result in the highest F1 score of 93%.

Source: Author
Additionally, the feature importance is also logged, depicted below. Features like "concave points_mean", "perimeter_worst", and "radius_worst" are shown to be most influential in the model's decisions, indicated by their longer bars. The importance is measured using the F1 score helping us understand which features are driving the model's predictions, allowing for better interpretation and potentially guiding feature selection and engineering.

Source: Author
7. Test the Machine Learning Model
We obtain the predictions for the test set and log the resulting scores to wandb. Since the data was initially imbalanced, it is important to log other metrics for better evaluation. And since the resulting scores all perform above 90%, the model accurately predicts the classes.
target_predictions = xgboost_cv.predict(features_test)accuracy_score = accuracy_score(target_test, target_predictions)wandb.log({"Test accuracy": accuracy_score})precision_score = precision_score(target_test, target_predictions)wandb.log({"Test Precision": precision_score})recall_score = recall_score(target_test, target_predictions)wandb.log({"Test Recall": recall_score})f1_score = f1_score(target_test, target_predictions)wandb.log({"Test F1 Score": f1_score})xgboost_confusion_matrix = confusion_matrix(target_test, target_predictions)confusion_matrix_display = ConfusionMatrixDisplay(confusion_matrix=xgboost_confusion_matrix,display_labels=xgboost_cv.classes_)

We also plot the confusion metrics and log it to wandb and the results show that the precision and recall is balanced. Hence, the model is accurate in its predictions.
fig, ax = plt.subplots()confusion_matrix_display.plot(ax=ax)plt.grid(False)wandb.log({"confusion_matrix": wandb.Image(fig)})plt.close(fig)

Source: Author
How Can You Improve Your Model's F1 Score?
After understanding F1 score and walking through its implementation, it is also important to understand the strategies that help contribute to improving the model’s F1 score. Let’s dive into few techniques for doing so:
- Resampling Techniques
Apply undersampling to reduce the size of the majority class, oversampling to increase the size of the minority class, or use synthetic data generation techniques like SMOTE (Synthetic Minority Over-sampling Technique) to create a more balanced dataset.
- Data Augmentation
Augment the minority class through techniques like rotation, flipping, or cropping (for images) or by paraphrasing sentences (for text) to increase the diversity and quantity of the training data.
- Weighted Classes
Assign higher weights to the minority class during model training. Many machine learning algorithms allow you to set class weights, which can help in adjusting the model’s focus towards underrepresented classes.
- Threshold Adjustment
By default, many classifiers use a decision threshold of 0.5 to differentiate between classes. Adjusting this threshold can help balance precision and recall, and thereby improve the F1 Score. You might lower the threshold to increase recall or raise it to enhance precision, depending on your specific needs.
- Cost-sensitive Learning
Implement a cost function that penalizes false negatives more than false positives, or vice versa, depending on the application's requirements, and optimize the decision threshold based on this cost.
- Ensemble Methods
Combine the predictions of several models to improve the overall performance. Techniques like bagging, boosting, and stacking can lead to better model robustness and improved F1 Scores.
- Hyper-parameter Tuning
Optimize the model's hyperparameters through grid search, random search, or Bayesian optimization to find the combination that yields the best F1 Score.
While the F1 Score is an important part of model evaluation, it should be used as one component of a broader evaluation strategy. Relying solely on it for model evaluation can lead to an incomplete understanding of a model's performance. It is crucial to understand how and why a model makes certain predictions, especially in high-stakes domains. Evaluation should include measures of interpretability and the ability to explain decisions to stakeholders which go beyond numerical performance metrics. It is also of utmost importance to consider computational cost, latency, and scalability for evaluating a model's feasibility in production environments. These factors can impact the overall utility and sustainability of a machine learning solution.
Are There Alternatives or Exceptions to the F1 Score?
A common misconception about F1 Score is that a higher score means a good model. However, that is not always the case, especially if there's a severe class imbalance. It could simply indicate the model is performing well on the majority class. Additionally, the F1 Score does not consider true negatives in its calculation, which can be a significant oversight in certain contexts, such as when negative predictions are crucial to the task at hand. To overcome these drawbacks, alternate metrics can be calculated in parallel to F1 Score for better evaluation:
- Matthews Correlation Coefficient (MCC)
The MCC is a more comprehensive measure that considers true and false positives, negatives, and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. It provides a value between -1 and +1, where +1 indicates a perfect prediction, 0 no better than random prediction, and -1 indicates total disagreement between prediction and observation. It is best for datasets with severe class imbalances. MCC can be calculated using:

Source: Author
Where, TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.
- Area Under the ROC Curve (AUC-ROC)
This metric evaluates the model's ability to distinguish between classes. The ROC curve plots the true positive rate against the false positive rate at various threshold settings. AUC-ROC values range from 0 to 1, where 1 represents a perfect model and 0.5 represents a random model. It's particularly useful for evaluating models on imbalanced datasets. This metric is ideal for binary classification problems where you want to evaluate the model's performance across all possible classification thresholds, or when you need to compare the performance of several models. The AUC-ROC graph below shows TP plotted against FP rate, where a perfect classifier has a TP rate of 1 and a FP rate of 0.

Conclusion
In conclusion, understanding the F1 Score is essential for effectively evaluating classification models, particularly when dealing with imbalanced datasets. In this exploration, we discussed not only how to calculate this metric but also to recognize common misconceptions about its application and interpretation. We also demonstrated how to implement this metric in Python, guiding through practical example while listing down the factors that help improve the model’s resulting F1 Score. By carefully selecting and applying the appropriate metrics, practitioners can enhance model evaluation, leading to more reliable and robust machine learning solutions.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.