Predict The LLM (Kaggle Community Competition)

Created on October 28|Last edited on November 12
Comment
Hello, and welcome to my report on the "Predict-the-LLM" Kaggle Competition. In this exciting challenge, we are tasked with the important job of detecting the text's category using Large Language Models (LLMs). The competition involves classifying text into one of seven different categories, making it a multiclassification task.  This report's table of content is below:
IntroductionEDAWord CountsText AnalysisTopic AnalysisEmbeddingsSplitsBaselinePreprocessPreprocess Results TablePreprocess Result Data Frame HeadsSweepsRetrainEvaluationPostprocess
﻿
IntroductionIn this report, i will take you through the various steps we followed in our journey to develop a text classification model using Transformers. Our aim is to provide a beginner-friendly explanation of the entire process, ensuring that you can easily follow along.
Report Sections:
Exploratory Data Analysis (EDA): We'll start by exploring the data provided in the competition. We will look at basic statistics such as word counts in the text, perform text analysis to understand its structure, and delve into topic analysis to identify common themes.
Data Splitting (StratifiedKFold): We will explain how we divided the data into training and validation sets using StratifiedKFold to ensure that each class is well-represented in both sets.
Baseline Model (distilroberta-base): Our initial model will be based on the distilroberta-base architecture. We will discuss this choice and how it performed as a starting point.
Preprocessing: Text data often requires some cleaning and preparation before it can be fed into the model. We will explain the text cleaning steps and the coverage of embeddings used in this task.
Hyperparameter Tuning (Sweeps): One crucial aspect of building a powerful model is hyperparameter tuning. We'll discuss the sweeps used to find the best hyperparameters for our model.
Retraining: After finding the optimal hyperparameters, we will go through the retraining process to fine-tune our model for improved performance.
Evaluation: We will use logloss as the metric to measure how well our model performs in classifying text into the seven different categories. We'll discuss the model's effectiveness and any insights gained from the evaluation process.
Checklist: We'll provide a checklist to summarize the key steps and decisions we made throughout the competition.
Results: Finally, we will present the results of our efforts, showcasing how our LLM-based model performs in classifying text into the seven different categories.
I hope you find this report both informative and accessible. 
EDA
Word CountsYou can eda notebook in GitHub Repository.﻿
In our analysis of text sequences generated by seven different Large Language Models (LLMs), we discovered some interesting insights about the lengths of these sequences for different classes.
Class 3: On average, sequences generated by this class were the longest, with approximately 250 words.
Class 1: Sequences in this class had an average length of about 200 words.
Class 5: The average length of sequences from this class was around 150 words.
Class 2: Sequences in this category had an average length of about 100 words.
Class 0, Class 4, and Class 6: These classes had the shortest sequences, with an average length of approximately 75 words.
These findings help us understand the characteristics of text sequences generated by different LLM models for each class. For instance, Class 3 tends to produce longer sequences, while Classes 0, 4, and 6 generate the shortest ones. This information is valuable in designing our text classification model, as it gives us a glimpse into the diversity of text lengths across the classes. 
﻿
﻿
﻿
Text AnalysisYou can eda notebook in GitHub Repository.﻿
The provided summaries appear to be describing statistics and common words for two different types of data: 'Question' and 'Response'. There seem to be seven identical questions across seven different LLM (Language Model) models, as indicated by the "top" question with a frequency of 7. The summaries can be described as follows:
For 'Question':
There are a total of 3,976 questions.
There are 568 unique questions.
The most frequently occurring question shown as "Explain the concept of coevolution."
We have seven same questions for seven different LLM (Language Model) models. 
For train 3976 / 7 = 568 unique Questions
For test 1001 / 7 = 143 unique Questions
Most Common Words in 'Question':
The word "what" occurs 1,603 times.
Other common words in the questions include "who," "how," "explain," "describe," "can," "I," "concept," "could," and "cause."
For 'Response':
There are a total of 3,976 responses.
There are 3,969 unique responses.
The most frequent response is empty (blank).
The empty response appears 7 times.
Train empty indexes [196, 339, 401, 798, 2279, 2725, 3046]
Test empty indexes [446, 969]
Most Common Words in 'Response':
The word "the" occurs 4,557 times.
Other common words in the responses include "this," "data," "it," "a," "I," "also," "in," "process," and "like."
These summaries provide insights into the data, including data count, uniqueness, and the most common words found in the 'Question' and 'Response' columns. It appears that the 'Response' data contains empty responses, and the 'Question' data focuses on common introductory words and phrases used in questions.
﻿
﻿
﻿
﻿
Topic AnalysisTopic 0:
Words: "data," "information," "the," "model," "learning," "used," "this," "process," "ai," "language"
Category: Machine Learning and Data Analysis - This topic appears to be related to machine learning, data analysis, and artificial intelligence, focusing on data, information, models, and learning.
Topic 1:
Words: "i," "help," "this," "answer," "what," "time," "use," "question," "skills," "make"
Category: Personal Assistance and Problem-Solving - This topic seems to revolve around personal assistance and problem-solving, where "i," "help," and "answer" suggest assistance and addressing questions or challenges.
Topic 2:
Words: "the," "x," "this," "cell," "process," "cells," "dna," "behavior," "animals," "in"
Category: Biology and Cellular Processes - This topic appears to be related to biology and cellular processes, with words like "cell," "process," "dna," and "animals" indicating a focus on biological subjects.
Topic 3:
Words: "the," "cause," "war," "symptoms," "pain," "it," "world," "what," "this," "also"
Category: Historical and Health-related Topics - This topic suggests a mix of historical and health-related discussions, with mentions of "war," "symptoms," and "pain."
Topic 4:
Words: "the," "this," "species," "one," "it," "what," "code," "in," "system," "different"
Category: Species and Systems - This topic may relate to discussions about species, systems, and differences between various elements or entities.
These are the identified topics from the combined questions and responses. Each topic represents a distinct theme or category, and the assigned category is based on the prominent words within each topic. The model seems to capture a range of topics, including machine learning, personal assistance, biology, history, and systems.
﻿
You can eda notebook in GitHub Repository.﻿
﻿
Embeddings﻿
﻿
﻿
Splits﻿
﻿
To split the dataset, we employed the StratifiedGroupKFold technique. This method was chosen because we had one identical question for each of the seven different LLM models, and our training set contained 568 distinct questions. This pattern was similarly reflected in the test set, where there were 1,001 questions, and among them, 143 were unique. This approach ensured that our data splitting was optimized to account for these specific conditions, helping us achieve more accurate results in the competition.﻿﻿
You can find split and baseline notebook in GitHub Repository.
﻿
﻿
BaselineTo get started quickly, we selected the "distilroberta-base" model from the Transformers library. We ran a short training experiment for 16 epochs with a batch size of 32. We used a linear learning rate scheduler and applied early stopping, waiting for 2 epochs without improvement in validation loss.
With these settings, we achieved the following results:
Training set validation loss: 0.5
Validation data validation loss: 1.28
Test data validation loss: 1.181
These results give us a starting point to measure our model's performance. While the training set loss of 0.5 suggests that our model learned well from the training data, the higher validation and test data losses (1.28 and 1.181, respectively) indicate that our model might be overfitting or not performing as well on unseen data.
﻿
PreprocessFor training, standard tokenization is applied to text data, transformers package's AutoTokenizer is used as tokenizer with padding and truncation and max_length as 512. 
Evaluation of Preprocessing and Embedding CoverageIn this section, we assess the impact of various preprocessing steps on embedding coverage for both the question (q) and response (r) text in the training and test datasets. The purpose of measuring embedding coverage is to understand how well our text data is represented by pre-trained embeddings, which can significantly affect the performance of our language model.
Google News Embeddings (First 5 Rows):
To provide context, the first five rows of the table show the embedding coverage percentages when using Google News embeddings as a reference point. These embeddings are based on a vast vocabulary of general text.
No Preprocessing (Baseline):
When no preprocessing functions were applied, we observed varying levels of embedding coverage. The training question (q) text achieved 64.18%, while the response (r) text achieved 72.61%. In the test dataset, q had 69.10% coverage, and r had 71.55% coverage.
Clean Markdown and Emojis:
After applying the "clean_markdown_and_emojis" function, we saw improvements in embedding coverage. The effect was more pronounced in the test dataset. For instance, the test question text (q) achieved a 13.41% increase in coverage, indicating that this preprocessing step helped the model better understand the text.
Clean Text:
Using the "clean_text" function, we observed significant improvements in embedding coverage for both training and test data. Training q and r saw coverage increases of 34.85% and 15.99%, respectively. In the test dataset, the question text (q) showed a 29.24% coverage increase, and response text (r) had a 15.24% increase.
Clean Numbers:
The "clean_numbers" function had a negligible impact on embedding coverage, with a 0% change. This suggests that removing numbers from the text did not significantly affect how well the embeddings represented the data.
Correct Misspelling:
The "correct_misspelling" function resulted in a slight increase in coverage, with 1% and -1% changes in the training and test datasets, respectively. While it had a limited effect, it still improved embedding coverage.
Before Preprocess Bert:
The "before preprocess bert" represents the embedding coverage before applying any preprocessing steps using BERT embeddings. The coverage values were noticeably lower compared to the Google News embeddings, indicating the importance of preprocessing.
All Preprocessing Functions for Bert:
Finally, when all preprocessing functions were applied to BERT embeddings, we observed substantial improvements in embedding coverage. Training and test datasets both saw notable increases in coverage, highlighting the effectiveness of the preprocessing steps.
In summary, the choice of preprocessing steps has a significant impact on embedding coverage. By selecting appropriate preprocessing functions, we can improve the representation of text data, which is crucial for building accurate language models. These findings inform our decision-making process as we fine-tune our model for better performance.
﻿
﻿
Preprocess Results Table












































































preprocess_fntrain qtrain rtest qtest reffect percent pre-step(test r)
google news emb coverage first 5vocab  - all textvocab  - all textvocab  - all textvocab  - all text------
no fn(function)64.18% - 72.61%40.96% - 73.15%69.10% - 71.55%49.58% - 73.04%------
clean_markdown_and_emojis66.25% - 72.99%48.55% - 74.01%70.34% - 71.69%56.23% - 73.95%13.41%- 1.2%
clean_text99.03% - 87.10%95.97% - 88.13%98.28% - 86.12%97.89% - 87.81%74.08% - 18.74%
clean_numbers99.03% - 87.10%95.97% - 88.13%98.28% - 86.12%97.89% - 87.81%0% - 0%
correct_misspelling99.09% - 87.10%96.23% - 88.17%98.74% - 86.33%97.84% - 87.81%1% - -1%
before preprocess bert coverage51.25% - 69.74%20.14% - 71.52%57.37% - 70.21%30.73% - 71.50%no_fns
fns result for bert76.81% - 81.28%42.90% - 83.93%80.09% - 82.24%56.53% - 83.48%all_fns
﻿
These preprocessing techniques enhanced log loss in a basic model (comprising an Embedding layer, Conv1D, and GlobalAveragePooling1D), yet they didn't lead to improvements in encoder transformers such as Bert and DistilRoberta.
Reasons why not improve model performance:
Embedding Coverage vs. Performance: The increase in embedding coverage doesn't necessarily translate to better model performance. The preprocessing methods might have increased the coverage of embeddings, but they might not have addressed the specific challenges or issues that transformers like Bert and DistilRoberta face. These models are already highly optimized and might not benefit significantly from changes in input embeddings.
Complexity of Transformer Models: Transformer models like Bert and DistilRoberta are highly complex and capable of capturing and learning intricate patterns in text data. Preprocessing methods that focus on basic text cleaning might not provide a significant boost in performance because these models may handle noisy and uncleaned text to some extent.
Overfitting and Data Distribution: It's possible that the transformer models were already well-fitted to the original data distribution and introducing extensive preprocessing could lead to overfitting or model deterioration.
Different Preprocessing Needs: Transformers may have specific preprocessing requirements that differ from more traditional models. What improves performance for a simpler model (e.g., Conv1D) might not necessarily apply to transformers. Transformers often require careful tokenization, handling of special tokens, and other unique considerations.
Diminishing Returns: After a certain point, further preprocessing may not lead to substantial gains in performance. The preprocessing methods we've applied might have already achieved most of the possible improvements.
﻿
Preprocess Result Data Frame Heads﻿
﻿
﻿
SweepsSweeps for Hyperparameter TuningIn our quest to build the best possible model, we embarked on a journey called "sweeps." Sweeps are a process where we systematically explore different combinations of hyperparameters to find the ones that work best for our task.
Here's what we did:
Epochs: We conducted each sweep over a short period of 2 epochs. This helped us quickly evaluate different hyperparameters without committing to a lengthy training process.
Random Seed: To ensure the reliability of our experiments, we set a random seed of 42. This means that we started each sweep from the same initial conditions, making our results consistent and comparable.
Batch Size: We tested three different batch sizes: 8, 16, and 32. Batch size affects how many examples are processed at once during training.
Learning Rate: We explored a range of learning rates. We used a distribution of values that were logarithmically spaced between 1e-5 and 1e-3. Learning rate plays a crucial role in determining how quickly our model learns.
Weight Decay: We considered two options for weight decay: 0.0 and 0.2. Weight decay is a regularization technique that helps prevent overfitting.
Learning Schedule: We tested three different learning schedules: linear, polynomial, and cosine. The learning schedule determines how the learning rate changes during training.
Model Arthitecture: We tested three different model arthitectures: bert-base-uncased, distilbert-base-uncased and distilroberta-base.
Best Hyperparameters:
After conducting our sweeps, we found that the following hyperparameters worked best for our model:
Batch Size: 16
Learning Rate: 0.0001215
Weight Decay: 0.2
Learning Schedule: Polynomial
Arthitecture: bert-base-uncased
Evaluation:
These hyperparameters were selected based on their ability to provide the best performance for our specific task. The low learning rate and the choice of a polynomial learning schedule indicate that our model benefits from slower and more gradual learning. The absence of weight decay suggests that regularization wasn't necessary in our case.
By using sweeps, we were able to fine-tune our model effectively and improve its performance. This process demonstrates the importance of optimizing hyperparameters to achieve the best results in our text classification task. It also reaffirms the value of thorough experimentation in the model development process.
﻿
﻿
﻿
﻿
﻿
RetrainFor retraining I selected evaluation method as steps because model is overfitting and we can't add new data due to the competition requirements. Batch size = 32, learning scheduler = linear, early_stopping = 2 epoch patience. With this configuration evaluation loss decreased to 1.251 on validation data and 1.14 on public test data.
﻿
﻿
﻿
EvaluationWhen all confusion matrices are examined, the model shows its best performance in detecting target 3 and the worst target 2. It primarily improves false predictions for target 4, and the most transitions to this class are observed from target 0 and target 2. In general, it can be said that the model is weak in detecting targets 4, 5, and 6. When the Exploratory Data Analysis (EDA) is examined, it is worth noting that the sequence lengths and character counts of targets 0, 4, and 6 are similar to each other. Compared to baseline, there is a slight improvement in validation losses.
﻿
﻿
﻿
PostprocessPost-processing for Class Weight Equalization
In our text classification task with seven distinct classes, it's important to ensure that the predictions made by our model are equally weighted across all classes. This equalization of class weights can help prevent biases towards any particular class and improve the overall balance and fairness of the classification.
Here's how we achieved this using the provided code:
Predictions: The variable predictions contains the raw predictions made by our model. These predictions are not normalized and may have varying weights for each class.
Desired Equal Weight: We start by defining our desired equal weight for each class. In our case, we aim for a uniform distribution, so we set the desired_weight to be equal to 1 divided by the total number of classes (in our case, 7). This results in each class having a desired weight of approximately 0.14285714.
Correction Factor: To equalize the weights, we calculate a correction factor for each class. This correction factor is determined by dividing the desired_weight by the mean prediction value for each class. In essence, it helps adjust the predictions for each class so that they contribute equally to the final result.
Applying the Correction Factor: We then apply this correction factor to the raw predictions. By multiplying each prediction by its corresponding correction factor, we ensure that the overall weight for each class is approximately 0.14285714, resulting in equal class weights.
Normalization: The normalized_predictions variable now contains the predictions that have been normalized to have equal weights for all seven classes. Each row in normalized_predictions reflects this balance.
Result: After applying this post-processing, we check the average weight of each class in the normalized_predictions array. The result confirms that all classes now have nearly identical weights of approximately 0.14285714, achieving the desired balance.
This post-processing step ensures that our model's predictions are not skewed towards any particular class and helps maintain fairness and consistency in our classification results. It's a valuable technique in scenarios where class imbalances exist, ensuring that no class is favored over the others.
﻿
predictions = preds
# Define the desired equal weight for each class
desired_weight = 1 / 7
# Calculate the correction factor for each prediction
correction_factor = desired_weight / predictions.mean(axis=0, keepdims=True)
# Apply the correction factor to normalize the predictions
normalized_predictions = predictions * correction_factor
# Now, each row in normalized_predictions has equal weights for all 7 classes
print(normalized_predictions.mean(axis=0))
Output:
array([0.14285714, 0.14285714, 0.14285714, 0.14285714, 0.14285714,
       0.14285714, 0.14285714])
The results from the competition's private test set indicate that postprocessing methods were effective. When we used postprocessing on the DistilRoberta and Mistral7B results, the log loss decreased from 0.81 to 0.789. This improvement moved us up 10 places on the private leaderboard, going from 17th to 7th.
💡
﻿
﻿
preprocess_fn	train q	train r	test q	test r	effect percent pre-step(test r)
google news emb coverage first 5	vocab - all text	vocab - all text	vocab - all text	vocab - all text	------
no fn(function)	64.18% - 72.61%	40.96% - 73.15%	69.10% - 71.55%	49.58% - 73.04%	------
clean_markdown_and_emojis	66.25% - 72.99%	48.55% - 74.01%	70.34% - 71.69%	56.23% - 73.95%	13.41%- 1.2%
clean_text	99.03% - 87.10%	95.97% - 88.13%	98.28% - 86.12%	97.89% - 87.81%	74.08% - 18.74%
clean_numbers	99.03% - 87.10%	95.97% - 88.13%	98.28% - 86.12%	97.89% - 87.81%	0% - 0%
correct_misspelling	99.09% - 87.10%	96.23% - 88.17%	98.74% - 86.33%	97.84% - 87.81%	1% - -1%
before preprocess bert coverage	51.25% - 69.74%	20.14% - 71.52%	57.37% - 70.21%	30.73% - 71.50%	no_fns
fns result for bert	76.81% - 81.28%	42.90% - 83.93%	80.09% - 82.24%	56.53% - 83.48%	all_fns
Add a comment