Predict The LLM (Kaggle Community Competition)
Introduction
- Exploratory Data Analysis (EDA): We'll start by exploring the data provided in the competition. We will look at basic statistics such as word counts in the text, perform text analysis to understand its structure, and delve into topic analysis to identify common themes.
- Data Splitting (StratifiedKFold): We will explain how we divided the data into training and validation sets using StratifiedKFold to ensure that each class is well-represented in both sets.
- Baseline Model (distilroberta-base): Our initial model will be based on the distilroberta-base architecture. We will discuss this choice and how it performed as a starting point.
- Preprocessing: Text data often requires some cleaning and preparation before it can be fed into the model. We will explain the text cleaning steps and the coverage of embeddings used in this task.
- Hyperparameter Tuning (Sweeps): One crucial aspect of building a powerful model is hyperparameter tuning. We'll discuss the sweeps used to find the best hyperparameters for our model.
- Retraining: After finding the optimal hyperparameters, we will go through the retraining process to fine-tune our model for improved performance.
- Evaluation: We will use logloss as the metric to measure how well our model performs in classifying text into the seven different categories. We'll discuss the model's effectiveness and any insights gained from the evaluation process.
- Checklist: We'll provide a checklist to summarize the key steps and decisions we made throughout the competition.
- Results: Finally, we will present the results of our efforts, showcasing how our LLM-based model performs in classifying text into the seven different categories.
EDA
Word Counts
- Class 3: On average, sequences generated by this class were the longest, with approximately 250 words.
- Class 1: Sequences in this class had an average length of about 200 words.
- Class 5: The average length of sequences from this class was around 150 words.
- Class 2: Sequences in this category had an average length of about 100 words.
- Class 0, Class 4, and Class 6: These classes had the shortest sequences, with an average length of approximately 75 words.
Text Analysis
The provided summaries appear to be describing statistics and common words for two different types of data: 'Question' and 'Response'. There seem to be seven identical questions across seven different LLM (Language Model) models, as indicated by the "top" question with a frequency of 7. The summaries can be described as follows:
For 'Question':
- There are a total of 3,976 questions.
- There are 568 unique questions.
- The most frequently occurring question shown as "Explain the concept of coevolution."
- We have seven same questions for seven different LLM (Language Model) models.
- For train 3976 / 7 = 568 unique Questions
- For test 1001 / 7 = 143 unique Questions
Most Common Words in 'Question':
- The word "what" occurs 1,603 times.
- Other common words in the questions include "who," "how," "explain," "describe," "can," "I," "concept," "could," and "cause."
For 'Response':
- There are a total of 3,976 responses.
- There are 3,969 unique responses.
- The most frequent response is empty (blank).
- The empty response appears 7 times.
- Train empty indexes [196, 339, 401, 798, 2279, 2725, 3046]
- Test empty indexes [446, 969]
Most Common Words in 'Response':
- The word "the" occurs 4,557 times.
- Other common words in the responses include "this," "data," "it," "a," "I," "also," "in," "process," and "like."
These summaries provide insights into the data, including data count, uniqueness, and the most common words found in the 'Question' and 'Response' columns. It appears that the 'Response' data contains empty responses, and the 'Question' data focuses on common introductory words and phrases used in questions.
Topic Analysis
Topic 0:
- Words: "data," "information," "the," "model," "learning," "used," "this," "process," "ai," "language"
- Category: Machine Learning and Data Analysis - This topic appears to be related to machine learning, data analysis, and artificial intelligence, focusing on data, information, models, and learning.
Topic 1:
- Words: "i," "help," "this," "answer," "what," "time," "use," "question," "skills," "make"
- Category: Personal Assistance and Problem-Solving - This topic seems to revolve around personal assistance and problem-solving, where "i," "help," and "answer" suggest assistance and addressing questions or challenges.
Topic 2:
- Words: "the," "x," "this," "cell," "process," "cells," "dna," "behavior," "animals," "in"
- Category: Biology and Cellular Processes - This topic appears to be related to biology and cellular processes, with words like "cell," "process," "dna," and "animals" indicating a focus on biological subjects.
Topic 3:
- Words: "the," "cause," "war," "symptoms," "pain," "it," "world," "what," "this," "also"
- Category: Historical and Health-related Topics - This topic suggests a mix of historical and health-related discussions, with mentions of "war," "symptoms," and "pain."
Topic 4:
- Words: "the," "this," "species," "one," "it," "what," "code," "in," "system," "different"
- Category: Species and Systems - This topic may relate to discussions about species, systems, and differences between various elements or entities.
These are the identified topics from the combined questions and responses. Each topic represents a distinct theme or category, and the assigned category is based on the prominent words within each topic. The model seems to capture a range of topics, including machine learning, personal assistance, biology, history, and systems.
Embeddings
Splits
You can find split and baseline notebook in GitHub Repository.
Baseline
Preprocess
Evaluation of Preprocessing and Embedding Coverage
In this section, we assess the impact of various preprocessing steps on embedding coverage for both the question (q) and response (r) text in the training and test datasets. The purpose of measuring embedding coverage is to understand how well our text data is represented by pre-trained embeddings, which can significantly affect the performance of our language model.
Google News Embeddings (First 5 Rows):
To provide context, the first five rows of the table show the embedding coverage percentages when using Google News embeddings as a reference point. These embeddings are based on a vast vocabulary of general text.
No Preprocessing (Baseline):
When no preprocessing functions were applied, we observed varying levels of embedding coverage. The training question (q) text achieved 64.18%, while the response (r) text achieved 72.61%. In the test dataset, q had 69.10% coverage, and r had 71.55% coverage.
Clean Markdown and Emojis:
After applying the "clean_markdown_and_emojis" function, we saw improvements in embedding coverage. The effect was more pronounced in the test dataset. For instance, the test question text (q) achieved a 13.41% increase in coverage, indicating that this preprocessing step helped the model better understand the text.
Clean Text:
Using the "clean_text" function, we observed significant improvements in embedding coverage for both training and test data. Training q and r saw coverage increases of 34.85% and 15.99%, respectively. In the test dataset, the question text (q) showed a 29.24% coverage increase, and response text (r) had a 15.24% increase.
Clean Numbers:
The "clean_numbers" function had a negligible impact on embedding coverage, with a 0% change. This suggests that removing numbers from the text did not significantly affect how well the embeddings represented the data.
Correct Misspelling:
The "correct_misspelling" function resulted in a slight increase in coverage, with 1% and -1% changes in the training and test datasets, respectively. While it had a limited effect, it still improved embedding coverage.
Before Preprocess Bert:
The "before preprocess bert" represents the embedding coverage before applying any preprocessing steps using BERT embeddings. The coverage values were noticeably lower compared to the Google News embeddings, indicating the importance of preprocessing.
All Preprocessing Functions for Bert:
Finally, when all preprocessing functions were applied to BERT embeddings, we observed substantial improvements in embedding coverage. Training and test datasets both saw notable increases in coverage, highlighting the effectiveness of the preprocessing steps.
In summary, the choice of preprocessing steps has a significant impact on embedding coverage. By selecting appropriate preprocessing functions, we can improve the representation of text data, which is crucial for building accurate language models. These findings inform our decision-making process as we fine-tune our model for better performance.
Preprocess Results Table
preprocess_fn | train q | train r | test q | test r | effect percent pre-step(test r) |
---|---|---|---|---|---|
google news emb coverage first 5 | vocab - all text | vocab - all text | vocab - all text | vocab - all text | ------ |
no fn(function) | 64.18% - 72.61% | 40.96% - 73.15% | 69.10% - 71.55% | 49.58% - 73.04% | ------ |
clean_markdown_and_emojis | 66.25% - 72.99% | 48.55% - 74.01% | 70.34% - 71.69% | 56.23% - 73.95% | 13.41%- 1.2% |
clean_text | 99.03% - 87.10% | 95.97% - 88.13% | 98.28% - 86.12% | 97.89% - 87.81% | 74.08% - 18.74% |
clean_numbers | 99.03% - 87.10% | 95.97% - 88.13% | 98.28% - 86.12% | 97.89% - 87.81% | 0% - 0% |
correct_misspelling | 99.09% - 87.10% | 96.23% - 88.17% | 98.74% - 86.33% | 97.84% - 87.81% | 1% - -1% |
before preprocess bert coverage | 51.25% - 69.74% | 20.14% - 71.52% | 57.37% - 70.21% | 30.73% - 71.50% | no_fns |
fns result for bert | 76.81% - 81.28% | 42.90% - 83.93% | 80.09% - 82.24% | 56.53% - 83.48% | all_fns |
- Embedding Coverage vs. Performance: The increase in embedding coverage doesn't necessarily translate to better model performance. The preprocessing methods might have increased the coverage of embeddings, but they might not have addressed the specific challenges or issues that transformers like Bert and DistilRoberta face. These models are already highly optimized and might not benefit significantly from changes in input embeddings.
- Complexity of Transformer Models: Transformer models like Bert and DistilRoberta are highly complex and capable of capturing and learning intricate patterns in text data. Preprocessing methods that focus on basic text cleaning might not provide a significant boost in performance because these models may handle noisy and uncleaned text to some extent.
- Overfitting and Data Distribution: It's possible that the transformer models were already well-fitted to the original data distribution and introducing extensive preprocessing could lead to overfitting or model deterioration.
- Different Preprocessing Needs: Transformers may have specific preprocessing requirements that differ from more traditional models. What improves performance for a simpler model (e.g., Conv1D) might not necessarily apply to transformers. Transformers often require careful tokenization, handling of special tokens, and other unique considerations.
- Diminishing Returns: After a certain point, further preprocessing may not lead to substantial gains in performance. The preprocessing methods we've applied might have already achieved most of the possible improvements.
Preprocess Result Data Frame Heads
Sweeps
Sweeps for Hyperparameter Tuning
In our quest to build the best possible model, we embarked on a journey called "sweeps." Sweeps are a process where we systematically explore different combinations of hyperparameters to find the ones that work best for our task.
Here's what we did:
-
Epochs: We conducted each sweep over a short period of 2 epochs. This helped us quickly evaluate different hyperparameters without committing to a lengthy training process.
-
Random Seed: To ensure the reliability of our experiments, we set a random seed of 42. This means that we started each sweep from the same initial conditions, making our results consistent and comparable.
-
Batch Size: We tested three different batch sizes: 8, 16, and 32. Batch size affects how many examples are processed at once during training.
-
Learning Rate: We explored a range of learning rates. We used a distribution of values that were logarithmically spaced between 1e-5 and 1e-3. Learning rate plays a crucial role in determining how quickly our model learns.
-
Weight Decay: We considered two options for weight decay: 0.0 and 0.2. Weight decay is a regularization technique that helps prevent overfitting.
-
Learning Schedule: We tested three different learning schedules: linear, polynomial, and cosine. The learning schedule determines how the learning rate changes during training.
-
Model Arthitecture: We tested three different model arthitectures: bert-base-uncased, distilbert-base-uncased and distilroberta-base.
Best Hyperparameters:
After conducting our sweeps, we found that the following hyperparameters worked best for our model:
- Batch Size: 16
- Learning Rate: 0.0001215
- Weight Decay: 0.2
- Learning Schedule: Polynomial
- Arthitecture: bert-base-uncased
Evaluation:
These hyperparameters were selected based on their ability to provide the best performance for our specific task. The low learning rate and the choice of a polynomial learning schedule indicate that our model benefits from slower and more gradual learning. The absence of weight decay suggests that regularization wasn't necessary in our case.
By using sweeps, we were able to fine-tune our model effectively and improve its performance. This process demonstrates the importance of optimizing hyperparameters to achieve the best results in our text classification task. It also reaffirms the value of thorough experimentation in the model development process.
Retrain
Evaluation
Postprocess
Post-processing for Class Weight Equalization
In our text classification task with seven distinct classes, it's important to ensure that the predictions made by our model are equally weighted across all classes. This equalization of class weights can help prevent biases towards any particular class and improve the overall balance and fairness of the classification.
Here's how we achieved this using the provided code:
-
Predictions: The variable
predictions
contains the raw predictions made by our model. These predictions are not normalized and may have varying weights for each class. -
Desired Equal Weight: We start by defining our desired equal weight for each class. In our case, we aim for a uniform distribution, so we set the
desired_weight
to be equal to 1 divided by the total number of classes (in our case, 7). This results in each class having a desired weight of approximately 0.14285714. -
Correction Factor: To equalize the weights, we calculate a correction factor for each class. This correction factor is determined by dividing the
desired_weight
by the mean prediction value for each class. In essence, it helps adjust the predictions for each class so that they contribute equally to the final result. -
Applying the Correction Factor: We then apply this correction factor to the raw predictions. By multiplying each prediction by its corresponding correction factor, we ensure that the overall weight for each class is approximately 0.14285714, resulting in equal class weights.
-
Normalization: The
normalized_predictions
variable now contains the predictions that have been normalized to have equal weights for all seven classes. Each row innormalized_predictions
reflects this balance. -
Result: After applying this post-processing, we check the average weight of each class in the
normalized_predictions
array. The result confirms that all classes now have nearly identical weights of approximately 0.14285714, achieving the desired balance.
This post-processing step ensures that our model's predictions are not skewed towards any particular class and helps maintain fairness and consistency in our classification results. It's a valuable technique in scenarios where class imbalances exist, ensuring that no class is favored over the others.
predictions = preds# Define the desired equal weight for each classdesired_weight = 1 / 7# Calculate the correction factor for each predictioncorrection_factor = desired_weight / predictions.mean(axis=0, keepdims=True)# Apply the correction factor to normalize the predictionsnormalized_predictions = predictions * correction_factor# Now, each row in normalized_predictions has equal weights for all 7 classesprint(normalized_predictions.mean(axis=0))Output:array([0.14285714, 0.14285714, 0.14285714, 0.14285714, 0.14285714,0.14285714, 0.14285714])