Kaggle's MMLM 2021-NCAAW-Spread
EDA and Baseline Model Comparison
Created on April 2|Last edited on May 18
Comment

🏀 Introduction
March Machine Learning Mania 2021 - NCAAW - Spread is a Kaggle competition which contains past data of NCAA games and we are required to predict the margin of the victor.
During my first glance at the competition data, I noticed a huge number of files in the dataset. Learning more about the game was essential to understand the data for this competition and the NCAA Tournament Glossary came in handy for this purpose✨
This intrigued me because I’ve always wanted to try real-time prediction for sports, and this competition gave me the perfect opportunity to train my models on historical data and predict the outcome for ongoing games.
📄Competition Data
🎯Goal:
- Stage 1 - To predict and submit predicted point spread for every possible matchup in the past 5 NCAA tournaments (seasons 2015-2019).
- Stage 2 - To predict and submit predicted point spread for every possible matchup before the 2021 tournament begins.

Run set
237
📊Exploratory Data Analysis
📌Terminology alert!
🏀 A seed in basketball is the number which corresponds to a team's ranking.
This is done to prevent high-ranking teams from playing together in the beginning of the competition and to encourage healthy competition!
🏀 The season year is the year in which the season ends in, not the year that it starts in.
Thus the Current season is:
- the 2021 season ✅
- and not the 2020 season/2020-2021 season/ 2020-21 season❌
I wanted to check for patterns and anomalies in the given data. I’ve included a bunch over here to provide a gist of the same. For the purpose of visualizing,
- I plotted graphs to see which teams got the highest and lowest seeds throughout past seasons.
- Next, I checked the distribution of winning and losing scores of teams.
- After this I checked the distribution of the average of winning and losing scores over the seasons.
Run set
10
🏠Home advantage – describes the benefit that the home team is said to gain over the visiting team. This benefit has been attributed to psychological effects supporting fans have on the competitors or referees; to psychological or physiological advantages of playing near home in familiar situations; to the disadvantages away teams suffer from changing time zones or climates, or from the rigors of travel; and in some sports, to specific rules that favour the home team directly or indirectly.
It is believed and observed that home advantage does affect winning in the field of sports. Here’s a pie chart of the winning location distribution and its effect on the difference in scores.


Consequently, I merged various data frames to put together all the necessary columns in one place.
Following this, I created a few baseline models.
- Ridge
- K Neighbors Regressor
- Random Forest Regressor
- LGBM Regressor
📝Tracking Model Performance
Learning Curve
def train_plot_regressor(model,name,v_ll,v_mll):run = wandb.init(project='ncaaw', name=name+" with KFold")mse_l = []NUM_FOLDS = 10kf = KFold(n_splits=NUM_FOLDS, shuffle=True, random_state=0)for f, (trn_idx, val_idx) in tqdm(enumerate(kf.split(X, y))):print('\nFold {}'.format(f))X_train, X_val = X[trn_idx], X[val_idx]y_train, y_val = y.iloc[trn_idx], y.iloc[val_idx]model = modelmodel.fit(X_train, y_train)temp_oof = model.predict(X_val)temp_test = model.predict(test_df)wandb.sklearn.plot_regressor(model, X_train, X_val, y_train, y_val)test_preds = 0train_oof[val_idx] = temp_ooftest_preds += temp_test/NUM_FOLDSmse = mean_squared_error(y_val, temp_oof, squared=False)mse_l.append(mse)print("MSE: ",mse)mean_mse = np.mean(mse_l, axis=0)v_mse_l.append(mse_l)v_mean_mse.append(mean_mse)print("\nMean MSE of ",name,mean_mse)run.finish()return v_mse_l,v_mean_msev_mse_l = []v_mean_mse = []v_mse_l,v_mean_mse = train_plot_regressor(Ridge(),'Ridge',v_mse_l,v_mean_mse)v_mse_l,v_mean_mse = train_plot_regressor(KNeighborsRegressor(n_neighbors=99),'K Neighbors Regressor',v_mse_l,v_mean_mse)v_mse_l,v_mean_mse = train_plot_regressor(RandomForestRegressor(random_state =21),'Random Forest Regressor',v_mse_l,v_mean_mse)v_mse_l,v_mean_mse = train_plot_regressor(lgb.LGBMRegressor(),'LGBM Regressor',v_mse_l,v_mean_mse)
Run set
11
Average Mean Squared Error of models over 10 folds
run = wandb.init(project='ncaaw', job_type='image-visualization',name='Mean MSE')values = v_mean_mselabels = ["Ridge","K Neighbors Regressor","Random Forest Regressor","LGBM Regressor"]dt = [[label, val] for (label, val) in zip(labels, values)]table = wandb.Table(data=dt, columns = ["Model name", "Mean MSE"])wandb.log({"Mean MSE" : wandb.plot.bar(table,"Model name", "Mean MSE",title="Mean MSE of models")})run.finish()
Run set
11
Time taken for each process
Run set
11
🏁Conclusion
After initial analysis, I came to the conclusion that LightGBM Regressor looks promising for the type of data at hand. So I could be building on to this and trying out hyperparameter tuning to get the best results.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.