Baseline XGB Regression training
Created on August 27|Last edited on September 3
Comment
Project Potential use caseDataData ProcessingDownload raw dataCombine raw dataAdd features and encode categorical onesSplit into training, validation, and testTrainingTraining BaselinePlot predictions vs true values for validation set
Project
The project's aim is to predict bike ride duration given the start and end station, start time, bike type, and type of membership.
Potential use case
A customer takes a bike from a station and wants to know how long it will take to get to the destination station. They enter the destination station, and the rest of the features are logged automatically. The request is sent to the web service that returns the predicted duration, and the customer can decide if they want to take the bike or not.
Data
The data is provided by Capital Bikeshare and contains information about bike rides in Washington DC. Downloadable files are available at the following link https://s3.amazonaws.com/capitalbikeshare-data/index.html
Data Processing
Download raw data
The original data is monthly data. In the first stage, we download individual archives and combine them into one big zip. There are also some typos in the original file naming that are being fixed along the way.
Combine raw data
Clean-up, calculate target, filter out outliers, and combine raw data into one big artifact that will be later split into training, validation, and test sets.
In the following table, we can see a subset of rides starting from April 2020 till now combined in one file
Add features and encode categorical ones
Add hour, month, and year features of start ride time
Encode categorical features (all the existing features apart from the added hour, month, and year) using scikit-learn DictVectorizer
Split into training, validation, and test
Training data: 01 April 2020 - 31 March 2023
Validation data: 01 April 2023 - 30 April 2023
Test data: 01 May 2023 - 31 May 2023
Training
Training Baseline
Train xgboost model using reg:squarederror objective with default parameters on different feature sets. The one with a worse rmse doesn't include end station id feature.
Run set
2
Plot predictions vs true values for validation set
Using model built on the feature set that includes end station id.
Add a comment