Skip to main content

Baseline XGB Regression training

Created on August 27|Last edited on September 3


Project

The project's aim is to predict bike ride duration given the start and end station, start time, bike type, and type of membership.

Potential use case

A customer takes a bike from a station and wants to know how long it will take to get to the destination station. They enter the destination station, and the rest of the features are logged automatically. The request is sent to the web service that returns the predicted duration, and the customer can decide if they want to take the bike or not.

Data

The data is provided by Capital Bikeshare and contains information about bike rides in Washington DC. Downloadable files are available at the following link https://s3.amazonaws.com/capitalbikeshare-data/index.html

Data Processing

Download raw data

The original data is monthly data. In the first stage, we download individual archives and combine them into one big zip. There are also some typos in the original file naming that are being fixed along the way.

Combine raw data

Clean-up, calculate target, filter out outliers, and combine raw data into one big artifact that will be later split into training, validation, and test sets.
In the following table, we can see a subset of rides starting from April 2020 till now combined in one file

start_station_id
end_station_id
rideable_type
member_casual
duration
started_at
1
2
3
4
5
6
7
8
9
10
11

Add features and encode categorical ones

Add hour, month, and year features of start ride time
Encode categorical features (all the existing features apart from the added hour, month, and year) using scikit-learn DictVectorizer

Split into training, validation, and test

Training data: 01 April 2020 - 31 March 2023
Validation data: 01 April 2023 - 30 April 2023
Test data: 01 May 2023 - 31 May 2023

Training

Training Baseline

Train xgboost model using reg:squarederror objective with default parameters on different feature sets. The one with a worse rmse doesn't include end station id feature.

Run set
2


Plot predictions vs true values for validation set

Using model built on the feature set that includes end station id.

File<(table)>
File<(table)>