Healthy Ride Pittsburgh
Community Detection on PGH Bike Share Data
Created on February 10|Last edited on February 16
Comment
Mission and Values
Data
The data for this experiment is composed of two different datasets. Dataset number one represents all the Healthy Ride Bike Stations in the city of Pittsburgh (100 in total), and dataset number two represents all of bike rental information. See the tables below for a glimpse of the data. Station data contains Station Name, Number of Racks available at the station as well as the Station's coordinates.
Where the Rental data has been aggregate to cover total number of trips between stations. In the data src represents the starting station and dst represents the destination station, and lastly, count is the total number of trips starting at src and ending as dst that have occurred during 2020.
Run set
41
If we put this detail together we can get a really nice visual of all the Pittsburgh bike rentals over the course of the year.
Run set
41
Approach
It should be clear that our data can best be described as a graph (or network). We have bikes stations which are connected to one another station by way of a bike rental. For what comes, we will eliminate loops, i.e., rentals that start and end at the same station, because we want to focus on traffic between stations, not amongst themselves.
As we look to determine communities within our network, the main task is to basically cluster the bike stations (aka vertices). But this begs the question on how exactly do you cluster vertices based on the structure of the graph? Like all machine learning problems, we'll find a way to represent our data numerically, in our case, how to represent the vertices of our graph, and then we'll run a clustering algorithm on it. While there are neural net methods that can aid in this problem, we'll focus on simple matrix decomposition to aid in embedding our vertices into an vector space. Once we've completed the embedding, we'll run our old friend K-means, and we'll make a decision based on some evaluation metrics and what "looks" good.
Representing the Data
As mentioned early, we'll use matrix decompositions to aid in our community detection problem. The matrix we will start with is the adjacency matrix for our data. , and we'll focus on representing the graph by it's adjacency matrix and analyzing the structure of the graph via the adjacency matrix. The adjacency matrix is a matrix, in which entry is the number of trips from station , which at station . With respect to the second table above, just think of it as a pivot table, where src is the rows and dst . The visualization below is such that the rows have been normalized to sum to one and each row represents the distribution of rides that start at station i and end at station . This is referred to as the transition matrix. This doesn't reveal too much, but it is a less busy visualization of our data.
Run set
1
Modeling
For our problem, it really focuses on two dimension
- figure out what embedding to use
- figure out the dimension of the embedding
- figure out how many clusters to use
For embedding, we'll look at two different embeddings: Spectral Decomposition, Generalized Singular Value Decomposition (GSVD). These methods all operate on the adjacency matrix of the graph, and with both Spectral and GSVD, we need to pick the dimension of the embedding space.
For our particular problem, the max dimension of the embedding space we can use is 100, but we probably we won't go above 10, in fact, we'll check out using between 3 and 10 dimension.
Moreover, in type of setup, it is usual that you will set the number of dimensions used for embedding equal to the number of clusters you find in your K-Means algorithm, so our problem simplied to just figuring out
- what embedding
- how man clusters
Our search space will be and we'll utilize WandB sweeps to efficiently evaluated and curate details with respect to our search. Leveraging WandB sweeps, each run that is created we log
- model evaluation
- embedding and predictions as a WandB table
- some plots to aide in decision
This helps once again in transparency and repoducibility of the results.
Evaluation
Silhouette coefficients takes a value between -1 and 1. A value near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. The results below show the average silhouette coefficient for our node embeddings.
Modularity - a measure of the structure of networks or graphs which measures the strength of division of a network into modules (also called groups, clusters or communities). Networks with high modularity have dense connections between the vertices within modules but sparse connections between vertices in different modules. Modularity is often used in optimization methods for detecting community structure in networks
Run set
41
Some Results
Argument for using any number of clusters is possible. We provide results for GSVD with 5 and 6 clusters below.
GSVD with 5 clusters
Run: charmed-sweep-11
1
GSVD with 6 clusters
Run: absurd-sweep-12
1
Run set
41
Run set
41
Add a comment