Indoor Navigation: Complete Data Understanding

This notebook contains EDA and a baseline model to the competition Indoor Location & Navigation. Made by Andrada Olteanu using Weights & Biases
Andrada Olteanu

📍 Introduction

The Indoor Location and Navigation Kaggle competition had the purpose of identifying the position of a smartphone within a shopping mall. 200 buildings from China were made available for this competition, containing around 30,000 total traces. A trace is a path that a smartphone - or the user - can take through the mall.


The purpose: Predict the position (x and y coordinates) and the floor on which a smartphone is going to be at a "t" moment in time.
The data was made available as follows:
A schema of the directories can be seen below as well:
You can find the full notebook (with datasets and code) here.

🔎 Exploratory Data Analysis

Now that we know how our data is structured, we can start to explore the floor plans, in other words the paths and different features that could be used to predict the waypoint, a particular point in the trace.
The organizers of this competition made a GitHub repo, with custom functions designed to help the teams better visualize the features. Hence, it would be a shame not to take advantage of the fruitful work that they have aleady put in there.
Ok, let's take the EDA step by step!

🏢 Site plan

First, let's look at a random shopping mall and visualize the plan of each floor. For this I've created a custom function, which takes the site_no of a site as input, and returns a plt plot of the floor plans in that site.
Below you can see that for this site example there are 8 floors:

🗽 How many floors?

Ok, the site plan looks very interesting, but how many floors do we really have on average for each building?
This information is extremely important, first to observe the variance between the sites, and second because the variable floor variable is also one of our target variables.

🎯 Waypoint

The waypoint is our second and third target variables. A waypoint is "a stopping place on a journey". Hence, that point would have x and y coordinates and would belong to a bigger trace.
Let's visualize a random trace in a random building in our data. In the figure below there are:
Each point mentioned above is a waypoint and can represent a location on the map that we would need to predict.

🧲 Magnetic Strength

📌 Magnetic Strength: Any point inside a building is subject to unique magnetic forces. Floors, walls, and objects around the room create a four-dimensional map of three-dimensional space and magnetic magnitudes. The magnetic magnitude at any point in space can be measured by reading the x, y, and z magnetic vectors at that point.
Hence, the phone detects fluctuations in the magnetic field as the user moves (and as the phone rotates). In the example below, the magnetic field is much more powerful at the beginning of the trajectory, but less strong at the left side of the floor.
Note: 📐 mu tesla (1×104 G) - a derived unit of the magnetic induction 📐

📶 WiFi

📌 WiFi Access Points: The floors of a site can have many WiFi Access Points. Hence, the signals and their strengths will vary widely in different areas.
In the example below you can see that the route is filled with such access points:

🔵🦷 IBeacon (Bluetooth)

📌 IBeacon: Beacons work in tandem with mobile apps to trigger particular messages or actions based on rules, such as triggering a push notification when a user is within a certain distance from a beacon.
There are many beacons along the example path, however there is a big portion that has no iBeacons at all (circled in orange), therefore the accuracy in this section for this particular example might be lower.
Note: RSSE is a web feed that can allow a user to keep track of many different websites in a single news aggregator.
Note: 📐 dBm (decibel milliwatts) - a measure of absolute power; the closer the number is to 0, the better your signal strength is 📐

💻 Light GBM - Baseline Model

There were many approaches in this competition, from Machine Learning techniques such as Light GBM, XGBoost and KMeans Clustering, to more advanced Deep Learning methodologies, such as LSTMs.
As my baseline model, I decided to follow a code tutorial by Jiwei Liu, which used competition data processed by Devin Anzelmo on the WiFi features.
You can find the entire training loop in my notebook, but below is a schema of how the training function works:
After it ran, I got the following scores: (not bad, right? 😅)
Note: MPE is the Mean Position Error, which is the Evaluation Metric of choice for this competition. The lower the error, the better.
I've also saved my submissions and training logs for this competition as Artifacts, so I can easily access and compare them afterwards.

📝 Final Tries and Ending Notes

I also tested the XGBoost approach for this competition. Although the rapidity was much faster, as I implemented the model using RAPIDS, the Mean Position Error (MPE) was much much higher than the Light BGM.
Future tries could involve LSTMs, or maybe other Machine Learning libraries, or ... maybe a combination of them all?

You can find the full notebook (with datasets and code) here.

💜 Thank you lots for reading and Happy Data Sciencin'! 💜