Skip to main content

Working with tabular data in Python

In this tutorial we'll explore how to work with tabular data in Python, using it to predict earthquakes.
Created on May 30|Last edited on December 17
Working with tabular data is a core skill for many data science and machine learning projects. Whether you're analyzing trends, making predictions, or uncovering hidden patterns, structured data is often the foundation. Python, with its powerful libraries and tools, makes it easier to manipulate and extract insights from this type of data.
In this article, we’ll explore how to use Python to work with tabular data, with a specific focus on applying these techniques in real-world scenarios like earthquake prediction. By the end, you’ll have a solid understanding of how to use structured data and machine learning to solve complex problems and build predictive models.

Table of contents



Understanding tabular data in Python

Tabular data is the structured format that powers many real-world datasets. Organized into rows and columns, much like a spreadsheet, it allows for easy manipulation, analysis, and visualization—making it an essential tool for data scientists and machine learning engineers.
In Python, tabular data comes in various formats such as CSV files, Excel spreadsheets, or even data stored in SQL databases. With the help of libraries like pandas and NumPy, Python makes working with tabular data efficient and straightforward. These tools provide everything you need to clean, transform, and explore structured datasets with ease.
For this project, we will be using Significant Earthquakes, 1965-2016 from Kaggle, as our tabular dataset.


Machine learning with tabular data for earthquake prediction

Now that we understand tabular data, let's explore how machine learning can help us predict earthquakes using this structured information. Our dataset includes key details like earthquake magnitude, location, depth, and timestamps. By analyzing these factors, machine learning models can help identify patterns that inform future predictions.
Neural networks, for example, can learn from historical data to forecast seismic events, while other techniques like linear regression, random forests, and gradient boosting provide different approaches to analyzing the data. Using a combination of these methods can enhance our ability to uncover valuable insights.
Throughout this project, we'll use both supervised and unsupervised learning methods to explore how machine learning can transform tabular earthquake data into meaningful predictions. With Python and its powerful libraries, we'll work through the process of building and refining these models to make data-driven forecasts.
Image By Author

Step-by-step tutorial for building our earthquake prediction model in Python

Now that we've covered the theory, it's time to get practical. In this tutorial, we'll build an AI model to predict earthquakes using Python and tabular data.
We'll be working with a feedforward neural network (FNN) as our model. Don’t worry—I'll guide you through each step, ensuring the process is both clear and engaging. Let’s get started!

Step 1: Setting up the Python environment

Before we begin, we need to set up our Python environment with the necessary tools and libraries. This includes installing and importing packages like TensorFlow, Keras, pandas, and NumPy. Don't worry, we’ll provide clear instructions to ensure your environment is properly configured for success.
!pip install basemap
!pip install scikeras
!pip install tensorflow

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

Step 2: Data collection and preprocessing

Our next step is to gather the seismic data for training our model. For this tutorial, we’ll use the Significant Earthquakes, 1965-2016 dataset, but you’re welcome to explore other datasets that interest you.
# Loading Dataset
data = pd.read_csv("database.csv")
print(data.isnull()) # checking for null values
This code loads the earthquake data into Python and checks for missing values. Once loaded, we'll clean the data by fixing errors, filling in missing values, and formatting it for our machine learning model.
# Handling Missing Values
data=data.interpolate(method ='linear', limit_direction ='forward')
print(data)
And this code uses linear interpolation to fill in any missing values, ensuring the data is ready for analysis.

Step 3: Exploratory data analysis (EDA)

Before building our model, it’s important to understand the data through Exploratory Data Analysis (EDA). EDA helps us uncover trends, outliers, and patterns in the data by using visualizations and statistical summaries. These insights will guide us in building a more effective model.
# lets see the dataset statistic
data.describe()
This command provides a statistical summary of the dataset, giving us a first look at key figures like mean, standard deviation, and range.
Scatter plot for Latitude and Longitude
plt.figure(figsize=(10, 6))
plt.scatter(data['Longitude'], data['Latitude'], c=data['Magnitude'], cmap='viridis', s=50, alpha=0.7)
plt.colorbar(label='Magnitude')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Earthquake Locations')
plt.show()

# Scatter plot for Magnitude and Depth
plt.figure(figsize=(10, 6))
plt.scatter(data['Magnitude'], data['Depth'], alpha=0.7)
plt.xlabel('Magnitude')
plt.ylabel('Depth')
plt.title('Magnitude vs Depth')
plt.show()

from mpl_toolkits.basemap import Basemap


m = Basemap(projection='mill',llcrnrlat=-80,urcrnrlat=80, llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c')


longitudes = data["Longitude"].tolist()
latitudes = data["Latitude"].tolist()
x,y = m(longitudes,latitudes)


fig = plt.figure(figsize=(12,10))
plt.title("All affected areas")
m.plot(x, y, "o", markersize = 2, color = 'green')
m.drawcoastlines()
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmapboundary()
m.drawcountries()
plt.show()
This plot reveals the relationship between earthquake magnitude and depth, giving us further insights into seismic activity.
To enhance geographical visualizations, we can use Python’s Basemap library, which makes it easy to map earthquake locations on a world map.


Step 4: Feature engineering

We need to carefully select and prepare the features that will be inputs for our model. This involves leveraging our understanding of earthquakes and applying statistical methods. From our dataset, key features include Date, Time, Latitude, Longitude, Depth, and Magnitude, but there’s also room to explore additional variables.
data = data[['Date', 'Time', 'Latitude', 'Longitude', 'Depth', 'Magnitude']]
data.head()
However, Date and Time are currently in string format. To make these features more effective, we need to convert them into a datetime format that our model can easily work with.
import datetime
import time


timestamp = []
for d, t in zip(data['Date'], data['Time']):
try:
ts = datetime.datetime.strptime(d+' '+t, '%m/%d/%Y %H:%M:%S')
timestamp.append(time.mktime(ts.timetuple()))
except ValueError:
# print('ValueError')
timestamp.append('ValueError')


timeStamp = pd.Series(timestamp)
data['Timestamp'] = timeStamp.values


final_data = data.drop(['Date', 'Time'], axis=1)
final_data = final_data[final_data.Timestamp != 'ValueError']
final_data.head()

Step 5: Building the machine learning model

Now that our data is prepared and the features are ready, it’s time to build the earthquake prediction model. We'll use a neural network to uncover patterns in the seismic data, and guide you through defining the model architecture, training it, and tuning its parameters.
X = final_data[['Timestamp', 'Latitude', 'Longitude']]
y = final_data[['Magnitude', 'Depth']]
We begin by splitting our data into training and testing sets, ensuring the model is evaluated on unseen data to simulate real-world conditions.
from sklearn.model_selection import train_test_split
import wandb
from wandb.integration.keras import WandbEvalCallback, WandbMetricsLogger


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Normalize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Next, we normalize the data to ensure it’s in a format the model can effectively learn from.
# Define the model
model = Sequential()
model.add(Dense(128, input_dim=X_train.shape[1], activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dense(2)) # Output layer with 2 neurons for 'Magnitude' and 'Depth'


# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')


# Summary of the model
model.summary()


We define our neural network with layers tailored to uncover patterns in the seismic data. The model is compiled with an optimizer and loss function to guide the learning process.
history = model.fit(X_train, y_train, epochs=200, batch_size=32, validation_split=0.2, verbose=1)


test_loss = model.evaluate(X_test, y_test, verbose=1)
print(f'Test Loss: {test_loss}')


# Make predictions
y_pred = model.predict(X_test)


# Calculate the Mean Squared Error
mse = mean_squared_error(y_test, y_pred) print(f'Mean Squared Error: {mse}')
Finally, we train the model through multiple epochs, refining its ability to predict seismic patterns. After training, we evaluate its performance on unseen data and calculate metrics such as the mean squared error to assess accuracy.


Step 6. Evaluating the model using Weights & Biases

Our journey doesn't end with building the model. We need to rigorously evaluate its performance to ensure that our predictions are as precise as possible.
This is where Weights & Biases comes into play. With it’s powerful tracking and visualization tools, we’ll log key metrics like accuracy and loss, enabling us to fine-tune and improve the model iteratively.
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
wandb.init(config={"hyper": "parameter"})

early_stopping = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10, min_lr=0.0001)



class WandbClfEvalCallback(WandbEvalCallback):
def __init__(self, validation_data, data_table_columns, pred_table_columns, num_samples=100):
super().__init__(data_table_columns, pred_table_columns)
self.x = validation_data[0]
self.y = validation_data[1]
self.num_samples = num_samples

def add_ground_truth(self, logs=None):
for idx, (features, label) in enumerate(zip(self.x[:self.num_samples], self.y[:self.num_samples])):
self.data_table.add_data(idx, features.tolist(), label[0], label[1])

def add_model_predictions(self, epoch, logs=None):
preds = self._inference()
table_idxs = self.data_table_ref.get_index()

for idx in table_idxs:
pred = preds[idx]
self.pred_table.add_data(
epoch,
self.data_table_ref.data[idx][0], # features
self.data_table_ref.data[idx][1], # true magnitude
self.data_table_ref.data[idx][2], # true depth
pred[0], # predicted magnitude
pred[1] # predicted depth
)

def _inference(self):
preds = []
for features in self.x[:self.num_samples]:
pred = self.model.predict(tf.expand_dims(features, axis=0), verbose=0)
preds.append(pred[0])
return preds

history = model.fit(
X_train, y_train,
validation_data=(X_test, y_test),
epochs=300,
batch_size=32,
verbose=1,
callbacks=[
early_stopping,
reduce_lr,
WandbMetricsLogger(),
WandbClfEvalCallback(
validation_data=(X_test, y_test),
data_table_columns=["idx", "features", "true_magnitude", "true_depth"],
pred_table_columns=["epoch", "features", "true_magnitude", "true_depth", "pred_magnitude", "pred_depth"],
),
]
)



Weights & Biases also offers additional features like model versioning, experiment tracking, and collaboration tools, making it easy to compare models, share results, and even deploy our model for real-world earthquake prediction. These capabilities help us continually optimize and refine our approach.

Challenges and future directions

Throughout our project, we encountered several challenges while learning how to effectively use tabular data in Python for earthquake prediction. Initially, training the model for 100 epochs didn’t yield strong results, and increasing it to 300 epochs led to overfitting. To address this, we used early stopping, which allowed the model to train effectively without overfitting.
We also refined the model by increasing the number of neurons in the hidden layers to better capture patterns from our tabular data. However, this added complexity risked overfitting, so we introduced dropout layers to improve model regularization.
Looking forward, real-time seismic sensor data could enhance our predictions by enabling dynamic, continuous updates. Exploring more advanced architectures, like Transformers or attention-based models, could further help us uncover long-range dependencies and complex patterns within the tabular data.
Additionally, combining multiple techniques and data sources—such as geological and tectonic data—could lead to more comprehensive earthquake prediction models. Collaboration between experts across different fields will be crucial in advancing our ability to harness tabular data and machine learning for seismic forecasting.

Conclusion

In this project, we explored how Python and machine learning can be used to work with tabular data for earthquake prediction. Along the way, we addressed challenges like overfitting, refined our model architecture, and saw how structured data can uncover meaningful insights.
The lessons you've learned here—about cleaning, preparing, and modeling tabular data—can now be applied to a wide range of other use cases. Whether it’s financial forecasting, healthcare analytics, or any other field that relies on structured data, the techniques and tools you've mastered in this tutorial will help you tackle complex datasets and make impactful predictions. The possibilities with tabular data are vast, and with continued practice, you can leverage these skills to solve a variety of real-world problems.



Iterate on AI agents and models faster. Try Weights & Biases today.