Data Preprocessing
Overview of Steps involved in data cleaning and logging datasets using Artifacts
Created on March 5|Last edited on March 5
Comment
Raw Dataset
This is a preview of the initial raw dataset.
Run set
10
Data Engineering
Let's check for any inconsistencies in the dataset.
Null Value analysis
The first thing we will look for is null values. The following command will help check for null values
df.isnull().sum()
It is evident that the 'Number_Weeks_Used' column has 9000 missing data values.
Run set
10
Since the null values has other data elements that is important for the analysis, we will replace those null values with the mode of the data.
Null Values in the "Number_Weeks_Used" column are replaced by the mode of the data.
💡
Outlier Analysis
The descriptive statistics of the dataset shows some outliers present in some columns.
"Insect_Count", "doses_week" and "number_weeks_quit" have outliers
💡
Run set
10
Final Clean Dataset
Once all the above described steps are performed the dataset is now ready for building models. Here is preview of the data.
Run set
10
Tracking Steps with Artifacts
All of the operations performed above can be tracked using Artifacts. This way we can always go back and find the exact steps involved in the entire Data preparation process.
The left side panel under the Artifacts tab shows the various system of records logged to W&B.

Artifacts also provide a DAG view that shows the steps taken at every process as highlighted below.

Add a comment