Skip to main content

Data Preprocessing

Overview of Steps involved in data cleaning and logging datasets using Artifacts
Created on March 5|Last edited on March 5

Raw Dataset

This is a preview of the initial raw dataset.

F00000001
188
1
0
1
0
0
0
1
0
F00000003
209
1
0
1
0
0
0
2
1
ID
Estimated_Insects_Count
Crop_Type
Soil_Type
Pesticide_Use_Category
Number_Doses_Week
Number_Weeks_Used
Number_Weeks_Quit
Season
Crop_Damage
source
1
2
3
4
5
Run set
10


Data Engineering

Let's check for any inconsistencies in the dataset.

Null Value analysis

The first thing we will look for is null values. The following command will help check for null values
df.isnull().sum()
It is evident that the 'Number_Weeks_Used' column has 9000 missing data values.

Run set
10

Since the null values has other data elements that is important for the analysis, we will replace those null values with the mode of the data.
Null Values in the "Number_Weeks_Used" column are replaced by the mode of the data.
💡


Outlier Analysis

The descriptive statistics of the dataset shows some outliers present in some columns.
"Insect_Count", "doses_week" and "number_weeks_quit" have outliers
💡

Run set
10


Final Clean Dataset

Once all the above described steps are performed the dataset is now ready for building models. Here is preview of the data.

Run set
10




Tracking Steps with Artifacts

All of the operations performed above can be tracked using Artifacts. This way we can always go back and find the exact steps involved in the entire Data preparation process.
The left side panel under the Artifacts tab shows the various system of records logged to W&B.



Artifacts also provide a DAG view that shows the steps taken at every process as highlighted below.


List<Maybe<File<(table)>>>