Data-Centric AI Competition Approach
This is the report is for hackathon Data-Centric AI Competition by Deeplearning.ai and steps taken to achieve accuracy improvement on Roman MNIST data-set by improving the data instead of improving the model.
Created on September 5|Last edited on September 6
Comment
Look here to check on Data-Centric AI hackathon - https://https-deeplearning-ai.github.io/data-centric-comp/?utm_source=linkedin&utm_medium=social&utm_campaign=dc-ai-competition&utm_content=dl-ai
💡
Initial Observation looking at data
Initial observation on the data was as below:
- Identify incorrect labeling (eg. an “I” labeled as a “III”)
- Remove noisy pictures (some examples hidden below)

Few noisy examples as shown above
3. There are different types of images for single label i.e. i or I
Initial strategy
- Consistent labeling - There were few images which we can consider as ambiguous as below

Is this example 2 or 6 ?
2. Delete noisy data: Delete the noisy data as described above
3. Define correct split: Since we have different type of images for same class as below, i have used labeling tool to assign meta-data to this images like Type 1, Type 2, Type 3 and Delete. We can use this meta-data for stratified split of the data between training and validation.

Number 1 can be written in 3 different types, used labeling tool to define those types
4. Log different version of database : I have used W&B to log different versions of database
5. Error analysis: I have used W&B to log images, ground truth and prediction to identify wrong labels and analyze those labels for further improvement.
First submission:
I have done 2 submissions to check if the data splitting the different type of data into training and validation has any effect or not. Below are the results:

Stratified split data has higher score of local as well as LB
Augmentation Idea:
I have observed that the image size is way too large and compressing it into 32 * 32 might lead degraded quality of the images. The idea is to crop the image such that it removes additional side space.
I have used below OpenCV code to crop the images with additional white space.
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)gray = 255*(gray < 128).astype(np.uint8) # To invert the text to whitecoords = cv2.findNonZero(gray) # Find all non-zero points (text)x, y, w, h = cv2.boundingRect(coords) # Find minimum spanning bounding boxrect = img[y:y+h, x:x+w] # Crop the image - note we do this on the original imagecv2.imshow("Cropped", rect) # Show it
I merged all the images with existing training and validation set. I found below score after submission.

Use of Augmentor library
I have then used augmentor library with augmentation like random rotate,random distortion and random erasing also tried normalizing images and augmentation as well. I received below score:

I had an idea to invert the images and found this useful explanation: https://stats.stackexchange.com/questions/220164/impact-of-inverting-grayscale-values-on-mnist-dataset
💡
Inverting images
I have then inverted (Black pixel becomes white and visa versa) all the images and added with the mixed data set.

The last augmentation worked well and achieved a 82% accuracy on test set.
Further experiments
- Based on error analysis, i found that score for images with label 2,3,7 and 8 were very low. I augmented data for those class.
- I experimented with augmented data (Score with 0.7987) and did an alternate invert of images.
- I experimented by adding more data to best score for class 2,3,7 and 8
- I experimented with inverting 3 out of 1 image.
I am still awaiting score on above experiments.

Different experiment tracking in W&B
Run set
92
Add a comment