Skip to main content

I. Shopee Competition: EDA and Preprocessing

This notebook contains the EDA and preprocessing parts for the competition Shopee - Price Match Guarantee.
Created on March 18|Last edited on June 2


Introduction

The Shopee - Price Match Guarantee Kaggle competition had the purpose of identifying if two products are the same by looking at their images and descriptions.
Two different images of similar wares may represent the same product or two completely different items. Retailers want to avoid misrepresentations and other issues that could come from conflating two dissimilar products. - Shopee
🟢 Goal: To build a model that can identify which images contain the same product/s.
🔴 Challenges:
  • Finding near-duplicates of the product (and NOT the image)
  • Erasing the impact of the background (the area surrounding the product)
  • Using the description of the image (or the title )

Data

The data was structured as follows:
  • train:
    • train_images: the product photos (~ 32,400 files).
    • train.csv: the corresponding metadata; each product is assigned a label_group that marks the images with identical products.
  • test:
    • test_images: the product photos to be predicted (~ 70,000 files)
    • test.csv: the corresponding metadata

Run set
149


Image exploration

I. Duplicated Images

There were a few image_ids that appeared more than once.

Run set
149

Hence, I wanted to check first how these "duplicated" images look like and to investigate what exactly differentiates them. I then noticed that the images had either different title values or different values for their label_group.
You can see a few examples below:








II. "Label Group" variable

The majority of the groups had 5 or fewer images that contain the same product.


Run set
93

We can also look within a few groups to see how the products labeled as being the same show in the images.







III. Image Augmentation

Another aspect I wanted to explore was the different kinds of augmentation that might be performed on the images so that the model can better pick up unique patterns.
From my research, the best performing augmentations for this type of problem were flips (vertical flip, horizontal flip, etc.), crops (center crop, random crop, etc.), and rotations, as they "display" the product in different positions without changing its color or texture attributes.
Below you can see an example of an image and 11 different applied augmentations.

Run set
1


Title exploration

I. Text preprocessing

The images were also accompanied by a title, which served as a description of the image, usually made by the user who posted the picture on the platform.
As the text wasn't clean, a preprocessing pipeline had to be put in place before going with the exploration further.
def preprocess_title(title):
'''Text Preprocessing Performance.
title: the string that needs prepped.'''
# Lower Case
title = title.lower()
# Remove Punctuation
title = title.translate(str.maketrans('','',string.punctuation))
# Remove whitespaces
title = title.strip()
# Tokenize
tokens_title = word_tokenize(title)
# Remove stopwords
tokens_title = [word for word in tokens_title if not word in stopwords.words()]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemm_text = [lemmatizer.lemmatize(word) for word in tokens_title]
prepped_title = ' '.join(lemm_text)

return prepped_title

def get_POS(prepped_title):
'''Gets Part of Speech.
prepped_text: the already prepped text'''
# Part of speech tagging
pos_text = TextBlob(prepped_title)
pos_text = ' '.join([j for (i, j) in pos_text.tags])

return pos_text

II. Text Feature Extraction

Another method was to extract features from the title column, in an attempt to feed into the final model more useful information. You can check out this article to find out more about the textfeatures library.
The extractions were:
  • word_count : counts how many words are in a sentence.
  • char_count : counts how many characters are in a sentence.
  • avg_word_length : counts what's the average word length in a sentence.
  • stopwords_count : counts how many stopwords are in a sentence.
  • numerics_count : counts how many numbers are in a sentence.
def extract_title_features(df_prep):
'''Extracts features from the unprocessed title column.'''
# Extract Features
df_prep = textfeatures.word_count(df_prep, "title", "word_count")
df_prep = textfeatures.char_count(df_prep, "title", "char_count")
df_prep = textfeatures.avg_word_length(df_prep, "title", "avg_word_length")
df_prep = textfeatures.stopwords_count(df_prep, "title", "stopwords_count")
df_prep = textfeatures.numerics_count(df_prep, "title", "numerics_count")
return df_prep
Within out title variable, the texts are usually ~10 words long, with ~ 50 characters and containing 1 to 2 numerics.



III. Title and POS Frequencies

Now that the data is cleaned, we can start exploring the most frequent words and pos (Part of Speech) apparitions within the title.
Part of Speech abbreviation meaning (you can find full list here):
  • WRB: wh- adverb (how).
  • WP: wh- pronoun (who).
  • VBZ: verb, present tense with 3rd person singular (bases).
  • VBP: verb, present tense not 3rd person singular (wrap).
  • RP: particle (about).
  • etc.

Run set
93

Another trick we can do is to create the Wordcloud from the bag of words:



Ending Notes

This blog contains the first part of my solution to this competition. Performing data cleaning and EDA is of tremendous value within a project, as it helps us better understand the data, how we should work with it, and how to use it moving forward.
You can check out the second part of my solution, containing models and prediction here.

You can find the full notebook (with datasets and code) here.

💜 Thank you lots for reading and Happy Data Sciencin'! 💜
Iterate on AI agents and models faster. Try Weights & Biases today.