I. Shopee Competition: EDA and Preprocessing
This notebook contains the EDA and preprocessing parts for the competition Shopee - Price Match Guarantee.
Created on March 18|Last edited on June 2
Comment

Introduction
The Shopee - Price Match Guarantee Kaggle competition had the purpose of identifying if two products are the same by looking at their images and descriptions.
Two different images of similar wares may represent the same product or two completely different items. Retailers want to avoid misrepresentations and other issues that could come from conflating two dissimilar products. - Shopee
🟢 Goal: To build a model that can identify which images contain the same product/s.
🔴 Challenges:
- Finding near-duplicates of the product (and NOT the image)
- Erasing the impact of the background (the area surrounding the product)
- Using the description of the image (or the title )
Data
The data was structured as follows:
- train:
- train_images: the product photos (~ 32,400 files).
- train.csv: the corresponding metadata; each product is assigned a label_group that marks the images with identical products.
- test:
- test_images: the product photos to be predicted (~ 70,000 files)
- test.csv: the corresponding metadata
Run set
149
Image exploration
I. Duplicated Images
There were a few image_ids that appeared more than once.
Run set
149
Hence, I wanted to check first how these "duplicated" images look like and to investigate what exactly differentiates them. I then noticed that the images had either different title values or different values for their label_group.
You can see a few examples below:



II. "Label Group" variable
The majority of the groups had 5 or fewer images that contain the same product.
Run set
93
We can also look within a few groups to see how the products labeled as being the same show in the images.



III. Image Augmentation
Another aspect I wanted to explore was the different kinds of augmentation that might be performed on the images so that the model can better pick up unique patterns.
From my research, the best performing augmentations for this type of problem were flips (vertical flip, horizontal flip, etc.), crops (center crop, random crop, etc.), and rotations, as they "display" the product in different positions without changing its color or texture attributes.
Below you can see an example of an image and 11 different applied augmentations.
Run set
1
Title exploration
I. Text preprocessing
The images were also accompanied by a title, which served as a description of the image, usually made by the user who posted the picture on the platform.
As the text wasn't clean, a preprocessing pipeline had to be put in place before going with the exploration further.
def preprocess_title(title):'''Text Preprocessing Performance.title: the string that needs prepped.'''# Lower Casetitle = title.lower()# Remove Punctuationtitle = title.translate(str.maketrans('','',string.punctuation))# Remove whitespacestitle = title.strip()# Tokenizetokens_title = word_tokenize(title)# Remove stopwordstokens_title = [word for word in tokens_title if not word in stopwords.words()]# Lemmatizationlemmatizer = WordNetLemmatizer()lemm_text = [lemmatizer.lemmatize(word) for word in tokens_title]prepped_title = ' '.join(lemm_text)return prepped_titledef get_POS(prepped_title):'''Gets Part of Speech.prepped_text: the already prepped text'''# Part of speech taggingpos_text = TextBlob(prepped_title)pos_text = ' '.join([j for (i, j) in pos_text.tags])return pos_text
II. Text Feature Extraction
Another method was to extract features from the title column, in an attempt to feed into the final model more useful information. You can check out this article to find out more about the textfeatures library.
The extractions were:
- word_count : counts how many words are in a sentence.
- char_count : counts how many characters are in a sentence.
- avg_word_length : counts what's the average word length in a sentence.
- stopwords_count : counts how many stopwords are in a sentence.
- numerics_count : counts how many numbers are in a sentence.
def extract_title_features(df_prep):'''Extracts features from the unprocessed title column.'''# Extract Featuresdf_prep = textfeatures.word_count(df_prep, "title", "word_count")df_prep = textfeatures.char_count(df_prep, "title", "char_count")df_prep = textfeatures.avg_word_length(df_prep, "title", "avg_word_length")df_prep = textfeatures.stopwords_count(df_prep, "title", "stopwords_count")df_prep = textfeatures.numerics_count(df_prep, "title", "numerics_count")return df_prep
Within out title variable, the texts are usually ~10 words long, with ~ 50 characters and containing 1 to 2 numerics.

III. Title and POS Frequencies
Now that the data is cleaned, we can start exploring the most frequent words and pos (Part of Speech) apparitions within the title.
- WRB: wh- adverb (how).
- WP: wh- pronoun (who).
- VBZ: verb, present tense with 3rd person singular (bases).
- VBP: verb, present tense not 3rd person singular (wrap).
- RP: particle (about).
- etc.
Run set
93
Another trick we can do is to create the Wordcloud from the bag of words:

Ending Notes
This blog contains the first part of my solution to this competition. Performing data cleaning and EDA is of tremendous value within a project, as it helps us better understand the data, how we should work with it, and how to use it moving forward.
You can check out the second part of my solution, containing models and prediction here.
You can find the full notebook (with datasets and code) here.
💜 Thank you lots for reading and Happy Data Sciencin'! 💜
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.