I. Shopee Competition: EDA and Preprocessing

This notebook contains the EDA and preprocessing parts for the competition Shopee - Price Match Guarantee.

Created on March 18|Last edited on June 2

Comment

﻿
﻿
IntroductionThe Shopee - Price Match Guarantee Kaggle competition had the purpose of identifying if two products are the same by looking at their images and descriptions.
Two different images of similar wares may represent the same product or two completely different items. Retailers want to avoid misrepresentations and other issues that could come from conflating two dissimilar products.  - Shopee🟢 Goal: To build a model that can identify which images contain the same product/s.
🔴 Challenges:
Finding near-duplicates of the product (and NOT the image)
Erasing the impact of the background (the area surrounding the product) 
Using the description of the image (or the title )
﻿You can find the full notebook (with datasets and code) here.﻿
DataThe data was structured as follows:
train: 
train_images: the product photos (~ 32,400 files).
train.csv: the corresponding metadata; each product is assigned a label_group that marks the images with identical products.
test: 
test_images: the product photos to be predicted (~ 70,000 files)
test.csv: the corresponding metadata
﻿
Run set149
﻿
Image exploration
I. Duplicated ImagesThere were a few image_ids that appeared more than once.
﻿
Run set149
﻿
Hence, I wanted to check first how these "duplicated" images look like and to investigate what exactly differentiates them. I then noticed that the images had either different title values or different values for their label_group.
You can see a few examples below:
﻿
﻿
﻿
﻿
﻿
﻿
﻿
II. "Label Group" variableThe majority of the groups had 5 or fewer images that contain the same product.
﻿
﻿
Run set93
﻿
We can also look within a few groups to see how the products labeled as being the same show in the images.
﻿
﻿
﻿
﻿
﻿
﻿
III. Image AugmentationAnother aspect I wanted to explore was the different kinds of augmentation that might be performed on the images so that the model can better pick up unique patterns.
From my research, the best performing augmentations for this type of problem were flips (vertical flip, horizontal flip, etc.), crops (center crop, random crop, etc.), and rotations, as they "display" the product in different positions without changing its color or texture attributes.
Below you can see an example of an image and 11 different applied augmentations.
﻿
Run set1
﻿
Title exploration
I. Text preprocessingThe images were also accompanied by a title, which served as a description of the image, usually made by the user who posted the picture on the platform.
As the text wasn't clean, a preprocessing pipeline had to be put in place before going with the exploration further.
def preprocess_title(title):
    '''Text Preprocessing Performance.
    title: the string that needs prepped.'''
    
    # Lower Case
    title = title.lower()
    # Remove Punctuation
    title = title.translate(str.maketrans('','',string.punctuation))
    # Remove whitespaces
    title = title.strip()
    # Tokenize
    tokens_title = word_tokenize(title)
    # Remove stopwords
    tokens_title = [word for word in tokens_title if not word in stopwords.words()]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemm_text = [lemmatizer.lemmatize(word) for word in tokens_title]
    prepped_title = ' '.join(lemm_text)
﻿
    return prepped_title
    
﻿
def get_POS(prepped_title):
    '''Gets Part of Speech.
    prepped_text: the already prepped text'''
    
    # Part of speech tagging
    pos_text = TextBlob(prepped_title)
    pos_text = ' '.join([j for (i, j) in pos_text.tags])
﻿
    return pos_text
II. Text Feature ExtractionAnother method was to extract features from the title column, in an attempt to feed into the final model more useful information. You can check out this article to find out more about the textfeatures library.
The extractions were:
word_count : counts how many words are in a sentence.
char_count : counts how many characters are in a sentence.
avg_word_length : counts what's the average word length in a sentence.
stopwords_count : counts how many stopwords are in a sentence.
numerics_count : counts how many numbers are in a sentence.
def extract_title_features(df_prep):
    '''Extracts features from the unprocessed title column.'''
    
    # Extract Features
    df_prep = textfeatures.word_count(df_prep, "title", "word_count")
    df_prep = textfeatures.char_count(df_prep, "title", "char_count")
    df_prep = textfeatures.avg_word_length(df_prep, "title", "avg_word_length")
    df_prep = textfeatures.stopwords_count(df_prep, "title", "stopwords_count")
    df_prep = textfeatures.numerics_count(df_prep, "title", "numerics_count")
    
    return df_prep
Within out title variable, the texts are usually ~10 words long, with ~ 50 characters and containing 1 to 2 numerics.
﻿
﻿
III. Title and POS FrequenciesNow that the data is cleaned, we can start exploring the most frequent words and pos (Part of Speech) apparitions within the title.
Part of Speech abbreviation meaning (you can find full list here):
WRB: wh- adverb (how).
WP: wh- pronoun (who).
VBZ: verb, present tense with 3rd person singular (bases).
VBP: verb, present tense not 3rd person singular (wrap).
RP: particle (about).
etc.
﻿
Run set93
﻿
Another trick we can do is to create the Wordcloud from the bag of words:
﻿
﻿
Ending NotesThis blog contains the first part of my solution to this competition. Performing data cleaning and EDA is of tremendous value within a project, as it helps us better understand the data, how we should work with it, and how to use it moving forward.
You can check out the second part of my solution, containing models and prediction here.﻿You can find the full notebook (with datasets and code) here.﻿💜 Thank you lots for reading and Happy Data Sciencin'! 💜
﻿