II. Shopee Competition: Model Training with Pytorch x RAPIDS

This notebook contains the models and clusterization for the competition Shopee - Price Match Guarantee.
Created on March 18|Last edited on June 2
Comment
﻿
﻿
IntroductionThis is the second notebook on the Shopee - Price Match Guarantee Kaggle competition (if you want to see the first one, which contains the EDA and preprocessing steps, head here).
🟢 Goal: To build a model that can identify which images contain the same product/s.
🔴 To consider:
This competition is different than what we're usually used to, as it uses Unsupervised Machine Learning Techniques.
The goal is to group similar products. Although we have a "target variable" (named label_group) in the train dataset, there can be multiple other types of groups in the test dataset (completely unseen during training). Hence, we can't use the label_group as our target (y) feature. Instead, we'll have to train a model to learn to cluster similar groups together. We can check it's performance aftewards by comparing the prediction with the actual groups.
Inspiration: Huge thanks to Chris Deotte for creating a trendsetter notebook with a baseline and to zzy990106 for his PyTorch version of Chris's work.﻿You can find the full notebook (with datasets and code) here.﻿
Loading the DataThis competition is a notebook competition, meaning that the test data is hidden and is revealed only when submitting the notebook to the competition. The test set has visible only 3 observations, but in the hidden data there are 70,000 more rows.
Hence, a first step was to create a variable that reads the data by looking at the state of the notebook:
When COMMIT: we create the clusters and compute the CV Score by using the training data.
When SUBMIT: we create the clusters by using the test data nd submit (we cannot compute CV Score).
# ---- Set COMPUTE_CV value ----
COMPUTE_CV = True
﻿
# Switch to False if test.csv has more than 3 values
### check out Chris's notebook for more info on this
test = pd.read_csv('../input/shopee-product-matching/test.csv')
﻿
if len(test)>3: 
    COMPUTE_CV = False
After establishing the COMPUTE_CV, we can go ahead and read in the files. Be mindful that we'll compute the target column only if we're using the train dataset. However, when we'll submit the notebook (meaning we'll use the hidden test data), we won't have a target variable available, so we need to skip it.
if COMPUTE_CV == True:
    # === CPU data ===
    # Read in data
    data = pd.read_csv("../input/shopee-product-matching/train.csv")    
    # Set a "filepath" column
    data["filepath"] = train_base + data["image"]
    # Map on for each product all `posting_id` that are labeled as the same
    ### this way we create a "target" column (ONLY FOR TRAIN)
    group_dicts = data.groupby('label_group')["posting_id"].unique().to_dict()
    data['target'] = data["label_group"].map(group_dicts)
    
    # === GPU data ===
    data_gpu = cudf.read_csv("../input/shopee-product-matching/train.csv")    
    data_gpu["filepath"] = train_base + data_gpu["image"]
﻿
else:
    # === CPU data ===
    data = pd.read_csv("../input/shopee-product-matching/test.csv")
    data["filepath"] = test_base + data["image"]
    # No Target Here
    
    # === GPU data ===
    data_gpu = cudf.read_csv("../input/shopee-product-matching/test.csv")    
    data_gpu["filepath"] = test_base + data_gpu["image"]
Competition MetricLet's now understand the competition metric. This is established by the organizers of the competition and it can be seen here. It is based on the mean F1 score on each product prediction.
﻿
﻿
def F1_score(target_column, pred_column):
    '''Returns the F1_score for each row in the data.
    Remember: The final score is the mean F1 score.
    target_column: the name of the column that contains the target
    pred_column: the name of the column that contains the prediction
    '''
    
    def get_f1(row):
        # Find the common values in target and prediction arrays.
        intersection = len( np.intersect1d(row[target_column], row[pred_column]) )
        # Computes the score by following the formula
        f1_score = 2 * intersection / (len(row[target_column]) + len(row[pred_column]))
        
        return f1_score
    
    return get_f1
Now we can compute a Baseline CV Score. This means that we'll predict for each image all the images that have the same image_phash (a perceptual hash of the image).
Note: If 2 or more images have the same image_phash it means that they look the same.
﻿
Run set149
﻿
PyTorch DatasetWe'll create a Dataset class called ShopeeDataset that will:
Receive the metadata (stored in .csv).
Read in the image and title.
Perform image augmentation and tokenization.
Return the necessary information to feed into the model afterward.
class ShopeeDataset(Dataset):
    
    def __init__(self, csv, train):
        self.csv = csv.reset_index()
        self.train = train
        
        # Instantiate one of the tokenizer classes of the library from BERT
        self.tokenizer = AutoTokenizer.from_pretrained('../input/bert-base-uncased')
        # Image Augmentation
        self.transform = Compose([VerticalFlip(p=0.5),
                                  HorizontalFlip(p=0.5),
                                  Resize(256, 256),
                                  Normalize(),
                                 ])
        
    def __len__(self):
        return len(self.csv)
    
    
    def __getitem__(self, index):
        '''Read in image & title as PyTorch Dataset.
        Return the transformed image and text ids and mask.'''
            
        # Read in image and text data
        image = cv2.imread(self.csv["filepath"][index])
        text = self.csv["title"][index]
        
        # Transform image & transpose channels [color, height, width]
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image_transf = self.transform(image=image)["image"].astype(np.float32)
        image_transf = torch.tensor(image_transf.transpose(2, 0, 1))
        
        # Tokenize the text using BERT
        text_token = self.tokenizer(text, padding="max_length",
                                    truncation=True, max_length=16,
                                    return_tensors="pt")
        input_ids = text_token["input_ids"][0]
        attention_mask = text_token["attention_mask"][0]
        
        # Return dataset info
        ### if "test", we won't have label_group available
        if self.train == True:
            label_group = torch.tensor(self.csv["label_group"][index])
            return image_transf, input_ids, attention_mask, label_group
        
        else:
            return image_transf, input_ids, attention_mask
Predicting using Image Embeddings
Step 1: Retrieving the embeddingsAfter importing the images using ShopeeDataset we can extract the embeddings using EfficientNet (more on PyTorch EfficientNet here).
The Embeddings are the abstract representation of the images:
input: an image of [3, 256, 256] (3 channels, of size 256x256).
output: an array of 1,000 items which is the abstract representation of the input structure (see image below for more details).
﻿
﻿
After applying the EffNet to retrieve the embeddings, we end up with an object with 70,501 rows and 1,000 features.
﻿
Run set149
﻿
Step 2: Creating the predictionsWe used the NearestNeighbors method to compute the clusters, using the image embeddings as features. The competition says that "group sizes are capped at 50, so there is no benefit to predict more than 50 matches". Hence, we'll create clusters of a maximum size of 50.
The predictions are saved in a new variable called img_pred.
﻿
Run set149
﻿
Creating clusters using Text EmbeddingsAs we have the title of the image also available, it would be a shame not to use it for predicting as well. In this part, we'll create a TfIdf Vectorizer to extract these embeddings.
Step 1: Retrieving the embeddingsA TfIdf Process looks like the example below:
﻿
After applying the TfIdf vectorizer to retrieve the embeddings, we end up with an object with 70,501 rows and 24,939 features.
# Extract the Tf-Idf Matrix
tf_idf = TfidfVectorizer(stop_words='english', binary=True, max_features=25000)
text_embeddings = tf_idf.fit_transform(data_gpu["title"]).toarray()
﻿
Run set149
﻿
Step 2: Creating the predictionsWe used the NearestNeighbors method as well, with the same hyperparameters as for the image embeddings. The predictions are saved in a new variable called title_pred.
Final PredictionNow that we have predictions linked to both image and title embeddings, we can combine them and create the final prediction, which will be submitted to the leaderboard.
Remember, duplic_pred contains all the images that have the same phash id.
def combine_predictions(row, cv=True):
    '''Combine all predictions together.'''
    
    # Concatenate all predictions
    all_preds = np.concatenate([row["img_pred"],row["title_pred"], row["duplic_pred"]])
    all_preds = np.unique(all_preds)
    
    # Return combined unique preds
    if cv == True:
        return all_preds
    else:
        return ' '.join(all_preds)
By using these 2 simple methods, we achieved a mean F1 score of 0.67 and a Leaderboard score of 0.66.
We can also observe the F1 score Distribution on each product below:
﻿
﻿
Run set1
﻿
Ending NotesThere are many more experiments we could try. Some ideas are:
Tweaking the EffNet or using another algorithm altogether (like ResNet).
Adding more information to the text embeddings (features like number of words in a sentence or average length of words).
Trying other clusterization methods.
etc.
You can check out the first part of my solution, containing EDA and preprocessing part here.﻿You can find the full notebook (with datasets and code) here.﻿﻿💜 Thank you lots for reading and Happy Data Sciencin'! 💜
﻿