Baseline Report: Text Classification on user complaints

Exploratory Data Analysis and execution of a baseline model, CNN 1D, to classify user complaints into categories
Created on September 6|Last edited on September 25
Comment
﻿
Problem DescriptionOur ApproachExploratory Data AnalysisDatasetData ProcessingRaw DatasetWord CloudsAnalyze the balance in our datasetSplit the datasetTraining a baseline modelConfusion MatrixMake Predictions
﻿
Problem DescriptionOur goal is to identify to which product category a user complaint refers. 
In many situations, it is required to analyze the user complaints to group them into the right category, not always this data is provided or many times it is established wrong by the user.
Our ApproachOur First step would be an EDA, Exploratory Data Analysis, of our dataset to get some intuitions about relevant features or pinpoints. Our input data is text, probably we will have to clean and standardize some formats.
The second step, create our datasets. 
Third Step, We are going to build a baseline model, our first approach to the problem. Later we can compare more advanced models, trying to get a better performance.
This initial model is a Convolutional 1D model in Keras applied to a text classification task. It is a simple and small model that we can easily use in a low-performance compute instance.
Exploratory Data Analysis
DatasetThis dataset is available on Kaggle and it is described as:
The Consumer Financial Protection Bureau (CFPB) is a federal U.S. agency that acts as a mediator when disputes arise between financial institutions and consumers. Via a web form, consumers can send the agency a narrative of their dispute. An NLP model would make the classification of complaints and their routing to the appropriate teams more efficient than manually tagged complaints.The dataset consisted of around 162,400 consumer submissions containing narratives from the CFPB website for training and testing the model. It included one year's worth of data (March 2020 to March 2021),
Data ProcessingOnce we download the dataset and load it in a Pandas Dataframe we can check that the text is all cleaned: no uppercase, no punctuation, no stopwords,... It is very uncommon that you can work with a dataset like that. You usually have to remove some URLs, html tags, punctuations, ...
We want to enrich our dataset with some data we can use to identify different distributions in the input data. For that reason, we calculate the count of words, chars, and sentences per row.
Another group of features we can inspect in text data is the Part-Of-Speech tagging:
The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories... The collection of tags used for a particular task is known as a tagset. Our emphasis in the next section is on exploiting tags, and tagging text automatically.
       Natural Language Processing with Python, by S. Bird, E. Klein and E. Loper [1]
We will include the folowing main POS tags in our dataset like ADJ, VERB, NOUN,...
Raw DatasetWe upload to W&B Artifacts our raw dataset:
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_eda").membershipForAlias("v0").artifactVersion.file("eda_table.table.json")
 - 10 of 62236
product
narrative
word_count
char_count
word_density
sent_density
ADJ
ADP
ADV
CONJ
DET
NOUN
NUM
PRT
PRON
VERB
\.
X
1
2
3
4
5
6
7
8
9
10
project
Word CloudsA Wordcloud (or Tag cloud) is a visual representation of text data. It displays a list of words and the importance of each being shown with font size or color (the bigger the more frequent). This format is useful for quickly perceiving the most relevant terms on a document or set of documents.
As an example, we show the WordCloud for credit card category:
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_eda").membershipForAlias("v0").artifactVersion.file("worcloud/wordcloud_credit_card.jpg")
Analyze the balance in our datasetWhen dealing with a classification task it is very important to analyze if the dataset is balanced. Let's check it in our raw dataset:
﻿
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_eda").membershipForAlias("v0").artifactVersion.file("eda_table.table.json")
 - 1 of 1
product
1
consumer_complaints
project
As we can observe, the category "credit_reporting" is predominant. To avoid future issues we will remove examples of this category to get this final distribution:
product
credit_reporting       		15000
debt_collection        		15000
mortgages_and_loans    		15000
credit_card            		14983
retail_banking         		13469
Name: count, dtype: int64
Split the datasetFinally, we build the following datasets and upload this distribution to a table in W&B:
Train dataset: 70%
Validation dataset: 15%
Test dataset: 15%
Training a baseline modelOur first approach creates a Convolutional Neural Network with a single 1D convolution layer for the text classification task. The network has 3 layers namely the embedding layer, convolution layer, and dense layer. The embedding layer maps token index to embeddings which are given to Conv1D for performing convolution operations. The output of the convolution layer is given to the dense layer to generate probabilities for target classes.
In our first experiment, we got the following results:
﻿
﻿
﻿
Confusion Matrix﻿
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_baseline").membershipForAlias("v0").artifactVersion.file("confussion_matrix.png")
﻿
Make PredictionsWe also register the predictions on the validation dataset, including the predicted probability of every class. With all these probabilities we can take a closer look at the errors and detect points to analyze.
﻿
project("edumunozsala", "consumer_complaints_classification").artifact("consumer_complaints_baseline").membershipForAlias("v0").artifactVersion.file("prediction_table.table.json")
 - 4 of 5
prob0
prob1
prob2
prob3
prob4
1
0
2
1
3
2
4
3
product
Here we can observe that errors look like they are equally distributed. Our validation dataset contains 11,000 rows and there are about 2,000 rows in every category. 
﻿
﻿
Add a comment