Baseline Report: Text Classification on user complaints
Exploratory Data Analysis and execution of a baseline model, CNN 1D, to classify user complaints into categories
Created on September 6|Last edited on September 25
Comment
Problem DescriptionOur ApproachExploratory Data AnalysisDatasetData ProcessingRaw DatasetWord CloudsAnalyze the balance in our datasetSplit the datasetTraining a baseline modelConfusion MatrixMake Predictions
Problem Description
Our goal is to identify to which product category a user complaint refers.
In many situations, it is required to analyze the user complaints to group them into the right category, not always this data is provided or many times it is established wrong by the user.
Our Approach
Our First step would be an EDA, Exploratory Data Analysis, of our dataset to get some intuitions about relevant features or pinpoints. Our input data is text, probably we will have to clean and standardize some formats.
The second step, create our datasets.
Third Step, We are going to build a baseline model, our first approach to the problem. Later we can compare more advanced models, trying to get a better performance.
This initial model is a Convolutional 1D model in Keras applied to a text classification task. It is a simple and small model that we can easily use in a low-performance compute instance.
Exploratory Data Analysis
Dataset
This dataset is available on Kaggle and it is described as:
The Consumer Financial Protection Bureau (CFPB) is a federal U.S. agency that acts as a mediator when disputes arise between financial institutions and consumers. Via a web form, consumers can send the agency a narrative of their dispute. An NLP model would make the classification of complaints and their routing to the appropriate teams more efficient than manually tagged complaints.
The dataset consisted of around 162,400 consumer submissions containing narratives from the CFPB website for training and testing the model. It included one year's worth of data (March 2020 to March 2021),
Data Processing
Once we download the dataset and load it in a Pandas Dataframe we can check that the text is all cleaned: no uppercase, no punctuation, no stopwords,... It is very uncommon that you can work with a dataset like that. You usually have to remove some URLs, html tags, punctuations, ...
We want to enrich our dataset with some data we can use to identify different distributions in the input data. For that reason, we calculate the count of words, chars, and sentences per row.
Another group of features we can inspect in text data is the Part-Of-Speech tagging:
The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories... The collection of tags used for a particular task is known as a tagset. Our emphasis in the next section is on exploiting tags, and tagging text automatically.
Natural Language Processing with Python, by S. Bird, E. Klein and E. Loper [1]
We will include the folowing main POS tags in our dataset like ADJ, VERB, NOUN,...
Raw Dataset
We upload to W&B Artifacts our raw dataset:
Word Clouds
A Wordcloud (or Tag cloud) is a visual representation of text data. It displays a list of words and the importance of each being shown with font size or color (the bigger the more frequent). This format is useful for quickly perceiving the most relevant terms on a document or set of documents.
As an example, we show the WordCloud for credit card category:

Analyze the balance in our dataset
When dealing with a classification task it is very important to analyze if the dataset is balanced. Let's check it in our raw dataset:
As we can observe, the category "credit_reporting" is predominant. To avoid future issues we will remove examples of this category to get this final distribution:
product
credit_reporting 15000
debt_collection 15000
mortgages_and_loans 15000
credit_card 14983
retail_banking 13469
Name: count, dtype: int64
Split the dataset
Finally, we build the following datasets and upload this distribution to a table in W&B:
- Train dataset: 70%
- Validation dataset: 15%
- Test dataset: 15%
Training a baseline model
Our first approach creates a Convolutional Neural Network with a single 1D convolution layer for the text classification task. The network has 3 layers namely the embedding layer, convolution layer, and dense layer. The embedding layer maps token index to embeddings which are given to Conv1D for performing convolution operations. The output of the convolution layer is given to the dense layer to generate probabilities for target classes.
In our first experiment, we got the following results:
Confusion Matrix

Make Predictions
We also register the predictions on the validation dataset, including the predicted probability of every class. With all these probabilities we can take a closer look at the errors and detect points to analyze.
Here we can observe that errors look like they are equally distributed. Our validation dataset contains 11,000 rows and there are about 2,000 rows in every category.
Add a comment