A Comprehensive Guide to Automating Article Categorization with Machine Learning

The article demonstrates the use of Machine Learning for automating article or document classification using PyTorch.
Created on May 18|Last edited on May 18
Comment
﻿
﻿
IntroductionIn the fast-moving world of the Internet, where digital content is being generated at an exponential rate. So, managing and categorizing articles efficiently is crucial for enhancing user experience and optimizing information retrieval. Document classification, a fundamental task in natural language processing and machine learning, plays a pivotal role in this domain. It involves automatically assigning predefined categories or labels to text documents based on their content, structure, or context. 
Document classification is significant because it organizes vast amounts of information into manageable and accessible categories. This process facilitates efficient search and retrieval, enables personalized content recommendations, and enhances content organization for both users and content managers
In this article, we aim to delve into the realm of automated article categorization using deep neural networks using the PyTorch framework. We will begin with an article categorization dataset, where we will perform data preprocessing. Next, we implement our simple neural network to learn to classify articles. We will also integrate Weights & Biases with our training pipeline to enable easy experiment tracking, model comparison, and other things. At the end of this article, you will have a good understanding of the article categorization using PyTorch.
How Do ML models perform Document Classification?In the realm of document classification, machine learning (ML) models play an important role in automating the categorization of digital content, ranging from articles and news pieces to research papers and blog posts. Let's dive deep into how these models perform document classification.
Document Classification and Its RoleDocument classification involves categorizing text documents into predefined categories or labels based on their content, structure, or context. This process is essential for organizing and managing digital content, enabling efficient information retrieval and content recommendation systems. ML models are leveraged to automate this classification process, making it scalable and adaptable to diverse content types and domains. 
How Document Classification Works
﻿
Document classification works in combination with both ML and NLP (Natural Language Processing) techniques. They both form the backbone of the document classification systems. These systems typically follow these steps:
Data Collection: The first step is to find the source of a labeled dataset, which consists of a document or digital content and its associated category or label. Ensure the dataset is representative and balanced across categories.
Data Preprocessing: This process involves multiple steps, cleaning the data in each step. The text is cleaned by removing noise such as special characters, HTML tags, punctuation, and stop words. The text is tokenized into words or subwords. Additionally, techniques like stemming or lemmatization can be further applied to normalize words.
Feature Extraction: Now, the preprocessed text data is converted into numerical features that machine learning algorithms can understand. Some commonly used text representation techniques include TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec, GloVe, or BERT for capturing semantic meanings.
Split Dataset: Divide the dataset into training, validation, and testing sets to train and evaluate the model's performance. Sometimes, the dataset is divided into training and testing; no validation is required as the task is simple. 
Model Selection: We must select a suitable machine-learning model for document classification. Some commonly used models include Logistic Regression, Naive Bayes, Support Vector Machines (SVM), and Neural Networks.
Model Training & Testing: The selected model is trained using the training dataset and evaluated on a separate testing dataset. Evaluation metrics such as accuracy, precision, recall, and F1 score are used to assess the model's effectiveness.
Supervised vs Unsupervised Learning for Text Classification
﻿Source﻿
Supervised Learning: In supervised learning, models are trained on a labeled dataset, where each document is associated with a known category or label. The model learns to map input features (textual content) to the corresponding output labels during training. Supervised learning is effective for precise categorization but requires labeled data for training.
Unsupervised Learning: Unsupervised learning approaches, such as clustering algorithms (e.g., K-means clustering, hierarchical clustering), do not require labeled data. These algorithms group documents based on similarities in their features without predefined categories. Unsupervised learning can discover hidden patterns and structures in data but may result in less granular categorization than supervised methods.
The Newsroom Challenge - Organizing InformationThe advent of digital media platforms and online publishing has led to an explosion of news content across various topics, ranging from politics and business to technology and entertainment. This exponential growth poses a challenge regarding information overload, making it increasingly difficult for users to navigate and access relevant news articles. Daily news content has skyrocketed, presenting a significant challenge for newsrooms and content managers to efficiently organize and categorize this vast amount of information. 
Manual vs. Automated ClassificationBoth manual and automated classification have their own benefits and limitations. Let’s study them both.
Manual Classification:
Costs: Manual categorization of news articles involves human effort, time, and resources. Hiring skilled personnel or assigning existing staff to categorize articles incurs costs.
Benefits: Human judgment and expertise can lead to nuanced categorization, especially for complex or ambiguous topics. It allows for fine-tuning categories based on editorial guidelines and audience preferences.
Limitations: Manual classification is labor-intensive, prone to errors, and may not scale well with the growing volume of news content. It can also be subjective, leading to inconsistencies in categorization.
Automated Classification:
Costs: Implementing automated categorization using machine learning models requires an initial investment in technology infrastructure, model development, and training data. However, the long-term operational costs can be significantly lower than those associated with manual efforts.
Benefits: Automated classification leverages ML algorithms to process large volumes of text data rapidly and consistently. It reduces human errors, improves scalability, and enables real-time categorization of incoming news articles.
Limitations: Automated systems may face challenges in accurately categorizing nuanced or context-dependent topics. They require regular monitoring and updates to maintain classification accuracy as news trends and topics evolve.
How Automated Categorization Enhances News Discovery, Reading, and ArchivingAutomated categorization improves user experience by increasing the overall time spent on relevant content, reducing the time spent searching for specific information, and personalizing content recommendations based on user preferences and behavior.
News Discovery: Automated categorization enables users to discover relevant news articles more efficiently by grouping content into meaningful categories or topics. This improves user experience by facilitating targeted content recommendations and personalized news feeds based on user preferences.
Reading Experience: Automated categorization can enhance the reading experience by presenting articles in organized sections or topic clusters, allowing users to seamlessly navigate and explore related content.
Archiving and Retrieval: Automated categorization aids in systematically archiving news content, making it easier to retrieve historical articles based on specific categories, dates, or keywords. This archival system supports research, analysis, and content repurposing efforts within news organizations.
By embracing automated categorization technologies powered by machine learning, newsrooms can overcome the challenges posed by the sheer volume of digital news content, improve content organization and accessibility, and ultimately enhance their audience's overall news consumption experience.
Practical Guide For Building Your Personal Document Classifier Using Weights and BiasesWe will explore the practical aspects of developing, training, and deploying a machine-learning model for news categorization. Here, the model will receive an article or document and it will be able to predict the category to which it belongs.
Dataset SelectionFor the purpose of Article Categorization, we are going to use the News Articles dataset, which contains 2584 unique news articles. Each news article has an associated category, such as business or sports. 
This Dataset is scraped from https://www.thenews.com.pk website. It has news articles from 2015 to date related to business and sports. It Contains the Heading of the particular Article, Its content, and its date. The content also contains the place from where the statement or Article was published.
Download the dataset: The News Article﻿
Step 1: Importing the required librariesWe will import some necessary libraries, such as pandas, torch, and sklearn, which will be used later in the code.
import os
import pandas as pd
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import re
import string
import nltk
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import wandb
Step 2: Initialize Weights & BiasesNow, we will initialize Weights & Biases for a new project called article_classifier and add the configuration data to it. This step is important as it sets up the environment for our project.
config = {
    "seed": 42,
    "lr": 0.001,
    "epochs": 5
}
﻿
wandb.init(project='article_classifier', config=config)
﻿
Step 3: Seeding the EnvironmentSeeding the environment in deep learning ensures consistent random number generation, aiding reproducibility for debugging, testing, hyperparameter tuning, and result validation in research or production settings.
random.seed(config["seed"])
os.environ["PYTHONHASHSEED"] = str(config["seed"])
np.random.seed(config["seed"])
torch.manual_seed(config["seed"])
torch.cuda.manual_seed(config["seed"])
torch.backends.cudnn.deterministic = True
Step 4: Downloading the StopwordsDownload all English stopwords using the NLTK library for future use in dataset cleaning.
nltk.download('stopwords')
Step 5: Data Loading &  PreprocessingFirst, we read the data from the CSV file using pandas. Then, we define a function called clean_text to preprocess the textual data by:
Converting text to lowercase.
Removing punctuations.
Removing stopwords.
Removing whitespaces.
This process filters out unnecessary elements from the article, preserving essential contextual information.
After cleaning, we save the cleaned textual data in the CSV file under the column 'Cleaned_Article' and load both the cleaned article and its category for further analysis.
# Load dataset from CSV
data = pd.read_csv('Articles.csv', encoding='ISO-8859-1')
﻿
# Preprocessing
def clean_text(text):
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])  # Remove stop words
    text = text.strip()  # Remove leading and trailing whitespaces
    return text
﻿
data['Cleaned_Article'] = data['Article'].apply(clean_text)
﻿
X = data['Cleaned_Article']
y = data['NewsType']
Here is an example of the news article before and after cleaning.
Before cleaning:
﻿
After cleaning:
﻿
﻿
Step 6: Label Encoding and VectorizationNow that we have preprocessed our text data, we'll proceed with label encoding and vectorization to prepare our features and labels for model training.
# Label Encoding
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
﻿
class_labels = label_encoder.classes_
﻿
# Vectorization
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)
﻿
# Convert sparse matrix to dense tensor
X = torch.tensor(X.toarray(), dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)
Step 7: Splitting the DatasetNext, we'll split our dataset into two equal parts: training and testing sets, which are further utilized for model training and evaluation.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print(f"Training samples: {len(X_train)} - Testing samples: {len(X_test)}")
Step 8: Defining the Model ArchitectureNow, let's define the architecture of our multilayer perceptron for article classification using PyTorch.
class NewsClassifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(NewsClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, input_dim//8),
            nn.ReLU(),
            nn.Linear(input_dim//8, output_dim)
        )
﻿
    def forward(self, x):
        x = self.fc(x)
        return x
﻿
input_dim = X_train.shape[1]
output_dim = len(y_train.unique())
model = NewsClassifier(input_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
Step 9: Logging Initial ResultsWe will take ten samples from the test dataset and make predictions from the untrained model. We will also log those results in the Weights & Biases.
# Logging articles to the table
pred_table = wandb.Table(columns=["Name", "Sample 1", "Sample 2", "Sample 3", "Sample 4", "Sample 5", "Sample 6", "Sample 7", "Sample 8", "Sample 9", "Sample 10"])
Now, we split the dataset and log the article into the Weights & Biases table.
# Split
_, test_x, _, test_y = train_test_split(data['Cleaned_Article'].values, data['NewsType'].values, test_size=0.5, random_state=42)
pred_table.add_data("Articles", test_x[0], test_x[1], test_x[2], test_x[3], test_x[4], test_x[5], test_x[6], test_x[7], test_x[8], test_x[9])
Now, we make predictions on the untrained model and log those results into the Weights & Biases table.
# Initial results
initial_samples = X_test[:10]  # Get 10 initial samples
﻿
initial_predictions = torch.argmax(torch.softmax(model(initial_samples), dim=1), dim=1)
initial_predictions_labels = [class_labels[i] for i in initial_predictions.tolist()]
print("Initial Predictions: \t", initial_predictions_labels)
﻿
pred_table.add_data("Initial Predictions", initial_predictions_labels[0], 
initial_predictions_labels[1], initial_predictions_labels[2], initial_predictions_labels[3], 
initial_predictions_labels[4], initial_predictions_labels[5], initial_predictions_labels[6],
initial_predictions_labels[7], initial_predictions_labels[8], initial_predictions_labels[9])
Step 10: Training and Logging PredictionsNow, we will train our model on the training dataset with just five epochs. With each epoch, we will use the above-mentioned test samples to make a prediction and add it to the Weights & Biases table.
# Training loop
for epoch in range(config["epochs"]):
    model.train()
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    print(f"Epoch: {epoch+1:2d} - Loss: {loss:1.4f}")
    wandb.log({"Epoch":epoch+1, "Loss":loss})
    loss.backward()
    optimizer.step()
﻿
    # Logging final results
    final_predictions = torch.argmax(torch.softmax(model(initial_samples), dim=1), dim=1)
    final_predictions_labels = [class_labels[i] for i in final_predictions.tolist()]
    pred_table.add_data(f"Epoch {epoch+1}", final_predictions_labels[0], final_predictions_labels[1], final_predictions_labels[2], final_predictions_labels[3], final_predictions_labels[4], final_predictions_labels[5], final_predictions_labels[6], final_predictions_labels[7], final_predictions_labels[8], final_predictions_labels[9])
﻿
print("Final Predictions: \t", final_predictions_labels)
﻿
The table shows the predictions from the initial model and then from each epoch showing the improvement.
Here, a curve graph from the Weights & Biases shows training loss with each epoch. With each epoch, the training loss decreases, indicating the difference between the prediction and real categories during training.
﻿
A curve graph from the Weights & Biases shows training loss with each epoch
Step 11: Testing & EvaluationAfter the model is trained, we will use the test dataset to evaluate its performance.
model.eval()
with torch.no_grad():
    test_outputs = model(X_test)
    predicted = torch.argmax(torch.softmax(test_outputs, dim=1), dim=1)
﻿
    accuracy = accuracy_score(y_test, predicted)
    confusion_mat = confusion_matrix(y_test, predicted)
    classification_rep = classification_report(y_test, predicted)
﻿
# Logging evaluation metrics to W&B
wandb.log({"Accuracy": accuracy})
wandb.log({"Confusion Matrix": confusion_mat.tolist()})
wandb.log({"Classification Report": classification_rep})
﻿
print("Evaluation ->")
print("Accuracy: ", accuracy)
print("Confusion Matrix:\n", confusion_mat)
print("Classification Report:\n", classification_rep)
﻿
The graph from the Weights & Biases shows the accuracy on the test dataset.
﻿
The classification report logged into the Weights & Biases.
Step 12: Saving the ModelNow, we will save our model so that it can used later for article/document classification. Additionally, we will run the finish function of the wandb class. It will mark a run as finished, and finish uploading all data.
torch.save(model.state_dict(), 'article_categorization_model.pth')
﻿
wandb.finish()
Output
﻿
The figure shows the output from the terminal.
﻿
The figure showing the complete data logged into the Weights & Biases.
Publishing the Results - Model Evaluation and ImplementationTo evaluate the performance of the article classifier model, we have used the following:
Accuracy
Confusion Matrix
Classification Report
The model achieved 99.47% accuracy on the test dataset. The classification report shows the precision, recall, f1-score, and other details, showing that the model has achieved really good performance on all the evaluation metrics.
This confusion matrix represents the performance of a binary classification model on a dataset. Each cell in the matrix corresponds to the count of instances that fall into specific categories: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). 
Here, you can see a visual representation of the confusion matrix. 
﻿
Here's an analysis based on the provided confusion matrix:
True Positives (TP): 648
False Positives (FP): 1
False Negatives (FN): 6
True Negatives (TN): 691
The confusion matrix suggests that the model performed exceptionally well, with high accuracy and precision for both classes. The low false positive and false negative rates indicate that the model made very few incorrect predictions, which is crucial for tasks like article categorization, where accuracy is paramount.
Challenges faced throughout the GuideThrough the process of building a simple article classifier, no major challenges are faced. There are some minor challenges that occur, and these are as follows:
Dataset: Finding a good enough dataset is a minor challenge. Sometimes, you may not find the right kind and need to annotate the dataset.
Class Imbalance: Unequal distribution of classes in the dataset can result in biased models that perform well on majority classes but poorly on minority ones. Techniques like class weighting may be needed to address this issue.
Text Preprocessing: It is an essential step in data cleaning, but with the present dataset, text preprocessing does not have any major effect on performance. So, comparing the performance with and without text preprocessing is better. Text preprocessing is also a time-consuming process, as it takes time to clean the text.
Conclusion Automatic article categorization is an exciting field of natural language processing and machine learning. It has a wide range of applications, including efficient search and retrieval, personalized content recommendations, and enhanced content organization. By using a simple multilayer perception, we are able to build a simple article/document classifier with 99% accuracy. As we continue to gather more data, refine algorithms, and scale our models, we can unlock new possibilities for efficient information management, personalized content delivery, and enhanced user experiences across various digital platforms and applications.
﻿
Add a comment
Tags: Articles, Community Posts, Classification
Iterate on AI agents and models faster. Try Weights & Biases today.