A Comprehensive Guide to Automating Article Categorization with Machine Learning
The article demonstrates the use of Machine Learning for automating article or document classification using PyTorch.
Created on May 18|Last edited on May 18
Comment

Introduction
In the fast-moving world of the Internet, where digital content is being generated at an exponential rate. So, managing and categorizing articles efficiently is crucial for enhancing user experience and optimizing information retrieval. Document classification, a fundamental task in natural language processing and machine learning, plays a pivotal role in this domain. It involves automatically assigning predefined categories or labels to text documents based on their content, structure, or context.
Document classification is significant because it organizes vast amounts of information into manageable and accessible categories. This process facilitates efficient search and retrieval, enables personalized content recommendations, and enhances content organization for both users and content managers
In this article, we aim to delve into the realm of automated article categorization using deep neural networks using the PyTorch framework. We will begin with an article categorization dataset, where we will perform data preprocessing. Next, we implement our simple neural network to learn to classify articles. We will also integrate Weights & Biases with our training pipeline to enable easy experiment tracking, model comparison, and other things. At the end of this article, you will have a good understanding of the article categorization using PyTorch.
How Do ML models perform Document Classification?
In the realm of document classification, machine learning (ML) models play an important role in automating the categorization of digital content, ranging from articles and news pieces to research papers and blog posts. Let's dive deep into how these models perform document classification.
Document Classification and Its Role
Document classification involves categorizing text documents into predefined categories or labels based on their content, structure, or context. This process is essential for organizing and managing digital content, enabling efficient information retrieval and content recommendation systems. ML models are leveraged to automate this classification process, making it scalable and adaptable to diverse content types and domains.
How Document Classification Works

Document classification works in combination with both ML and NLP (Natural Language Processing) techniques. They both form the backbone of the document classification systems. These systems typically follow these steps:
- Data Collection: The first step is to find the source of a labeled dataset, which consists of a document or digital content and its associated category or label. Ensure the dataset is representative and balanced across categories.
- Data Preprocessing: This process involves multiple steps, cleaning the data in each step. The text is cleaned by removing noise such as special characters, HTML tags, punctuation, and stop words. The text is tokenized into words or subwords. Additionally, techniques like stemming or lemmatization can be further applied to normalize words.
- Feature Extraction: Now, the preprocessed text data is converted into numerical features that machine learning algorithms can understand. Some commonly used text representation techniques include TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec, GloVe, or BERT for capturing semantic meanings.
- Split Dataset: Divide the dataset into training, validation, and testing sets to train and evaluate the model's performance. Sometimes, the dataset is divided into training and testing; no validation is required as the task is simple.
- Model Selection: We must select a suitable machine-learning model for document classification. Some commonly used models include Logistic Regression, Naive Bayes, Support Vector Machines (SVM), and Neural Networks.
- Model Training & Testing: The selected model is trained using the training dataset and evaluated on a separate testing dataset. Evaluation metrics such as accuracy, precision, recall, and F1 score are used to assess the model's effectiveness.
Supervised vs Unsupervised Learning for Text Classification

Supervised Learning: In supervised learning, models are trained on a labeled dataset, where each document is associated with a known category or label. The model learns to map input features (textual content) to the corresponding output labels during training. Supervised learning is effective for precise categorization but requires labeled data for training.
Unsupervised Learning: Unsupervised learning approaches, such as clustering algorithms (e.g., K-means clustering, hierarchical clustering), do not require labeled data. These algorithms group documents based on similarities in their features without predefined categories. Unsupervised learning can discover hidden patterns and structures in data but may result in less granular categorization than supervised methods.
The Newsroom Challenge - Organizing Information
The advent of digital media platforms and online publishing has led to an explosion of news content across various topics, ranging from politics and business to technology and entertainment. This exponential growth poses a challenge regarding information overload, making it increasingly difficult for users to navigate and access relevant news articles. Daily news content has skyrocketed, presenting a significant challenge for newsrooms and content managers to efficiently organize and categorize this vast amount of information.
Manual vs. Automated Classification
Both manual and automated classification have their own benefits and limitations. Let’s study them both.
Manual Classification:
- Costs: Manual categorization of news articles involves human effort, time, and resources. Hiring skilled personnel or assigning existing staff to categorize articles incurs costs.
- Benefits: Human judgment and expertise can lead to nuanced categorization, especially for complex or ambiguous topics. It allows for fine-tuning categories based on editorial guidelines and audience preferences.
- Limitations: Manual classification is labor-intensive, prone to errors, and may not scale well with the growing volume of news content. It can also be subjective, leading to inconsistencies in categorization.
Automated Classification:
- Costs: Implementing automated categorization using machine learning models requires an initial investment in technology infrastructure, model development, and training data. However, the long-term operational costs can be significantly lower than those associated with manual efforts.
- Benefits: Automated classification leverages ML algorithms to process large volumes of text data rapidly and consistently. It reduces human errors, improves scalability, and enables real-time categorization of incoming news articles.
- Limitations: Automated systems may face challenges in accurately categorizing nuanced or context-dependent topics. They require regular monitoring and updates to maintain classification accuracy as news trends and topics evolve.
How Automated Categorization Enhances News Discovery, Reading, and Archiving
Automated categorization improves user experience by increasing the overall time spent on relevant content, reducing the time spent searching for specific information, and personalizing content recommendations based on user preferences and behavior.
- News Discovery: Automated categorization enables users to discover relevant news articles more efficiently by grouping content into meaningful categories or topics. This improves user experience by facilitating targeted content recommendations and personalized news feeds based on user preferences.
- Reading Experience: Automated categorization can enhance the reading experience by presenting articles in organized sections or topic clusters, allowing users to seamlessly navigate and explore related content.
- Archiving and Retrieval: Automated categorization aids in systematically archiving news content, making it easier to retrieve historical articles based on specific categories, dates, or keywords. This archival system supports research, analysis, and content repurposing efforts within news organizations.
By embracing automated categorization technologies powered by machine learning, newsrooms can overcome the challenges posed by the sheer volume of digital news content, improve content organization and accessibility, and ultimately enhance their audience's overall news consumption experience.
Practical Guide For Building Your Personal Document Classifier Using Weights and Biases
We will explore the practical aspects of developing, training, and deploying a machine-learning model for news categorization. Here, the model will receive an article or document and it will be able to predict the category to which it belongs.
Dataset Selection
For the purpose of Article Categorization, we are going to use the News Articles dataset, which contains 2584 unique news articles. Each news article has an associated category, such as business or sports.
This Dataset is scraped from https://www.thenews.com.pk website. It has news articles from 2015 to date related to business and sports. It Contains the Heading of the particular Article, Its content, and its date. The content also contains the place from where the statement or Article was published.
Step 1: Importing the required libraries
We will import some necessary libraries, such as pandas, torch, and sklearn, which will be used later in the code.
import osimport pandas as pdimport numpy as npimport randomimport torchimport torch.nn as nnimport torch.optim as optimfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.preprocessing import LabelEncoderfrom sklearn.model_selection import train_test_splitimport reimport stringimport nltkfrom nltk.corpus import stopwordsfrom sklearn.metrics import accuracy_score, confusion_matrix, classification_reportimport wandb
Step 2: Initialize Weights & Biases
Now, we will initialize Weights & Biases for a new project called article_classifier and add the configuration data to it. This step is important as it sets up the environment for our project.
config = {"seed": 42,"lr": 0.001,"epochs": 5}wandb.init(project='article_classifier', config=config)

Step 3: Seeding the Environment
Seeding the environment in deep learning ensures consistent random number generation, aiding reproducibility for debugging, testing, hyperparameter tuning, and result validation in research or production settings.
random.seed(config["seed"])os.environ["PYTHONHASHSEED"] = str(config["seed"])np.random.seed(config["seed"])torch.manual_seed(config["seed"])torch.cuda.manual_seed(config["seed"])torch.backends.cudnn.deterministic = True
Step 4: Downloading the Stopwords
Download all English stopwords using the NLTK library for future use in dataset cleaning.
nltk.download('stopwords')
Step 5: Data Loading & Preprocessing
First, we read the data from the CSV file using pandas. Then, we define a function called clean_text to preprocess the textual data by:
- Converting text to lowercase.
- Removing punctuations.
- Removing stopwords.
- Removing whitespaces.
This process filters out unnecessary elements from the article, preserving essential contextual information.
After cleaning, we save the cleaned textual data in the CSV file under the column 'Cleaned_Article' and load both the cleaned article and its category for further analysis.
# Load dataset from CSVdata = pd.read_csv('Articles.csv', encoding='ISO-8859-1')
# Preprocessingdef clean_text(text):text = text.lower() # Convert text to lowercasetext = re.sub(r'\d+', '', text) # Remove digitstext = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuationtext = ' '.join([word for word in text.split() if word not in stopwords.words('english')]) # Remove stop wordstext = text.strip() # Remove leading and trailing whitespacesreturn text
data['Cleaned_Article'] = data['Article'].apply(clean_text)X = data['Cleaned_Article']y = data['NewsType']
Here is an example of the news article before and after cleaning.
Before cleaning:

After cleaning:

Step 6: Label Encoding and Vectorization
Now that we have preprocessed our text data, we'll proceed with label encoding and vectorization to prepare our features and labels for model training.
# Label Encodinglabel_encoder = LabelEncoder()y = label_encoder.fit_transform(y)class_labels = label_encoder.classes_
# Vectorizationvectorizer = CountVectorizer()X = vectorizer.fit_transform(X)
# Convert sparse matrix to dense tensorX = torch.tensor(X.toarray(), dtype=torch.float32)y = torch.tensor(y, dtype=torch.long)
Step 7: Splitting the Dataset
Next, we'll split our dataset into two equal parts: training and testing sets, which are further utilized for model training and evaluation.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)print(f"Training samples: {len(X_train)} - Testing samples: {len(X_test)}")
Step 8: Defining the Model Architecture
Now, let's define the architecture of our multilayer perceptron for article classification using PyTorch.
class NewsClassifier(nn.Module):def __init__(self, input_dim, output_dim):super(NewsClassifier, self).__init__()self.fc = nn.Sequential(nn.Linear(input_dim, input_dim//8),nn.ReLU(),nn.Linear(input_dim//8, output_dim))def forward(self, x):x = self.fc(x)return x
input_dim = X_train.shape[1]output_dim = len(y_train.unique())model = NewsClassifier(input_dim, output_dim)criterion = nn.CrossEntropyLoss()optimizer = optim.Adam(model.parameters(), lr=0.001)
Step 9: Logging Initial Results
We will take ten samples from the test dataset and make predictions from the untrained model. We will also log those results in the Weights & Biases.
# Logging articles to the tablepred_table = wandb.Table(columns=["Name", "Sample 1", "Sample 2", "Sample 3", "Sample 4", "Sample 5", "Sample 6", "Sample 7", "Sample 8", "Sample 9", "Sample 10"])
Now, we split the dataset and log the article into the Weights & Biases table.
# Split_, test_x, _, test_y = train_test_split(data['Cleaned_Article'].values, data['NewsType'].values, test_size=0.5, random_state=42)pred_table.add_data("Articles", test_x[0], test_x[1], test_x[2], test_x[3], test_x[4], test_x[5], test_x[6], test_x[7], test_x[8], test_x[9])
Now, we make predictions on the untrained model and log those results into the Weights & Biases table.
# Initial resultsinitial_samples = X_test[:10] # Get 10 initial samplesinitial_predictions = torch.argmax(torch.softmax(model(initial_samples), dim=1), dim=1)initial_predictions_labels = [class_labels[i] for i in initial_predictions.tolist()]print("Initial Predictions: \t", initial_predictions_labels)
pred_table.add_data("Initial Predictions", initial_predictions_labels[0],initial_predictions_labels[1], initial_predictions_labels[2], initial_predictions_labels[3],initial_predictions_labels[4], initial_predictions_labels[5], initial_predictions_labels[6],initial_predictions_labels[7], initial_predictions_labels[8], initial_predictions_labels[9])
Step 10: Training and Logging Predictions
Now, we will train our model on the training dataset with just five epochs. With each epoch, we will use the above-mentioned test samples to make a prediction and add it to the Weights & Biases table.
# Training loopfor epoch in range(config["epochs"]):model.train()optimizer.zero_grad()outputs = model(X_train)loss = criterion(outputs, y_train)print(f"Epoch: {epoch+1:2d} - Loss: {loss:1.4f}")wandb.log({"Epoch":epoch+1, "Loss":loss})loss.backward()optimizer.step()
# Logging final resultsfinal_predictions = torch.argmax(torch.softmax(model(initial_samples), dim=1), dim=1)final_predictions_labels = [class_labels[i] for i in final_predictions.tolist()]pred_table.add_data(f"Epoch {epoch+1}", final_predictions_labels[0], final_predictions_labels[1], final_predictions_labels[2], final_predictions_labels[3], final_predictions_labels[4], final_predictions_labels[5], final_predictions_labels[6], final_predictions_labels[7], final_predictions_labels[8], final_predictions_labels[9])
print("Final Predictions: \t", final_predictions_labels)

The table shows the predictions from the initial model and then from each epoch showing the improvement.
Here, a curve graph from the Weights & Biases shows training loss with each epoch. With each epoch, the training loss decreases, indicating the difference between the prediction and real categories during training.

A curve graph from the Weights & Biases shows training loss with each epoch
Step 11: Testing & Evaluation
After the model is trained, we will use the test dataset to evaluate its performance.
model.eval()with torch.no_grad():test_outputs = model(X_test)predicted = torch.argmax(torch.softmax(test_outputs, dim=1), dim=1)accuracy = accuracy_score(y_test, predicted)confusion_mat = confusion_matrix(y_test, predicted)classification_rep = classification_report(y_test, predicted)
# Logging evaluation metrics to W&Bwandb.log({"Accuracy": accuracy})wandb.log({"Confusion Matrix": confusion_mat.tolist()})wandb.log({"Classification Report": classification_rep})
print("Evaluation ->")print("Accuracy: ", accuracy)print("Confusion Matrix:\n", confusion_mat)print("Classification Report:\n", classification_rep)

The graph from the Weights & Biases shows the accuracy on the test dataset.

The classification report logged into the Weights & Biases.
Step 12: Saving the Model
Now, we will save our model so that it can used later for article/document classification. Additionally, we will run the finish function of the wandb class. It will mark a run as finished, and finish uploading all data.
torch.save(model.state_dict(), 'article_categorization_model.pth')wandb.finish()
Output

The figure shows the output from the terminal.

The figure showing the complete data logged into the Weights & Biases.
Publishing the Results - Model Evaluation and Implementation
To evaluate the performance of the article classifier model, we have used the following:
- Accuracy
- Confusion Matrix
- Classification Report
The model achieved 99.47% accuracy on the test dataset. The classification report shows the precision, recall, f1-score, and other details, showing that the model has achieved really good performance on all the evaluation metrics.
This confusion matrix represents the performance of a binary classification model on a dataset. Each cell in the matrix corresponds to the count of instances that fall into specific categories: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
Here, you can see a visual representation of the confusion matrix.

Here's an analysis based on the provided confusion matrix:
- True Positives (TP): 648
- False Positives (FP): 1
- False Negatives (FN): 6
- True Negatives (TN): 691
The confusion matrix suggests that the model performed exceptionally well, with high accuracy and precision for both classes. The low false positive and false negative rates indicate that the model made very few incorrect predictions, which is crucial for tasks like article categorization, where accuracy is paramount.
Challenges faced throughout the Guide
Through the process of building a simple article classifier, no major challenges are faced. There are some minor challenges that occur, and these are as follows:
- Dataset: Finding a good enough dataset is a minor challenge. Sometimes, you may not find the right kind and need to annotate the dataset.
- Class Imbalance: Unequal distribution of classes in the dataset can result in biased models that perform well on majority classes but poorly on minority ones. Techniques like class weighting may be needed to address this issue.
- Text Preprocessing: It is an essential step in data cleaning, but with the present dataset, text preprocessing does not have any major effect on performance. So, comparing the performance with and without text preprocessing is better. Text preprocessing is also a time-consuming process, as it takes time to clean the text.
Conclusion
Automatic article categorization is an exciting field of natural language processing and machine learning. It has a wide range of applications, including efficient search and retrieval, personalized content recommendations, and enhanced content organization. By using a simple multilayer perception, we are able to build a simple article/document classifier with 99% accuracy. As we continue to gather more data, refine algorithms, and scale our models, we can unlock new possibilities for efficient information management, personalized content delivery, and enhanced user experiences across various digital platforms and applications.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.