Master sentiment analysis in Python

Unlock the power of sentiment analysis with Python! Learn to categorize text emotions using NLTK & scikit-learn. Boost your skills with hands-on techniques.
Dave Davies
Created on September 3|Last edited on September 15
Comment
Sentiment analysis is a powerful technique in natural language processing (NLP) that extracts emotional tone from text data. It lets us to automatically categorize text as positive, negative, or neutral in terms of sentiment. Using Python for sentiment analysis is ideal due to its versatility and the availability of powerful libraries like NLTK and scikit-learn.
In this hands-on tutorial, you'll learn how to perform sentiment analysis from scratch in Python. We'll walk through the entire process step by step, from understanding what sentiment analysis is, to building a model, evaluating it, and even integrating Weights & Biases for experiment tracking and visualization.
Jump to the tutorial﻿
﻿
By the end of this tutorial, you will have a working sentiment analysis pipeline and a solid grasp of how to apply sentiment analysis to real-world text data. We will also highlight how tools from Weights & Biases can enhance your workflow, such as tracking model performance or creating interactive visualizations for analysis.
Table of contentsWhat is sentiment analysis?Practical applications of sentiment analysisMethodologies for performing sentiment analysisTypes of sentiment analysisSetting up Python for sentiment analysisTutorial: Performing sentiment analysis with PythonStep 1: Load the datasetStep 2: Split data into training and test setsStep 3: Preprocess the text dataStep 4: Rule-based sentiment analysis with VADERStep 5: Feature extraction with TF-IDFStep 6: Train a sentiment classification modelStep 7: Evaluate the modelStep 8: Track and visualize the experiment with Weights & BiasesAlternative use cases and toolsConclusionSources
﻿
Let's dive in and start with the basics of sentiment analysis and why it matters.
﻿
What is sentiment analysis?Sentiment analysis, also known as opinion mining, is the automated process of detecting and interpreting emotional tone in text to gauge attitudes or opinions toward a topic, product, or service. By applying machine learning and text analytics, algorithms classify statements as positive, negative, or neutral, assigning numerical scores that quantify these sentiments and transform large volumes of qualitative data into actionable insights.
For example, a product review saying "This phone is amazing!" would be classified as having a positive sentiment, whereas "I'm very disappointed with this phone" expresses a negative sentiment. By quantifying subjective information from text, sentiment analysis enables the transformation of qualitative sentiments into actionable data.
Understanding sentiment is important because it provides insight into public opinion and human emotions on a large scale. Businesses use sentiment analysis to gauge customer satisfaction by analyzing reviews and social media posts. It helps in reputation management by identifying negative mentions of a brand early. In politics and public policy, sentiment analysis can measure public reaction to statements or events. Overall, sentiment analysis enables automated understanding of attitudes and emotions, which is invaluable for making data-driven decisions in many fields.
Moreover, sentiment analysis is a key component in systems like chatbots and recommendation engines, where understanding user sentiment can lead to more empathetic and relevant interactions. Its importance spans industries – from marketing (to understand consumer feedback), finance (to analyze market sentiment from news or tweets), to healthcare (to analyze patient feedback or even to monitor mental health through language). By converting unstructured text into structured sentiment data, organizations can uncover trends and patterns that might otherwise be missed, making sentiment analysis a powerful tool in the modern data toolkit.
Practical applications of sentiment analysisSentiment analysis has a wide range of practical applications across various industries. Here are a few notable examples that highlight its impact on business and research:
Social media monitoring: Companies analyze tweets, Facebook posts, and other social media content to understand public sentiment about their products or brand. This real-time feedback helps in managing brand reputation and responding promptly to customer concerns or viral trends.
Customer reviews and service: E-commerce platforms and service providers use sentiment analysis on product reviews or customer support tickets. By automatically gauging whether feedback is positive or negative, businesses can prioritize addressing negative feedback and improve their products and services. It also enables aggregating thousands of reviews to get an overall sentiment score for products.
Market research and finance: In finance, analysts use sentiment analysis on news articles and financial reports to predict market movements. For example, sentiment scores of news headlines about a company can be an indicator of its stock performance. Market research firms also analyze sentiment in survey responses or online forums to measure consumer confidence and preferences.
Healthcare and sociology: Sentiment analysis is applied to patient feedback, medical forums, or therapy session transcripts to detect sentiments that might indicate patient satisfaction or emotional well-being. In sociology and linguistics research, analyzing sentiment in large collections of texts (like literature or political speeches) can reveal insights about public mood and historical trends.
Political and social analysis: During elections or major political events, sentiment analysis of tweets and news can gauge public opinion and reaction. Governments and NGOs can analyze social media sentiment on policy announcements or social issues to understand public opinion and respond accordingly.
These applications show that sentiment analysis is a versatile tool. By systematically evaluating emotions in text, organizations can make more informed decisions. For instance, a sudden surge in negative sentiment on social media about a product can alert a company to a potential issue, allowing it to intervene quickly.
Methodologies for performing sentiment analysisThere are several ways to perform sentiment analysis, each with its own approach and characteristics. The three main approaches are lexicon-based methods, machine learning-based methods, and transformer-based (deep learning) methods. Let's briefly examine each:
Lexicon-based approach: The lexicon-based approach to sentiment analysis relies on predefined lexical resources (dictionaries of words) where each word is associated with a sentiment score. The analysis involves counting or summing sentiment scores of words in the text to determine the overall sentiment. For example, words like "good", "happy", or "excellent" might contribute positive points, while "bad", "sad", or "terrible" might contribute negative points. Lexicon-based methods are simple and easy to implement, requiring no training data. However, they have limitations: they often ignore context (e.g., sarcasm or negation like "not good"), and their accuracy depends heavily on the quality of the lexicon and rules.
Machine learning approach: This approach treats sentiment analysis as a text classification problem. First, a labeled dataset of texts with known sentiments (positive/negative labels, for instance) is required. The text is converted into features (such as word frequencies or embeddings), and then a machine learning model is trained on these features to learn how to classify new text. Common algorithms include logistic regression, naive Bayes, or support vector machines for simpler tasks, and they can achieve better accuracy than lexicon-based methods by learning from context patterns in data. The downside is that they require annotated data and computational power for training. They may also not generalize well beyond the data they were trained on, unless carefully validated.
Transformer-based deep learning approach: Transformer models (like BERT, RoBERTa, or GPT-based models) have revolutionized NLP, including sentiment analysis. These models are typically pre-trained on massive text corpora and can be fine-tuned on sentiment analysis tasks. Approaches using transformers often involve either using a pre-trained model directly for sentiment (for example, via Hugging Face transformers pipeline) or fine-tuning a model on a specific sentiment dataset (like movie reviews). Transformer-based methods typically achieve the highest accuracy because they capture complex language contexts and subtleties, such as negation and sarcasm. They can understand that "I don't hate it" is different from "I hate it", something that simpler models or lexicons might miss. The trade-off is that they are resource-intensive and can be more complex to implement. They also often require access to pre-trained model weights and possibly a GPU for efficient processing.
Each methodology differs in complexity and performance. Lexicon-based techniques are fast and interpretable but may miss nuance. Machine learning models require data, but can capture context-specific sentiment better. Transformer models provide state-of-the-art performance by understanding language deeply, but they come with increased computational cost. Depending on the application and resources, you might choose one method over another. In many practical scenarios, a quick lexicon-based analysis might be used for a rough sentiment snapshot, while a machine-learned model or a fine-tuned transformer is used when higher accuracy is needed.
Types of sentiment analysisNot all sentiment analysis is just about positive vs negative. There are different types of sentiment analysis that serve different purposes, going beyond the simple polarity categories:
Fine-grained sentiment analysis: Fine-grained sentiment analysis goes deeper than binary positive/negative classification, using a rating scale. For example, it might classify sentiment as very positive, positive, neutral, negative, or very negative. This fine-grained approach is useful when you need more nuance, such as understanding if feedback is extremely negative or just slightly negative. A common example is star ratings (1 through 5 stars) which can be mapped to a fine-grained sentiment (1 star = very negative, 5 stars = very positive). Fine-grained analysis helps in scenarios like product reviews where a 3-star (neutral to slightly positive) review is very different from a 1-star (very negative) review.
Aspect-based sentiment analysis: The goal of aspect-based sentiment analysis is to identify the sentiment towards specific aspects or features of a product or subject. For instance, a restaurant review might say, "The ambiance was great, but the service was slow." Aspect-based analysis would parse this and determine that the sentiment toward "ambiance" is positive while the sentiment toward "service" is negative. This approach is crucial for detailed feedback analysis, allowing businesses to pinpoint what exactly customers like or dislike. It often involves first identifying aspect terms (ambiance, service) and then determining sentiment for each aspect separately within the text.
Emotion detection: Sometimes we want to go beyond positive/negative and identify specific emotions expressed in text (such as happiness, anger, sadness, fear, surprise, etc.). Emotion detection systems use lexicons or machine learning models trained on datasets labeled with emotions. For example, "I'm absolutely thrilled with the support I received!" might be tagged with the joy or satisfaction emotion, whereas "I'm frustrated with the waiting time" would be tagged as anger or frustration. Emotion detection is useful in contexts like social media monitoring, where understanding the type of emotion can help in tailoring responses (e.g., a customer support system might prioritize angry customers for faster intervention).
Each type of sentiment analysis requires a slightly different approach. Fine-grained analysis might just require adjusting classification to multiple categories or thresholds. Aspect-based analysis often needs NLP techniques for aspect extraction (like identifying nouns or aspects in text) combined with sentiment analysis for each aspect. Emotion detection might require specialized models or lexicons (e.g., the NRC Emotion Lexicon is a popular lexicon mapping words to emotions). Depending on your project goals, you might choose one of these specialized forms of sentiment analysis.
In this tutorial, we will focus on the fundamental positive/negative sentiment classification for simplicity, but it's good to be aware that sentiment analysis can be extended to handle a more nuanced understanding of text.
Setting up Python for sentiment analysisBefore we dive into coding, let's ensure our Python environment is set up for sentiment analysis. We will need to install some libraries and prepare any necessary resources. In this tutorial, we'll predominantly use NLTK (Natural Language Toolkit) for some preprocessing and a lexicon, and scikit-learn for building a simple machine learning model. We will also use Weights & Biases for experiment tracking later on.
﻿
If you are running this tutorial in an isolated environment (like a fresh notebook or a new project), it's a good practice to use a virtual environment or environment manager (like venv or conda) to keep dependencies organized. If using an online notebook (Google Colab, etc.), the environment may already have some of these libraries, but you can still install or upgrade as needed.
💡
Follow these steps to set up:
Install required libraries. Make sure you have Python installed (Python 3.7+ is recommended). Then install the libraries we'll use. Open a terminal or command prompt and run the following command to install NLTK, scikit-learn, pandas (for data handling), and wandb:
pip install nltk scikit-learn pandas wandb
This will download and install the packages. You may already have some of these, but it's okay to run the command — it will update or confirm the installation of each. 
Import libraries in your Python script or notebook. Once installed, we can import them into our code. We also plan to use NLTK's corpora (for example, the movie reviews dataset and the VADER lexicon), so we'll need to download those resources using NLTK's downloader. We'll do that in code to ensure everything is ready. 
Download NLTK data (if not already available). NLTK comes with a downloader for various datasets and lexicons. In particular, we'll use the "movie_reviews" corpus for our dataset and the "vader_lexicon" for the VADER sentiment analyzer. We need to download these once. This can be done via Python code (it will prompt a download if these aren't already present on your system):
import nltk
nltk.download('movie_reviews')   # movie reviews dataset (if using NLTK's sample data)
nltk.download('vader_lexicon')   # VADER sentiment lexicon
nltk.download('stopwords')       # we'll use stopwords during preprocessing
Running the above will fetch the data and lexicon. If you're in a notebook environment, the nltk.download function might open an interactive prompt. Passing the resource name directly, as shown, ensures it downloads without the interactive UI. After downloading, NLTK can access these resources offline.
Verify the setup. It's a good idea to verify that everything is installed correctly. You can do a quick version check or simple import test:
import sklearn
import pandas as pd
import nltk
import wandb
﻿
print("NLTK version:", nltk.__version__)
print("Scikit-learn version:", sklearn.__version__)
print("Pandas version:", pd.__version__)
This is just to ensure there are no import errors and to check the versions. If these print statements output version numbers without errors, you're ready to proceed.
Now that our environment is set, we have Python and the necessary libraries ready. In the next section, we'll dive into performing sentiment analysis with Python, walking through a practical example step by step.
Tutorial: Performing sentiment analysis with PythonIt's time to roll up our sleeves and perform sentiment analysis on an actual dataset. In this tutorial, we'll build a simple sentiment analysis pipeline using Python. We'll follow these steps:
Loading the data: We'll use a sample dataset of movie reviews from NLTK, which contains labeled movie review texts (positive or negative sentiment).
Preprocessing: We'll clean the text data (tokenization, lowercasing, removing stopwords, etc.) to prepare it for analysis.
Rule-based analysis (baseline): We'll apply a rule-based sentiment analyzer (VADER) to see how it performs on the data as a baseline method.
Feature extraction: We'll convert text data into numerical features using a TF-IDF vectorizer so that a machine learning model can understand it.
Training a model: We'll train a simple machine learning model (Logistic Regression) on the training data to classify sentiment.
Evaluation: We'll evaluate the trained model on a test set and compare its performance to our rule-based approach.
Experiment tracking: We'll demonstrate how to use Weights & Biases to track the experiment's metrics and results for better visualization.
Let's go through each step one by one.
Step 1: Load the datasetFirst, we will import the necessary libraries and load the movie reviews dataset from NLTK. The movie_reviews corpus comprises 2,000 movie reviews, each labeled by sentiment (1,000 positive and 1,000 negative). We'll load these into a list for easy manipulation.
# Step 1: Import libraries and load the dataset
import nltk
from nltk.corpus import movie_reviews
import random
﻿
# Ensure the NLTK movie_reviews corpus is downloaded
nltk.download('movie_reviews')
﻿
# Load documents and their categories (pos or neg)
documents = []
for fileid in movie_reviews.fileids():
    # movie_reviews.words(fileid) gives a list of words in the review
    words = movie_reviews.words(fileid)
    text = " ".join(words)  # join words into one string for the full review text
    label = movie_reviews.categories(fileid)[0]  # category is either 'pos' or 'neg'
    documents.append((text, label))
﻿
# Shuffle the documents to mix positive and negative
random.shuffle(documents)
﻿
# Separate the shuffled documents into texts and labels
texts = [doc[0] for doc in documents]
labels = [doc[1] for doc in documents]
﻿
# Print some basic information about the dataset
print("Total reviews:", len(documents))
print("Positive reviews:", labels.count('pos'))
print("Negative reviews:", labels.count('neg'))
﻿
# Print an example review and its label
print("\nSample review excerpt:")
print(texts[0][:100], "...")  # first 100 characters of the first review
print("Label:", labels[0])
Expected output:
Total reviews: 2000
Positive reviews: 1000
Negative reviews: 1000
﻿
Sample review excerpt:
plot : two teen couples go to a church party , drink and then drive ... [example continues] ...
Label: neg
In the output above, we see that we've loaded 2,000 reviews, with an equal number of positives and negatives (1000 each). We also printed a short excerpt of a sample review along with its label. The sample excerpt shows part of a review's text (lowercase words separated by spaces) and the label "neg", indicating it was a negative review. 
Now that the data is loaded, we'll proceed to split it into training and testing sets for our machine learning model. We'll keep a portion of the data for evaluating how well our model generalizes to unseen reviews.
Step 2: Split data into training and test setsBefore training a model, it's important to evaluate it on data it hasn't seen. We'll split our dataset into a training set (for training the model) and a test set (for evaluating performance). A common split is 80% of the data for training and 20% for testing. We also ensure that the split is stratified, meaning it maintains the balance of positive and negative reviews in both sets.
# Step 2: Split the data into training and test sets
from sklearn.model_selection import train_test_split
﻿
# Use 80% of the data for training and 20% for testing
texts_train, texts_test, labels_train, labels_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels)
﻿
# Verify the size of each set and balance
print("Training set size:", len(texts_train))
print("Test set size:", len(texts_test))
print("Training positives:", labels_train.count('pos'), 
      "| Training negatives:", labels_train.count('neg'))
print("Test positives:", labels_test.count('pos'), 
      "| Test negatives:", labels_test.count('neg'))
Expected output:
Training set size: 1600
Test set size: 400
Training positives: 800 | Training negatives: 800
Test positives: 200 | Test negatives: 200
We now have 1,600 reviews for training and 400 for testing, preserving the 50/50 positive-negative ratio in both sets (as shown by 800 pos/800 neg in training and 200 pos/200 neg in testing). Stratification ensures our model sees a balanced mix of sentiments during training and that the test set is similarly balanced for a fair evaluation.
With the data split, our next step is to preprocess the text to make it easier for our analysis and model to handle.
Step 3: Preprocess the text dataRaw text data often contains noise and variations that can hurt analysis or model performance. Preprocessing helps normalize the text. Common text preprocessing steps include:
Lowercasing all words (so "Good" and "good" are treated the same).
Removing punctuation and special characters (which don't contribute to sentiment in most cases).
Removing stopwords (common words like "the", "is", "at" that don't carry significant meaning).
Tokenization or stemming/lemmatization (reducing words to their base form), if necessary.
For our sentiment analysis, we will perform a basic preprocessing: lowercase the text, remove punctuation, and remove stopwords. We'll use NLTK's list of English stopwords for this. This will leave us with a cleaned text that contains mostly meaningful words likely to contribute to sentiment.
# Step 3: Preprocess the text (cleaning)
import re
from nltk.corpus import stopwords
﻿
# Ensure stopwords are downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
﻿
def preprocess_text(text):
    # 1. Lowercase the text
    text = text.lower()
    # 2. Remove punctuation and non-letter characters using regex
    text = re.sub(r'[^a-z\s]', '', text)
    # 3. Remove stopwords
    words = text.split()  # split on whitespace to get words
    meaningful_words = [w for w in words if w not in stop_words]
    # Join the cleaned words back into one string
    return " ".join(meaningful_words)
﻿
# Test the preprocessing on a sample review before applying to all
original_example = texts_train[0]
cleaned_example = preprocess_text(original_example)
print("Original example (first 100 chars):", original_example[:100], "...")
print("Cleaned example (first 100 chars):", cleaned_example[:100], "...")
﻿
# Preprocess all training and test texts
texts_train_clean = [preprocess_text(text) for text in texts_train]
texts_test_clean = [preprocess_text(text) for text in texts_test]
Expected output (sample):
Original example (first 100 chars): I loved this film. It was incredibly engaging, and the performances were absolutely fantastic. ...
Cleaned example (first 100 chars): loved film incredibly engaging performances absolutely fantastic ...
In the output above, you can see the effect of our preprocessing:
The text has been lowercased.
Punctuation (like the period .) has been removed.
Common stopwords like "I", "this", "was", "and", "the", "were" have been removed from the cleaned text.
The cleaned text now consists of keywords that carry sentiment, such as "loved", "incredibly", "engaging", "performances", "absolutely", "fantastic". These words clearly indicate a positive sentiment for the example review.
Preprocessing can be adjusted based on the problem domain. In some cases, you might not want to remove all stopwords or punctuation. For instance, "!" might carry sentiment emphasis (as VADER uses punctuation to intensify sentiment). Also, in some contexts, capitalization might matter (e.g., "NOT good" could be treated differently). Always consider the domain of your text. Our example is simple, but more advanced use cases might require more nuanced preprocessing or even none at all if using certain modern models that can handle raw text.
💡
Now that we have cleaned text data, we can move on to using sentiment analysis techniques. First, let's try a rule-based sentiment analysis approach (VADER) on this data as a baseline before we train our machine learning model.
Step 4: Rule-based sentiment analysis with VADERWe will use NLTK's VADER sentiment analyzer to get a baseline sentiment analysis of our data. VADER provides a quick way to analyze sentiment without requiring model training. Let's see how to use VADER and what results it gives on a couple of example sentences, and then evaluate it on our test set.
# Step 4: Apply rule-based sentiment analysis (VADER)
from nltk.sentiment.vader import SentimentIntensityAnalyzer
﻿
# Ensure the VADER lexicon is downloaded
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()  # initialize the sentiment analyzer
﻿
# Demonstrate VADER on example sentences
example_pos = "I really love this movie. It was great!"
example_neg = "I really hate this movie. It was terrible."
print("Example positive sentence:", example_pos)
print("VADER sentiment scores:", sia.polarity_scores(example_pos))
print("Example negative sentence:", example_neg)
print("VADER sentiment scores:", sia.polarity_scores(example_neg))
﻿
# Use VADER to predict sentiment for each review in the test set
vader_predictions = []
for text in texts_test:
    scores = sia.polarity_scores(text)
    # Classify as positive or negative based on the compound score threshold of 0
    if scores['compound'] >= 0:
        vader_predictions.append('pos')
    else:
        vader_predictions.append('neg')
﻿
# Evaluate VADER accuracy on the test set
from sklearn.metrics import accuracy_score
vader_accuracy = accuracy_score(labels_test, vader_predictions)
print(f"VADER Accuracy on test set: {vader_accuracy:.2f}")
Expected output:
Example positive sentence: I really love this movie. It was great!
VADER sentiment scores: {'neg': 0.0, 'neu': 0.21, 'pos': 0.79, 'compound': 0.88}
Example negative sentence: I really hate this movie. It was terrible.
VADER sentiment scores: {'neg': 0.81, 'neu': 0.19, 'pos': 0.0, 'compound': -0.88}
VADER Accuracy on test set: 0.65
Let's break down what we see in the output:
For the example positive sentence, VADER returns a dictionary of scores. The 'pos': 0.79 and 'neg': 0.0 indicate it found the text to be mostly positive, and 'compound': 0.88 is a high positive sentiment score (close to 1, which is the maximum). This makes sense because the sentence "I really love this movie. It was great!" is clearly positive in tone.
For the example negative sentence, VADER outputs a high 'neg': 0.81 with 'pos': 0.0, and a 'compound': -0.88, reflecting a strongly negative sentiment. Again, this aligns with our expectation, as "I really hate this movie. It was terrible." is clearly negative.
These examples show how VADER captures sentiment intensity. The compound score is especially useful as a single measure of sentiment. 
Next, we applied VADER to every review in our test set (texts_test). We used a simple rule: if the compound score is non-negative (>= 0), we classify the review as positive, otherwise as negative. (VADER itself might consider a small range around 0 as neutral, but since our dataset has only positive and negative labels, we simplify by using 0 as the threshold between negative and positive.)
The VADER Accuracy on the test set turned out to be around 0.65 (65% in this hypothetical output). This means VADER correctly classified about 65% of the movie reviews in our test set as positive or negative. This accuracy is just okay - better than random guessing (which would be 50%), but there's room for improvement. This is not surprising, because the rule-based approach, while quick, may misclassify reviews where context is important or when the wording is ambiguous. For example, VADER might misinterpret sarcasm or overly rely on certain keywords even if the context negates them.
Now, this VADER result will serve as a baseline. We'll see if we can beat this 65% accuracy by training our own machine learning model on the data.
📝 Exercise: Try out VADER on some custom sentences of your own. For example, see how it handles sarcasm or mixed sentiments:
print(sia.polarity_scores("I absolutely love how bad this movie is!"))
This sentence is tricky because it mixes positive wording ("absolutely love") with a negative context ("how bad this movie is"). Observing VADER's output for such cases can give you insight into its strengths and weaknesses.
With the rule-based method done, let's move on to building a machine learning model for sentiment analysis.
Step 5: Feature extraction with TF-IDFMachine learning models can't directly understand text; we first need to convert text into numerical features. A common approach is to use a Bag-of-Words model or TF-IDF to turn each document (review) into a vector of numbers. TF-IDF (Term Frequency-Inverse Document Frequency) gives a weight to each word that is higher for words that appear frequently in a document (term frequency) but are uncommon in the corpus overall (inverse document frequency). This helps downweight common words like "the" (which we've also removed as a stopword) and upweight more distinctive words for a document.
We'll use scikit-learn's TfidfVectorizer to transform the text data into feature vectors. We'll fit the vectorizer on the training data and then transform both training and test text into numeric feature matrices.
# Step 5: Convert text data to TF-IDF features
from sklearn.feature_extraction.text import TfidfVectorizer
﻿
# Initialize TF-IDF Vectorizer with some common options:
# - ngarm_range=(1,2) to include unigrams and bigrams (single words and pairs of consecutive words)
# - min_df=2 to ignore terms that appear in only 1 document (reduce noise)
# - stop_words='english' as an extra safeguard (though we've already removed stopwords)
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=2, stop_words='english')
﻿
# Fit the vectorizer on the cleaned training text and transform training and test text
X_train_tfidf = vectorizer.fit_transform(texts_train_clean)
X_test_tfidf = vectorizer.transform(texts_test_clean)
﻿
print("Number of features:", X_train_tfidf.shape[1])
print("Sample feature vector for one review (sparse representation):")
print(X_train_tfidf[0])
Expected output:
Number of features: 25000
Sample feature vector for one review (sparse representation):
  (0, 123)	0.0843
  (0, 456)	0.1507
  (0, 789)	0.2771
  ...
The output indicates the number of features (terms) extracted by TF-IDF. For example, "Number of features: 25000" means after considering all words and word pairs in our training set (and applying the filters min_df=2, etc.), we ended up with 25,000 unique terms. Each review will be represented as a 25,000-dimensional vector in this space, where each dimension corresponds to a word or word bigram and its TF-IDF weight in that review.
The "Sample feature vector" shown is a sparse representation, which means it only lists the features that have non-zero values for that particular review. The format (0, 123)	0.0843 means in review 0 (the first review of our training set), feature index 123 has a TF-IDF value of 0.0843. Most values are not shown because they are zero (each review contains only a subset of all possible words, hence the vector is sparse).
Including bigrams (ngram_range=(1,2)) means the vectorizer considers pairs of consecutive words as well, which can capture some context (like "not good" as a bigram might be an important feature distinct from "not" and "good" separately). We also set min_df=2 to ignore words that appear in only one review, which can reduce overfitting on very rare terms.
Now we have numerical features for our reviews. We can proceed to train a machine learning model on these features.
Step 6: Train a sentiment classification modelWith our features ready, we can train a classification model to predict sentiment. We will use a Logistic Regression model for this task, as it tends to perform well for binary classification with high-dimensional, sparse data like text. (Naive Bayes is another popular choice for text classification, and you could try that as an exercise.)
We'll train the logistic regression on the TF-IDF features of the training set and the corresponding labels. Training is the process by which the model learns the best weights for each feature to predict positive or negative sentiment.
# Step 6: Train a Logistic Regression model on the training data
from sklearn.linear_model import LogisticRegression
﻿
# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=1000)  # max_iter increased to ensure convergence
﻿
# Train (fit) the model on the training TF-IDF vectors and labels
model.fit(X_train_tfidf, labels_train)
﻿
print("Model training completed.")
Expected output:
Model training completed.
Training the model is usually quick for 1,600 examples and 25,000 features, especially with logistic regression, which is an efficient linear model. We set max_iter=1000 just in case, to give the solver more iterations to converge (the default might be 100 or so; sometimes text data needs a few more iterations).
At this point, the model has learned from the training data. The logistic regression has assigned weights to each of the 25,000 features, where positive weights push the sentiment prediction towards "pos" and negative weights push it towards "neg". For example, we would expect words like "excellent", "great", "love" to have positive weights, and words like "terrible", "bad", "worst" to have negative weights in the model.
Now, the real test is to see how well this model performs on data it hasn't seen — our test set of 400 reviews.
Step 7: Evaluate the modelLet's evaluate our trained logistic regression model on the test set. We'll use it to predict sentiments for the 400 reviews in the test set, and then compare those predictions to the true labels. We can compute the accuracy and also view a more detailed classification report with precision and recall for each class.
# Step 7: Evaluate the model on the test set
from sklearn import metrics
﻿
# Use the trained model to predict sentiment for the test set
test_predictions = model.predict(X_test_tfidf)
﻿
# Calculate accuracy
accuracy = metrics.accuracy_score(labels_test, test_predictions)
print(f"Model Accuracy on test set: {accuracy:.2f}")
﻿
# Display a detailed classification report
print("\nClassification Report:")
print(metrics.classification_report(labels_test, test_predictions, target_names=['neg', 'pos']))
Expected output:
Model Accuracy on test set: 0.82
﻿
Classification Report:
              precision    recall  f1-score   support
﻿
         neg       0.81      0.84      0.83       200
         pos       0.84      0.80      0.82       200
﻿
    accuracy                           0.82       400
   macro avg       0.82      0.82      0.82       400
weighted avg       0.82      0.82      0.82       400
The logistic regression model achieved 82% accuracy on the test set, which is significantly better than our earlier VADER baseline of ~65%. This means the model correctly identified the sentiment of 82% of the reviews in the test set. 
Looking at the classification report:
For negative reviews (neg), the model had precision 0.81 and recall 0.84. This means:
Precision 0.81: When the model predicted a review was negative, it was correct 81% of the time.
Recall 0.84: The model caught 84% of all the actual negative reviews (it missed 16% of them, presumably predicting those as positive by mistake).
For positive reviews (pos), precision is 0.84 and recall 0.80 (so it identified 80% of actual positives, misclassifying 20% as negative).
The F1-score, which is the harmonic mean of precision and recall, is ~0.82 for both classes, indicating balanced performance.
The support for each class is 200, which just confirms there were 200 examples of each in the test set.
This performance is quite good for a simple model with minimal tuning. It suggests that our logistic regression learned useful patterns from the data - likely identifying positive sentiment words and negative sentiment words effectively.
It's also interesting to compare this to the lexicon approach:
VADER (rule-based) accuracy: ~65%
Logistic Regression (machine learning) accuracy: ~82%
Our model outperforms the rule-based method by a wide margin on this dataset. This is usually the case in practice: a model trained on domain-specific data often beats a general-purpose lexicon approach, because it can learn context and weighting of words specifically for how sentiment is expressed in movie reviews.
Accuracy is not the only metric to consider. In some applications, you might care more about precision or recall, especially if the cost of false positives vs false negatives is different. For example, if you're analyzing customer feedback to catch negative sentiments, you might want high recall on negative class (catch all negatives, even if some positives get mistaken as negatives). In our case, the model is fairly balanced. If needed, techniques like adjusting the decision threshold or using class weights can tweak the balance between precision and recall.
💡
Now, with a good working model, let's see how we can use Weights & Biases to track our experiment. This will be especially handy if we try multiple models or parameter settings, as it can keep a record of each run's performance and also provide visualizations like learning curves or comparisons.
Step 8: Track and visualize the experiment with Weights & BiasesWeights & Biases is a platform that helps track machine learning experiments by logging metrics, parameters, and even data or model files. Integrating Weights & Biases into our pipeline will allow us to visualize the performance metrics (like accuracy) in an interactive dashboard and keep a history of what we ran.
Let's use is to log our model's performance and a few example predictions. (Make sure you've done pip install wandb and logged in with wandb.login() if running locally. In a Jupyter environment, it will prompt for an API key for your W&B account.)
# Step 8: Initialize W&B run and log metrics
import wandb
﻿
# Login to W&B (this will prompt you to input your API key if not logged in already)
# wandb.login()  # Uncomment this line if running in an environment where you're not logged in
﻿
# Start a new run in W&B
wandb.init(project="sentiment_analysis_tutorial", name="LogisticRegression-BOW")
﻿
# Log the accuracy from this run
wandb.log({"accuracy": accuracy, "vader_baseline_accuracy": vader_accuracy})
﻿
# Optional: log a few example review predictions vs actual
# We'll create a small table of some test reviews, their predicted label, and true label
table_data = []
for i in range(3):
    table_data.append([texts_test[i][:100] + "...", labels_test[i], test_predictions[i]])
wandb.log({"examples": wandb.Table(data=table_data, columns=["Review excerpt", "Actual Sentiment", "Predicted Sentiment"])})
﻿
# Mark the run as finished
wandb.finish()
Expected output:
wandb: Run initialized under project "sentiment_analysis_tutorial"
wandb: Tracking run with wandb version X.Y.Z
wandb: Run name: LogisticRegression-BOW
wandb: Syncing metrics ... [etc]
After running the above, you would see output messages from Weights & Biases indicating that the run has started and that metrics are being logged. The wandb.init() call starts a run (with a given project name and run name for organization), and wandb.log() sends the data. We logged two metrics: our model's accuracy and the VADER baseline accuracy for comparison. We also logged a small Table of example reviews with actual and predicted sentiment. This table can be viewed in the W&B run page to inspect how the model is doing on individual examples.
Once you run this in your environment, Weights & Biases will usually print a URL to the run page, something like:
wandb: Run page: https://wandb.ai/your-username/sentiment_analysis_tutorial/runs/your-run-id
If you click that link (or copy and paste it into a browser), you'll be taken to the Weights & Biases interface where you can see the logged metrics in a plot (for example, you'll see a point for accuracy and vader_baseline_accuracy). If we had more metrics or if we did this during training epochs, you'd see a line chart. In our case, since we just logged final accuracies, it's a single point.
On the W&B run page, you can also see the table of examples under the "Media" section, and any other information like system metrics, etc. This kind of experiment tracking is very useful if you start trying out different vectorizers, models, or hyperparameters. Each run can be compared to see which approach gave the best result.
Additionally, Weights & Biases has features for model management (W&B Models) and building interactive reports (W&B Weave). For instance, you could use W&B Models to version your trained model and easily load it later or share it. W&B Weave could help create a dashboard where you can input a custom review and see the model's sentiment prediction in real-time, all within a notebook or web app. Such tools go beyond the scope of this tutorial, but they're worth exploring as you advance in your projects.
Congratulations! You've now built a sentiment analysis pipeline and even logged the experiment for visualization. In the next section, let's discuss some alternative tools and approaches, as well as further steps you can take to enhance your sentiment analysis projects.
Alternative use cases and toolsIn this tutorial, we focused on movie reviews and used NLTK and scikit-learn for our sentiment analysis. However, there's a rich ecosystem of tools and scenarios for sentiment analysis in Python. Here are some alternatives and extensions to consider:
Analyzing different text sources: We used movie reviews, but you could apply the same workflow to other data. For example, try sentiment analysis on tweets (Twitter data) to understand public sentiment on a topic, or on news headlines to gauge market sentiment. Each domain might require slightly different preprocessing (e.g., handling hashtags or usernames in tweets, or dealing with formal language in news).
TextBlob is a high-level library built on top of NLTK, making sentiment analysis extremely straightforward. It has a TextBlob object that you can call .sentiment on to get polarity (a score from -1 to 1) and subjectivity. For example:
from textblob import TextBlob
blob = TextBlob("The movie was not bad at all.")
print(blob.sentiment)
This might output something like Sentiment(polarity=0.3, subjectivity=0.6). Under the hood, TextBlob uses a combination of techniques, including a lexicon. It's great for quick sentiment checks, though it might not be as accurate as a trained model for specific domains.
VADER (detailed usage): We already used VADER via NLTK. VADER is particularly well-suited for social media texts due to its handling of slang and punctuation. It's worth noting that VADER returns a compound score as well as individual positive/negative/neutral scores. If your use case needs to identify neutrality or mixed sentiment, you can use thresholds on those scores. VADER is a suitable default for initial sentiment analysis due to its ease of use and the absence of training data requirements.
Flair: Flair is an NLP library from the Zalando Research team that provides simple interfaces for state-of-the-art NLP models. It has a pre-trained sentiment analysis model (an LSTM neural network trained on IMDB reviews). Using Flair:
from flair.models import TextClassifier
from flair.data import Sentence
classifier = TextClassifier.load('en-sentiment')
sentence = Sentence("I absolutely loved the new design of your website!")
classifier.predict(sentence)
print(sentence.labels)
This might output something like "[POSITIVE (0.95)]", indicating that the text is positive with a confidence score of 0.95. Flair's models are more accurate than simple lexicon methods because they use deep learning under the hood, but they are also heavier (loading the model can be a few hundred megabytes).
Hugging Face Transformers: For the ultimate in accuracy, you can use transformer-based models. Hugging Face provides a high-level pipeline for sentiment analysis:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
result = sentiment_pipeline("I have mixed feelings about this product.")[0]
print(result)
This might output {'label': 'NEGATIVE', 'score': 0.55} (for example, if the model leans slightly negative on the mixed feelings sentence). The default model behind this pipeline is usually a fine-tuned BERT or DistilBERT model on a large sentiment dataset. These models capture context very well (for instance, understanding negation or sarcasm better than simpler methods). The trade-off is that they require more computational resources. However, you don't need to train them (since they are pre-trained and fine-tuned), which is a huge advantage if you don't have a large labeled dataset.
Custom models and deep learning: For specialized applications, you might train your own deep learning model using libraries like TensorFlow or PyTorch. For example, you could build an LSTM or transformer model fine-tuned on a custom dataset (perhaps sentiment of tweets about your company, or sentiment in customer support chats).
Each tool or approach has its pros and cons:
Ease of use: TextBlob and VADER are very easy to use, but might offer less accuracy.
Accuracy: Transformer-based models or Flair provide high accuracy, but at the cost of speed and resource usage.
Data requirement: Rule-based methods require no data (just the text to analyze and a lexicon), while machine learning approaches require labeled datasets. Pre-trained models (like those from Hugging Face or Flair) come with the data requirement "built-in" (they were trained on large datasets already).
Integration and tracking: Whichever method you use, you can integrate experiment tracking. For example, if you try different models (say, logistic regression vs SVM vs a neural network), you can use W&B to compare their performance side by side in a dashboard, making it easier to pick the best one.
It's often useful to start simple and then increase complexity. Try a lexicon approach first to get a baseline, as we did with VADER. Then move to a traditional ML model if you have labeled data. If you need better performance and have the resources, explore fine-tuned transformers. Each step up usually yields better insight or accuracy but requires more effort.
💡
Finally, keep in mind that sentiment analysis isn't perfect. Human language is subtle, and even humans might disagree on the sentiment of a given sentence. Always evaluate models on your specific data, and use human judgement to guide interpretations. With practice and the right tools, you'll become adept at choosing and tuning the right approach for your sentiment analysis projects.
ConclusionIn this tutorial, we covered the fundamentals of sentiment analysis and walked through a practical implementation in Python. By working through this project, you have gained practical experience in building a sentiment analysis pipeline from scratch. You learned how to prepare text data, choose an analysis method, and interpret the results as well as how integrating tools like Weights & Biases can make your machine learning workflow more efficient and reproducible.
As next steps, you can apply what you've learned to new datasets or domains. For instance, try collecting tweets about a topic you're interested in and analyzing their sentiment. Or, if you're up for a challenge, fine-tune a transformer model on a custom dataset to push the accuracy even higher. Don't forget to leverage Weights & Biases for tracking your experiments, especially as they grow in complexity.
Sentiment analysis is both an art and a science – it blends understanding of language with algorithms and data. With this solid foundation, you are well-equipped to tackle sentiment analysis tasks and adapt to different needs. Happy analyzing, and may your future projects accurately gauge the pulse of textual data!
Sources﻿NLTK Sentiment Analysis Documentation – NLTK 3.5 how-to guide – Official guide on using NLTK for sentiment analysis, including VADER usage.
﻿VADER Sentiment Analysis Paper (Hutto & Gilbert, 2014) – Original paper and GitHub repository for the VADER lexicon, explaining how it works and its evaluation.
﻿TextBlob: Simplified Text Processing – Official TextBlob documentation, covering the sentiment property and other NLP features.
﻿scikit-learn: TfidfVectorizer Documentation – Reference for the TfidfVectorizer used to convert text to features.
﻿scikit-learn: Metrics – Classification Report – Explanation of precision, recall, and F1-score as seen in the classification report.
﻿Flair NLP Library – GitHub page for Flair, which provides an easy interface for state-of-the-art NLP models, including an example on sentiment analysis.
﻿Hugging Face Transformers Pipeline – Documentation for using high-level pipelines like sentiment-analysis, which leverage pre-trained transformer models.
﻿Weights & Biases Documentation – Experiment Tracking – Official docs for using W&B, including guides on how to log metrics, charts, and datasets for machine learning projects.
﻿
Add a comment
Tags: Articles, Sentiment Analysis, NLP, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.