Master sentiment analysis in Python
Unlock the power of sentiment analysis with Python! Learn to categorize text emotions using NLTK & scikit-learn. Boost your skills with hands-on techniques.
Created on September 5|Last edited on September 5
Comment
Sentiment analysis is a powerful technique in natural language processing that extracts emotional tone from text data. It allows us to automatically categorize text as positive, negative, or neutral in sentiment. Using Python for sentiment analysis is ideal due to its versatility and the availability of powerful libraries like NLTK and scikit-learn. In this hands-on tutorial, you'll learn how to perform sentiment analysis from scratch in Python. We'll walk through the entire process step by step – from understanding what sentiment analysis is, to building a model, evaluating it, and even integrating Weights & Biases (W&B) for experiment tracking and visualization.

By the end of this guide, you will have a working sentiment analysis pipeline and a solid grasp of how to apply sentiment analysis to real-world text data. We will also highlight how tools from Weights & Biases can enhance your workflow, such as tracking model performance or creating interactive visualizations for analysis. Let's dive in and start with the basics of sentiment analysis and why it matters.
What is sentiment analysis and why is it important?
Sentiment analysis is a natural language processing (NLP) technique used to determine the emotional tone behind a body of text. In practice, it involves classifying text (such as a sentence or review) into sentiment categories like positive, negative, or neutral. For example, a product review saying "This phone is amazing!" would be classified as having a positive sentiment, whereas "I'm very disappointed with this phone" expresses a negative sentiment. By quantifying subjective information from text, sentiment analysis helps transform qualitative sentiments into actionable data.
Understanding sentiment is important because it provides insight into public opinion and human emotions on a large scale. Businesses use sentiment analysis to gauge customer satisfaction by analyzing reviews and social media posts. It helps in reputation management by identifying negative mentions of a brand early. In politics and public policy, sentiment analysis can measure public reaction to statements or events. Overall, sentiment analysis enables automated understanding of attitudes and emotions, which is invaluable for making data-driven decisions in many fields.
Moreover, sentiment analysis is a key component in systems like chatbots and recommendation engines, where understanding user sentiment can lead to more empathetic and relevant interactions. Its importance spans industries – from marketing (to understand consumer feedback), finance (to analyze market sentiment from news or tweets), to healthcare (to analyze patient feedback or even to monitor mental health through language). By converting unstructured text into structured sentiment data, organizations can uncover trends and patterns that might otherwise be missed, making sentiment analysis a powerful tool in the modern data toolkit.
Practical applications of sentiment analysis
Sentiment analysis has a wide range of practical applications across various industries. Here are a few notable examples that highlight its impact on business and research:
- Social media monitoring: Companies analyze tweets, Facebook posts, and other social media content to understand public sentiment about their products or brand. This real-time feedback helps in managing brand reputation and responding promptly to customer concerns or viral trends.
- Customer reviews and service: E-commerce platforms and service providers use sentiment analysis on product reviews or customer support tickets. By automatically gauging whether feedback is positive or negative, businesses can prioritize addressing negative feedback and improve their products and services. It also enables aggregating thousands of reviews to get an overall sentiment score for products.
- Market research and finance: In finance, analysts use sentiment analysis on news articles and financial reports to predict market movements. For example, sentiment scores of news headlines about a company can be an indicator of its stock performance. Market research firms also analyze sentiment in survey responses or online forums to measure consumer confidence and preferences.
- Healthcare and sociology: Sentiment analysis is applied to patient feedback, medical forums, or therapy session transcripts to detect sentiments that might indicate patient satisfaction or emotional well-being. In sociology and linguistics research, analyzing sentiment in large collections of texts (like literature or political speeches) can reveal insights about public mood and historical trends.
- Political and social analysis: During elections or major political events, sentiment analysis of tweets and news can gauge public opinion and reaction. Governments and NGOs may analyze social media sentiment on policy announcements or social issues to understand how people feel and respond accordingly.
These applications show that sentiment analysis is a versatile tool. By systematically evaluating emotions in text, organizations can make more informed decisions. For instance, a sudden surge in negative sentiment on social media about a product can alert a company to a potential issue, allowing them to intervene quickly. In summary, sentiment analysis turns qualitative text feedback into quantitative insight, impacting strategies in marketing, customer service, product development, and beyond.
Methodologies for performing sentiment analysis
There are several methodologies to perform sentiment analysis, each with its own approach and characteristics. The three main approaches are lexicon-based methods, machine learning-based methods, and transformer-based (deep learning) methods. Let's briefly examine each:
- Lexicon-based approach: This method relies on predefined lexical resources (dictionaries of words) where each word is associated with a sentiment score. The analysis involves counting or summing sentiment scores of words in the text to determine the overall sentiment. For example, words like "good", "happy", or "excellent" might contribute positive points, while "bad", "sad", or "terrible" contribute negative points. Lexicon-based methods are simple and easy to implement, requiring no training data. However, they have limitations: they often ignore context (e.g., sarcasm or negation like "not good"), and their accuracy depends heavily on the quality of the lexicon and rules.
- Machine learning approach: This approach treats sentiment analysis as a text classification problem. First, a labeled dataset of texts with known sentiments (positive/negative labels, for instance) is required. The text is converted into features (such as word frequencies or embeddings), and then a machine learning model is trained on these features to learn how to classify new texts. Common algorithms include logistic regression, naive Bayes, or support vector machines for simpler tasks, and they can achieve better accuracy than lexicon-based methods by learning from context patterns in data. The downside is that they require annotated data and computational power for training. They also may not generalize well beyond the data they were trained on unless carefully validated.
- Transformer-based deep learning approach: In recent years, transformer models (like BERT, RoBERTa, or GPT-based models) have revolutionized NLP, including sentiment analysis. These models are typically pre-trained on massive text corpora and can be fine-tuned on sentiment analysis tasks. Approaches using transformers often involve either using a pre-trained model directly for sentiment (for example, via Hugging Face transformers pipeline) or fine-tuning a model on a specific sentiment dataset (like movie reviews). Transformer-based methods usually achieve the highest accuracy because they capture complex language context and subtleties such as negation and sarcasm. They can understand that "I don't hate it" is different from "I hate it", something that simpler models or lexicons might miss. The trade-off is that they are resource-intensive and can be more complex to implement. They also often require access to pre-trained model weights and possibly a GPU for efficient processing.
Each methodology differs in complexity and performance. Lexicon-based techniques are fast and interpretable but may miss nuance. Machine learning models require data but can capture context-specific sentiment better. Transformer models provide state-of-the-art performance by understanding language deeply, but they come with increased computational cost. Depending on the application and resources, you might choose one method over another. In many practical scenarios, a quick lexicon-based analysis might be used for a rough sentiment snapshot, while a machine-learned model or a fine-tuned transformer is used when higher accuracy is needed.
Rule-based sentiment analysis
Rule-based sentiment analysis is a lexicon-based approach where we define a set of rules to compute sentiment from text. This often involves using a sentiment lexicon: a dictionary of words associated with predetermined sentiment scores. For example, a lexicon might assign +3 to "excellent", -2 to "poor", and so on. The simplest rule-based sentiment analyzer might sum up scores of all sentiment-bearing words in a sentence. If the total score is positive, the sentiment is positive; if negative, the sentiment is negative.
One popular tool for rule-based sentiment analysis is VADER (Valence Aware Dictionary for sEntiment Reasoning). VADER is tailored for social media texts and considers aspects like capitalization, degree modifiers (e.g., "very"), and punctuation (e.g., "!!!") to adjust sentiment intensity. It provides a compound sentiment score between -1 (most negative) and +1 (most positive), along with separate scores for positive, negative, and neutral components of text. Another user-friendly library is TextBlob, which under the hood uses a lexicon and rule-based approach (as well as some machine learning for certain tasks) to yield sentiment polarity and subjectivity.
Rule-based methods are straightforward and interpretable. They don't need any training data and can be set up quickly. For instance, a company might use a simple keyword-based sentiment rule to scan incoming customer emails for extremely negative words to flag urgent issues. However, these methods can struggle with the complexity of language:
- They often fail to understand context or sarcasm (for example, "I love waiting in long lines 🙄" would likely be misclassified as positive due to the word "love").
- They treat each word independently, so phrases like "not good" could be misinterpreted if "not" isn't properly accounted for as a negation rule.
- Slang and context-specific meanings can be missed unless the lexicon is comprehensive.
Despite these limitations, rule-based sentiment analysis remains useful for quick analyses or as a component in more complex systems (for instance, generating features for a machine learning model). Many practitioners start with rule-based tools like VADER or TextBlob to get a baseline sentiment analysis before moving to more sophisticated approaches. In the following sections, we'll get hands-on and actually use a rule-based analyzer (VADER) as part of our tutorial to see how it works in practice.
Types of sentiment analysis
Not all sentiment analysis is just about positive vs negative. There are different types of sentiment analysis that serve different purposes, going beyond the simple polarity categories:
- Fine-grained sentiment analysis: This type goes deeper than binary positive/negative and often uses a rating scale. For example, it might classify sentiment as very positive, positive, neutral, negative, or very negative. This fine-grained approach is useful when you need more nuance, such as understanding if feedback is extremely negative or just slightly negative. A common example is star ratings (1 through 5 stars) which can be mapped to a fine-grained sentiment (1 star = very negative, 5 stars = very positive). Fine-grained analysis helps in scenarios like product reviews where a 3-star (neutral to slightly positive) review is very different from a 1-star (very negative) review.
- Aspect-based sentiment analysis: In aspect-based sentiment analysis, the goal is to identify the sentiment towards specific aspects or features of a product or subject. For instance, a restaurant review might say, "The ambiance was great but the service was slow." Aspect-based analysis would parse this and determine that the sentiment toward "ambiance" is positive while the sentiment toward "service" is negative. This approach is crucial for detailed feedback analysis, allowing businesses to pinpoint what exactly customers like or dislike. It often involves first identifying aspect terms (ambiance, service) and then determining sentiment for each aspect separately within the text.
- Emotion detection: Sometimes we want to go beyond positive/negative and identify specific emotions expressed in text (such as happiness, anger, sadness, fear, surprise, etc.). Emotion detection systems use lexicons or machine learning models trained on datasets labeled with emotions. For example, "I'm absolutely thrilled with the support I received!" might be tagged with the joy or satisfaction emotion, whereas "I'm frustrated with the waiting time" would be tagged as anger or frustration. Emotion detection is useful in contexts like social media monitoring, where understanding the type of emotion can help in tailoring responses (e.g., a customer support system might prioritize angry customers for faster intervention).
Each type of sentiment analysis requires a slightly different approach. Fine-grained analysis might just require adjusting classification to multiple categories or thresholds. Aspect-based analysis often needs NLP techniques for aspect extraction (like identifying nouns or aspects in text) combined with sentiment analysis for each aspect. Emotion detection might require specialized models or lexicons (e.g., NRC Emotion Lexicon is a popular lexicon mapping words to emotions). Depending on your project goals, you might choose one of these specialized forms of sentiment analysis. In this tutorial, we will focus on the fundamental positive/negative sentiment classification for simplicity, but it's good to be aware that sentiment analysis can be extended to handle more nuanced understanding of text.
Setting up Python for sentiment analysis
Before we dive into coding, let's ensure our Python environment is set up for sentiment analysis. We will need to install some libraries and prepare any necessary resources. In this tutorial, we'll predominantly use NLTK (Natural Language Toolkit) for some preprocessing and a lexicon, and scikit-learn for building a simple machine learning model. We will also use Weights & Biases for experiment tracking later on.

If you are running this tutorial in an isolated environment (like a fresh notebook or a new project), it's a good practice to use a virtual environment or environment manager (like venv or conda) to keep dependencies organized. If using an online notebook (Google Colab, etc.), the environment may already have some of these libraries, but you can still install or upgrade as needed.
💡
Follow these steps to set up:
- Install required libraries. Make sure you have Python installed (Python 3.7+ is recommended). Then install the libraries we'll use. Open a terminal or command prompt and run the following command to install NLTK, scikit-learn, pandas (for data handling), and wandb:pip install nltk scikit-learn pandas wandb
- This will download and install the packages. You may already have some of these, but it's okay to run the command — it will update or confirm the installation of each.
- Import libraries in your Python script or notebook. Once installed, we can import them in our code. We also plan to use NLTK's corpora (for example, the movie reviews dataset and the VADER lexicon), so we'll need to download those resources using NLTK's downloader. We'll do that in code to ensure everything is ready.
- Download NLTK data (if not already available). NLTK comes with a downloader for various datasets and lexicons. In particular, we'll use the "movie_reviews" corpus for our dataset and the "vader_lexicon" for the VADER sentiment analyser. We need to download these once. This can be done via Python code (it will prompt a download if these aren't already present on your system):import nltknltk.download('movie_reviews') # movie reviews dataset (if using NLTK's sample data)nltk.download('vader_lexicon') # VADER sentiment lexiconnltk.download('stopwords') # we'll use stopwords during preprocessing
- Running the above will fetch the data and lexicon. If you're in a notebook environment, the nltk.download function might open an interactive prompt. Passing the resource name directly as shown ensures it downloads without the interactive UI. After downloading, NLTK can access these resources offline.
- Verify the setup. It's a good idea to verify that everything is installed correctly. You can do a quick version check or simple import test:import sklearnimport pandas as pdimport nltkimport wandbprint("NLTK version:", nltk.__version__)print("Scikit-learn version:", sklearn.__version__)print("Pandas version:", pd.__version__)
- This is just to ensure there are no import errors and to check the versions. If these print statements output version numbers without errors, you're ready to proceed.
Now that our environment is set, we have Python and the necessary libraries ready. In the next section, we'll dive into performing sentiment analysis with Python, walking through a practical example step by step.
Performing sentiment analysis with Python: A tutorial
It's time to roll up our sleeves and perform sentiment analysis on an actual dataset. In this project we are going to build a complete sentiment analysis pipeline from raw movie reviews to a working neural network classifier. The goal is to take plain text reviews and automatically decide whether the overall opinion is positive or negative. To do this, we combine classical natural language processing techniques with a simple deep learning model.
We begin with the NLTK movie reviews dataset, which provides 2,000 labeled examples evenly split between positive and negative sentiment. Each review is cleaned and normalized so that the text is easier to work with. After preprocessing, we transform the words into numerical vectors using TF-IDF, a method that assigns weights to terms based on how important they are within a document compared to the whole collection. This produces high-dimensional feature vectors that capture the presence of both single words and short phrases.
With the features in hand, we train a multilayer perceptron (MLP) in PyTorch. The MLP is a straightforward feedforward neural network with hidden layers, ReLU activations, and dropout for regularization. It takes the TF-IDF vectors as input and learns to output a score that represents positive or negative sentiment. We optimize the model with binary cross-entropy loss and the AdamW optimizer, while also applying gradient clipping and other stability techniques.
To track progress, we log training and validation metrics to Weights & Biases so we can watch the model’s performance evolve over epochs. Once training is complete, we evaluate the network with accuracy, a classification report, and a confusion matrix. Finally, we test the trained classifier on a few custom sentences to demonstrate how it generalizes beyond the training data.
Here's the code:
import randomimport reimport sysimport timefrom typing import List, Tupleimport numpy as npimport scipy.sparse as spimport nltkfrom nltk.corpus import movie_reviews, stopwordsfrom nltk.sentiment.vader import SentimentIntensityAnalyzerfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics import accuracy_score, classification_report, confusion_matriximport torchfrom torch import nnfrom torch.utils.data import Dataset, DataLoader# ---------------------- setup ----------------------SEED = 42def set_seed(seed=SEED):random.seed(seed)np.random.seed(seed)torch.manual_seed(seed)if torch.cuda.is_available():torch.cuda.manual_seed_all(seed)USE_WANDB = Truetry:import wandbexcept Exception:USE_WANDB = Falsewandb = Nonedef setup_nltk():nltk.download("movie_reviews", quiet=True)nltk.download("vader_lexicon", quiet=True)nltk.download("stopwords", quiet=True)# ---------------------- data ----------------------def load_movie_reviews() -> Tuple[List[str], List[str]]:docs = []for fileid in movie_reviews.fileids():text = " ".join(movie_reviews.words(fileid))label = movie_reviews.categories(fileid)[0]docs.append((text, label))random.shuffle(docs)texts = [t for t, _ in docs]labels = [y for _, y in docs]return texts, labelsdef preprocess_texts(texts: List[str]) -> List[str]:sw = set(stopwords.words("english"))for keep in ["no", "not", "nor"]:if keep in sw:sw.remove(keep)def clean(t: str) -> str:t = t.lower()t = re.sub(r"[^a-z\s]", " ", t)words = [w for w in t.split() if w not in sw]return " ".join(words)return [clean(t) for t in texts]def vader_baseline(texts_test: List[str], labels_test: List[str]) -> float:sia = SentimentIntensityAnalyzer()preds = []for t in texts_test:c = sia.polarity_scores(t)["compound"]preds.append("pos" if c >= 0 else "neg")return accuracy_score(labels_test, preds)def vectorize(train_clean: List[str], test_clean: List[str]):# sublinear_tf and dtype=float32 keep values small/stablevec = TfidfVectorizer(ngram_range=(1, 2),min_df=2,stop_words="english",sublinear_tf=True,norm="l2",dtype=np.float32,)X_train = vec.fit_transform(train_clean)X_test = vec.transform(test_clean)return vec, X_train, X_testdef to_binary(labels: List[str]) -> np.ndarray:return np.array([1 if y == "pos" else 0 for y in labels], dtype=np.int64)# ---------------------- dataset ----------------------class TFIDFDataset(Dataset):def __init__(self, X_csr: sp.csr_matrix, y: np.ndarray):self.X = X_csrself.y = ydef __len__(self):return self.X.shape[0]def __getitem__(self, idx):row = self.X[idx]x = torch.from_numpy(row.toarray().astype(np.float32).ravel())y = torch.tensor(self.y[idx], dtype=torch.float32)return x, y# ---------------------- model ----------------------class MLP(nn.Module):def __init__(self, in_dim: int, hidden: int = 256, dropout: float = 0.3):super().__init__()self.fc1 = nn.Linear(in_dim, hidden)self.fc2 = nn.Linear(hidden, hidden)self.out = nn.Linear(hidden, 1)self.drop = nn.Dropout(dropout)self.act = nn.ReLU()self._init_weights()def _init_weights(self):for m in [self.fc1, self.fc2, self.out]:nn.init.xavier_uniform_(m.weight)nn.init.zeros_(m.bias)def forward(self, x):x = self.act(self.fc1(x))x = self.drop(x)x = self.act(self.fc2(x))x = self.drop(x)return self.out(x).squeeze(1) # logits# ---------------------- train/eval ----------------------def epoch_run(model, loader, device, criterion, optimizer=None, max_grad_norm=1.0):is_train = optimizer is not Nonemodel.train() if is_train else model.eval()total_loss = 0.0total_correct = 0total = 0with torch.set_grad_enabled(is_train):for xb, yb in loader:xb = xb.to(device, non_blocking=True)yb = yb.to(device, non_blocking=True)logits = model(xb)# clamp logits to avoid inf in loss (rare but safe)logits = torch.clamp(logits, -20.0, 20.0)loss = criterion(logits, yb)if torch.isnan(loss) or torch.isinf(loss):# skip pathological batchcontinueif is_train:optimizer.zero_grad(set_to_none=True)loss.backward()nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)optimizer.step()total_loss += loss.item() * xb.size(0)preds = (torch.sigmoid(logits) >= 0.5).long()total_correct += (preds == yb.long()).sum().item()total += xb.size(0)avg_loss = total_loss / max(total, 1)acc = total_correct / max(total, 1)return avg_loss, accdef evaluate_final(model, loader, device, y_true_labels: List[str]):model.eval()preds_all = []with torch.no_grad():for xb, _ in loader:xb = xb.to(device)logits = model(xb)preds = (torch.sigmoid(logits) >= 0.5).long().cpu().numpy().tolist()preds_all.extend(preds)pred_labels = ["pos" if p == 1 else "neg" for p in preds_all]acc = accuracy_score(y_true_labels, pred_labels)rep = classification_report(y_true_labels, pred_labels, target_names=["neg", "pos"])cm = confusion_matrix(y_true_labels, pred_labels, labels=["neg", "pos"])return acc, rep, cmdef demo_predictions(vec, model, device):examples = ["I absolutely loved this movie. Brilliant performances!","I really hate this film. It was a waste of time.","I don't hate it, but it's not good either.","The ambiance was great, but the service was slow.","I love waiting in long lines"]sw = set(stopwords.words("english"))for keep in ["no", "not", "nor"]:if keep in sw:sw.remove(keep)def clean(t: str) -> str:t = t.lower()t = re.sub(r"[^a-z\s]", " ", t)return " ".join([w for w in t.split() if w not in sw])X = vec.transform([clean(t) for t in examples])X = torch.from_numpy(X.toarray().astype(np.float32)).to(device)model.eval()with torch.no_grad():logits = model(X)preds = (torch.sigmoid(logits) >= 0.5).long().cpu().numpy().tolist()labels = ["pos" if p == 1 else "neg" for p in preds]print("\nDemo predictions:")for t, p in zip(examples, labels):print(f"[{p}] {t}")# ---------------------- main ----------------------def main():set_seed()setup_nltk()texts, labels = load_movie_reviews()print(f"Total reviews: {len(texts)}")print(f"Positive reviews: {labels.count('pos')}")print(f"Negative reviews: {labels.count('neg')}")X_train_raw, X_test_raw, y_train_labels, y_test_labels = train_test_split(texts, labels, test_size=0.2, random_state=SEED, stratify=labels)# quick baseline (print only)vader_acc = vader_baseline(X_test_raw, y_test_labels)print(f"\nVADER baseline accuracy: {vader_acc:.3f}")# tf-idfX_train_clean = preprocess_texts(X_train_raw)X_test_clean = preprocess_texts(X_test_raw)vec, X_train_tfidf, X_test_tfidf = vectorize(X_train_clean, X_test_clean)y_train_bin = to_binary(y_train_labels)y_test_bin = to_binary(y_test_labels)# loaderstrain_ds = TFIDFDataset(X_train_tfidf, y_train_bin)val_ds = TFIDFDataset(X_test_tfidf, y_test_bin)train_loader = DataLoader(train_ds, batch_size=128, shuffle=True, num_workers=0)val_loader = DataLoader(val_ds, batch_size=256, shuffle=False, num_workers=0)# device# device = torch.device("cuda" if torch.cuda.is_available()# else ("mps" if torch.backends.mps.is_available() else "cpu"))device = 'cpu'# model/opt/loss with safer settingsmodel = MLP(in_dim=X_train_tfidf.shape[1], hidden=256, dropout=0.3).to(device)criterion = nn.BCEWithLogitsLoss()optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=1e-2)# wandbrun = Noneif USE_WANDB:try:run = wandb.init(project="sentiment_analysis_tutorial",name="MLP_TFIDF_Stable",mode="online",config={"model": "MLP","hidden": 256,"dropout": 0.3,"optimizer": "AdamW","lr": 5e-4,"weight_decay": 1e-2,"batch_size": 128,"epochs": 12,"num_features": int(X_train_tfidf.shape[1]),"num_train_samples": int(X_train_tfidf.shape[0]),"num_val_samples": int(X_test_tfidf.shape[0]),},)except Exception:try:run = wandb.init(project="sentiment_analysis_tutorial", name="MLP_TFIDF_Stable", mode="offline")except Exception:run = Noneepochs = 12for epoch in range(1, epochs + 1):t0 = time.time()train_loss, train_acc = epoch_run(model, train_loader, device, criterion, optimizer, max_grad_norm=1.0)val_loss, val_acc = epoch_run(model, val_loader, device, criterion, optimizer=None)dt = time.time() - t0if run is not None:wandb.log({"epoch": epoch,"train_loss": float(train_loss),"train_accuracy": float(train_acc),"val_loss": float(val_loss),"val_accuracy": float(val_acc),"epoch_time_sec": float(dt),})print(f"epoch {epoch:02d} train_loss {train_loss:.4f} train_acc {train_acc:.3f} "f"val_loss {val_loss:.4f} val_acc {val_acc:.3f} time {dt:.1f}s")if run is not None:wandb.finish()acc, report, cm = evaluate_final(model, val_loader, device, y_test_labels)print(f"\nMLP TF IDF accuracy: {acc:.3f}")print("\nClassification report:\n" + report)print("Confusion matrix [rows true, cols pred] order neg pos:")print(cm)demo_predictions(vec, model, device)if __name__ == "__main__":try:main()except Exception as e:print("Error:", e, file=sys.stderr)raise
After running the code, you’ll see output showing both training and validation accuracy for each epoch, along with the corresponding loss values. This loop is the heart of the model: the MLP gradually learns to adjust its weights so that positive reviews push the output score upward and negative reviews push it downward. Because we used BCEWithLogitsLoss, the network works directly with raw logits and the sigmoid function is applied internally in a numerically stable way. We also clamp the logits and clip gradients to prevent the kinds of runaway values that can destabilize training.
One important detail is the TF-IDF vectorization step. Raw text cannot be fed into a neural network directly, so we convert each review into a fixed-length numeric vector. With unigrams and bigrams, the model doesn’t just look at single words but also short phrases like “not good” or “very bad,” which carry strong sentiment signals. The sublinear_tf and l2 normalization options keep feature values well-scaled, which makes the optimization process smoother.
The architecture itself is deliberately simple: two hidden layers with ReLU activations and dropout regularization. This strikes a balance between expressive power and training stability. A much deeper network would likely overfit given the small size of the dataset, while a purely linear model like logistic regression would lack the capacity to capture subtle patterns. The MLP sits in the middle, providing enough nonlinearity to improve performance without becoming unmanageable.
Finally, logging to Weights & Biases makes the training process transparent. Each epoch’s metrics are recorded, producing clear plots of accuracy and loss over time. This helps diagnose issues such as overfitting or divergence and makes it easy to compare experiments later. Once training finishes, we validate the model using standard scikit-learn tools like the classification report and confusion matrix, and then demonstrate predictions on fresh examples to show the model in action.
Here's the logs for my training run:
Run: MLP_TFIDF_Stable
1
LLM-based sentiment classification
Now we will test a small local LLM for sentiment without training anything. The idea is simple. Feed raw reviews to Gemma 3 270M running in Ollama and ask it to answer with one word, positive or negative. We use NLTK to pull the movie_reviews dataset and split it into train and test, but the model never sees the train split because we are not fine tuning. We add five in context examples to the prompt so the model sees the task format and a few demonstrations before it judges each review. Those examples cover clear positive, clear negative, and mild positivity to bias the model toward short, unambiguous outputs. The system prompt tells the model to reply with exactly one word and nothing else. The user message injects either a demo review with its target label or the real review text when evaluating. We set temperature to zero to reduce variance and regex scrub the response in case the model adds extra tokens. Finally we map the model word to the dataset labels, compute accuracy, print a classification report and a confusion matrix, and show a few example predictions next to gold labels. As a quick reference point we also run VADER on the same test set to see how a lexicon baseline compares to the LLM approach.
Here's the code:
import timeimport refrom typing import List, Tupleimport nltkfrom nltk.corpus import movie_reviewsfrom nltk.sentiment.vader import SentimentIntensityAnalyzerfrom sklearn.model_selection import train_test_splitfrom sklearn import metricsfrom openai import OpenAI# 1. NLTK setupdef setup_nltk():nltk.download("movie_reviews", quiet=True)nltk.download("vader_lexicon", quiet=True)# 2. Data loading and splitdef load_data() -> Tuple[List[str], List[str]]:texts, labels = [], []for fid in movie_reviews.fileids():texts.append(" ".join(movie_reviews.words(fid)))labels.append(movie_reviews.categories(fid)[0])return texts, labels# 3. Ollama OpenAI compatible clientOLLAMA_BASE_URL = "http://localhost:11434/v1"MODEL_NAME = "gemma3:270m"client = OpenAI(base_url=OLLAMA_BASE_URL, api_key="ollama") # required by client, unused by OllamaSYSTEM_PROMPT = ("You are a rigorous sentiment classifier.\n""Output ONLY one word: positive or negative.\n""No punctuation, no extra words, no JSON.")# Five in-context examplesFEW_SHOT = [("I absolutely loved this movie. Brilliant pacing and acting.", "positive"),("This was a complete waste of time. Boring and poorly written.", "negative"),("A delightful surprise with charming performances.", "positive"),("Awful editing and the story made no sense at all.", "negative"),("Not perfect, but I left the theater smiling.", "positive"),]# Final user instruction templateUSER_TEMPLATE = ("Classify the overall sentiment of this text.\n""Respond with exactly one word: positive or negative.\n""Text:\n""{text}")def build_messages_with_fewshot(text: str):msgs = [{"role": "system", "content": SYSTEM_PROMPT}]for ex_text, ex_label in FEW_SHOT:msgs.append({"role": "user","content": ("Classify the overall sentiment of this text.\n""Respond with exactly one word: positive or negative.\n""Text:\n"f"{ex_text}")})msgs.append({"role": "assistant", "content": ex_label})msgs.append({"role": "user", "content": USER_TEMPLATE.format(text=text)})return msgsdef gemma_label(text: str) -> str:resp = client.chat.completions.create(model=MODEL_NAME,messages=build_messages_with_fewshot(text),temperature=0.0,)content = resp.choices[0].message.content.strip().lower()# Be robust to accidental extra tokensword = re.findall(r"[a-z]+", content)lbl = word[0] if word else "positive"if lbl not in {"positive", "negative"}:if "negative" in content:lbl = "negative"elif "positive" in content:lbl = "positive"else:lbl = "positive"return lbl# 4. Map Gemma labels to dataset labelsdef map_label(lbl: str) -> str:return "pos" if lbl == "positive" else "neg"# Optional baseline with VADERdef vader_baseline(texts: List[str], gold: List[str]) -> float:sia = SentimentIntensityAnalyzer()preds = []for t in texts:c = sia.polarity_scores(t)["compound"]preds.append("pos" if c >= 0 else "neg")return metrics.accuracy_score(gold, preds)def evaluate_gemma(texts: List[str], sleep_sec: float = 0.0) -> List[str]:preds = []start = time.time()for i, t in enumerate(texts, 1):gem = gemma_label(t)preds.append(map_label(gem))if i % 25 == 0:elapsed = time.time() - startprint(f"Processed {i} samples in {elapsed:.1f}s")if sleep_sec > 0:time.sleep(sleep_sec)return predsdef main():setup_nltk()texts, labels = load_data()X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42, stratify=labels)print("Evaluating Gemma 270M via Ollama on 400 test reviews with 5-shot prompts...")gemma_preds = evaluate_gemma(X_test)acc = metrics.accuracy_score(y_test, gemma_preds)report = metrics.classification_report(y_test, gemma_preds, target_names=["neg", "pos"])cm = metrics.confusion_matrix(y_test, gemma_preds, labels=["neg", "pos"])print(f"\nGemma accuracy: {acc:.3f}")print("\nClassification report:")print(report)print("Confusion matrix [rows true, cols pred] order neg pos:")print(cm)print("\nSample predictions:")for i in range(5):excerpt = X_test[i][:120].replace("\n", " ") + "..."print(f"Gold={y_test[i]} Pred={gemma_preds[i]} | {excerpt}")try:v_acc = vader_baseline(X_test, y_test)print(f"\nVADER baseline accuracy: {v_acc:.3f}")except Exception as e:print(f"\nVADER baseline skipped: {e}")if __name__ == "__main__":main()
The few shot prompt works as the core mechanism here. By stacking five labeled pairs before each test review, the model learns the output style and the decision boundary from context. This matters because small local models can drift into explanations or hedging if the instruction is weak. The system message forces a single word, while the user template repeats that constraint to keep the model on track. Temperature zero limits randomness so the same review produces the same label across runs. The evaluator treats the model as a black box classifier. For each review it builds the message list with the five demos plus the target review, calls the OpenAI compatible endpoint exposed by Ollama, and normalizes the reply to lowercase text. A small regex captures only the first alphabetic token, and we fall back to positive if the response is empty to avoid crashes. Neutral is never used in this setup, so mapping is straightforward
Using W&B Inference
Now we will evaluate a hosted LLM on sentiment without training. The code pulls the NLTK movie_reviews corpus, which contains two thousand labeled reviews split evenly between positive and negative. Text and labels are loaded, then a stratified split creates a fixed test set of four hundred reviews. Instead of running a local model, the script builds an OpenAI-compatible client that points at the Weights & Biases inference endpoint. Weave is initialized so requests are tied to a project, and the client uses your WANDB API key. The chosen model is meta-llama Llama-3.1-8B-Instruct.
Here's the code:
import osimport timeimport refrom typing import List, Tupleimport nltkfrom nltk.corpus import movie_reviewsfrom nltk.sentiment.vader import SentimentIntensityAnalyzerfrom sklearn.model_selection import train_test_splitfrom sklearn import metricsimport openaiimport weave# ----------------------------------# Config# ----------------------------------PROJECT = "wandb_inference"WEAVE_PROJECT = PROJECTBASE_URL = "https://api.inference.wandb.ai/v1"MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"# Optional: set your key inline (or export WANDB_API_KEY beforehand)# os.environ["WANDB_API_KEY"] = "YOUR_WANDB_API_KEY"# ----------------------------------# NLTK setup# ----------------------------------def setup_nltk():nltk.download("movie_reviews", quiet=True)nltk.download("vader_lexicon", quiet=True)def load_data() -> Tuple[List[str], List[str]]:texts, labels = [], []for fid in movie_reviews.fileids():texts.append(" ".join(movie_reviews.words(fid)))labels.append(movie_reviews.categories(fid)[0])return texts, labels# ----------------------------------# W&B Inference client (OpenAI-compatible)# ----------------------------------def make_client():weave.init(WEAVE_PROJECT)api_key = os.getenv("WANDB_API_KEY")if not api_key:raise RuntimeError("WANDB_API_KEY not set. Export it or set it inline above.")client = openai.OpenAI(base_url=BASE_URL,api_key=api_key,project=PROJECT,default_headers={"OpenAI-Project": "wandb_fc/quickstart_playground" # replace with your team/project if needed})return clientSYSTEM_PROMPT = ("You are a rigorous sentiment classifier.\n""Output ONLY one word: positive or negative.\n""No punctuation, no extra words, no JSON.")FEW_SHOT = [("I absolutely loved this movie. Brilliant pacing and acting.", "positive"),("This was a complete waste of time. Boring and poorly written.", "negative"),("A delightful surprise with charming performances.", "positive"),("Awful editing and the story made no sense at all.", "negative"),("Not perfect, but I left the theater smiling.", "positive"),]USER_TEMPLATE = ("Classify the overall sentiment of this text.\n""Respond with exactly one word: positive or negative.\n""Text:\n""{text}")def build_messages_with_fewshot(text: str):msgs = [{"role": "system", "content": SYSTEM_PROMPT}]for ex_text, ex_label in FEW_SHOT:msgs.append({"role": "user","content": ("Classify the overall sentiment of this text.\n""Respond with exactly one word: positive or negative.\n""Text:\n"f"{ex_text}")})msgs.append({"role": "assistant", "content": ex_label})msgs.append({"role": "user", "content": USER_TEMPLATE.format(text=text)})return msgsalpha_re = re.compile(r"[a-z]+")def llm_label(client: openai.OpenAI, text: str) -> str:resp = client.chat.completions.create(model=MODEL_NAME,messages=build_messages_with_fewshot(text),temperature=0.0,max_tokens=4,)content = (resp.choices[0].message.content or "").strip().lower()toks = alpha_re.findall(content)lbl = toks[0] if toks else "positive"if lbl not in {"positive", "negative"}:if "negative" in content:lbl = "negative"elif "positive" in content:lbl = "positive"else:lbl = "positive"return lbldef map_label(lbl: str) -> str:return "pos" if lbl == "positive" else "neg"# ----------------------------------# Baseline (VADER)# ----------------------------------def vader_baseline(texts: List[str], gold: List[str]) -> float:sia = SentimentIntensityAnalyzer()preds = []for t in texts:c = sia.polarity_scores(t)["compound"]preds.append("pos" if c >= 0 else "neg")return metrics.accuracy_score(gold, preds)# ----------------------------------# Evaluation# ----------------------------------def evaluate_llm(client: openai.OpenAI, texts: List[str], sleep_sec: float = 0.0) -> List[str]:preds = []start = time.time()for i, t in enumerate(texts, 1):out = llm_label(client, t)preds.append(map_label(out))if i % 25 == 0:elapsed = time.time() - startprint(f"Processed {i} samples in {elapsed:.1f}s")if sleep_sec > 0:time.sleep(sleep_sec)return predsdef main():setup_nltk()texts, labels = load_data()X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42, stratify=labels)client = make_client()print(f"Evaluating {MODEL_NAME} via W&B Inference on 400 test reviews with 5-shot prompts...")llm_preds = evaluate_llm(client, X_test)acc = metrics.accuracy_score(y_test, llm_preds)report = metrics.classification_report(y_test, llm_preds, target_names=["neg", "pos"])cm = metrics.confusion_matrix(y_test, llm_preds, labels=["neg", "pos"])print(f"\nLLM accuracy: {acc:.3f}")print("\nClassification report:")print(report)print("Confusion matrix [rows true, cols pred] order neg pos:")print(cm)print("\nSample predictions:")for i in range(5):excerpt = X_test[i][:120].replace("\n", " ") + "..."print(f"Gold={y_test[i]} Pred={llm_preds[i]} | {excerpt}")try:v_acc = vader_baseline(X_test, y_test)print(f"\nVADER baseline accuracy: {v_acc:.3f}")except Exception as e:print(f"\nVADER baseline skipped: {e}")if __name__ == "__main__":main()
After running the code, you can navigate to Weave and open your project to see a timeline of calls. Each API request to the W&B inference endpoint appears as a call with inputs, outputs, latency, token usage, and any exceptions.

Alternative use cases and tools
In this tutorial, we focused on movie reviews and used NLTK and scikit-learn for our sentiment analysis. However, there's a rich ecosystem of tools and scenarios for sentiment analysis in Python. Here are some alternatives and extensions to consider:
- Analyzing different text sources: We used movie reviews, but you could apply the same workflow to other data. For example, try sentiment analysis on tweets (Twitter data) to understand public sentiment on a topic, or on news headlines to gauge market sentiment. Each domain might require slightly different preprocessing (e.g., handling hashtags or usernames in tweets, or dealing with formal language in news).
- TextBlob: TextBlob is a high-level library built on top of NLTK that makes sentiment analysis extremely straightforward. It has a TextBlob object that you can call .sentiment on to get polarity (a score from -1 to 1) and subjectivity. For example:from textblob import TextBlobblob = TextBlob("The movie was not bad at all.")print(blob.sentiment)
- This might output something like Sentiment(polarity=0.3, subjectivity=0.6). Under the hood, TextBlob uses a combination of techniques including a lexicon. It's great for quick sentiment checks, though it might not be as accurate as a trained model for specific domains.
- VADER (detailed usage): We already used VADER via NLTK. VADER is particularly well-suited for social media texts due to handling of slang and punctuation. It's worth noting that VADER returns a compound score as well as individual positive/negative/neutral scores. If your use case needs identifying neutrality or mixed sentiment, you can use thresholds on those scores. VADER is a good default for any initial sentiment analysis because of its ease of use and no requirement for training data.
- Flair: Flair is an NLP library from the Zalando Research team that provides simple interfaces for state-of-the-art NLP models. It has a pre-trained sentiment analysis model (an LSTM neural network trained on IMDB reviews). Using Flair:from flair.models import TextClassifierfrom flair.data import Sentenceclassifier = TextClassifier.load('en-sentiment')sentence = Sentence("I absolutely loved the new design of your website!")classifier.predict(sentence)print(sentence.labels)
- This might output something like "[POSITIVE (0.95)]", indicating the text is positive with a confidence score. Flair's models are more accurate than simple lexicon methods because they use deep learning under the hood, but they are also heavier (loading the model can be a few hundred megabytes).
- Hugging Face Transformers: For the ultimate in accuracy, you can use transformer-based models. Hugging Face provides a high-level pipeline for sentiment analysis:from transformers import pipelinesentiment_pipeline = pipeline("sentiment-analysis")result = sentiment_pipeline("I have mixed feelings about this product.")[0]print(result)
- This might output {'label': 'NEGATIVE', 'score': 0.55} (for example, if the model leans slightly negative on the mixed feelings sentence). The default model behind this pipeline is usually a fine-tuned BERT or DistilBERT model on a large sentiment dataset. These models capture context very well (for instance, understanding negation or sarcasm better than simpler methods). The trade-off is that they require more computational resources. However, you don't need to train them (since they are pre-trained and fine-tuned), which is a huge advantage if you don't have a large labeled dataset.
- Custom models and deep learning: For specialized applications, you might train your own deep learning model using libraries like TensorFlow or PyTorch. For example, you could build an LSTM or transformer model fine-tuned on a custom dataset (perhaps sentiment of tweets about your company, or sentiment in customer support chats).
Each tool or approach has its pros and cons:
- Ease of use: TextBlob and VADER are very easy to use but might offer less accuracy.
- Accuracy: Transformer-based models or Flair provide high accuracy but at the cost of speed and resource usage.
- Data requirement: Rule-based methods require no data (just the text to analyze and a lexicon), while machine learning approaches require labeled datasets. Pre-trained models (like those from Hugging Face or Flair) come with the data requirement "built-in" (they were trained on large datasets already).
- Integration and tracking: Whichever method you use, you can integrate experiment tracking. For example, if you try different models (say, logistic regression vs SVM vs a neural network), you can use W&B to compare their performance side by side in a dashboard, making it easier to pick the best one.
💡 Tip: It's often useful to start simple and then increase complexity. Try a lexicon approach first to get a baseline, as we did with VADER. Then move to a traditional ML model if you have labeled data. If you need better performance and have the resources, explore fine-tuned transformers. Each step up usually yields better insight or accuracy but requires more effort.
Finally, keep in mind that sentiment analysis isn't perfect. Human language is subtle, and even humans might disagree on the sentiment of a given sentence. Always evaluate models on your specific data, and use human judgement to guide interpretations. With practice and the right tools, you'll become adept at choosing and tuning the right approach for your sentiment analysis projects.
Conclusion
In this comprehensive guide, we covered the fundamentals of sentiment analysis and walked through a practical implementation in Python. Let's recap the journey and key takeaways:
- Understanding sentiment analysis: We began by defining sentiment analysis and highlighting its importance in extracting subjective information (positive, negative, neutral sentiments) from text. It's a crucial tool for understanding opinions and emotions at scale, with applications in business, finance, social media, and more.
- Exploring applications: We discussed how various industries use sentiment analysis, from social media monitoring to customer feedback and market analysis, showcasing the versatility of this technique.
- Different approaches: We broke down the main methodologies – lexicon-based rules, machine learning classifiers, and advanced transformer-based models – each with its advantages and scenarios where it shines. We also noted specialized types of sentiment analysis like aspect-based and emotion detection for more granular insights.
- Hands-on tutorial: The core of our guide was a step-by-step tutorial. We:
- Set up a Python environment with NLTK, scikit-learn, and other libraries.
- Loaded and preprocessed a dataset of movie reviews.
- Implemented a rule-based sentiment analysis using VADER and observed its output and accuracy.
- Converted text to features with TF-IDF and trained a logistic regression model, which significantly outperformed the rule-based baseline on our test data.
- Evaluated the model with metrics and even used Weights & Biases to log and visualize the results, demonstrating good practices in experiment tracking.
- Alternative tools: We reviewed other tools like TextBlob, Flair, and Hugging Face pipelines that can be used for sentiment analysis with minimal code, as well as discussing trade-offs between simplicity and accuracy.
By working through this project, you have gained practical experience in building a sentiment analysis pipeline from scratch. You learned how to prepare text data, choose an analysis method, and interpret the results. You also saw how integrating tools like W&B can make your machine learning workflow more efficient and reproducible.
As next steps, you can apply what you've learned to new datasets or domains. For instance, try collecting tweets about a topic you're interested in and analyze their sentiment. Or, if you're up for a challenge, fine-tune a transformer model on a custom dataset to push the accuracy even higher. Don't forget to leverage W&B for tracking your experiments, especially as they grow in complexity.
Sentiment analysis is both an art and a science – it blends understanding of language with algorithms and data. With this solid foundation, you are well-equipped to tackle sentiment analysis tasks and adapt to different needs. Happy analyzing, and may your future projects accurately gauge the pulse of textual data!
Sources
- NLTK Sentiment Analysis Documentation – NLTK 3.5 how-to guide – Official guide on using NLTK for sentiment analysis, including VADER usage.
- VADER Sentiment Analysis Paper (Hutto & Gilbert, 2014) – Original paper and GitHub repository for the VADER lexicon, explaining how it works and its evaluation.
- TextBlob: Simplified Text Processing – Official TextBlob documentation, covering the sentiment property and other NLP features.
- scikit-learn: TfidfVectorizer Documentation – Reference for the TfidfVectorizer used to convert text to features.
- scikit-learn: Metrics – Classification Report – Explanation of precision, recall, and F1-score as seen in the classification report.
- Flair NLP Library – GitHub page for Flair, which provides an easy interface for state-of-the-art NLP models, including an example on sentiment analysis.
- Hugging Face Transformers Pipeline – Documentation for using high-level pipelines like sentiment-analysis, which leverage pre-trained transformer models.
- Weights & Biases Documentation – Experiment Tracking – Official docs for using W&B, including guides on how to log metrics, charts, and datasets for machine learning projects.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.