💬
Module 3
Natural Language ProcessingNLP & Transformers
⏱️ 10–14 hours📊 Intermediate🧩 2 Code Blocks🏗️ 1 Project
🎯 Learning Objectives
- ✓Understand how computers process and understand human language — from basic text cleaning to modern AI.
- ✓Learn word embeddings, attention mechanisms, and transformer architecture — explained intuitively.
- ✓Fine-tune a pre-trained BERT model for sentiment analysis (even with limited data!).
- ✓Build a complete AI-Powered Sentiment Analysis API for product reviews.
📋 Prerequisites
Module 1 & 2 completedComfortable with Python and neural network basicsNo NLP experience needed — we build from scratch
📐 Technical Theory
The Evolution of NLP
How do you teach a computer to understand "I love this movie" is positive and "This movie is terrible" is negative? That's Natural Language Processing (NLP) — teaching machines to read, understand, and generate human language.
NLP has evolved dramatically over the decades, and you're about to learn the cutting-edge techniques that power ChatGPT, Google Search, and Alexa.
| Era | Method | Limitation |
|---|---|---|
| 1990s | Bag of Words, TF-IDF | No semantic understanding, ignores word order |
| 2013 | Word2Vec, GloVe | Static embeddings (same vector for "bank" regardless of context) |
| 2018+ | BERT, GPT, ELMo | Dynamic, contextual embeddings — the modern standard |
TF-IDF — Classic Text Vectorization
TF-IDF assigns importance scores to words. Common words like "the" get low scores; rare but important domain words get high scores.
TF-IDF(t, d) = TF(t, d) × log(N / (1 + df(t)))
Word Embeddings — Semantic Vector Space
Word2Vec (2013) represents each word as a dense vector in d-dimensional space (typically d=300), where semantically similar words cluster together.
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
The Attention Mechanism
The key insight: not all words are equally relevant when processing a given token. Scaled Dot-Product Attention computes a weighted sum of Values, where the weights are determined by the compatibility of Queries and Keys.
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
Multi-Head Attention
Run attention h times in parallel with different learned projections. This allows the model to attend to different positions from different representational subspaces simultaneously.
MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · Wᴼ
BERT Architecture
BERT (Bidirectional Encoder Representations from Transformers) is pre-trained on two tasks:
1. Masked Language Model (MLM): 15% of tokens are masked; the model must predict them using both left and right context.
2. Next Sentence Prediction (NSP): The model predicts whether sentence B follows sentence A.
Fine-tuning: Replace the pre-training head with a task-specific layer and train with a very small learning rate (2e-5 to 5e-5).
💻 Code Implementation
Step 1: Classical NLP — TF-IDF + Logistic Regression
Step 1: Classical NLP — TF-IDF + Logistic Regression
python
import numpy as np
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
STOP_WORDS = set(stopwords.words("english"))
stemmer = PorterStemmer()
def clean_text(text: str) -> str:
"""Lowercase, remove URLs/HTML/special chars, stem."""
text = text.lower()
text = re.sub(r"http\S+|www\S+", "", text)
text = re.sub(r"<[^>]+>", "", text)
text = re.sub(r"[^a-z\s]", "", text)
tokens = text.split()
tokens = [t for t in tokens if t not in STOP_WORDS and len(t) > 2]
tokens = [stemmer.stem(t) for t in tokens]
return " ".join(tokens)
# ── Movie Review Sentiment Dataset ────────────────────────
reviews = [
"This movie was absolutely fantastic! The acting was superb.",
"Terrible film. Completely boring and a waste of time.",
"An incredible cinematic experience. Beautifully crafted story.",
"I fell asleep halfway through. Very disappointing.",
"The special effects were breathtaking! Loved every minute.",
"Poorly written script with one-dimensional characters.",
"A masterpiece. One of the best films I have seen.",
"Complete garbage. The director should be ashamed.",
"Heartwarming and emotionally powerful. A must-watch.",
"Predictable plot, cliched dialogue. Nothing new here.",
] * 100
labels = ([1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 100)
df_nlp = pd.DataFrame({"review": reviews, "sentiment": labels})
df_nlp["cleaned"] = df_nlp["review"].apply(clean_text)
X_train_nlp, X_test_nlp, y_train_nlp, y_test_nlp = train_test_split(
df_nlp["cleaned"], df_nlp["sentiment"],
test_size=0.2, random_state=42, stratify=df_nlp["sentiment"]
)
# ── TF-IDF + LR Pipeline ─────────────────────────────────
text_pipeline = Pipeline([
("tfidf", TfidfVectorizer(
max_features=15000, ngram_range=(1, 2),
min_df=2, max_df=0.95, sublinear_tf=True
)),
("clf", LogisticRegression(C=5.0, max_iter=1000, random_state=42))
])
text_pipeline.fit(X_train_nlp, y_train_nlp)
y_pred_nlp = text_pipeline.predict(X_test_nlp)
print("📊 TF-IDF + Logistic Regression Results:")
print(classification_report(y_test_nlp, y_pred_nlp,
target_names=["Negative", "Positive"]))🔧 Troubleshooting
❌ Error:LookupError: nltk resource not found🔍 Cause:NLTK data not downloaded✅ Fix:Run nltk.download("stopwords") and nltk.download("punkt")
Step 2: Fine-Tuning BERT for Sentiment Analysis
Step 2: Fine-Tuning BERT for Sentiment Analysis
python
import torch
from transformers import (
BertTokenizer, BertForSequenceClassification,
Trainer, TrainingArguments, DataCollatorWithPadding,
)
from datasets import Dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device.upper()}")
MODEL_NAME = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(
MODEL_NAME, num_labels=2,
hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1,
)
def tokenize_function(examples):
return tokenizer(
examples["text"], truncation=True,
max_length=128, padding=False,
)
raw_dataset = Dataset.from_dict({"text": reviews, "label": labels})
raw_dataset = raw_dataset.train_test_split(test_size=0.2, seed=42)
tokenized_datasets = raw_dataset.map(
tokenize_function, batched=True, remove_columns=["text"],
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy_score(labels, predictions),
"f1": f1_score(labels, predictions, average="weighted"),
}
training_args = TrainingArguments(
output_dir="./bert_sentiment",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
fp16=(device == "cuda"),
report_to="none",
)
trainer = Trainer(
model=model, args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
tokenizer=tokenizer, data_collator=data_collator,
compute_metrics=compute_metrics,
)
print("\n🚀 Starting BERT Fine-Tuning...")
trainer.train()
results = trainer.evaluate()
print(f"\n✅ Eval Accuracy: {results['eval_accuracy']:.4f}")
print(f"✅ Eval F1 Score: {results['eval_f1']:.4f}")
trainer.save_model("./bert_sentiment_final")
tokenizer.save_pretrained("./bert_sentiment_final")
print("💾 Model saved to ./bert_sentiment_final")🔧 Troubleshooting
❌ Error:OSError: Can't load tokenizer🔍 Cause:Model name typo or no internet✅ Fix:Check spelling; ensure internet access; run transformers-cli login
❌ Error:Token indices sequence length > 512🔍 Cause:Input too long for BERT✅ Fix:Set truncation=True, max_length=512
❌ Error:OOM during fine-tuning🔍 Cause:Batch size too large✅ Fix:Reduce to 8 or 4; use gradient_accumulation_steps=4
🏗️ Practical Project
AI Sentiment Analysis API
A dual-mode sentiment analysis engine supporting both fast (TF-IDF) and accurate (BERT) inference modes. Production-ready with latency tracking and batch processing.
AI Sentiment Analysis API
python
from transformers import pipeline as hf_pipeline
import time, torch, pandas as pd
# ── Load Models ───────────────────────────────────────────
bert_sentiment = hf_pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=0 if torch.cuda.is_available() else -1,
)
class SentimentAnalysisEngine:
"""Dual-mode sentiment analyzer: fast (TF-IDF) or accurate (BERT)."""
def __init__(self, fast_model, bert_model):
self.fast_model = fast_model
self.bert_model = bert_model
def analyze(self, text: str, fast_mode: bool = False) -> dict:
start = time.time()
if fast_mode:
cleaned = clean_text(text)
pred = self.fast_model.predict([cleaned])[0]
proba = self.fast_model.predict_proba([cleaned])[0]
label = "POSITIVE" if pred == 1 else "NEGATIVE"
score = float(max(proba))
model_used = "TF-IDF + LR"
else:
result = self.bert_model(text[:512])[0]
label = result["label"]
score = result["score"]
model_used = "DistilBERT"
elapsed = (time.time() - start) * 1000
return {
"text": text[:100] + ("..." if len(text) > 100 else ""),
"sentiment": label,
"confidence": f"{score:.1%}",
"model": model_used,
"latency_ms": f"{elapsed:.1f}ms"
}
def batch_analyze(self, texts: list, fast_mode: bool = True) -> pd.DataFrame:
return pd.DataFrame([self.analyze(t, fast_mode) for t in texts])
engine = SentimentAnalysisEngine(text_pipeline, bert_sentiment)
# ── Demo ──────────────────────────────────────────────────
test_reviews = [
"Absolutely love this product! Best purchase this year.",
"Terrible quality. Broke after 2 days. Total waste.",
"It's okay. Does the job I suppose.",
"Outstanding customer service! Resolved in minutes.",
"Packaging damaged, product missing parts. Disappointed.",
]
for review in test_reviews:
fast = engine.analyze(review, fast_mode=True)
bert = engine.analyze(review, fast_mode=False)
print(f"\n📝 {fast['text']}")
print(f" ⚡ Fast: {fast['sentiment']} ({fast['confidence']})")
print(f" 🧠 BERT: {bert['sentiment']} ({bert['confidence']})")🔧 Troubleshooting
❌ Error:BERT training loss = 0.693 after 3 epochs🔍 Cause:Model not learning✅ Fix:LR too small or large; try 2e-5, 3e-5, 5e-5
❌ Error:Accuracy stuck at class distribution🔍 Cause:Model predicts majority class✅ Fix:Balance dataset, use weighted loss
Topics Covered
#NLP#BERT#Transformers#TF-IDF#Attention#HuggingFace