Y
💬
Module 3

Natural Language ProcessingNLP & Transformers

⏱️ 10–14 hours📊 Intermediate🧩 2 Code Blocks🏗️ 1 Project

🎯 Learning Objectives

  • Understand how computers process and understand human language — from basic text cleaning to modern AI.
  • Learn word embeddings, attention mechanisms, and transformer architecture — explained intuitively.
  • Fine-tune a pre-trained BERT model for sentiment analysis (even with limited data!).
  • Build a complete AI-Powered Sentiment Analysis API for product reviews.

📋 Prerequisites

Module 1 & 2 completedComfortable with Python and neural network basicsNo NLP experience needed — we build from scratch

📐 Technical Theory

The Evolution of NLP

How do you teach a computer to understand "I love this movie" is positive and "This movie is terrible" is negative? That's Natural Language Processing (NLP) — teaching machines to read, understand, and generate human language. NLP has evolved dramatically over the decades, and you're about to learn the cutting-edge techniques that power ChatGPT, Google Search, and Alexa.
EraMethodLimitation
1990sBag of Words, TF-IDFNo semantic understanding, ignores word order
2013Word2Vec, GloVeStatic embeddings (same vector for "bank" regardless of context)
2018+BERT, GPT, ELMoDynamic, contextual embeddings — the modern standard

TF-IDF — Classic Text Vectorization

TF-IDF assigns importance scores to words. Common words like "the" get low scores; rare but important domain words get high scores.
TF-IDF(t, d) = TF(t, d) × log(N / (1 + df(t)))

Word Embeddings — Semantic Vector Space

Word2Vec (2013) represents each word as a dense vector in d-dimensional space (typically d=300), where semantically similar words cluster together. vector("king") - vector("man") + vector("woman") ≈ vector("queen")

The Attention Mechanism

The key insight: not all words are equally relevant when processing a given token. Scaled Dot-Product Attention computes a weighted sum of Values, where the weights are determined by the compatibility of Queries and Keys.
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Multi-Head Attention

Run attention h times in parallel with different learned projections. This allows the model to attend to different positions from different representational subspaces simultaneously.
MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · Wᴼ

BERT Architecture

BERT (Bidirectional Encoder Representations from Transformers) is pre-trained on two tasks: 1. Masked Language Model (MLM): 15% of tokens are masked; the model must predict them using both left and right context. 2. Next Sentence Prediction (NSP): The model predicts whether sentence B follows sentence A. Fine-tuning: Replace the pre-training head with a task-specific layer and train with a very small learning rate (2e-5 to 5e-5).

💻 Code Implementation

Step 1: Classical NLP — TF-IDF + Logistic Regression

Step 1: Classical NLP — TF-IDF + Logistic Regression
python
import numpy as np
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

STOP_WORDS = set(stopwords.words("english"))
stemmer = PorterStemmer()

def clean_text(text: str) -> str:
    """Lowercase, remove URLs/HTML/special chars, stem."""
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"<[^>]+>", "", text)
    text = re.sub(r"[^a-z\s]", "", text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in STOP_WORDS and len(t) > 2]
    tokens = [stemmer.stem(t) for t in tokens]
    return " ".join(tokens)

# ── Movie Review Sentiment Dataset ────────────────────────
reviews = [
    "This movie was absolutely fantastic! The acting was superb.",
    "Terrible film. Completely boring and a waste of time.",
    "An incredible cinematic experience. Beautifully crafted story.",
    "I fell asleep halfway through. Very disappointing.",
    "The special effects were breathtaking! Loved every minute.",
    "Poorly written script with one-dimensional characters.",
    "A masterpiece. One of the best films I have seen.",
    "Complete garbage. The director should be ashamed.",
    "Heartwarming and emotionally powerful. A must-watch.",
    "Predictable plot, cliched dialogue. Nothing new here.",
] * 100

labels = ([1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 100)

df_nlp = pd.DataFrame({"review": reviews, "sentiment": labels})
df_nlp["cleaned"] = df_nlp["review"].apply(clean_text)

X_train_nlp, X_test_nlp, y_train_nlp, y_test_nlp = train_test_split(
    df_nlp["cleaned"], df_nlp["sentiment"],
    test_size=0.2, random_state=42, stratify=df_nlp["sentiment"]
)

# ── TF-IDF + LR Pipeline ─────────────────────────────────
text_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        max_features=15000, ngram_range=(1, 2),
        min_df=2, max_df=0.95, sublinear_tf=True
    )),
    ("clf", LogisticRegression(C=5.0, max_iter=1000, random_state=42))
])

text_pipeline.fit(X_train_nlp, y_train_nlp)
y_pred_nlp = text_pipeline.predict(X_test_nlp)

print("📊 TF-IDF + Logistic Regression Results:")
print(classification_report(y_test_nlp, y_pred_nlp,
    target_names=["Negative", "Positive"]))

🔧 Troubleshooting

❌ Error:LookupError: nltk resource not found🔍 Cause:NLTK data not downloaded✅ Fix:Run nltk.download("stopwords") and nltk.download("punkt")

Step 2: Fine-Tuning BERT for Sentiment Analysis

Step 2: Fine-Tuning BERT for Sentiment Analysis
python
import torch
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    Trainer, TrainingArguments, DataCollatorWithPadding,
)
from datasets import Dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device.upper()}")

MODEL_NAME = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=2,
    hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1,
)

def tokenize_function(examples):
    return tokenizer(
        examples["text"], truncation=True,
        max_length=128, padding=False,
    )

raw_dataset = Dataset.from_dict({"text": reviews, "label": labels})
raw_dataset = raw_dataset.train_test_split(test_size=0.2, seed=42)
tokenized_datasets = raw_dataset.map(
    tokenize_function, batched=True, remove_columns=["text"],
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="weighted"),
    }

training_args = TrainingArguments(
    output_dir="./bert_sentiment",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=(device == "cuda"),
    report_to="none",
)

trainer = Trainer(
    model=model, args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer, data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("\n🚀 Starting BERT Fine-Tuning...")
trainer.train()

results = trainer.evaluate()
print(f"\n✅ Eval Accuracy: {results['eval_accuracy']:.4f}")
print(f"✅ Eval F1 Score: {results['eval_f1']:.4f}")

trainer.save_model("./bert_sentiment_final")
tokenizer.save_pretrained("./bert_sentiment_final")
print("💾 Model saved to ./bert_sentiment_final")

🔧 Troubleshooting

❌ Error:OSError: Can't load tokenizer🔍 Cause:Model name typo or no internet✅ Fix:Check spelling; ensure internet access; run transformers-cli login
❌ Error:Token indices sequence length > 512🔍 Cause:Input too long for BERT✅ Fix:Set truncation=True, max_length=512
❌ Error:OOM during fine-tuning🔍 Cause:Batch size too large✅ Fix:Reduce to 8 or 4; use gradient_accumulation_steps=4

🏗️ Practical Project

AI Sentiment Analysis API

A dual-mode sentiment analysis engine supporting both fast (TF-IDF) and accurate (BERT) inference modes. Production-ready with latency tracking and batch processing.

AI Sentiment Analysis API
python
from transformers import pipeline as hf_pipeline
import time, torch, pandas as pd

# ── Load Models ───────────────────────────────────────────
bert_sentiment = hf_pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 if torch.cuda.is_available() else -1,
)

class SentimentAnalysisEngine:
    """Dual-mode sentiment analyzer: fast (TF-IDF) or accurate (BERT)."""

    def __init__(self, fast_model, bert_model):
        self.fast_model = fast_model
        self.bert_model = bert_model

    def analyze(self, text: str, fast_mode: bool = False) -> dict:
        start = time.time()
        if fast_mode:
            cleaned = clean_text(text)
            pred = self.fast_model.predict([cleaned])[0]
            proba = self.fast_model.predict_proba([cleaned])[0]
            label = "POSITIVE" if pred == 1 else "NEGATIVE"
            score = float(max(proba))
            model_used = "TF-IDF + LR"
        else:
            result = self.bert_model(text[:512])[0]
            label = result["label"]
            score = result["score"]
            model_used = "DistilBERT"
        elapsed = (time.time() - start) * 1000
        return {
            "text": text[:100] + ("..." if len(text) > 100 else ""),
            "sentiment": label,
            "confidence": f"{score:.1%}",
            "model": model_used,
            "latency_ms": f"{elapsed:.1f}ms"
        }

    def batch_analyze(self, texts: list, fast_mode: bool = True) -> pd.DataFrame:
        return pd.DataFrame([self.analyze(t, fast_mode) for t in texts])

engine = SentimentAnalysisEngine(text_pipeline, bert_sentiment)

# ── Demo ──────────────────────────────────────────────────
test_reviews = [
    "Absolutely love this product! Best purchase this year.",
    "Terrible quality. Broke after 2 days. Total waste.",
    "It's okay. Does the job I suppose.",
    "Outstanding customer service! Resolved in minutes.",
    "Packaging damaged, product missing parts. Disappointed.",
]

for review in test_reviews:
    fast = engine.analyze(review, fast_mode=True)
    bert = engine.analyze(review, fast_mode=False)
    print(f"\n📝 {fast['text']}")
    print(f"   ⚡ Fast:  {fast['sentiment']} ({fast['confidence']})")
    print(f"   🧠 BERT:  {bert['sentiment']} ({bert['confidence']})")

🔧 Troubleshooting

❌ Error:BERT training loss = 0.693 after 3 epochs🔍 Cause:Model not learning✅ Fix:LR too small or large; try 2e-5, 3e-5, 5e-5
❌ Error:Accuracy stuck at class distribution🔍 Cause:Model predicts majority class✅ Fix:Balance dataset, use weighted loss

Topics Covered

#NLP#BERT#Transformers#TF-IDF#Attention#HuggingFace