📊

Module 1

Supervised LearningFoundations, Regression & Classification

⏱️ 12–18 hours📊 Beginner🧩 5 Code Blocks🏗️ 1 Project

🎯 Learning Objectives

✓Understand what Machine Learning is and how it differs from traditional programming.
✓Get comfortable with Python libraries essential for ML: NumPy, Pandas, and Matplotlib.
✓Learn data preprocessing — handling missing values, scaling, and splitting data correctly.
✓Understand the math behind Linear Regression, Logistic Regression, and SVMs — explained step by step.
✓Implement, tune, and evaluate supervised learning models using Scikit-Learn.
✓Build a real-world House Price Prediction system as your first ML project.

📋 Prerequisites

Python installed on your computerBasic Python knowledge (variables, loops, functions)No ML experience needed — we start from zero!

📐 Technical Theory

🌟 What is Machine Learning? (Start Here!)

Imagine you want to build a program that recognises cats in photos. With traditional programming, you would write thousands of rules: "if the image has pointy ears AND whiskers AND fur..." — this approach breaks quickly. Machine Learning flips this on its head. Instead of writing rules, you show the computer thousands of examples (photos of cats and not-cats), and it automatically discovers the patterns. That's it — ML is just learning from examples. Think of it like teaching a child to recognise fruits. You don't explain the biology of an apple — you just show them many apples until they "get it."

Traditional Programming	Machine Learning
You write the rules	The computer discovers the rules
Input: data + rules → Output	Input: data + output → Rules
Breaks with complexity	Improves with more data
Example: calculator	Example: spam filter, Netflix recommendations

🐍 Python for ML — A Quick Crash Course

Before we dive into ML, let's make sure you're comfortable with the key Python tools. You only need 3 libraries to start: 1. NumPy — handles numbers and arrays (think: math on steroids) 2. Pandas — handles tables of data (think: Excel in Python) 3. Matplotlib — creates charts and graphs (think: visualisation) Don't worry if these are new to you — we'll use them hands-on and you'll pick them up naturally. The code examples below include detailed comments explaining every line.

Library	What It Does	Analogy
NumPy	Fast math with arrays of numbers	A powerful calculator
Pandas	Load, clean, and explore tabular data	Excel / Google Sheets
Matplotlib	Create plots and visualisations	A charting tool
Scikit-Learn	Build and train ML models	The ML toolkit

📦 Data Preprocessing 101 — Why Clean Data Matters

Here's a secret most beginners miss: 80% of a data scientist's time is spent cleaning and preparing data, not building models. A perfect algorithm on garbage data will produce garbage results. Key preprocessing steps we'll cover: • Handling missing values — What do you do when some rows have blank cells? • Feature scaling — Making sure all numbers are on the same scale (e.g., age 0-100 vs salary 0-1,000,000) • Train/Test split — Never test on the same data you trained on (that's like memorising the answer key!) • Encoding categories — Converting text labels ("red", "blue") into numbers the model understands

What is Supervised Learning?

Now that you understand ML basics, let's get specific. Supervised Learning is the most common type of ML. Think "supervised" as in "learning with a teacher" — you give the model both the question AND the correct answer, and it learns the pattern. Formally: the model learns a mapping function f(X) → y from labeled data, where X is the input and y is the correct output.

Task	Output Type	Example
Regression	A number (continuous)	Predicting house price (₹ 45,00,000)
Classification	A category (discrete)	Is this email Spam or Not Spam?

Linear Regression — Your First Algorithm

Linear Regression is the "Hello World" of Machine Learning. It draws a straight line (or plane) through your data to make predictions. Imagine plotting house size (x-axis) vs price (y-axis) on a graph. Linear Regression finds the best-fit line through those dots. Once you have that line, you can predict the price of ANY house by finding where its size falls on the line. Mathematically, it's just a weighted sum: each feature (like size, location, age) gets a weight that says "how important is this?"

ŷ = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ

Translation: prediction = bias + (weight₁ × feature₁) + (weight₂ × feature₂) + ...

Cost Function — How Do We Know If the Model is Good?

How does the model know if its predictions are good or bad? We need a "scorecard" — that's the Cost Function. The most common one is Mean Squared Error (MSE). It works like this: 1. For each prediction, calculate the error: (predicted - actual) 2. Square it (so negative errors don't cancel out positive ones) 3. Average all the squared errors A lower MSE = better predictions. The model's job is to find weights that make MSE as small as possible.

J(θ) = (1 / 2m) × Σ (ŷᵢ - yᵢ)²

In plain English: Average of (prediction - actual)² across all examples

Gradient Descent — How the Model Learns

Imagine you're blindfolded on a hilly landscape and need to find the lowest valley. What would you do? Feel the slope under your feet and take a step downhill. Repeat until you can't go lower. That's Gradient Descent! The model starts with random weights (random position on the hill). It calculates the slope (gradient) of the error, then takes a small step in the direction that reduces the error. After thousands of tiny steps, it reaches the minimum. The learning rate (α) controls step size: • Too large → you overshoot and bounce around • Too small → training takes forever • Just right → smooth convergence (typically α = 0.001 to 0.01)

θⱼ := θⱼ - α × (∂J/∂θⱼ)

Translation: new_weight = old_weight - learning_rate × slope_of_error

Logistic Regression — Yes or No Questions

What if instead of predicting a number (like price), you want to predict a category (like "Spam" or "Not Spam")? Logistic Regression takes the Linear Regression formula and squeezes it through a special S-shaped function called Sigmoid. This converts any number into a probability between 0 and 1: • Output = 0.92 → "92% chance this is spam" → classify as Spam • Output = 0.15 → "15% chance this is spam" → classify as Not Spam The decision boundary is usually 0.5 — above it means "yes", below means "no".

σ(z) = 1 / (1 + e⁻ᶻ)    →    Squishes any number into range (0, 1)

Loss: J(θ) = -(1/m) × Σ [yᵢ log(ŷᵢ) + (1 - yᵢ) log(1 - ŷᵢ)]

The Bias-Variance Tradeoff — The Golden Rule of ML

This is the most important concept in all of ML. Every model you build faces this tension: • Underfitting (High Bias): Your model is too simple — like trying to draw a straight line through a curvy pattern. It misses the real signal. Example: predicting house prices using only the number of rooms. • Overfitting (High Variance): Your model is too complex — it memorises the training data perfectly (including noise and outliers) but fails miserably on new data. Example: a model that learns "house #47 costs ₹50 lakhs" instead of learning general patterns. The sweet spot is a model complex enough to capture real patterns, but simple enough to generalise to unseen data.

Problem	Fix
Underfitting	More features, more complex model, reduce regularization
Overfitting	More data, regularization (L1/L2), cross-validation

Regularization — Preventing Overfitting

Regularization is like telling your model: "Keep it simple!" It adds a penalty for having large, complex weights. Two common types: • Ridge (L2): Shrinks all weights towards zero, but never exactly to zero. Good when all features are somewhat useful. • Lasso (L1): Can shrink some weights to exactly zero — effectively removing useless features. Great for feature selection. The strength of the penalty is controlled by λ (lambda). Higher λ = simpler model.

Ridge: Cost = Error + λ × (sum of weights²)
Lasso: Cost = Error + λ × (sum of |weights|)

λ = 0 → no regularization    λ = large → very simple model

💻 Code Implementation

Step 1: Data Loading & Exploratory Analysis

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# ── Load Dataset ──────────────────────────────────────────
housing = fetch_california_housing(as_frame=True)
df = housing.frame

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nStatistical Summary:")
print(df.describe())

# ── Check for Missing Values ──────────────────────────────
print("\nMissing Values:")
print(df.isnull().sum())

# ── Visualize Feature Correlations ───────────────────────
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Feature Correlation Heatmap", fontsize=16, pad=20)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150)
plt.show()

# ── Distribution of Target Variable ──────────────────────
plt.figure(figsize=(8, 5))
sns.histplot(df["MedHouseVal"], bins=50, kde=True, color="#4C72B0")
plt.title("Distribution of Median House Value")
plt.xlabel("Median House Value (×$100,000)")
plt.tight_layout()
plt.show()

🔧 Troubleshooting

❌ Error:ModuleNotFoundError: No module named "sklearn"🔍 Cause:Scikit-learn not installed✅ Fix:pip install scikit-learn

❌ Error:ModuleNotFoundError: No module named "seaborn"🔍 Cause:Seaborn not installed✅ Fix:pip install seaborn

Step 2: Data Preprocessing — Scaling & Split

python

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# ── Separate Features and Target ─────────────────────────
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

# ── Train/Validation/Test Split ──────────────────────────
# 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=42
)

print(f"Training set:   {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set:       {X_test.shape[0]} samples")

# ── Feature Scaling (StandardScaler) ─────────────────────
# Formula: z = (x - μ) / σ   (zero mean, unit variance)
scaler = StandardScaler()

# IMPORTANT: Fit ONLY on training data to prevent data leakage!
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled   = scaler.transform(X_val)     # Only transform, never fit
X_test_scaled  = scaler.transform(X_test)    # Only transform, never fit

print("\nScaling complete. Feature means (should be ~0):")
print(np.round(X_train_scaled.mean(axis=0), 3))

🔧 Troubleshooting

❌ Error:ValueError: Input contains NaN🔍 Cause:Missing values in dataset✅ Fix:Use df.fillna(df.median()) or SimpleImputer before scaling

Step 3: Model Training & Evaluation

python

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

def evaluate_model(name, model, X_tr, y_tr, X_v, y_v):
    """Train a model and print key regression metrics."""
    model.fit(X_tr, y_tr)
    preds = model.predict(X_v)

    rmse = np.sqrt(mean_squared_error(y_v, preds))
    mae  = mean_absolute_error(y_v, preds)
    r2   = r2_score(y_v, preds)

    print(f"\n── {name} ──")
    print(f"   RMSE : {rmse:.4f}")
    print(f"   MAE  : {mae:.4f}")
    print(f"   R²   : {r2:.4f}")
    return model, preds

# ── Define Models ─────────────────────────────────────────
models = [
    ("Linear Regression",       LinearRegression()),
    ("Ridge Regression (L2)",   Ridge(alpha=1.0)),
    ("Lasso Regression (L1)",   Lasso(alpha=0.01)),
    ("Random Forest",           RandomForestRegressor(n_estimators=100, random_state=42)),
    ("Gradient Boosting",       GradientBoostingRegressor(
        n_estimators=200, learning_rate=0.1, random_state=42
    )),
]

results = {}
for name, model in models:
    trained_model, preds = evaluate_model(
        name, model,
        X_train_scaled, y_train,
        X_val_scaled, y_val
    )
    results[name] = (trained_model, preds)

🔧 Troubleshooting

❌ Error:Model R² score is negative🔍 Cause:Model is worse than predicting the mean✅ Fix:Check for data leakage or incorrect scaler fitting

❌ Error:ConvergenceWarning (Lasso/Ridge)🔍 Cause:Solver didn't converge✅ Fix:Increase max_iter=10000 parameter

Step 4: Hyperparameter Tuning with Cross-Validation

python

from sklearn.model_selection import GridSearchCV

# ── Define the search space ───────────────────────────────
param_grid = {
    "n_estimators":      [100, 200, 300],
    "max_depth":         [3, 5, 7, None],
    "min_samples_split": [2, 5, 10],
}

rf = RandomForestRegressor(random_state=42)

# 5-fold cross-validation with negative MSE as scoring
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=-1,     # Use all available CPU cores
    verbose=1,
)

grid_search.fit(X_train_scaled, y_train)
print("\n✅ Best Parameters Found:")
print(grid_search.best_params_)

best_rf = grid_search.best_estimator_
test_preds = best_rf.predict(X_test_scaled)
final_r2 = r2_score(y_test, test_preds)
final_rmse = np.sqrt(mean_squared_error(y_test, test_preds))

print(f"\n🏆 Final Test Set Performance (Best Random Forest):")
print(f"   RMSE : {final_rmse:.4f}")
print(f"   R²   : {final_r2:.4f}")

# ── Feature Importance Plot ───────────────────────────────
importances = pd.Series(best_rf.feature_importances_, index=X.columns)
importances.sort_values(ascending=True).plot(
    kind="barh", figsize=(10, 6),
    color="#2C7BB6", title="Feature Importances — Random Forest"
)
plt.tight_layout()
plt.savefig("feature_importances.png", dpi=150)
plt.show()

🔧 Troubleshooting

❌ Error:GridSearch takes hours🔍 Cause:Too many hyperparameter combinations✅ Fix:Reduce grid size or use RandomizedSearchCV instead

❌ Error:MemoryError during fit🔍 Cause:Dataset too large for RAM✅ Fix:Use partial_fit() or SGDRegressor

Step 5: Classification — SVM with RBF Kernel

python

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score

# ── Load Iris Dataset ──────────────────────────────────────
iris = load_iris(as_frame=True)
X_iris, y_iris = iris.data, iris.target

X_tr, X_te, y_tr, y_te = train_test_split(
    X_iris, y_iris, test_size=0.20, random_state=42, stratify=y_iris
)

sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

# ── Train SVM with RBF Kernel ─────────────────────────────
svm = SVC(kernel="rbf", C=10.0, gamma="scale", probability=True)
svm.fit(X_tr_sc, y_tr)

y_pred = svm.predict(X_te_sc)

print("📊 Classification Report:")
print(classification_report(y_te, y_pred, target_names=iris.target_names))

# ── Confusion Matrix ──────────────────────────────────────
cm = confusion_matrix(y_te, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
    xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title("Confusion Matrix — SVM Classifier")
plt.ylabel("Actual"); plt.xlabel("Predicted")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()

# ── 10-Fold Stratified Cross-Validation ──────────────────
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cv_scores = cross_val_score(svm, sc.fit_transform(X_iris), y_iris, cv=cv, scoring="accuracy")
print(f"\n10-Fold CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

🔧 Troubleshooting

❌ Error:Overfitting (train acc ≈ 100%, test ≈ 60%)🔍 Cause:Model too complex for data✅ Fix:Reduce C value, use regularization, get more data

🏗️ Practical Project

House Price Prediction System

Build a production-style prediction pipeline with persistence. This project uses an Sklearn Pipeline to chain preprocessing and modeling, guaranteeing no data leakage in production.

House Price Prediction System

python

import pickle
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# ── 1. Build a Reusable Pipeline ─────────────────────────
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  GradientBoostingRegressor(
        n_estimators=300, learning_rate=0.08,
        max_depth=5, random_state=42
    ))
])

# ── 2. Load and Prepare Data ─────────────────────────────
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ── 3. Train the Pipeline ─────────────────────────────────
pipeline.fit(X_train, y_train)

# ── 4. Evaluate ───────────────────────────────────────────
preds = pipeline.predict(X_test)
r2   = r2_score(y_test, preds)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print(f"✅ Model Trained! R² = {r2:.4f} | RMSE = {rmse:.4f}")

# ── 5. Persist Model to Disk ──────────────────────────────
with open("house_price_model.pkl", "wb") as f:
    pickle.dump(pipeline, f)
print("💾 Model saved to house_price_model.pkl")

# ── 6. Production Prediction Function ─────────────────────
def predict_house_price(features: dict) -> float:
    """Accepts a dict of housing features, returns predicted price."""
    with open("house_price_model.pkl", "rb") as f:
        loaded_model = pickle.load(f)
    feature_order = [
        "MedInc", "HouseAge", "AveRooms", "AveBedrms",
        "Population", "AveOccup", "Latitude", "Longitude"
    ]
    X_new = np.array([[features[k] for k in feature_order]])
    prediction = loaded_model.predict(X_new)[0]
    return round(prediction * 100_000, 2)

# ── Example Usage ─────────────────────────────────────────
sample_house = {
    "MedInc": 8.3252, "HouseAge": 41.0, "AveRooms": 6.984,
    "AveBedrms": 1.023, "Population": 322.0, "AveOccup": 2.555,
    "Latitude": 37.88, "Longitude": -122.23,
}
price = predict_house_price(sample_house)
print(f"\n🏠 Predicted House Price: ${price:,.2f}")

🔧 Troubleshooting

❌ Error:FileNotFoundError: house_price_model.pkl🔍 Cause:Model file not saved or wrong path✅ Fix:Run the training step first to generate the .pkl file

❌ Error:KeyError in predict_house_price🔍 Cause:Missing feature in input dict✅ Fix:Ensure all 8 features are provided with correct key names

Topics Covered

#Python Basics#NumPy#Pandas#Data Preprocessing#Regression#Classification#SVM#Scikit-Learn