📊
Module 1
Supervised LearningFoundations, Regression & Classification
⏱️ 12–18 hours📊 Beginner🧩 5 Code Blocks🏗️ 1 Project
🎯 Learning Objectives
- ✓Understand what Machine Learning is and how it differs from traditional programming.
- ✓Get comfortable with Python libraries essential for ML: NumPy, Pandas, and Matplotlib.
- ✓Learn data preprocessing — handling missing values, scaling, and splitting data correctly.
- ✓Understand the math behind Linear Regression, Logistic Regression, and SVMs — explained step by step.
- ✓Implement, tune, and evaluate supervised learning models using Scikit-Learn.
- ✓Build a real-world House Price Prediction system as your first ML project.
📋 Prerequisites
Python installed on your computerBasic Python knowledge (variables, loops, functions)No ML experience needed — we start from zero!
📐 Technical Theory
🌟 What is Machine Learning? (Start Here!)
Imagine you want to build a program that recognises cats in photos. With traditional programming, you would write thousands of rules: "if the image has pointy ears AND whiskers AND fur..." — this approach breaks quickly.
Machine Learning flips this on its head. Instead of writing rules, you show the computer thousands of examples (photos of cats and not-cats), and it automatically discovers the patterns. That's it — ML is just learning from examples.
Think of it like teaching a child to recognise fruits. You don't explain the biology of an apple — you just show them many apples until they "get it."
| Traditional Programming | Machine Learning |
|---|---|
| You write the rules | The computer discovers the rules |
| Input: data + rules → Output | Input: data + output → Rules |
| Breaks with complexity | Improves with more data |
| Example: calculator | Example: spam filter, Netflix recommendations |
🐍 Python for ML — A Quick Crash Course
Before we dive into ML, let's make sure you're comfortable with the key Python tools. You only need 3 libraries to start:
1. NumPy — handles numbers and arrays (think: math on steroids)
2. Pandas — handles tables of data (think: Excel in Python)
3. Matplotlib — creates charts and graphs (think: visualisation)
Don't worry if these are new to you — we'll use them hands-on and you'll pick them up naturally. The code examples below include detailed comments explaining every line.
| Library | What It Does | Analogy |
|---|---|---|
| NumPy | Fast math with arrays of numbers | A powerful calculator |
| Pandas | Load, clean, and explore tabular data | Excel / Google Sheets |
| Matplotlib | Create plots and visualisations | A charting tool |
| Scikit-Learn | Build and train ML models | The ML toolkit |
📦 Data Preprocessing 101 — Why Clean Data Matters
Here's a secret most beginners miss: 80% of a data scientist's time is spent cleaning and preparing data, not building models. A perfect algorithm on garbage data will produce garbage results.
Key preprocessing steps we'll cover:
• Handling missing values — What do you do when some rows have blank cells?
• Feature scaling — Making sure all numbers are on the same scale (e.g., age 0-100 vs salary 0-1,000,000)
• Train/Test split — Never test on the same data you trained on (that's like memorising the answer key!)
• Encoding categories — Converting text labels ("red", "blue") into numbers the model understands
What is Supervised Learning?
Now that you understand ML basics, let's get specific. Supervised Learning is the most common type of ML. Think "supervised" as in "learning with a teacher" — you give the model both the question AND the correct answer, and it learns the pattern.
Formally: the model learns a mapping function f(X) → y from labeled data, where X is the input and y is the correct output.
| Task | Output Type | Example |
|---|---|---|
| Regression | A number (continuous) | Predicting house price (₹ 45,00,000) |
| Classification | A category (discrete) | Is this email Spam or Not Spam? |
Linear Regression — Your First Algorithm
Linear Regression is the "Hello World" of Machine Learning. It draws a straight line (or plane) through your data to make predictions.
Imagine plotting house size (x-axis) vs price (y-axis) on a graph. Linear Regression finds the best-fit line through those dots. Once you have that line, you can predict the price of ANY house by finding where its size falls on the line.
Mathematically, it's just a weighted sum: each feature (like size, location, age) gets a weight that says "how important is this?"
ŷ = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₙxₙ Translation: prediction = bias + (weight₁ × feature₁) + (weight₂ × feature₂) + ...
Cost Function — How Do We Know If the Model is Good?
How does the model know if its predictions are good or bad? We need a "scorecard" — that's the Cost Function.
The most common one is Mean Squared Error (MSE). It works like this:
1. For each prediction, calculate the error: (predicted - actual)
2. Square it (so negative errors don't cancel out positive ones)
3. Average all the squared errors
A lower MSE = better predictions. The model's job is to find weights that make MSE as small as possible.
J(θ) = (1 / 2m) × Σ (ŷᵢ - yᵢ)² In plain English: Average of (prediction - actual)² across all examples
Gradient Descent — How the Model Learns
Imagine you're blindfolded on a hilly landscape and need to find the lowest valley. What would you do? Feel the slope under your feet and take a step downhill. Repeat until you can't go lower. That's Gradient Descent!
The model starts with random weights (random position on the hill). It calculates the slope (gradient) of the error, then takes a small step in the direction that reduces the error. After thousands of tiny steps, it reaches the minimum.
The learning rate (α) controls step size:
• Too large → you overshoot and bounce around
• Too small → training takes forever
• Just right → smooth convergence (typically α = 0.001 to 0.01)
θⱼ := θⱼ - α × (∂J/∂θⱼ) Translation: new_weight = old_weight - learning_rate × slope_of_error
Logistic Regression — Yes or No Questions
What if instead of predicting a number (like price), you want to predict a category (like "Spam" or "Not Spam")?
Logistic Regression takes the Linear Regression formula and squeezes it through a special S-shaped function called Sigmoid. This converts any number into a probability between 0 and 1:
• Output = 0.92 → "92% chance this is spam" → classify as Spam
• Output = 0.15 → "15% chance this is spam" → classify as Not Spam
The decision boundary is usually 0.5 — above it means "yes", below means "no".
σ(z) = 1 / (1 + e⁻ᶻ) → Squishes any number into range (0, 1) Loss: J(θ) = -(1/m) × Σ [yᵢ log(ŷᵢ) + (1 - yᵢ) log(1 - ŷᵢ)]
The Bias-Variance Tradeoff — The Golden Rule of ML
This is the most important concept in all of ML. Every model you build faces this tension:
• Underfitting (High Bias): Your model is too simple — like trying to draw a straight line through a curvy pattern. It misses the real signal. Example: predicting house prices using only the number of rooms.
• Overfitting (High Variance): Your model is too complex — it memorises the training data perfectly (including noise and outliers) but fails miserably on new data. Example: a model that learns "house #47 costs ₹50 lakhs" instead of learning general patterns.
The sweet spot is a model complex enough to capture real patterns, but simple enough to generalise to unseen data.
| Problem | Fix |
|---|---|
| Underfitting | More features, more complex model, reduce regularization |
| Overfitting | More data, regularization (L1/L2), cross-validation |
Regularization — Preventing Overfitting
Regularization is like telling your model: "Keep it simple!" It adds a penalty for having large, complex weights.
Two common types:
• Ridge (L2): Shrinks all weights towards zero, but never exactly to zero. Good when all features are somewhat useful.
• Lasso (L1): Can shrink some weights to exactly zero — effectively removing useless features. Great for feature selection.
The strength of the penalty is controlled by λ (lambda). Higher λ = simpler model.
Ridge: Cost = Error + λ × (sum of weights²) Lasso: Cost = Error + λ × (sum of |weights|) λ = 0 → no regularization λ = large → very simple model
💻 Code Implementation
Step 1: Data Loading & Exploratory Analysis
Step 1: Data Loading & Exploratory Analysis
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# ── Load Dataset ──────────────────────────────────────────
housing = fetch_california_housing(as_frame=True)
df = housing.frame
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nStatistical Summary:")
print(df.describe())
# ── Check for Missing Values ──────────────────────────────
print("\nMissing Values:")
print(df.isnull().sum())
# ── Visualize Feature Correlations ───────────────────────
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Feature Correlation Heatmap", fontsize=16, pad=20)
plt.tight_layout()
plt.savefig("correlation_heatmap.png", dpi=150)
plt.show()
# ── Distribution of Target Variable ──────────────────────
plt.figure(figsize=(8, 5))
sns.histplot(df["MedHouseVal"], bins=50, kde=True, color="#4C72B0")
plt.title("Distribution of Median House Value")
plt.xlabel("Median House Value (×$100,000)")
plt.tight_layout()
plt.show()🔧 Troubleshooting
❌ Error:ModuleNotFoundError: No module named "sklearn"🔍 Cause:Scikit-learn not installed✅ Fix:pip install scikit-learn
❌ Error:ModuleNotFoundError: No module named "seaborn"🔍 Cause:Seaborn not installed✅ Fix:pip install seaborn
Step 2: Data Preprocessing — Scaling & Split
Step 2: Data Preprocessing — Scaling & Split
python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# ── Separate Features and Target ─────────────────────────
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]
# ── Train/Validation/Test Split ──────────────────────────
# 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.30, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.50, random_state=42
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# ── Feature Scaling (StandardScaler) ─────────────────────
# Formula: z = (x - μ) / σ (zero mean, unit variance)
scaler = StandardScaler()
# IMPORTANT: Fit ONLY on training data to prevent data leakage!
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val) # Only transform, never fit
X_test_scaled = scaler.transform(X_test) # Only transform, never fit
print("\nScaling complete. Feature means (should be ~0):")
print(np.round(X_train_scaled.mean(axis=0), 3))🔧 Troubleshooting
❌ Error:ValueError: Input contains NaN🔍 Cause:Missing values in dataset✅ Fix:Use df.fillna(df.median()) or SimpleImputer before scaling
Step 3: Model Training & Evaluation
Step 3: Model Training & Evaluation
python
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
def evaluate_model(name, model, X_tr, y_tr, X_v, y_v):
"""Train a model and print key regression metrics."""
model.fit(X_tr, y_tr)
preds = model.predict(X_v)
rmse = np.sqrt(mean_squared_error(y_v, preds))
mae = mean_absolute_error(y_v, preds)
r2 = r2_score(y_v, preds)
print(f"\n── {name} ──")
print(f" RMSE : {rmse:.4f}")
print(f" MAE : {mae:.4f}")
print(f" R² : {r2:.4f}")
return model, preds
# ── Define Models ─────────────────────────────────────────
models = [
("Linear Regression", LinearRegression()),
("Ridge Regression (L2)", Ridge(alpha=1.0)),
("Lasso Regression (L1)", Lasso(alpha=0.01)),
("Random Forest", RandomForestRegressor(n_estimators=100, random_state=42)),
("Gradient Boosting", GradientBoostingRegressor(
n_estimators=200, learning_rate=0.1, random_state=42
)),
]
results = {}
for name, model in models:
trained_model, preds = evaluate_model(
name, model,
X_train_scaled, y_train,
X_val_scaled, y_val
)
results[name] = (trained_model, preds)🔧 Troubleshooting
❌ Error:Model R² score is negative🔍 Cause:Model is worse than predicting the mean✅ Fix:Check for data leakage or incorrect scaler fitting
❌ Error:ConvergenceWarning (Lasso/Ridge)🔍 Cause:Solver didn't converge✅ Fix:Increase max_iter=10000 parameter
Step 4: Hyperparameter Tuning with Cross-Validation
Step 4: Hyperparameter Tuning with Cross-Validation
python
from sklearn.model_selection import GridSearchCV
# ── Define the search space ───────────────────────────────
param_grid = {
"n_estimators": [100, 200, 300],
"max_depth": [3, 5, 7, None],
"min_samples_split": [2, 5, 10],
}
rf = RandomForestRegressor(random_state=42)
# 5-fold cross-validation with negative MSE as scoring
grid_search = GridSearchCV(
estimator=rf,
param_grid=param_grid,
cv=5,
scoring="neg_mean_squared_error",
n_jobs=-1, # Use all available CPU cores
verbose=1,
)
grid_search.fit(X_train_scaled, y_train)
print("\n✅ Best Parameters Found:")
print(grid_search.best_params_)
best_rf = grid_search.best_estimator_
test_preds = best_rf.predict(X_test_scaled)
final_r2 = r2_score(y_test, test_preds)
final_rmse = np.sqrt(mean_squared_error(y_test, test_preds))
print(f"\n🏆 Final Test Set Performance (Best Random Forest):")
print(f" RMSE : {final_rmse:.4f}")
print(f" R² : {final_r2:.4f}")
# ── Feature Importance Plot ───────────────────────────────
importances = pd.Series(best_rf.feature_importances_, index=X.columns)
importances.sort_values(ascending=True).plot(
kind="barh", figsize=(10, 6),
color="#2C7BB6", title="Feature Importances — Random Forest"
)
plt.tight_layout()
plt.savefig("feature_importances.png", dpi=150)
plt.show()🔧 Troubleshooting
❌ Error:GridSearch takes hours🔍 Cause:Too many hyperparameter combinations✅ Fix:Reduce grid size or use RandomizedSearchCV instead
❌ Error:MemoryError during fit🔍 Cause:Dataset too large for RAM✅ Fix:Use partial_fit() or SGDRegressor
Step 5: Classification — SVM with RBF Kernel
Step 5: Classification — SVM with RBF Kernel
python
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score
# ── Load Iris Dataset ──────────────────────────────────────
iris = load_iris(as_frame=True)
X_iris, y_iris = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(
X_iris, y_iris, test_size=0.20, random_state=42, stratify=y_iris
)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)
# ── Train SVM with RBF Kernel ─────────────────────────────
svm = SVC(kernel="rbf", C=10.0, gamma="scale", probability=True)
svm.fit(X_tr_sc, y_tr)
y_pred = svm.predict(X_te_sc)
print("📊 Classification Report:")
print(classification_report(y_te, y_pred, target_names=iris.target_names))
# ── Confusion Matrix ──────────────────────────────────────
cm = confusion_matrix(y_te, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title("Confusion Matrix — SVM Classifier")
plt.ylabel("Actual"); plt.xlabel("Predicted")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()
# ── 10-Fold Stratified Cross-Validation ──────────────────
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
cv_scores = cross_val_score(svm, sc.fit_transform(X_iris), y_iris, cv=cv, scoring="accuracy")
print(f"\n10-Fold CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")🔧 Troubleshooting
❌ Error:Overfitting (train acc ≈ 100%, test ≈ 60%)🔍 Cause:Model too complex for data✅ Fix:Reduce C value, use regularization, get more data
🏗️ Practical Project
House Price Prediction System
Build a production-style prediction pipeline with persistence. This project uses an Sklearn Pipeline to chain preprocessing and modeling, guaranteeing no data leakage in production.
House Price Prediction System
python
import pickle
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
# ── 1. Build a Reusable Pipeline ─────────────────────────
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", GradientBoostingRegressor(
n_estimators=300, learning_rate=0.08,
max_depth=5, random_state=42
))
])
# ── 2. Load and Prepare Data ─────────────────────────────
housing = fetch_california_housing(as_frame=True)
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ── 3. Train the Pipeline ─────────────────────────────────
pipeline.fit(X_train, y_train)
# ── 4. Evaluate ───────────────────────────────────────────
preds = pipeline.predict(X_test)
r2 = r2_score(y_test, preds)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print(f"✅ Model Trained! R² = {r2:.4f} | RMSE = {rmse:.4f}")
# ── 5. Persist Model to Disk ──────────────────────────────
with open("house_price_model.pkl", "wb") as f:
pickle.dump(pipeline, f)
print("💾 Model saved to house_price_model.pkl")
# ── 6. Production Prediction Function ─────────────────────
def predict_house_price(features: dict) -> float:
"""Accepts a dict of housing features, returns predicted price."""
with open("house_price_model.pkl", "rb") as f:
loaded_model = pickle.load(f)
feature_order = [
"MedInc", "HouseAge", "AveRooms", "AveBedrms",
"Population", "AveOccup", "Latitude", "Longitude"
]
X_new = np.array([[features[k] for k in feature_order]])
prediction = loaded_model.predict(X_new)[0]
return round(prediction * 100_000, 2)
# ── Example Usage ─────────────────────────────────────────
sample_house = {
"MedInc": 8.3252, "HouseAge": 41.0, "AveRooms": 6.984,
"AveBedrms": 1.023, "Population": 322.0, "AveOccup": 2.555,
"Latitude": 37.88, "Longitude": -122.23,
}
price = predict_house_price(sample_house)
print(f"\n🏠 Predicted House Price: ${price:,.2f}")🔧 Troubleshooting
❌ Error:FileNotFoundError: house_price_model.pkl🔍 Cause:Model file not saved or wrong path✅ Fix:Run the training step first to generate the .pkl file
❌ Error:KeyError in predict_house_price🔍 Cause:Missing feature in input dict✅ Fix:Ensure all 8 features are provided with correct key names
Topics Covered
#Python Basics#NumPy#Pandas#Data Preprocessing#Regression#Classification#SVM#Scikit-Learn