Linear Regression With Example

code, graphs, guidelines, and practice tasks


Linear Regression Code and Description

** Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better **

Section 1: Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

We import dataset loader, split utility, linear model, and core regression metrics.

Section 2: Load Data

data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="MedHouseVal")

print("Shape:", X.shape)
print("Target name:", y.name)
print(X.head())

California Housing is a strong regression dataset because the target is continuous house value.

Section 3: Split Data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

We keep 80% data for training and 20% for testing with reproducible split.

Section 4: Train Model

lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

The model learns coefficients for each feature and an intercept for baseline prediction.

Section 5: Predict and Evaluate

y_pred = lin_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("MAE:", round(mae, 4))
print("MSE:", round(mse, 4))
print("RMSE:", round(rmse, 4))
print("R2:", round(r2, 4))

MAE and RMSE give error magnitude; R2 tells how much target variance is explained.

Section 6: Coefficients Table

coef_table = pd.DataFrame({
    "feature": X.columns,
    "coefficient": lin_model.coef_
}).sort_values("coefficient", key=np.abs, ascending=False)

print("Intercept:", round(lin_model.intercept_, 4))
print(coef_table)

Coefficients show direction (+/-) and strength of each feature's contribution.

Section 7: Quick Feature-Set Comparison

feature_sets = {
    "1 feature": ["MedInc"],
    "3 features": ["MedInc", "AveRooms", "HouseAge"],
    "all features": list(X.columns)
}

for label, cols in feature_sets.items():
    X_sub = X[cols]
    Xtr, Xte, ytr, yte = train_test_split(X_sub, y, test_size=0.2, random_state=42)
    model = LinearRegression()
    model.fit(Xtr, ytr)
    pred = model.predict(Xte)
    print(f"{label}: R2={r2_score(yte, pred):.4f}, RMSE={np.sqrt(mean_squared_error(yte, pred)):.4f}")

This gives fast intuition on how model quality changes from simple to richer feature sets.

Guidelines for different datasets

# 1) Replace dataset loader, keep same workflow pattern
# data = load_diabetes() / your csv / your database extract
# X = features dataframe
# y = continuous numeric target

# 2) Confirm the problem is regression
# - target must be numeric and continuous (not class labels)

# 3) Check missing values and leakage before split
# - clean nulls
# - remove leakage columns (post-outcome fields)

# 4) Use train/test split first, then preprocessing
# - never fit transforms on full data before split

# 5) Validate with multiple regression metrics
# - MAE: average absolute error
# - RMSE: penalizes large errors more
# - R2: explained variance

# 6) Inspect residual behavior
# - clear pattern in residuals indicates model mismatch

# 7) Re-check assumptions when switching datasets
# - linearity, outlier impact, multicollinearity

# 8) Keep reproducibility fixed
# - set random_state for split/comparisons

Treat this page as a reusable regression template: swap dataset and target, then re-run metrics, coefficient review, and residual diagnostics.

Second Practical Example (Single Block): Diabetes Dataset

from sklearn.datasets import load_diabetes

# 1) Load
db = load_diabetes()
X_db = pd.DataFrame(db.data, columns=db.feature_names)
y_db = pd.Series(db.target, name="disease_progression")

# 2) Split
Xtr_db, Xte_db, ytr_db, yte_db = train_test_split(
    X_db, y_db, test_size=0.2, random_state=42
)

# 3) Train
db_model = LinearRegression()
db_model.fit(Xtr_db, ytr_db)

# 4) Predict + Evaluate
pred_db = db_model.predict(Xte_db)
db_mae = mean_absolute_error(yte_db, pred_db)
db_rmse = np.sqrt(mean_squared_error(yte_db, pred_db))
db_r2 = r2_score(yte_db, pred_db)

print("Diabetes MAE:", round(db_mae, 4))
print("Diabetes RMSE:", round(db_rmse, 4))
print("Diabetes R2:", round(db_r2, 4))

# 5) Quick coefficient interpretation support
db_coef = pd.DataFrame({
    "feature": X_db.columns,
    "coefficient": db_model.coef_
})
db_coef["abs_coef"] = db_coef["coefficient"].abs()
print("\nTop coefficients by magnitude:")
print(db_coef.sort_values("abs_coef", ascending=False).head(6)[["feature", "coefficient"]])

This compact example shows transferability: the exact same regression workflow works on a different domain. Compare MAE/RMSE/R2 with California Housing carefully because target meaning and scale are different. Use the top coefficients to identify which medical factors most strongly influence predicted disease progression. If this block performs worse than expected, check linearity assumptions and consider Ridge/Lasso from Graph 4.

Graphs and Analysis

Graph 1: Actual vs Predicted Values (Clear Comparison)

Open PDF: Linear Regression Code output
fig, axes = plt.subplots(1, 2, figsize=(13, 4.8))

# Plot A: Actual vs Predicted scatter
axes[0].scatter(y_test, y_pred, alpha=0.35, color="#1f77b4", edgecolor="k", s=22, label="Predictions")
min_v = min(y_test.min(), y_pred.min())
max_v = max(y_test.max(), y_pred.max())
axes[0].plot([min_v, max_v], [min_v, max_v], "r--", linewidth=1.2, label="Ideal line (y=x)")
axes[0].set_xlabel("Actual MedHouseVal")
axes[0].set_ylabel("Predicted MedHouseVal")
axes[0].set_title("Actual vs Predicted (Scatter)")
axes[0].legend()

# Plot B: Side-by-side line comparison on a sample
n = min(120, len(y_test))
y_true_sample = y_test.iloc[:n].reset_index(drop=True)
y_pred_sample = pd.Series(y_pred[:n])

axes[1].plot(y_true_sample.index, y_true_sample.values, color="#2e7d32", linewidth=1.8, label="Actual")
axes[1].plot(y_pred_sample.index, y_pred_sample.values, color="#ef6c00", linewidth=1.8, label="Predicted")
axes[1].set_xlabel("Sample Index")
axes[1].set_ylabel("MedHouseVal")
axes[1].set_title("Actual vs Predicted (Line View)")
axes[1].legend()

plt.tight_layout()
plt.show()

Left plot checks overall fit against the ideal line. Right plot uses two different colors and legend so actual and predicted values are easy to distinguish point by point.

Graph 2: Residuals (Side-by-Side View)

residuals = y_test - y_pred

fig, axes = plt.subplots(1, 2, figsize=(13, 4.8))

# Plot A: Residuals vs predicted values
axes[0].scatter(y_pred, residuals, alpha=0.35, color="#8e24aa", edgecolor="k", s=22)
axes[0].axhline(0, color="red", linestyle="--", linewidth=1.2)
axes[0].set_xlabel("Predicted MedHouseVal")
axes[0].set_ylabel("Residual (Actual - Predicted)")
axes[0].set_title("Residual Scatter")

# Plot B: Residual trend across sample index
residual_sample = residuals.iloc[:n].reset_index(drop=True)
axes[1].plot(residual_sample.index, residual_sample.values, color="#6d4c41", linewidth=1.6, label="Residual")
axes[1].axhline(0, color="red", linestyle="--", linewidth=1.2, label="Zero Error")
axes[1].set_xlabel("Sample Index")
axes[1].set_ylabel("Residual")
axes[1].set_title("Residual Line View")
axes[1].legend()

plt.tight_layout()
plt.show()

Left plot checks error spread against predictions; right plot shows error movement across samples. Both should stay centered around zero without strong pattern.

Graph 3: Top Coefficients by Magnitude

coef_sorted = coef_table.copy()
coef_sorted["abs_coef"] = coef_sorted["coefficient"].abs()
top_coef = coef_sorted.head(8).sort_values("abs_coef", ascending=True)

colors = ["#ef5350" if c < 0 else "#66bb6a" for c in top_coef["coefficient"]]

plt.figure(figsize=(8.5, 5))
plt.barh(top_coef["feature"], top_coef["coefficient"], color=colors)
plt.xlabel("Coefficient Value")
plt.title("Top Linear Regression Coefficients")
plt.axvline(0, color="black", linewidth=1)
plt.show()

This chart makes feature influence easy to read: positive coefficients push prediction up, negative coefficients pull it down.

Graph 4: Linear vs Ridge vs Lasso (CV R2)

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

models = {
    "LinearRegression": LinearRegression(),
    "Ridge(alpha=1.0)": Ridge(alpha=1.0),
    "Lasso(alpha=0.01)": Lasso(alpha=0.01, max_iter=10000)
}

rows = []
for name, model in models.items():
    pipe = make_pipeline(StandardScaler(), model)
    score = cross_val_score(pipe, X, y, cv=5, scoring="r2").mean()
    rows.append((name, score))

comp = pd.DataFrame(rows, columns=["model", "cv_r2"]).sort_values("cv_r2", ascending=False)
print(comp)

plt.figure(figsize=(7.5, 4.5))
plt.bar(comp["model"], comp["cv_r2"], color=["#42a5f5", "#66bb6a", "#ffa726"])
plt.ylabel("Mean CV R2")
plt.title("Regression Model Comparison")
plt.ylim(comp["cv_r2"].min() - 0.02, comp["cv_r2"].max() + 0.02)
plt.show()

This comparison shows whether regularized variants (Ridge/Lasso) improve generalization for your dataset.

Exercises for Practice

Exercise 1: Replace target with another continuous column (if using a custom dataset) and compare MAE, RMSE, and R2.

Exercise 2: Train with only MedInc vs all features and explain the R2 difference.

Exercise 3: Add a residual histogram and check whether errors are centered near zero.

Exercise 4: Use 5-fold CV and 10-fold CV for Linear Regression and compare score stability.

Exercise 5: Try Ridge with alphas [0.1, 1, 10] and report best CV R2.

Exercise 6: Try Lasso with alphas [0.001, 0.01, 0.1] and note which coefficients shrink near zero.

Exercise 7: Remove one high-impact feature from Graph 3 and observe RMSE change.

Exercise 8: Increase test_size from 0.2 to 0.3 and compare metric drift.

Exercise 9: Apply the same template to load_diabetes() and summarize how coefficient behavior differs.

Exercise 8 Solution: test_size 0.2 vs 0.3

def eval_split(test_size):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=test_size, random_state=42)
    model = LinearRegression()
    model.fit(Xtr, ytr)
    pred = model.predict(Xte)
    return {
        "test_size": test_size,
        "MAE": mean_absolute_error(yte, pred),
        "RMSE": np.sqrt(mean_squared_error(yte, pred)),
        "R2": r2_score(yte, pred)
    }

result_02 = eval_split(0.2)
result_03 = eval_split(0.3)

compare = pd.DataFrame([result_02, result_03])
print(compare.round(4))

A larger test split (0.3) reduces training data and usually causes small metric movement. If RMSE rises and R2 drops slightly, that is normal drift from having fewer training samples. Use fixed random_state so differences come from split size, not random reshuffling.

Exercise 9 Solution: Same Workflow on load_diabetes()

from sklearn.datasets import load_diabetes

db = load_diabetes()
X_db = pd.DataFrame(db.data, columns=db.feature_names)
y_db = pd.Series(db.target, name="disease_progression")

Xtr_db, Xte_db, ytr_db, yte_db = train_test_split(
    X_db, y_db, test_size=0.2, random_state=42
)

db_model = LinearRegression()
db_model.fit(Xtr_db, ytr_db)
pred_db = db_model.predict(Xte_db)

print("MAE:", round(mean_absolute_error(yte_db, pred_db), 4))
print("RMSE:", round(np.sqrt(mean_squared_error(yte_db, pred_db)), 4))
print("R2:", round(r2_score(yte_db, pred_db), 4))

db_coef = pd.DataFrame({
    "feature": X_db.columns,
    "coefficient": db_model.coef_
})
db_coef["abs_coef"] = db_coef["coefficient"].abs()
db_coef = db_coef.sort_values("abs_coef", ascending=False)
print(db_coef[["feature", "coefficient"]])

Coefficient behavior differs because the diabetes features are standardized medical variables with a different target scale and relationship pattern. You will often see a few features dominate while several coefficients stay closer to zero, which indicates weaker direct linear effect in this dataset. Compare top features and signs against California Housing to understand how domain changes feature impact.

A practical Linear Regression workflow is: split data, train baseline model, validate with MAE/RMSE/R2, inspect residuals and coefficients, and compare with Ridge/Lasso for stronger generalization.