Logistic Regression With Example

code, graphs, guidelines, and practice tasks


Logistic Regression Code and Description

** Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better **

Section 1: Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

We import dataset loader, split tools, scaling, Logistic Regression model, and classification metrics.

Section 2: Load Data

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print("Shape:", X.shape)
print("Target classes:", np.unique(y))
print("Target names:", data.target_names)

Breast cancer dataset is a binary classification dataset and fits Logistic Regression learning very well.

Section 3: Split and Scale

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Logistic Regression benefits from scaling for stable optimization and more balanced feature influence.

Section 4: Train Logistic Model

log_model = LogisticRegression(max_iter=5000, random_state=42)
log_model.fit(X_train_scaled, y_train)

We use a higher max_iter to ensure convergence on this feature-rich dataset.

Section 5: Predict and Evaluate

y_pred = log_model.predict(X_test_scaled)

y_prob = log_model.predict_proba(X_test_scaled)[:, 1]
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy:", round(acc, 4))
print("Confusion Matrix:\n", cm)
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))

We evaluate both class predictions and probabilities. Accuracy is summary-level; confusion matrix and report show class-wise behavior.

Section 6: Threshold Comparison

for t in [0.30, 0.50, 0.70]:
    pred_t = (y_prob >= t).astype(int)
    print(f"threshold={t}: accuracy={accuracy_score(y_test, pred_t):.4f}")
    print(confusion_matrix(y_test, pred_t))

Threshold tuning is practical and important: lower threshold increases positive detection but can raise false positives.

Section 7: Visualizing Classification Regions (2 Features)

from sklearn.inspection import DecisionBoundaryDisplay

feature_names = list(data.feature_names)

# Select two informative features by name (safe matching)
def find_feature_index(name, all_names):
    key = name.strip().lower().replace("_", " ")
    norm = [n.strip().lower().replace("_", " ") for n in all_names]
    if key in norm:
        return norm.index(key)
    raise ValueError(f"Feature '{name}' not found.")

ix = find_feature_index("mean radius", feature_names)
iy = find_feature_index("mean texture", feature_names)

X2 = data.data[:, [ix, iy]]
y2 = data.target

X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y2, test_size=0.2, random_state=42, stratify=y2
)

scaler2 = StandardScaler()
X2_train_scaled = scaler2.fit_transform(X2_train)

log2 = LogisticRegression(max_iter=5000, random_state=42)
log2.fit(X2_train_scaled, y2_train)

fig, ax = plt.subplots(figsize=(7, 5))
DecisionBoundaryDisplay.from_estimator(
    log2, X2_train_scaled, response_method="predict", cmap="coolwarm", alpha=0.18, ax=ax
)
ax.scatter(X2_train_scaled[:, 0], X2_train_scaled[:, 1], c=y2_train, cmap="coolwarm", s=24, edgecolor="k")
ax.set_xlabel(feature_names[ix] + " (scaled)")
ax.set_ylabel(feature_names[iy] + " (scaled)")
ax.set_title("Breast Cancer: Logistic Regression Regions")
plt.show()

This gives a visual feel for linear decision boundary behavior in Logistic Regression on two selected features.

Guidelines for different datasets

# 1) Replace dataset loader and keep common pattern
# X = feature matrix, y = class labels

# 2) Confirm this is classification
# - y should represent discrete classes, not continuous values

# 3) Scale features for Logistic Regression
# - standardization improves convergence and stability

# 4) Always stratify train/test split
# train_test_split(..., stratify=y)

# 5) Tune threshold based on problem cost
# - high recall needs lower threshold
# - high precision may need higher threshold

# 6) Use class-wise metrics, not only accuracy
# - check precision, recall, f1 for each class

# 7) Handle imbalance when needed
# - try class_weight="balanced" and compare results

# 8) Validate model settings
# - test solver, C (regularization strength), and penalty

# 9) Keep workflow reproducible
# - set random_state in split and model where possible

Treat this page as a reusable Logistic Regression template: update dataset, then re-check scaling, threshold, class-wise metrics, and confusion behavior.

Graphs and Analysis

Graph 1: Why Scaling Matters (Logistic Regression)

Open PDF: Logistic Regression Code output file
from sklearn.pipeline import make_pipeline

# Without scaling
raw_model = LogisticRegression(max_iter=5000, random_state=42)
raw_model.fit(X_train, y_train)
raw_pred = raw_model.predict(X_test)
raw_acc = accuracy_score(y_test, raw_pred)

# With scaling
scaled_model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, random_state=42))
scaled_model.fit(X_train, y_train)
scaled_pred = scaled_model.predict(X_test)
scaled_acc = accuracy_score(y_test, scaled_pred)

plt.figure(figsize=(6.5, 4.2))
plt.bar(["Without Scaling", "With Scaling"], [raw_acc, scaled_acc], color=["#ef5350", "#66bb6a"])
plt.ylim(0.85, 1.00)
plt.ylabel("Test Accuracy")
plt.title("Scaling Impact on Logistic Regression")
for i, v in enumerate([raw_acc, scaled_acc]):
    plt.text(i, v + 0.003, f"{v:.4f}", ha="center")
plt.show()

This graph confirms how scaling can improve training stability and often improves final classification quality.

Graph 2: ROC Curve and AUC

from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6.8, 5))
plt.plot(fpr, tpr, color="#1e88e5", linewidth=2, label=f"AUC = {roc_auc:.4f}")
plt.plot([0, 1], [0, 1], "k--", linewidth=1)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.grid(alpha=0.25)
plt.show()

ROC-AUC shows ranking quality across thresholds. Higher curve and higher AUC indicate better class separation.

Graph 3: Confusion Matrices at Different Thresholds

from sklearn.metrics import ConfusionMatrixDisplay

thresholds_to_plot = [0.30, 0.50, 0.70]
fig, axes = plt.subplots(1, 3, figsize=(13.5, 4.2))

for i, t in enumerate(thresholds_to_plot):
    pred_t = (y_prob >= t).astype(int)
    ConfusionMatrixDisplay.from_predictions(
        y_test, pred_t, display_labels=data.target_names, cmap="Blues", values_format="d", ax=axes[i]
    )
    axes[i].set_title(f"Threshold = {t}")

plt.tight_layout()
plt.show()

This makes threshold trade-off visual: lower threshold catches more positives but may increase false alarms.

Graph 4: Coefficient Importance (Signed)

coef_df = pd.DataFrame({
    "feature": X.columns,
    "coef": log_model.coef_[0]
})
coef_df["abs_coef"] = coef_df["coef"].abs()
top = coef_df.sort_values("abs_coef", ascending=False).head(10).sort_values("coef")

colors = ["#ef5350" if v < 0 else "#66bb6a" for v in top["coef"]]

plt.figure(figsize=(8.5, 5))
plt.barh(top["feature"], top["coef"], color=colors)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("Coefficient Value")
plt.title("Top Logistic Regression Coefficients")
plt.show()

Positive coefficient pushes toward class 1; negative pushes toward class 0. Magnitude indicates stronger influence.

Graph 5: Precision-Recall Curve and Average Precision

from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)
ap = average_precision_score(y_test, y_prob)

plt.figure(figsize=(6.8, 5))
plt.plot(recall, precision, color="#5e35b1", linewidth=2, label=f"AP = {ap:.4f}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend(loc="lower left")
plt.grid(alpha=0.25)
plt.show()

PR curve is very useful when class costs are asymmetric. Use it with threshold tuning to choose a precision-recall balance that fits your decision context.

Graph 6: Predicted Probability Distribution by True Class

prob_df = pd.DataFrame({
    "y_true": y_test,
    "y_prob": y_prob
})

plt.figure(figsize=(7.2, 4.8))
plt.hist(
    prob_df.loc[prob_df["y_true"] == 0, "y_prob"],
    bins=18,
    alpha=0.65,
    color="#ef5350",
    label=f"True class 0 ({data.target_names[0]})"
)
plt.hist(
    prob_df.loc[prob_df["y_true"] == 1, "y_prob"],
    bins=18,
    alpha=0.65,
    color="#66bb6a",
    label=f"True class 1 ({data.target_names[1]})"
)
plt.axvline(0.5, color="black", linestyle="--", linewidth=1.2, label="Threshold 0.50")
plt.xlabel("Predicted Probability for Class 1")
plt.ylabel("Count")
plt.title("Probability Separation by True Class")
plt.legend()
plt.tight_layout()
plt.show()

This histogram shows confidence overlap between classes. Heavy overlap suggests threshold-only tuning may be insufficient and feature/model improvements are needed.

Exercises for Practice

Exercise 1: Compare threshold=0.3, 0.5, and 0.7; report precision and recall for each.

Exercise 2: Add class_weight="balanced" and check changes in minority-class recall.

Exercise 3: Compare solvers lbfgs and liblinear with the same data split.

Exercise 4: Tune regularization C over [0.1, 1, 10] and report best validation score.

Exercise 5: Plot Precision-Recall curve and compare it with ROC insight.

Exercise 6: Use only top 5 coefficient features and compare test performance with all features.

Exercise 7: Switch to load_iris() and run multiclass Logistic Regression with confusion matrix.

Exercise 8: Create an intentionally imbalanced subset and observe threshold sensitivity changes.

Exercise 9: Build a full Pipeline(StandardScaler(), LogisticRegression(...)) and evaluate with cross-validation.

A practical Logistic Regression workflow is: split and scale data, train model, evaluate with confusion matrix and class report, tune threshold and regularization, and verify behavior with ROC and coefficient interpretation.