Logistic Regression Code and Description
** Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better **
Section 1: Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
We import dataset loader, split tools, scaling, Logistic Regression model, and classification metrics.
Section 2: Load Data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
print("Shape:", X.shape)
print("Target classes:", np.unique(y))
print("Target names:", data.target_names)
Breast cancer dataset is a binary classification dataset and fits Logistic Regression learning very well.
Section 3: Split and Scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Logistic Regression benefits from scaling for stable optimization and more balanced feature influence.
Section 4: Train Logistic Model
log_model = LogisticRegression(max_iter=5000, random_state=42)
log_model.fit(X_train_scaled, y_train)
We use a higher max_iter to ensure convergence on this feature-rich dataset.
Section 5: Predict and Evaluate
y_pred = log_model.predict(X_test_scaled)
y_prob = log_model.predict_proba(X_test_scaled)[:, 1]
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print("Accuracy:", round(acc, 4))
print("Confusion Matrix:\n", cm)
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))
We evaluate both class predictions and probabilities. Accuracy is summary-level; confusion matrix and report show class-wise behavior.
Section 6: Threshold Comparison
for t in [0.30, 0.50, 0.70]:
pred_t = (y_prob >= t).astype(int)
print(f"threshold={t}: accuracy={accuracy_score(y_test, pred_t):.4f}")
print(confusion_matrix(y_test, pred_t))
Threshold tuning is practical and important: lower threshold increases positive detection but can raise false positives.
Section 7: Visualizing Classification Regions (2 Features)
from sklearn.inspection import DecisionBoundaryDisplay
feature_names = list(data.feature_names)
# Select two informative features by name (safe matching)
def find_feature_index(name, all_names):
key = name.strip().lower().replace("_", " ")
norm = [n.strip().lower().replace("_", " ") for n in all_names]
if key in norm:
return norm.index(key)
raise ValueError(f"Feature '{name}' not found.")
ix = find_feature_index("mean radius", feature_names)
iy = find_feature_index("mean texture", feature_names)
X2 = data.data[:, [ix, iy]]
y2 = data.target
X2_train, X2_test, y2_train, y2_test = train_test_split(
X2, y2, test_size=0.2, random_state=42, stratify=y2
)
scaler2 = StandardScaler()
X2_train_scaled = scaler2.fit_transform(X2_train)
log2 = LogisticRegression(max_iter=5000, random_state=42)
log2.fit(X2_train_scaled, y2_train)
fig, ax = plt.subplots(figsize=(7, 5))
DecisionBoundaryDisplay.from_estimator(
log2, X2_train_scaled, response_method="predict", cmap="coolwarm", alpha=0.18, ax=ax
)
ax.scatter(X2_train_scaled[:, 0], X2_train_scaled[:, 1], c=y2_train, cmap="coolwarm", s=24, edgecolor="k")
ax.set_xlabel(feature_names[ix] + " (scaled)")
ax.set_ylabel(feature_names[iy] + " (scaled)")
ax.set_title("Breast Cancer: Logistic Regression Regions")
plt.show()
This gives a visual feel for linear decision boundary behavior in Logistic Regression on two selected features.
Guidelines for different datasets
# 1) Replace dataset loader and keep common pattern
# X = feature matrix, y = class labels
# 2) Confirm this is classification
# - y should represent discrete classes, not continuous values
# 3) Scale features for Logistic Regression
# - standardization improves convergence and stability
# 4) Always stratify train/test split
# train_test_split(..., stratify=y)
# 5) Tune threshold based on problem cost
# - high recall needs lower threshold
# - high precision may need higher threshold
# 6) Use class-wise metrics, not only accuracy
# - check precision, recall, f1 for each class
# 7) Handle imbalance when needed
# - try class_weight="balanced" and compare results
# 8) Validate model settings
# - test solver, C (regularization strength), and penalty
# 9) Keep workflow reproducible
# - set random_state in split and model where possible
Treat this page as a reusable Logistic Regression template: update dataset, then re-check scaling, threshold, class-wise metrics, and confusion behavior.
Graphs and Analysis
Graph 1: Why Scaling Matters (Logistic Regression)
Open PDF: Logistic Regression Code output filefrom sklearn.pipeline import make_pipeline
# Without scaling
raw_model = LogisticRegression(max_iter=5000, random_state=42)
raw_model.fit(X_train, y_train)
raw_pred = raw_model.predict(X_test)
raw_acc = accuracy_score(y_test, raw_pred)
# With scaling
scaled_model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000, random_state=42))
scaled_model.fit(X_train, y_train)
scaled_pred = scaled_model.predict(X_test)
scaled_acc = accuracy_score(y_test, scaled_pred)
plt.figure(figsize=(6.5, 4.2))
plt.bar(["Without Scaling", "With Scaling"], [raw_acc, scaled_acc], color=["#ef5350", "#66bb6a"])
plt.ylim(0.85, 1.00)
plt.ylabel("Test Accuracy")
plt.title("Scaling Impact on Logistic Regression")
for i, v in enumerate([raw_acc, scaled_acc]):
plt.text(i, v + 0.003, f"{v:.4f}", ha="center")
plt.show()
This graph confirms how scaling can improve training stability and often improves final classification quality.
Graph 2: ROC Curve and AUC
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(6.8, 5))
plt.plot(fpr, tpr, color="#1e88e5", linewidth=2, label=f"AUC = {roc_auc:.4f}")
plt.plot([0, 1], [0, 1], "k--", linewidth=1)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.grid(alpha=0.25)
plt.show()
ROC-AUC shows ranking quality across thresholds. Higher curve and higher AUC indicate better class separation.
Graph 3: Confusion Matrices at Different Thresholds
from sklearn.metrics import ConfusionMatrixDisplay
thresholds_to_plot = [0.30, 0.50, 0.70]
fig, axes = plt.subplots(1, 3, figsize=(13.5, 4.2))
for i, t in enumerate(thresholds_to_plot):
pred_t = (y_prob >= t).astype(int)
ConfusionMatrixDisplay.from_predictions(
y_test, pred_t, display_labels=data.target_names, cmap="Blues", values_format="d", ax=axes[i]
)
axes[i].set_title(f"Threshold = {t}")
plt.tight_layout()
plt.show()
This makes threshold trade-off visual: lower threshold catches more positives but may increase false alarms.
Graph 4: Coefficient Importance (Signed)
coef_df = pd.DataFrame({
"feature": X.columns,
"coef": log_model.coef_[0]
})
coef_df["abs_coef"] = coef_df["coef"].abs()
top = coef_df.sort_values("abs_coef", ascending=False).head(10).sort_values("coef")
colors = ["#ef5350" if v < 0 else "#66bb6a" for v in top["coef"]]
plt.figure(figsize=(8.5, 5))
plt.barh(top["feature"], top["coef"], color=colors)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("Coefficient Value")
plt.title("Top Logistic Regression Coefficients")
plt.show()
Positive coefficient pushes toward class 1; negative pushes toward class 0. Magnitude indicates stronger influence.
Graph 5: Precision-Recall Curve and Average Precision
from sklearn.metrics import precision_recall_curve, average_precision_score
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)
ap = average_precision_score(y_test, y_prob)
plt.figure(figsize=(6.8, 5))
plt.plot(recall, precision, color="#5e35b1", linewidth=2, label=f"AP = {ap:.4f}")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend(loc="lower left")
plt.grid(alpha=0.25)
plt.show()
PR curve is very useful when class costs are asymmetric. Use it with threshold tuning to choose a precision-recall balance that fits your decision context.
Graph 6: Predicted Probability Distribution by True Class
prob_df = pd.DataFrame({
"y_true": y_test,
"y_prob": y_prob
})
plt.figure(figsize=(7.2, 4.8))
plt.hist(
prob_df.loc[prob_df["y_true"] == 0, "y_prob"],
bins=18,
alpha=0.65,
color="#ef5350",
label=f"True class 0 ({data.target_names[0]})"
)
plt.hist(
prob_df.loc[prob_df["y_true"] == 1, "y_prob"],
bins=18,
alpha=0.65,
color="#66bb6a",
label=f"True class 1 ({data.target_names[1]})"
)
plt.axvline(0.5, color="black", linestyle="--", linewidth=1.2, label="Threshold 0.50")
plt.xlabel("Predicted Probability for Class 1")
plt.ylabel("Count")
plt.title("Probability Separation by True Class")
plt.legend()
plt.tight_layout()
plt.show()
This histogram shows confidence overlap between classes. Heavy overlap suggests threshold-only tuning may be insufficient and feature/model improvements are needed.
Exercises for Practice
Exercise 1: Compare threshold=0.3, 0.5, and 0.7; report precision and recall for each.
Exercise 2: Add class_weight="balanced" and check changes in minority-class recall.
Exercise 3: Compare solvers lbfgs and liblinear with the same data split.
Exercise 4: Tune regularization C over [0.1, 1, 10] and report best validation score.
Exercise 5: Plot Precision-Recall curve and compare it with ROC insight.
Exercise 6: Use only top 5 coefficient features and compare test performance with all features.
Exercise 7: Switch to load_iris() and run multiclass Logistic Regression with confusion matrix.
Exercise 8: Create an intentionally imbalanced subset and observe threshold sensitivity changes.
Exercise 9: Build a full Pipeline(StandardScaler(), LogisticRegression(...)) and evaluate with cross-validation.