Principal Component Analysis With Example

code, graphs, guidelines, and practice tasks


PCA Code and Desc. (Breast Cancer)

** Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better **

Section 1: Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

We import PCA tools plus a classifier so we can measure whether dimensionality reduction helps in real model performance.

Section 2: Load Data

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

print("Shape:", X.shape)
print("Target classes:", np.unique(y))
print(X.head())

This dataset has many numeric medical features, which is exactly the type of data where PCA is useful.

Section 3: Split and Scale

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Scaling is mandatory for PCA because PCA is variance-based and large-scale features can dominate otherwise.

Section 4: Fit PCA and Inspect Variance

pca_full = PCA()
X_train_pca_full = pca_full.fit_transform(X_train_scaled)

explained = pca_full.explained_variance_ratio_
cum_explained = np.cumsum(explained)

print("First 5 explained variance ratios:", np.round(explained[:5], 4))
print("Cumulative variance by first 5:", round(cum_explained[4], 4))
print("Components needed for 95% variance:", np.argmax(cum_explained >= 0.95) + 1)

This tells you how much information each component carries and how many components are enough for your retention goal.

Section 5: Build PCA + Logistic Pipeline

pca_model = make_pipeline(
    StandardScaler(),
    PCA(n_components=0.95),
    LogisticRegression(max_iter=5000)
)

pca_model.fit(X_train, y_train)
y_pred = pca_model.predict(X_test)

print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))

pca_step = pca_model.named_steps["pca"]
print("PCA components retained:", pca_step.n_components_)

Pipeline keeps preprocessing safe and reproducible. PCA is fit only on training data and then applied consistently to test data.

Section 6: Baseline vs PCA Comparison

baseline_model = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=5000)
)

baseline_cv = cross_val_score(baseline_model, X, y, cv=5, scoring="accuracy").mean()
pca_cv = cross_val_score(pca_model, X, y, cv=5, scoring="accuracy").mean()

print(f"Baseline (no PCA) mean CV accuracy: {baseline_cv:.4f}")
print(f"With PCA (95% variance) mean CV accuracy: {pca_cv:.4f}")

This check prevents blind PCA usage. Keep PCA when it simplifies features with similar or better validation performance.

Section 7: Inspect PCA Loadings

pca_2 = PCA(n_components=2)
X_scaled = StandardScaler().fit_transform(X)
pca_2.fit(X_scaled)

loading_df = pd.DataFrame(
    pca_2.components_.T,
    columns=["PC1_loading", "PC2_loading"],
    index=X.columns
)

top_pc1 = loading_df["PC1_loading"].abs().sort_values(ascending=False).head(8)
print("Top contributors to PC1:")
print(loading_df.loc[top_pc1.index].sort_values("PC1_loading", key=np.abs, ascending=False))

Loadings show which original features drive each principal component and make PCA interpretation more practical.

Guidelines for different datasets

# 1) Use numeric features only for PCA
# - Encode categories separately and evaluate if PCA is still suitable

# 2) Always scale before PCA
# - StandardScaler is the default safe choice for most tabular datasets

# 3) Pick n_components from a variance target
# - Start with 0.90 or 0.95 and validate downstream model metrics

# 4) Compare against a no-PCA baseline
# - Keep PCA only if it helps speed, stability, or generalization

# 5) Use loadings for practical interpretation
# - Report top positive/negative contributors per component

# 6) Avoid leakage in workflow
# - Split first, then fit scaler/PCA only on train data

# 7) Re-check dimensionality needs by use case
# - 2 components for visualization
# - more components for production prediction quality

# 8) Keep reproducibility fixed
# - use random_state in split and any randomized model steps

Treat this as a reusable PCA template: scale, inspect explained variance, validate with and without PCA, then interpret loadings.

Graphs and Analysis

Graph 1: Explained Variance and Cumulative Curve

Open PDF: PCA Breast Cancer Code output
pca_vis = PCA()
X_scaled = StandardScaler().fit_transform(X)
pca_vis.fit(X_scaled)

exp = pca_vis.explained_variance_ratio_
cum = np.cumsum(exp)

fig, axes = plt.subplots(1, 2, figsize=(12, 4.6))

axes[0].bar(range(1, 11), exp[:10], color="#42a5f5")
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("First 10 Components")

axes[1].plot(range(1, len(cum) + 1), cum, marker="o", linewidth=1.5, color="#2e7d32")
axes[1].axhline(0.95, color="red", linestyle="--", label="95% threshold")
axes[1].set_xlabel("Number of Components")
axes[1].set_ylabel("Cumulative Explained Variance")
axes[1].set_title("Cumulative Variance Curve")
axes[1].legend()

plt.tight_layout()
plt.show()

Use this chart to select a component count that keeps enough information while reducing dimensionality.

Graph 2: 2D PCA Projection by Class

X_scaled = StandardScaler().fit_transform(X)
X_2d = PCA(n_components=2).fit_transform(X_scaled)

plt.figure(figsize=(7, 5))
plt.scatter(X_2d[y == 0, 0], X_2d[y == 0, 1], s=22, alpha=0.7, label=data.target_names[0], color="#ef5350")
plt.scatter(X_2d[y == 1, 0], X_2d[y == 1, 1], s=22, alpha=0.7, label=data.target_names[1], color="#42a5f5")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA Projection: Breast Cancer Classes")
plt.legend()
plt.show()

This plot is useful for visual sanity-checks. Partial separation means PCA captures structure, even if classes still overlap.

Graph 3: Baseline vs PCA Accuracy

models = {
    "No PCA": make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)),
    "PCA 95%": make_pipeline(StandardScaler(), PCA(n_components=0.95), LogisticRegression(max_iter=5000)),
    "PCA 2 comps": make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression(max_iter=5000))
}

rows = []
for name, model in models.items():
    score = cross_val_score(model, X, y, cv=5, scoring="accuracy").mean()
    rows.append((name, score))

comp = pd.DataFrame(rows, columns=["model", "cv_accuracy"]).sort_values("cv_accuracy", ascending=False)
print(comp)

plt.figure(figsize=(7.2, 4.5))
plt.bar(comp["model"], comp["cv_accuracy"], color=["#66bb6a", "#42a5f5", "#ffa726"])
plt.ylabel("Mean CV Accuracy")
plt.title("Logistic Regression: Baseline vs PCA")
plt.ylim(comp["cv_accuracy"].min() - 0.02, comp["cv_accuracy"].max() + 0.02)
plt.show()

This gives decision-level evidence: keep aggressive reduction (2 PCs) only if accuracy stays acceptable for your business target.

Graph 4: Top Feature Loadings in PC1

pca_2 = PCA(n_components=2)
X_scaled = StandardScaler().fit_transform(X)
pca_2.fit(X_scaled)

load = pd.DataFrame({
    "feature": X.columns,
    "pc1_loading": pca_2.components_[0]
})
load["abs_loading"] = load["pc1_loading"].abs()
top = load.sort_values("abs_loading", ascending=False).head(10).sort_values("pc1_loading")

colors = ["#ef5350" if v < 0 else "#66bb6a" for v in top["pc1_loading"]]

plt.figure(figsize=(8.3, 5))
plt.barh(top["feature"], top["pc1_loading"], color=colors)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("PC1 Loading")
plt.title("Top PC1 Contributors")
plt.tight_layout()
plt.show()

Features with large absolute loadings drive the first component most strongly; this helps you narrate what PC1 represents.

Graph 5: Reconstruction Error vs Number of Components

X_scaled = StandardScaler().fit_transform(X)
component_grid = [2, 3, 5, 8, 10, 15, 20, X.shape[1]]
errors = []

for n in component_grid:
    pca_n = PCA(n_components=n)
    X_proj = pca_n.fit_transform(X_scaled)
    X_recon = pca_n.inverse_transform(X_proj)
    mse = ((X_scaled - X_recon) ** 2).mean()
    errors.append(mse)

plt.figure(figsize=(7.5, 4.6))
plt.plot(component_grid, errors, marker="o", color="#5e35b1", linewidth=1.8)
plt.xlabel("Number of Components")
plt.ylabel("Reconstruction MSE")
plt.title("Information Loss vs PCA Components")
plt.grid(alpha=0.3)
plt.show()

This graph directly shows information loss. If error drops sharply then flattens, that elbow region is a practical component range.

Graph 6: Class Centroid Distance in PCA Space

X_scaled = StandardScaler().fit_transform(X)
X_2d = PCA(n_components=2).fit_transform(X_scaled)

pc_df = pd.DataFrame(X_2d, columns=["PC1", "PC2"])
pc_df["target"] = y.values

centroids = pc_df.groupby("target")[["PC1", "PC2"]].mean()
c0 = centroids.loc[0].values
c1 = centroids.loc[1].values
centroid_distance = np.linalg.norm(c0 - c1)
print("Centroid distance (2D PCA space):", round(centroid_distance, 4))

plt.figure(figsize=(7, 5))
plt.scatter(pc_df.loc[pc_df["target"] == 0, "PC1"], pc_df.loc[pc_df["target"] == 0, "PC2"],
            s=20, alpha=0.55, color="#ef5350", label=data.target_names[0])
plt.scatter(pc_df.loc[pc_df["target"] == 1, "PC1"], pc_df.loc[pc_df["target"] == 1, "PC2"],
            s=20, alpha=0.55, color="#42a5f5", label=data.target_names[1])
plt.scatter([c0[0], c1[0]], [c0[1], c1[1]], s=180, marker="X", color=["#b71c1c", "#0d47a1"], label="Class centroids")
plt.plot([c0[0], c1[0]], [c0[1], c1[1]], "k--", linewidth=1.2, label=f"Distance = {centroid_distance:.2f}")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Centroid Separation in 2D PCA Space")
plt.legend()
plt.show()

This gives a compact separability signal. Larger centroid distance usually supports easier class separation in reduced space.

Exercises for Practice

Exercise 1: Change n_components to 0.90, 0.95, and 0.99; compare retained components and CV accuracy.

Exercise 2: Force PCA to 2, 5, and 10 components; compare confusion matrices for each.

Exercise 3: Skip scaling intentionally and observe variance/loadings distortion.

Exercise 4: Compare top loadings for PC1 and PC2; write one interpretation sentence for each.

Exercise 5: Run 5-fold and 10-fold CV for baseline vs PCA-95%; compare stability.

Exercise 6: Replace Logistic Regression with SVC and test whether PCA helps speed and accuracy.

Exercise 7: Keep only top 5 raw features (no PCA) and compare with PCA-95% pipeline.

Exercise 8: Increase test_size from 0.2 to 0.3 and compare model drift with and without PCA.

Exercise 9: Repeat the same template on load_wine() and compare how many components are needed for 95% variance.

A practical PCA workflow is: scale first, inspect explained variance, pick components by retention target, validate against no-PCA baseline, and interpret loadings before final model decisions.