Principal Component Analysis Wine Example

code, graphs, guidelines, and practice tasks


PCA Code and Description (Wine)

** Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better **

Section 1: Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

We use the same PCA workflow as other pages so the method is reusable across domains.

Section 2: Load Data

data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

print("Shape:", X.shape)
print("Target classes:", np.unique(y))
print(X.head())

Wine data has 13 numeric chemistry features and 3 classes, which is useful for PCA + multiclass classification.

Section 3: Split and Scale

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Feature scales differ a lot in this dataset, so scaling before PCA is non-negotiable.

Section 4: Fit PCA and Inspect Variance

pca_full = PCA()
X_train_pca_full = pca_full.fit_transform(X_train_scaled)

explained = pca_full.explained_variance_ratio_
cum_explained = np.cumsum(explained)

print("Explained variance ratios:", np.round(explained, 4))
print("Cumulative variance:", np.round(cum_explained, 4))
print("Components needed for 95% variance:", np.argmax(cum_explained >= 0.95) + 1)

In wine data, the first few components often carry most variation, making PCA very practical for faster modeling.

Section 5: Build PCA + Logistic Pipeline

pca_model = make_pipeline(
    StandardScaler(),
    PCA(n_components=0.95),
    LogisticRegression(max_iter=5000)
)

pca_model.fit(X_train, y_train)
y_pred = pca_model.predict(X_test)

print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))

pca_step = pca_model.named_steps["pca"]
print("PCA components retained:", pca_step.n_components_)

This gives an end-to-end practical workflow that can be used on real multiclass datasets with minimum leakage risk.

Section 6: Baseline vs PCA Comparison

baseline_model = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=5000)
)

baseline_cv = cross_val_score(baseline_model, X, y, cv=5, scoring="accuracy").mean()
pca_cv = cross_val_score(pca_model, X, y, cv=5, scoring="accuracy").mean()

print(f"Baseline (no PCA) mean CV accuracy: {baseline_cv:.4f}")
print(f"With PCA (95% variance) mean CV accuracy: {pca_cv:.4f}")

This side-by-side comparison tells you whether PCA compression is helping enough to justify transformation complexity.

Section 7: Inspect PCA Loadings

pca_2 = PCA(n_components=2)
X_scaled = StandardScaler().fit_transform(X)
pca_2.fit(X_scaled)

loading_df = pd.DataFrame(
    pca_2.components_.T,
    columns=["PC1_loading", "PC2_loading"],
    index=X.columns
)

print("Top absolute contributors to PC1:")
print(loading_df["PC1_loading"].abs().sort_values(ascending=False).head(8))

Loadings help connect components back to chemistry features so the output stays interpretable for practical analysis.

Guidelines for different datasets

# 1) Keep PCA for numeric correlated features
# - for mixed-type data, preprocess categories separately first

# 2) Standardize features before PCA
# - avoid scale-dominance distortion

# 3) Choose components by purpose
# - 2 components for visualization
# - 90%/95% variance target for production models

# 4) Validate with baseline model
# - compare accuracy/F1 and training speed with and without PCA

# 5) Use loadings for interpretability
# - report top features influencing PC1/PC2

# 6) Keep leakage-safe workflow
# - split first, then fit scaler/PCA on train only

# 7) Use stratified split for classification datasets
# - preserve class balance in train/test

# 8) Keep reproducibility fixed
# - use random_state and consistent CV settings

Use this page as a practical PCA template for multiclass problems where you need both dimensionality reduction and stable predictive quality.

Graphs and Analysis

Graph 1: Explained Variance and Cumulative Curve

Open PDF: PCA Wine Code output
pca_vis = PCA()
X_scaled = StandardScaler().fit_transform(X)
pca_vis.fit(X_scaled)

exp = pca_vis.explained_variance_ratio_
cum = np.cumsum(exp)

fig, axes = plt.subplots(1, 2, figsize=(12, 4.6))

axes[0].bar(range(1, len(exp) + 1), exp, color="#42a5f5")
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("Variance by Component")

axes[1].plot(range(1, len(cum) + 1), cum, marker="o", linewidth=1.5, color="#2e7d32")
axes[1].axhline(0.95, color="red", linestyle="--", label="95% threshold")
axes[1].set_xlabel("Number of Components")
axes[1].set_ylabel("Cumulative Explained Variance")
axes[1].set_title("Cumulative Variance Curve")
axes[1].legend()

plt.tight_layout()
plt.show()

This graph helps pick a justified component count instead of guessing dimension reduction strength.

Graph 2: 2D PCA Projection for 3 Wine Classes

X_scaled = StandardScaler().fit_transform(X)
X_2d = PCA(n_components=2).fit_transform(X_scaled)

plt.figure(figsize=(7.4, 5.2))
colors = ["#ef5350", "#42a5f5", "#66bb6a"]
for cls, color, name in zip(sorted(np.unique(y)), colors, data.target_names):
    mask = (y == cls)
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], s=28, alpha=0.75, label=name, color=color)

plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA Projection: Wine Classes")
plt.legend()
plt.show()

Separation quality here gives a quick view of class structure and whether reduced dimensions still preserve useful discrimination.

Graph 3: Baseline vs PCA Accuracy

models = {
    "No PCA": make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)),
    "PCA 95%": make_pipeline(StandardScaler(), PCA(n_components=0.95), LogisticRegression(max_iter=5000)),
    "PCA 2 comps": make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression(max_iter=5000))
}

rows = []
for name, model in models.items():
    score = cross_val_score(model, X, y, cv=5, scoring="accuracy").mean()
    rows.append((name, score))

comp = pd.DataFrame(rows, columns=["model", "cv_accuracy"]).sort_values("cv_accuracy", ascending=False)
print(comp)

plt.figure(figsize=(7.2, 4.5))
plt.bar(comp["model"], comp["cv_accuracy"], color=["#66bb6a", "#42a5f5", "#ffa726"])
plt.ylabel("Mean CV Accuracy")
plt.title("Wine Classification: Baseline vs PCA")
plt.ylim(comp["cv_accuracy"].min() - 0.03, comp["cv_accuracy"].max() + 0.03)
plt.show()

This comparison is operationally useful: it tells whether PCA compression is acceptable for your expected accuracy level.

Graph 4: Top Feature Loadings in PC1

pca_2 = PCA(n_components=2)
X_scaled = StandardScaler().fit_transform(X)
pca_2.fit(X_scaled)

load = pd.DataFrame({
    "feature": X.columns,
    "pc1_loading": pca_2.components_[0]
})
load["abs_loading"] = load["pc1_loading"].abs()
top = load.sort_values("abs_loading", ascending=False).head(10).sort_values("pc1_loading")

colors = ["#ef5350" if v < 0 else "#66bb6a" for v in top["pc1_loading"]]

plt.figure(figsize=(8.4, 5))
plt.barh(top["feature"], top["pc1_loading"], color=colors)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("PC1 Loading")
plt.title("Top PC1 Contributors (Wine)")
plt.tight_layout()
plt.show()

This plot gives a practical interpretation bridge from abstract components back to concrete feature groups.

Graph 5: Reconstruction Error vs Number of Components

X_scaled = StandardScaler().fit_transform(X)
component_grid = [2, 3, 5, 8, 10, 12, X.shape[1]]
errors = []

for n in component_grid:
    pca_n = PCA(n_components=n)
    X_proj = pca_n.fit_transform(X_scaled)
    X_recon = pca_n.inverse_transform(X_proj)
    mse = ((X_scaled - X_recon) ** 2).mean()
    errors.append(mse)

plt.figure(figsize=(7.5, 4.6))
plt.plot(component_grid, errors, marker="o", color="#5e35b1", linewidth=1.8)
plt.xlabel("Number of Components")
plt.ylabel("Reconstruction MSE")
plt.title("Wine PCA: Information Loss vs Components")
plt.grid(alpha=0.3)
plt.show()

This curve quantifies compression cost. Pick components around the flattening zone where extra dimensions reduce error only marginally.

Graph 6: Pairwise Class-Centroid Distances in 2D PCA Space

X_scaled = StandardScaler().fit_transform(X)
X_2d = PCA(n_components=2).fit_transform(X_scaled)

pc_df = pd.DataFrame(X_2d, columns=["PC1", "PC2"])
pc_df["target"] = y.values
centroids = pc_df.groupby("target")[["PC1", "PC2"]].mean().sort_index()

dist_01 = np.linalg.norm(centroids.loc[0].values - centroids.loc[1].values)
dist_02 = np.linalg.norm(centroids.loc[0].values - centroids.loc[2].values)
dist_12 = np.linalg.norm(centroids.loc[1].values - centroids.loc[2].values)
print(f"Centroid distances: 0-1={dist_01:.3f}, 0-2={dist_02:.3f}, 1-2={dist_12:.3f}")

plt.figure(figsize=(7.3, 5.3))
colors = ["#ef5350", "#42a5f5", "#66bb6a"]
for cls, color, name in zip(sorted(np.unique(y)), colors, data.target_names):
    mask = (pc_df["target"] == cls)
    plt.scatter(pc_df.loc[mask, "PC1"], pc_df.loc[mask, "PC2"], s=22, alpha=0.6, color=color, label=name)

for cls, color in zip([0, 1, 2], ["#b71c1c", "#0d47a1", "#1b5e20"]):
    cx, cy = centroids.loc[cls, "PC1"], centroids.loc[cls, "PC2"]
    plt.scatter(cx, cy, s=190, marker="X", color=color)

p01 = centroids.loc[0].values
p02 = centroids.loc[2].values
p12 = centroids.loc[1].values
plt.plot([p01[0], p12[0]], [p01[1], p12[1]], "k--", linewidth=1)
plt.plot([p01[0], p02[0]], [p01[1], p02[1]], "k--", linewidth=1)
plt.plot([p12[0], p02[0]], [p12[1], p02[1]], "k--", linewidth=1)

plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Wine Class Separation via PCA Centroids")
plt.legend()
plt.show()

This view adds a compact separability measure for multiclass structure. Larger pairwise centroid distances usually indicate cleaner class separation.

Exercises for Practice

Exercise 1: Compare 2, 3, and 5 PCA components and evaluate multiclass confusion matrices.

Exercise 2: Run PCA with n_components=0.90, 0.95, and 0.99; compare retained dimensions.

Exercise 3: Skip scaling intentionally and observe change in explained variance ordering.

Exercise 4: Train SVC and KNN with/without PCA and compare CV accuracy.

Exercise 5: Use 5-fold and 10-fold CV and check if best setup changes.

Exercise 6: Extract top contributors for PC1 and PC2 and write interpretation notes.

Exercise 7: Remove two high-loading features and re-run PCA; observe component count for 95% variance.

Exercise 8: Increase test_size from 0.2 to 0.3 and compare drift in baseline vs PCA pipelines.

Exercise 9: Reuse the same workflow on breast cancer data and compare separation quality in 2D PCA plot.

A practical PCA workflow for multiclass data is: standardize features, inspect variance retention, validate with baseline and PCA pipelines, and interpret loadings before final deployment.