PCA Code and Description (Wine)
** Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better **
Section 1: Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
We use the same PCA workflow as other pages so the method is reusable across domains.
Section 2: Load Data
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")
print("Shape:", X.shape)
print("Target classes:", np.unique(y))
print(X.head())
Wine data has 13 numeric chemistry features and 3 classes, which is useful for PCA + multiclass classification.
Section 3: Split and Scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Feature scales differ a lot in this dataset, so scaling before PCA is non-negotiable.
Section 4: Fit PCA and Inspect Variance
pca_full = PCA()
X_train_pca_full = pca_full.fit_transform(X_train_scaled)
explained = pca_full.explained_variance_ratio_
cum_explained = np.cumsum(explained)
print("Explained variance ratios:", np.round(explained, 4))
print("Cumulative variance:", np.round(cum_explained, 4))
print("Components needed for 95% variance:", np.argmax(cum_explained >= 0.95) + 1)
In wine data, the first few components often carry most variation, making PCA very practical for faster modeling.
Section 5: Build PCA + Logistic Pipeline
pca_model = make_pipeline(
StandardScaler(),
PCA(n_components=0.95),
LogisticRegression(max_iter=5000)
)
pca_model.fit(X_train, y_train)
y_pred = pca_model.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))
pca_step = pca_model.named_steps["pca"]
print("PCA components retained:", pca_step.n_components_)
This gives an end-to-end practical workflow that can be used on real multiclass datasets with minimum leakage risk.
Section 6: Baseline vs PCA Comparison
baseline_model = make_pipeline(
StandardScaler(),
LogisticRegression(max_iter=5000)
)
baseline_cv = cross_val_score(baseline_model, X, y, cv=5, scoring="accuracy").mean()
pca_cv = cross_val_score(pca_model, X, y, cv=5, scoring="accuracy").mean()
print(f"Baseline (no PCA) mean CV accuracy: {baseline_cv:.4f}")
print(f"With PCA (95% variance) mean CV accuracy: {pca_cv:.4f}")
This side-by-side comparison tells you whether PCA compression is helping enough to justify transformation complexity.
Section 7: Inspect PCA Loadings
pca_2 = PCA(n_components=2)
X_scaled = StandardScaler().fit_transform(X)
pca_2.fit(X_scaled)
loading_df = pd.DataFrame(
pca_2.components_.T,
columns=["PC1_loading", "PC2_loading"],
index=X.columns
)
print("Top absolute contributors to PC1:")
print(loading_df["PC1_loading"].abs().sort_values(ascending=False).head(8))
Loadings help connect components back to chemistry features so the output stays interpretable for practical analysis.
Guidelines for different datasets
# 1) Keep PCA for numeric correlated features
# - for mixed-type data, preprocess categories separately first
# 2) Standardize features before PCA
# - avoid scale-dominance distortion
# 3) Choose components by purpose
# - 2 components for visualization
# - 90%/95% variance target for production models
# 4) Validate with baseline model
# - compare accuracy/F1 and training speed with and without PCA
# 5) Use loadings for interpretability
# - report top features influencing PC1/PC2
# 6) Keep leakage-safe workflow
# - split first, then fit scaler/PCA on train only
# 7) Use stratified split for classification datasets
# - preserve class balance in train/test
# 8) Keep reproducibility fixed
# - use random_state and consistent CV settings
Use this page as a practical PCA template for multiclass problems where you need both dimensionality reduction and stable predictive quality.
Graphs and Analysis
Graph 1: Explained Variance and Cumulative Curve
Open PDF: PCA Wine Code outputpca_vis = PCA()
X_scaled = StandardScaler().fit_transform(X)
pca_vis.fit(X_scaled)
exp = pca_vis.explained_variance_ratio_
cum = np.cumsum(exp)
fig, axes = plt.subplots(1, 2, figsize=(12, 4.6))
axes[0].bar(range(1, len(exp) + 1), exp, color="#42a5f5")
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("Variance by Component")
axes[1].plot(range(1, len(cum) + 1), cum, marker="o", linewidth=1.5, color="#2e7d32")
axes[1].axhline(0.95, color="red", linestyle="--", label="95% threshold")
axes[1].set_xlabel("Number of Components")
axes[1].set_ylabel("Cumulative Explained Variance")
axes[1].set_title("Cumulative Variance Curve")
axes[1].legend()
plt.tight_layout()
plt.show()
This graph helps pick a justified component count instead of guessing dimension reduction strength.
Graph 2: 2D PCA Projection for 3 Wine Classes
X_scaled = StandardScaler().fit_transform(X)
X_2d = PCA(n_components=2).fit_transform(X_scaled)
plt.figure(figsize=(7.4, 5.2))
colors = ["#ef5350", "#42a5f5", "#66bb6a"]
for cls, color, name in zip(sorted(np.unique(y)), colors, data.target_names):
mask = (y == cls)
plt.scatter(X_2d[mask, 0], X_2d[mask, 1], s=28, alpha=0.75, label=name, color=color)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA Projection: Wine Classes")
plt.legend()
plt.show()
Separation quality here gives a quick view of class structure and whether reduced dimensions still preserve useful discrimination.
Graph 3: Baseline vs PCA Accuracy
models = {
"No PCA": make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)),
"PCA 95%": make_pipeline(StandardScaler(), PCA(n_components=0.95), LogisticRegression(max_iter=5000)),
"PCA 2 comps": make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression(max_iter=5000))
}
rows = []
for name, model in models.items():
score = cross_val_score(model, X, y, cv=5, scoring="accuracy").mean()
rows.append((name, score))
comp = pd.DataFrame(rows, columns=["model", "cv_accuracy"]).sort_values("cv_accuracy", ascending=False)
print(comp)
plt.figure(figsize=(7.2, 4.5))
plt.bar(comp["model"], comp["cv_accuracy"], color=["#66bb6a", "#42a5f5", "#ffa726"])
plt.ylabel("Mean CV Accuracy")
plt.title("Wine Classification: Baseline vs PCA")
plt.ylim(comp["cv_accuracy"].min() - 0.03, comp["cv_accuracy"].max() + 0.03)
plt.show()
This comparison is operationally useful: it tells whether PCA compression is acceptable for your expected accuracy level.
Graph 4: Top Feature Loadings in PC1
pca_2 = PCA(n_components=2)
X_scaled = StandardScaler().fit_transform(X)
pca_2.fit(X_scaled)
load = pd.DataFrame({
"feature": X.columns,
"pc1_loading": pca_2.components_[0]
})
load["abs_loading"] = load["pc1_loading"].abs()
top = load.sort_values("abs_loading", ascending=False).head(10).sort_values("pc1_loading")
colors = ["#ef5350" if v < 0 else "#66bb6a" for v in top["pc1_loading"]]
plt.figure(figsize=(8.4, 5))
plt.barh(top["feature"], top["pc1_loading"], color=colors)
plt.axvline(0, color="black", linewidth=1)
plt.xlabel("PC1 Loading")
plt.title("Top PC1 Contributors (Wine)")
plt.tight_layout()
plt.show()
This plot gives a practical interpretation bridge from abstract components back to concrete feature groups.
Graph 5: Reconstruction Error vs Number of Components
X_scaled = StandardScaler().fit_transform(X)
component_grid = [2, 3, 5, 8, 10, 12, X.shape[1]]
errors = []
for n in component_grid:
pca_n = PCA(n_components=n)
X_proj = pca_n.fit_transform(X_scaled)
X_recon = pca_n.inverse_transform(X_proj)
mse = ((X_scaled - X_recon) ** 2).mean()
errors.append(mse)
plt.figure(figsize=(7.5, 4.6))
plt.plot(component_grid, errors, marker="o", color="#5e35b1", linewidth=1.8)
plt.xlabel("Number of Components")
plt.ylabel("Reconstruction MSE")
plt.title("Wine PCA: Information Loss vs Components")
plt.grid(alpha=0.3)
plt.show()
This curve quantifies compression cost. Pick components around the flattening zone where extra dimensions reduce error only marginally.
Graph 6: Pairwise Class-Centroid Distances in 2D PCA Space
X_scaled = StandardScaler().fit_transform(X)
X_2d = PCA(n_components=2).fit_transform(X_scaled)
pc_df = pd.DataFrame(X_2d, columns=["PC1", "PC2"])
pc_df["target"] = y.values
centroids = pc_df.groupby("target")[["PC1", "PC2"]].mean().sort_index()
dist_01 = np.linalg.norm(centroids.loc[0].values - centroids.loc[1].values)
dist_02 = np.linalg.norm(centroids.loc[0].values - centroids.loc[2].values)
dist_12 = np.linalg.norm(centroids.loc[1].values - centroids.loc[2].values)
print(f"Centroid distances: 0-1={dist_01:.3f}, 0-2={dist_02:.3f}, 1-2={dist_12:.3f}")
plt.figure(figsize=(7.3, 5.3))
colors = ["#ef5350", "#42a5f5", "#66bb6a"]
for cls, color, name in zip(sorted(np.unique(y)), colors, data.target_names):
mask = (pc_df["target"] == cls)
plt.scatter(pc_df.loc[mask, "PC1"], pc_df.loc[mask, "PC2"], s=22, alpha=0.6, color=color, label=name)
for cls, color in zip([0, 1, 2], ["#b71c1c", "#0d47a1", "#1b5e20"]):
cx, cy = centroids.loc[cls, "PC1"], centroids.loc[cls, "PC2"]
plt.scatter(cx, cy, s=190, marker="X", color=color)
p01 = centroids.loc[0].values
p02 = centroids.loc[2].values
p12 = centroids.loc[1].values
plt.plot([p01[0], p12[0]], [p01[1], p12[1]], "k--", linewidth=1)
plt.plot([p01[0], p02[0]], [p01[1], p02[1]], "k--", linewidth=1)
plt.plot([p12[0], p02[0]], [p12[1], p02[1]], "k--", linewidth=1)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Wine Class Separation via PCA Centroids")
plt.legend()
plt.show()
This view adds a compact separability measure for multiclass structure. Larger pairwise centroid distances usually indicate cleaner class separation.
Exercises for Practice
Exercise 1: Compare 2, 3, and 5 PCA components and evaluate multiclass confusion matrices.
Exercise 2: Run PCA with n_components=0.90, 0.95, and 0.99; compare retained dimensions.
Exercise 3: Skip scaling intentionally and observe change in explained variance ordering.
Exercise 4: Train SVC and KNN with/without PCA and compare CV accuracy.
Exercise 5: Use 5-fold and 10-fold CV and check if best setup changes.
Exercise 6: Extract top contributors for PC1 and PC2 and write interpretation notes.
Exercise 7: Remove two high-loading features and re-run PCA; observe component count for 95% variance.
Exercise 8: Increase test_size from 0.2 to 0.3 and compare drift in baseline vs PCA pipelines.
Exercise 9: Reuse the same workflow on breast cancer data and compare separation quality in 2D PCA plot.