Principal Component Analysis

reduce dimensions while retaining variance

Machine Learning View.

Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better

In machine learning, Principal Component Analysis (PCA) is an unsupervised technique used for dimensionality reduction. It transforms many original features into fewer new features called principle components.

These new components are linear combinations of the original variables and are arranged in descending order of explained variance. The first component captures the most variation, the second captures the next most, and so on.

Why PCA is useful.
Real datasets often have many correlated columns. PCA removes redundancy and compresses the information into fewer dimensions, which helps with faster training, easier visualization, and noise reduction.

How PCA works conceptually.
PCA rotates the coordinate system. Instead of measuring data along old axes, it measures along new axes that capture maximum spread in the data. These axes are mutually orthogonal.

Centering and scaling.
PCA is sensitive to feature scale. Variables should usually be standardized before applying PCA, especially when units differ (for example age, income, and percentage).

Core idea.
PCA does not select existing columns. It creates new columns (components) that summarize the data with minimal information loss.

Explained variance ratio.
Each component has an explained variance ratio showing how much total information it captures. A cumulative ratio such as 90% or 95% is often used to decide how many components to keep.

PCA and prediction models.
PCA itself is not a prediction algorithm. It is a preprocessing step. After transforming data, the reduced components can be passed to models like logistic regression, SVM, or clustering algorithms.

Interpretability trade-off.
Components are combinations of many variables, so interpretation can be harder than using original features. PCA improves compression and numerical behavior but may reduce direct business explainability.

When to avoid PCA.
If features are already few, or direct interpretability is mandatory, PCA may not add value. Also, PCA is linear; if structure is highly non-linear, other techniques may be better.

Related Statistics page: Regression from Statistics Perspective

Important caution.
Fit PCA only on training data. Then apply the same fitted transform to validation and test data to avoid data leakage.

A simple mental picture.
Think of PCA as turning the camera angle to see the direction where data spreads most. Keep the strongest directions and drop weaker ones.

PCA is a dimensionality reduction method that builds new orthogonal components to preserve maximum variance with fewer features.

Examples for Understanding.

Copy-paste starter code.
This quick example standardizes built-in data, applies PCA, prints explained variance, and trains a classifier on reduced features.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=0.95)  # keep 95% variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

model = LogisticRegression(max_iter=5000)
model.fit(X_train_pca, y_train)

predictions = model.predict(X_test_pca)
print("Original feature count:", X_train.shape[1])
print("Reduced feature count:", X_train_pca.shape[1])
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Accuracy:", accuracy_score(y_test, predictions))

Try changing the same code.
Change `n_components=0.95` to `0.90` or `0.99` and compare reduced dimensions and accuracy. You can also use `n_components=2` for visualization experiments.

1. Face image compression.
Pixel columns are highly correlated. PCA can reduce thousands of pixel features to a smaller set while still preserving useful visual structure.

2. Customer analytics.
Behavioral metrics may overlap. PCA can compress them before clustering to reduce noise and improve group separation.

3. Credit risk modeling.
Financial indicators often move together. PCA can reduce multicollinearity before applying downstream classification models.

4. Sensor data.
IoT systems collect many related sensor signals. PCA can retain the strongest operational patterns with fewer variables for faster monitoring pipelines.

5. Two-dimensional visualization.
With many features, plotting is hard. PCA with two components allows scatter plotting of high-dimensional data for quick visual checks.

6. Noise filtering.
Lower-variance components may capture noise. Dropping some of them can improve robustness in downstream learning tasks.

7. Pipeline discipline.
Standardization and PCA should be in one consistent training pipeline so transformations remain identical during inference.

8. Business trade-off.
Fewer components improve speed and storage, but too much reduction may hurt model performance. Balance depends on your target metric and interpretability needs.