Principal Component Analysis

reducing many columns into fewer useful directions

Theory, understanding, uses, and advantages

Machine Learning View

Principal Component Analysis, commonly called PCA, is a technique used to simplify data.

In real-world datasets, we often have many columns. For example, a customer dataset may contain age, salary, spending score, number of visits, online activity, purchase frequency, complaints, loyalty points, and many more details. Some of these columns may carry similar information. Some may be strongly related to each other. Some may add very little extra value.

PCA helps us reduce many columns into fewer new columns while trying to keep most of the important information.

These new columns are called principal components.

A simple mental picture is this:

Imagine you are looking at the shadow of a large object. If the light is placed correctly, the shadow still shows the main shape of the object. PCA does something similar. It tries to project the data into fewer dimensions while keeping the main pattern visible.

Why PCA is needed

As the number of columns increases, data becomes harder to understand, harder to visualize, and sometimes harder for machine learning models to handle.

This problem is often called the curse of dimensionality.

In simple words, when there are too many features, the data space becomes very large. The model may struggle to find meaningful patterns. It may also become slower, more complex, and sometimes less reliable.

PCA is useful because it can reduce the number of features while keeping the strongest patterns in the data.

For example, if a dataset has 30 numeric columns, PCA may show that most of the useful variation can be represented using only 2, 3, 5, or 10 principal components.

This does not mean the original columns were useless. It means PCA has combined them into new directions that summarize the data more efficiently.

What PCA actually does

PCA looks for the directions where the data varies the most.

The first principal component captures the largest amount of variation in the data. The second principal component captures the next largest amount of variation, but in a different direction. The third captures the next, and so on.

A principal component is not usually one original column. It is a combination of original columns.

For example, suppose we have three columns:

height
weight
body mass index

These columns may be related. PCA may combine them into one principal component that represents the general “body size” pattern.

This new component may not have a simple business name like “height” or “weight,” but mathematically it may carry much of the same information.

A small everyday example

Assume we are studying students.

We collect marks in:

Mathematics
Physics
Chemistry
Computer Science
English

The science-related subjects may move together. A student strong in Mathematics may often be strong in Physics and Computer Science also. PCA may combine these related marks into a new component that represents “technical academic strength.”

Another component may represent language ability or general variation not captured by the first component.

Instead of analysing five marks separately, PCA may help us understand the main learning patterns using fewer components.

PCA and dimensionality reduction

The main purpose of PCA is dimensionality reduction.

Dimensionality reduction means reducing the number of input columns while keeping as much useful information as possible.

If we reduce 20 columns to 3 principal components, the dataset becomes easier to visualize and sometimes easier to process.

This is especially useful before applying other machine learning algorithms.

For example, PCA is often used before:

clustering
classification
visualization
anomaly detection
noise reduction
exploratory data analysis

PCA itself is not a prediction model. It does not predict a target like sales, churn, disease, or approval status by itself.

Instead, PCA transforms the input data into a simpler form.

PCA is unsupervised

PCA is an unsupervised learning technique.

This means PCA does not need a target column.

In supervised learning, we usually have input features and a target output. For example, we may use customer details to predict whether the customer will leave.

But PCA does not ask what we are trying to predict. It only looks at the input features and studies their structure.

It asks:

“Where is the strongest variation in this data?”

This is both a strength and a weakness.

It is a strength because PCA can be used even when we do not have labelled data.

It is a weakness because the direction with the highest variation may not always be the direction most useful for business or prediction.

Why scaling is important

PCA is strongly affected by the scale of the data.

Suppose one column is salary and another column is age.

Salary may range from 20,000 to 200,000. Age may range from 18 to 65. Since salary has much larger numbers, PCA may give it more importance unless we scale the data.

This is why we usually apply standard scaling before PCA.

Standard scaling changes the columns so that they are placed on a comparable scale.

A common formula is:

scaled value = (value - mean) / standard deviation

After scaling, PCA can focus more fairly on patterns instead of being dominated by columns with large numeric ranges.

Explained variance

One important PCA concept is explained variance.

Explained variance tells us how much information each principal component is carrying.

For example:

PC1 explains 45% of the variance
PC2 explains 25% of the variance
PC3 explains 15% of the variance

Together, the first three components explain:

45% + 25% + 15% = 85%

This means that instead of using all original columns, we may use only the first three components and still keep 85% of the main variation in the data.

There is no universal rule that says we must keep exactly 90%, 95%, or 99%. It depends on the purpose.

For visualization, even two components may be enough.

For modelling, we may keep enough components to retain 90% or 95% of the variance.

PCA for visualization

One of the most useful applications of PCA is converting high-dimensional data into two dimensions.

If we have 10, 20, or 100 columns, we cannot easily draw them on a simple chart.

But PCA can reduce them to two components:

Principal Component 1
Principal Component 2

Then we can plot the data points on a 2D chart.

This helps us see whether groups are separated, mixed, clustered, or unusual.

For example, in a customer dataset, PCA may help show whether high-value customers are naturally separated from low-value customers.

In a medical dataset, PCA may show whether healthy and unhealthy records form different groups.

In a manufacturing dataset, PCA may reveal whether defective items are separated from normal items.

Advantages of PCA

PCA has several important advantages.

It reduces complexity. A dataset with many columns can become easier to understand after PCA.

It helps visualization. PCA can convert many features into two or three components so we can plot them.

It can reduce noise. Less important variation may be removed when we keep only the strongest components.

It may improve model speed. Fewer input columns can make some machine learning models run faster.

It can reduce multicollinearity. When original columns are strongly related to each other, PCA can convert them into independent components.

It helps exploratory analysis. PCA can reveal the main structure of the data before deeper modelling.

Important caution

PCA is powerful, but it should not be used blindly.

The principal components are mathematical combinations of the original columns. They may not be easy to explain in business language.

For example, a model using original columns like salary, age, and experience is easier to explain. A model using PC1, PC2, and PC3 may perform well, but explaining what PC1 means may be difficult.

So PCA is useful when simplification is more important than direct interpretation.

But when explanation is very important, PCA must be used carefully.

A simple mental picture

Think of PCA as a smart camera angle.

If you take a photo of a long object from the wrong side, it may look small. If you take the photo from the best angle, you capture its full length and shape.

PCA finds the best angle from which the data shows its maximum spread.

The first angle captures the strongest pattern. The second angle captures the next strongest pattern. Together, these angles help us understand the data with fewer dimensions.

Ready-to-run code, examples, unbiased analysis

Copy-paste starter code

This example uses the built-in breast cancer dataset from scikit-learn. The dataset has many numeric features. We will reduce them using PCA and check how much information is retained.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np

# Load sample data
data = load_breast_cancer()
X = data.data
feature_names = data.feature_names

# Convert to DataFrame for easier understanding
df = pd.DataFrame(X, columns=feature_names)
print("Original shape:")
print(df.shape)

# Scale the data before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Explained variance
explained_variance = pca.explained_variance_ratio_

# Show explained variance of first 10 components
for i, var in enumerate(explained_variance[:10], start=1):
    print(f"PC{i}: {var:.4f}")

print("\nTotal variance explained by first 2 components:")
print(explained_variance[:2].sum())

print("\nTotal variance explained by first 5 components:")
print(explained_variance[:5].sum())

What this code does

The dataset is first loaded into Python.Then the values are scaled using StandardScaler.

After scaling, PCA is applied. The code prints how much variance is explained by each principal component. The output helps us decide how many components may be useful. If the first few components explain a large percentage of the variance, PCA has successfully summarized the data.

PCA with only two components

The following code reduces the dataset into only two principal components.

This is useful for plotting and visualization.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# Load data
data = load_breast_cancer()
X = data.data
y = data.target

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce to 2 components
pca_2 = PCA(n_components=2)
X_pca_2 = pca_2.fit_transform(X_scaled)

# Create DataFrame
pca_df = pd.DataFrame(X_pca_2, columns=["PC1", "PC2"])
pca_df["target"] = y

print(pca_df.head())
print("\nExplained variance by 2 components:")
print(pca_2.explained_variance_ratio_)
print("\nTotal explained variance:")
print(pca_2.explained_variance_ratio_.sum())

Plot PCA output

import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# Load data
data = load_breast_cancer()
X = data.data
y = data.target

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# DataFrame
pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2"])
pca_df["target"] = y

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(pca_df["PC1"], pca_df["PC2"], c=pca_df["target"])
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA: Data reduced to 2 components")
plt.colorbar(label="Target")
plt.show()

How to read the chart

Each point represents one record. The x-axis is the first principal component. The y-axis is the second principal component.

If the points form visible groups, PCA has helped reveal structure in the data.

If the groups overlap heavily, PCA may still have captured variance, but the first two components may not be enough to clearly separate the classes.

This is important because PCA does not know the target labels while creating components. It only looks at feature variation.

Choosing number of components

Instead of guessing the number of components, we can ask PCA to keep enough components to explain a required amount of variance.

For example, this code keeps enough components to explain 95% of the variance.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load data
data = load_breast_cancer()
X = data.data

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Keep 95% variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print("Original number of features:")
print(X.shape[1])
print("\nReduced number of components:")
print(X_pca.shape[1])
print("\nTotal explained variance:")
print(pca.explained_variance_ratio_.sum())

PCA before Logistic Regression

PCA is often used before another machine learning model.

Here is a simple example using PCA before Logistic Regression.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix

# Load data
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Pipeline: scaling + PCA + model
model = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=0.95)),
    ("logistic", LogisticRegression(max_iter=5000))
])

# Train
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print("Accuracy:")
print(accuracy_score(y_test, predictions))
print("\nConfusion matrix:")
print(confusion_matrix(y_test, predictions))

# Check how many PCA components were used
pca_step = model.named_steps["pca"]
print("\nNumber of PCA components used:")
print(pca_step.n_components_)

Why Pipeline is used

The pipeline keeps the steps together.
First, the data is scaled.
Second, PCA is applied.
Third, Logistic Regression is trained.

This is a safer and cleaner method because the same transformations are applied properly during training and testing.

Try changing the same code

Change:

PCA(n_components=0.95)

to:

PCA(n_components=0.90)

or:

PCA(n_components=2)

Then compare the accuracy and confusion matrix. You may notice that reducing too much can hurt performance.
You may also notice that keeping too many components may not simplify the data enough. This is why PCA should be tested, not assumed.

PCA: Where it succeeds

1. When many numeric columns are related

PCA works well when many columns are correlated. For example, in finance data, income, spending, credit limit, and loan amount may be related.
In medical data, different measurements may move together. In manufacturing data, machine readings may be connected.

PCA can summarize these related features into fewer components.

2. When visualization is needed

PCA is very useful when we want to see high-dimensional data on a simple 2D plot. It may reveal clusters, separation, outliers, or unusual records. This makes PCA useful in the early stages of analysis.

3. When models are slow due to too many features

Some models become slower when there are many columns. PCA can reduce the number of columns before modelling. This may improve speed and reduce unnecessary complexity.

4. When multicollinearity is a problem

Multicollinearity means some input columns are strongly related to each other. This can confuse some models and make interpretation unstable. PCA transforms the original columns into new components that are not correlated with each other.
This can help certain modelling workflows.

5. When noise reduction is useful

Sometimes small variations in data are not useful. By keeping only the main components, PCA may remove weaker noise and preserve stronger patterns. This can help in image processing, sensor data, and exploratory analysis.

PCA: Where it fails or becomes risky

1. When the data is not scaled
PCA can fail badly if scaling is ignored. Columns with large numeric values may dominate the result. For example, salary may overpower age only because salary has larger numbers. This does not mean salary is more important. It may simply be larger in scale.

So scaling is usually required before PCA.

2. When business interpretation is important

PCA components are combinations of original columns. A component may contain parts of many features. This can make the result difficult to explain. If a manager asks, “Which exact factor caused this result?”, PCA may not give a simple answer.
In such cases, using original columns may be more understandable.

3. When the target depends on low-variance information

PCA keeps directions with high variance. But sometimes the most useful predictive signal may be in a low-variance feature. Since PCA does not look at the target variable, it may remove information that is important for prediction. This is one of the biggest cautions in supervised machine learning.

High variance does not always mean high business importance.

4. When relationships are highly non-linear

PCA is a linear method. It looks for straight-line directions in the data. If the real pattern is curved or complex, PCA may not capture it well.

In such cases, other methods may work better.

5. When categorical data is not handled properly

PCA is designed mainly for numeric data. If categorical columns like city, gender, department, or product type are used directly, PCA will not understand them properly. Categorical variables must be encoded carefully before PCA.

Even after encoding, interpretation may become difficult.

6. When too many components are removed

If we reduce the data too aggressively, we may lose important information. For example, reducing 50 columns to 2 components may be useful for a chart, but it may not be enough for prediction.

A 2D PCA plot is helpful for understanding, but not always enough for final modelling.

Unbiased practical summary in my view

PCA is not magic.

It is a useful mathematical technique for reducing many numeric columns into fewer new components. It succeeds when the dataset has many related numeric features and when the main patterns can be captured through linear directions.

It is especially useful for visualization, simplification, noise reduction, and preprocessing. But PCA can fail when the data is not scaled, when interpretation is more important than compression, when the target signal is hidden in low-variance features, or when the true pattern is non-linear.

A good analyst should not ask only:

Can I apply PCA?

A better question is:

Does PCA preserve the information I need for this specific purpose?

Use PCA as a tool, not as a compulsory step.

Final takeaway

Principal Component Analysis reduces many numeric features into fewer principal components.

It helps simplify data, visualize patterns, reduce noise, and support machine learning workflows. But the new components may be harder to explain than the original columns. PCA is most useful when we need structure and simplification. It is less useful when we need direct business explanation from original variables.

In simple words:

PCA helps us see the main shape of the data without carrying every small detail.

PCA is best understood as a method that reduces many numeric features into fewer new components, while trying to preserve the strongest patterns in the data.