K-Nearest Neighbors With Example

code, graphs, guidelines, and practice tasks


KNN Code and Description

** Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better **

Section 1: Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

We import dataset tools, split helpers, scaling, KNN classifier, and evaluation metrics.

Section 2: Load Data

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print("Shape:", X.shape)
print("Target classes:", np.unique(y))

Breast cancer data is a binary classification dataset, suitable for KNN practice.

Section 3: Split and Scale

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

KNN depends on distance, so standardization is required to prevent large-scale features from dominating.

Section 4: Train KNN Model

knn_model = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
knn_model.fit(X_train_scaled, y_train)

We start with k=5 and Euclidean distance (p=2), which is a common baseline.

Section 5: Predict and Evaluate

y_pred = knn_model.predict(X_test_scaled)

acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy:", round(acc, 4))
print("Confusion Matrix:\n", cm)
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy gives overall score; confusion matrix and classification report show class-level behavior.

Section 6: Quick k Comparison

for k in [1, 3, 5, 7, 11, 15]:
    model = KNeighborsClassifier(n_neighbors=k, metric="minkowski", p=2)
    model.fit(X_train_scaled, y_train)
    pred = model.predict(X_test_scaled)
    print(f"k={k}: accuracy={accuracy_score(y_test, pred):.4f}")

This helps you see underfitting vs overfitting behavior as k changes.

Section 7: Visualizing the Breast Cancer Dataset

# Classification visual using two breast-cancer features
from sklearn.inspection import DecisionBoundaryDisplay

feature_names = list(data.feature_names)

def find_feature_index(name, all_names):
    # Robust match for variations like spaces vs underscores or case changes
    key = name.strip().lower().replace("_", " ")
    normalized = [n.strip().lower().replace("_", " ") for n in all_names]
    if key in normalized:
        return normalized.index(key)
    raise ValueError(f"Feature '{name}' not found. Available: {all_names[:5]} ...")

ix = find_feature_index("mean radius", feature_names)
iy = find_feature_index("mean texture", feature_names)

feat_x = feature_names[ix]
feat_y = feature_names[iy]

X2 = data.data[:, [ix, iy]]
y2 = data.target

X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y2, test_size=0.2, random_state=42, stratify=y2
)

scaler2 = StandardScaler()
X2_train_scaled = scaler2.fit_transform(X2_train)
X2_test_scaled = scaler2.transform(X2_test)

knn2 = KNeighborsClassifier(n_neighbors=9, metric="minkowski", p=2)
knn2.fit(X2_train_scaled, y2_train)

fig, ax = plt.subplots(figsize=(7, 5))
DecisionBoundaryDisplay.from_estimator(
    knn2, X2_train_scaled, response_method="predict", cmap="coolwarm", alpha=0.18, ax=ax
)
ax.scatter(
    X2_train_scaled[:, 0],
    X2_train_scaled[:, 1],
    c=y2_train,
    cmap="coolwarm",
    s=24,
    edgecolor="k",
    alpha=0.85
)
ax.set_xlabel(feat_x + " (scaled)")
ax.set_ylabel(feat_y + " (scaled)")
ax.set_title("Breast Cancer: KNN Classification Regions")
plt.show()

This view directly shows KNN class regions and how benign/malignant points distribute in feature space.

Guidelines for different datasets

# 1) Replace dataset loader and keep common pattern
# Example:
# data = load_iris() or load_wine() or your custom DataFrame
# X = data.data (or df[feature_cols].values)
# y = data.target (or df[target_col].values)

# 2) Re-check target type before training
# - Classification only: y should represent class labels
# - For text labels, encode if needed

# 3) Update 2D visualization feature selection
# - Choose two meaningful feature indices for boundary plots
# - Do not assume names like "mean radius" exist

# 4) Always scale for KNN distance calculations
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# 5) Re-tune k for every new dataset
# - Run CV over a k range (for example odd k from 1 to 31)
# - Do not reuse old best_k blindly

# 6) Review class balance and split strategy
# - Use stratify=y in train_test_split
# - If classes are imbalanced, focus on per-class metrics, not only accuracy

# 7) Validate metric choices by problem type
# - Binary and multiclass both work in KNN
# - For multiclass summaries, inspect macro/weighted averages

# 8) Check data quality before fit
# - Handle missing values
# - Remove leakage columns (IDs, post-outcome signals)
# - Keep train/test preprocessing separation strict

# 9) Keep reproducibility fixed
# - Use random_state in split and CV where applicable

Treat this page as a reusable KNN template: change dataset and feature selection first, then re-run scaling, k tuning, and error analysis before finalizing results.

Graphs and Analysis

Graph 1: Why Scaling Matters for KNN (Breast Cancer)

Open PDF: K-Nearest Neighbors Code output file
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Model without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
pred_raw = knn_raw.predict(X_test)
acc_raw = accuracy_score(y_test, pred_raw)

# Model with scaling
knn_scaled = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
knn_scaled.fit(X_train, y_train)
pred_scaled = knn_scaled.predict(X_test)
acc_scaled = accuracy_score(y_test, pred_scaled)

plt.figure(figsize=(6.5, 4.2))
plt.bar(["Without Scaling", "With Scaling"], [acc_raw, acc_scaled], color=["#ef5350", "#66bb6a"])
plt.ylim(0.85, 1.00)
plt.ylabel("Test Accuracy")
plt.title("KNN Performance: Impact of Feature Scaling")
for i, v in enumerate([acc_raw, acc_scaled]):
    plt.text(i, v + 0.003, f"{v:.4f}", ha="center")
plt.show()

KNN uses distance directly. This graph usually shows clear improvement after scaling and explains why scaling is not optional for KNN.

Graph 2: Selecting k with Cross-Validation

from sklearn.model_selection import StratifiedKFold, cross_val_score

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

k_values = list(range(1, 32, 2))  # odd k values
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []

for k in k_values:
    model = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model, X_scaled, y, cv=cv, scoring="accuracy")
    cv_scores.append(scores.mean())

best_idx = int(np.argmax(cv_scores))
best_k = k_values[best_idx]
best_score = cv_scores[best_idx]

plt.figure(figsize=(8, 4.5))
plt.plot(k_values, cv_scores, marker="o")
plt.axvline(best_k, color="green", linestyle="--", label=f"Best k = {best_k}")
plt.xlabel("k (number of neighbors)")
plt.ylabel("Mean CV Accuracy")
plt.title("KNN Hyperparameter Tuning with 5-Fold CV")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

Instead of guessing `k`, this gives a repeatable selection method. Choose `k` where CV accuracy peaks and remains stable.

Graph 3: Confusion Matrix for Best k

from sklearn.metrics import ConfusionMatrixDisplay, classification_report

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

k_for_eval = best_k if "best_k" in globals() else 9
knn_best = KNeighborsClassifier(n_neighbors=k_for_eval)
knn_best.fit(X_train_scaled, y_train)
y_pred = knn_best.predict(X_test_scaled)

ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=data.target_names, cmap="Blues", values_format="d"
)
plt.title(f"KNN Confusion Matrix (k={k_for_eval})")
plt.show()

print(classification_report(y_test, y_pred, target_names=data.target_names))

This graph converts accuracy into actionable errors: which class is getting misclassified and where you may need threshold or feature improvements.

Graph 4: Compare Weights and Distance Metric

import pandas as pd

configs = [
    {"weights": "uniform", "p": 2, "label": "uniform + euclidean"},
    {"weights": "distance", "p": 2, "label": "distance + euclidean"},
    {"weights": "uniform", "p": 1, "label": "uniform + manhattan"},
    {"weights": "distance", "p": 1, "label": "distance + manhattan"},
]

rows = []
k_for_compare = best_k if "best_k" in globals() else 9
for cfg in configs:
    pipe = make_pipeline(
        StandardScaler(),
        KNeighborsClassifier(
            n_neighbors=k_for_compare,
            weights=cfg["weights"],
            metric="minkowski",
            p=cfg["p"]
        )
    )
    score = cross_val_score(pipe, X, y, cv=5, scoring="accuracy").mean()
    rows.append({"setting": cfg["label"], "cv_accuracy": score})

result = pd.DataFrame(rows).sort_values("cv_accuracy", ascending=False)
print(result)

plt.figure(figsize=(8.5, 4.5))
plt.barh(result["setting"], result["cv_accuracy"], color="#42a5f5")
plt.xlabel("Mean CV Accuracy")
plt.title(f"KNN Setting Comparison at k={k_for_compare}")
plt.gca().invert_yaxis()
plt.xlim(result["cv_accuracy"].min() - 0.01, result["cv_accuracy"].max() + 0.01)
plt.show()

After `k` is selected, this compares practical setting choices. Use this to confirm whether distance-weighted voting or Manhattan distance improves reliability.

Exercises for Practice

Exercise 1: Compare weights="uniform" and weights="distance" on your dataset.

Exercise 2: Plot accuracy for k from 1 to 40 and choose a robust value.

Exercise 3: Try p=1 and p=2 with the same k and compare confusion matrices.

Exercise 4: Use make_classification() to generate synthetic data and visualize decision boundaries.

Exercise 5: Run 5-fold CV and 10-fold CV for the same k-range; compare whether the selected best_k changes.

Exercise 6: Add one intentionally high-scale feature (for example multiply a column by 1000) and measure KNN accuracy before and after scaling.

Exercise 7: Build a full Pipeline(StandardScaler(), KNeighborsClassifier(...)) and evaluate it with cross-validation in one step.

Exercise 8: Compare KNN results using only top 2 features vs all features and report accuracy + confusion matrix differences.

Exercise 9: On an imbalanced dataset, report precision/recall/F1 per class and explain why accuracy alone is misleading.

A practical KNN workflow is: split data, scale features, pick k thoughtfully, validate with confusion matrix, and tune k, weights, and distance metric for stable performance.