KNN Code and Description
** Please type code into your code window,
instead of copying and pasting
-this can help you understand the process better **
Section 1: Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
We import dataset tools, split helpers, scaling, KNN classifier, and evaluation metrics.
Section 2: Load Data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
print("Shape:", X.shape)
print("Target classes:", np.unique(y))
Breast cancer data is a binary classification dataset, suitable for KNN practice.
Section 3: Split and Scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
KNN depends on distance, so standardization is required to prevent large-scale features from dominating.
Section 4: Train KNN Model
knn_model = KNeighborsClassifier(n_neighbors=5, metric="minkowski", p=2)
knn_model.fit(X_train_scaled, y_train)
We start with k=5 and Euclidean distance (p=2), which is a common baseline.
Section 5: Predict and Evaluate
y_pred = knn_model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print("Accuracy:", round(acc, 4))
print("Confusion Matrix:\n", cm)
print("Classification Report:\n", classification_report(y_test, y_pred))
Accuracy gives overall score; confusion matrix and classification report show class-level behavior.
Section 6: Quick k Comparison
for k in [1, 3, 5, 7, 11, 15]:
model = KNeighborsClassifier(n_neighbors=k, metric="minkowski", p=2)
model.fit(X_train_scaled, y_train)
pred = model.predict(X_test_scaled)
print(f"k={k}: accuracy={accuracy_score(y_test, pred):.4f}")
This helps you see underfitting vs overfitting behavior as k changes.
Section 7: Visualizing the Breast Cancer Dataset
# Classification visual using two breast-cancer features
from sklearn.inspection import DecisionBoundaryDisplay
feature_names = list(data.feature_names)
def find_feature_index(name, all_names):
# Robust match for variations like spaces vs underscores or case changes
key = name.strip().lower().replace("_", " ")
normalized = [n.strip().lower().replace("_", " ") for n in all_names]
if key in normalized:
return normalized.index(key)
raise ValueError(f"Feature '{name}' not found. Available: {all_names[:5]} ...")
ix = find_feature_index("mean radius", feature_names)
iy = find_feature_index("mean texture", feature_names)
feat_x = feature_names[ix]
feat_y = feature_names[iy]
X2 = data.data[:, [ix, iy]]
y2 = data.target
X2_train, X2_test, y2_train, y2_test = train_test_split(
X2, y2, test_size=0.2, random_state=42, stratify=y2
)
scaler2 = StandardScaler()
X2_train_scaled = scaler2.fit_transform(X2_train)
X2_test_scaled = scaler2.transform(X2_test)
knn2 = KNeighborsClassifier(n_neighbors=9, metric="minkowski", p=2)
knn2.fit(X2_train_scaled, y2_train)
fig, ax = plt.subplots(figsize=(7, 5))
DecisionBoundaryDisplay.from_estimator(
knn2, X2_train_scaled, response_method="predict", cmap="coolwarm", alpha=0.18, ax=ax
)
ax.scatter(
X2_train_scaled[:, 0],
X2_train_scaled[:, 1],
c=y2_train,
cmap="coolwarm",
s=24,
edgecolor="k",
alpha=0.85
)
ax.set_xlabel(feat_x + " (scaled)")
ax.set_ylabel(feat_y + " (scaled)")
ax.set_title("Breast Cancer: KNN Classification Regions")
plt.show()
This view directly shows KNN class regions and how benign/malignant points distribute in feature space.
Guidelines for different datasets
# 1) Replace dataset loader and keep common pattern
# Example:
# data = load_iris() or load_wine() or your custom DataFrame
# X = data.data (or df[feature_cols].values)
# y = data.target (or df[target_col].values)
# 2) Re-check target type before training
# - Classification only: y should represent class labels
# - For text labels, encode if needed
# 3) Update 2D visualization feature selection
# - Choose two meaningful feature indices for boundary plots
# - Do not assume names like "mean radius" exist
# 4) Always scale for KNN distance calculations
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)
# 5) Re-tune k for every new dataset
# - Run CV over a k range (for example odd k from 1 to 31)
# - Do not reuse old best_k blindly
# 6) Review class balance and split strategy
# - Use stratify=y in train_test_split
# - If classes are imbalanced, focus on per-class metrics, not only accuracy
# 7) Validate metric choices by problem type
# - Binary and multiclass both work in KNN
# - For multiclass summaries, inspect macro/weighted averages
# 8) Check data quality before fit
# - Handle missing values
# - Remove leakage columns (IDs, post-outcome signals)
# - Keep train/test preprocessing separation strict
# 9) Keep reproducibility fixed
# - Use random_state in split and CV where applicable
Treat this page as a reusable KNN template: change dataset and feature selection first, then re-run scaling, k tuning, and error analysis before finalizing results.
Graphs and Analysis
Graph 1: Why Scaling Matters for KNN (Breast Cancer)
Open PDF: K-Nearest Neighbors Code output filefrom sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Model without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)
pred_raw = knn_raw.predict(X_test)
acc_raw = accuracy_score(y_test, pred_raw)
# Model with scaling
knn_scaled = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
knn_scaled.fit(X_train, y_train)
pred_scaled = knn_scaled.predict(X_test)
acc_scaled = accuracy_score(y_test, pred_scaled)
plt.figure(figsize=(6.5, 4.2))
plt.bar(["Without Scaling", "With Scaling"], [acc_raw, acc_scaled], color=["#ef5350", "#66bb6a"])
plt.ylim(0.85, 1.00)
plt.ylabel("Test Accuracy")
plt.title("KNN Performance: Impact of Feature Scaling")
for i, v in enumerate([acc_raw, acc_scaled]):
plt.text(i, v + 0.003, f"{v:.4f}", ha="center")
plt.show()
KNN uses distance directly. This graph usually shows clear improvement after scaling and explains why scaling is not optional for KNN.
Graph 2: Selecting k with Cross-Validation
from sklearn.model_selection import StratifiedKFold, cross_val_score
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
k_values = list(range(1, 32, 2)) # odd k values
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
for k in k_values:
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X_scaled, y, cv=cv, scoring="accuracy")
cv_scores.append(scores.mean())
best_idx = int(np.argmax(cv_scores))
best_k = k_values[best_idx]
best_score = cv_scores[best_idx]
plt.figure(figsize=(8, 4.5))
plt.plot(k_values, cv_scores, marker="o")
plt.axvline(best_k, color="green", linestyle="--", label=f"Best k = {best_k}")
plt.xlabel("k (number of neighbors)")
plt.ylabel("Mean CV Accuracy")
plt.title("KNN Hyperparameter Tuning with 5-Fold CV")
plt.legend()
plt.grid(alpha=0.3)
plt.show()
Instead of guessing `k`, this gives a repeatable selection method. Choose `k` where CV accuracy peaks and remains stable.
Graph 3: Confusion Matrix for Best k
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
k_for_eval = best_k if "best_k" in globals() else 9
knn_best = KNeighborsClassifier(n_neighbors=k_for_eval)
knn_best.fit(X_train_scaled, y_train)
y_pred = knn_best.predict(X_test_scaled)
ConfusionMatrixDisplay.from_predictions(
y_test, y_pred, display_labels=data.target_names, cmap="Blues", values_format="d"
)
plt.title(f"KNN Confusion Matrix (k={k_for_eval})")
plt.show()
print(classification_report(y_test, y_pred, target_names=data.target_names))
This graph converts accuracy into actionable errors: which class is getting misclassified and where you may need threshold or feature improvements.
Graph 4: Compare Weights and Distance Metric
import pandas as pd
configs = [
{"weights": "uniform", "p": 2, "label": "uniform + euclidean"},
{"weights": "distance", "p": 2, "label": "distance + euclidean"},
{"weights": "uniform", "p": 1, "label": "uniform + manhattan"},
{"weights": "distance", "p": 1, "label": "distance + manhattan"},
]
rows = []
k_for_compare = best_k if "best_k" in globals() else 9
for cfg in configs:
pipe = make_pipeline(
StandardScaler(),
KNeighborsClassifier(
n_neighbors=k_for_compare,
weights=cfg["weights"],
metric="minkowski",
p=cfg["p"]
)
)
score = cross_val_score(pipe, X, y, cv=5, scoring="accuracy").mean()
rows.append({"setting": cfg["label"], "cv_accuracy": score})
result = pd.DataFrame(rows).sort_values("cv_accuracy", ascending=False)
print(result)
plt.figure(figsize=(8.5, 4.5))
plt.barh(result["setting"], result["cv_accuracy"], color="#42a5f5")
plt.xlabel("Mean CV Accuracy")
plt.title(f"KNN Setting Comparison at k={k_for_compare}")
plt.gca().invert_yaxis()
plt.xlim(result["cv_accuracy"].min() - 0.01, result["cv_accuracy"].max() + 0.01)
plt.show()
After `k` is selected, this compares practical setting choices. Use this to confirm whether distance-weighted voting or Manhattan distance improves reliability.
Exercises for Practice
Exercise 1: Compare weights="uniform" and weights="distance" on your dataset.
Exercise 2: Plot accuracy for k from 1 to 40 and choose a robust value.
Exercise 3: Try p=1 and p=2 with the same k and compare confusion matrices.
Exercise 4: Use make_classification() to generate synthetic data and visualize decision boundaries.
Exercise 5: Run 5-fold CV and 10-fold CV for the same k-range; compare whether the selected best_k changes.
Exercise 6: Add one intentionally high-scale feature (for example multiply a column by 1000) and measure KNN accuracy before and after scaling.
Exercise 7: Build a full Pipeline(StandardScaler(), KNeighborsClassifier(...)) and evaluate it with cross-validation in one step.
Exercise 8: Compare KNN results using only top 2 features vs all features and report accuracy + confusion matrix differences.
Exercise 9: On an imbalanced dataset, report precision/recall/F1 per class and explain why accuracy alone is misleading.