04 K-Nearest Neighbors

Today’s Agenda: Credit Risk Assessment

Imagine you are a credit manager, and a new customer applies for a loan.

Your core task is:

  • Assess Risk: How likely is this customer to default in the future?
  • Make a Decision: Should you approve or reject the loan?

This is a classic business classification problem.

Credit Risk Assessment Icon An icon symbolizing the process of evaluating a loan application for risk.

The Data-Driven Solution

You are not guessing blindly. Your bank possesses valuable historical data.

  • Existing Data: Complete records of thousands of past customers.
    • Features: Income, debt, age, occupation…
    • Outcome: Paid on time ✅ or Defaulted ❌
  • Our Approach: Use this data to profile the new customer’s risk. This is where the K-Nearest Neighbors (KNN) algorithm comes into play.

KNN’s Core Idea: ‘Birds of a feather flock together’

The philosophy of the KNN algorithm is very simple: To determine the class of a new sample, just look at which class its nearest ‘neighbors’ belong to.

It’s like trying to figure out a stranger’s hobbies; the easiest way is to see what kind of people their closest friends are.

K-Nearest Neighbors (KNN) Concept A new data point is classified based on the majority class of its three nearest neighbors. ? Class A Class B k=3 Neighborhood 2 Class A, 1 Class B ⇒ Classified as Class A

The Business Analogy for KNN

Machine Learning Concept Business Analogy (Credit Assessment)
New Sample (Unknown Class) A new loan applicant
Neighbors (Known Class) Historical customers in the database who are most similar to the new applicant
Class ‘Will Default’ or ‘Will Not Default’
Measure of ‘Closeness’ Similarity of customer features (e.g., income, age, debt)
The value ‘K’ How many similar customers do we consult to make a decision?

Our Learning Roadmap Today

  1. Fundamental Theory: A deep dive into the three core elements of KNN (Distance, K-Value, Decision Rule).
  2. Code in Action: Building a breast cancer diagnostic classifier from scratch using Python.
  3. Model Optimization: Learning how to find the optimal K value and understanding Weighted KNN.
  4. Expanded Applications: Exploring how KNN can be used for regression problems (e.g., predicting house prices).
  5. Efficiency & Challenges: Discussing the challenges posed by data scale and high dimensionality, and their solutions.
  6. Summary: A comprehensive analysis of KNN’s pros and cons and its applications in business.

4.1 Formal Definition of the k-NN Rule

Given a sample \(x_n\) to be classified, the KNN decision process involves two steps:

  1. Find Neighbors: In the entire dataset, identify the \(k\) samples that are closest in distance to \(x_n\).
  2. Majority Vote: Among these \(k\) neighbors, use a majority vote to assign the most frequent class as the predicted class for \(x_n\).

\[ \large{y_n = \underset{c \in \mathcal{Y}}{\arg\max} \sum_{i=1}^k I(y_i = c)} \qquad(1)\]

Here, \(I(\cdot)\) is an indicator function. It is 1 if the class \(y_i\) of neighbor \(i\) is equal to class \(c\), and 0 otherwise. This formula is essentially counting votes.

Understanding the Three Key Elements of KNN

To successfully apply the KNN algorithm, we must clearly define three core components. They are the building blocks of any KNN model.

  • Distance Metric
    • How do we quantify the ‘similarity’ between samples?
  • The Choice of \(k\)
    • How many neighbors should we look at? One or ten?
  • Decision Rule
    • How do we make the final judgment based on the neighbors? (e.g., simple majority vote)

Key Element 1: The Distance Metric

How do we define ‘close’? This depends on the distance metric we choose.

Distance Metric Concept Two points are connected by a scaled line, representing the measurement of distance between them. Distance Metric Sample A Sample B d(A, B) = ?

Key Element 2: The Choice of k

How many neighbors should we consult? This is the critical dial for model complexity.

The Choice of k in KNN A central point is shown with two concentric circles representing k=3 and k=8 neighborhoods. The Choice of k k=3 k=8

Key Element 3: The Decision Rule

How do we aggregate the ‘opinions’ of the neighbors? The most common method is ‘majority rules’.

KNN Decision Rule: Majority Vote Five neighbors (3 circles, 2 squares) vote, and the majority class (circle) determines the final classification. Decision Rule: Majority Vote Classes of k=5 Neighbors Final Decision: Class A

Deep Dive: Distance Metrics

The similarity between samples is measured by a distance function. For two \(d\)-dimensional samples \(x_i\) and \(x_j\), common distance metrics include:

  • Euclidean Distance: The familiar straight-line distance, and the most commonly used in KNN. \[ \large{L_2(x_i, x_j) = \sqrt{\sum_{l=1}^d (x_{il} - x_{jl})^2}} \]

  • Manhattan Distance: Imagine driving in a city where you can only travel along a grid. \[ \large{L_1(x_i, x_j) = \sum_{l=1}^d |x_{il} - x_{jl}|} \]

Visualizing Distance Metrics

Euclidean distance is ‘as the crow flies’, while Manhattan distance is ‘walking the blocks’.

Euclidean vs. Manhattan Distance A diagram showing two points on a grid. The Manhattan distance is the sum of the horizontal and vertical paths. The Euclidean distance is the direct diagonal path. Euclidean vs. Manhattan Distance Point A (xA, yA) Point B (xB, yB) |xₓ - xₐ| |yₐ - yₓ| Manhattan Distance (L₁) Movement is restricted to axes Like walking on city blocks Sum of absolute differences L₁ = |xₓ - xₐ| + |yₐ - yₓ| Euclidean Distance (L₂) Shortest straight-line path Based on Pythagorean theorem The hypotenuse length L₂ = √((xₓ-xₐ)² + (yₐ-yₓ)²)

Critical Prerequisite: Data Standardization

Important Note: Distance metrics are highly sensitive to the scale of features.

  • The Problem: If you directly calculate distance using ‘Annual Income’ (in thousands) and ‘Age’ (in years), the large numerical values of income will completely dominate the calculation, making the effect of age negligible.
  • The Solution: Data Standardization. Transform all features to a similar scale (e.g., a mean of 0 and a standard deviation of 1).

Before applying KNN, data standardization is almost always a mandatory step.

Why Standardization Is Crucial

The Importance of Data Standardization for KNN A before-and-after comparison showing how feature scaling prevents one feature (Income) from dominating distance calculations over another (Age). 1. Before Standardization Income (Range: 10k-100k) Age (Range: 20-60) A B Effect: Distance is dominated by 'Income'. The huge value of ΔIncome completely drowns out the tiny contribution of ΔAge. 2. After Standardization Income Z-score Age Z-score A' B' Effect: All features are on the same scale. ΔZ-Income and ΔZ-Age now contribute equally and fairly to the distance.

Deep Dive: The Choice of k

The choice of \(k\) directly impacts the model’s performance and represents a Bias-Variance Tradeoff.

  • Small \(k\) values:
    • The model is more complex, with a more irregular decision boundary.
    • It is susceptible to noise, leading to Overfitting.
    • Low Bias, High Variance.
  • Large \(k\) values:
    • The model is simpler, with a smoother decision boundary.
    • It may ignore local, subtle structures in the data, leading to Underfitting.
    • High Bias, Low Variance.

Small k: High Flexibility, Prone to Overfitting

When k is small (e.g., k=1), the model’s decision boundary becomes highly convoluted, trying to perfectly match every training point, including noise.

KNN Overfitting with k=1 A new point is shown being misclassified because its single nearest neighbor is a noise point from a different class. KNN Algorithm: How k=1 Leads to Overfitting New Point Single Nearest Neighbor (k=1) (A noise/outlier point) 1. Find the nearest 1 neighbor. 2. The neighbor is a Pink Square (an outlier). 3. Misclassification: Point marked as Pink. (Model is too sensitive to noise)

Large k: High Stability, Prone to Underfitting

When k is large (e.g., k=N), the model ignores local data structures, and the decision boundary becomes overly simple, failing to capture class differences.

KNN Underfitting with a Large k A new point is classified based on the global majority class because its large neighborhood encompasses most data points, ignoring a local cluster of a different class. KNN Algorithm: How Large k Leads to Underfitting Classified as Blue Classified as Red Analysis (k is very large): 1. Neighborhood (yellow) covers most samples. 2. Globally, Blue Circles are the majority class. 3. New point is classed as Blue; local red cluster is ignored. → Leads to an overly smooth boundary & underfitting.

Finding the Optimal k Value

So, how do we find that ‘just right’ value of \(k\)?

  • It’s not a guessing game: We can’t choose based on intuition.
  • A systematic approach: We typically use Cross-Validation. We split the data into several folds, iteratively use one part as a ‘mini test set’ to evaluate the performance of different \(k\) values, and finally select the \(k\) that performs the best on average.

We will demonstrate this process in the upcoming Python practical session.

Let’s Visualize KNN with an Example

Suppose we have a dataset of tumors, classified as ‘Benign’ or ‘Malignant’. A new patient arrives (represented by ?), and we need to determine if their tumor is benign or malignant.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

# --- Generate Data ---
X, y = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=1.5)
new_point = np.array([[-3, 8]])

# --- Visualization ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X[y==0, 0], y=X[y==0, 1], s=120, label='Benign', marker='o', color='royalblue', ec='black')
sns.scatterplot(x=X[y==1, 0], y=X[y==1, 1], s=120, label='Malignant', marker='X', color='darkorange', ec='black')
plt.scatter(new_point[:, 0], new_point[:, 1], s=300, c='red', marker='P', label='New Patient', ec='black', linewidth=1.5)

plt.title('A New Sample to be Classified', fontsize=18, pad=15)
plt.xlabel('Feature 1 (Texture)', fontsize=14)
plt.ylabel('Feature 2 (Radius)', fontsize=14)
plt.legend(loc='upper right', fontsize=12)
plt.gca().set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()
Figure 1: A new sample with labeled tumor data

When k=1: The Closest Neighbor Decides Everything

1-NN is the simplest form of KNN. We only look at the single closest neighbor.

In this example, the sample closest to the new patient belongs to the ‘Malignant’ class, so 1-NN would classify it as Malignant. This could be a flawed judgment influenced by noise.

# --- Recalculate from previous cell ---
X, y = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=1.5)
new_point = np.array([[-3, 8]])
# --- Calculate Distances ---
distances_k1 = np.sqrt(np.sum((X - new_point)**2, axis=1))
nearest_neighbor_idx = np.argmin(distances_k1)

# --- Visualization ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X[y==0, 0], y=X[y==0, 1], s=100, label='Benign', marker='o', color='royalblue', alpha=0.4)
sns.scatterplot(x=X[y==1, 0], y=X[y==1, 1], s=100, label='Malignant', marker='X', color='darkorange', alpha=0.4)
plt.scatter(new_point[:, 0], new_point[:, 1], s=300, c='red', marker='P', label='New Patient', ec='black', linewidth=1.5)

# Highlight the nearest neighbor
plt.scatter(X[nearest_neighbor_idx, 0], X[nearest_neighbor_idx, 1], s=350, 
            facecolors='none', edgecolors='green', linewidth=3, label='Nearest Neighbor (k=1)')

plt.title('For k=1, the Closest Neighbor is "Malignant"', fontsize=18, pad=15)
plt.xlabel('Feature 1 (Texture)', fontsize=14)
plt.ylabel('Feature 2 (Radius)', fontsize=14)
plt.legend(loc='upper right', fontsize=12)
plt.gca().set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()
Figure 2: The decision process for 1-NN

When k=5: Majority Vote Reverses the Outcome

Now, we increase \(k\) to 5 and examine the 5 nearest neighbors.

  • Among these 5 neighbors, 4 are ‘Benign’ and 1 is ‘Malignant’.
  • According to the majority vote rule (4 > 1), 5-NN will classify the new sample as Benign.

This example clearly shows that the choice of \(k\) has a decisive impact on the final result.

# --- Recalculate from previous cell ---
X, y = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=1.5)
new_point = np.array([[-3, 8]])
# --- Calculate Distances ---
distances_k5 = np.sqrt(np.sum((X - new_point)**2, axis=1))
k = 5
nearest_k_indices = np.argsort(distances_k5)[:k]

# --- Visualization ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X[y==0, 0], y=X[y==0, 1], s=100, label='Benign', marker='o', color='royalblue', alpha=0.3)
sns.scatterplot(x=X[y==1, 0], y=X[y==1, 1], s=100, label='Malignant', marker='X', color='darkorange', alpha=0.3)
plt.scatter(new_point[:, 0], new_point[:, 1], s=300, c='red', marker='P', label='New Patient', ec='black', linewidth=1.5)

# Highlight the 5 nearest neighbors
plt.scatter(X[nearest_k_indices, 0], X[nearest_k_indices, 1], s=350, 
            facecolors='none', edgecolors='green', linewidth=3, label='5 Nearest Neighbors (k=5)')

# Draw a circle enclosing the neighbors
center = new_point.flatten()
radius = distances_k5[nearest_k_indices[-1]]
circle = plt.Circle(center, radius, color='green', fill=False, linestyle='--', linewidth=2)
plt.gca().add_patch(circle)

plt.title('For k=5, the Majority of Neighbors (4/5) are "Benign"', fontsize=18, pad=15)
plt.xlabel('Feature 1 (Texture)', fontsize=14)
plt.ylabel('Feature 2 (Radius)', fontsize=14)
plt.legend(loc='upper right', fontsize=12)
plt.gca().set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()
Figure 3: The decision process for 5-NN

Hands-On: Implementing KNN for Breast Cancer Diagnosis with Python

With the theory covered, it’s time for practice. We’ll use a classic real-world dataset: the Wisconsin Breast Cancer dataset.

  • Objective: Predict whether a tumor is benign or malignant based on various physiological indicators (e.g., radius, texture, perimeter).
  • Tool: scikit-learn, a powerful Python machine learning library.
  • Workflow:
    1. Data Loading and Exploration
    2. Data Preprocessing and Splitting
    3. Model Training and Prediction
    4. Finding the Optimal k Value

Step 1: Load and Explore the Data

First, we load the dataset built into scikit-learn and examine its basic information. The dataset contains 569 samples and 30 numerical features.

from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the dataset
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = pd.Series(cancer.target).map({0: 'Malignant', 1: 'Benign'})

print('Dataset dimensions (samples, features):', X_cancer.shape)
print('\nPreview of the first 5 rows:')
# Display the head of the DataFrame for a quick look at the features
with pd.option_context('display.max_columns', 5, 'display.width', 100):
    print(X_cancer.head())
Dataset dimensions (samples, features): (569, 30)

Preview of the first 5 rows:
   mean radius  mean texture  ...  worst symmetry  worst fractal dimension
0        17.99         10.38  ...          0.4601                  0.11890
1        20.57         17.77  ...          0.2750                  0.08902
2        19.69         21.25  ...          0.3613                  0.08758
3        11.42         20.38  ...          0.6638                  0.17300
4        20.29         14.34  ...          0.2364                  0.07678

[5 rows x 30 columns]

Step 2: Data Preprocessing and Splitting

The KNN algorithm relies on distance calculations, so feature scaling is crucial. We will use StandardScaler for standardization.

Then, we split the data into a training set (to train the model) and a test set (to evaluate its performance).

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Standardize the features to have mean=0 and variance=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cancer)

# 2. Split data into training and testing sets (70% train, 30% test)
# stratify=y_cancer ensures the class proportions are the same in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_cancer, test_size=0.3, random_state=42, stratify=y_cancer
)

print(f'Training set size: {X_train.shape} samples')
print(f'Test set size: {X_test.shape} samples')
Training set size: (398, 30) samples
Test set size: (171, 30) samples

Step 3: Train the KNN Model and Make Predictions

Training a KNN classifier with scikit-learn is straightforward. We’ll start with a common choice of \(k=5\).

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize the classifier with k=5
knn_clf = KNeighborsClassifier(n_neighbors=5)

# Train the model on the training data
knn_clf.fit(X_train, y_train)

# Make predictions on the unseen test data
y_pred = knn_clf.predict(X_test)

# Calculate accuracy: (Correct Predictions) / (Total Predictions)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model accuracy on the test set with k=5 is: {accuracy:.4f}')
Model accuracy on the test set with k=5 is: 0.9649

Step 4: How to Find the Optimal k Value?

As mentioned earlier, the choice of \(k\) is critical. We can find the optimal \(k\) by iterating through a range of values and calculating the test set accuracy for each one.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# --- Recalculate from previous cell ---
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
cancer = load_breast_cancer()
y_cancer = pd.Series(cancer.target)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(cancer.data)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_cancer, test_size=0.3, random_state=42, stratify=y_cancer
)
# --- End recalculate ---

# We will test k values from 1 to 30
k_range = range(1, 31)
test_scores = []

# Loop through each k value
for k_val in k_range:
    knn_loop = KNeighborsClassifier(n_neighbors=k_val)
    knn_loop.fit(X_train, y_train)
    y_pred_loop = knn_loop.predict(X_test)
    test_scores.append(accuracy_score(y_test, y_pred_loop))

# Find the k value with the highest accuracy
best_k = k_range[np.argmax(test_scores)]
best_accuracy = max(test_scores)

# --- Visualization ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(10, 6))
plt.plot(k_range, test_scores, marker='o', linestyle='--', color='royalblue', mfc='skyblue')
plt.title('Model Accuracy vs. k Value', fontsize=18, pad=15)
plt.xlabel('k (Number of Neighbors)', fontsize=14)
plt.ylabel('Test Set Accuracy', fontsize=14)
plt.axvline(best_k, color='red', linestyle=':', linewidth=2, label=f'Optimal k = {best_k}\nAccuracy = {best_accuracy:.4f}')
plt.xticks(np.arange(0, 31, 2))
plt.legend(fontsize=12)
plt.tight_layout()
plt.show()
Figure 4: The impact of different k values on model accuracy

Result Analysis: k=7 Provides the Best Performance

The analysis from the chart above tells us:

  • When \(k\) is very small (e.g., 1 or 2), the model might be overfitting, resulting in lower accuracy.
  • As \(k\) increases, the accuracy first rises and then falls.
  • For this specific problem, the model performs best on the test set when \(k=7\), achieving an accuracy of approximately 97.08%.

This is a very strong result, indicating that KNN can be effectively used in practical applications like medical diagnosis.

4.2 Weighted k-Nearest Neighbors (Weighted k-NN)

In the standard KNN model, each neighbor’s ‘vote’ has equal weight. But is this reasonable?

  • Intuition: Neighbors that are closer should have a greater say than those that are farther away.
  • Solution: Weighted k-NN. Assign voting weights based on the neighbor’s distance.

A common weighting method is to use the inverse of the distance as the weight.

\[ \large{y_n = \underset{c \in \mathcal{Y}}{\arg\max} \sum_{i=1}^k w_i \cdot I(y_i = c) \quad \text{, where } w_i = \frac{1}{\text{distance}(x_n, x_i)}} \]

The Intuition Behind Weighted KNN

Closer neighbors have a ‘louder voice’, while distant neighbors have a ‘quieter voice’.

Weighted KNN Intuition A new point is classified based on distance-weighted votes from its neighbors. A single, very close neighbor outweighs two more distant neighbors. Weighted KNN Closer Distance = Higher Weight Point to Classify Weight: 0.8 Weight: 0.3 Weight: 0.2 Weight Sum & Decision Standard KNN (k=3): 1 Purple vs 2 PinkPink Weighted KNN: Σ W(Purple): 0.8 Σ W(Pink): 0.5 (0.3+0.2) Conclusion: Classified as Purple

Implementing Weighted KNN in scikit-learn

Implementing weighted KNN in scikit-learn is simple. Just set the weights='distance' parameter when initializing the classifier.

# --- Recalculate from previous cells to ensure context ---
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
cancer = load_breast_cancer()
y_cancer = pd.Series(cancer.target)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(cancer.data)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_cancer, test_size=0.3, random_state=42, stratify=y_cancer
)
# --- End recalculate ---

# We found the optimal k to be 7
best_k = 7 

# Train a standard KNN model (uniform weights)
knn_standard = KNeighborsClassifier(n_neighbors=best_k, weights='uniform')
knn_standard.fit(X_train, y_train)
accuracy_standard = accuracy_score(y_test, knn_standard.predict(X_test))

# Train a weighted KNN model (weights based on distance)
knn_weighted = KNeighborsClassifier(n_neighbors=best_k, weights='distance')
knn_weighted.fit(X_train, y_train)
accuracy_weighted = accuracy_score(y_test, knn_weighted.predict(X_test))

print(f'Standard KNN (k={best_k}) Accuracy: {accuracy_standard:.4f}')
print(f'Weighted KNN (k={best_k}) Accuracy: {accuracy_weighted:.4f}')
Standard KNN (k=7) Accuracy: 0.9649
Weighted KNN (k=7) Accuracy: 0.9649

In this case, the accuracy with weighting is nearly identical to the standard KNN, but on other, more complex datasets, weighted KNN can often provide a performance boost.

4.3 KNN for More Than Just Classification: Regression

The core idea of KNN also applies to regression problems, where the goal is to predict a continuous value (like a stock price or house value).

Classification

  • Goal: Predict a discrete class
  • Method: Neighbors vote
  • Example: Is an email ‘Spam’ or ‘Not Spam’?

Regression

  • Goal: Predict a continuous value
  • Method: Neighbors are averaged
  • Example: Predict the sale price of a house

The Mathematics of KNN Regression

For a new sample, the predicted value from a KNN regression model is the average (or weighted average) of the target values of its \(k\) nearest neighbors.

\[ \large{\hat{y}_n = \frac{1}{k} \sum_{i=1}^k y_i} \]

Here, \(y_i\) is the actual numerical value of the \(i\)-th neighbor.

Hands-On: Predicting California House Prices with KNN

We’ll use the California Housing dataset to predict a house’s price based on its features (e.g., median income in the block, house age).

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load and prepare the data
housing = fetch_california_housing()
X_h = pd.DataFrame(housing.data, columns=housing.feature_names)
y_h = pd.Series(housing.target) # Target unit is $100,000

# Feature standardization and data splitting
scaler_h = StandardScaler()
X_h_scaled = scaler_h.fit_transform(X_h)
X_h_train, X_h_test, y_h_train, y_h_test = train_test_split(X_h_scaled, y_h, test_size=0.3, random_state=42)

# Train a KNN Regressor model with k=7
knn_reg = KNeighborsRegressor(n_neighbors=7)
knn_reg.fit(X_h_train, y_h_train)
y_h_pred = knn_reg.predict(X_h_test)

# Evaluate using Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_h_test, y_h_pred))
print(f'KNN Regressor RMSE on the test set: ${rmse * 100000:,.2f}')
KNN Regressor RMSE on the test set: $64,419.94

Visualizing KNN Regression: A Step-like Prediction Function

The prediction function of a KNN regressor is not a smooth curve but rather a step-like function. This is because, within a local region, the predicted value for all new samples will be the same (the average of the neighbors in that region).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor

# Generate synthetic data
np.random.seed(42)
X_syn = np.sort(5 * np.random.rand(80, 1), axis=0)
y_syn = np.sin(X_syn).ravel()
y_syn[::5] += 3 * (0.5 - np.random.rand(16)) # Add some noise

# Fit the model
knn_reg_viz = KNeighborsRegressor(n_neighbors=5)
knn_reg_viz.fit(X_syn, y_syn)

# Generate test points for plotting
T = np.linspace(0, 5, 500)[:, np.newaxis]
y_pred_viz = knn_reg_viz.predict(T)

# --- Plotting ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(10, 6))
plt.scatter(X_syn, y_syn, c='navy', s=50, label='Data Points', ec='black', alpha=0.8)
plt.plot(T, y_pred_viz, c='red', label='KNN Regression Prediction (k=5)', linewidth=3)
plt.xlabel('Feature', fontsize=14)
plt.ylabel('Target Value', fontsize=14)
plt.title('The Prediction Function of KNN Regression', fontsize=18, pad=15)
plt.legend(fontsize=12)
plt.tight_layout()
plt.show()
Figure 5: The step-like prediction function of KNN Regression (1D Example)

4.4 Real-World Challenges and Solutions

So far, we’ve dealt with idealized numerical data. But in the business world, data is often more complex.

  • Challenge 1: How do we handle non-numerical features (like ‘City’ or ‘Product Category’)?
  • Challenge 2: What if the dataset is massive (millions of records)? KNN predictions become extremely slow.

Challenge 1: Handling Categorical Features

The Problem: How do you calculate the ‘distance’ between samples that include text features like ‘Beijing’ and ‘Shanghai’?

Solutions:

  1. One-Hot Encoding: Convert the categorical feature into multiple binary (0/1) features. This is the most common approach.
  2. Use a specific distance metric: For example, the Hamming Distance, which counts the number of positions at which two strings of equal length are different.

Visualizing One-Hot Encoding

One-Hot Encoding Diagram A table shows how a 'City' column is transformed into three separate binary columns, with the conversion process highlighted. City Beijing Shanghai Guangzhou Transform City_BJ City_SH City_GZ 1 0 0 0 1 0 0 0 1

Challenge 2: Data Scale and Prediction Speed

A major drawback of KNN is its high computational cost.

  • Training Phase: Extremely fast. There’s almost no computation, just storing the data.
  • Prediction Phase: Extremely slow. For every new sample to be predicted, it must calculate its distance to every single training sample. If the training set has millions of records, this is unacceptable.

The Question: How can we speed up the process of finding nearest neighbors without sacrificing too much accuracy?

Acceleration Strategy 1: KD-Tree (k-dimensional tree)

A KD-Tree is a classic data structure for partitioning space. It recursively divides the k-dimensional space into a series of hyperrectangles.

  • Construction Process:
    1. Select a coordinate axis (e.g., the one with the largest variance).
    2. Find the median of all data points along that axis.
    3. Use a hyperplane perpendicular to that axis to split the space in two.
    4. Repeat this process recursively for the two subspaces until each region contains only a few data points.

How a KD-Tree Partitions 2D Space

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
data_kd = np.random.rand(25, 2) * 10

# --- Visualization ---
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(8, 6.5))
ax.scatter(data_kd[:, 0], data_kd[:, 1], c='navy', s=50, ec='black', alpha=0.8)

# 1st split (vertical)
data_sorted_x = data_kd[data_kd[:, 0].argsort()]
median_idx_x = len(data_sorted_x) // 2
median_val_x = data_sorted_x[median_idx_x, 0]
ax.plot([median_val_x, median_val_x], [-1, 11], color='red', linestyle='--', linewidth=2, label='1st Split (x-axis)')

# 2nd split (horizontal) on left side
left_data = data_sorted_x[:median_idx_x]
left_data_sorted_y = left_data[left_data[:, 1].argsort()]
median_idx_ly = len(left_data_sorted_y) // 2
median_val_ly = left_data_sorted_y[median_idx_ly, 1]
ax.plot([-1, median_val_x], [median_val_ly, median_val_ly], color='green', linestyle='--', linewidth=2, label='2nd Split (y-axis)')

# 2nd split (horizontal) on right side
right_data = data_sorted_x[median_idx_x + 1:]
right_data_sorted_y = right_data[right_data[:, 1].argsort()]
median_idx_ry = len(right_data_sorted_y) // 2
median_val_ry = right_data_sorted_y[median_idx_ry, 1]
ax.plot([median_val_x, 11], [median_val_ry, median_val_ry], color='green', linestyle='--')


ax.set_xlim(-1, 11)
ax.set_ylim(-1, 11)
ax.set_title('KD-Tree Spatial Partitioning Process', fontsize=18, pad=15)
ax.set_xlabel('Feature 1', fontsize=14)
ax.set_ylabel('Feature 2', fontsize=14)
ax.legend(fontsize=12)
ax.set_aspect('equal', adjustable='box')
plt.tight_layout()
plt.show()
Figure 6: How a KD-Tree recursively partitions a 2D space

The Limitation of KD-Trees: The Curse of Dimensionality

While KD-Trees are very effective in low-dimensional spaces (e.g., \(d < 20\)), their performance degrades rapidly as the number of dimensions \(d\) increases.

  • Reason: In high-dimensional space, the distances between all pairs of points tend to become nearly equal, the space becomes sparse, and the concept of a ‘neighbor’ itself becomes ambiguous.
  • Consequence: When the dimensionality is very high, the search efficiency of a KD-Tree degenerates to be close to that of a brute-force search.
Curse of Dimensionality Illustration A comparison between a dense low-dimensional space and a sparse high-dimensional space where all points are far from the center and each other. Low-Dimensional Space (d=2) High-Dimensional Space (d >> 20) All 'neighbors' are almost equally distant

The Curse of Dimensionality: Why Distance Fails

The Core Question: What is ‘Distance’ in High-Dimensional Space?

In high-dimensional spaces, the geometric intuition we’ve built from our low-dimensional (2D or 3D) experience often fails.

  • Our Intuition: Points are either near or far; you can always find a ‘nearest neighbor’.
  • High-Dimensional Reality: As dimensions increase, the Euclidean distances between pairs of points tend to become very similar to each other.

This lecture aims to reveal this counter-intuitive phenomenon—the failure of distance metrics—through intuitive examples and mathematical derivation.

A Simplified Model: Slicing a Unit Hypercube

Let’s begin with a simple thought experiment: slicing a unit hypercube $[0, 1]^d$ at the midpoint 0.5 along each dimension.

This partitions the entire space into $2^d$ identical ‘small hypercubes’.

By observing where a random point lands within these ‘small cubes’, we can build an initial intuition about high-dimensional space.

Low-Dimensional Intuition: The Case of d=1 and d=2

In low dimensions, the region near the origin (0,0,...) (the ‘inner core’) and the regions far from it (the ‘outer shell’) appear balanced.

Low-Dimensional Space Partitioning A 1D line is split into two parts, and a 2D square is split into four quadrants, distinguishing the 'inner core' from the 'outer shell'. Dimension d=1 0 0.5 1 Inner Core [0, 0.5] Outer Shell (0.5, 1] Dimension d=2 Inner Core (1/4) Outer Shell (3/4) 0 1 1
Figure 7: In d=2, out of the $2^2=4$ small squares, only one ([0, 1/2], [0, 1/2]) is near the origin.

The High-Dimensional Phenomenon: Data Points Almost Always Land in the ‘Outer Shell’

As the dimension d increases, the proportion of the volume in the ‘inner core’ hypercube, $1/2^d$, rapidly approaches zero.

Dimension (d) Total Small Cubes (\(2^d\)) Prob. of Inner Core (\(1/2^d\)) Prob. of Outer Shell (\(1 - 1/2^d\))
1 2 50% 50%
3 8 12.5% 87.5%
10 1,024 ~0.1% 99.9%
100 ~\(1.27 \times 10^{30}\) ~\(7.9 \times 10^{-31}\) ~100%

Conclusion: If we randomly scatter points in a hypercube, when the dimension is high enough, almost all points will land in the ‘outer shell’ regions, far from the origin.

From Discrete to Continuous: Distance Concentration Under a Uniform Distribution

The slicing model was a simplification. Now consider a more general case: n data points $X_1, ..., X_n$ are independently and identically distributed in the unit hypercube $[0, 1]^d$.

The Core Finding: As dimension d approaches infinity, the relative difference between the maximum and minimum pairwise distances approaches zero.

\[ \large{ \lim_{d \to \infty} \frac{dist_{max} - dist_{min}}{dist_{min}} = 0 } \]

This means that from any given point’s perspective, the distances to all other points become almost identical!

Mathematical Derivation 1: Setting up the Random Variable for Distance

Let’s analyze the distance between any two points $X_i$ and $X_j$.

  1. Point Representation: Each point $X_j$ is a d-dimensional vector $(X_{j1}, ..., X_{jd})^T$.
  2. Component Distribution: Each component $X_{jk}$ independently follows a U(0, 1) distribution.
    • Expectation: $E(X_{jk}) = \frac{1}{2}$
    • Variance: $Var(X_{jk}) = \frac{1}{12}$
  3. Squared Distance: We analyze the squared Euclidean distance $S_{ij} = ||X_i - X_j||_2^2$ for mathematical convenience.

\[ \large{ S_{ij} = \sum_{k=1}^{d} (X_{ik} - X_{jk})^2 } \]

Mathematical Derivation 2: Calculating the Expected Squared Distance

Using the linearity of expectation, we can compute the expected value of $S_{ij}$.

First, consider the expected squared difference in a single dimension: $E[(X_{ik} - X_{jk})^2]$.

Using the variance definition $Var(Y) = E(Y^2) - (E(Y))^2$, we have: $E[(X_{ik} - X_{jk})^2] = Var(X_{ik} - X_{jk}) + (E[X_{ik} - X_{jk}])^2$

Since $X_{ik}$ and $X_{jk}$ are i.i.d.: - $E[X_{ik} - X_{jk}] = E[X_{ik}] - E[X_{jk}] = \frac{1}{2} - \frac{1}{2} = 0$ - $Var(X_{ik} - X_{jk}) = Var(X_{ik}) + Var(X_{jk}) = \frac{1}{12} + \frac{1}{12} = \frac{1}{6}$

So, $E[(X_{ik} - X_{jk})^2] = \frac{1}{6} + 0^2 = \frac{1}{6}$.

Finally, summing over all dimensions: \[ \large{ E(S_{ij}) = E[\sum_{k=1}^{d} (X_{ik} - X_{jk})^2] = \sum_{k=1}^{d} E[(X_{ik} - X_{jk})^2] = \frac{d}{6} } \]

Mathematical Derivation 3: Applying the Law of Large Numbers

We’ve established that the squared distance $S_{ij}$ is a sum of d i.i.d. random variables $(X_{ik} - X_{jk})^2$.

By the Law of Large Numbers, as d approaches infinity, the sample mean converges in probability to the expected value:

\[ \large{ \frac{S_{ij}}{d} = \frac{1}{d} \sum_{k=1}^{d} (X_{ik} - X_{jk})^2 \xrightarrow{p} E[(X_{ik} - X_{jk})^2] = \frac{1}{6} } \]

This implies that for a large d, the value of $S_{ij}$ will be highly concentrated around its expected value $\frac{d}{6}$. Since this holds for any pair of points i and j, the squared distances for all pairs of points will converge to the same value!

Conceptual Visualization: Concentration of the Distance Distribution

This phenomenon can be understood visually through the changing distribution of distances.

Phenomenon of Distance Concentration A graph showing that as dimension d increases, the probability distribution of pairwise distances narrows and becomes concentrated around a single mean value. Distance Probability Density Low Dim (d=2) Medium Dim (d=10) High Dim (d → ∞) E[Distance]
Figure 8: As dimension d increases, the distribution of distances becomes increasingly narrow, ultimately concentrating like a spike around the expected value.

Generalization: Concentration of the Norm

The phenomenon of distance concentration is not limited to uniform distributions or Euclidean distance; it’s a more general principle.

Concentration of the Norm Theorem (Intuitive Version): For a random vector $X \in \mathbb{R}^d$ satisfying certain mild conditions (e.g., independent components with unit variance), its L2-norm (or length) $||X||_2$ will be highly concentrated around its expected value, which is close to $\sqrt{d}$.

A stronger result (Vershynin, 2018) shows that for sub-gaussian vectors, the probability of deviating from $\sqrt{d}$ decays exponentially with the dimension d. This means that in high-dimensional space, the length of a random vector is almost a deterministic, not random, quantity.

Practical Implications: Why ‘Nearest Neighbor’ Fails

The failure of distance metrics poses a severe challenge to many algorithms that rely on distance calculations, especially k-Nearest Neighbors (k-NN).

  • Goal of k-NN: To find the k neighbors in the feature space that are ‘most similar’ to a query point.
  • High-Dimensional Dilemma: If the distances from a query point to all its neighbors are nearly equal, there’s no meaningful difference between the ‘nearest’ and ‘farthest’ neighbors.
  • Result: The algorithm’s classification or regression output becomes unstable and meaningless, as the selected ‘neighbors’ may be effectively random.

Summary

  1. High-Dimensional Geometry is Counter-intuitive: We cannot directly extend our low-dimensional intuitions to high-dimensional spaces.
  2. Mass is in the Shell: Data points in a high-dimensional space are likely to be found in the ‘edges’ or ‘outer shell’ of the space.
  3. Distance Concentration: As dimensions increase, the distances between any two random points tend to become equal. This is known as distance concentration.
  4. Algorithm Failure: This phenomenon directly causes the performance of distance-based algorithms (like k-NN and clustering) to degrade severely on high-dimensional data, which is a core aspect of the ‘Curse of Dimensionality’.

4.5 Summary: The Pros and Cons of KNN

Advantages (Pros) Disadvantages (Cons)
Simple principle, easy to implement High computational cost, slow predictions
No training required, highly adaptive (non-parametric) Large memory requirement, needs all training data
✅ Makes no assumptions about data distribution ❌ Sensitive to imbalanced data (majority class can dominate)
✅ Can be used for classification and regression ❌ Suffers badly from the Curse of Dimensionality
✅ Decision boundary can be very flexible ❌ Requires feature standardization

KNN’s Applications in Business Decision-Making

Thanks to its simplicity and flexibility, KNN is widely used in various business domains:

  • Finance: Customer credit scoring, fraudulent transaction detection.
  • Marketing: Customer segmentation, identifying high-value customer groups.
  • Recommendation Systems: ‘Customers who bought product X also bought product Y’ is a classic ‘find the nearest neighbor’ problem.
  • Healthcare: Disease diagnosis, genetic pattern recognition.
  • Retail: Predicting customer purchasing behavior, inventory optimization.

Review of Core Concepts from This Chapter

Concept Core Idea Key Parameters/Decisions
KNN Classification Majority rules k, distance metric
KNN Regression Average of neighbors k, distance metric
Weighted KNN Proximity matters weights='distance'
KD-Tree Space-for-time tradeoff Suitable for low-dimensional data
Curse of Dimensionality Distance loses meaning in high dimensions Feature selection/reduction is key
Standardization Treat all features equally Must be done before calculating distance

Final Thoughts: The Keys to Success with KNN

To apply KNN successfully, economists and data scientists need to focus on:

  1. Feature Engineering: Selecting the most relevant features for the problem is crucial. Garbage in, garbage out.
  2. Distance Metric: Choose a distance metric appropriate for your specific problem. Euclidean distance is not always the best choice.
  3. Parameter Tuning: Systematically find the optimal value of \(k\) using techniques like cross-validation.
  4. Efficiency Considerations: For large datasets, it’s essential to consider acceleration algorithms like KD-Trees or Ball Trees, or to use approximate nearest neighbor search methods.

KNN is an excellent gateway into the world of machine learning—it’s simple, intuitive, and powerful.

Thank You!

Q & A