About Machine Learning ( Part 6: KNN vs. K-means )

In machine learning, K-Nearest Neighbors (KNN) and K-means Clustering are two commonly used algorithms. Despite their similar names, they serve different purposes and have distinct working principles.

KNN (K-Nearest Neighbors)

KNN is a supervised learning algorithm used for classification and regression tasks.

The core idea of KNN is:

Given a new data point, find the K most similar instances in the training dataset (neighbors) and use them to predict the output.

KNN is a lazy learning algorithm, meaning it does not require a training phase. Instead, it directly classifies or predicts based on stored data.

KNN follows these steps:

  1. Compute the distance between the new data point and all training samples (e.g., using Euclidean distance).
  2. Select the K nearest neighbors.
  3. Predict the result:
    • For classification: Use a majority vote among the K neighbors.
    • For regression: Take the average of the K neighbors.

Mathematical Formulation

For two points $A(x_1, y_1)$ and $B(x_2, y_2)$, the Euclidean distance is:

$$
d(A, B) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
$$

In n-dimensional space, it generalizes to:

$$
d(A, B) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
$$

where:

  • $x_i$ and $y_i$ are the coordinates of points $A$ and $B$ in dimension $i$.

If the K nearest neighbors have labels $y_1, y_2, …, y_K$, then the predicted label $\hat{y}$ is:

$$
\hat{y} = \arg\max_{c} \sum_{i=1}^{K} \mathbb{I}(y_i = c)
$$

where:

  • $\mathbb{I}(\cdot)$ is an indicator function, returning 1 if $y_i = c$ and 0 otherwise.
  • The class with the most occurrences is chosen.

Pros and Cons of KNN

Advantages:

  • Simple and easy to implement.
  • Effective for non-linear decision boundaries.
  • No need for training (lazy learning).

Disadvantages:

  • Computationally expensive (slow for large datasets).
  • Sensitive to noise and irrelevant features.
  • Suffers from the curse of dimensionality.

Applications of KNN

  • Text classification (e.g., spam detection)
  • Recommendation systems (e.g., movie or product recommendations)
  • Medical diagnosis (e.g., predicting diseases based on similar cases)

K-means Clustering

K-means is an unsupervised learning algorithm used for clustering.

The core idea of K-means is:

Partition the dataset into K clusters such that data points in the same cluster are similar to each other.

It is an iterative optimization algorithm that minimizes intra-cluster distances.

How K-means Works:

  1. Initialize K cluster centroids (randomly selected).
  2. Assign each data point to the nearest centroid.
  3. Update centroids by computing the mean of all points in each cluster.
  4. Repeat until centroids no longer change or a stopping criterion is met.

Mathematical Formulation

Objective Function (Loss Function):

K-means minimizes the sum of squared distances between points and their assigned cluster centers:

$$
J = \sum_{i=1}^{K} \sum_{x_j \in C_i} ||x_j - \mu_i||^2
$$

where:

  • $K$ = number of clusters
  • $C_i$ = the $i$-th cluster
  • $x_j$ = a data point in cluster $C_i$
  • $\mu_i$ = centroid of cluster $C_i$

Updating Cluster Centers: The new centroid $\mu_i$ is computed as:

$$
\mu_i = \frac{1}{|C_i|} \sum_{x_j \in C_i} x_j
$$

i.e., the mean of all points in the cluster.

Pros and Cons of K-means

Advantages:

  • Simple and computationally efficient.
  • Works well on large datasets.
  • Produces interpretable results.

Disadvantages:

  • Requires manually setting K.
  • Sensitive to initialization (may converge to local optima).
  • Struggles with non-convex cluster shapes.

Applications of K-means

  • Customer segmentation (e.g., marketing analytics)
  • Image segmentation (e.g., clustering colors in an image)
  • Anomaly detection (e.g., identifying outliers)

KNN vs. K-means: Key Differences

Feature KNN (K-Nearest Neighbors) K-means Clustering
Type Supervised learning Unsupervised learning
Purpose Classification & Regression Clustering
Training No training required (lazy learning) Requires iterative training
Prediction Based on K nearest neighbors Based on cluster centroids
Distance metric Used to find nearest neighbors Used to compute cluster assignments
Computation cost High for large datasets Lower after convergence
Applications Spam detection, recommendation systems Market segmentation, image compression

About Machine Learning ( Part 6: KNN vs. K-means )

https://kongchenglc.github.io/blog/2025/02/02/Machine-Learning-6/

Author

Lich

Posted on

2025-02-02

Updated on

2026-03-11

Licensed under

Comments