About Machine Learning ( Part 6: KNN vs. K-means )
In machine learning, K-Nearest Neighbors (KNN) and K-means Clustering are two commonly used algorithms. Despite their similar names, they serve different purposes and have distinct working principles.
KNN (K-Nearest Neighbors)
KNN is a supervised learning algorithm used for classification and regression tasks.
The core idea of KNN is:
Given a new data point, find the K most similar instances in the training dataset (neighbors) and use them to predict the output.
KNN is a lazy learning algorithm, meaning it does not require a training phase. Instead, it directly classifies or predicts based on stored data.
KNN follows these steps:
- Compute the distance between the new data point and all training samples (e.g., using Euclidean distance).
- Select the K nearest neighbors.
- Predict the result:
- For classification: Use a majority vote among the K neighbors.
- For regression: Take the average of the K neighbors.
Mathematical Formulation
For two points $A(x_1, y_1)$ and $B(x_2, y_2)$, the Euclidean distance is:
$$
d(A, B) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
$$
In n-dimensional space, it generalizes to:
$$
d(A, B) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
$$
where:
- $x_i$ and $y_i$ are the coordinates of points $A$ and $B$ in dimension $i$.
If the K nearest neighbors have labels $y_1, y_2, …, y_K$, then the predicted label $\hat{y}$ is:
$$
\hat{y} = \arg\max_{c} \sum_{i=1}^{K} \mathbb{I}(y_i = c)
$$
where:
- $\mathbb{I}(\cdot)$ is an indicator function, returning 1 if $y_i = c$ and 0 otherwise.
- The class with the most occurrences is chosen.
Pros and Cons of KNN
Advantages:
- Simple and easy to implement.
- Effective for non-linear decision boundaries.
- No need for training (lazy learning).
Disadvantages:
- Computationally expensive (slow for large datasets).
- Sensitive to noise and irrelevant features.
- Suffers from the curse of dimensionality.
Applications of KNN
- Text classification (e.g., spam detection)
- Recommendation systems (e.g., movie or product recommendations)
- Medical diagnosis (e.g., predicting diseases based on similar cases)
K-means Clustering
K-means is an unsupervised learning algorithm used for clustering.
The core idea of K-means is:
Partition the dataset into K clusters such that data points in the same cluster are similar to each other.
It is an iterative optimization algorithm that minimizes intra-cluster distances.
How K-means Works:
- Initialize K cluster centroids (randomly selected).
- Assign each data point to the nearest centroid.
- Update centroids by computing the mean of all points in each cluster.
- Repeat until centroids no longer change or a stopping criterion is met.
Mathematical Formulation
Objective Function (Loss Function):
K-means minimizes the sum of squared distances between points and their assigned cluster centers:
$$
J = \sum_{i=1}^{K} \sum_{x_j \in C_i} ||x_j - \mu_i||^2
$$
where:
- $K$ = number of clusters
- $C_i$ = the $i$-th cluster
- $x_j$ = a data point in cluster $C_i$
- $\mu_i$ = centroid of cluster $C_i$
Updating Cluster Centers: The new centroid $\mu_i$ is computed as:
$$
\mu_i = \frac{1}{|C_i|} \sum_{x_j \in C_i} x_j
$$
i.e., the mean of all points in the cluster.
Pros and Cons of K-means
Advantages:
- Simple and computationally efficient.
- Works well on large datasets.
- Produces interpretable results.
Disadvantages:
- Requires manually setting K.
- Sensitive to initialization (may converge to local optima).
- Struggles with non-convex cluster shapes.
Applications of K-means
- Customer segmentation (e.g., marketing analytics)
- Image segmentation (e.g., clustering colors in an image)
- Anomaly detection (e.g., identifying outliers)
KNN vs. K-means: Key Differences
| Feature | KNN (K-Nearest Neighbors) | K-means Clustering |
|---|---|---|
| Type | Supervised learning | Unsupervised learning |
| Purpose | Classification & Regression | Clustering |
| Training | No training required (lazy learning) | Requires iterative training |
| Prediction | Based on K nearest neighbors | Based on cluster centroids |
| Distance metric | Used to find nearest neighbors | Used to compute cluster assignments |
| Computation cost | High for large datasets | Lower after convergence |
| Applications | Spam detection, recommendation systems | Market segmentation, image compression |
About Machine Learning ( Part 6: KNN vs. K-means )
https://kongchenglc.github.io/blog/2025/02/02/Machine-Learning-6/