Posted 2025-02-02Learning Notes

About Machine Learning ( Part 5: Support Vector Machine )

Support Vector Machine (SVM)

Support Vector Machines (SVM) are one of the most powerful supervised learning algorithms used for classification and regression tasks.

The Hyperplane

In a binary classification problem, the goal of SVM is to find a hyperplane that best separates two classes. Given a training dataset:

$$
D = { (\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n) }, \quad \mathbf{x}_i \in \mathbb{R}^d, \quad y_i \in {-1, +1}
$$

$\mathbf{x}_i$: $d$-dimensional feature vector (e.g., pixel values in an image).
$y_i$: Class label ($+1$ for “cat”, $-1$ for “dog”).

The Hyperplane

A hyperplane in $\mathbb{R}^d$ is defined as:

$$
\mathbf{w} \cdot \mathbf{x} + b = 0, \quad \mathbf{x} \in \mathbb{R}^d
$$

$\mathbf{w}$: Weight vector (normal to the hyperplane).
$b$: Bias term (shifts the hyperplane away from the origin).
$\mathbf{x}$: Any point in the feature space.

Example: 2D Hyperplane

Consider a 2D hyperplane $2x_1 + 3x_2 - 12 = 0$:

Normal Vector: $\mathbf{w} = [2, 3]$.
Bias: $b = -12$.

The weight vector $\mathbf{w}$ is perpendicular to the hyperplane.

2x+3y-12=0

Margin Maximization

The distance from a sample $\mathbf{x}_i$ to the hyperplane is:

$$
\text{Distance} = \frac{|\mathbf{w} \cdot \mathbf{x}_i + b|}{|\mathbf{w}|}.
$$

SVM seeks the hyperplane that maximizes the minimum margin between classes:

$$
\max_{\mathbf{w}, b} ( \frac{2}{|\mathbf{w}|} ) \quad \text{subject to} \quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, , \forall i.
$$

This is equivalent to minimizing $|\mathbf{w}|^2$, a convex optimization problem solvable via quadratic programming.

Optimize

Linear Separation with Hard Margin

When data is perfectly linearly separable, SVM seeks the hyperplane with maximum margin:

$$
\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1, , \forall i
$$

Geometric Interpretation:
The margin width is $\frac{2}{|\mathbf{w}|}$. Maximizing margin = minimizing $|\mathbf{w}|$.

The Limitation:

Fails catastrophically when:

Data has noise/outliers
Classes are inherently non-separable

Soft Margin SVM

Allow controlled violations using slack variables $\xi_i$:

$$
\begin{aligned}
\min_{\mathbf{w}, b, \xi} &\quad \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i \
\text{s.t.} &\quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 - \xi_i \
&\quad \xi_i \geq 0, \quad \forall i
\end{aligned}
$$

$C$: Penalty weight (Large $C$ ≈ Hard Margin)
$\xi_i$: How much the $i$-th sample violates the margin

Lagrangian Formulation

Convert constraints into the objective function:

$$ \mathcal{L} = \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i[y_i(\mathbf{w}^T\mathbf{x}_i + b) - 1 + \xi_i] - \sum_{i=1}^n \mu_i\xi_i $$

Key Derivations:

Primal Variables
- $\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 0 \Rightarrow \mathbf{w} = \sum \alpha_i y_i \mathbf{x}_i$
- $\frac{\partial \mathcal{L}}{\partial b} = 0 \Rightarrow \sum \alpha_i y_i = 0$
- $\frac{\partial \mathcal{L}}{\partial \xi_i} = 0 \Rightarrow \alpha_i + \mu_i = C$
Dual Problem
Substitute back to get:

$$
\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j \
\text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum \alpha_i y_i = 0
$$

Interpretation of $\alpha_i$

$\alpha_i$ Range	Sample Status	$\xi_i$ Value
$=0$	Outside margin	0
$(0, C)$	On margin	0
$=C$	Inside margin	$>0$

Decision Function:

$$
f(\mathbf{x}) = \text{sign}( \sum_{\alpha_i > 0} \alpha_i y_i \mathbf{x}_i^T \mathbf{x} + b)
$$

Nonlinear Classification with Kernel Trick

The Fundamental Idea

Problem: Many datasets require nonlinear boundaries.
Solution: Map data to higher dimension $\phi(\mathbf{x})$ where linear separation becomes possible.

Example Transformation:
For $\mathbf{x} = [x_1, x_2]$, use $\phi(\mathbf{x}) = [x_1, x_2, x_1^2 + x_2^2]$

The Computational Challenge

Direct computation of $\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$ in high dimensions is intractable.

Key Insight: Many ML algorithms (like SVM) only need inner products, not explicit coordinates.

Kernel Functions to the Rescue

Replace $\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$ with kernel $K(\mathbf{x}_i, \mathbf{x}_j)$:

Updated Dual Problem:

$$
\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j)
$$

Common Kernels:

Kernel	Formula	Characteristics
Linear	$K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z}$	No transformation
Polynomial	$(\mathbf{x}^T\mathbf{z} + c)^d$	Captures polynomial interactions
RBF	$\exp(-\gamma \|\mathbf{x}-\mathbf{z}\|^2)$	Infinite-dimensional mapping
Sigmoid	$\tanh(\alpha \mathbf{x}^T\mathbf{z} + c)$	Mimics neural networks

3.4 Why Kernels Work: Mercer’s Theorem

A valid kernel must:

Be symmetric: $K(\mathbf{x}, \mathbf{z}) = K(\mathbf{z}, \mathbf{x})$
Produce positive semi-definite Gram matrix

Practical Check: If SVM training converges, your kernel is valid.

About Machine Learning ( Part 5: Support Vector Machine )

https://kongchenglc.github.io/blog/2025/02/02/Machine-Learning-5/

Author

Lich

Posted on

2025-02-02

Updated on

2026-03-11

Licensed under

#Machine Learning Support Vector Machine (SVM)

About Machine Learning ( Part 5: Support Vector Machine )

Support Vector Machine (SVM)

The Hyperplane

The Hyperplane

Example: 2D Hyperplane

Margin Maximization

Optimize

Linear Separation with Hard Margin

Soft Margin SVM

Lagrangian Formulation

Key Derivations:

Interpretation of $\alpha_i$

Nonlinear Classification with Kernel Trick

The Fundamental Idea

The Computational Challenge

Kernel Functions to the Rescue

Common Kernels:

3.4 Why Kernels Work: Mercer’s Theorem

Author

Posted on

Updated on

Licensed under

Comments

Catalogue