Support Vector Machine idea: maximises margin and more robust to “perturbations”
Euclidean distance between two points x x x and the hyperplane parametrised by W W W is:
∣ W T x + b ∣ ∥ W ∥ 2 \frac{\mid W^T x + b \mid }{\|W\|_2} ∥ W ∥ 2 ∣ W T x + b ∣
Assuming ∥ W ∥ 2 = 1 \| W \|_2=1 ∥ W ∥ 2 = 1 then the distance is ∣ W T x + b ∣ \mid W^T x + b \mid ∣ W T x + b ∣
SVMs are good for high-dimensional data
We can probably use a solver, or gradient descent
maximum margin hyperplane
W W W has γ \gamma γ margin if
W T x + b ≥ γ ∀ blue x W T x + b ≤ − γ ∀ red x \begin{aligned}
W^T x + b \ge \gamma \space &\forall \text{ blue x} \\
W^T x +b \le - \gamma \space &\forall \text{ red x}
\end{aligned} W T x + b ≥ γ W T x + b ≤ − γ ∀ blue x ∀ red x
Margin:
Z = { ( x i , y i ) } i = 1 n , y ∈ { − 1 , 1 } , ∥ W ∥ 2 = 1 Z = \{(x^{i}, y^{i})\}_{i=1}^{n}, y \in \{-1, 1\}, \|W\|_2 = 1 Z = {( x i , y i ) } i = 1 n , y ∈ { − 1 , 1 } , ∥ W ∥ 2 = 1
hard-margin SVM
this is the version with bias
"\\begin{algorithm}\n\\caption{Hard-SVM}\n\\begin{algorithmic}\n\\REQUIRE Training set $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{solve:} $(w_{0},b_{0}) = \\argmin\\limits_{(w,b)} \\|w\\|^2 \\text{ s.t } \\forall i, y_{i}(\\langle{w,x_i} \\rangle + b) \\ge 1$\n\\STATE \\textbf{output:} $\\hat{w} = \\frac{w_0}{\\|w_0\\|}, \\hat{b} = \\frac{b_0}{\\|w_0\\|}$\n\\end{algorithmic}\n\\end{algorithm}"
Algorithm 2 Hard-SVM
Require: Training set ( x 1 , y 1 ) , … , ( x m , y m ) (\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m) ( x 1 , y 1 ) , … , ( x m , y m )
1: solve: ( w 0 , b 0 ) = arg min ( w , b ) ∥ w ∥ 2 s.t ∀ i , y i ( ⟨ w , x i ⟩ + b ) ≥ 1 (w_{0},b_{0}) = \argmin\limits_{(w,b)} \|w\|^2 \text{ s.t } \forall i, y_{i}(\langle{w,x_i} \rangle + b) \ge 1 ( w 0 , b 0 ) = ( w , b ) arg min ∥ w ∥ 2 s.t ∀ i , y i (⟨ w , x i ⟩ + b ) ≥ 1
2: output: w ^ = w 0 ∥ w 0 ∥ , b ^ = b 0 ∥ w 0 ∥ \hat{w} = \frac{w_0}{\|w_0\|}, \hat{b} = \frac{b_0}{\|w_0\|} w ^ = ∥ w 0 ∥ w 0 , b ^ = ∥ w 0 ∥ b 0
Note that this version is sensitive to outliers
it assumes that training set is linearly separable
soft-margin SVM
can be applied even if the training set is not linearly separable
"\\begin{algorithm}\n\\caption{Soft-SVM}\n\\begin{algorithmic}\n\\REQUIRE Input $(\\mathbf{x}_1, y_1),\\ldots,(\\mathbf{x}_m, y_m)$\n\\STATE \\textbf{parameter:} $\\lambda > 0$\n\\STATE \\textbf{solve:} $\\min_{\\mathbf{w}, b, \\boldsymbol{\\xi}} \\left( \\lambda \\|\\mathbf{w}\\|^2 + \\frac{1}{m} \\sum_{i=1}^m \\xi_i \\right)$\n\\STATE \\textbf{s.t: } $\\forall i, \\quad y_i (\\langle \\mathbf{w}, \\mathbf{x}_i \\rangle + b) \\geq 1 - \\xi_i \\quad \\text{and} \\quad \\xi_i \\geq 0$\n\\STATE \\textbf{output:} $\\mathbf{w}, b$\n\\end{algorithmic}\n\\end{algorithm}"
Algorithm 3 Soft-SVM
Require: Input ( x 1 , y 1 ) , … , ( x m , y m ) (\mathbf{x}_1, y_1),\ldots,(\mathbf{x}_m, y_m) ( x 1 , y 1 ) , … , ( x m , y m )
1: parameter: λ > 0 \lambda > 0 λ > 0
2: solve: min w , b , ξ ( λ ∥ w ∥ 2 + 1 m ∑ i = 1 m ξ i ) \min_{\mathbf{w}, b, \boldsymbol{\xi}} \left( \lambda \|\mathbf{w}\|^2 + \frac{1}{m} \sum_{i=1}^m \xi_i \right) min w , b , ξ ( λ ∥ w ∥ 2 + m 1 ∑ i = 1 m ξ i )
3: s.t: ∀ i , y i ( ⟨ w , x i ⟩ + b ) ≥ 1 − ξ i and ξ i ≥ 0 \forall i, \quad y_i (\langle \mathbf{w}, \mathbf{x}_i \rangle + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0 ∀ i , y i (⟨ w , x i ⟩ + b ) ≥ 1 − ξ i and ξ i ≥ 0
4: output: w , b \mathbf{w}, b w , b
Equivalent form of soft-margin SVM:
min w ( λ ∥ w ∥ 2 + L S hinge ( w ) ) L S hinge ( w ) = 1 m ∑ i = 1 m max ( { 0 , 1 − y ⟨ w , x i ⟩ } ) \begin{aligned}
\min_{w} &(\lambda \|w\|^2 + L_S^{\text{hinge}}(w)) \\[8pt]
L_{S}^{\text{hinge}}(w) &= \frac{1}{m} \sum_{i=1}^{m} \max{(\{0, 1 - y \langle w, x_i \rangle\})}
\end{aligned} w min L S hinge ( w ) ( λ ∥ w ∥ 2 + L S hinge ( w )) = m 1 i = 1 ∑ m max ({ 0 , 1 − y ⟨ w , x i ⟩})
SVM with basis functions
min W 1 n ∑ max { 0 , 1 − y i ⟨ w , ϕ ( x i ) ⟩ } + λ ∥ w ∥ 2 2 \min_{W} \frac{1}{n} \sum \max \{0, 1 - y^i \langle w, \phi(x^i) \rangle\} + \lambda \|w\|^2_2 W min n 1 ∑ max { 0 , 1 − y i ⟨ w , ϕ ( x i )⟩} + λ ∥ w ∥ 2 2
ϕ ( x i ) \phi(x^i) ϕ ( x i ) can be high-dimensional
representor theorem
W ∗ = arg min W 1 n ∑ max { 0 , 1 − y i ⟨ w , ϕ ( x i ) ⟩ } + λ ∥ w ∥ 2 2 W^{*} = \argmin_{W} \frac{1}{n} \sum \max \{0, 1- y^i \langle w, \phi (x^i) \rangle\} + \lambda \|w\|^2_2 W ∗ = W arg min n 1 ∑ max { 0 , 1 − y i ⟨ w , ϕ ( x i )⟩} + λ ∥ w ∥ 2 2
There are real values a 1 , … , a m a_{1},\ldots,a_{m} a 1 , … , a m such that 1
W ∗ = ∑ a i ϕ ( x i ) W^{*} = \sum a_i \phi(x^i) W ∗ = ∑ a i ϕ ( x i )
kernelized SVM
from representor theorem , we have the kernel:
K ( x , z ) = ⟨ ϕ ( x ) , ϕ ( z ) ⟩ K(x,z) = \langle \phi(x), \phi(z) \rangle K ( x , z ) = ⟨ ϕ ( x ) , ϕ ( z )⟩
drawbacks
prediction-time complexity
need to store all training data
Dealing with K n × n \mathbf{K}_{n \times n} K n × n
choice of kernel, in which is tricky and pretty heuristic sometimes.
note that we can also write a T ϕ a^T \phi a T ϕ where ϕ = [ ϕ ( x 1 ) , … , ϕ ( x n ) ] T \phi = [\phi(x^1),\ldots,\phi(x^n)]^T ϕ = [ ϕ ( x 1 ) , … , ϕ ( x n ) ] T ↩