PyTorch see also: unstable docs
MultiMarginLoss
Creates a criterion that optimizes a multi-class classification hinge loss (margin-based loss) between input x x x
(a 2D mini-batch Tensor
) and output y y y (which is a 1D tensor of target class indices, 0 ≤ y ≤ x . size ( 1 ) − 1 0 \le y \le \text{x}.\text{size}(1) -1 0 ≤ y ≤ x . size ( 1 ) − 1 ):
For each mini-batch sample, loss in terms of 1D input x x x and output y y y is:
loss ( x , y ) = ∑ i max 0 , margin − x [ y ] + x [ i ] p x . size ( 0 ) ∵ i ∈ { 0 , … x . size ( 0 ) − 1 } and i ≠ y \text{loss}(x,y) = \frac{\sum_{i} \max{0, \text{margin} - x[y] + x[i]}^p}{x.\text{size}(0)}
\\
\because i \in \{0, \ldots x.\text{size}(0)-1\} \text{ and } i \neq y loss ( x , y ) = x . size ( 0 ) ∑ i max 0 , margin − x [ y ] + x [ i ] p ∵ i ∈ { 0 , … x . size ( 0 ) − 1 } and i = y
SGD
Nesterov momentum is based on On the importance of initialization and momentum in deep learning
"\\begin{algorithm}\n\\caption{SGD in PyTorch}\n\\begin{algorithmic}\n\\State \\textbf{input:} $\\gamma$ (lr), $\\theta_0$ (params), $f(\\theta)$ (objective), $\\lambda$ (weight decay),\n\\State $\\mu$ (momentum), $\\tau$ (dampening), nesterov, maximize\n\\For{$t = 1$ to $...$}\n \\State $g_t \\gets \\nabla_\\theta f_t(\\theta_{t-1})$\n \\If{$\\lambda \\neq 0$}\n \\State $g_t \\gets g_t + \\lambda\\theta_{t-1}$\n \\EndIf\n \\If{$\\mu \\neq 0$}\n \\If{$t > 1$}\n \\State $b_t \\gets \\mu b_{t-1} + (1-\\tau)g_t$\n \\Else\n \\State $b_t \\gets g_t$\n \\EndIf\n \\If{$\\text{nesterov}$}\n \\State $g_t \\gets g_t + \\mu b_t$\n \\Else\n \\State $g_t \\gets b_t$\n \\EndIf\n \\EndIf\n \\If{$\\text{maximize}$}\n \\State $\\theta_t \\gets \\theta_{t-1} + \\gamma g_t$\n \\Else\n \\State $\\theta_t \\gets \\theta_{t-1} - \\gamma g_t$\n \\EndIf\n\\EndFor\n\\State \\textbf{return} $\\theta_t$\n\\end{algorithmic}\n\\end{algorithm}"
Algorithm 1 SGD in PyTorch
1: input: γ \gamma γ (lr), θ 0 \theta_0 θ 0 (params), f ( θ ) f(\theta) f ( θ ) (objective), λ \lambda λ (weight decay),
2: μ \mu μ (momentum), τ \tau τ (dampening), nesterov, maximize
3: for t = 1 t = 1 t = 1 to . . . ... ... do
4: g t ← ∇ θ f t ( θ t − 1 ) g_t \gets \nabla_\theta f_t(\theta_{t-1}) g t ← ∇ θ f t ( θ t − 1 )
5: if λ ≠ 0 \lambda \neq 0 λ = 0 then
6: g t ← g t + λ θ t − 1 g_t \gets g_t + \lambda\theta_{t-1} g t ← g t + λ θ t − 1
7: end if
8: if μ ≠ 0 \mu \neq 0 μ = 0 then
9: if t > 1 t > 1 t > 1 then
10: b t ← μ b t − 1 + ( 1 − τ ) g t b_t \gets \mu b_{t-1} + (1-\tau)g_t b t ← μ b t − 1 + ( 1 − τ ) g t
11: else
12: b t ← g t b_t \gets g_t b t ← g t
13: end if
14: if nesterov \text{nesterov} nesterov then
15: g t ← g t + μ b t g_t \gets g_t + \mu b_t g t ← g t + μ b t
16: else
17: g t ← b t g_t \gets b_t g t ← b t
18: end if
19: end if
20: if maximize \text{maximize} maximize then
21: θ t ← θ t − 1 + γ g t \theta_t \gets \theta_{t-1} + \gamma g_t θ t ← θ t − 1 + γ g t
22: else
23: θ t ← θ t − 1 − γ g t \theta_t \gets \theta_{t-1} - \gamma g_t θ t ← θ t − 1 − γ g t
24: end if
25: end for
26: return θ t \theta_t θ t