Logistic regression 😄 fun fact: actually better for classification instead of regression problems
Assume there is a plane in R d \mathbb{R}^d R d parameterized by W W W
P ( Y = 1 ∣ x , W ) = ϕ ( W T x ) P ( Y = 0 ∣ x , W ) = 1 − ϕ ( W T x ) ∵ ϕ ( a ) = 1 1 + e − a \begin{aligned}
P(Y = 1 \mid x, W) &= \phi (W^T x) \\
P(Y= 0 \mid x, W) &= 1 - \phi (W^T x) \\[12pt]
&\because \phi (a) = \frac{1}{1+e^{-a}}
\end{aligned} P ( Y = 1 ∣ x , W ) P ( Y = 0 ∣ x , W ) = ϕ ( W T x ) = 1 − ϕ ( W T x ) ∵ ϕ ( a ) = 1 + e − a 1
maximum likelihood
1 − ϕ ( a ) = ϕ ( − a ) 1 - \phi (a) = \phi (-a) 1 − ϕ ( a ) = ϕ ( − a )
W ML = arg max W ∏ P ( x i , y i ∣ W ) = arg max W ∏ P ( x i , y i , W ) P ( W ) = arg max W ∏ P ( y i ∣ x i , W ) P ( x i ) = arg max W [ ∏ P ( x i ) ] [ ∏ P ( y i ∣ x i , W ) ] = arg max W ∑ i = 1 n log ( τ ( y i W T x i ) ) \begin{aligned}
W^{\text{ML}} &= \argmax_{W} \prod P(x^i, y^i \mid W) \\
&= \argmax_{W} \prod \frac{P(x^i, y^i, W)}{P(W)} \\
&= \argmax_{W} \prod P(y^i | x^i, W) P(x^i) \\
&= \argmax_{W} \lbrack \prod P(x^i) \rbrack \lbrack \prod P(y^i \mid x^i, W) \rbrack \\
&= \argmax_{W} \sum_{i=1}^{n} \log (\tau (y^i W^T x^i))
\end{aligned} W ML = W arg max ∏ P ( x i , y i ∣ W ) = W arg max ∏ P ( W ) P ( x i , y i , W ) = W arg max ∏ P ( y i ∣ x i , W ) P ( x i ) = W arg max [ ∏ P ( x i )] [ ∏ P ( y i ∣ x i , W )] = W arg max i = 1 ∑ n log ( τ ( y i W T x i ))
maximize the following:
∑ i = 1 n ( y i log p i + ( 1 − y i ) log ( 1 − p i ) ) \sum_{i=1}^{n} (y^i \log p^i + (1-y^i) \log (1-p^i)) i = 1 ∑ n ( y i log p i + ( 1 − y i ) log ( 1 − p i ))
softmax
softmax(y) i = e y i ∑ i e y i \text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}} softmax(y) i = ∑ i e y i e y i
where y ∈ R k y \in \mathbb{R}^k y ∈ R k