probabilitic modeling example: to assume each class is a Gaussian
discriminant analysis
P ( x ∣ y = 1 , μ 0 , μ 1 , β ) = 1 a 0 e − ∥ x − μ 1 ∥ 2 2 P(x \mid y = 1, \mu_0, \mu_1, \beta) = \frac{1}{a_0} e^{-\|x-\mu_1\|^2_2} P ( x ∣ y = 1 , μ 0 , μ 1 , β ) = a 0 1 e − ∥ x − μ 1 ∥ 2 2
maximum likelihood estimate
see also priori and posterior distribution
given Θ = { μ 1 , μ 2 , β } \Theta = \{\mu_1, \mu_2, \beta\} Θ = { μ 1 , μ 2 , β } :
arg max Θ P ( Z ∣ Θ ) = arg max Θ ∏ i = 1 n P ( x i , y i ∣ Θ ) \begin{aligned}
\argmax_{\Theta} P(Z \mid \Theta) &= \argmax_{\Theta} \prod_{i=1}^{n} P(x^i, y^i \mid \Theta) \\
\end{aligned} Θ arg max P ( Z ∣ Θ ) = Θ arg max i = 1 ∏ n P ( x i , y i ∣ Θ )
How can we predict the label of a new test point?
Or in another words, how can we run inference?
Check P ( y = 0 ∣ X , Θ ) P ( y = 1 ∣ X , Θ ) ≥ 1 \frac{P(y=0 \mid X, \Theta)}{P(y=1 \mid X, \Theta)} \ge 1 P ( y = 1 ∣ X , Θ ) P ( y = 0 ∣ X , Θ ) ≥ 1
Generalization for correlated features
Gaussian for correlated features:
N ( x ∣ μ , Σ ) = 1 ( 2 π ) d / 2 ∣ Σ ∣ 1 / 2 exp ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) \mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{(2 \pi)^{d/2}|\Sigma|^{1/2}} \exp (-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu)) N ( x ∣ μ , Σ ) = ( 2 π ) d /2 ∣Σ ∣ 1/2 1 exp ( − 2 1 ( x − μ ) T Σ − 1 ( x − μ ))
Naive Bayes Classifier
Given the label, the coordinates are statistically independent
P ( x ∣ y = k , Θ ) = π j P ( x j ∣ y = k , Θ ) P(x \mid y = k, \Theta) = \pi_j P(x_j \mid y=k, \Theta) P ( x ∣ y = k , Θ ) = π j P ( x j ∣ y = k , Θ )
idea: comparison between discriminative and generative models
😄 fun fact: actually better for classification instead of regression problems
Assume there is a plane in R d \mathbb{R}^d R d parameterized by W W W
P ( Y = 1 ∣ x , W ) = ϕ ( W T x ) P ( Y = 0 ∣ x , W ) = 1 − ϕ ( W T x ) ∵ ϕ ( a ) = 1 1 + e − a \begin{aligned}
P(Y = 1 \mid x, W) &= \phi (W^T x) \\
P(Y= 0 \mid x, W) &= 1 - \phi (W^T x) \\[12pt]
&\because \phi (a) = \frac{1}{1+e^{-a}}
\end{aligned} P ( Y = 1 ∣ x , W ) P ( Y = 0 ∣ x , W ) = ϕ ( W T x ) = 1 − ϕ ( W T x ) ∵ ϕ ( a ) = 1 + e − a 1
maximum likelihood
1 − ϕ ( a ) = ϕ ( − a ) 1 - \phi (a) = \phi (-a) 1 − ϕ ( a ) = ϕ ( − a )
W ML = arg max W ∏ P ( x i , y i ∣ W ) = arg max W ∏ P ( x i , y i , W ) P ( W ) = arg max W ∏ P ( y i ∣ x i , W ) P ( x i ) = arg max W [ ∏ P ( x i ) ] [ ∏ P ( y i ∣ x i , W ) ] = arg max W ∑ i = 1 n log ( τ ( y i W T x i ) ) \begin{aligned}
W^{\text{ML}} &= \argmax_{W} \prod P(x^i, y^i \mid W) \\
&= \argmax_{W} \prod \frac{P(x^i, y^i, W)}{P(W)} \\
&= \argmax_{W} \prod P(y^i | x^i, W) P(x^i) \\
&= \argmax_{W} \lbrack \prod P(x^i) \rbrack \lbrack \prod P(y^i \mid x^i, W) \rbrack \\
&= \argmax_{W} \sum_{i=1}^{n} \log (\tau (y^i W^T x^i))
\end{aligned} W ML = W arg max ∏ P ( x i , y i ∣ W ) = W arg max ∏ P ( W ) P ( x i , y i , W ) = W arg max ∏ P ( y i ∣ x i , W ) P ( x i ) = W arg max [ ∏ P ( x i )] [ ∏ P ( y i ∣ x i , W )] = W arg max i = 1 ∑ n log ( τ ( y i W T x i ))
maximize the following:
∑ i = 1 n ( y i log p i + ( 1 − y i ) log ( 1 − p i ) ) \sum_{i=1}^{n} (y^i \log p^i + (1-y^i) \log (1-p^i)) i = 1 ∑ n ( y i log p i + ( 1 − y i ) log ( 1 − p i ))
softmax
softmax(y) i = e y i ∑ i e y i \text{softmax(y)}_i = \frac{e^{y_i}}{\sum_{i} e^{y_i}} softmax(y) i = ∑ i e y i e y i
where y ∈ R k y \in \mathbb{R}^k y ∈ R k