autoencoders

Think of using autoencoders to extract representations.

sparsity allows us to interpret hidden layers and internal representations of Transformers model.

graph TD
    A[Input X] --> B[Layer 1]
    B --> C[Layer 2]
    C --> D[Latent Features Z]
    D --> E[Layer 3]
    E --> F[Layer 4]
    F --> G[Output X']

    subgraph Encoder
        A --> B --> C
    end

    subgraph Decoder
        E --> F
    end

    style D fill:#c9a2d8,stroke:#000,stroke-width:2px,color:#fff
    style A fill:#98FB98,stroke:#000,stroke-width:2px
    style G fill:#F4A460,stroke:#000,stroke-width:2px

definition

\begin{aligned} \text{Enc}_{\Theta_1}&: \mathbb{R}^d \to \mathbb{R}^q \\ \text{Dec}_{\Theta_2}&: \mathbb{R}^q \to \mathbb{R}^d \\[12pt] &\because q \ll d \end{aligned}

loss function: $l(x) = \|\text{Dec}_{\Theta_2}(\text{Enc}_{\Theta_1}(x)) - x\|$

The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. article

intuition: to give a positive and negative pairs for optimizing loss function.

training objective

we want smaller reconstruction error, or

\|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|_2^2

we want to get the latent space distribution to look something similar to isotopic Gaussian!

denoted as $D_{\text{KL}}(P \parallel Q)$

definition

The statistical distance between a model probability distribution $Q$ difference from a true probability distribution $P$ :
$D_{\text{KL}}(P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log (\frac{P(x)}{Q(x)})$

Alternative form ¹:

\begin{aligned} \text{KL}(p \parallel q) &= E_{x \sim p}(\log \frac{p(x)}{q(x)}) \\ &= \int_x P(x) \log \frac{p(x)}{q(x)} dx \end{aligned}

For relative entropy if $\forall x > 0, Q(x) = 0 \implies P(x) = 0$ absolute continuity

For distribution $P$ and $Q$ of a continuous random variable, then KL divergence is:

D_{\text{KL}}(P \parallel Q) = \int_{-\infty}^{+ \infty} p(x) \log \frac{p(x)}{q(x)} dx

where $p$ and $q$ denote probability densities of $P$ and $Q$

For discrete probability distribution $P$ and $Q$ defined on the same sample space. ↩

variational autoencoders

idea: to add a gaussian sampler after calculating latent space.

objective function:

\min (\sum_{x} \|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|^2_2 + \lambda \sum_{i=1}^{q}(-\log (\sigma_i^2) + \sigma_i^2 + \mu_i^2))

Think of using autoencoders to extract representations.

sparsity allows us to interpret hidden layers and internal representations of Transformers model.

graph TD
    A[Input X] --> B[Layer 1]
    B --> C[Layer 2]
    C --> D[Latent Features Z]
    D --> E[Layer 3]
    E --> F[Layer 4]
    F --> G[Output X']

    subgraph Encoder
        A --> B --> C
    end

    subgraph Decoder
        E --> F
    end

    style D fill:#c9a2d8,stroke:#000,stroke-width:2px,color:#fff
    style A fill:#98FB98,stroke:#000,stroke-width:2px
    style G fill:#F4A460,stroke:#000,stroke-width:2px

definition

\begin{aligned} \text{Enc}_{\Theta_1}&: \mathbb{R}^d \to \mathbb{R}^q \\ \text{Dec}_{\Theta_2}&: \mathbb{R}^q \to \mathbb{R}^d \\[12pt] &\because q \ll d \end{aligned}

loss function: $l(x) = \|\text{Dec}_{\Theta_2}(\text{Enc}_{\Theta_1}(x)) - x\|$

url: thoughts/contrastive-representation-learning
description: contrastive learning
contrastive representation learning

The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. article

intuition: to give a positive and negative pairs for optimizing loss function.
Lien vers l'original

training objective

we want smaller reconstruction error, or

\|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|_2^2

we want to get the latent space distribution to look something similar to isotopic Gaussian!

url: thoughts/Kullback-Leibler-divergence
description: KL divergence
Kullback-Leibler divergence
denoted as $D_{\text{KL}}(P \parallel Q)$

definition

The statistical distance between a model probability distribution $Q$ difference from a true probability distribution $P$ :
$D_{\text{KL}}(P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log (\frac{P(x)}{Q(x)})$

Alternative form ¹:
$\begin{aligned} \text{KL}(p \parallel q) &= E_{x \sim p}(\log \frac{p(x)}{q(x)}) \\ &= \int_x P(x) \log \frac{p(x)}{q(x)} dx \end{aligned}$
For relative entropy if $\forall x > 0, Q(x) = 0 \implies P(x) = 0$ absolute continuity

For distribution $P$ and $Q$ of a continuous random variable, then KL divergence is:
$D_{\text{KL}}(P \parallel Q) = \int_{-\infty}^{+ \infty} p(x) \log \frac{p(x)}{q(x)} dx$
where $p$ and $q$ denote probability densities of $P$ and $Q$
Lien vers l'original

variational autoencoders

idea: to add a gaussian sampler after calculating latent space.

objective function:

\min (\sum_{x} \|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|^2_2 + \lambda \sum_{i=1}^{q}(-\log (\sigma_i^2) + \sigma_i^2 + \mu_i^2))

autoencoders

Étiquette

publié à

modifié à

durée

source

definition

contrastive representation learning

training objective

Kullback-Leibler divergence

variational autoencoders

Vous pourriez aimer ce qui suit

Liens retour

autoencoders

Étiquette

publié à

modifié à

durée

source

definition

contrastive representation learning

training objective

Kullback-Leibler divergence

variational autoencoders

Remarque

Vous pourriez aimer ce qui suit

Liens retour