• ↑↓ pour naviguer
  • pour ouvrir
  • pour sélectionner
  • ⌘ ⌥ ↵ pour ouvrir dans un panneau
  • esc pour rejeter
⌘ '
raccourcis clavier

Think of using autoencoders to extract representations.

sparsity allows us to interpret hidden layers and internal representations of Transformers model.

graph TD
    A[Input X] --> B[Layer 1]
    B --> C[Layer 2]
    C --> D[Latent Features Z]
    D --> E[Layer 3]
    E --> F[Layer 4]
    F --> G[Output X']

    subgraph Encoder
        A --> B --> C
    end

    subgraph Decoder
        E --> F
    end

    style D fill:#c9a2d8,stroke:#000,stroke-width:2px,color:#fff
    style A fill:#98FB98,stroke:#000,stroke-width:2px
    style G fill:#F4A460,stroke:#000,stroke-width:2px

see also latent space

definition

EncΘ1:RdRqDecΘ2:RqRdqd\begin{aligned} \text{Enc}_{\Theta_1}&: \mathbb{R}^d \to \mathbb{R}^q \\ \text{Dec}_{\Theta_2}&: \mathbb{R}^q \to \mathbb{R}^d \\[12pt] &\because q \ll d \end{aligned}

loss function: l(x)=DecΘ2(EncΘ1(x))xl(x) = \|\text{Dec}_{\Theta_2}(\text{Enc}_{\Theta_1}(x)) - x\|

contrastive representation learning

The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. article

intuition: to give a positive and negative pairs for optimizing loss function.

Lien vers l'original

training objective

we want smaller reconstruction error, or

Dec(Sampler(Enc(x)))x22\|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|_2^2

we want to get the latent space distribution to look something similar to isotopic Gaussian!

Kullback-Leibler divergence

denoted as DKL(PQ)D_{\text{KL}}(P \parallel Q)

definition

The statistical distance between a model probability distribution QQ difference from a true probability distribution PP:

DKL(PQ)=xXP(x)log(P(x)Q(x))D_{\text{KL}}(P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log (\frac{P(x)}{Q(x)})

Alternative form 1:

KL(pq)=Exp(logp(x)q(x))=xP(x)logp(x)q(x)dx\begin{aligned} \text{KL}(p \parallel q) &= E_{x \sim p}(\log \frac{p(x)}{q(x)}) \\ &= \int_x P(x) \log \frac{p(x)}{q(x)} dx \end{aligned}

For relative entropy if x>0,Q(x)=0    P(x)=0\forall x > 0, Q(x) = 0 \implies P(x) = 0 absolute continuity

For distribution PP and QQ of a continuous random variable, then KL divergence is:

DKL(PQ)=+p(x)logp(x)q(x)dxD_{\text{KL}}(P \parallel Q) = \int_{-\infty}^{+ \infty} p(x) \log \frac{p(x)}{q(x)} dx

where pp and qq denote probability densities of PP and QQ

Lien vers l'original

variational autoencoders

idea: to add a gaussian sampler after calculating latent space.

objective function:

min(xDec(Sampler(Enc(x)))x22+λi=1q(log(σi2)+σi2+μi2))\min (\sum_{x} \|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|^2_2 + \lambda \sum_{i=1}^{q}(-\log (\sigma_i^2) + \sigma_i^2 + \mu_i^2))