[Ch4] Variational AutoEncoders (VAE, CVAE, AAE)

티스토리 뷰

DL/AutoEncoder

[Ch4] Variational AutoEncoders (VAE, CVAE, AAE)

jeong_reneer 2022. 1. 31. 09:43

AE와 VAE는 목적 자체가 정반대

- AE : 앞단(Encoder 부분) 학습 위해 뒷단 추가 → Manifold learning

- VAE : 뒷단(Decoder 부분) 학습 위해 앞단 추가 → Generative model learning

[Keyword] : Generative model learning

Generative model : Latent variable model

Variational AE (VAE)

1. Generative model

1) Sample Gerneration

2) Latent Variable Model

Training DB에 있는 data point x 가 나올 확률을 구함

→ 그 확률이 모든 Training DB에 대해 Maximize 하는 확률분포 p(x) 찾는 것이 목표 !!

z~p(z) : Latent variable / 컨트롤러 역할 (다루기 쉬운 분포 사용 : Normal or Uniform distribution)

g_Θ( ) : Generator (Deterministic function parameterized by Θ)

g_Θ(z) : Network의 Output 출력값 = 확률분포 모델의 Parameters

3) Prior distribution p(z)

DNN이기에 실제로 학습해야 할 manifold가 복잡하더라도,

앞의 한 두개 layer 정도는 복잡한 latent space를 학습할 (manifold를 잘 찾을) 수 있음

→ Prior distribution으로 Normal or Uniform distribution 같이 간단한 분포 사용해도 괜찮음

4) P_Θ(x|z)

Q : 단순히 Summation으로 구하면 왜 안되나? 왜 VAE로 구해야 하나?

의미적으로는 (a), (c)가 (a), (b)보다 더 가깝다고 볼 수 있는데,

MSE 값은 (a), (c)가 (a), (b)보다 더 큼

즉, Conditional probability model을 Gaussian distribution으로 가정한다면

MSE가 Likelihood 값을 정하게 되므로 평균값이 (b)라고 찍힐 때 Likelihood 값이 더 큼

거기서 Sampling을 하다보니 의미적(육안상)으로는 굉장히 의미없는(다른) 이미지가 생성됨

→ 그냥 Prior distribution p(z) 에서 Sampling 하면 학습이 제대로 X

→ 이상적인 Sampling 함수 p(z|x) ~ z 도입 (적어도 training sample인 x는 생성할 수 있게 evidence 인자로 줌)

→ 따라서, 이와 같은 모르는 확률분포를 추정할 때 *Variational inference 방법을 사용

*Variational inference : 우리가 잘 알고 있는 확률분포 중 하나 q_Φ(z|x)를 택해서 그것의 파라미터값을 조정하여 p(z|x)와 유사하게 만드는 방법

2. Variational Inference

1) Sampling z from p(z|x) using q_Φ(z|x)

True posterior p(z|x) : 모르지만 추정하고 싶은 확률분포

Approximation class q_Φ(z|x) : True posterior랑 최대한 비슷하게 만든, 다루기 쉬운 확률분포 (ex. Gaussian)

뒷단(Generator)을 학습할 때 z sample을 잘 만들어내는 이상적인 Sampling 함수 도입하고 싶음

→ 이상적인 함수라 모르니까 Approximation class 만들어서 거기서 Sampling !

2) ELBO (Evidence LowerBOund) : Relationship among p(x), p(z), p(z|x), q_Φ(z|x)

(1) Derivation 1

찾고자 하는 Target distribution : log(p(x))

Jensen's Inequality 이용해서 식 전개하면 log(p(x)) ≥ ELBO 구할 수 있음

직관적이지는 X 전개 방법

(2) Derivation 2 ★

KL(q_Φ(z|x)||p(z|x)) : 두 확률분포 간의 거리 ≥ 0 (확률적 특성이 같으면 거리 = 0)

log(p(x)) = ELBO(φ) + KL(q_Φ(z|x)||p(z|x)) ≥ 0

→ ELBO(φ) = log(p(x)) - KL(q_Φ(z|x)||p(z|x)) ≤ log(p(x))

KL(q_Φ(z|x)||p(z|x)) 를 Minimize 하는 φ 찾으면 최적화된 이상적인 Sampling 함수 찾을 수 있음

= ELBO(φ) 를 Maximize (log(p(x))가 고정이므로) 하는 φ 를 찾음으로써 q_Φ(z|x) 찾을 수 있음

목표 : ELBO(φ) = E_q_Φ(z|x)[logp(x|z)] - KL(q_Φ(z|x)||p(z)) 를 Maximize 하는 φ 찾아서 이상적인 Sampling 함수 찾자 !

3. Loss function

1) Derivation

2가지 Optimization problem 해결해야 함 : φ, θ 에 대해 Maximize

(1) Variation Inference Optimization Problem : φ 에 대해 ELBO를 Maximize

: ELBO를 Maximize하는 φ를 찾아서 이상적인 Sampling 함수 찾는 것

ELBO(φ) = E_q_Φ(z|x)[logp(x|z)] - KL(q_Φ(z|x)||p(z))

(2) Maximum Likelihood Optimization Problem : θ 에 대해 ELBO를 Maximize

: Generator 입장에서 Conditional Probability가 Maximum likelihood 되도록 학습

수식으로 보면 ELBO Term 앞부분(E_q_Φ(z|x)[logp(x|z)])에 포함되어 있는 식과 같음

2) NN Perspective

(2)번에 해당되는 뒷단(Generator = Decoder = Generation Network)을 학습하기 위해서

(1)번에 해당되는 앞단(Posterior = Encoder = Inference Network)을 붙인 것!

ELBO ?

ELBO(φ) = E_q_Φ(z|x)[logp(x|z)] - KL(q_Φ(z|x)||p(z)) → ELBO를 Maximize

Loss = - ELBO(φ)

= - E_q_Φ(z|x)[logp(x|z)] + KL(q_Φ(z|x)||p(z)) → Loss를 Minimize

= Reconstruction Error term + Regularization term

Reconstruction Error term : - E_q_Φ(z|x)[logp(x|z)]

→ Minimize Negative log likelihood = Maximize log likelihood

- ML 관점에서 Optimize (x를 넣었을 때 x가 나오는가)

- Gaussian으로 가정하면 MSE, Bernoulli로 가정하면 CE Loss

- AE 관점의 Error term과 형태 같음 (Sampling 들어간 것만 빼고)

Regularization term : KL(q_Φ(z|x)||p(z))

→ Minimize KL = q_Φ(z|x)와 p(z)를 최대한 같게 만들고 싶음 (여러 prior q_Φ(z|x)가 있다면 이왕이면 p(z)와 같게)

- 다루기 쉽고 유사한 함수 만들기 위한 추가 조건

(0) Assumptions for Variational Inference

[Encoder] q_Φ(z|x_i) ~ N(μ_i, σ^2I) : Gaussian distribution 가정 → 동시에 μ, σ 추정해야 함

[Prior] p(z) ~ N(0, I) : Normal distribution 가정

(1) Regularization Term : KL divergence

Gaussian distribution 2개 간의 KL divergence 계산식 이미 알려져 있음

이 식 이용하면 Closed form으로 KL divergence term 계산됨 (Easy to compute)

# encoding
mu, sigma = gaussian_MLP_encoder(x_hat, n_hidden, dim_z, keep_prob)

# loss
KL_divergence = 0.5 * tf.reduce_sum(tf.square(mu) + tf.square(sigma) - tf.log(1e-8 + tf.square(sigma)) - 1, 1)

(2) Reconstruction Error Term

Expectation 구하려면 원래는 적분을 해야 하지만, Monte-carlo technique 으로 L개 Sampling해서 평균 구함

Gaussian distribution으로 가정해서 μ, σ 가 정해져 있는 q_Φ(z|x_i) 에서 L=1개 Sampling 해서

Decoder에 입력하면 Decoder의 Conditional probability에 필요한 parameter 찾을 수 있고,

결국 Likelihood 값이 정해지므로 mean 구할 수 있음 → Reconstruction Error

Reparameterization Trick

- Why? 그냥 Sampling 은 Random node라 Chain Rule 적용 못해서 Backprop 알고리즘 쓸 수가 X

- Reparameterization Trick

: ε를 Normal distribution에서 Sampling 해서 σ에 대해 Elementwise 곱한 후, μ랑 더해서 z 생성

= 기존의 z~N(μ,σ) distribution에서 Sampling 하는 것이랑 똑같음 & Backprop도 가능!

# sampling by re-parameterization technique
z = mu + sigma * tf.random_normal(tf.shape(mu), 0, 1, dtype=tf.float32)

◾ 이미지에서는 보통 Bernoullli distribution를 따른다고 가정 → Log Likelihood 값 = CE Loss

- Reconstruction error = Network의 출력값(p_i)과 입력값(x_i) 사이의 CE Loss

◾ Gaussian distribution를 따른다고 가정 → Log Likelihood 값 = 아래 식

◾ Gaussian distribution 가정 & μ 만 추정 → Log Likelihood 값 = MSE Loss

4. Structure

1) Gaussian Encoder + Bernoulli Decoder (Default)

(1) Gaussian Enconder → Regularization term (q와 prior 관계) = 아래와 같은 Closed form으로 계산됨

(2) Bernoullli Decoder → Reconstruction Error term = Network의 출력값(p_i)과 입력값(x_i) 사이의 CE Loss

◾ 이미지에서는 보통 Bernoullli distribution를 따른다고 가정 → Log Likelihood 값 = CE Loss

2) Gaussian Encoder + Gaussian Decoder

(1) Gaussian Enconder → Regularization term (q와 prior 관계) = 아래와 같은 Closed form으로 계산됨

(2) Gaussian Decoder → Reconstruction Error term = Network의 출력값(p_i)과 입력값(x_i) 사이의 MSE Loss

◾ Gaussian distribution를 따른다고 가정 → Log Likelihood 값 = 아래 식

◾ Gaussian distribution 가정 & μ 만 추정 → Log Likelihood 값 = MSE Loss

5. Result - MNIST

1) Architecture

2) Implementation

import tensorflow as tf

# Gaussian MLP as encoder
def gaussian_MLP_encoder(x, n_hidden, n_output, keep_prob):
    with tf.variable_scope("gaussian_MLP_encoder"):
        # initializers
        w_init = tf.contrib.layers.variance_scaling_initializer()
        b_init = tf.constant_initializer(0.)

        # 1st hidden layer
        w0 = tf.get_variable('w0', [x.get_shape()[1], n_hidden], initializer=w_init)
        b0 = tf.get_variable('b0', [n_hidden], initializer=b_init)
        h0 = tf.matmul(x, w0) + b0
        h0 = tf.nn.elu(h0)
        h0 = tf.nn.dropout(h0, keep_prob)

        # 2nd hidden layer
        w1 = tf.get_variable('w1', [h0.get_shape()[1], n_hidden], initializer=w_init)
        b1 = tf.get_variable('b1', [n_hidden], initializer=b_init)
        h1 = tf.matmul(h0, w1) + b1
        h1 = tf.nn.tanh(h1)
        h1 = tf.nn.dropout(h1, keep_prob)

        # output layer !!
        # n_output=2 : z-dimension -> mean용, n_output*2 : stddev용을 위해 2개
        # borrowed from https: // github.com / altosaar / vae / blob / master / vae.py
        wo = tf.get_variable('wo', [h1.get_shape()[1], n_output * 2], initializer=w_init)
        bo = tf.get_variable('bo', [n_output * 2], initializer=b_init)
        gaussian_params = tf.matmul(h1, wo) + bo

        # The mean parameter is unconstrained
        mean = gaussian_params[:, :n_output]
        # The standard deviation must be positive. Parametrize with a softplus and
        # add a small epsilon for numerical stability
        stddev = 1e-6 + tf.nn.softplus(gaussian_params[:, n_output:])

    return mean, stddev

# Bernoulli MLP as decoder
def bernoulli_MLP_decoder(z, n_hidden, n_output, keep_prob, reuse=False):

    with tf.variable_scope("bernoulli_MLP_decoder", reuse=reuse):
        # initializers
        w_init = tf.contrib.layers.variance_scaling_initializer()
        b_init = tf.constant_initializer(0.)

        # 1st hidden layer
        w0 = tf.get_variable('w0', [z.get_shape()[1], n_hidden], initializer=w_init)
        b0 = tf.get_variable('b0', [n_hidden], initializer=b_init)
        h0 = tf.matmul(z, w0) + b0
        h0 = tf.nn.tanh(h0)
        h0 = tf.nn.dropout(h0, keep_prob)

        # 2nd hidden layer
        w1 = tf.get_variable('w1', [h0.get_shape()[1], n_hidden], initializer=w_init)
        b1 = tf.get_variable('b1', [n_hidden], initializer=b_init)
        h1 = tf.matmul(h0, w1) + b1
        h1 = tf.nn.elu(h1)
        h1 = tf.nn.dropout(h1, keep_prob)

        # output layer-mean
        wo = tf.get_variable('wo', [h1.get_shape()[1], n_output], initializer=w_init)
        bo = tf.get_variable('bo', [n_output], initializer=b_init)
        y = tf.sigmoid(tf.matmul(h1, wo) + bo) # sigmoid -> 0~1

    return y

# Gateway
def autoencoder(x_hat, x, dim_img, dim_z, n_hidden, keep_prob):

    # encoding
    mu, sigma = gaussian_MLP_encoder(x_hat, n_hidden, dim_z, keep_prob)

    # sampling by re-parameterization technique
    z = mu + sigma * tf.random_normal(tf.shape(mu), 0, 1, dtype=tf.float32)

    # decoding
    y = bernoulli_MLP_decoder(z, n_hidden, dim_img, keep_prob)
    y = tf.clip_by_value(y, 1e-8, 1 - 1e-8)

    # loss
    marginal_likelihood = tf.reduce_sum(x * tf.log(y) + (1 - x) * tf.log(1 - y), 1) # CE Loss
    KL_divergence = 0.5 * tf.reduce_sum(tf.square(mu) + tf.square(sigma) - tf.log(1e-8 + tf.square(sigma)) - 1, 1)

    marginal_likelihood = tf.reduce_mean(marginal_likelihood)
    KL_divergence = tf.reduce_mean(KL_divergence)

    ELBO = marginal_likelihood - KL_divergence
    loss = -ELBO

    return y, z, loss, -marginal_likelihood, KL_divergence

def decoder(z, dim_img, n_hidden):
    y = bernoulli_MLP_decoder(z, n_hidden, dim_img, 1.0, reuse=True)
    return y

3) Performance

(1) Reproduce

- 초기에 784 dimension이었던 z-dimension을 각각 2, 5, 20으로 줄인 후 복원(x 입력하면 x 출력)

- 더 많이 압축(Dimension작게)했을수록 Reconstruction Loss가 클 것

- Dimension이 클수록 or 학습이 잘 될수록 Reconstruction Loss가 작을 것

(2) Denoising

- Network Input으로 Noise 추가한 img → Network Output으로 Noise 없는 img 나오도록 학습 (Restored)

(3) Learned Manifold (KL term의 역할)

VAE와 AE 차이점 : Loss function에서 KL term 유무

AE : 데이터 압축 목적

- Generator 관점에서 어떤 z값 넣었을 때 의미있는 이미지 값이 나오는지 range 모름

- 결과 : 의미있는 이미지들이 나오는 space 위치가 계속 바뀜

- 적용 : z값을 어떻게 sampling 하는지 모름

VAE : 데이터 생성 목적

- 이상적인 Sampling 함수 찾아갈 때 Generator도 함께 학습됨 (KL term 이용해서 q_Φ = p(z) 같도록 학습)

- 결과 : 이상적인 Sampling 함수 = Prior distribution p(z) = Normal distribution 나옴

- 적용 : z값을 그냥 Prior distribution = Normal distribution 에서 sampling 하면 됨

VAE Learned Manifold

- 학습 잘 될수록 2D 공간에서 같은 숫자들을 생성하는 z들끼리는 뭉쳐 있고, 다른 숫자들을 생성하는 z들은 떨어져 있는 결과

- 전체 분포로 보면 Normal distribution 따르도록 학습

- 네모 공간(A-B-C-D)에서 z값 sampling해서 이미지 뿌려낸 결과 : 우하단

→ Rotation, 굵기 등의 Feature들을 자동으로 학습함

Conditional VAE (CVAE)

1) Architecture

(1) CVAE (M2) : Supervised version

- When? Label 정보 알고 있을 때 (보통의 경우)

- VAE의 Encoder와 Decoder에 각각 Label 정보를 Condition y로 추가(concat)

# encdoer
# concatenate condition and image
input = tf.concat(axis=1, values=[x, y])

# decoder
# concatenate condition and latent vectors
input = tf.concat(axis=1, values=[z, y])

Loss function : Vanilla VAE랑 ELBO 식 똑같이 유도됨

(2) CVAE (M2) : Unsupervised version

- When? Label 정보 모를 때

- 모르는 data에 대해 condition y를 추정하는 별도의 네트워크를 두고, 그 네트워크를 통해 추정한 y값을 넣어줌

(3) CVAE (M2) : Semi-supervised version

- When? Label 정보 조금만 알고 있을 때

- 아는 Label은 Supervised version 처럼 y값을 바로 concat 해서 넣어주고,

- 모르는 Label은 Unsupervised version 처럼 별도의 네트워크로 추정한 y값을 넣어줌

(4) CVAE (M3) : Unsupervised or Sem-supervised

- 모르는 Label 추정할 때 별도의 네트워크를 사용하는 것이 아니라

VAE M1 구조로 먼저 학습 후 & 윗 단에 Layer 하나 붙여서 M2로 y 추정

2) Implementation

import tensorflow as tf

# Gaussian MLP as conditional encoder
def gaussian_MLP_conditional_encoder(x, y, n_hidden, n_output, keep_prob):
    with tf.variable_scope("gaussian_MLP_encoder"):
        # concatenate condition and image
        dim_y = int(y.get_shape()[1])
        input = tf.concat(axis=1, values=[x, y])

        # initializers
        w_init = tf.contrib.layers.variance_scaling_initializer()
        b_init = tf.constant_initializer(0.)

        # 1st hidden layer
        w0 = tf.get_variable('w0', [input.get_shape()[1], n_hidden+dim_y], initializer=w_init)
        b0 = tf.get_variable('b0', [n_hidden+dim_y], initializer=b_init)
        h0 = tf.matmul(input, w0) + b0
        h0 = tf.nn.elu(h0)
        h0 = tf.nn.dropout(h0, keep_prob)

        # 2nd hidden layer
        w1 = tf.get_variable('w1', [h0.get_shape()[1], n_hidden], initializer=w_init)
        b1 = tf.get_variable('b1', [n_hidden], initializer=b_init)
        h1 = tf.matmul(h0, w1) + b1
        h1 = tf.nn.tanh(h1)
        h1 = tf.nn.dropout(h1, keep_prob)

        # output layer
        # borrowed from https: // github.com / altosaar / vae / blob / master / vae.py
        wo = tf.get_variable('wo', [h1.get_shape()[1], n_output * 2], initializer=w_init)
        bo = tf.get_variable('bo', [n_output * 2], initializer=b_init)
        gaussian_params = tf.matmul(h1, wo) + bo

        # The mean parameter is unconstrained
        mean = gaussian_params[:, :n_output]
        # The standard deviation must be positive. Parametrize with a softplus and
        # add a small epsilon for numerical stability
        stddev = 1e-6 + tf.nn.softplus(gaussian_params[:, n_output:])

    return mean, stddev

# Bernoulli MLP as conditional decoder
def bernoulli_MLP_conditional_decoder(z, y, n_hidden, n_output, keep_prob, reuse=False):

    with tf.variable_scope("bernoulli_MLP_decoder", reuse=reuse):
        # concatenate condition and latent vectors
        input = tf.concat(axis=1, values=[z, y])

        # initializers
        w_init = tf.contrib.layers.variance_scaling_initializer()
        b_init = tf.constant_initializer(0.)

        # 1st hidden layer
        w0 = tf.get_variable('w0', [input.get_shape()[1], n_hidden], initializer=w_init)
        b0 = tf.get_variable('b0', [n_hidden], initializer=b_init)
        h0 = tf.matmul(input, w0) + b0
        h0 = tf.nn.tanh(h0)
        h0 = tf.nn.dropout(h0, keep_prob)

        # 2nd hidden layer
        w1 = tf.get_variable('w1', [h0.get_shape()[1], n_hidden], initializer=w_init)
        b1 = tf.get_variable('b1', [n_hidden], initializer=b_init)
        h1 = tf.matmul(h0, w1) + b1
        h1 = tf.nn.elu(h1)
        h1 = tf.nn.dropout(h1, keep_prob)

        # output layer-mean
        wo = tf.get_variable('wo', [h1.get_shape()[1], n_output], initializer=w_init)
        bo = tf.get_variable('bo', [n_output], initializer=b_init)
        y = tf.sigmoid(tf.matmul(h1, wo) + bo)

    return y

# Gateway
def autoencoder(x_hat, x, y, dim_img, dim_z, n_hidden, keep_prob):

    # encoding
    mu, sigma = gaussian_MLP_conditional_encoder(x_hat, y, n_hidden, dim_z, keep_prob)

    # sampling by re-parameterization technique
    z = mu + sigma * tf.random_normal(tf.shape(mu), 0, 1, dtype=tf.float32)

    # decoding
    x_ = bernoulli_MLP_conditional_decoder(z, y, n_hidden, dim_img, keep_prob)
    x_ = tf.clip_by_value(x_, 1e-8, 1 - 1e-8)

    # ELBO
    marginal_likelihood = tf.reduce_sum(x * tf.log(x_) + (1 - x) * tf.log(1 - x_), 1)
    KL_divergence = 0.5 * tf.reduce_sum(tf.square(mu) + tf.square(sigma) - tf.log(1e-8 + tf.square(sigma)) - 1, 1)

    marginal_likelihood = tf.reduce_mean(marginal_likelihood)
    KL_divergence = tf.reduce_mean(KL_divergence)

    ELBO = marginal_likelihood - KL_divergence

    # minimize loss instead of maximizing ELBO
    loss = -ELBO

    return x_, z, loss, -marginal_likelihood, KL_divergence

# Conditional Decoder (Generator)
def decoder(z, y, dim_img, n_hidden):

    x_ = bernoulli_MLP_conditional_decoder(z, y, n_hidden, dim_img, 1.0, reuse=True)

    return

3) Performance

(1) Reproduce

(2) Denoising

(3) Handwriting styles obtained by fixing the class label and varying z

- 학습이 다 끝난 후, Genertator 역할을 하는 Decoder 부분만 떼어내서 사용 가능

VAE로 학습된 latent vector z : 숫자 class (Label) 정보, 기울기, 두께 등의 feature들이 저절로 학습되어 담겨 있음

CVAE로 학습된 latent vector z : Condition y로 직접 입력해주는 숫자 class (Label) 정보 빼고 나머지 feature 중 dominant feature 2개(기울기 Rotation, 두께)가 저절로 학습되어 담겨 있음

(4) Analogies : Result in paper

- Style 유지, 숫자만 변경 : 각 행 별로, 고정된 z값에 대해서 condition으로 입력하는 Label 정보만 바꿔서 이미지 생성

→ 각각의 Style을 유지하면서 숫자 Label 값만 바뀌는 다양한 손글씨 생성

- 실제로 손으로 쓴 글씨 '3'을 CVAE의 Label 정보와 같이 넣었을 때 얻는 Latent vector z는 Decoder의 고정 입력으로 하고, Condition으로 입력하는 Label 정보만 바꿨을 경우 :

(5) Learned Manifold

언뜻 그냥 보면 Entangled 되어있는 것처럼 보이지만,

숫자들을 한번에 뿌린 결과이므로 (즉, Layer 10개가 한번에 나타난 것) 사실은 바람직한 결과!

숫자에 대한 Condition을 주고 z공간에서 보면 Normal distribution (그 숫자만 나오도록 control 할 수 있음)

(6) Classification : Result in paper

MNIST dataset Label 50000개 중 100개만 사용하고 나머지 49900개는 미사용해서 단순한 classifier 붙여서 학습

Semi-supervised CVAE로 학습시켰더니 분류성능이 95% (Feature를 자동으로 학습하는 능력이 뛰어남)

Adversarial AE (AAE)

🎶(~AAE) Regularization Conditions for q_Φ(z|x_i) and p(z)

① Sampling is possible

② KL Divergence can be calculated

→ Conditions 만족시키기 위해 거의 Gaussian distribution으로 가정해서 Regularization 처리했음

[Encoder] q_Φ(z|x_i) ~ N(μ_i, σ^2I) : Gaussian distribution 가정

[Prior] p(z) ~ N(0, I) : Normal distribution 가정

🎶 AAE : KL Divergence term is replaced by Discriminator in GAN!

🎶(AAE) Regularization Conditions for q_Φ(z|x_i) and p(z)

① Sampling is possible ( O )

② KL Divergence can be calculated ( X )

VAE KL Divergence : 두 확률분포 q_Φ(z|x_i)가 p(z)와 같아지도록 만드는 것 목표

GAN Loss : Target distribution과 Sample distribution이 같아지도록 만드는 것 목표

1) Architecture

GAN : Implicit density (probability distribution density model 안정하고 시작) - Direct

VAE : Explicit density (probability distribution density model 정하고 시작) - Approximate

(1) GAN

◾ Goal : G(z) ~ p_data(x)

(2) AAE

◾ Goal : q_Φ(z|x_i) ~ p(z)

x → q(z|x) → z~q(z)에서 Sampling 한 것은 Fake img (-) ▶ Generator 역할

p(z) 에서 Sampling 한 것은 Real img (+)

▶ Discriminator 추가해서 Fake img와 Real img를 구분하도록 학습

2) Implementation

(1) Loss function

∴ AAE : VAE loss function 에서 KL Divergence term 안써도 q_Φ(z|x_i) ~ p(z) 같아지도록 학습 가능!

# Reconstruction loss
marginal_likelihood = -tf.reduce_mean(tf.reduce_mean(tf.squared_difference(x,y)))

## GAN Loss
z_real = tf.concat([z_sample, z_id],1) # control 하고자 z_sample에 대한 id도 같이 넣어줌
z_fake = tf.concat([z, x_id],1)
D_real, D_real_logits = discriminator(z_real, (int)(n_hidden), 1, keep_prob)
D_fake, D_fake_logits = discriminator(z_fake, (int)(n_hidden), 1, keep_prob, reuse=True)

# discriminator loss
D_loss_real = tf.reduce_mean(
    tf.nn.sigmoid_cross_entropy_with_logits(logits=D_real_logits, labels=tf.ones_like(D_real_logits)))
D_loss_fake = tf.reduce_mean(
    tf.nn.sigmoid_cross_entropy_with_logits(logits=D_fake_logits, labels=tf.zeros_like(D_fake_logits)))
D_loss = D_loss_real+D_loss_fake

# generator loss
G_loss = tf.reduce_mean(
    tf.nn.sigmoid_cross_entropy_with_logits(logits=D_fake_logits, labels=tf.ones_like(D_fake_logits)))

marginal_likelihood = tf.reduce_mean(marginal_likelihood)
D_loss = tf.reduce_mean(D_loss)
G_loss = tf.reduce_mean(G_loss)

(2) Training Procedure

◾ 3 Steps : AE에 대해 Reconstruction Loss 한번 Update 하고, GAN Loss 에 대해 D와 G 입장에서 각각 Update

# optimization -> 3 steps
t_vars = tf.trainable_variables()
d_vars = [var for var in t_vars if "discriminator" in var.name]
g_vars = [var for var in t_vars if "MLP_encoder" in var.name]
ae_vars = [var for var in t_vars if "MLP_encoder" or "MLP_decoder" in var.name] # Reconstruction

train_op_ae = tf.train.AdamOptimizer(learn_rate).minimize(neg_marginal_likelihood, var_list=ae_vars)
train_op_d = tf.train.AdamOptimizer(learn_rate/5).minimize(D_loss, var_list=d_vars)
# learn_rate/5 : GAN 학습 잘 되게 하기 위한 트릭 1 (D를 G에 비해 천천히 학습)
train_op_g = tf.train.AdamOptimizer(learn_rate).minimize(G_loss, var_list=g_vars)


# Training
# reconstruction loss
_, loss_likelihood = sess.run(
    (train_op_ae, neg_marginal_likelihood),
    feed_dict={x_hat: batch_xs_input, x: batch_xs_target, x_id: batch_ids_input, z_sample: samples,
               z_id: z_id_one_hot_vector, keep_prob: 0.9})

# discriminator loss
_, d_loss = sess.run(
    (train_op_d, D_loss),
    feed_dict={x_hat: batch_xs_input, x: batch_xs_target, x_id: batch_ids_input, z_sample: samples,
               z_id: z_id_one_hot_vector, keep_prob: 0.9})

# generator loss
for _ in range(2): # GAN 학습 잘 되게 하기 위한 트릭 2 (G는 2번 loop 돌려서 빨리 학습)
    _, g_loss = sess.run(
        (train_op_g, G_loss),
        feed_dict={x_hat: batch_xs_input, x: batch_xs_target, x_id: batch_ids_input, z_sample: samples,
                   z_id: z_id_one_hot_vector, keep_prob: 0.9})

tot_loss = loss_likelihood + d_loss + g_loss

(3) Code

Prior_factory : 다양한 Sampling function DIstribution 모음

"""
Most codes from https://github.com/musyoku/adversarial-autoencoder/blob/master/aae/sampler.py
"""
import numpy as np
from math import sin,cos,sqrt

def uniform(batch_size, n_dim, n_labels=10, minv=-1, maxv=1, label_indices=None):
    if label_indices is not None:
        if n_dim != 2 or n_labels != 10:
            raise Exception("n_dim must be 2 and n_labels must be 10.")

        def sample(label, n_labels):
            num = int(np.ceil(np.sqrt(n_labels)))
            size = (maxv-minv)*1.0/num
            x, y = np.random.uniform(-size/2, size/2, (2,))
            i = label / num
            j = label % num
            x += j*size+minv+0.5*size
            y += i*size+minv+0.5*size
            return np.array([x, y]).reshape((2,))

        z = np.empty((batch_size, n_dim), dtype=np.float32)
        for batch in range(batch_size):
            for zi in range((int)(n_dim/2)):
                    z[batch, zi*2:zi*2+2] = sample(label_indices[batch], n_labels)
    else:
        z = np.random.uniform(minv, maxv, (batch_size, n_dim)).astype(np.float32)
    return z

def gaussian(batch_size, n_dim, mean=0, var=1, n_labels=10, use_label_info=False):
    if use_label_info:
        if n_dim != 2 or n_labels != 10:
            raise Exception("n_dim must be 2 and n_labels must be 10.")

        def sample(n_labels):
            x, y = np.random.normal(mean, var, (2,))
            angle = np.angle((x-mean) + 1j*(y-mean), deg=True)
            dist = np.sqrt((x-mean)**2+(y-mean)**2)

            # label 0
            if dist <1.0:
                label = 0
            else:
                label = ((int)((n_labels-1)*angle))//360

                if label<0:
                    label+=n_labels-1

                label += 1

            return np.array([x, y]).reshape((2,)), label

        z = np.empty((batch_size, n_dim), dtype=np.float32)
        z_id = np.empty((batch_size), dtype=np.int32)
        for batch in range(batch_size):
            for zi in range((int)(n_dim/2)):
                    a_sample, a_label = sample(n_labels)
                    z[batch, zi*2:zi*2+2] = a_sample
                    z_id[batch] = a_label
        return z, z_id
    else:
        z = np.random.normal(mean, var, (batch_size, n_dim)).astype(np.float32)
        return z

def gaussian_mixture(batch_size, n_dim=2, n_labels=10, x_var=0.5, y_var=0.1, label_indices=None):
    if n_dim != 2:
        raise Exception("n_dim must be 2.")

    def sample(x, y, label, n_labels):
        shift = 1.4
        r = 2.0 * np.pi / float(n_labels) * float(label)
        new_x = x * cos(r) - y * sin(r)
        new_y = x * sin(r) + y * cos(r)
        new_x += shift * cos(r)
        new_y += shift * sin(r)
        return np.array([new_x, new_y]).reshape((2,))

    x = np.random.normal(0, x_var, (batch_size, (int)(n_dim/2)))
    y = np.random.normal(0, y_var, (batch_size, (int)(n_dim/2)))
    z = np.empty((batch_size, n_dim), dtype=np.float32)
    for batch in range(batch_size):
        for zi in range((int)(n_dim/2)):
            if label_indices is not None:
                z[batch, zi*2:zi*2+2] = sample(x[batch, zi], y[batch, zi], label_indices[batch], n_labels)
            else:
                z[batch, zi*2:zi*2+2] = sample(x[batch, zi], y[batch, zi], np.random.randint(0, n_labels), n_labels)

    return z

def swiss_roll(batch_size, n_dim=2, n_labels=10, label_indices=None):
    if n_dim != 2:
        raise Exception("n_dim must be 2.")

    def sample(label, n_labels):
        uni = np.random.uniform(0.0, 1.0) / float(n_labels) + float(label) / float(n_labels)
        r = sqrt(uni) * 3.0
        rad = np.pi * 4.0 * sqrt(uni)
        x = r * cos(rad)
        y = r * sin(rad)
        return np.array([x, y]).reshape((2,))

    z = np.zeros((batch_size, n_dim), dtype=np.float32)
    for batch in range(batch_size):
        for zi in range((int)(n_dim/2)):
            if label_indices is not None:
                z[batch, zi*2:zi*2+2] = sample(label_indices[batch], n_labels)
            else:
                z[batch, zi*2:zi*2+2] = sample(np.random.randint(0, n_labels), n_labels)
    return z

AAE

import tensorflow as tf

# MLP as encoder
def MLP_encoder(x, n_hidden, n_output, keep_prob):
    with tf.variable_scope("MLP_encoder"):
        # initializers
        w_init = tf.contrib.layers.xavier_initializer()
        b_init = tf.constant_initializer(0.)

        # 1st hidden layer
        w0 = tf.get_variable('w0', [x.get_shape()[1], n_hidden], initializer=w_init)
        b0 = tf.get_variable('b0', [n_hidden], initializer=b_init)
        h0 = tf.matmul(x, w0) + b0
        h0 = tf.nn.relu(h0)
        h0 = tf.nn.dropout(h0, keep_prob)

        # 2nd hidden layer
        w1 = tf.get_variable('w1', [h0.get_shape()[1], n_hidden], initializer=w_init)
        b1 = tf.get_variable('b1', [n_hidden], initializer=b_init)
        h1 = tf.matmul(h0, w1) + b1
        h1 = tf.nn.relu(h1)
        h1 = tf.nn.dropout(h1, keep_prob)

        # output layer
        wo = tf.get_variable('wo', [h1.get_shape()[1], n_output], initializer=w_init)
        bo = tf.get_variable('bo', [n_output], initializer=b_init)
        output = tf.matmul(h1, wo) + bo

    return output

# MLP as decoder
def MLP_decoder(z, n_hidden, n_output, keep_prob, reuse=False):

    with tf.variable_scope("MLP_decoder", reuse=reuse):
        # initializers
        w_init = tf.contrib.layers.xavier_initializer()
        b_init = tf.constant_initializer(0.)

        # 1st hidden layer
        w0 = tf.get_variable('w0', [z.get_shape()[1], n_hidden], initializer=w_init)
        b0 = tf.get_variable('b0', [n_hidden], initializer=b_init)
        h0 = tf.matmul(z, w0) + b0
        h0 = tf.nn.relu(h0)
        h0 = tf.nn.dropout(h0, keep_prob)

        # 2nd hidden layer
        w1 = tf.get_variable('w1', [h0.get_shape()[1], n_hidden], initializer=w_init)
        b1 = tf.get_variable('b1', [n_hidden], initializer=b_init)
        h1 = tf.matmul(h0, w1) + b1
        h1 = tf.nn.relu(h1)
        h1 = tf.nn.dropout(h1, keep_prob)

        # output layer
        wo = tf.get_variable('wo', [h1.get_shape()[1], n_output], initializer=w_init)
        bo = tf.get_variable('bo', [n_output], initializer=b_init)
        y = tf.sigmoid(tf.matmul(h1, wo) + bo)

    return y

# Discriminator
def discriminator(z, n_hidden, n_output, keep_prob, reuse=False):

    with tf.variable_scope("discriminator", reuse=reuse):
        # initializers
        w_init = tf.contrib.layers.xavier_initializer()
        b_init = tf.constant_initializer(0.)

        # 1st hidden layer
        w0 = tf.get_variable('w0', [z.get_shape()[1], n_hidden], initializer=w_init)
        b0 = tf.get_variable('b0', [n_hidden], initializer=b_init)
        h0 = tf.matmul(z, w0) + b0
        h0 = tf.nn.relu(h0)
        h0 = tf.nn.dropout(h0, keep_prob)

        # 2nd hidden layer
        w1 = tf.get_variable('w1', [h0.get_shape()[1], n_hidden], initializer=w_init)
        b1 = tf.get_variable('b1', [n_hidden], initializer=b_init)
        h1 = tf.matmul(h0, w1) + b1
        h1 = tf.nn.relu(h1)
        h1 = tf.nn.dropout(h1, keep_prob)

        # output layer
        wo = tf.get_variable('wo', [h1.get_shape()[1], n_output], initializer=w_init)
        bo = tf.get_variable('bo', [n_output], initializer=b_init)
        y = tf.matmul(h1, wo) + bo

    return tf.sigmoid(y), y

# Gateway
def adversarial_autoencoder(x_hat, x, x_id, z_sample, z_id, dim_img, dim_z, n_hidden, keep_prob):
    ## Reconstruction Loss
    # encoding
    z = MLP_encoder(x_hat, n_hidden, dim_z, keep_prob)

    # decoding
    y = MLP_decoder(z, n_hidden, dim_img, keep_prob)

    # Reconstruction loss
    marginal_likelihood = -tf.reduce_mean(tf.reduce_mean(tf.squared_difference(x,y)))

    ## GAN Loss
    z_real = tf.concat([z_sample, z_id],1) # control 하고자 z_sample에 대한 id도 같이 넣어줌
    z_fake = tf.concat([z, x_id],1)
    D_real, D_real_logits = discriminator(z_real, (int)(n_hidden), 1, keep_prob)
    D_fake, D_fake_logits = discriminator(z_fake, (int)(n_hidden), 1, keep_prob, reuse=True)

    # discriminator loss
    D_loss_real = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=D_real_logits, labels=tf.ones_like(D_real_logits)))
    D_loss_fake = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=D_fake_logits, labels=tf.zeros_like(D_fake_logits)))
    D_loss = D_loss_real+D_loss_fake

    # generator loss
    G_loss = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=D_fake_logits, labels=tf.ones_like(D_fake_logits)))

    marginal_likelihood = tf.reduce_mean(marginal_likelihood)
    D_loss = tf.reduce_mean(D_loss)
    G_loss = tf.reduce_mean(G_loss)

    return y, z, -marginal_likelihood, D_loss, G_loss

def decoder(z, dim_img, n_hidden):

    y = MLP_decoder(z, n_hidden, dim_img, 1.0, reuse=True)

    return y

3) Performance

(1) MNIST Results - VAE VS AAE

◾ p(z) : N(0, 5^2I) = Normal distribution 모양

학습이 잘 됐다면 Prior distribution, 즉 Normal distribution에 가까운 모양이 나옴

- A : AAE는 GAN loss 영향으로 인해 VAE보다 더 Normal distribution에 가까운 예쁜 모양

- C : VAE는 Maximum likelihood 관점에서 학습하므로 Sample 의 분포를 중시하며 학습

◾ p(z) : mixture if 10 Gaussians = 별 모양

10개의 Gaussian distribution mixture (별 모양) 만들어서 어떤 distribution에서 sampling 할 지 지정할 수 있음

- B : AAE는 Prior를 p(z)로 주고 학습 → Prior distribution 별 모양대로 예쁘게 잘 학습된 모양

- C : VAE는 Prior를 p(z)로 줄 수 없기에 Normal distribution이라 가정하고 학습 → Maximum likelihood 하다보니 sample 분포의 위치가 반영돼서 어느정도 모양은 나오지만 AAE 만큼 Prior distribution 모양 그대로 학습되지는 X

(2) Incorporating Label information in the Adversarial Regularization

◾ Goal : 별모양 distribution에 존재하는 10개의 날개별로 MNIST Label 값을 mapping 해보자

◾ How : 원하는 Mapping 대로 Label 정보를 Condition으로 Discriminator 에만 넣어주자

① Discriminator에 Real data, 즉 Prior distribution p(z)에서 뽑은 sample을 Input으로 넣을 때

: 해당 sample이 어떤 Label 정보를 가지게 할 것인지에 대한 Condition을 Discriminator에 넣어줌 (원하는 Mapping 가능)

② Discriminator에 Fake data, 즉 Posterior distribution에서 뽑은 sample을 Input으로 넣을 때

: 해당 이미지에 대한 Label을 Discriminator에 넣어줌

◾ Result

(A) : 특정 Label의 이미지는 Latent space에서 의도된 구간으로 Mapping 될 수 있음

(c) : 각 Gaussian distribution에서 동일 위치는 동일 스타일 갖음

(B) : Gaussian이 아닌 Swiss roll 모양으로 어떤 Label mapping 할 지 condition 입력한 경우

(3) Supervised AAE

(4) Semi-Supervised AAE

(5) Incorporating Label information in the Adversarial Regularization

AAE의 Main Contribution : 다루기 쉽도록 Manifold를 원하는 모양으로 만들 수 있다 !! (기존엔 Normal distribution 밖에 X)

'DL > AutoEncoder' 카테고리의 다른 글

[Ch5] Applications (Retrieval, Generation, GAN+VAE) (0)	2022.02.22
[Ch3] AutoEncoders (AE, DAE, CAE) (0)	2022.01.30
[Ch2] Manifold Learning (0)	2022.01.30
[Ch1] Revisit Deep Neural Networks (0)	2022.01.30
[Ch0] AutoEncoder (0)	2022.01.30

공지사항

티스토리 뷰