Un (Suthee)’s Official Site

Ladder VAE (NIPS’2016)

2018-06-20T00:00:00+00:00

Ladder VAE is a VAE architecture that can effectively several stochastic layers. The trick is to couple the distribution parameters from the inference and generative networks to better estimate the model’s parameters.

The generative network is:

\[P(\textbf{z}) = P(\textbf{z}_L)\prod_{i=1}^{L-1}P(\textbf{z}_i|\textbf{z}_{i+1})\]

Where:

\[P(\textbf{z}_i|\textbf{z}_{i+1}) = Normal(\textbf{z}_i|\mu_{p,i}(\textbf{z}_{i+1}), \sigma^2_{p,i}(z_{i+1}))\]

and

\[P(\textbf{z}_L) = Normal(\textbf{z}_L|0, I)\]

The prior \( P(\textbf{z}_L) \) is a standard gaussian.

The inference network:

The key difference is that each stochastic layer calculates the distribution parameters (mean and variance) but does not sample the latent vector from this distribution. The standard VAE will simply draw a sample from the distribution parameters computed by the inference network. However, the Ladder VAE draw the latent vector from:

\[\sigma_{q,i} = \frac{1}{\hat \sigma^2_{q,i} + \sigma^2_{p,i}}\] \[\mu_{q,i} = \frac{\hat \mu_{q,i} \hat \sigma^{-2}_{q,i} + \mu_{p,i}\sigma^{-2}_{p,i}}{\hat \sigma^{-2}_{q,i} + \sigma^{-2}_{p,i}}\]

The approximate distribution at layer i is:

\[q(\textbf{z}_i|\cdot) = Normal(\textbf{z}_i|\mu_{q,i},\sigma^2_{q,i})\]

By coupling the distribution parameters from both inference and generative networks, the Ladder VAE utilizes information from both network for parameter learning.

Reference

Sønderby, Casper Kaae, et al. “Ladder variational autoencoders.” Advances in neural information processing systems. 2016.

Towards a Neural Statistician (ICLR’17)

2017-09-28T00:00:00+00:00

One extension of VAE is to add a hierarchy structure. In contrast to the classical VAE which its prior is drawn from a standard Gaussian distribution, Hierarchical VAE (HVAE) learns the prior distribution from the dataset.

The generative process is:

Draw a dataset prior \( \boldsymbol{c} \sim N(\boldsymbol{0}, \boldsymbol{I}) \)
For each data point in the dataset
- Draw a latent vector \( \boldsymbol{z} \sim P(\cdot \mid \boldsymbol{c}) \)
- Draw a sample \( \boldsymbol{x} \sim P(\cdot \mid \boldsymbol{z}) \)

The likelihood of the dataset is:

\[p(D) = \int p(c) ( \prod_{x \in D} \int p(x \mid z;\theta)p(z \mid c;\theta)dz ) dc\]

The paper define the approximate inference network, \( q(z \mid x,c;\phi) \) and \( q(c \mid D; \phi) \) to optimize a variational lowerbound. The single dataset log likelihood lowerboud is:

\[\mathcal{L}_D = E_{q(c \mid D;\phi)}\big( \sum_{x \in d} E_{q(z \mid c, x; \phi)}( \log p(x \mid z;\theta)) - D_{KL}(q(z \mid c,x;\phi)||p(z \mid c;\theta)) \big) - D_{KL}(q(c \mid D;\phi)||p(c))\]

The statistic network \( q(c \mid D; \phi) \) that approximates the posterior distribution over the context c given the dataset D. Basically, this inference network has an encoder to take each datapoint into a vector \( e_i = E(x_i) \). Then, add a pool layer to aggregate \( e_i \) into a single vector, an element-wise mean is used. The final vector is used to generate parameters of a diagonal Gaussian.

This model surprisingly works well for many tasks such as topic models, transfer learning, one-shot learning, etc.

Reference

Towards a Neural Statistician

IRGAN (SIGIR’17)

2017-06-28T00:00:00+00:00

This paper uses GANs framework to combine generative and discriminative information retrieval model. It shows a promising result on web search, item recommendations, and Q/A tasks.

Typically, many relevant models are classified into 2 types:

Generative retrieval model: It generates a document given query and relevant score. The model is \( p(d \mid q,r) \).
Discriminative retrieval model: It computes a relevant score for the given query and document pair. The model is \( p(r \mid d, q) \).

The generative model tried to find a connection between document and query. On the other hand, the discriminative model attempts to model the interaction between query and document based on relevance scores.

Both models have their shortcoming. Many generative models require a predefined data generating story. The wrong assumption will lead to the poor performance. The generative model is usually trying to fit the data to its model without external guidance. Meanwhile, the discriminative model requires a lot of labeled data to be effective, especially for a deep neural network model.

By train both models using GANs framework, it is now possible to solve their shortcoming. The generative model is now adaptive because the discriminator will reward the generative model when it can create or select good samples. This adaptive guidance from the discriminator is unique in GANs framework and will help the generator learns to pick good samples from the data distribution. At the same time, the discriminator can receive even more training data from the generative model. This is similar to semi-supervised learning where unlabeled data are utilized. Adversarial training allows us to improve both generative and discriminative models via jointly learning through the Adversarial training allows us to improve both generative and discriminative models via jointly learning through the minimax training. The traditional training based on maximum likelihood does not have principle way to allow both models to give each other feedbacks.

The proposed framework seems to be promising and the results on 3 information retrieval tasks are really good. But I notice that their training procedure requires pretraining. This made me wonder if pre-training is part of the performance boost during testing. I don’t find the part in the paper that explains the benefit of pretraining in their settings.

The discriminative model is straight forward. It is a sigmoid function. The discriminator basically gives a high probability when the given document-query pair is relevant. The generative model is more interesting. In the standard GANs, the generator will create a sample from a simple distribution, but IRGAN does not generate a new document-query pair. Instead, the author chose to let the generator select the sample from the document pool. In my opinion, this approach is simpler than creating a new data because the sample is realistic. Also, IRGAN cares about finding a function to compute a relevance score so it is unnecessary to generate a completely new data.

However, the cost function for the generator is an expectation over all documents in the corpus. The Monte Carlo approximation will have a high variance. Thus, they use policy gradient to reduce the variance so that the model can learn a useful representation. Although \( p(d \mid q,r) \) is a discrete distribution, the backprop is applicable because we pre-sample all documents from \( p(d \mid q,r) \) beforehand. Thus, eq.5 is differentiable. The extra care may need in order to reduce variance further. They use an advantage function. (Please look at the reference on Reinforcement Learning [2]).

Generating positive and negative samples are still confusing in this paper. It seems to be application specific. The author mentioned about using softmax with temperature hyper-parameter to put more or less focus on top documents. My guess is when we put less focus on top documents, the generator has more chance to pick up more negative samples. After I read the paper again, it seems that all samples selected by the generator model are negative samples. This part remains unclear and I need to ask the author for more details.

In conclusion, I like this paper because it tried to combine generative and discriminative retrieval models via GANs framework. The paper has a good motivation and discussed the advantage of jointly train both models. It seems adversarial training is useful for IR tasks as well.

References:

Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of SIGIR’17, Shinjuku, Tokyo, Japan, August 7-11, 2017, 10 pages.
Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, and others. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NIPS.
IRGAN code

LDA2Vec

2017-06-28T00:00:00+00:00

The topic is interpretable by collecting all nearby word vectors to the selected topic vector. This work boosts word2vec with topic modeling via training in the similar fashion to word2vec.

The key difference of LDA2Vec is its loss function. There are 2 loss functions: the first one is Skipgram Negative Sampling Loss which is similar to Word2Vec. It wants to maximize the probability of predicting a target word \( \vec w_j \) and non-target word (negative samples) given a context vector \( \vec c_j \). This loss wants the model to distinguish a positive word (which related to the given context) from negative sampled words.

The innovation is the context vector \( \vec c_j \). The intuition is that predict a nearby word given a pivot word also depends on the theme of the context. For example, if the document is about an airline when we want to predict nearby words given a word “Germany”, we will likely want to see word related to airlines but not country names. Thus, a context vector is a sum of a word vector and document vector. \( \vec c_j = \vec w_j + \vec d_j \). Thus, LDA2vec attempts to capture both document-wide relationship and local interaction between words within its context window.

In order to learn a topic vector, the document is further decomposed as a linear combination of topic vectors. \( \vec d_j = \sum_{k} p_{jk} \cdot \vec t_k \) where \( p_{jk} \) is a probability of document j will be a topic k. Finally, the interpretability comes from a sparsity of topic assignment vector, \( p_j \). One way to enforce sparsity is to design the loss function as:

\[L^{d} = \lambda \sum_{jk} (\alpha - 1)\log p_{jk}\]

When \( \alpha < 1 \), we encourage a topic assignment probability to put more mass on a small set of topics.

The results look interesting. This paper shows a simple way to combine topic modeling with word embedding. By embedding document vectors and topic vectors into the same semantic space as word vectors, we can learn a global semantic structure as well as word-level local interaction.

References:

RBMs for Collaborative Filtering (ICML’2007)

2017-04-06T00:00:00+00:00

If we are going to write a paper on deep learning for Collaborative Filtering problem, then RBM-CF [1] must be cited! Although it was not the best algorithm that yielded the lowest RMSE during the Netflix contest (SVD++ was the best algorithm at that time), RBM-CF was used as one of many algorithms for ensemble learning. One the main reason is that of its unique approach at that time. Thus, its predictions were slightly different from matrix factorization approaches.

It turns out that one of the state-of-the-art for collaborative filtering right now (as of April 2017) that I am aware of is NADE-CF [2] which is based on a similar idea as RBM-CF. This post will summarize the RBM-CF model, its architecture, and extensions.

The RBM-CF models the joint probability between visible and hidden units. In this case of CF problem, visible units are observed ratings which are represented as a binary value. Each rating is one-hot encoded as a binary vector of length K where K is the maximum rating. The goal is to infer hidden units from these observed ratings. This means we need to learn a non-linear function that maps visible units to a probability of distribution of hidden units. In RBF, this non-linear function is a sigmoid function.

RBM-CF uses a softmax function to model the visible units and the hidden units are modeled by Bernoulli distribution. To infer all hidden units, this model is trained by using a contrastive divergence to approximate the gradient of the log-likelihood. In order to make a prediction, the author suggested to first compute all \( p(v_q = k \mid V) \) and normalize them using softmax. Then, compute the expectation of rating.

One possible variant is to model the hidden units as Gaussian latent variables. This variant increases the capacity of the model. Another variant to utilize the missing ratings as an extra information. The author observed that all rating in the test sets can be treated as all items that are viewed by a user without the rating. The viewing information is represented as a binary random variable and influences the hidden units. The conditional RBM model significantly improves performance. Imputing the missing values is a heuristic used to slightly improve the model performance.

The most interesting contribution is how the author proposed an architecture to reduce the number of free parameters. This can be done by factoring the weight matrix into a product of two lower-rank matrices. The less number of parameters means that we can avoid an overfitting and the convergence will be faster.

CF-RBM has a slightly lower RMSE than a standard SVD. The modern approaches such as Autoencoder or PMF are much more scalable than CF-RBM. This model can be extended to a deeper model and RBM used for parameter pretraining.

References:

[1] Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. “Restricted Boltzmann machines for collaborative filtering.” Proceedings of the 24th international conference on Machine learning. ACM, 2007.

[2] Zheng, Yin, et al. “A neural autoregressive approach to collaborative filtering.” Proceedings of the 33nd International Conference on Machine Learning. 2016.