Paperwhy

Paperwhy https://paperwhy.8027.org/ Recent content on Paperwhy Hugo -- gohugo.io en-gb Sat, 22 Dec 2018 00:00:00 +0000 Recurrent models of visual attention https://paperwhy.8027.org/2018/12/22/recurrent-models-of-visual-attention/ Sat, 22 Dec 2018 00:00:00 +0000 https://paperwhy.8027.org/2018/12/22/recurrent-models-of-visual-attention/ tl;dr: Training a network to classify images (with a single label) is modeled as a sequential decision problem where actions are salient locations in the image and tentative labels. The state (full image) is partially observed through a fixed size subimage around each location. The policy takes the full history into account compressed into a hidden vector via an RNN. REINFORCE is used to compute the policy gradient. Although the paper targets several applications, to fix ideas, say we want to classify images with one label. Extrapolation and learning equations https://paperwhy.8027.org/2017/10/21/extrapolation-and-learning-equations/ Sat, 21 Oct 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/10/21/extrapolation-and-learning-equations/ tl;dr: Starting from the intuition that many physical dynamical systems are typically well modeled by first order systems of ODE with governing equations expressed in terms of a few elementary functions, the authors propose a fully connected architecture with multiple non-linearities with the purpose of learning the formulae for these systems of equations. The network effectively performs a kind of hierarchical, non-linear regression with the given nonlinearities as basis functions and is able to learn the governing equations for several examples like a compound pendulum or the forward kinematics of a robotic arm. Deep Residual Learning for image recognition https://paperwhy.8027.org/2017/07/06/deep-residual-learning-for-image-recognition/ Thu, 06 Jul 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/07/06/deep-residual-learning-for-image-recognition/ tl;dr: Deeper models for visual tasks have been proven to greatly outperform shallow ones, but after some point simply adding more layers decreases performance even if the networks are in principle more expressive. Adding skip-connections overcomes these difficulties and vastly improves performance, while keeping the number of parameters under control. This post is a prequel to previous ones where we went over work studiying the theoretical properties of Residual Networks, introduced in the current paper. Batch normalization: accelerating deep network training by reducing internal covariate shift https://paperwhy.8027.org/2017/06/26/batch-normalization-accelerating-deep-network-training-by-reducing-internal-covariate-shift/ Mon, 26 Jun 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/06/26/batch-normalization-accelerating-deep-network-training-by-reducing-internal-covariate-shift/ tl;dr: Normalization to zero mean and unit variance of layer outputs in a deep model vastly improves learning rates and yields improvements in generalization performance. Approximating the full sample statistics by mini-batch ones is effective and computationally manageable. You should be doing it too. Covariate shift and whitening For any procedure learning a function $f$ from random data $X \sim \mathbb{P}_{X}$ it is essential that the distribution itself does not vary along the learning process. On the number of linear regions of deep neural networks https://paperwhy.8027.org/2017/06/20/on-the-number-of-linear-regions-of-deep-neural-networks/ Tue, 20 Jun 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/06/20/on-the-number-of-linear-regions-of-deep-neural-networks/ tl;dr: Adding layers to build a deep model is exponentially better than just increasing the number of parameters in a shallow one in order to increase the complexity of the piecewise linear functions computed by feedforward neural networks with rectifier or maxout networks. Consider a feed forward neural network with linear layers $f_{l} (x) = W^l x + b^l$ followed by ReLUs $g_{l} (z) = \max \lbrace 0, z \rbrace $: Training with noise is equivalent to Tikhonov regularization https://paperwhy.8027.org/2017/06/12/training-with-noise-is-equivalent-to-tikhonov-regularization/ Mon, 12 Jun 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/06/12/training-with-noise-is-equivalent-to-tikhonov-regularization/ tl;dr: Adding noise to training inputs changes the risk function. A Taylor expansion shows that up to a term quadratic in the noise amplitude, the empirical risk is the same as without noise but with an additional term involving 1st derivatives of the estimator. In our quest to understand all things regularization, today we review an old piece by Christopher Bishop no less! The bias-variance tradeoff We begin with a classical observation: for any statistical model we develop (i. Identity matters in Deep Learning https://paperwhy.8027.org/2017/06/07/identity-matters-in-deep-learning/ Wed, 07 Jun 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/06/07/identity-matters-in-deep-learning/ tl;dr: vanilla residual networks are very good approximators of functions which can be represented as linear perturbations of the identity. In the linear setting, optimization is aided by a benevolent landscape having only minima in certain (interesting) regions. Finally, very simple ResNets can completely learn datasets with $\mathcal{O} (n \log n + \ldots)$ parameters. All this seems to indicate that deep and simple architectures might be enough to achieve great performance. On gradient-based optimization: accelerated, stochastic, asynchronous, distributed https://paperwhy.8027.org/2017/06/04/on-gradient-based-optimization-accelerated-stochastic-asynchronous-distributed/ Sun, 04 Jun 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/06/04/on-gradient-based-optimization-accelerated-stochastic-asynchronous-distributed/ Today’s post is about another great talk given at the Simons Institute for the Theory of Computing in the context of their currently ongoing series Computational Challenges in Machine Learning. Part 1: Variational, Hamiltonian and Symplectic Perspectives on Acceleration For convex functions, Nesterov accelerated gradient descent method attains the optimal rate of $\mathcal{O} (1 / k^2)$.1 \begin{equation} \label{eq:nesterov}\tag{1} \left \lbrace \begin{array}{lll} y_{k + 1} & = & x_{k} - \beta \nabla f (x_{k})\\ Dropout training as adaptive regularization https://paperwhy.8027.org/2017/05/31/dropout-training-as-adaptive-regularization/ Wed, 31 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/31/dropout-training-as-adaptive-regularization/ tl;dr: dropout (of features) for GLMs is a noising procedure equivalent to Tykhonov regularization. A first order approximation of the regularizer actually scales the parameters with the Fisher information matrix, adapting the objective function to the dataset, independently of the labels. This makes dropout useful in the context of semi-supervised learning: regularizers can be adapted to the unlabeled data yielding better generalization. For logistic regression the adaption amounts to favoring features on which the estimator is confident. Why and when can deep – but not shallow – networks avoid the curse of dimensionality: a review https://paperwhy.8027.org/2017/05/29/why-and-when-can-deep--but-not-shallow--networks-avoid-the-curse-of-dimensionality-a-review/ Mon, 29 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/29/why-and-when-can-deep--but-not-shallow--networks-avoid-the-curse-of-dimensionality-a-review/ tl;dr:1 deep convnets avoid the curse of dimensionality for the approximation of certain classes of functions (hierarchical compositions): complexity bounds (for the number of units) are polynomial instead of exponential in the dimension of the input as is the case for shallow networks. This is true for smooth and non-smooth activations like ReLUs. For the latter insight into how they approximate (hierarchical) Lipschitz functions is provided . It is conjectured that many target functions relevant to current machine learning problems are in these classes due either to physical grounds2 or biological ones. Maxout Networks https://paperwhy.8027.org/2017/05/23/maxout-networks/ Tue, 23 May 2017 22:45:50 +0200 https://paperwhy.8027.org/2017/05/23/maxout-networks/ tl;dr: this paper introduced an activation function for deep convolutional networks which specifically benefits from regularization with dropout1 and still has a universal approximation property for continuous functions. It is hypothesized that, analogously to ReLUs, the locally linear character of these units makes the averaging of the dropout ensemble more accurate than with fully non-linear units. Although sparsity of representation is lost wrt. ReLUs, backpropagation of errors is improved by not clamping to 0, resulting in significant performance gains. Deep sparse rectifier neural networks https://paperwhy.8027.org/2017/05/18/deep-sparse-rectifier-neural-networks/ Thu, 18 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/18/deep-sparse-rectifier-neural-networks/ tl;dr: use ReLUs by default. Don’t pretrain if you have lots of labeled training data, but do in unsupervised settings. Use regularisation on weights / activations. $L_1$ might promote sparsity, ReLUs already do and this seems good if the data itself is. This seminal paper settled the introduction of ReLUs1 into the neural network community (they had already been used in other contexts, e.g. in RBMs.2 rectifying neurons (…) yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data Deep Learning using linear Support Vector Machines https://paperwhy.8027.org/2017/05/15/deep-learning-using-linear-support-vector-machines/ Mon, 15 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/15/deep-learning-using-linear-support-vector-machines/ The author substitutes a linear SVM for the softmax atop some architectures, then backpropagate the error of the primal problem to the whole network . This idea had already been proposed in the literature but with a standard hinge loss instead of the $L^2$-loss that the author uses.1 Because an $L^2$ loss penalizes mistakes more heavily than the standard hinge loss the author believes that: the performance gain is largely due to the superior regularization effects of the SVM loss function, rather than an advantage from better parameter optimization. Why does deep and cheap learning work so well? https://paperwhy.8027.org/2017/05/13/why-does-deep-and-cheap-learning-work-so-well/ Sat, 13 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/13/why-does-deep-and-cheap-learning-work-so-well/ tl;dr: There is (hard to judge) physical motivation for the success of shallow networks as approximators of Hamiltonians. Proof that fixed size networks can approximate polynomials arbitrarily well and implication for typical Hamiltonians. Proof that the inference (reconstruction of initial parameters) of hierarchical / sequential Markovian processes (argued to be pervasive in nature) is learnable by deep architectures but not by shallower ones (no-flattening theorem). This paper addresses two fundamental questions for deep networks. Greedy layer-wise training of Deep Networks https://paperwhy.8027.org/2017/05/10/greedy-layer-wise-training-of-deep-networks/ Wed, 10 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/10/greedy-layer-wise-training-of-deep-networks/ Back in the dark days of 2006, neural networks were not properly initialised (no batchnorm1), not properly regularised (no dropout,2 no maxout3), mostly still using sigmoids4, not properly trained (no momentum,5 no adam, no wildhog!). Random initialisation of weights often led to poor local minima. This paper took an idea of Hinton, Osindero, and Teh (2006) for pre-training of Deep Belief Networks: greedily (one layer at a time) pre-training in unsupervised fashion a network kicks its weights to regions closer to better local minima, Local minima in training of neural networks https://paperwhy.8027.org/2017/05/09/local-minima-in-training-of-neural-networks/ Tue, 09 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/09/local-minima-in-training-of-neural-networks/ tl;dr: The goal is to construct elementary examples of datasets such that some neural network architectures get stuck in very bad local minima. The purpose is to better understand why NNs seem to work so well for many problems and what it is that makes them fail when they do. The authors conjecture that their examples can be generalized to higher dimensional problems and therefore that the good learning properties of deep networks rely heavily on the structure of the data. Understanding Dropout https://paperwhy.8027.org/2017/05/05/understanding-dropout/ Fri, 05 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/05/understanding-dropout/ The authors set to study the “averaging” properties of dropout in a quantitative manner in the context of fully connected, feed forward networks understood as DAGs. In particular, architectures other than sequential are included, cf. Figure 1. In the linear case with no activations, the output of some layer $h$ (no dropout yet) is: $$ S^h_i = \sum_{l < h} \sum_j w^{h l}_{i j} S^l_j . $$ And if activations are included: Representational and optimization properties of Deep Residual Networks https://paperwhy.8027.org/2017/05/02/representational-and-optimization-properties-of-deep-residual-networks/ Tue, 02 May 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/05/02/representational-and-optimization-properties-of-deep-residual-networks/ Today’s post reviews a recent talk given at the Simons Institute for the Theory of Computing in their current workshop series Computational Challenges in Machine Learning. tl;dr: Sufficiently regular functions (roughly: having Lipschitz, invertible derivatives) can be represented as compositions of decreasing, small perturbations of the identity. Furthermore, critical points of the quadratic loss for these target functions are proven to be always minima, thus ensuring loss-reducing gradient descent steps. This makes this class of functions “easily” approximable by Deep Residual Networks. Improving neural networks by preventing co-adaptation of feature detectors https://paperwhy.8027.org/2017/04/29/improving-neural-networks-by-preventing-co-adaptation-of-feature-detectors/ Sat, 29 Apr 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/04/29/improving-neural-networks-by-preventing-co-adaptation-of-feature-detectors/ This paper introduced the now pervasive dropout regularisation technique. The basic idea is that On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5 (…) The intuition behind this is that silencing random networks at each iteration (about 50% of them), effectively training so many different networks, prevents the neurons from “co-adapting”, i.e. from relying too much on each other for their outputs. Identifying a minimal class of models for high–dimensional data https://paperwhy.8027.org/2017/04/27/identifying-a-minimal-class-of-models-for-highdimensional-data/ Thu, 27 Apr 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/04/27/identifying-a-minimal-class-of-models-for-highdimensional-data/ tl;dr: a technique for feature selection in regression which might be useful for exploratory analysis and which can provide guidelines for designing subsequent costly experiments by hinting at which features need not be collected. The main weaknesses are multiple non-discoverable hyperparameters, a blind random search for optimization, and a not so easily actionable output of the algorithm. Consider sparse regression with a number of features/predictors $p$ greater than the number of datapoints $n$. Spectral Clustering based on Local PCA https://paperwhy.8027.org/2017/04/25/spectral-clustering-based-on-local-pca/ Tue, 25 Apr 2017 00:00:00 +0000 https://paperwhy.8027.org/2017/04/25/spectral-clustering-based-on-local-pca/ Actually appeared in 2011. tl;dr This paper develops an algorithm in manifold clustering1, Connected Component Extraction, which attempts to resolve the issue of intersecting manifolds. The idea is to use a local version of PCA at each point to determine the “principal” or “approximate tangent space” at that point in order to compute a set of weights for neighboring points. Then these weights are used to build a graph and Spectral Graph Partitioning2 is applied to compute its connected components.