1 Journal of Machine Learning Research 11 (2010) 3371-3408 Submitted 5/10; Published 12/10 in Stacked Denoising Autoencoders: Learning Useful Representations a Deep Network with a Local Denoising Criterion . @ UMONTREAL . CA PASCAL VINCENT Pascal Vincent epartement d’informatique et de recherche op D erationnelle ́ ́ Universit e de Montr eal ́ ́ 2920, chemin de la Tour eal, Qu ebec, H3T 1J8, Canada Montr ́ ́ Hugo Larochelle @ CS . TORONTO . EDU LAROCHEH Department of Computer Science University of Toronto 10 King’s College Road Toronto, Ontario, M5S 3G4, Canada ISABELLE . LAJOIE [email protected] UMONTREAL . Isabelle Lajoie CA Yoshua Bengio YOSHUA . BENGIO @ UMONTREAL . CA Pierre-Antoine Manzagol PIERRE - ANTOINE . MANZAGOL @ UMONTREAL . CA D erationnelle epartement d’informatique et de recherche op ́ ́ e de Montr eal Universit ́ ́ 2920, chemin de la Tour Montr ebec, H3T 1J8, Canada eal, Qu ́ ́ ́ Editor: L eon Bottou Abstract based on stacking layers of denoising We explore an original strategy for building deep networks, autoencoders which are trained locally to denoise corrupted versions of t heir inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield sign ificantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpass- ing it. Higher level representations learnt in this purely u nsupervised fashion also help boost the eriments show that, contrary to ordi- performance of subsequent SVM classifiers. Qualitative exp n Gabor-like edge detectors from natural nary autoencoders, denoising autoencoders are able to lear image patches and larger stroke detectors from digit images . This work clearly establishes the value of using a denoising criterion as a tractable unsupervised o bjective to guide the learning of useful higher level representations. Keywords: deep learning, unsupervised feature learning, deep belief networks, autoencoders, denoising 1. Introduction the composition of several It has been a long held belief in the field of neural network research that levels of nonlinearity would be key to efficiently model complex relationships between variables and to achieve better generalization performance on difficult recognition ta sks (McClelland et al., 1986; Hinton, 1989; Utgoff and Stracuzzi, 2002). This viewpoint is motiv ated in part by knowledge c © 2010 Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yos hua Bengio and Pierre-Antoine Manzagol.

2 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L l cortex, and in part by a of the layered architecture of regions of the human brain such as the visua ̊ ̊ astad and Goldmann, 1991; Bengio and astad, 1986; H body of theoretical arguments in its favor (H LeCun, 2007; Bengio, 2009). Yet, looking back at the history of multi-laye r neural networks, their problematic non-convex optimization has for a long time prevented reaping the e xpected benefits 1 Consequently (Bengio et al., 2007; Bengio, 2009) of going beyond one or two hidden la yers. tures allowing for convex much of machine learning research has seen progress in shallow architec ormant. optimization, while the difficult problem of learning in deep networks was left d deep architectures is due to the discovery of novel ap- The recent revival of interest in such t al., 2007; Ranzato et al., proaches (Hinton et al., 2006; Hinton and Salakhutdinov, 2006; Bengio e meters. Several alternative tech- 2007; Lee et al., 2008) that proved successful at learning their para ep belief networks (DBN) niques and refinements have been suggested since the seminal work on de however to build on the by Hinton et al. (2006) and Hinton and Salakhutdinov (2006). All appear same principle that we may summarize as follows: • Training a deep network to directly optimize only the supervised objective of in terest (for ex- ample the log probability of correct classification) by gradient descent, sta rting from random initialized parameters, does not work very well. • What works much better is to initially use a local unsupervised criterion to (pre)train each layer in turn, with the goal of learning to produce a useful higher-level representation from the g point on, gradient lower-level representation output by the previous layer. From this startin generalization descent on the supervised objective leads to much better solutions in terms of performance. Deep layered networks trained in this fashion have been shown empirically to avoid getting stuck in the kind of poor solutions one typically reaches with only random initializ ations. See ding possible explanations Erhan et al. (2010) for an in depth empirical study and discussion regar for the phenomenon. In addition to the supervised criterion relevant to the task, what appears to be key is using an unsupervised criterion to guide the learning at each layer. In this sense, these techniques additional t they are useful even bear much in common with the semi-supervised learning approach, except tha in the scenario where all examples are labeled, exploiting the input part of th e data to regularize, thus approaching better minima of generalization error (Erhan et al., 2010). There is yet no clear understanding of what constitutes “good” repres entations for initializing deep architectures or what explicit unsupervised criteria may best guide their learning. We know d Boltzmann machines but a few algorithms that work well for this purpose, beginning with restricte (RBMs) (Hinton et al., 2006; Hinton and Salakhutdinov, 2006; Lee et al., 2 008), and autoencoders edding (Weston et al., (Bengio et al., 2007; Ranzato et al., 2007), but also semi-supervised emb 2008) and kernel PCA (Cho and Saul, 2010). It is worth mentioning here that RBMs (Hinton, 2002; Smolensky, 1986) and basic classical autoencoders are very similar in their functional form, although their interpr etation and the pro- cedures used for training them are quite different. More specifically, the deterministic function that maps from input to mean hidden representation , detailed below in Section 2.2, is the same for both models. One important difference is that deterministic autoencoders con sider that real valued 1. There is a notable exception to this in the more specialized convolutional ne twork architecture of LeCun et al. (1989). 3372

3 S TACKED A UTOENCODERS D ENOISING binary hidden representa- mean as their hidden representation whereas stochastic RBMs sample a tion from that mean. However, after their initial pretraining, the way layers o f RBMs are typically these real-valued means used in practice when stacked in a deep neural network is by propagating e deterministic (Hinton et al., 2006; Hinton and Salakhutdinov, 2006). This is more in line with th autoencoder interpretation. Note also that reconstruction error of an au toencoder can be seen as an approxima- approximation of the log-likelihood gradient in an RBM, in a way that is similar to the and Delalleau, 2009). tion made by using the Contrastive Divergence updates for RBMs (Bengio ders yields almost as It is thus not surprising that initializing a deep network by stacking autoenco good a classification performance as when stacking RBMs (Bengio et al., 2 007; Larochelle et al., almost as good? An initial motivation of the research presented here was 2009a). But why is it only to find a way to bridge that performance gap. what can With the autoencoder paradigm in mind, we began an inquiry into the question of ised learning principles likely to shape a good, useful representation. We were looking for unsuperv lead to the learning of feature detectors that detect important structure in the input patterns. Section 2 walks the reader along the lines of our reasoning. Starting from th e simple intuitive notion of preserving information, we present a generalized formulation of the classical autoencoder, denoising before highlighting its limitations. This leads us in Section 3 to motivate an alternative criterion, and derive the denoising autoencoder model, for which we also give a possible intuitive geometric interpretation. A closer look at the considered noise types will then allow us to derive a g works and approaches. further extension of the base model. Section 4 discusses related preexistin learnt by a single-layer Section 5 presents experiments that qualitatively study the feature detectors xperiments with multi-layer denoising autoencoder under various conditions. Section 6 describes e architectures obtained by stacking denoising autoencoders and compare s their classification perfor- mance with other state-of-the-art models. Section 7 is an attempt at turning stac ked (denoising) autoencoders into practical generative models, to allow for a qualitative co mparison of generated samples with DBNs. Section 8 summarizes our findings and concludes our wor k. 1.1 Notation We will be using the following notation throughout the article: • Random variables are written in upper case, for example, X . th X is a random vector, then its j component • will be noted X If . j • f a random vector Ordinary vectors are written in lowercase bold. For example, a realization o X may be written x . Vectors are considered column vectors. Matrices are written in uppercase bold (e.g., W ). I denotes the identity matrix. • T T ′ ′ W is written x or • W The transpose of a vector ( not x x or W or a matrix which may be used to refer to an entirely different vector or matrix). • p and q to denote both probability density functions or probability mass We use lower case functions according to context. • X and Y two random variables with marginal probability p ( X ) and p ( Y ) . Their joint Let p p X , Y ) and the conditional ( ( X | Y ) . probability is written • We may use the following common shorthands when unambiguous: p ( x ) for p ( X = x ) ; ( p X | y ) for p ( X | Y = y ) (denoting a conditional distribution) and p ( x | y ) for p ( X = x | Y = y ) . 3373

4 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L f , , h , will be used for ordinary functions. • g p is probability mass): Expectation (discrete case, • [ f ( X )] = E ( ) x p . X = x ) f ( ∑ ) X ( p x ∫ is probability density): E [ . x d • f ( X Expectation (continuous case, p p ( x ) f ( x ) )] = ) ( p X ( IH ) = IH ( p ) = E Entropy or differential entropy: X [ − log p ( X )] . • ( p ) X Conditional entropy: IH ( X | Y ) = E . [ )] Y | X • − log p ( ) ( , X p Y ) X p ( . ] log E • Kullback-Leibler divergence: ( p [ ‖ q ) = ID KL ) X ( p q ( X ) Cross-entropy: IH ( p ‖ q ) = E . ) q ‖ [ − log q ( • )] = IH ( p )+ ID p ( X KL X p ) ( Mutual information: I ( X ; Y ) = IH ( X ) − • ( X | Y ) . IH 1 T • Sigmoid: s ( x ) = x )) x ( and s ( x ) = ( s ( . s ) ,..., − x 1 d + 1 e B : ( μ ) . By extension for vector variables: • ∼ B ( μ ) means μ Bernoulli distribution with mean X , X ∀ ∼ B ( μ . ) i i i 1.2 General setup We consider the typical supervised learning setup with a training set of n (input, target) pairs D = n ) n ( 1 ) ( ) ( 1 ) n ( ( ( ) t , t , x ..., ) } , that we suppose to be an i.i.d. sample from an unknown distribution x { 0 0 , T ) with corresponding marginals q ( X ) and q ( T ) . We denote q q ( X , T ) and q ( ( X ) the empirical X d D or . X is a d -dimensional random vector (typically in IR distributions defined by the samples in n d 0 , 1 ] in ). [ Y presentation X . of In this work we are primarily concerned with finding a new, higher-level re ′ ′ d ′ ′ d d we will talk of an > d ). If 1 [ -dimensional random vector (typically in IR Y is a or in d 0 , ] ′ over-complete representation, whereas it will be termed an under-complete representation if d < d . Y may be linked to X by a deterministic or stochastic mapping q ( Y | X ; θ ) parameterized by a vector of parameters θ . o Autoencoders 2. What Makes a Good Representation? From Mutual Information t operational definition of a “good” representation as one that will From the outset we can give an useful for addressing tasks of interest, in the sense that it will help the system quic kly eventually be d to form the representa- achieve higher performance on those tasks than if it hadn’t first learne tion. Based on the objective measure typically used to assess algorithm perf ormance, this might be phrased as “A good representation is one that will yield a better performing classifier”. Final classi- fication performance will indeed typically be used to objectively compare algo rithms. However, if a lesson is to be learnt from the recent breakthroughs in deep network tr aining techniques, it is that the error signal from a single narrowly defined classification task should not be the only nor primary criterion used to guide the learning of representations. First because it has been shown expe rimen- tally that beginning by optimizing an unsupervised criterion, oblivious of the s pecific classification problem, can actually greatly help in eventually achieving superior performa nce for that classifica- tion problem. Second it can be argued that the capacity of humans to quickly b ecome proficient in new tasks builds on much of what they have learnt prior to being faced with that task. In this section, we begin with the simple notion of retaining information and progre ss to formally introduce the traditional autoencoder paradigm from this more general vantage point. 3374

5 S TACKED ENOISING A UTOENCODERS D 2.1 Retaining Information about the Input We are interested in learning a (possibly stochastic) mapping from input X to a novel representa- . To make this more precise, let us restrict ourselves to parameterized mappin tion ( Y | X ) = gs Y q | X θ ) with parameters θ that we want to learn. ; ( q Y to meet, at least to some representation One natural criterion that we may expect any good degree, is to retain a significant amount of information about the input . It can be expressed in X ( Y ) between an input random information-theoretic terms as maximizing the mutual information I ; and its higher level representation . This is the infomax principle put forward by Linsker X variable Y (1989). Mutual information can be decomposed into an entropy and a conditional entr opy term in two ( different ways. A first possible decomposition is ; Y ) = IH ( Y ) − IH ( Y | X ) which lead Bell and I X Analysis. Here we will Sejnowski (1995) to their infomax approach to Independent Component − X ; Y ) = IH ( X ) I IH ( X | Y ) . Since observed input X comes from start from another decomposition: ( q ( X ) on which θ has no influence, this makes IH an unknown distribution X ) an unknown constant. ( Thus the infomax principle reduces to: IH ( ; Y ) = arg max arg max − X ( X | Y ) I θ θ )] E = . arg max Y | X [ log q ( ) Y X ( q , θ ( p | Y ) we will have Now for any distribution X E [ log p ( X | Y )] ≤ E [ log q ( X | Y )] (1) , ) ( q , Y ) q ( X , Y X ︸ ︸ ︷︷ ( | Y ) − IH X ns p and q we have as can easily be shown starting from the property that for any two distributio ID 0. ( q ‖ p ) ≥ 0, and in particular ID ≥ ( q ( X | Y = y ) ‖ p ( X | Y = y )) KL KL ′ ′ p Y ; θ ( ) , parameterized by θ | , and the following Let us consider a parametric distribution X optimization: ′ . E max )] log [ ( X | Y ; θ p X , Y ; θ ) q ( ′ , θ θ − IH ( From Equation 1, we see that this corresponds to maximizing a lower bound on | Y ) and thus X on the mutual information. We would end up maximizing the mutual information provided exact ′ ′ s.t. q ( X | Y ) = p ( X | Y ; θ ∃ ) θ . If, as is done in infomax ICA, we further restrict ourselves to a deterministic X to mapping from , that is, representation Y is to be computed by a parameterized function Y = f Y ( X ) or equivalently θ q ( Y | X ; θ ) = δ ( Y − f denotes Dirac-delta), then this optimization can be written: ( X )) (where δ θ ′ . )] X E θ ; ) ( [ log p ( X | Y = f max θ X ( q ) ′ θ θ , This again corresponds to maximizing a lower bound on the mutual information. q ( X ) is unknown, but we have samples from it, the empirical average over the tra ining Since E samples can be used instead as an unbiased estimate (i.e., replacing by E ): 0 q ( X ) ( ) q X ′ (2) . )] p E θ ; ) X ( [ log max ( X | Y = f 0 θ X ( q ) ′ θ , θ 3375

6 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L reconstruction error We will see in the next section that this equation corresponds to the criterion . autoencoders used to train 2.2 Traditional Autoencoders (AE) 2 (AE) Here we briefly specify the traditional autoencoder framework and its terminology, based on ′ ( | p ; θ X ) introduced above. and f Y θ that transforms an input vector Encoder: x The deterministic mapping f into hidden represen- θ encoder . Its typical form is an affine mapping followed by a nonlinearity: y tation is called the b x s ( Wx + ) = ) . f ( θ ′ Its parameter set is W , b } , where W is a d θ × d weight matrix and b is an offset vector of = { ′ . dimensionality d y is then mapped back to a reconstructed - Decoder: The resulting hidden representation d ′ ′ is called the . Its typical decoder g ( y ) . This mapping g = z in input space, z dimensional vector θ θ form is again an affine mapping optionally followed by a squashing non-linea rity, that is, either ′ ′ ′ ( ) = W y y + b or g θ ′ ′ ′ g , s ) = y + b y ) W (3) ( ( θ ′ ′ ′ W with appropriately sized parameters , b = } . θ { z is not to be interpreted as an exact reconstruction of x , but rather in probabilistic In general p ( X | Z = z ) that may generate x with terms as the parameters (typically the mean) of a distribution ′ p X | Y ; θ ( ) from the previous section high probability. We have thus completed the specification of ′ p X | Y = y ) = p ( X | Z = as ( g ( )) y . This yields an associated reconstruction error to be optimized: θ p x , z ) ∝ L log ( ( x | z ) . (4) − Common choices for p ( x | z ) and associated loss function L ( x , z ) include: d 2 2 , that is, IR • : X | z ∼ N ( z , σ x I ) , that is, X x | z ∼ N ( z For real-valued , σ ∈ ) . j j 2 2 2 , z ) = L ‖ ( x , z ) = C ( σ This yields ) L x − z ‖ ( where C ( σ x ) denotes a constant that depends 2 2 only on squared error objective σ and that can be ignored for the optimization. This is the found in most traditional autoencoders. In this setting, due to the Gaussian in terpretation, it is more natural not to use a squashing nonlinearity in the decoder. d B x x ∈ { 0 , 1 } For binary : X | z ∼ , that is, ( z ) , that is, X • | z ∼ B ( z ) . j j d In this case, the decoder needs to produce a [ 0 , 1 ] z . So a squashing nonlinearity such as a ∈ L s This yields L sigmoid x , z ) = will typically be used in the decoder. ) = ( x , z ( IH − cross-entropy loss which is termed the [ x )) log z z +( 1 − x ( ) log ( 1 − z B )] = IH ( B ( x ) ‖ ∑ j j j j j because it is seen as the cross-entropy between two independent multiva riate Bernoullis, the x and the other with mean z . This loss can also be used when x is not strictly first with mean d x ∈ [ 0 , 1 ] binary but rather . 2. Note: (AE) are also often called AutoAssociators (AA) in the literature. The shorter autoencoder term AutoEncoders was preferred in this work, as we believe encoding better conveys the idea of producing a novel useful representation. Similarly, what we call Stacked Auto Encoders (SAE) has also been called Stacked AutoAssociators (SAA). 3376

7 S TACKED UTOENCODERS D ENOISING A Note that in the general autoencoder framework, we may use other forms o f parameterized func- unction (corresponding to a tions for the encoder or decoder, and other suitable choices of the loss f | z p ). In particular, we investigated the usefulness of a more complex encoding function different ( X ) esent work however, we in Larochelle, Erhan, and Vincent (2009b). For the experiments in the pr affine+sigmoid encoder will restrict ourselves to the two usual forms detailed above, that is, an affine+sigmoid decoder with cross-entropy affine decoder with squared error loss or and either allels the workings of . A further constraint that can optionally be imposed, and that further par loss ′ ′ ′ T RBMs, is having , in effect defining W tied weights as W between = W W . and W Autoencoder training consists in minimizing the reconstruction error, that is, c arrying the fol- lowing optimization: [ , X E Z arg min ))] L ( X , ( 0 q ) ( X ′ , θ θ Z ) Z is a deterministic function of Z , since to emphasize the fact that is ( where we wrote X X obtained by composition of deterministic encoding and decoding. Making this explicit and using our definition of loss L from Equation 4 this can be rewritten as: ′ arg max [ log p ( E | Z = g X )))] f X ( ( , 0 θ θ ) q ( X ′ θ , θ or equivalently ′ . )] arg max θ E ; ) X ( f [ log p ( X | Y = 0 θ ( ) q X ′ θ , θ lower bound on We see that this last line corresponds to Equation 2, that is, the maximization of a the mutual information between Y . and X It can thus be said that training an autoencoder to minimize reconstruction error amounts to maximizing a lower bound on the mutual information between input X and learnt repre- Y . Intuitively, if a representation allows a good reconstruction of its input, it me ans that sentation it has retained much of the information that was present in that input. 2.3 Merely Retaining Information is Not Enough Y should retain information about input X is not by itself sufficient The criterion that representation ximized by setting to yield a useful representation. Indeed mutual information can be trivially ma = X . Similarly, an ordinary autoencoder where Y is of the same dimensionality as X Y (or larger) 3 can achieve perfect reconstruction simply by learning an identity mapping. Without any other constraints, this criterion alone is unlikely to lead to the discovery of a more use ful representation than the input. Thus further constraints need to be applied to attempt to separate useful inf ormation (to be re- tained) from noise (to be discarded). This will naturally translate to non-ze ro reconstruction error. bottleneck to produce an under-complete repre- The traditional approach to autoencoders uses a ′ Y < d . The resulting lower-dimensional sentation where can thus be seen as a lossy compressed d of X . When using affine encoder and decoder without any nonlinearity and a squared representation error loss, the autoencoder essentially performs principal component a nalysis (PCA) as showed by ′ g ◦ f be the identity to obtain zero reconstruction error. For d = d 3. More precisely, it suffices that if we had a linear − ′ 1 W encoder and decoder this would be achieved for any invertible matrix = W W by setting . Now there is a sigmoid nonlinearity in the encoder, but it is possible to stay in the linear part o f the sigmoid with small enough W . 3377

8 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L 4 When a nonlinearity such as a sigmoid is used in the encoder, things Baldi and Hornik (1989). become a little more complicated: obtaining the PCA subspace is a likely possibility (Bo urlard and Kamp, 1988) since it is possible to stay in the linear regime of the sigmoid, but arg uably not the only one (Japkowicz et al., 2000). Also when using a cross-entropy loss ra ther than a squared error the optimization objective is no longer the same as that of PCA and will likely learn diff erent features. nd decoder matrices to The use of “tied weights” can also change the solution: forcing encoder a be symmetric and thus have the same scale can make it harder for the encoder to stay in the linear regime of its nonlinearity without paying a high price in reconstruction error. Y Alternatively it is also conceivable to impose on different constraints than that of a lower over-complete (i.e., higher dimensional than dimensionality. In particular the possibility of using sparse pre- the input) but representations has received much attention lately. Interest in sparse re sentations is inspired in part by evidence that neural activity in the brain se ems to be sparse and on sparse coding has burgeoned following the seminal work of Olshausen and Field (1996) . Other motivations for sparse representations include the ability to handle effectiv ely variable-size repre- sentations (counting only the non-zeros), and the fact that dense compr essed representations tend to entangle information (i.e., changing a single aspect of the input yields signifi cant changes in all components of the representation) whereas sparse ones can be expec ted to be easier to interpret and to use for a subsequent classifier. Various modifications of the traditional autoencoder framework to et al., 2007, 2008). These have been proposed in order to learn sparse representations (Ranza were shown to extract very useful representations, from which it is po ssible to build top performing tions can be viewed as an alter- deep neural network classifiers. A sparse over-complete representa native “compressed” representation: it has straightforward compressibility due to the large implicit number of zeros rather than an explicit lower dimensionality. 3. Using a Denoising Criterion We have seen that the reconstruction criterion alone is unable to guarantee the extraction of useful features as it can lead to the obvious solution “simply copy the input” or similarly u ninteresting ones that trivially maximizes mutual information. One strategy to avoid this phenomenon is to constrain the representation: the traditional bottleneck and the more recent interest o n sparse representations both follow this strategy. Here we propose and explore a very different strategy. Rather than c onstrain the representation, we change the reconstruction criterion for a both more challenging and more interesting objec- tive: cleaning partially corrupted input, or in short . In doing so we modify the implicit denoising definition of a good representation into the following: “a good representation is one that can be obtained robustly from a corrupted input and that will be useful for recove ring the corresponding clean input” . Two underlying ideas are implicit in this approach: • First it is expected that a higher level representation should be rather sta ble and robust under corruptions of the input. Second, it is expected that performing the denoising task well requires ex tracting features that • capture useful structure in the input distribution. 4. More specifically it will find the same subspace as PCA, but the specific projection directions found will in general not correspond to the actual principal directions and need not be ortho normal. 3378

9 S TACKED ENOISING A UTOENCODERS D not the task of denoising per se. Rather We emphasize here that our goal is denoising is ad- for learning to extract useful features training criterion that will vocated and investigated as a representation can then be constitute better higher level representation. The usefulness of a learnt s it as input. assessed objectively by measuring the accuracy of a classifier that use 3.1 The Denoising Autoencoder Algorithm This approach leads to a very simple variant of the basic autoencoder des denoising cribed above. A version of autoencoder (DAE) is trained to reconstruct a clean “repaired” input from a corrupted . This is done by first it (the specific types of corruptions we consider will be discussed below) ̃ ̃ ̃ by means of a stochastic mapping corrupting the initial input x ∼ q x ( x x | x ) . into D ̃ x Corrupted input is then mapped, as with the basic autoencoder, to a hidden representation ′ ̃ ̃ ) . See Figure 1 for a schematic ( y s ( W ( x + b ) from which we reconstruct a z = g y = x ) = f θ θ ′ θ θ are trained to minimize the average recon- representation of the procedure. Parameters and z as close as possible to the uncorrupted input struction error over a training set, that is, to have . x ̃ The key difference is that z is now a deterministic function of x rather than x . As previously, the considered reconstruction error is either the cross-entropy loss L , with an ( x , z ) = IH ( B ( x ) ‖ B ( z )) IH 2 L ( x , z ) = ‖ x − z ‖ affine+sigmoid decoder, or the squared error loss , with an affine decoder. Pa- 2 rameters are initialized at random and then optimized by stochastic gradient de scent. Note that each ̃ time a training example x is presented, a different corrupted version x of it is generated according ̃ q x | x ) . to ( D s between a Note that denoising autoencoders are still minimizing the same reconstruction los X . So this still amounts to maximizing a lower bound on the Y clean and its reconstruction from X Y Y . The difference is that mutual information between clean input is now and representation f to a corrupted input. It thus forces the learning of a obtained by applying deterministic mapping θ far more clever mapping than the identity: one that extracts features useful for denoising . y x ( L ) z , H ′ g θ f θ q D ̃ xx z x Figure 1: The denoising autoencoder architecture. An example x is stochastically corrupted (via ̃ ) and attempts to reconstruct x . The autoencoder then maps it to y (via encoder f q ) to θ D ′ , producing reconstruction z . Reconstruction error is measured by loss via decoder g x θ L ( x , z ) . H 3379

10 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L 3.2 Geometric Interpretation The process of denoising, that is, mapping a corrupted example back to an uncorrupted one, can manifold assumption (Chapelle be given an intuitive geometric interpretation under the so-called et al., 2006), which states that natural high dimensional data concentrates close to a non-linear low-dimensional manifold. This is illustrated in Figure 2. During denoising trainin g, we learn a ̃ ̃ X | ) that maps a corrupted ( X back to its uncorrupted X , for example, in the stochastic operator X p case of binary data, ̃ ̃ ′ ∼ B ( g X | X f . ( ( X ))) θ θ ifold than the Corrupted examples are much more likely to be outside and farther from the man ̃ X | uncorrupted ones. Thus stochastic operator p ) learns a map that tends to go from lower prob- ( X ̃ ̃ to nearby high probability points X , on or near the manifold. Note that when ability points X is X ̃ p X | farther from the manifold, X ) should learn to make bigger steps, to reach the manifold. Suc- ( cessful denoising implies that the operator maps even far away points to a sma ll region close to the manifold. The denoising autoencoder can thus be seen as a way to define and learn a manifold. In particu- lar, if we constrain the dimension of Y to be smaller than the dimension of X , then the intermediate representation Y = f ( X ) may be interpreted as a coordinate system for points on the manifold. More generally, one can think of Y f ( X ) as a representation of X which is well suited to capture the = main variations in the data, that is, those along the manifold. 3.3 Types of Corruption Considered The above principle and technique can potentially be used with any type of co rruption process. Also ailable, could be easily in- the corruption process is an obvious place where prior knowledge, if av corporated. But in the present study we set to investigate a technique that is generally applicable. In x ̃ ′ f ( ( ̃ x )) g ! ! x ) x q | ( ̃ x D x ̃ x Figure 2: Manifold learning perspective. Suppose training data ( × ) concentrate near a low- dimensional manifold. Corrupted examples ( . ) obtained by applying corruption process ̃ ̃ p X | X q will generally lie farther from the manifold. The model learns with ( ( X | ) X ) D ′ ( f ( · )) ) onto the manifold. Intermediate rep- g to “project them back” (via autoencoder θ θ resentation Y = f on the ( X ) may be interpreted as a coordinate system for points X θ manifold. 3380

11 S TACKED ENOISING A UTOENCODERS D ntations by stacking particular we want it to be usable for learning ever higher level represe denois- ocesses may be available in ing autoencoders. Now while prior knowledge on relevant corruption pr a particular input space (such as images), such prior knowledge will not be available for the space of intermediate-level representations. We will thus restrict our discussion and experiments to the following simple corr uption pro- cesses: 2 ̃ N x | x ∼ • ( Additive isotropic , σ Gaussian noise I ) ; (GS): x Masking noise (MN): a fraction ν of the elements of x (chosen at random for each example) • is forced to 0; Salt-and-pepper noise (SP): a fraction • of the elements of x (chosen at random for each ν example) is set to their minimum or maximum possible value (typically 0 or 1) accordin g to a fair coin flip. Additive Gaussian noise is a very common noise model, and is a natural choice for real val- will also be considered, as it is a natural choice for input ued inputs. The salt-and-pepper noise d white images or the domains which are interpretable as binary or near binary such as black an representations produced at the hidden layer after a sigmoid squashing f unction. masking Much of our work however, both for its inspiration and in experiments, foc uses on which can be viewed as turning off components considered missing or repla cing their value noise ll information about by a default value—that is, a common technique for handling missing values. A and we can view the these masked components is thus removed from that particular input pattern, fill-in denoising autoencoder as trained to these artificially introduced “blanks”. Also, numerically, forcing components to zero means that they are totally ignored in the computation s of downstream neurons. sking noise drasti- We draw the reader’s attention to the fact that both salt-and-pepper and ma d. Denoising, that is, cally corrupt but a fraction of the elements while leaving the others untouche recovering the values of the corrupted elements, will only be possible thank s to dependencies be- tween dimensions in high dimensional distributions. Denoising training is thus exp ected to capture these dependencies. The approach probably makes less sense for ve ry low dimensional problems, at least with these types of corruption. 3.4 Extension: Putting an Emphasis on Corrupted Dimensions and Noise types such as salt-and-pepper masking noise that erase only a changing subset of the rward extension of the input’s components while leaving the others untouched suggest a straightfo econstruction of all com- denoising autoencoder criterion. Rather than giving equal weight to the r emphasis on the corrupted dimensions. To achieve this we give ponents of the input, we can put an for the reconstruction error on components that were corrupted, and β a different weight α for those α β are considered hyperparameters. that were left untouched. and For the squared loss this yields ( ( ) ) 2 2 L ) = α z ( x , ( − z x ) β + ( z − ) x , , 2 α j j j j ∑ ∑ ̃ ̃ ∈ J ( x j ) ) ( J ∈ / j x ̃ J ( where x ) denotes the indexes of the components of x that were corrupted. 3381

12 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L And for the cross-entropy loss this yields ) ( log z − 1 ( )] x ) x [ − 1 log z +( L − ( x , z ) = α j j j j IH α , ∑ ̃ j x ) J ( ∈ ) ( . )] z − 1 ( log ) x − 1 [ x +( log z + β − j j j j ∑ ̃ x J ( ∈ j ) / emphasized denoising autoencoder . A special case that we call full We call this extension is obtained for α = 1 , β emphasis 0 where we only take into account the error on the prediction = of corrupted elements. 3.5 Stacking Denoising Autoencoders to Build Deep Architectures Stacking denoising autoencoders to initialize a deep network works in much the s ame way as stack- ing RBMs in deep belief networks (Hinton et al., 2006; Hinton and Salakhutdino v, 2006) or ordinary autoencoders (Bengio et al., 2007; Ranzato et al., 2007; Larochelle et al., 2009a). Let us specify dividual layer, so that that input corruption is only used for the initial denoising-training of each in f has thus been learnt, it will henceforth it may learn useful feature extractors. Once the mapping θ be used on uncorrupted inputs. In particular no corruption is applied to produce the representation that will serve as clean input for training the next layer. The complete proc edure for learning and stacking several layers of denoising autoencoders is shown in Figure 3 . resentation can be used Once a stack of encoders has thus been built, its highest level output rep ort Vector Machine as input to a stand-alone supervised learning algorithm, for example a Supp classifier or a (multi-class) logistic regression. Alternatively, as illustrated in Figure 4, a logistic regression layer can be added on top of the encoders, yielding a deep neural network amenable usly to supervised learning. The parameters of all layers can then be simultaneo using a fine-tuned gradient-based procedure such as stochastic gradient descent. 4. Related Approaches in the Literature In this section, we briefly review and discuss related prior work along thre e different axes. 4.1 Previous Work on Training Neural Networks for Denoising The idea of training a multi-layer perceptron using error backpropagation on a denoising task is not new. The approach was first introduced by LeCun (1987) and Gallinar i et al. (1987) as an alternative method to learn an (auto-)associative memory similar to how Hopfield Networks (Hopfield, 1982) patterns, corrupted by were understood. The networks were trained and tested on binary input flipping a fraction of input bits chosen at random. Both the model and training procedure in this 5 precursory work are very similar to the denoising autoencoder we descr ibe. Our motivation and goal are however quite different. The objective of LeCun (1987) was to study the capacity of such a network for memorization tasks, that is, counting how many training patte rns it was able to 5. There are a few minor differences; for example, the use of a squa red error after sigmoid for binary data, while we tend to use a cross-entropy loss. Also their denoising procedure consid ers doing several recurrent passes through the autoencoder network, as in a recurrent net. 3382

13 S TACKED ENOISING A UTOENCODERS D L H 2 ) ( g ′ 2 ) ( θ f θ ( ) 2 f θ q D f f f θ θ θ xx xx xx Figure 3: Stacking denoising autoencoders. After training a first level d enoising autoencoder (see Figure 1) its learnt encoding function f is used on clean input (left). The resulting θ representation is used to train a second level denoising autoencoder (midd le) to learn a ( 2 ) second level encoding function f . From there, the procedure can be repeated (right). θ supervised cost su p f θ ) 3 ( f θ ( 2 ) f θ f θ Target x Figure 4: Fine-tuning of a deep network for classification. After training a stack of encoders as explained in the previous figure, an output layer is added on top of the stac k. The param- eters of the whole system are fine-tuned to minimize the error in predicting the su pervised target (e.g., class), by performing gradient descent on a supervised cost. 3383

14 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L d the usefulness of a non- correctly recall under these conditions. The work also clearly establishe linear hidden layer for this. By contrast, our work is motivated by the searc h and understanding interest is thus in of unsupervised pretraining criteria to initialize deep networks. Our primary investigating the ability of the denoising criterion to learn good feature extracto rs, with which to . We focus on analyzing initialize a deep network by stacking and composing these feature extractors the learnt higher-level representations and their effect on the classific ation performance of resulting deep networks. d here is the research Another insightful work that is very much related to the approach advocate lete corrupted input patterns of Seung (1998), in which a recurrent neural network is trained to comp at of LeCun (1987) and using backpropagation through time. Both the work of Seung (1998) and th e memories (Hopfield, 1982) Gallinari et al. (1987) appear to be inspired by Hopfield-type associativ urrent network dynamic. in which learnt patterns are conceived as attractive fixed points of a rec attractors, points out the Seung (1998) contributes a very interesting analysis in terms of continuous limitations of regular autoencoding, and advocates the pattern completion task a s an alternative to tudy mainly by its focus density estimation for unsupervised learning. Again, it differs form our s 6 and b) on the image denoising task per se. The latter justifies their use of a) on recurrent networks prior knowledge of the 2D topology of images, both in the architectural choic e of local 2D receptive field connectivity, and in the corruption process that consists in zeroing- out a square image patch at a random position. This occlusion by a 2D patch is a special form of a structured masking noise, where the a-priori known 2D topological structure of images is taken into ac count. In our research r in our corruption we deliberately chose not to use topological prior knowledge in our model no same generic procedure may be applied to learn higher levels of representation process, so that the from lower ones, or to other domains for which we have no such topologica l prior knowledge. More recently Jain and Seung (2008) presented a very interesting and s uccessful approach for image denoising, that consists in layer-wise building of a deep convolutional neural network. Their algorithm yields comparable or better performance than state-of-the-art M arkov Random Field and wavelet methods developed for image denoising. The approach clearly ha s roots in their earlier work (Seung, 1998) and appears also inspired by more recent resea rch on deep network pretraining, including our own group’s (Bengio et al., 2007). But apparently, neithe r of us was initially aware of the other group’s relevant work on denoising (Vincent et al., 2008; Ja in and Seung, 2008). Again the focus of Seung (1998) on image denoising per se differs from our own focus on studying deep net- work pretraining for classification tasks and results in marked difference s in the actual algorithms. Specifically, in Jain and Seung (2008) each layer in the stack is trained to re construct the original clean image a little better, which makes sense for image denoising. This can be co ntrasted with our approach, in which upper layers are trained to denoise-and-reco nstruct whatever representation they receive from the layer immediately below, rather than to restore the origin al input image in one operation. This logically follows from our search for a generic featu re extraction algorithm for pretraining, where upper level representations will eventually be us ed for a totally different task such as classification. 6. Note however that a recurrent network can be seen as deep networ k, with the additional property that all layers share the same weights. 3384

15 S TACKED ENOISING A UTOENCODERS D 4.2 Training Classifiers with Noisy Inputs The idea of training a neural network with noisy input (Scalettar and Zee, 1 988; von Lehman et al., as it is sometimes called—has been proposed to enhance generaliza- 1988)—or training with jitter Holmstrm and Koistinen, tion performance for supervised learning tasks (Sietsma and Dow, 1991; 1992; An, 1996). This thread of research is less directly related to autoe ncoders and denoising than the studies discussed in the previous section. It is nevertheless relevant. After all, denoising amounts albeit a rather high di- to using noisy patterns as input with the clean pattern as a supervised target, mensional one. It has been argued that training with noise is equivalent to applying generalized Tikhonov regularization (Bishop, 1995). On the surface, this may seem to suggest that training with penalizing the sum noisy inputs has a similar effect to training with an L2 weight decay penalty (i.e., of squared weights), but this view is incorrect. Tikhonov regularization a pplied to linear regression . But for a non-linear map- is indeed equivalent to a L2 weight decay penalty (i.e., ridge regression) imple (Bishop, 1995). More ping such as a neural network, Tikhonov regularization is no longer so s importantly, in the non-linear case, the equivalence of noisy training with Tikh onov regularization is derived from a Taylor series expansion, and is thus only valid in the limit of very small additive noise. See Grandvalet et al. (1997) for a theoretical study and discus sion regarding the limitations of validity for this equivalence. Last but not least, our experimental res ults in Section 5.1 clearly show qualitatively very different results when using denoising autoenco ders (i.e., noisy inputs) than when using regular autoencoders with a L2 weight decay. Here, we must also mention a well-known technique to improve the generalization performance of a classifier (neural network or other), which consists in augmenting the original training set with additional distorted inputs, either explicitly (Baird, 1990; Poggio and Vette r, 1992) or virtually ̈ through a modified algorithm (Simard et al., 1992; Sch olkopf et al., 1996). For character images for instance such distortions may include small translations, rotations, scaling s and shearings of the image, or even applying a scanner noise model. This technique can thus b e seen as training s based on much prior with noisy corrupted inputs, but with a highly structured corruption proces 7 As already explained and motivated above, our intent in this work is to develo p and knowledge. investigate a generally applicable technique, that should also be applicable to intermediate higher level representations. Thus we intentionally restrict our study to very simple generic corruption processes that do not incorporate explicit prior knowledge. We also stress the difference between the approaches we just discusse d, that consist in training a neural network by optimizing a global supervised criterion using noisy input , and the approach we local unsupervised denoising criterion to pretrain investigate in the present work, that is, using a each layer of the network with the goal to learn useful intermediate representations. We shall see in experimental Section 6.4 that the latter applied to a deep network yields better c lassification performance than the former. 4.3 Pseudo-Likelihood and Dependency Networks The view of denoising training as “filling in the blanks” that motivated the masking noise and the extension that puts an emphasis on corrupted dimensions presented in Sectio n 3.4, can also be re- lated to the pseudo-likelihood (Besag, 1975) and Dependency Network ( Heckerman et al., 2000) 7. Clearly, simple Tikhonov regularization cannot achieve the same as tra ining with such prior knowledge based cor- ruption process. This further illustrates the limitation of the equivalence betw een training with noise and Tikhonov regularization. 3385

16 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L e likelihood paradigms. Maximizing pseudo-likelihood instead of likelihood implies replacing th d th ) by the product of conditionals term ( X p ) denotes the X | X i ( . Here X p component of ∏ i i ¬ i = i 1 th X and X input vector variable denotes all components but the i . Similarly, in the Dependency i ¬ d Network approach of Heckerman et al. (2000) one learns conditional distributions, each trained to i given (a subset of) all other components. This is in effect what an predict component emphasized 1 ν with a masking noise that masks but one input component ( denoising autoencoder = ), and a d ( α = 1 , β full emphasis 0), is trained to do. More specifically, for binary variables it will learn = to predict ( X ; and when using squared error for real-valued variables it will learn to = 1 p X ) | i ¬ i 2 E [ X d | X . Note that with denoising autoencoders, all ) ] assuming X ] | X σ , ∼ N ( E [ X predict | X ¬ i i ¬ i i ¬ i i share common parameters , which define the mapping to conditional distributions are constrained to . Also when the emphasis is not fully put on the corrupted com- and from hidden representation Y > 0) some of the network’s capacity will be devoted to encoding/decoding unc orrupted ponents ( β components. g scenario: What A more important difference can be appreciated by considering the followin happens if components of the input come in identical pairs? In that case, co nditional distribution ( X g any other | X can simply learn to replicate the other component of the pair, thus not capturin p ) i i ¬ ct scenario is forbidden by potentially useful dependency. Now for dependency networks this exa ility. But a similar problem the formal requirement that no input configuration may have zero probab may occur in practice if the components in a pair are not identical but very hig hly correlated. By er fraction ν of cor- contrast, denoising autoencoders can and will typically be trained with a larg rupted components, so that reliable prediction of a component cannot rely exclusively on a single other component. 5. Experiments on Single Denoising Autoencoders: Qualitati ve Evaluation of Learned Feature Detectors A first series of experiments was carried out using single denoising autoe ncoders, that is, without any stacking nor supervised fine tuning. The goal was to examine qualitativ ely the kind of feature detectors learnt by a denoising criterion, for various noise types, and c ompare these to what ordinary autoencoders yield. Feature detectors that correspond to the first hidden layer of a network trained on image data y has an associated vector of weights W that are straightforward to visualize. Each hidden neuron j j it uses to compute a dot product with an input example (before applying its no n-linearity). These filters vectors, the W , have the same dimensionality as the input, and can thus be displayed as little j to. images showing what aspects of the input each hidden neuron is sensitive 5.1 Feature Detectors Learnt from Natural Image Patches 2 × We trained both regular autoencoders and denoising autoencoders on 1 12 patches from whitened 8 natural scene images, made available by Olshausen (Olshausen and Field, 1996). A few of these patches are shown in Figure 5 (left). For these natural image patches, we used a linear decoder 9 and a squared reconstruction cost. Network parameters were trained fr using om a random start, × 12 patches were extracted from the 20 8. More specifically randomly positioned 12 20 patches made available by × https://redwood.berkeley.edu/bruno/sparsenet/ . Olshausen at the following URL: 9. We applied the usual random initialization heuristic in which weights are samp led independently form a uniform in 1 1 √ √ , where fanin in this case is the input dimension. ] − range [ fanin fanin 3386

17 S TACKED ENOISING A UTOENCODERS D ed learning rate of 0.05. All stochastic gradient descent to perform 500000 weight updates with a fix filters shown were from experiments with tied weights, but untied weights yielde d similar results. autoencoder that used a bottleneck under-complete Figure 5 displays filters learnt by a regular over-complete autoencoder using 200 hidden units. of 50 hidden units, as well as those learnt by an Filters obtained in the under-complete case look like very local blob detectors . No clear structure is apparent in the filters learnt in the over-complete case. Left: some of the 12 × 12 image Figure 5: Regular autoencoder trained on natural image patches. Middle: filters learnt by a regular under-complete patches used for training. autoencoder (50 hidden units) using tied weights and L2 reconstruction error. filters learnt by a Right: over-complete regular autoencoder (200 hidden units). The under-complete autoencoder appears to learn rather uninteresting local blob detectors. Filters obtained in the over- complete case have no recognizable structure, looking entirely random. gularized with L2 We then trained 200 hidden units over-complete noiseless autoencoders re weight decay, as well as 200 hidden units denoising autoencoders with iso tropic Gaussian noise (but no weight decay). Resulting filters are shown in Figure 6. Note that a denoising autoencoder with a noise level of 0 is identical to a regular autoencoder. So, naturally, fi lters learnt by a denoising autoencoder at small noise levels (not shown) look like those obtained with a regular autoencoder previously shown in Figure 5. With a sufficiently large noise level however ( σ = 0 . 5), the denoising autoencoder learns Gabor-like local oriented edge detectors (see Figu re 6). This is similar to what (Bell and Sejnowski, 1997) is learnt by sparse coding (Olshausen and Field, 1996, 1997), or ICA and resembles simple cell receptive fields from the primary visual cortex fir st studied by Hubel and nt nothing interesting beyond Wiesel (1959). The L2 regularized autoencoder on the other hand lear restoring some of the local blob detectors found in the under-complete case . Note that we did try a 10 wide range of values for the regularization hyperparameter, but were never able to get Gabor-like filters. From this experiment, we see clearly that training with sufficiently large noise yields a qualitatively very different outcome than training with a weight dec ay regularization , which confirms experimentally that the two are not equivalent for a non-linear autoencoder, as discussed earlier in Section 4.2. Figure 7 shows some of the results obtained with the other two noise types cons idered, that is, salt-and-pepper noise, and masking-noise. We experimented with 3 corru ption levels ν : 10% , 25% , 55%. The filters shown were obtained using 100 hidden units, but similar filters wer e found with 50 or 200 hidden units. Salt-and-pepper noise yielded Gabor-like edge detecto rs, whereas masking noise 10. Attempted weight decays values were the following: λ ∈ { 0 . 0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1 . 0 } . 3387

18 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L t from natural image Figure 6: Weight decay vs. Gaussian noise. We show typical filters learn Left: regular autoencoder with patches in the over-complete case (200 hidden units). weight decay. We tried a wide range of weight-decay values and learning rates: filters never appeared to capture a more interesting structure than what is shown here. Note decay at all that some local blob detectors are recovered compared to using no weight Right: σ (Figure 5 right). = 0 . 5) a denoising autoencoder with additive Gaussian noise ( learns Gabor-like local oriented edge detectors. Clearly the filters learnt are qualitatively very different in the two cases. yielded a mixture of edge detectors and grating filters. Clearly different co rruption types and levels can yield qualitatively different filters. But it is interesting to note that all thre e noise types we experimented with were able to yield some potentially useful edge detectors. 5.2 Feature Detectors Learnt from Handwritten Digits We also trained denoising autoencoders on the 28 × 28 gray-scale images of handwritten digits ers with tied weights, from the MNIST data set. For this experiment, we used denoising autoencod oal was to better understand the cross-entropy reconstruction error, and zero-masking noise. The g qualitative effect of the noise level. So we trained several denoising auto encoders, all starting from , but the same initial random point in weight space Figure 8 shows some with different noise levels. of the resulting filters learnt and how they are affected as we increase the level of corruption. With 0% corruption, the majority of the filters appear totally random, with only a few tha t specialize as little ink blob detectors. With increased noise levels, a much larger proportion o f interesting (visibly non random and with a clear structure) feature detectors are learnt. The se include local oriented stroke detectors and detectors of digit parts such as loops. It was to be e xpected that denoising a more corrupted input requires detecting bigger, less local structures: th e denoising auto-encoder must rely on longer range statistical dependencies and pool evidence fr om a larger subset of pixels. Interestingly, filters that start from the same initial random weight vector of ten look like they “grow” from random, to local blob detector, to slightly bigger structure detectors su ch as a stroke detector, as we use increased noise levels. By “grow” we mean that the slightly larger structure learnt at a higher noise level often appears related to the smaller structure obtained at lower noise levels, in that they share about the same position and orientation. 3388

19 S TACKED ENOISING A UTOENCODERS D ers using other noise Figure 7: Filters obtained on natural image patches by denoising autoencod with 10% salt-and-pepper noise, we obtain oriented Gabor-like filters. The y types. Left: appear slightly less localized than when using Gaussian noise (contrast with Figure 6 right). Right: with 55% zero-masking noise we obtain filters that look like oriented rs to learn filters gratings. For the three considered noise types, denoising training appea that capture meaningful natural image statistics structure. 6. Experiments on Stacked Denoising Autoencoders gy for building deep net- In this section, we evaluate denoising autoencoders as a pretraining strate shall mainly compare the works, using the stacking procedure that we described in Section 3.5. We autoencoders (SDAE), ver- classification performance of networks pretrained by stacking denoising Boltzmann machines (DBN), sus stacking regular autoencoders (SAE), versus stacking restricted on a benchmark of classification problems. 6.1 Considered Classification Problems and Experimental Methodolog y We considered 10 classification problems, the details of which are listed in Tab le 1. They consist of: • . The standard MNIST digit classification problem with 60000 training examples The eight benchmark image classification problems used in Larochelle et al. ( • 2007) which in- ll with 10000 clude more challenging variations of the MNIST digit classification problem (a 11 training examples), as well as three artificial 28 28 binary image classification tasks. × These problems were designed to be particularly challenging to current ge neric learning al- gorithms (Larochelle et al., 2007). They are illustrated in Figure 9. • tzanetakis audio genre classification data set (Bergstra, 2006) which con- A variation of the tains 10000 three-second audio clips, equally distributed among 10 musical g enres: blues, classical, country, disco, hiphop, pop, jazz, metal, reggae and rock. E ach example in the set 11. The data sets for this benchmark are available at http://www.iro.umontreal.ca/ lisa/icml2007 . ̃ 3389

20 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L (b) 25% corruption (a) No corruption (c) 50% corruption (d) Neuron A (0%, 10%, 20%, 50% corruption) (e) Neuron B (0%, 10%, 20%, 50% corruption) Figure 8: Filters learnt by denoising autoencoder on MNIST digits, using z ero-masking noise. (a-c) show some of the filters learnt by denoising autoencoders trained with vario us corruption ν . Filters at the same position in the three images are related only by the fact levels parameter that the autoencoders were started from the same random initialization point in (d) and (e) zoom in on the filters obtained for two of the neurons. As can be seen, space. with no noise, many filters remain similarly uninteresting (undistinctive almost unifo rm random grey patches). As we increase the noise level, denoising training forces the filters to differentiate more, and capture more distinctive features. Higher noise le vels tend to induce less local filters, as expected. One can distinguish different kind s of filters, from local blob detectors, to stroke detectors, and character parts detectors at the higher noise levels. was represented by 592 (MPC) features. These are a simplified for- Mel Phon Coefficient Mel-frequency cepstral coefficients mulation of the (MFCCs) that were shown to yield better classification performance (Bergstra, 2006). tzanetakis had their data split into training set, validation set and test set. All problems except . The training set is used We kept the same standard splits that were used in Larochelle et al. (2007) for both pretraining and fine tuning of the models. Classification performance on the validation set is used for choosing the best configuration of hyperparameters (model s election). The corresponding classification performance on the test set is then reported together with a 95 % confidence interval. For we used a slightly different procedure, since there was no predefined standard tzanetakis split and fewer examples. We used 10-fold cross validation, where each fold consisted of 8000 training examples, 1000 test and 1000 validation examples. For each fold, h yperparameters were chosen based on the performance on the validation set, and the retained mod el was used for com- puting the corresponding test error. We report the average test erro r and standard deviation across the 10 folds. We were thus able to compare the classification performance of deep neura l networks using different unsupervised initialization strategies for their parameters: 3390

21 S TACKED A UTOENCODERS D ENOISING MLP random: multilayer perceptron with usual random initialization; • • DBN (deep belief networks) corresponds to stacked RBM pretraining; • SAE stacked autoencoder pretraining; • SDAE stacked denoising autoencoder pretraining. In all cases, the same supervised fine-tuning procedure was then used , that is, simple stochastic gradient descent with early stopping based on validation set performanc e. Description input Train-Valid-Test Data Set m Standard MNIST digit classi- 10 50000-10000-10000 MNIST 784 gray-scale fication problem. values scaled to [0,1] Smaller subset of MNIST. 10 10000-2000-50000 basic MNIST digits with added 10 10000-2000-50000 rot random rotation. bg-rand MNIST digits with random 10 10000-2000-50000 noise background. MNIST digits with random 10000-2000-50000 10 bg-img image background. bg-img-rot 10 10000-2000-50000 MNIST digits with rotation and image background. 2 10000-2000-50000 784 values rect Discriminate between tall and wide rectangles (white 0,1 } ∈{ on black). rect-img 10000-2000-50000 Discriminate between tall ∈ 784 values 2 0 , 1 ] and wide rectangular image [ overlayed on a different background image. convex 784 values 2 6000-2000-50000 Discriminate between con- vex and concave shape. ∈{ 0,1 } tzanetakis Classify genre of thirty sec- 10-fold cross valida- 592 MPC 10 coefficients, tion on 10000 training ond audio-clip. samples. standardized. Table 1: Data sets. Characteristics of the 10 different problems consider tzane- ed. Except for takis whose input is made of 592 MPC features extracted from short audio seq uences, all other problems are 28 × 28 gray-scale image classification tasks (i.e., input dimension- ality is 28 28 = 784). See Larochelle et al. (2007) and Bergstra (2006) for further details × on these data sets. The table gives, for each task, its shorthand name, a d escription of the problem, a description of input preprocessing, the number of classes ( m ), and the number of examples used for the training, validation and test sets respectively. 3391

22 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L rot rect bg-rand , bg-img , bg-img-rot (b) (a) , rect-img , convex , er variations on the Figure 9: Samples form the various image classification problems. (a): hard blems. MNIST digit classification problems. (b): artificial binary classification pro × 28 gray-scale image problems, SAE and SDAE used linear+sigmoid decoder layers On the 28 hoice for this kind of (near) and were trained using a cross-entropy loss, due to this being a natural c binary images, as well as being functionally closer to RBM pretraining we wan ted to compare against. However for training the first layer on the tzanetakis problem, that is, for reconstructing MPC coefficients , a linear decoder and a squared reconstruction cost were deemed more appropriate (sub- layer RBM used in pre- sequent layers used sigmoid cross entropy as before). Similarly the first tzanetakis was defined with a Gaussian visible layer. training a DBN on tion (based on Table 2 lists the hyperparameters that had to be tuned by proper model selec validation set performance). Note that to reduce the choice space, we re stricted ourselves to the same number of hidden units, the same noise level, and the same learning rate fo r all hidden layers. 6.2 Empirical Comparison of Deep Network Training Strategies Table 3 reports the classification performance obtained on all data sets usin g a 3 hidden layer neural network pretrained with the three different strategies: by stacking denois ing autoencoders (SDAE- 3), stacking restricted Boltzmann machines (DBN-3), and stacking regular autoencoders (SAE-3). For reference the table also contains the performance obtained by a single hidden-layer DBN-1 and 12 ). by a Support Vector Machine with a RBF kernel (SVM rbf 12. SVMs were trained using the libsvm implementation. Their hyperparame ters ( C and kernel width) were tuned semi- automatically (i.e., by human guided grid-search), searching for the b est performer on the validation set. 3392

23 S TACKED ENOISING A UTOENCODERS D description hyperparameter considered values number of hidden layers { 1,2,3 } nHLay 1000,2000,3000 { nHUnit } number of units per hidden layer (same for all layers) − 6 − 5 − 4 lRate fixed learning rate for unsuper- , 5 × 10 10 { , 5 × 5 × 10 , − 3 − 2 − 1 × 10 10 vised pretraining , 10 × , 5 } 5 lRateSup fixed learning rate for supervised { 0.0005,0.005,0.05,0.1,0.15,0.2 } fine-tuning 5,10,50,100,125,150,200,300 } nEpoq { number of pretraining epochs (passages through the whole training set) ν corrupting noise level fraction of corrupted inputs (0 , 0 . 10 , 0 . 25 , 0 . 40) or standard deviation Gaussian for noise 0 , 0 . 05 , 0 . 10 , 0 . 15 , ( . 30 , 0 . 50 ) 0 Table 2: List of hyperparameters for deep networks. These hyperpa rameters are common to all considered deep network training strategies, except for noise level ν which is specific to SDAE (for which we must also choose the kind of corruption). We list the ty pical values we considered in the majority of our experiments. Best performing co nfiguration on the validation set was always searched for in a semi-automatic fashion, th at is, running experiments in parallel on a large computation cluster, but with manual guidanc e to avoid Some experiments wasting resources on unnecessary parts of the configuration space. sionally used a meant to study more closely the influence of specific hyperparameters occa finer search grid for them, as will be specified in the description of these ex periments. In these experiments, SDAE used a zero-masking corruption noise for all problems except for tzanetakis of the input. , for which a Gaussian noise was deemed more appropriate due to the nature We see that SDAE-3 systematically outperforms the baseline SVM, as well as S AE-3 (except for convex for which the difference is not statistically significant). This shows clearly th at de- noising pretraining with a non-zero noise level is a better strategy than pretr aining with regular autoencoders. For all but one problem, SDAE-3 is either the best perfo rming algorithm or has its confidence interval overlap with that of the winning algorithm (i.e., differen ce cannot be considered statistically significant). In most cases, stacking 3 layers of denoising autoe ncoder seems to be on par or better than stacking 3 layers of RBMs in DBN-3. In the following subsections, we will be conducting further detailed experime nts to shed light on particular aspects of the denoising autoencoder pretraining strategy. 6.3 Influence of Number of Layers, Hidden Units per Layer, and Noise Level Next we wanted to study more closely the influence of important architectural hyperparameters, namely the number of layers, the number of hidden units per layer, and the no ise level. For this finer 3393

24 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L SVM SAE-3 DBN-3 SDAE-3 ( ν ) Data Set DBN-1 rb f ± . . 23 1.21 ± 0 . 21 1.40 ± 0 . 23 1.24 ± 0 . 22 1.28 ± 0 1.40 22 (25%) MNIST 0 16 3.03 . 15 3.94 ± 0 . 17 3.46 ± 0 . 0 3.11 ± 0 . 15 2.84 ± 0 . 15 (10%) basic ± 11.11 ± 0 . 28 14.69 ± 0 . 31 10.30 ± 0 . 27 10.30 ± 0 . 27 9.53 ± rot . 26 (25%) 0 bg-rand ± 0 . 31 9.80 ± 0 . 26 11.28 ± 0 . 28 6.73 ± 0 . 22 10.30 ± 0 . 27 (40%) 14.58 22.61 . 0 . 37 16.15 ± 0 . 32 23.00 ± 0 bg-img 37 16.31 ± 0 . 32 16.68 ± 0 . 33 (25%) ± . bg-img-rot 0 . 44 52.21 ± 0 . 44 51.93 ± 0 ± 44 47.39 ± 0 . 44 43.76 ± 0 . 43 (25%) 55.18 rect 2.15 ± 0 . 13 4.71 ± 0 . 19 2.41 ± 0 . 13 2.60 ± 0 . 14 1.99 ± 0 . 12 (10%) rect-img 24.04 0 . 37 23.69 ± 0 . 37 24.05 ± 0 . 37 22.50 ± 0 . 37 21.59 ± 0 . 36 (25%) ± 19.13 18.63 0 . 34 19.92 ± 0 . 35 18.41 ± 0 . 34 convex ± 0 . 34 19.06 ± 0 . 34 (10%) ± 18.38 14.41 tzanetakis . 18 18.07 ± 1 . 31 16.15 ± 1 . 95 ± ± 1 . 64 16.02 ± 1 . 04 (0.05) 2 Table 3: Comparison of stacked denoising autoencoders (SDAE-3) with o ther models. Test error rate on all considered classification problems is reported together with a 95% confidence interval. Best performer is in bold, as well as those for which confidence intervals overlap. SDAE-3 appears to achieve performance superior or equivalent to the best other model on bg-rand . For SDAE-3, we also indicate the fraction of corrupted all problems except ν , the standard deviation of the Gaussian noise, as tzanetakis input components, or in case of ν = 0%. chosen by proper model selection. Note that SAE-3 is equivalent to SDAE -3 with f the considered problems, grained series of experiments, we chose to concentrate on the hardest o that is, the one with the most factors of variation: bg-img-rot . e increase the capacity We first examine how the proposed network training strategy behaves as w umber of hidden layers). of the model both in breadth (number of neurons per layer) and in depth (n r of hidden layers from Figure 10 shows the evolution of the performance as we increase the numbe ining (standard MLP), 1 to 3, for three different network training strategies: without any pretra with ordinary autoencoder pretraining (SAE) and with denoising autoenco der pretraining (SDAE). We clearly see a strict ordering: denoising pretraining being better than au toencoder pretraining being better than no pretraining. The advantage appears to increase with th e number of layers (note that without pretraining it seems impossible to successfully train a 3 hidden laye r network) and with the number of hidden units. This general behavior is a typical illustration o f what is gained by pretraining deep networks with a good unsupervised criterion, and ap pears to be common to several pretraining strategies. We refer the reader to Erhan et al. (20 10) for an empirical study and discussion regarding possible explanations for the phenomenon, ce ntered on the observation of effects (we exploit the hypothesis that features of X that help to capture regularization ( X ) also P help to capture P ( Y | X ) ) and optimization effects (unsupervised pre-training initializes parameters near a better of generalization error). local minimum Notice that in tuning the hyperparameters for all classification performance s so far reported, we considered only a coarse choice of noise levels ν (namely 0%, 10%, 25%, or 40% of zero-masking corruption for the image classification problems). Clearly it was not necess ary to pick the noise level very precisely to obtain good performances. In Figure 11 we examin e in more details the influence of the level of corruption ν using a more fine-grained grid for problem bg-img-rot . We 3394

25 S TACKED ENOISING A UTOENCODERS D bg-img-rot Figure 10: Classification performance on for standard MLP with random initialization (NoPreTrain, left), SAE (middle), and SDAE (right), as we increase the n umber of hid- den layers and the number of neurons per layer. Error bars show 95% confidence in- aphic, the tervals. Note that without pretraining the curve for 3 layers is outside the gr classification error being above 89%. r wide range of noise notice that SDAE appears to perform better than SAE (0 noise) for a rathe levels, regardless of the number of hidden layers. The following section reports an experiment that was conducted on three o ther data sets. The ex- periment had a different goal and thus used a coarser ν grid, but the resulting curves (see Figure 12) appear to follow a similar pattern to the one seen here (Figure 11). 6.4 Denoising Pretraining vs. Training with Noisy Input We discussed in Section 4.2 the important distinction between denoising pretrain ing as it is done rn good in SDAE and simply training with noisy inputs. SDAE uses a denoising criterion to lea initial feature extractors at each layer that will be used as initialization for a noiseless supervised training. This is very different from training with noisy inputs, which amounts to training with a (virtually) expanded data set. This latter approach can in principle be applie d to any classifier, such 13 as an SVM training or a SAE. Note that in the case of the SAE, since there are two phases (pre and fine-tuning), it is possible to use noisy inputs for only the pretraining or for both the pretraining 13. For SAE, input examples can cheaply be corrupted on the fly, but th is is not an option with standard SVM algorithms. So for SVM training we first augmented the training set by generating 9 extr a variations of each original training example thus yielding a training set 10 times bigger than the original. Alternativ ely, we could instead have used the 3395

26 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L st error rate for SDAE Figure 11: Sensitivity to the level of corruption. The curves report the te bg-img-rot as a function of the fraction ν of corrupted input compo- trained on problem rval. Note that nents (using zero masking noise). Error bars show 95% confidence inte 0% corruption corresponds to a SAE (regular autoencoder). aches on three data sets in and fine-tuning phase. We experimentally compare these different appro Figure 12. We see that denoising pretraining with SDAE, for a large range of noise levels, yields significantly improved performance, whereas training with noisy inputs sometime s degrades the performance, and sometimes improves it slightly but is clearly less beneficial th an SDAE. 6.5 Variations on the Denoising Autoencoder: Alternate Corruption Types and Emphasizing In the next series of experiments, we wanted to evaluate quantitatively the ef fect of using the various emphasized denoising autoencoder variant corrupting noises described in Section 3.3 as well as the described in Section 3.4. e previous section. Extensive experiments were carried out on the same 3 problems we used in th Besides zero-masking noise (MN) we trained 3 hidden layer SDAE using sa lt-and-pepper noise (SP) 14 ized variant. For each and additive Gaussian noise (GS). For MN and SP, we also tried the emphas considered variant, hyperparameters were selected as usual to yield the best performance on the ̈ virtual SV technique (Sch olkopf et al., 1996), which may or may not have yielded better perform ance, but since our main focus here is comparing noisy SAE with SDAE, SVM only serves as a s imple baseline. 14. As already mentioned previously, since Gaussian noise corrupts ev ery dimension, emphasized denoising does not make sense for this type of corruption. 3396

27 S TACKED ENOISING A UTOENCODERS D basic (a) (b) (c) bg-rand rot Figure 12: SDAE vs. training with noisy input. The test error of a SDAE with 3 hidden layers is compared to other algorithms trained with noisy inputs: a SVM with RBF kernel ( SVM only ), a 3-hidden-layers SAE where noisy inputs were used for pretraining rbf SAE(1) ) and one where noisy inputs were used both for pretraining and superv ised ( fine-tuning ( SAE(2) ). Hidden layers have 1000 neurons each. Zero-masking noise was used. Note that at 0% noise, the three stacked models correspond to an or dinary SAE. Error bars show 95% confidence interval. Denoising pretraining with SDA E appears to always yield significantly better performance, unlike training with noisy inputs . 3397

28 V , L , L AJOIE , B ENGIO AND M ANZAGOL INCENT AROCHELLE rot Model bg-rand basic 28 ± . 15 11.11 ± SVM . 0 14.58 ± 0 . 31 3.03 0 rb f ± 0 . 16 10.30 SAE-3 0 . 27 11.28 ± 0 . 28 3.46 ± 3.11 0 . 15 10.30 ± 0 ± 27 6.73 ± 0 . 22 DBN-3 . (25%) ( ν ) 2.84 ± 0 . 15 (10%) 9.53 ± 0 . 26 SDAE-3 10.30 ± 0 . 27 (40%) MN ± ν ) + emph 2.76 ± 0 . 14 (25%) 10.36 ( 0 . 27 (25%) 9.69 ± 0 . 26 (40%) SDAE-3 MN 9.33 ( ν ) 2.66 ± 0 . 14 (25%) SDAE-3 ± 0 . 25 (25%) 10.03 ± 0 . 26 (25%) SP 8.76 ( ν ) + emph 2.48 ± 0 . 14 (25%) SDAE-3 ± 0 . 29 (25%) 8.52 ± 0 . 24 (10%) SP 0 SDAE-3 ) 2.61 ± 0 . 14 (0.1) 8.86 ± ν . 28 (0.3) 11.73 ± 0 . 28 (0.1) ( GS (SDAE-3): alternative noise Table 4: Variations on 3-hidden-layer stacked denoising autoencoders (MN), salt-and- types and effect of emphasis. Considered noise types are masking noise d double emphasis pepper (SP) and Gaussian noise (GS). Emphasized version considere ison, the table and full emphasis (see main text for detailed explanation). For easy compar , SAE-3, and DBN-3. Test error rate also reproduces previously shown results for SVM rbf is reported together with a 95% confidence interval. Best performer is in bo ld, as well as those for which confidence intervals overlap. Corruption level ν (fraction of corrupted el selection on input components or Gaussian standard deviation) that was retained by mod the validation set is specified in parenthesis. SDAE-3 with emphasis on reconstruction of SP ets, significantly corrupted dimension appears to be the best SDAE variant for these data s improving performance on rot bg-rand . and validation set. These included the number of units per layer (same for all laye rs), the corruption level ν (fraction of corrupted dimensions for MN and SP, or standard deviation f or GS), with the usual considered values (listed previously in Table 2). For the emphasize d version, a further hy- double emphasis perparameter was the degree of emphasis. We considered both , where the weight on the reconstruction of the corrupted components is twice that on the uncor rupted components α = 1 , β = 0 ( 5), and full emphasis where all the weight is on reconstructing the corrupted compo- . nents and none on the uncorrupted dimensions ( α = 1 , β = 0). Table 4 reports the corresponding classification performance on the held-out test set. For the three conside red data sets, an empha- sized SDAE with salt-and-pepper noise appears to be the winning SDAE var iant. It thus appears etter performance. that a judicious choice of noise type and added emphasis may often buy us a b However we had hoped, with these variants, to catch up with the performanc e of DBN-3 on the 15 bg-rand but DBN-3 still performs significantly better than the best SDAE variant on th is problem, particular problem. 6.6 Are Features Learnt in an Unsupervised Fashion by SDAE Useful for SVMs? In the following series of experiments, we wanted to verify whether the highe r level representations extracted using SDAE could improve the performance of learning algorithms o ther than a neural network, such as SVMs. 15. As discussed in Larochelle et al. (2007), bg-rand is particularly favorable to RBMs because the pixel-wise indepen- dent noise perfectly matches what an RBM expects and will naturally not b e represented in the hidden units. 3398

29 S TACKED ENOISING A UTOENCODERS D d phase of SDAE, at To this end, we fed the representations learnt by the purely unsupervise ar SVM and a Kernel increasing higher levels (first, second and third hidden layer) to both a line SVM (using a RBF kernel). The hyperparameters of the SVM and its kerne l were tuned on the validation set as usual. For computational reasons, we did not re-tune SD AE hyperparameters. Instead, we identified the best performing SDAE-pretrained neural netw orks with 1000 units per ious experiments, but used layer, based on their validation performance after fine-tuning from prev their saved weights prior to fine-tuning (i.e., after unsupervised denoising training only). 3 highlights performance Results for all considered data sets are reported in Table 5, and Figure 1 curves for two of them. Clearly, SVM performance can benefit significan tly from using the higher 16 level representation learnt by SDAE. On all problems we see improved performance compared to using the original input (SVM ). More interestingly, on most problems, SVM performance im- 0 proves steadily as we use ever higher level representations. While it is no t too surprising that linear SVMs can benefit from having the original input processed non-linear ly, it is noteworthy that RBF m to benefit greatly from the kernel SVMs, which are high-capacity non-linear classifiers, also see non-linear mapping learned by SDAE. Figure 13: SVM on representations learnt by SDAE. The curves show e volution, on two data sets, higher level of the test performance of linear and RBF kernel SVMs as we train them on ce of SDAE after representations learnt in the unsupervised phase of SDAE. Performan supervised fine-tuning is also shown as SDAE (mn stands for masking noise). Hidden mn layer 0 corresponds to original input representation. 7. Generating Samples from Stacked Denoising Autoencoder N etworks Besides the classification performance comparisons and qualitative visual inspection of learnt filters, it is also customary in studies of deep generative models such as DBNs, to sh ow samples generated 16. To verify that the learnt representation was responsible for the impr oved performance, rather than a random non- linear transformation, we also trained SVMs on the representation of the sa me neural network architecture but using randomly initialized weights: the performance degraded as we used the hig her level representations. 3399

30 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L SVM SVM SVM SVM Data Set SVM 0 1 2 3 kernel . ± 44 1.49 ± 0 . 24 1.24 ± 0 . 22 1.2 ± 0 . 21 linear 5.33 0 MNIST ± 0 . 23 1.04 ± 0 . 20 0.94 ± 0 . 19 0.95 ± 0 . 19 rbf 1.40 7.32 ± . 23 3.43 ± 0 . 16 2.71 ± 0 . 14 2.63 ± 0 . 14 linear 0 basic ± 0 . 15 2.59 ± 0 . 14 2.55 ± 0 . 14 2.57 ± 0 . 14 rbf 3.03 43.47 ± 0 . 43 21.74 ± 0 . 36 15.15 ± 0 . 31 10.00 linear 0 . 26 ± rot 11.11 ± . 28 8.45 ± 0 . 24 8.27 ± 0 . 24 8.64 ± 0 . 25 rbf 0 0 24.14 . 38 13.58 ± 0 . 30 13.00 ± 0 . 29 11.32 ± 0 . 28 linear ± bg-rand 14.58 ± 0 . 31 11.00 ± 0 . 27 rbf ± 0 . 26 10.16 ± 0 . 26 10.08 linear ± 0 . 38 16.72 ± 0 . 33 20.73 ± 0 . 36 14.55 ± 0 . 31 25.08 bg-img rbf ± 0 . 37 15.91 ± 0 . 32 22.61 ± 0 . 32 14.06 ± 0 . 30 16.36 linear 63.53 ± 0 . 42 50.44 ± 0 . 44 50.26 ± 0 . 44 42.07 ± 0 . 43 bg-img-rot rbf ± 0 . 44 44.09 ± 0 . 44 42.28 ± 0 . 43 39.07 ± 0 . 43 55.18 0 29.04 . 40 6.43 ± 0 . 22 2.31 ± 0 . 13 1.80 ± 0 . 12 linear ± rect 2.15 ± 0 . 13 2.19 ± 0 . 13 1.46 ± 0 . 11 1.22 ± 0 . 10 rbf linear 49.64 0 . 44 23.12 ± 0 . 37 23.01 ± 0 . 37 21.43 ± 0 . 36 ± rect-img rbf ± 0 . 37 22.27 ± 0 . 36 24.04 ± 0 . 36 20.98 ± 0 . 36 21.56 linear 45.75 ± 0 . 44 24.10 ± 0 . 37 18.40 ± 0 . 34 18.06 ± 0 . 34 convex rbf ± 0 . 34 18.09 ± 0 . 34 17.39 ± 0 . 33 17.53 ± 0 . 33 19.13 ± linear 2 . 51 12.51 ± 2 . 05 7.95 ± 1 . 68 5.04 ± 1 . 36 20.72 tzanetakis rbf 14.41 ± 2 . 18 7.54 ± 1 . 64 5.20 ± 1 . 38 4.13 ± 1 . 23 Table 5: SVM performance on higher level representations learnt by SD AE. Performance of both er original linear SVM, and SVM with RBF kernel is reported, as they are trained on eith ), or on the representation learnt by a SDAE at the level of its first (SVM ), input (SVM 0 1 ) hidden layer. The representations used for the SVMs ), or third (SVM second (SVM 2 3 were those obtained prior to fine-tuning. Test error rate on all consider ed classification problems is reported together with a 95% confidence interval. Best perfor mer is in bold, as well as those for which confidence intervals overlap. Clearly both linea r and kernel SVM performance benefit from using the higher level representations le arnt by SDAE. For most problems the performance increases steadily as we use represe ntations from ever higher levels of the architecture. from the trained models. This can yield another qualitative visual assessmen t of whether they were able to capture the input distribution well. 7.1 Top-Down Generation of a Visible Sample Given a Top-Layer Represe ntation Given a top-layer representation , a deep belief network (Hinton et al., 2006) is a directed graphical model, and it is easy to do a top down sampling pass, that is, sampling each layer conditioned on the layer above, to eventually produce a sample in the bottom layer that can be displayed. More precisely, in sigmoid deep belief networks (DBN), the representation at a lo wer layer X given the 3400

31 S TACKED ENOISING A UTOENCODERS D Y is distributed according to a product of independent Bernoullis whose mea layer above n is a ′ ′ Y ∼ B ( | Y , that is, X deterministic function of g )) , where g ( Y has the exact same form as that θ θ 17 part of an autoencoder. From a trained SAE or SDAE given in Equation 3 for the it is decoder e layer above in the exact thus possible to generate samples at one layer from the representation of th same way as in a DBN. 7.2 Bottom-Up Inferring of the Top-Layer Representation Corre sponding to a Given Input Pattern In SAE/SDAE, given an input representation at the bottom layer , the corresponding representation bottom-up pass using encoding functions f in the top layer is computed in a deterministic . The θ same procedure is used in DBNs and, in the graphical model perspective , can be viewed as an approximate inference of a factorial Bernoulli top-layer distribution given the low level input. This top-layer representation is to be understood as the parameters (the mean) o f a factorial Bernoulli distribution for the actual binary units. 7.3 Generating Samples with SAE, SDAE, and DBN Using the Same Proced ure The deep belief network of Hinton et al. (2006) is a fully specified genera tive model. In particular 18 the joint distribution of its top two layers is defined by an RBM model, undirected , that is, an s sampling (Hinton graphical model from which one can efficiently sample using alternating Gibb et al., 2006). So to sample from a DBN model, one would first sample from the to p-layer RBM presentation, perform using alternating Gibbs sampling. Then, given the thus obtained top-layer re the single top down sampling pass previously described to produce a visible p attern at the bottom layer. By contrast, SAE/SDAE training does not attempt to model the distribution of the to p-layer can use the exact same top representation. So even though—given a top-layer representation—we or a DBN, SAE/SDAE down sampling procedure to generate input patterns from a SAE/SDAE as f . They lack a model of cannot by themselves alone be treated as fully specified generative models the marginal distribution of their top layer. We can easily fix this by modeling that top-layer distribution non-parametrically b y the simple empirical distribution of the encoded top-layer representations of the n training set memory-based patterns. A visible sample can then be generated by simply taking the top-layer e ncoded represen- tation of a randomly picked training set input, and carrying out the top-down sampling procedure o be used as an alter- explained previously, as illustrated in Figure 14. This same technique can als native sample-generation procedure for DBNs built by stacking RBMs. If we keep the same fixed input pattern, and hence the same correspondin g higher level rep- resentation, and perform several top-down samplings, we can thus obs erve what kind of pattern variations the deep multilayer part of a deep network has learnt to model (or abstract away in ex- tracting the top-layer representation). Figure 15 shows the resulting varia bility in the regenerated 19 patterns, for models pretrained respectively as SAE, SDAE and DBN on MNIST without any su- ′ 17. SAE and SDAE train such g ven the to perform reconstruction, that is, predicting the mean value of a layer gi θ representation in the layer immediately above it. 18. This RBM was trained, using the training set, to model the representation s obtained at the layer just below the top one, produced by the bottom-up pass we just described. 19. Both were pretrained with salt-and-pepper noise. 3401

32 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L . A randomly picked input Figure 14: Non-parametric sampling procedure for pretrained networks form the original data set is provided as input. Its top level representation is obtained by . A visible pattern is generated f deterministic bottom-up encoding using functions k ) ( θ eterministic given this high level representation, by alternating Bernoulli sampling and d ) . This same previouslayer ( B ( g decoding, that is, by successively sampling from ) ( ′ k θ lity and procedure can be applied with SAE, SDAE and DBN. It allows to see the qua variability of patterns one obtains given a high-level representation. pervised fine-tuning. It appears that SDAE and DBN are able to resynth esize a variety of similarly good quality digits, whereas the SAE trained model regenerates patterns with much visible degra- dation in quality. This is further evidence of the qualitative difference resu lting from optimizing a denoising criterion instead of mere reconstruction criterion. Note how SDAE puts back the miss- r stroke of the 7, ing hole in the loop of the regenerated 6, and sometimes straightens up the uppe t appears that, when using suggesting that it did indeed capture interesting specific characteristics. I this same sample generation procedure, SDAE and DBN yield a similar degree o f variability in the regenerated patterns (with DBN patterns looking slightly fatter and SDAE patte rns slightly thinner). 20 Neither DBN nor SDAE guarantee that class boundaries will not be cross for example DBN ed, closes a loop in a 7 making it look closer to a 9, whereas SDAE sometimes breaks open the loop of an 8 making it look like a 3. But in all cases, and contrary to SAE, the rege nerated patterns look like they could be samples from the same unknown input distribution that yielded the training set. 8. Conclusion and Future Work The present work was inspired by recent successful approaches to training deep networks, specif- ically by their use of a local unsupervised criterion, and led by the question of what that crite- rion should be. At the same time we were motivated by a desire to bridge a remainin g perfor- mance gap between deep belief networks and the stacking of ordinary auto encoders (Bengio et al., 2007; Larochelle et al., 2009a). This led us to address a theoretical sho rtcoming of traditional autoencoders—namely their inability in principle to learn useful over-complete representations—in a simple yet original way: by changing the objective from one involving mere reconstruction to the more challenging task of denoising . The resulting Stacked Denoising Autoencoder algorithm 20. The reader should however keep in mind that this results from unsup ervised training only. 3402

33 S TACKED ENOISING A UTOENCODERS D (b) SDAE (a) SAE (c) DBN AE and DBN pre- Figure 15: Variability of the samples generated with 3-hidden-layer SAE, SD trained models. Each sub-figure is to be read row-wise: the leftmost pattern in each row is a training set pattern. Following the sample generation depicted in Figure 14, it was provided as input to the network and its top-layer representation was compu ted by de- terministic bottom up encoding. Patterns to its right were then generated indepe ndently given that top level representation. Clearly, SDAE trained networks, like DBNs, are able to regenerate high quality samples from their high level representation, con trary to SAE. SDAE and DBNs also appear to give rise to a similar level of variability in the botto m-up generated patterns (DBN patterns tending to be somewhat fatter). Note how SDAE puts back the missing hole in the loop of the regenerated 6, and sometimes straightens up the upper stroke of the last 7, suggesting that it did indeed capture meaning ful specific characteristics. DBN and SDAE generated patterns can easily pass for s amples from the unknown input distribution being modeled, unlike patterns generated by SAE . 3403

34 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L e gap with DBNs, yielding for training deep networks, proved indeed able to bridge the performanc equivalent or better classification performance on all but one of the con sidered benchmark prob- coders yielded in most lems. As a deep network pretraining strategy, stacking of denoising autoen cases a significant improvement in performance over the stacking of ordin ary autoencoders. The ised local denoising criterion, representations thus extracted layer by layer, using a purely unsuperv appear to make subsequent classification tasks much easier. This is furthe r evidenced by the fact ear able to greatly benefit from that state-of-the-art shallow classifiers such as kernel SVMs also app ders showed that they it. Close examination of the feature extractors learnt by denoising autoenco ge detectors on natural were able to zero in on useful structure in the data (such as Gabor-like ed image patches) that regular autoencoders seemed unable to learn. n the well- The algorithm we developed is a straightforward, easy to implement, variation o ind and level of corrupting understood ordinary autoencoders. All that remains to be chosen is the k noise. It is likely that a careful choice, possibly guided by prior domain kn owledge, may further showed that high perfor- boost application-specific performance. Nevertheless our experiments nd with little tuning of mance can already be achieved using very simple and generic noise types a the noise level. In addition, we were able to show that, contrary to what it may s eem on the sur- face based on popular myths, the denoising training we advocate is not equivalent to using a mere weight decay regularization, nor is it the same as direct supervised trainin g with corrupted (jittered) examples. Beyond the specificities and practical usefulness of the simple algorithm we d our re- eveloped, denoising criterion sults clearly establish the value of using a as an unsupervised objective to guide the learning of useful higher level representations. This is in our view the most important contribution of our work, as it offers an interesting alternative to more usu al (and often intractable) e measured and directly op- likelihood derived criteria. Indeed, denoising performance can easily b tive divergence training timized. The use of a denoising criterion is very different from the contras of RBMs or the direct enforcing of sparsity in autoencoders. We hope th at our very encouraging results will inspire further research in this direction, both theoretical (to be tter understand the rela- tionship between denoising and representation learning), and practical ( to develop better learning algorithms based on this understanding). There are certainly better ways to use denoising-based training signals in th e learning of a deep network than the simple local approach we explored here. In particular, w hile stacking denoising autoencoders allows us to build a deep network, the denoising autoencode rs we used here were shallow. It would thus be interesting to investigate deep denoising autoencod ers with several hidden layers, and their ability to form useful representations. The choice and r ole of the corruption process also deserves further inquiry. If more involved corruption processes than those explored here prove beneficial, it would be most useful if they could be parameterized and learn t directly from the data, rather than having to be hand-engineered based on prior-knowledge. Acknowledgments This research was supported by funding from NSERC, MITACS, FQRN T, CIFAR, and the Canada Research Chairs, and partly carried out on computation resources made available by RQCHP. 3404

35 S TACKED ENOISING A UTOENCODERS D References G. An. The effects of adding noise during backpropagation training on a generalization perfor- , 8(3):643–674, 1996. mance. Neural Computation IAPR Workshop on Syntactic and Structural Pattern H. Baird. Document image defect models. In Recognition , pages 38–46, Murray Hill, NJ., 1990. P. Baldi and K. Hornik. Neural networks and principal component ana lysis: Learning from examples without local minima. Neural Networks , 2:53–58, 1989. nes are edge filters. Vision A. Bell and T.J. Sejnowski. The independent components of natural sce Research , 37:3327–3338, 1997. ration and blind A.J. Bell and T.J. Sejnowski. An information maximisation approach to blind sepa Neural Computation , 7(6):1129–1159, 1995. deconvolution. Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning , 2(1): 1–127, 2009. Also published as a book. Now Publishers, 2009. rgence. Neural Computa- Y. Bengio and O. Delalleau. Justifying and generalizing contrastive dive tion , 21(6):1601–1621, June 2009. Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In L. Botto u, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines . MIT Press, 2007. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wis e training of deep net- ̈ Advances in Neural works. In Bernhard Sch olkopf, John Platt, and Thomas Hoffman, editors, Information Processing Systems 19 (NIPS’06) , pages 153–160. MIT Press, 2007. ́ J. Bergstra. Algorithms for classifying recorded music by genre. Master ’s thesis, Universit e de Montreal, 2006. The Statistician , 24(3):179–195, 1975. J. Besag. Statistical analysis of non-lattice data. C.M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation , 7 (1):108–116, 1995. H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and s ingular value decom- position. Biological Cybernetics , 59:291–294, 1988. ̈ O. Chapelle, B. Sch olkopf, and A. Zien, editors. Semi-Supervised Learning . MIT Press, Cambridge, MA, 2006. Y. Cho and L. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, C. Williams, J. Lafferty, and A. Culotta, editors, Advances in Neural Information Processing Systems 22 (NIPS’09) , pages 342–350. NIPS Foundation, 2010. D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Be ngio. Why does unsu- pervised pre-training help deep learning? Journal of Machine Learning Research , 11:625–660, February 2010. 3405

36 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L iatives distribuees. In P. Gallinari, Y. LeCun, S. Thiria, and F. Fogelman-Soulie. Memoires assoc Proceedings of COGNITIVA 87 , Paris, La Villette, 1987. Neural Compu- l prospects. Y. Grandvalet, S. Canu, and S. Boucheron. Noise injection: Theoretica tation , 9(5):1093–1108, 1997. ̊ J. H astad. Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual , pages 6–20, Berkeley, California, 1986. ACM Press. ACM Symposium on Theory of Computing ̊ Computational Com- J. H astad and M. Goldmann. On the power of small-depth threshold circuits. , 1:113–129, 1991. plexity Dependency networks D. Heckerman, D.M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Journal of Machine Learning Re- for inference, collaborative filtering, and data visualization. search , 1:49–75, 2000. Artificial Intelligence , 40:185–234, 1989. G.E. Hinton. Connectionist learning procedures. nce. Neural Computa- G.E. Hinton. Training products of experts by minimizing contrastive diverge tion , 14:1771–1800, 2002. G.E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neur al networks. Science , 313(5786):504–507, July 2006. G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for dee p belief nets. Neural , 18:1527–1554, 2006. Computation ing. IEEE Transac- L. Holmstrm and P. Koistinen. Using additive noise in back-propagation train tions on Neural Networks , 3(1):24–38, 1992. J.J. Hopfield. Neural networks and physical systems with emergent collec tive computational abili- ties. , 79, 1982. Proceedings of the National Academy of Sciences, USA D.H. Hubel and T.N. Wiesel. Receptive fields of single neurons in the cat’s striate cortex. Journal of Physiology , 148:574–591, 1959. V. Jain and S.H. Seung. Natural image denoising with convolutional network s. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Advances in Neural Information Processing Systems 21 (NIPS’08) , 2008. N. Japkowicz, S.J. Hanson, and M.A. Gluck. Nonlinear autoassociation is not equivalent to PCA. Neural Computation , 12(3):531–545, 2000. H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Z. Ghahramani, editor, Proceedings of the Twenty-fourth International Conference on Machine Learning (IC ML’07) , pages 473–480. ACM, 2007. H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring stra tegies for training deep neural networks. Journal of Machine Learning Research , 10:1–40, January 2009a. 3406

37 S TACKED ENOISING A UTOENCODERS D terdependent codes. In Pro- H. Larochelle, D. Erhan, and P. Vincent. Deep learning using robust in nd Statistics (AIS- ceedings of the Twelfth International Conference on Artificial Intelligence a TATS 2009) , pages 312–319, April 2009b. ́ Y. LeCun. Mod eles connexionistes de l’apprentissage e de Paris VI, 1987. . PhD thesis, Universit ` Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. H ubbard, and L.D. Jackel. Back- Neural Computation , 1(4):541–551, propagation applied to handwritten zip code recognition. 1989. H. Lee, C. Ekanadham, and A. Ng. Sparse deep belief net model for vis ual area V2. In J.C. Platt, Advances in Neural Information Processing Systems D. Koller, Y. Singer, and S. Roweis, editors, , pages 873–880, Cambridge, MA, 2008. MIT Press. 20 (NIPS’07) R. Linsker. An application of the principle of maximum information preservation to linear systems. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems 1 (NIPS’88) . Mor- gan Kaufmann, 1989. Parallel Distributed Processing: J.L. McClelland, D.E. Rumelhart, and the PDP Research Group. Explorations in the Microstructure of Cognition , volume 2. MIT Press, Cambridge, 1986. B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field pr operties by learning a sparse code for natural images. Nature , 381:607–609, 1996. B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis se t: a strategy employed by V1? Vision Research , 37:3311–3325, December 1997. : Observations on T. Poggio and T. Vetter. Recognition and structure from one 2d model view o. 1347, Artificial prototypes, object classes and symmetries. Technical Report A.I. Memo N Intelligence Laboratory, Massachusetts Institute of Technology, 1992. M. Ranzato, C.S. Poultney, S. Chopra, and Y. LeCun. Efficient learnin g of sparse representations ̈ with an energy-based model. In B. Sch olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19 (NIPS’06) , pages 1137–1144. MIT Press, 2007. M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS’07) , pages 1185–1192, Cambridge, MA, 2008. MIT Press. R. Scalettar and A. Zee. Emergence of grandmother memory in feed forwar d networks: Learning with noise and forgetfulness. In D. Waltz and J. A. Feldman, editors, Connectionist Models and Their Implications: Readings from Cognitive Science , pages 309–332. Ablex, Norwood, 1988. ̈ B. Sch olkopf, C.J.C. Burges, and V. Vapnik. Incorporating invariances in su pport vector learning machines. In C. von der Malsburg, W. von Seelen, J. C. Vorbrggen, a nd B. Sendhoff, editors, Lecture Notes in Computer Science (Vol 112), Artificial Neural Netowrk s ICANN’96 , pages 47– 52. Springer, 1996. S.H. Seung. Learning continuous attractors in recurrent networks. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems 10 (NIPS’97) , pages 654–660. MIT Press, 1998. 3407

38 V INCENT AROCHELLE , L AJOIE , B ENGIO AND M ANZAGOL , L . , 4(1): J. Sietsma and R. Dow. Creating artificial neural networks that generalize Neural Networks 67–79, 1991. malism for specifying selected P. Simard, B. Victorri, Y. LeCun, and J. Denker. Tangent prop - A for Lippmann, editors, invariances in an adaptive network. In J.E. Moody S.J. Hanson and R.P. , pages 895–903, San Mateo, Advances in Neural Information Processing Systems 4 (NIPS’91) CA, 1992. Morgan Kaufmann. f harmony theory. In P. Smolensky. Information processing in dynamical systems: Foundations o D.E. Rumelhart and J.L. McClelland, editors, Parallel Distributed Processing , volume 1, chap- ter 6, pages 194–281. MIT Press, Cambridge, 1986. P.E. Utgoff and D.J. Stracuzzi. Many-layered learning. Neural Computation , 14:2497–2539, 2002. composing robust features P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and with denoising autoencoders. In W.W. Cohen, A. McCallum, and S.T. Roweis , editors, Pro- ceedings of the Twenty-fifth International Conference on Machine Learn ing (ICML’08) , pages 1096–1103. ACM, 2008. A. von Lehman, E.G. Paek, P.F. Liao, A. Marrakchi, and J.S. Patel. Facto rs influencing learning by back-propagation. In , volume 1, pages IEEE International Conference on Neural Networks 335–341, San Diego 1988, 1988. IEEE, New York. J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised e mbedding. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Proceedings of the Twenty-fifth Inter- national Conference on Machine Learning (ICML’08) , pages 1168–1175, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390 303. 3408