1 A Tutorial on Energy-Based Learning Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang The Courant Institute of Mathematical Sciences, New York University } { yann,sumit,raia,ranzato,jhuangfu @cs.nyu.edu http://yann.lecun.com v1.0, August 19, 2006 To appear in “Predicting Structured Data”, ̈ olkopf, A. Smola, B. Taskar (eds) G. Bakir, T. Hofman, B. Sch MIT Press, 2006 Abstract Energy-Based Models (EBMs) capture dependencies between v ariables by as- sociating a scalar energy to each configuration of the variab les. Inference consists urations of the re- in clamping the value of observed variables and finding config sts in finding an energy maining variables that minimize the energy. Learning consi re given lower energies function in which observed configurations of the variables a than unobserved ones. The EBM approach provides a common the oretical frame- work for many learning models, including traditional discr iminative and genera- tive approaches, as well as graph-transformer networks, co nditional random fields, maximum margin Markov networks, and several manifold learn ing methods. Probabilistic models must be properly normalized, which so metimes requires ible variable configura- evaluating intractable integrals over the space of all poss ion, this problem is tions. Since EBMs have no requirement for proper normalizat naturally circumvented. EBMs can be viewed as a form of non-p robabilistic factor graphs, and they provide considerably more flexibility in th e design of architec- tures and training criteria than probabilistic approaches . 1 Introduction: Energy-Based Models The main purpose of statistical modeling and machine learni ng is to encode depen- dencies between variables. By capturing those dependencie s, a model can be used to answer questions about the values of unknown variables give n the values of known variables. Energy-Based Models (EBMs) capture dependencies by associ ating a scalar en- ergy (a measure of compatibility) to each configuration of the var iables. Inference , i.e., making a prediction or decision, consists in setting t he value of observed variables 1

2 and finding values of the remaining variables that minimize t he energy. con- Learning ies to correct values of the sists in finding an energy function that associates low energ loss functional , mini- remaining variables, and higher energies to incorrect valu es. A vailable energy functions. mized during learning, is used to measure the quality of the a Within this common inference/learning framework, the wide choice of energy func- tions and loss functionals allows for the design of many type s of statistical models, both probabilistic and non-probabilistic. probabilistic and Energy-based learning provides a unified framework for many non-probabilistic approaches to learning, particularly f or non-probabilistic training of d learning can be seen as an graphical models and other structured models. Energy-base alternative to probabilistic estimation for prediction, c lassification, or decision-making tion, energy-based ap- tasks. Because there is no requirement for proper normaliza normalization constant in proaches avoid the problems associated with estimating the probabilistic models. Furthermore, the absence of the norm alization condition allows for much more flexibility in the design of learning machines. Most probabilistic mod- els can be viewed as special types of energy-based models in w hich the energy function satisfies certain normalizability conditions, and in which the loss function, optimized by learning, has a particular form. This chapter presents a tutorial on energy-based models, wi th an emphasis on their use for structured output problems and sequence labeling pr oblems. Section 1 intro- duces energy-based models and describes deterministic inf erence through energy min- imization. Section 2 introduces energy-based learning and the concept of the loss func- re described, including tion. A number of standard and non-standard loss functions a egative log-likelihood loss. the perceptron loss, several margin-based losses, and the n The negative log-likelihood loss can be used to train a model to produce conditional probability estimates. Section 3 shows how simple regressi on and classification mod- s models that contain els can be formulated in the EBM framework. Section 4 concern ions in detail and gives suf- latent variables. Section 5 analyzes the various loss funct ficient conditions that a loss function must satisfy so that i ts minimization will cause the model to approach the desired behavior. A list of “good” a nd “bad” loss functions is given. Section 6 introduces the concept of non-probabili stic factor graphs and infor- mally discusses efficient inference algorithms. Section 7 f ocuses on sequence labeling and structured output models. Linear models such as max-mar gin Markov networks and conditional random fields are re-formulated in the EBM fr amework. The liter- ature on discriminative learning for speech and handwritin g recognition, going back to the late 80’s and early 90’s, is reviewed. This includes gl obally trained systems that integrate non-linear discriminant functions, such as neural networks, and sequence alignment methods, such as dynamic time warping and hidden M arkov models. Hier- archical models such as the graph transformer network archi tecture are also reviewed. Finally, the differences, commonalities, and relative adv antages of energy-based ap- proaches, probabilistic approaches, and sampling-based a pproximate methods such as contrastive divergence are discussed in Section 8. 2

3 a n u m H n i m a l A i r p l a n e A a r C k c u r T E Y, X ( ) o r y F u n e t i g n n c E ( Y, X ) E X Y a b b s e r v e d v a r i a b l e s l e s t o O b e a r i V p t c i d e r e d ) u p n i t ( n a s r ) e w ( u m a n H n i m a l A r e p l a n i A a r C r k u c T Figure 1: les X and variables to A model measures the compatibility between observed variab X Y energy function E ( be predicted ) . For example, using an could be the pixels of an Y, X image, and Y a discrete label describing the object in the image. Given X , the model produces the answer Y that minimizes the energy E . 1.1 Energy-Based Inference Let us consider a model with two sets of variables, and Y , as represented in Fig- X X could be a vector containing the pixels from an image of an obj ure 1. Variable ect. could be a discrete variable that represents the possible ca Y tegory of the ob- Variable could take six possible values: animal, human figure, airpla ject. For example, Y ne, energy function which truck, car, and “none of the above”. The model is viewed as an uration of X and measures the “goodness” (or badness) of each possible config . The Y output number can be interpreted as the degree of between the values of compatibility and . In the following, we use the convention that small energy va lues correspond X Y large energy values corre- to highly compatible configurations of the variables, while s. Functions of this type are spond to highly incompatible configurations of the variable given different names in different technical communities; they may be called contrast functions, value functions, or negative log-likelihood fu nctions. In the following, we will use the term energy function and denote it E ( Y, X ) . A distinction should be made between the energy function, which is minimized by the infer ence process, and the loss the learning process. functional (introduced in Section 2), which is minimized by In the most common use of a model, the input is given (observed from the world), X and the model produces the answer X . Y that is most compatible with the observed ∗ More precisely, the model must produce the value , chosen from a set Y , for which Y ( Y, X ) is the smallest: E ∗ Y = argmin (1) . E ( Y, X ) ∈Y Y When the size of the set Y is small, we can simply compute E ( Y, X ) for all possible values of Y ∈Y and pick the smallest. 3

4 E E ) ( Y, X ) ( Y, X ) Y, X ( E Y Y Y X X X . 9 0 4 1 . 1 1 6 8 . 5 1 4 . 2 5 % 0 . 1 0 0 0 . 0 5 ] [ % 0 3 n i n e E t s i 9 . 6 2 1 0 9 . 6 [ 3 4 . 2 5 0 . 3 7 0 % 0 . 0 4 ] 0 . 8 4 1 0 2 8 5 1 1 6 4 . 4 4 3 4 . 2 5 % . . 4 2 0 0 . 1 6 ] 6 6 7 . 0 [ 0 4 6 . 6 6 1 2 3 . 3 3 3 4 . 2 5 0 . 8 5 0 % 0 . 0 4 ] [ 0 . 1 7 2 8 7 4 5 4 . 8 1 3 4 . 2 5 0 . 3 1 0 % 0 . 1 4 ] 1 6 1 . 0 [ 8 . a ) ( b ( ) ( c ) E E ( Y, X ) ( ( Y, X E ) Y, X ) Y Y X X X Y s " " t h i ) v n e a d j n b ( p r o " o u T h " r y s e s i s i a ( ) ( e ) ( f ) d Figure 2: Several applications of EBMs: (a) face recognition: Y is a high-cardinality discrete variable; Y is a collection of vectors with location (b) face detection and pose estimation: and pose of each possible face; (c) image segmentation: Y is an image in which each pixel is a discrete label; (d-e) handwriting recognition and sequence labeling: Y is a sequence of symbols from a highly structured but potentially infinite se t (the set of English sentences). The situation is similar for many applications in natural langu age processing and computational biology; (f) image restoration: Y is a high-dimensional continuous variable (an image). 4

5 In general, however, picking the best Y may not be simple. Figure 2 depicts sev- Y may be too large to make exhaustive search practical. In eral situations in which Y is discrete e, the set Figure 2(a), the model is used to recognize a face. In this cas and finite, but its cardinality may be tens of thousands [Chop ra et al., 2005]. In Fig- mate their poses. The ure 2(b), the model is used to find the faces in an image and esti set contains a binary variable for each location indicating whe ther a face is present Y at that location, and a set of continuous variables represen ting the size and orienta- del is used to segment tion of the face [Osadchy et al., 2005]. In Figure 2(c), the mo a biological image: each pixel must be classified into one of fi ve categories (cell nu- al medium). In this case, cleus, nuclear membrane, cytoplasm, cell membrane, extern Y contains all the label images, i.e. the ones for which the nuclear mem- consistent branes are encircling the nuclei, the nuclei and cytoplasm a re inside the cells walls, antly, members of the set etc. The set is discrete, but intractably large. More import al., 2005]. In Figure 2(d), must satisfy complicated consistency constraints [Ning et Y contains all possible the model is used to recognize a handwritten sentence. Here sentences of the English language, which is a discrete but in finite set of sequences of symbols [LeCun et al., 1998a]. In Figure 2(f), the model is us ed to restore an image (by cleaning the noise, enhancing the resolution, or removi Y ng scratches). The set contains all possible images (all possible pixel combinati ons). It is a continuous and high-dimensional set. For each of the above situations, a specific strategy, called inference procedure , the must be employed to find the that minimizes E ( Y ) . In many real situations, the Y, X inference procedure will produce an approximate result, wh ich may or may not be the global minimum of E ( Y, X ) for a given X . In fact, there may be situations where E ( ) has several equivalent minima. The best inference procedur e to use often Y, X if is continuous and depends on the internal structure of the model. For example, Y ( Y, X ) is smooth and well-behaved with respect to Y , one may use a gradient-based E Y is a collection of discrete variables and the energy func- optimization algorithm. If tion can be expressed as a factor graph , i.e. a sum of energy functions (factors) that depend on different subsets of variables, efficient inferen ce procedures for factor graphs can be used (see Section 6) [Kschischang et al., 2001, MacKay , 2003]. A popular ex- ample of such a procedure is the min-sum algorithm. When each element of Y can be represented as a path in a weighted directed acyclic graph, t hen the energy for a partic- Y ath. In this case, ular is the sum of values on the edges and nodes along a particular p Y can be found efficiently using dynamic programming (e.g with the best the Viterbi ∗ algorithm or A ). This situation often occurs in sequence labeling problem s such as speech recognition, handwriting recognition, natural lan guage processing, and biolog- ical sequence analysis (e.g. gene finding, protein folding p rediction, etc). Different situations may call for the use of other optimization proced ures, including continuous optimization methods such as linear programming, quadrati c programming, non-linear optimization methods, or discrete optimization methods su ch as simulated annealing, graph cuts, or graph matching. In many cases, exact optimiza tion is impractical, and one must resort to approximate methods, including methods t hat use surrogate energy functions (such as variational methods). 5

6 1.2 What Questions Can a Model Answer? In the preceding discussion, we have implied that the questi on to be answered by the Y that is most compatible with this ?”, a situation that occurs model is “What is the X or tasks. However, a model may be used prediction, classification decision-making in to answer questions of several types: : “Which value of Y is most com- 1. Prediction, classification, and decision-making ?’ This situation occurs when the model is used to make hard X patible with this decisions or to produce an action. For example, if the model i s used to drive a robot and avoid obstacles, it must produce a single best deci sion such as “steer left”, “steer right”, or “go straight”. Ranking 2. Y ?” This is a more complex or Y X more compatible with this : “Is 2 1 o produce a complete task than classification because the system must be trained t est one. This situ- ranking of all the answers, instead of merely producing the b ation occurs in many data mining applications where the mode l is used to select multiple samples that best satisfy a given criterion. Detection 3. Y compatible with X ?” Typically, detection tasks, : “Is this value of such as detecting faces in images, are performed by comparin g the energy of a face label with a threshold. Since the threshold is generally unk nown when the system is built, the system must be trained to produce energy values that increase as the image looks less like a face. 4. : “What is the conditional probability distribution Conditional density estimation over Y given X ?” This case occurs when the output of the system is not used maker or is fed to directly to produce actions, but is given to a human decision the input of another, separately built system. We often think of X as a high-dimensional variable (e.g. an image) and Y as a discrete variable (e.g. a label), but the converse case is al so common. This occurs when the model is used for such applications as image restora tion, computer graphics, speech and language production, etc. The most complex case i s when both X and Y are high-dimensional. 1.3 Decision Making versus Probabilistic Modeling rely necessary that the sys- For decision-making tasks, such as steering a robot, it is me tem give the lowest energy to the correct answer. The energie s of other answers are irrelevant, as long as they are larger. However, the output o f a system must sometimes be combined with that of another system, or fed to the input of another system (or to a human decision maker). Because energies are uncalibrated ( i.e. measured in arbitrary units), combining two, separately trained energy-based mo dels is not straightforward: there is no guarantee that their energy scales are commensurate. Calib rating a priori energies so as to permit such combinations can be done in a num ber of ways. However, the only consistent way involves turning the collection of energies for all poss ible out- puts into a normalized probability distribution. The simpl est and most common method 6

7 for turning a collection of arbitrary energies into a collec tion of numbers between 0 and Gibbs distribution : 1 whose sum (or integral) is 1 is through the − βE ( Y,X ) e ∫ , (2) ) = Y ( X | P βE y,X − ) ( e ∈Y y β is an arbitrary positive constant akin to an inverse tempera ture, and the denom- where (by analogy with similar concepts in statistical inator is called the partition function physics). The choice of the Gibbs distribution may seem arbi trary, but other proba- bility distributions can be obtained (or approximated) thr ough a suitable re-definition of the energy function. Whether the numbers obtained this wa y are good probability estimates does not depend on how energies are turned into pro babilities, but on how E Y, X ) is estimated from data. ( It should be noted that the above transformation of energies into probabilities is ∫ − βE ( y,X ) e converges. This somewhat restricts the only possible if the integral ∈Y y energy functions and domains Y that can be used. More importantly, there are many practical situations where computing the partition functi on is intractable (e.g. when has high cardinality), or outright impossible (e.g. when Y is a high dimensional Y variable and the integral has no analytical solution). Henc e probabilistic modeling ation does not require comes with a high price, and should be avoided when the applic it. 2 Energy-Based Training: Architecture and Loss Func- tion oduces the best Training an EBM consists in finding an energy function that pr for Y any X . The search for the best energy function is performed within a family of energy functions E indexed by a parameter W E = E ( W, Y, X ) : W ∈W} . (3) { architecture rgy func- The of the EBM is the internal structure of the parameterized ene E ( W, Y, X ) . At this point, we put no particular restriction on the natur e of X , tion are real vectors, , , and E . When X and Y Y E could be as simple as a linear com- W bination of basis functions (as in the case of kernel methods ), or a set of neural net architectures and weight values. Section gives examples of simple architectures for common applications to classification and regression. When X and Y are variable-size images, sequences of symbols or vectors, or more complex str may E uctured objects, represent a considerably richer class of functions. Sectio ns 4, 6 and 7 discuss several examples of such architectures. One advantage of the energy -based approach is that it E . puts very little restrictions on the nature of on-making, we are given To train the model for prediction, classification, or decisi i i i { ( X , Y a set of training samples S ) : i = 1 . . . P } , where X = is the input for i the i -th training sample, and Y is the corresponding desired answer. In order to find the best energy function in the family E , we need a way to assess the quality of any 7

8 particular energy function, based solely on two elements: t he training set, and our prior knowledge about the task. This quality measure is called the loss functional (i.e. a E, S ) . For simplicity, we often denote it L ( W, ( ) S function of function) and denoted L W that . The learning problem is simply to find the loss function and simply call it the minimizes the loss: ∗ W W, L ( = min S ) . (4) W ∈W For most cases, the loss functional is defined as follows: P ∑ 1 i i Y L Y . , E ( W, ( , X (5) )) + R ( W ) ( E, S ) = L P =1 i It is an average taken over the training set of a per-sample loss functional , denoted i i i Y , E ( W, Y , X L )) , which depends on the desired answer ( Y and on the energies obtained by keeping the input sample fixed and varying the ans wer . Thus, for each Y R ( ) is the regularizer , sample, we evaluate a “slice” of the energy surface. The term W and can be used to embed our prior knowledge about which energ y functions in our family are preferable to others (in the absence of training d ata). With this definition, the loss is invariant under permutations of the training sam ples and under multiple repetitions of the training set. odel that will give Naturally, the ultimate purpose of learning is to produce a m aining. We can rely good answers for new input samples that are not seen during tr uarantee that, under simple on general results from statistical learning theory which g onditions on the family of interchangeability conditions on the samples and general c energy functions (finite VC dimension), the deviation betwe en the value of the loss e, separate set of test after minimization on the training set, and the loss on a larg size of training set samples is bounded by a quantity that converges to zero as the increases [Vapnik, 1995]. 2.1 Designing a Loss Functional Intuitively, the per-sample loss functional should be desi gned in such a way that it assigns a low loss to well-behaved energy functions: energy functions that give the lowest energy to the correct answer and higher energy to all o ther (incorrect) answers. nergy to the correct answers Conversely, energy functions that do not assign the lowest e of loss functions (the ones would have a high loss. Characterizing the appropriateness that select the best energy functions) is further discussed in following sections. Considering only the task of training a model to answer quest ions of type 1 (pre- diction, classification and decision-making), the main int uition of the energy-based ap- proach is as follows. Training an EBM consists in shaping the energy function, so that for any given X , the inference algorithm will produce the desired value for Y . Since the inference algorithm selects the with the lowest energy, the learning procedure Y must shape the energy surface so that the desired value of Y has lower energy than all other (undesired) values. Figures 3 and 4 show examples of en ergy as a function of Y i for a given input sample X in cases where Y is a discrete variable and a continuous scalar variable. We note three types of answers: 8

9 H u a n n a m u H m A A n m i n i a l m a l A f t e r i a r A g n A n t i l e n a p r i r i p l a e n C C r a r a c k k c u T T r u r ) ( Y, X E ( Y, X E ) Figure 3: How training affects the energies of the possible answers in the discrete case: the energy of the correct answer is decreased, and the energies o f incorrect answers are increased, particularly if they are lower than that of the correct answe r. w n o d h s p u r t e f A ) ) i i t i g n i a r n , X , X p u l l u p · · W, W, ( ( E E i i i i ̄ ̄ Y Y Y Y r r A s w e A n s w e n ) ( ) Y ( Y Figure 4: The effect of training on the energy surface as a function of t he answer Y in the con- i tinuous case. After training, the energy of the correct answ er Y is lower than that of incorrect answers. 9

10 i • : the correct answer Y ∗ i : the answer produced by the model, i.e. the answer with the lo west energy. • Y i ̄ Y : the most offending incorrect answer , i.e. the answer that has the lowest • energy among all the incorrect answers. To define this answer in the continuous i case, we can simply view all answers within a distance ǫ as correct, and all of Y answers beyond that distance as incorrect. With a properly designed loss function, the learning proces s should have the effect i i W, Y , and “pulling up” on the incorrect energies, par- , X of “pushing down” on ) E ( i i ̄ ticularly on Y W, , X E ) . Different loss functions do this in different ways. Sectio n 5 ( sfy in order to be guaranteed gives sufficient conditions that the loss function must sati ly used loss functions to shape the energy surface correctly. We show that some wide do not satisfy the conditions, while others do. S , building and training an energy-based model To summarize: given a training set involves designing four components: . : the internal structure of 1. ( W, Y, X ) The architecture E 2. The inference algorithm : the method for finding a value of Y that minimizes E ( W, Y, X ) for any given X . 3. The loss function L ( W, S ) measures the quality of an energy function using the : training set. 4. : the method for finding a W that minimizes the loss The learning algorithm functional over the family of energy functions E , given the training set. s critical. Any prior knowl- Properly designing the architecture and the loss function i hitecture and into edge we may have about the task at hand is embedded into the arc nately, not all combinations of the loss function (particularly the regularizer). Unfortu binations, minimizing the architectures and loss functions are allowed. With some com ng the combinations of loss will not make the model produce the best answers. Choosi and efficiently is critical to the architecture and loss functions that can learn effectively torial. energy-based approach, and thus is a central theme of this tu 2.2 Examples of Loss Functions We now describe a number of standard loss functions that have been proposed and used in the machine learning literature. We shall discuss them an d classify them as “good” or “bad” in an energy-based setting. For the time being, we se t aside the regularization term, and concentrate on the data-dependent part of the loss function. 2.2.1 Energy Loss The simplest and the most straightforward of all the loss fun ctions is the energy loss. i i ( , Y For a training sample ) , the per-sample loss is defined simply as: X i i i i ( Y ( W, Y , X ( )) = E , E W, Y (6) , X L ) . energy 10

11 14 1.8 3 1.6 12 2.5 1.4 10 1.2 2 L 8 1 1.5 0.8 6 Loss: L Loss: L Loss: 0.6 1 4 0.4 0.5 2 0.2 0 0 0 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 1 −4 0 0.5 1 1.5 2 2.5 3 3.5 −2 −1 −3 E − E Energy: E / E E − E C C I I I C i i i i ̄ The hinge loss (left) and log loss (center) penalize ( ) − E ( W, W, Y Y ) , X , X Figure 5: lin- E re loss (right) separately penalizes large early and logarithmically, respectively. The square-squa i i i i ̄ ( ) (solid line) and small values of E ( W, W, Y values of E , X , X ) (dashed line) quadrati- Y cally. This loss function, although very popular for things like re gression and neural network training, cannot be used to train most architectures: while this loss will push down on the energy of the desired answer, it will not pull up on any o ther energy. With some architectures, this can lead to a collapsed solution in which the energy is con- stant and equal to zero. The energy loss will only work with ar chitectures that are i i ( , X ) will automatically make E W, Y designed in such a way that pushing down on the energies of the other answers larger. A simple example of such an architecture is i i i i 2 ) = || Y ( − G W, Y W, X ( ) || , X , which corresponds to regression with mean- E G squared error with being the regression function. 2.2.2 Generalized Perceptron Loss i i X ( ) is defined as , Y The generalized perceptron loss for a training sample i i i i i , E ( W, Y , X L (7) E ( W, Y ( , X Y ) − min . ) E ( W, Y, X )) = perceptron ∈Y Y bound on the first term. This loss is always positive, since the second term is a lower i i W, Y , X E ( Minimizing this loss has the effect of pushing down on , while pulling ) up on the energy of the answer produced by the model. While the perceptron loss has been widely used in many settin gs, including for models with structured outputs such as handwriting recogni tion [LeCun et al., 1998a] and parts of speech tagging [Collins, 2002], it has a major de ficiency: there is no mech- anism for creating an energy gap between the correct answer a nd the incorrect ones. Hence, as with the energy loss, the perceptron loss may produ ce flat (or almost flat) energy surfaces if the architecture allows it. Consequentl y, a meaningful, uncollapsed result is only guaranteed with this loss if a model is used tha t cannot produce a flat energy surface. For other models, one cannot guarantee anyt hing. 2.2.3 Generalized Margin Losses Several loss functions can be described as margin losses; the hinge loss, log loss, LVQ2 loss, minimum classification error loss, square-square los s, and square-exponential loss all use some form of margin to create an energy gap between the correct answer and the 11

12 incorrect answers. Before discussing the generalized marg in loss we give the following definitions. i i X Definition 1 , Y ( ) , the Let Y be a discrete variable. Then for a training sample i ̄ Y most offending incorrect answer is the answer that has the lowest energy among all answers that are incorrect: i i ̄ = argmin Y E ( W, Y, X ) . (8) i andY 6 = ∈Y Y Y is a continuous variable then the definition of the most offen ding incorrect answer If Y follows. can be defined in a number of ways. The simplest definition is as i i X Definition 2 , Y be a continuous variable. Then for a training sample ) Y ( Let , the i ̄ Y is the answer that has the lowest energy among most offending incorrect answer away from the correct answer: all answers that are at least ǫ i i ̄ (9) . E = argmin W, Y, X Y ) ( i >ǫ , ‖ Y − Y ∈Y ‖ Y The generalized margin loss is a more robust version of the ge neralized perceptron rect answer in the contrastive loss. It directly uses the energy of the most offending incor term: ) ( i i i i i i ̄ (10) E ( W, Y ) , X . ) , E ( W, , X Y W, Y ) = ( , X L Q m margin m is a positive parameter called the margin and Q is a convex function ( e ) Here , e 2 1 m whose gradient has a positive dot product with the vector [1 , − 1] in the region where i i i i ̄ . In other words, the loss surface is slanted toward ( W, , X Y ( , X E ) m > E )+ W, Y i i i i i i ̄ and high values of E ( W, E Y ( , X W, Y ) wherever E ( W, Y , X , X low values of ) ) i i ̄ ) E , X Y W, by at least m . Two special cases of the generalized ( is not smaller than margin loss are given below: : A particularly popular example of generalized margin loss is Hinge Loss hinge loss n- , which is used in combination with linearly parameterized e the ergies and a quadratic regularizer in support vector machin es, support vector n Markov net- Markov models [Altun and Hofmann, 2003], and maximum-margi works [Taskar et al., 2003]: ) ( i i i i i i ̄ ) = max E ( W, Y , X , X W, Y ) − E ( W, 0 Y ( , X + ) L , (11) , m hinge m en in Figure 5. The where is the positive margin. The shape of this loss function is giv e most offending incorrect difference between the energies of the correct answer and th − m . The hinge loss only depends on answer is penalized linearly when larger than trained to take any particular energy differences, hence individual energies are not cons value. Log Loss , which can be seen log loss : a common variation of the hinge loss is the as a “soft” version of the hinge loss with an infinite margin (s ee Figure 5, center): ) ( i i i i ̄ i i ) ,X E ) − E ( W, ( Y ,X W,Y . (12) W, Y L , X ( ) = log e 1 + log LVQ2 Loss : One of the very first proposals for discriminatively train- ing sequence labeling systems (particularly speech recogn ition systems) 12

13 is a version of Kohonen’s LVQ2 loss. This loss has been advoca ted al., 1991a, early 90’s [Driancourt et by Driancourt and Bottou since the , 1992a, Driancourt, 1994, Driancourt and Gallinari, 1992b, Driancourt and Gallinari McDermott, 1997, McDermott and Katagiri, 1992]: ( ( )) i i i i ̄ Y E ( − ) E ( W, Y , X , X ) W, i i , 1 ) = min , , X max W, Y ( L 0 , (13) lvq2 i i ̄ W, , X ( Y ) δE is a positive parameter. LVQ2 is a zero-margin loss, but it ha s the peculiarity of δ where i i i i ̄ saturating the ratio between ) and E ( W, E Y ( , X W, Y ) to 1 + δ . This mitigates , X st M the effect of outliers by making them contribute a nominal co to the total loss. This loss function is a continuous approximation of the numb er of classification errors. i i Unlike generalized margin losses, the LVQ2 loss is non-conv , X ( E and ex in W, Y ) i i ̄ ( W, Y ) . E , X : The Minimum Classification Error loss was originally propo MCE Loss sed by eech recognition sys- Juang et al. in the context of discriminative training for sp tems [Juang et al., 1997]. The motivation was to build a loss f unction that also ap- le being smooth and differ- proximately counts the number of classification errors, whi entiable. The number of classification errors can be written as: ( ) i i i i ̄ (14) − E ( W, , X Y ( , X ) ) E , W, Y θ θ nd 1 for positive where is the step function (equal to zero for negative arguments, a arguments). However, this function is not differentiable, and therefore very difficult to optimize. The MCE Loss “softens” it with a sigmoid: ) ( i i i i i i ̄ (15) E ( W, Y ( , X W, Y ) − E ( W, , X Y L , X ) = ) σ , mce − x − 1 ) = (1 + e where σ is the logistic function ( x . As with the LVQ2 loss, the satu- σ ) e overall loss. Although the ration ensures that mistakes contribute a nominal cost to th i i ( W, Y , X MCE loss does not have an explicit margin, it does create a gap between ) E i i ̄ . The MCE loss is non-convex. Y W, , X E ) ( and : Unlike the hinge loss, the square-square loss treats Square-Square Loss the energy of the correct answer and the most offending answe r sepa- rately [LeCun and Huang, 2005, Hadsell et al., 2006]: ) ( 2 i i i i i 2 i ̄ W, Y , X , X ( ) ) = + (16) max(0 , m − E ( W, E Y ( , X W, Y )) . L sq sq − i i i i ̄ W, ) and small values of Large values of ( ( E Y , X , X W, Y ) below the margin m E are both penalized quadratically (see Figure 5). Unlike the margin loss, the square- square loss “pins down” the correct answer energy at zero and “pins down” the incor- . Therefore, it is only suitable for energy functions that m rect answer energies above tput module measures are bounded below by zero, notably in architectures whose ou some sort of distance. Chopra et al., 2005, Square-Exponential [LeCun and Huang, 2005, loss is similar to the square-square square-exponential Osadchy et al., 2005]: The loss. It only differs in the contrastive term: instead of a qu adratic term it has the orrect answer: exponential of the negative energy of the most offending inc i i ̄ i i 2 − ) ( W, i Y i ,X E ( ) W, Y + γe L (17) , ) = E ( W, Y , X , X exp − sq 13

14 is a positive constant. Unlike the square-square loss, this where loss has an infinite γ margin and pushes the energy of the incorrect answers to infin ity with exponentially decreasing force. 2.2.4 Negative Log-Likelihood Loss rom probabilistic modeling. The motivation for the negative log-likelihood loss comes f It is defined as: i i i i i ) = E ( W, Y ( , X W, Y ) + F L , X W, Y , X ( ) . (18) β nll i free energy of the ensemble { E ( W, y, X F ) , y ∈Y} : Where is the ( ) ∫ ) ( 1 i i ) = , X Y ( F W, log (19) . exp − ) βE ( W, y, X β β ∈Y y is a positive constant akin to an inverse temperature. This l oss can only be where β Y over used if the exponential of the negative energy is integrable , which may not be . Y the case for some choices of energy function or babilistic formulation The form of the negative log-likelihood loss stems from a pro of the learning problem in terms of the maximum conditional p robability principle. , we must find the value of the parameter that maximizes the Given the training set S conditional probability of all the answers given all the inp uts in the training set. Assum- i i Y ing that the samples are independent, and denoting by X P , W ) the conditional ( | i i X W that is produced by our model with parameter Y , the condi- probability of given mple product over samples: tional probability of the training set under the model is a si P ∏ i i P 1 P 1 | . ) Y , W X (20) P ( ) = , W , . . . , X X | Y ( P , . . . , Y =1 i Applying the maximum likelihood estimation principle, we s W that eek the value of log gative maximizes the above product, or the one that minimizes the ne of the above product: P P ∑ ∏ i i i i (21) | . ) Y ( , W log P X − ( X | Y , W P ) = log − =1 i i =1 Using the Gibbs distribution (Equation 2), we get: ∫ P P ∑ ∏ i βE − ( W,y,X i ) i i i . e ( , X βE (22) W, Y ) + log ( X | Y , W P ) = − log y ∈Y =1 i i =1 The final form of the negative log-likelihood loss is obtaine d by dividing the above and β (which has no effect on the position of the minimum): P expression by ( ) ∫ P ∑ i 1 1 i i W,y,X ( βE − ) ) = L W, S ( , X W, Y ( E ) + . (23) log e nll P β ∈Y y =1 i 14

15 i i ̄ While many of the previous loss functions involved only , X Y ) in their con- ( W, E trastive term, the negative log-likelihood loss combines a ll the energies for all val- i ues of ( W, Y , X in its contrastive term ) . This term can be interpreted as the Y F β ble of systems with ener- Helmholtz free energy (log partition function) of the ensem i W, Y, X gies ) , Y ∈Y . This contrastive term causes the energies of all the answer s E ( to be pulled up. The energy of the correct answer is also pulle d up, but not as hard as it is pushed down by the first term. This can be seen in the express ion of the gradient for a single sample: ∫ i i i i i W, Y ∂E , X ( ) W, Y, X ( ) ∂E ) ( W, Y ∂L , X nll i (24) = , ) P ( Y − X | , W ∂W ∂W ∂W Y ∈Y i ( Y | X where , W ) is obtained through the Gibbs distribution: P i W,Y,X ( βE ) − e i ∫ (25) . ) = ( P , W X | Y i − W,y,X βE ) ( e y ∈Y Hence, the contrastive term pulls up on the energy of each ans wer with a force propor- tional to the likelihood of that answer under the model. Unfo rtunately, there are many Y is intractable. Evaluating interesting models for which computing the integral over orts have been devoted to ap- this integral is a major topic of research. Considerable eff calculations, Monte-Carlo proximation methods, including clever organization of the hods have been devised as sampling methods, and variational methods. While these met ed in the energy-based approximate ways of minimizing the NLL loss, they can be view framework as different strategies for choosing the Y ’s whose energies will be pulled up. Interestingly, the NLL loss reduces to the generalized perc β →∞ eptron loss when (zero temperature), and reduces to the log loss (Eq. 12) when Y has two elements (e.g. binary classification). The NLL loss has been used extensively by many authors under v arious names. In the neural network classification literature, it i cross- s known as the [Solla et al., 1988]. to train an entropy loss It was also used by Bengio et al. It has be en widely used un- energy-based language model [Bengio et al., 2003]. der the name for discriminatively train- maximum mutual information estimation ing speech recognition systems since the late 80’s, includi ng hidden Markov models with mixtures of Gaussians [Bahl et al., 1986], and HM M-neural net hy- brids [Bengio et al., 1990, Bengio et al., 1992, Haffner, 199 3, Bengio, 1996]. It has also been used extensively for global discriminative train ing of handwriting recog- nition systems that integrate neural nets and hidden Markov models under the names maximum mutual information [Bengio et al., 1993, LeCun and Bengio, 1994, Bengio et al., 1995, LeCun et al., 1997, Bottou et al., 1997] a nd discriminative for- ward training [LeCun et al., 1998a]. Finally, it is the loss function of cho ice for train- ing other probabilistic discriminative sequence labeling models such as input/output HMM [Bengio and Frasconi, 1996], conditional random fields [ Lafferty et al., 2001], and discriminative random fields [Kumar and Hebert, 2004]. Minimum Empirical Error Loss : Some authors have argued that the negative log likelihood loss puts too much emphasis on mistakes: Eq. 20 is a product whose value 15

16 is dominated by its smallest term. Hence, Ljolje et al. [Ljol je et al., 1990] proposed , which combines the conditional probabilities of the minimum empirical error loss the samples additively instead of multiplicatively: i i i i ) = 1 − P ( Y W, Y | X , W L ) . (26) , X ( mee Substituting Equation 2 we get: i i ,X ) W,Y βE − ( e i i ∫ (27) . − L W, Y , X ( ) = 1 mee i βE ( W,y,X ) − e ∈Y y he contribution As with the MCE loss and the LVQ2 loss, the MEE loss saturates t l noise and outliers, of any single error. This makes the system more robust to labe which is of particular importance to such applications such as speech recognition, but es evaluating the it makes the loss non-convex. As with the NLL loss, MEE requir partition function. 3 Simple Architectures To substantiate the ideas presented thus far, this section d emonstrates how simple mod- els of classification and regression can be formulated as ene rgy-based models. This sets the stage for the discussion of good and bad loss functions, a s well as for the discussion of advanced architectures for structured prediction. 3 ∑ ) W, Y, X ( E ) W, Y, X ( E E ( W, Y, X ) = ) − Y ( δ k g k =1 k X ) · G D ( ) , Y ) X − ( ( G Y W W g g g 0 1 2 ( ( X ) X G ) X ( G G ) W W W X Y X Y X Y Simple learning models viewed as EBMs: The energy is the dis- Figure 6: (a) a regressor: G . The best ( X crepancy between the output of the regression function and the answer Y ) W ∗ inference is simply Y = G The set of possible an- ( X ) ; (b) a simple two-class classifier: W ∗ {− 1 , +1 } . The best inference is Y ; = sign( G The ( X )) swers is (c) a multiclass classifier: W ee categories. The answer, which discriminant function produces one value for each of the thr can take three values, controls the position of a “switch”, w hich connects one output of the dis- criminant function to the energy function. The best inferen ce is the index of the smallest output component of G . ( X ) W 16

17 3.1 Regression nction approximation. The Figure 6(a) shows a simple architecture for regression or fu G X ) ( egression function energy function is the squared error between the output of a r W and the variable to be predicted Y , which may be a scalar or a vector: 1 2 (28) . || G || ( X ) − Y ) = W, Y, X ( E W 2 that minimizes E is equal to G The inference problem is trivial: the value of ( X ) . Y W s architecture, the The minimum energy is always equal to zero. When used with thi loss are all equivalent because energy loss, perceptron loss, and negative log-likelihood the contrastive term of the perceptron loss is zero, and that of the NLL loss is constant (it is a Gaussian integral with a constant variance): P P ∑ ∑ 1 1 i i 2 i i Y E , X ) = (29) . || ( || G − ( X W, Y ) S ( W, L ) = W energy P 2 P i =1 =1 i This corresponds to standard regression with mean-squared error. is a linear function of the parameters: G A popular form of regression occurs when N ∑ T ) = ( G (30) . w ) φ X ( X ) = W X Φ( k W k =1 k φ -dimensional ( X The are a set of N features , and w are the components of an N ) k k T T ) Φ( X W , where W parameter vector W . For concision, we use the vector notation W , and Φ( X ) denotes the vector formed by each φ ( denotes the transpose of X ) . With k this linear parameterization, training with the energy los s reduces to an easily solvable least-squares minimization problem, which is convex: [ ] P ∑ 1 ∗ 2 T i i = argmin W || W (31) Φ( X . ) − Y || W 2 P =1 i the designer, or separately In simple models, the feature functions are hand-crafted by ds, they are defined as trained from unlabeled data. In the dual form of kernel metho k = 1 ( ) = K is the kernel function. In more complex X, X φ ) , k X . . . P , where K ( k models such as multilayer neural networks and others, the φ ’s may themselves be pa- rameterized and subject to learning, in which case the regre ssion function is no longer a linear function of the parameters and hence the loss functi on may not be convex in the parameters. 3.2 Two-Class Classifier Figure 6(b) shows a simple two-class classifier architectur e. The variable to be pre- dicted is binary: Y = {− 1 , +1 } . The energy function can be defined as: ( E ) = − Y G W, Y, X ( X ) , (32) W 17

18 where G ( X ) is a scalar-valued discriminant function parameterized by W . Inference W is trivial: ∗ X = argmin Y (33) . − Y G )) ( X ) = sign( G ( W W Y } 1 1 , ∈{− ns, which include the Learning can be done using a number of different loss functio perceptron loss, hinge loss, and negative log-likelihood l oss. Substituting Equations 32 and 33 into the perceptron loss (Eq. 7), we get: P ∑ ) ( 1 i i i X . G ) ( (34) − Y X ( G sign( )) L ) = ( W, S W W perceptron P =1 i The stochastic gradient descent update rule to minimize thi s loss is: i ( ) ∂G ( X ) W i i η ( X Y ) ← G − sign( W W + , (35) W ∂W is a positive step size. If we choose G η X ) in the family of linear models, where ( W T ( W, Y, X ) = − Y W Φ( the energy function becomes X ) and the perceptron loss E becomes: P ∑ ) ( 1 T i T i i L S ) = ( W, (36) W , Φ( X ) Φ( X W )) − Y sign( perceptron P =1 i and the stochastic gradient descent update rule becomes the familiar perceptron learn- ) ( i i T i ← − sign( W W Φ( X Y )) ing rule: Φ( X η ) . + W The hinge loss (Eq. 11) with the two-class classifier energy ( Eq. 32) yields: P ∑ 1 i i ) = S W, ( L )) max(0 , m + 2 Y (37) G . ( X hinge W P =1 i 2 T || X ( X and a regularizer of the form || W W Using this loss with gives the ) = G W familiar linear support vector machine. The negative log-likelihood loss (Eq. 23) with Equation 32 y ields: P )] [ ( ∑ i i i i 1 G − ) i X X ( ( i G Y Y ) W W . (38) ( ) + log G Y + e − e X L ) = S W, ( nll W P i =1 Y = {− 1 , +1 } , we obtain: Using the fact that P ( ) ∑ i i 1 Y X ( G 2 − ) W 1 + log e (39) , L S W, ( ) = nll P =1 i which is equivalent to the log loss (Eq. 12). Using a linear mo del as described above, the loss function becomes: P ( ) ∑ i T i 1 2 W Φ( X ) Y − e log 1 + . (40) ( W, L ) = S nll P =1 i This particular combination of architecture and loss is the familiar logistic regression method. 18

19 3.3 Multiclass Classifier Figure 6(c) shows an example of architecture for multiclass classification for 3 classes. , g G ) produces an output vector [ g X , . . . , g ] with one A discriminant function ( W 2 1 C categories. Each component can be interpreted as C component for each of the g j th j a “penalty” for assigning category. A discrete switch module selects X to the e position of the switch which of the components is connected to the output energy. Th ∈ { 1 , 2 Y } , which is interpreted as the is controlled by the discrete variable , . . . , C ∑ C W, Y, X ) = − category. The output energy is equal to , where g δ ( Y E ( ) j j j =1 u ( j ) is the Kronecker delta function: δ ( u ) = 1 for − = 0 ; δ ( u ) = 0 otherwise. δ Y Y to the index of the smallest component of G Inference consists in setting ( X ) . W The perceptron loss, hinge loss, and negative log-likeliho od loss can be directly translated to the multiclass case. 3.4 Implicit Regression ( W, Y, X ) E ) || ( G − ) X ( Y G || 1 2 1 W W 1 2 H ( X ) W Y ( ) ( ) X G G 2 X G ) ( 1 2 W W W 2 1 X Y The implicit regression architecture. and Y are passed through two functions Figure 7: X to have low energies for a given Y . This architecture allows multiple values of and G G 2 1 W W 2 1 X . ple functions of The architectures described in the previous section are sim with a Y single minimum within the set Y . However, there are tasks for which multiple answers are equally good. Examples include robot navigation, where turning left or right may get around an obstacle equally well, or a language model in wh ich the sentence segment “the cat ate the” can be followed equally well by “mouse” or “b ird”. X Y sometimes cannot be expressed and More generally, the dependency between 2 2 Y Y X to + X as a function that maps = 1 ). In this case, (e.g., consider the constraint implicit regression , we model the constraint that X and Y must satisfy which we call and design the energy function such that it measures the viol ation of the constraint. Both X and Y can be passed through functions, and the energy is a function of their outputs. A simple example is: 1 2 ( W, Y, X ) = E || G . ( W ) , X (41) − G || ( W ) , Y Y Y X X 2 19

20 For some problems, the function G must be different from the function G . In Y X G must be instances of the same function G . An interesting G and other cases, X Y architecture [Bromley et al., 1993]: variables X and example is the X Siamese are 2 1 passed through two instances of a function G . A binary label Y determines the con- W G should be equal, ( X ) ) and G X ( X ( ) : if Y = 0 , G ) ( X and straint on G 1 W 2 W 2 1 W W G Y G and if ( X should be different. In this way, the regres- ) and , ) ( X = 1 2 W 1 W X rather than explicitly and sion on Y is implicitly learned through the constraint X 2 1 ed to learn similarity metrics learned through supervision. Siamese architectures are us X and X are known to be similar with labeled examples. When two input samples 1 2 Y = 0 ; when they are different, Y = 1 . (e.g. two pictures of the same person), Siamese architectures were originally designed for signat ure verification [Bromley et al., 1993]. More recently they have been used with the square-exponenti al loss (Eq. 17) to learn a opra et al., 2005]. They have similarity metric with application to face recognition [Ch also been used with the square-square loss (Eq. 16) for unsup ervised learning of mani- folds [Hadsell et al., 2006]. es and Y . An example In other applications, a single non-linear function combin X of such architecture is the trainable language model of Beng io et al [Bengio et al., 2003]. is a sequence of a several successive words in a text, and Under this model, the input X is the the next word in the text. Since many different words ca n follow Y the answer a particular word sequence, the architecture must allow mul tiple values of Y to have nction low energy. The authors used a multilayer neural net as the fu ( W, X, Y ) , and G chose to train it with the negative log-likelihood loss. Bec ause of the high cardinal- Y approximations ity of (equal to the size of the English dictionary), they had to use (importance sampling) and had to train the system on a cluste r machine. The current section often referred to architectures in whic h the energy was linear or quadratic in , and the loss function was convex in W , but it is important to keep in W mind that much of the discussion applies equally well to more complex architectures, as we will see later. 4 Latent Variable Architectures Energy minimization is a convenient way to represent the gen eral process of reasoning and inference. In the usual scenario, the energy is minimize d with respect to the vari- Y X . During training, the correct ables to be predicted , given the observed variables value of Y is given for each training sample. However there are numerou s applications where it is convenient to use energy functions that depend on a set of hidden variables Z whose correct value is never (or rarely) given to us, even dur ing training. For ex- ample, we could imagine training the face detection system d epicted in Figure 2(b) with data for which the scale and pose information of the face s is not available. For these architectures, the inference process for a given set o f variables X and Y involves minimizing over these unseen variables : Z E ( Y, X ) = min (42) . E ( Z, Y, X ) ∈Z Z Such hidden variables are called latent variables , by analogy with a similar concept in probabilistic modeling. The fact that the evaluation of E ( Y, X ) involves a minimiza- 20

21 tion over Z but the use of does not significantly impact the approach described so far, reatment. latent variables is so ubiquitous that it deserves special t In particular, some insight can be gained by viewing the infe rence process in the Y Z : presence of latent variables as a simultaneous minimizatio and n over ∗ Y . (43) = argmin E ( Z, Y, X ) ∈Z ∈Y Y ,Z Latent variables can be viewed as intermediate results on th e way to finding the best . At this point, one could argue that there is no conceptual di Y fference between output and Y the Z could simply be folded into Y . The distinction arises during Z variables: Y training: we are given the correct value of for a number of training samples, but we are never given the correct value of . Z n characteristic of the Latent variables are very useful in situations where a hidde process being modeled can be inferred from observations, bu t cannot be predicted di- mple, in face recognition rectly. One such example is in recognition problems. For exa the gender of a person or the orientation of the face could be a latent variable. Knowing these values would make the recognition task much easier. Li kewise in invariant object recognition the pose parameters of the object (location, or ientation, scale) or the illumi- problems where segmenta- nation could be latent variables. They play a crucial role in tion of the sequential data must be performed simultaneousl y with the recognition task. ion of sentences into A good example is speech recognition, in which the segmentat y with recognition, yet words and words into phonemes must take place simultaneousl during training. Similarly, in the correct segmentation into phonemes is rarely available aracters should take place handwriting recognition, the segmentation of words into ch simultaneously with the recognition. The use of latent vari ables in face recognition is t variable architecture for discussed in this section, and Section 7.3 describes a laten handwriting recognition. 4.1 An Example of Latent Variable Architecture To illustrate the concept of latent variables, we consider t he task of face detection, beginning with the simple problem of determining whether a f ace is present or not in a small image. Imagine that we are provided with a face detect ing function G ) ( X face which takes a small image window as input and produces a scala r output. It outputs e value if no face is a small value when a human face fills the input image, and a larg present (or if only a piece of a face or a tiny face is present). An energy-based face Y controls the The variable detector built around this function is shown in Figure 8(a). 1 = “face”, 0 = “non-face”). The output energy is equal position of a binary switch ( G to ( X ) when Y = 1 , and to a fixed threshold value T when Y = 0 : face E ( Y, X ) = Y G T. ( X ) + (1 − Y ) face The value of that minimizes this energy function is 1 (face) if G 0 Y X ) < T and ( face (non-face) otherwise. Let us now consider the more complex task of detecting and locating a single face in a large image. We can apply our ( X ) function to multiple windows in the large G face image, compute which window produces the lowest value of G , and detect a ( X ) face 21

22 E ( ) W, Y, X ) W, Z, Y, X ( E T T X ( ) G W G ( X ) ( X ) G f ace W X ( G ) X ( G ) X ( G ) X ( G ) f ace f ace f ace f ace ( f a c e " = 1 ) " = ) ( " c e n " 1 f a i s o p o i t o r f o r o n ) a e " ( = 0 " c f e c a f ( " e = a f ) n " 0 c o o X Y X Y Z (a) (b) Figure 8: (a): Architecture of an energy-based face detector. Given an ima ge, it outputs a small value when the image is filled with a human face, and a hig h value equal to the threshold T when there is no face in the image. (b): Architecture of an energy-based face detector that simultaneously locates and detects a face in an input image b y using the location of the face as a latent variable. face at that location if the value is lower than T . This process is implemented by the energy-based architecture shown in Figure 8(b). The lat ent “location” variable Z K copies of the G selects which of the function is routed to the output energy. The face energy function can be written as [ ] K ∑ Z, Y, X ( ) = Y E δ ( Z − k ) G T, ( X ) ) (44) + (1 − Y k face =1 k where the X ’s are the image windows. Locating the best-scoring locatio n in the image k consists in minimizing the energy with respect to Y Z . The resulting value of Y and will indicate whether a face was found, and the resulting val Z will indicate the ue of location. 4.2 Probabilistic Latent Variables When the best value of the latent variable for a given X and Y is ambiguous, one may consider combining the contributions of the various possib le values by marginalizing over the latent variables instead of minimizing with respec t to those variables. When latent variables are present, the joint conditional di stribution over Y and Z 22

23 given by the Gibbs distribution is: − βE ( Z,Y,X ) e ∫ | X ) = P Z, Y ( . (45) − βE ( y,z,X ) e ∈Y y , z ∈Z gives: Marginalizing over Z ∫ ) Z,Y,X ( βE − e z ∈Z ∫ P ) = X | ( Y (46) . βE ( y,z,X ) − e ∈Z , z ∈Y y after marginalizing over Z reduces to: Finding the best Y ∫ 1 ∗ ( βE ) − z,Y,X Y − = argmin (47) e . log Y ∈Y β ∈Z z This is actually a conventional energy-based inference in w hich the energy function has ∫ 1 βE ( ) z,Y,X − e , which log E ( Z, Y, X ) to F ( Z merely been redefined from − ) = β ∈Z z is the free energy of the ensemble { E ( z, Y, X ) , z ∈Z} . The above inference formula by marginalization reduces to the previous inference formu la by minimization when (zero temperature). →∞ β 5 Analysis of Loss Functions for Energy-Based Models This section discusses the conditions that a loss function m ust satisfy so that its mini- mization will result in a model that produces the correct ans wers. To give an intuition of the problem, we first describe simple experiments in which certain combinations of ataset, with varying results. architectures and loss functions are used to learn a simple d A more formal treatment follows in Section 5.2. 5.1 “Good” and “Bad” Loss Functions Consider the problem of learning a function that computes th e square of a number: 2 f ( X ) , where f ( X ) = X Y . Though this is a trivial problem for a learning = in the design of an energy machine, it is useful for demonstrating the issues involved function and loss function that work together. For the follo wing experiments, we use 2 i i i i X , Y a training set of ) where Y 200 = samples ( X , randomly sampled with a uniform distribution between − and +1 . 1 t is passed through First, we use the architecture shown in Figure 9(a). The inpu X G a parametric function , which produces a scalar output. The output is compared W with the desired answer using the absolute value of the diffe rence ( L 1 norm): E ( W, Y, X ) = || G (48) ( X ) − Y || . 1 W Any reasonable parameterized family of functions could be u sed for G . For these W experiments, we chose a two-layer neural network with 1 inpu t unit, 20 hidden units (with sigmoids) and 1 output unit. Figure 10(a) shows the ini tial shape of the energy 23

24 ) W, Y, X ( E E ) W, Y, X ( || G ) Y ( || − ) ( X G 1 1 2 W W X ) − Y || ( G || 2 1 1 W H ( ) X W G G ) ) Y X ( ( 2 ) X G ( 2 1 W W G ) ( X X ( G ) W 1 2 W W X Y Y X (a) (b) (a): A simple architecture that can be trained with the energy loss. (b): An implicit Figure 9: respec- G and are passed through functions G regression architecture where and Y X 2 1 W W 2 1 tively. Training this architecture with the energy loss cau ses a collapse (a flat energy surface). A loss function with a contrastive term corrects the problem. function in the space of the variables X and Y , using a set of random initial parameters W . The dark spheres mark the location of a few training samples . ss (Eq. 6): First, the simple architecture is trained with the energy lo P P ∑ ∑ 1 1 i i ( L ) = W, S || , X ) = ( Y E . (49) − W, Y G ) ( X || energy W 1 P P =1 =1 i i This corresponds to a classical form of robust regression. T he learning process can be viewed as pulling down on the energy surface at the location o f the training samples (the spheres in Figure 10), without considering the rest of the po ints on the energy surface. The energy surface as a function of Y for any X has the shape of a V with fixed slopes. By changing the function G ( X ) , the apex of that V can move around for different W 2 i ion Y = X X for . The loss is minimized by placing the apex of the V at the posit any value of , and this has the effect of making the energies of all other an swers X larger, because the V has a single minimum. Figure 10 shows th e shape of the energy surface at fixed intervals during training with simple stoch astic gradient descent. The energy surface takes the proper shape after a few iterations through the training set. Using more sophisticated loss functions such as the NLL loss or the perceptron loss would produce exactly the same result as the energy loss beca use, with this simple architecture, their contrastive term is constant. Consider a slightly more complicated architecture, shown i n Figure 9(b), to learn is Y and G is passed through function X the same dataset. In this architecture 1 W 1 . For the experiment, both functions were two-layer G passed through function 2 W 2 neural networks with 1 input unit, 10 hidden units and 10 outp ut units. The energy is 24

25 (a) (b) (d) (c) The shape of the energy surface at four intervals while train Figure 10: ing the system in Fig- . The energy loss X ure 9(a) with stochastic gradient descent to minimize the axis is the input, f training, (b) after 10 axis the output. The energy surface is shown (a) at the start o Y and the after 39 epochs. The energy surface epochs through the training set, (c) after 25 epochs, and (d) has attained the desired shape where the energy around train ing samples (dark spheres) is low and energy at all other points is high. the s: L norm of the difference between their 10-dimensional output 1 , ) Y ( (50) || X ) ( G − || E ( W, X, Y G ) = 1 2 1 W W 2 1 W W where W ] . Training this architecture with the energy loss results in a col- = [ 2 1 of the energy surface. Figure 11 shows the shape of the energy surface during lapse training; the energy surface becomes essentially flat. What has happened? The shape of the energy as a function of Y for a given X is no longer fixed. With the energy loss, there is no mechanism to prevent G from ignoring their inputs and producing and G 2 1 tion: the energy surface is flat identical output values. This results in the collapsed solu and equal to zero everywhere. (b) (c) (d) (a) ing the system in Fig- Figure 11: The shape of the energy surface at four intervals while train axis is the input variable and along the Y axis is the X ure 9(b) using the energy loss. Along the answer. The shape of the surface (a) at the start of the traini ng, (b) after 3 epochs through the rly the energy is collapsing to a flat training set, (c) after 6 epochs, and (d) after 9 epochs. Clea surface. Now consider the same architecture, but trained with the loss: square-square ) ( 2 i 2 i i i i i ̄ , X W, Y ) , X − L max (0 , m − E ( W, ) = Y E , X ( )) ( W, Y . (51) i ̄ is a positive margin, and Y Here m is the most offending incorrect answer. The second term in the loss explicitly prevents the collapse of the ener gy by pushing up on points 25

26 whose energy threatens to go below that of the desired answer . Figure 12 shows the ccessfully attains the desired shape of the energy function during training; the surface su shape. (c) (a) (d) (b) Figure 12: The shape of the energy surface at four intervals while train ing the system in Fig- loss. Along the x-axis is the variable X ure 9(b) using square-square and along the y-axis is the Y . The shape of the surface at (a) the start of the training, (b) after 15 epochs over the variable training set, (c) after 25 epochs, and (d) after 34 epochs. Th e energy surface has attained the low and energies at all other points desired shape: the energies around the training samples are are high. (a) (c) (d) (b) The shape of the energy surface at four intervals while train Figure 13: ing the system in Fig- ure 9(b) using the negative log-likelihood loss. Along the X axis is the input variable and along Y axis is the answer. The shape of the surface at (a) the start of the training, (b) after 3 epochs over the training set, (c) after 6 epochs, and (d) after 11 epo chs. The energy surface has quickly attained the desired shape. Another loss function that works well with this architectur e is the negative log- likelihood loss: ( ) ∫ i 1 i i i i ) βE W,y,X ( − W, Y ) = E ( W, Y L , X , X ) + ( log (52) . e β y ∈Y The first term pulls down on the energy of the desired answer, w hile the second term pushes up on all answers, particularly those that have the lo west energy. Note that the energy corresponding to the desired answer also appears in the second term. The shape of the energy function at various intervals using the n egative log-likelihood loss is shown in Figure 13. The learning is much faster than the squ are-square loss. The minimum is deeper because, unlike with the square-square lo ss, the energies of the in- correct answers are pushed up to infinity (although with a dec reasing force). However, 26

27 each iteration of negative log-likelihood loss involves co nsiderably more work because sive when no analytical pushing up every incorrect answer is computationally expen expression for the derivative of the second term exists. In t his experiment, a simple sampling method was used: the integral is approximated by a s um of 20 points regu- larly spaced between -1 and +1 in the Y direction. Each learning iteration thus requires 2 locations in the case computing the gradient of the energy at 20 locations, versus of the square-square loss. However, the cost of locating the most offending incorrect answer must be taken into account for the square-square loss . An important aspect of the NLL loss is that it is invariant to g lobal shifts of energy Y s of the X . values, and only depends on differences between the energie s for a given fferent , and may not be Hence, the desired answer may have different energies for di X zero. This has an important consequence: the quality of an answer cannot be measured by the energy of that answer without considering the energie s of all other answers. In this section we have seen the results of training four comb inations of architec- hitecture along with a tures and loss functions. In the first case we used a simple arc simple energy loss, which was satisfactory. The constraint s in the architecture of the ired answers while de- system automatically lead to the increase in energy of undes creasing the energies of the desired answers. In the second c ase, a more complicated hine collapsed for lack architecture was used with the simple energy loss and the mac of a contrastive term in the loss. In the third and the fourth c ase the same architecture was used as in the second case but with loss functions contain ing explicit contrastive terms. In these cases the machine performed as expected and d id not collapse. 5.2 Sufficient Conditions for Good Loss Functions In the previous section we offered some intuitions about whi ch loss functions are good and which ones are bad with the help of illustrative experime nts. In this section a more formal treatment of the topic is given. First, a set of suffici ent conditions are stated. The energy function and the loss function must satisfy these conditions in order to be guaranteed to work in an energy-based setting. Then we discu ss the quality of the loss functions introduced previously from the point of view of th ese conditions. 5.3 Conditions on the Energy hooses the answer with Generally in energy-based learning, the inference method c i i X ) ( minimum energy. Thus the condition for the correct inferenc is , Y e on a sample as follows. i i i Condition 1 , Y For sample ) , the machine will give the correct answer for X ( if X i i i i , X ( ) < E ( X, Y, X W, Y ) , ∀ Y ∈Y and Y 6 = Y E . (53) In other words, the inference algorithm will give the correc t answer if the energy of the i desired answer Y is less than the energies of all the other answers Y . 27

28 To ensure that the correct answer is robustly stable, we may c hoose to impose that i ̄ m its energy be lower than energies of incorrect answers by a po . If sitive margin Y denotes the most offending incorrect answer, then the condi tion for the answer to be m is as follows. correct by a margin i i X m , Y and sample ) and positive margin Y , the infer- Condition 2 For a variable ( i if X ence algorithm will give the correct answer for i i i i ̄ W, Y ( W, ) Y , X , X E ) − m. (54) ( < E 5.4 Sufficient Conditions on the Loss Functional If the system is to produce the correct answers, the loss func tional should be designed in i i i i ̄ W, to be lower than E ( W, Y , X Y ( , X ) ) such a way that minimizing it will cause E m . Since only the relative values of those two energies matter , we only by some margin the 2D space of those two need to consider the shape of a slice of the loss functional in is the set of integers from to k , the loss Y 1 energies. For example, in the case where functional can be written as: i i i i i ) = L ( Y ( , E ( W, 1 , X W, Y ) , . . . , E ( W, k, X (55) )) . , X L i i i i ̄ ) ) and E ( W, E Y , X , X W, Y ( can be The projection of this loss in the space of Q k − 2 energies: parameterized by the other viewed as a function i i i i i i ̄ ( Q E ( W, Y ) = , X , X ) , E ( W, W, Y Y ( , X L )) , (56) E ] [ y i where the parameter ] contains the vector of energies for all values of Y except Y [ E y i ̄ . and Y for which condition 2 We assume the existence of at least one set of parameters W i i is satisfied for a single training sample , Y ( ) . Clearly, if such a W does not exist, X ld lead to condition 2. For there cannot exist any loss function whose minimization wou i i ( W, Y the purpose of notational simplicity let us denote the energ , X y ) associated E i i i i ̄ ) by E ) (as in “correct energy”) and X ( W, ( Y , Y , X with the training sample E C by E . As an (as in “incorrect energy”). Consider the plane formed by E E and C I I e square-square illustration, Figure 17(a) shows a 3-dimensional plot of th loss function . The third axis gives the value of E E in which the abscissa is and the ordinate is I C and E the loss for the corresponding values of . In general, the loss function E C I corresponds to one is a family of 2D surfaces in this 3D space, where each surface E and E . The solid red line in the particular configuration of all the energies except I C figure corresponds to the points in the 2D plane for which . The dashed blue = E E I C line correspond to the margin line E m < E + m = E + . Let the two half planes E C I I C be denoted by E + m ≥ and respectively. HP HP and E 1 C 2 I Let R be the feasible region , defined as the set of values ( E corresponding , E ) I C to all possible values of W ∈ W . This region may be non-convex, discontinuous, open, or one-dimensional and could lie anywhere in the plane . It is shown shaded in 28

29 1 E + m = E 0.9 C I 0.8 HP 1 0.7 R I 0.6 0.5 E = E C I 0.4 Energy: E 0.3 0.2 HP 2 0.1 m 0 0.4 0.5 0.2 0.6 0.7 0.8 0.9 1 0 0.3 0.1 Energy: E C . rgies and E E Figure showing the various regions in the plane of the two ene E Figure 14: C C I i i , Y , and ) ( E are the (correct answer) energies associated with are the (incorrect answer) X I i i ̄ Y energies associated with , ) . ( X exists which satisfies Figure 14. As a consequence of our assumption that a solution R . must intersect the half plane conditions 2, HP 1 ′ ′ , such that R ) belong to the feasible region , e , e Let two points ) e and ( e ( 2 1 2 1 ′ ′ ′ ′ ) ∈ HP ( (that is, e e + m < e ≥ ) and ( e e m , e , e + ) ∈ HP ). We (that is, e 2 2 1 1 2 1 1 1 2 2 function. are now ready to present the sufficient conditions on the loss i i th X Condition 3 , Y Let ) be the i ( training example and m be a positive margin. Minimizing the loss function L will satisfy conditions 1 or 2 if there exists at least one ′ ′ ′ ′ ) with e ( + m < e e such that for all points ( point e , e , e ) with e , we + m ≥ e 1 2 1 2 1 2 2 1 have ′ ′ < Q Q , e , e ) ( (57) ( e e , ) 2 1 E [ E ] [ ] 2 1 y y where Q is given by E [ ] y i i i i i i ̄ Q (58) L )) W, Y ( E ( W, Y . , X , X ) , E ( W, ( Y ) = , X ] [ E y should be f E and In other words, the surface of the loss function in the space o E I C such that there exists at least one point in the part of the fea sible region R intersecting the half plane HP han its such that the value of the loss function at this point is less t 1 . R intersecting the half plane HP value at all other points in the part of 2 Note that this is only a sufficient condition and not a necessa ry condition. There may be loss functions that do not satisfy this condition but w hose minimization still satisfies condition 2. 29

30 Table 1: A list of loss functions, together with the margin which allo ws them to satisfy con- indicates that the loss satisfies the condition for any stric tly positive dition 3. A margin 0 > margin, and “none” indicates that the loss does not satisfy t he condition. Formula Loss (equation #) Margin i i , X W, Y ) none E energy loss (6) ( i i i , X perceptron (7) ) − min E ( E ( W, Y, X W, Y ) 0 ∈Y Y ( ) i i i i ̄ hinge (11) , X 0 ) − E ( W, max Y , m , X + ) E m ( W, Y ) ( i i i i ̄ ) ,X − ) E E ( W, ( Y W,Y ,X e > 0 1 + log (12) log ( ) i i i i ̄ W, Y ) − E ( W, ( Y max(0 , X , X ) M, 0 min LVQ2 (13) , E ) ( − 1 i i i i ̄ ( ) − E ( W, E Y ,X ,X W,Y ) − ) ( e 1 + > 0 MCE (15) ( ) 2 2 i i i i ̄ − ( max(0 W, Y − E ( W, , m Y , X , X square-square (16) )) ) E m i i ̄ W, − E ( i ) Y i ,X 2 square-exp (17) + βe , X W, Y ) E > 0 ( ∫ i 1 i i ) − βE ( W,y,X NLL/MMI (23) ) + , X 0 > E W, Y ( e log β y ∈Y ∫ i i i ( ) βE W,y,X − − ,X W,Y ( βE ) e e / MEE (27) 0 > 1 − ∈Y y 5.5 Which Loss Functions are Good or Bad Table 1 lists several loss functions, together with the valu e of the margin with which cause it does not satisfy they satisfy condition 3. The energy loss is marked “none” be condition 3 for a general architecture. The perceptron loss and the LVQ2 loss satisfy it with a margin of zero. All others satisfy condition 3 with a strictly positive value of the margin. 5.5.1 Energy Loss e certain forms of energies The energy loss is a bad loss function in general, but there ar for which it is a good loss function. For example consider an e nergy function of the form K ∑ k i i i 2 i (59) ) = . δ ( Y , X − k ) || U W, Y − G E ( X ( ) || W =1 k G through This energy passes the output of the function K radial basis functions W k k U . If the centers (one corresponding to each class) whose centers are the vect U ors are fixed and distinct then the energy loss satisfies conditio n 3 and hence is a good loss function. To see this, consider the two-class classification case (the reasoning for K > 2 follows along the same lines). The architecture of the syste m is shown in Figure 15. 30

31 2 ∑ k 2 ( G − ) U || ) k − Y ( δ X · || E ) = W, Y, X ( W k =1 d d 1 2 i R U s t B F n i 2 ) U || || G d − X ( = i W ) X ( G W ) X ( G G W W X Y 1 2 Figure 15: and U The architecture of a system where two RBF units with centers are U d G placed on top of the machine d . and , to produce distances W 2 1 (a) (b) Figure 16: (a) : When using the RBF architecture with fixed and distinct RBF c enters, only the shaded region of the E , E ) plane is allowed. The non-shaded region is unattainable bec ause ( C I the energies of the two outputs cannot be small at the same tim e. The minimum of the energy loss is at the intersection of the shaded region and vertical axis. (b) : The 3-dimensional plot of the energy loss when using the RBF architecture with fixed and distinct centers. Lighter shades indicate higher loss values and darker shades indicate lowe r values. 31

32 1 2 2 1 i 2 2 2 i Let , d = = || U d − G . Since ( X − ) || U , and d || = || U || − G || ( X U ) W 2 1 W 2 1 and are fixed and distinct, there is a strictly positive lower bou nd on d U + d U 1 2 for all G . Being only a two-class problem, E and E correspond directly to the I W C ( , E ) plane no part of the loss function exists E energies of the two classes. In the C I E E + in where . The region where the loss function is defined is shaded in ≤ d I C Figure 16(a). The exact shape of the loss function is shown in Figure 16(b). One can ≥ m d see from the figure that as long as , the loss function satisfies condition 3. We conclude that this is a good loss function. 1 2 and U U are not fixed and are allowed to be However, when the RBF centers learned, then there is no guarantee that d + d . Then the RBF centers could ≥ d 2 1 resulting in a collapsed become equal and the energy could become zero for all inputs, energy surface. Such a situation can be avoided by having a co ntrastive term in the loss function. 5.5.2 Generalized Perceptron Loss The generalized perceptron loss has a margin of zero. Theref ore, it could lead to a col- lapsed energy surface and is not generally suitable for trai ning energy-based models. However, the absence of a margin is not always fatal [LeCun et al., 1998a, Collins, 2002]. arameter space. Second, First, the set of collapsed solutions is a small piece of the p although nothing prevents the system from reaching the coll apsed solutions, nothing f hitting a collapsed solu- drives the system toward them either. Thus the probability o tion is quite small. 5.5.3 Generalized Margin Loss 4 1.5 3.5 3 1 2.5 2 0.5 Loss: L 1.5 Loss: L 1 HP 1 1 0 0.5 HP 2 2 1 HP 0 0.8 1 0.5 0.5 0.6 1 0.8 Energy: E Energy: E 0.4 Energy: E 0.6 C I 0.4 C HP 0.2 0.2 1 0 0 0 0 E = E = E E E + m = E Energy: E E + m = E C I I C C I I C I (a) (b) ). The value of (a) square-square loss in the space of energies E The and E Figure 17: C I the loss monotonically decreases as we move from HP into HP , indicating that it satisfies 2 1 condition 3. (b) The square-exponential loss in the space of energies E ). The value and E I C of the loss monotonically decreases as we move from HP , indicating that it satisfies into HP 1 2 condition 3. 32

33 We now consider the square-square square-exponential losses. For the two- and of class case, the shape of the surface of the losses in the space E E is shown in and C I such ( ) in HP oint Figure 17. One can clearly see that there exists at least one p e , e 1 1 2 that ′ ′ , ) < Q ) ( e ( e Q (60) , e , e 2 1 [ E [ ] E ] 2 1 y y ′ ′ in ( for all points ) e HP , e . These loss functions satisfy condition 3. 2 1 2 5.5.4 Negative Log-Likelihood Loss It is not obvious that the negative log-likelihood loss sati sfies condition 3. The proof follows. 1 E + m = E I C 0.9 0.8 HP 1 0.7 * * ) ε + m + ε , E − B = (E R C C 0.6 I −g = E E 0.5 I C ε Energy: E 0.4 g C 0.3 * * + g g = g g + m) , E A = (E I C I C C 0.2 HP 2 0.1 m 0 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0 0.5 0.4 Energy: E C Figure 18: Figure showing the direction of gradient of the negative log -likelihood loss in the feasible region R in the space defined by the two energies E . and E I C i i X For any fixed parameter , Y and a sample ) consider the gradient of the loss ( W i of the most Y of the correct answer and the energy E with respect to the energy E I C i ̄ offending incorrect answer Y . We have i i E ) ,X − ( W,Y i i e ) W, Y ( , X ∂L ∑ (61) , − = 1 = g C i − E ( W,Y,X ) ∂E e C Y ∈Y and i i ̄ i i ,X ) Y − E ( W, , X W, Y ( ∂L ) e ∑ = g − = (62) . I i W,Y,X − E ( ) ∂E e I ∈Y Y . The overall direction of g > 0 and g 0 Clearly, for any value of the energies, < I C the gradient at any point in the space of E and E is shown in Figure 18. One can C I conclude that when going from HP , the loss decreases monotonically. to HP 1 2 Now we need to show that there exists at least one point in HP at which the loss 1 ∗ ∗ be a point on the margin . Let A = ( E is less than at all the points in ) , E HP m + 2 C C 33

34 ∗ is the value of the correct energy at this point. line for which the loss is minimum. E C That is, ∗ E } Q = argmin { (63) ( E . , E ) + m C C [ ] E C y Q t of the loss at all Since from the above discussion, the negative of the gradien E ] [ y HP points (and in particular on the margin line) is in the direct ion which is inside , by 1 monotonicity of the loss we can conclude that ∗ ∗ Q ( E ≤ , , E (64) ) + m ) Q , E E ( I C E [ E [ ] ] C C y y E where + m > E . I C ∗ ∗ ) + m , and inside , E Consider a point E ( B away from the point ǫ at a distance C C (see Figure 18). That is the point HP 1 ∗ ∗ ( E − ǫ, E (65) + m + ǫ ) . C C Using the first order Taylor’s expansion on the value of the lo ss at this point, we get ∗ ∗ ) E − ǫ, E ( Q + m + ǫ ] E [ C C y ∂Q ∂Q ] E [ ] E [ y y 2 ∗ ∗ ǫ + O + ) ( ǫ = Q ( E , E m + ) ǫ − [ ] E C C y ∂E ∂E I C [ ] − 1 ∂Q ∂Q ] E [ ] E [ y y 2 ∗ ∗ ( . O + ) (66) ǫ ( + E , E m = Q ǫ ) + + [ E ] C C y ∂E ∂E C I 1 From the previous discussion the second term on the right han d side is negative. So for sufficiently small ǫ we have ∗ ∗ ∗ ∗ Q ( E − ǫ, E E + ( + ǫ ) < Q m , E + m (67) ) . ] [ E E ] [ C C C C y y Thus we conclude that there exists at least one point in HP at which the loss is less 1 than at all points in HP . 2 Note that the energy of the most offending incorrect answer E is bounded above I by the value of the energy of the next most offending incorrec t answer. Thus we only E ’s and the point need to consider a finite range of B cannot be at infinity. I 6 Efficient Inference: Non-Probabilistic Factor Graphs This section addresses the important issue of efficient ener gy-based inference. Se- quence labeling problems and other learning problem with st ructured outputs can often be modeled using energy functions whose structure can be exp loited for efficient infer- ence algorithms. the energy over the Learning and inference with EBMs involves a minimization of set of answers Y and latent variables Z . When the cardinality of Y ×Z is large, this minimization can become intractable. One approach to the pr oblem is to exploit the structure of the energy function in order to perform the mini mization efficiently. One case where the structure can be exploited occurs when the ene rgy can be expressed as a 34

35 E ( Y, Z, X ) + ( E ( X, Z , Z ) E ) ( Z X, Z , Y ) E ( Y , Y ) E b d 2 1 1 2 2 1 1 a c Z Y Z Y X 2 1 1 2 2 2) , (1 d E 0 1 1 1 1 X, 1) E E (1 , , (1 1) 1) , 1 ( E b c d E E E d 1) (1 b ( c X, (1 , 0 0) X, ( , 1 0) a 1) , , E 0) (0 d E 2) 1) , , 1) 0 , (0 d (0 X, E E ( c a E b 0 ( X, E 0 0) (0 , 0) 0 , 0) ( E E (0 , 0) E X, d c b 0 0 0 Z Z Y Y 2 2 1 1 Figure 19: Top: A log domain factor graph. The energy is a sum of factors t hat take differ- ent subsets of variables as inputs. Bottom: Each possible co nfiguration of Z and Y can be Y represented by a path in a trellis. Here is ternary. , Z Y , and Z are binary variables, while 1 2 2 1 35

36 sum of individual functions (called factors) that each depe nd on different subsets of the Y and . These dependencies are best expressed in the form of a factor variables in Z [Kschischang et al., 2001, MacKay, 2003]. Factor graphs are a general form of graph graphical models, or belief networks. y distributions over vari- Graphical models are normally used to represent probabilit tween variables. At first ables by directly encoding the dependency relationships be robabilistic modeling (wit- glance, it is difficult to dissociate graphical models from p actor graphs can be studied ness their original name: “Bayesian networks”). However, f ning applies to them. outside the context of probabilistic modeling, and EBM lear The energy func- A simple example of a factor graph is shown in Figure 19 (top). tion is the sum of four factors: ) = E ( X, Z ) + E E ( ( Y, Z, X , Z ) + E ( Z , Y ) + E ( Y , Y ) , (68) X, Z d 1 2 1 2 2 a 1 1 c b Y where Y are the latent variables. , Y ] ] are the output variables and Z = [ Z , Z = [ 1 2 1 2 ween the values of its input Each factor can be seen as representing soft constraints bet variables. The inference problem consists in finding: ̄ ̄ ( Z ) = argmin . )) , y y ( E ( ( X, z E ) + E X, z ( Y , ) + , z , y ) + E z ( a c 2 1 1 1 d 2 2 1 b , z ∈Z ∈Y y (69) This factor graph represents a structured output E problem, because the factor en- d codes dependencies between 1 and Y 2 (perhaps by forbidding certain combinations Y of values). Z is a ternary , Z Let’s assume that , and Y Y are discrete binary variables, and 1 2 1 2 X is immaterial since X is always observed. variable. The cardinality of the domain of × Z Y given X The number of possible configurations of 2 × 2 and 2 × 3 = 24 . is A naive minimization algorithm through exhaustive search w ould evaluate the entire energy function 24 times (96 single factor evaluations). Ho wever, we notice that for a given X , E . Sim- only has two possible input configurations: Z = 1 = 0 and Z 1 1 a E E only have 4 possible input configurations, and E and has 6. Hence, ilarly, c b d 2 + 4 + 4 + 6 = 16 single factor evaluations. The set there is no need for more than of possible configurations can be represented by a graph (a tr ellis) as shown in Fig- ure 19 (bottom). The nodes in each column represent the possi ble values of a single ctor for the correspond- variable. Each edge is weighted by the output energy of the fa ing values of its input variables. With this representation , a single path from the start all the variables. The node to the end node represents one possible configuration of sum of the weights along a path is equal to the total energy for the corresponding con- figuration. Hence, the inference problem can be reduced to se arching for the shortest path in this graph. This can be performed using a dynamic prog ramming method such as the Viterbi algorithm, or the A* algorithm. The cost is pro portional to the number of edges (16), which is exponentially smaller than the numbe r of paths in general. To compute E ( Y, X ) = min , we follow the same procedure, but we re- ) E ( Y, z, X ∈Z z strict the graph to the subset of arcs that are compatible wit h the prescribed value of Y . The above procedure is sometimes called the min-sum algorithm , and it is the log domain version of the traditional max-product for graphica l models. The procedure can easily be generalized to factor graphs where the factors tak e more than two variables 36

37 as inputs, and to factor graphs that have a tree structure ins tead of a chain structure. However, it only applies to factor graphs that are bipartite trees (with no loops). When loops are present in the graph, the min-sum algorithm may giv e an approximate solu- descent algorithm such as tion when iterated, or may not converge at all. In this case, a simulated annealing could be used. inimization or As mentioned in Section 4, variables can be handled through m he one required for comput- through marginalization. The computation is identical to t ing the contrastive term of the negative log-likelihood los s (the log partition function), he negative log-likelihood hence we will make no distinctions. The contrastive term in t loss function is: ∫ 1 ) Z,Y,X − βE ( − e , (70) log β Y , z ∈Y ∈Z or simply ∫ 1 − ) ( Y,X βE − e , (71) log β ∈Y Y when no latent variables are present. At first, this seems intractable, but the computation can be f actorized just like with forward algorithm in the log domain. the min-sum algorithm. The result is the so-called Values are propagated forward, starting at the start node on the left, and following the computes a quantity α arrows in the trellis. Each node : k k ∑ 1 α ) E ( + β − j kj e , (72) log = α − k β j E is the energy attached to the edge linking node j to node k . The final α at where jk the end node is the quantity in Eq. 70. The procedure reduces t o the min-sum algorithm for large values of β . In a more complex factor graph with factors that take more tha n two variables as in- put, or that have a tree structure, this procedure generaliz es to a non-probabilistic form of belief propagation in the log domain. For loopy graphs, th e procedure can be iter- ated, and may lead to an approximate value for Eq. 70, if it con verges at all [Yedidia et al., 2005]. The above procedures are an essential component for constru cting models with structures and/or sequential output. 6.1 EBMs versus Internally Normalized Models It is important to note that at no point in the above discussio n did we need to manip- ulate normalized probability distributions. The only quan tities that are manipulated are energies. This is in contrast with hidden Markov models a nd traditional Bayesian nets. In HMMs, the outgoing transition probabilities of a no de must sum to 1, and the emission probabilities must be properly normalized. This e nsures that the overall dis- tribution over sequences is normalized. Similarly, in dire cted Bayesian nets, the rows of the conditional probability tables are normalized. EBMs manipulate energies, so no normalization is necessary . When energies are transformed into probabilities, the normalization over Y occurs as the very last step 37

38 in the process. This idea of late normalization solves several problems associated he first problem is the with the internal normalization of HMMs and Bayesian nets. T , first pointed out by Bottou [Bottou, 1991]: transitions so-called label bias problem her transitions in the leaving a given state compete with each other, but not with ot model. Hence, paths whose states have few outgoing transiti ons tend to have higher probability than paths whose states have many outgoing tran sitions. This seems like an artificial constraint. To circumvent this problem, a late normalization scheme was ing and speech recogni- first proposed by Denker and Burges in the context of handwrit tion [Denker and Burges, 1995]. Another flavor of the label bi as problem is the miss- discussed by LeCun et al. in [LeCun et al., 1998a]. They ing probability mass problem blem. Normalized mod- also make use of a late normalization scheme to solve this pro at are explicitly modeled by els distribute the probability mass among all the answers th deled inputs, designers the system. To cope with “junk” or other unforeseen and un-mo must often add a so-called background model that takes some probability mass away nstrued as a thinly dis- from the set of explicitly-modeled answers. This could be co it another way, since every guised way of removing the normalization constraint. To put explicit normalization is another opportunity for mishand ling unforeseen events , one should strive to minimize the number of explicit normalizat ions in a model. A recent demonstration of successful handling of the label bias prob lem through normalization removal is the comparison between maximum entropy Markov mo dels by McCallum, l random fields by Lafferty, Freitag and Pereira [McCallum et al., 2000], and conditiona McCallum and Pereira [Lafferty et al., 2001]. The second problem is controlling the relative importance o f probability distribu- tions of different natures. In HMMs, emission probabilitie s are often Gaussian mix- e transition probabilities tures in high dimensional spaces (typically 10 to 100), whil mic range of the former are discrete probabilities over a few transitions. The dyna is considerably larger than that of the latter. Hence transi tion probabilities count for almost nothing in the overall likelihood. Practitioners of ten raise the transition prob- abilities to some power in order to increase their influence. This trick is difficult to justify in a probabilistic framework because it breaks the n ormalization. In the energy- based framework, there is no need to make excuses for breakin g the rules. Arbitrary coefficients can be applied to any subset of energies in the mo del. The normalization can always be performed at the end. The third problem concerns discriminative learning. Discr iminative training often uses iterative gradient-based methods to optimize the loss . It is often complicated, ex- pensive, and inefficient to perform a normalization step aft er each parameter update by the gradient method. The EBM approach eliminates the proble m [LeCun et al., 1998a]. More importantly, the very reason for internally normalizi ng HMMs and Bayesian nets is somewhat contradictory with the idea of training them dis criminatively. The normal- ization is only necessary for generative models. 38

39 7 EBMs for Sequence Labeling and Structured Out- puts or sequences of vec- The problem of classifying or labeling sequences of symbols tors has long been a topic of great interest in several techni cal communities. The scriminative learning earliest and most notable example is speech recognition. Di methods were proposed to train HMM-based speech recognitio n systems in the late 1980’s [Bahl et al., 1986, Ljolje et al., 1990]. These method s for HMMs brought about a considerable improvement in the accuracy of speech recogn ition systems, and re- mains an active topic of research to this day. With the appearance of multi-layer neural network training procedures, several groups proposed combining neural networks and time alignme nt methods for speech recognition. The time alignment was implemented either thr ough elastic template matching (Dynamic Time Warping) with a set of reference word s, or using a hidden grated training Markov model. One of the main challenges was to design an inte he time alignment module. method for simultaneously training the neural network and t or combining neural nets In the early 1990’s, several authors proposed such methods f urt et al., 1991b, and dynamic time warping [Driancourt et al., 1991a, Drianco Driancourt and Gallinari, 1992b, Driancourt and Gallinari , 1992a, as well as Driancourt, 1994], combining neural net and HM M for [Bengio et al., 1990, Bourlard and Morgan, 1990, Bottou, 199 1, Haffner et al., 1991, Haffner and Waibel, 1991, Bengio et al., 1992, Haffner and Wa ibel, 1992, Haffner, 1993, Driancourt, 1994, Morgan and Bourlard, 1995 , Konig et al., 1996]. McDermott, 1997, Extensive lists of references on the topic are available in [ Most approaches used one dimensional convol utional net- Bengio, 1996]. works ( time-delay neural networks ) to build robustness to variations of pitch, Earlier models combined d iscriminative voice timbre, and speed of speech. uence-level train- classifiers with time alignment, but without integrated seq ing [Sakoe et al., 1988, McDermott and Katagiri, 1992, Franz ini et al., 1990]. Applying similar ideas to handwriting recognition proved m ore challenging, be- cause the 2D nature of the signal made the segmentation probl em considerably more complicated. This task required the integration of image se gmentation heuristics in order to generate segmentation hypotheses. To classify the segments with robust- ness to geometric distortions, 2D convolutional nets were u sed [Bengio et al., 1993, LeCun and Bengio, 1994, Bengio et al., 1995]. A general formu lation of integrated learning of segmentation and recognition with late normali zation resulted in the graph transformer network architecture [LeCun et al., 1997, LeCun et al., 1998a]. Detailed descriptions of several sequence labeling models in the framework of energy-based models are presented in the next three section s. 7.1 Linear Structured Models: CRF, SVMM, and MMMN Outside of the discriminative training tradition in speech and handwriting recognition, graphical models have traditionally been seen as probabili stic generative models, and trained as such. However, in recent years, a resurgence of in terest for discriminative 39

40 E ( ) W, Y, X + W W W 3 2 1 ) ( , Y , Y ) f ( X, Y f , Y ) f ( X, Y X, Y 1 3 3 2 2 4 Y Y Y Y 3 2 4 1 X A log domain factor graph for linear structured models, whic Figure 20: h include conditional random fields, support vector Markov models, and maximum mar gin Markov networks. ng problems in natural lan- training has emerged, largely motivated by sequence labeli rty et al., 2001], perceptron- guage processing, notably conditional random fields [Laffe like models [Collins, 2002], support vector Markov models [ Altun et al., 2003], and maximum margin Markov networks [Taskar et al., 2003]. These models can be easily described in an EBM setting. The en ergy function in W : ers these models is assumed to be a linear function of the paramet T W, Y, X W ( E F ( X, Y ) , (73) ) = F ( X, Y ) is a vector of feature functions that depend on X and Y . The answer where Y is a sequence of l individual labels ( Y , often interpreted as a temporal , . . . , Y ) l 1 e sequence is captured by a sequence. The dependencies between individual labels in th factor graph, such as the one represented in Figure 20. Each f actor is a linear function of the trainable parameters. It depends on the input X and on a pair of individual labels ( , Y ) . In general, each factor could depend on more than two indivi dual labels, but Y m n the notation: we will limit the discussion to pairwise factors to simplify ∑ T ) , Y (74) X, Y ( f . W W, Y, X E ( ) = mn m n mn ) ( m,n ∈F Here F denotes the set of factors (the set of pairs of individual lab els that have a direct inter-dependency), W ) is the parameter vector for factor ( m, n ) , and f , Y ( X, Y m n mn mn is a (fixed) feature vector. The global parameter vector W is the concatenation of all the W kind of interac- . It is sometimes assumed that all the factors encode the same mn tion between input and label pairs: the model is then called a homogeneous field. The 40

41 factors share the same parameter vector and features, and th e energy can be simplified as: ∑ T ) . , Y X, Y ( W (75) f ) = ( W, Y, X E n m ( ) m,n ∈F The linear parameterization of the energy ensures that the c orresponding probability is in the exponential family: distribution over W T ) X,Y ( F − W e ∫ ) = P W ( Y, X | . (76) T w ( F − X,Y ) e ′ w ∈W This model is called the linear structured model . We now describe various versions of linear structured model s that use different loss d hierarchical models. functions. Sections 7.2 and 7.3 will describe non-linear an 7.1.1 Perceptron Loss The simplest way to train the linear structured model is with the perceptron loss. LeCun et al. [LeCun et al., 1998a] proposed its use for general, non -linear energy functions in sequence labeling (particularly handwriting recogniti on), calling it discriminative Viterbi training . More recently, Collins [Collins, 2000, Collins, 2002] has advocated its use for linear structured models in the context of NLP: P ∑ 1 i ∗ i i i W, Y , X (77) E ( W, Y , , X ) ) − E ( L W ( ) = perceptron P =1 i i ∗ i = argmin Y is the answer produced by the system. The E ( W, y, X where ) ∈Y y linear property gives a particularly simple expression for the loss: P ∑ ) ( 1 i i ∗ i i T ( X (78) , Y . ) ( F F X ) , Y − W W L ) = ( perceptron P =1 i s to a simple form of the Optimizing this loss with stochastic gradient descent lead perceptron learning rule: ( ) i i i ∗ i ( X ← , Y W ) − F ( X − , Y η F ) W . (79) As stated before, the main problem with the perceptron loss i s the absence of margin, although this problem is not fatal when the energy is a linear function of the parameters, as in Collins’ model. The lack of a margin, which theoretical ly may lead to stability problems, was overlooked in [LeCun et al., 1998a]. 7.1.2 Margin Loss: Max-Margin Markov Networks The main idea behind margin-based Markov networks [Altun et al., 2003, Altun and Hofmann, 2003, Taskar et al., 2003] is to use a margi n loss to train 41

42 the linearly parameterized factor graph of Figure 20, with t he energy function of regularizer: Equation 73. The loss function is the simple hinge loss with a L n 2 P ∑ 1 i i 2 i i ̄ W max(0 + E ( W, Y L , X (80) ) − E ( W, . Y ) = , X W )) + γ || , m || ( hinge P =1 i Because the energy is linear in W , the loss becomes particularly simple: P ∑ ( ) 1 2 T i i ) = W ( L , m + W , ∆ F ( X max , Y (81) ) 0 + γ || W || hinge P i =1 i i i i i i ̄ ) = F ( X ∆ , Y F ) − F ( X ( , X Y , Y ) . This loss function can be optimized where tic gradient descent. How- with a variety of techniques. The simplest method is stochas he use of a dual formulation ever, the hinge loss and linear parameterization allow for t as in the case of conventional support vector machines. The q uestion of which op- timization method is most suitable is not settled. As with ne ural net training, it is not clear whether second order methods bring a significant sp eed improvement over o systematic experimental well tuned stochastic gradient methods. To our knowledge, n study of this issue has been published. Altun, Johnson, and Hofman [Altun et al., 2003] have studied several versions of this model that use other loss functions, such as the exponen tial margin loss proposed by Collins [Collins, 2000]: P ∑ 1 i i 2 i i ̄ , X ( W, Y . , X || ) − E ( W, E Y W exp( || )) + γ (82) ( W L ) = hinge P i =1 i i i i ̄ This loss function tends to push the energies , X and E ( W, W, Y Y ( , X E ) as far ) arization. apart as possible, an effect which is moderated only by regul elds 7.1.3 Negative Log-Likelihood Loss: Conditional Random Fi (CRF) [Lafferty et al., 2001] use the negative log-likeliho od Conditional random fields loss function to train a linear structured model: P ∑ ∑ i 1 1 i i − ( βE W,y,X ) ) = L ( W , X E ) + ( W, Y log (83) e . nll β P i =1 y ∈Y ression: The linear form of the energy (Eq. 75) gives the following exp P ∑ ∑ T i 1 1 T i i X ,y − ) F βW ( , Y ) + X ( F W log (84) . e ) = W ( L nll P β =1 i y ∈Y W ect to is: Following Equation 24, the derivative of this loss with resp P ∑ ∑ 1 ( L W ) ∂ nll i i i i X , Y F ( ) − ) ) , F ( X , W , y (85) P ( y | X = ∂W P =1 i ∈Y y 42

43 where T i − βW F ,y ( X ) e i ∑ ) = y | X P , W ( . (86) T i ′ F ) X − βW ,y ( e ′ y ∈Y possible label com- The problem with this loss function is the need to sum over all l 2 for a binations, as there are an exponentially large number of suc h combinations ( sequence of l binary labels). However, one of the efficient inference algo rithms men- tioned in Section 6 can be used. n is convex with One of the alleged advantages of CRFs is that the loss functio respect to W tically sat- . However, the convexity of the loss function, while mathema e. Although the original isfying, does not seem to be a significant practical advantag ng, recent work indicates optimization algorithm for CRF was based on iterative scali that stochastic gradient methods may be more efficient [Vish wanathan et al., 2006]. 7.2 Non-Linear Graph Based EBMs The discriminative learning methods for graphical models d eveloped in the speech and handwriting communities in the 90’s allowed for non-lin ear parameterizations r neural nets. Non-linear of the factors, mainly mixtures of Gaussians and multi-laye factors allow the modeling of highly complex dependencies b etween inputs and labels sponding character (such as mapping the pixels of a handwritten word to the corre labels). One particularly important aspect is the use of arc hitectures that are invariant (or robust) to irrelevant transformations of the inputs, su ch as time dilation or pitch variation in speech, and geometric variations in handwriti ng. This is best handled by hierarchical, multi-layer architectures that can learn lo w level features and higher level representations in an integrated fashion. Most authors hav e used one dimensional convolutional nets (time-delay neural networks) for speec h and pen-based handwrit- Haffner and Waibel, 1991, ing [Bengio et al., 1990, Bottou, 1991, Haffner et al., 1991, Driancourt et al., 1991a, Driancourt et al., 1991b, Drianco urt and Gallinari, 1992b, Bengio et al., 1992, Haffn er and Waibel, 1992, Driancourt and Gallinari, 1992a, Haffner, 1993, Driancourt, 1994, Bengio, 1996], and 2D conv olutional nets for image- based handwriting [Bengio et al., 1993, LeCun and Bengio, 19 94, Bengio et al., 1995, LeCun et al., 1997, LeCun et al., 1998a]. To some observers, the recent interest in the linear structu red model looks like omplexity scale. One somewhat of a throw-back to the past, and a regression on the c apparent advantage of linearly parameterized energies is t hat they make the percep- that convex loss func- tron loss, hinge loss, and NLL loss convex. It is often argued tions are inherently better because they allow the use of effi cient optimization algo- rithms with guaranteed convergence to the global minimum. H owever, several au- thors have recently argued that convex loss functions are no guarantee for good perfor- mance, and that non-convex losses may in fact be easier to opt imize than convex ones in practice, even in the absence of theoretical guarantees [ Huang and LeCun, 2006, Collobert et al., 2006]. Furthermore, it has been argued that convex loss functions c an be effi- ciently optimized using sophisticated second-order optim ization methods. How- ever, it is a well-known but often overlooked fact that a care fully tuned stochas- 43

44 tic gradient descent method is often considerably faster in practice than even (which appear bet- the most sophisticated second-order optimization methods advantage of This is because stochastic gradients can take ter on paper). the redundancy between the samples by updating the paramete rs on the ba- sis of a single sample, whereas “batch” optimization method s waste consider- able resources to compute exact descent directions, often n ullifying the theoretical speed advantage [Becker and LeCun, 1989, LeCun et al., 1998a , LeCun et al., 1998b, 006]. Bottou, 2004, Bottou and LeCun, 2004, Vishwanathan et al., 2 E ( W, Z, Y, X ) e v e r u t a e f r s t c o l d s e t p m e t a r o w a P t h i n w o r d x e l e h t c n o i u s r ) c a ( o t s c e v c i t o X Y Z Figure 21: tem using latent vari- Figure showing the architecture of a speech recognition sys ral network (TDNN) to produce ables. An acoustic signal is passed through a time-delay neu red to the word templates. Dy- a high level feature vector. The feature vector is then compa namic time warping (DTW) aligns the feature vector with the w ord templates so as to reduce the . sensitivity of the matching to variations in pronunciation Figure 21 shows an example of speech recognition system that integrates a time- delay neural network (TDNN) and word matching using dynamic time warping (DTW). The raw speech signal is first transformed into a sequence of a coustic vectors (typically 10 to 50 spectral or cepstral coefficients, every 10ms). The a coustic vector sequence is fed to a TDNN that transforms it into a sequence of high leve l features. Temporal subsampling in the TDNN can be used to reduce the temporal res olution of the fea- ture vectors [Bottou, 1991]. The sequence of feature vector s is then compared to word templates. In order to reduce the sensitivity of the matchin g to variations in speed of pronunciation, dynamic time warping aligns the feature seq uence with the template se- quences. Intuitively, DTW consists in finding the best “elas tic” warping that maps a sequence of vectors (or symbols) to another. The solution ca n be found efficiently with dynamic programming (e.g. the Viterbi algorithm or the A* al gorithm). DTW can be reduced to a search for the shortest path in a direct ed acyclic graph in which the cost of each node is the mismatch between two item s in the two in- 44

45 put sequences. Hence, the overall system can be seen as a late nt variable EBM in Y is the set of words in the lexicon, and represents the set of templates which Z he earliest proposal for each word, and the set of paths for each alignment graph. T for integrated training of neural nets and time alignment is by Driancourt and Bot- tou [Driancourt et al., 1991a], who proposed using the LVQ2 l oss (Eq. 13) to train this system. It is a simple matter to back-propagate gradients th rough the DTW module and further back-propagate gradients into the TDNN in order to update the weights. Similarly, gradients can be back-propagated to the word tem plates in order to up- date them as well. Excellent results were obtained for isola ted word recognition, later used by Mc- despite the zero margin of the LVQ2 loss. A similar scheme was Dermott [McDermott, 1997]. A slightly more general method consists in combining neural networks (e.g. ors proposed TDNN) with hidden Markov models instead of DTW. Several auth ng the 90’s. The integrated training procedures for such combinations duri first proposals were by Bengio et al. [Bengio et al., 1991, Ben gio et al., 1992, astic gradient de- Bengio, 1996] who used the NLL/MMI loss optimized with stoch unctions. A simi- scent, and Bottou [Bottou, 1991] who proposed various loss f lar method was subsequently proposed by Haffner et al. in his multi-state TDNN model [Haffner and Waibel, 1992, Haffner, 1993]. Similar tr aining methods were de- vised for handwriting recognition. Bengio and LeCun descri bed a neural net/HMM hybrid with global training using the NLL/MMI loss optimize d with stochastic gradi- ortly thereafter, Konig ent descent [Bengio et al., 1993, LeCun and Bengio, 1994]. Sh ion maximization algo- et al. proposed the REMAP method, which applies the expectat rithm to the HMM in order to produce targets outputs for a neur al net based acoustic model [Konig et al., 1996]. The basic architecture of neural net/HMM hybrid systems is s imilar to the one in Figure 21, except that the word (or language) models are pr obabilistic finite-state machines instead of sequences. The emission probabilities at each node are generally simple Gaussians operating on the output vector sequences p roduced by the neural net. The only challenge is to compute the gradient of the loss with respect to the neural net outputs by backpropagating gradients through the HMM trell is. Since the procedure is very similar to the one used in graph transformer networks, w e refer to the next section for a description. methods that com- It should be noted that many authors had previously proposed bined a separately trained discriminative classifier and an alignment method for speech and handwriting, but they did not use integrated training me thods. 7.3 Hierarchical Graph-Based EBMs: Graph Transformer Net- works Sections 7.2 and 7.1 discussed models in which inference and learning involved marginal- izing or minimizing over all configurations of variables of a dynamic factor graph. These operations are performed efficiently by building a tre llis in which each path cor- responds to a particular configuration of those variables. S ection 7.2 concentrated on models where the factors are non-linear functions of the par ameters, while Section 7.1 45

46 (a) ( W, Z, Y, X ) E i t r e V i b n a r T f r e m r o s 3 2 4 Gr sel 4 3 h r t a P t c e l e o S 2 Gr int 1 3 g n i t i o n R e c o G G G W W W e r T r a n s f o r m Gr seg " " 2 4 3 p a t h Y X Z (b) Figure 22: The architecture of a graph transformer network for handwritten word recognition. (a) The segmentation graph Gr i- is generated from the input image, (b) the hierarchical mult seg er set of graphs. modular architecture takes a set of graphs and outputs anoth focused on simpler models where the factors are linearly par ameterized. The present section discusses a class of models called graph transformer networks (GTN) [LeCun et al., 1998a]. GTNs are designed for situation s where the sequential structure is so complicated that the corresponding dynamic al factor graph cannot be explicitly represented, but must be represented procedurally . For example, the factor 46

47 graph that must be built on-the-fly in order to recognize an en tire handwritten sen- lis contains a path for every tence in English is extremely large. The corresponding trel grammatically correct transcription of the sentence, for e very possible segmentation of the sentence into characters. Generating this trellis (or i ts associated factor graph) in sented procedurally. Instead its entirety is impractical, hence the trellis must be repre of representing the factor graph, the GTN approach views the trellis as the main data as a multilayer archi- structure being manipulated by the machine. A GTN can be seen tecture in which the states are trellises, just as a neural ne t is a multilayer architecture in which the states are fixed-size vectors. A GTN can be viewed as a network of modules, called graph transformers , that take one or more graphs as input and produces another ed as the composition graph as output. The operation of most modules can be express of the input graph with another graph, called a transducer, a ssociated with the mod- ule [Mohri, 1997]. The objects attached to the edges of the in put graphs, which can be e fed to trainable functions numbers, labels, images, sequences, or any other entity, ar whose outputs are attached to edge of the output graphs. The r esulting architecture can be seen as a in which low level features and parts are compositional hierarchy combined into higher level objects through graph compositi on. o phones, phones into For speech recognition, acoustic vectors are assembled int triphones, triphones into words, and words into sentences. Similarly in handwriting haracters into words, and recognition, ink segments are assembled into characters, c words into sentences. ously segmenting Figure 22 shows an example of GTN architecture for simultane and recognizing handwritten words [LeCun et al., 1998a]. Th e first step involves over- t of it (see Figure 22(a)). segmenting the image and generating a segmentation graph ou Gr The segmentation graph is a directed acyclic graph (DAG) in which each path seg from the start node to the end node represents a particular wa y of segmenting the in- put image into character candidates. Each internal node is a ssociated with a candidate cut produced by the segmentation. Every arc between a source and a destination node is associated with the part of the image that lies between the two cuts. Hence every piece of ink appears once and only once along each path. The ne xt stage passes the segmentation graph Gr through the recognition transformer which produces the in- seg terpretation graph with the same number of nodes as Gr . The recognition Gr int seg minant functions G X ( transformer contains as many identical copies of the discri ) W anges for every new in- as there are arcs in the interpretation graph (this number ch G takes the image associated with one arc in the segmentation put). Each copy of W graph and produces several arcs between the corresponding n odes in the interpretation graph. Each output arc is labeled by a character category, an d weighted by the energy of assigning the image to that category. Hence, each path in t he interpretation graph represents one possible interpretation of the input for one possible segmentation, with the sum of the weights along the path representing the combin ed energy of that inter- pretation. The interpretation graph is then passed through a path selector module that selects only those paths from the interpretation graph that have the same sequence of labels as given by Y (the answer). The output of this module is another graph call ed Gr indexed by . Finally a so-called Viterbi transformer selects a single p ath in Gr sel sel the latent variable Z . Each value of Z corresponds to a different path in Gr , and can sel be interpreted as a particular segmentation of the input. Th e output energy is obtained 47

48 either by minimizing or by marginalizing over Z Z is achieved by . Minimizing over Gr running a shortest path algorithm on the (e.g., the Viterbi algorithm, hence the sel name Viterbi transformer). The output energy is then the sum of the arc energies along Z the shortest path. Marginalizing over is achieved by running the forward algorithm , as indicated in Section 6, equation 72. The path selector an d Viterbi trans- Gr on sel at select paths in their former can be seen as particular types of “switch” modules th input graph. In the handwriting recognition systems described in [LeCun et al., 1998a], the dis- X ( criminant function ) was a 2D convolutional network. This class of function G W entations in an integrated is designed to learn low level features and high level repres . Hence the loss function is not convex manner, and is therefore highly non-linear in W in W chastic gradient . The optimization method proposed is a refined version of sto descent. In [LeCun et al., 1998a], two primary methods for training GT Ns are proposed: , which is equivalent to using the generalized percep- discriminative Viterbi training tron loss (Eq. 7), and discriminative forward training , which is equivalent to using the s in Table 1 could also be negative log-likelihood loss (Eq. 23). Any of the good losse used. gradient descent is per- Training by minimizing the perceptron loss with stochastic formed by applying the following update rule: ) ( i i ∗ i i , X ) ∂E ( W, Y W, Y ) , X ∂E ( . (87) − − W ← W η ∂W ∂W i i i i be computed? The answer , X How can the gradients of ) and E ( W, Y E , X ( ) W, Y is simply to back-propagate gradients through the entire st ructure, all the way back to G the discriminant functions X ) . The overall energy can be written in the following ( W form: ∑ ( W, Y, X ) = E δ ) ( Y ) G (88) ( W, X , kl kl kl l -th component of the , G Gr ( W, X ) where the sum runs over all arcs in is the kl int copy of the discriminant function, and δ k ( Y ) is a binary value equal to 1 if the kl arc containing G is present in the final graph, and 0 otherwise. Hence, the ( W, X ) kl gradient is simply: ∑ ( W, Y, X ) ∂E ) W, X ( ∂G kl (89) = . ) ( δ Y kl ∂W ∂W kl One must simply keep track of the δ . ( Y ) kl ss is not a good loss In Section 5 we concluded that the generalized perceptron lo function. While the zero margin may limit the robustness of t he solution, the perceptron loss seems appropriate as a way to refine a system that was pre- trained on segmented characters as suggested in [LeCun et al., 1998a]. Neverthel ess, the GTN-based bank check reading system described in [LeCun et al., 1998a] that was deployed commer- cially was trained with the negative log-likelihood loss. 48

49 The second method for training GTNs uses the NLL loss functio n, with a marginal- Z using the forward algorithm of Equation 72 over ization over , instead of a Gr sel minimization. t descent is performed Training by minimizing the NLL loss with stochastic gradien by applying the following update rule: ( ) i i i ) ( W, Y ∂ , X F ( ) ∂ F W, X Z , Z Y η W − ← W , (90) − ∂W ∂W where ∑ i i 1 ,z,X ) W,Y ( βE − i i log (91) e , ( ) = , X W, Y F − Z β ∈Z z i i X Y and is the free energy obtained by marginalizing over Z fixed, and , keeping ∑ i 1 i − W,y,z,X βE ) ( − ) = W, X ( F log (92) , e Z , Y β ∈Z , z y ∈Y i Y X and fixed. Com- is the free energy obtained by marginalizing over Z , keeping the minimization case. By puting those gradients is slightly more complicated than in chain rule the gradients can be expressed as: i i ∑ ∂G ( ) W, X ∂ F ) W, X ( ∂ F ( W, X ) kl Z Y , , Z Y = , (93) ∂W ∂G ∂W kl kl where the sum runs over all edges in the interpretation graph . The first factor is the derivative of the quantity obtained through the forward alg orithm (Eq. 72) with respect to one particular edge in the interpretation graph. These qu antities can be computed by a feed-forward network with back-propagating gradients through the trellis, viewed as node functions given by Equation 72. We refer to [LeCun et al. , 1998a] for details. Contrary to the claim in [Lafferty et al., 2001], the GTN syst em trained with the ll-defined probability NLL loss as described in [LeCun et al., 1998a] does assign a we ty of a particular interpretation distribution over possible label sequences. The probabili is given by Equation 46: ∫ Z,Y,X ( βE − ) e z ∈Z ∫ (94) . ) = P X | Y ( ) y,z,X ( − βE e , z ∈Y y ∈Z It would seem natural to train GTNs with one of the generalize d margin losses. To our knowledge, this has never been done. 8 Discussion There are still outstanding questions to be answered about e nergy-based and probabilis- tic models. This section offers a relatively philosophical discussion of these questions, including an energy-based discussion of approximate metho ds for inference and learn- ing. Finally, a summary of the main ideas of this chapter is gi ven. 49

50 8.1 EBMs and Probabilistic Models In Section 1.3, the transformation of energies to probabili ties through the Gibbs distri- bution was introduced: W,Y,X − βE ( ) e ∫ . (95) P X, W ) = ( | Y − βE W,y,X ) ( e ∈Y y can be approximated arbitrarily closely by a dis- Y Any probability distribution over ions where the probability tribution of that form. With finite energy values, distribut Y is exactly zero can only be approximated. Estimating the par ameters of a of some t ways, including max- probabilistic model can be performed in a number of differen onditional likelihood imum likelihood estimation with Bayes inversion, maximum c bly with variational approx- estimation, and (when possible) Bayesian averaging (possi aining samples is equivalent imations). Maximizing the conditional likelihood of the tr to minimizing what we called the negative log-likelihood lo ss. Hence, at a high level, discriminative probabilistic model s can be seen as a special case of EBMs in which: ∫ − ( ) W,y,X βE e (partition function) con- The energy is such that the integral • y ∈Y verges. The model is trained by minimizing the negative log-likelih ood loss. • disadvantages of prob- An important question concerns the relative advantages and abilistic models versus energy-based models. Probabilist ic models have two major disadvantages. First, the normalization requirement limi ts the choice of energy func- tions we can use. For example, there is no reason to believe th at the model in Figure 7 is upper bounded, the integral ( Y ) is normalizable over G . In fact, if the function Y W 2 ∫ + ∞ ) − βE ( W,y,X e does not converge. A common fix is to include an additive term dy −∞ , whose negative exponential is ( Y ) R Y to the energy, interpreted as a log prior on y integrable. Second, computing the contrastive term in the n egative log-likelihood loss function (or its gradient with respect to W ) may be very complicated, expensive, or even intractable. The various types of models can be divided into five rough categories of increasing complexity: • : When Y is discrete with a small cardinality, the partition functio n is Trivial a sum with a small number of terms that can be computed simply. Another trivial case is when the partition function does not depend o n W , and hence can be ignored for the purpose of learning. For example, this is t he case when the energy is a quadratic form in Y with a fixed matrix. These are cases were the energy loss can be used without fear of collapse. Analytical : When the partition function and its derivative can be compu ted an- • alytically. For example, when the energy is a quadratic form in Y in which the matrix depends on trainable parameters, the partition func tion is a Gaussian in- tegral (with variable covariance matrix) and its derivativ e is an expectation under a Gaussian distribution, both of which have closed-form exp ressions. 50

51 • Computable - : When the partition function is a sum over an exponential num ay as to make it ber of terms, but the computation can be factorized in such a w tractable. The most notable case of this is when the partitio n function is a sum s of a tree-type graph- over configurations of output variables and latent variable ical model. In this case, belief propagation can be used to co mpute the partition function. When the graphical model is a simple chain graph (a s in the case of HMMs), the set of configurations can be represented by the pat hs of a weighted trellis. Running the forward algorithm through this trelli s yields the partition function. A simple backpropagation-like procedure can be u sed to compute its gradient (e.g., see [LeCun et al., 1998a] and reference ther ein). Approachable : When the partition function cannot be computed exactly, bu • t can be approximated reasonably well using various methods. One notable example a loopy graphi- is when the partition function is a sum over configurations of ef propagation cal model. The sum cannot be computed exactly, but loopy beli or other variational methods may yield a suitable approxima tion. With those ap- proximations, the energies of the various answers will stil l be pulled up, although not as systematically and with the same force as if using the f ull partition func- tion. In a sense, variational methods could be interpreted i n the context of EBM as a way to choose a subset of energies to pull up. • Intractable : When the partition function is truly intractable with no sa tisfactory variational approximation. In this case, one is reduced to u sing sampling meth- ods . A sampling method is a policy for choosing suitable candida te answers o this is to sam- whose energy will be pulled up. The probabilistic approach t and to pull up their ple answers according to their probability under the model, energy. On average, each answer will be pulled up by the appro priate amount according to the partition function. In this context, using an energy-based loss function other t han the negative log-likelihood can be seen as a sampling method with a particular policy for p icking the answers whose energy will be pulled up. For example, the hinge loss sy stematically chooses the most offending incorrect answer as the one whose energy s hould be pulled up. In the end, using such strategies will produce energy surfaces with which differences of energies cannot be interpreted as likelihood ratios (the sa me is true with variational methods). We should emphasize again that this is inconseque ntial if the model is to be used for prediction, classification, or decision-making. e EBM framework as Variational approximation methods can be interpreted in th a particular choice of contrastive term for the loss functio n. A common approach is to view variational methods and energy-based loss function s as approximations to the probabilistic method. What we propose here is to view the pro babilistic approach as a special case of a much larger family of energy-based method s. Energy-based meth- ods are equally well justified as probabilistic methods. The y are merely designed for training models to answer a different kind of question than p robabilistic models. An important open question is whether the variational metho ds that are commonly used (e.g., mean field approximations with popular architec tures) actually satisfy con- dition 3 (see Section 5.2). 51

52 8.2 Efficiency in Learning The most important question that affects the efficiency of le arning is: “How many en- ore the energy surface takes ergies of incorrect answers must be explicitly pulled up bef the right shape?”. Energy-based loss functions that pull up the most offending incor- eration. By contrast, the rect answer only pull up on a single energy at each learning it negative log-likelihood loss pulls up on all incorrect answ ers at each iteration, includ- ing those that are unlikely to produce a lower energy than the correct answer. Hence, unless the NLL computation can be done at very low cost (as in t he case of “trivial” nd to be more efficient. and “analytical” models), the energy-based approach is bou An important open question is whether alternative loss func tions exist whose con- o compute than that of the trastive term and its derivative are considerably simpler t negative log-likelihood loss, while preserving the nice pr operty that they pull up a large gly low”. Perhaps, a figure volume of incorrect answers whose energies are “threatenin d which would compare of merit for architectures and loss functions could be define the amount of computation required to evaluate the loss and i ts derivative relative to the volume of incorrect answers whose energy is pulled up as a result. For models in the “intractable” category, each individual e nergy that needs to be pulled up or pushed down requires an evaluation of the energy and of its gradient (if a gradient-based optimization method is used). Hence, fi nding parameterizations of the energy surface that will cause the energy surface to ta ke the right shape with the minimum amount of pushing of pulling is of crucial import ance. If Y is high- dimensional, and the energy surface is infinitely malleable , then the energy surface le shape. Conversely, will have to be pulled up in many places to make it take a suitab ess pulling, but are less more “rigid” energy surfaces may take a suitable shape with l likely to approach the correct shape. There seems to be a bias -variance dilemma similar to the one that influences the generalization performance. 8.3 Learning with Approximate Inference ximate answer, or is not Very often, the inference algorithm can only give us an appro ergy-based learning guaranteed to give us the global minimum of the energy. Can en work in this case? The theory for this does not yet exist, but a few intuitions may shed light on the issue. There may be certain answers in Y that our inference algorithm never finds, per- haps because they reside in far-away regions of the space tha t the algorithm can never reach. Our model may give low energy to wrong answers in these regions, but since the inference algorithm cannot find them, they will never appear in the contrastive term, and their energies will never be pulled up. Fortunately, sin ce those answers are not found by the inference algorithm, we do not need to worry abou t their energies. It is an interesting advantage of energy-based learning tha t only the incorrect an- swers whose energies are pulled up actually matter. This is i n contrast with probabilis- tic loss functions (e.g. NLL) in which the contrastive term m ust pull up the energy of every single answer, including the ones that our inference a lgorithm will never find, which can be wasteful. 52

53 8.4 Approximate Contrastive Samples, Contrastive Diverge nce Loss functions differ in how the contrastive sample is selec ted, and how hard its energy ers that are always near the is pulled up. One interesting suggestion is to pull up on answ correct answer, so as to make the correct answer a local minim um, but not necessarily a proposed by global one. This idea is the basis of the contrastive divergence algorithm ence learning can be seen Hinton [Hinton, 2002, Teh et al., 2003]. Contrastive diverg as an approximation of NLL learning with two shortcuts. Firs t, the contrastive term in i P | X ribution , W ) Y ( Equation 24 is approximated by drawing samples from the dist using a Markov chain Monte Carlo method. Second, the samples are picked by starting ew steps of the chain. the Markov chain at the desired answer, and by running only a f i ̃ Y that is near the desired answer. Then, a simple gradient This produces a sample update of the parameters is performed: ) ( i i i i ̃ ) W, Y ( , X ∂E ) W, Y ( ∂E , X (96) . − ← W W η − ∂W ∂W er, one can hope that the Since the contrastive sample is always near the desired answ ning MCMC for just a desired answer will become a local minimum of the energy. Run o guarantee that all incor- few steps limits computational expense. However, there is n rect answers with low energy will be pulled up. 8.5 Conclusion This tutorial was written to introduce and explicate the fol lowing major ideas: • Many existing learning models can be be expressed simply in t he framework of energy-based learning. • Among the many loss functions proposed in the literature, so me are good (with a non-zero margin), and some can be bad. • arning where the loss Probabilistic learning is a special case of energy-based le m mutual information function is the negative log-likelihood, a.k.a. the maximu criterion. • ods is often more ef- Optimizing the loss function with stochastic gradient meth ficient than black box convex optimization methods. • Stochastic gradient methods can be applied to any loss funct ion including non- convex ones. Local minima are rarely a problem in practice be cause of the high dimensionality of the space. Support vector Markov models, max-margin Markov networks, and conditional • random fields are all sequence modeling systems that use line arly parameterized energy factors. Sequence modeling systems with non-linear parameterization for speech and handwriting recognition have been a very active r esearch area since the early 1990’s. since the early 90’s. 53

54 • Graph transformer networks are hierarchical sequence mode ling systems in which ll the alternative inter- the objects that are manipulated are trellises containing a pretations at a given level. Global training can be performe d using stochastic mpute the gradi- gradient by using a form of back-propagation algorithm to co ents of the loss with respect to all the parameters in the syst em. Acknowledgments The authors wish to thank Geoffrey Hinton, Leon Bottou, Yosh ua Bengio, Sebastian Seung, and Brendan Frey for helpful discussions. irections in This work was supported in part by NSF ITR grant 0325463 “New d predictive learning”. References [Altun and Hofmann, 2003] Altun, Y. and Hofmann, T. (2003). L arge margin meth- ods for label sequence learning. In Proc. of 8th European Conference on Speech . Communication and Technology (EuroSpeech) 2003). Loss functions [Altun et al., 2003] Altun, Y., Johnson, M., and Hofmann, T. ( bel sequences. In Proc. and optimization methods for discriminative learning of la . EMNLP [Bahl et al., 1986] Bahl, L., Brown, P., de Souza, P., and Merc er, R. (1986). Maxi- mum mutual information estimation of hidden markov model pa rameters for speech recognition. In Proceedings of Acoustics, Speech, and Signal Processing Co nfer- ence , pages 49–52. roving the conver- [Becker and LeCun, 1989] Becker, S. and LeCun, Y. (1989). Imp ods. In Touretzky, D., gence of back-propagation learning with second-order meth Hinton, G., and Sejnowski, T., editors, Proc. of the 1988 Connectionist Models Sum- , pages 29–37, San Mateo. Morgan Kaufman. mer School Neural Networks for Speech and Sequence Recog- [Bengio, 1996] Bengio, Y. (1996). nition . International Thompson Computer Press, London, UK. [Bengio et al., 1990] Bengio, Y., Cardin, R., De Mori, R., and Normandin, Y. (1990). A hybrid coder for hidden markov models using a recurrent net work. In Proceeding of ICASSP , pages 537–540. [Bengio et al., 1992] Bengio, Y., De Mori, R., Flammia, G., an d Kompe, R. (1992). Global optimization of a neural network-hidden Markov mode l hybrid. IEEE Trans- action on Neural Networks , 3(2):252–259. [Bengio et al., 1991] Bengio, Y., DeMori, R., Flammia, G., an d Kompe, R. (1991). Global optimization of a neural network - hidden markov mode l hybrid. In Pro- ceedings of EuroSpeech’91 . 54

55 [Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). Journal of Machine Learning Research , A neural probabilistic language model. 3:1137–1155. 6). An input/output [Bengio and Frasconi, 1996] Bengio, Y. and Frasconi, P. (199 ., editors, Advances HMM architecture. In Tesauro, G., Touretzky, D., and Leen, T in Neural Information Processing Systems , volume 7, pages 427–434. MIT Press, Cambridge, MA. . (1993). Globally [Bengio et al., 1993] Bengio, Y., LeCun, Y., and Henderson, D trained handwritten word recognizer using spatial represe ntation, space displace- ment neural networks and hidden markov models. In Cowan, J. a nd Tesauro, G., Advances in Neural Information Processing Systems editors, , volume 6. Morgan Kaufmann. ges, C. (1995). Lerec: [Bengio et al., 1995] Bengio, Y., LeCun, Y., Nohl, C., and Bur Neural Computation , A nn/hmm hybrid for on-line handwriting recognition. 7(6):1289–1303. Une Approche th eorique de l’Apprentissage Con- [Bottou, 1991] Bottou, L. (1991). ́ nexionniste: Applications a la Reconnaissance de la Parole . PhD thesis, Universit ́e ` de Paris XI, 91405 Orsay cedex, France. [Bottou, 2004] Bottou, L. (2004). Stochastic learning. In B ousquet, O. and von Luxburg, U., editors, Advanced Lectures on Machine Learning , number LNAI 3176 in Lecture Notes in Artificial Intelligence, pages 146–168. Springer Verlag, Berlin. ge-scale on-line learn- [Bottou and LeCun, 2004] Bottou, L. and LeCun, Y. (2004). Lar Advances in Neural Information Processing Systems 15 . MIT Press. ing. In [Bottou et al., 1997] Bottou, L., LeCun, Y., and Bengio, Y. (1 997). Global training works. In Proc. of of document processing systems using graph transformer net , pages 490–494, Puerto-Rico. IEEE. Computer Vision and Pattern Recognition [Bourlard and Morgan, 1990] Bourlard, H. and Morgan, N. (199 0). A continuous speech recognition system embedding mlp into hmm. In Touret zky, D., editor, Ad- vances in Neural Information Processing Systems 2 , pages 186–193. Morgan Kauf- mann. [Bromley et al., 1993] Bromley, J., Guyon, I., LeCun, Y., Sac kinger, E., and Shah, R. (1993). Signature verification using a siamese time delay ne ural network. In Cowan, J. and Tesauro, G., editors, Advances in Neural Information Processing Systems , volume 6. Morgan Kaufmann. [Chopra et al., 2005] Chopra, S., Hadsell, R., and LeCun, Y. ( 2005). Learning a simi- larity metric discriminatively, with application to face v erification. In Proc. of Com- puter Vision and Pattern Recognition Conference . IEEE Press. [Collins, 2000] Collins, M. (2000). Discriminative rerank ing for natural language parsing. In Proceedings of ICML 2000 . 55

56 [Collins, 2002] Collins, M. (2002). Discriminative traini ng methods for hidden rithms. In Proc. markov models: Theory and experiments with perceptron algo EMNLP . [Collobert et al., 2006] Collobert, R., Weston, J., and Bott ou, L. (2006). Trading con- vexity for scalability. In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006) . IMLS/ICML. ACM Digital Library. [Denker and Burges, 1995] Denker, J. S. and Burges, C. J. (199 5). Image segmenta- The Mathematics of Induction . Addison Wesley. tion and recognition. In [Driancourt, 1994] Driancourt, X. (1994). Optimisation par descente de gradient emes modulaires combinant r stochastique de syst eseaux de neurones et programma- ́ ` tion dynamique. Application a la reconnaissance de la parole. (optimization through ` stochastic gradient of modular systems that combine neural networks and dynamic programming, with applications to speech recognition) . PhD thesis, Universit ́e de Paris XI, 91405 Orsay cedex, France. [Driancourt et al., 1991a] Driancourt, X., Bottou, L., and G allinari, P. (1991a). MLP, LVQ and DP: Comparison & cooperation. In Proceedings of the International Joint Conference on Neural Networks , volume 2, pages 815–819, Seattle. [Driancourt et al., 1991b] Driancourt, X., Bottou, L., and P ., G. (1991b). Comparison and cooperation of several classifiers. In Proceedings of the International Confer- ence on Artificial Neural Networks (ICANN) . nari, P. (1992a). Empirical [Driancourt and Gallinari, 1992a] Driancourt, X. and Galli g. In Proceedings of risk optimisation: neural networks and dynamic programmin Neural Networks for Signal Processing (NNSP) . [Driancourt and Gallinari, 1992b] Driancourt, X. and Galli nari, P. (1992b). A speech recognizer optimaly combining learning vector quantizati on, dynamic programming and multi-layer perceptron. In Proceedings of ICASSP . [Franzini et al., 1990] Franzini, M., Lee, K. F., and Waibel, A. (1990). Connection- nist viterbi training: A new hybrid method for continuous sp eech recognition. In Proceedings of ICASSP , page 245. [Hadsell et al., 2006] Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Proc. Computer Vision and Pattern Recognition Conference (CVPR’06) . IEEE Press. [Haffner, 1993] Haffner, P. (1993). Connectionist speech r ecognition with a global MMI algorithm. In EUROSPEECH’93, 3rd European Conference on Speech Com- munication and Technology , Berlin. [Haffner et al., 1991] Haffner, P., Franzini, M., and Waibel , A. H. (1991). Integrat- ing time-alignment and neural networks for high performanc e continuous speech recognition. In Proceeding of ICASSP , pages 105–108. IEEE. 56

57 [Haffner and Waibel, 1991] Haffner, P. and Waibel, A. H. (199 1). Time-delay alysis. In neural networks embedding time alignment: a performance an EU- ion and Tech- ROSPEECH’91, 2nd European Conference on Speech Communicat nology , Genova, Italy. [Haffner and Waibel, 1992] Haffner, P. and Waibel, A. H. (199 2). Multi-state time- Advances in Neural delay neural networks for continuous speech recognition. I n , volume 4, pages 579–588. Morgan Kaufmann, San Information Processing Systems Mateo. [Hinton, 2002] Hinton, G. E. (2002). Training products of ex perts by minimizing con- Neural Computation trastive divergence. , 14:1771–1800. [Huang and LeCun, 2006] Huang, F.-J. and LeCun, Y. (2006). La rge-scale learning zation. In with svm and convolutional nets for generic object categori Proc. Com- puter Vision and Pattern Recognition Conference (CVPR’06) . IEEE Press. [Juang et al., 1997] Juang, B.-H., Chou, W., and Lee, C.-H. (1 997). Minimum clas- sification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing , 5(3):257–265. [Konig et al., 1996] Konig, Y., Bourlard, H., and Morgan, N. ( 1996). REMAP: Re- cursive estimation and maximization of A posteriori probab ilities — application to transition-based connectionist speech recognition. In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing , volume 8, pages 388–394. The MIT Press. Systems eliger, H.-A. (2001). [Kschischang et al., 2001] Kschischang, F., Frey, B., and Lo Factor graphs and the sum-product algorithm. IEEE Trans. Information Theory , 47(2):498–519. criminative fields [Kumar and Hebert, 2004] Kumar, S. and Hebert, M. (2004). Dis un, S., Saul, L., and for modeling spatial dependencies in natural images. In Thr Sch ̈olkopf, B., editors, Advances in Neural Information Processing Systems 16 . MIT Press, Cambridge, MA. [Lafferty et al., 2001] Lafferty, J., McCallum, A., and Pere ira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labe ling sequence data. In Proc. International Conference on Machine Learning (ICML) . [LeCun and Bengio, 1994] LeCun, Y. and Bengio, Y. (1994). wor d-level training of a handwritten word recognizer based on convolutional neura l networks. In IAPR, editor, Proc. of the International Conference on Pattern Recogniti on , volume II, pages 88–92, Jerusalem. IEEE. [LeCun et al., 1997] LeCun, Y., Bottou, L., and Bengio, Y. (19 97). Reading checks with graph transformer networks. In International Conference on Acoustics, Speech, and Signal Processing , volume 1, pages 151–154, Munich. IEEE. 57

58 [LeCun et al., 1998a] LeCun, Y., Bottou, L., Bengio, Y., and H affner, P. (1998a). Proceedings of the IEEE , Gradient-based learning applied to document recognition. 86(11):2278–2324. er, K. (1998b). Efficient [LeCun et al., 1998b] LeCun, Y., Bottou, L., Orr, G., and Mull Neural Networks: Tricks of the trade . backprop. In Orr, G. and K., M., editors, Springer. [LeCun and Huang, 2005] LeCun, Y. and Huang, F. (2005). Loss f unctions for dis- criminative training of energy-based models. In Proc. of the 10-th International Workshop on Artificial Intelligence and Statistics (AIStat s’05) . [Ljolje et al., 1990] Ljolje, A., Ephraim, Y., and Rabiner, L . R. (1990). Estimation of or rate. In hidden markov model parameters by minimizing empirical err Proc. of International Conference on Acoustics, Speech, and Signal , pages 709– Processing 712. [MacKay, 2003] MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms . Cambridge University Press. Available from . http://www.inference.phy.cam.ac.uk/mackay/itila/ ra, F. (2000). Maximum [McCallum et al., 2000] McCallum, A., Freitag, D., and Perei entropy markov models for information extraction and segme tnation. In Proc. In- ternational Conference on Machine Learning (ICML) , pages 591–598. [McDermott, 1997] McDermott, E. (1997). Discriminative Training for Speech Recognition . PhD thesis, Waseda University. S. (1992). Prototype- [McDermott and Katagiri, 1992] McDermott, E. and Katagiri, Proceedings of ICASSP- based discriminative training for various speech units. In 92, San Francisco, CA, USA , pages 417–420. [Mohri, 1997] Mohri, M. (1997). Finite-state transducers i n language and speech pro- Computational Linguistics cessing. , 23(2):269–311. [Morgan and Bourlard, 1995] Morgan, N. and Bourlard, H. (199 5). Continuous speech recognition: An introduction to the hybrid hmm/conn ectionist approach. IEEE Signal Processing Magazine , 12(3):25–42. [Ning et al., 2005] Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., and Bar- bano, P. (2005). Toward automatic phenotyping of developin g embryos from videos. IEEE Transactions on Image Processing , 14(9):1360–1371. Special issue on Molec- ular and Cellular Bioimaging, to appear. [Osadchy et al., 2005] Osadchy, R., Miller, M., and LeCun, Y. (2005). Synergistic face detection and pose estimation with energy-based model . In Advances in Neural Information Processing Systems (NIPS 2004) . MIT Press. [Sakoe et al., 1988] Sakoe, H., Isotani, R., Yoshida, K., Iso , K., and Watanabe, T. (1988). Speaker-independant word recognition using dynam ic programming neu- ral networks. In Proceedings of ICASSP-88, New York , pages 107–110. 58

59 [Solla et al., 1988] Solla, S., Levin, E., and Fleisher, M. (1 988). Accelerated learning Complex Systems in layered neural networks. , 2(6):625–639. [Taskar et al., 2003] Taskar, B., Guestrin, C., and Koller, D . (2003). Max-margin Proc. NIPS . markov networks. In ., H. G. (2003). Energy- [Teh et al., 2003] Teh, Y. W., Welling, M., Osindero, S., and E based models for sparse overcomplete representations. Journal of Machine Learning Research , 4:1235–1260. [Vapnik, 1995] Vapnik, V. (1995). The Nature of Statistical Learning Theory . Springer Verlag. olph, N. N., Schmidt, [Vishwanathan et al., 2006] Vishwanathan, S. V. N., Schraud M. W., and Murphy, K. P. (2006). Accelerated training of cond itional random fields with stochastic gradient methods. In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006) . IMLS/ICML. [Yedidia et al., 2005] Yedidia, J., Freeman, W., and Weiss, Y . (2005). Constructing free-energy approximations and generalized belief propag ation algorithms. IEEE Transactions on Information Theory , 51(7):2282–2312. 59

201 8 Fourth National Report on Human Exposure to Environmental Chemicals U pdated Tables, March 2018 , Volume One

More info »AT Commands Reference Guide GE863-QUAD, GE863-PY, GE863-GPS, GM862-QUAD, GM862-QUAD-PY, GE862-GPS, GE864-QUAD, GE864-PY, GC864-QUAD and GC864-PY 80000ST10025a Rev. 0 - 04/08/06

More info »U.S. DEPARTMENT OF TRANSPORTATION ORDER FEDERAL AVIATION ADMINISTRATION 7400.11C JO Air Traffic Organization Policy August 13, 2018 SUBJ: Airspace Designations and Reporting Points . This O rder, publ...

More info »DoD 7045.7-H EPARTMENT OF D EFENSE D F UTURE Y EARS D EFENSE P ROGRAM (FYDP) S TRUCTURE Codes and Definitions for All DoD Components Office of the Director, Program Analysis and Evaluation A pril 2004

More info »Select a state to see the income limits for the counties in that state. Rural Development Single Family Housing Loan Program Direct ME WA VT ND MT RI NH CT NY MN OR MA WI ID SD MI WY PA NJ IA DE OH NE...

More info »SCRIE TENANTS LIST ~ By Docket Number ~ Borough of Bronx SCRIE in the last year; it includes tenants that have a lease expiration date equal or who have received • This report displays information on ...

More info »Hubbell Trading Post NHS: An Administrative History Hubbell Trading Post Administrative History Hubbell Trading Post National Historic Site An Administrative History Albert Manchester and Ann Manchest...

More info »Select a state to see the income limits for the counties in that state. Rural Development Single Family Housing Guaranteed Loan Program ME WA VT ND MT RI NH CT NY MN OR MA WI ID SD MI WY PA NJ IA DE O...

More info »Insight Report The Global Information Technology Report 2014 Rewards and Risks of Big Data Beñat Bilbao-Osorio, Soumitra Dutta, and Bruno Lanvin, Editors

More info »3.06.2 Post Construction Stormwater BMP Standards and Specifications March 2013

More info »This publication contains: VOLUME I: SOURCES SOURCES AND EFFECTS Report of the United Nations Scientific Committee on the Effects of Atomic Radiation to the General Assembly OF IONIZING RADIATION Scie...

More info »Gulf of Mexico OCS Region OCS Operations Field Directory (Includes all active and expired fields and leases) Quarterly Repor t, as of March 31 , 201 9 U.S. Department of the Interior Bureau of Ocean E...

More info »February 10, 2019 $ 1 BUS BOOK EFFECTIVE THROUGH JUNE 8, 2019 OCBus.com EFECTIVO HASTA EL 8 DE JUNIO 2019 EASY JUST GOT EASIER. Upgrade to version 2.0 See back for cool new features! CHANGE HIGHLIGHTS...

More info »