1 A Scalable Hierarchical Distributed Language Model Geoffrey Hinton Andriy Mnih Department of Computer Science Department of Computer Science University of Toronto University of Toronto [email protected] [email protected] Abstract wn to be competi- Neural probabilistic language models (NPLMs) have been sho tive with and occasionally superior to the widely-used -gram language models. n The main drawback of NPLMs is their extremely long training a nd testing times. Morin and Bengio have proposed a hierarchical language mode l built around a binary tree of words, which was two orders of magnitude faste r than the non- hierarchical model it was based on. However, it performed co nsiderably worse rd tree created using than its non-hierarchical counterpart in spite of using a wo expert knowledge. We introduce a fast hierarchical languag e model along with on of word trees from a simple feature-based algorithm for automatic constructi the data. We then show that the resulting models can outperfo rm non-hierarchical n -gram models. neural models as well as the best Introduction 1 probabilistic models of word sequences. Statistical language modelling is concerned with building Such models can be used to discriminate probable sequences f rom improbable ones, a task important for performing speech recognition, information retrieval , and machine translation. The vast majority of statistical language models are based on the Markov assum ption, which states that the distribu- tion of a word depends only on some fixed number of words that im mediately precede it. While use it reduces the problem of modelling this assumption is clearly false, it is very convenient beca the probability distribution of word sequences of arbitrar y length to the problem of modelling the ding words, called the context. We distribution on the next word given some fixed number of prece , where P w | w ) ( w is the next word and w will denote this distribution by is the context n 1: n n − 1: 1 − 1 n ( w , ..., w . ) 1 n 1 − n -gram language models are the most popular statistical lang uage models due to their simplicity and surprisingly good performance. These models are simply conditional probability tables for -tuples in the training data and normalizing the counts P | w ( ) , estimated by counting the n w n n − 1 1: appropriately. Since the number of n -tuples is exponential in n , smoothing the raw counts is essential for achieving good performance. There is a large number of sm oothing methods available for n -gram models [4]. In spite of the sophisticated smoothing methods developed for them, -gram models are n rsity problem becomes extreme. The unable to take advantage of large contexts since the data spa main reason for this behavior is the fact that classical -gram models are essentially conditional n probability tables where different entries are estimated i ndependently of each other. These models do not take advantage of the fact that similar words occur in s imilar contexts, because they have no concept of similarity. Class-based n -gram models [3] aim to address this issue by clustering word s and/or contexts into classes based on their usage patterns a nd then using this class information to improve generalization. While it can improve n -gram performance, this approach introduces a very rigid kind of similarity, since each word typically belongs to exactly one class. An alternative and much more flexible approach to counteract ing the data sparsity problem is to represent each word using a real-valued feature vector that captures its properties, so that words 1

2 used in similar contexts will have similar feature vectors. Then the conditional probability of the ectors of the context words and the next word can be modelled as a smooth function of the feature v next word. This approach provides automatic smoothing, sin ce for a given context similar words are now guaranteed to be assigned similar probabilities. Si milarly, similar contexts are now likely to ions for the next word. Most models based have similar representations resulting in similar predict on this approach use a feed-forward neural network to map the feature vectors of the context words erhaps the best known model of this type is to the distribution for the next word (e.g. [12], [5], [9]). P the Neural Probabilistic Language Model [1], which has been n -gram models shown to outperform on a dataset of about one million words. 2 The hierarchical neural network language model The main drawback of the NPLM and other similar models is that they are very slow to train and test [10]. Since computing the probability of the next word r equires explicitly normalizing over all ty of the given next word and the cost of words in the vocabulary, the cost of computing the probabili computing the full distribution over the next word are virtu ally the same – they take time linear in the models requires repeatedly computing vocabulary size. Since computing the exact gradient in such the probability of the next word given its context and updati ng the model parameters to increase that probability, training time is also linear in the vocabulary size. Typical natural language datasets have vocabularies containing tens of thousands of words, which m eans that training NPLM-like models ensive in practice. One way to speed the straightforward way is usually too computationally exp up the process is to use a specialized importance sampling pr ocedure to approximate the gradients eed up training substantially, testing required for learning [2]. However, while this method can sp remains computationally expensive. The hierarchical NPLM introduced in [10], provides an expon ential reduction in time complexity of learning and testing as compared to the NPLM. It achieves thi s reduction by replacing the unstruc- tured vocabulary of the NPLM by a binary tree that represents a hierarchical clustering of words in d can be uniquely specified by the the vocabulary. Each word corresponds to a leaf in the tree an N path from the root to that leaf. If is the number of words in the vocabulary and the tree is bal- O (log N anced, any word can be specified by a sequence of binary decisions indicating which of ) the two children of the current node is to be visited next. Thi s setup replaces one N -way choice by a sequence of O (log N ) binary choices. In probabilistic terms, one N -way normalization is replaced by a sequence of O N ) local (binary) normalizations. As a result, a distribution over words in (log of visiting the left child at each of the the vocabulary can be specified by providing the probability are computed by giving a version of the nodes. In the hierarchical NPLM, these local probabilities NPLM the feature vectors for the context words as well as a fea ture vector for the current node as inputs. The probability of the next word is then given by the p robability of making a sequence of binary decisions that corresponds to the path to that word. l outperformed class-based trigrams, When applied to a dataset of about one million words, this mode rarchical model however was more but performed considerably worse than the NPLM [10]. The hie mitation of this work was the than two orders of magnitude faster than the NPLM. The main li he tree was obtained by starting procedure used to construct the tree of words for the model. T with the WordNet IS-A taxonomy and converting it into a binar y tree through a combination of is procedure by an automated method manual and data-driven processing. Our goal is to replace th for building trees from the training data without requiring expert knowledge of any kind. We will also explore the performance benefits of using trees where ea ch word can occur more than once. 3 The log-bilinear model We will use the log-bilinear language model (LBL) [9] as the f oundation of our hierarchical model irtually all neural language models, the because of its excellent performance and simplicity. Like v ctor. We will denote the feature vector LBL model represents each word with a real-valued feature ve w by for word . To predict the and refer to the matrix containing all these feature vectors as R r w next word w for the given the context w r ˆ , the model computes the predicted feature vector 1 − n 1: n next word by linearly combining the context word feature vec tors: 2

3 n − 1 ∑ (1) , C r ˆ r = w i i i =1 C is the weight matrix associated with the context position i . Then the similarity between the where i predicted feature vector and the feature vector for each wor d in the vocabulary is computed using and normalized to obtain the distribution the inner product. The similarities are then exponentiated over the next word: T b r exp(ˆ + r ) w w ∑ | = ) = w ( P w w (2) . 1 − n 1: n T r b r ) exp(ˆ + j j j Here b is the bias for word w , which is used to capture the context-independent word freq uency. w Note that the LBL model can be interpreted as a special kind of a feed-forward neural network with one linear hidden layer and a softmax output layer. The i nputs to the network are the feature vectors for the context words, while the matrix of weights fr om the hidden layer to the output layer is simply the feature vector matrix R . The vector of activities of the hidden units corresponds to the the predicted feature vector for the next word. Unlike the NP LM, the LBL model needs to compute linearities in its hidden layer. In spite the hidden activities only once per prediction and has no non of its simplicity the LBL model performs very well, outperfo rming both the NPLM and the n -gram models on a fairly large dataset [9]. The hierarchical log-bilinear model 4 l model from [10]. The distinguishing Our hierarchical language model is based on the hierarchica e model for computing the probabilities features of our model are the use of the log-bilinear languag f each word in the tree. Note that the at each node and the ability to handle multiple occurrences o d in [10], but it was not implemented. idea of using multiple word occurrences in a tree was propose The first component of the hierarchical log-bilinear model ( HLBL) is a binary tree with words at its leaves. For now, we will assume that each word in the vocabula ry is at exactly one leaf. Then each e to the leaf node the word is at. word can be uniquely specified by a path from the root of the tre The path itself can be encoded as a binary string of decisions made at each node, so that d = 1 d i corresponds to the decision to visit the left child of the cur rent node. For example, the string “10” corresponds to a path that starts at the root, visits its left child, and then visits the right child of that child. This allows each word to be represented by a binary str ing which we will call a code. The second component of the HLBL model is the probabilistic m odel for making the decisions at each node, which in our case is a modified version of the LBL m odel. In the HLBL model, ds are represented using real-valued feature just like in its non-hierarchical counterpart, context wor ure vector associated with it that is vectors. Each of the non-leaf nodes in the tree also has a feat used for discriminating the words in the left subtree form th e words in the right subtree of the node. Unlike the context words, the words being predicted are repr esented using their binary codes that are s still quite flexible, since each binary determined by the word tree. However, this representation i digit in the code encodes a decision made at a node, which depe nds on that node’s feature vector. is the probability of making the In the HLBL model, the probability of the next word being w iven the context. Since the probability sequences of binary decisions specified by the word’s code, g of making a decision at a node depends only on the predicted fe ature vector, determined by the context, and the feature vector for that node, we can express the probability of the next word as a product of probabilities of the binary decisions: ∏ w w = w | P ( ) = | (3) , ) P ( d , w q n 1 n 1: − n i 1 i − 1: i th th node in the path digit in the code for word w , and q is the feature vector for the i d where is i i i n is given by corresponding to that code. The probability of each decisio T d = 1 | q , w ) = (4) (ˆ r P q ( + b ) , σ − 1 i i i i 1: n is the predicted feature vector computed using Eq. 1. where ( x ) is the logistic function and ˆ r σ b in i the equation is the node’s bias that captures the context-in dependent tendency to visit the left child when leaving this node. 3

4 The definition of P w ( = w | w can be extended to multiple codes per word by including a ) − n 1 n 1: summation over all codes for w as follows: ∏ ∑ | w = ) = w ( P w q | d (5) , ) , w ( P 1 n 1: n − i − 1 i n 1: i w ( D ∈ d ) D ( w ) is a set of codes corresponding to word w . Allowing multiple codes per word can allow where iple usage patterns. Using multiple codes better prediction of words that have multiple senses or mult per word also makes it easy to combine several separate words hierarchies to into a single one to to elationships between words. reflect the fact that no single hierarchy can express all the r Using the LBL model instead of the NPLM for computing the loca l probabilities allows us to avoid s our hierarchical model faster at mak- computing the nonlinearities in the hidden layer which make ing predictions than the hierarchical NPLM. More important ly, the hierarchical NPLM needs to O (log N ) decisions, while the HLBL model compute the hidden activities once for each of the computes the predicted feature vector just once per predict ion. However, the time complexity of LBL model is still quadratic in the computing the probability for a single binary decision in an D , which might make the use of high-dimensional feature vecto rs feature vector dimensionality linear in D too computationally expensive. We make the time complexity by restricting the weight 1 matrices C to be diagonal. Note that for a context of size 1, this restriction does not re duce the i representational power of the model because the context wei ght matrix C can be absorbed into the 1 word feature vectors. And while this restriction does makes the models with larger contexts slightly ated for by much faster training times less powerful, we believe that this loss is more than compens which allow using more complex trees. HLBL models can be trained by maximizing the (penalized) log -likelihood. Since the probability of the next word depends only on the context weights, the featur e vectors of the context words, and the feature vectors of the nodes on the paths from the root to the l eaves containing the word in question, only a (logarithmically) small fraction of the parameters n eed to be updated for each training case. 5 Hierarchical clustering of words The first step in training a hierarchical language model is co nstructing a binary tree of words for the model to use. This can be done by using expert knowledge, data -driven methods, or a combination of the two. For example, in [10] the tree was constructed from th e IS-A taxonomy DAG from WordNet [6]. After preprocessing the taxonomy by hand to ensure that each node had only one parent, data- children of the nodes in the taxonomy driven hierarchical binary clustering was performed on the that had more than two children, resulting in a binary tree. We are interested in using a pure learning approach applicab le in situations where the expert knowl- owledge, even when it is available, edge is unavailable. It is also not clear that using expert kn stering of words based on the their usage will lead to superior performance. Hierarchical binary clu statistics is a natural choice for generating binary trees o f words automatically. This task is similar ss-based n -gram models, for which a large to the task of clustering words into classes for training cla number of algorithms has been proposed. We considered sever al of these algorithms before decid- ing to use our own algorithm which turned out to be surprising ly effective in spite of its simplicity. However, we will mention two existing algorithms that might be suitable for producing binary word hierarchies. Since we wanted an algorithm that scaled well t o large vocabularies, we restricted our attention to the top-down hierarchical clustering algorit hms, as they tend to scale better than their agglomerative counterparts [7]. The algorithm from [8] pro duces exactly the kind of binary trees 2 we need, except that its time complexity is cubic in the vocab We also considered the ulary size. distributional clustering algorithm [11] but decided not t o use it because of the difficulties involved roblem is shared by most n -gram in using contexts of more than one word for clustering. This p . Since we would like to cluster words for clustering algorithms, so we will describe it in some detail easy prediction of the next word based on its context, it is na tural to describe each word in terms of the contexts that can precede it. For example, for a single-w ord context one such description is the ∑ − n 1 1 = ˆ Thus the feature vector for the next word can now be computed as r c ◦ r is a vector c , where w i i i =1 i of context weights for position i and ◦ denotes the elementwise product of two vectors. 2 More precisely, the time complexity of the algorithm is cubic in the number of the frequent words, but that is still to slow for our purposes. 4

5 distribution of words that precede the word of interest in th e training data. The problem becomes of contexts that can potentially pre- apparent when we consider using larger contexts: the number cede a word grows exponentially in the context size. This is t he very same data sparsity problem that -gram models, which is not surprising, since we are trying to affects the describe words in terms of n exponentially large (normalized) count vectors. Thus, clu stering words based on such large-context nal cost involved as well as the statistical representations becomes non-trivial due to the computatio difficulties caused by the sparsity of the data. We avoid these difficulties by operating on low-dimensional real-valued word representations in our tree-building procedure. Since we need to train a model to ob tain word feature vectors, we perform binary tree of words, train an HLBL random the following bootstrapping procedure: we generate a model based on it, and use the distributed representations i t learns to represent words when building the word tree. Since each word is represented by a distribution over contex ts it appears in, we need a way of compressing such a collection of contexts down to a low-dime nsional vector. After training the HLBL model, we summarize each context with the predicted feature vector produced from w n 1 1: − s that precede a given word into a it using Eq. 1. Then, we condense the distribution of context d representation w.r.t. that distribution. feature vector by computing the expectation of the predicte Thus, for the purposes of clustering each word is represente d by its average predicted feature vector. After computing the low-dimensional real-valued feature v ectors for words, we recursively apply a a mixture of two Gaussians to the very simple clustering algorithm to them. At each step, we fit feature vectors and then partition them into two subsets bas ed on the responsibilities of the two bsets using the same procedure, and mixture components for them. We then partition each of the su so on. The recursion stops when the current set contains only two words. We fit the mixtures by 3 . The algorithm updates both the means and the spherical running the EM algorithm for 10 steps covariances of the components. Since the means of the compon ents are initialized based on a random partitioning of the feature vectors, the algorithm is not de terministic and will produce somewhat different clusterings on different runs. One appealing pro perty of this algorithm is that the running h is a consequence of representing words time of each iteration is linear in the vocabulary size, whic using feature vectors of fixed dimensionality. In our experi ments, the algorithm took only a few words based on 100-dimensional minutes to build a hierarchy for a vocabulary of nearly 18000 feature vectors. The goal of an algorithm for generating trees for hierarchic al language models is to produce trees that are well-supported by the data and are reasonably well- balanced so that the resulting models generalize well and are fast to train and test. To explore the trade-off between these two require- ments, we tried several splitting rules in our tree-buildin g algorithm. The rules are based on the observation that the responsibility of a component for a dat apoint can be used as a measure of con- t. Thus, when the responsibilities of fidence about the assignment of the datapoint to the componen both components for a datapoint are close to 0.5, we cannot be sure that the datapoint should be in one component but not the other. Our simplest rule aims to produce a balanced tree at any cost. It sorts the responsibilities and splits the words into two disjoint subsets of equal size base d on the sorted order. The second rule makes splits well-supported by the data even if that results in an unbalanced tree. It achieves that by assigning the word to the component with the higher respon sibility for the word. The third rule, modified to assign a point to and the most sophisticated rule is an extension of the second both components whenever both responsibilities are within ǫ ǫ . This of 0.5, for some pre-specified rule is designed to produce multiple codes for words that are difficult to cluster. We will refer to the algorithms that use these rules as BALANCED, ADAPTIVE, a nd ADAPTIVE( ǫ ) respectively. Finally, as a baseline for comparison with the above algorit hms, we will use an algorithm that generates random balanced trees. It starts with a random per mutation of the words and recursively builds the left subtree based one the first half of the words an d the right subtree based on the second half of the words. We will call this algorithm RANDOM. 3 Running EM for more than 10 steps did not make a significant difference in the quality of the resulting trees. 5

6 Table 1: Trees of words generated by the feature-based algor ithm. The mean code length is the sum distribution of the words in the training of lengths of codes associated with a word, averaged over the data. The run-time complexity of the hierarchical model is l inear in the mean code length of the tree used. The mean number of codes per word refers to the number of codes per word averaged over the training data distribution. Since each non-leaf node in a tr ee has its own feature vector, the number of free parameters associated with the tree is linear in this quantity. Generating Mean number of Number of Tree Mean code length codes per word non-leaf nodes label algorithm 14.2 RANDOM 17963 T1 1.0 14.3 1.0 BALANCED T2 17963 T3 ADAPTIVE 16.1 1.0 17963 T4 24.2 1.3 22995 ADAPTIVE(0.25) T5 29.0 1.7 30296 ADAPTIVE(0.4) ADAPTIVE(0.4) 2 69.1 3.4 61014 T6 × ADAPTIVE(0.4) × 4 143.2 6.8 T7 121980 Table 2: The effect of the feature dimensionality and the wor d tree used on the test set perplexity of the model. Feature Perplexity using Perplexity using Reduction dimensionality a random tree a non-random tree in perplexity 25 191.6 162.4 29.2 50 166.4 24.7 141.7 156.4 134.8 75 21.6 131.3 19.9 151.2 100 6 Experimental results We compared the performance of our models on the APNews datas et containing the Associated Press news stories from 1995 and 1996. The dataset consists o f a 14 million word training set, he vocabulary size for this dataset a 1 million word validation set, and 1 million word test set. T ed to compare the performance of is 17964. We chose this dataset because it had already been us n sults to neural models to that of -gram models in [1] and [9], which allowed us to compare our re the results in those papers. Except for where stated otherwi se, the models used for the experiments used 100 dimensional feature vectors and a context size of 5. The details of the training procedure we used are given in the appendix. All models were compared ba sed on their perplexity score on the test set. We started by training a model that used a tree generated by th e RANDOM algorithm (tree T1 in Table 1). The feature vectors learned by this model were used to build a tree using the BALANCED algorithm (tree T2). We then trained models of various featu re vector dimensionality on each of ensate for using a poorly constructed these trees to see whether a highly expressive model can comp ble 2. As can be seen from the scores, tree. The test scores for the resulting models are given in Ta using a non-random tree results in much better model perform ance. Though the gap in performance ectors, using a non-random tree drasti- can be reduced by increasing the dimensionality of feature v nsional feature vectors. It should be cally improves performance even for the model with 100-dime noted however, that models that use the random tree are not en tirely hopeless. For example, they outperform the unigram model which achieved the perplexity of 602.0 by a very large margin. This suggests that the HLBL architecture is sufficiently flexible to make effective use of a random tree over words. not result in a substantial reduction in Since increasing the feature dimensionality beyond 100 did perplexity, we used 100-dimensional feature vectors for al l of our models in the following experi- ments. Next we explored the effect of the tree building algor ithm on the performance of the resulting HLBL model. To do that, we used the RANDOM, BALANCED, and ADAP TIVE algorithms to generate one tree each. The ADAPTIVE( ǫ ) algorithm was used to generate two trees: one with ǫ set 6

7 Table 3: Test set perplexity results for the hierarchical LB L models. All the distributed models a context size of 5. LBL is the non- in the comparison used 100-dimensional feature vectors and is a Kneser-Ney n -gram model. The scores for LBL, KN3, hierarchical log-bilinear model. KN n and KN5 are from [9]. The timing for LBL is based on our impleme ntation of the model. Model Minutes Tree Tree generating Perplexity per epoch algorithm type used RANDOM 151.2 4 HLBL T1 BALANCED T2 4 HLBL 131.3 ADAPTIVE 127.0 T3 HLBL 4 HLBL T4 ADAPTIVE(0.25) 124.4 6 HLBL ADAPTIVE(0.4) 123.3 7 T5 HLBL ADAPTIVE(0.4) × 2 115.7 16 T6 T7 × 4 112.1 32 HLBL ADAPTIVE(0.4) – – 117.0 6420 LBL – – 129.8 – KN3 – – KN5 – 123.2 to 0.25 and the other with ǫ set to 0.4. We then generated a 2 × overcomplete tree by running the ADAPTIVE( ǫ = 0 . 4 ) algorithm twice and creating a tree with a root node that had the two generated trees as its subtrees. Since the ADAPTIVE( ǫ ) algorithm involves some randomization we tried to e dynamically between two possible improve the model performance by allowing the model to choos overcomplete using the same approach. Table 1 lists the clusterings. Finally, we generated a 4 × ǫ t trees generated using ADAPTIVE( ) generated trees as well as some statistics for them. Note tha 0 using er of tree-nodes and thus ǫ > result in models with more parameters due to the greater numb sing methods producing one code/leaf tree-node feature vectors, as compared to trees generated u per word. Table 3 shows the test set perplexities and time per epoch for the resulting models along with the erformance of the HLBL models based perplexities for models from [9]. The results show that the p n -gram models. As expected, building word trees on non-random trees is comparable to that of the adaptively improves model performance. The general trend t hat emerges is that bigger trees tend to lead to better performing models. For example, a model based on a single tree produced using the ADAPTIVE(0.4) algorithm, performs as well as the 5-gram but not as well as the non-hierarchical LBL model. However, using a 2 × overcomplete tree generated using the same algorithm resul ts in a model that outperforms both the n -gram models and the LBL model, and using a 4 × overcomplete tree leads to a further reduction in perplexity. The time-pe r-epoch statistics reported for the neural odels over the LBL model. models in Table 3 shows the great speed advantage of the HLBL m r than the LBL model. Indeed, the slowest of our HLBL models is over 200 times faste 7 Discussion and future work l can actually outperform its non- We have demonstrated that a hierarchal neural language mode erformance. The key to making a hierarchical hierarchical counterparts and achieve state-of-the-art p model perform well is using a carefully constructed hierarc hy over words. We have presented a simple and fast feature-based algorithm for automatic cons truction of such hierarchies. Creating hierarchies in which every word occurred more than once was e ssential to getting the models to perform better. howed that the words with the largest An inspection of trees generated by our adaptive algorithm s numbers of codes (i.e. the word that were replicated the most ) were not the words with multiple distinct senses. Instead, the algorithm appeared to replic ate the words that occurred relatively in- frequently in the data and were therefore difficult to cluste r. The failure to use multiple codes for words with several very different senses is probably a conse quence of summarizing the distribution over contexts with a single mean feature vector when cluster ing words. The “sense multimodality” of context distributions would be better captured by using a small set of feature vectors found by clustering the contexts. 7

8 Finally, since our tree building algorithm is based on the fe ature vectors learned by the model, it el to rebuild the word tree based on the is possible to periodically interrupt training of such a mod dified training procedure might produce feature vectors provided by the model being trained. This mo better models by allowing the word hierarchy to adapt to the p robabilistic component of the model and vice versa. Appendix: Details of the training procedure The models have been trained by maximizing the log-likeliho od using stochastic gradient ascent. All model parameters other than the biases were initialized by sampling from a Gaussian of small variance. The biases for the tree nodes were initialized so t hat the distribution produced by the model rates of the words in the training set. with all the non-bias parameters set to zero matched the base − 3 Models were trained using the learning rate of 10 until the perplexity on the validation set started 5 − 10 and training was resumed until the 3 to increase. Then the learning rate was reduced to × parameters were regulated using a small validation perplexity started increasing again. All model penalty. L 2 Acknowledgments We thank Martin Szummer for his comments on a draft of this pap er. This research was supported by NSERC and CFI. GEH is a fellow of the Canadian Institute for Advanced Research. References istian Jauvin. A neural probabilistic [1] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Chr Journal of Machine Learning Research , 3:1137–1155, 2003. language model. ́ ́ [2] Yoshua Bengio and Jean-S ecal. Quick training of probabilistic neural nets by ebastien Sen importance sampling. In AISTATS’03 , 2003. [3] P.F. Brown, R.L. Mercer, V.J. Della Pietra, and J.C. Lai. Class-based n-gram models of natural language. Computational Linguistics , 18(4):467–479, 1992. smoothing techniques for lan- [4] Stanley F. Chen and Joshua Goodman. An empirical study of Proceedings of the Thirty-Fourth Annual Meeting of the Asso ciation for guage modeling. In Computational Linguistics , pages 310–318, San Francisco, 1996. [5] Ahmad Emami, Peng Xu, and Frederick Jelinek. Using a conn ectionist model in a syntactical Proceedings of ICASSP based language model. In , volume 1, pages 372–375, 2003. [6] C. Fellbaum et al. WordNet: an electronic lexical database . Cambridge, Mass: MIT Press, 1998. [7] J. Goodman. A bit of progress in language modeling. Techn ical report, Microsoft Research, 2000. [8] John G. McMahon and Francis J. Smith. Improving statisti cal language model performance with automatically generated word hierarchies. Computational Linguistics , 22(2):217–247, 1996. [9] A. Mnih and G. Hinton. Three new graphical models for stat Pro- istical language modelling. arning , pages 641–648, 2007. ceedings of the 24th international conference on Machine le ilistic neural network language model. [10] Frederic Morin and Yoshua Bengio. Hierarchical probab In Robert G. Cowell and Zoubin Ghahramani, editors, AISTATS’05 , pages 246–252, 2005. [11] F. Pereira, N. Tishby, and L. Lee. Distributional clust ering of English words. Proceedings of the 31st conference on Association for Computational Lingu istics , pages 183–190, 1993. [12] Holger Schwenk and Jean-Luc Gauvain. Connectionist la nguage modeling for large vocabu- lary continuous speech recognition. In Proceedings of the International Conference on Acous- tics, Speech and Signal Processing , pages 765–768, 2002. 8