1 Are All Languages Equally Hard to Language-Model? 1 1 2 1 and Ryan Cotterell and Brian Roark Jason Eisner Sebastian J. Mielke and 1 2 Department of Computer Science, Johns Hopkins University Google [email protected],[email protected],[email protected] } [email protected] { Abstract tends to be harder to predict in languages with fine- grained inflectional morphology. Specifically, lan- For general modeling methods applied to di- guage models perform worse on these languages, verse languages, a natural question is: how in our controlled comparison. Furthermore, this well should we expect our models to work on performance difference essentially vanishes when languages with differing typological profiles? 1 we remove the inflectional markings. In this work, we develop an evaluation frame- Thus, in highly inflected languages, either the ut- work for fair cross-linguistic comparison of language models, using translated text so that terances have more content or the models are worse. all models are asked to predict approximately in- (1) Text in highly inflected languages may be the same information. We then conduct a study herently harder to predict (higher entropy per utter- on 21 languages, demonstrating that in some ance) if its extra morphemes carry additional, un- languages, the textual expression of the infor- predictable information. (2) Alternatively, perhaps -gram n mation is harder to predict with both — the extra morphemes are predictable in principle and LSTM language models. We show com- for example, redundant marking of grammatical plex inflectional morphology to be a cause of performance differences among languages. number on both subjects and verbs, or marking of object case even when it is predictable from seman- 1 Introduction tics or word order—and yet our current language modeling technology fails to predict them. This Modern natural language processing practitioners might happen because (2a) the technology is bi- strive to create modeling techniques that work well ased toward modeling words or characters and fails on all of the world’s languages. Indeed, most meth- to discover intermediate morphemes, or because ods are portable in the following sense: Given ap- (2b) it fails to capture the syntactic and semantic propriately annotated data, they should, in princi- predictors that govern the appearance of the extra ple, be trainable on any language. However, despite morphemes. We leave it to future work to tease this crude cross-linguistic compatibility, it is un- apart these hypotheses. likely that all languages are equally easy, or that our methods are equally good at all languages. 2 Language Modeling lan- In this work, we probe the issue, focusing on guage modeling . A fair comparison is tricky. Train- A traditional closed-vocabulary, word-level lan- ing corpora in different languages have different guage model operates as follows: Given a fixed set sizes, and reflect the disparate topics of discussion , the model provides a probability distri- V of words in different linguistic communities, some of which bution over sequences of words with parameters to may be harder to predict than others. Moreover, be estimated from data. Most fixed-vocabulary lan- bits per character, a standard metric for language UNK guage models employ a distinguished symbol modeling, depends on the vagaries of a given ortho- that represents all words not present in V ; these graphic system. We argue for a fairer metric based words are termed out-of-vocabulary (OOV). on the bits per utterance using utterance-aligned Choosing the set is something of a black V multi-text. That is, we train and test on “the same” k most com- art: Some practitioners choose the set of utterances in each language, modulo transla- 1 One might have expected a priori that some difference tion. To avoid discrepancies in out-of-vocabulary would remain, because most highly inflected languages can handling, we evaluate open-vocabulary models. also vary word order to mark a topic-focus distinction, and We find that under standard approaches, text this (occasional) marking is preserved in our experiment.

2 mon words (e.g., Mikolov et al. (2010) choose from a vocabulary Σ with mutually disjoint subsets: ∈ = 10000 are Σ = W C k ∪ C ) and others use all those words that ∪ S , where single characters c ̇ ̇ appear at least twice in the training corpus. In gen- distinguished in the model from single character eral, replacing more words with full words w ∈ W , e.g., artificially UNK . Spe- a versus the word a improves the perplexity measure but produces a { = S cial symbols are end-of-word } EOS , EOW and end-of-string, respectively. N-gram histories less useful model. OOVs present something of a challenge for the cross-linguistic comparison of lan- are either word-boundary or word-internal in H (corresponding to a whitespace tokenization), i.e., guage models, especially in morphologically rich H H = . String-internal word boundaries are languages, which simply have more word forms. ∪ H i b ̇ 3 always separated by a single whitespace character. 2.1 The Role of Inflectional Morphology baz , foo For example, if , W 6∈ bar but W ∈ Inflectional morphology can explode the base vo- then the string foo bar baz would be gener- cabulary of a language. Compare, for instance, En- foo b a ated as: . Possible r EOW baz EOS glish and Turkish. The nominal inflectional system 3-gram histories in this string would be, e.g., [ foo of English distinguishes two forms: a singular and ∈ . ∈ H H , [ r EOW ] b H , and [ EOW baz ] ∈ ] i b b plural. The English lexeme BOOK has the singular Symbols are generated from a multinomial given ′ form book and the plural form books . In contrast, h , leading to a new history h the history that now kitablar Turkish distinguishes at least 12: kitap , , includes the symbol and is truncated to the Markov kitabı , kitabın , etc. can generate symbols ∈ H h order. Histories b To compare the degree of morphological inflec- . If } s , the string is EOS = s ∈ EOS W ∪ C ∪{ count- tion in our evalation languages, we use s W , it has an implicit EOW and the ended. If ∈ ′ ing complexity (Sagot, 2013). This crude metric C s . If ∈ H ∈ , h model transitions to history b ′ counts the number of inflectional categories distin- H ∈ ∈ can it translitions to h . Histories H h i i guished by a language (e.g., English includes a cat- C ∈ and transition generate symbols } s EOW ∪{ ′ ′ egory of 3rd-person singular present-tense verbs). EOW ∈ to h . ∈ H h if s = H , otherwise to i b We count the categories annotated in the language’s We use standard Kneser and Ney (1995) model UniMorph (Kirov et al., 2018) lexicon. See Table 1 training, with distributions at word-internal histo- for the counting complexity of evaluated languages. ries h ∈ H constrained so as to only provide prob- i ∈ ∪{ s } . We train ability mass for symbols EOW C 2.2 Open-Vocabulary Language Models hs where the 7-gram models, but prune -grams n k To ensure comparability across languages, we re- history ∈ W h , for k > 4 , i.e., 6- and 7-gram his- quire our language models to predict every charac- tories must include at least one 6∈ W . To establish s ter in an utterance, rather than skipping some char- the vocabularies and C , we replace exactly one W acters because they appear in words that were (arbi- instance of each word type with its spelled out ver- trarily) designated as OOV in that language. Such sion. Singleton words are thus excluded from W , models are known as “open-vocabulary” LMs. and character sequence observations from all types are included in training. Note any word w W ∈ Notation. Let ∪ denote disjoint union, i.e., A ∪ ̇ ̇ can also be generated as a character sequence. For . Let ∅ = B ∩ A and C = Σ B ∪ A iff C = B be perplexity calculation, we sum the probabilities for a discrete alphabet of characters, including a dis- each way of generating the word. 2 tinguished unknown-character symbol ? . A char- ∏ +1 | c | ( ) = p acter LM then defines p ( c , ) c c | i

3 0 . 4 . 1 30 4 . 1 . 1 3 . 1 3 . 25 0 . 2 . 1 1 2 20 0 . 1 . 1 1 . 1 15 0 . 0 1 . 0 . 1 10 . 0 9 0 0 . 9 . n -gram 05 0 . . 8 0 0 . 8 Cost of Modeling Words (BPEC) LSTM Cost of Modeling Inflection (BPEC) Cost of Modeling Lemmata (BPEC) . 00 . 0 0 7 . 0 7 0 50 100 150 200 250 250 200 250 100 0 0 150 50 100 150 200 50 Morphological Counting Complexity Morphological Counting Complexity Morphological Counting Complexity n -gram (blue) -gram (blue) n (a) BPEC performance of (c) Difference in BPEC performance of (b) BPEC performance of and LSTM (green) LMs over word se- n and LSTM (green) LMs over lemma se- -gram (blue) and LSTM (green) LMs quences. Lower is better. between words and lemmata. quences. Lower is better. Figure 1 : The primary findings of our paper are evinced in these plots. Each point is a language. While the LSTM outperforms the hybrid n -gram model, the relative performance on the highly inflected languages compared to the more modestly inflected languages is almost constant; to see this point, note that the regression lines in Fig. 1c are almost identical. Also, comparing Fig. 1a and Fig. 1b shows that the correlation between LM performance and morphological richness disappears after lemmatization of the corpus, indicating that inflectional morphology is the origin for the lower BPEC. p ( c Open- | c What’s wrong with bits per character? ) depends on c how the distribution .