sigmod15 jliu


1 Mining Quality Phrases from Massive Text Corpora ‡ † †∗ † †∗ Chi Wang Jialu Liu Jingbo Shang Jiawei Han Xiang Ren † Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA ‡ Microsoft Research, Redmond, WA, USA ‡ † [email protected] {jliu64, shang7, xren7, hanj} data that deviate from rigorous language rules. Query logs, ABSTRACT social media messages, and textual transaction records are Text data are ubiquitous and play an essential role in big data just a few examples. Therefore, researchers have sought more applications. However, text data are mostly unstructured. general data-driven approaches, primarily based on the fre- e.g. , Transforming unstructured text into structured units ( quent pattern mining principle [ 2 , 34 ]. The early work focuses semantically meaningful phrases) will substantially reduce on efficiently retrieving recurring word sequences, but many semantic ambiguity and enhance the power and efficiency such sequences do not form meaningful phrases. More re- at manipulating such data using database technology. Thus cent work filters or ranks them according to frequency-based mining quality phrases is a critical research problem in the statistics. However, the raw frequency from the data tends field of databases. In this paper, we propose a new framework to produce misleading quality assessment, and the outcome that extracts quality phrases from text corpora integrated is unsatisfactory, as the following example demonstrates. with phrasal segmentation. The framework requires only Example 1 (Raw Frequency-based Phrase Mining). limited training but the quality of phrases so generated is Consider a set of scientific publications and the raw fre- close to human judgment. Moreover, the method is scalable: quency counts of two phrases ‘relational database system’ both computation time and required space grow linearly as and ‘support vector machine’ and their subsequences in the corpus size increases. Our experiments on large text corpora frequency column of Table 1. The numbers are hypothetical demonstrate the quality and efficiency of the new method. but manifest several key observations: (i) the frequency gen- erally decreases with the phrase length; (ii) both good and 1. INTRODUCTION e.g. bad phrases can possess high frequency ( , ‘support vector’ Mining quality phrases refers to automatically extracting and ‘vector machine’); and (iii) the frequency of one sequence salient phrases from a given corpus. It is a fundamental , ‘relational database system’) and its subsequences can e.g. ( task for text analytics of various domains, such as science, have a similar scale of another sequence ( e.g. , ‘support vector news, social media and enterprise documents. In these large, machine’) and its counterparts.  dynamic collections of documents, analysts are often inter- Obviously, a method that ranks the word sequences solely ested in variable-length phrases, including scientific concepts, according to the frequency will output many false phrases events, organizations, products, slogans and so on. Efficient such as ‘vector machine’. In order to address this problem, extraction of quality phrases enable a large body of applica- different heuristics have been proposed based on comparison tions to transform from word granularity to phrase granular- of a sequence’s frequency and its sub-(or super-)sequences, ity. Examples of such applications include topic tracking [ 21 ], assuming that a good phrase should have high enough (nor- 40 OLAP on multi-dimensional text collections [ ], and doc- malized) frequency compared with its sub-sequences and/or ument categorization. For keyword search, the extracted , 12 29 ]. However, such heuristics can hardly super-sequences [ phrases facilitate selective indexing, query suggestion and , ‘support vector’ and ‘vector differentiate the quality of, e.g. other tasks. Also, extraction of phrases is critical towards machine’ because their frequency are so close. Finally, even information extraction because many concepts, entities and if the heuristics can indeed draw a line between ‘support vec- relations are manifested in phrases. tor’ and ‘vector machine’ by discriminating their frequency Though the study of this task originates from the natural (between 160 and 150), the same separation could fail for language processing (NLP) community, the challenge has another case like ‘relational database’ and ‘database system’. been recognized of applying NLP tools in the emerging big Using the frequency in Table 1, all heuristics will produce ∗ Equal Contribution identical predictions for ‘relational database’ and ‘vector ma- chine’, guaranteeing one of them wrong. This example sug- Permission to make digital or hard copies of all or part of this work for personal or gests the intrinsic limitations of using raw frequency counts, classroom use is granted without fee provided that copies are not made or distributed especially in judging whether a sequence is too long (longer for profit or commercial advantage and that copies bear this notice and the full cita- than a minimum semantic unit), too short (broken and not tion on the first page. Copyrights for components of this work owned by others than informative), or right in length. It is a critical bottleneck for ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- all frequency-based quality assessment. publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] In this work, we address this bottleneck, proposing to May 31–June 4, 2015, Melbourne, Victoria, Australia. SIGMOD’15, rectify the decisive raw frequency that hinders discovering c 2015 ACM 978-1-4503-2758-9/15/05 ...$15.00. Copyright © the true quality of a phrase. The goal of the rectification is .

2 Table 1: A hypothetical example of word sequence raw frequency frequency rectified sequence frequency phrase? rectified sequence phrase? yes 70 100 yes 80 100 relational database system support vector machine yes support vector 160 yes 50 150 relational database 40 yes 35 vector machine 150 no 6 160 database system 500 20 support 500 N/A 150 N/A relational N/A 1000 vector 1000 N/A 200 database 200 10000 1000 machine 1000 N/A 150 system N/A integrate it with the phrase quality assessment in a unified to estimate how many times each word sequence should be interpreted in whole as a phrase in its occurrence context. framework with linear complexity (w.r.t the corpus size). The following example illustrates this idea. First, the segmentation assigns every word occurrence to only one phrase. In the first instance of Example 2, ‘relational Example 2 (Rectification). Consider the following occur- database system’ are bundled as a single phrase. Therefore, rences of the 6 multi-word sequences listed in Table 1. it automatically avoids double counting ‘relational database’ for images... c relational database system d 1. A and ‘database system’ within this instance. Similarly, the empowers everyone in your organiza- d Database system c 2. segmentation of the fifth instance contributes to the count of tion... ‘feature vector’ and ‘machine learning’ instead of ‘feature’, ‘vector machine’ and ‘learning’. This strategy condenses support vector machine constructs a c d More formally, a 3. the individual tests for each word sequence and reduces the hyperplane... overall complexity while ensures correctness. Second, though 4. support vector c d method is a new general method of The there are an exponential number of possible partitions of ... c function estimation d the documents, we are concerned with those relevant to the 5. c setup is machine learning c d feature vector d A standard phrase extraction task only. Therefore, we can integrate used to describe... the segmentation with the phrase quality assessment, such c has an identical d functional 6. d Relevance vector machine that (i) only frequent phrases with reasonable quality are d form c to the support vector machine c ... taken into consideration when enumerating partitions; and (ii) the phrase quality guides the segmentation, and the The basic goal for d object-oriented relational database c 7. segmentation rectifies the phrase quality estimation. Such an bridge the gap c between...  is to d integrated framework benefits from mutual enhancement, and The first 4 instances should provide positive counts to achieves both high quality and high efficiency. Finally, both these sequences, while the last three instances should not the phrase quality and the segmentation results are useful provide positive counts to ‘vector machine’ or ‘relational from an application point of view. The segmentation results database’ because they should not be interpreted as a whole are especially desirable for tasks like document indexing, phrase (instead, sequences like ‘feature vector’ and ‘relevance categorization or retrieval. vector machine’ can). Suppose one can correctly count true The main contributions lie in the following aspects: occurrences of the sequences, and collect rectified frequency Realizing the limitation of raw frequency-based phrase • as shown in the rectified column of Table 1. The rectified mining, we propose a segmentation-integrated framework frequency now clearly distinguishes ‘vector machine’ from to rectify the raw frequency. To the best of our knowledge, the other phrases, since ‘vector machine’ rarely occurs as a it is the first work to integrate phrase extraction and whole phrase. phrasal segmentation and mutually benefit each other. The success of this approach relies on reasonably accurate The proposed method is scalable: both computation time • rectification. Simple arithmetics of the raw frequency, such and required space grow linearly as corpus size increases. as subtracting one sequence’s count with its quality super It is easy to parallelize as well. sequence, are prone to error. First, which super sequences are quality phrases is a question itself. Second, it is context- Experimental results demonstrate that our method is effi- • dependent to decide whether a sequence should be deemed a cient, generic, and highly accurate. Case studies indicate whole phrase. For example, the fifth instance in Example 2 that the proposed method significantly improves applica- prefers ‘feature vector’ and ‘machine learning’ over ‘vector 17 tions like interesting phrase mining ] and 28 , relevant , 5 [ machine’, even though neither ‘feature vector machine’ nor [27]. word/phrase search ‘vector machine learning’ is a quality phrase. The context information is lost when we only collect the frequency counts. 2. RELATED WORK In order to recover the true frequency with best effort, we ought to examine the context of every occurrence of each 2.1 Quality Phrase Mining word sequence and decide whether to count it as a phrase. , multiword i.e. Automatic extraction of quality phrases ( The examination for one occurrence may involve enumeration semantic units) from massive, dynamically growing corpora of alternative possibilities, such as extending the sequence or gains increasing attention due to its value in text analytics breaking the sequence, and comparison among them. The of various domains. As the origin, the NLP community has test for word sequence occurrences could be expensive, losing , conducted extensive studies [ 38 , 16 30 ]. With pre- , 41 , 4 the advantage in efficiency of the frequent pattern mining defined POS rules, one can generate noun phrases from each 1 approaches. document . However, such rule-based methods usually suffer Facing the challenge of accuracy and efficiency, we propose 1 , and phrasal segmentation a segmentation approach named

3 in domain adaptation. Supervised noun phrase chunking articles, titles, queries, tags, memos, messages and records. techniques [ 39 , , A phrase is a sequence of words that appear contiguously in ] leverage annotated documents to 31 10 automatically learn rules based on POS-tagged corpus. These the text, and serves as a whole (non-composible) semantic supervised methods may utilize more sophisticated NLP unit in certain context of the given documents. Its raw features such as dependency parser to further enhance the frequency is the total count of its occurrences. There is no , ]. The various kinds of linguistic processing, 25 phrase quality universally accepted definition of 20 precision [ . However, it is useful to quantify phrase quality based on certain criteria. domain-dependent language rules, and expensive human We use a value between 0 and 1 to indicate the quality of labeling make it challenging to apply to emerging big corpora. each phrase, and specify four requirements of a good phrase, Another kind of approaches explore frequency statistics , 32 ]. Most of them 15 , 12 , in document collections [ 13 , which conform with previous work. 29 leverage a variety of statistical measures derived from a Since many phrases are invented and adopted Popularity: • corpus to estimate phrase quality. Therefore, they do not rely by people, it could change over time or occasions whether a on linguistic feature generation, domain-specific rules or large sequence of words should be regarded as a non-composible training sets, and can process massive corpora efficiently. In semantic unit. When relational database was first intro- 29 ], several indicators, including frequency and comparison [ duced in 1970 [ 11 ], ‘data base’ was a simple composition n to super/sub-sequences, were proposed to extract -grams of two words, and then with its gained popularity people that are not only popular but also concise as concepts. Deane even invented a new word ‘database’, clearly as a whole 13 [ ] proposed a heuristic metric over frequency distribution semantic unit. ‘vector machine’ is not a meaningful phrase based on Zipfian ranks, to measure lexical association for in machine learning community, but it is a phrase in hard- phrase candidates. ware design. Quality phrases should occur with sufficient Different from aforementioned methods, this work con- frequency in a given document collection. siders integrating phrasal segmentation with phrase quality Concordance: • Concordance refers to the collocation of estimation to further rectify the inaccurate phrase quality tokens in such frequency that is significantly higher than initially estimated, based on local occurrence context. Mean- what is expected due to chance. A commonly-used exam- while, our phrase quality estimation explores statistical fea- ple of a phraseological-concordance is the two candidate tures using rectified phrase frequencies, which provides a phrases ‘strong tea’ and ‘powerful tea’ [ ]. One would 18 principled way to refine the phrase quality assessment. assume that the two phrases appear in similar frequency, in terms Our work is also related to keyphrase extraction yet in the English language, the phrase ‘strong tea’ is con- of deriving a variety of statistical measures for finding qual- sidered more proper and appears in much higher frequency. 24 ]. However, keyphrase extraction fo- 26 ity phrases [ 19 , , Because a concordant phrase’s frequency deviates from cuses on deriving from each single document most prominent what is expected, we consider them belonging to a whole ], in- 28 , , 5 phrases, instead of from the entire corpus. In [ 17 semantic unit. teresting phrases can be queried efficiently for ad-hoc subsets • Informativeness: A phrase is informative if it is indica- of a corpus, while the phrases are based on simple frequent tive of a specific topic. ‘This paper’ is a popular and pattern mining methods. concordant phrase, but not informative in research publi- 2.2 Word Sequence Segmentation cation corpus. In our solution, phrasal segmentation is integrated with Completeness: • Long frequent phrases and their subsets phrase quality assessment, as a critical component for rec- may both satisfy the above criteria. A complete phrase tifying phrase frequency. Formally, phrasal segmentation should be interpreted as a whole semantic unit in certain aims to partition a sequence into disjoint subsequences each context. In the previous discussion of Example 2, the mapping to a semantic unit, i.e. , word or phrase. In terms of sequence ‘vector machine’ does not appear as a complete identifying semantic units, existing work includes query seg- phrase. Note that a phrase and its subphrase can both 37 14 , 7 , ], and Chinese ], phrase chunking [ 23 , 36 mentation [ be valid in appropriate context. For example, ‘relational 9 ], following either supervised setting , word segmentation [ 35 database system’, ‘relational database’ and ‘database sys- on labeled data, or unsupervised setting on large corpus. In tem’ can all be valid in certain context. 36 ], Tan and Pang proposed a generative model in unsuper- [ Efficiently and accurately extracting quality phrases is vised setting which adopted n -gram frequency from a large the main goal of this study. For generality, we allow users corpus and used expectation maximization for computing to provide a few examples of quality phrases and inferior et al. segment scores. Li ] leveraged query click-through 23 [ ones. The estimated quality should therefore align with data based on a bigram language model and further refined these labeled examples. Previous work has overlooked some retrieval model with query segmentation. of the requirements and made assumptions against them. The phrasal segmentation step in our proposed framework For example, most work assumes a phrase candidate should incorporates the estimated phrase quality as guidance on either be included as a phrase, or excluded entirely, without partition of sequences, which does not rely on external sources [ et al. ] 29 analyzing the context it appears. Parameswaran such as Wikipedia and click-through data and achieves good assumed that if a phrase candidate with length n is a good efficiency. It mainly serves the purpose of rectifying phrase phrase, its length n − 1 prefix and suffix cannot be a good frequencies. phrase simultaneously. We do not make such assumptions. Instead, we take a context-dependent analysis approach – 3. PRELIMINARIES phrasal segmentation. defines a partition of a sequence A phrasal segmentation This paper deals with quality phrase extraction from a into subsequences, such that every subsequence corresponds large collection of documents. The input documents can be any textual word sequences with arbitrary lengths, such as to either a single word or a phrase. Example 2 shows instances

4 of such partitions, where all phrases with high quality are sequence of document d,d = 1 ... D . Each document can marked by brackets be further partitioned into smaller pieces based on different dc . The phrasal segmentation is distinct from word, sentence or topic segmentation tasks in natural properties of the corpus, such as sentences according to punctuation. The output is a ranked list of phrases with language processing. It is also different from the syntactic decreasing quality, together with a segmented corpus. or semantic parsing which relies on grammar to decompose the sentences with rich structures like parse trees. Phrasal segmentation provides the necessary granularity we need 4. QUALITY PHRASE MINING to extract quality phrases. The total count of times for a We first present the full procedure of phrase mining. Then rectified phrase to appear in the segmented corpus is called we introduce each of them in following subsections. . frequency 1. Generate frequent phrase candidates according to popu- It is beneficial to acknowledge that a sequence’s segmen- larity requirement (Sec. 4.1). tation may not be unique, due to two reasons. First, as we mentioned above, a word sequence may be regarded as a Estimate phrase quality based on features about concor- 2. phrase or not, depending on the adoption customs. Some dance and informativeness requirements (Sec. 4.2). phrases, like ‘bridge the gap’ in the last instance of Exam- Estimate rectified frequency via phrasal segmentation 3. ple 2, are subject to a user’s requirement. Therefore, we (Sec. 4.3). seek for segmentation that accommodates the phrase quality, 4. Add segmentation-based features derived from rectified which is learned from user-provided examples. Second, a frequency into the feature set of phrase quality classifier sequence could be ambiguous and have different interpreta- (Sec. 4.4). Repeat step 2 and 3. tions. Nevertheless, in most cases, it does not require perfect segmentation, no matter if such a segmentation exists, to Filter phrases with low rectified frequencies to satisfy the 5. extract quality phrases. In a large document collection, the completeness requirement as post-processing step. popularly adopted phrases appear many times in a variety of An complexity analysis for this framework is given at Sec 4.5 context. Even with a few mistakes or debatable partitions, a to show that both of its computation time and required space reasonably high quality segmentation ( , yielding no parti- e.g. grow linearly as the corpus size increases. tion like ‘support vector machine c d ’) would retain sufficient support ( , rectified frequency) for these quality phrases, i.e. 4.1 Frequent Phrase Detection albeit not for false phrases with high raw frequency. With the above discussions, we have formalizations: Algorithm 1: Frequent Phrase Detection Phrase quality is defined Definition 1 (Phrase Quality). . τ Input : Corpus C , minimum support threshold to be the possibility of a multi-word sequence being a coherent of frequent : Raw frequency dictionary Output f semantic unit, according to the above four criteria. Given a phrases and words. , its phrase quality is: phrase v f ← an empty dictionary index ← an empty dictionary p ) = v , Q 1] [0 ∈ ) v c| v d ( ( for i ← 1 to |C| do ∪ ]] i [ C [ index ← ]] i [ C [ index i refers to the event that the words in v compose c v d where while do is not empty index w a phrase. For a single word . For ) = 1 w ( Q , we define ′ ← index an empty dictionary Q phrases,  is to be learned from data. do index.keys ∈ u for For example, a good quality estimator is able to return index | then τ | ≥ ] u [ if ≈ vector machine ( Q 1 and ) ) relational database system ( Q ≈ ] u [ f index [ u ] | ←| 0. for j ∈ index [ u ] do ′ Given a word Definition 2 (Phrasal Segmentation). [ u ← u + 1] j ⊕C ′ ′ ′ ′ w , a segmentation n of length ... sequence w w = C 1 n 2 + 1 ← j ∪{ u [ ] index ] u [ } index C ... S = for s s s is induced by a boundary index se- 2 m 1 ′ { b ,b = , ... ,b quence } satisfying 1 = b B < b < ... < ← index index 2 +1 1 1 m 2 f return w ... w . b w = s +1, where a segment n = +1 t m b b +1 1 | |− + s b t t t t | refers to the number of words in segment s s | Here . Since t t The task of detecting frequent phrases can be defined as | , for clearness we use s + to denote word b w b = | t +1 t t [ b ,b ) t t +1 collecting aggregate counts for all phrases in a corpus that .  w ··· w w sequence b +1 b b 1 + s | |− t t t t satisfy a certain minimum support threshold τ , accordding Example 3 . Continuing our previous Example 2 and specif- to the popularity requirement. In practice, one can also set ically for the first instance, the word sequence and marked ω a maximum phrase length to restrict the phrase length. segmentation are is typically a small ω Even if no explicit restriction is added, constant. For efficiently mining these frequent phrases, we = a relational database system for images C draw upon two properties: = / a / relational database system / for / images / S Downward Closure property: If a phrase is not frequent, 1. B 1 { = indicating with a boundary index sequence } 7 , 6 , 5 , 2 , then any its super-phrase is guaranteed to be not frequent.  the location of segmentation symbol /. Therefore, those longer phrases will be filtered and never expanded. Based on these definitions, the main input of quality phrase L mining task is a corpus with a small set Prefix property: If a phrase is frequent, any its prefix of labeled quality 2. ̄ units should be frequent too. In this way, all the frequent L of inferior ones. The corpus can be represented phrases and C = C by a giant word sequence , where is the word C ... C phrases can be generated by expanding their prefixes. 1 D d

5 The algorithm for detecting frequent phrases is given in which requires a dictionary of stopwords. Phrases that begin · ] to index a word in the corpus string Alg. 1. We use C or end with stopwords, such as ‘I am’, are often functional [ operator is for rather than informative. to denote the corpus size. The |C| and ⊕ A more generic feature is to measure the informativeness concatecating two words or phrases. Alg. 1 returns a key- value dictionary containing all U . Its keys are vocabulary f based on corpus statistics: , and words U \P . Its values are their frequent phrases P • Average inverse document frequency (IDF) computed over raw frequency. words; 4.2 Phrase Quality Estimation is computed as where IDF for a word w Estimating phrase quality from only a few training labels is |C| challenging since a huge number of phrase candidates might ) = log w IDF( ] : ∈ |{ d [ D C w ∈ }| d be generated from the first step and they are messy. Instead 16 of using one or two statistical measures [ 30 15 ], we choose , , It is a traditional information retrieval measure of how much P . A to compute multiple features for each candidate in information a word provides in order to retrieve a small subset for Q classifier is trained on these features to predict quality of documents from a corpus. In general, quality phrases are , their quality is P all unlabeled phrases. For phrases not in expected to have not too small average IDF. simply 0. In addition to word-based features, punctuation is fre- We divide the features into two categories according to quently used in text to aid interpretations of specific concept concordance and informativeness requirements in the fol- or idea. This information is helpful for our task. Specifically, lowing two subsections. Only representative features are we adopt the following feature: introduced for clearness. We then discuss about the classifier • Punctuation: probabilities of a phrase in quotes, brackets in Sec. 4.2.3. or capitalized; 4.2.1 Concordance Features higher probability usually indicates more likely a phrase is This set of features is designed to measure concordance informative. among sub-units of a phrase. To make phrases with different Besides these features, many other signals like knowledge- lengths comparable, we partition each phrase candidate into base entities and part-of-speech tagging can be plugged into two disjoint parts in all possible ways and derive effective the feature set. They are less generic quality estimators and features measuring their concordance. require more training or external resources. It is easy to Suppose for each word or phrase ∈U u , we have its raw incorporate these features in our framework when they are p [ f ]. Its probability u ( u ) is defined as: frequency available. [ u ] f ∑ p ( ) = u 4.2.3 Classifier ′ [ u ] f ′ ∈U u Our framework can work with arbitrary classifiers that can ∈P , we split it into two most-likely sub- Given a phrase v be effectively trained with small labeled data and output a is min- u 〈 units such that 〉 pointwise mutual information ,u r l probabilistic score between 0 and 1. For instance, we can imized. Pointwise mutual information quantifies the discrep- adopt random forest [ 8 ] which is efficient to train with a ancy between the probability of their true collocation and small number of labels. The ratio of positive predictions the presumed collocation under independence assumption. among all decision trees can be interpreted as a phrase’s Mathematically, quality estimation. In experiments we will see that 200–300 labels are enough to train a satisfactory classifier. ( v ) p 〈 log u 〉 = arg min ,u r l Just as we have mentioned, both quality phrases and in- u ⊕ u = v ( p u ) p ( u ) r l r l ferior ones are required as labels for training. To further With 〉 u , we directly use the pointwise mutual informa- 〈 ,u r l reduce the labeling effort, some machine learning ideas like tion as one of the concordance features. ] can be applied to automatically retrieve 22 PU-learning [ v ) p ( ] is another popular mech- 33 negative labels. Active learning [ ,u ( ) = log u PMI r l anism to substantially reduce the number of labels required ( ( ) p ) u p u r l via iterative training. Moreover, it is possible to train a Another feature is also from information theory, called point- transferable model on one document collection and adapt it wise Kullback-Leibler divergence: to the target corpus. We plan to explore these directions in ) v ( p future work. ,u PKL ) log v ( p ) = 〉 ‖〈 ( u v r l Finally, it’s noteworthy that in order to extract features p ( ) p ( u ) u r l efficiently we have designed an adapted Aho-Corasick Au- ( v ) is multiplied with pointwise mutual in- The additional p tomaton to rapidly locate occurrences of phrases in the corpus. formation, leading to less bias towards rare-occurred phrases. Please refer to the appendix for details about this part. Both features are positively correlated with concordance. 4.3 Phrasal Segmentation 4.2.2 Informativeness Features The discussion in Example 1 points out the limitations Some candidates are unlikely to be informative because of using only raw frequency counts. Instead, we ought to they are functional or stopwords. We incorporate the follow- examine the context of every word sequence’s occurrence and ing stopword-based features into the classification process: decide whether to count it as a phrase, as introduced in Ex- • ample 2. The segmentation directly addresses the complete- Whether stopwords are located at the beginning or the end of the phrase candidate; ness requirement, and indirectly helps with the concordance

6 Algorithm 2: Dynamic Programming (DP) requirement via rectified frequency. Here we propose an efficient phrasal segmentation method to compute rectified w Input : Word sequence C = w , phrase quality ...w 2 n 1 frequency of each phrase. We will see that combined with Q , normalized frequency θ , segment length penalty α . aforementioned phrase quality estimation, bad phrases with : Optimal segmentation S . Output high raw frequency get removed as their rectified frequencies h ← 0 (0 ) n ≤ < i h ← 1, 0 i approach zero. ω as the maximum phrase length denote Furthermore, rectified phrase frequencies can be fed back to = 1 do to n for i generate additional features and improve the phrase quality ω do to = 1 δ for ∣ estimation. This will be discussed in the next subsection. ∣ c > h = b ) + δ, d w p · ( b h b if ∣ t t +1 i t i + δ δ + ,i +1 i [ +1) We now propose the phrasal segmentation model integrated then with the aforementioned phrase quality Q . Given a word ∣ ∣ , and a segmentation = C induced by S s sequence s ... 1 m c ← p ) w d δ, + b = b ( b · h h ∣ t i t t +1 + δ i [ i +1 ,i + δ +1) boundary index sequence B = { b } , ... ,b = , where s t 1 +1 m ← i g i δ + w , the joint probability is factorized as: ) b ,b [ t +1 t i n ← m ∣ ( ) ∏ m ← 0 ∣ ) = S,C ( p w d b , p b c ∣ (1) t t +1 b ) [ ,b t +1 t while i > do 0 =1 t m ← m + 1 w w ...w ← s +2 g i g +1 m i i ) is the probability of observing c| p ( b w where d b , +1 t t [ ) b ,b t +1 t g i ← i a word sequence -th quality segment. As w t as the ,b [ b ) t +1 t ...s return ← s S s 1 m − 1 m segments of a word sequence usually have weak dependence on each other, we assume they are generated one by one for the sake of both efficiency and simplicity. Viterbi Training (VT) Algorithm 3: We now describe the generative model for each segment. , we first gener- Given the start index b of a segment s α , length penalty Q , phrase quality C : Corpus Input . t t b , according to a prior distribution ate the end index : Output θ . t +1 p ) over phrase lengths. Then we generate the b s | b | ( = − with normalized raw frequencies in the corpus θ initialize t t t +1 w word sequence according to a multinomial distribu- not converge while do ) ,b b [ t +1 t ′ − ). Finally, we gener- b tion over all segments of length ( b ← ∈U u ∀ θ , 0 +1 t t u ate an indicator whether w forms a quality segment ac- D to = 1 d for do [ ) ,b b t t +1 ← ) via Alg. 2 C ( DP S ,Q,θ,α p cording to its quality Q ( d w ( ). c| w ) = w d d [ b ,b ,b ) ,b [ b ) [ b t t t t +1 +1 t +1 t ( d ) d ) ( We formulate its probabilistic factorization as follows: assume s S ··· s = m d 1 = 1 to m do for t b | b ) p ( d ) = w c| p b ( b ,b ) b , d w c| p ( ( d ) t t t +1 +1 t t +1 t ,b b ) ) [ b ,b [ t t +1 t +1 t w ← u ∣ ,b [ ) b ) ( ( ) t +1 t ∣ ′ ′ b − ( b b = p = | Q ) w b − s p | w + 1 θ ← θ ∣ t t t +1 t +1 t b [ ) [ b ,b ) ,b u u t t +1 t +1 t ′ normalize w.r.t. different length as in Eq. (4) θ ′ ) is explicitly modeled b − = | b s The length prior p ( | t +1 t t θ θ ← to counter the bias to longer segments as they result in fewer return θ | ) we pick is: | s ( p segments. The particular form of t To find the best segmentation to maximize Eq. (3) , one s 1 | −| t θ can use efficient dynamic programming (DP) if is known. (2) p ( | s α | ) ∝ t The algorithm is shown in Alg. 2. + To learn , we employ an optimization strategy called θ Here α ∈ R is a factor called segment length penalty . If 3 ]. Gen- Viterbi Training (VT) or Hard-EM in the literature [ 1, phrases with longer length have larger value of p ( | s | ). α < t erally speaking, VT is an efficient and iterative way of param- If α > | 1, the mass of p ( | s ) moves towards shorter phrases. t eter learning for probabilistic models with hidden variables. favors longer phrases and results in fewer segments. α Smaller C In our case, given corpus , it searches for a segmentation Tuning its value turns out to be a trade-off between precision C| ) followed by coordinate ascent ( , Q,θ,α that maximizes p S and recall for recognizing quality phrases. At the end of . Such a procedure is iterated until a station- on parameters θ this subsection we will discuss how to estimate its value by ary point has been reached. The corresponding algorithm is reusing labels in Sec. 4.2. It is worth mentioning that such given in Alg. 3. [ segment length penalty is also discussed by Li et al. 23 ]. Our θ The hard E-step is performed by DP with fixed, and formulation differs from theirs by posing a weaker penalty the M-step is based on the segmentation obtained from DP. on long phrases. ∣ ∣ Once the segmentation S is fixed, the closed-form solution ) with for convenience. θ w p We denote | s ( | ∣ t w ) b ,b [ t t +1 ,b [ ) b t +1 t can be derived as: θ of u documents, we need to estimate C D with For a given corpus ∣ ∣ ∑ ∑ p u | ) for each frequent word and phrase u ∈U , and ( u θ | = D m ∣ u d 1 ( d ) =1 d =1 t u = s t infer segmentation S . We employ the maximum a posteriori (4) θ = ∑ ∑ u m D d 1 d ) ( principle and maximize the joint probability of the corpus: =1 t d =1 | u s | = | | t m D D d ∣ ) ( ∑ ∑ ∑ ∣ d ) ( ) d ( ) d ( 1 where denotes the identity indicator. We can see that (3) log c ( S w ,C d ) = , p p b log b ∣ d d t +1 t b ,b [ t t +1 u is the rectified frequency of θ normalized by the total u =1 t =1 =1 d d

7 Table 2: Effects of segmentation feedback on phrase quality estimation phrase problem fixed by feedback quality before feedback quality after feedback 0.93 np hard in the strong sense slight underestimate 0.78 overestimate 0.23 0.70 np hard in the strong N/A false positives and false negatives 0.97 0.90 0.87 overestimate positives and false negatives 0.29 underestimate 0.82 0.60 data base management system data stream management system 0.64 underestimate 0.26 Algorithm 4: Penalty Learning 4.4 Feedback as Segmentation Features , labeled quality phrases Input L C , phrase : Corpus Rectified frequencies can help refine the feature generation r . , non-segmented ratio Q quality 0 process in Sec. 4.2 and improve the quality assessment. The Output : Desired segment length penalty α . motivation behind this feedback idea is explained with the low ← 0 ← up 200, examples shown in Table 2. ‘Quality before feedback’ listed do not converge while in the table is computed based on phrase quality estimation α ← up + ) / 2 ( low introduced in Sec. 4.2. For example, the quality of ‘np hard V T ( C ,Q,α ) via Alg. 3 θ ← in the strong’ is significantly overestimated according to the r ← r ×| L | 0 raw frequency. Once we correctly segment the documents, | do = 1 for | L i to its frequency will be largely reduced, and we can use it L ← DP ( S ,Q,θ,α ) via Alg. 2 i to guide the quality estimator. For another example, The | S if | then = 1 quality of phrases like ‘data stream management system’ ← r − 1 r were originally underestimated due to its relatively lower frequency and smaller concordance feature values. Suppose if 0 r then ≥ up ← α after the segmentation, this phrase is not broken into smaller else units in most cases. Then we can feed that information back low α ← to the quality estimator and boost the score. Based on this intuition, we design two new features named + low ) / 2 return ( up segmentation features and plug them into the feature set , these two introduced in Sec. 4.2. Given a phrase v ∈ P | frequencies of the segments with length u . For this reason, | segmentation features are defined as: . we name θ normalized rectified frequency i.e. ]) can 6 , Bawm-Welch algorithm [ Note that Soft-EM ( ( S,v ) | p | | S =1 also be applied to find a maximum likelihood estimator of θ . log max p ( S,v ) S | > 1 | Nevertheless, VT is more suitable in our case because: p | S,v ) ( | S | =1 S,v ) p | log ( =1 | S | VT uses DP for the segmentation step, which is signifi- 1. max ) S,v ( p | 1 > | S cantly faster than Bawm-Welch using forward-backward algorithm for the E-step; where ) is computed by Eq. (1) . Instead of splitting ( p S,v a phrase into two parts like the concordance features, we 2. Majority of the phrases get removed as their θ approaches now find the best segmentation with dynamic programming 0 during iterations, which further speeds up our algorithm. introduced in the phrasal segmentation, which better models the concordance criterion. In addition, normalized rectified It has also been reported in [ ] that VT converges faster and 3 frequencies are used to compute these new features. This results in sparser and simpler models for Hidden Markov addresses the context-dependent completeness requirements. Model-like tasks. Meanwhile, VT is capable of correctly As a result, misclassified phrase candidates in the above ex- recovering most of the parameters. ample can get mostly corrected after retraining the classifier, (2) Previously in Eq. we have defined the formula of seg- as shown in Table 2. that α ment length penalty. There is a hyper-parameter A better phrase quality estimator can guide a better seg- needs to be determined outside the VT iterations. An over- mentation as well. In this way, the loop between the quality estimate α will segment quality phrases into shorter parts, estimation and phrasal segmentation is closed and such an tends to keep low-quality phrases. α while an underestimate of integrated framework is expected to leverage mutual enhance- Thus an appropriate α reflects the user’s trade-off between ment and address all the four phrase quality requirements value is reasonable, α precision and recall. To judge what organically. we propose to reuse the labeled phrases used in the phrase Note that we do not need to run quality estimation and quality estimation. Specifically, we try to search for the phrasal segmentation for many iterations. In our experiments, maximum value of such that VT does not segment positive α the benefits brought by rectified frequency can penetrate after non-segmented ratio controls r phrases. A parameter named 0 the first iteration, leaving performance curves over the next the trade-off mentioned above. It is the expected ratio of several iterations similar. It will be shown in the appendix. phrases in L not partitioned by dynamic programming. The detailed searching process is described in Alg. 4 where we 4.5 Complexity Analysis initially set upper and lower bounds of and then perform a α Frequent Phrases Detection: | binary search. In Alg. 4, denotes the number of segments S | Since the operation of Hash refers to the number of positive labels. L | and | S in table is O (1) , both the time and space complexities are

8 O ). is a small constant indicating the maximum |C| ω ( ω Table 3: Statistics about datasets #docs #labels #words dataset phrase length, so this step is linear to the size of corpus . |C| Academia 91.6M 2.77M 300 Feature Extraction: When extracting features, the most Yelp 4.75M 145.1M 300 challenging problem is how to efficiently locate these phrase candidates in the original corpus, because the original texts 5.2 Compared Methods are crucial for finding the punctuation and capitalization To demonstrate the effectiveness of the proposed approach, information. Instead of using some dictionaries to store all we compared the following phrase extraction methods. the occurrences, we take the advantage of the Aho-Corasick Automaton algorithm and tailor it to find all the occurrences ranks phrases by the product of their raw fre- • TF-IDF of phrase candidates (some prefix issues are addressed by the quencies and inverse document frequencies; adapted version in the appendix). The time complexity is proposes a ranking measure based on frequencies • C-Value ), where O refers to |P| + |P| |P| ) and space complexity O ( ( |C| of a phrase used as parts of their super-phrases following a the total number of frequent phrase candidates. As the length top-down scheme; ( ( |P| of each candidate is limited by a constant O O |C| ω ), , ) = O ) in both time and space. ( |C| so the complexity is • ConExtr approaches phrase extraction as a market-baskets problem based on an assumption about relationship be- Phrase Quality Estimation: As we only labeled a very 1)-gram; − n tween n -gram and prefix/suffix ( small set of phrase candidates, as long as the number and 4 depth of decision trees in the random forest are some constant, is a supervised keyphrase extraction method for • KEA the training time for the classifier is very small compared to long documents. To apply this method in our setting, we other parts. For the prediction stage, it is proportional to consider the whole corpus as a single document; the size of phrase candidates and the dimensions of features. 5 • is a topical phrase extraction method. We use TopMine ( ) in both time and space, although |C| Therefore, it could be O its phrase mining module for comparison; the actual magnitude might be smaller. ranks phrases based on their estimated qual- • ClassPhrase ( ), nω O It is easy to observe that Alg. 2 is Viterbi Training: ities (removing step 3–5 from our framework); ω which is linear to the number of words. is treated as a SegPhrase • combines ClassPhrase with phrasal segmenta- O ( |C| ) considering constant, and thus the VT process is also tion to filter overestimated phrases based on normalized Alg. 3 ususally finishes in a few iterations. rectified frequency (removing step 4 from our framework);  Suppose we only require a constant Penalty Learning: SegPhrase + is similar to SegPhrase but adds segmenta- • to check the convergence of the binary search. Then after 200 tion features to refine quality estimation. It contains the rounds, the algorithm converges. So the number log 2  full procedures presented in Sec. 4. of loops could be treated as a constant. Because VT takes O ( |C| ) time, Penalty learning also takes O ( |C| ) time. The first two methods utilize NLP chunking to obtain Summary. Because the time and space complexities of 6 implementation of phrase candidates. We use the JATE ), our proposed O ( |C| all components in our framework are , TF-IDF and C-Value. Both of i.e. the first two methods, framework has a linear time and space complexities and is 7 them rely on OpenNLP as the linguistic processor to de- thus very efficient. Furthermore, the most time consuming tect phrase candidates in the corpus. The rest methods are parts, including penalty learning and VT, could be easily -grams and the runtime is dramati- n all based on frequent parallelized because of the nature of independence between cally reduced. The last three methods are variations of our documents and sentences. proposed method. It is also worth mentioning that JATE contains several more implemented methods including Weirdness [ ]. They 1 5. EXPERIMENTAL STUDY are not reported here due to their unsatisfactory performance In this section, experiments demonstrate the effectiveness compared to the baselines listed above. and efficiency of the proposed methods in mining quality phrases and generating accurate segmentation. More com- 5.3 Experimental Setting plete experiments regarding parameter selection and case τ We set minimum phrase support as 30 and maximum studies are reported in the appendix. We begin with the phrase length ω as 6, which are two parameters required by description of datasets. all methods. Other parameters required by baselines were set according to the open source tools or the original papers. 5.1 Datasets For our proposed methods, training labels for phrases were Two real-world data sets were used in the experiments and collected by sampling representative phrase candidates from detailed statistics are summarized in Table 3. groups of phrases pre-clustered on the normalized feature k space by -means. We labeled research areas, tasks, algo- 2 dataset is a collection of major computer • The Academia rithms and other scientific terms in the Academia dataset science journals and proceedings. We use both titles and as quality phrases. Some examples are ‘divide and conquer’, abstracts in our experiments. ‘np complete’ and ‘relational database’. For the Yelp dataset, 3 The • provides reviews of 250 businesses. Yelp dataset restaurants, dishes, cities and other related concepts are Each individual review is considered as a document. 4 5 6 2 7 3

9 Precision-Recall Curves on Academia Dataset (Wiki Phrases) Precision-Recall Curves on Yelp Dataset (Wiki Phrases) 1 0.8 0.8 1 TF-IDF TF-IDF TF-IDF TF-IDF 0.9 0.9 ClassPhrase ClassPhrase C-Value C-Value 0.7 0.7 SegPhrase SegPhrase ConExtr ConExtr 0.8 SegPhrase+ SegPhrase+ KEA KEA 0.8 0.6 ToPMine ToPMine 0.6 0.7 SegPhrase+ SegPhrase+ 0.7 0.5 0.6 0.5 0.6 0.5 0.4 0.5 0.4 Precision Precision Precision Precision 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.6 0.8 0 0.2 0.4 0.6 0.2 0.8 0.6 0.8 0 0.2 0.4 0.4 0 0.2 0.4 0.6 0.8 0 Recall Recall Recall Recall Precision-Recall Curves on Yelp Dataset (Pooling) Precision-Recall Curves on Academia Dataset (Pooling) TF-IDF TF-IDF TF-IDF TF-IDF 0.95 C-Value ClassPhrase ClassPhrase C-Value 1 1 1 ConExtr SegPhrase SegPhrase ConExtr KEA SegPhrase+ 0.9 SegPhrase+ KEA ToPMine 0.9 0.95 0.95 ToPMine SegPhrase+ SegPhrase+ 0.85 0.9 0.8 0.9 0.8 0.85 0.7 0.85 0.75 0.8 0.6 Precision Precision Precision Precision 0.8 0.75 0.7 0.5 0.75 0.7 0.65 0.4 0.65 0.7 0.6 0.3 0.8 1 0 0.2 0.4 0.6 1 0 0.2 0.4 0.6 0.8 0.6 0.2 0 0.8 0.8 0.6 0.4 0.2 0 0.4 Recall Recall Recall Recall × Figure 1: Precision-recall in 4 groups of experiments: (Academia, Yelp) (Wiki phrase, pooling). Within each group, two figures are shown on par: the left compares SegPhrase+ with existing methods, and the right compares among the 3 proposed methods in this paper together with TF-IDF baseline labeled to be positive. In contrast, phrases like ‘under cer- to obtain meaningful insights regarding the performance tain assumptions’, ‘many restaurants’ and ‘last night’ were difference between the proposed and baselines. labeled as negative. We down-sample low quality phrases Besides Wiki phrases, we rely on human evalua- Pooling: because they are dominant over quality phrases. The num- tors to judge whether the rest of the candidates are good. ber of training labels in our experiments are reported in We randomly sampled Wiki-uncovered phrases from the k Table 3. To automatically learn the value of segment length returned candidates of each method mentioned at Sec. 5.2. r in Alg. 4 as 1.0 for penalty, we set the non-segmented ratio 0 These sampled phrases formed a pool and each of them was Academia dataset and 0.95 for Yelp dataset. The selection of then evaluated by 3 reviewers independently. The reviewers this parameter will be discussed in detail later in this section. could use a popular search engine for the candidates (thus To make outputs returned by different methods compa- helping reviewers judge the quality of phrases that they were rable, we converted all the phrase candidates to lower case not familiar with). We took the majority of the opinions and and merged plural with singular phrases. The phrase lists used these results to evaluate the methods on how precise the generated by these methods have different size, and the tail returned quality phrases are. Throughout the experiments of the lists are low quality. For the simplicity of comparison, k = 500. we set we discarded low-ranked phrases based on the minimum size among all phrase lists except ConExtr. ConExtr returns all 5.5 Results phrases without ranking. Thus we did not remove its phrases. Precision-recall curves of different methods evaluated by The remaining size of each list is still reasonably large ( > both Wiki phrases and pooling phrases are shown in Fig. 1. 40,000). The trends on both datasets are similar. Among the existing work, the chunking-based methods, 5.4 Evaluation such as TF-IDF and C-Value, have the best performance; The goal of our experiments is to study how well our ConExtr reduces to a dot in the figure since its output methods perform in terms of “precision” and “recall” and does not provide the ranking information. Our proposed compare with baselines. Precision is defined as the number method, SegPhrase+, outperforms them significantly. More of quality phrases divided by the number of phrase candidates. specifically, SegPhrase+ can achieve a higher recall while Recall is defined as the number of quality phrases divided its precision is maintained at a satisfactory level. That is, by the total number of quality phrases. many more quality phrases can be found by SegPhrase+ than The first set of experiments were conducted baselines. Under a given recall, precision of our method is Wiki Phrases: higher in most of the time. by using Wikipedia phrases as ground truth labels. Wiki phrases refer to popular mentions of entities by crawling intra- For variant methods within our framework, it is surpris- Wiki citations within Wiki content. To compute precision, ing that ClassPhrase could perform competitively to the only the Wiki phrases are considered to be positive. For recall, chunking-based methods like TF-IDF. Note that the latter we combine Wiki phrases returned by different methods requires large amounts of pre-training for good phrase chunk- ing. However, ClassPhrase’s precision at the tail is slightly altogether and view them as all quality phrases. Precision worse than TF-IDF on Academia dataset evaluated by Wiki and recall are biased in this case because positive labels are restricted to Wiki phrases. However, we still expect phrases. We also observe a significant difference between

10 SegPhrase and ClassPhrase, indicating phrasal segmentation monotonically increasing, which indicates that more labels plays a crucial role to address the completeness requirement. may result in better performance. 300 labels are enough to train a satisfactory classifier. In fact, SegPhrase already beats ClassPhrase and baselines. Moreover, SegPhrase+ improves the performance of Seg- Table 4: Impact of training data size on ClassPhrase Phrase, because of the use of phrasal segmentation results (Top: Academia, Bottom: Yelp) as additional features. #wiki phr. f1 recall prec. #labels #total An interesting observation is that the advantage of our 6,179 24,603 50 0.881 0.372 0.523 method is more significant on the pooling evaluations. The 100 0.859 0.430 0.573 6,834 30,234 phrases in the pool are not covered by Wiki, indicating 0.558 0.676 8,196 40,355 200 0.856 0.785 11,535 300 0.760 0.811 95,070 that Wikipedia is not a complete source of quality phrases. However, our proposed methods, including SegPhrase+, Seg- #total f1 #labels prec. recall #wiki phr. Phrase, and ClassPhrase, can mine out most of them (more 0.948 50 79,091 6,985 0.647 0.491 than 80%) and keep a very high level of precision, especially 0.540 0.688 6,692 57,018 0.948 100 6,786 0.554 0.700 200 53,613 0.948 on the Academia dataset. Therefore, the evaluation results 0.702 53,442 6,777 300 0.944 0.559 on the pooling phrases suggest that our methods not only detect the well-known Wiki phrases, but also work properly for the long tail phrases which might occur not so frequently. 5.6.2 Non-segmented Ratio From the result on Yelp dataset evaluated by pooling is designed for learning seg- r The non-segmented ratio 0 phrases, we notice that SegPhrase+ is a little weaker than ment length penalty, which further controls the precision and SegPhrase at the head. As we know, SegPhrase+ has tried to , r recall phrasal segmentation. Empirically, under higher 0 utilize phrasal segmentation results from SegPhrase to refine the segmentation process will favor longer phrases, and vice the phrase quality estimator. However, segmentation features versa. We show experimental results in Table 5 for models do not add new information for bigrams. If there are not . The evaluation measures are with different values of r 0 many quality phrases with more than two words, SegPhrase+ similar to the previous setting but they are computed based might not have significant improvement and even can perform on the results of SegPhrase. One can observe that the preci- slightly worse due to the overfitting problem by reusing the , while the recall decreases. It is sion increases with lower r 0 same set of labeled phrases. This overfitting problem has been because phrases are more likely to be segmented into words verified in the appendix regarding the iterative SegPhrase+ r r . High by lower is generally preferred because we should 0 0 experiments. In fact, on Academia dataset, the ratios of preserve most positive phrases in training data. We select quality phrases with more than 2 words are 24% among all 95 for Academia and Yelp datasets respec- . 00 and 0 . r = 1 0 Wiki phrases and 17% among pooling phrases. In contrast, tively, because quality phrases are shorter in Yelp dataset these statistics go down to to 13% and 10% on Yelp dataset, than in Academia dataset. which verifies our conjecture and explains why SegPhrase+ r on Seg- Table 5: Impact of non-segmented ratio has slightly lower precision than SegPhrase at the head. 0 Phrase (Top: Academia, Bottom: Yelp) 5.6 Model Selection f1 recall prec. r #total #wiki phr. 0 1.00 0.816 0.756 0.785 10,607 57,668 The goal of model selection is to study how well our meth- 0.741 43,554 9,226 0.95 0.625 0.909 ods perform in terms of “precision” and “recall” on various 0.90 30,550 7,262 0.617 0.457 0.949 candidate models with different parameters. We specifically 0.85 0.422 0.584 7,107 29,826 0.948 want to study two potentially interesting questions: 6,208 0.525 0.364 0.944 0.80 25,374 • How many labels do we need to achieve good results of #total #wiki phr. f1 recall prec. r 0 1.00 0.606 7,155 48,684 0.739 0.948 phrase quality estimation? 0.95 42,933 6,916 0.749 0.921 0.631 for deciding segment How to choose non-segmented ratio • r 0 0.90 0.749 6,467 0.846 0.673 34,632 length penalty? 0.85 0.766 0.739 5,947 28,462 0.714 0.80 26,245 5,729 0.727 0.728 0.725 Experiments for other parameters are provided in the ap- pendix due to the space limit. 5.7 Computational Complexity Study 5.6.1 Number of Labels The following execution time experiments were all con- To evaluate the impact of training data size on the phrase ducted on a machine with 20 cores of Intel(R) Xeon(R) CPU quality estimation, we focus on studying the classification E5-2680 v2 @ 2.80GHz. Our framework is mainly imple- performance of ClassPhrase. Table 4 shows the results eval- mented in C ++ while a small part of preprocessing is in { v ∈P : uated among phrases with positive predictions ( i.e. , Python. As shown in Fig. 3, the linear curves of total run- 5). With different numbers of labels, we report the . 0 ≥ Q v time of SegPhrase+ on different proportions of data verifies precision, recall and F1 score judged by human evaluators our linear time complexity analyzed in Sec. 4.5. (Pooling). The number of correctly predicted Wiki phrases dataset file size #words time is also provided together with the total number of positive Academia 613MB 91.6M 0.595h phrases predicted by the classifier. From these results, we ob- 145.1M 0.917h Yelp 750MB serve that the performance of the classifier becomes better as the number of labels increases. Specifically, on both datasets, Besides, the pies in Fig. 4 show the ratios of different com- the recall rises up as the number of labels increases, while the ponents of our framework. One can observe that Feature precision goes down. The reason is the down-sampling of low quality phrases in the training data. Overall, the F1 score is Extraction and Phrasal Segmentation occupy most of the

11 runtime. Fortunately, almost all components of our frame- 3500 Academia works can be parallelized, such as Feature Extraction, Phrasal Yelp 3000 Segmentation and Quality Estimation, which are the most 2500 expensive parts of execution time. It is because sentences can 2000 be proceeded one by one without any impact on each other. 1500 Therefore, our methods could be very efficient for massive corpus using parallel and distributed techniques. Here we do 1000 not compare the runtime with other baselines because they Running Time (seconds) 500 are implemented by different programming languages and 0 some of them further rely on various third-party packages. 0.4 0.2 1 0.8 0.6 Proportion Among existing implementations, our method is empirically one of the fastest. Figure 3: Runtime on different proportions of data 5.8 Feature Contribution Yelp Academia We conducted experiments to show the contributions of Frequent Phrase Detection Frequent Phrase Detection 2nd Phrasal Segmentation 2nd Phrasal Segmentation different features in the classifier of SegPhrase+ in Fig. 2. Feature contributions are determined by aggregating mean squared errors for nodes in decision trees of the random forest 2nd Qualtiy Estimation 2nd Qualtiy Estimation classifier. Generally speaking, we can observe that the contri- Segmentation Feature Extraction Segmentation Feature Extraction butions of features in different aspects vary for two datasets. Feature Extraction Feature Extraction 1st Phrasal Segmentation 1st Phrasal Segmentation As the Yelp dataset is noisier and less formal compared to the Academia dataset, the stopword and IDF features are 1st Quality Estimation 1st Quality Estimation more important in Yelp, whereas the concordance and seg- mentation features are more important in Academia. Also, Figure 4: Runtime of different modules in our frame- due to the subjectivity in Yelp review data, its punctuation work on Academia and Yelp dataset feature is less significant. Fig. 2 indicates that supervision ences. Each series of proceedings form a subset of the whole is necessary for phrase quality assessment. Meanwhile, it Academia corpus. Two segmentation methods are compared. suggests our proposed criteria for judging phrase quality is The first one relies on dynamic programming using phrase reasonable since each feature can be a strong signal of the quality estimated by SegPhrase+. The other is based on the final phrase quality in certain datasets. phrase chunking method adopted in JATE, which is further 40 Academia used to detect phrase candidates for TF-IDF and C-Value 35 Yelp ) methods. To be fair, we only show phrases extracted by 30 SegPhrase+, TF-IDF and C-Value methods in the table. 25 Because TF-IDF and C-Value perform similarly and they 20 both rely on the chunking method, we merge their phrases 15 and report mining results in one column named ‘Chunking’. 10 Relative Weight (% Phrases in SegPhrase+ but missing in the chunking results 5 are highlighted in purple (red vice versa). One can observe 0 IDF Punct. Seg-PMI Seg-PKL Stopword PKL PMI that the interesting phrases mined by SegPhrase+ based on Features the segmentation result are more meaningful and the im- Figure 2: Contributions of features for phrase qual- provement is significant. Relatively speaking, phrases mined ity estimation of SegPhrase+ from the chunking method are of inferior quality. Therefore, many of them are not covered by SegPhrase+. For more experimental results regarding this application, readers can 5.9 Case Study refer to the appendix. Previous experiments are focused on evaluating phrase quality quantitatively. In this subsection, we show two case 5.9.2 Word/Phrase Similarity Search studies based on applications taking segmented corpora as input. Recall that the segmented corpus is the other output With a segmented corpus, one could train a model to learn of the proposed phrase mining framework. In the appendix, 27 distributed vector representations of words and phrases [ ]. we provide a fraction of the ranked phrase list produced by Using this technique, words and phrases are mapped into a our framework. vector space such that semantically similar words and phrases have similar vector representations. It helps other text min- 5.9.1 Interesting Phrase Mining ing algorithms to achieve better performance by grouping The first application is to mine interesting phrases in a similar units. The quality of the learned vector representa- tion is closely related to the quality of the input segmented subset of given corpus. Interesting phrases are defined to be ′ corpus. Accurate segmentation results in good vector rep- phrases frequent in the subset but relatively infrequent C ]. Given a phrase 28 , 17 resentation and this performance gain is usually evaluated 5 [ C in the overall corpus , , its v ′ ′ v, C by comparing similarity scores between word/phrase pairs. ) interestingness is measured by purity ( v, C freq , C ) = ( · 2 ′ ( v, C ), which considers both phrase frequency similar words or k To be specific, one could compute top- v, C freq ) ( /freq and purity in the subset. phrases given a query and compare the ranked lists. We use We list a fraction of interesting phrases in Table 6 mined this to verify the utility of both quality phrase mining and from papers published in SIGMOD and SIGKDD confer- quality segmentation.

12 Table 6: Interesting phrases mined from papers published in SIGMOD and SIGKDD conferences Conference SIGKDD SIGMOD SegPhrase+ Chunking Chunking Method SegPhrase+ data base data mining 1 data base data mining data set database system association rule 2 database system association rule knowledge discovery 3 relational database query processing knowledge discovery frequent itemset query optimization query optimization 4 relational database time series decision tree 5 query processing ... ... ... ... ... database technology search space sql server 51 association rule mining database server rule set domain knowledge 52 relational data data structure concept drift important problem 53 large volume 54 join query performance study knowledge acquisition concurrency control web service web service 55 conceptual graph gene expression data ... ... ... ... ... high dimensional data efficient implementation optimal solution 201 web content sensor network semantic relationship location based service frequent subgraph 202 large collection intrusion detection 203 xml schema effective way two phase locking important issue categorical attribute space complexity 204 deep web frequent itemset user preference small set 205 ... ... ... ... ... Table 7: Top-5 similar words or phrases for representative queries (Top: Academia, Bottom: Yelp) Query olap data mining SegPhrase+ Chunking SegPhrase+ Chunking Method 1 driven methodologies data warehouse warehouses knowledge discovery 2 text mining text mining online analytical processing clustcube 3 web mining financial investment data cube rolap 4 knowledge discovery online analytical processing machine learning olap queries building knowledge 5 multidimensional databases data mining techniques analytical processing valet parking noodle blu-ray Query SegPhrase+ Chunking SegPhrase+ SegPhrase+ Chunking Method Chunking dvd ramen huge lot new microwave noodle soup valet 1 vhs private lot noodle soup asian noodle 2 self-parking lifetime warranty rice noodle cd self-parking recliner beef noodle 3 valet service new release egg noodle valet stir fry free valet parking battery 4 front lot sony pasta fish ball covered parking new battery 5 We show the results in Table 7 from SegPhrase+ and A number of open problems need to be solved to allow the chunking method mentioned in the previous interesting further development of the proposed phrase mining frame- phrase mining application. Queries were chosen to be capa- work. One direction is to investigate reducing the number of training labels. The current framework requires the model to ble of showing the difference between the two methods for be trained by 300 labeled phrases. It would be preferable if both Academia and Yelp datasets. Distributed representa- ] and ranking a transferable model can be pretrained on general document 27 tions were learned through an existing tool [ scores were computed based on cosine similarity. Additional collection and adapt itself to the target corpus. An alterna- comparisons are provided in the appendix. tive is to use a domain-independent phrase collection as the training source. Overlapped phrases are first identified and From the table, one can easily tell that the rank list from then used to serve as the positive labels. Another direction SegPhrase+ carries more sense than that from phrase chunk- is to replace Viterbi Training by other parameter estimation ing. One of the possible reasons is that chunking method only approaches to further improve the phrasal segmentation. For detects noun phrases in the corpus, providing less accurate instance, one can find top- information of phrase occurrences than SegPhrase+ to the best segmentations instead of k vector representation learning algorithm. one for each text snippet. This is less deterministic than the current design but consumes more computational power. Finally, it is now possible to reconsider many text data man- 6. CONCLUSIONS agement problems with this technique. For example, text can In this paper, we introduced a novel framework for ex- be represented by bag of phrases instead of words. The best tracting quality phrases from text corpora. The framework way to exploit this representation is open for exploration. is data-driven and requires only limited training to achieve outstanding performance. The key idea is to rectify the 7. ACKNOWLEDGEMENTS raw frequency of phrases which misleads quality estimation. Research was sponsored in part by the U.S. Army Re- We develop a segmentation-integrated approach for this pur- search Lab. under Cooperative Agreement No. W911NF-09-2- 0053 (NSCTA), the Army Research Office under Cooperative pose, which significantly boosts the final quality of extracted Agreement No. W911NF-13-1-0193, National Science Foun- phrases. It is the first work to explore the mutual benefit of dation IIS-1017362, IIS-1320617, and IIS-1354329, HDTRA1- phrase extraction and phrasal segmentation. By intergrating 10-1-0120, trans-NIH Big Data to Knowledge (BD2K) initia- them, this work addresses a fundamental limitation of phrase tive grant 1U54GM114838 awarded by NIGMS, and MIAS, mining and empowers widespread applications. Meanwhile, a DHS-IDS Center for Multimodal Information Access and the method is scalable: both computation time and required space grow linearly as corpus size increases. Synthesis at UIUC.

13 Z. Liu, X. Chen, Y. Zheng, and M. Sun. Automatic References [24] keyphrase extraction by bridging vocabulary gap. In Pro- [1] K. Ahmad, L. Gillam, and L. Tostevin. University of ceedings of the Fifteenth Conference on Computational surrey participation in trec8: Weirdness indexing for Natural Language Learning , pages 135–144. Association logical document extrapolation and retrieval (wilder). for Computational Linguistics, 2011. In , The Eighth Text REtrieval Conference (TREC-8) [25] R. McDonald, F. Pereira, K. Ribarov, and J. Hajiˇc. Gaithersburg, Maryland. Non-projective dependency parsing using spanning tree [2] H. Ahonen. Knowledge discovery in documents by ex- EMNLP , 2005. algorithms. In tracting frequent word sequences. , 48(1), Library Trends [26] R. Mihalcea and P. Tarau. Textrank: Bringing order 1999. into texts. In ACL , 2004. [3] A. Allahverdyan and A. Galstyan. Comparative analysis [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and of viterbi training and maximum likelihood estimation J. Dean. Distributed representations of words and NIPS , pages 1674–1682, 2011. for hmms. In phrases and their compositionality. In , pages NIPS [4] T. Baldwin and S. N. Kim. Multiword expressions. 3111–3119, 2013. Handbook of Natural Language Processing, second edi- D. P, A. Dey, and D. Majumdar. Fast mining of inter- [28] , 2010. tion. Morgan and Claypool , EDBT esting phrases from subsets of text corpora. In [5] S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, 2014. and G. Weikum. Interesting-phrase mining for ad-hoc [29] A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. VLDB text analytics. , 3(1-2):1348–1357, 2010. Towards the web of concepts: Extracting concepts from [6] Pattern recognition and machine learning , C. M. Bishop. Proceedings of the Very Large Data Bases large datasets. volume 1. springer New York, 2006. , 3((1-2)), September 2010. Conference (VLDB) [7] G. Blackwood, A. De Gispert, and W. Byrne. Phrasal Y. Park, R. J. Byrd, and B. K. Boguraev. Automatic [30] segmentation models for statistical machine translation. glossary extraction: beyond terminology identification. , 2008. COLING In , 2002. COLING In , 45(1):5– Machine learning L. Breiman. Random forests. [8] [31] V. Punyakanok and D. Roth. The use of classifiers in 32, 2001. sequential inference. In NIPS , 2001. [9] P.-C. Chang, M. Galley, and C. D. Manning. Optimiz- [32] C. Ramisch, A. Villavicencio, and C. Boitet. Multiword ing chinese word segmentation for machine translation expressions in the wild? the mwetoolkit comes in handy. ACL Workshop on Statistical Machine performance. In COLING In , pages 57–60, 2010. , 2008. Translation [33] . Synthesis Lectures on B. Settles. Active Learning [10] K.-h. Chen and H.-H. Chen. Extracting noun phrases Artificial Intelligence and Machine Learning. Morgan & from large-scale texts: A hybrid approach and its auto- Claypool Publishers, 2012. matic evaluation. In , 1994. ACL [34] A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. E. F. Codd. A Relational Model for Large Shared Data [11] Proc. VLDB Multidimensional content exploration. Communications of The ACM , 13:377–387, 1970. Banks. Endow. , 1(1):660–671, Aug. 2008. [12] M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, and [35] R. Sproat, W. Gale, C. Shih, and N. Chang. A stochas- J. Han. Automatic construction and ranking of topical tic finite-state word-segmentation algorithm for chinese. SDM keyphrases on collections of short documents. In , Computational linguistics , 22(3):377–404, 1996. 2014. B. Tan and F. Peng. Unsupervised query segmentation [36] P. Deane. A nonparametric method for extraction of [13] using generative language models and wikipedia. In , 2005. candidate phrasal terms. In ACL , 2008. WWW H. Echizen-ya and K. Araki. Automatic evaluation [14] E. F. Tjong Kim Sang and S. Buchholz. Introduction [37] method for machine translation using noun-phrase CONLL to the conll-2000 shared task: Chunking. In , , 2010. ACL chunking. In 2000. A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. [15] [38] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, VLDB , Scalable topical phrase mining from text corpora. and C. G. Nevill-Manning. Kea: Practical automatic 8(3), Aug. 2015. Proceedings of the fourth ACM keyphrase extraction. In [16] K. Frantzi, S. Ananiadou, and H. Mima. Automatic , pages 254–255. ACM, conference on Digital libraries recognition of multi-word terms:. the c-value/nc-value 1999. , 3(2):115–130, 2000. method. JODL [39] E. Xun, C. Huang, and M. Zhou. A unified statistical [17] C. Gao and S. Michel. Top-k interesting phrase mining model for the identification of english basenp. In ACL , in ad-hoc collections using sequence pattern indexing. 2000. In EDBT , 2012. [40] D. Zhang, C. Zhai, and J. Han. Topic Cube: Topic Mod- In memory M. A. Halliday. Lexis as a linguistic level. [18] eling for OLAP on Multidimensional Text Databases. of JR Firth , pages 148–162, 1966. , pages 1123–1134, 2009. SDM In [19] K. S. Hasan and V. Ng. Conundrums in unsupervised Z. Zhang, J. Iria, C. A. Brewster, and F. Ciravegna. A [41] keyphrase extraction: making sense of the state-of-the- comparative evaluation of term recognition algorithms. COLING , 2010. art. In , 2008. LREC [20] T. Koo, X. Carreras, and M. Collins. Simple semi- supervised dependency parsing. ACL-HLT , 2008. 8. APPENDIX [21] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme- 8.1 Adapted Aho-Corasick Automaton KDD , tracking and the dynamics of the news cycle. In KDD ’09, pages 497–506, 2009. The Aho-Corasick Automaton is similar to the data struc- ture Trie, which could utilize the common prefix to save the [22] X. Li and B. Liu. Learning to classify texts using positive , volume 3, pages 587–592, and unlabeled data. In IJCAI memory usage and also make the process more efficient. It 2003. also computes the field “failed” referring to the node which Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised [23] could continue the matching process. In this paper, we adopt query segmentation using clickthrough for information standard Aho-Corasick Automaton definition and construc- , 2011. SIGIR retrieval. In tion process. Algorithm 5 introduced a “while” loop to fix the

14 F1 Score on Yelp dataset (Pooling) F1 Score on Academia dataset (Pooling) Algorithm 5: Locating using Aho-Corasick Automaton 0.95 : The Aho-Corasick Automaton root, the corpus Input 0.9 0.95 C string . 0.85 0.9 Output : All occurrences of phrases in the automaton O . 0.8 0.85 F1 Score F1 Score O ←∅ 0.75 0.8 ← root u [email protected] [email protected] 0.7 [email protected] [email protected] do for i 1 to ← |C| 0.75 3 2 1 4 1 2 3 4 = do root and .next u C [i] not in 6 u while #Iterations #Iterations .failed ← u u Figure 5: Performance variations of SegPhrase+ .next then if [i] in C u and ClassPhrase+ with increasing iterations .next[ u ← u [i]] C observe a slight performance decline over the next two itera- u p ← tions especially for the top-1000 phrases. Recall that we are p p 6 = root and while .isEnd do reusing training labels for each iteration. Then this decline [i - p.depth + 1, i] O ←O∪ can be well explained by overfitting because segmentation p ← p .failed features added by later iterations become less meaningful. return O Meanwhile, more meaningless features will undermine the Table 8: Runtime of locating phrase candidates classification power of random forest. Based on this, we Yelp Academia can conclude that there is no need to do the phrase quality Aho-Corasick Automaton 154.25s 198.39s re-estimation multiple times. 192.71s Unordered Map (Hash Table) 366.67s 8.4 Experiments on Larger Datasets , there might be some phrase can- i.e. issues brought by prefix ( Since both Yelp and Academia dataset are relatively small didates which are the prefix of the others), which is slightly in file size, we additionally tested our framework’s running different from the traditional matching process and could time on a larger data set composed of all Wikipedia articles help us find all occurrences of the phrase candidates in the as well as its performance considering all Wiki phrases to be corpus in a linear time. ground truth. The statistics of Wikipedia dataset is shown in An alternative way is to adopt the hash table. However, the following table and one can verify the linearality of com- one should carefully choose the hash function for hash tale putational complexity compared with previous experiments and the theoretical time complexity of hash table is not ex- on Academia and Yelp reported in Table 10. The column actly linear. For the comparison, we implemented a hash shows the number of correctly predicted Wiki phrases # wiki table approach using the unordered map in C , while the ++ of SegPhrase+. The precision-recall curves are reported in Aho-Corasick Automaton was coded in C ++ too. The re- Fig. 6. Similar patterns can be observed compared to Fig. 1 sults can be found in Table 8 We can see that Aho-Corasick that SegPhrase+ beats the other two proposed baselines Automaton is slightly better because of its exact linear com- and integration of phrasal segmentation with phrase mining plexity and less memory overhead. boosts performance a lot. 8.2 Convergence Study of Viterbi Training Table 10: Statistics of experiments on Wiki dataset Our time complexity analysis in Sec. 4.5 assumes Viterbi #wiki phr. time #words file size # total Training in Alg. 3 converges in few iterations. Here we verify 1,390,632 718,840 28.08h 3.26G 20.23GB this through empirical studies. From Table 9, VT converges Precision-Recall Curves on Wiki extremely fast on both datasets. This owes to the good 1 ClassPhrase initialization based on raw frequency as well as the particular SegPhrase SegPhrase+ 0.9 property of Viterbi Training discussed in Sec. 4.3. 0.8 0.7 8.3 Iterations of SegPhrase+ 0.6 SegPhrase+ involves only one iteration of re-estimating Precision 0.5 phrase quality using normalized rectified frequency from 0.4 phrasal segmentation. Here we show the performance of 0.3 SegPhrase+ with more iterations in Fig. 5 based on human- 0.2 0.4 0.6 0.8 0 Recall labeled phrases. For comparison, we also report performance Figure 6: Precision-recall curves of the 3 proposed of ClassPhrase+ which is similar with ClassPhrase but con- methods tested on Wikipedia dataset tains segmentation feature generated by results of phrasal segmentation from the last iteration. 8.5 Interesting Phrase Mining (More) We can see that the benefits brought by rectified frequency can be fully digested within the first iteration, leaving F1 We show comparisons between SegPhrase+ and TopMine scores over the next several iterations close. One can also in Table 11. TopMine approaches phrase mining based on raw frequency as well as document context following a bottom-up Table 9: Objective function values of Viterbi Train- fashion. Such formulation can be interpreted as a greedy ing for SegPhrase and SegPhrase+ segmentation by iteratively merging most correlated units Academia Dataset Yelp in a sentence. Such a greedy segmentation is not optimal SegPhrase SegPhrase+ Method SegPhrase SegPhrase+ -9.27055E+08 Iter.1 -6.39453E+08 -6.33064E+08 -9.33899E+08 compared to our segmentation framework using dynamic -9.12082E+08 -6.17419E+08 -6.23699E+08 Iter.2 -9.06041E+08 -9.05946E+08 -9.11835E+08 -6.17214E+08 -6.23383E+08 Iter.3 programming. Moreover, TopMine removes stopwords be- -9.05940E+08 Iter.4 -6.23354E+08 -6.17196E+08 -9.11819E+08 Iter.5 -9.05940E+08 -9.11818E+08 -6.17195E+08 -6.23351E+08 fore greedy segmentation. This loses important information

15 Table 11: Interesting phrases mined from SIGMOD and SIGKDD conferences (SegPhrase+ v.s. TopMine) Conference SIGKDD SIGMOD SegPhrase+ TopMine TopMine Method SegPhrase+ data base data mining 1 data base data mining data set query optimization data set 2 database system association rule knowledge discovery 3 relational database database system ... ... ... ... ... aggregation function association rule mining data item 51 sql server relational data query answering user profile 52 rule set query rewriting user preference data structure 53 concept drift data access knowledge acquisition rule discovery 54 join query web service gene expression data sequential pattern 55 hash join ... ... ... ... ... high dimensional data 201 active learning logical and physical 202 transitive closure frequent subgraph algorithm is efficient location based service xml schema performance study approach is proposed 203 intrusion detection disk access data cube two phase locking categorical attribute 204 object oriented user preference distance metric 205 deep web ... ... ... ... ... Table 12: Interesting phrases mined from SIGMOD and SIGKDD conferences (SegPhrase+ v.s. SegPhrase) SIGMOD SIGKDD Conference Method SegPhrase+ SegPhrase+ SegPhrase SegPhrase data base data mining data mining 1 data base database system database system data set data set 2 3 query processing association rule association rule relational database ... ... ... ... ... 51 sql server subsequence matching association rule mining association rule mining 52 relational data rule set sql server rule set relational data concept drift data structure concept drift 53 54 join query data structure knowledge acquisition knowledge acquisition web service gene expression data 55 web service gene expression data ... ... ... ... ... query containment web content web search engine 201 high dimensional data location based service buffer management frequent subgraph better than 202 xml schema 203 previously proposed garbage collection intrusion detection moving object categorical attribute two phase locking 204 application domain 205 deep web on demand user preference wide range ... ... ... ... ... because stopwords can be naturally serving as boundaries be- False positive (reflected in precision): Through our studies, • tween meaningful phrases. From the table we can exactly tell phrases with poor quality are mistakenly predicted as good that interesting phrases mined from our segmented corpus ones due to the fixed collocation among words. For example, are better. ‘experiments are conducted’ gets 0.88 quality score and ‘at the network layer’ gets 0.77. Such cases become more Comparisons between SegPhrase+ and SegPhrase are in frequent when the phrases are close to the bottom of the Table 12. The results are basically similar but we can notice that phrases generated based on SegPhrase+’s segmentation ranking list. are better in quality than SegPhrase’s. False negative (reflected in recall): Quality phrases are • underestimated because their frequencies in the corpus is 8.6 Relevant Word/Phrase Mining (More) significantly lower than those of their sub-components. For We show comparisons between SegPhrase+ and TopMine example, ‘topic model’s score is 0.44, compared to ‘topic in Table 13, SegPhrase+ and SegPhrase in Table 14. From modeling’ which gets 0.72. As ‘topic’ and ‘model’ are both these two tables one can tell that SegPhrase+ beats TopMine very frequent in the corpus, it is more difficult for ‘topic while behaves similar to SegPhrase. This is reasonable be- model’ to be predicted as quality phrase. cause we are only listing the most similar words/phrases. As SegPhrase+ and SegPhrase are sharing the same phrasal seg- Based on the above error analysis, there are three inspired mentation module, the differences among top-ranked phrases directions to improve the model. The first is to modify will not be significant. the quality estimator by incorporating syntactic features obtained from a pre-trained POS tagger. The second is to 8.7 Case Study of Quality Phrases improve the viterbi training module in phrasal segmentation. We show some phrases from ranking lists generated by k One alternative is to find top- segmentations instead of ClassPhrase, SegPhrase and SegPhrase+ in Table 15. In gen- the best one given a text snippet. This is less deterministic eral, phrase quality drops with number goes up. ClassPhrase in performing segmentation than the current design but always performs the worst among the three methods. Seg- consumes more computational power. Meanwhile, segment Phrase+ is slightly better than SegPhrase, which is noticeable length penalty can be designed in a more sophisticated way for phrases ranked after 20,000. It’s worth mentioning that by dynamically changing its value based on the local context the minimum sizes of phrase lists are 50,577 and 42,989 for instead of setting a global value. However, this will inevitably two datasets respectively. require more labels or predefined rules. The third aspect is to do normalization on words in the corpus such as stemming. 8.8 Error Analysis This may merge ‘topic modeling’ and ‘topic model’ and make We analyze of errors of SegPhrase+ from the aspects of their predictions consistent. false positive and false negative.

16 Table 13: Top-5 similar words/phrases for queries from Academia/Yelp datasets (SegPhrase+ v.s. TopMine) Query olap data mining SegPhrase+ TopMine TopMine Method SegPhrase+ knowledge discovery knowledge discovery data warehouse 1 olap system online analytical processing integrating data mining online analytical processing olap 2 text mining 3 web mining data mining tools data cube olap queries olap queries data cube 4 machine learning mining multidimensional databases queries in olap data mining dm data mining techniques 5 Query blu-ray noodle valet parking TopMine SegPhrase+ TopMine SegPhrase+ SegPhrase+ TopMine Method movie ramen dvd valet valet noodles the noodles 1 recordable noodle soup valet service vhs noodles weren self-parking 2 rice noodle parking and valet adobe’s cd noodle dish 3 valet service itunes new release egg noodle free valet free valet parking rice noodles 4 bootable pasta sony valet your car covered parking 5 noodle texture Table 14: Top-5 similar words/phrases for queries from Academia/Yelp datasets (SegPhrase+ v.s. SegPhrase) data mining Query olap Method SegPhrase SegPhrase+ SegPhrase SegPhrase+ knowledge discovery knowledge discovery data warehouse 1 data warehouse text mining online analytical processing text mining 2 online analytical processing 3 web mining machine learning data cube data cube data mining techniques olap queries multidimensional database 4 machine learning mining data mining techniques olap operations 5 multidimensional databases noodle valet parking blu-ray Query TopMine SegPhrase+ TopMine TopMine SegPhrase+ Method SegPhrase+ dvd ramen noodle soup valet dvd 1 valet vhs new release egg noodle noodle soup valet service self-parking 2 cd rice noodle ramen self parking on netflix 3 valet service new release egg noodle rice noodles complimentary valet vhs 4 free valet parking cd pasta noodle and parking sony 5 covered parking Table 15: Sampled quality phrases from Academia and Yelp datasets SegPhrase+ ClassPhrase SegPhrase Method virtual reality self organization virtual reality 1 variable bit rate variable bit rate polynomial time approx. scheme 2 shortest path shortest path least squares 3 ... ... ... ... health care finite state frequency offset estimation 501 collaborative filtering gene expression air traffic 502 ultra wide band finite state transducers long term 503 ... ... ... ... search terms test plan airline crew scheduling 10001 high dimensional space automatic text integer programming 10002 delay variation adaptive bandwidth web log data 10003 ... ... ... ... test coverage implementation costs experience sampling 20001 error bounded virtual execution environments adaptive sliding mode control 20002 random graph models free market nonlinear time delay systems 20003 ... ... ... ... svm method harmony search algorithm asymptotic theory 50001 interface adaptation physical mapping integer variables 50002 diagnostic fault simulation nonlinear oscillators distinct patterns 50003 ... ... ... ... SegPhrase+ ClassPhrase SegPhrase Method taco bell tour guide taco bell 1 wet republic wet republic yellow tail 2 pizzeria bianco vanilla bean pizzeria bianco 3 ... ... ... ... panoramic view art museum rm seafood 501 pretzel bun ice cream parlor pecan pie 502 spa pedicure master bedroom pho kim long 503 ... ... ... ... seated promptly carrot soup gary danko 10001 benny benassi leisurely stroll veggie soup 10002 pork burrito big eaters flavored water 10003 ... ... ... ... buttery toast late night specials cilantro hummus 20001 quick breakfast older women lv convention center 20002 iced vanilla slightly higher worth noting 20003 ... ... ... ... friday morning conveniently placed coupled with 40001 way too high start feeling cant remember 40002 immediately start stereo system almost guaranteed 40003 ... ... ... ...

Related documents