SEIRiP

Transcript

1 W. Bruce Croft Donald Metzler Trevor Strohman Search Engines Information Retrieval in Practice ©W.B. Croft, D. Metzler, T. Strohman, 2015 This book was previously published by: Pearson Education, Inc.

2

3 Preface This book provides an overview of the important issues in information retrieval, and how those issues affect the design and implementation of search engines. Not every topic is covered at the same level of detail. We focus instead on what we consider to be the most important alternatives to implementing search engine components and the information retrieval models underlying them. Web search engines are obviously a major topic, and we base our coverage primarily on the 1 technology we all use on the Web, but search engines are also used in many other applications. That is the reason for the strong emphasis on the information re- trieval theories and concepts that underlie all search engines. The target audience for the book is primarily undergraduates in computer sci- ence or computer engineering, but graduate students should also find this useful. We also consider the book to be suitable for most students in information sci- ence programs. Finally, practicing search engineers should benefit from the book, whatever their background. There is mathematics in the book, but nothing too esoteric. There are also code and programming exercises in the book, but nothing beyond the capabilities of someone who has taken some basic computer science and programming classes. The exercises at the end of each chapter make extensive use of a Java™-based open source search engine called Galago. Galago was designed both for this book and to incorporate lessons learned from experience with the Lemur and Indri projects. In other words, this is a fully functional search engine that can be used to support real applications. Many of the programming exercises require the use, modification, and extension of Galago components. 1 In keeping with common usage, most uses of the word “web” in this book are not cap- italized, except when we refer to the World Wide Web as a separate entity.

4 VI Preface Contents In the first chapter, we provide a high-level review of the field of information re- trieval and its relationship to search engines. In the second chapter, we describe the architecture of a search engine. This is done to introduce the entire range of search engine components without getting stuck in the details of any particular aspect. In Chapter 3, we focus on crawling, document feeds, and other techniques for acquiring the information that will be searched. Chapter 4 describes the sta- tistical nature of text and the techniques that are used to process it, recognize im- portant features, and prepare it for indexing. Chapter 5 describes how to create indexes for efficient search and how those indexes are used to process queries. In Chapter 6, we describe the techniques that are used to process queries and trans- form them into better representations of the user’s information need. Ranking algorithms and the retrieval models they are based on are covered in Chapter 7. This chapter also includes an overview of machine learning tech- niques and how they relate to information retrieval and search engines. Chapter 8 describes the evaluation and performance metrics that are used to compare and tune search engines. Chapter 9 covers the important classes of techniques used for classification, filtering, clustering, and dealing with spam. Social search is a term used to describe search applications that involve communities of people in tag- ging content or answering questions. Search techniques for these applications and peer-to-peer search are described in Chapter 10. Finally, in Chapter 11, we give an overview of advanced techniques that capture more of the content of documents than simple word-based approaches. This includes techniques that use linguistic features, the document structure, and the content of nontextual media, such as images or music. Information retrieval theory and the design, implementation, evaluation, and use of search engines cover too many topics to describe them all in depth in one book. We have tried to focus on the most important topics while giving some coverage to all aspects of this challenging and rewarding subject. Supplements A range of supplementary material is provided for the book. This material is de- signed both for those taking a course based on the book and for those giving the course. Specifically, this includes: • Extensive lecture slides (in PDF and PPT format)

5 Preface VII • Solutions to selected end–of–chapter problems (instructors only) • Test collections for exercises • Galago search engine www.search-engines-book.com . The supplements are available at Acknowledgments First and foremost, this book would not have happened without the tremen- dous support and encouragement from our wives, Pam Aselton, Anne-Marie Strohman, and Shelley Wang. The University of Massachusetts Amherst provided material support for the preparation of the book and awarded a Conti Faculty Fel- lowship to Croft, which sped up our progress significantly. The staff at the Center for Intelligent Information Retrieval ( Jean Joyce, Kate Moruzzi, Glenn Stowell, and Andre Gauthier) made our lives easier in many ways, and our colleagues and students in the Center provided the stimulating environment that makes work- ing in this area so rewarding. A number of people reviewed parts of the book and we appreciated their comments. Finally, we have to mention our children, Doug, Eric, Evan, and Natalie, or they would never forgive us. B C D M T S 2015 Update This version of the book is being made available for free download. It has been edited to correct the minor errors noted in the 5 years since the book’s publica- tion. The authors, meanwhile, are working on a second edition.

6

7 Contents 1 Search Engines and Information Retrieval 1 . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What Is Information Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 The Big Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Search Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Architecture of a Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1 What Is an Architecture? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Breaking It Down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Text Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Text Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3 Index Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.4 User Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.5 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 How Does It Really Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Crawls and Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 Deciding What to Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Crawling the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Retrieving Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.2 The Web Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.3 Freshness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.4 Focused Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.5 Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8 X Contents 3.2.6 Sitemaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.7 Distributed Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Crawling Documents and Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Document Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 The Conversion Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.1 Character Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Storing the Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.1 Using a Database System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.2 Random Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.3 Compression and Large Files . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6.4 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6.5 BigTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.7 Detecting Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.8 Removing Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4 Processing Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 From Words to Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Text Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.1 Vocabulary Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 Estimating Collection and Result Set Sizes . . . . . . . . . . . . . . 83 4.3 Document Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.2 Tokenizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.3 Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.4 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.5 Phrases and N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4 Document Structure and Markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5 Link Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.5.1 Anchor Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.2 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.3 Link Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6.1 Hidden Markov Models for Extraction . . . . . . . . . . . . . . . . . 115 4.7 Internationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

9 Contents XI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5 Ranking with Indexes 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.2 Abstract Model of Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.3.1 Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.3.2 Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.3 Positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.3.4 Fields and Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.3.5 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.6 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.4 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.4.1 Entropy and Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.4.2 Delta Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.4.3 Bit-Aligned Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4.4 Byte-Aligned Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.4.5 Compression in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.4.6 Looking Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.4.7 Skipping and Skip Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.5 Auxiliary Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.6 Index Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.6.1 Simple Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.6.2 Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.6.3 Parallelism and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.6.4 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.7 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.7.1 Document-at-a-time Evaluation . . . . . . . . . . . . . . . . . . . . . . . 166 5.7.2 Term-at-a-time Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.7.3 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.7.4 Structured Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.7.5 Distributed Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.7.6 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6 Queries and Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.1 Information Needs and Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.2 Query Transformation and Refinement . . . . . . . . . . . . . . . . . . . . . . . 190 6.2.1 Stopping and Stemming Revisited . . . . . . . . . . . . . . . . . . . . . 190 6.2.2 Spell Checking and Suggestions . . . . . . . . . . . . . . . . . . . . . . . 193

10 XII Contents 6.2.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.2.4 Relevance Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 6.2.5 Context and Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . 211 6.3 Showing the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.3.1 Result Pages and Snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.3.2 Advertising and Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.3.3 Clustering the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 6.4 Cross-Language Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 7 Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.1 Overview of Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.1.1 Boolean Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 7.1.2 The Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 7.2 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 7.2.1 Information Retrieval as Classification . . . . . . . . . . . . . . . . . 244 7.2.2 The BM25 Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 250 7.3 Ranking Based on Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 252 7.3.1 Query Likelihood Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 7.3.2 Relevance Models and Pseudo-Relevance Feedback . . . . . . 261 7.4 Complex Queries and Combining Evidence . . . . . . . . . . . . . . . . . . . 267 7.4.1 The Inference Network Model . . . . . . . . . . . . . . . . . . . . . . . . 268 7.4.2 The Galago Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . 273 7.5 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 7.6 Machine Learning and Information Retrieval . . . . . . . . . . . . . . . . . . 283 7.6.1 Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 7.6.2 Topic Models and Vocabulary Mismatch . . . . . . . . . . . . . . . . 288 7.7 Application-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 8 Evaluating Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 8.1 Why Evaluate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 8.2 The Evaluation Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 8.3 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 8.4 Effectiveness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 8.4.1 Recall and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 8.4.2 Averaging and Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 313 8.4.3 Focusing on the Top Documents . . . . . . . . . . . . . . . . . . . . . . 318 8.4.4 Using Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

11 Contents XIII 8.5 Efficiency Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 8.6 Training, Testing, and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 8.6.1 Significance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 8.6.2 Setting Parameter Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 8.6.3 Online Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 8.7 The Bottom Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 9 Classification and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 9.1 Classification and Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 9.1.1 Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 9.1.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 9.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.1.4 Classifier and Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 359 9.1.5 Spam, Sentiment, and Online Advertising . . . . . . . . . . . . . . 364 9.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 9.2.1 Hierarchical and K -Means Clustering . . . . . . . . . . . . . . . . . . 375 9.2.2 K Nearest Neighbor Clustering . . . . . . . . . . . . . . . . . . . . . . . 384 9.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 9.2.4 How to Choose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 K 9.2.5 Clustering and Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 10 Social Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 10.1 What Is Social Search? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 10.2 User Tags and Manual Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 10.2.1 Searching Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 10.2.2 Inferring Missing Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 10.2.3 Browsing and Tag Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 10.3 Searching with Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 10.3.1 What Is a Community? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 10.3.2 Finding Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 10.3.3 Community-Based Question Answering . . . . . . . . . . . . . . . . 415 10.3.4 Collaborative Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 10.4 Filtering and Recommending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 10.4.1 Document Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 10.4.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 10.5 Peer-to-Peer and Metasearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 10.5.1 Distributed Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438

12 XIV Contents 10.5.2 P2P Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 11 Beyond Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 11.2 Feature-Based Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 11.3 Term Dependence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 11.4 Structure Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 11.4.1 XML Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 11.4.2 Entity Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 11.5 Longer Questions, Better Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 11.6 Words, Pictures, and Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 11.7 One Search Fits All? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511

13 List of Figures 1.1 Search engine design and the core information retrieval issues . . . 9 2.1 The indexing process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 The query process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 A uniform resource locator (URL), split into three parts . . . . . . . 33 3.2 Crawling the Web. The web crawler connects to web servers to find pages. Pages may link to other pages on the same server or on different servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 An example robots.txt file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 A simple crawling thread implementation . . . . . . . . . . . . . . . . . . . . 37 3.5 An HTTP HEAD request and server response . . . . . . . . . . . . . . . 38 3.6 Age and freshness of a single page over time . . . . . . . . . . . . . . . . . . 39 3.7 Expected age of a page with mean change frequency λ = 1/7 (one week) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 An example sitemap file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.9 An example RSS 2.0 feed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.10 An example of text in the TREC Web compound document format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.11 An example link with anchor text . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.12 BigTable stores data in a single logical table, which is split into many smaller tablets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.13 A BigTable row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.14 Example of fingerprinting process . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.15 Example of simhash fingerprinting process . . . . . . . . . . . . . . . . . . . 64 3.16 Main content block in a web page . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

14 XVI List of Figures 3.17 Tag counts used to identify text blocks in a web page . . . . . . . . . . . 66 3.18 Part of the DOM structure for the example web page . . . . . . . . . . 67 4.1 Rank versus probability of occurrence for words assuming × probability = 0.1) . . . . . . . . . . . . . . . . . . . . . . . . 76 Zipf ’s law (rank 4.2 A log-log plot of Zipf ’s law compared to real data from AP89. The predicted relationship between probability of occurrence and rank breaks down badly at high ranks. . . . . . . . . . . . . . . . . . . . 79 4.3 Vocabulary growth for the TREC AP89 collection compared to Heaps’ law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4 Vocabulary growth for the TREC GOV2 collection compared to Heaps’ law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5 Result size estimate for web search . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.6 Comparison of stemmer output for a TREC query. Stopwords have also been removed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.7 Output of a POS tagger for a TREC query . . . . . . . . . . . . . . . . . . . 98 4.8 Part of a web page from Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.9 HTML source for example Wikipedia page . . . . . . . . . . . . . . . . . . 103 4.10 A sample “Internet” consisting of just three web pages. The arrows denote links between the pages. . . . . . . . . . . . . . . . . . . . . . . 108 4.11 Pseudocode for the iterative PageRank algorithm . . . . . . . . . . . . . . 110 4.12 Trackback links in blog postings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.13 Text tagged by information extraction . . . . . . . . . . . . . . . . . . . . . . . 114 4.14 Sentence model for statistical entity extractor . . . . . . . . . . . . . . . . . 116 4.15 Chinese segmentation and bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.1 The components of the abstract model of ranking: documents, features, queries, the retrieval function, and document scores . . . . 127 5.2 A more concrete model of ranking. Notice how both the query and the document have feature functions in this model. . . . . . . . . 128 5.3 An inverted index for the documents (sentences) in Table 5.1 . . . 132 5.4 An inverted index, with word counts, for the documents in Table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.5 An inverted index, with word positions, for the documents in Table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.6 Aligning posting lists for “tropical” and “fish” to find the phrase “tropical fish” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

15 List of Figures XVII 5.7 Aligning posting lists for “fish” and title to find matches of the word “fish” in the title field of a document. . . . . . . . . . . . . . . . . . . . 138 5.8 Pseudocode for a simple indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.9 An example of index merging. The first and second indexes are merged together to produce the combined index. . . . . . . . . . . . . . 158 5.10 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.11 Mapper for a credit card summing algorithm . . . . . . . . . . . . . . . . . . 162 5.12 Reducer for a credit card summing algorithm . . . . . . . . . . . . . . . . . 162 5.13 Mapper for documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.14 Reducer for word postings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 x : 5.15 Document-at-a-time query evaluation. The numbers ( ) y represent a document number ( x ) and a word count ( y ). . . . . . . . 166 5.16 A simple document-at-a-time retrieval algorithm . . . . . . . . . . . . . . 167 5.17 Term-at-a-time query evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 5.18 A simple term-at-a-time retrieval algorithm . . . . . . . . . . . . . . . . . . . 169 5.19 Skip pointers in an inverted list. The gray boxes show skip pointers, which point into the white boxes, which are inverted list postings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.20 A term-at-a-time retrieval algorithm with conjunctive processing 173 5.21 A document-at-a-time retrieval algorithm with conjunctive processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.22 MaxScore retrieval with the query “eucalyptus tree”. The gray boxes indicate postings that can be safely ignored during scoring. 176 5.23 Evaluation tree for the structured query #combine(#od:1(tropical fish) #od:1(aquarium fish) fish) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.1 Top ten results for the query “tropical fish” . . . . . . . . . . . . . . . . . . . 209 6.2 Geographic representation of Cape Cod using bounding rectangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.3 Typical document summary for a web search . . . . . . . . . . . . . . . . . . 215 w ) bracketed by significant 6.4 An example of a text span of words ( words ( s ) using Luhn’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 6.5 Advertisements displayed by a search engine for the query “fish tanks” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 6.6 Clusters formed by a search engine from top-ranked documents for the query “tropical fish”. Numbers in brackets are the number of documents in the cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 222

16 XVIII List of Figures 6.7 Categories returned for the query “tropical fish” in a popular online retailer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 6.8 Subcategories and facets for the “Home & Garden” category . . . . 225 6.9 Cross-language search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 6.10 A French web page in the results list for the query “pecheur france” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 7.1 Term-document matrix for a collection of four documents . . . . . . 239 7.2 Vector representation of documents and queries . . . . . . . . . . . . . . . 240 7.3 Classifying a document as relevant or non-relevant . . . . . . . . . . . . 245 7.4 Example inference network model . . . . . . . . . . . . . . . . . . . . . . . . . . 269 7.5 Inference network with three nodes . . . . . . . . . . . . . . . . . . . . . . . . . 271 7.6 Galago query for the dependence model . . . . . . . . . . . . . . . . . . . . . 282 7.7 Galago query for web data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 8.1 Example of a TREC topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 8.2 Recall and precision values for two rankings of six relevant documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 8.3 Recall and precision values for rankings from two different queries314 8.4 Recall-precision graphs for two queries . . . . . . . . . . . . . . . . . . . . . . . 315 8.5 Interpolated recall-precision graphs for two queries . . . . . . . . . . . . 316 8.6 Average recall-precision graph using standard recall levels . . . . . . . 317 8.7 Typical recall-precision graph for 50 queries from TREC . . . . . . . 318 8.8 Probability distribution for test statistic values assuming the null hypothesis. The shaded area is the region of rejection for a one-sided test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 8.9 Example distribution of query effectiveness improvements . . . . . . 335 9.1 Illustration of how documents are represented in the multiple- Bernoulli event space. In this example, there are 10 documents (each with a unique id ), two classes (spam and not spam), and a vocabulary that consists of the terms “cheap”, “buy”, “banking”, “dinner”, and “the”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 9.2 Illustration of how documents are represented in the multinomial event space. In this example, there are 10 documents (each with a unique id ), two classes (spam and not spam), and a vocabulary that consists of the terms “cheap”, “buy”, “banking”, “dinner”, and “the”. . . . . . . . . . . . . . . . . . . . . . . . . . 349

17 List of Figures XIX 9.3 Data set that consists of two classes (pluses and minuses). The data set on the left is linearly separable, whereas the one on the right is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 9.4 Graphical illustration of Support Vector Machines for the linearly separable case. Here, the hyperplane defined by w is shown, as well as the margin, the decision regions, and the support vectors, which are indicated by circles. . . . . . . . . . . . . . . . . 353 9.5 Generative process used by the Naïve Bayes model. First, a class P ( c ) is chosen according to , and then a document is chosen according to ( d | c P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 ) 9.6 Example data set where non-parametric learning algorithms, such as a nearest neighbor classifier, may outperform parametric algorithms. The pluses and minuses indicate positive and negative training examples, respectively. The solid gray line shows the actual decision boundary, which is highly non-linear. . 361 9.7 Example output of SpamAssassin email spam filter . . . . . . . . . . . . 365 9.8 Example of web page spam, showing the main page and some of the associated term and link spam . . . . . . . . . . . . . . . . . . . . . . . . . 367 9.9 Example product review incorporating sentiment . . . . . . . . . . . . . 370 9.10 Example semantic class match between a web page about rainbow fish (a type of tropical fish) and an advertisement for tropical fish food. The nodes “Aquariums”, “Fish”, and “Supplies” are example nodes within a semantic hierarchy. The web page is classified as “Aquariums - Fish” and the ad is classified as “Supplies - Fish”. Here, “Aquariums” is the least common ancestor. Although the web page and ad do not share any terms in common, they can be matched because of their semantic similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 9.11 Example of divisive clustering with K = 4 . The clustering proceeds from left to right and top to bottom, resulting in four clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 9.12 Example of agglomerative clustering with K = 4 . The clustering proceeds from left to right and top to bottom, resulting in four clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 9.13 Dendrogram that illustrates the agglomerative clustering of the points from Figure 9.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

18 XX List of Figures 9.14 Examples of clusters in a graph formed by connecting nodes representing instances. A link represents a distance between the two instances that is less than some threshold value. . . . . . . . . . . . . 379 9.15 Illustration of how various clustering cost functions are computed 381 9.16 Example of overlapping clustering using nearest neighbor clustering with K = 5 . The overlapping clusters for the black points (A, B, C, and D) are shown. The five nearest neighbors for each black point are shaded gray and labeled accordingly. . . . . 385 9.17 Example of overlapping clustering using Parzen windows. The clusters for the black points (A, B, C, and D) are shown. The shaded circles indicate the windows used to determine cluster membership. The neighbors for each black point are shaded gray and labeled accordingly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 9.18 Cluster hypothesis tests on two TREC collections. The top two compare the distributions of similarity values between relevant-relevant and relevant-nonrelevant pairs (light gray) of documents. The bottom two show the local precision of the relevant documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 10.1 Search results used to enrich a tag representation. In this example, the tag being expanded is “tropical fish”. The query “tropical fish” is run against a search engine, and the snippets returned are then used to generate a distribution over related terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 10.2 Example of a tag cloud in the form of a weighted list. The tags are in alphabetical order and weighted according to some criteria, such as popularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 10.3 Illustration of the HITS algorithm. Each row corresponds to a single iteration of the algorithm and each column corresponds to a specific step of the algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 10.4 Example of how nodes within a directed graph can be represented as vectors. For a given node p , its vector representation has component q set to 1 if p → q . . . . . . . . . . . . . . 413

19 List of Figures XXI 10.5 Overview of the two common collaborative search scenarios. On the left is co-located collaborative search, which involves multiple participants in the same location at the same time. On the right is remote collaborative search, where participants are in different locations and not necessarily all online and searching at the same time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 10.6 Example of a static filtering system. Documents arrive over time and are compared against each profile. Arrows from documents to profiles indicate the document matches the profile and is retrieved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 10.7 Example of an adaptive filtering system. Documents arrive over time and are compared against each profile. Arrows from documents to profiles indicate the document matches the profile and is retrieved. Unlike static filtering, where profiles are static over time, profiles are updated dynamically (e.g., when a new match occurs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 10.8 A set of users within a recommender system. Users and their ratings for some item are given. Users with question marks above their heads have not yet rated the item. It is the goal of the recommender system to fill in these question marks. . . . . . . . . 434 10.9 Illustration of collaborative filtering using clustering. Groups of similar users are outlined with dashed lines. Users and their ratings for some item are given. In each group, there is a single user who has not judged the item. For these users, the unjudged item is assigned an automatic rating based on the ratings of similar users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 10.10 Metasearch engine architecture. The query is broadcast to multiple web search engines and result lists are merged. . . . . . . . . 439 10.11 Network architectures for distributed search: (a) central hub; (b) pure P2P; and (c) hierarchical P2P. Dark circles are hub or superpeer nodes, gray circles are provider nodes, and white circles are consumer nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 10.12 Neighborhoods ( N ) of a hub node ( H ) in a hierarchical P2P i network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445

20 XXII List of Figures 11.1 Example Markov Random Field model assumptions, including full independence (top left), sequential dependence (top right), full dependence (bottom left), and general dependence (bottom right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 11.2 Graphical model representations of the relevance model technique (top) and latent concept expansion (bottom) used for pseudo-relevance feedback with the query “hubble telescope achievements” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 11.3 Functions provided by a search engine interacting with a simple database system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 11.4 Example of an entity search for organizations using the TREC Wall Street Journal 1987 Collection . . . . . . . . . . . . . . . . . . . . . . . . . 464 11.5 Question answering system architecture . . . . . . . . . . . . . . . . . . . . . . 467 11.6 Examples of OCR errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 11.7 Examples of speech recognizer errors . . . . . . . . . . . . . . . . . . . . . . . . 473 11.8 Two images (a fish and a flower bed) with color histograms. The horizontal axis is hue value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 11.9 Three examples of content-based image retrieval. The collection for the first two consists of 1,560 images of cars, faces, apes, and other miscellaneous subjects. The last example is from a collection of 2,048 trademark images. In each case, the leftmost image is the query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 11.10 Key frames extracted from a TREC video clip . . . . . . . . . . . . . . . . 476 11.11 Examples of automatic text annotation of images . . . . . . . . . . . . . . 477 11.12 Three representations of Bach’s “Fugue #10”: audio, MIDI, and conventional music notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478

21 List of Tables 1.1 Some dimensions of information retrieval . . . . . . . . . . . . . . . . . . . . 4 3.1 UTF-8 encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1 Statistics for the AP89 collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Most frequent 50 words from AP89 . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 Low-frequency words from AP89 . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Example word frequency ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 n times in 336,310 documents 4.5 Proportions of words occurring from the TREC Volume 3 corpus. The total vocabulary size (number of unique words) is 508,209. . . . . . . . . . . . . . . . . . . . . . . . 80 4.6 Document frequencies and estimated frequencies for word combinations (assuming independence) in the GOV2 Web collection. Collection size ( N ) is 25,205,179. . . . . . . . . . . . . . . . . . 84 4.7 Examples of errors made by the original Porter stemmer. False positives are pairs of words that have the same stem. False negatives are pairs that have different stems. . . . . . . . . . . . . . . . . . . 93 4.8 Examples of words with the Arabic root ktb . . . . . . . . . . . . . . . . . . 96 4.9 High-frequency noun phrases from a TREC collection and U.S. patents from 1996 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.10 Statistics for the Google n-gram sample . . . . . . . . . . . . . . . . . . . . . . 101 5.1 Four sentences from the Wikipedia entry for tropical fish . . . . . . . 132 5.2 Elias- γ code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.3 Elias- δ code examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.4 Space requirements for numbers encoded in v-byte . . . . . . . . . . . . 149

22 XXIV List of Tables 5.5 Sample encodings for v-byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 k 5.6 Skip lengths ( ) and expected processing steps . . . . . . . . . . . . . . . . 152 6.1 Partial entry for the Medical Subject (MeSH) Heading “Neck Pain” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.2 Term association measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6.3 Most strongly associated words for “tropical” in a collection of TREC news stories. Co-occurrence counts are measured at the document level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.4 Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured at the document level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 6.5 Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured in windows of five words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 7.1 Contingency table of term occurrences for a particular query . . . 248 7.2 BM25 scores for an example document . . . . . . . . . . . . . . . . . . . . . . 252 7.3 Query likelihood scores for an example document . . . . . . . . . . . . . 260 7.4 Highest-probability terms from relevance model for four example queries (estimated using top 10 documents) . . . . . . . . . . . 266 7.5 Highest-probability terms from relevance model for four example queries (estimated using top 50 documents) . . . . . . . . . . . 267 7.6 Conditional probabilities for example network . . . . . . . . . . . . . . . 272 7.7 Highest-probability terms from four topics in LDA model . . . . . 290 8.1 Statistics for three example text collections. The average number of words per document is calculated without stemming. . . . . . . . . 301 8.2 Statistics for queries from example text collections . . . . . . . . . . . . . 301 8.3 Sets of documents defined by a simple search with binary relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 8.4 Precision values at standard recall levels calculated using interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 8.5 Definitions of some important efficiency metrics . . . . . . . . . . . . . . 323 8.6 Artificial effectiveness data for two retrieval algorithms (A and B) over 10 queries. The column B – A gives the difference in effectiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

23 List of Tables XXV 9.1 A list of kernels that are typically used with SVMs. For each kernel, the name, value, and implicit dimensionality are given. . . 357 10.1 Example questions submitted to Yahoo! Answers . . . . . . . . . . . . . . 416 10.2 Translations automatically learned from a set of question and answer pairs. The 10 most likely translations for the terms “everest”, “xp”, and “search” are given. . . . . . . . . . . . . . . . . . . . . . . . . . 419 10.3 Summary of static and adaptive filtering models. For each, the profile representation and profile updating algorithm are given. . 430 10.4 Contingency table for the possible outcomes of a filtering system. Here, TP (true positive) is the number of relevant documents retrieved, FN (false negative) is the number of relevant documents not retrieved, FP (false positive) is the number of non-relevant documents retrieved, and TN (true negative) is the number of non-relevant documents not retrieved. 431 11.1 Most likely one- and two-word concepts produced using latent concept expansion with the top 25 documents retrieved for the query “hubble telescope achievements” on the TREC ROBUST collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 11.2 Example TREC QA questions and their corresponding question categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469

24

25 1 Search Engines and Information Retrieval “Mr. Helpmann, I’m keen to get into Information Retrieval.” Sam Lowry, Brazil 1.1 What Is Information Retrieval? This book is designed to help people understand search engines, evaluate and compare them, and modify them for specific applications. Searching for infor- mation on the Web is, for most people, a daily activity. Search and communi- cation are by far the most popular uses of the computer. Not surprisingly, many people in companies and universities are trying to improve search by coming up with easier and faster ways to find the right information. These people, whether they call themselves computer scientists, software engineers, information scien- tists, search engine optimizers, or something else, are working in the field of In- 1 . formation Retrieval So, before we launch into a detailed journey through the internals of search engines, we will take a few pages to provide a context for the rest of the book. Gerard Salton, a pioneer in information retrieval and one of the leading figures from the 1960s to the 1990s, proposed the following definition in his classic 1968 textbook (Salton, 1968): Information retrieval is a field concerned with the structure, analysis, or- ganization, storage, searching, and retrieval of information. Despite the huge advances in the understanding and technology of search in the past 40 years, this definition is still appropriate and accurate. The term “informa- 1 Information retrieval is often abbreviated as IR. In this book, we mostly use the full term. This has nothing to do with the fact that many people think IR means “infrared” or something else.

26 2 1 Search Engines and Information Retrieval tion” is very general, and information retrieval includes work on a wide range of types of information and a variety of applications related to search. docu- The primary focus of the field since the 1950s has been on text and text ments . Web pages, email, scholarly papers, books, and news stories are just a few of the many examples of documents. All of these documents have some amount of structure, such as the title, author, date, and abstract information associated with the content of papers that appear in scientific journals. The elements of this structure are called attributes, or fields, when referring to database records. The important distinction between a document and a typical database record, such as a bank account record or a flight reservation, is that most of the information in the document is in the form of text, which is relatively unstructured. To illustrate this difference, consider the information contained in two typical attributes of an account record, the account number and current balance. Both are very well defined, both in terms of their format (for example, a six-digit integer for an account number and a real number with two decimal places for balance) and their meaning. It is very easy to compare values of these attributes, and conse- quently it is straightforward to implement algorithms to identify the records that satisfy queries such as “Find account number 321456” or “Find accounts with balances greater than $50,000.00”. Now consider a news story about the merger of two banks. The story will have some attributes, such as the headline and source of the story, but the primary con- tent is the story itself. In a database system, this critical piece of information would typically be stored as a single large attribute with no internal structure. Most of 2 the queries submitted to a web search engine such as Google that relate to this story will be of the form “bank merger” or “bank takeover”. To do this search, we must design algorithms that can compare the text of the queries with the text of the story and decide whether the story contains the information that is being sought. Defining the meaning of a word, a sentence, a paragraph, or a whole news story is much more difficult than defining an account number, and consequently comparing text is not easy. Understanding and modeling how people compare texts, and designing computer algorithms to accurately perform this comparison, is at the core of information retrieval. Increasingly, applications of information retrieval involve multimedia docu- ments with structure, significant text content, and other media. Popular infor- mation media include pictures, video, and audio, including music and speech. In 2 http://www.google.com

27 1.1 What Is Information Retrieval? 3 some applications, such as in legal support, scanned document images are also important. These media have content that, like text, is difficult to describe and compare. The current technology for searching non-text documents relies on text descriptions of their content rather than the contents themselves, but progress is being made on techniques for direct comparison of images, for example. In addition to a range of media, information retrieval involves a range of tasks and applications. The usual search scenario involves someone typing in a query to a search engine and receiving answers in the form of a list of documents in ranked ) is by far the most order. Although searching the World Wide Web ( web search common application involving information retrieval, search is also a crucial part Vertical of applications in corporations, government, and many other domains. search is a specialized form of web search where the domain of the search is re- stricted to a particular topic. involves finding the required infor- Enterprisesearch mation in the huge variety of computer files scattered across a corporate intranet. Web pages are certainly a part of that distributed information store, but most information will be found in sources such as email, reports, presentations, spread- Desktop search is the personal sheets, and structured data in corporate databases. version of enterprise search, where the information sources are the files stored on an individual computer, including email messages and web pages that have re- Peer-to-peersearch involves finding information in networks cently been browsed. of nodes or computers without any centralized control. This type of search began as a file sharing tool for music but can be used in any community based on shared interests, or even shared locality in the case of mobile devices. Search and related information retrieval techniques are used for advertising, for intelligence analy- sis, for scientific discovery, for health care, for customer support, for real estate, 3 and so on. Any application that involves a of text or other unstructured collection information will need to organize and search that information. Search based on a user query (sometimes called adhocsearch because the range of possible queries is huge and not prespecified) is not the only text-based task that is studied in information retrieval. Other tasks include filtering , classification , and question answering . Filtering or tracking involves detecting stories of interest based on a person’s interests and providing an alert using email or some other mechanism. Classification or categorization uses a defined set of labels or classes 3 The term database is often used to refer to a collection of either structured or unstruc- tured data. To avoid confusion, we mostly use the term document collection (or just collection ) for text. However, the terms web database and search engine database are so common that we occasionally use them in this book.

28 4 1 Search Engines and Information Retrieval 4 ) and automatically assigns (such as the categories listed in the Yahoo! Directory those labels to documents. Question answering is similar to search but is aimed at more specific questions, such as “What is the height of Mt. Everest?”. The goal of question answering is to return a specific answer found in the text, rather than a list of documents. Table 1.1 summarizes some of these aspects or dimensions of the field of information retrieval. Examples of Examples of Examples of Content Tasks Applications Text Web search Ad hoc search Images Vertical search Filtering Video Enterprise search Classification Scanned documents Desktop search Question answering Audio Peer-to-peer search Music Table 1.1. Some dimensions of information retrieval 1.2 The Big Issues Information retrieval researchers have focused on a few key issues that remain just as important in the era of commercial web search engines working with billions of web pages as they were when tests were done in the 1960s on document col- lections containing about 1.5 megabytes of text. One of these issues is relevance . Relevance is a fundamental concept in information retrieval. Loosely speaking, a relevant document contains the information that a person was looking for when she submitted a query to the search engine. Although this sounds simple, there are many factors that go into a person’s decision as to whether a particular document is relevant. These factors must be taken into account when designing algorithms for comparing text and ranking documents. Simply comparing the text of a query with the text of a document and looking for an exact match, as might be done in a database system or using the grep utility in Unix, produces very poor results in terms of relevance. One obvious reason for this is that language can be used to ex- 4 http://dir.yahoo.com/

29 1.2 The Big Issues 5 press the same concepts in many different ways, often with very different words. This is referred to as the vocabulary mismatch problem in information retrieval. and user relevance topical relevance It is also important to distinguish between . A text document is topically relevant to a query if it is on the same topic. For ex- ample, a news story about a tornado in Kansas would be topically relevant to the query “severe weather events”. The person who asked the question (often called the user) may not consider the story relevant, however, if she has seen that story before, or if the story is five years old, or if the story is in Chinese from a Chi- nese news agency. User relevance takes these additional features of the story into account. To address the issue of relevance, researchers propose retrieval models and test how well they work. A retrieval model is a formal representation of the process of ranking algorithm that is matching a query and a document. It is the basis of the used in a search engine to produce the ranked list of documents. A good retrieval model will find documents that are likely to be considered relevant by the person who submitted the query. Some retrieval models focus on topical relevance, but a search engine deployed in a real environment must use ranking algorithms that incorporate user relevance. An interesting feature of the retrieval models used in information retrieval is that they typically model the statistical properties of text rather than the linguistic structure. This means, for example, that the ranking algorithms are typically far more concerned with the counts of word occurrences than whether the word is a noun or an adjective. More advanced models do incorporate linguistic features, but they tend to be of secondary importance. The use of word frequency infor- mation to represent text started with another information retrieval pioneer, H.P. Luhn, in the 1950s. This view of text did not become popular in other fields of computer science, such as natural language processing, until the 1990s. Another core issue for information retrieval is evaluation . Since the quality of a document ranking depends on how well it matches a person’s expectations, it was necessary early on to develop evaluation measures and experimental proce- dures for acquiring this data and using it to compare ranking algorithms. Cyril Cleverdon led the way in developing evaluation methods in the early 1960s, and two of the measures he used, precision and recall , are still popular. Precision is a very intuitive measure, and is the proportion of retrieved documents that are rel- evant. Recall is the proportion of relevant documents that are retrieved. When the recall measure is used, there is an assumption that all the relevant documents for a given query are known. Such an assumption is clearly problematic in a web

30 6 1 Search Engines and Information Retrieval test collections of documents, this measure search environment, but with smaller 5 for information retrieval experiments consists of can be useful. A test collection a collection of text documents, a sample of typical queries, and a list of relevant documents for each query (the ). The best-known test collec- relevance judgments 6 tions are those associated with the TREC evaluation forum. Evaluation of retrieval models and search engines is a very active area, with much of the current focus on using large volumes of log data from user interac- tions, such as data, which records the documents that were clicked clickthrough on during a search session. Clickthrough and other log data is strongly correlated with relevance so it can be used to evaluate search, but search engine companies still use relevance judgments in addition to log data to ensure the validity of their results. The third core issue for information retrieval is the emphasis on users and their information needs . This should be clear given that the evaluation of search is user- centered. That is, the users of a search engine are the ultimate judges of quality. This has led to numerous studies on how people interact with search engines and, in particular, to the development of techniques to help people express their in- formation needs. An information need is the underlying cause of the query that a person submits to a search engine. In contrast to a request to a database system, such as for the balance of a bank account, text queries are often poor descriptions of what the user actually wants. A one-word query such as “cats” could be a request for information on where to buy cats or for a description of the Broadway musi- cal. Despite their lack of specificity, however, one-word queries are very common query suggestion , in web search. Techniques such as , and relevance query expansion feedback use interaction and context to refine the initial query in order to produce better ranked lists. These issues will come up throughout this book, and will be discussed in con- siderably more detail. We now have sufficient background to start talking about the main product of research in information retrieval—namely, search engines. 1.3 Search Engines A search engine is the practical application of information retrieval techniques to large-scale text collections. A web search engine is the obvious example, but as 5 Also known as an evaluation corpus (plural corpora ). 6 Text REtrieval Conference—http://trec.nist.gov/

31 1.3 Search Engines 7 has been mentioned, search engines can be found in many different applications, such as desktop search or enterprise search. Search engines have been around for many years. For example, MEDLINE, the online medical literature search sys- tem, started in the 1970s. The term “search engine” was originally used to refer to specialized hardware for text search. From the mid-1980s onward, however, it gradually came to be used in preference to “information retrieval system” as the name for the software system that compares queries to documents and produces ranked result lists of documents. There is much more to a search engine than the ranking algorithm, of course, and we will discuss the general architecture of these systems in the next chapter. Search engines come in a number of configurations that reflect the applica- 7 must tions they are designed for. Web search engines, such as Google and Yahoo!, be able to capture, or , many terabytes of data, and then provide subsecond crawl response times to millions of queries submitted every day from around the world. 8 Enterprise search engines—for example, Autonomy —must be able to process the large variety of information sources in a company and use company-specific data mining knowledge as part of search and related tasks, such as . Data mining refers to the automatic discovery of interesting structure in data and includes tech- niques such as clustering . Desktop search engines, such as the Microsoft Vista™ search feature, must be able to rapidly incorporate new documents, web pages, and email as the person creates or looks at them, as well as provide an intuitive interface for searching this very heterogeneous mix of information. There is over- lap between these categories with systems such as Google, for example, which is available in configurations for enterprise and desktop search. Open source search engines are another important class of systems that have somewhat different design goals than the commercial search engines. There are a 9 pro- number of these systems, and the Wikipedia page for information retrieval 10 vides links to many of them. Three systems of particular interest are Lucene, 12 11 and the system provided with this book, Galago. Lucene is a popular Lemur, Java-based search engine that has been used for a wide range of commercial ap- plications. The information retrieval techniques that it uses are relatively simple. 7 http://www.yahoo.com 8 http://www.autonomy.com 9 http://en.wikipedia.org/wiki/Information_retrieval 10 http://lucene.apache.org 11 http://www.lemurproject.org 12 http://www.search-engines-book.com

32 8 1 Search Engines and Information Retrieval Lemur is an open source toolkit that includes the Indri C++-based search engine. Lemur has primarily been used by information retrieval researchers to compare advanced search techniques. Galago is a Java-based search engine that is based on the Lemur and Indri projects. The assignments in this book make extensive use of Galago. It is designed to be fast, adaptable, and easy to understand, and incorpo- rates very effective information retrieval techniques. The “big issues” in the design of search engines include the ones identified for information retrieval: effective ranking algorithms, evaluation, and user interac- tion. There are, however, a number of additional critical features of search engines that result from their deployment in large-scale, operational environments. Fore- most among these features is the performance of the search engine in terms of mea- response time sures such as query throughput , and indexing speed . Response time , is the delay between submitting a query and receiving the result list, throughput measures the number of queries that can be processed in a given time, and index- ing speed is the rate at which text documents can be transformed into indexes for searching. An index is a data structure that improves the speed of search. The design of indexes for search engines is one of the major topics in this book. Another important performance measure is how fast new data can be incorpo- rated into the indexes. Search applications typically deal with dynamic, constantly changing information. Coverage measures how much of the existing information in, say, a corporate information environment has been indexed and stored in the search engine, and or freshness measures the “age” of the stored informa- recency tion. Search engines can be used with small collections, such as a few hundred emails and documents on a desktop, or extremely large collections, such as the entire Web. There may be only a few users of a given application, or many thousands. Scalability is clearly an important issue for search engine design. Designs that work for a given application should continue to work as the amount of data and the number of users grow. In section 1.1, we described how search engines are used customizable in many applications and for many tasks. To do this, they have to be or adaptable . This means that many different aspects of the search engine, such as the ranking algorithm, the interface, or the indexing strategy, must be able to be tuned and adapted to the requirements of the application. Practical issues that impact search engine design also occur for specific appli- cations. The best example of this is spam in web search. Spam is generally thought of as unwanted email, but more generally it could be defined as misleading, inap- propriate, or non-relevant information in a document that is designed for some

33 1.4 Search Engineers 9 commercial benefit. There are many kinds of spam, but one type that search en- gines must deal with is spam words put into a document to cause it to be retrieved in response to popular queries. The practice of “spamdexing” can significantly de- grade the quality of a search engine’s ranking, and web search engine designers have to develop techniques to identify the spam and remove those documents. Figure 1.1 summarizes the major issues involved in search engine design. Information Retrieval Search Engines Performance Relevance -Efficient search and indexing -Effective ranking Evaluation Incorporating new data -Coverage and freshness -Testing and measuring Information needs Scalability -Growing with data and users -User interaction Adaptability -Tuning for applications Specific problems -E.g., spam Fig. 1.1. Search engine design and the core information retrieval issues Based on this discussion of the relationship between information retrieval and search engines, we now consider what roles computer scientists and others play in the design and use of search engines. 1.4 Search Engineers Information retrieval research involves the development of mathematical models of text and language, large-scale experiments with test collections or users, and a lot of scholarly paper writing. For these reasons, it tends to be done by aca- demics or people in research laboratories. These people are primarily trained in computer science, although information science, mathematics, and, occasionally, social science and computational linguistics are also represented. So who works

34 10 1 Search Engines and Information Retrieval with search engines? To a large extent, it is the same sort of people but with a more practical emphasis. The computing industry has started to use the term search engineer to describe this type of person. Search engineers are primarily people trained in computer science, mostly with a systems or database background. Sur- prisingly few of them have training in information retrieval, which is one of the major motivations for this book. What is the role of a search engineer? Certainly the people who work in the major web search companies designing and implementing new search engines are search engineers, but the majority of search engineers are the people who modify, extend, maintain, or tune existing search engines for a wide range of commercial applications. People who design or “optimize” content for search engines are also search engineers, as are people who implement techniques to deal with spam. The search engines that search engineers work with cover the entire range mentioned in the last section: they primarily use open source and enterprise search engines for application development, but also get the most out of desktop and web search engines. The importance and pervasiveness of search in modern computer applications has meant that search engineering has become a crucial profession in the com- puter industry. There are, however, very few courses being taught in computer science departments that give students an appreciation of the variety of issues that are involved, especially from the information retrieval perspective. This book is in- tended to give potential search engineers the understanding and tools they need. References and Further Reading In each chapter, we provide some pointers to papers and books that give more detail on the topics that have been covered. This additional reading should not be necessary to understand material that has been presented, but instead will give more background, more depth in some cases, and, for advanced topics, will de- scribe techniques and research results that are not covered in this book. The classic references on information retrieval, in our opinion, are the books by Salton (1968; 1983) and van Rijsbergen (1979). Van Rijsbergen’s book remains 13 popular, since it is available on the Web. All three books provide excellent de- scriptions of the research done in the early years of information retrieval, up to the late 1970s. Salton’s early book was particularly important in terms of defining 13 http://www.dcs.gla.ac.uk/Keith/Preface.html

35 1.4 Search Engineers 11 the field of information retrieval for computer science. More recent books include Baeza-Yates and Ribeiro-Neto (1999) and Manning et al. (2008). Research papers on all the topics covered in this book can be found in the Proceedings of the Association for Computing Machinery (ACM) Special In- terest Group on Information Retrieval (SIGIR) Conference. These proceedings 14 are available on the Web as part of the ACM Digital Library. Good papers on information retrieval and search also appear in the European Conference on Information Retrieval (ECIR), the Conference on Information and Knowl- edge Management (CIKM), and the Web Search and Data Mining Conference (WSDM). The WSDM conference is a spin-off of the World Wide Web Confer- ence (WWW), which has included some important papers on web search. The proceedings from the TREC workshops are available online and contain useful descriptions of new research techniques from many different academic and indus- try groups. An overview of the TREC experiments can be found in Voorhees and Harman (2005). An increasing number of search-related papers are beginning to appear in database conferences, such as VLDB and SIGMOD. Occasional papers also show up in language technology conferences, such as ACL and HLT (As- sociation for Computational Linguistics and Human Language Technologies), machine learning conferences, and others. Exercises 1.1. Think up and write down a small number of queries for a web search engine. Make sure that the queries vary in length (i.e., they are not all one word). Try to specify exactly what information you are looking for in some of the queries. Run these queries on two commercial web search engines and compare the top 10 results for each query by doing relevance judgments. Write a report that an- swers at least the following questions: What is the precision of the results? What is the overlap between the results for the two search engines? Is one search engine clearly better than the other? If so, by how much? How do short queries perform compared to long queries? 1.2. Site search is another common application of search engines. In this case, search is restricted to the web pages at a given website. Compare site search to web search, vertical search, and enterprise search. 14 http://www.acm.org/dl

36 12 1 Search Engines and Information Retrieval 1.3. Use the Web to find as many examples as you can of open source search en- gines, information retrieval systems, or related technology. Give a brief descrip- tion of each search engine and summarize the similarities and differences between them. 1.4. List five web services or sites that you use that appear to use search, not includ- ing web search engines. Describe the role of search for that service. Also describe whether the search is based on a database or grep style of matching, or if the search is using some type of ranking.

37 2 Architecture of a Search Engine “While your first question may be the most per- tinent, you may or may not realize it is also the most irrelevant.” The Architect, Matrix Reloaded 2.1 What Is an Architecture? In this chapter, we describe the basic software architecture of a search engine. Al- though there is no universal agreement on the definition, a software architecture generally consists of software components, the interfaces provided by those com- ponents, and the relationships between them. An architecture is used to describe a system at a particular level of abstraction. An example of an architecture used to provide a standard for integrating search and related language technology compo- 1 UIMA nents is UIMA (Unstructured Information Management Architecture). defines interfaces for components in order to simplify the addition of new tech- nologies into systems that handle text and other unstructured data. Our search engine architecture is used to present high-level descriptions of the important components of the system and the relationships between them. It is not a code-level description, although some of the components do correspond to software modules in the Galago search engine and other systems. We use this architecture in this chapter and throughout the book to provide context to the discussion of specific techniques. An architecture is designed to ensure that a system will satisfy the application requirements or goals. The two primary goals of a search engine are: • Effectiveness (quality): We want to be able to retrieve the most relevant set of documents possible for a query. • Efficiency (speed): We want to process queries from users as quickly as possi- ble. 1 http://www.research.ibm.com/UIMA

38 14 2 Architecture of a Search Engine We may have more specific goals, too, but usually these fall into the categories of effectiveness or efficiency (or both). For instance, the collection of documents we want to search may be changing; making sure that the search engine immedi- ately reacts to changes in documents is both an effectiveness issue and an efficiency issue. The architecture of a search engine is determined by these two requirements. Because we want an efficient system, search engines employ specialized data struc- tures that are optimized for fast retrieval. Because we want high-quality results, search engines carefully process text and store text statistics that help improve the relevance of results. Many of the components we discuss in the following sections have been used for decades, and this general design has been shown to be a useful compromise between the competing goals of effective and efficient retrieval. In later chapters, we will discuss these components in more detail. 2.2 Basic Building Blocks Search engine components support two major functions, which we call the index- ing process query process . The indexing process builds the structures that and the enable searching, and the query process uses those structures and a person’s query to produce a ranked list of documents. Figure 2.1 shows the high-level “building blocks” of the indexing process. These major components are text acquisition , text transformation , and . index creation The task of the text acquisition component is to identify and make available the documents that will be searched. Although in some cases this will involve sim- ply using an existing collection, text acquisition will more often require building crawling or scanning the Web, a corporate intranet, a desktop, or a collection by other sources of information. In addition to passing documents to the next com- ponent in the indexing process, the text acquisition component creates a docu- ment data store, which contains the text and metadata for all the documents. Metadata is information about a document that is not part of the text content, such the document type (e.g., email or web page), document structure, and other features, such as document length. The text transformation component transforms documents into index terms or features . Index terms, as the name implies, are the parts of a document that are stored in the index and used in searching. The simplest index term is a word, but not every word may be used for searching. A “feature” is more often used in

39 2.2 Basic Building Blocks 15 Document data store Index Creation Text Acquisition Email, web pages, news articles, memos, letters Index Text Transformation Fig. 2.1. The indexing process the field of machine learning to refer to a part of a text document that is used to represent its content, which also describes an index term. Examples of other types of index terms or features are phrases, names of people, dates, and links in a web page. Index terms are sometimes simply referred to as “terms.” The set of all the terms that are indexed for a document collection is called the index vocabulary . The index creation component takes the output of the text transformation component and creates the indexes or data structures that enable fast searching. Given the large number of documents in many search applications, index creation must be efficient, both in terms of time and space. Indexes must also be able to be efficiently updated when new documents are acquired. Inverted indexes , or some- times , are by far the most common form of index used by search inverted files engines. An inverted index, very simply, contains a list for every index term of the documents that contain that index term. It is inverted in the sense of being the opposite of a document file that lists, for every document, the index terms they contain. There are many variations of inverted indexes, and the particular form of index used is one of the most important aspects of a search engine. Figure 2.2 shows the building blocks of the query process. The major compo- nents are user interaction , ranking , and evaluation . The user interaction component provides the interface between the person doing the searching and the search engine. One task for this component is accept- ing the user’s query and transforming it into index terms. Another task is to take the ranked list of documents from the search engine and organize it into the re-

40 16 2 Architecture of a Search Engine Document data store User Interaction Ranking Index Evaluation Log data The query process Fig. 2.2. snippets used to sults shown to the user. This includes, for example, generating the summarize documents. The document data store is one of the sources of informa- tion used in generating the results. Finally, this component also provides a range of techniques for refining the query so that it better represents the information need. The ranking component is the core of the search engine. It takes the trans- formed query from the user interaction component and generates a ranked list of documents using scores based on a retrieval model. Ranking must be both effi- cient, since many queries may need to be processed in a short time, and effective, since the quality of the ranking determines whether the search engine accom- plishes the goal of finding relevant information. The efficiency of ranking depends on the indexes, and the effectiveness depends on the retrieval model. The task of the evaluation component is to measure and monitor effectiveness and efficiency. An important part of that is to record and analyze user behavior using log data . The results of evaluation are used to tune and improve the ranking component. Most of the evaluation component is not part of the online search engine, apart from logging user and system data. Evaluation is primarily an offline activity, but it is a critical part of any search application.

41 2.3 Breaking It Down 17 2.3 Breaking It Down We now look in more detail at the components of each of the basic building blocks. Not all of these components will be part of every search engine, but to- gether they cover what we consider to be the most important functions for a broad range of search applications. 2.3.1 Text Acquisition Crawler component has the primary responsibility for In many applications, the crawler identifying and acquiring documents for the search engine. There are a number of different types of crawlers, but the most common is the general web crawler. A web crawler is designed to follow the links on web pages to discover and download new pages. Although this sounds deceptively simple, there are significant challenges in designing a web crawler that can efficiently handle the huge volume of new pages on the Web, while at the same time ensuring that pages that may have changed since the last time a crawler visited a site are kept “fresh” for the search engine. A web crawler can be restricted to a single site, such as a university, as the basis for Text Acquisition Focused . topical , web crawlers use classification techniques to restrict site search , or Crawler Feeds the pages that are visited to those that are likely to be about a specific topic. This Conversion topical or vertical type of crawler may be used by a search application, such as a Document data store search engine that provides access to medical information on web pages. For enterprise search, the crawler is adapted to discover and update all docu- ments and web pages related to a company’s operation. An enterprise document crawler follows links to discover both external and internal (i.e., restricted to the corporate intranet) pages, but also must scan both corporate and personal di- rectories to identify email, word processing documents, presentations, database records, and other company information. Document crawlers are also used for desktop search, although in this case only the user’s personal directories need to be scanned. Feeds Document feeds are a mechanism for accessing a real-time stream of documents. For example, a news feed is a constant stream of news stories and updates. In con- trast to a crawler, which must discover new documents, a search engine acquires

42 18 2 Architecture of a Search Engine 2 RSS new documents from a feed simply by monitoring it. is a common standard used for web feeds for content such as news, blogs, or video. An RSS “reader” 3 XML is used to subscribe to RSS feeds, which are formatted using XML is a . 4 language for describing data formats, similar to HTML. The reader monitors those feeds and provides new content when it arrives. Radio and television feeds are also used in some search applications, where the “documents” contain auto- matically segmented audio and video streams, together with associated text from closed captions or speech recognition. Conversion The documents found by a crawler or provided by a feed are rarely in plain text. Instead, they come in a variety of formats, such as HTML, XML, Adobe PDF, Microsoft Word™, Microsoft PowerPoint®, and so on. Most search engines require that these documents be converted into a consistent text plus metadata format. In this conversion, the control sequences and non-content data associated with a particular format are either removed or recorded as metadata. In the case of HTML and XML, much of this process can be described as part of the text trans- formation component. For other formats, the conversion process is a basic step that prepares the document for further processing. PDF documents, for example, must be converted to text. Various utilities are available that perform this conver- sion, with varying degrees of accuracy. Similarly, utilities are available to convert the various Microsoft Office® formats into text. encoded in a Another common conversion problem comes from the way text is 5 is a common standard single-byte character encoding scheme document. ASCII used for text. ASCII uses either 7 or 8 bits (extended ASCII) to represent either 128 or 256 possible characters. Some languages, however, such as Chinese, have many more characters than English and use a number of other encoding schemes. Unicode is a standard encoding scheme that uses 16 bits (typically) to represent most of the world’s languages. Any application that deals with documents in dif- ferent languages has to ensure that they are converted into a consistent encoding scheme before further processing. 2 RSS actually refers to a family of standards with similar names (and the same initials), such as Really Simple Syndication or Rich Site Summary. 3 eXtensible Markup Language 4 HyperText Markup Language 5 American Standard Code for Information Interchange

43 2.3 Breaking It Down 19 Document data store The document data store is a database used to manage large numbers of docu- ments and the structured data that is associated with them. The document con- tents are typically stored in compressed form for efficiency. The structured data consists of document metadata and other information extracted from the docu- ments, such as links and anchor text relational (the text associated with a link). A database system can be used to store the documents and metadata. Some applica- tions, however, use a simpler, more efficient storage system to provide very fast retrieval times for very large document stores. Although the original documents are available on the Web, in the enterprise database, the document data store is necessary to provide fast access to the doc- ument contents for a range of search engine components. Generating summaries of retrieved documents, for example, would take far too long if the search engine had to access the original documents and reprocess them. 2.3.2 Text Transformation Parser The parsing component is responsible for processing the sequence of text tokens in the document to recognize structural elements such as titles, figures, links, and headings. Tokenizing the text is an important first step in this process. In many cases, tokens are the same as words. Both document and query text must be trans- formed into tokens in the same manner so that they can be easily compared. There are a number of decisions that potentially affect retrieval that make tokenizing non-trivial. For example, a simple definition for tokens could be strings of al- Text Transformation phanumeric characters that are separated by spaces. This does not tell us, however, how to deal with special characters such as capital letters, hyphens, and apostro- Parser Stopping phes. Should we treat “apple” the same as “Apple”? Is “on-line” two words or one Stemming Link Analysis word? Should the apostrophe in “O’Connor” be treated the same as the one in Information Extraction Classifier “owner’s”? In some languages, tokenizing gets even more interesting. Chinese, for example, has no obvious word separator like a space in English. Document structure is often specified by a markup language such as HTML or XML. HTML is the default language used for specifying the structure of web pages. XML has much more flexibility and is used as a data interchange format for many applications. The document parser uses knowledge of the syntax of the markup language to identify the structure.

44 20 2 Architecture of a Search Engine tags to define document . For example, Both HTML and XML use elements defines “Search” as a second-level heading in HTML. Tags and

Search

other control sequences must be treated appropriately when tokenizing. Other types of documents, such as email and presentations, have a specific syntax and methods for specifying structure, but much of this may be be removed or simpli- fied by the conversion component. Stopping The stopping component has the simple task of removing common words from the stream of tokens that become index terms. The most common words are typ- ically function words that help form sentence structure but contribute little on their own to the description of the topics covered by the text. Examples are “the”, “of ”, “to”, and “for”. Because they are so common, removing them can reduce the size of the indexes considerably. Depending on the retrieval model that is used as the basis of the ranking, removing these words usually has no impact on the search engine’s effectiveness, and may even improve it somewhat. Despite these potential advantages, it can be difficult to decide how many words to include on the stop- word list . Some stopword lists used in research contain hundreds of words. The problem with using such lists is that it becomes impossible to search with queries like “to be or not to be” or “down under”. To avoid this, search applications may use very small stopword lists (perhaps just containing “the”) when processing doc- ument text, but then use longer lists for the default processing of query text. Stemming Stemming is another word-level transformation. The task of the stemming com- ponent (or stemmer ) is to group words that are derived from a common stem . Grouping “fish”, “fishes”, and “fishing” is one example. By replacing each member of a group with one designated word (for example, the shortest, which in this case is “fish”), we increase the likelihood that words used in queries and documents will match. Stemming, in fact, generally produces small improvements in ranking effectiveness. Similar to stopping, stemming can be done aggressively, conserva- tively, or not at all. Aggressive stemming can cause search problems. It may not be appropriate, for example, to retrieve documents about different varieties of fish in response to the query “fishing”. Some search applications use more conservative stemming, such as simply identifying plural forms using the letter “s”, or they may

45 2.3 Breaking It Down 21 do no stemming when processing document text and focus on adding appropriate word variants to the query. morphology than En- Some languages, such as Arabic, have more complicated glish, and stemming is consequently more important. An effective stemming com- ponent in Arabic has a huge impact on search effectiveness. In contrast, there is little word variation in other languages, such as Chinese, and for these languages stemming is not effective. Link extraction and analysis Links and the corresponding anchor text in web pages can readily be identified and extracted during document parsing. Extraction means that this information is recorded in the document data store, and can be indexed separately from the general text content. Web search engines make extensive use of this information through link analysis algorithms such as PageRank (Brin & Page, 1998). Link analysis provides the search engine with a rating of the popularity, and to some extent, the authority of a page (in other words, how important it is). Anchor text , which is the clickable text of a web link, can be used to enhance the text content of a page that the link points to. These two factors can significantly improve the effectiveness of web search for some types of queries. Information extraction Information extraction is used to identify index terms that are more complex than single words. This may be as simple as words in bold or words in headings, but in general may require significant additional computation. Extracting syntactic fea- tures such as noun phrases, for example, requires some form of syntactic analysis or part-of-speech tagging . Research in this area has focused on techniques for ex- tracting features with specific semantic content, such as namedentity recognizers, which can reliably identify information such as person names, company names, dates, and locations. Classifier The classifier component identifies class-related metadata for documents or parts of documents. This covers a range of functions that are often described separately. Classification techniques assign predefined class labels to documents. These labels typically represent topical categories such as “sports”, “politics”, or “business”. Two

46 22 2 Architecture of a Search Engine important examples of other types of classification are identifying documents as spam, and identifying the non-content parts of documents, such as advertising. Clustering techniques are used to group related documents without predefined categories. These document groups can be used in a variety of ways during ranking or user interaction. Index Creation Document Statistics 2.3.3 Index Creation Weighting Inversion Distribution Document statistics The task of the document statistics component is simply to gather and record statistical information about words, features, and documents. This information is used by the ranking component to compute scores for documents. The types of data generally required are the counts of index term occurrences (both words and more complex features) in individual documents, the positions in the doc- uments where the index terms occurred, the counts of occurrences over groups of documents (such as all documents labeled “sports” or the entire collection of documents), and the lengths of documents in terms of the number of tokens. The actual data required is determined by the retrieval model and associated rank- ing algorithm. The document statistics are stored in lookup tables , which are data structures designed for fast retrieval. Weighting Index term weights reflect the relative importance of words in documents, and are used in computing scores for ranking. The specific form of a weight is deter- mined by the retrieval model. The weighting component calculates weights using the document statistics and stores them in lookup tables. Weights could be calcu- lated as part of the query process, and some types of weights require information about the query, but by doing as much calculation as possible during the indexing process, the efficiency of the query process will be improved. tf.idf One of the most common types used in older retrieval models is known as weighting. There are many variations of these weights, but they are all based on a combination of the frequency or count of index term occurrences in a document term frequency , or tf ) and the frequency of index term occurrence over the (the entire collection of documents ( inversedocumentfrequency , or idf ). The idf weight is called inverse document frequency because it gives high weights to terms that is the idf is log occur in very few documents. A typical formula for / n , where N N

47 2.3 Breaking It Down 23 n is the number of total number of documents indexed by the search engine and documents that contain a particular term. Inversion The inversion component is the core of the indexing process. Its task is to change the stream of document-term information coming from the text transformation component into term-document information for the creation of inverted indexes. The challenge is to do this efficiently, not only for large numbers of documents when the inverted indexes are initially created, but also when the indexes are up- dated with new documents from feeds or crawls. The format of the inverted in- dexes is designed for fast query processing and depends to some extent on the ranking algorithm used. The indexes are also compressed to further enhance effi- ciency. Index distribution The index distribution component distributes indexes across multiple computers and potentially across multiple sites on a network. Distribution is essential for efficient performance with web search engines. By distributing the indexes for a subset of the documents ( document distribution ), both indexing and query pro- cessing can be done in parallel . Distributing the indexes for a subset of terms ( term distribution is a form Replication ) can also support parallel processing of queries. of distribution where copies of indexes or parts of indexes are stored in multiple sites so that query processing can be made more efficient by reducing communi- cation delays. Peer-to-peer search involves a less organized form of distribution where each node in a network maintains its own indexes and collection of docu- ments. User Interaction 2.3.4 User Interaction Query input Query transformation Results Output Query input The query input component provides an interface and a parser for a query lan- guage . The simplest query languages, such as those used in most web search in- terfaces, have only a small number of operators . An operator is a command in the query language that is used to indicate text that should be treated in a special way. In general, operators help to clarify the meaning of the query by constraining how

48 24 2 Architecture of a Search Engine text in the document can match text in the query. An example of an operator in a simple query language is the use of quotes to indicate that the enclosed words should occur as a phrase in the document, rather than as individual words with no relationship. A typical web query, however, consists of a small number of keywords with no operators. A keyword is simply a word that is important for specifying the topic of a query. Because the ranking algorithms for most web search engines are designed for keyword queries, longer queries that may contain a lower proportion of keywords typically do not work well. For example, the query “search engines” may produce a better result with a web search engine than the query “what are typical implementation techniques and data structures used in search engines”. One of the challenges for search engine design is to give good results for a range of queries, and better results for more specific queries. More complex query languages are available, either for people who want to have a lot of control over the search results or for applications using a search en- gine. In the same way that the SQL query language (Elmasri & Navathe, 2006) is not designed for the typical user of a database application (the end user ), these query languages are not designed for the end users of search applications. Boolean query languages have a long history in information retrieval. The operators in this language include Boolean AND , OR , and NOT , and some form of proximity opera- tor that specifies that words must occur together within a specific distance (usually in terms of word count). Other query languages include these and other operators in a probabilistic framework designed to allow specification of features related to both document structure and content. Query transformation The query transformation component includes a range of techniques that are de- signed to improve the initial query, both before and after producing a document ranking. The simplest processing involves some of the same text transformation techniques used on document text. Tokenizing, stopping, and stemming must be done on the query text to produce index terms that are comparable to the docu- ment terms. Spell checking and query suggestion are query transformation techniques that produce similar output. In both cases, the user is presented with alternatives to the initial query that are likely to either correct spelling errors or be more spe- cific descriptions of their information needs. These techniques often leverage the extensive query logs collected for web applications. Query expansion techniques

49 2.3 Breaking It Down 25 also suggest or add additional terms to the query, but usually based on an analy- sis of term occurrences in documents. This analysis may use different sources of information, such as the whole document collection, the retrieved documents, or is a technique that expands documents on the user’s computer. Relevance feedback queries based on term occurrences in documents that are identified as relevant by the user. Results output The results output component is responsible for constructing the display of ranked documents coming from the ranking component. This may include tasks such as snippets to summarize the retrieved documents, highlighting generating impor- tant words and passages in documents, clustering the output to identify related groups of documents, and finding appropriate advertising to add to the results display. In applications that involve documents in multiple languages, the results Ranking may be translated into a common language. Scoring Optimization Distribution 2.3.5 Ranking Scoring The scoring component, also called query processing , calculates scores for docu- ments using the ranking algorithm, which is based on a retrieval model. The de- signers of some search engines explicitly state the retrieval model they use. For other search engines, only the ranking algorithm is discussed (if any details at all are revealed), but all ranking algorithms are based implicitly on a retrieval model. The features and weights used in a ranking algorithm, which may have been de- rived empirically (by testing and evaluation), must be related to topical and user relevance, or the search engine would not work. Many different retrieval models and methods of deriving ranking algorithms have been proposed. The basic form of the document score calculated by many of these models is ∑ .d q i i i where the summation is over all of the terms in the vocabulary of the collection, q is the document term weight. is the query term weight of the i th term, and d i i The term weights depend on the particular retrieval model being used, but are generally similar to tf.idf weights. In Chapter 7, we discuss the ranking algorithms

50 26 2 Architecture of a Search Engine BM25 query likelihood retrieval models (as well as others) in based on the and more detail. The document scores must be calculated and compared very rapidly in order to determine the ranked order of the documents that are given to the results output component. This is the task of the performance optimization component. Performance optimization Performance optimization involves the design of ranking algorithms and the as- sociated indexes to decrease response time and increase query throughput. Given a particular form of document scoring, there are a number of ways to calculate those scores and produce the ranked document output. For example, scores can be computed by accessing the index for a query term, computing the contribution for that term to a document’s score, adding this contribution to a score accumula- tor, and then accessing the next index. This is referred to as term-at-a-time scoring. Another alternative is to access all the indexes for the query terms simultaneously, and compute scores by moving pointers through the indexes to find the terms present in a document. In this document-at-a-time scoring, the final document score is calculated immediately instead of being accumulated one term at a time. In both cases, further optimizations are possible that significantly decrease the time required to compute the top-ranked documents. Safe optimizations guaran- tee that the scores calculated will be the same as the scores without optimization. Unsafe optimizations, which do not have this property, can in some cases be faster, so it is important to carefully evaluate the impact of the optimization. Distribution Given some form of index distribution, ranking can also be distributed. A query broker decides how to allocate queries to processors in a network and is responsi- ble for assembling the final ranked list for the query. The operation of the broker depends on the form of index distribution. Caching is another form of distribu- tion where indexes or even ranked document lists from previous queries are left in local memory. If the query or index term is popular, there is a significant chance that this information can be reused with substantial time savings.

51 2.3 Breaking It Down 27 2.3.6 Evaluation Logging Logs of the users’ queries and their interactions with the search engine are one of the most valuable sources of information for tuning and improving search ef- fectiveness and efficiency. Query logs can be used for spell checking, query sug- gestions, query caching, and other tasks, such as helping to match advertising to Evaluation searches. Documents in a result list that are clicked on and browsed tend to be Logging relevant. This means that logs of user clicks on documents (clickthrough data) Ranking Analysis and information such as the dwell time (time spent looking at a document) can Performance Analysis be used to evaluate and train ranking algorithms. Ranking analysis Given either log data or explicit relevance judgments for a large number of (query, document) pairs, the effectiveness of a ranking algorithm can be measured and compared to alternatives. This is a critical part of improving a search engine and selecting values for parameters that are appropriate for the application. A variety of evaluation measures are commonly used, and these should also be selected to measure outcomes that make sense for the application. Measures that emphasize the quality of the top-ranked documents, rather than the whole list, for example, are appropriate for many types of web queries. Performance analysis The performance analysis component involves monitoring and improving overall system performance, in the same way that the ranking analysis component mon- itors effectiveness. A variety of performance measures are used, such as response time and throughput, but the measures used also depend on the application. For example, a distributed search application should monitor network usage and ef- ficiency in addition to other measures. For ranking analysis, test collections are often used to provide a controlled experimental environment. The equivalent for performance analysis is simulations , where actual networks, processors, storage devices, and data are replaced with mathematical models that can be adjusted us- ing parameters.

52 28 2 Architecture of a Search Engine Really 2.4 How Does It Work? Now you know the names and the basic functions of the components of a search engine, but we haven’t said much yet about how these components actually per- form these functions. That’s what the rest of the book is about. Each chapter de- scribes, in depth, how one or more components work. If you still don’t understand a component after finishing the appropriate chapter, you can study the Galago code, which is one implementation of the ideas presented, or the references de- scribed at the end of each chapter. References and Further Reading Detailed references on the techniques and models mentioned in the compo- nent descriptions will be given in the appropriate chapters. There are a few gen- eral references for search architectures. A database textbook, such as Elmasri and Navathe (2006), provides descriptions of database system architecture and the associated query languages that are interesting to compare with the search en- gine architecture discussed here. There are some similarities at the high level, but database systems focus on structured data and exact match rather than on text and ranking, so most of the components are very different. The classic research paper on web search engine architecture, which gives an overview of an early version of Google, is Brin and Page (1998). Another system overview for an earlier general-purpose search engine (Inquery) is found in Callan et al. (1992). A comprehensive description of the Lucene architecture and com- ponents can be found in Hatcher and Gospodnetic (2004). Exercises 2.1. Find some examples of the search engine components described in this chap- ter in the Galago code. 2.2. A more-like-this query occurs when the user can click on a particular docu- ment in the result list and tell the search engine to find documents that are similar to this one. Describe which low-level components are used to answer this type of query and the sequence in which they are used.

53 2.4 How Does It Really Work? 29 2.3. Document filtering is an application that stores a large number of queries or user profiles and compares these profiles to every incoming document on a feed. Documents that are sufficiently similar to the profile are forwarded to that person via email or some other mechanism. Describe the architecture of a filtering engine and how it may differ from a search engine.

54

55 3 Crawls and Feeds “You’ve stuck your webs into my business for the last time.” Spider Man 2 Doc Ock, 3.1 Deciding What to Search This book is about the details of building a search engine, from the mathematics behind ranking to the algorithms of query processing. Although we focus heav- ily on the technology that makes search engines work, and great technology can make a good search engine even better, it is the information in the document col- lection that makes search engines useful. In other words, if the right documents are not stored in the search engine, no search technique will be able to find rele- vant information. The title of this section implies the question, “What should we search?” The simple answer is everything you possibly can . Every document answers at least one question (i.e., “Now where was that document again?”), although the best doc- uments answer many more. Every time a search engine adds another document, the number of questions it can answer increases. On the other hand, adding many poor-quality documents increases the burden on the ranking process to find only the best documents to show to the user. Web search engines, however, show how successful search engines can be, even when they contain billions of low-quality documents with little useful content. Even useful documents can become less useful over time. This is especially true of news and financial information where, for example, many people want to know about today’s stock market report, but only a few care about what happened yes- terday. The frustration of finding out-of-date web pages and links in a search re- sult list is, unfortunately, a common experience. Search engines are most effective when they contain the most recent information in addition to archives of older material.

56 32 3 Crawls and Feeds This chapter introduces techniques for finding documents to search, whether on the Web, on a file server, on a computer’s hard disk, or in an email program. We will discuss strategies for storing documents and keeping those documents up-to-date. Along the way, we will discuss how to pull data out of files, navigating through issues of character encodings, obsolete file formats, duplicate documents, and textual noise. By the end of this chapter you will have a solid grasp on how to get document data into a search engine, ready to be indexed. 3.2 Crawling the Web To build a search engine that searches web pages, you first need a copy of the pages that you want to search. Unlike some of the other sources of text we will consider later, web pages are particularly easy to copy, since they are meant to be retrieved over the Internet by browsers. This instantly solves one of the major problems of getting information to search, which is how to get the data from the place it is stored to the search engine. Finding and downloading web pages automatically is called crawling , and a 1 program that downloads pages is called a web crawler . There are some unique challenges to crawling web pages. The biggest problem is the sheer scale of the Web. There are at least tens of billions of pages on the Internet. The “at least” in the last sentence is there because nobody is sure how many pages there are. Even if the number of pages in existence today could be measured exactly, that number would be immediately wrong, because pages are constantly being created. Every time a user adds a new blog post or uploads a photo, another web page is created. Most organizations do not have enough storage space to store even a large fraction of the Web, but web search providers with plenty of resources must still constantly download new content to keep their collections current. Another problem is that web pages are usually not under the control of the people building the search engine database. Even if you know that you want to copy all the pages from www.company.com , there is no easy way to find out how many pages there are on the site. The owners of that site may not want you to copy some of the data, and will probably be angry if you try to copy it too quickly or too frequently. Some of the data you want to copy may be available only by typing a request into a form, which is a difficult process to automate. 1 Crawling is also occasionally referred to as spidering , and a crawler is sometimes called a spider .

57 3.2 Crawling the Web 33 3.2.1 Retrieving Web Pages Each web page on the Internet has its own unique uniform resource locator , or . Any URL used to describe a web page has three parts: the scheme, the host- URL name, and the resource name (Figure 3.1). Web pages are stored on , web servers which use a protocol called HTTP , to exchange Hypertext Transfer Protocol , or information with client software. Therefore, most URLs used on the Web start http , indicating that the URL represents a resource that can with the scheme hostname be retrieved using HTTP. The follows, which is the name of the com- puter that is running the web server that holds this web page. In the figure, the computer’s name is , which is a computer in the University of www.cs.umass.edu Massachusetts Computer Science department. This URL refers to a page on that /csinfo/people.html computer called . http://www .cs .umass .edu/csinfo/people .html http www .cs .umass .edu /csinfo/people .html scheme hostna r esour ce me A uniform resource locator (URL), split into three parts Fig. 3.1. Web browsers and web crawlers are two different kinds of web clients, but both fetch web pages in the same way. First, the client program connects to a domain namesystem (DNS) server. The DNS server translates the hostname into an inter- net protocol (IP) address. This IP address is a number that is typically 32 bits long, but some networks now use 128-bit IP addresses. The program then attempts to connect to a server computer with that IP address. Since that server might have many different programs running on it, with each one listening to the network for new connections, each program listens on a different port . A port is just a 16-bit number that identifies a particular service. By convention, requests for web pages are sent to port 80 unless specified otherwise in the URL. Once the connection is established, the client program sends an HTTP re- quest to the web server to request a page. The most common HTTP request type is a GET request, for example: GET /csinfo/people.html HTTP/1.0 This simple request asks the server to send the page called /csinfo/people.html back to the client, using version 1.0 of the HTTP protocol specification. After

58 34 3 Crawls and Feeds sending a short header, the server sends the contents of that file back to the client. If the client wants more pages, it can send additional requests; otherwise, the client closes the connection. A client can also fetch web pages using POST requests. A POST request is like a GET request, except that it can send additional request information to the server. By convention, GET requests are used for retrieving data that already exists on the server, whereas POST requests are used to tell the server something. A POST request might be used when you click a button to purchase something or to edit a web page. This convention is useful if you are running a web crawler, since sending only GET requests helps make sure your crawler does not inadvertently order a product. /index.html /courses /news /index.html /index.html .cs.umass.edu www y /toda y.html /2006/09/stor y.html /2005/stor /index.html /2003/04/stor y.html www .cnn.com www .bbc.co.uk /news .html /about.html v www .whitehouse.go cra wler .searchengine.com Fig. 3.2. Crawling the Web. The web crawler connects to web servers to find pages. Pages may link to other pages on the same server or on different servers.

59 3.2 Crawling the Web 35 3.2.2 The Web Crawler Figure 3.2 shows a diagram of the Web from a simple web crawler’s perspective. The web crawler has two jobs: downloading pages and finding URLs. The crawler starts with a set of seeds , which are a set of URLs given to it as parameters. These seeds are added to a URL request queue. The crawler starts fetching pages from the request queue. Once a page is downloaded, it is parsed to find link tags that might contain other useful URLs to fetch. If the crawler finds a new URL that it has not seen before, it is added to the crawler’s request frontier queue, or . The frontier may be a standard queue, or it may be ordered so that important pages move to the front of the list. This process continues until the crawler either runs out of disk space to store pages or runs out of useful links to add to the request queue. If a crawler used only a single thread, it would not be very efficient. Notice that the web crawler spends a lot of its time waiting for responses: it waits for the DNS server response, then it waits for the connection to the web server to be acknowledged, and then it waits for the web page data to be sent from the server. During this waiting time, the CPU of the web crawler machine is idle and the network connection is unused. To reduce this inefficiency, web crawlers use threads and fetch hundreds of pages at once. Fetching hundreds of pages at once is good for the person running the web crawler, but not necessarily good for the person running the web server on the other end. Just imagine how the request queue works in practice. When a web page like www.company.com is fetched, it is parsed and all of the links on that page are added to the request queue. The crawler will then attempt to fetch all of www.company.com is not very powerful, it those pages at once. If the web server for might spend all of its time handling requests from the crawler instead of handling requests from real users. This kind of behavior from web crawlers tends to make web server administrators very angry. To avoid this problem, web crawlers use . Reasonable web politeness policies crawlers do not fetch more than one page at a time from a particular web server. In addition, web crawlers wait at least a few seconds, and sometimes minutes, be- tween requests to the same web server. This allows web servers to spend the bulk of their time processing real user requests. To support this, the request queue is logically split into a single queue per web server. At any one time, most of these per-server queues are off-limits for crawling, because the crawler has fetched a page from that server recently. The crawler is free to read page requests only from queues that haven’t been accessed within the specified politeness window.

60 36 3 Crawls and Feeds When using a politeness window, the request queue must be very large in order to achieve good performance. Suppose a web crawler can fetch 100 pages each second, and that its politeness policy dictates that it cannot fetch more than one page each 30 seconds from a particular web server. The web crawler needs to have URLs from at least 3,000 different web servers in its request queue in order to achieve high throughput. Since many URLs will come from the same servers, the request queue needs to have tens of thousands of URLs in it before a crawler can reach its peak throughput. User-agent: * Disallow: /private/ Disallow: /confidential/ Disallow: /other/ Allow: /other/public/ User-agent: FavoredCrawler Disallow: Sitemap: http://mysite.com/sitemap.xml.gz Fig. 3.3. An example robots.txt file Even crawling a site slowly will anger some web server administrators who ob- ject to any copying of their data. Web server administrators who feel this way can store a file called /robots.txt on their web servers. Figure 3.3 contains an ex- ample robots.txt file. The file is split into blocks of commands that start with a User-agent: specification. The User-agent: line identifies a crawler, or group of crawlers, affected by the following rules. Following this line are and Allow Disallow rules that dictate which resources the crawler is allowed to access. In the figure, the first block indicates that all crawlers need to ignore resources that begin with /private/ , /confidential/ , or /other/ , except for those that begin with . The second block indicates that a crawler named Favored- /other/public/ Crawler gets its own set of rules: it is allowed to copy everything. The final block of the example is an optional Sitemap: directive, which will be discussed later in this section. Figure 3.4 shows an implementation of a crawling thread, using the crawler building blocks we have seen so far. Assume that the frontier has been initialized

61 3.2 Crawling the Web 37 CT (frontier) procedure do while not frontier.done() website ← frontier.nextSite() url ← website.nextURL() website.permitsCrawl(url) if then text retrieveURL(url) ← storeDocument(url, text) for each url in parse(text) do frontier.addURL(url) end for end if frontier.releaseSite(website) end while end procedure Fig. 3.4. A simple crawling thread implementation with a few URLs that act as seeds for the crawl. The crawling thread first retrieves a website from the frontier. The crawler then identifies the next URL in the web- site’s queue. In permitsCrawl, the crawler checks to see if the URL is okay to crawl according to the website’s robots.txt file. If it can be crawled, the crawler uses re- trieveURL to fetch the document contents. This is the most expensive part of the loop, and the crawler thread may block here for many seconds. Once the text has been retrieved, storeDocument stores the document text in a document database (discussed later in this chapter). The document text is then parsed so that other URLs can be found. These URLs are added to the frontier, which adds them to the appropriate website queues. When all this is finished, the website object is returned to the frontier, which takes care to enforce its politeness policy by not giving the website to another crawler thread until an appropriate amount of time has passed. In a real crawler, the timer would start immediately after the document was retrieved, since parsing and storing the document could take a long time. 3.2.3 Freshness Web pages are constantly being added, deleted, and modified. To keep an accu- rate view of the Web, a web crawler must continually revisit pages it has already crawled to see if they have changed in order to maintain the freshness of the docu- ment collection. The opposite of a fresh copy is a stale copy, which means a copy that no longer reflects the real content of the web page.

62 38 3 Crawls and Feeds HEAD /csinfo/people.html HTTP/1.1 Client request: Host: www.cs.umass.edu HTTP/1.1 200 OK Date: Thu, 03 Apr 2008 05:17:54 GMT Server: Apache/2.0.52 (CentOS) Last-Modified: Fri, 04 Jan 2008 15:28:39 GMT ETag: "239c33-2576-2a2837c0" Server response: Accept-Ranges: bytes Content-Length: 9590 Connection: close Content-Type: text/html; charset=ISO-8859-1 Fig. 3.5. An HTTP HEAD request and server response The HTTP protocol has a special request type called HEAD that makes it easy to check for page changes. The HEAD request returns only header information about the page, but not the page itself. Figure 3.5 contains an example HEAD request and response. The Last-Modified value indicates the last time the page content was changed. Notice that the date is also sent along with the response, as well as in response to a GET request. This allows the web crawler to compare the date it received from a previous GET request with the Last-Modified value from a HEAD request. A HEAD request reduces the cost of checking on a page, but does not elimi- nate it. It simply is not possible to check every page every minute. Not only would that attract more negative reactions from web server administrators, but it would cause enormous load on the web crawler and the incoming network connection. Thankfully, most web pages are not updated every few minutes. Some of them, like news websites, do change frequently. Others, like a person’s home page, change much less often. Even within a page type there can be huge variations in the modification rate. For example, some blogs are updated many times a day, whereas others go months between updates. It does little good to continuously check sites that are rarely updated. Therefore, one of the crawler’s jobs is to measure the rate at which each page changes. Over time, this data can be used to estimate how frequently each page changes. Given that a web crawler can’t update every page immediately as it changes, the crawler needs to have some metric for measuring crawl freshness. In this chapter, we’ve used freshness as a general term, but freshness is also the name of a metric.

63 3.2 Crawling the Web 39 freshness age cra wl cra wl wl update cra updates Age and freshness of a single page over time Fig. 3.6. freshness metric, a page is fresh if the crawl has the most recent copy of a Under the stale web page, but otherwise. Freshness is then the fraction of the crawled pages that are currently fresh. Keeping freshness high seems like exactly what you’d want to do, but optimiz- http://www.ex- ing for freshness can have unintended consequences. Suppose that ample.com is a popular website that changes its front page slightly every minute. Unless your crawler continually polls , you will almost al- http://www.example.com ways have a stale copy of that page. Notice that if you want to optimize for fresh- ness, the appropriate strategy is to stop crawling this site completely! If it will never be fresh, it can’t help your freshness value. Instead, you should allocate your crawler’s resources to pages that change less frequently. Of course, users will revolt if you decide to optimize your crawler for freshness. They will look at http://www.example.com and wonder why your indexed copy is months out of date. Age is a better metric to use. You can see the difference between age and fresh- ness in Figure 3.6. In the top part of the figure, you can see that pages become fresh immediately when they are crawled, but once the page changes, the crawled page becomes stale. Under the age metric, the page has age 0 until it is changed, and then its age grows until the page is crawled again. λ , meaning that we expect it Suppose we have a page with change frequency to change λ times in a one-day period. We can calculate the expected age of a page t days after it was last crawled: ∫ t ( ) = Age λ, t ) P ( page changed at time x )( t − x dx 0

64 40 3 Crawls and Feeds λ (one week) Expected age of a page with mean change frequency = 1/7 Fig. 3.7. t − The ) expression is an age: we assume the page is crawled at time t , but ( x x . We multiply that by the probability that the page actu- that it changed at time x . Studies have shown that, on average, web page updates fol- ally changed at time low the Poisson distribution, meaning that the time until the next update is gov- erned by an exponential distribution (Cho & Garcia-Molina, 2003). This gives us P ( page changed at time x ) expression: a formula to plug into the ∫ t x − Age ( λ, t ) = λe t − x ) dx ( 0 λ = 1/7 , Figure 3.7 shows the result of plotting this expression for a fixed indicating roughly one change a week. Notice how the expected age starts at zero, and rises slowly at first. This is because the page is unlikely to have changed in the first day. As the days go by, the probability that the page has changed increases. By the end of the week, the expected age of the page is about . 6 days. This means 2 that if your crawler crawls each page once a week, and each page in your collection has a mean update time of once a week, the pages in your index will be 2.6 days old on average just before the crawler runs again. Notice that the second derivative of the Age function is always positive. That is, the graph is not only increasing, but its rate of increase is always increasing. This positive second derivative means that the older a page gets, the more it costs you to not crawl it. Optimizing this metric will never result in the conclusion that optimizing for freshness does, where sometimes it is economical to not crawl a page at all.

65 3.2 Crawling the Web 41 3.2.4 Focused Crawling Some users would like a search engine that focuses on a specific topic of informa- tion. For instance, at a website about movies, users might want access to a search engine that leads to more information about movies. If built correctly, this type of vertical search can provide higher accuracy than general search because of the lack of extraneous information in the document collection. The computational cost of running a vertical search will also be much less than a full web search, simply because the collection will be much smaller. The most accurate way to get web pages for this kind of engine would be to crawl a full copy of the Web and then throw out all unrelated pages. This strategy requires a huge amount of disk space and bandwidth, and most of the web pages will be discarded at the end. A less expensive approach is focused , or topical , crawling. A focused crawler attempts to download only those pages that are about a particular topic. Focused crawlers rely on the fact that pages about a topic tend to have links to other pages on the same topic. If this were perfectly true, it would be possible to start a crawl at one on-topic page, then crawl all pages on that topic just by following links from a single root page. In practice, a number of popular pages for a specific topic are typically used as seeds. Focused crawlers require some automatic means for determining whether a page is about a particular topic. Chapter 9 will introduce text classifiers, which are tools that can make this kind of distinction. Once a page is downloaded, the crawler uses the classifier to decide whether the page is on topic. If it is, the page is kept, and links from the page are used to find other related sites. The anchor text in the outgoing links is an important clue of topicality. Also, some pages have more on-topic links than others. As links from a particular web page are visited, the crawler can keep track of the topicality of the downloaded pages and use this to determine whether to download other similar pages. Anchor text data and page link topicality data can be combined together in order to determine which pages should be crawled next. 3.2.5 Deep Web Not all parts of the Web are easy for a crawler to navigate. Sites that are difficult for a crawler to find are collectively referred to as the deep Web (also called the hidden Web ). Some studies have estimated that the deep Web is over a hundred

66 42 3 Crawls and Feeds times larger than the traditionally indexed Web, although it is very difficult to measure this accurately. Most sites that are a part of the deep Web fall into three broad categories: • Privatesites are intentionally private. They may have no incoming links, or may require you to log in with a valid account before using the rest of the site. These sites generally want to block access from crawlers, although some news pub- lishers may still want their content indexed by major search engines. Form results are sites that can be reached only after entering some data into a • form. For example, websites selling airline tickets typically ask for trip infor- mation on the site’s entry page. You are shown flight information only after submitting this trip information. Even though you might want to use a search engine to find flight timetables, most crawlers will not be able to get through this form to get to the timetable information. • Scripted pages are pages that use JavaScript™, Flash®, or another client-side lan- guage in the web page. If a link is not in the raw HTML source of the web page, but is instead generated by JavaScript code running on the browser, the crawler will need to execute the JavaScript on the page in order to find the link. Although this is technically possible, executing JavaScript can slow down the crawler significantly and adds complexity to the system. static pages and dynamic pages Sometimes people make a distinction between . Static pages are files stored on a web server and displayed in a web browser un- modified, whereas dynamic pages may be the result of code executing on the web server or the client. Typically it is assumed that static pages are easy to crawl, while dynamic pages are hard. This is not quite true, however. Many websites have dy- namically generated web pages that are easy to crawl; wikis are a good example of this. Other websites have static pages that are impossible to crawl because they can be accessed only through web forms. Web administrators of sites with form results and scripted pages often want their sites to be indexed, unlike the owners of private sites. Of these two categories, scripted pages are easiest to deal with. The site owner can usually modify the pages slightly so that links are generated by code on the server instead of by code in the browser. The crawler can also run page JavaScript, or perhaps Flash as well, although these can take a lot of time. The most difficult problems come with form results. Usually these sites are repositories of changing data, and the form submits a query to a database system. In the case where the database contains millions of records, the site would need to

67 3.2 Crawling the Web 43 expose millions of links to a search engine’s crawler. Adding a million links to the front page of such a site is clearly infeasible. Another option is to let the crawler guess what to enter into forms, but it is difficult to choose good form input. Even with good guesses, this approach is unlikely to expose all of the hidden data. 3.2.6 Sitemaps As you can see from the last two sections, the biggest problems in crawling arise because site owners cannot adequately tell crawlers about their sites. In section 3.2.3, we saw how crawlers have to make guesses about when pages will be updated because polling is costly. In section 3.2.5, we saw that site owners sometimes have data that they would like to expose to a search engine, but they can’t because there is no reasonable place to store the links. Sitemaps solve both of these problems. http://www.company.com/ 2008-01-15 monthly 0.7 http://www.company.com/items?item=truck weekly http://www.company.com/items?item=bicycle daily Fig. 3.8. An example sitemap file A robots.txt file can contain a reference to a sitemap, like the one shown in Figure 3.8. A sitemap contains a list of URLs and data about those URLs, such as modification time and modification frequency.

68 44 3 Crawls and Feeds There are three URL entries shown in the example sitemap. Each one contains a URL in a tag. The changefreq tag indicates how often this resource is likely loc to change. The first entry includes a lastmod tag, which indicates the last time priority tag with a value of 0.7, it was changed. The first entry also includes a which is higher than the default of 0.5. This tells crawlers that this page is more important than other pages on this site. Why would a web server administrator go to the trouble to create a sitemap? One reason is that it tells search engines about pages it might not otherwise find. Look at the second and third URLs in the sitemap. Suppose these are two prod- uct pages. There may not be any links on the website to these pages; instead, the user may have to use a search form to get to them. A simple web crawler will not attempt to enter anything into a form (although some advanced crawlers do), and so these pages would be invisible to search engines. A sitemap allows crawlers to find this hidden content. The sitemap also exposes modification times. In the discussion of page fresh- ness, we mentioned that a crawler usually has to guess when pages are likely to change. The changefreq tag gives the crawler a hint about when to check a page lastmod tag tells the crawler when a page has changed. again for changes, and the This helps reduce the number of requests that the crawler sends to a website with- out sacrificing page freshness. 3.2.7 Distributed Crawling For crawling individual websites, a single computer is sufficient. However, crawl- ing the entire Web requires many computers devoted to crawling. Why would a single crawling computer not be enough? We will consider three reasons. One reason to use multiple computers is to put the crawler closer to the sites it crawls . Long-distance network connections tend to have lower throughput (fewer bytes copied per second) and higher latency (bytes take longer to cross the net- work). Decreased throughput and increased latency work together to make each page request take longer. As throughput drops and latency rises, the crawler has to open more connections to copy pages at the same rate. For example, suppose a crawler has a network connection that can transfer 1MB each second. With an average web page size of 20K, it can copy 50 pages each second. If the sites that are being crawled are close, the data transfer rate from them may be 1MB a second. However, it can take 80ms for the site to start sending data, because there is some transmission delay in opening the connection

69 3.2 Crawling the Web 45 and sending the request. Let’s assume each request takes 100ms (80ms of latency and 20ms of data transfer). Multiplying 50 by 100ms, we see that there is 5 seconds of waiting involved in transferring 50 pages. This means that five connections will be needed to transfer 50 pages in one second. If the sites are farther away, with an average throughput of 100K per second and 500ms of latency, then each request would now take 600ms. Since 50 × 600ms = 30 seconds, the crawler would need to keep 30 connections open to transfer pages at the same rate. number of Another reason for multiple crawling computers is to reduce the . A crawler has to remember all of the URLs it sites the crawler has to remember has already crawled, and all of the URLs that it has queued to crawl. These URLs must be easy to access, because every page that is crawled contains new links that need to be added to the crawl queue. Since the crawler’s queue should not contain duplicates or sites that have already been crawled, each new URL must be checked against everything in the queue and everything that has been crawled. The data structure for this lookup needs to be in RAM; otherwise, the computer’s crawl speed will be severely limited. Spreading crawling duties among many computers reduces this bookkeeping load. Yet another reason is that crawling can use a lot of computing resources , includ- ing CPU resources for parsing and network bandwidth for crawling pages. Crawl- ing a large portion of the Web is too much work for a single computer to handle. A distributed crawler is much like a crawler on a single computer, except in- stead of a single queue of URLs, there are many queues. The distributed crawler uses a hash function to assign URLs to crawling computers. When a crawler sees a new URL, it computes a hash function on that URL to decide which crawl- ing computer is responsible for it. These URLs are gathered in batches, then sent periodically to reduce the network overhead of sending a single URL at a time. The hash function should be computed on just the host part of each URL. This assigns all the URLs for a particular host to a single crawler. Although this may promote imbalance since some hosts have more pages than others, politeness rules require a time delay between URL fetches to the same host. It is easier to maintain that kind of delay by using the same crawling computers for all URLs for domain.com will have the same host. In addition, we would expect that sites from lots of links to other pages on domain.com . By assigning domain.com to a single crawl host, we minimize the number of URLs that need to be exchanged between crawling computers.

70 46 3 Crawls and Feeds 3.3 Crawling Documents and Email Even though the Web is a tremendous information resource, a huge amount of digital information is not stored on websites. In this section, we will consider in- formation that you might find on a normal desktop computer, such as email, word processing documents, presentations, or spreadsheets. This information can be searched using a desktop search tool. In companies and organizations, enterprise search will make use of documents on file servers, or even on employee desktop computers, in addition to local web pages. Many of the problems of web crawling change when we look at desktop data. In web crawling, just finding the data can be a struggle. On a desktop computer, the interesting data is stored in a file system with familiar semantics. Finding all the files on a hard disk is not particularly difficult, since file systems have directo- ries that are easy to discover. In some ways, a file system is like a web server, but with an automatically generated sitemap. There are unique challenges in crawling desktop data, however. The first con- cerns update speed. In desktop search applications, users demand search results based on the current content of their files. This means, for example, being able to search for an email the instant it is received, and being able to search for a docu- ment as soon as it has been saved. Notice that this is a much different expectation than with web search, where users can tolerate crawling delays of hours or days. Crawling the file system every second is impractical, but modern file systems can send change notifications directly to the crawler process so that it can copy new files immediately. Remote file systems from file servers usually do not provide this kind of change notification, and so they must be crawled just like a web server. Disk space is another concern. With a web crawler, we assume that we need to keep a copy of every document that is found. This is less true on a desktop system, where the documents are already stored locally, and where users will be unhappy if a large proportion of the hard disk is taken by the indexer. A desktop crawler instead may need to read documents into memory and send them directly to the indexer. We will discuss indexing more in Chapter 5. Since websites are meant to be viewed with web browsers, most web content is stored in HTML. On the other hand, each desktop program—the word pro- cessor, presentation tool, email program, etc.—has its own file format. So, just finding these files is not enough; eventually they will need to be converted into a format that the indexer can understand. In section 3.5 we will revisit this conver- sion issue.

71 3.4 Document Feeds 47 Finally, and perhaps most importantly, crawling desktop data requires a focus on data privacy. Desktop systems can have multiple users with different accounts, should not be able to find emails from user B ’s account through the and user A search feature. This is especially important when we consider crawling shared net- work file systems, as in a corporate network. The file access permissions of each file must be recorded along with the crawled data, and must be kept up-to-date. 3.4 Document Feeds In general Web or desktop crawling, we assume that any document can be created or modified at any time. However, many documents are , meaning that published they are created at a fixed time and rarely updated again. News articles, blog posts, press releases, and email are some of the documents that fit this publishing model. Most information that is time-sensitive is published. Since each published document has an associated time, published documents from a single source can be ordered in a sequence called a document feed . A docu- ment feed is particularly interesting for crawlers, since the crawler can easily find all the new documents by examining only the end of the feed. We can distinguish two kinds of document feeds, push and pull . A push feed alerts the subscriber to new documents. This is like a telephone, which alerts you to an incoming phone call; you don’t need to continually check the phone to see pull if someone is calling. A feed requires the subscriber to check periodically for new documents; this is like checking your mailbox for new mail to arrive. News feeds from commercial news agencies are often push feeds, but pull feeds are over- whelmingly popular for free services. We will focus primarily on pull feeds in this section. RSS . RSS has at least three The most common format for pull feeds is called definitions: Really Simple Syndication, RDF Site Summary, or Rich Site Sum- mary. Not surprisingly, RSS also has a number of slightly incompatible imple- mentations, and a similar competing format exists called the Atom Syndication Format . The proliferation of standards is the result of an idea that gained popu- larity too quickly for developers to agree on a single standard. Figure 3.9 shows an RSS 2.0 feed from an example site called http://www.search- engine-news.org . This feed contains two articles: one is about an upcoming SIGIR conference, and the other is about a textbook. Notice that each entry contains a time indicating when it was published. In addition, near the top of the RSS feed there is an tag named ttl , which means time to live , measured in minutes. This

72 48 3 Crawls and Feeds Search Engine News http://www.search-engine-news.org/ News about search engines. en-us Tue, 19 Jun 2008 05:17:00 GMT 60 Upcoming SIGIR Conference http://www.sigir.org/conference The annual SIGIR conference is coming! Mark your calendars and check for cheap flights. Tue, 05 Jun 2008 09:50:11 GMT http://search-engine-news.org#500 New Search Engine Textbook http://www.cs.umass.edu/search-book A new textbook about search engines will be published soon. Tue, 05 Jun 2008 09:33:01 GMT http://search-engine-news.org#499 Fig. 3.9. An example RSS 2.0 feed

73 3.5 The Conversion Problem 49 feed states that its contents should be cached only for 60 minutes, and informa- tion more than an hour old should be considered stale. This gives a crawler an indication of how often this feed file should be crawled. RSS feeds are accessed just like a traditional web page, using HTTP GET re- quests to web servers that host them. Therefore, some of the crawling techniques we discussed before apply here as well, such as using HTTP HEAD requests to detect when RSS feeds change. From a crawling perspective, document feeds have a number of advantages over traditional pages. Feeds give a natural structure to data; even more than with a sitemap, a web feed implies some relationship between the data items. Feeds are easy to parse and contain detailed time information, like a sitemap, but also include a description field about each page (and this description field sometimes contains the entire text of the page referenced in the URL). Most importantly, like a sitemap, feeds provide a single location to look for new data, instead of hav- ing to crawl an entire site to find a few new documents. 3.5 The Conversion Problem Search engines are built to search through text. Unfortunately, text is stored on computers in hundreds of incompatible file formats. Standard text file formats include raw text, RTF, HTML, XML, Microsoft Word, ODF (Open Document Format) and PDF (Portable Document Format). There are tens of other less com- mon word processors with their own file formats. But text documents aren’t the only kind of document that needs to be searched; other kinds of files also contain important text, such as PowerPoint slides and Excel®spreadsheets. In addition to all of these formats, people often want to search old documents, which means that search engines may need to support obsolete file formats. It is not uncommon for a commercial search engine to support more than a hundred file types. The most common way to handle a new file format is to use a conversion tool that converts the document content into a tagged text format such as HTML or XML. These formats are easy to parse, and they retain some of the important formatting information (font size, for example). You can see this on any major web search engine. Search for a PDF document, but then click on the “Cached” link at the bottom of a search result. You will be taken to the search engine’s view of the page, which is usually an HTML rendition of the original document. For some document types, such as PowerPoint, this cached version can be nearly un- readable. Fortunately, readability isn’t the primary concern of the search engine.

74 50 3 Crawls and Feeds The point is to copy this data into the search engine so that it can be indexed and retrieved. However, translating the data into HTML has an advantage: the user does not need to have an application that can read the document’s file format in order to view it. This is critical for obsolete file formats. Documents could be converted to plain text instead of HTML or XML. However, doing this would strip the file of important information about head- ings and font sizes that could be useful to the indexer. As we will see later, headings and bold text tend to contain words that describe the document content well, so we want to give these words preferential treatment during scoring. Accurate con- version of formatting information allows the indexer to extract these important features. 3.5.1 Character Encodings Even HTML files are not necessarily compatible with each other because of char- acter encoding issues. The text that you see on this page is a series of little pictures we call letters or glyphs . Of course, a computer file is a stream of bits, not a collec- tion of pictures. A character encoding is a mapping between bits and glyphs. For English, the basic character encoding that has been around since 1963 is ASCII. ASCII encodes 128 letters, numbers, special characters, and control characters in 7 bits, extended with an extra bit for storage in bytes. This scheme is fine for the English alphabet of 26 letters, but there are many other languages, and some of those have many more glyphs. The Chinese language, for example, has more than 40,000 characters, with over 3,000 in common use. For the CJK (Chinese- Japanese-Korean) family of East Asian languages, this led to the development of a number of different 2-byte standards. Other languages, such as Hindi or Arabic, also have a range of different encodings. Note that not all encodings even agree on English. The EBCDIC encoding used on mainframes, for example, is completely different than the ASCII encoding used by personal computers. The computer industry has moved slowly in handling complicated character sets such as Chinese and Arabic. Until recently, the typical approach was to use different language-specific encodings, sometimes called code pages . The first 128 values of each encoding are reserved for typical English characters, punctuation, and numbers. Numbers above 128 are mapped to glyphs in the target language, from Hebrew to Arabic. However, if you use a different encoding for each lan- guage, you can’t write in Hebrew and Japanese in the same document. Addition- ally, the text itself is no longer self-describing. It’s not enough to just store data in a text file; you must also record what encoding was used.

75 3.5 The Conversion Problem 51 Unicode was developed. Unicode is a To solve this mess of encoding issues, single mapping from numbers to glyphs that attempts to include all glyphs in common use in all known languages. This solves the problem of using multiple languages in a single file. Unfortunately, it does not fully solve the problems of bi- nary encodings, because Unicode is a mapping between numbers and glyphs, not bits and glyphs. It turns out that there are many ways to translate Unicode num- bers to glyphs! Some of the most popular include UTF-8, UTF-16, UTF-32, and UCS-2 (which is deprecated). The proliferation of encodings comes from a need for compatibility and to save space. Encoding English text in UTF-8 is identical to the ASCII encod- ing. Each ASCII letter requires just one byte. However, some traditional Chinese characters can require as many as 4 bytes. The trade-off for compactness for West- ern languages is that each character requires a variable number of bytes, which makes it difficult to quickly compute the number of characters in a string or to jump to a random location in a string. By contrast, UTF-32 (also known as UCS- 4) uses exactly 4 bytes for every character. Jumping to the twentieth character in a UTF-32 string is easy: just jump to the eightieth byte and start reading. Un- fortunately, UTF-32 strings are incompatible with all old ASCII software, and UTF-32 files require four times as much space as UTF-8. Because of this, many applications use UTF-32 as their internal text encoding (where random access is important), but use UTF-8 to store text on disk. Decimal Hexadecimal Encoding 0–127 0–7F 0xxxxxxx 80–7FF 110xxxxx 10xxxxxx 128–2047 2048–55295 1110xxxx 10xxxxxx 10xxxxxx 800–D7FF 55296–57343 D800–DFFF Undefined 57344–65535 E000–FFFF 1110xxxx 10xxxxxx 10xxxxxx 65536–1114111 10000–10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx Table 3.1. UTF-8 encoding Table 3.1 shows an encoding table for UTF-8. The left columns represent ranges of decimal values, and the rightmost column shows how these values are encoded in binary. The x characters represent binary digits. For example, the Greek letter pi ( π ) is Unicode symbol number 960. In binary, that number is 00000011 11000000 (3C0 in hexadecimal). The second row of the table tells us

76 52 3 Crawls and Feeds that this letter will require 2 bytes to encode in UTF-8. The high 5 bits of the character go in the first byte, and the next 6 bits go in the second byte. The final 110 01111 10 000000 (CF80 in hexadecimal). The bold binary digits encoding is x letters from the table have are the same as the digits from the table, while the been filled in by binary digits from the Unicode number. 3.6 Storing the Documents After documents have been converted to some common format, they need to be stored in preparation for indexing. The simplest document storage is no docu- ment storage, and for some applications this is preferable. In desktop search, for example, the documents are already stored in the file system and do not need to be copied elsewhere. As the crawling process runs, it can send converted documents immediately to an indexing process. By not storing the intermediate converted documents, desktop search systems can save disk space and improve indexing la- tency. Most other kinds of search engines need to store documents somewhere. Fast 2 access to the document text is required in order to build document snippets for each search result. These snippets of text give the user an idea of what is inside the retrieved document without actually needing to click on a link. Even if snippets are not necessary, there are other reasons to keep a copy of each document. Crawling for documents can be expensive in terms of both CPU and network load. It makes sense to keep copies of the documents around instead of trying to fetch them again the next time you want to build an index. Keep- ing old documents allows you to use HEAD requests in your crawler to save on bandwidth, or to crawl only a subset of the pages in your index. Finally, document storage systems can be a starting point for information ex- traction (described in Chapter 4). The most pervasive kind of information ex- traction happens in web search engines, which extract anchor text from links to store with target web documents. Other kinds of extraction are possible, such as identifying names of people or places in documents. Notice that if information extraction is used in the search application, the document storage system should support modification of the document data. We now discuss some of the basic requirements for a document storage system, including random access, compression, and updating, and consider the relative 2 We discuss snippet generation in Chapter 6.

77 3.6 Storing the Documents 53 benefits of using a database system or a customized storage system such as Google’s BigTable. 3.6.1 Using a Database System If you have used a relational database before, you might be thinking that a database would be a good place to store document data. For many applications, in fact, a database is an excellent place to store documents. A database takes care of the difficult details of storing small pieces of data, such as web pages, and makes it easy to update them later. Most databases also run as a network server, so that the documents are easily available on the network. This could support, for example, a single computer serving documents for snippets while many other computers handle queries. Databases also tend to come with useful import and analysis tools that can make it easier to manage the document collection. Many companies that run web search engines are reluctant to talk about their internal technologies. However, it appears that few, if any, of the major search engines use conventional relational databases to store documents. One problem is the sheer volume of document data, which can overwhelm traditional database systems. Database vendors also tend to expect that database servers will use the most expensive disk systems, which is impractical given the collection size. We discuss an alternative to a relational database at the end of this section that ad- dresses some of these concerns. 3.6.2 Random Access To retrieve documents quickly in order to compute a snippet for a search result, the document store needs to support random access. Compared to a full relational database, however, only a relatively simple lookup criterion is needed. We want a data store such that we can request the content of a document based on its URL. The easiest way to handle this kind of lookup is with hashing. Using a hash function on the URL gives us a number we can use to find the data. For small installations, the hash function can tell us which file contains the document. For larger installations, the hash function tells us which server contains the document. Once the document location has been narrowed down to a single file, a B-Tree or sorted data structure can be used to find the offset of the document data within the file.

78 54 3 Crawls and Feeds 3.6.3 Compression and Large Files Regardless of whether the application requires random access to documents, the document storage system should make use of large files and compression. Even a document that seems long to a person is small by modern computer standards. For example, this chapter is approximately 10,000 words, and those words require about 70K of disk space to store. That is far bigger than the average web page, but a modern hard disk can transfer 70K of data in about a millisecond. However, the hard disk might require 10 milliseconds to seek to that file in order to start reading. This is why storing each document in its own file is not a very good idea; reading these small files requires a substantial overhead to open them. A better solution is to store many documents in a single file, and for that file to be large enough that transferring the file contents takes much more time than seeking to the beginning. A good size choice might be in the hundreds of megabytes. By storing documents close together, the indexer can spend most of its time reading data instead of seeking for it. The Galago search engine includes parsers for three compound document for- mats: ARC, TREC Text, and TREC Web. In each format, many text documents are stored in the same file, with short regions of document metadata separating the documents. Figure 3.10 shows an example of the TREC Web format. Notice tag and ends with a tag. that each document block begins with a At the beginning of the document, the tag marks a section containing the information about the page request, such as its URL, the date it was crawled, and the HTTP headers returned by the web server. Each document record also contains a field that includes a unique identifier for the document. Even though large files make sense for data transfer from disk, reducing the total storage requirements for document collections has obvious advantages. For- tunately, text written by people is highly redundant. For instance, the letter q is al- most always followed by the letter u . Shannon (1951) showed that native English speakers are able to guess the next letter of a passage of English text with 69% ac- curacy. HTML and XML tags are even more redundant. Compression techniques exploit this redundancy to make files smaller without losing any of the content. We will cover compression as it is used for document indexing in Chapter 5, in part because compression for indexing is rather specialized. While research continues into text compression, popular algorithms like DEFLATE (Deutsch, 1996) and LZW (Welch, 1984) can compress HTML and XML text by 80%. This space savings reduces the cost of storing a lot of documents, and also reduces

79 3.6 Storing the Documents 55 WTX001-B01-10 http://www.example.com/test.html 204.244.59.33 19970101013145 text/html 440 HTTP/1.0 200 OK Date: Wed, 01 Jan 1997 01:21:13 GMT Server: Apache/1.0.3 Content-type: text/html Content-length: 270 Last-modified: Mon, 25 Nov 1996 05:31:24 GMT Tropical Fish Store Coming soon! WTX001-B01-109 http://www.example.com/fish.html 204.244.59.33 19970101013149 text/html 440 HTTP/1.0 200 OK Date: Wed, 01 Jan 1997 01:21:19 GMT Server: Apache/1.0.3 Content-type: text/html Content-length: 270 Last-modified: Mon, 25 Nov 1996 05:31:24 GMT Fish Information This page will soon contain interesting information about tropical fish. Fig. 3.10. An example of text in the TREC Web compound document format

80 56 3 Crawls and Feeds the amount of time it takes to read a document from the disk since there are fewer bytes to read. Compression works best with large blocks of data, which makes it a good fit for big files with many documents in them. However, it is not necessarily a good idea to compress the entire file as a single block. Most compression methods do not allow random access, so each block can only be decompressed sequentially. If you want random access to the data, it is better to consider compressing in smaller blocks, perhaps one block per document, or one block for a few documents. Small blocks reduce compression ratios (the amount of space saved) but improve re- quest latency. 3.6.4 Update As new versions of documents come in from the crawler, it makes sense to update the document store. The alternative is to create an entirely new document store by merging the new, changed documents from the crawler with document data from the old document store for documents that did not change. If the document data does not change very much, this merging process will be much more expensive than updating the data in place. Example website Fig. 3.11. An example link with anchor text Another important reason to support update is to handle anchor text. Fig- ure 3.11 shows an example of anchor text in an HTML link tag. The HTML code in the figure will render in the web browser as a link, with the text Example website that, when clicked, will direct the user to http://example.com . Anchor text is an important feature because it provides a concise summary of what the target page is about. If the link comes from a different website, we may also believe that the summary is unbiased, which also helps us rank documents (see Chapters 4 and 7). Collecting anchor text properly is difficult because the anchor text needs to be associated with the target page. A simple way to approach this is to use a data store that supports update. When a document is found that contains anchor text, we find the record for the target page and update the anchor text portion of the record. When it is time to index the document, the anchor text is all together and ready for indexing.

81 3.6 Storing the Documents 57 3.6.5 BigTable Although a database can perform the duties of a document data store, the very largest document collections demand custom document storage systems. BigTable is the most well known of these systems (Chang et al., 2006). BigTable is a working system in use internally at Google, although at least two open source projects are taking a similar approach. In the next few paragraphs, we will look at the BigTable architecture to see how the problem of document storage influenced its design. BigTable is a distributed database system originally built for the task of storing big table; it can be over a petabyte in web pages. A BigTable instance really is a size, but each database contains only one table. The table is split into small pieces, called tablets , which are served by thousands of machines (Figure 3.12). lo gical ta ble ta blets Fig. 3.12. BigTable stores data in a single logical table, which is split into many smaller tablets If you are familiar with relational databases, you will have encountered SQL (Structured Query Language). SQL allows users to write complex and computa- tionally expensive queries, and one of the tasks of the database system is to opti- mize the processing of these queries to make them as fast as possible. Because some of these queries could take a very long time to complete, a large relational database requires a complex locking system to ensure that the many users of the database do not corrupt it by reading or writing data simultaneously. Isolating users from each other is a difficult job, and many papers and books have been written about how to do it well.

82 58 3 Crawls and Feeds The BigTable approach is quite different. There is no query language, and therefore no complex queries, and it includes only row-level transactions, which would be considered rather simple by relational database standards. However, the simplicity of the model allows BigTable to scale up to very large database sizes while using inexpensive computers, even though they may be prone to failure. Most of the engineering in BigTable involves failure recovery. The tablets, which are the small sections of the table, are stored in a replicated file system that is accessible by all BigTable tablet servers. Any changes to a BigTable tablet are recorded to a transaction log, which is also stored in a shared file system. If any tablet server crashes, another server can immediately read the tablet data and transaction log from the file system and take over. Most relational databases store their data in files that are constantly modified. In contrast, BigTable stores its data in immutable (unchangeable) files. Once file data is written to a BigTable file, it is never changed. This also helps in failure recovery. In relational database systems, failure recovery requires a complex series of operations to make sure that files were not corrupted because only some of the outstanding writes completed before the computer crashed. In BigTable, a file is either incomplete (in which case it can be thrown away and re-created from other BigTable files and the transaction log), or it is complete and therefore is not corrupt. To allow for table updates, the newest data is stored in RAM, whereas older data is stored in a series of files. Periodically the files are merged together to reduce the total number of disk files. anchor :other .com title text :n ull.com anchor document text example clic k here example site .example.com www Fig. 3.13. A BigTable row BigTables are logically organized by rows (Figure 3.13). In the figure, the row stores the data for a single web page. The URL, www.example.com , is the row key, which can be used to find this row. The row has many columns, each with a unique name. Each column can have many different timestamps, although that is not shown in the figure. The combination of a row key, a column key, and a times-

83 3.6 Storing the Documents 59 cell tamp point to a single in the row. The cell holds a series of bytes, which might be a number, a string, or some other kind of data. In the figure, notice that there is a text column for the full text of the docu- ment as well as a title column, which makes it easy to quickly find the document title without parsing the full document text. There are two columns for anchor text. One, called anchor:other.com , includes anchor text from a link from the site to example.com other.com ; the text of the link is “example”, as shown in the cell. The describes a link from null.com to example.com anchor:null.com with anchor text “click here”. Both of these columns are in the anchor columngroup . Other columns could be added to this column group to add information about more links. BigTable can have a huge number of columns per row, and while all rows have the same column groups, not all rows have the same columns. This is a major de- parture from traditional database systems, but this flexibility is important, in part because of the lack of tables. In a relational database system, the anchor columns would be stored in one table and the document text in another. Because BigTable has just one table, all the anchor information needs to be packed into a single record. With all the anchor data stored together, only a single disk read is neces- sary to read all of the document data. In a two-table relational database, at least two reads would be necessary to retrieve this data. Rows are partitioned into tablets based on their row keys. For instance, all URLs beginning with a could be located in one tablet, while all those starting with b could be in another tablet. Using this kind of range-based partitioning makes it easy for a client of BigTable to determine which server is serving each row. To look up a particular row, the client consults a list of row key ranges to determine which tablet would hold the desired row. The client then contacts the appropriate tablet server to fetch the row. The row key ranges are cached in the client, so that most of the network traffic is between clients and tablet servers. BigTable’s architecture is designed for speed and scale through massive num- bers of servers, and for economy by using inexpensive computers that are expected to fail. In order to achieve these goals, BigTable sacrifices some key relational database features, such as a complex query language and multiple-table databases. However, this architecture is well suited for the task of storing and finding web pages, where the primary task is efficient lookups and updates on individual rows.

84 60 3 Crawls and Feeds 3.7 Detecting Duplicates Duplicate and near-duplicate documents occur in many situations. Making copies and creating new versions of documents is a constant activity in offices, and keep- ing track of these is an important part of information management. On the Web, however, the situation is more extreme. In addition to the normal sources of dupli- cation, and spam are common, and the use of multiple URLs to point plagiarism mirror sites to the same web page and can cause a crawler to generate large num- bers of duplicate pages. Studies have shown that about 30% of the web pages in a large crawl are exact or near duplicates of pages in the other 70% (e.g., Fetterly et al., 2003). Documents with very similar content generally provide little or no new infor- mation to the user, but consume significant resources during crawling, indexing, and search. In response to this problem, algorithms for detecting duplicate doc- uments have been developed so that they can be removed or treated as a group during indexing and ranking. Detecting exact duplicates is a relatively simple task that can be done using checksumming techniques. A checksum is a value that is computed based on the content of the document. The most straightforward checksum is a sum of the bytes in the document file. For example, the checksum for a file containing the text “Tropical fish” would be computed as follows (in hex): f i s h T r o p i c a l Sum 54 72 6F 70 69 63 61 6C 20 66 69 73 68 508 Any document file containing the same text would have the same checksum. Of course, any document file containing text that happened to have the same check- sum would also be treated as a duplicate. A file containing the same characters in a different order would have the same checksum, for example. More sophis- ticated functions, such as a cyclic redundancy check (CRC), have been developed that consider the positions of the bytes. The detection of near-duplicate documents is more difficult. Even defining a near-duplicate is challenging. Web pages, for example, could have the same text content but differ in the advertisements, dates, or formatting. Other pages could have small differences in their content from revisions or updates. In general, a near-duplicate is defined using a threshold value for some similarity measure be- tween pairs of documents. For example, a document D could be defined as a 1 near-duplicate of document D if more than 90% of the words in the documents 2 were the same.

85 3.7 Detecting Duplicates 61 search sce- There are two scenarios for near-duplicate detection. One is the D . This, like nario, where the goal is to find near-duplicates of a given document all search problems, conceptually involves the comparison of the query document to all other documents. For a collection containing N documents, the number of ( N ) . The other scenario, discovery , involves find- comparisons required will be O ing all pairs of near-duplicate documents in the collection. This process requires 2 O ( ) comparisons. Although information retrieval techniques that measure N similarity using word-based representations of documents have been shown to be effective for identifying near-duplicates in the search scenario, the computa- tional requirements of the discovery scenario have meant that new techniques have been developed for deriving compact representations of documents. These compact representations are known as fingerprints . The basic process of generating fingerprints is as follows: 1. The document is parsed into words. Non-word content, such as punctuation, HTML tags, and additional whitespace, is removed (see section 4.3). 2. The words are grouped into contiguous n-grams for some n . These are usually overlapping sequences of words (see section 4.3.5), although some techniques use non-overlapping sequences. 3. Some of the n-grams are selected to represent the document. 4. The selected n-grams are hashed to improve retrieval efficiency and further reduce the size of the representation. 5. The hash values are stored, typically in an inverted index. There are a number of fingerprinting algorithms that use this general approach, and they differ mainly in how subsets of the n-grams are selected. Selecting a fixed number of n-grams at random does not lead to good performance in terms of D . The and finding near-duplicates. Consider two near-identical documents, D 2 1 fingerprints generated from n-grams selected randomly from document D are 1 unlikely to have a high overlap with the fingerprints generated from a different set of n-grams selected randomly from D . A more effective technique uses pre- 2 specified combinations of characters, and selects n-grams that begin with those characters. Another popular technique, called 0 mod p , is to select all n-grams whose hash value modulo p is zero, where p is a parameter. Figure 3.14 illustrates the fingerprinting process using overlapping 3-grams, hypothetical hash values, and the 0 mod p selection method with a p value of 4. Note that after the selection process, the document (or sentence in this case) is represented by fingerprints for the n-grams “fish include fish”, “found in tropical”,

86 62 3 Crawls and Feeds include fish found in tropical environments around the world, Tropical fish freshwater and salt both species. including water text (a) Original fish tropical include, fish include fish, include fish found, fish found in, found in tropical, in tropical environments, environments around, environments tropical around the, around the world, the world including, world including both, including and both both freshwater and, freshwater and salt, freshwater, salt water, salt water species (b) 3-grams 938 664 463 822 492 798 78 969 143 236 913 908 694 553 870 779 (c) Hash values 908 664 492 236 (d) Selected hash values using 0 mod 4 Fig. 3.14. Example of fingerprinting process “the world including”, and “including both freshwater”. In large-scale applications, such as finding near-duplicates on the Web, the n-grams are typically 5–10 words 3 long and the hash values are 64 bits. Near-duplicate documents are found by comparing the fingerprints that repre- sent them. Near-duplicate pairs are defined by the number of shared fingerprints or the ratio of shared fingerprints to the total number of fingerprints used to rep- resent the pair of documents. Fingerprints do not capture all of the information in the document, however, and consequently this leads to errors in the detection of near-duplicates. Appropriate selection techniques can reduce these errors, but not eliminate them. As we mentioned, evaluations have shown that comparing word-based representations using a similarity measure such as the cosine correla- tion (see section 7.1.2) is generally significantly more effective than fingerprinting methods for finding near-duplicates. The problem with these methods is their ef- ficiency. 3 The hash values are usually generated using Rabin fingerprinting (Broder et al., 1997), named after the Israeli computer scientist Michael Rabin.

87 3.8 Removing Noise 63 simhash (Charikar, 2002) A recently developed fingerprinting technique called combines the advantages of the word-based similarity measures with the effi- ciency of fingerprints based on hashing. It has the unusual property for a hashing function that similar documents have similar hash values. More precisely, the sim- ilarity of two pages as measured by the cosine correlation measure is proportional . to the number of bits that are the same in the fingerprints generated by simhash fingerprint is as follows: simhash The procedure for calculating a 1. Process the document into a set of features with associated weights. We will assume the simple case where the features are words weighted by their fre- quency. Other weighting schemes are discussed in Chapter 7. 2. Generate a hash value with bits (the desired size of the fingerprint) for each b word. The hash value should be unique for each word. 3. In b-dimensional vector V , update the components of the vector by adding the weight for a word to every component for which the corresponding bit in the word’s hash value is 1, and subtracting the weight if the value is 0. 4. After all words have been processed, generate a b -bit fingerprint by setting the i th bit to 1 if the i th component of V is positive, or 0 otherwise. Figure 3.15 shows an example of this process for an 8-bit fingerprint. Note that common words (stopwords) are removed as part of the text processing. In b practice, much larger values of are used. Henzinger (2006) describes a large-scale Web-based evaluation where the fingerprints had 384 bits. A web page is defined simhash as a near-duplicate of another page if the fingerprints agree on more than 372 bits. This study showed significant effectiveness advantages for the simhash approach compared to fingerprints based on n-grams. 3.8 Removing Noise Many web pages contain text, links, and pictures that are not directly related to the main content of the page. For example, Figure 3.16 shows a web page contain- ing a news story. The main content of the page (the story) is outlined in black. This content block takes up less than 20% of the display area of the page, and the rest is made up of banners, advertisements, images, general navigation links, ser- vices (such as search and alerts), and miscellaneous information, such as copy- right. From the perspective of the search engine, this additional material in the web page is mostly noise that could negatively affect the ranking of the page. A major component of the representation of a page used in a search engine is based

88 64 3 Crawls and Feeds fish include fish found in tropical environments around the Tropical world, freshwater and salt both species. including water text (a) Original 2 fish 2 include 1 found 1 environments 1 around 1 world 1 tropical water both 1 freshwater 1 salt 1 1 1 species 1 including (b) Words with weights 01100001 10101011 11100110 fish include tropical 00101101 10001011 00011110 found around environments 11000000 10101110 00101010 including both world 00111111 10110101 00100101 freshwater water salt 11101110 species (c) 8 bit hash values 1 ! 5 9 ! 9 3 1 3 3 (d) Vector V formed by summing weights 1 0 1 1 1 1 1 0 (e) 8-bit fingerprint formed from V Fig. 3.15. simhash fingerprinting process Example of on word counts, and the presence of a large number of words unrelated to the main topic can be a problem. For this reason, techniques have been developed to detect the content blocks in a web page and either ignore the other material or reduce its importance in the indexing process. Finn et al. (2001) describe a relatively simple technique based on the obser- vation that there are less HTML tags in the text of the main content of typical web pages than there is in the additional material. Figure 3.17 (also known as a document slope curve ) shows the cumulative distribution of tags in the example web page from Figure 3.16, as a function of the total number of tokens (words or other non-tag strings) in the page. The main text content of the page corresponds to the “plateau” in the middle of the distribution. This flat area is relatively small because of the large amount of formatting and presentation information in the HTML source for the page.

89 3.8 Removing Noise 65 Content block Fig. 3.16. Main content block in a web page One way to detect the largest flat area of the distribution is to represent a web page as a sequence of bits, where b = 1 indicates that the n th token is a tag, and n b = 0 otherwise. Certain tags that are mostly used to format text, such as font n changes, headings, and table tags, are ignored (i.e., are represented by a 0 bit). The detection of the main content can then be viewed as an optimization problem where we find values of i and j to maximize both the number of tags below i and above j and the number of non-tag tokens between i and j . This corresponds to maximizing the corresponding objective function: j 1 i − − N 1 ∑ ∑ ∑ b + b ) + (1 − b n n n =0 n n = i = j +1 n

90 66 3 Crawls and Feeds 1000 900 800 700 600 Text area 500 count Tag 400 300 200 100 0 3500 500 1000 0 1500 2000 2500 3000 count Token Fig. 3.17. Tag counts used to identify text blocks in a web page where N is the number of tokens in the page. This can be done simply by scanning the possible values for i j and computing the objective function. Note that and this procedure will only work when the proportion of text tokens in the non- content section is lower than the proportion of tags, which is not the case for the web page in Figure 3.17. Pinto et al. (2002) modified this approach to use a text window to search for low-slope sections of the document slope curve. The structure of the web page can also be used more directly to identify the content blocks in the page. To display a web page using a browser, an HTML parser interprets the structure of the page specified using the tags, and creates a Document Object Model (DOM) representation. The tree-like structure repre- sented by the DOM can be used to identify the major components of the web 4 page. Figure 3.18 shows part of the DOM structure for the example web page in Figure 3.16. The part of the structure that contains the text of the story is indicated by the comment cnnArticleContent . Gupta et al. (2003) describe an approach that 4 This was generated using the DOM Inspector tool in the Firefox browser.

91 3.8 Removing Noise 67 navigates the DOM tree recursively, using a variety of filtering techniques to re- move and modify nodes in the tree and leave only content. HTML elements such as images and scripts are removed by simple filters. More complex filters remove advertisements, lists of links, and tables that do not have “substantive” content. Fig. 3.18. Part of the DOM structure for the example web page The DOM structure provides useful information about the components of a web page, but it is complex and is a mixture of logical and layout components. In Figure 3.18, for example, the content of the article is buried in a table cell ( TD tag) in a row ( TR tag) of an HTML table ( TABLE tag). The table is being used in this case to specify layout rather than semantically related data. Another approach to

92 68 3 Crawls and Feeds identifying the content blocks in a page focuses on the layout and presentation of the web page. In other words, visual features—such as the position of the block, the size of the font used, the background and font colors, and the presence of separators (such as lines and spaces)—are used to define blocks of information that would be apparent to the user in the displayed web page. Yu et al. (2003) describe an algorithm that constructs a hierarchy of visual blocks from the DOM tree and visual features. The first algorithm we discussed, based on the distribution of tags, is quite effective for web pages with a single content block. Algorithms that use the DOM structure and visual analysis can deal with pages that may have several content blocks. In the case where there are several content blocks, the relative importance of each block can be used by the indexing process to produce a more effective representation. One approach to judging the importance of the blocks in a web page is to train a classifier that will assign an importance category based on visual and content features (R. Song et al., 2004). References and Further Reading Cho and Garcia-Molina (2002, 2003) wrote a series of influential papers on web crawler design. Our discussion of page refresh policies is based heavily on Cho and Garcia-Molina (2003), and section 3.2.7 draws from Cho and Garcia-Molina (2002). 5 There are many open source web crawlers. The Heritrix crawler, developed for the Internet Archive project, is a capable and scalable example. The system is de- veloped in modules that are highly configurable at runtime, making it particularly suitable for experimentation. Focused crawling attracted much attention in the early days of web search. Menczer and Belew (1998) and Chakrabarti et al. (1999) wrote two of the most influential papers. Menczer and Belew (1998) envision a focused crawler made of autonomous software agents, principally for a single user. The user enters a list of both URLs and keywords. The agent then attempts to find web pages that would be useful to the user, and the user can rate those pages to give feedback to the system. Chakrabarti et al. (1999) focus on crawling for specialized topical indexes. Their crawler uses a classifier to determine the topicality of crawled pages, as well as a distiller, which judges the quality of a page as a source of links to other topical 5 http://crawler.archive.org

93 3.8 Removing Noise 69 pages. They evaluate their system against a traditional, unfocused crawler to show that an unfocused crawler seeded with topical links is not sufficient to achieve a topical crawl. The broad link structure of the Web causes the unfocused crawler to quickly drift to other topics, while the focused crawler successfully stays on topic. The Unicode specification is an incredibly detailed work, covering tens of thousands of characters (Unicode Consortium, 2006). Because of the nature of some non-Western scripts, many glyphs are formed from grouping a number of Unicode characters together, so the specification must detail not just what the characters are, but how they can be joined together. Characters are still being added to Unicode periodically. Bergman (2001) is an extensive study of the deep Web. Even though this study is old by web standards, it shows how sampling through search engines can be used to help estimate the amount of unindexed content on the Web. This study esti- mated that 550 billion web pages existed in the deep Web, compared to 1 billion in the accessible Web. He et al. (2007) describe a more recent survey that shows that the deep Web has continued to expand rapidly in recent years. An example of a technique for generating searchable representations of deep Web databases, called query probing, is described by Ipeirotis and Gravano (2004). Sitemaps, robots.txt files, RSS feeds, and Atom feeds each have their own spec- 6 ifications, which are available on the Web. These formats show that successful web standards are often quite simple. As we mentioned, database systems can be used to store documents from a web crawl for some applications. Our discussion of database systems was, how- ever, limited mostly to a comparison with BigTable. There are a number of text- books, such as Garcia-Molina et al. (2008), that provide much more informa- tion on how databases work, including details about important features such as query languages, locking, and recovery. BigTable, which we referenced frequently, was described in Chang et al. (2006). Other large Internet companies have built their own database systems with similar goals: large-scale distribution and high throughput, but without an expressive query language or detailed transaction sup- port. The Dynamo system from Amazon has low latency guarantees (DeCandia et al., 2007), and Yahoo! uses their UDB system to store large datasets (Baeza- Yates & Ramakrishnan, 2008). 6 http://www.sitemaps.org http://www.robotstxt.org http://www.rssboard.org/rss-specification http://www.rfc-editor.org/rfc/rfc5023.txt

94 70 3 Crawls and Feeds We mentioned DEFLATE (Deutsch, 1996) and LZW (Welch, 1984) as spe- cific document compression algorithms in the text. DEFLATE is the basis for the com- popular Zip, gzip, and zlib compression tools. LZW is the basis of the Unix press command, and is also found in file formats such as GIF, PostScript, and PDF. The text by Witten et al. (1999) provides detailed discussions about text and image compression algorithms. Hoad and Zobel (2003) provide both a review of fingerprinting techniques and a comparison to word-based similarity measures for near-duplicate detection. Their evaluation focused on finding versions of documents and plagiarized docu- ments. Bernstein and Zobel (2006) describe a technique for using full fingerprint- ing (no selection) for the task of finding co-derivatives , which are documents de- rived from the same source. Bernstein and Zobel (2005) examined the impact of duplication on evaluations of retrieval effectiveness. They showed that about 15% of the relevant documents for one of the TREC tracks were redundant, which could significantly affect the impact of the results from a user’s perspective. Henzinger (2006) describes a large-scale evaluation of near-duplicate detec- tion on the Web. The two techniques compared were a version of Broder’s “shin- gling” algorithm (Broder et al., 1997; Fetterly et al., 2003) and simhash (Charikar, 2002). Henzinger’s study, which used 1.6 billion pages, showed that neither meth- od worked well for detecting redundant documents because of on the same site the frequent use of “boilerplate” text that makes different pages look similar. For pages on different sites, the algorithm achieved a precision of 50% simhash (meaning that of those pages that were declared “near-duplicate” based on the similarity threshold, 50% were correct), whereas the Broder algorithm produced a precision of 38%. A number of papers have been written about techniques for extracting content from web pages. Yu et al. (2003) and Gupta et al. (2003) are good sources for references to these papers. Exercises 3.1. Suppose you have two collections of documents. The smaller collection is full of useful, accurate, high-quality information. The larger collection contains a few high-quality documents, but also contains lower-quality text that is old, out-of- date, or poorly written. What are some reasons for building a search engine for only the small collection? What are some reasons for building a search engine that covers both collections?

95 3.8 Removing Noise 71 Suppose you have a network connection that can transfer 10MB per second. 3.2. If each web page is 10K and requires 500 milliseconds to transfer, how many threads does your web crawler need to fully utilize the network connection? If your crawler needs to wait 10 seconds between requests to the same web server, what is the minimum number of distinct web servers the system needs to contact each minute to keep the network connection fully utilized? What is the advantage of using HEAD requests instead of GET requests dur- 3.3. ing crawling? When would a crawler use a GET request instead of a HEAD re- quest? 3.4. Why do crawlers not use POST requests? 3.5. Name the three types of sites mentioned in the chapter that compose the deep Web. 3.6. How would you design a system to automatically enter data into web forms in order to crawl deep Web pages? What measures would you use to make sure your crawler’s actions were not destructive (for instance, so that it doesn’t add random blog comments). 3.7. Write a program that can create a valid sitemap based on the contents of a directory on your computer’s hard disk. Assume that the files are accessible from a website at the URL http://www.example.com . For instance, if there is a file in homework.pdf , this would be available at http://www.exam- your directory called . Use the real modification date on the file as the last modi- ple.com/homework.pdf fied time in the sitemap, and to help estimate the change frequency. 3.8. Suppose that, in an effort to crawl web pages faster, you set up two crawl- ing machines with different starting seed URLs. Is this an effective strategy for distributed crawling? Why or why not? Write a simple single-threaded web crawler. Starting from a single input URL 3.9. (perhaps a professor’s web page), the crawler should download a page and then wait at least five seconds before downloading the next page. Your program should find other pages to crawl by parsing link tags found in previously crawled docu- ments. 3.10. UTF-16 is used in Java and Windows®. Compare it to UTF-8. 3.11. How does BigTable handle hardware failure?

96 72 3 Crawls and Feeds Design a compression algorithm that compresses HTML tags. Your algo- 3.12. rithm should detect tags in an HTML file and replace them with a code of your own design that is smaller than the tag itself. Write an encoder and decoder pro- gram. Generate checksums for a document by adding the bytes of the document 3.13. cksum . Edit the document and see if both check- and by using the Unix command sums change. Can you change the document so that the simple checksum does not change? 3.14. Write a program to generate simhash fingerprints for documents. You can use any reasonable hash function for the words. Use the program to detect du- plicates on your home computer. Report on the accuracy of the detection. How does the detection accuracy vary with fingerprint size? 3.15. Plot the document slope curves for a sample of web pages. The sample should include at least one page containing a news article. Test the accuracy of the simple optimization algorithm for detecting the main content block. Write your own program or use the code from http://www.aidanf.net/software/bte-body- text-extraction . Describe the cases where the algorithm fails. Would an algorithm that searched explicitly for low-slope areas of the document slope curve be suc- cessful in these cases? 3.16. Give a high-level outline of an algorithm that would use the DOM structure to identify content information in a web page. In particular, describe heuristics you would use to identify content and non-content elements of the structure.

97 4 Processing Text “I was trying to comprehend the meaning of the words.” Star Trek: The Final Frontier Spock, 4.1 From Words to Terms After gathering the text we want to search, the next step is to decide whether it should be modified or restructured in some way to simplify searching. The types of changes that are made at this stage are called or, more often, texttransformation text processing . The goal of text processing is to convert the many forms in which words can occur into more consistent index terms . Index terms are the represen- tation of the content of a document that are used for searching. The simplest decision about text processing would be to not do it at all. A good example of this is the “find” feature in your favorite word processor. By the time you use the find command, the text you wish to search has already been gathered: it’s on the screen. After you type the word you want to find, the word processor scans the document and tries to find the exact sequence of letters that you just typed. This feature is extremely useful, and nearly every text editing program can do this because users demand it. The trouble is that exact text search is rather restrictive. The most annoying restriction is case-sensitivity: suppose you want to find “computer hardware”, and there is a sentence in the document that begins with “Computer hardware”. Your search query does not exactly match the text in the sentence, because the first letter of the sentence is capitalized. Fortunately, most word processors have an option for ignoring case during searches. You can think of this as a very rudimentary form of online text processing. Like most text processing techniques, ignoring case in- creases the probability that you will find a match for your query in the document. Many search engines do not distinguish between uppercase and lowercase let- ters. However, they go much further. As we will see in this chapter, search engines

98 74 4 Processing Text can strip punctuation from words to make them easier to find. Words are split apart in a process called tokenization . Some words may be ignored entirely in or- . stopping der to make query processing more effective and efficient; this is called The system may use stemming to allow similar words (like “run” and “running”) to match each other. Some documents, such as web pages, may have formatting changes (like bold or large text), or explicit structure (like titles, chapters, and cap- tions) that can also be used by the system. Web pages also contain to other links web pages, which can be used to improve document ranking. All of these tech- niques are discussed in this chapter. These text processing techniques are fairly simple, even though their effects on search results can be profound. None of these techniques involves the com- puter doing any kind of complex reasoning or understanding of the text. Search engines work because much of the meaning of text is captured by counts of word 1 especially when that data is gathered from the occurrences and co-occurrences, huge text collections available on the Web. Understanding the statistical nature of text is fundamental to understanding retrieval models and ranking algorithms, so we begin this chapter with a discussion of text statistics . More sophisticated tech- niques for naturallanguageprocessing that involve syntactic and semantic analysis of text have been studied for decades, including their application to information retrieval, but to date have had little impact on ranking algorithms for search en- gines. These techniques are, however, being used for the task of question answer- ing, which is described in Chapter 11. In addition, techniques involving more complex text processing are being used to identify additional index terms or fea- tures for search. Informationextraction techniques for identifying people’s names, organization names, addresses, and many other special types of features are dis- cussed here, and classification , which can be used to identify semantic categories, is discussed in Chapter 9. Finally, even though this book focuses on retrieving English documents, in- formation retrieval techniques can be used with text in many different languages. In this chapter, we show how different languages require different types of text representation and processing. 1 Word co-occurrence measures the number of times groups of words (usually pairs) oc- cur together in documents. A collocation is the name given to a pair, group, or sequence of words that occur together more often than would be expected by chance. The term association measures that are used to find collocations are discussed in Chapter 6.

99 4.2 Text Statistics 75 4.2 Text Statistics Although language is incredibly rich and varied, it is also very predictable. There are many ways to describe a particular topic or event, but if the words that occur in many descriptions of an event are counted, then some words will occur much more frequently than others. Some of these frequent words, such as “and” or “the,” will be common in the description of any event, but others will be characteristic of that particular event. This was observed as early as 1958 by Luhn, when he proposed that the significance of a word depended on its frequency in the docu- ment. Statistical models of word occurrences are very important in information retrieval, and are used in many of the core components of search engines, such as the ranking algorithms, query transformation, and indexing techniques. These models will be discussed in later chapters, but we start here with some of the basic models of word occurrence. One of the most obvious features of text from a statistical point of view is that skewed the distribution of word frequencies is very . There are a few words that have very high frequencies and many words that have low frequencies. In fact, the two most frequent words in English (“the” and “of ”) account for about 10% of all word occurrences. The most frequent six words account for 20% of occurrences, and the most frequent 50 words are about 40% of all text! On the other hand, given a large sample of text, typically about one half of all the unique words in 2 Zipf’s law that sample occur only once. This distribution is described by which , states that the frequency of the r th most common word is inversely proportional to r or, alternatively, the rank of a word times its frequency ( f ) is approximately a constant ( k ): · = k r f We often want to talk about the probability of occurrence of a word, which is just the frequency of the word divided by the total number of word occurrences in the text. In this case, Zipf ’s law is: · P = c r r where P is a con- is the probability of occurrence for the r th ranked word, and c r 1 c ≈ 0 . stant. For English, . Figure 4.1 shows the graph of Zipf ’s law with this constant. This clearly shows how the frequency of word occurrence falls rapidly after the first few most common words. 2 Named after the American linguist George Kingsley Zipf.

100 76 4 Processing Text 0.1 0.09 0.08 0.07 0.06 Probability 0.05 (of occurrence) 0.04 0.03 0.02 0.01 0 50 60 70 80 90 100 0 20 10 30 40 Rank decreasing frequency) (by Fig. 4.1. Rank versus probability of occurrence for words assuming Zipf ’s law (rank × probability = 0.1) To see how well Zipf ’s law predicts word occurrences in actual text collec- tions, we will use the Associated Press collection of news stories from 1989 (called AP89) as an example. This collection was used in TREC evaluations for several years. Table 4.1 shows some statistics for the word occurrences in AP89. The vo- cabulary size is the number of unique words in the collection. Even in this rela- tively small collection, the vocabulary size is quite large (nearly 200,000 unique words). A large proportion of these words (70,000) occur only once. Words that occur once in a text corpus or book have long been regarded as important in text 3 . analysis, and have been given the special name of Hapax Legomena Table 4.2 shows the 50 most frequent words from the AP89 collection, to- gether with their frequencies, ranks, probability of occurrence (converted to a percentage of total occurrences), and the r . P value. From this table, we can see r 3 The name was created by scholars studying the Bible. Since the 13th century, people have studied the word occurrences in the Bible and, of particular interest, created con- cordances , which are indexes of where words occur in the text. Concordances are the ancestors of the inverted files that are used in modern search engines. The first concor- dance was said to have required 500 monks to create.

101 4.2 Text Statistics 77 84,678 Total documents Total word occurrences 39,749,179 Vocabulary size 198,763 times 4,169 > 1000 Words occurring Words occurring once 70,064 Statistics for the AP89 collection Table 4.1. that Zipf ’s law is quite accurate, in that the value of P r is approximately con- . r stant, and close to 0.1. The biggest variations are for some of the most frequent words. In fact, it is generally observed that Zipf ’s law is inaccurate for low and high ranks (high-frequency and low-frequency words). Table 4.3 gives some ex- amples for lower-frequency words from AP89. 4 of the r Figure 4.2 shows a log-log plot . values for all words in the AP89 P r log = collection. Zipf ’s law is shown as a straight line on this graph since P r − 1 · r ) = log ( c c − log r . This figure clearly shows how the predicted re- log lationship breaks down at high ranks (approximately rank 10,000 and above). A 5 number of modifications to Zipf ’s law have been proposed, some of which have interesting connections to cognitive models of language. It is possible to derive a simple formula for predicting the proportion of words with a given frequency from Zipf ’s law. A word that occurs n times has rank r k / n . In general, more than one word may have the same frequency. We = n r is associated with the last of the group of words with the assume that the rank n n will same frequency. In that case, the number of words with the same frequency , which is the rank of the last word in the group minus r be given by r − n +1 n the rank of the last word of the previous group of words with a higher frequency (remember that higher-frequency words have lower ranks). For example, Table 4.4 has an example of a ranking of words in decreasing order of their frequency. The number of words with frequency 5,099 is the rank of the last member of that 4 The x and y axes of a log-log plot show the logarithm of the values of x and y , not the values themselves. 5 The most well-known is the derivation by the mathematician Benoit Mandelbrot (the P r β ) ( · + same person who developed fractal geometry), which is = γ , where β , r α , and γ are parameters that can be tuned for a particular text. In the case of the AP89 collection, however, the fit for the frequency data is not noticeably better than the Zipf distribution.

102 78 4 Processing Text Freq. r P (%) r . P Word Word Freq r P P (%) r . r r r r the 2,420,778 1 6.49 0.065 has 136,007 26 0.37 0.095 2.80 0.056 are 130,322 27 0.35 0.094 of 1,045,733 2 2.60 0.078 not 127,493 28 0.34 0.096 to 968,882 3 2.39 0.096 0.31 0.090 116,364 29 892,429 4 a who 865,644 5 they 111,024 30 0.30 0.089 and 2.32 0.120 in 2.27 0.140 its 111,021 31 0.30 0.092 847,825 6 said 1.35 0.095 had 103,943 32 0.28 0.089 504,593 7 363,865 8 0.98 0.078 102,949 33 0.28 0.091 for will 347,072 9 would 99,503 34 0.27 0.091 that 0.93 0.084 293,027 10 0.79 0.079 about 92,983 35 0.25 0.087 was on 291,947 11 i 92,005 36 0.25 0.089 0.78 0.086 250,919 12 88,786 37 been he 0.24 0.088 0.67 0.081 is 0.65 0.086 this 87,286 38 0.23 0.089 245,843 13 with 223,846 14 0.60 0.084 their 84,638 39 0.23 0.089 at 210,064 15 new 83,449 40 0.22 0.090 0.56 0.085 209,586 16 0.56 0.090 81,796 41 0.22 0.090 by or which 80,385 42 0.22 0.091 195,621 17 it 0.52 0.089 189,451 18 0.51 0.091 we 80,245 43 0.22 0.093 from as 181,714 19 more 76,388 44 0.21 0.090 0.49 0.093 157,300 20 75,165 45 after be 0.20 0.091 0.42 0.084 us 72,045 46 0.19 0.089 were 0.41 0.087 153,913 21 an 152,576 22 0.41 0.090 percent 71,956 47 0.19 0.091 have 149,749 23 up 71,082 48 0.19 0.092 0.40 0.092 142,285 24 0.19 0.092 one 70,266 49 his 0.38 0.092 140,880 25 but people 68,988 50 0.19 0.093 0.38 0.094 Table 4.2. Most frequent 50 words from AP89 Word Freq. r P P (%) r . r r assistant 5,095 1,021 .013 0.13 sewers 100 17,110 .000256 0.04 toothbrush 10 51,555 .000025 0.01 hazmat 1 166,945 .000002 0.04 Table 4.3. Low-frequency words from AP89

103 4.2 Text Statistics 79 1 Zipf AP89 0.1 0.01 0.001 0.0001 Pr 1e-005 1e-006 1e-007 1e-008 1 100000 1e+006 10 100 1000 10000 Rank Fig. 4.2. A log-log plot of Zipf ’s law compared to real data from AP89. The predicted relationship between probability of occurrence and rank breaks down badly at high ranks. group (“chemical”) minus the rank of the last member of the previous group with higher frequency (“summit”), which is 1006 − 1002 = 4 . Rank Word Frequency concern 5,100 1000 spoke 1001 5,100 1002 summit 5,100 1003 bring 5,099 1004 5,099 star 1005 immediate 5,099 1006 chemical 5,099 1007 african 5,098 Table 4.4. Example word frequency ranking

104 80 4 Processing Text n is Given that the number of words with frequency − r r − = k / n +1 n n n + 1) = k / k ( n + 1) , then the proportion of words with this frequency /( n can be found by dividing this number by the total number of words, which will be the rank of the last word with frequency 1. The rank of the last word in the vocabulary is /1 = k . The proportion of words with frequency n , therefore, is k 1/ +1) ( n given by . This formula predicts, for example, that 1/2 of the words in n the vocabulary will occur once. Table 4.5 compares the predictions of this formula with real data from a different TREC collection. Number of Predicted Actual Actual Occurrences Proportion Proportion Number of (n) (1/n(n+1)) Words 1 0.500 0.402 204,357 2 0.167 67,082 0.132 0.083 35,083 3 0.069 4 0.050 0.046 23,271 0.033 0.032 5 16,332 6 0.024 0.024 12,421 7 0.018 0.019 9,766 8 0.014 0.016 8,200 9 0.011 6,907 0.014 0.009 0.012 10 5,893 Table 4.5. Proportions of words occurring n times in 336,310 documents from the TREC Volume 3 corpus. The total vocabulary size (number of unique words) is 508,209. 4.2.1 Vocabulary Growth vocabularygrowth . As the Another useful prediction related to word occurrence is size of the corpus grows, new words occur. Based on the assumption of a Zipf dis- tribution for words, we would expect that the number of new words that occur in a given amount of new text would decrease as the size of the corpus increases. New words will, however, always occur due to sources such as invented words (think of all those drug names and start-up company names), spelling errors, product numbers, people’s names, email addresses, and many others. The relationship be- tween the size of the corpus and the size of the vocabulary was found empirically by Heaps (1978) to be:

105 4.2 Text Statistics 81 = n k v · are pa- is the vocabulary size for a corpus of size words, and v and β n where k Heaps’ law . rameters that vary for each collection. This is sometimes referred to as Typical values for and β are often stated to be 10 ≤ k ≤ 100 and β ≈ 0 . 5 . k Heaps’ law predicts that the number of new words will increase very rapidly when the corpus is small and will continue to increase indefinitely, but at a slower rate for larger corpora. Figure 4.3 shows a plot of vocabulary growth for the AP89 collection compared to a graph of Heaps’ law with k β = 0.455. = 62.95 and Clearly, Heaps’ law is a good fit. The parameter values are similar for many of the other TREC news collections. As an example of the accuracy of this predic- tion, if the first 10,879,522 words of the AP89 collection are scanned, Heaps’ law predicts that the number of unique words will be 100,151, whereas the actual value is 100,024. Predictions are much less accurate for small numbers of words < 1,000). ( 200000 AP89 Heaps 62.95, 0.455 180000 160000 140000 120000 100000 80000 Words in Vocabulary 60000 40000 20000 0 1.5e+007 2e+007 2.5e+007 3e+007 3.5e+007 4e+007 0 5e+006 1e+007 Words in Collection Fig. 4.3. Vocabulary growth for the TREC AP89 collection compared to Heaps’ law

106 82 4 Processing Text Web-scale collections are considerably larger than the AP89 collection. The AP89 collection contains about 40 million words, but the (relatively small) TREC 6 Web collection GOV2 contains more than 20 billion words. With that many words, it seems likely that the number of new words would eventually drop to near zero and Heaps’ law would not be applicable. It turns out this is not the case. Figure 4.4 shows a plot of vocabulary growth for GOV2 together with a graph of k = 7.34 and β = 0.648. This data indicates that the number of Heaps’ law with unique words continues to grow steadily even after reaching 30 million. This has significant implications for the design of search engines, which will be discussed in Chapter 5. Heaps’ law provides a good fit for this data, although the parameter values are very different than those for other TREC collections and outside the boundaries established as typical with these and other smaller collections. 4.5e+007 GOV2 Heaps 7.34, 0.648 4e+007 3.5e+007 3e+007 2.5e+007 2e+007 Words in Vocabulary 1.5e+007 1e+007 5e+006 0 1.5e+010 2e+010 2.5e+010 0 1e+010 5e+009 Words in Collection Fig. 4.4. Vocabulary growth for the TREC GOV2 collection compared to Heaps’ law 6 Web pages crawled from websites in the .gov domain during early 2004. See section 8.2 for more details.

107 4.2 Text Statistics 83 4.2.2 Estimating Collection and Result Set Sizes Word occurrence statistics can also be used to estimate the size of the results from a web search. All web search engines have some version of the query interface shown in Figure 4.5, where immediately after the query (“tropical fish aquarium” in this case) and before the ranked list of results, an estimate of the total number of results is given. This is typically a very large number, and descriptions of these systems always point out that it is just an estimate. Nevertheless, it is always included. tropical fish aquarium Search Web results Page 1 of 3,880,000 results Fig. 4.5. Result size estimate for web search To estimate the size of a result set, we first need to define “results.” For the purposes of this estimation, a result is any document (or web page) that contains all of the query words. Some search applications will rank documents that do not contain all the query words, but given the huge size of the Web, this is usually not necessary. If we assume that words occur of each other, then independently the probability of a document containing all the words in the query is simply the product of the probabilities of the individual words occurring in a document. For a , b , and c example, if there are three query words , then: P a ∩ b ∩ c ) = P ( a ) · P ( b ( · P ( c ) ) where ( a ∩ P ∩ c ) is the joint probability, or the probability that all three words b occur in a document, and P ( a ) , P ( b ) , and P ( c ) are the probabilities of each word occurring in a document. A search engine will always have access to the number of 7 and the number of documents documents that a word occurs in ( f , , and f f ), c b a in the collection ( N ), so these probabilities can easily be estimated as P ( a ) = f . This gives us / N , P ( b ) = f N / N , and P ( c ) = f / c a b 2 N = N · f N / N · f )/ / N · f f / f = ( f · · f b c c b a abc a where f is the estimated size of the result set. abc 7 Note that these are document occurrence frequencies , not the total number of word oc- currences (there may be many occurrences of a word in a document).

108 84 4 Processing Text Document Estimated Word(s) Frequency Frequency 120,990 tropical 1,131,855 fish aquarium 26,480 81,885 breeding 18,472 5,433 tropical fish tropical aquarium 1,921 127 5,510 393 tropical breeding fish aquarium 9,722 1,189 fish breeding 36,427 3,677 aquarium breeding 86 1,848 tropical fish aquarium 6 1,529 3,629 tropical fish breeding 18 Table 4.6. Document frequencies and estimated frequencies for word combinations (as- suming independence) in the GOV2 Web collection. Collection size ( N ) is 25,205,179. Table 4.6 gives document occurrence frequencies for the words “tropical”, “fish”, “aquarium”, and “breeding”, and for combinations of those words in the TREC GOV2 Web collection. It also gives the estimated size of the frequencies of the combinations based on the independence assumption. Clearly, this assump- tion does not lead to good estimates for result size, especially for combinations of three words. The problem is that the words in these combinations do not occur independently of each other. If we see the word “fish” in a document, for example, then the word “aquarium” is more likely to occur in this document than in one that does not contain “fish”. Better estimates are possible if word co-occurrence information is also avail- able from the search engine. Obviously, this would give exact answers for two- word queries. For longer queries, we can improve the estimate by not assuming independence. In general, for three words P ( a ∩ b ∩ c ) = P ( a ∩ b ) · P ( c | ( a ∩ b )) is the probability that the words where a ∩ b ) ( a and b co-occur in a document, P and P ( c | ( a ∩ b )) is the probability that the word c occurs in a document given that 8 the words and b occur in the document. a If we have co-occurrence information, 8 This is called a conditional probability.

109 4.2 Text Statistics 85 ( , whichever is | a ) or P ( c | b ) P c we can approximate this probability using either the largest. For the example query “tropical fish aquarium” in Table 4.6, this means we estimate the result set size by multiplying the number of documents containing both “tropical” and “aquarium” by the probability that a document contains “fish” given that it contains “aquarium”, or: f = · f / f f tropical ∩ aquarium f ish ∩ f ish ∩ aquarium aquarium tropical aquarium ∩ · 9722/26480 = 705 = 1921 Similarly, for the query “tropical fish breeding”: = f f · f / f tropical tropical ∩ breeding f ish ∩ f ish ∩ breeeding breeding ∩ breeding = 5510 · 36427/81885 = 2451 These estimates are much better than the ones produced assuming indepen- dence, but they are still too low. Rather than storing even more information, such as the number of occurrences of word triples, it turns out that reasonable esti- mates of result size can be made using just word frequency and the size of the cur- rent result set. Search engines estimate the result size because they do not rank all the documents that contain the query words. Instead, they rank a much smaller subset of the documents that are likely to be the most relevant. If we know the proportion of the total documents that have been ranked ( s ) and the number of documents found that contain all the query words ( C ), we can simply estimate the result size as C / s , which assumes that the documents containing all the words 9 are distributed uniformly. The proportion of documents processed is measured by the proportion of the documents containing the least frequent word that have been processed, since all results must contain that word. For example, if the query “tropical fish aquarium” is used to rank GOV2 doc- uments in the Galago search engine, after processing 3,000 out of the 26,480 doc- uments that contain “aquarium”, the number of documents containing all three words is 258. This gives a result size estimate of 258/(3,000 ÷ 26,480) = 2,277. After processing just over 20% of the documents, the estimate is 1,778 (compared to the actual figure of 1,529). For the query “tropical fish breeding”, the estimates after processing 10% and 20% of the documents that contain “breeding” are 4,076 9 We are also assuming document-at-a-time processing, where the inverted lists for all query words are processed at the same time, giving complete document scores (see Chapter 5).

110 86 4 Processing Text and 3,762 (compared to 3,629). These estimates, as well as being quite accurate, do not require knowledge of the total number of documents in the collection. Estimating the total number of documents stored in a search engine is, in fact, of significant interest to both academia (how big is the Web?) and business (which search engine has better coverage of the Web?). A number of papers have been written about techniques to do this, and one of these is based on the concept of word independence that we used before. If a b are two words that occur and independently, then / f N = f N / N · f / b a ab and N = ( f f · f )/ b a ab N To get a reasonable estimate of , the two words should be independent and, as we have seen from the examples in Table 4.6, this is often not the case. We can be more careful about the choice of query words, however. For example, if we use the word “lincoln” (document frequency 771,326 in GOV2), we would expect the words in the query “tropical lincoln” to be more independent than the word pairs in Table 4.6 (since the former are less semantically related). The document frequency of “tropical lincoln” in GOV2 is 3,018, which means we can estimate N = (120,990 · 771,326)/3,018 = 30,922,045. This the size of the collection as is quite close to the actual number of 25,205,179. 4.3 Document Parsing 4.3.1 Overview Document parsing involves the recognition of the content and structure of text documents. The primary content of most documents is the words that we were counting and modeling using the Zipf distribution in the previous section. Rec- ognizing each word occurrence in the sequence of characters in a document is called tokenizing or lexical analysis . Apart from these words, there can be many other types of content in a document, such as metadata, images, graphics, code, and tables. As mentioned in Chapter 2, metadata is information about a doc- ument that is not part of the text content. Metadata content includes docu- ment attributes such as date and author, and, most importantly, the tags that are used by markup languages to identify document components. The most popular

111 4.3 Document Parsing 87 markup languages are HTML (Hypertext Markup Language) and XML (Exten- sible Markup Language). The parser uses the tags and other metadata recognized in the document to interpret the document’s structure based on the syntax of the markup language ( syntacticanalysis ) and to produce a representation of the document that includes both the structure and content. For example, an HTML parser interprets the structure of a web page as specified using HTML tags, and creates a Document Object Model (DOM) representation of the page that is used by a web browser. In a search engine, the output of a document parser is a representation of the con- tent and structure that will be used for creating indexes. Since it is important for a search index to represent every document in a collection, a document parser for a search engine is often more tolerant of syntax errors than parsers used in other applications. In the first part of our discussion of document parsing, we focus on the recog- nition of the tokens, words, and phrases that make up the content of the docu- ments. In later sections, we discuss separately the important topics related to doc- ument structure, namely markup, links, and extraction of structure from the text content. 4.3.2 Tokenizing Tokenizing is the process of forming words from the sequence of characters in a document. In English text, this appears to be simple. In many early systems, a “word” was defined as any sequence of alphanumeric characters of length 3 or more, terminated by a space or other special character. All uppercase letters were 10 also converted to lowercase. This means, for example, that the text Bigcorp’s 2007 bi-annual report showed profits rose 10%. would produce the following sequence of tokens: bigcorp 2007 annual report showed profits rose Although this simple tokenizing process was adequate for experiments with small test collections, it does not seem appropriate for most search applications or even experiments with TREC collections, because too much information is discarded. Some examples of issues involving tokenizing that can have significant impact on the effectiveness of search are: 10 This is sometimes referred to as case folding , case normalization , or downcasing .

112 88 4 Processing Text • Small words (one or two characters) can be important in some queries, usually in combinations with other words. For example, ma , pm , ben e king , el paso , xp , 11 gm , j lo , world war II . master p , • Both hyphenated and non-hyphenated forms of many words are common. In e-bay , wal-mart some cases the hyphen is not needed. For example, active-x , , cd-rom t-shirts . At other times, hyphens should be considered either as part of , winston-salem mazda rx-7 , e-cards , the word or a word separator. For example, , , t-mobile pre-diabetes spanish-speaking . , • Special characters are an important part of the tags, URLs, code, and other important parts of documents that must be correctly tokenized. • Capitalized words can have different meaning from lowercase words. For ex- ample, “Bush” and “Apple”. • Apostrophes can be a part of a word, a part of a possessive, or just a mistake. For example, rosie o’donnell , can’t , don’t , 80’s , 1890’s , men’s straw hats , master’s degree , , shriner’s . england’s ten largest cities nokia 3250 top • Numbers can be important, including decimals. For example, , , united 93 , 10 courses , 92.3 the beat , 288358 (yes, this was a quicktime 6.5 pro real query; it’s a patent number). • Periods can occur in numbers, abbreviations (e.g., “I.B.M.”, “Ph.D.”), URLs, ends of sentences, and other situations. From these examples, tokenizing seems more complicated than it first appears. The fact that these examples come from queries also emphasizes that the text pro- cessing for queries must be the same as that used for documents. If different to- kenizing processes are used for queries and documents, many of the index terms used for documents will simply not match the corresponding terms from queries. Mistakes in tokenizing become obvious very quickly through retrieval failures. To be able to incorporate the range of language processing required to make matching effective, the tokenizing process should be both simple and flexible. One approach to doing this is for the first pass of tokenizing to focus entirely on identifying markup or tags in the document. This could be done using a tok- enizer and parser designed for the specific markup language used (e.g., HTML), but it should accommodate syntax errors in the structure, as mentioned previ- ously. A second pass of tokenizing can then be done on the appropriate parts of the document structure. Some parts that are not used for searching, such as those containing HTML code, will be ignored in this pass. 11 These and other examples were taken from a small sample of web queries.

113 4.3 Document Parsing 89 Given that nearly everything in the text of a document can be important for some query, the tokenizing rules have to convert most of the content to search- able tokens. Instead of trying to do everything in the tokenizer, some of the more difficult issues, such as identifying word variants or recognizing that a string is a name or a date, can be handled by separate processes, including stemming, in- formation extraction, and query transformation. Information extraction usually requires the full form of the text as input, including capitalization and punctua- tion, so this information must be retained until extraction has been done. Apart from this restriction, capitalization is rarely important for searching, and text can be reduced to lowercase for indexing. This does not mean that capitalized words are not used in queries. They are, in fact, used quite often, but in queries where the capitalization does not reduce ambiguity and so does not impact effective- ness. Words such as “Apple” that are often used in examples (but not so often in real queries) can be handled by query reformulation techniques (Chapter 6) or simply by relying on the most popular pages (section 4.5). If we take the view that complicated issues are handled by other processes, the most general strategy for hyphens, apostrophes, and periods would be to treat them as word terminators (like spaces). It is important that all the tokens pro- duced are indexed, including single characters such as “s” and “o”. This will mean, 12 for example, that the query ”o’connor” ”o connor” , ”bob’s” is is equivalent to equivalent to ”bob s” , and ”rx-7” is equivalent to ”rx 7” . Note that this will also mean that a word such as “rx7” will be a different token than “rx-7” and therefore will be indexed separately. The task of relating the queries , rx7 , and rx-7 rx 7 will then be handled by the query transformation component of the search engine. On the other hand, if we rely entirely on the query transformation component to make the appropriate connections or inferences between words, there is the risk that effectiveness could be lowered, particularly in applications where there is not enough data for reliable query expansion. In these cases, more rules can be incor- porated into the tokenizer to ensure that the tokens produced by the query text will match the tokens produced from document text. For example, in the case of TREC collections, a rule that tokenizes all words containing apostrophes by the string without the apostrophe is very effective. With this rule, “O’Connor” would be tokenized as “oconnor” and “Bob’s” would produce the token “bobs”. Another effective rule for TREC collections is to tokenize all abbreviations con- 12 We assume the common syntax for web queries where ”” means match exactly the phrase contained in the quotes.

114 90 4 Processing Text taining periods as the string without periods. An abbreviation in this case is any string of alphabetic single characters separated by periods. This rule would tok- enize “I.B.M.” as “ibm”, but “Ph.D.” would still be tokenized as “ph d”. In summary, the most general tokenizing process will involve first identify- ing the document structure and then identifying words in text as any sequence of alphanumeric characters, terminated by a space or special character, with ev- erything converted to lowercase. This is not much more complicated than the simple process we described at the start of the section, but it relies on informa- tion extraction and query transformation to handle the difficult issues. In many cases, additional rules are added to the tokenizer to handle some of the special characters, to ensure that query and document tokens will match. 4.3.3 Stopping Human language is filled with function words: words that have little meaning apart from other words. The most popular—“the,” “a,” “an,” “that,” and “those”—are determiners . These words are part of how we describe nouns in text, and express concepts like location or quantity. Prepositions, such as “over,” “under,” “above,” and “below,” represent relative position between two nouns. Two properties of these function words cause us to want to treat them in a special way in text processing. First, these function words are extremely common. Table 4.2 shows that nearly all of the most frequent words in the AP89 collection fall into this category. Keeping track of the quantity of these words in each docu- ment requires a lot of disk space. Second, both because of their commonness and their function, these words rarely indicate anything about document relevance on their own. If we are considering individual words in the retrieval process and not phrases, these function words will help us very little. In information retrieval, these function words have a second name: stopwords . We call them stopwords because text processing stops when one is seen, and they are thrown out. Throwing out these words decreases index size, increases retrieval efficiency, and generally improves retrieval effectiveness. Constructing a stopword list must be done with caution. Removing too many words will hurt retrieval effectiveness in particularly frustrating ways for the user. For instance, the query ”to be or not to be” consists entirely of words that are usu- ally considered stopwords. Although not removing stopwords may cause some problems in ranking, removing stopwords can cause perfectly valid queries to re- turn no results.

115 4.3 Document Parsing 91 n A stopword list can be constructed by simply using the top (e.g., 50) most frequent words in a collection. This can, however, lead to words being included that are important for some queries. More typically, either a standard stopword 13 list is used, or a list of frequent words and standard stopwords is manually edited to remove any words that may be significant for a particular application. It is also possible to create stopword lists that are customized for specific parts of the doc- fields ument structure (also called ). For example, the words “click”, “here”, and “privacy” may be reasonable stopwords to use when processing anchor text. If storage space requirements allow, it is best to at least index all words in the documents. If stopping is required, the stopwords can always be removed from queries. By keeping the stopwords in the index, there will be a number of possi- ble ways to execute a query with stopwords in it. For instance, many systems will remove stopwords from a query unless the word is preceded by a plus sign (+). If keeping stopwords in an index is not possible because of space requirements, as few as possible should be removed in order to maintain maximum flexibility. 4.3.4 Stemming Part of the expressiveness of natural language comes from the huge number of ways to convey a single idea. This can be a problem for search engines, which rely on matching words to find relevant documents. Instead of restricting matches to words that are identical, a number of techniques have been developed to allow a search engine to match words that are semantically related. Stemming , also called conflation , is a component of text processing that captures the relationships be- tween different variations of a word. More precisely, stemming reduces the dif- ferent forms of a word that occur because of inflection (e.g., plurals, tenses) or derivation (e.g., making a verb into a noun by adding the suffix -ation) to a com- mon stem. Suppose you want to search for news articles about Mark Spitz’s Olympic swimming career. You might type “mark spitz swimming” into a search engine. However, many news articles are usually summaries of events that have already happened, so they are likely to contain the word “swam” instead of “swimming.” It is the job of the stemmer to reduce “swimming” and “swam” to the same stem (probably “swim”) and thereby allow the search engine to determine that there is a match between these two words. 13 Such as the one distributed with the Lemur toolkit and included with Galago.

116 92 4 Processing Text In general, using a stemmer for search applications with English text produces a small but noticeable improvement in the quality of results. In applications in- volving highly inflected languages, such as Arabic or Russian, stemming is a cru- cial part of effective search. algorithmic dictionary-based . An and There are two basic types of stemmers: algorithmic stemmer uses a small program to decide whether two words are re- lated, usually based on knowledge of word suffixes for a particular language. By contrast, a dictionary-based stemmer has no logic of its own, but instead relies on pre-created dictionaries of related terms to store term relationships. suffix-s stemmer. This The simplest kind of English algorithmic stemmer is the kind of stemmer assumes that any word ending in the letter “s” is plural, so cakes cake , dogs → dog → . Of course, this rule is not perfect. It cannot detect many plural relationships, like “century” and “centuries”. In very rare cases, it detects a relationship where it does not exist, such as with “I” and “is”. The first kind of error 14 is called a , and the second kind of error is called a false positive . false negative More complicated algorithmic stemmers reduce the number of false negatives by considering more kinds of suffixes, such as -ing or -ed. By handling more suffix types, the stemmer can find more term relationships; in other words, the false negative rate is reduced. However, the false positive rate (finding a relationship where none exists) generally increases. 15 The most popular algorithmic stemmer is the Porter stemmer . This has been used in many information retrieval experiments and systems since the 1970s, and a number of implementations are available. The stemmer consists of a number of steps, each containing a set of rules for removing suffixes. At each step, the rule for the longest applicable suffix is executed. Some of the rules are obvious, whereas others require some thought to work out what they are doing. As an example, here are the first two parts of step 1 (of 5 steps): Step 1a: → by ss (e.g., - Replace sses stress ). stresses - Delete s if the preceding word part contains a vowel not immediately be- fore the s (e.g., gaps → gap but gas → gas ). - Replace or ies by i if preceded by more than one letter, otherwise by ie ied (e.g., ties → tie , cries → cri ). 14 These terms are used in any binary decision process to describe the two types of errors. This includes evaluation (Chapter 8) and classification (Chapter 9). 15 http://tartarus.org/martin/PorterStemmer/

117 4.3 Document Parsing 93 us or do nothing (e.g., stress → stress ). - If suffix is ss Step 1b: eed , eedly by ee if it is in the part of the word after the first non- - Replace agreed → agree , feed → feed ). vowel following a vowel (e.g., ed - Delete edly , ing , ingly if the preceding word part contains a vowel, and , (e.g., at bl then if the word ends in iz add e , fished → fish , pirating → , or pirate ), or if the word ends with a double letter that is not ll , ss , or zz , remove the last letter (e.g., falling fall , dripping → drip ), or if the word is short, add → (e.g., ). → hope e hoping - Whew! The Porter stemmer has been shown to be effective in a number of TREC eval- uations and search applications. It is difficult, however, to capture all the subtleties of a language in a relatively simple algorithm. The original version of the Porter stemmer made a number of errors, both false positives and false negatives. Table 4.7 shows some of these errors. It is easy to imagine how confusing “execute” with “executive” or “organization” with “organ” could cause significant problems in 16 the ranking. A more recent form of the stemmer (called Porter2) fixes some of these problems and provides a mechanism to specify exceptions. False positives False negatives organization/organ european/europe generalization/generic cylinder/cylindrical numerical/numerous matrices/matrix policy/police urgency/urgent university/universe create/creation addition/additive analysis/analyses negligible/negligent useful/usefully execute/executive noise/noisy past/paste decompose/decomposition ignore/ignorant sparse/sparsity special/specialized resolve/resolution head/heading triangle/triangular Table 4.7. Examples of errors made by the original Porter stemmer. False positives are pairs of words that have the same stem. False negatives are pairs that have different stems. 16 http://snowball.tartarus.org

118 94 4 Processing Text A dictionary-based stemmer provides a different approach to the problem of stemming errors. Instead of trying to detect word relationships from letter pat- terns, we can store lists of related words in a large dictionary. Since these word lists can be created by humans, we can expect that the false positive rate will be very low for these words. Related words do not even need to look similar; a dic- tionary stemmer can recognize that “is,” “be,” and “was” are all forms of the same verb. Unfortunately, the dictionary cannot be infinitely long, so it cannot react automatically to new words. This is an important problem since language is con- stantly evolving. It is possible to build stem dictionaries automatically by statistical analysis of a text corpus. Since this is particularly useful when stemming is used for query expansion, we discuss this technique in section 6.2.1. Another strategy is to combine an algorithmic stemmer with a dictionary- based stemmer. Typically, irregular words such as the verb “to be” are the oldest in the language, while new words follow more regular grammatical conventions. This means that newly invented words are likely to work well with an algorith- mic stemmer. A dictionary can be used to detect relationships between common words, and the algorithmic stemmer can be used for unrecognized words. A well-known example of this hybrid approach is the Krovetz stemmer (Kro- vetz, 1993). This stemmer makes constant use of a dictionary to check whether the word is valid. The dictionary in the Krovetz stemmer is based on a general English dictionary but also uses exceptions that are generated manually. Before being stemmed, the dictionary is checked to see whether a word is present; if it is, it is either left alone (if it is in the general dictionary) or stemmed based on the exception entry. If the word is not in the dictionary, it is checked for a list of common inflectional and derivational suffixes. If one is found, it is removed and the dictionary is again checked to see whether the word is present. If it is not found, the ending of the word may be modified based on the ending that was removed. For example, if the ending -ies is found, it is replaced by -ie and checked in the dictionary. If it is found in the dictionary, the stem is accepted; otherwise the ending is replaced by y . This will result in calories → calorie , for example. The suffixes are checked in a sequence (for example, plurals before -ion endings), so multiple suffixes may be removed. The Krovetz stemmer has a lower false positive rate than the Porter stemmer, but also tends to have a higher false negative rate, depending on the size of the ex- ception dictionaries. Overall, the effectiveness of the two stemmers is comparable when used in search evaluations. The Krovetz stemmer has the additional advan- tage of producing stems that, in most cases, are full words, whereas the Porter

119 4.3 Document Parsing 95 text: Original U.S. will marketing strategies carried out by describe companies for their agricultural Document report chemicals, predictions for market share of such chemicals, or report market statistics for pesticide, herbicide, agrochemicals, fungicide, insecticide, fertilizer, predicted sales, market share, stimulate demand, price cut, volume of sales. Porter stemmer: document describ market strategi carri compani agricultur chemic report predict market share chemic report market statist agrochem pesticid herbicid fungicid insecticid fertil predict sale market share stimul price cut volum sale demand Krovetz stemmer: document describe marketing strategy carry company agriculture chemical report prediction market share chemical report market statistic agrochemic pesticide herbicide fungicide insecticide fertilizer cut predict stimulate demand price sale volume sale Fig. 4.6. Comparison of stemmer output for a TREC query. Stopwords have also been removed. stemmer often produces stems that are word fragments. This is a concern if the stems are used in the search interface. Figure 4.6 compares the output of the Porter and Krovetz stemmers on the text of a TREC query. The output of the Krovetz stemmer is similar in terms of which words are reduced to the same stems, although “marketing” is not reduced to “market” because it was in the dictionary. The stems produced by the Krovetz stemmer are mostly words. The exception is the stem “agrochemic”, which oc- curred because “agrochemical” was not in the dictionary. Note that text process- ing in this example has removed stopwords, including single characters. This re- sulted in the removal of “U.S.” from the text, which could have significant conse- quences for some queries. This can be handled by better tokenization or informa- tion extraction, as we discuss in section 4.6. As in the case of stopwords, the search engine will have more flexibility to an- swer a broad range of queries if the document words are not stemmed but instead are indexed in their original form. Stemming can then be done as a type of query expansion, as explained in section 6.2.1. In some applications, both the full words and their stems are indexed, in order to provide both flexibility and efficient query processing times. We mentioned earlier that stemming can be particularly important for some languages, and have virtually no impact in others. Incorporating language-specific

120 96 4 Processing Text in- stemming algorithms is one of the most important aspects of customizing, or ternationalizing , a search engine for multiple languages. We discuss other aspects of internationalization in section 4.7, but focus on the stemming issues here. As an example, Table 4.8 shows some of the Arabic words derived from the same root. A stemming algorithm that reduced Arabic words to their roots would clearly not work (there are less than 2,000 roots in Arabic), but a broad range of prefixes and suffixes must be considered. Highly inflectional languages like Ara- bic have many word variants, and stemming can make a large difference in the accuracy of the ranking. An Arabic search engine with high-quality stemming can be more than 50% more effective, on average, at finding relevant documents than a system without stemming. In contrast, improvements for an English search engine vary from less than 5% on average for large collections to about 10% for small, domain-specific collections. k i t a b a book k t a b i my book i al i t a b the book k k t a b uki your book (f ) i k i t a b uka your book (m) k i a b uhu his book t k t a b a to write a ma kt a b a library, bookstore ma a b office kt Table 4.8. Examples of words with the Arabic root ktb Fortunately, stemmers for a number of languages have already been developed and are available as open source software. For example, the Porter stemmer is avail- able in French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, 17 Norwegian, Danish, Russian, Finnish, Hungarian, and Turkish. In addition, the statistical approach to building a stemmer that is described in section 6.2.1 can be used when only a text corpus is available. 17 http://snowball.tartarus.org/

121 4.3 Document Parsing 97 4.3.5 Phrases and N-grams Phrases are clearly important in information retrieval. Many of the two- and three-word queries submitted to search engines are phrases, and finding docu- ments that contain those phrases will be part of any effective ranking algorithm. For example, given the query “black sea”, documents that contain that phrase are much more likely to be relevant than documents containing text such as “the sea turned black”. Phrases are more precise than single words as topic descriptions (e.g., “tropical fish” versus “fish”) and usually less ambiguous (e.g., “rotten ap- ple” versus “apple”). The impact of phrases on retrieval can be complex, however. Given a query such as “fishing supplies”, should the retrieved documents contain exactly that phrase, or should they get credit for containing the words “fish“, “fish- ing”, and “supplies” in the same paragraph, or even the same document? The de- tails of how phrases affect ranking will depend on the specific retrieval model that is incorporated into the search engine, so we will defer this discussion until Chap- ter 7. From the perspective of text processing, the issue is whether phrases should be identified at the same time as tokenizing and stemming, so that they can be indexed for faster query processing. There are a number of possible definitions of a phrase, and most of them have been studied in retrieval experiments over the years. Since a phrase has a grammat- ical definition, it seems reasonable to identify phrases using the syntactic structure of sentences. The definition that has been used most frequently in information re- noun phrase . This is often trieval research is that a phrase is equivalent to a simple restricted even further to include just sequences of nouns, or adjectives followed by nouns. Phrases defined by these criteria can be identified using a part-of-speech (POS) tagger . A POS tagger marks the words in a text with labels corresponding to the part-of-speech of the word in that context. Taggers are based on statistical or rule-based approaches and are trained using large corpora that have been man- ually labeled. Typical tags that are used to label the words include NN (singular noun), NNS (plural noun), VB (verb), VBD (verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., “and”, “or”), PRP (pronoun), and MD (modal auxiliary, e.g., “can”, “will”). Figure 4.7 shows the output of a POS tagger for the TREC query text used in Figure 4.6. This example shows that the tagger can identify phrases that are sequences of nouns, such as “marketing/NN strategies/NNS”, or adjectives fol- lowed by nouns, such as “agricultural/JJ chemicals/NNS”. Taggers do, however, make mistakes. The words “predicted/VBN sales/NNS” would not be identified as a noun phrase, because “predicted” is tagged as a verb.

122 98 4 Processing Text text: Original will describe marketing strategies carried out by U.S. companies for their agricultural Document report predictions for market share of such chemicals, or report market statistics for chemicals, agrochemicals, herbicide, fungicide, insecticide, fertilizer, predicted sales, market share, pesticide, price cut, volume stimulate of sales. demand, Brill tagger: will/MD describe/VB marketing/NN Document/NN strategies/NNS carried/VBD out/IN by/IN U.S./NNP companies/NNS for/IN their/PRP agricultural/JJ chemicals/NNS ,/, report/NN predictions/NNS for/IN market/NN of/IN such/JJ chemicals/NNS ,/, or/CC report/NN market/NN statistics/NNS for/IN share/NN fungicide/NN agrochemicals/NNS pesticide/NN ,/, herbicide/NN ,/, ,/, ,/, insecticide/NN ,/, fertilizer/NN ,/, predicted/VBN sales/NNS ,/, market/NN share/NN ,/, stimulate/VB demand/NN ,/, price/NN cut/NN ,/, of/IN sales/NNS ./. volume/NN Fig. 4.7. Output of a POS tagger for a TREC query Table 4.9 shows the high-frequency simple noun phrases from a TREC cor- pus consisting mainly of news stories and a corpus of comparable size consisting of all the 1996 patents issued by the United States Patent and Trademark Of- fice (PTO). The phrases were identified by POS tagging. The frequencies of the example phrases indicate that phrases are used more frequently in the PTO col- lection, because patents are written in a formal, legal style with considerable repe- tition. There were 1,100,000 phrases in the TREC collection that occurred more than five times, and 3,700,000 phrases in the PTO collection. Many of the TREC phrases are proper nouns, such as “los angeles” or “european union”, or are topics that will be important for retrieval, such as “peace process” and “human rights”. Two phrases are associated with the format of the documents (“article type”, “end recording”). On the other hand, most of the high-frequency phrases in the PTO collection are standard terms used to describe all patents, such as“present inven- tion” and “preferred embodiment”, and relatively few are related to the content of the patents, such as “carbon atoms” and “ethyl acetate”. One of the phrases, “group consisting”, was the result of a frequent tagging error. Although POS tagging produces reasonable phrases and is used in a number of applications, in general it is too slow to be used as the basis for phrase index- ing of large collections. There are simpler and faster alternatives that are just as effective. One approach is to store word position information in the indexes and use this information to identify phrases only when a query is processed. This pro- vides considerable flexibility in that phrases can be identified by the user or by using POS tagging on the query, and they are not restricted to adjacent groups of

123 4.3 Document Parsing 99 Patent data TREC data Frequency Phrase Frequency Phrase 975362 united states present invention 65824 61327 article type 191625 u.s. pat 147352 preferred embodiment 33864 los angeles 95097 carbon atoms hong kong 18062 17788 north korea 87903 group consisting new york 81809 17308 room temperature 15513 78458 seq id san diego orange county brief description 15009 75850 12869 prime minister 66407 prior art first time 59828 12799 perspective view 12067 soviet union 58724 first embodiment 10811 russian federation 56715 reaction mixture 9912 united nations 54619 detailed description 8127 southern california 54117 ethyl acetate south korea example 1 7640 52195 7620 end recording 52003 block diagram european union 46299 second embodiment 7524 7436 south africa accompanying drawings 41694 san francisco 40554 7362 output signal 7086 news conference 37911 first end 6792 city council 35827 second end 6348 middle east appended claims 34881 peace process 33947 6157 distal end 5955 human rights 32338 cross-sectional view 5837 white house 30193 outer surface Table 4.9. High-frequency noun phrases from a TREC collection and U.S. patents from 1996 words. The identification of syntactic phrases is replaced by testing word proxim- ity constraints, such as whether two words occur within a specified text window. We describe position indexing in Chapter 5 and retrieval models that exploit word proximity in Chapter 7. In applications with large collections and tight constraints on response time, such as web search, testing word proximities at query time is also likely to be too slow. In that case, we can go back to identifying phrases in the documents dur-

124 100 4 Processing Text ing text processing, but use a much simpler definition of a phrase: any sequence of n-gram . Sequences of two words are called n words. This is also known as an trigrams , and sequences of three words are called bigrams . Single words are called unigrams . N-grams have been used in many text applications and we will mention them again frequently in this book, particularly in association with languagemod- els word n-grams, but character (section 7.3). In this discussion, we are focusing on n-grams are also used in applications such as OCR, where the text is “noisy” and word matching can be difficult (section 11.6). Character n-grams are also used for indexing languages such as Chinese that have no word breaks (section 4.7). N-grams, both character and word, are generated by choosing a particular value for and then moving that “window” forward one unit (character or word) at n a time. In other words, n-grams overlap . For example, the word “tropical” con- tains the following character bigrams: tr, ro, op, pi, ic, ca, and al. Indexes based on n-grams are obviously larger than word indexes. The more frequently a word n-gram occurs, the more likely it is to correspond to a meaningful phrase in the language. N-grams of all lengths form a Zipf distri- bution, with a few common phrases occurring very frequently and a large number occurring with frequency 1. In fact, the rank-frequency data for n-grams (which includes single words) fits the Zipf distribution better than words alone. Some of the most common n-grams will be made up of stopwords (e.g., “and the”, “there is”) and could be ignored, although as with words, we should be cautious about discarding information. Our previous example query “to be or not to be” could certainly make use of n-grams. We could potentially index all n-grams in a doc- ument text up to a specific length and make them available to the ranking algo- rithm. This would seem to be an extravagant use of indexing time and disk space because of the large number of possible n-grams. A document containing 1,000 words, for example, would contain 3,990 instances of word n-grams of length 2 ≤ n ≤ 5 . Many web search engines, however, use n-gram indexing because it provides a fast method of incorporating phrase features in the ranking. 18 The Google recently made available a file of n-grams derived from web pages. statistics for this sample are shown in Table 4.10. An analysis of n-grams on the Web (Yang et al., 2007) found that “all rights reserved” was the most frequent trigram in English, whereas “limited liability corporation” was the most frequent in Chinese. In both cases, this was due to the large number of corporate sites, but 18 http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to- you.html

125 4.4 Document Structure and Markup 101 it also indicates that n-grams are not dominated by common patterns of speech such as “and will be”. 1,024,908,267,229 Number of tokens: Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 977,069,902 Number of trigrams: 1,313,818,354 Number of fourgrams: Number of fivegrams: 1,176,470,663 Table 4.10. Statistics for the Google n-gram sample 4.4 Document Structure and Markup In database applications, the fields or attributes of database records are a critical part of searching. Queries are specified in terms of the required values of these fields. In some text applications, such as email or literature search, fields such as author and date will have similar importance and will be part of the query specifi- cation. In the case of web search, queries usually do not refer to document struc- ture or fields, but that does not mean that this structure is unimportant. Some parts of the structure of web pages, indicated by HTML markup, are very signifi- cant features used by the ranking algorithm. The document parser must recognize this structure and make it available for indexing. 19 As an example, Figure 4.8 shows part of a web page for a Wikipedia entry. The page has some obvious structure that could be used in a ranking algorithm. The main heading for the page, “tropical fish”, indicates that this phrase is particu- larly important. The same phrase is also in bold and italics in the body of the text, which is further evidence of its importance. Other words and phrases are used as the anchor text for links and are likely to be good terms to represent the content of the page. The HTML source for this web page (Figure 4.9) shows that there is even more structure that should be represented for search. Each field or element in HTML is indicated by a start tag (such as

) and an optional end tag (e.g., 19 The Web encyclopedia, http://en.wikipedia.org/.

126 102 4 Processing Text Tropical fish From Wikipedia, the free encyclopedia found in tropical environments around the world, including Tropical fish include fish species. Fishkeepers tropical fish and salt water to both freshwater often use the term marine refer only those requiring fresh water, with saltwa ter tropical fish referred to as fish . fish , due to their often bright coloration. In Tropical fish are popular aquarium ridescence freshwater fish, this coloration typically derives from i , while salt water fish . are generally pigmented Part of a web page from Wikipedia Fig. 4.8. 20 Elements can also have attributes (with values), given by

). attribute_name = ”value” pairs. The element of an HTML document contains metadata that is not displayed by a browser. The metadata element for keywords ( from the main heading). The element of the document contains the content that is displayed.

tag. Other headings, of different sizes The main heading is indicated by the and potentially different importance, would be indicated by through

tags. Terms that should be displayed in bold or italic are indicated by and tags. Unlike typical database fields, these tags are primarily used for format- ting and can occur many times in a document. They can also, as we have said, be interpreted as a tag indicating a word or phrase of some importance. Links, such as fish , are very common. They are the basis of link analysis algorithms such as PageRank (Brin & Page, 1998), but also define the anchor text. Links and anchor text are of particular importance to web search and will be described in the next section. The title attribute for a link is used to provide extra information about that link, although in our example it is the words in the last part of the URL for the associated Wikipedia page. Web search engines also make use of the URL of a page as a source of additional meta- data. The URL for this page is: http://en.wikipedia.org/wiki/Tropical_fish 20 In XML the end tag is not optional.

127 4.4 Document Structure and Markup 103 The fact that the words “tropical” and “fish” occur in the URL will increase the importance of those words for this page. The depth of a URL (i.e., the number of directories deep the page is) can also be important. For example, the URL www.ibm.com is more likely to be the home page for IBM than a page with the URL: ! www.pcworld.com/businesscenter/article/698/ibm_buys_apt e, Albinism, Algae eater, ... a Tropical fish - Wikipedia, the free encyclopedi </head> <body> ... <h1 class="firstHeading">Tropical fish</h1> ... <p><b>Tropical fish</b> include <a href="/wiki/Fish" tit le="Fish">fish</a> found in <a href="/wiki/Tropics" title="Tropics">tropical</a> e nvironments around the world, including both <a href="/wiki/Fresh_water" title="Fre sh water">freshwater</a> and <a es. <a href="/wiki/Sea_water" title="Sea water">salt water</a> speci pers</a> often use the term href="/wiki/Fishkeeping" title="Fishkeeping">Fishkee <i>tropical fish</i> to refer only those requiring fre sh water, with saltwater tropical fish referred to as <i><a href="/wiki/List_of_marine_aqu arium_fish_species" title="List of marine aquarium fish species">marine fish</a></i>.</p> <p>Tropical fish are popular <a href="/wiki/Aquarium" ti tle="Aquarium">aquarium</a> fish , due to their often bright coloration. In freshwa ter fish, this coloration typically derives from <a href="/wiki/Iridescence" title="Iridesce nce">iridescence</a>, while salt water fish are generally <a href="/wiki/Pigment" ti tle="Pigment">pigmented</a>.</p> ... </body></html> Fig. 4.9. HTML source for example Wikipedia page In HTML, the element types are predefined and are the same for all docu- ments. XML, in contrast, allows each application to define what the element types are and what tags are used to represent them. XML documents can be described by a schema , similar to a database schema. XML elements, consequently, are more closely tied to the semantics of the data than HTML elements. Search applica-</p> <p><span class="badge badge-info text-white mr-2">128</span> 104 4 Processing Text semantic annotations tions often use XML to record in the documents that are produced by information extraction techniques, as described in section 4.6. A document parser for these applications would record the annotations, along with the other document structure, and make them available for indexing. 21 The query language XQuery has been defined by the database community for searching structured data described using XML. XQuery supports queries that specify both structural and content constraints, which raises the issue of whether a database or information retrieval approach is better for building a search engine for XML data. We discuss this topic in more detail in section 11.4, but the gen- eral answer is that it will depend on the data, the application, and the user needs. For XML data that contains a substantial proportion of text, the information re- trieval approach is superior. In Chapter 7, we will describe retrieval models that are designed for text documents that contain both structure and metadata. 4.5 Link Analysis Links connecting pages are a key component of the Web. Links are a power- ful navigational aid for people browsing the Web, but they also help search en- gines understand the relationships between the pages. These detected relation- ships help search engines rank web pages more effectively. It should be remem- bered, however, that many document collections used in search applications such as desktop or enterprise search either do not have links or have very little link structure. For these collections, link analysis will have no impact on search per- formance. As we saw in the last section, a link in a web page is encoded in HTML with a statement such as: For more information on this topic, please go to <a href=”http://www.somewhere.com”>the somewhere page</a>. When this page appears in your web browser, the words “the somewhere page” will be displayed differently than regular text, usually underlined or in a different color (or both). When you click on that link, your browser will then load the web page http://www.somewhere.com . In this link, “the somewhere page” is called the anchor text , and http://www.somewhere.com is the destination . Both components are useful in the ranking process. 21 http://www.w3.org/XML/Query/</p> <p><span class="badge badge-info text-white mr-2">129</span> 4.5 Link Analysis 105 4.5.1 Anchor Text Anchor text has two properties that make it particularly useful for ranking web pages. First, it tends to be very short, perhaps two or three words, and those words often succinctly describe the topic of the linked page. For instance, links to www.ebay.com are highly likely to contain the word “eBay” in the anchor text. Many queries are very similar to anchor text in that they are also short topical de- scriptions of web pages. This suggests a very simple algorithm for ranking pages: search through all links in the collection, looking for anchor text that is an exact match for the user’s query. Each time there is a match, add 1 to the score of the destination page. Pages would then be ranked in decreasing order of this score. This algorithm has some glaring faults, not the least of which is how to handle the query “click here”. More generally, the collection of all the anchor text in links pointing to a page can be used as an additional text field for that page, and incor- porated into the ranking algorithm. Anchor text is usually written by people who are not the authors of the des- tination page. This means that the anchor text can describe a destination page from a different perspective, or emphasize the most important aspect of the page from a community viewpoint. The fact that the link exists at all is a vote of impor- tance for the destination page. Although anchor text is not mentioned as often as link analysis algorithms (for example, PageRank) in discussions of web search engines, TREC evaluations have shown that it is the most important part of the representation of a page for some types of web search. In particular, it is essential for searches where the user is trying to find a home page for a particular topic, person, or organization. 4.5.2 PageRank There are tens of billions of web pages, but most of them are not very interesting. Many of those pages are spam and contain no useful content at all. Other pages are personal blogs, wedding announcements, or family picture albums. These pages are interesting to a small audience, but probably not broadly. On the other hand, there are a few pages that are popular and useful to many people, including news sites and the websites of popular companies. The huge size of the Web makes this a difficult problem for search engines. Suppose a friend had told you to visit the site for eBay, and you didn’t know that www.ebay.com was the URL to use. You could type “eBay” into a search engine, but there are millions of web pages that contain the word “eBay”. How can the</p> <p><span class="badge badge-info text-white mr-2">130</span> 106 4 Processing Text search engine choose the most popular (and probably the correct) one? One very effective approach is to use the links between web pages as a way to measure pop- (links point- ularity. The most obvious measure is to count the number of inlinks ing to a page) for each page and use this as a feature or piece of evidence in the ranking algorithm. Although this has been shown to be quite effective, it is very susceptible to spam. Measures based on link analysis algorithms are designed to provide more reliable ratings of web pages. Of these measures, PageRank, which is associated with the Google search engine, is most often mentioned. PageRank is based on the idea of a random surfer (as in web surfer). Imagine a person named Alice who is using her web browser. Alice is extremely bored, so she wanders aimlessly between web pages. Her browser has a special “surprise me” button at the top that will jump to a random web page when she clicks it. Each time a web page loads, she chooses whether to click the “surprise me” button or whether to click one of the links on the web page. If she clicks a link on the page, she has no preference for any particular link; instead, she just picks one randomly. Alice is sufficiently bored that she intends to keep browsing the Web like this for- 22 ever. To put this in a more structured form, Alice browses the Web using this algo- rithm: 1. Choose a random number r between 0 and 1. 2. If r < λ : • Click the “surprise me” button. 3. If ≥ λ : r • Click a link at random on the current page. 4. Start again. Typically we assume that λ is fairly small, so Alice is much more likely to click a link than to pick the “surprise me” button. Even though Alice’s path through the web pages is random, Alice will still see popular pages more often than unpopular ones. That’s because Alice often follows links, and links tend to point to popular pages. So, we expect that Alice will end up at a university website, for example, more often than a personal website, but less often than the CNN website. 22 The PageRank calculation corresponds to finding what is known as the stationary prob- ability distribution of a random walk on the graph of the Web. A random walk is a spe- cial case of a Markovchain in which the next state (the next page visited) depends solely on the current state (current page). The transitions that are allowed between states are all equally probable and are given by the links.</p> <p><span class="badge badge-info text-white mr-2">131</span> 4.5 Link Analysis 107 Suppose that CNN has posted a story that contains a link to a professor’s web page. Alice now becomes much more likely to visit that professor’s page, because Alice visits the CNN website frequently. A single link at CNN might influence Alice’s activity more than hundreds of links at less popular sites, because Alice visits CNN far more often than those less popular sites. Because of Alice’s special “surprise me” button, we can be guaranteed that 23 Since she plans to browse eventually she will reach every page on the Internet. the Web for a very long time, and since the number of web pages is finite, she will visit every page a very large number of times. It is likely, however, that she will visit a popular site thousands of times more often than an unpopular one. Note that if she did not have the “surprise me” button, she would get stuck on pages that did not have links, pages whose links no longer pointed to any page, or pages that formed a loop. Links that point to the first two types of pages, or pages that dangling links have not yet been crawled, are called . Now suppose that while Alice is browsing, you happened to walk into her room and glance at the web page on her screen. What is the probability that she will be looking at the CNN web page when you walk in? That probability is CNN’s PageRank. Every web page on the Internet has a PageRank, and it is uniquely determined by the link structure of web pages. As this example shows, PageRank has the ability to distinguish between popular pages (those with many incoming links, or those that have links from popular pages) and unpopular ones. The PageRank value can help search engines sift through the millions of pages that www.ebay.com ). contain the word “eBay” to find the one that is most popular ( Alice would have to click on many billions of links in order for us to get a reasonable estimate of PageRank, so we can’t expect to compute it by using actual people. Fortunately, we can compute PageRank in a much more efficient way. Suppose for the moment that the Web consists of just three pages, A , B , and links to page C links to pages B and C , page B A C , . We will suppose that page and page C links to page A , as shown in Figure 4.10. The PageRank of page C , which is the probability that Alice will be looking at this page, will depend on the PageRank of pages and B . Since Alice chooses A randomly between links on a given page, if she starts in page A , there is a 50% chance that she will go to page C (because there are two outgoing links). Another way of saying this is that the PageRank for a page is divided evenly between all the 23 The “surprise button” makes the random surfer model an ergodic Markov chain, which guarantees that the iterative calculation of PageRank will converge.</p> <p><span class="badge badge-info text-white mr-2">132</span> 108 4 Processing Text B A C A sample “Internet” consisting of just three web pages. The arrows denote links Fig. 4.10. between the pages. outgoing links. If we ignore the “surprise me” button, this means that the Page- P R C ) , can be calculated as: , represented as Rank of page ( C ) A ) P R P R ( B ( P R ( C + ) = 1 2 u as: More generally, we could calculate the PageRank for any page ∑ ) v ( P R ( P R u ) = L v ∈ v B u B is the number of outgoing is the set of pages that point to u , and where L v u links from page v (not counting duplicate links). There is an obvious problem here: we don’t know the PageRank values for the pages, because that is what we are trying to calculate. If we start by assuming that the PageRank values for all pages are the same (1/3 in this case), then it is easy to see that we could perform multiple iterations of the calculation. For example, in the first iteration, P R ( C ) = 0.33/2 + 0.33 = 0.5, P R ( A ) = 0.33, and P R ( B ) P R = 0.17. In the next iteration, C ) = 0.33/2 + 0.17 = 0.33, ( ( A ) = 0.5, P R and P R ( B ) = 0.17. In the third iteration, P R ( C ) = 0.42, P R ( A ) = 0.33, and P R B ) = 0.25. After a few more iterations, the PageRank values converge to the ( final values of P R ( C ) = 0.4, P R ( A ) = 0.4, and P R ( B ) = 0.2. If we take the “surprise me” button into account, part of the PageRank for page C will be due to the probability of coming to that page by pushing the button. Given that there is a 1/3 chance of going to any page when the button is pushed,</p> <p><span class="badge badge-info text-white mr-2">133</span> 4.5 Link Analysis 109 C /3 . This means for the button will be λ the contribution to the PageRank for that the total PageRank for C is now: λ ( A ) ) P R ( B P R · ( ( C ) = ) P R + + (1 − λ ) 1 3 2 Similarly, the general formula for PageRank is: ∑ P R λ ) v ( − λ ) · ( ) = + (1 P R u N L v B v ∈ u N where λ is 0.15. is the number of pages being considered. The typical value for This can also be expressed as a matrix equation: R = TR where R is the vector of PageRank values and T is the matrix representing the transition probabilities for the random surfer model. The element T represents ij the probability of going from page to page j , and: i λ 1 = T + (1 − λ ) ij L N i Those of you familiar with linear algebra may recognize that the solution is an R of the matrix T . eigenvector Figure 4.11 shows some pseudocode for computing PageRank. The algorithm takes a graph G as input. Graphs are composed of vertices and edges, so G = ( V, E ) . In this case, the vertices are web pages and the edges are links, so the pseu- , docode uses the letters L instead. A link is represented as a pair ( p, q ) and P where p and q are the source and destination pages. Dangling links, which are links where the page q does not exist, are assumed to be removed. Pages with no outbound links are , in that they accumulate PageRank but do not dis- rank sinks tribute it. In this algorithm, we assume that these pages link to all other pages in the collection. The first step is to make a guess at the PageRank value for each page. Without any better information, we assume that the PageRank is the same for each page. Since PageRank values need to sum to 1 for all pages, we assign a PageRank of 1/ | P | to each page in the input vector I . An alternative that may produce faster convergence would be to use a value based on the number of inlinks.</p> <p><span class="badge badge-info text-white mr-2">134</span> 110 4 Processing Text procedure PR G ) 1: ( is the web graph, consisting of vertices (pages) and edges (links). 2: ◃ G ) G ◃ Split graph into pages and links ( ← P, L 3: 4: | P | ◃ The current PageRank estimate I ← a vector of length ← a vector of length | P 5: ◃ The resulting better PageRank estimate R | for all I 6: ∈ I do entries i | 7: 1/ | P I ◃ Start with each page being equally likely ← i 8: end for while R 9: do has not converged 10: entries R for all ∈ R do i λ ← λ / | P | ◃ Each page has a 11: / | P | chance of random selection R i 12: end for for all pages p ∈ 13: do P 14: ← the set of pages such that ( p, q ) ∈ L and q Q P ∈ 15: if | Q | > 0 then 16: for all pages q ∈ Q do 17: R ← R p + (1 − λ ) I of being at page / | Q | ◃ Probability I p q q p 18: end for 19: else for all pages q ∈ P do 20: P R ← R | 21: − λ ) I | / + (1 p q q end for 22: 23: end if I ← R ◃ Update our current PageRank estimate 24: 25: end for 26: end while return 27: R 28: end procedure Fig. 4.11. Pseudocode for the iterative PageRank algorithm R , and storing λ / | In each iteration, we start by creating a result vector, | in P each entry. This is the probability of landing at any particular page because of a random jump. The next step is to compute the probability of landing on a page because of a clicked link. We do that by iterating over each web page in P . At each page, we retrieve the estimated probability of being at that page, I . From p that page, the user has a λ chance of jumping randomly, or 1 − λ of clicking on a link. There are Q | links to choose from, so the probability of jumping to a page | q ∈ Q is (1 − . In the event that ) I R / | Q | . We add this quantity to each entry λ p q</p> <p><span class="badge badge-info text-white mr-2">135</span> 4.5 Link Analysis 111 there are no usable outgoing links, we assume that the user jumps randomly, and | λ ) I therefore the probability is spread evenly among all | P − pages. (1 p To summarize, PageRank is an important example of query-independent meta- data that can improve ranking for web search. Web pages have the same PageRank value regardless of what query is being processed. Search engines that use Page- Rank will prefer pages with high PageRank values instead of assuming that all web pages are equally likely to satisfy a query. PageRank is not, however, as im- portant in web search as the conventional wisdom holds. It is just one of many features used in ranking. It does, however, tend to have the most impact on pop- ular queries, which is a useful property. 24 The algorithm (Kleinberg, 1999) for link analysis was developed at HITS about the same time as PageRank and has also been very influential. This algo- rithm estimates the value of the content of a page (the value) and the authority hub value). Both values are computed us- value of the links to other pages (the ing an iterative algorithm based solely on the link structure, similar to PageRank. The HITS algorithm, unlike PageRank, calculates authority and hub values for 25 a subset of pages retrieved by a given query. This can be an advantage in terms of the impact of the HITS metadata on ranking, but may be computationally in- feasible for search engines with high query traffic. In Chapter 10, we discuss the application of the HITS algorithm to finding web communities. 4.5.3 Link Quality It is well known that techniques such as PageRank and anchor text extraction are used in commercial search engines, so unscrupulous web page designers may try to create useless links just to improve the search engine placement of one of their link spam . Even typical users, however, can unwittingly web pages. This is called fool simple search engine techniques. A good example of this is with blogs. A reads Many blog posts are comments about other blog posts. Suppose author b in author B ’s blog. Author A might write a new blog post, called a post called a , which contains a link to post b . In the process of posting, author A may post a trackback b in author B ’s blog. A trackback is a special kind of comment to post that alerts author B that a reply has been posted in author A ’s blog. 24 Hypertext Induced Topic Search 25 Query-independent versions of HITS and topic-dependent versions of PageRank have also been defined.</p> <p><span class="badge badge-info text-white mr-2">136</span> 112 4 Processing Text B Blog Blog A b Post a Post Link links Trackback Fig. 4.12. Trackback links in blog postings As Figure 4.12 shows, a cycle has developed between post a and post b . Post a links to post b , and post b contains a trackback link to post a . Intuitively we would say that post b A has decided to write about is influential, because author a and it. However, from the PageRank perspective, have links to each other, b and therefore neither is more influential than the other. The trouble here is that a trackback is a fundamentally different kind of link than one that appears in a post. The comments section of a blog can also be a source of link spam. Page authors may try to promote their own websites by posting links to them in the comments section of popular blogs. Based on our discussion of PageRank, we know that a link from a popular website can make another website seem much more impor- tant. Therefore, this comments section is an attractive target for spammers. In this case, one solution is for search engine companies to automatically de- tect these comment sections and effectively ignore the links during indexing. An even easier way to do this is to ask website owners to alter the unimportant links so that search engines can detect them. This is the purpose behind the rel=nofollow link attribute. Most blog software is now designed to modify any link in a blog comment to contain the rel=nofollow attribute. Therefore, a post like this:</p> <p><span class="badge badge-info text-white mr-2">137</span> 4.6 Information Extraction 113 Come visit my <a href=”http://www.page.com”>web page</a>. becomes something like this: Come visit my <a rel=nofollow href=”http://www.page.com”>web page</a>. The link still appears on the blog, but search engines are designed to ignore all rel=nofollow . This helps preserve the integrity of PageRank calcula- links marked tion and anchor text harvesting. 4.6 Information Extraction Information extraction is a language technology that focuses on extracting struc- ture from text. Information extraction is used in a number of applications, and text data mining particularly for . For search applications, the primary use of in- formation extraction is to identify features that can be used by the search engine to improve ranking. Some people have speculated that information extraction tech- niques could eventually transform text search into a database problem by extract- ing all of the important information from text and storing it in structured form, but current applications of these techniques are a very long way from achieving that goal. Some of the text processing steps we have already discussed could be consid- ered information extraction. Identifying noun phrases, titles, or even bolded text are examples. In each of these cases, a part of the text has been recognized as hav- ing some special property, and that property can be described using a markup lan- guage, such as XML. If a document is already described using HTML or XML, the recognition of some of the structural features (such as titles) is straightfor- ward, but others, such as phrases, require additional processing before the feature can be annotated using the markup language. In some applications, such as when the documents in the collection are input through OCR, the document has no markup and even simple structures such as titles must be recognized and anno- tated. These types of features are very general, but most of the recent research in in- formation extraction has been concerned with features that have specific seman- tic content, such as named entities , relationships , and events . Although all of these features contain important information, named entity recognition has been used most often in search applications. A named entity is a word or sequence of words that is used to refer to something of interest in a particular application. The most</p> <p><span class="badge badge-info text-white mr-2">138</span> 114 4 Processing Text common examples are people’s names, company or organization names, locations, time and date expressions, quantities, and monetary values. It is easy to come up with other “entities” that would be important for specific applications. For an e- commerce application, for example, the recognition of product names and model numbers in web pages and reviews would be essential. In a pharmaceutical appli- cation, the recognition of drug names, dosages, and medical conditions may be important. Given the more specific nature of these features, the process of rec- ognizing them and tagging them in text is sometimes called semantic annotation . Some of these recognized entities would be incorporated directly into the search (see Chapter 6), whereas others may be used as part of facets using, for example, browsing the search results. An example of the latter is the search engine feature that recognizes addresses in pages and provides links to the appropriate map. Smith, who lives at 10 Fred Water Street, Springfield, MA, is a long ! time collector of tropical fish. <p ><PersonName><GivenName>Fred</GivenName> <Sn>Smith</Sn> </PersonName>, lives at <address><Street >10 Water Street</Street>, who <City>Springfield</City>, <State>MA</State></address>, is long ! time a collector of <b>tropical fish.</b></p> Fig. 4.13. Text tagged by information extraction Figure 4.13 shows a sentence and the corresponding XML markup after us- ing information extraction. In this case, the extraction was done by a well-known 26 word processing program. <p> and In addition to the usual structure markup ( <b> ), a number of tags have been added that indicate which words are part of named entities. It shows, for example, that an address consisting of a street (“10 Water Street”), a city (“Springfield”), and a state (“MA”) was recognized in the text. Two main approaches have been used to build named entity recognizers: rule- based and statistical. A rule-based recognizer uses one or more lexicons (lists of words and phrases) that categorize names. Some example categories would be locations (e.g., towns, cities, states, countries, places of interest), people’s names (given names, family names), and organizations (e.g., companies, government 26 Microsoft Word</p> <p><span class="badge badge-info text-white mr-2">139</span> 4.6 Information Extraction 115 agencies, international groups). If these lists are sufficiently comprehensive, much of the extraction can be done simply by lookup. In many cases, however, rules or patterns are used to verify an entity name or to find new entities that are not in the ” could be used to lists. For example, a pattern such as “ <number> <word> street identify street addresses. Patterns such as “ <street address>, <city> in <city> ” ” or “ could be used to verify that the name found in the location lexicon as a city was <street address>, <city>, <state> ” could indeed a city. Similarly, a pattern such as “ also be used to identify new cities or towns that were not in the lexicon. New per- son names could be recognized by rules such as “ ”, where <title> <title> <name> would include words such as “President”, “Mr.”, and “CEO”. Names are generally easier to extract in mixed-case text, because capitalization often indicates a name, but many patterns will apply to all lower- or uppercase text as well. Rules incor- porating patterns are developed manually, often by trial and error, although an initial set of rules can also be used as seeds in an automated learning process that 27 can discover new rules. A statistical entity recognizer uses a probabilistic model of the words in and around an entity. A number of different approaches have been used to build these HiddenMarkov models, but because of its importance, we will briefly describe the Model (HMM) approach. HMMs are used for many applications in speech and language. For example, POS taggers can be implemented using this approach. 4.6.1 Hidden Markov Models for Extraction One of the most difficult parts of entity extraction is that words can have many different meanings. The word “Bush”, for example, can describe a plant or a person. Similarly, “Marathon” could be the name of a race or a location in Greece. People context of the tell the difference between these different meanings based on the word, meaning the words that surround it. For instance, if “Marathon” is preceded by “Boston”, the text is almost certainly describing a race. We can describe the 28 generation context of a word mathematically by modeling the of the sequence of words in a text as a process with the , meaning that the next Markov property word in the sequence depends on only a small number of the previous words. 27 GATE (http://gate.ac.uk) is an example of an open source toolkit that provides both an information extraction component and an environment for customizing extraction for a specific application. 28 We discuss generative models in more detail in Chapter 7.</p> <p><span class="badge badge-info text-white mr-2">140</span> 116 4 Processing Text MarkovModel states with More formally, a describes a process as a collection of transitions between them. Each of the transitions has an associated probability. The next state in the process depends solely on the current state and the transition probabilities. In a Hidden Markov Model, each state has a set of possible outputs that can be generated. As with the transitions, each output also has a probability associated with it. Figure 4.14 shows a state diagram representing a very simple model for sen- tence generation that could be used by a named entity recognizer. In this model, the words in a sentence are assumed to be either part of an entity name (in this case, either a person, organization, or location) or not part of one. Each of these entity categories is represented by a state, and after every word the system may stay in that state (represented by the arrow loops) or transition to another state. There are two special states representing the start and end of the sentence. Associated with each state representing an entity category, there is a probability distribution of the likely sequences of words for that category. start organ- not-an- <every entity person location category> ization entity end Fig. 4.14. Sentence model for statistical entity extractor One possible use of this model is to construct new sentences. Suppose that we begin in the start state, and then the next state is randomly chosen according</p> <p><span class="badge badge-info text-white mr-2">141</span> 4.6 Information Extraction 117 to the start state’s transition probability table. For example, we may transition to the person state. Once we have entered the person state, we complete the tran- sition by choosing an output according to the person state’s output probability distribution. An example output may be the word “Thomas”. This process would continue, with a new state being transitioned to and an output being generated during each step of the process. The final result is a set of states and their associated outputs. Although such models can be used to generate new sentences, they are more commonly used to recognize entities in a sentence. To do this for a given sentence, a sequence of entity categories is found that gives the highest probability for the words in that sentence. Only the outputs generated by state transitions are visible (i.e., can be observed); the underlying states are “hidden.” For the sentence in Fig- ure 4.13, for example, the recognizer would find the sequence of states <start><name><not-an-entity><location><not-an-entity><end> to have the highest probability for that model. The words that were associated with the entity categories in this sequence would then be tagged. The problem of finding the most likely sequence of states in an HMM is solved by the Viterbi 29 , algorithm. which is a dynamic programming algorithm The key aspect of this approach to entity recognition is that the probabili- ties in the sentence model must be estimated from training data. To estimate the transition and output probabilities, we generate training data that consists of text manually annotated with the correct entity tags. From this training data, we can directly estimate the probability of words associated with a given category (i.e., output probabilities), and the probability of transitions between categories. To build a more accurate recognizer, features that are highly associated with named entities, such as capitalized words and words that are all digits, would be included in the model. In addition, the transition probabilities could depend on the pre- 30 vious word as well as the previous category. For example, the occurrence of the word “Mr.” increases the probability that the next category is Person . Although such training data can be useful for constructing accurate HMMs, collecting it requires a great deal of human effort. To generate approximately one million words of annotated text, which is the approximate size of training data required for accurate estimates, people would have to annotate the equivalent of 29 Named after the electrical engineer Andrew Viterbi. 30 Bikel et al. (1997) describe one of the first named entity recognizers based on the HMM approach.</p> <p><span class="badge badge-info text-white mr-2">142</span> 118 4 Processing Text more than 1,500 news stories. This may require considerably more effort than de- veloping rules for a simple set of features. Both the rule-based and statistical ap- 31 proaches have recognition effectiveness of about 90% for entities such as name, organization, and location, although the statistical recognizers are generally the best. Other entity categories, such as product names, are considerably more diffi- cult. The choice of which entity recognition approach to use will depend on the application, the availability of software, and the availability of annotators. Interestingly, there is little evidence that named entities are useful features for general search applications. Named entity recognition is a critical part of question-answering systems (section 11.5), and can be important in domain- specific or vertical search engines for accurately recognizing and indexing domain terms. Named entity recognition can also be useful for query analysis in applica- tions such as local search, and as a tool for understanding and browsing search results. 4.7 Internationalization The Web is used all over the world, and not just by English speakers. Although 65–70% of the Web is written in English, that percentage is continuing to de- crease. More than half of the people who use the Web, and therefore search the Web, do not use English as their primary language. Other search applications, such as desktop search and corporate search, are being used in many different languages every day. Even an application designed for an environment that has mostly English-speaking users can have many non-English documents in the col- lection. Try using “poissons tropicaux” (tropical fish) as a query for your favorite 32 web search engine and see how many French web pages are retrieved. A monolingual search engine is, as the name suggests, a search engine designed 33 Many of the indexing techniques and retrieval models for a particular language. we discuss in this book work well for any language. The differences between lan- guages that have the most impact on search engine design are related to the text processing steps that produce the index terms for searching. 31 By this we mean that about 9 out of 10 of the entities found are accurately identified, and 9 out of 10 of the existing entities are found. See Chapter 8 for details on evaluation measures. 32 You would find many more French web pages, of course, if you used a French version of the search engine, such as http://fr.yahoo.com. 33 We discuss cross-language search engines in section 6.4.</p> <p><span class="badge badge-info text-white mr-2">143</span> 4.7 Internationalization 119 As we mentioned in the previous chapter, character encoding is a crucial issue for search engines dealing with non-English languages, and Unicode has become the predominant character encoding standard for the internationalization of soft- ware. Other text processing steps also need to be customized for different languages. The importance of stemming for highly inflected languages has already been men- tioned, but each language requires a customized stemmer. Tokenizing is also im- portant for many languages, especially for the CJK family of languages. For these , where the breaks correspond- languages, the key problem is word segmentation ing to words or index terms must be identified in the continuous sequence of characters (spaces are generally not used). One alternative to segmenting is to in- dex overlapping character bigrams (pairs of characters, see section 4.3.5). Figure 4.15 shows word segmentation and bigrams for the text “impact of droughts in China”. Although the ranking effectiveness of search based on bigrams is quite good, word segmentation is preferred in many applications because many of the bigrams do not correspond to actual words. A segmentation technique can be im- plemented based on statistical approaches, such as a Hidden Markov Model, with sufficient training data. Segmentation can also be an issue in other languages. Ger- man, for example, has many compound words (such as “fischzuchttechniken” for “fish farming techniques”) that should be segmented for indexing. text 1. Original ᰡ⚮൘ѝഭ䙐ᡀⲴᖡ૽ droughts in China) (the impact of segmentation Word 2. ᰡ⚮ ൘ ѝഭ 䙐ᡀ Ⲵ ᖡ૽ impact china drought at make 3. Bigrams ᰡ⚮ ⚮൘ ൘ѝ ѝഭ ഭ䙐 䙐ᡀ ᡀⲴ Ⲵᖡ ᖡ૽ Fig. 4.15. Chinese segmentation and bigrams In general, given the tools that are available, it is not difficult to build a search engine for the major languages. The same statement holds for any language that</p> <p><span class="badge badge-info text-white mr-2">144</span> 120 4 Processing Text has a significant amount of online text available on the Web, since this can be used as a resource to build and test the search engine components. There are, however, a large number of other so-called “low-density” languages that may have many speakers but few online resources. Building effective search engines for these lan- guages is more of a challenge. References and Further Reading The properties and statistics of text and document collections has been studied for bibliometrics , which is part of the field of library some time under the heading of . Information science journals such as the Journal of the and information science American Society of Information Science and Technology ( JASIST) or Information Processing and Management (IPM) contain many papers in this general area. In- formation retrieval has, from the beginning, emphasized a statistical view of text, and researchers from IR and information science have always worked closely to- gether. Belew (2000) contains a good discussion of the cognitive aspects of Zipf ’s law and other properties of text in relationship to IR. With the shift to statistical methods in the 1990s, researchers also became inter- natural language processing ested in studying the statistical properties of text. Manning and Schütze (1999) is a good summary of text statistics from this perspective. Ha et al. (2002) give an interesting result showing that phrases (or n-grams) also generally follow Zipf ’s law, and that combining the phrases and words results in better predictions for frequencies at low ranks. The paper by Anagnostopoulos et al. (2005) describes a technique for estimat- ing query result size and also points to much of the relevant literature in this area. Similarly, Broder et al. (2006) show how to estimate corpus size and compare their estimation with previous techniques. Not much is written about tokenizing or stopping. Both are considered suf- ficiently “well known” that they are hardly mentioned in papers. As we have pointed out, however, getting these basic steps right is crucial for the overall sys- tem’s effectiveness. For many years, researchers used the stopword list published in van Rijsbergen (1979). When it became clear that this was not sufficient for the larger TREC collections, a stopword list developed at the University of Mas- sachusetts and distributed with the Lemur toolkit has frequently been used. As mentioned previously, this list contains over 400 words, which will be too long for many applications.</p> <p><span class="badge badge-info text-white mr-2">145</span> 4.7 Internationalization 121 The original paper describing the Porter stemmer was written in 1979, but was reprinted in Porter (1997). The paper by Krovetz (1993) describes his stemming algorithm but also takes a more detailed approach to studying the role of mor- 34 phology in a stemmer. The Krovetz stemmer is available on the Lemur website. Stemmers for other languages are available from various websites (including the Lemur website and the Porter stemmer website). A description of Arabic stem- ming techniques can be found in Larkey et al. (2002). Research on the use of phrases in searching has a long history. Croft et al. (1991) describe retrieval experiments with phrases derived by both syntactic and statistical processing of the query, and showed that effectiveness was similar to phrases selected manually. Many groups that have participated in the TREC eval- uations have used phrases as part of their search algorithms (Voorhees & Harman, 2005). Church (1988) described an approach to building a statistical (or stochastic ) part-of-speech tagger that is the basis for many current taggers. This approach uses manually tagged training data to train a probabilistic model of sequences of parts of speech, as well as the probability of a part of speech for a specific word. For a given sentence, the part-of-speech tagging that gives the highest probability for the whole sentence is used. This method is essentially the same as that used by a statistical entity extractor, with the states being parts of speech instead of en- tity categories. The Brill tagger (Brill, 1994) is a popular alternative approach that uses rules that are learned automatically from tagged data. Manning and Schütze (1999) provide a good overview of part-of-speech tagging methods. Many variations of PageRank can be found in the literature. Many of these variations are designed to be more efficient to compute or are used in different ap- plications. The topic-dependent version of PageRank is described in Haveliwala (2002). Both PageRank and HITS have their roots in the citation analysis algo- rithms developed in the field of bibliometrics. The idea of enhancing the representation of a hypertext document (i.e., a web page) using the content of the documents that point to it has been around for some time. For example, Croft and Turtle (1989) describe a retrieval model based on incorporating text from related hypertext documents, and Dunlop and van Rijsbergen (1993) describe how documents with little text content (such as those containing images) could be retrieved using the text in linked documents. Re- 34 Morphology is the study of the internal structure of words, and stemming is a form of morphological processing .</p> <p><span class="badge badge-info text-white mr-2">146</span> 122 4 Processing Text stricting the text that is incorporated to the anchor text associated with inlinks was first mentioned by McBryan (1994). Anchor text has been shown to be es- sential for some categories of web search in TREC evaluations, such as in Ogilvie and Callan (2003). Techniques have been developed for applying link analysis in collections with- out explicit link structure (Kurland & Lee, 2005). In this case, the links are based on similarities between the content of the documents, calculated by a similarity measure such as the cosine correlation (see Chapter 7). Information extraction techniques were developed primarily in research pro- grams such as TIPSTER and MUC (Cowie & Lehnert, 1996). Using named en- tity extraction to provide additional features for search was also studied early in the TREC evaluations (Callan et al., 1992, 1995). One of the best-known rule- based information extraction systems is FASTUS (Hobbs et al., 1997). The BBN system Identifinder (Bikel et al., 1999), which is based on an HMM, has been used in many projects. A detailed description of HMMs and the Viterbi algorithm can be found in Manning and Schütze (1999). McCallum (2005) provides an overview of infor- mation extraction, with references to more recent advances in the field. Statistical models that incorporate more complex features than HMMs, such as conditional randomfields , have become increasingly popular for extraction (Sutton & McCal- lum, 2007). Detailed descriptions of all the major encoding schemes can be found in Wikipedia. Fujii and Croft (1993) was one of the early papers that discussed the problems of text processing for search with CJK languages. An entire journal, 35 ACMTransactionsonAsianLanguageInformationProcessing , has now been de- voted to this issue. Peng et al. (2004) describe a statistical model for Chinese word segmentation and give references to other approaches. Exercises 4.1. Plot rank-frequency curves (using a log-log graph) for words and bigrams in the Wikipedia collection available through the book website ( http://www.search- engines-book.com ). Plot a curve for the combination of the two. What are the best values for the parameter c for each curve? 35 http://talip.acm.org/</p> <p><span class="badge badge-info text-white mr-2">147</span> 4.7 Internationalization 123 Plot vocabulary growth for the Wikipedia collection and estimate the pa- 4.2. rameters for Heaps’ law. Should the order in which the documents are processed make any difference? 4.3. Try to estimate the number of web pages indexed by two different search en- gines using the technique described in this chapter. Compare the size estimates from a range of queries and discuss the consistency (or lack of it) of these esti- mates. Modify the Galago tokenizer to handle apostrophes or periods in a different 4.4. way. Describe the new rules your tokenizer implements. Give examples of where the new tokenizer does a better job (in your opinion) and examples where it does not. 4.5. Examine the Lemur stopword list and list 10 words that you think would cause problems for some queries. Give examples of these problems. 4.6. Process five Wikipedia documents using the Porter stemmer and the Krovetz stemmer. Compare the number of stems produced and find 10 examples of dif- ferences in the stemming that could have an impact on ranking. 4.7. Use the GATE POS tagger to tag a Wikipedia document. Define a rule or rules to identify phrases and show the top 10 most frequent phrases. Now use the POS tagger on the Wikipedia queries. Are there any problems with the phrases identified? 4.8. Find the 10 Wikipedia documents with the most inlinks. Show the collec- tion of anchor text for those pages. 4.9. Compute PageRank for the Wikipedia documents. List the 20 documents with the highest PageRank values together with the values. 4.10. Figure 4.11 shows an algorithm for computing PageRank. Prove that the entries of the vector I sum to 1 every time the algorithm enters the loop on line 9. 4.11. Implement a rule-based recognizer for cities (you can choose a subset of cities to make this easier). Create a test collection that you can scan manually to find cities mentioned in the text and evaluate your recognizer. Summarize the performance of the recognizer and discuss examples of failures.</p> <p><span class="badge badge-info text-white mr-2">148</span> 124 4 Processing Text 4.12. Create a small test collection in some non-English language using web pages. Do the basic text processing steps of tokenizing, stemming, and stopping using tools from the book website and from other websites. Show examples of the index term representation of the documents.</p> <p><span class="badge badge-info text-white mr-2">149</span> 5 Ranking with Indexes “Must go faster.” David Levinson, Independence Day 5.1 Overview As this is a fairly technical book, if you have read this far, you probably understand data structures and how they are used in programs. If you want something about to store a list of items, linked lists and arrays are good choices. If you want to quickly find an item based on an attribute, a hash table is a better choice. More complicated tasks require more complicated structures, such as B-trees or priority queues. Why are all these data structures necessary? Strictly speaking, they aren’t. Most things you want to do with a computer can be done with arrays alone. However, arrays have drawbacks: unsorted arrays are slow to search, and sorted arrays are slow at insertion. By contrast, hash tables and trees are fast for both search and insertion. These structures are more complicated than arrays, but the speed dif- ference is compelling. Text search is very different from traditional computing tasks, so it calls for its own kind of data structure, the inverted index . The name “inverted index” is really an umbrella term for many different kinds of structures that share the same general philosophy. As you will see shortly, the specific kind of data structure used depends on the ranking function. However, since the ranking functions that rank documents well have a similar form, the most useful kinds of inverted indexes are found in nearly every search engine. This chapter is about how search engine queries are actually processed by a computer, so this whole chapter could arguably be called query processing . The last section of this chapter is called that, and the query processing algorithms pre- sented there are based on the data structures presented earlier in the chapter.</p> <p><span class="badge badge-info text-white mr-2">150</span> 126 5 Ranking with Indexes Efficient query processing is a particularly important problem in web search, as it has reached a scale that would have been hard to imagine just 10 years ago. People all over the world type in over half a billion queries every day, searching indexes containing billions of web pages. Inverted indexes are at the core of all modern web search engines. There are strong dependencies between the separate components of a search engine. The query processing algorithm depends on the retrieval model, and dic- tates the contents of the index. This works in reverse, too, since we are unlikely to choose a retrieval model that has no efficient query processing algorithm. Since we will not be discussing retrieval models in detail until Chapter 7, we start this chapter by describing an abstract model of ranking that motivates our choice of indexes. After that, there are four main parts to the chapter. In the first part, we discuss the different types of inverted index and what information about docu- ments is captured in each index. The second part gives an overview of compres- sion techniques, which are a critical part of the efficient implementation of in- verted indexes for text retrieval. The third part of the chapter describes how in- dexes are constructed, including a discussion of the MapReduce framework that can be used for very large document collections. The final part of the chapter fo- cuses on how the indexes are used to generate document rankings in response to queries. 5.2 Abstract Model of Ranking Before we begin to look at how to build indexes for a search system, we will start by considering an abstract model of ranking. All of the techniques we will consider in this chapter can be seen as implementations of this model. Figure 5.1 shows the basic components of our model. On the left side of the figure is a sample document. Documents are written in natural human languages, which are difficult for computers to analyze directly. So, as we saw in Chapter 4, the text is transformed into index terms or document features . For the purposes of this chapter, a document feature is some attribute of the document we can express numerically. In the figure, we show two kinds of features. On top, we have topical features, which estimate the degree to which the document is about a particular subject. On the bottom of the figure, we see two possible document quality fea- tures. One feature is the number of web pages that link to this document, and an- other is the number of days since this page was last updated. These features don’t address whether the document is a good topical match for a query, but they do</p> <p><span class="badge badge-info text-white mr-2">151</span> 5.2 Abstract Model of Ranking 127 ! sh 9.7 a ic p o r t ! sh l t r o p ic a l 4.2 p i sh e d' s T ropic a l Sho F F is r y Quer ic 22.1 t r o p a l ! sh ! d e b e s t n h o t e c a l p t w t ropic a l ! sh a lo t w lo , sea w e e d 8.2 r e r i c e s . W h e t h e p y o u ' r o k i n g fo r a l it t le ! sh o r a lo 4.2 ds ar o " r u s ig ! sh b , w e 'v e g ot w h a t you e e d . W e e v e ne k n h a v e f a es eatur opical F T 24.5 Ranking Function sh t an k s ea w eed fo r you r ! e Document Scor " t le su r oards t o o ). (and l it i nc 14 o s k n i g l n i m 3 a d s t u ys si d a t e e l a nc p Quality F eatur es Document Fig. 5.1. The components of the abstract model of ranking: documents, features, queries, the retrieval function, and document scores address its quality: a page with no incoming links that hasn’t been edited in years is probably a poor match for any query. Each of these feature values is generated using a feature function , which is just a mathematical expression that generates numbers from document text. In Chapter 4 we discussed some of the important topical and quality features, and in Chapter 7 you will learn about the techniques for creating good feature functions. In this chapter, we assume that reasonable feature values have already been created. On the right side of the figure, we see a cloud representing the ranking func- tion . The ranking function takes data from document features combined with the query and produces a score. For now, the contents of that cloud are unimportant, except for the fact that most reasonable ranking functions ignore many of the doc- ument features, and focus only on the small subset that relate to the query. This fact makes the inverted index an appealing data structure for search. The final output of the ranking function is a score, which we assume is some real number. If a document gets a high score, this means that the system thinks that document is a good match for the query, whereas lower numbers mean that the system thinks the document is a poor match for the query. To build a ranked list of results, the documents are sorted by score so that the highest-scoring doc- uments come first. Suppose that you are a human search engine, trying to sort documents in an appropriate order in response to a user query. Perhaps you would place the docu- ments in piles, like “good,” “not so good,” and “bad.” The computer is doing essen- tially the same thing with scoring. However, you might also break ties by looking carefully at each document to decide which one is more relevant. Unfortunately,</p> <p><span class="badge badge-info text-white mr-2">152</span> 128 5 Ranking with Indexes finding deep meaning in documents is difficult for computers to do, so search en- gines focus on identifying good features and scoring based on those features. A more concrete ranking model Later in the chapter we will look at query evaluation techniques that assume some- thing stronger about what happens in the ranking function. Specifically, we as- takes the following form: R sume that the ranking function ∑ ) = g R ( Q ) f ( ( D ) Q, D i i i f is some feature function that extracts a number from the document Here, i g text. is a similar feature function that extracts a value from the query. These two i functions form a pair of feature functions. Each pair of functions is multiplied together, and the results from all pairs are added to create a final document score. ! sh 5.2 ! 9.7 sh g f t 3.4 a l r o p ic a ic p o r t l 4.2 i i Sho p is r e d' s T F a l F i sh ropic l a ic p o r t sh ! o ic 22.1 t r 9.9 p sh ! l a e t o h n d ! t e b e s t p l a c l t lo sh a , w t ropic lo a w ! 1.2 ids l ich ch 8.2 w sea e e d i W h e t h e c r y o u ' r e e . p r s t sh k i n g fo r a l it o le ! lo o r a 0.7 b arbs 4.2 ds s u r " o ar w e g 'v ot h a you t b ! sh , w ig e r sh ! l a ic p o t e n h a v ne f a k e d e . W e e v e T T eatur es eatur es opical F opical F y Quer k w ea s an eed sh ! r you r fo t le su r " oards t o o ). (and l it t 1.2 s nc o m i i g l i n k n 14 i s k n i g l n i m o nc 3 ou e c t a d p u u t n p d a t e c ou n t 0.9 Quality F eatur Quality F Document es es eatur 303.01 e Document Scor Fig. 5.2. A more concrete model of ranking. Notice how both the query and the docu- ment have feature functions in this model. Figure 5.2 shows an example of this model. Just as in the abstract model of ranking, various features are extracted from the document. This picture shows only a few features, but in reality there will be many more. These correspond to the f functions in the equation just shown. We could easily name these ( D ) i f ; these values will be larger for documents that contain ( D ) or f ) ( D fish tropical the words “tropical” or “fish” more often or more prominently.</p> <p><span class="badge badge-info text-white mr-2">153</span> 5.3 Inverted Indexes 129 The document has some features that are not topical. For this example docu- ment, we see that the search engine notices that this document has been updated three times, and that it has 14 incoming links. Although these features don’t tell us anything about whether this document would match the subject of a query, they do give us some hints about the quality of the document. We know that it wasn’t just posted to the Web and then abandoned, since it gets updated occa- sionally. We also know that there are 14 other pages that have links pointing to it, which might mean that it has some useful information on it. Notice that there are also feature functions that act on the query. The feature g evaluates to a large value because “tropical” is in the query. ( Q ) function tropical However, g ( Q ) also has a small non-zero value because it is related to other barbs terms in the query. These values from the query feature functions are multiplied by the document feature functions, then summed to create a document score. The query also has some feature values that aren’t topical, such as the update count feature. Of course, this doesn’t mean that the query has been updated. The value of this feature indicates how important document updates are to relevance for this query. For instance, if the query was “today’s weather in london”, we would prefer documents that are updated frequently, since a document that isn’t updated at least daily is unlikely to say anything interesting about today’s weather. This query should have a high value for the update count feature. By contrast, a docu- ment that never changed could be very relevant for the query “full text of moby dick”. This query could have a low feature value for update count. If a retrieval system had to perform a sum over millions of features for every document, text search systems would not be practical. In practice, the query fea- tures ( g ( Q ) ) are mostly zero. This means that the sum for each document is only i over the non-zero g values. ( Q ) i 5.3 Inverted Indexes All modern search engine indexes are based on inverted indexes . Other index 1 signature files , structures have been used in the past, most notably but inverted indexes are considered the most efficient and flexible index structure. 1 A signature is a concise representation of a block of text (or document) as a sequence of bits, similar to the fingerprints discussed in Chapter 3. A hash function is used for each word in the text block to set bits in specific positions in the signature to one.</p> <p><span class="badge badge-info text-white mr-2">154</span> 130 5 Ranking with Indexes An inverted index is the computational equivalent of the index found in the back of this textbook. You might want to look over the index in this book as an . Each example. The book index is arranged in alphabetical order by index term index term is followed by a list of pages about that word. If you want to know more about stemming, for example, you would look through the index until you found words starting with “s”. Then, you would scan the entries until you came to the word “stemming.” The list of page numbers there would lead you to Chapter 4. Similarly, an inverted index is organized by index term. The index is inverted because usually we think of words being a part of documents, but if we invert this idea, documents are associated with words. Index terms are often alphabetized like a traditional book index, but they need not be, since they are often found di- rectly using a hash table. Each index term has its own inverted list that holds the relevant data for that term. In an index for a book, the relevant data is a list of page numbers. In a search engine, the data might be a list of documents or a list of word occurrences. Each list entry is called a posting , and the part of the posting that refers to a specific document or location is often called a pointer . Each doc- ument in the collection is given a unique number to make it efficient for storing document pointers. Indexes in books store more than just location information. For important words, often one of the page numbers is marked in boldface, indicating that this page contains a definition or extended discussion about the term. Inverted files can also have extended information, where postings can contain a range of in- formation other than just locations. By storing the right information along with each posting, the feature functions we saw in the last section can be computed efficiently. Finally, by convention, the page numbers in a book index are printed in as- cending order, so that the smallest page numbers come first. Traditionally, in- verted lists are stored the same way. These document-ordered lists are ordered by document number, which makes certain kinds of query processing more efficient and also improves list compression. However, some inverted files we will consider have other kinds of orderings. Alternatives to inverted files generally have one or more disadvantages. The sig- nature file, for example, represents each document in the collection as a small set of bits. To search a signature file, the query is converted to a signature and the bit patterns are compared. In general, all signatures must be scanned for every search. Even if the index is encoded compactly, this is a lot of processing. The inverted file’s advantage is that only a small fraction of the index needs to be considered</p> <p><span class="badge badge-info text-white mr-2">155</span> 5.3 Inverted Indexes 131 to process most queries. Also, matches in signature files are noisy, so a signature match is not guaranteed to be a match in the document text. Most importantly, it is difficult to generalize signature file techniques for ranked search (Zobel et al., 1998). Another approach is to use spatial data structures, such as k-d trees . In this ap- proach, each document is encoded as a point in some high-dimensional space, and the query is as well. The spatial data structure can then be used to find the closest documents to the query. Although many ranking approaches are fundamentally spatial, most spatial data structures are not designed for the number of dimen- 2 As a result, it tends to be much faster to sions associated with text applications. use an inverted list to rank documents than to use a typical spatial data structure. In the next few sections, we will look at some different kinds of inverted files. In each case, the inverted file organization is dictated by the ranking function. More complex ranking functions require more information in the index. These more complicated indexes take additional space and computational power to process, but can be used to generate more effective document rankings. Index organization is by no means a solved problem, and research is ongoing into the best way to create indexes that can more efficiently produce effective document rankings. 5.3.1 Documents The simplest form of an inverted list stores just the documents that contain each word, and no additional information. This kind of list is similar to the kind of index you would find at the back of this textbook. Figure 5.3 shows an index of this type built from the four sentences in Table 5.1 (so in this case, the “documents” are sentences). The index contains every word found in all four sentences. Next to each word, there are a list of boxes, and each one contains the number of a sentence. Each one of these boxes is a posting. For example, look at the word “fish”. You can quickly see that this word appears in all four sentences, because the numbers 1, 2, 3, and 4 appear by it. You can also quickly determine that “fish” is the only word that appears in all the sentences. Two words come close: “tropical” appears in every sentence but S , and “water” 4 is not in S . 3 2 Every term in a document corresponds to a dimension, so there are tens of thousands of dimensions in effect. This is in comparison to a typical database application with tens of dimensions at most.</p> <p><span class="badge badge-info text-white mr-2">156</span> 132 5 Ranking with Indexes S Tropical fish include fish found in tropical environments around 1 the world, including both freshwater and salt water species. S Fishkeepers often use the term tropical fish to refer only those 2 requiring fresh water, with saltwater tropical fish referred to as marine fish. Tropical fish are popular aquarium fish, due to their often bright S 3 coloration. S In freshwater fish, this coloration typically derives from irides- 4 cence, while salt water fish are generally pigmented. tropical fish Four sentences from the Wikipedia entry for Table 5.1. only 1 2 and aquarium pigmented 3 4 are popular 3 3 4 2 1 around refer as 2 2 referred requiring 1 2 both bright 3 4 1 salt 3 2 4 coloration saltwater species 1 4 derives term 3 2 due 1 1 2 environments the 4 1 2 3 3 fish their fishkeepers 4 2 this 1 2 found those to 3 2 2 fresh tropical 2 1 3 4 1 freshwater typically 4 4 from use generally 2 4 water 4 2 1 4 1 in 4 1 include while including 1 2 with 1 4 iridescence world 2 marine 3 2 often Fig. 5.3. An inverted index for the documents (sentences) in Table 5.1</p> <p><span class="badge badge-info text-white mr-2">157</span> 5.3 Inverted Indexes 133 Notice that this index does not record the number of times each word appears; it only records the documents in which each word appears. For instance, con- S 2 S contains “fish” only once. The inverted list tains the word “fish” twice, whereas 1 for “fish” shows no distinction between sentences 1 and 2; both are listed in the same way. In the next few sections, we will look at indexes that include informa- tion about word frequencies. Inverted lists become more interesting when we consider their intersection. Suppose we want to find the sentence that contains the words “coloration” and S and S , “freshwater”. The inverted index tells us that “coloration” appears in 3 4 S while “freshwater” appears in and S contains . We can quickly tell that only S 4 1 4 both “coloration” and “freshwater”. Since each list is sorted by sentence number, )) ( max ( finding the intersection of these lists takes O time, where m and n m, n are the lengths of the two lists. The algorithm is the same as in merge sort. With list skipping , which we will see later in the chapter, this cost drops to O ( min ( m, n )) . 5.3.2 Counts Remember that our abstract model of ranking considers each document to be composed of features. With an inverted index, each word in the index corre- sponds to a document feature. This feature data can be processed by a ranking function into a document score. In an inverted index that contains only docu- ment information, the features are binary, meaning they are 1 if the document contains a term, 0 otherwise. This is important information, but it is too coarse to find the best few documents when there are a lot of possible matches. For instance, consider the query “tropical fish”. Three sentences match this S . The data in the document-based index (Figure 5.3) gives , S , query: and S 3 1 2 us no reason to prefer any of these sentences over any other. Now look at the index in Figure 5.4. This index looks similar to the previous one. We still have the same words and the same number of postings, and the first number in each posting is the same as in the previous index. However, each post- ing now has a second number. This second number is the number of times the word appears in the document. This small amount of additional data allows us to prefer S contains “tropical” over S S and S for the query “tropical fish”, since 3 2 2 1 twice and “fish” three times. In this example, it may not be obvious that S , is much better than S S or 1 3 2 but in general, word counts can be a powerful predictor of document relevance. In particular, word counts can help distinguish documents that are about a particular</p> <p><span class="badge badge-info text-white mr-2">158</span> 134 5 Ranking with Indexes only 1:1 2:1 and aquarium pigmented 4:1 3:1 are popular 3:1 4:1 3:1 1:1 2:1 refer around as 2:1 2:1 referred requiring 1:1 2:1 both bright 1:1 3:1 4:1 salt 4:1 2:1 3:1 saltwater coloration species 1:1 4:1 derives term 3:1 2:1 due 1:1 2:1 1:1 environments the 3:1 1:2 2:3 4:2 3:2 their fish fishkeepers 4:1 2:1 this 1:1 2:1 found those to 2:1 3:1 2:2 fresh tropical 4:1 1:1 3:1 1:2 2:2 freshwater typically 4:1 4:1 from use generally 4:1 2:1 water 2:1 4:1 1:1 4:1 1:1 in 4:1 1:1 while include including 2:1 1:1 with 1:1 4:1 iridescence world 2:1 marine 3:1 2:1 often An inverted index, with word counts, for the documents in Table 5.1 Fig. 5.4. subject from those that discuss that subject in passing. Imagine two documents, one about tropical fish and another about tropical islands. The document about tropical islands would probably contain the word “fish”, but only a few times. On the other hand, the document about tropical fish would contain the word “fish” many times. Using word occurrence counts helps us rank the most relevant doc- ument highest in this example. 5.3.3 Positions When looking for matches for a query like “tropical fish”, the location of the words in the document is an important predictor of relevance. Imagine a doc- ument about food that included a section on tropical fruits followed by a section on saltwater fish. So far, none of the indexes we have considered contain enough information to tell us that this document is not relevant. Although a document</p> <p><span class="badge badge-info text-white mr-2">159</span> 5.3 Inverted Indexes 135 1,15 2,22 marine and 2,2 3,10 3,5 aquarium often are 4,14 2,10 3,3 only 1,9 4,16 pigmented around as 2,21 3,4 popular 1,13 2,9 refer both 2,19 3,11 bright referred 2,12 3,12 4,5 requiring coloration 4,11 4,7 1,16 salt derives 2,16 3,7 saltwater due 1,8 1,18 species environments 2,5 2,23 2,7 1,4 2,18 1,2 term fish 3,6 2,4 1,10 4,3 3,2 the 3,9 4,13 their 4,4 2,1 fishkeepers this 1,5 2,11 found those 2,8 2,20 2,13 3,8 to fresh 4,2 1,14 1,1 1,7 2,6 2,17 3,1 tropical freshwater 4,6 4,8 typically from use 4,15 2,3 generally 1,6 4,1 4,12 1,17 2,14 water in 4,10 1,3 include while 1,12 2,15 including with 4,9 1,11 world iridescence Fig. 5.5. An inverted index, with word positions, for the documents in Table 5.1 that contains the words “tropical” and “fish” is likely to be relevant, we really want to know if the document contains the exact phrase “tropical fish”. To determine this, we can add position information to our index, as in Fig- ure 5.5. This index shares some structural characteristics with the previous in- dexes, in that it has the same index terms and each list contains some postings. These postings, however, are different. Each posting contains two numbers: a doc- ument number first, followed by a word position. In the previous indexes, there was just one posting per document. Now there is one posting per word occur- rence. Look at the long list for the word “fish”. In the other indexes, this list contained just four postings. Now the list contains nine postings. The first two postings tell us that the word “fish” is the second word and fourth word in S . The next three 1 postings tell us that “fish” is the seventh, eighteenth, and twenty-third word in S . 2</p> <p><span class="badge badge-info text-white mr-2">160</span> 136 5 Ranking with Indexes 1,1 1,7 2,17 3,1 2,6 tropical 3,6 2,7 2,18 1,2 3,2 1,4 4,3 4,13 2,23 fish Fig. 5.6. Aligning posting lists for “tropical” and “fish” to find the phrase “tropical fish” This information is most interesting when we look at intersections with other posting lists. Using an intersection with the list for “tropical”, we find where the phrase “tropical fish” occurs. In Figure 5.6, the two inverted lists are lined up next S , and “fish” is to each other. We see that the word “tropical” is the first word in 1 S must start with the phrase “tropical , which means that S the second word in 1 1 fish”. The word “tropical” appears again as the seventh word in S , but “fish” does 1 not appear as the eighth word, so this is not a phrase match. In all, there are four occurrences of the phrase “tropical fish” in the four sentences. The phrase matches are easy to see in the figure; they happen at the points where the postings are lined up in columns. This same technique can be extended to find longer phrases or more general proximity expressions, such as “find within 5 words of fish .” Suppose that tropical p . We can then look in the inverted list the word “tropical” appears at position p − 5 and p + 5 . Any of those for “fish” for any occurrences between position occurrences would constitute a match. 5.3.4 Fields and Extents Real documents are not just lists of words. They have sentences and paragraphs that separate concepts into logical units. Some documents have titles and head- ings that provide short summaries of the rest of the content. Special types of doc- uments have their own sections; for example, every email contains sender infor- mation and a subject line. All of these are instances of what we will call document fields , which are sections of documents that carry some kind of semantic meaning. It makes sense to include information about fields in the index. For example, suppose you have a professor named Dr. Brown. Dr. Brown sent you an email about when course projects are due, but you can’t find it. You can type “brown” into your email program’s search box, but the result you want will be mixed in with other uses of the word “brown”, such as Brown University or brown socks. A search for “brown” in the From: line of the email will focus your search on exactly what you want.</p> <p><span class="badge badge-info text-white mr-2">161</span> 5.3 Inverted Indexes 137 Field information is useful even when it is not used explicitly in the query. Ti- tles and headings tend to be good summaries of the rest of a document. Therefore, if a user searches for “tropical fish”, it makes sense to prefer documents with the title “Tropical Fish,” even if a document entitled “Mauritius” mentions the words “tropical” and “fish” more often. This kind of preference for certain document fields can be integrated into the ranking function. In order to handle these kinds of searches, the search engine needs to be able to determine whether a word is in a particular field. One option is to make separate inverted lists for each kind of document field. Essentially, you could build one index for document titles, one for document headings, and one for body text. Searching for words in the title is as simple as searching the title index. However, finding a word in any section of the document is trickier, since you need to fetch inverted lists from many different indexes to make that determination. Another option is to store information in each word posting about where the word occurred. For instance, we could specify that the number 0 indicates a title and 1 indicates body text. Each inverted list posting would then contain a 0 or a 1 at the end. This data could be used to quickly determine whether a posting was in a title, and it would require only one bit per posting. However, if you have more fields than just a title, the representation will grow. Both of these suggestions have problems when faced with more complicated kinds of document structure. For instance, suppose we want to index books. Some books, like this one, have more than one author. Somewhere in the XML descrip- tion of this book, you might find: <author>W. Bruce Croft</author>, <author>Donald Metzler</author>, and <author>Trevor Strohman</author> Suppose you would like to find books by an author named Croft Donald. If you type the phrase query ”croft donald” into a search engine, should this book match? The words “croft” and “donald” appear in it, and in fact, they appear next to each other. However, they are in two distinct author fields. This probably is not a good match for the query ”croft donald” , but the previous two methods for dealing with fields (bits in the posting list, separate indexes) cannot make this kind of distinction. This is where extent lists come in. An extent is a contiguous region of a doc- ument. We can represent these extents using word positions. For example, if the title of a book started on the fifth word and ended just before the ninth word,</p> <p><span class="badge badge-info text-white mr-2">162</span> 138 5 Ranking with Indexes we could encode that as (5,9). For the author text shown earlier, we could write author: (1,4), (4,6), (7,9). The (1,4) means that the first three words (“W. Bruce Croft”) constitute the first author, followed by the second author (“Donald Metz- ler”), which is two words. The word “and” is not in an author field, but the next two words are, so the last posting is (7,9). 1,2 1,4 2,7 2,18 2,23 4,13 3,6 4,3 3,2 fish 1:(1,3) 2:(1,5) 4:(9,15) title Fig. 5.7. Aligning posting lists for “fish” and title to find matches of the word “fish” in the title field of a document. Figure 5.7 shows how this works in practice. Here we have the same positions posting list for “fish” that we used in the previous example. We also have an ex- tent list for the title field. For clarity, there are gaps in the posting lists so that the appropriate postings line up next to each other. At the very beginning of both lists, we see that document 1 has a title that contains the first two words (1 and 2, ending just before the third word). We know that this title includes the word “fish”, because the inverted list for “fish” tells us that “fish” is the second word in document 1. If the user wants to find documents with the word “fish” in the title, document 1 is a match. Document 2 does not match, because its title ends just be- fore the fifth word, but “fish” doesn’t appear until the seventh word. Document 3 apparently has no title at all, so no matches are possible. Document 4 has a title that starts at the ninth word (perhaps the document begins with a date or an au- thor declaration), and it does contain the word “fish”. In all, this example shows two matching documents: 1 and 4. This concept can be extended to all kinds of fields, such as headings, para- graphs, or sentences. It can also be used to identify smaller pieces of text with specific meaning, such as addresses or names, or even just to record which words are verbs. 5.3.5 Scores If the inverted lists are going to be used to generate feature function values, why not just store the value of the feature function? This is certainly possible, and some very efficient search engines do just this. This approach makes it possible to store feature function values that would be too computationally intensive to compute</p> <p><span class="badge badge-info text-white mr-2">163</span> 5.3 Inverted Indexes 139 during the query processing phase. It also moves complexity out of the query pro- cessing engine and into the indexing code, where it may be more tolerable. Let’s make this more concrete. In the last section, there was an example about how a document with the title “Tropical Fish” should be preferred over a docu- ment “Mauritius” for the query “tropical fish”, even if the Mauritius document contains the words “tropical” and “fish” many times. Computing the scores that reflect this preference requires some complexity at query evaluation time. The postings for “tropical fish” have to be segregated into groups, so we know which ones are in the title and which ones aren’t. Then, we have to define some score for the title postings and the non-title postings and mix those numbers together, and this needs to be done for every document. An alternate approach is to store the final value right in the inverted list. We could make a list for “fish” that has postings like [(1:3.6), (3:2.2)], meaning that the total feature value for “fish” in document 1 is 3.6, and in document 3 it is 2.2. Presumably the number 3.6 came from taking into account how many times “fish” appeared in the title, in the headings, in large fonts, in bold, and in links to the document. Maybe the document doesn’t contain the word “fish” at all, but instead many names of fish, such as “carp” or “trout”. The value 3.6 is then some indicator of how much this document is about fish. Storing scores like this both increases and decreases the system’s flexibility. It increases flexibility because computationally expensive scoring becomes possible, since much of the hard work of scoring documents is moved into the index. How- ever, flexibility is lost, since we can no longer change the scoring mechanism once the index is built. More importantly, information about word proximity is gone in this model, meaning that we can’t include phrase information in scoring unless we build inverted lists for phrases, too. These precomputed phrase lists require considerable additional space. 5.3.6 Ordering So far, we have assumed that the postings of each inverted list would be ordered by document number. Although this is the most popular option, this is not the only way to order an inverted list. An inverted list can also be ordered by score, so that the highest-scoring documents come first. This makes sense only when the lists already store the score, or when only one kind of score is likely to be com- puted from the inverted list. By storing scores instead of documents, the query processing engine can focus only on the top part of each inverted list, where the</p> <p><span class="badge badge-info text-white mr-2">164</span> 140 5 Ranking with Indexes highest-scoring documents are recorded. This is especially useful for queries con- sisting of a single word. In a traditional document-ordered inverted list, the query k scoring doc- processing engine would need to scan the whole list to find the top uments, whereas it would only need to read the first k postings in a score-sorted list. 5.4 Compression There are many different ways to store digital information. Usually we make a sim- ple distinction between persistent and transient storage. We use persistent stor- age to store things in files and directories that we want to keep until we choose to delete them. Disks, CDs, DVDs, flash memory, and magnetic tape are commonly used for this purpose. Dynamic RAM (Random Access Memory), on the other hand, is used to store transient information, which is information we need only while the computer is running. We expect that when we turn off the computer, all of that information will vanish. We can make finer distinctions between types of storage based on speed and capacity. Magnetic tape is slow, disks are faster, but dynamic RAM is much faster. Modern computers are so fast that even dynamic RAM isn’t fast enough to keep up, so microprocessors contain at least two levels of cache memory. The very fastest kind of memory makes up the processor registers. In a perfect world, we could use registers or cache memory for all transient storage, but it is too expen- sive to be practical. The reality, then, is that modern computers contain a memory hierarchy . At the top of the hierarchy we have memory that is tiny, but fast. The base consists of memory that is huge, but slow. The performance of a search engine strongly depends on how it makes use of the properties of each type of memory. Compression techniques are the most powerful tool for managing the mem- ory hierarchy. The inverted lists for a large collection are themselves very large. In fact, when it includes information about word position and document extents, 3 the index can be comparable in size to the document collection. Compression allows the same inverted list data to be stored in less space. The obvious ben- efit is that this could reduce disk or memory requirements, which would save 3 As an example, indexes for TREC collections built using the Indri open source search engine range from 25–50% of the size of the collection. The lower figure is for a col- lection of web pages.</p> <p><span class="badge badge-info text-white mr-2">165</span> 5.4 Compression 141 money. More importantly, compression allows data to move up the memory hi- erarchy. If index data is compressed by a factor of four, we can store four times more useful data in the processor cache, and we can feed data to the processor four times faster. On disk, compression also squeezes data closer together, which reduces seek times. In multicore and multiprocessor systems, where many proces- sors share one memory system, compressing data allows the processors to share memory bandwidth more efficiently. Unfortunately, nothing is free. The space savings of compression comes at a cost: the processor must decompress the data in order to use it. Therefore, it isn’t enough to pick the compression technique that can store the most data in the smallest amount of space. In order to increase overall performance, we need to choose a compression technique that reduces space and is easy to decompress. inverted list To see this mathematically, suppose some processor can process p postings per second. This processor is attached to a memory system that can sup- ply the processor with m postings each second. The number of postings processed min ( m, p each second is then . If p > m , then the processor will spend some of ) its time waiting for postings to arrive from memory. If , the memory sys- m > p tem will sometimes be idle. Suppose we introduce compression into the system. Our compression system r , meaning that we can now store r postings in the has a compression ratio of same amount of space as one uncompressed posting. This lets the processor read postings each second. However, the processor first needs to decompress each mr posting before processing it. This slows processing by a decompression factor, d , and lets the processor process dp postings each second. Now we can process min ( mr, dp ) postings each second. When we use no compression at all, r and d = 1 . Any reasonable com- = 1 r > 1 , but d < pression technique gives us . We can see that compression is a 1 p > m useful performance technique only when the , that is, when the processor can process inverted list data faster than the memory system can supply it. A very simple compression scheme will raise r a little bit and reduce d a little bit. A com- plicated compression scheme will raise r a lot, while reducing d a lot. Ideally we would like to pick a compression scheme such that ( mr, dp ) is maximized, min which should happen when mr = dp . In this section, we consider only lossless compression techniques. Lossless tech- niques store data in less space, but without losing information. There are also lossy data compression techniques, which are often used for video, images, and audio. These techniques achieve very high compression ratios ( r in our previous discus-</p> <p><span class="badge badge-info text-white mr-2">166</span> 142 5 Ranking with Indexes sion), but do this by throwing away the least important data. Inverted list prun- ing techniques, which we discuss later, could be considered a lossy compression technique, but typically when we talk about compression we mean only lossless methods. In particular, our goal with these compression techniques is to reduce the size of the inverted lists we discussed previously. The compression techniques in this section are particularly well suited for document numbers, word counts, and doc- ument position information. 5.4.1 Entropy and Ambiguity By this point in the book, you have already seen many examples of probability distributions. Compression techniques are based on probabilities, too. The fun- damental idea behind compression is to represent common data elements with short codes while representing uncommon data elements with longer codes. The inverted lists that we have discussed are essentially lists of numbers, and with- out compression, each number takes up the same amount of space. Since some of those numbers are more frequent than others, if we encode the frequent numbers with short codes and the infrequent numbers with longer codes, we can end up with space savings. For example, consider the numbers 0, 1, 2, and 3. We can encode these num- bers using two binary bits. A sequence of numbers, like: 0 , 1 , 0 , 3 , 0 , 2 , 0 can be encoded in a sequence of binary digits: 00 01 00 10 00 11 00 Note that the spaces in the binary sequence are there to make it clear where each number starts and stops, and are not actually part of the encoding. In our example sequence, the number 0 occurs four times, whereas each of the other numbers occurs just once. We may decide to save space by encoding 0 using just a single 0 bit. Our first attempt at an encoding might be: 0 01 0 10 0 11 0 This looks very successful because this encoding uses just 10 bits instead of the 14 bits used previously. This encoding is, however, ambiguous , meaning that it is not</p> <p><span class="badge badge-info text-white mr-2">167</span> 5.4 Compression 143 clear how to decode it. Remember that the spaces in the code are only there for our convenience and are not actually stored. If we add some different spaces, we arrive at a perfectly valid interpretation of this encoding: 0 01 01 0 0 11 0 which, when decoded, becomes: , 1 , 1 , 0 , 0 , 3 0 0 , Unfortunately, this isn’t the data we encoded. The trouble is that when we see 010 (0 , 2) in the encoded data, we can’t be sure whether (1 , 0) was encoded. or The uncompressed encoding was not ambiguous. We knew exactly where to put the spaces because we knew that each number took exactly 2 bits. In our com- pressed code, encoded numbers consume either 1 or 2 bits, so it is not clear where to put the spaces. To solve this problem, we need to restrict ourselves to unam- codes, which are confusingly called both and prefix-free codes . biguous prefix codes An unambiguous code is one where there is only one valid way to place spaces in encoded data. Let’s fix our code so that it is unambiguous: Number Code 0 0 1 101 2 110 3 111 This results in the following encoding: 0 101 0 111 0 110 0 This encoding requires 13 bits instead of the 14 bits required by the uncom- pressed version, so we are still saving some space. However, unlike the last code we considered, this one is unambiguous. Notice that if a code starts with 0, it con- sumes 1 bit; if a code starts with 1, it is 3 bits long. This gives us a deterministic algorithm for placing spaces in the encoded stream. In the “Exercises” section, you will prove that there is no such thing as an un- ambiguous code that can compress every possible input; some inputs will get big- ger. This is why it is so important to know something about what kind of data we</p> <p><span class="badge badge-info text-white mr-2">168</span> 144 5 Ranking with Indexes want to encode. In our example, we notice that the number 0 appears frequently, and we can use that fact to reduce the amount of space that the encoded version measures the predictability of the input. In our case, the input requires. Entropy seems somewhat predictable, because the number 0 is more likely to appear than other numbers. We leverage this entropy to produce a usable code for our pur- poses. 5.4.2 Delta Encoding All of the coding techniques we will consider in this chapter assume that small numbers are more likely to occur than large ones. This is an excellent assump- tion for word count data; many words appear just once in a document, and some appear two or three times. Only a small number of words appear more than 10 times. Therefore, it makes sense to encode small numbers with small codes and large numbers with large codes. However, document numbers do not share this property. We expect that a typical inverted list will contain some small document numbers and some very large document numbers. It is true that some documents contain more words, and therefore will appear more times in the inverted lists, but otherwise there is not a lot of entropy in the distribution of document numbers in inverted lists. The situation is different if we consider the differences between document numbers instead of the document numbers themselves. Remember that inverted list postings are typically ordered by document number. An inverted list without counts, for example, is just a list of document numbers, like these: 1 , 5 , 9 , 18 , 23 , 24 , 30 , 44 , 45 , 48 Since these document numbers are ordered, we know that each document number in the sequence is more than the one before it and less than the one after it. This fact allows us to encode the list of numbers by the differences between adjacent document numbers: 1 , 4 , 4 , 9 , 5 , 1 , 6 , 14 , 1 , 3 This encoded list starts with 1, indicating that 1 is the first document number. The next entry is 4, indicating that the second document number is 4 more than the first: 1 + 4 = 5 . The third number, 4, indicates that the third document number is 4 more than the second: 5 + 4 = 9 . This process is called deltaencoding , and the differences are often called d-gaps . Notice that delta encoding does not define the bit patterns that are used to store</p> <p><span class="badge badge-info text-white mr-2">169</span> 5.4 Compression 145 the data, and so it does not save any space on its own. However, delta encoding is particularly successful at changing an ordered list of numbers into a list of small numbers. Since we are about to discuss methods for compressing lists of small numbers, this is a useful property. Before we move on, consider the inverted lists for the words “entropy” and “who.” The word “who” is very common, so we expect that most documents will contain it. When we use delta encoding on the inverted list for “who,” we would expect to see many small d-gaps, such as: , 1 , 2 , 1 , 1 , 1 , 4 , 1 , 1 , 3 , ... 5 By contrast, the word “entropy” rarely appears in text, so only a few documents will contain it. Therefore, we would expect to see larger d-gaps, such as: 109 , 3766 , 453 , 1867 , 992 , ... However, since “entropy” is a rare word, this list of large numbers will not be very long. In general, we will find that inverted lists for frequent terms compress very well, whereas infrequent terms compress less well. 5.4.3 Bit-Aligned Codes The code we invented in section 5.4.1 is a bit-aligned code, meaning that the breaks between the coded regions (the spaces) can happen after any bit posi- tion. In this section we will describe some popular bit-aligned codes. In the next section, we will discuss methods where code words are restricted to end on byte boundaries. In all of the techniques we’ll discuss, we are looking at ways to store small numbers in inverted lists (such as word counts, word positions, and delta- encoded document numbers) in as little space as possible. One of the simplest codes is the unary code. You are probably familiar with bi- nary, which encodes numbers with two symbols, typically 0 and 1. A unary num- ber system is a base-1 encoding, which means it uses a single symbol to encode numbers. Here are some examples: Number Code 0 0 1 10 110 2 3 1110 11110 4 5 111110</p> <p><span class="badge badge-info text-white mr-2">170</span> 146 5 Ranking with Indexes k in unary, we output 1s, followed by a 0. We In general, to encode a number k need the 0 at the end to make the code unambiguous. This code is very efficient for small numbers such as 0 and 1, but quickly be- comes very expensive. For instance, the number 1023 can be represented in 10 binary bits, but requires 1024 bits to represent in unary code. Now we know about two kinds of numeric encodings. Unary is convenient because it is compact for small numbers and is inherently unambiguous. Binary is a better choice for large numbers, but it is not inherently unambiguous. A rea- sonable compression scheme needs to encode frequent numbers with fewer bits than infrequent numbers, which means binary encoding is not useful on its own for compression. codes Elias- The Elias- γ (Elias gamma) code combines the strengths of unary and binary codes. To encode a number k using this code, we compute two quantities: k ⌋ = ⌊ • k log d 2 ⌋ ⌊ log k 2 2 k = k − • r Suppose you wrote in binary form. The first value, k k , is the number of binary d k > 0 , the leftmost binary digit of k is digits you would need to write. Assuming 1. If you erase that digit, the remaining binary digits are k . r If we encode k in unary and k binary digits), we get the in binary (in k d d r Elias- γ code. Some examples are shown in Table 5.2. k ) k Number ( k Code r d 1 0 0 0 2 1 0 10 0 3 1 10 1 1 2 2 6 110 10 3 7 1110 111 15 16 4 0 11110 0000 255 127 11111110 1111111 7 1023 9 511 1111111110 111111111 Table 5.2. Elias- γ code examples</p> <p><span class="badge badge-info text-white mr-2">171</span> 5.4 Compression 147 The trick with this code is that the unary part of the code tells us how many bits to expect in the binary part. We end up with a code that uses no more bits than the unary code for any number, and for numbers larger than 2, it uses fewer bits. The savings for large numbers is substantial. We can, for example, now encode 1023 in 19 bits, instead of 1024 using just unary code. γ code requires ⌊ log k k , the Elias- + 1 bits for k For any number in unary ⌋ d 2 log k code and ⌋ bits for k ⌊ in binary. Therefore, 2 ⌊ log bits are required k ⌋ + 1 2 r 2 in all. Elias-  codes γ Although the Elias- code is a major improvement on the unary code, it is not ideal for inputs that might contain large numbers. We know that a number can k log binary digits, but the Elias- γ code requires twice as many be expressed in k 2 bits in order to make the encoding unambiguous. δ code attempts to solve this problem by changing the way that k The Elias- d k . In in unary, we can encode k in Elias- + 1 is encoded. Instead of encoding γ d d k into: particular, we split d k ⌋ = ⌊ • + 1) ( k log d dd 2 ⌋ +1) k ( log ⌊ d 2 + 1) k − = ( • 2 k dr d Notice that we use + 1 here, since k is undefined. k log 0 may be zero, but d d 2 k k in unary, k is in binary, and We then encode k in binary. The value of r dd dr dd k k , and k the length of is the length of , which makes this code unambiguous. r dr dr Table 5.3 gives some examples of Elias- δ encodings. Number ( k ) k Code k k k r dd d dr 0 0 0 0 1 0 1 0 1 0 10 0 0 2 1 1 0 10 0 1 1 3 6 2 1 1 10 1 10 2 15 3 7 2 0 110 00 111 16 0 2 1 110 01 0000 4 255 7 127 3 0 1110 000 1111111 1023 9 511 3 2 1110 010 111111111 Table 5.3. Elias- δ code examples</p> <p><span class="badge badge-info text-white mr-2">172</span> 148 5 Ranking with Indexes δ sacrifices some efficiency for small numbers in order to gain efficiency Elias- at encoding larger numbers. Notice that the code for the number 2 has increased γ code. However, for numbers to 4 bits instead of the 3 bits required by the Elias- code requires no more space than the Elias- γ larger than 16, the Elias- δ code, and δ requires less space. for numbers larger than 32, the Elias- γ code requires Specifically, the Elias- log in ( ⌊ log k k ⌋ +1) ⌋ +1 bits for ⌊ dd 2 2 unary, followed by ⌊ log bits ( ⌊ log ⌋ k ⌋ + 1) ⌋ bits for k k in binary, and ⌊ log dr 2 2 2 log k in binary. The total cost is approximately 2 log . log k for + k r 2 2 2 5.4.4 Byte-Aligned Codes Even though a few tricks can help us decode bit-aligned codes quickly, codes of variable bit length are cumbersome on processors that process bytes. The proces- sor is built to handle bytes efficiently, not bits, so it stands to reason that byte- aligned codes would be faster in practice. There are many examples of byte-aligned compression schemes, but we con- sider only one popular method here. This is the code commonly known as v-byte , which is an abbreviation for “variable byte length.” The v-byte method is very similar to UTF-8 encoding, which is a popular way to represent text (see section 3.5.1). Like the other codes we have studied so far, the v-byte method uses short codes for small numbers and longer codes for longer numbers. However, each code is a series of bytes, not bits. So, the shortest v-byte code for a single integer is one byte. In some circumstances, this could be very space-inefficient; encoding the number 1 takes eight times as much space in v-byte as in Elias- γ . Typically, the difference in space usage is not quite so dramatic. The v-byte code is really quite simple. The low seven bits of each byte con- tain numeric data in binary. The high bit is a terminator bit. The last byte of each code has its high bit set to 1; otherwise, it is set to 0. Any number that can be represented in seven binary digits requires one byte to encode. More information about space usage is shown in Table 5.4. Some example encodings are shown in Table 5.5. Numbers less than 128 are stored in a single byte in traditional binary form, except that the high bit is set. For larger numbers, the least significant seven bits are stored in the first byte. The next seven bits are stored in the next byte until all of the non-zero bits have been stored. Storing compressed data with a byte-aligned code has many advantages over a bit-aligned code. Byte-aligned codes compress and decompress faster, since pro-</p> <p><span class="badge badge-info text-white mr-2">173</span> 5.4 Compression 149 Number of bytes k 7 1 k < 2 7 14 2 2 k < ≤ 2 14 21 2 2 3 ≤ k < 21 28 2 k < ≤ 4 2 Table 5.4. Space requirements for numbers encoded in v-byte Binary Code k Hexadecimal 1 1 0000001 81 1 0000110 86 6 127 FF 1 1111111 0 0000001 1 0000000 01 80 128 130 0 0000001 1 0000010 01 82 20000 0 0000001 0 0011100 1 0100000 01 1C A0 Table 5.5. Sample encodings for v-byte cessors (and programming languages) are designed to process bytes instead of bits. For these reasons, the Galago search engine associated with this book uses v-byte exclusively for compression. 5.4.5 Compression in Practice The compression techniques we have covered are used to encode inverted lists in real retrieval systems. In this section, we’ll look at how Galago uses compression to encode inverted lists in the class. PositionListWriter Figure 5.5 illustrates how position information can be stored in inverted lists. Consider just the inverted list for tropical : (1 , 1)(1 , 7)(2 , 6)(2 , 17)(3 , 1) In each pair, the first number represents the document and the second number represents the word position. For instance, the third entry in this list states that the word tropical is the sixth word in document 2. Because it helps the example, we’ll add (2 , 197) to the list: (1 1) 1)(1 , 7)(2 , 6)(2 , 17)(2 , 197)(3 , ,</p> <p><span class="badge badge-info text-white mr-2">174</span> 150 5 Ranking with Indexes We can group the positions for each document together so that each document has its own entry, (document, count, [positions]), where count is the number of occurrences in the document. Our example data now looks like this: , 2 [1 , 7])(2 , 3 , [6 , (1 , 197])(3 , 1 , [1]) , 17 The word count is important because it makes this list decipherable even with- out the parentheses and brackets. The count tells us how many positions lie within the brackets, and we can interpret these numbers unambiguously, even if they were printed as follows: 1 2 , 1 , 7 , 2 , 3 , 6 , 17 , 197 , 3 , 1 , 1 , However, we will leave the brackets in place for now for clarity. These are small numbers, but with delta encoding we can make them smaller. Notice that the document numbers are sorted in ascending order, so we can safely use delta encoding to encode them: (1 , 2 , [1 , 7])(1 , 3 , [6 , 17 , 197])(1 , 1 , [1]) The second entry now starts with a 1 instead of a 2, but this 1 means “this document number is one more than the last document number.” Since position information is also sorted in ascending order, we can delta-encode the positions as well: , 2 , [1 , 6])(1 , 3 , [6 (1 11 , 180])(1 , 1 , [1]) , We can’t delta-encode the word counts, because they’re not in ascending order. If we did delta-encode them, some of the deltas might be negative, and the com- pression techniques we have discussed do not handle negative numbers without some extra work. Now we can remove the brackets and consider this inverted list as just a list of numbers: 6 , 2 , 1 , 6 , 1 , 3 , 1 , 11 , 180 , 1 , 1 , 1 Since most of these numbers are small, we can compress them with v-byte to save space: 81 82 81 86 81 83 86 8B 01 B4 81 81 81</p> <p><span class="badge badge-info text-white mr-2">175</span> 5.4 Compression 151 01 B4 is 180, which is encoded in two bytes. The rest of the numbers were The encoded as single bytes, giving a total of 13 bytes for the entire list. 5.4.6 Looking Ahead This section described three compression schemes for inverted lists, and there are many others in common use. Even though compression is one of the older areas of computer science, new compression schemes are developed every year. Why are these new schemes necessary? Remember at the beginning of this sec- tion we talked about how compression allows us to trade processor computation for data throughput. This means that the best choice for a compression algorithm is tightly coupled with the state of modern CPUs and memory systems. For a long time, CPU speed was increasing much faster than memory throughput, so com- pression schemes with higher compression ratios became more attractive. How- ever, the dominant hardware trend now is toward many CPU cores with lower clock speeds. Depending on the memory throughput of these systems, lower com- pression ratios may be attractive. More importantly, modern CPUs owe much of their speed to clever tricks such as branch prediction, which helps the processor guess about how code will execute. Code that is more predictable can run much faster than unpredictable code. Many of the newest compression schemes are designed to make the decode phase more predictable, and therefore faster. 5.4.7 Skipping and Skip Pointers For many queries, we don’t need all of the information stored in a particular in- verted list. Instead, it would be more efficient to read just the small portion of the data that is relevant to the query. Skip pointers help us achieve that goal. Consider the Boolean query “ galago AND animal ”. The word “animal” occurs in about 300 million documents on the Web versus approximately 1 million for “galago.” If we assume that the inverted lists for “galago” and “animal” are in doc- ument order, there is a very simple algorithm for processing this query: • Let d be the first document number in the inverted list for “galago.” g d be the first document number in the inverted list for “animal.” • Let a • While there are still documents in the lists for “galago” and “animal,” loop: – If d to the next document number in the “galago” list. < d d , set a g g – If d d to the next document number in the “animal” list. , set < d g a a</p> <p><span class="badge badge-info text-white mr-2">176</span> 152 5 Ranking with Indexes d – If d = , the document d contains both “galago” and “animal”. Move a a g and d d both to the next documents in the inverted lists for “galago” and a g “animal,” respectively. Unfortunately, this algorithm is very expensive. It processes almost all docu- ments in both inverted lists, so we expect the computer to process this loop about 300 million times. Over 99% of the processing time will be spent processing the 299 million documents that contain “animal” but do not contain “galago.” We can change this algorithm slightly by skipping forward in the “animal” list. d documents in the “animal” list < d , we skip ahead Every time we find that k g a s s < d . If , we skip ahead by another k documents. We to a new document, a a g s . At this point, we have narrowed our search down to a range do this until d ≥ g a k documents that might contain of , which we can search linearly. d g How much time does the modified algorithm take? Since the word “galago” appears 1 million times, we know that the algorithm will perform 1 million lin- ear searches of length k , giving an expected cost of 500 , 000 × k steps. We also expect to skip forward 300 000 , 000/ k times. This algorithm then takes about , k , × k + 300 500 000 , 000/ 000 steps in total. , k Steps 5 62.5 million 10 35 million 25 million 20 25 24.5 million 27.5 million 40 31 million 50 53 million 100 Skip lengths ( k ) and expected processing steps Table 5.6. Table 5.6 shows the number of processing steps required for some example values of k . We get the best expected performance when we skip 25 documents at a time. Notice that at this value of , we expect to have to skip forward 12 times k in the “animal” list for each occurrence of “galago.” This is because of the cost of linear search: a larger value of k means more elements to check in the linear search. If we choose a binary search instead, the best value of k rises to about 208, with about 9.2 million expected steps.</p> <p><span class="badge badge-info text-white mr-2">177</span> 5.4 Compression 153 If binary search combined with skipping is so much more efficient, why even consider linear search at all? The problem is compression. For binary search to work, we need to be able to jump directly to elements in the list, but after com- pression, every element could take a different amount of space. In addition, delta encoding may be used on the document numbers, meaning that even if we could jump to a particular location in the compressed sequence, we would need to de- compress everything up to that point in order to decode the document numbers. This is discouraging because our goal is to reduce the amount of the list we need to process, and it seems that compression forces us to decompress the whole list. We can solve the compression problem with a list of skip pointers. Skip pointer lists are small additional data structures built into the index to allow us to skip through the inverted lists efficiently. d, p ) contains two parts, a document number d and a byte A skip pointer ( (or bit) position . This means that there is an inverted list posting that starts at p position , and that the posting immediately before it is for document d . Notice p that this definition of the skip pointer solves both of our compression problems: we can start decoding at position , and since we know that d is the document p p , we can use it for decoding. immediately preceding As a simple example, consider the following list of document numbers, un- compressed: 5 , 11 , 17 , 21 , 26 , 34 , 36 , 37 , 45 , 48 , 51 , 52 , 57 , 80 , 89 , 91 , 94 , 101 , 104 , 119 If we delta-encode this list, we end up with a list of d-gaps like this: 7 , 6 , 6 , 4 , 5 , 9 , 2 , 1 , 8 , 3 , 3 , 1 , 5 , 23 , 9 , 2 , 3 , 5 , 3 , 15 We can then add some skip pointers for this list, using 0-based positions (that is, the number 5 is at position 0 in the list): (17 , 3) , (34 , 6) , (45 , 9) , (52 , 12) , (89 , 15) , (101 , 18) Suppose we try decoding using the skip pointer (34, 6). We move to position 6 in the d-gaps list, which is the number 2. We add 34 to 2, to decode document number 36. More generally, if we want to find document number 80 in the list, we scan the list of skip pointers until we find (52, 12) and (89, 15). 80 is larger than 52 but less than 89, so we start decoding at position 12. We find:</p> <p><span class="badge badge-info text-white mr-2">178</span> 154 5 Ranking with Indexes • 52 + 5 = 57 • 57 + 23 = 80 At this point, we have successfully found 80 in the list. If instead we were searching for 85, we would again start at skip pointer (52, 12): • 52 + 5 = 57 • 57 + 23 = 80 • 80 + 9 = 89 At this point, since 85 < 89, we would know that 85 is not in the list. galago AND animal In the analysis of skip pointers for the “ ” example, the effec- tiveness of the skip pointers depended on the fact that “animal” was much more common than “galago.” We found that 25 was a good value for k given this query, but we only get to choose one value for k for all queries. The best way to choose k is to find the best possible k for some realistic sample set of queries. For most collections and query loads, the optimal skip distance is around 100 bytes. 5.5 Auxiliary Structures The inverted file is the primary data structure in a search engine, but usually other structures are necessary for a fully functional system. Vocabulary and statistics An inverted file, as described in this chapter, is just a collection of inverted lists. To search the index, some kind of data structure is necessary to find the inverted list for a particular term. The simplest way to solve this problem is to store each inverted list as a separate file, where each file is named after the corresponding search term. To find the inverted list for “dog,” the system can simply open the file named dog and read the contents. However, as we saw in Chapter 4, document collections can have millions of unique words, and most of these words will occur only once or twice in the collection. This means that an index, if stored in files, would consist of millions of files, most of which are very small. Unfortunately, modern file systems are not optimized for this kind of stor- age. A file system typically will reserve a few kilobytes of space for each file, even though most files will contain just a few bytes of data. The result is a huge amount of wasted space. As an example, in the AP89 collection, over 70,000 words occur</p> <p><span class="badge badge-info text-white mr-2">179</span> 5.5 Auxiliary Structures 155 just once (see Table 4.1). These inverted lists would require about 20 bytes each, for a total of about 2MB of space. However, if the file system requires 1KB for each file, the result is 70MB of space used to store 2MB of data. In addition, many file systems still store directory information in unsorted arrays, meaning that file lookups can be very slow for large file directories. To fix these problems, inverted lists are usually stored together in a single file, which explains the name inverted file . An additional directory structure, called vocabulary the lexicon , contains a lookup table from index terms to the byte or offset of the inverted list in the inverted file. In many cases, this vocabulary lookup table will be small enough to fit into memory. In this case, the vocabulary data can be stored in any reasonable way on disk and loaded into a hash table at search engine startup. If the search engine needs to handle larger vocabularies, some kind of tree-based data structure, such as a B-tree, should be used to minimize disk accesses during the search process. Galago uses a hybrid strategy for its vocabulary structure. A small file in each index, called vocabulary , stores an abbreviated lookup table from vocabulary terms to offsets in the inverted file. This file contains just one vocabulary entry for each 32K of data in the inverted file. Therefore, a 32TB inverted file would require less than 1GB of vocabulary space, meaning that it can always be stored in memory for collections of a reasonable size. The lists in the inverted file are stored in alphabetical order. To find an inverted list, the search engine uses bi- nary search to find the nearest entry in the vocabulary table, and reads the offset from that entry. The engine then reads 32KB of the inverted file, starting at the offset. This approach finds each inverted list with just one disk seek. To compute some feature functions, the index needs to contain certain vo- cabulary statistics, such as the term frequency or document frequency (discussed in Chapter 4). When these statistics pertain to a specific term, they can be eas- ily stored at the start of the inverted list. Some of these statistics pertain to the corpus, such as the total number of documents stored. When there are just a few of these kinds of statistics, efficient storage considerations can be safely ignored. Galago stores these collection-wide statistics in an XML file called manifest . Documents, snippets, and external systems The search engine, as described so far, returns a list of document numbers and scores. However, a real user-focused search engine needs to display textual infor- mation about each document, such as a document title, URL, or text summary</p> <p><span class="badge badge-info text-white mr-2">180</span> 156 5 Ranking with Indexes (Chapter 6 explains this in more detail). In order to get this kind of information, the text of the document needs to be retrieved. In Chapter 3, we saw some ways that documents can be stored for fast access. There are many ways to approach this problem, but in the end, a separate system is necessary to convert search engine results from numbers into something readable by people. 5.6 Index Construction Before an index can be used for query processing, it has to be created from the text collection. Building a small index is not particularly difficult, but as input sizes grow, some index construction tricks can be useful. In this section, we will look at simple in-memory index construction first, and then consider the case where the input data does not fit in memory. Finally, we will consider how to build indexes using more than one computer. 5.6.1 Simple Construction Pseudocode for a simple indexer is shown in Figure 5.8. The process involves only a few steps. A list of documents is passed to the BuildIndex function, and the function parses each document into tokens, as discussed in Chapter 4. These to- kens are words, perhaps with some additional processing, such as downcasing or stemming. The function removes duplicate tokens, using, for example, a hash ta- ble. Then, for each token, the function determines whether a new inverted list needs to be created in I , and creates one if necessary. Finally, the current docu- ment number, n , is added to the inverted list. The result is a hash table of tokens and inverted lists. The inverted lists are just lists of integer document numbers and contain no special information. This is enough to do very simple kinds of retrieval, as we saw in section 5.3.1. As described, this indexer can be used for many small tasks—for example, in- dexing less than a few thousand documents. However, it is limited in two ways. First, it requires that all of the inverted lists be stored in memory, which may not be practical for larger collections. Second, this algorithm is sequential, with no obvious way to parallelize it. The primary barrier to parallelizing this algorithm is the hash table, which is accessed constantly in the inner loop. Adding locks to the hash table would allow parallelism for parsing, but that improvement alone will</p> <p><span class="badge badge-info text-white mr-2">181</span> 5.6 Index Construction 157 BI ( ) ◃ D is a set of text documents procedure D HashTable ◃ Inverted list storage ← () I 0 ◃ n ← Document numbering documents d ∈ D do for all ← n + 1 n T Parse ( d ) ◃ Parse document into tokens ← T Remove duplicates from tokens t ∈ T do for all I ̸∈ I then if t I () ← Array t end if . ( n ) I append t end for end for return I end procedure Pseudocode for a simple indexer Fig. 5.8. not be enough to make use of more than a handful of CPU cores. Handling large collections will require less reliance on memory and improved parallelism. 5.6.2 Merging The classic way to solve the memory problem in the previous example is by merg- ing . We can build the inverted list structure until memory runs out. When that I happens, we write the partial index I to disk, then start making a new one. At the end of this process, the disk is filled with many partial indexes, I . , I , ..., I , I 3 2 1 n The system then merges these files into a single result. By definition, it is not possible to hold even two of the partial index files in memory at one time, so the input files need to be carefully designed so that they can be merged in small pieces. One way to do this is to store the partial indexes in alphabetical order. It is then possible for a merge algorithm to merge the partial indexes using very little memory. Figure 5.9 shows an example of this kind of merging procedure. Even though this figure shows only two indexes, it is possible to merge many at once. The algo- rithm is essentially the same as the standard merge sort algorithm. Since both I 1 and I are sorted, at least one of them points to the next piece of data necessary 2 to write to I . The data from the two files is interleaved to produce a sorted result.</p> <p><span class="badge badge-info text-white mr-2">182</span> 158 5 Ranking with Indexes aardv 4 2 5 ark 3 apple 2 4 Index A 15 actor ark aardv 6 9 68 42 Index B aardv ark 2 5 4 3 4 apple 2 Index A 42 15 actor ark aardv 6 9 68 Index B 3 4 5 2 4 6 9 ark actor 15 aardv 42 68 apple 2 Combined index Fig. 5.9. An example of index merging. The first and second indexes are merged together to produce the combined index. Since I may have used the same document numbers, the merge function and I 2 1 I . renumbers documents in 2 This merging process can succeed even if there is only enough memory to store w ), a single inverted list posting, and a few file pointers. In and w two words ( 1 2 practice, a real merge function would read large chunks of I and I , and then 1 2 write large chunks to I in order to use the disk most efficiently. This merging strategy also shows a possible parallel indexing strategy. If many machines build their own partial indexes, a single machine can combine all of those indexes together into a single, final index. However, in the next section, we will explore more recent distributed indexing frameworks that are becoming popular. 5.6.3 Parallelism and Distribution The traditional model for search engines has been to use a single, fast machine to create the index and process queries. This is still the appropriate choice for a large number of applications, but it is no longer a good choice for the largest systems. Instead, for these large systems, it is increasingly popular to use many inexpen- sive servers together and use distributed processing software to coordinate their activities. MapReduce is a distributed processing tool that makes this possible. Two factors have forced this shift. First, the amount of data to index in the largest systems is exploding. Modern web search engines already index tens of bil- lions of pages, but even larger indexes are coming. Consider that if each person on earth wrote one blog post each day, the Web would increase in size by over two trillion pages every year. Optimistically, one typical modern computer can handle a few hundred million pages, although not with the kind of response times that</p> <p><span class="badge badge-info text-white mr-2">183</span> 5.6 Index Construction 159 most users expect. This leaves a huge gulf between the size of the Web and what we can handle with current single-computer technology. Note that this problem is not restricted to a few major web search companies; many more companies want to analyze the content of the Web instead of making it available for public search. These companies have the same scalability problem. The second factor is simple economics. The incredible popularity of personal computers has made them very powerful and inexpensive. In contrast, large com- puters serve a very small market, and therefore have fewer opportunities to de- velop economies of scale. Over time, this difference in scale has made it difficult to make a computer that is much more powerful than a personal computer that is still sold for a reasonable amount of money. Many large information retrieval systems ran on mainframes in the past, but today’s platform of choice consists of many inexpensive commodity servers. Inexpensive servers have a few disadvantages when compared to mainframes. First, they are more likely to break, and the likelihood of at least one server fail- ure goes up as you add more servers. Second, they are difficult to program. Most programmers are well trained for single-threaded programming, less well trained for threaded or multi-process programming, and not well trained at all for coop- erative network programming. Many programming toolkits have been developed to help address this kind of problem. RPC, CORBA, Java RMI, and SOAP have been developed to allow function calls across machine boundaries. MPI provides a different abstraction, called messagepassing , which is popular for many scientific tasks. None of these techniques are particularly robust against system failures, and the programming models can be complex. In particular, these systems do not help distribute data evenly among machines; that is the programmer’s job. Data placement Before diving into the mechanics of distributed processing, consider the problems of handling huge amounts of data on a single computer. Distributed processing and large-scale data processing have one major aspect in common, which is that not all of the input data is available at once. In distributed processing, the data might be scattered among many machines. In large-scale data processing, most of the data is on the disk. In both cases, the key to efficient data processing is placing the data correctly. Let’s take a simple example. Suppose you have a text file that contains data about credit card transactions. Each line of the file contains a credit card number</p> <p><span class="badge badge-info text-white mr-2">184</span> 160 5 Ranking with Indexes and an amount of money. How might you determine the number of unique credit card numbers in the file? If the file is not very big, you could read each line, parse the credit card num- ber, and store the credit card number in a hash table. Once the entire file had been read, the hash table would contain one entry for each unique credit card number. Counting the number of entries in the hash table would give you the answer. Un- fortunately, for a big file, the hash table would be too large to store in memory. Now suppose you had the very same credit card data, but the transactions in the file were ordered by credit card number. Counting the number of unique credit card numbers in this case is very simple. Each line in the file is read and the credit card number on the line is parsed. If the credit card number found is different than the one on the line before it, a counter is incremented. When the end of the file is reached, the counter contains a count of the unique credit card numbers in the file. No hash table is necessary for this to work. Now, back to distributed computation. Suppose you have more than one com- puter to use for this counting task. You can split the big file of transactions into small batches of transactions. Each computer can count its fraction, and then the results can be merged together to produce a final result. Initially, we start with an unordered file of transactions. We split that file into small batches of transactions and count the unique credit card numbers in each batch. How do we combine the results? We could add the number of credit card numbers found in each batch, but this is incorrect, since the same credit card num- ber might appear in more than one batch, and therefore would be counted more than once in the final total. Instead, we would need to keep a list of the unique credit card numbers found in each batch, and then merge those lists together to make a final result list. The size of this final list is the number of unique credit card numbers in the whole set. In contrast, suppose the transactions are split into batches with more care, so that all transactions made with the same credit card end up in the same batch. With this extra restriction, each batch can be counted individually, and then the counts from each batch can be added to make a final result. No merge is necessary, because there is no possibility of double-counting. Each credit card number will appear in precisely one batch. These examples might be a little bit tedious, but the point is that proper data grouping can radically change the performance characteristics of a task. Using a sorted input file made the counting task easy, reduced the amount of memory needed to nearly zero, and made it possible to distribute the computation easily.</p> <p><span class="badge badge-info text-white mr-2">185</span> 5.6 Index Construction 161 MapReduce MapReduce is a distributed programming framework that focuses on data place- ment and distribution. As we saw in the last few examples, proper data placement can make some problems very simple to compute. By focusing on data placement, MapReduce can unlock the parallelism in some common tasks and make it easier to process large amounts of data. MapReduce gets its name from the two pieces of code that a user needs to write in order to use the framework: the and the Reducer . The MapReduce Mapper library automatically launches many Mapper and Reducer tasks on a cluster of machines. The interesting part about MapReduce, though, is the path the data takes between the Mapper and the Reducer. Before we look at how the Mapper and Reducer work, let’s look at the founda- tions of the MapReduce idea. The functions map and reduce are commonly found in functional languages. In very simple terms, the function transforms a list map of items into another list of items of the same length. The reduce function trans- forms a list of items into a single item. The MapReduce framework isn’t quite so strict with its definitions: both Mappers and Reducers can return an arbitrary number of items. However, the general idea is the same. Map Input Shuff le R educe Output Fig. 5.10. MapReduce</p> <p><span class="badge badge-info text-white mr-2">186</span> 162 5 Ranking with Indexes We assume that the data comes in a set of records. The records are sent to the Mapper, which transforms these records into pairs, each with a key and a value. The next step is the shuffle, which the library performs by itself. This operation uses a hash function so that all pairs with the same key end up next to each other and on the same machine. The final step is the reduce stage, where the records are processed again, but this time in batches, meaning all pairs with the same key are processed at once. The MapReduce steps are summarized in Figure 5.10. procedure MCC (input) not input.done() do while ← record input.next() ← record.card card amount record.amount ← Emit(card, amount) end while end procedure Fig. 5.11. Mapper for a credit card summing algorithm procedure RCC (key, values) ← 0 total card key ← while not values.done() do amount ← values.next() total total + amount ← end while Emit(card, total) end procedure Fig. 5.12. Reducer for a credit card summing algorithm The credit card data example we saw in the previous section works well as a MapReduce task. In the Mapper (Figure 5.11), each record is split into a key (the credit card number) and a value (the money amount in the transaction). The shuffle stage sorts the data so that the records with the same credit card number end up next to each other. The reduce stage emits a record for each unique credit</p> <p><span class="badge badge-info text-white mr-2">187</span> 5.6 Index Construction 163 card number, so the total number of unique credit card numbers is the number of records emitted by the reducer (Figure 5.12). . By Typically, we assume that both the Mapper and Reducer are idempotent idempotent, we mean that if the Mapper or Reducer is called multiple times on the same input, the output will always be the same. This idempotence allows the MapReduce library to be fault tolerant. If any part of the computation fails, per- haps because of a hardware machine failure, the MapReduce library can just pro- cess that part of the input again on a different machine. Even when machines don’t fail, sometimes machines can be slow because of misconfiguration or slowly failing parts. In this case, a machine that appears to be normal could return re- sults much more slowly than other machines in a cluster. To guard against this, as the computation nears completion, the MapReduce library issues backup Map- pers and Reducers that duplicate the processing done on the slowest machines. This ensures that slow machines don’t become the bottleneck of a computation. The idempotence of the Mapper and Reducer are what make this possible. If the Mapper or Reducer modified files directly, for example, multiple copies of them could not be run simultaneously. Let’s look at the problem of indexing a corpus with MapReduce. In our simple indexer, we will store inverted lists with word positions. MDTP (input) procedure not input.done() do while document ← input.next() number ← document.number position ← 0 ← Parse(document) tokens each word w in tokens do for Emit( w , document : position ) position = position + 1 end for end while end procedure Fig. 5.13. Mapper for documents MapDocumentsToPostings (Figure 5.13) parses each document in the input. At each word position, it emits a key/value pair: the key is the word itself, and the value is document : position , which is the document number and the position</p> <p><span class="badge badge-info text-white mr-2">188</span> 164 5 Ranking with Indexes RPTL (key, values) procedure key word ← WriteWord(word) while do not input.done() EncodePosting(values.next()) end while end procedure Fig. 5.14. Reducer for word postings concatenated together. When ReducePostingsToLists (Figure 5.14) is called, the emitted postings have been shuffled so that all postings for the same word are together. The Reducer calls WriteWord to start writing an inverted list and then uses EncodePosting to write each posting. 5.6.4 Update So far, we have assumed that indexing is a batch process. This means that a set of documents is given to the indexer as input, the indexer builds the index, and then the system allows users to run queries. In practice, most interesting document col- lections are constantly changing. At the very least, collections tend to get bigger over time; every day there is more news and more email. In other cases, such as web search or file system search, the contents of documents can change over time as well. A useful search engine needs to be able to respond to dynamic collections. We can solve the problem of update with two techniques: index merging and result merging. If the index is stored in memory, there are many options for quick index update. However, even if the search engine is evaluating queries in mem- ory, typically the index is stored on a disk. Inserting data in the middle of a file is not supported by any common file system, so direct disk-based update is not straightforward. We do know how to merge indexes together, though, as we saw in section 5.6.2. This gives us a simple approach for adding data to the index: make a new, smaller index ( I ) to make a new in- ) and merge it with the old index ( I 1 2 dex containing all of the data ( I ). Postings in I for any deleted documents can 1 be ignored during the merge phase so they do not appear in I . Index merging is a reasonable update strategy when index updates come in large batches, perhaps many thousands of documents at a time. For single docu- ment updates, it isn’t a very good strategy, since it is time-consuming to write the entire index to disk. For these small updates, it is better to just build a small index</p> <p><span class="badge badge-info text-white mr-2">189</span> 5.7 Query Processing 165 for the new data, but not merge it into the larger index. Queries are evaluated sep- arately against the small index and the big index, and the result lists are merged to find the top results. k Result merging solves the problem of how to handle new documents: just put them in a new index. But how do we delete documents from the index? The com- mon solution is to use a deleted document list. During query processing, the sys- tem checks the deleted document list to make sure that no deleted documents enter the list of results shown to the user. If the contents of a document change, we can delete the old version from the index by using a deleted document list and then add a new version to the recent documents index. Results merging allows us to consider a small, in-memory index structure to hold new documents. This in-memory structure could be a hash table of arrays, as shown in Figure 5.8, and therefore would be simple and quick to update, even with only a single document. To gain even more performance from the system, instead of using just two indexes (an in-memory index and a disk-based index), we can use many indexes. Using too many indexes is a bad idea, since each new index slows down query processing. However, using too few indexes results in slow index build throughput because of excessive disk traffic. A particularly elegant solution to this problem is geometric partitioning I , contains . In geometric partitioning, the smallest index, 0 I r , contains about about as much data as would fit in memory. The next index, 1 times as much data as I . If m is the amount of bytes of memory in the machine, 1 n n I then contains between mr index and ( m + 1) r I bytes of data. If index n n n ( m + 1) r r , it is merged into index I , the = 2 . If ever contains more than +1 n system can hold 1000 m bytes of index data using just 10 indexes. 5.7 Query Processing Once an index is built, we need to process the data in it to produce query results. Even with simple algorithms, processing queries using an index is much faster than it is without one. However, clever algorithms can boost query processing speed by ten to a hundred times over the simplest versions. We will explore the simplest two query processing techniques first, called document-at-a-time and term-at-a-time, and then move on to faster and more flexible variants.</p> <p><span class="badge badge-info text-white mr-2">190</span> 166 5 Ranking with Indexes 5.7.1 Document-at-a-time Evaluation Document-at-a-time retrieval is the simplest way, at least conceptually, to per- form retrieval with an inverted file. Figure 5.15 is a picture of document-at-a-time retrieval for the query “salt water tropical”. The inverted lists are shown horizon- tally, although the postings have been aligned so that each column represents a different document. The inverted lists in this example hold word counts, and the score, for this example, is just the sum of the word counts in each document. The vertical gray lines indicate the different steps of retrieval. In the first step, all the counts for the first document are added to produce the score for that document. Once the scoring for the first document has completed, the second document is scored, then the third, and then the fourth. salt 4:1 1:1 water 1:1 2:1 4:1 tropical 1:2 3:1 2:2 scor e 1:4 3:1 4:2 2:3 Fig. 5.15. x : y ) represent a docu- Document-at-a-time query evaluation. The numbers ( ment number ( x ) and a word count ( y ). Figure 5.16 shows a pseudocode implementation of this strategy. The param- eters are , the query; I , the index; f and Q , the sets of feature functions; and k , g the number of documents to retrieve. This algorithm scores documents using the abstract model of ranking described in section 5.2. However, in this simplified example, we assume that the only non-zero feature values for g ( Q ) correspond to the words in the query. This gives us a simple correspondence between inverted lists and features: there is one list for each query term, and one feature for each list. Later in this chapter we will explore structured queries, which are a standard way of moving beyond this simple model. For each word w in the query, an inverted list is fetched from the index. These i inverted lists are assumed to be sorted in order by document number. The Invert- edList object starts by pointing at the first posting in each list. All of the fetched inverted lists are stored in an array, L .</p> <p><span class="badge badge-info text-white mr-2">191</span> 5.7 Query Processing 167 DAATR ( , I , f , g , k ) procedure Q Array() L ← PriorityQueue( k ) R ← w in Q do for all terms i ) ← InvertedList( w , l I i i .add( L ) l i end for for all documents ∈ I do d ← s 0 d inverted lists do l in L for all i l .getCurrentDocument() = d then if i s Update the document score ← s ◃ + g ) ( Q ) f l ( i i i d d end if l .movePastDocument( d ) i end for .add( s ) , d R d end for the top k results from R return end procedure Fig. 5.16. A simple document-at-a-time retrieval algorithm In the main loop, the function loops once for each document in the collection. At each document, all of the inverted lists are checked. If the document appears in one of the inverted lists, the feature function f is evaluated, and the docu- i ment score s is computed by adding up the weighted function values. Then, the D inverted list pointer is moved to point at the next posting. At the end of each doc- ument loop, a new document score has been computed and added to the priority queue . R For clarity, this pseudocode is free of even simple performance-enhancing changes. Realistically, however, the priority queue R only needs to hold the top k results at any one time. If the priority queue ever contains more than k results, the lowest-scoring documents can be removed until only k remain, in order to save memory. Also, looping over all documents in the collection is unnecessary; we can change the algorithm to score only documents that appear in at least one of the inverted lists. The primary benefit of this method is its frugal use of memory. The only major use of memory comes from the priority queue, which only needs to store k entries</p> <p><span class="badge badge-info text-white mr-2">192</span> 168 5 Ranking with Indexes at a time. However, in a realistic implementation, large portions of the inverted lists would also be buffered in memory during evaluation. 5.7.2 Term-at-a-time Evaluation Figure 5.17 shows term-at-a-time retrieval, using the same query, scoring func- tion, and inverted list data as in the document-at-a-time example (Figure 5.15). Notice that the computed scores are exactly the same in both figures, although the structure of each figure is different. 1:1 salt 4:1 4:1 es tial scor 1:1 par old par 4:1 es tial scor 1:1 water 1:1 2:1 4:1 2:1 es tial scor new par 1:2 4:2 2:1 old par tial scor es 4:2 1:2 2:2 3:1 1:2 tropical 1 1:4 2:3 3 : es 4:2 final scor Fig. 5.17. Term-at-a-time query evaluation As before, the gray lines indicate the boundaries between each step. In the first step, the inverted list for “salt” is decoded, and partial scores are stored in accumu- lators . These scores are called partial scores because they are only a part of the final document score. The accumulators, which get their name from their job, accumu- late score information for each document. In the second step, partial scores from the accumulators are combined with data from the inverted list for “water” to produce a new set of partial scores. After the data from the list for “tropical” is added in the third step, the scoring process is complete. The figure implies that a new set of accumulators is created for each list. Al- though this is one possible implementation technique, in practice accumulators are stored in a hash table. The information for each document is updated as the</p> <p><span class="badge badge-info text-white mr-2">193</span> 5.7 Query Processing 169 inverted list data is processed. The hash table contains the final document scores after all inverted lists have been processed. g k , I , f , Q ) ( TAATR procedure HashTable() ← A L ← Array() ← k ) R PriorityQueue( terms in Q do for all w i I ← InvertedList( w ) , l i i l ) L .add( i end for l ∈ for all lists do L i while l is not finished do i ← l d .getCurrentDocument() i ( A A + g l ← Q ) f ( ) i i d d l .moveToNextDocument() i end while end for accumulators A in A do for all d Accumulator contains the document score ← s ◃ A d d R .add( s , d ) d end for return the top k results from R end procedure Fig. 5.18. A simple term-at-a-time retrieval algorithm The term-at-a-time retrieval algorithm for the abstract ranking model (Fig- ure 5.18) is similar to the document-at-a-time version at the start. It creates a pri- ority queue and fetches one inverted list for each term in the query, just like the document-at-a-time algorithm. However, the next step is different. Instead of a loop over each document in the index, the outer loop is over each list. The inner loop then reads each posting of the list, computing the feature functions f and g i i and adding its weighted contribution to the accumulator . After the main loop A d completes, the accumulators are scanned and added to a priority queue, which de- termines the top k results to be returned. The primary disadvantage of the term-at-a-time algorithm is the memory us- age required by the accumulator table A . Remember that the document-at-a-time</p> <p><span class="badge badge-info text-white mr-2">194</span> 170 5 Ranking with Indexes R strategy requires only the small priority queue , which holds a limited number of results. However, the term-at-a-time algorithm makes up for this because of its more efficient disk access. Since it reads each inverted list from start to finish, it requires minimal disk seeking, and it needs very little list buffering to achieve high speeds. In contrast, the document-at-a-time algorithm switches between lists and requires large list buffers to help reduce the cost of seeking. In practice, neither the document-at-a-time nor term-at-a-time algorithms are used without additional optimizations. These optimizations dramatically im- prove the running speed of the algorithms, and can have a large effect on the mem- ory footprint. 5.7.3 Optimization Techniques There are two main classes of optimizations for query processing. The first is to read less data from the index, and the second is to process fewer documents. The two are related, since it would be hard to score the same number of documents while reading less data. When using feature functions that are particularly com- plex, focusing on scoring fewer documents should be the main concern. For sim- ple feature functions, the best speed comes from ignoring as much of the inverted list data as possible. List skipping In section 5.4.7, we covered skip pointers in inverted lists. This kind of forward skipping is by far the most popular way to ignore portions of inverted lists (Figure 5.19). More complex approaches (for example, tree structures) are also possible but not frequently used. Fig. 5.19. Skip pointers in an inverted list. The gray boxes show skip pointers, which point into the white boxes, which are inverted list postings. Skip pointers do not improve the asymptotic running time of reading an in- verted list. Suppose we have an inverted list that is n bytes long, but we add skip pointers after each c bytes, and the pointers are k bytes long. Reading the entire</p> <p><span class="badge badge-info text-white mr-2">195</span> 5.7 Query Processing 171 ( n bytes, but jumping through the list using the skip point- list requires reading ) kn . Even though there is c ) time, which is equivalent to ( n ) ( ers requires / can be huge. For typical values of no asymptotic gain in runtime, the factor of c c = 100 and k = 4 , skipping through a list results in reading just 2.5% of the total data. c Notice that as gets bigger, the amount of data you need to read to skip through the list drops. So, why not make as big as possible? The problem is that c c if gets too large, the average performance drops. Let’s look at this problem in more detail. Suppose you want to find p particular postings in an inverted list, and the list is bytes long, with k -byte skip pointers located at c n -byte intervals. Therefore, there are / c total intervals in the list. To find those n postings, we need to read kn / c p bytes in skip pointers, but we also need to read data in p intervals. On average, we assume that the postings we want are about halfway between two skip pointers, so we read an additional /2 bytes to find those postings. The total number of pc bytes read is then: kn pc + 2 c p is much smaller than n / Notice that this analysis assumes that ; that’s what c allows us to assume that each posting lies in its own interval. As p grows closer to n / c , it becomes likely that some of the postings we want will lie in the same intervals. However, notice that once p gets close to n / c , we need to read almost all of the inverted list, so the skip pointers aren’t very helpful. Coming back to the formula, you can see that while a larger value of makes c the first term smaller, it also makes the second term bigger. Therefore, picking the perfect value for c p , and we don’t know what p is until depends on the value of a query is executed. However, it is possible to use previous queries to simulate skipping behavior and to get a good estimate for . In the exercises, you will be c asked to plot some of graphs of this formula and to solve for the equilibrium point. Although it might seem that list skipping could save on disk accesses, in prac- tice it rarely does. Modern disks are much better at reading sequential data than they are at skipping to random locations. Because of this, most disks require a skip of about 100,000 postings before any speedup is seen. Even so, skipping is still use- ful because it reduces the amount of time spent decoding compressed data that has been read from disk, and it dramatically reduces processing time for lists that are cached in memory.</p> <p><span class="badge badge-info text-white mr-2">196</span> 172 5 Ranking with Indexes Conjunctive processing conjunctive processing The simplest kind of query optimization is . By conjunctive processing, we just mean that every document returned to the user needs to con- tain all of the query terms. Conjunctive processing is the default mode for many web search engines, in part because of speed and in part because users have come to expect it. With short queries, conjunctive processing can actually improve effec- tiveness and efficiency simultaneously. In contrast, search engines that use longer queries, such as entire paragraphs, will not be good candidates for conjunctive processing. Conjunctive processing works best when one of the query terms is rare, as in the query “fish locomotion”. The word “fish” occurs about 100 times as often as the word “locomotion”. Since we are only interested in documents that contain both words, the system can skip over most of the inverted list for “fish” in order to find only the postings in documents that also contain the word “locomotion”. Conjunctive processing can be employed with both term-at-a-time and docu- ment-at-a-time systems. Figure 5.20 shows the updated term-at-a-time algorithm for conjunctive processing. When processing the first term, ( = 0 ), processing i i > 0 ), the algorithm proceeds normally. However, for the remaining terms, ( ?? . It checks the accumulator table for the next processes postings starting at line document that contains all of the previous query terms, and instructs list l to i skip forward to that document if there is a posting for it (line ?? ). If there is a posting, the accumulator is updated. If the posting does not exist, the accumulator is deleted (line ). ?? The document-at-a-time version (Figure 5.21) is similar to the old document- at-a-time version, except in the inner loop. It begins by finding the largest docu- ment currently pointed to by an inverted list (line 13). This document d is not d guaranteed to contain all the query terms, but it is a reasonable candidate. The next loop tries to skip all lists forward to point at d (line 16). If this is not success- ful, the loop terminates and another document is chosen. If it is successful, the d document is scored and added to the priority queue. In both algorithms, the system runs fastest when the first list ( l ) is the shortest 0 and the last list ( l ) is the longest. This results in the biggest possible skip distances n in the last list, which is where skipping will help most.</p> <p><span class="badge badge-info text-white mr-2">197</span> 5.7 Query Processing 173 procedure TAATR Q , I , f , g , k ) 1: ( ← Map() 2: A Array() 3: ← L 4: R PriorityQueue( k ) ← terms w 5: in Q for all do i 6: ← InvertedList( w l , I ) i i 7: L .add( l ) i end for 8: for all 9: l lists ∈ L do i 10: ←− 1 d 0 while 11: is not finished do l i 12: i = 0 then if d ← l 13: .getCurrentDocument() i f ) ← A l + g ( ( Q ) 14: A i i d d 15: .moveToNextDocument() l i 16: else 17: d ← l .getCurrentDocument() i ′ ← A .getNextAccumulator( d ) d 18: ′ .removeAccumulatorsBetween( 19: A d , ) d 0 ′ 20: d = d then if A 21: ← A ) + g l ( Q ) f ( i i d d 22: l .moveToNextDocument() i else 23: ′ l ) skipForwardToDocument( d 24: i 25: end if ′ d 26: ← d 0 27: end if end while 28: 29: end for for all accumulators A 30: in A do d 31: s ← A ◃ Accumulator contains the document score d d 32: .add( s , d ) R d 33: end for 34: return the top k results from R 35: end procedure Fig. 5.20. A term-at-a-time retrieval algorithm with conjunctive processing</p> <p><span class="badge badge-info text-white mr-2">198</span> 174 5 Ranking with Indexes procedure DAATR Q , I , f , g , k ) 1: ( ← L 2: Array() ← PriorityQueue( k ) 3: R terms Q 4: in for all do w i l 5: InvertedList( w ← , I ) i i L .add( l 6: ) i 7: end for d ←− 8: 1 all lists in do are not finished while 9: L ← 0 10: s d for all inverted lists l L in 11: do i 12: l if .getCurrentDocument() > d then i .getCurrentDocument() 13: l d ← i 14: end if 15: end for 16: for all inverted lists l do in L i 17: l .skipForwardToDocument( d ) i 18: l .getCurrentDocument() = d then if i s Update the document score ← s ◃ + g ) ( Q 19: f ) ( l i i i d d 20: l .movePastDocument( d ) i 21: else 22: d ←− 1 23: break end if 24: end for 25: if d > − 1 then 26: .add( s ) , d R d 27: end if 28: end while 29: the top k results from R return 30: end procedure Fig. 5.21. A document-at-a-time retrieval algorithm with conjunctive processing Threshold methods So far, the algorithms we have considered do not do much with the parameter k until the very last statement. Remember that k is the number of results requested by the user, and for many search applications this number is something small, such as 10 or 20. Because of this small value of k , most documents in the inverted lists</p> <p><span class="badge badge-info text-white mr-2">199</span> 5.7 Query Processing 175 k parameter in will never be shown to the user. Threshold methods focus on this order to score fewer documents. In particular, notice that for every query, there is some minimum score that each document needs to reach before it can be shown to the user. This minimum th k -highest scoring document. Any document that does score is the score of the not score at least this highly will never be shown to the user. In this section, we will use the Greek letter tau ( threshold . τ ) to represent this value, which we call the τ before processing the query, many If we could know the appropriate value for query optimizations would be possible. For instance, since a document needs a τ in order to be useful to the user, we could avoid adding docu- score of at least ments to the priority queue (in the document-at-a-time case) that did not achieve a score of at least . In general, we could safely ignore any document with a score τ τ less than . Unfortunately, we don’t know how to compute the true value of τ without evaluating the query, but we can approximate it. These approximations will be ′ ′ τ τ called ≤ τ , so that we can safely ignore any document with a score . We want ′ ′ gets to . Of course, the closer our estimate τ less than τ τ , the faster our algorithm will run, since it can ignore more documents. ′ Coming up with an estimate for is easy with a document-at-a-time strategy. τ R k highest-scoring documents seen maintains a list of the top Remember that ′ τ so far in the evaluation process. We can set to the score of the lowest-scoring R , assuming document currently in already has k documents in it. With term- R at-a-time evaluation, we don’t have full document scores until the query evalua- th ′ τ tion is almost finished. However, we can still set -largest score in k to be the the accumulator table. MaxScore ′ With reasonable estimates for , it is possible to start ignoring some of the data τ ′ τ in the inverted lists. This estimate, , represents a lower bound on the score a doc- ument needs in order to enter the final ranked list. Therefore, with a little bit of clever math, we can ignore parts of the inverted lists that will not generate docu- ′ ment scores above τ . Let’s look more closely at how this might happen with a simple example. Con- sider the query “eucalyptus tree”. The word “tree” is about 100 times more com- mon than the word “eucalyptus”, so we expect that most of the time we spend evaluating this query will be spent scoring documents that contain the word “tree”</p> <p><span class="badge badge-info text-white mr-2">200</span> 176 5 Ranking with Indexes and not “eucalyptus”. This is a poor use of time, since we’re almost certain to find k a set of top documents that contain both words. eucalyptus tree Fig. 5.22. MaxScore retrieval with the query “eucalyptus tree”. The gray boxes indicate postings that can be safely ignored during scoring. Figure 5.22 shows this effect in action. We see the inverted lists for “eucalyp- tus” and “tree” extending across the page, with the postings lined up by document, as in previous figures in this chapter. This figure shows that there are many doc- uments that contain the word “tree” and don’t contain the word “eucalyptus”. Suppose that the indexer computed the largest partial score in the “tree” list, and μ that value is called . This is the maximum score (hence MaxScore) that any tree document that contains just this word could have. Suppose that we are interested only in the top three documents in the ranked list (i.e., k is 3). The first scored document contains just the word “tree”. The next ′ three documents contain both “eucalyptus” and “tree”. We will use to represent τ the lowest score from these three documents. At this point, it is highly likely that ′ ′ > μ is the score of a document that contains both query terms, , because τ τ tree whereas μ is a query score for a document that contains just one of the query tree ′ terms. This is where the gray boxes come into the story. Once > μ , we can τ tree safely skip over all of the gray postings, since we have proven that these documents will not enter the final ranked list. The postings data in the figure is fabricated, but for real inverted lists for “euca- lyptus” and “tree”, 99% of the postings for “tree” would be gray boxes, and there- fore would be safe to ignore. This kind of skipping can dramatically reduce query times without affecting the quality of the query results. Early termination The MaxScore approach guarantees that the result of query processing will be ex- actly the same in the optimized version as it is without optimization. In some cases, however, it may be interesting to take some risks with quality and process queries in a way that might lead to different results than the same queries in an unoptimized system.</p> <p><span class="badge badge-info text-white mr-2">201</span> 5.7 Query Processing 177 Why might we choose to do this? One reason is that some queries are much, much more expensive than others. Consider the phrase query “to be or not to be”. This query uses very common terms that would have very long inverted lists. Running this query to completion could severely reduce the amount of system resources available to serve other queries. Truncating query processing for this ex- pensive query can help ensure fairness for others using the same system. Another reason is that MaxScore is necessarily conservative. It will not skip over regions of the inverted list that might have a usable candidate document. Because of this, MaxScore can spend a lot of time looking for a document that might not exist. Taking a calculated risk to ignore these improbable documents can pay off in decreased system resource consumption. How might early termination work? In term-at-a-time systems, we can termi- nate processing by simply ignoring some of the very frequent query terms. This is not so different from using a stopword list, except that in this case we would be ignoring words that usually would not be considered stopwords. Alternatively, we might decide that after some constant number of postings have been read, no other terms will be considered. The reasoning here is that, after processing a sub- stantial number of postings, the ranking should be fairly well established. Reading more information will only change the rankings a little. This is especially true for queries with many (e.g., hundreds) of terms, which can happen when query ex- pansion techniques are used. In document-at-a-time systems, early termination means ignoring the docu- ments at the very end of the inverted lists. This is a poor idea if the documents are sorted in random order, but this does not have to be the case. Instead, documents could be sorted in order by some quality metric, such as PageRank. Terminating early in that case would mean ignoring documents that are considered lower qual- ity than the documents that have already been scored. List ordering So far, all the examples in this chapter assume that the inverted lists are stored in the same order, by document number. If the document numbers are assigned randomly, this means that the document sort order is random. The net effect is that the best documents for a query can easily be at the very end of the lists. With good documents scattered throughout the list, any reasonable query processing algorithm must read or skip through the whole list to make sure that no good documents are missed. Since these lists can be long, it makes sense to consider a more intelligent ordering.</p> <p><span class="badge badge-info text-white mr-2">202</span> 178 5 Ranking with Indexes One way to improve document ordering is to order documents based on doc- ument quality, as we discussed in the last section. There are plenty of quality met- rics that could be used, such as PageRank or the total number of user clicks. If the smallest document numbers are assigned to the highest-quality documents, it be- comes reasonable to consider stopping the search early if many good documents have been found. The threshold techniques from the MaxScore section can be used here. If we know that documents in the lists are decreasing in quality, we can compute an upper bound on the scores of the documents remaining in the lists at ′ every point during retrieval. When rises above the highest possible remaining τ document score, retrieval can be stopped safely without harming effectiveness. Another option is to order each list by partial score. For instance, for the “food” list, we could store documents that contain many instances of the word “food” first. In a web application, this may correspond to putting restaurant pages early in the inverted list. For a “dog” list, we could store pages about dogs (i.e., containing many instances of “dog”) first. Evaluating a query about food or dogs then becomes very easy. Other queries, however, can be more difficult. For ex- ample, how do we evaluate the query “dog food”? The best way to do it is to use an accumulator table, as in term-at-a-time retrieval. However, instead of reading a whole list at once, we read just small pieces of each list. Once the accumulator table shows that many good documents have been found, we can stop looking. As you can imagine, retrieval works fastest with terms that are likely to appear together, such as “tortilla guacamole”. When the terms are not likely to appear together—for example, “dirt cheese”—it is likely to take much longer to find the top documents. 5.7.4 Structured Queries In the query evaluation examples we have seen so far, our assumption is that each inverted list corresponds to a single feature, and that we add those features to- gether to create a final document score. Although this works in simple cases, we might want a more interesting kind of scoring function. For instance, in Figure 5.2 the query had plenty of interesting features, including a phrase (“tropical fish”), a synonym (“chichlids”), and some non-topical features (e.g., incoming links). One way to do this is to write specialized ranking code in the retrieval system that detects these extra features and uses inverted list data directly to compute scores, but in a way that is more complicated than just a linear combination of features. This approach greatly increases the kinds of scoring that you can use, and is very efficient. Unfortunately, it isn’t very flexible.</p> <p><span class="badge badge-info text-white mr-2">203</span> 5.7 Query Processing 179 . Struc- structured queries Another option is to build a system that supports tured queries are queries written in a query language, which allows you to change the features used in a query and the way those features are combined. The query language is not used by normal users of the system. Instead, a query translator converts the user’s input into a structured query representation. This translation process is where the intelligence of the system goes, including how to weight word features and what synonyms to use. Once this structured query has been created, it is passed to the retrieval system for execution. You may already be familiar with this kind of model, since database systems work this way. Relational databases are controlled using Structured Query Lan- guage (SQL). Many important applications consist of a user interface and a struc- tured query generator, with the rest of the logic controlled by a database. This sep- aration of the application logic from the database logic allows the database to be both highly optimized and highly general. Galago contains a structured query processing system that is described in de- tail in Chapter 7. This query language is also used in the exercises. #combine featur e combinations ximity expr o essions pr #od:1 #od:1 aquarium fish tropical list data Evaluation tree for the structured query #combine(#od:1(tropical fish) Fig. 5.23. #od:1(aquarium fish) fish) Figure 5.23 shows a tree representation of a structured query written in the Galago structured query language: #combine(#od:1(tropical fish) #od:1(aquarium fish) fish) . This query indicates that the document score should be a combination of the scores from three subqueries. The first query is #od:1(tropical fish) . In the Galago query language, the #od:1 operator means that the terms inside it need to appear next to each other, in that order, in a matching document. The same is true of</p> <p><span class="badge badge-info text-white mr-2">204</span> 180 5 Ranking with Indexes . The final query term is fish #od:1(aquarium fish) . Each of these subqueries acts as a operator. #combine document feature that is combined using the This query contains examples of the main types of structured query expres- sions. At the bottom of the tree, we have index terms. These are terms that corre- spond to inverted lists in the index. Above that level, we have proximity expres- sions. These expressions combine inverted lists to create more complex features, such as a feature for “fish” occurring in a document title, or “tropical fish” occur- ring as a phrase. At the top level, the feature data computed from the inverted lists is combined into a document score. At this level, the position information from the inverted lists is ignored. Galago evaluates structured queries by making a tree of iterator objects that looks just like the tree shown in Figure 5.23. For instance, an iterator is created that returns the matching documents for #od:1(tropical fish) . The iterator finds these matching documents by using data from inverted list iterators for “tropical” and “fish”. The #combine operator is an iterator of document scores, which uses iterators for #od:1(tropical fish) , #od:1(aquarium fish) , and fish . Once a tree of itera- tors like this is made, scoring documents is just a matter of using the root iterator to step through the documents. 5.7.5 Distributed Evaluation A single modern machine can handle a surprising load, and is probably enough for most tasks. However, dealing with a large corpus or a large number of users may require using more than one machine. The general approach to using more than one machine is to send all queries to director a indexservers , which machine. The director then sends messages to many do some portion of the query processing. The director then organizes the results of this process and returns them to the user. The easiest distribution strategy is called document distribution . In this strat- egy, each index server acts as a search engine for a small fraction of the total doc- ument collection. The director sends a copy of the query to each of the index servers, each of which returns the top k results, including the document scores for these results. These results are merged into a single ranked list by the director, which then returns the results to the user. Some ranking algorithms rely on collection statistics, such as the number of occurrences of a term in the collection or the number of documents containing a term. These statistics need to be shared among the index servers in order to pro- duce comparable scores that can be merged effectively. In very large clusters of</p> <p><span class="badge badge-info text-white mr-2">205</span> 5.7 Query Processing 181 machines, the term statistics at the index server level can vary wildly. If each in- dex server uses only its own term statistics, the same document could receive very different kinds of scores, depending on which index server is used. term distribution . In term distribution, Another distribution method is called a single index is built for the whole cluster of machines. Each inverted list in that index is then assigned to one index server. For instance, the word “dog” might be handled by the third server, while “cat” is handled by the fifth server. For a index servers and a k term query, the probability that all of the system with n k − 1 query terms would be on the same server is 1/ . For a cluster of 10 machines, n this probability is just 1% for a three-term query. Therefore, in most cases the data to process a query is not stored all on one machine. One of the index servers, usually the one holding the longest inverted list, is chosen to process the query. If other index servers have relevant data, that data is sent over the network to the index server processing the query. When query processing is complete, the results are sent to a director machine. The term distribution approach is more complex than document distribution because of the need to send inverted list data between machines. Given the size of inverted lists, the messages involved in shipping this data can saturate a network. In addition, each query is processed using just one processor instead of many, which increases overall query latency versus document distribution. The main ad- vantage of term distribution is seek time. If we have a -term query and n index k servers, the total number of disk seeks necessary to process a query is ( kn ) for O a document-distributed system, but just O ( k ) in a term-distributed system. For a system that is disk-bound, and especially one that is seek-bound, term distribu- tion might be attractive. However, recent research shows that term distribution is rarely worth the effort. 5.7.6 Caching We saw in Chapter 4 how word frequencies in text follow a Zipfian distribution: a few words occur very often, but a huge number of words occur very infrequently. It turns out that query distributions are similar. Some queries, such as those about popular celebrities or current events, tend to be very popular with public search engines. However, about half of the queries that a search engine receives each day are unique. This leads us into a discussion of caching . Broadly speaking, caching means storing something you might want to use later. With search engines, we usually</p> <p><span class="badge badge-info text-white mr-2">206</span> 182 5 Ranking with Indexes want to cache ranked result lists for queries, but systems can also cache inverted lists from disk. Caching is perfectly suited for search engines. Queries and ranked lists are small, meaning it doesn’t take much space in a cache to store them. By contrast, processing a query against a large corpus can be very computationally intensive. This means that once a ranked list is computed, it usually makes sense to keep it around. However, caching does not solve all of our performance problems, because about half of all queries received each day are unique. Therefore, the search en- gine itself must be built to handle query traffic very quickly. This leads to com- petition for resources between the search engine and the caching system. Recent research suggests that when memory space is tight, caching should focus on the most popular queries, leaving plenty of room to cache the index. Unique queries with multiple terms may still share a term and use the same inverted list. This explains why inverted list caching can have higher hit rates than query caching. Once the whole index is cached, all remaining resources can be directed toward caching query results. When using caching systems, it is important to guard against stale data. Cach- ing works because we assume that query results will not change over time, but eventually they do. Cache entries need acceptable timeouts that allow for fresh results. This is easier when dealing with partitioned indexes like the ones we dis- cussed in section 5.6.4. Each cache can be associated with a particular index par- tition, and when that partition is deleted, the cache can also be deleted. Keep in mind that a system that is built to handle a certain peak throughput with caching enabled will handle a much smaller throughput with caching off. This means that if your system ever needs to destroy its cache, be prepared to have a slow system until the cache becomes suitably populated. If possible, cache flushes should hap- pen at off-peak load times. References and Further Reading This chapter contains information about many topics: indexing, query process- ing, compression, index update, caching, and distribution just to name a few. All these topics are in one chapter to highlight how these components work together. Because of how interconnected these components are, it is useful to look at studies of real, working systems. Brin and Page (1998) wrote a paper about the early Google system that is an instructive overview of what it takes to build a fully</p> <p><span class="badge badge-info text-white mr-2">207</span> 5.7 Query Processing 183 working system. Later papers show how the Google architecture has changed over time—for example Barroso et al. (2003). The MapReduce paper, by Dean and Ghemawat (2008), gives more detail than this chapter does about how MapRe- duce was developed and how it works in practice. The inner workings of commercial search engines are often considered trade secrets, so the exact details of how they work is not often published. One im- portant exception is the TodoBR engine, a popular Brazilian web search engine. Before TodoBR was acquired by Google, their engineers frequently published papers about its workings. One example is their paper on a two-level caching scheme (Saraiva et al., 2001), but there are many others. Managing Gigabytes (Witten et al., 1999) is the standard reference The book for index construction, and is particularly detailed in its discussion of compres- sion techniques. Work on compression for inverted lists continues to be an active area of research. One of the recent highlights of this research is the PFOR se- ries of compressors from Zukowski et al. (2006), which exploit the performance characteristics of modern processors to make a scheme that is particularly fast for decompressing small integers. Büttcher and Clarke (2007) did a recent study on how compression schemes compare on the latest hardware. Zobel and Moffat (2006) wrote a review article that outlines all of the impor- tant recent research in inverted indexes, both in index construction and in query processing. This article is the best place to look for an understanding of how this research fits together. Turtle and Flood (1995) developed the MaxScore series of algorithms. Fagin et al. (2003) took a similar approach with score-sorted inputs, although they did not initially apply their ideas to information retrieval. Anh and Moffat (2006) refined these ideas to make a particularly efficient retrieval system. Anh and Moffat (2005) and Metzler et al. (2008) cover methods for comput- ing scores that can be stored in inverted lists. In particular, these papers describe how to compute scores that are both useful in retrieval and can be stored com- pactly in the list. Strohman (2007) explores the entire process of building scored indexes and processing queries efficiently with them. Many of the algorithms from this chapter are based on merging two sorted in- puts; index construction relies on this, as does any kind of document-at-a-time retrieval process. Knuth wrote an entire volume on just sorting and searching, which includes large amounts of material on merging, including disk-based merg- ing (Knuth, 1998). If the Knuth book is too daunting, any standard algorithms textbook should be able to give you more detail about how merging works.</p> <p><span class="badge badge-info text-white mr-2">208</span> 184 5 Ranking with Indexes Lester et al. (2005) developed the geometric partitioning method for index update. Büttcher et al. (2006) added some extensions to this model, focusing on how very common terms should be handled during update. Strohman and Croft (2006) show how to update the index without halting query processing. Exercises 5.1. Section 5.2 introduced an abstract model of ranking, where documents and queries are represented by features. What are some advantages of representing documents and queries by features? What are some disadvantages? R ( 5.2. ) , which com- Our model of ranking contains a ranking function Q, D pares each document with the query and computes a score. Those scores are then used to determine the final ranked list. An alternate ranking model might contain a different kind of ranking func- f ( A, B, Q ) , where A tion, B are two different documents in the collection and and is the query. When A should be ranked higher than B , f ( A, B, Q ) Q eval- uates to 1. When should be ranked below B , A ( A, B, Q ) evaluates to –1. f If you have a ranking function R ( Q, D ) , show how you can use it in a system that requires one of the form f A, B, Q ) . Why can you not go the other way (use ( ( Q, D ) in a system that requires R ( f ) )? A, B, Q Suppose you build a search engine that uses one hundred computers with a 5.3. million documents stored on each one, so that you can search a collection of 100 million documents. Would you prefer a ranking function like R ( Q, D ) or one like f ( A, B, Q ) (from the previous problem). Why? 5.4. Suppose your search engine has just retrieved the top 50 documents from your collection based on scores from a ranking function R ( Q, D ) . Your user in- terface can show only 10 results, but you can pick any of the top 50 documents to show. Why might you choose to show the user something other than the top 10 documents from the retrieved document set? 5.5. Documents can easily contain thousands of non-zero features. Why is it im- portant that queries have only a few non-zero features? 5.6. Indexes are not necessary to search documents. Your web browser, for in- stance, has a Find function in it that searches text without using an index. When should you use an inverted index to search text? What are some advantages to using an inverted index? What are some disadvantages?</p> <p><span class="badge badge-info text-white mr-2">209</span> 5.7 Query Processing 185 Section 5.3 explains many different ways to store document information in 5.7. inverted lists. What kind of inverted lists might you build if you needed a very small index? What kind would you build if you needed to find mentions of cities, such as Kansas City or São Paulo? 5.8. Write a program that can build a simple inverted index of a set of text docu- ments. Each inverted list will contain the file names of the documents that contain that word. A contains the text “the quick brown fox”, and file B Suppose the file contains “the slow blue fox”. The output of your program would be: % ./your-program A B blue B brown A fox A B quick A slow B the A B 5.9. In section 5.4.1, we created an unambiguous compression scheme for 2-bit binary numbers. Find a sequence of numbers that takes up more space when it is “compressed” using our scheme than when it is “uncompressed.” 5.10. Suppose a company develops a new unambiguous lossless compression SuperShrink . Its developers claim that it will re- scheme for 2-bit numbers called duce the size of any sequence of 2-bit numbers by at least 1 bit. Prove that the developers are lying. More specifically, prove that either: • SuperShrink never uses less space than an uncompressed encoding, or • There is an input to SuperShrink such that the compressed version is larger than the uncompressed input You can assume that each 2-bit input number is encoded separately. 5.11. Why do we need to know something about the kind of data we will com- press before choosing a compression algorithm? Focus specifically on the result from Exercise 5.10. 5.12. Develop an encoder for the Elias- γ code. Verify that your program produces the same codes as in Table 5.2.</p> <p><span class="badge badge-info text-white mr-2">210</span> 186 5 Ranking with Indexes Identify the optimal skip distance k 5.13. when performing a two-term Boolean AND query where one term occurs 1 million times and the other term appears 100 million times. Assume that a linear search will be used once an appropriate region is found to search in. In section 5.7.3, we saw that the optimal skip distance c can be determined 5.14. kn / c + pc /2 , where by minimizing the quantity is the skip pointer length, n k is the total inverted list size, is the skip interval, and p is the number of postings c to find. Plot this function using k n = 1,000,000, and p = 1,000, but varying c . = 4, Then, plot the same function, but set p = 10,000. Notice how the optimal value for c changes. kn / c Finally, take the derivative of the function pc /2 in terms of c to find + the optimum value for c for a given set of other parameters ( k , n , and p ). 5.15. In Chapter 4, you learned about Zipf ’s law, and how approximately 50% of words found in a collection of documents will occur only once. Your job is to design a program that will verify Zipf ’s law using MapReduce. Your program will output a list of number pairs, like this: 195840,1 70944,2 34039,3 ... 1,333807 This sample output indicates that 195,840 words appeared once in the collection, 70,944 appeared twice, and 34,039 appeared three times, but one word appeared 333,807 times. Your program will print this kind of list for a document collection. Your program will use MapReduce twice (two Map phases and two Reduce phases) to produce this output. 5.16. Write the program described in Exercise 5.15 using the Galago search toolkit. Verify that it works by indexing the Wikipedia collection provided on the book website.</p> <p><span class="badge badge-info text-white mr-2">211</span> 6 Queries and Interfaces “This is Information Retrieval, not Information Dispersal.” Brazil Jack Lint, 6.1 Information Needs and Queries Although the index structures and ranking algorithms are key components of a search engine, from the user’s point of view the search engine is primarily an in- terface for specifying queries and examining results. People can’t change the way the ranking algorithm works, but they can interact with the system during query formulation and reformulation, and while they are browsing the results. These in- teractions are a crucial part of the process of information retrieval, and can deter- mine whether the search engine is viewed as providing an effective service. In this chapter, we discuss techniques for query transformation and refinement , and for assembling and displaying the search results . We also discuss cross-language search engines here because they rely heavily on the transformation of queries and results. In Chapter 1, we described an information need as the motivation for a person using a search engine. There are many types of information needs, and researchers have categorized them using dimensions such as the number of relevant docu- ments being sought, the type of information that is needed, and the tasks that led to the requirement for information. It has also been pointed out that in some cases it can be difficult for people to define exactly what their information need 1 From the point of view is, because that information is a gap in their knowledge. of the search engine designer, there are two important consequences of these ob- servations about information needs: • Queries can represent very different information needs and may require dif- ferent search techniques and ranking algorithms to produce the best rankings. 1 This is Belkin’s well-known Anomalous State of Knowledge (ASK) hypothesis (Belkin et al., 1982/1997).</p> <p><span class="badge badge-info text-white mr-2">212</span> 188 6 Queries and Interfaces • A query can be a poor representation of the information need. This can happen because the user finds it difficult to express the information need. More often, however, it happens because the user is encouraged to enter short queries, both by the search engine interface and by the fact that long queries often fail. The first point is discussed further in Chapter 7. The second point is a major theme in this chapter. We present techniques—such as , query expan- spelling correction relevance feedback —that are designed to refine the query, either auto- sion , and matically or through user interaction. The goal of this refinement process is to produce a query that is a better representation of the information need, and con- sequently to retrieve better documents. On the output side, the way that results are displayed is an important part of helping the user understand whether his in- snippet generation formation need has been met. We discuss techniques such as , , and document highlighting result clustering , that are designed to help this process of understanding the results. Short queries consisting of a small number of keywords (between two and three on average in most studies of web search) are by far the most popular form of query currently used in search engines. Given that such short queries can be am- 2 biguous and imprecise, why don’t people use longer queries? There are a number querylanguages of reasons for this. In the past, for search engines were designed to be used by expert users, or search intermediaries . They were called intermediaries because they acted as the interface between the person looking for information and the search engine. These query languages were quite complex. For example, here is a query made up by an intermediary for a search engine that provides legal information: User query: Are there any cases that discuss negligent maintenance or fail- ure to maintain aids to navigation such as lights, buoys, or channel mark- ers? Intermediaryquery: NEGLECT! FAIL! NEGLIG! /5 MAINT! REPAIR! /P NAV- IGAT! /5 AID EQUIP! LIGHT BUOY ”CHANNEL MARKER” wildcard operators and various forms of proximity op- This query language uses erators to specify the information need. A wildcard operator is used to define the minimum string match required for a word to match the query. For example, NE- GLECT! will match “neglected”, “neglects”, or just “neglect”. A proximity operator 2 Would you go up to a person and say, “Tropical fish?”, or even worse, “Fish?”, if you wanted to ask what types of tropical fish were easiest to care for?</p> <p><span class="badge badge-info text-white mr-2">213</span> 6.1 Information Needs and Queries 189 is used to define constraints on the distance between words that are required for them to match the query. One type of proximity constraint is adjacency. For ex- ample, the quotes around ”CHANNEL MARKER” specify that the two words must occur next to each other. The more general window operator specifies a width (in /5 specifies words) of a text window that is allowed for the match. For example, that the words must occur within five words of each other. Other typical prox- /P specifies imity operators are sentence and paragraph proximity. For example, that the words must occur in the same paragraph. In this query language, if no constraint is specified, it is assumed to be a Boolean OR . Some of these query language operators are still available in search engine in- terfaces, such as using quotes for a phrase or a “+” to indicate a mandatory term, but in general there is an emphasis on simple keyword queries (sometimes called “natural language” queries) in order to make it possible for most people to do 3 their own searches. But if we want to make querying as natural as possible, why not encourage people to type in better descriptions of what they are looking for instead of just a couple of keywords? Indeed, in applications where people expect other people to answer their questions, such as the community-based question answering systems described in section 10.3, the average query length goes up to around 30 words. The problem is that current search technology does not do a good job with long queries. Most web search engines, for example, only rank doc- uments that contain all the query terms. If a person enters a query with 30 words in it, the most likely result is that nothing will be found. Even if documents con- taining all the words can be found, the subtle distinctions of language used in a long, grammatically correct query will often be lost in the results. Search engines use ranking algorithms based primarily on a statistical view of text as a collection of words, not on syntactic and semantic features. Given what happens to long queries, people have quickly learned that they will get the most reliable results by thinking of a few keywords that are likely to be as- sociated with the information they are looking for, and using these as the query. This places quite a burden on the user, and the query refinement techniques de- scribed here are designed to reduce this burden and compensate for poor queries. 3 Note that the search engine may still be using a complex query language (such as that described in section 7.4.2) internally, but not in the interface.</p> <p><span class="badge badge-info text-white mr-2">214</span> 190 6 Queries and Interfaces 6.2 Query Transformation and Refinement 6.2.1 Stopping and Stemming Revisited As mentioned in the last section, the most common form of query used in cur- rent search engines consists of a small number of keywords. Some of these queries use quotes to indicate a phrase, or a “+” to indicate that a word must be present, but for the remainder of this chapter we will make the simplifying assumption 4 that the query is simply text. The initial stages of processing a text query should mirror the processing steps that are used for documents. Words in the query text must be transformed into the same terms that were produced by document texts, or there will be errors in the ranking. This sounds obvious, but it has been a source of problems in a number of search projects. Despite this restriction, there is scope for some useful differences between query and document transformation, par- ticularly in stopping and stemming. Other steps, such as parsing the structure or tokenizing, will either not be needed (keyword queries have no structure) or will be essentially the same. We mentioned in section 4.3.3 that stopword removal can be done at query time instead of during document indexing. Retaining the stopwords in the in- dexes increases the flexibility of the system to deal with queries that contain stop- words. Stopwords can be treated as normal words (by leaving them in the query), removed, or removed except under certain conditions (such as being used with quote or “+” operators). Query-basedstemming is another technique for increasing the flexibility of the search engine. If the words in documents are stemmed during indexing, the words in the queries must also be stemmed. There are circumstances, however, where stemming the query words will reduce the accuracy of the results. The query “fish village” will, for example, produce very different results from the query “fishing village”, but many stemming algorithms would reduce “fishing” to “fish”. By not stemming during document indexing, we are able to make the decision at query time whether or not to stem “fishing”. This decision could be based on a number of factors, such as whether the word was part of a quoted phrase. For query-based stemming to work, we must expand the query using the ap- propriate word variants, rather than reducing the query word to a word stem. This is because documents have not been stemmed. If the query word “fishing” was re- 4 Based on a recent sample of web queries, about 1.5% of queries used quotes, and less than 0.5% used a “+” operator.</p> <p><span class="badge badge-info text-white mr-2">215</span> 6.2 Query Transformation and Refinement 191 placed with the stem “fish”, the query would no longer match documents that contained “fishing”. Instead, the query should be expanded to include the word “fish”. This expansion is done by the system (not the user) using some form of synonym operator, such as that described in section 7.4.2. Alternatively, we could index the documents using stems aswellas words. This will make query execution more efficient, but increases the size of the indexes. stem classes . A stem class is the Every stemming algorithm implicitly generates group of words that will be transformed into the same stem by the stemming al- gorithm. They are created by simply running the stemming algorithm on a large collection of text and recording which words map to a given stem. Stem classes can be quite large. For example, here are three stem classes created with the Porter Stemmer on TREC news collections (the first entry in each list is the stem): /bank banked banking bankings banks /ocean oceaneering oceanic oceanics oceanization oceans /polic polical polically police policeable policed -policement policer policers polices policial -policically policier policiers policies policing -policization policize policly policy policying policys These classes are not only long (the “polic” class has 22 entries), but they also contain a number of errors. The words relating to “police” and “policy” should not be in the same class, and this will cause a loss in ranking accuracy. Other words are not errors, but may be used in different contexts. For example, “banked” is more often used in discussions of flying and pool, but this stem class will add words that are more common in financial discussions. The length of the lists is an issue if the stem classes are used to expand the query. Adding 22 words to a simple query will certainly negatively impact response time and, if not done properly using a synonym operator, could cause the search to fail. Both of these issues can be addressed using an analysis of word co-occurrence in the collection of text. The assumption behind this analysis is that word variants that could be substitutes for each other should co-occur often in documents. More specifically, we do the following steps: 1. For all pairs of words in the stem classes, count how often they co-occur in text windows of W words. W is typically in the range 50–100. 2. Compute a co-occurrence or association metric for each pair. This measures how strong the association is between the words.</p> <p><span class="badge badge-info text-white mr-2">216</span> 192 6 Queries and Interfaces 3. Construct a graph where the vertices represent words and the edges are be- tween words whose co-occurrence metric is above a threshold T . of this graph. These are the new stem classes. connected components 4. Find the used in TREC experiments was based on Dice’s term association measure The coefficient . This measure has been used since the earliest studies of term similarity and automatic thesaurus construction in the 1960s and 1970s. If n is the num- a a ber of windows (or documents) containing word n is the number of windows , b and b n , is the number of windows containing both words containing word , b a ab and N is the number of text windows in the collection, then Dice’s coefficient is defined as 2 · n . This is simply the proportion of term occurrences /( n ) + n b a ab that are co-occurrences. There are other possible association measures, which will be discussed later in section 6.2.3. Two vertices are in the same connected component of a graph if there is a path between them. In the case of the graph representing word associations, the con- clusters or groups of words, where each word has an nected components will be T with at least one other member of the cluster. association above the threshold The parameter T is set empirically. We will discuss this and other clustering tech- niques in section 9.2. Applying this technique to the three example stem classes, and using TREC data to do the co-occurrence analysis, results in the following connected compo- nents: /policies policy /police policed policing /bank banking banks The new stem classes are smaller, and the inappropriate groupings (e.g., pol- icy/police) have been split up. In general, experiments show that this technique produces good ranking effectiveness with a moderate level of query expansion. What about the “fishing village” query? The relevant stem class produced by the co-occurrence analysis is /fish fished fishing which means that we have not solved that problem. As mentioned before, the query context determines whether stemming is appropriate. It would be reason- able to expand the query “fishing in Alaska” with the words “fish” and “fished”, but not the query “fishing village”. The co-occurrence analysis described earlier</p> <p><span class="badge badge-info text-white mr-2">217</span> 6.2 Query Transformation and Refinement 193 uses context in a general way, but not at the level of co-occurrence with specific query words. With the recent availability of large query logs in applications such as web search, the concept of validating or even generating stem classes through statistical analysis can be extended to these resources. In this case, the analysis would look for word variants that tended to co-occur with the same words in queries. This could be a solution to the fish/fishing problem, in that “fish” is unlikely to co-occur with “village” in queries. Comparing this stemming technique to those described in section 4.3.4, it could be described as a dictionary-based approach, where the dictionary is gen- erated automatically based on input from an algorithmic stemmer (i.e., the stem classes). This technique can also be used for stemming with languages that do not have algorithmic stemmers available. In that case, the stem classes are based on very simple criteria, such as grouping all words that have similar n-grams. A simple example would be to generate classes from words that have the same first three characters. These initial classes are much larger than those generated by an algorithmic stemmer, but the co-occurrence analysis reduces the final classes to similar sizes. Retrieval experiments confirm that typically there is little difference in ranking effectiveness between an algorithmic stemmer and a stemmer based on n-gram classes. 6.2.2 Spell Checking and Suggestions Spell checking is an extremely important part of query processing. Approximately 10–15% of queries submitted to web search engines contain spelling errors, and people have come to rely on the “Did you mean: ...” feature to correct these er- rors. Query logs contain plenty of examples of simple errors such as the following (taken from a recent sample of web queries): poiner sisters brimingham news catamarn sailing hair extenssions marshmellow world miniture golf courses psyhics home doceration</p> <p><span class="badge badge-info text-white mr-2">218</span> 194 6 Queries and Interfaces These errors are similar to those that may be found in a word processing docu- ment. In addition, however, there will be many queries containing words related to websites, products, companies, and people that are unlikely to be found in any standard spelling dictionary. Some examples from the same query log are: realstateisting.bc.com akia 1080i manunal ultimatwarcade mainscourcebank dellottitouche The wide variety in the type and severity of possible spelling errors in queries presents a significant challenge. In order to discuss which spelling correction tech- niques are the most effective for search engine queries, we first have to review how spelling correction is done for general text. The basic approach used in many spelling checkers is to suggest corrections for words that are not found in the . Suggestions are found by spelling dictionary comparing the word that was not found in the dictionary to words that are in the dictionary using a similarity measure. The most common measure for compar- ing words (or more generally, strings) is the edit distance , which is the number of operations required to transform one of the words into the other. The Damerau- Levenshteindistance metric counts the minimum number of insertions, deletions, substitutions, or transpositions of single characters required to do the transfor- 5 mation. Studies have shown that 80% or more of spelling errors are caused by an instance of one of these types of single-character errors. As an example, the following transformations (shown with the type of error involved) all have Damerau-Levenshtein distance 1 since only a single operation or edit is required to produce the correct word: → extensions extenssions (insertion error) poiner → pointer (deletion error) marshmellow → marshmallow (substitution error) brimingham birmingham (transposition error) → The transformation doceration → decoration , on the other hand, has edit dis- tance 2 since it requires two edit operations: 5 The Levenshtein distance is similar but does not include transposition as a basic opera- tion.</p> <p><span class="badge badge-info text-white mr-2">219</span> 6.2 Query Transformation and Refinement 195 → deceration doceration decoration deceration → A variety of techniques and data structures have been used to speed up the calculation of edit distances between the misspelled word and the words in the dictionary. These include restricting the comparison to words that start with the same letter (since spelling errors rarely change the first letter), words that are of the same or similar length (since spelling errors rarely change the length of the word), 6 In the latter case, phonetic rules are used to map and words that sound the same. words to codes. Words with the same codes are considered as possible corrections. Soundex The code is a simple type of phonetic encoding that was originally used for the problem of matching names in medical records. The rules for this encoding are: 1. Keep the first letter (in uppercase). 2. Replace these letters with hyphens: a, e, i, o, u, y, h, w. 3. Replace the other letters by numbers as follows: 1: b, f, p, v 2: c, g, j, k, q, s, x, z 3: d, t 4: l 5: m, n 6: r 4. Delete adjacent repeats of a number. 5. Delete the hyphens. 6. Keep the first three numbers or pad out with zeros. Some examples of this code are: extenssions → E235 ; extensions → E235 marshmellow → ; marshmallow → M625 M625 brimingham B655 ; birmingham → B655 → poiner → P560 ; pointer → P536 The last example shows that the correct word may not always have the same Soundex code. More elaborate phonetic encodings have been developed specifi- 7 cally for spelling correction (e.g., the GNU Aspell checker uses a phonetic code). 6 A word that is pronounced the same as another word but differs in meaning is called a homophone . 7 http://aspell.net/</p> <p><span class="badge badge-info text-white mr-2">220</span> 196 6 Queries and Interfaces codes can be used These encodings can be designed so that the edit distance for the to narrow the search for corrections. A given spelling error may have many possible corrections. For example, the spelling error “lawers” has the following possible corrections (among others) at → lowers , lawyers , edit distance 1: , lasers , lagers . The spelling correc- lawers layers tor has to decide whether to present all of these to the user, and in what order to present them. A typical policy would be to present them in decreasing order of their frequency in the language. Note that this process ignores the context of the spelling error. For example, if the error occurred in the query “trial lawers”, this would have impact on the presentation order of the suggested corrections. The no lack of context in the spelling correction process also means that errors involv- ing valid words will be missed. For example, the query “miniature golf curses” is clearly an example of a single-character deletion error, but this error has produced the valid word “curses” and so would not be detected. The typical interface for the “Did you mean:...” feature requires the spell check- er to produce the single best suggestion. This means that ranking the suggestions using context and frequency information is very important for query spell check- ing compared to spell checking in a word processor, where suggestions can be made available on a pull-down list. In addition, queries contain a large number of run-on errors, where word boundaries are skipped or mistyped. The two queries “ultimatwarcade” and “mainscourcebank” are examples of run-on errors that also contain single-character errors. With the appropriate framework, leaving out a separator such as a blank can be treated as just another class of single-character error. The for spelling correction is a general framework that noisy channel model can address the issues of ranking, context, and run-on errors. The model is called a “noisy channel” because it is based on Shannon’s theory of communication w to (Shannon & Weaver, 1963). The intuition is that a person chooses a word P ( output (i.e., write), based on a probability distribution ) . The person then w tries to write the word w , but the noisy channel (presumably the person’s brain) causes the person to write the word e instead, with probability P ( e | w ) . languagemodel The probabilities ( w ) , called the P , capture information about the frequency of occurrence of a word in text (e.g., what is the probability of the word “lawyer” occurring in a document or query?), and contextual information such as the probability of observing a word given that another word has just been observed (e.g., what is the probability of “lawyer” following the word “trial”?). We</p> <p><span class="badge badge-info text-white mr-2">221</span> 6.2 Query Transformation and Refinement 197 will have more to say about language models in Chapter 7, but for now we can just assume it is a description of word occurrences in terms of probabilities. ( e | w ) , called the errormodel , represent information about The probabilities P the frequency of different types of spelling errors. The probabilities for words (or w strings) that are edit distance 1 away from the word will be quite high, for ex- ample. Words with higher edit distances will generally have lower probabilities, although homophones will have high probabilities. Note that the error model will ( w | w have probabilities for writing the correct word ( ) as well as probabilities P ) for spelling errors. This enables the spelling corrector to suggest a correction for all words, even if the original word was correctly spelled. If the highest-probability correction is the same word, then no correction is suggested to the user. If, how- ever, the context (i.e., the language model) suggests that another word may be more appropriate, then it can be suggested. This, in broad terms, is how a spelling corrector can suggest “course” instead of “curse” for the query “golf curse”. So how do we estimate the probability of a correction? What the person writes e is the word P ( w | e ) , which is the probability that the , so we need to calculate w e . If we are interested in correct word is given that we can see the person wrote finding the correction with the maximum value of this probability, or if we just P ( e | want to rank the corrections, it turns out we can use ) P ( w ) , which is the w 8 product of the error model probability and the language model probability. To handle run-on errors and context, the language model needs to have in- formation about pairs of words in addition to single words. The language model probability for a word is then calculated as a mixture of the probability that the word occurs in text and the probability that it occurs following the previous word, or ( w ) + (1 − λ ) P ( λP | w ) w p where λ is a parameter that specifies the relative importance of the two probabili- ties, and P ( w | w . ) is the probability of a word w following the previous word w p p As an example, consider the spelling error “fish tink”. To rank the possible correc- tions for “tink”, we multiply the error model probabilities for possible corrections by the language model probabilities for those corrections. The words “tank” and “think” will both have high error-model probabilities since they require only a sin- gle character correction. In addition, both words will have similar probabilities for P ( w ) since both are quite common. The probability P ( tank | f ish ) , however, 8 w , which is discussed in Chapter 7, is used to express P ( Bayes’ Rule | e ) in terms of the component probabilities.</p> <p><span class="badge badge-info text-white mr-2">222</span> 198 6 Queries and Interfaces P think | f ish ) , and this will result in “tank” being a will be much higher than ( more likely correction than “think”. Where does the information for the language model probabilities come from? In many applications, the best source for statistics about word occurrence in text will be the collection of documents that is being searched. In the case of web search (and some other applications), there will also be a query log containing millions of queries submitted to the search engine. Since our task is to correct the spelling of queries, the query log is likely to be the best source of information. It also reduces the number of pairs of words that need to be recorded in the language model, compared to analyzing all possible pairs in a large collection of documents. In addition to these sources, if a trusted dictionary is available for the application, it should be used. The estimation of the P ( e | w ) probabilities in the error model can be relatively simple or quite complex. The simple approach is to assume that all errors with the same edit distance have equal probability. Additionally, only strings within a cer- tain edit distance (usually 1 or 2) are considered. More sophisticated approaches have been suggested that base the probability estimates on the likelihood of mak- ing certain types of errors, such as typing an “a” when the intention was to type an “e” . These estimates are derived from large collections of text by finding many pairs of correctly and incorrectly spelled words. Cucerzan and Brill (2004) describe an iterative process for spell checking queries using information from a query log and dictionary. The steps, in simplified form, are as follows: 1. Tokenize the query. 2. For each token, a set of alternative words and pairs of words is found using an edit distance modified by weighting certain types of errors, as described earlier. The data structure that is searched for the alternatives contains words and pairs from both the query log and the trusted dictionary. 3. The noisy channel model is then used to select the best correction. 4. The process of looking for alternatives and finding the best correction is re- peated until no better correction is found. By having multiple iterations, the spelling corrector can potentially make sugges- tions that are quite far (in terms of edit distance) from the original query. As an example, given the query “miniture golfcurses”, the spelling corrector would go through the following iterations:</p> <p><span class="badge badge-info text-white mr-2">223</span> 6.2 Query Transformation and Refinement 199 miniture golfcurses miniature golfcourses miniature golf courses Experiments with this spelling corrector show that the language model from the query log is the most important component in terms of correction accuracy. In addition, using context in the form of word pairs in the language model is crit- ical. Having at least two iterations in the correction process also makes a signif- icant difference. The error model was less important, however, and using a sim- ple model where all errors have the same probability was nearly as effective as the more sophisticated model. Other studies have shown that the error model is more important when the language model is based just on the collection of documents, rather than the query log. The best approach to building a spelling corrector for queries in a search ap- plication obviously depends on the data available. If a large amount of query log information is available, then this should be incorporated. Otherwise, the sources available will be the collection of documents for the application and, in some cases, a trusted dictionary. One approach would be to use a general-purpose spelling corrector, such as Aspell, and create an application-specific dictionary. Building a spelling corrector based on the noisy channel model, however, is likely to be more effective and more adaptable, even if query log data is not available. 6.2.3 Query Expansion In the early development of search engines, starting in the 1960s, an online the- saurus was considered an essential tool for the users of the system. The thesaurus described the indexing vocabulary that had been used for the document collec- tion, and included information about synonyms and related words or phrases. This was particularly important because the collection had typically been manu- 9 ally indexed (tagged) using the terms in the thesaurus. Because the terms in the thesaurus were carefully chosen and subject to quality control, the thesaurus was also referred to as a controlled vocabulary . Using the thesaurus, users could deter- mine what words and phrases could be used in queries, and could expand an ini- tial query using synonyms and related words. Table 6.1 shows part of an entry in 9 In information retrieval, indexing is often used to refer to the process of representing a document using an index term, in addition to the process of creating the indexes to search the collection. More recently, the process of manual indexing has been called tagging , particularly in the context of social search applications (see Chapter 10).</p> <p><span class="badge badge-info text-white mr-2">224</span> 200 6 Queries and Interfaces the Medical Subject (MeSH) Headings thesaurus that is used in the National Li- 10 brary of Medicine search applications. The “tree number” entries indicate, using a numbering scheme, where this term is found in the tree of broader and narrow terms. An “entry term” is a synonym or related phrase for the term. MeSH Heading Neck Pain C10.597.617.576 Tree Number Tree Number C23.888.592.612.553 C23.888.646.501 Tree Number Cervical Pain Entry Term Neckache Entry Term Entry Term Anterior Cervical Pain Entry Term Anterior Neck Pain Entry Term Cervicalgia Entry Term Cervicodynia Entry Term Neck Ache Entry Term Posterior Cervical Pain Posterior Neck Pain Entry Term Partial entry for the Medical Subject (MeSH) Heading “Neck Pain” Table 6.1. Although the use of an explicit thesaurus is less common in current search ap- plications, a number of techniques have been proposed for automatic and semi- automatic query expansion. A semi-automatic technique requires user interac- tion, such as selecting the expansion terms from a list suggested by the expansion technique. Web search engines, for example, provide query suggestions to the user in the form of the original query words expanded with one or more additional words, or replaced with alternative words. Query expansion techniques are usually based on an analysis of word or term co-occurrence, in either the entire document collection, a large collection of queries, or the top-ranked documents in a result list. From this perspective, query- based stemming can also be regarded as a query expansion technique, with the expansion terms limited to word variants. Automatic expansion techniques that 11 use a general thesaurus, such as Wordnet, have not been shown to be effective. 10 http://www.nlm.nih.gov/mesh/meshhome.html 11 http://wordnet.princeton.edu/</p> <p><span class="badge badge-info text-white mr-2">225</span> 6.2 Query Transformation and Refinement 201 The key to effective expansion is to choose words that are appropriate for the context , or topic, of the query. For example, “aquarium” may be a good expan- sion term for “tank” in the query “tropical fish tanks”, but not appropriate for the query “armor for tanks”. A general thesaurus lists related terms for many differ- ent contexts, which is why it is difficult to use automatically. The techniques we all will describe use a variety of approaches to address this problem, such as using the words in a query to find related words rather than expanding each word sepa- rately. One well-known expansion technique, called pseudo-relevance feedback , is discussed in the next section along with techniques that are based on user feed- back about the relevance of documents in the results list. Term association measures are an important part of many approaches to query expansion, and consequently a number of alternatives have been suggested. One of these, Dice’s coefficient, was mentioned in section 6.2.1. The formula for this measure is 2 · n n rank ab ab = n n n + n + b a b a rank (produces the same ranking means that the formula is rank equivalent where = 12 of terms). mutual information , has been used in a number of studies Another measure, collocation . For two words (or terms) a of word b , it is defined as and P ( a, b ) log P ( a ) P ( b ) 13 and measures the extent to which the words occur independently. ( a ) is the P a b P ( probability that word ) is the prob- occurs in a text window of a given size, ability that word occurs in a text window, and P ( a, b ) is the probability that b a and b occur in the same text window. If the occurrences of the words are in- and the mutual information will be 0. As an dependent, ( a, b ) = P ( a ) P ( b ) P example, we might expect that the two words “fishing” and “automobile” would occur relatively independently of one another. If two words tend to co-occur, for 12 More formally, two functions are defined to be rank equivalent if they produce the same ordering of items when sorted according to function value. Monotonic trans- forms (such as log), scaling (multiplying by a constant), and translation (adding a con- stant) are all examples of rank-preserving operations. 13 This is actually the pointwise mutual information measure, just to be completely accu- rate.</p> <p><span class="badge badge-info text-white mr-2">226</span> 202 6 Queries and Interfaces ( ( will be greater than P ( a ) P ) b ) and the mu- P example “fishing” and “boat”, a, b tual information will be higher. To calculate mutual information, we use the following simple normalized fre- ( a ) = n ) = / N , P ( b quency estimates for the probabilities: P , and / N n b a n a, b ) = / N , where n P is the number of windows (or documents) contain- ( a ab , n ing word is the number of windows containing word b , n is the number of a ab b a and b , and N is the number of text windows windows containing both words in the collection. This gives the formula: P a, b ) ( n n rank ab ab N. log = = log ( a ) n P .n .n ( b ) n P a a b b A problem that has been observed with this measure is that it tends to favor low-frequency terms. For example, consider two words with frequency 10 (i.e., n = n ) that co-occur half the time ( n = 5 ). The association measure = 10 a b ab − 2 5 × 10 . For two terms with frequency 1,000 that co-occur for these two terms is − 4 expected = 500 ), the association measure is 5 × 10 half the time ( n . The ab addresses this problem by weighting the mutual in- mutual information measure a, b ) P formation value using the probability ( . Although the expected mutual in- formation in general is calculated over all combinations of the events of word oc- currence and non-occurrence, we are primarily interested in the case where both terms occur, giving the formula: ) P ( a, b n n n rank ab ab ab log . a, b ( P ) n = ) N. = log ( N. ( log . ) ab ) ) .n b ( P N a ( n .n n P a b a b 6 N , this gives an as- If we take the same example as before and assume = 10 sociation measure of 23.5 for the low-frequency terms, and 1,350 for the high- frequency terms, clearly favoring the latter case. In fact, the bias toward high- frequency terms can be a problem for this measure. Another popular association measure that has been used in a variety of applica- 2 χ tions is )measure . This measure compares the number of Pearson’sChi-squared( co-occurrences of two words with the expected number of co-occurrences if the two words were independent, and normalizes this comparison by the expected number. Using our estimates, this gives the formula: n n 1 2 2 a b ) ( N. − ) n .n .n . − n ( a ab b ab rank N N N = n n a b . N. n .n a b N N</p> <p><span class="badge badge-info text-white mr-2">227</span> 6.2 Query Transformation and Refinement 203 n n a b N.P b ) = N. ( ) The term . .P ( a is the expected number of co-occurrences N N 2 test is usually calculated over all if the two terms occur independently. The χ combinations of the events of word occurrence and non-occurrence, similar to the expected mutual information measure, but our focus is on the case where both 2 χ produces the terms co-occur. In fact, when N is large, this restricted form of 2 is very similar same term rankings as the full form. It should also be noted that χ to the mutual information measure and may be expected to favor low-frequency terms. Table 6.2 summarizes the association measures we have discussed. To see how they work in practice, we have calculated the top-ranked terms in a TREC news collection using each measure for the words in the sample query “tropical fish”. Measure Formula n ab Mutual information :n n a b M IM ) ( n ab . log ( N. n ) Expected Mutual Information ab n :n a b ( EM IM ) 1 2 ( n − :n ) :n a b ab N Chi-square n :n a b 2 ( ) χ n ab Dice’s coefficient n + n a b ( Dice ) Table 6.2. Term association measures Table 6.3 shows the strongly associated words for “tropical” assuming an un- limited window size (in other words, co-occurrences are counted at the document level). There are two obvious features to note. The first is that the ranking for 2 2 χ . The second is that MIM χ is identical to the one for MIM favor low- and frequency words, as expected. These words are not unreasonable (“itto”, for ex- ample, is the International Tropical Timber Organization, and “xishuangbanna” is a Chinese tropical botanic garden), but they are so specialized that they are un- likely to be much use for many queries. The top terms for the EMIM and Dice measures are much more general and, in the case of , sometimes too gen- EMIM eral (e.g., “most”). Table 6.4 shows the top-ranked words for “fish”. Note that because this is a 2 MIM and χ higher-frequency term, the rankings for are no longer identical, al- though both still favor low-frequency terms. The top-ranked words for EMIM</p> <p><span class="badge badge-info text-white mr-2">228</span> 204 6 Queries and Interfaces 2 EMIM Dice MIM χ trmm forest trmm forest itto exotic itto tree rain timber ortuno ortuno kuroshio island kuroshio rain like ivirgarzama banana ivirgarzama biofunction biofunction deforestation fish most plantation kapiolani kapiolani water bstilla coconut bstilla almagreb almagreb jungle fruit jackfruit area jackfruit tree adeo world adeo rainforest xishuangbanna america palm xishuangbanna some hardwood frangipani frangipani yuca live yuca greenhouse plant anthurium logging anthurium Table 6.3. Most strongly associated words for “tropical” in a collection of TREC news stories. Co-occurrence counts are measured at the document level. and Dice are quite similar, although in different ordering. To show the effect of changing the window size, Table 6.5 gives the top-ranked words found using a window of five words. The small window size has an effect on the results, although 2 and χ both MIM EMIM are some- still find low-frequency terms. The words for what improved, being more specific. Overall, it appears that the simple Dice’s co- efficient is the most stable and reliable over a range of window sizes. The most significant feature of these tables, however, is that even the best rank- ings contain virtually nothing that would be useful to expand the query “tropical fish”! Instead, the words are associated with other contexts, such as tropical forests and fruits, or fishing conservation. One way to address this problem would be to find words that are strongly associated with the phrase “tropical fish”. Using Dice’s coefficient with the same collection of TREC documents as the previous tables, this produces the following 10 words at the top of the ranking: goldfish, reptile, aquarium, coral, frog, exotic, stripe, regent, pet, wet Clearly, this is doing much better at finding words associated with the right con- text. To use this technique, however, we would have to find associations for every group of words that could be used in a query. This is obviously impractical, but</p> <p><span class="badge badge-info text-white mr-2">229</span> 6.2 Query Transformation and Refinement 205 2 EMIM Dice MIM χ arlsq zoologico species water wildlife zapanta happyman species wrint outerlimit fishery wildlife sportk water wpfmc fishery lingcod weighout sea fisherman longfin boat fisherman waterdog boat bontadelli sea longfin veracruzana sportfisher habitat area ungutt billfish vessel habitat vessel marine ulocentra needlefish needlefish marine damaliscu endanger land bontebok conservation tunaboat river taucher river tsolwana olivacea orangemouth catch food motoroller sheepshead island endanger Table 6.4. Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured at the document level. 2 MIM EMIM Dice χ wildlife gefilte zapanta wildlife mbmo vessel vessel plar boat zapanta boat mbmo fishery plar gefilte fishery hapc hapc species species tuna catch odfw odfw southpoint trout southpoint water anadromous fisherman anadromous sea taiffe salmon meat taiffe catch mollie mollie interior nmf frampton fisherman frampton idfg trawl idfg game billingsgate billingsgate salmon halibut sealord meat sealord tuna longline shellfish longline caught Table 6.5. Most strongly associated words for “fish” in a collection of TREC news stories. Co-occurrence counts are measured in windows of five words.</p> <p><span class="badge badge-info text-white mr-2">230</span> 206 6 Queries and Interfaces there are other approaches that accomplish the same thing. One alternative would be to analyze the word occurrences in the retrieved documents for a query. This is the basis of pseudo-relevance feedback, which is discussed in the next section. Another approach that has been suggested is to index every word in the collection 14 by the words that co-occur with it, creating a virtual document representing that word. For example, the following list is the top 35 most strongly associated words for “aquarium” (using Dice’s coefficient): zoology, cranmore, jouett, zoo, goldfish, fish, cannery, urchin, reptile, coral, animal, mollusk, marine, underwater, plankton, mussel, oceanog- raphy, mammal, species, exhibit, swim, biologist, cabrillo, saltwater, crea- ture, reef, whale, oceanic, scuba, kelp, invertebrate, park, crustacean, wild, tropical These words would form the index terms for the document representing “aquar- ium”. To find expansion words for a query, these virtual documents are ranked in the same way as regular documents, giving a ranking for the corresponding words. In our example, the document for “aquarium” contains the words “tropical” and “fish” with high weights, so it is likely that it would be highly ranked for the query “tropical fish”. This means that “aquarium” would be a highly ranked expansion term. The document for a word such as “jungle”, on the other hand, would contain “tropical” with a high weight but is unlikely to contain “fish”. This document, and the corresponding word, would be much further down the ranking than “aquar- ium”. All of the techniques that rely on analyzing the document collection face both computational and accuracy challenges due to the huge size and variability in quality of the collections in search applications. At the start of this section, it was mentioned that instead of analyzing the document collection, either the result list or a large collection of queries could be used. Recent studies and experience indicate that a large query log is probably the best resource for query expansion. Not only do these logs contain many short pieces of text that are are easier to an- alyze than full text documents, they also contain other data, such as information on which documents were clicked on during the search (i.e., clickthrough data). As an example of how the query log can be used for expansion, the following list shows the 10 most frequent words associated with queries that contain “trop- ical fish” in a recent query log sample from a popular web search engine: 14 Sometimes called a context vector .</p> <p><span class="badge badge-info text-white mr-2">231</span> 6.2 Query Transformation and Refinement 207 stores, pictures, live, sale, types, clipart, blue, freshwater, aquarium, sup- plies These words indicate the types of queries that people tend to submit about trop- ical fish (sales, supplies, pictures), and most would be good words to suggest for query expansion. In current systems, suggestions are usually made in the form of whole queries rather than expansion words, and here again the query log will be extremely useful in producing the best suggestions. For example, “tropical fish supplies” will be a much more common query than “supplies tropical fish” and would be a better suggestion for this expansion. From this perspective, query expansion can be reformulated as a problem of finding similar queries, rather than expansion terms. Similar queries may not al- ways contain the same words. For example, the query “pet fish sales” may be a reasonable suggestion as an alternative to “tropical fish”, even though it doesn’t contain the word “tropical”. It has long been recognized that semantically similar queries can be found by grouping them based on the relevant documents they have in common, rather than just the words. Clickthrough data is very similar to rele- vance data, and recent studies have shown that queries can be successfully grouped or clustered based on the similarity of their clickthrough data. This means that ev- ery query is represented using the set of pages that are clicked on for that query, and the similarity between the queries is calculated using a measure such as Dice’s coefficient, except that in this case n will be the number of clicked-on pages the ab n , n are the number of pages clicked on for two queries have in common, and a b each query. In summary, both automatic and semi-automatic query expansion methods have been proposed, although the default in many search applications is to sug- gest alternative queries to the user. Some term association measures are better than others, but term association based on single words does not produce good expan- sion terms, because it does not capture the context of the query. The best way to capture query context is to use a query log, both to analyze word associations and to find similar queries based on clickthrough data. If there is no query log avail- able, the best alternative would be to use pseudo-relevance feedback, as described in the next section. Of the methods described for constructing an automatic the- saurus based on the document collection, the best alternative is to create virtual documents for each word and rank them for each query.</p> <p><span class="badge badge-info text-white mr-2">232</span> 208 6 Queries and Interfaces 6.2.4 Relevance Feedback Relevance feedback is a query expansion and refinement technique with a long history. First proposed in the 1960s, it relies on user interaction to identify rel- evant documents in a ranking based on the initial query. Other semi-automatic techniques were discussed in the last section, but instead of choosing from lists of terms or alternative queries, in relevance feedback the user indicates which docu- ments are interesting (i.e., relevant) and possibly which documents are completely off-topic (i.e., non-relevant). Based on this information, the system automatically reformulates the query by adding terms and reweighting the original terms, and a new ranking is generated using this modified query. This process is a simple example of using in information machine learning retrieval, where training data (the identified relevant and non-relevant docu- ments) is used to improve the system’s performance. Modifying the query is in fact equivalent to learning a classifier that distinguishes between relevant and non- relevant documents. We discuss classification and classification techniques fur- ther in Chapters 7 and 9. Relative to many other applications of machine learning, however, the amount of training data generated in relevance feedback is extremely limited since it is based on the user’s input for this query session only, and not on historical data such as clickthrough. The specific method for modifying the query depends on the underlying re- trieval model. In the next chapter, we describe how relevance feedback works in the vector space model and the probabilistic model. In general, however, words that occur more frequently in the relevant documents than in the non-relevant documents, or in the collection as a whole, are added to the query or increased in weight. The same general idea is used in the technique of pseudo-relevance feed- back , where instead of asking the user to identify relevant documents, the system simply assumes that the top-ranked documents are relevant. Words that occur fre- quently in these documents may then be used to expand the initial query. Once again, the specifics of how this is done depend on the underlying retrieval model. We describe pseudo-relevance feedback based on the language model approach to retrieval in the next chapter. The expansion terms generated by pseudo-relevance feedback will depend on the whole query, since they are extracted from docu- ments ranked highly for that query, but the quality of the expansion will be de- termined by how many of the top-ranked documents in the initial ranking are in fact relevant. As a simple example of how this process works, consider the ranking shown in Figure 6.1, which was generated using a popular search engine with the query</p> <p><span class="badge badge-info text-white mr-2">233</span> 6.2 Query Transformation and Refinement 209 Badmans Tropical Fish 1. tropical fish A freshwater aquarium page covering all aspects of hobby. ... to the Badman's Tropical Fish . ... world of aquariology with Badman's Tropical Fish . ... 2. Tropical Fish Notes on a few species and a gallery of photos of African cichlids. The Tropical Tank Homepage - Tropical Fish and Aquariums 3. tropical fish and tropical aquariums, large Info on species index with ... Here you fish will find lots of information on and Aquariums. ... Tropical Fish Tropical Fish 4. Centre Offers a range of aquarium products, advice on choosing species, feeding , and health care, and a discussion board. 5. Tropical fish - Wikipedia, the free encyclopedia are popular aquarium Tropical fish , due to their often bright coloration. ... Practi cal fish Fishkeeping • Tropical Fish Hobbyist • Koi. Aquarium related companies: ... 6. Tropical Fish Find Home page for Tropical Fish Internet Directory ... stores, forums, clubs, fish facts, tropical fish compatibility and aquarium ... Breeding 7. tropical fish ... intrested in keeping and/or breeding Tropical , Marine, Pond and Coldwater fish . ... Tropical Fish ... breeding tropical Breeding fish . ... , marine, coldwater & pond 8. FishLore Includes tropical freshwater aquarium how-to guides, FAQs, fish profiles, articles, and forums. 9. Tropical Fish Keeping Cathy's Information on setting up and maintaining a success ful freshwater aquarium. Tropical Fish Place 10. Tropical Fish information for your freshwater fish tank ... great amount of information about a great hobby, a freshwater tank. ... tropical fish Fig. 6.1. Top ten results for the query “tropical fish” “tropical fish”. To expand this query using pseudo-relevance feedback, we might assume that all these top 10 documents were relevant. By analyzing the full text of these documents, the most frequent terms, with their frequencies, can be iden- tified as: a (926), td (535), href (495), http (357), width (345), com (343), nbsp (316), www (260), tr (239), htm (233), class (225), jpg (221)</p> <p><span class="badge badge-info text-white mr-2">234</span> 210 6 Queries and Interfaces Clearly, these words are not appropriate to use as expansion terms, because they consist of stopwords and HTML expressions that will be common in the whole collection. In other words, they do not represent the topics covered in the top- ranked documents. A simple way to refine this process is to count words in the snippets of the documents and ignore stopwords. This analysis produces the fol- lowing list of frequent words: tropical (26), fish (28), aquarium (8), freshwater (5), breeding (4), information (3), species (3), tank (2), Badman’s (2), page (2), hobby (2), forums (2) These words are much better candidates for query expansion, and do not have the problem of inadequate context that occurs when we try to expand “tropical” and “fish” separately. If the user was, however, specifically interested in breeding tropical fish, the expansion terms could be improved using true relevance feed- back, where the document ranked seventh would be explicitly tagged as relevant. In this case, the most frequent terms are: breeding (4), fish (4), tropical (4), marine (2), pond (2), coldwater (2), keeping (1), interested (1) The major effect of using this list would be to increase the weight of the expansion term “breeding”. The specific weighting, as we have said, depends on the underly- ing retrieval model. Both relevance feedback and pseudo-relevance feedback have been extensively investigated in the research literature, and have been shown to be effective tech- niques for improving ranking. They are, however, seldom incorporated into oper- ational search applications. In the case of pseudo-relevance feedback, this appears to be primarily because the results of this automatic process can be unpredictable. If the initial ranking does not contain many relevant documents, the expansion terms found by pseudo-relevance feedback are unlikely to be helpful and, for some queries, can make the ranking significantly worse. To avoid this, the candidate ex- pansion terms could be shown to the user, but studies have shown that this is not particularly effective. Suggesting alternative queries based on an analysis of query logs is a more reliable alternative for semi-automatic query expansion. Relevance feedback, on the other hand, has been used in some applications, such as document filtering. Filtering involves tracking a person’s interests over time, and some applications allow people to modify their profiles using relevance feedback. Another simple use of relevance feedback is the “more like this” fea- ture in some early web search engines. This feature allowed users to click on a link</p> <p><span class="badge badge-info text-white mr-2">235</span> 6.2 Query Transformation and Refinement 211 associated with each document in a result list in order to generate a ranked list of other documents similar to the clicked-on document. The new ranked list of documents was based on a query formed by extracting and weighting important words from the clicked-on document. This is exactly the relevance feedback pro- cess, but limited to a single relevant document for training data. Although these applications have had some success, the alternative approach of asking users to choose a different query from a list of suggested queries is cur- rently more popular. There is no guarantee, of course, that the suggested queries will contain exactly what the user is looking for, and in that sense relevance feed- back supports more precise query reformulation. There is an assumption, how- ever, underlying the use of relevance feedback: that the user is looking for many relevant documents, not just the one or two that may be in the initial ranked list. For some queries, such as looking for background information on a topic, this may be true, but for many queries in the web environment, the user will be satisfied with the initial ranking and will not need relevance feedback. Lists of suggested queries will be helpful when the initial query fails, whereas relevance feedback is unlikely to help in that case. 6.2.5 Context and Personalization One characteristic of most current search engines is that the results of a query will be the same regardless of who submitted the query, why the query was submit- ted, where the query was submitted, or what other queries were submitted in the same session. All that matters is what words were used to describe the query. The other factors, known collectively as the query context , will affect the relevance of retrieved documents and could potentially have a significant impact on the rank- ing algorithm. Most contextual information, however, has proved to be difficult to capture and represent in a way that provides consistent effectiveness improve- ments. Much research has been done, in particular, on learning user models or pro- files to represent a person’s interests so that a search can be personalized . If the system knew that a person was interested in sports, for example, the documents retrieved for the query “vikings” may be different than those retrieved by the same query for a person interested in history. Although this idea is appealing, there are a number of problems with actually making it work. The first is the accuracy of the user models. The most common proposal is to create the profiles based on the documents that the person looks at, such as web pages visited, email messages,</p> <p><span class="badge badge-info text-white mr-2">236</span> 212 6 Queries and Interfaces or word processing documents on the desktop. This type of profile represents a person using words weighted by their importance. Words that occur frequently in the documents associated with that person, but are not common words in gen- eral, will have the highest weights. Given that documents contain hundreds or even thousands of words, and the documents visited by the person represent only a snapshot of their interests, these models are not very specific. Experiments have shown that using such models does not improve the effectiveness of ranking on average. An alternative approach would be to ask the user to describe herself using pre- defined categories. In addition to requiring additional (and optional) interactions that most people tend to avoid, there is still the fundamental problem that some- one with a general interest in sports may still want to ask a question about history. This suggests that a category of interest could be specified for each query, such as specifying the “history” category for the query “vikings”, but this is no different than simply entering a less ambiguous query. It is much more effective for a person to enter an extra word or two in her query to clarify it—such as “vikings quarter- backs” or “vikings exploration”, for example—than to try to classify a query into a limited set of categories. Another issue that is raised by any approach to personalization based on user models is privacy . People have understandable concerns about personal details be- ing recorded in corporate and government databases. In response, techniques for maintaining anonymity while searching and browsing on the Web are becoming an increasingly popular area for research and development. Given this, a search engine that creates profiles based on web activity may be viewed negatively, espe- cially since the benefit of doing this is currently not clear. Problems with user modeling and privacy do not mean that contextual infor- mation is not useful, but rather that the benefits of any approach based on context need to be examined carefully. There are examples of applications where the use of contextual information is clearly effective. One of these is the use of query logs and clickthrough data to improve web search. The context in this case is the his- tory of previous searches and search sessions that are the same or very similar. In general, this history is based on the entire user population. A particular person’s search history may be useful for “caching” results for common search queries, but learning from a large number of queries across the population appears to be much more effective. Another effective application of context is local search , which uses geographic information derived from the query, or from the location of the device that the</p> <p><span class="badge badge-info text-white mr-2">237</span> 6.2 Query Transformation and Refinement 213 query comes from, to modify the ranking of search results. For example, the query “fishing supplies” will generate a long list of web pages for suppliers from all over the country (or the world). The query “fishing supplies Cape Cod”, however, should use the context provided by the location “Cape Cod” to rank suppliers in that region higher than any others. Similarly, if the query “fishing supplies” came from a mobile device in a town in Cape Cod, then this information could be used to rank suppliers by their proximity to that town. Local search based on queries involves the following steps: 1. Identify the geographic region associated with web pages. This is done either by using location metadata that has been manually added to the document, or by automatically identifying locations, such as place names, city names, or country names, in the document text. 2. Identify the geographic region associated with the query using automatic techniques. Analysis of query logs has shown that 10–15% of queries contain some location reference. 3. Rank web pages using a comparison of the query and document location in- formation in addition to the usual text- and link-based features. Automatically identifying the location information in text is a specific exam- ple of the information extraction techniques mentioned in Chapter 4. Location names are mapped to specific regions and coordinates using a geographic ontol- 15 ogy geographic information and algorithms developed for spatial reasoning in . For example, the location “Cape Cod” in a document might be mapped systems to bounding rectangles based on latitude and longitude, as shown in Figure 6.2, whereas a town location would be mapped to more specific coordinates (or a smaller bounding rectangle). Although this sounds straightforward, there are many issues involved in identifying location names (for example, there are more than 35 places named Springfield in the United States), deciding which locations are significant (if a web page discusses the “problems with Washington lobbyists”, should “Washington” be used as location metadata?), and combining multiple lo- cation references in a document. 15 An ontology is essentially the same thing as a thesaurus. It is a representation of the con- cepts in a domain and the relationships between them, whereas a thesaurus describes words, phrases, and relationships between them. Ontologies usually have a richer set of relationships than a thesaurus. A taxonomy is another term used to describe categories of concepts.</p> <p><span class="badge badge-info text-white mr-2">238</span> 214 6 Queries and Interfaces The geographic comparison used in the ranking could involve inclusion (for example, the location metadata for a supplier’s web page indicates that the sup- plier is located in the bounding box that represents Cape Cod), distance (for ex- ample, the supplier is within 10 miles of the town that the query mentioned), or other spatial relationships. From both an efficiency and effectiveness perspective, there will be implications for exactly how and when the geographic information is incorporated into the ranking process. Fig. 6.2. Geographic representation of Cape Cod using bounding rectangles To summarize, the most useful contextual information for improving search quality is based on past interactions with the search engine (i.e., the query log and session history). Local search based on geographic context can also produce substantial improvements for a subset of queries. In both cases, context is used to provide additional features to enhance the original query (query expansion pro- vides additional words, and local search provides geographic distance). To under- stand the context for a specific query, however, there is no substitute for the user providing a more specific query. Indeed, local search in most cases relies on the location being specified in the query. Typically, more specific queries come from users examining the results and then reformulating the query. The results display, which we discuss next, must convey the context of the query term matches so that the user can decide which documents to look at in detail or how to reformulate the query.</p> <p><span class="badge badge-info text-white mr-2">239</span> 6.3 Showing the Results 215 6.3 Showing the Results 6.3.1 Result Pages and Snippets Successful interactions with a search engine depend on the user understanding the results. Many different visualization techniques have been proposed for dis- playing search output (Hearst, 1999), but for most search engines the result pages consist of a ranked list of document summaries that are linked to the actual doc- uments or web pages. A document summary for a web search typically contains the title and URL of the web page, links to live and cached versions of the page, and, most importantly, a short text summary, or snippet , that is used to convey the content of the page. In addition, most result pages contain advertisements con- sisting of short descriptions and links. Query words that occur in the title, URL, snippet, or advertisements are to make them easier to identify, usually highlighted by displaying them in a bold font. Figure 6.3 gives an example of a document summary from a result page for a web search. In this case, the snippet consists of two partial sentences. Figure 6.1 gives more examples of snippets that are sometimes full sentences, but often text fragments, extracted from the web page. Some of the snippets do not even contain the query words. In this section, we describe some of the basic features of the algorithms used for snippet generation. Tropical Fish One of the U.K.s Leading suppliers of Tropical , Coldwater, Marine Fish and Invertebrates plus.. . next day delivery service ... fish tropicalfish .org.uk/ www. _ fish .htm Cached page tropical Fig. 6.3. Typical document summary for a web search Snippet generation is an example of text summarization. Summarization tech- niques have been developed for a number of applications, but primarily have been tested using news stories from the TREC collections. A basic distinction is made between techniques that produce query-independent summaries and those that produce query-dependent summaries. Snippets in web search engine result pages are clearly query-dependent summaries, since the snippet that is generated for a page will depend on the query that retrieved it, but some query-independent fea- tures, such as the position of the text in the page and whether the text is in a head- ing, are also used.</p> <p><span class="badge badge-info text-white mr-2">240</span> 216 6 Queries and Interfaces The development of text summarization techniques started with H. P. Luhn in the 1950s (Luhn, 1958). Luhn’s approach was to rank each sentence in a doc- and to select the top sentences for the summary. ument using a significance factor The significance factor for a sentence is calculated based on the occurrence of sig- nificant words. Significant words are defined in his work as words of medium fre- quency in the document, where “medium” means that the frequency is between predefined high-frequency and low-frequency cutoff values. Given the significant words, portions of the sentence that are “bracketed” by these words are consid- ered, with a limit set for the number of non-significant words that can be between two significant words (typically four). The significance factor for these bracketed text spans is computed by dividing the square of the number of significant words in the span by the total number of words. Figure 6.4 gives an example of a text 2 4 /7 = 2 . 3 . The significance factor for span for which the significance factor is a sentence is the maximum calculated for any text span in the sentence. w w w w w w w w w w. w sentence) (Initial w s w s s w w s w w. w (Identify significant words) w w [s w s s w w s] w w. (Text bracketed by significant words) span Fig. 6.4. w ) bracketed by significant words ( s ) using An example of a text span of words ( Luhn’s algorithm To be more specific about the definition of a significant word, the following is a frequency-based criterion that has been used successfully in more recent research. f in document is the frequency of word w If d , then w is a significant word if d;w it is not a stopword (which eliminates the high-frequency words), and  < s . if , 7 − 0 25 1 × (25 − s )  d d if ≤ s 7 , 40 25 ≤ f ≥ d d;w  7 + 0 . 1 × ( s otherwise, − 40) , d where s is the number of sentences in document d . As an example, the second d page of Chapter 1 of this book contains less than 25 sentences (roughly 20), and so the significant words will be non-stopwords with a frequency greater than or equal</p> <p><span class="badge badge-info text-white mr-2">241</span> 6.3 Showing the Results 217 to 6.5. The only words that satisfy this criterion are “information” (frequency 9), “story” (frequency 8), and “text” (frequency 7). Most work on summarization since Luhn has involved improvements to this basic approach, including better methods of selecting significant words and se- lecting sentences or sentence fragments. Snippet generation techniques can also be viewed as variations of Luhn’s approach with query words being used as the significant words and different sentence selection criteria. Typical features that would be used in selecting sentences for snippets to sum- marize a text document such as a news story would include whether the sentence is a heading, whether it is the first or second line of the document, the total num- ber of query terms occurring in the sentence, the number of unique query terms in the sentence, the longest contiguous run of query words in the sentence, and a density measure of query words, such as Luhn’s significance factor. In this ap- proach, a weighted combination of features would be used to rank sentences. Web pages, however, often are much less structured than a news story, and can contain a lot of text that would not be appropriate for snippets. To address this, snip- pet sentences are often selected from the metadata associated with the web page, such as the “description” identified by the <meta name=”description” content= ...> 16 HTML tags, or from external sources, such as web directories. Certain classes of web pages, such as Wikipedia entries, are more structured and have snippet sentences selected from the text. Although many variations are possible for snippet generation and document summaries in result pages, some basic guidelines for effective summaries have been derived from an analysis of clickthrough data (Clarke et al., 2007). The most im- portant is that whenever possible, all of the query terms should appear in the summary, showing their relationship to the retrieved page. When query terms are present in the title, however, they need not be repeated in the snippet. This al- lows for the possibility of using sentences from metadata or external descriptions that may not have query terms in them. Another guideline is that URLs should be selected and displayed in a manner that emphasizes their relationship to the query by, for example, highlighting the query terms present in the URL. Finally, search engine users appear to prefer readable prose in snippets (such as complete or near-complete sentences) rather than lists of keywords and phrases. A feature that measures readability should be included in the computation of the ranking for snippet selection. 16 For example, the Open Directory Project, http://www.dmoz.org.</p> <p><span class="badge badge-info text-white mr-2">242</span> 218 6 Queries and Interfaces The efficient implementation of snippet generation will be an important part of the search engine architecture since the obvious approach of finding, opening, and scanning document files would lead to unacceptable overheads in an envi- ronment requiring high query throughput. Instead, documents must be fetched from a local document store or cache at query time and decompressed. The docu- ments that are processed for snippet generation should have all HTML tags and other “noise” (such as Javascript) removed, although metadata must still be distin- guished from text content. In addition, sentence boundaries should be identified and marked at indexing time, to avoid this potentially time-consuming operation when selecting snippets. 6.3.2 Advertising and Search Advertising is a key component of web search engines since that is how companies generate revenue. In the case of advertising presented with search results ( spon- sored search ), the goal is to find advertisements that are appropriate for the query context. When browsing web pages, advertisements are selected for display based on the contents of pages. Contextual advertising is thought to lead to more user clicks on advertisements (clickthrough), which is how payments for advertising are determined. Search engine companies maintain a database of advertisements, which is searched to find the most relevant advertisements for a given query or web page. An advertisement in this database usually consists of a short text de- scription and a link to a web page describing the product or service in more detail. Searching the advertisement database can therefore be considered a special case of general text search. Nothing is ever that simple, however. Advertisements are not selected solely based on their ranking in a simple text search. Instead, advertisers bid for key- words that describe topics associated with their product. The amount bid for a keyword that matches a query is an important factor in determining which adver- tisement is selected. In addition, some advertisements generate more clickthrough because they are more appealing to the user population. The popularity of an ad- vertisement, as measured by the clickthrough over time that is captured in the query log, is another significant factor in the selection process. The popularity of an advertisement can be measured over all queries or on a query-specific ba- sis. Query-specific popularity can be used only for queries that occur on a regular basis. For the large number of queries that occur infrequently (so-called long-tail</p> <p><span class="badge badge-info text-white mr-2">243</span> 6.3 Showing the Results 219 17 queries ), the general popularity of advertisements can be used. By taking all of these factors into account, namely relevance, bids, and popularity, the search en- gine company can devise strategies to maximize their expected profit. As an example, a pet supplies company that specializes in tropical fish may place the highest bid for the keywords “aquarium” and “tropical fish”. Given the query “tropical fish”, this keyword is certainly relevant. The content of the ad- vertisement for that company should also contain words that match the query. Given that, this company’s advertisement will receive a high score for relevance and a high score based on the bid. Even though it has made the highest bid, how- ever, there is still some chance that another advertisement will be chosen if it is very popular and has a moderately high bid for the same keywords. Much ongoing research is directed at developing algorithms to maximize the advertiser’s profit, drawing on fields such as economics and game theory. From the information retrieval perspective, the key issues are techniques for matching short pieces of text (the query and the advertisement) and selecting keywords to represent the content of web pages. When searching the Web, there are usually many pages that contain all of the query terms. This is not the case, however, when queries are compared to adver- tisements. Advertisements contain a small number of words or keywords relative to a typical page, and the database of advertisements will be several orders of mag- nitude smaller than the Web. It is also important that variations of advertisement keywords that occur in queries are matched. For example, if a pet supply com- pany has placed a high bid for “aquarium”, they would expect to receive some traffic from queries about “fish tanks”. This, of course, is the classic vocabulary mismatch problem, and many techniques have been proposed to address this, such as stemming and query expansion. Since advertisements are short, techniques for expanding the documents as well as the queries have been considered. Two techniques that have performed well in experiments are query reformu- lation based on user sessions in query logs (Jones et al., 2006) and expansion of queries and documents using external sources, such as the Web (Metzler et al., 2007). Studies have shown that about 50% of the queries in a single session are refor- mulations, where the user modifies the original query through word replacements, 17 The term “long-tail” comes from the long tail of the Zipf distribution described in Chapter 4. Assuming that a query refers to a specific combination of words, most queries occur infrequently, and a relatively small number account for the majority of the query instances that are processed by search engines.</p> <p><span class="badge badge-info text-white mr-2">244</span> 220 6 Queries and Interfaces insertions, and deletions. Given a large number of candidate associations between queries and phrases in those queries, statistical tests, such as those described in section 6.2.3, can be used to determine which associations are significant. For ex- ample, the association between the phrases “fish tank” and “aquarium” may occur often in search sessions as users reformulate their original query to find more web pages. If this happens often enough relative to the frequency of these phrases, it will be considered significant. The significant associations can be used as potential substitutions, so that, given an initial query, a ranked list of query reformulations can be generated, with the emphasis on generating queries that contain matches for advertising keywords. The expansion technique consists of using the Web to expand either the query, the advertisement text, or both. A form of pseudo-relevance feedback is used where the advertisement text or keywords are used as a query for a web search, and expansion words are selected from the highest-ranking web pages. Experi- ments have shown that the most effective relevance ranking of advertisements is when exact matches of the whole query are ranked first, followed by exact matches of the whole query with words replaced by stems, followed by a probabilistic sim- ilarity match of the expanded query with the expanded advertisement. The type of similarity match used is described in section 7.3. fish tanks at Target Find fish tanks Online. Shop & Save at Target.com Today. www.target.com Aquariums 540+ Aquariums at Great Prices. fishbowls.pronto.com Freshwater Fish Species Everything you need to know to keep your setup clean a nd beautiful www.FishChannel.com Pet Supplies at Shop.com Shop millions of products and buy from our trusted merchants. shop.com Custom Fish Tanks Choose From 6,500+ Pet Supplies. Save On Custom Fish Tanks ! shopzilla.com Fig. 6.5. Advertisements displayed by a search engine for the query “fish tanks”</p> <p><span class="badge badge-info text-white mr-2">245</span> 6.3 Showing the Results 221 As an example, Figure 6.5 shows the list of advertisements generated by a search engine for the query “fish tanks”. Two of the advertisements are obvious matches, in that “fish tanks” occurs in the titles. Two of the others (the second and fourth) have no words in common with the query, although they are clearly rele- vant. Using the simple pseudo-relevance feedback technique described in section 6.2.4 would produce both “aquarium” (frequency 10) and “acrylic” (frequency 7) as expansion terms based on the top 10 results. This would give advertisements containing “aquarium”, such as the second one, a higher relevance score in the se- lection process. The fourth advertisement has presumably been selected because the pet supplier has bid on the keyword “aquarium”, and potentially because many people have clicked on this advertisement. The third advertisement is similar and matches one of the query words. In the case of contextual advertising for web pages, keywords typically are ex- tracted from the contents of the page and then used to search the advertising database to select advertisements for display along with the contents of the page. Keyword selection techniques are similar to the summarization techniques de- scribed in the last section, with the focus on keywords rather than sentences. A simple approach would be to select the top words ranked by a significance weight based on relative frequencies in the document and the collection of documents. A more effective approach is to use a based on machine learning tech- classifier niques, as described in Chapter 9. A classifier uses a weighted combination of features to determine which words and phrases are significant. Typical features include the frequency in the document, the number of documents in which the word or phrase occurs, functions of those frequencies (such as taking the log or normalizing), frequency of occurrence in the query log, location of the word or phrase in the document (e.g., the title, body, anchor text, metadata, URL), and whether the word or phrase was capitalized or highlighted in some way. The most useful features are the document and query log frequency information (Yih et al., 2006). 6.3.3 Clustering the Results The results returned by a search engine are often related to different aspects of the query topic. In the case of an ambiguous query, these groups of documents can represent very different interpretations of the query. For example, we have seen how the query “tropical fish” retrieves documents related to aquariums, pet sup- plies, images, and other subtopics . An even simpler query, such as “fish”, is likely</p> <p><span class="badge badge-info text-white mr-2">246</span> 222 6 Queries and Interfaces to retrieve a heterogeneous mix of documents about the sea, software products, a rock singer, and anything else that happens to use the name “fish”. If a user is in- terested in a particular aspect of a query topic, scanning through many pages on different aspects could be frustrating. This is the motivation for the use of cluster- ing techniques on search results. Clustering groups documents that are similar in content and labels the clusters so they can be quickly scanned for relevance. Pictures (38) (28) Aquarium Fish (26) Tropical Fish Aquarium Exporter (31) Supplies (32) Plants, Aquatic (18) Fish Tank (15) Breeding (16) Marine Fish (16) Aquaria (9) Fig. 6.6. Clusters formed by a search engine from top-ranked documents for the query “tropical fish”. Numbers in brackets are the number of documents in the cluster. Figure 6.6 shows a list of clusters formed by a web search engine from the top- ranked documents for the query “tropical fish”. This list, where each cluster is de- scribed or labeled using a single word or phrase and includes a number indicating the size of the cluster, is displayed to the side of the usual search results. Users that are interested in one of these clusters can click on the cluster label to see a list of those documents, rather than scanning the ranked list to find documents related to that aspect of the query. In this example, the clusters are clearly related to the subtopics we mentioned previously, such as supplies and pictures. Clustering techniques are discussed in detail in Chapter 9. In this section, we focus on the specific requirements for the task of clustering search results. The first of these requirements is efficiency . The clusters that are generated must be specific to each query and are based on the top-ranked documents for that query. The clus- ters for popular queries could be cached, but clusters will still need to be generated online for most queries, and this process has to be efficient. One consequence of this is that cluster generation is usually based on the text of document snippets,</p> <p><span class="badge badge-info text-white mr-2">247</span> 6.3 Showing the Results 223 rather than the full text. Snippets typically contain many fewer words than the full text, which will substantially speed up calculations that involve comparing word overlap. Snippet text is also designed to be focused on the query topic, whereas documents can contain many text passages that are only partially relevant. The second important requirement for result clusters is that they are easy to understand. In the example in Figure 6.6, each cluster is labeled by a single word or phrase, and the user will assume that every document in that cluster will be described by that concept. In the cluster labeled “Pictures”, for example, it is rea- sonable to expect that every document would contain some pictures of fish. This is an example of a monothetic classification, where every member of a class has 18 This may sound obvious, but in fact it is not the property that defines the class. the type of classification produced by most clustering algorithms. Membership of 19 is based on word K-means a class or cluster produced by an algorithm such as overlap. In other words, members of clusters share many properties, but there is no single defining property. This is known as a polythetic classification. For result clustering, techniques that produce monothetic classifications (or, at least, those that appear to be monothetic) are preferred because they are easier to understand. D , As an example, consider documents in the search results D D , D , and 3 1 4 2 { } . The sets of terms that contain the terms (i.e., words or phrases) a, b, c, d, e, f, g representing each document are: } = { D a, b, c 1 D = { a, d, e } 2 D } = { d, e, f, g 3 D = { f, g } 4 A monothetic algorithm may decide that and e are significant terms and pro- a { D ). , D e } duce the two clusters a ) and { D (labeled , D } (with cluster label 3 2 2 1 Note that these clusters are overlapping, in that a document may belong to more than one cluster. A polythetic algorithm may decide that, based on term overlap, D D the only significant cluster is , D has two terms in common with , D { } — 4 2 3 2 D , and D has two terms in common with D . Note that these three documents 3 4 3 have no single term in common, and it is not clear how this cluster would be la- beled. 18 This is also the definition of a class proposed by Aristotle over 2,400 years ago. 19 K -means clustering is described in Chapter 9, but basically a document is compared to representatives of the existing clusters and added to the most similar cluster.</p> <p><span class="badge badge-info text-white mr-2">248</span> 224 6 Queries and Interfaces If we consider the list of snippets shown in Figure 6.1, a simple clustering based on the non-stopwords that occur in more than one document would give us: aquarium (5) (Documents 1, 3, 4, 5, 8) freshwater (4) (1, 8, 9, 10) species (3) (2, 3, 4) hobby (3) (1, 5, 10) forums (2) (6, 8) In an actual implementation of this technique, both words and phrases would be considered and many more of the top-ranking snippets (say, 200) would be used. Additional features of the words and phrases, such as whether they occurred in titles or snippets, the length of the phrase, and the collection frequency of the phrase, as well as the overlap of the resulting clusters, would be considered in choosing the final set of clusters. An alternative approach for organizing results into meaningful groups is to use faceted classification or, more simply, facets . A faceted classification consists of a set of categories, usually organized into a hierarchy, together with a set of facets that describe the important properties associated with the category. A product de- scribed by a faceted classification, for example, could be labeled by more than one category and will have values for every facet. Faceted classifications are primar- ily manually defined, although it is possible to support faceted browsing for data that has been structured using a database schema, and techniques for construct- ing faceted classifications automatically are being studied. The major advantage of manually defining facets is that the categories are in general easier for the user to understand than automatically generated cluster labels. The disadvantages are that a classification has to be defined for each new application and domain, and manual classifications tend to be static and not as responsive to new data as dy- namically constructed clusters. Facets are very common in e-commerce sites. Figure 6.7 shows the set of cat- egories returned for the query “tropical fish” for a search on a popular retailer’s site. The numbers refer to the number of products in each category that match the query. These categories are displayed to the side of the search results, similar to the clusters discussed earlier. If the “Home & Garden” category is selected, Fig- ure 6.8 shows that what is displayed is a list of subcategories, such as “pet supplies”, together with facets for this category, which include the brand name, supplier or vendor name, discount level, and price. A given product, such as an aquarium, can be found under “pet supplies” and in the appropriate price level, discount level,</p> <p><span class="badge badge-info text-white mr-2">249</span> 6.3 Showing the Results 225 etc. This type of organization provides the user with both guidance and flexibility in browsing the search results. Books (7,845) (12) DVD (2,477) & Garden Home Music (11) (236) Apparel (10) Software Improvement Home (169) Food (6) Gourmet Watches Jewelry (76) & Beauty (4) & (71) Outdoors Sports Automotive (4) Products (68) Office Subscriptions Magazine (3) Games Toys & (62) Health (3) Care Personal & Everything Else (44) (2) Wireless Accessories Electronics (26) Games (1) Video Baby (25) Fig. 6.7. Categories returned for the query “tropical fish” in a popular online retailer Discount Garden Home & off (563) 25% Up to Kitchen Dining & (149) off 50% ! 25% (472) (1,776) Furniture & Décor 50% ! 70% off (46) (368) Pet Supplies 70% (46) more or off Bath & Bedding (51) Patio & Garden (22) Price Supplies Craft & Art (12) $0 (1,032) $24 ! Appliances (2) Home (394) $49 ! $25 Vacuums, Storage & Cleaning $50 (797) $99 ! (107) (206) $100 ! $199 (39) $499 ! $200 Brand ! $999 $500 (9) <brand names> ! $1999 (5) $1000 Seller $5000 ! $9999 (7) names> <vendor Fig. 6.8. Subcategories and facets for the “Home & Garden” category</p> <p><span class="badge badge-info text-white mr-2">250</span> 226 6 Queries and Interfaces 6.4 Cross-Language Search By translating queries for one or more monolingual search engines covering dif- 20 ferent languages, it is possible to do cross-language search (see Figure 6.9). A cross-language search engine receives a query in one language (e.g., English) and retrieves documents in a variety of other languages (e.g., French and Chinese). Users typically will not be familiar with a wide range of languages, so a cross- language search system must do the query translation automatically. Since the system also retrieves documents in multiple languages, some systems also trans- late these for the user. Search engines for languages other Translated Queries Translate Query User Retrieved documents Translate in other languages Fig. 6.9. Cross-language search The most obvious approach to automatic translation would be to use a large bilingual dictionary that contained the translation of a word in the source lan- guage (e.g., English) to the target language (e.g., French). Sentences would then be translated by looking up each word in the dictionary. The main issue is how to deal with ambiguity, since many words have multiple translations. Simple dictionary- based translations are generally poor, but a number of techniques have been de- veloped, such as query expansion (section 6.2.3), that reduce ambiguity and in- 20 Also called cross-language information retrieval (CLIR), cross-lingual search, and mul- tilingual search.</p> <p><span class="badge badge-info text-white mr-2">251</span> 6.4 Cross-Language Search 227 crease the ranking effectiveness of a cross-language system to be comparable to a monolingual system. The most effective and general methods for automatic translation are based on statisticalmachinetranslationmodels (Manning & Schütze, 1999). When translat- ing a document or a web page, in contrast to a query, not only is ambiguity a prob- lem, but the translated sentences should also be grammatically correct. Words can change order, disappear, or become multiple words when a sentence is translated. Statistical translation models represent each of these changes with a probability. This means that the model describes the probability that a word is translated into another word, the probability that words change order, and the probability that words disappear or become multiple words. These probabilities are used to calcu- 21 late the most likely translation for a sentence. Although a model that is based on word-to-word translation probabilities has some similarities to a dictionary-based approach, if the translation probabilities are accurate, they can make a large difference to the quality of the translation. Un- usual translations for an ambiguous word can then be easily distinguished from more typical translations. More recent versions of these models, called phrase- based translation models , further improve the use of context in the translation by calculating the probabilities of translating sequences of words, rather than just in- dividual words. A word such as “flight”, for example, could be more accurately translated as the phrase “commercial flight”, instead of being interpreted as “bird flight”. The probabilities in statistical machine translation models are estimated pri- parallel corpora . These are collections of documents in one lan- marily by using guage together with the translations into one or more other languages. The cor- pora are obtained primarily from government organizations (such as the United Nations), news organizations, and by mining the Web, since there are hundreds of thousands of translated pages. The sentences in the parallel corpora are aligned either manually or automatically, which means that sentences are paired with their translations. The aligned sentences are then used for training the translation model. 21 The simplest form of a machine translation model is actually very similar to the query likelihood model described in section 7.3.1. The main difference is the incorporation P ( w can be | of a translation probability w ) , which is the probability that a word w j j i translated into the word w is the probability , in the estimation of P ( Q | D ) . P ( Q | D ) i of generating a query from a document, which in the translation model becomes the probability that a query is a translation of the document.</p> <p><span class="badge badge-info text-white mr-2">252</span> 228 6 Queries and Interfaces Special attention has to be paid to the translation of unusual words, espe- cially proper nouns such as people’s names. For these words in particular, the techniques are also used to ad- Web is a rich resource. Automatic transliteration dress the problem of people’s names. Proper names are not usually translated into another language, but instead are transliterated, meaning that the name is writ- ten in the characters of another language according to certain rules or based on similar sounds. This can lead to many alternative spellings for the same name. For example, the Libyan leader Muammar Qaddafi’s name can found in many different transliterated variants on web pages, such as Qathafi, Kaddafi, Qadafi, Gadafi, Gaddafi, Kathafi, Kadhafi, Qadhafi, Qazzafi, Kazafi, Qaddafy, Qadafy, Quadhaffi, Gadhdhafi, al-Qaddafi, Al-Qaddafi, and Al Qaddafi. Similarly, there are a number of variants of “Bill Clinton” on Arabic web pages. Although they are not generally regarded as cross-language search systems, web search engines can often retrieve pages in a variety of languages. For that reason, many search engines have made translation available on the result pages. Figure 6.10 shows an example of a page retrieved for the query “pecheur france”, where the translation option is shown as a hyperlink. Clicking on this link pro- duces a translation of the page (not the snippet), which makes it clear that the LepêcheurdeFrance page contains links to archives of the sports magazine , which is translated as “The fisherman of France”. Although the translation provided is not perfect, it typically provides enough information for someone to understand the contents and relevance of the page. These translations are generated automat- ically using machine translation techniques, since any human intervention would be prohibitively expensive. Le pêcheur France archives @ peche poissons - [ Translate this page ] de pêcheur de France Les média Revues de pêche Revue de presse Archives de la revue Le Le pêcheur de France janvier 2003 n°234 Le pêcheur de France mars 2003 ... Fig. 6.10. A French web page in the results list for the query “pecheur france” References and Further Reading This chapter has covered a wide range of topics that have been studied for a num- ber of years. Consequently, there are many references that are relevant and provide</p> <p><span class="badge badge-info text-white mr-2">253</span> 6.4 Cross-Language Search 229 more detail than we are able to cover here. The following papers and books rep- resent some of the more significant contributions, but each contains pointers to other work for people interested in gaining a deeper understanding of a specific topic. The advantages and disadvantages of Boolean queries relative to “natural lan- guage” or keyword queries has been discussed for more than 30 years. This debate has been particularly active in the legal retrieval field, which saw the introduction of the first search engines using ranking and simple queries on large collections in the early 1990s. Turtle (1994) describes one of the few quantitative comparisons of expert Boolean searching to ranking based on simple queries, and found that simple queries are surprisingly effective, even in this professional environment. The next chapter contains more discussion of the Boolean retrieval model. A more detailed description of query-based stemming based on corpus analy- sis can be found in J. Xu and Croft (1998). A good source for the earlier history of association measures such as Dice’s coefficient that have been used for informa- tion retrieval is van Rijsbergen (1979). Peng et al. (2007) describe a more recent version of corpus-based stemming for web search. Kukich (1992) provides an overview of spelling correction techniques. For a more detailed introduction to minimum edit distance and the noisy channel model for spelling correction, see Jurafsky and Martin (2006). Guo et al. (2008) describe an approach that combines query refinement steps, such as spelling cor- rection, stemming, and identification of phrases, into a single model. Their results indicate that the unified model can potentially improve effectiveness relative to carrying out these steps as separate processes. Query expansion has been the subject of much research. Efthimiadis (1996) gives a general overview and history of query expansion techniques, including thesaurus-based expansion. As mentioned before, van Rijsbergen (1979) describes the development of association measures for information retrieval, including the mutual information measure. In computational linguistics, the paper by Church and Hanks (1989) is often referred to for the use of the mutual information mea- sure in constructing lexicons (dictionaries). Manning and Schütze (1999) give a good overview of these and the other association measures mentioned in this chapter. Jing and Croft (1994) describe a technique for constructing an “association thesaurus” from virtual documents consisting of the words that co-occur with other words. The use of query log data to support expansion is described in Beeferman and Berger (2000) and Cui et al. (2003).</p> <p><span class="badge badge-info text-white mr-2">254</span> 230 6 Queries and Interfaces Rocchio (1971) pioneered the work on relevance feedback, which was then followed up by a large amount of work that is reviewed in Salton and McGill (1983) and van Rijsbergen (1979). J. Xu and Croft (2000) is a frequently cited paper on pseudo-relevance feedback that compared “local” techniques based on top-ranked documents to “global” techniques based on the term associations in the collection. The book, based on 10 years of TREC experiments (Voorhees & Harman, 2005), contains many descriptions of both relevance feedback and pseudo-relevance feedback techniques. Context and personalization is a popular topic. Many publications can be found in workshops and conferences, such as the Information Interaction in Con- 22 Wei and Croft (2007) describe an experiment that raises text Symposium (IIiX). questions about the potential benefit of user profiles. Chen et al. (2006) and Zhou et al. (2005) both discuss index structures for efficiently processing local search queries, but also provide general overviews of local search. V. Zhang et al. (2006) discusses local search with an emphasis on analyzing query logs. The original work on text summarization was done by Luhn (1958). Goldstein et al. (1999) describe more recent work on summarization based on sentence se- lection. The work of Berger and Mittal (2000), in contrast, generates summaries based on statistical models of the document. Sun et al. (2005) describe a tech- niques based on clickthrough data. The papers of Clarke et al. (2007) and Turpin et al. (2007) focus specifically on snippet generation. Feng et al. (2007) give a general overview of the issues in sponsored search. Metzler et al. (2007) and Jones et al. (2006) discuss specific techniques for match- ing queries to short advertisements. A discussion of the issues in contextual adver- tising (providing advertisements while browsing), as well as a specific technique for selecting keywords from a web page, can be found in Yih et al. (2006). As mentioned earlier, many visualization techniques have been proposed over the years for search results, and we have ignored most of these in this book. Hearst (1999) provides a good overview of the range of techniques. Leouski and Croft (1996) presented one of the first evaluations of techniques for result clustering. Hearst and Pedersen (1996) show the potential benefits of this technique, and Zamir and Etzioni (1999) emphasize the importance of clusters that made sense to the user and were easy to label. Lawrie and Croft (2003) discuss a technique for building a hierarchical summary of the results, and Zeng et al. (2004) focus on the 22 This conference grew out of the Information Retrieval in Context (IRiX) workshops, whose proceedings can also be found on the Web.</p> <p><span class="badge badge-info text-white mr-2">255</span> 6.4 Cross-Language Search 231 selection of phrases from the results as the basis of clusters. The relative advantages and disadvantages of clustering and facets are discussed in Hearst (2006). 23 (Human-Computer In- More generally, there is a whole community of HCI teraction) researchers concerned with the design and evaluation of interfaces for information systems. Shneiderman et al. (1998) is an example of this type of re- search, and Marchionini (2006) gives a good overview of the importance of the interactive , or exploratory search interface for , search. Cross-language search has been studied at TREC (Voorhees & Harman, 2005) 24 and at a European evaluation forum called CLEF for a number of years. The first collection of papers in this area was in Grefenstette (1998). Issues that arise in specific CLIR systems, such as transliteration (AbdulJaleel & Larkey, 2003), are discussed in many papers in the literature. Manning and Schütze (1999) and Jurafsky and Martin (2006) give overviews of statistical machine translation mod- els. Finally, there has been a large body of work in the information science lit- erature that has looked at how people actually search and interact with search engines. This research is complementary to the more systems-oriented approach taken in this chapter, and is a crucial part of understanding the process of looking for information and relevance. The Journal of the American Society of Informa- tion Science and Technology ( JASIST) is the best source for these type of papers, and Ingwersen and Järvelin (2005) provide an interesting comparison of the com- puter science and information science perspectives on search. Exercises 6.1. Using the Wikipedia collection provided at the book website, create a sample of stem clusters by the following process: 1. Index the collection without stemming. 2. Identify the first 1,000 words (in alphabetical order) in the index. 3. Create stem classes by stemming these 1,000 words and recording which words become the same stem. 4. Compute association measures (Dice’s coefficient) between all pairs of stems in each stem class. Compute co-occurrence at the document level. 23 Sometimes referred to as CHI. 24 http://clef.isti.cnr.it/</p> <p><span class="badge badge-info text-white mr-2">256</span> 232 6 Queries and Interfaces 5. Create stem clusters by thresholding the association measure. All terms that are still connected to each other form the clusters. Compare the stem clusters to the stem classes in terms of size and the quality (in your opinion) of the groupings. 6.2. Create a simple spelling corrector based on the noisy channel model. Use a single-word language model, and an error model where all errors with the same edit distance have the same probability. Only consider edit distances of 1 or 2. Implement your own edit distance calculator (example code can easily be found on the Web). Implement a simple pseudo-relevance feedback algorithm for the Galago 6.3. search engine. Provide examples of the query expansions that your algorithm does, and summarize the problems and successes of your approach. 6.4. Assuming you had a gazetteer of place names available, sketch out an algo- rithm for detecting place names or locations in queries. Show examples of the types of queries where your algorithm would succeed and where it would fail. 6.5. Describe the snippet generation algorithm in Galago. Would this algorithm work well for pages with little text content? Describe in detail how you would modify the algorithm to improve it. 6.6. Pick a commercial web search engine and describe how you think the query is matched to the advertisements for sponsored search. Use examples as evidence for your ideas. Do the same thing for advertisements shown with web pages. 6.7. Implement a simple algorithm that selects phrases from the top-ranked pages as the basis for result clusters. Phrases should be considered as any two-word se- quence. Your algorithm should take into account phrase frequency in the results, phrase frequency in the collection, and overlap in the clusters associated with the phrases. 6.8. Find four different types of websites that use facets, and describe them with examples. 6.9. Give five examples of web page translation that you think is poor. Why do you think the translation failed?</p> <p><span class="badge badge-info text-white mr-2">257</span> 7 Retrieval Models “There is no certainty, only opportunity.” V, V for Vendetta 7.1 Overview of Retrieval Models During the last 45 years of information retrieval research, one of the primary goals has been to understand and formalize the processes that underlie a person mak- ing the decision that a piece of text is relevant to his information need. To develop a complete understanding would probably require understanding how language is represented and processed in the human brain, and we are a long way short of that. We can, however, propose theories about relevance in the form of mathemat- ical retrieval models and test those theories by comparing them to human actions. Good models should produce outputs that correlate well with human decisions on relevance. To put it another way, ranking algorithms based on good retrieval models will retrieve relevant documents near the top of the ranking (and conse- quently will have high effectiveness). How successful has modeling been? As an example, ranking algorithms for general search improved in effectiveness by over 100% in the 1990s, as measured using the TREC test collections. These changes in effectiveness corresponded to improvements in the associated retrieval models. Web search effectiveness has also improved substantially over the past 10 years. In experiments with TREC web collections, the most effective ranking algorithms come from well-defined retrieval models. In the case of commercial web search engines, it is less clear what the retrieval models are, but there is no doubt that the ranking algorithms rely on solid mathematical foundations. It is possible to develop ranking algorithms without an explicit retrieval model through trial and error. Using a retrieval model, however, has generally proved to be the best approach. Retrieval models, like all mathematical models, provide a</p> <p><span class="badge badge-info text-white mr-2">258</span> 234 7 Retrieval Models framework for defining new tasks and explaining assumptions. When problems are observed with a ranking algorithm, the retrieval model provides a structure (try ev- for testing alternatives that will be much more efficient than a brute force erything) approach. In this discussion, we must not overlook the fact that relevance is a complex concept. It is quite difficult for a person to explain why one document is more relevant than another, and when people are asked to judge the relevance of docu- ments for a given query, they can often disagree. Information scientists have writ- ten volumes about the nature of relevance, but we will not dive into that material here. Instead, we discuss two key aspects of relevance that are important for both retrieval models and evaluation measures. The first aspect is the difference between topical and user relevance, which was mentioned in section 1.1. A document is topically relevant to a query if it is judged to be on the same topic. In other words, the query and the document are about the same thing. A web page containing a biography of Abraham Lincoln would certainly be topically relevant to the query “Abraham Lincoln”, and would also be topically relevant to the queries “U.S. presidents” and “Civil War”. User relevance takes into account all the other factors that go into a user’s judgment of relevance. This may include the age of the document, the language of the document, the intended target audience, the novelty of the document, and so on. A document containing just a list of all the U.S. presidents, for example, would be topically relevant to the query “Abraham Lincoln” but may not be considered relevant to the person who submitted the query because they were looking for more detail on Lincoln’s life. Retrieval models cannot incorporate all the additional factors involved in user relevance, but some do take these factors into consideration. The second aspect of relevance that we consider is whether it is binary or mul- tivalued . Binary relevance simply means that a document is either relevant or not relevant. It seems obvious that some documents are less relevant than others, but still more relevant than documents that are completely off-topic. For exam- ple, we may consider the document containing a list of U.S. presidents to be less topically relevant than the Lincoln biography, but certainly more relevant than an advertisement for a Lincoln automobile. Based on this observation, some re- trieval models and evaluation measures explicitly introduce relevance as a multi- valued variable. Multiple levels of relevance are certainly important in evaluation, when people are asked to judge relevance. Having just three levels (relevant, non- relevant, unsure) has been shown to make the judges’ task much easier. In the case of retrieval models, however, the advantages of multiple levels are less clear. This</p> <p><span class="badge badge-info text-white mr-2">259</span> 7.1 Overview of Retrieval Models 235 probability of relevance and can is because most ranking algorithms calculate a represent the uncertainty involved. Many retrieval models have been proposed over the years. Two of the oldest are the Boolean and models. Although these models have been largely vector space superseded by probabilistic approaches, they are often mentioned in discussions about information retrieval, and so we describe them briefly before going into the details of other models. 7.1.1 Boolean Retrieval The Boolean retrieval model was used by the earliest search engines and is still in exact-match retrieval since documents are retrieved use today. It is also known as if they exactly match the query specification, and otherwise are not retrieved. Al- though this defines a very simple form of ranking, Boolean retrieval is not gener- ally described as a ranking algorithm. This is because the Boolean retrieval model assumes that all documents in the retrieved set are equivalent in terms of rele- vance, in addition to the assumption that relevance is binary. The name Boolean comes from the fact that there only two possible outcomes for query evaluation ( TRUE and FALSE ) and because the query is usually specified using operators from Boolean logic ( AND OR , NOT ). As mentioned in Chapter 6, proximity operators , and wildcard characters are also commonly used in Boolean queries. Searching with a regular expression utility such as grep is another example of exact-match retrieval. There are some advantages to Boolean retrieval. The results of the model are very predictable and easy to explain to users. The operands of a Boolean query can be any document feature, not just words, so it is straightforward to incorporate metadata such as a document date or document type in the query specification. From an implementation point of view, Boolean retrieval is usually more efficient than ranked retrieval because documents can be rapidly eliminated from consid- eration in the scoring process. Despite these positive aspects, the major drawback of this approach to search is that the effectiveness depends entirely on the user. Because of the lack of a so- phisticated ranking algorithm, simple queries will not work well. All documents containing the specified query words will be retrieved, and this retrieved set will be presented to the user in some order, such as by publication date, that has lit- tle to do with relevance. It is possible to construct complex Boolean queries that narrow the retrieved set to mostly relevant documents, but this is a difficult task</p> <p><span class="badge badge-info text-white mr-2">260</span> 236 7 Retrieval Models that requires considerable experience. In response to the difficulty of formulat- ing queries, a class of users known as search intermediaries (mentioned in the last chapter) became associated with Boolean search systems. The task of an interme- diary is to translate a user’s information need into a complex Boolean query for a particular search engine. Intermediaries are still used in some specialized areas, such as in legal offices. The simplicity and effectiveness of modern search engines, however, has enabled most people to do their own searches. As an example of Boolean query formulation, consider the following queries for a search engine that has indexed a collection of news stories. The simple query: lincoln would retrieve a large number of documents that mention Lincoln cars and places named Lincoln in addition to stories about President Lincoln. All of these doc- uments would be equivalent in terms of ranking in the Boolean retrieval model, regardless of how many times the word “lincoln” occurs or in what context it oc- curs. Given this, the user may attempt to narrow the scope of the search with the following query: president AND lincoln This query will retrieve a set of documents that contain both words, occurring anywhere in the document. If there are a number of stories involving the manage- ment of the Ford Motor Company and Lincoln cars, these will be retrieved in the same set as stories about President Lincoln, for example: Ford Motor Company today announced that Darryl Hazel will succeed Brian Kelley as president Lincoln Mercury. of If enough of these types of documents were retrieved, the user may try to eliminate NOT operator, as follows: documents about cars by using the president AND lincoln AND NOT (automobile OR car) This would remove any document that contains even a single mention of the words “automobile” or “car” anywhere in the document. The use of the NOT op- erator, in general, removes too many relevant documents along with non-relevant documents and is not recommended. For example, one of the top-ranked docu- ments in a web search for “President Lincoln” was a biography containing the sentence: Lincoln ’s body departs Washington in a nine- car funeral train.</p> <p><span class="badge badge-info text-white mr-2">261</span> 7.1 Overview of Retrieval Models 237 NOT (automobile OR car) Using in the query would have removed this document. If the retrieved set is still too large, the user may try to further narrow the query by adding in additional words that should occur in biographies: president AND lincoln AND biography AND life AND birthplace AND get- tysburg AND NOT (automobile OR car) Unfortunately, in a Boolean search engine, putting too many search terms into the AND operator often results in nothing being retrieved. To avoid query with the OR instead: this, the user may try using an president AND lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car) This will retrieve any document containing the words “president” and “lincoln”, along with any one of the words “biography”, “life”, “birthplace”, or “gettysburg” (and does not mention “automobile” or “car”). After all this, we have a query that may do a reasonable job at retrieving a set containing some relevant documents, but we still can’t specify which words are more important or that having more of the associated words is better than any one of them. For example, a document containing the following text was retrieved at rank 500 by a web search (which does use measures of word importance): ’s Day - Holiday activities - crafts, mazes, word searches, ... “The President Life of Washington” Read the entire book online! Abraham Re- Lincoln search Site ... A Boolean retrieval system would make no distinction between this document and the other 499 that are ranked higher by the web search engine. It could, for example, be the first document in the result list. The process of developing queries with a focus on the size of the retrieved set searching by numbers , and is a consequence of the limitations of has been called the Boolean retrieval model. To address these limitations, researchers developed models, such as the vector space model, that incorporate ranking. 7.1.2 The Vector Space Model The vector space model was the basis for most of the research in information re- trieval in the 1960s and 1970s, and papers using this model continue to appear at conferences. It has the advantage of being a simple and intuitively appealing framework for implementing term weighting, ranking, and relevance feedback.</p> <p><span class="badge badge-info text-white mr-2">262</span> 238 7 Retrieval Models Historically, it was very important in introducing these concepts, and effective techniques have been developed through years of experimentation. As a retrieval model, however, it has major flaws. Although it provides a convenient computa- tional framework, it provides little guidance on the details of how weighting and ranking algorithms are related to relevance. t In this model, documents and queries are assumed to be part of a -dimensional t vector space, where is the number of index terms (words, stems, phrases, etc.). A is represented by a vector of index terms: document D i d = ( , d , . . . , d ) , D i it i i 2 1 d represents the weight of the j th term. A document collection contain- where ij n documents can be represented as a matrix of term weights, where each row ing represents a document and each column describes weights that were assigned to a term for a particular document: T erm . . . T erm T erm t 1 2 d d . . . d Doc 1 12 1 t 11 . . . d d Doc d 22 21 2 t 2 . . . . . . Doc d d . . . d 1 n n 2 nt n Figure 7.1 gives a simple example of the vector representation for four docu- ments. The term-document matrix has been rotated so that now the terms are the rows and the documents are the columns. The term weights are simply the count of the terms in the document. Stopwords are not indexed in this example, and D , for example, is represented by the the words have been stemmed. Document 3 (1 , 1 , 0 , 2 , 0 , 1 vector 0 , 1 , 0 , 0 , 1) . , Queries are represented the same way as documents. That is, a query Q is rep- resented by a vector of t weights: Q = ( q , , q ) , . . . , q t 1 2 where q is the weight of the j th term in the query. If, for example the query was j “tropical fish”, then using the vector representation in Figure 7.1, the query would be (0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1) . One of the appealing aspects of the vector space model is the use of simple diagrams to visualize the documents and queries. Typ- ically, they are shown as points or vectors in a three-dimensional picture, as in</p> <p><span class="badge badge-info text-white mr-2">263</span> 7.1 Overview of Retrieval Models 239 Tropical Freshwater Aquarium Fish. D 1 Tropical Fish, Aquarium Care, Tank Setup. D 2 Keeping Tropical Fish and Goldfish in Aquariums, D 3 and Fish Bowls. The Tropical Tank Homepage - Tropical Fish and D 4 Aquariums. Terms Documents D D D D 3 4 2 1 1 1 1 1 aquarium 0 0 1 0 bowl 0 1 0 0 care 1 1 2 1 fish freshwater 1 0 0 0 goldfish 0 0 1 0 homepage 0 0 0 1 keep 0 0 1 0 0 1 0 0 setup 0 1 0 1 tank tropical 1 1 1 2 Fig. 7.1. Term-document matrix for a collection of four documents Figure 7.2. Although this can be helpful for teaching, it is misleading to think that an intuition developed using three dimensions can be applied to the actual high-dimensional document space. Remember that the t terms represent all the document features that are indexed. In enterprise and web applications, this cor- responds to hundreds of thousands or even millions of dimensions. Given this representation, documents could be ranked by computing the dis- tance between the points representing the documents and the query. More com- similarity measure is used (rather than a distance or dissimilarity mea- monly, a sure), so that the documents with the highest scores are the most similar to the query. A number of similarity measures have been proposed and tested for this purpose. The most successful of these is the cosine correlation similarity measure. The cosine correlation measures the cosine of the angle between the query and the document vectors. When the vectors are normalized so that all documents and queries are represented by vectors of equal length, the cosine of the angle be- tween two identical vectors will be 1 (the angle is zero), and for two vectors that do not share any non-zero terms, the cosine will be 0. The cosine measure is de- fined as:</p> <p><span class="badge badge-info text-white mr-2">264</span> 240 7 Retrieval Models Doc1 Doc2 Term2 Query Term3 Vector representation of documents and queries Fig. 7.2. t ∑ q · d ij j j =1 √ , Q D ( Cosine ) = i t t ∑ ∑ 2 2 · q d j ij j =1 j =1 The numerator of this measure is the sum of the products of the term weights for the matching query and document terms (known as the dot product or in- ner product). The denominator normalizes this score by dividing by the product of the lengths of the two vectors. There is no theoretical reason why the cosine correlation should be preferred to other similarity measures, but it does perform somewhat better in evaluations of search quality. , D = (0 . 5 , 0 . As an example, consider two documents = 0 . 3) and D 8 2 1 (0 . 9 , 0 . 4 , 0 . 2) indexed by three terms, where the numbers represent term weights. 1 Given the query = (1 . 5 , Q . 0 , 0) indexed by the same terms, the cosine mea- sures for the two documents are:</p> <p><span class="badge badge-info text-white mr-2">265</span> 7.1 Overview of Retrieval Models 241 8 . 0) . (0 . 5 × 1 . 5) + (0 1 × √ D ) = , Q Cosine ( 1 2 2 2 2 2 8 (0 + 0 . 3 . )(1 . 5 5 + 1 . 0 ) + 0 . 1 . 55 √ = 0 . 87 = 25) . 98 × (0 . 3 0) . 1 × 4 (0 . 9 × 1 . 5) + (0 . √ ) = D ( , Q Cosine 2 2 2 2 2 2 . 4 (0 + 0 . 2 9 )(1 . 5 . + 1 . 0 + 0 ) 1 . 75 √ 97 . = 0 = . 3 × 01 . (1 25) The second document has a higher score because it has a high weight for the first term, which also has a high weight in the query. Even this simple example shows that ranking based on the vector space model is able to reflect term importance and the number of matching terms, which is not possible in Boolean retrieval. In this discussion, we have yet to say anything about the form of the term weighting used in the vector space model. In fact, many different weighting schemes tf.idf weighting, have been tried over the years. Most of these are variations on which was described briefly in Chapter 2. The term frequency component, tf , re- D flects the importance of a term in a document (or query). This is usually com- i puted as a normalized count of the term occurrences in a document, for example by f ik = tf ik t ∑ f ij j =1 where tf is is the term frequency weight of term k in document D f , and i ik ik k in the document. In the vector space model, the number of occurrences of term normalization is part of the cosine measure. A document collection can contain documents of many different lengths. Although normalization accounts for this to some degree, long documents can have many terms occurring once and others occurring hundreds of times. Retrieval experiments have shown that to reduce the impact of these frequent terms, it is effective to use the logarithm of the number of term occurrences in tf weights rather than the raw count. The inverse document frequency component ( idf ) reflects the importance of the term in the collection of documents. The more documents that a term occurs in, the less discriminating the term is between documents and, consequently, the less useful it will be in retrieval. The typical form of this weight is</p> <p><span class="badge badge-info text-white mr-2">266</span> 242 7 Retrieval Models N = idf log k n k where idf is the inverse document frequency weight for term k , N is the number k n of documents in the collection, and is the number of documents in which term k k occurs. The form of this weight was developed by intuition and experiment, idf although an argument can be made that measures the amount of information information theory (Robertson, 2004). carried by the term, as defined in The effects of these two weights are combined by multiplying them (hence tf.idf ). The reason for combining them this way is, once again, mostly the name empirical. Given this, the typical form of document term weighting in the vector space model is: ) n / ( log ( f N ) + 1) · log ( ik k √ d = ik t ∑ 2 [( log ( f 0) ) + 1 . )] · log ( N / n k ik =1 k The form of query term weighting is essentially the same. Adding 1 to the term fre- quency component ensures that terms with frequency 1 have a non-zero weight. Note that, in this model, term weights are computed only for terms that occur in the document (or query). Given that the cosine measure normalization is incor- porated into the weights, the score for a document is computed using simply the dot product of the document and query vectors. Although there is no explicit definition of relevance in the vector space model, there is an implicit assumption that relevance is related to the similarity of query and document vectors. In other words, documents “closer” to the query are more likely to be relevant. This is primarily a model of topical relevance, although fea- tures related to user relevance could be incorporated into the vector representa- tion. No assumption is made about whether relevance is binary or multivalued. In the last chapter we described relevance feedback, a technique for query modification based on user-identified relevant documents. This technique was first introduced using the vector space model. The well-known Rocchio algorithm (Rocchio, 1971) was based on the concept of an , which maximizes optimal query the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents. Given that only limited relevance information is typically available, the most common (and effec- tive) form of the Rocchio algorithm modifies the initial weights in query vector ′ Q to produce a new query Q according to</p> <p><span class="badge badge-info text-white mr-2">267</span> 7.2 Probabilistic Models 243 ∑ ∑ 1 1 ′ β. α.q q γ. = + − d d ij ij j j | N onrel | | Rel | Rel D ∈ ∈ Nonrel D i i q where is the initial weight of query term j , Rel is the set of identified relevant j documents, N onrel is the set of non-relevant documents, | . | gives the size of a set, d are parame- is the weight of the j th term in document i , and α , β , and γ ij ters that control the effect of each component. Previous studies have shown that the set of non-relevant documents is best approximated by all unseen documents (i.e., all documents not identified as relevant), and that reasonable values for the α , β , and γ , respectively. parameters are 8, 16, and 4 for This formula modifies the query term weights by adding a component based on the average weight in the relevant documents and subtracting a component based on the average weight in the non-relevant documents. Query terms with weights that are negative are dropped. This results in a longer or expanded query because terms that occur frequently in the relevant documents but not in the orig- inal query will be added (i.e., they will have non-zero positive weights in the mod- ified query). To restrict the amount of expansion, typically only a certain number (say, 50) of the terms with the highest average weights in the relevant documents will be added to the query. 7.2 Probabilistic Models One of the features that a retrieval model should provide is a clear statement about the assumptions upon which it is based. The Boolean and vector space approaches make implicit assumptions about relevance and text representation that impact the design and effectiveness of ranking algorithms. The ideal situation would be to show that, given the assumptions, a ranking algorithm based on the retrieval model will achieve better effectiveness than any other approach. Such proofs are actually very hard to come by in information retrieval, since we are trying to for- malize a complex human activity. The validity of a retrieval model generally has to be validated empirically, rather than theoretically. One early theoretical statement about effectiveness, known as the Probabil- ity Ranking Principle (Robertson, 1977/1997), encouraged the development of probabilistic retrieval models, which are the dominant paradigm today. These models have achieved this status because probability theory is a strong founda- tion for representing and manipulating the uncertainty that is an inherent part</p> <p><span class="badge badge-info text-white mr-2">268</span> 244 7 Retrieval Models of the information retrieval process. The Probability Ranking Principle, as origi- nally stated, is as follows: 1 response to each request is a ranking of If a reference retrieval system’s the documents in the collection in order of decreasing probability of rel- evance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data. Given some assumptions, such as that the relevance of a document to a query is independent of other documents, it is possible to show that this statement is true, in the sense that ranking by probability of relevance will maximize preci- sion, which is the proportion of relevant documents, at any given rank (for exam- ple, in the top 10 documents). Unfortunately, the Probability Ranking Principle doesn’t tell us how to calculate or estimate the probability of relevance. There are many probabilistic retrieval models, and each one proposes a different method for estimating this probability. Most of the rest of this chapter discusses some of the most important probabilistic models. In this section, we start with a simple probabilistic model based on treating information retrieval as a classification problem. We then describe a popular and effective ranking algorithm that is based on this model. 7.2.1 Information Retrieval as Classification In any retrieval model that assumes relevance is binary, there will be two sets of documents, the relevant documents and the non-relevant documents, for each query. Given a new document, the task of a search engine could be described as 2 deciding whether the document belongs in the relevant set or the non-relevant set. That is, the system should classify the document as relevant or non-relevant, and retrieve it if it is relevant. probability that the document is relevant Given some way of calculating the and the probability that it is non-relevant, then it would seem reasonable to clas- sify the document into the set that has the highest probability. In other words, 1 A “reference retrieval system” would now be called a search engine. 2 Note that we never talk about “irrelevant” documents in information retrieval; instead they are “non-relevant.”</p> <p><span class="badge badge-info text-white mr-2">269</span> 7.2 Probabilistic Models 245 ) is relevant if ( R | D ) > P ( N R | D P , where we would decide that a document D ( R | D ) is a conditional probability representing the probability of relevance P P ( N R | D ) is the conditional given the representation of that document, and Bayes Decision probability of non-relevance (Figure 7.3). This is known as the Bayes classifier . Rule , and a system that classifies documents this way is called a In Chapter 9, we discuss other applications of classification (such as spam fil- tering) and other classification techniques, but here we focus on the ranking algo- rithm that results from this probabilistic retrieval model based on classification. Relevant Documents P(R|D) The ! rain ! in ! Spain ! falls mainly ! in ! the ! plain ! ! Spain ! falls The ! rain ! in ! ! in ! the ! plain mainly rain The ! ! in ! Spain ! falls ! ! mainly ! in plain the ! P(NR|D) ! falls ! rain ! in ! Spain The mainly ! in ! the ! plain Document Relevant Non ! Documents Fig. 7.3. Classifying a document as relevant or non-relevant The question that faces us now is how to compute these probabilities. To start with, let’s focus on P ( R D ) . It’s not clear how we would go about calculating | this, but given information about the relevant set, we should be able to calcu- late P ( D | R ) . For example, if we had information about how often specific words occurred in the relevant set, then, given a new document, it would be relatively straightforward to calculate how likely it would be to see the combination of words in the document occurring in the relevant set. Let’s assume that the prob- ability of the word “president” in the relevant set is 0.02, and the probability of “lincoln” is 0.03. If a new document contains the words “president” and “lincoln”, we could say that the probability of observing that combination of words in the</p> <p><span class="badge badge-info text-white mr-2">270</span> 246 7 Retrieval Models 0 . × 0 . 03 = 0 . 0006 , assuming that the two words occur inde- relevant set is 02 3 pendently. So how does calculating ( D | R ) get us to the probability of relevance? It P ( R | D ) and P ( D | R ) that is expressed turns out there is a relationship between P 4 : by Bayes’ Rule P R ) R P ( D | ( ) | ( ) = D P R P ( D ) P where R ) is the a priori probability of relevance (in other words, how likely ( any document is to be relevant), and ( D ) acts as a normalizing constant. Given P this, we can express our decision rule in the following way: classify a document as relevant if ( D | R ) P ( R ) > P ( D | N R ) P ( N R ) . This is the same as classifying P a document as relevant if: P D | R ) ) N R ( P ( > N R | D ( P P ) R ) ( The left-hand side of this equation is known as the likelihood ratio . In most clas- sification applications, such as spam filtering, the system must decide which class the document belongs to in order to take the appropriate action. For information retrieval, a search engine only needs to rank documents, rather than make that de- cision (which is hard). If we use the likelihood ratio as a score, the highly ranked documents will be those that have a high likelihood of belonging to the relevant set. To calculate the document scores, we still need to decide how to come up | ( | R ) and P with values for D D N R ) . The simplest approach is to make the P ( same assumptions that we made in our earlier example; that is, we represent doc- uments as a combination of words and the relevant and non-relevant sets using word probabilities. In this model, documents are represented as a vector of binary D = ( d is present in the document, , d features, , . . . , d i ) , where d if term = 1 i 2 1 t term independence and 0 otherwise. The other major assumption we make is (also Naïve Bayes assumption). This means we can estimate P ( known as the | R ) by D ∏ t the product of the individual term probabilities (and similarly for ) P ( d R | i =1 i ( | N R ) ). Because this model makes the assumptions of term independence P D binary independence model . and binary features in documents, it is known as the 3 Given two events A and B, the joint probability P ( A ∩ B ) is the probability of both events occurring together. In general, P ( A ∩ B ) = P ( A | B ) P ( B ) . If A and B are A independent, this means that ( A ∩ B ) = P ( P ) P ( B ) . 4 Named after Thomas Bayes, a British mathematician.</p> <p><span class="badge badge-info text-white mr-2">271</span> 7.2 Probabilistic Models 247 Words obviously do not occur independently in text. If the word “Microsoft” occurs in a document, it is very likely that the word “Windows” will also occur. The assumption of term independence, however, is a common one since it usually simplifies the mathematics involved in the model. Models that allow some form of dependence between terms will be discussed later in this chapter. Recall that a document in this model is a vector of 1s and 0s representing the presence and absence of terms. For example, if there were five terms indexed, one 0 , 0 , 1 , 1) of the document representations might be (1 , , meaning that the doc- ument contains terms 1, 4, and 5. To calculate the probability of this document occurring in the relevant set, we need the probabilities that the terms are 1 or 0 in the relevant set. If p is the probability that term i occurs (has the value 1) in i a document from the relevant set, then the probability of our example document × × (1 − p occurring in the relevant set is ) × (1 − p . The prob- ) × p p p 4 2 5 1 3 p occurring in the relevant set. For ) is the probability of term 2 not − ability (1 2 5 to represent the probability of term i occurring. s the non-relevant set, we use i and Going back to the likelihood ratio, using s p gives us a score of i i ∏ ∏ R ) P D | ( p p − 1 i i · = 1 | N R ) s P ( D s − i i d : i =1 d : i =0 i i ∏ means that it is a product over the terms that have the value 1 in where i : d =1 i the document. We can now do a bit of mathematical manipulation to get: ∏ ∏ ∏ ∏ p p 1 − s p − − 1 1 i i i i · ) ( · · p s s 1 1 − s − − 1 i i i i =1 : d i i =1 i : d d =0 : i =1 : d i i i i ∏ ∏ − s p ) p (1 − 1 i i i · = s p s ) − (1 1 − i i i i d : i =1 i The second product is over all terms and is therefore the same for all documents, so we can ignore it for ranking. Since multiplying lots of small numbers can lead to problems with the accuracy of the result, we can equivalently use the logarithm of the product, which means that the scoring function is: ∑ ) s − (1 p i i log (1 s − p ) i i i =1 : d i 5 In many descriptions of this model, p are used for these probabilities. We use and q i i s to avoid confusion with the q used to represent query terms. i i</p> <p><span class="badge badge-info text-white mr-2">272</span> 248 7 Retrieval Models You might be wondering where the query has gone, given that this is a doc- ument ranking algorithm for a specific query. In many cases, the query provides us with the only information we have about the relevant set. We can assume that, in the absence of other information, terms that are not in the query will have the same probability of occurrence in the relevant and non-relevant documents (i.e., p ). In that case, the summation will only be over terms that are both in = s i i the query and in the document. This means that, given a query, the score for a document is simply the sum of the term weights for all matching terms. If we have no other information about the relevant set, we could make the s is a constant and that additional assumptions that p could be estimated by i i using the term occurrences in the whole collection as an approximation. We make the second assumption based on the fact that the number of relevant documents is much smaller than the total number of documents in the collection. With a value of 0.5 for p in the scoring function described earlier, this gives a term weight for i of term i n i 0 − . 5(1 ) n − N i N log = log n i − 0 . 5) n (1 i N n is the number of documents that contain term i where N is the number of , and i documents in the collection. This shows that, in the absence of information about the relevant documents, the term weight derived from the binary independence model is very similar to an idf weight. There is no tf component, because the documents were assumed to have binary features. If we do have information about term occurrences in the relevant and non- contingencytable relevant sets, it can be summarized in a , shown in Table 7.1. This information could be obtained through relevance feedback, where users identify relevant documents in initial rankings. In this table, r is the number of relevant i i , n is the number of documents containing term documents containing term i , i N is the total number of documents in the collection, and R is the number of relevant documents for this query. Relevant Non-relevant Total d n = 1 r r n − i i i i i d n = 0 R − r − N − n N − R + r i i i i i Total R N − R N Table 7.1. Contingency table of term occurrences for a particular query</p> <p><span class="badge badge-info text-white mr-2">273</span> 7.2 Probabilistic Models 249 6 for estimates and s Given this table, the obvious would be p (the p r R / = i i i i number of relevant documents that contain a term divided by the total number of )/( = ( n (the number of non-relevant − r ) relevant documents) and s − R N i i i documents that contain a term divided by the total number of non-relevant doc- uments). Using these estimates could cause a problem, however, if some of the entries in the contingency table were zeros. If was zero, for example, the term r i weight would be log 0 . To avoid this, a standard solution is to add 0.5 to each count (and 1 to the totals), which gives us estimates of = ( r p + 0 . 5)/( R + 1) i i s and = ( n . Putting these estimates into the scoring − r 0) +0 . 5)/( N − R +1 . i i i function gives us: ∑ 5) . + 0 ( r r + 0 . 5)/( R − i i log . 5) ( n r − r + + 0 + 0 5)/( N − n R − . i i i i q : =1 d = i i i Although this document score sums term weights for just the matching query terms, with relevance feedback the query can be to include other impor- expanded tant terms from the relevant set. Note that if we have no relevance information, r and R to 0, which would give a we can set value of 0.5, and would produce p i the idf -like term weight discussed before. So how good is this document score when used for ranking? Not very good, it turns out. Although it does provide a method of incorporating relevance in- formation, in most cases we don’t have this information and instead would be using term weights that are similar to idf weights. The absence of a tf compo- nent makes a significant difference to the effectiveness of the ranking, and most effectiveness measures will drop by about 50% if the ranking ignores this informa- tion. This means, for example, that we might see 50% fewer relevant documents in the top ranks if we used the binary independence model ranking instead of the best tf.idf ranking. It turns out, however, that the binary independence model is the basis for one 7 of the most effective and popular ranking algorithms, known as BM25. 6 We use the term estimate for a probability value calculated using data such as a contin- gency table because this value is only an estimate for the true value of the probability and would change if more data were available. 7 BM stands for Best Match, and 25 is just a numbering scheme used by Robertson and his co-workers to keep track of weighting variants (Robertson & Walker, 1994).</p> <p><span class="badge badge-info text-white mr-2">274</span> 250 7 Retrieval Models 7.2.2 The BM25 Ranking Algorithm BM25 extends the scoring function for the binary independence model to in- clude document and query term weights. The extension is based on probabilistic arguments and experimental validation, but it is not a formal model. BM25 has performed very well in TREC retrieval experiments and has influ- enced the ranking algorithms of commercial search engines, including web search engines. There are some variations of the scoring function for BM25, but the most common form is: ∑ r f + 0 . 5)/( R − r + 1) ( . 5) k qf + 1) ( k ( + 0 2 i 1 i i i · log · + 0 − R + r qf ( n 5) + − r k + 0 . 5)/( K + f N − n . i i i i i 2 i i Q ∈ N , R , n are , and r where the summation is now over all terms in the query; and i i the same as described in the last section, with the additional condition that r and R are set to zero if there is no relevance information; f is the frequency of term i i in the document; qf K is the frequency of term i in the query; and k , and , k 2 1 i are parameters whose values are set empirically. The constant determines how the tf component of the term weight changes k 1 f increases. If k = 0 , the term frequency component would be ignored and as 1 i k is large, the term weight compo- only term presence or absence would matter. If 1 nent would increase nearly linearly with f . In TREC experiments, a typical value i to be very non-linear, similar to the use k is 1.2, which causes the effect of for f i 1 of log f in the term weights discussed in section 7.1.2. This means that after three or four occurrences of a term, additional occurrences will have little impact. The constant k has a similar role in the query term weight. Typical values for this pa- 2 rameter are in the range 0 to 1,000, meaning that performance is less sensitive to k . This is because query term frequencies are much lower and less k than it is to 2 1 variable than document term frequencies. is a more complicated parameter that normalizes the tf component by doc- K ument length. Specifically dl = k b ((1 − K ) + b · ) 1 avdl where b is a parameter, dl is the length of the document, and avdl is the average length of a document in the collection. The constant b regulates the impact of the length normalization, where b = 0 corresponds to no length normalization, and</p> <p><span class="badge badge-info text-white mr-2">275</span> 7.2 Probabilistic Models 251 = 1 is full normalization. In TREC experiments, a value of = 0 . 75 was found b b to be effective. As an example calculation, let’s consider a query with two terms, “president” ). We will con- qf and “lincoln”, each of which occurs only once in the query ( = 1 R and r sider the typical case where we have no relevance information ( are zero). Let’s assume that we are searching a collection of 500,000 documents ( N ), and ) = 40 , 000 that in this collection, “president” occurs in 40,000 documents ( n 1 n = 300 ). In the document we are scor- and “lincoln” occurs in 300 documents ( 2 f = 15 ) ing (which is about President Lincoln), “president” occurs 15 times ( 1 = 25 ). The document length is 90% of the aver- and “lincoln” occurs 25 times ( f 2 dl / avdl = 0 . age length ( ). The parameter values we use are k , = 1 . 2 , b = 0 . 75 9 1 9) = 1 k = 100 . With these values, K = 1 . 2 · (0 . 25 + 0 . 75 · 0 and , and . 11 . 2 the document score is: BM Q, D ) = 25( (0 + 0 5) 5)/(0 − 0 + 0 . . log (40000 − 0 + 0 . 5)/(500000 − 40000 − 0 + 0 + 0 . 5) (1 2 + 1)15 . (100 + 1)1 × × 11 + 15 100 + 1 1 . (0 + 0 . − 0 + 0 . 5) 5)/(0 log + − 0 + 0 . (300 − 300 − 0 + 0 + 0 . 5) 5)/(500000 . 2 + 1)25 (1 (100 + 1)1 × × 11 + 25 1 100 + 1 . = log 460000 . 5/40000 . 5 · 33/16 . 11 · 101/101 · log . · . . 499700 55/26 5 11 5/300 101/101 + = 2 . 44 · 2 . 05 · 1 + 7 . 42 · 2 . 11 · 1 = 5 00 + 15 . 66 = 20 . 66 . Notice the impact from the first part of the weight that, without relevance in- formation, is nearly the same as an idf weight (as we discussed in section 7.2.1). Because the term “lincoln” is much less frequent in the collection, it has a much higher idf component (7.42 versus 2.44). Table 7.2 gives scores for different num- bers of term occurrences. This shows the importance of the “lincoln” term and that even one occurrence of a term can make a large difference in the score. Re- ducing the number of term occurrences from 25 or 15 to 1 makes a significant but</p> <p><span class="badge badge-info text-white mr-2">276</span> 252 7 Retrieval Models not dramatic difference. This example also demonstrates that it is possible for a document containing a large number of occurrences of a single important term to score higher than a document containing both query terms (15.66 versus 12.74). BM25 Frequency of Frequency of “president” “lincoln” score 15 25 20.66 1 15 12.74 15 5.00 0 25 18.2 1 25 15.66 0 Table 7.2. BM25 scores for an example document The score calculation may seem complicated, but remember that some of the calculation of term weights can occur at indexing time, before processing any query. If there is no relevance information, scoring a document simply involves adding the weights for matching query terms, with a small additional calculation if query terms occur more than once (i.e., if 1 ). Another important point is qf > that the parameter values for the BM25 ranking algorithm can be tuned (i.e., ad- justed to obtain the best effectiveness) for each application. The process of tuning is described further in section 7.7 and Chapter 8. To summarize, BM25 is an effective ranking algorithm derived from a model of information retrieval viewed as classification. This model focuses on topical relevance and makes an explicit assumption that relevance is binary. In the next section, we discuss another probabilistic model that incorporates term frequency directly in the model, rather than being added in as an extension to improve per- formance. 7.3 Ranking Based on Language Models Language models are used to represent text in a variety of language technologies , such as speech recognition, machine translation, and handwriting recognition. The simplest form of language model, known as a unigram language model, is a probability distribution over the words in the language. This means that the lan- guage model associates a probability of occurrence with every word in the in-</p> <p><span class="badge badge-info text-white mr-2">277</span> 7.3 Ranking Based on Language Models 253 dex vocabulary for a collection. For example, if the documents in a collection contained just five different words, a possible language model for that collection (0 . 2 , 0 . 1 , 0 . 35 , 0 . might be , 0 . 1) , where each number is the probability of a 25 word occurring. If we treat each document as a sequence of words, then the proba- bilities in the language model predict what the next word in the sequence will be. For example, if the five words in our language were “girl”, “cat”, “the”, “boy”, and “touched”, then the probabilities predict which of these words will be next. These words cover all the possibilities, so the probabilities must add to 1. Because this is a unigram model, the previous words have no impact on the prediction. With this model, for example, it is just as likely to get the sequence “girl cat” (probability 0 . 2 × 0 . 1 ) as “girl touched” (probability 0 . 2 × 0 . 1 ). In applications such as speech recognition, n-gram language models that pre- dict words based on longer sequences are used. An n-gram model predicts a word n 1 words. The most common n-gram models are bi- based on the previous − (predicting based on the previous word) and gram (predicting based on trigram the previous two words) models. Although bigram models have been used in in- formation retrieval to represent two-word phrases (see section 4.3.5), we focus our discussion on unigram models because they are simpler and have proven to be very effective as the basis for ranking algorithms. For search applications, we use language models to represent the topical con- tent of a document. A topic is something that is talked about often but rarely de- fined in information retrieval discussions. In this approach, we define a topic as a probability distribution over words (in other words, a language model). For exam- ple, if a document is about fishing in Alaska, we would expect to see words associ- ated with fishing and locations in Alaska with high probabilities in the language model. If it is about fishing in Florida, some of the high-probability words will be the same, but there will be more high probability words associated with locations in Florida. If instead the document is about fishing games for computers, most of the high-probability words will be associated with game manufacturers and com- puter use, although there will still be some important words about fishing. Note that a topic language model, or topicmodel for short, contains probabilities for all words, not just the most important. Most of the words will have “default” proba- bilities that will be the same for any text, but the words that are important for the topic will have unusually high probabilities. A language model representation of a document can be used to “generate” new text by sampling words according to the probability distribution. If we imagine the language model as a big bucket of words, where the probabilities determine</p> <p><span class="badge badge-info text-white mr-2">278</span> 254 7 Retrieval Models how many instances of a word are in the bucket, then we can generate text by reaching in (without looking), drawing out a word, writing it down, putting the word back in the bucket, and drawing again. Note that we are not saying that we can generate the original document by this process. In fact, because we are only using a unigram model, the generated text is going to look pretty bad, with no syntactic structure. Important words for the topic of the document will, however, appear often. Intuitively, we are using the language model as a very approximate model for the topic the author of the document was thinking about when he was writing it. When text is modeled as a finite sequence of words, where at each point in t different possible words, this corresponds to assuming the sequence there are multinomial a distribution over words. Although there are alternatives, multino- 8 One of the mial language models are the most common in information retrieval. limitations of multinomial models that has been pointed out is that they do not burstiness well, which is the observation that once a word is “pulled describe text out of the bucket,” it tends to be pulled out repeatedly. In addition to representing documents as language models, we can also repre- sent the topic of the query as a language model. In this case, the intuition is that the language model is a representation of the topic that the information seeker had in mind when she was writing the query. This leads to three obvious possi- bilities for retrieval models based on language models: one based on the proba- bility of generating the query text from a document language model, one based on generating the document text from a query language model, and one based on comparing the language models representing the query and document topics. In the next two sections, we describe these retrieval models in more detail. 7.3.1 Query Likelihood Ranking query likelihood retrieval model, we rank documents by the probability In the that the query text could be generated by the document language model. In other words, we calculate the probability that we could pull the query words out of the “bucket” of words representing the document. This is a model of topical relevance, in the sense that the probability of query generation is the measure of how likely it is that a document is about the same topic as the query. Since we start with a query, we would in general like to calculate P ( D | Q ) to rank the documents. Using Bayes’ Rule, we can calculate this by 8 We discuss the multinomial model in the context of classification in Chapter 9.</p> <p><span class="badge badge-info text-white mr-2">279</span> 7.3 Ranking Based on Language Models 255 rank ) Q ) ( = P ( Q | D | P ( D ) p D rank = , as we mentioned previously, means that the right-hand where the symbol side is rank equivalent to the left-hand side (i.e., we can ignore the normalizing P ( Q ) ), P ( D constant is the prior probability of a document, and P ( Q | D ) is ) the query likelihood given the document. In most cases, ( D ) is assumed to be P (the same for all documents), and so will not affect the ranking. Mod- uniform els that assign non-uniform prior probabilities based on, for example, document date or document length can be useful in some applications, but we will make the simpler uniform assumption here. Given that assumption, the retrieval model P ( Q specifies ranking documents by D ) , which we calculate using the unigram | language model for the document n ∏ Q | D ) = | P ) D P ( q ( i =1 i q is a query word, and there are n words in the query. where i To calculate this score, we need to have estimates for the language model prob- P ( q ) | D abilities . The obvious estimate would be i f q ;D i D ) = ( q | P i | D | where f is | D is the number of times word qi occurs in document D , and | ;D q i D maximum the number of words in . For a multinomial distribution, this is the likelihood estimate, which means this this is the estimate that makes the observed value of most likely. The major problem with this estimate is that if any of f ;D q i the query words are missing from the document, the score given by the query like- lihood model for P ( Q | D ) will be zero. This is clearly not appropriate for longer queries. For example, missing one word out of six should not produce a score of zero. We will also not be able to distinguish between documents that have differ- ent numbers of query words missing. Additionally, because we are building a topic model for a document, words associated with that topic should have some prob- ability of occurring, even if they were not mentioned in the document. For ex- ample, a language model representing a document about computer games should have some non-zero probability for the word “RPG” even if that word was not mentioned in the document. A small probability for that word will enable the document to receive a non-zero score for the query “RPG computer games”, al- though it will be lower than the score for a document that contains all three words.</p> <p><span class="badge badge-info text-white mr-2">280</span> 256 7 Retrieval Models Smoothing is a technique for avoiding this estimation problem and overcom- , which means that we typically do not have large amounts of ing data sparsity text to use for the language model probability estimates. The general approach ) the probability estimates for words that to smoothing is to lower (or discount are seen in the document text, and assign that “leftover” probability to the esti- mates for the words that are not seen in the text. The estimates for unseen words are usually based on the frequency of occurrence of words in the whole document q | collection. If C ) is the probability for query word i in the collectionlanguage P ( i C model for document collection , then the estimate we use for an unseen word in α P ( q a document is | C ) , where α is a coefficient controlling the probability D i D 9 assigned to unseen words. In general, α can depend on the document. In order D that the probabilities sum to one, the probability estimate for a word that is seen (1 − α P ) P ( q . | D ) + α ) in a document is ( q C | i D D i To make this clear, consider a simple example where there are only three words, w , and w w , in our index vocabulary. If the collection probabilities for these , 1 3 2 three words, based on maximum likelihood estimates, are 0.3, 0.5, and 0.2, and the document probabilities based on maximum likelihood estimates are 0.5, 0.5, and 0.0, then the smoothed probability estimates for the document language model are: ( w P | D ) = (1 − α ) ) P ( w C | D ) + α | P ( w 1 1 D 1 D − α 3 ) = (1 0 . 5 + α · 0 . · D D P ( w | D ) = (1 − α 5 ) · 0 . 5 + α . · 0 D 2 D ( w 2 | D ) = (1 − α . ) · 0 . 0 + P 0 · 0 . 2 = α · α D D D 3 Note that term w has a non-zero probability estimate, even though it did not 3 occur in the document text. If we add these three probabilities, we get P ( w 5) | D ) + P ( w . | D ) + P ( w 5 + 0 | D ) = (1 − α . ) · (0 D 2 1 3 + α 2) · (0 . 3 + 0 . 5 + 0 . D − α + α = 1 D D = 1 which confirms that the probabilities are consistent. 9 The collection language model probability is also known as the background language model probability, or just the background probability.</p> <p><span class="badge badge-info text-white mr-2">281</span> 7.3 Ranking Based on Language Models 257 Different forms of estimation result from specifying the value of . The sim- α D α = λ . The collection language plest choice would be to set it to a constant, i.e., D / is c is the num- c model probability estimate we use for word | q | , where C q i q i i | | is the ber of times a query word occurs in the collection of documents, and C total number of word occurrences in the collection. This gives us an estimate for ( q | P D ) of: i f c q ;D q i i + ) q p | D ) = (1 − λ ( λ i | | | C | D Jelinek-Mercer method. Substituting this This form of smoothing is known as the estimate in the document score for the query-likelihood model gives: n ∏ c f q ;D q i i ) = λ ) | Q D ( ((1 − P + λ ) D | | C | | i =1 As we have said before, since multiplying many small numbers together can lead to accuracy problems, we can use logarithms to turn this score into a rank-equivalent sum as follows: n ∑ f c q ;D q i i log ((1 − ( ) | D ) = P log + λ Q ) λ | | | D | C =1 i λ Small values of produce less smoothing, and consequently the query tends to act more like a Boolean AND since the absence of any query word will penalize the score substantially. In addition, the relative weighting of words, as measured by the maximum likelihood estimates, will be important in determining the score. As λ approaches 1, the relative weighting will be less important, and the query acts 10 OR or a coordination level match . In TREC evaluations, it more like a Boolean has been shown that values of around 0.1 work well for short queries, whereas λ values around 0.7 are better for much longer queries. Short queries tend to contain only significant words, and a low λ value will favor documents that contain all the query words. With much longer queries, missing a word is much less important, and a high λ places more emphasis on documents that contain a number of the high-probability words. At this point, it may occur to you that the query likelihood retrieval model doesn’t have anything that looks like a tf.idf weight, and yet experiments show 10 A coordination level match simply ranks documents by the number of matching query terms.</p> <p><span class="badge badge-info text-white mr-2">282</span> 258 7 Retrieval Models that it is as least as effective as the BM25 ranking algorithm. We can, however, weights by manipulating the query likelihood demonstrate a relationship to tf.idf score in the following way: n ∑ f c ;D q q i i + log ) log ((1 ) = D λ λ | Q ( − ) P C | | | D | =1 i ∑ ∑ c f c q q ;D q i i i log ((1 + λ − λ ) log ) + = λ ( ) | D C | | | | C | > =0 f : i 0 i f : q q ;D ;D i i f c q q ;D n i i ∑ ∑ ) + λ ((1 ) λ − c | D | | C | q i log = λ log ) + ( c q i | | C λ | C | i =1 0 > i : f q ;D i   f ;D q i ∑ − ) ((1 λ rank | D |   + 1 = log c q i λ C | | f > i 0 : q ;D i In the second line, we split the score into the words that occur in the document and those that don’t occur ( f = 0 ). In the third line, we add ;D q i ∑ c q i λ ( log ) | | C > : i 0 f q ;D i to the last term and subtract it from the first (where it ends up in the denomina- tor), so there is no net effect. The last term is now the same for all documents and can be ignored for ranking. The final expression gives the document score in terms of a “weight” for matching query terms. Although this weight is not identical to tf.idf weight, there are clear similarities in that it is directly proportional to the a document term frequency and inversely proportional to the collection frequency. A different form of estimation, and one that is generally more effective, comes α from using a value of that is dependent on document length. This approach is D known as smoothing, for reasons we will discuss later, and uses Dirichlet μ α = D | D | + μ μ where is a parameter whose value is set empirically. Substituting this expression | α (1 − α ) P ( q for in D ) + α P ( q | C ) results in the probability estimation i D i D D formula</p> <p><span class="badge badge-info text-white mr-2">283</span> 7.3 Ranking Based on Language Models 259 c q i μ f + ;D q i | C | ) = q ( | D p i | D | + μ which in turn leads to the following document score: c q n i ∑ μ + f ;D q i | | C log | ) = P log ( Q D μ D | + | =1 i μ Similar to the Jelinek-Mercer smoothing, small values of the parameter ( in this case) give more importance to the relative weighting of words, and large val- μ ues favor the number of matching terms. Typical values of that achieve the best results in TREC experiments are in the range 1,000 to 2,000 (remember that col- lection probabilities are very small), and Dirichlet smoothing is generally more effective than Jelinek-Mercer, especially for the short queries that are common in most search applications. So where does Dirichlet smoothing come from? It turns out that a Dirichlet 11 distribution is the natural way to specify prior knowledge when estimating the probabilities in a multinomial distribution. The process of Bayesian estimation determines probability estimates based on this prior knowledge and the observed text. The resulting probability estimate can be viewed as combining actual word counts from the text with pseudo-counts from the Dirichlet distribution. If we had q | would be μ ( c , which is a μ / no text, the probability estimate for term C | )/ q i i reasonable guess based on the collection. The more text we have (i.e., for longer documents), the less influence the prior knowledge will have. We can demonstrate the calculation of query likelihood document scores us- ing the example given in section 7.2.2. The two query terms are “president” and f c “lincoln”. For the term “president”, = 15, and let’s assume that = 160,000. ;D q q i i For the term “lincoln”, = 25, and we will assume that c f = 2,400. The num- q ;D q i i ber of word occurrences in the document | d | is assumed to be 1,800, and the num- 9 ber of word occurrences in the collection is (500,000 documents times an 10 average of 2,000 words). The value of μ used is 2,000. Given these numbers, the score for the document is: 11 Named after the German mathematician Johann Peter Gustav Lejeune Dirichlet (the first name used seems to vary).</p> <p><span class="badge badge-info text-white mr-2">284</span> 260 7 Retrieval Models 5 9 6 (1 . 15 + 2000 × 10 /10 × ) ) = ( log Q, D QL 1800 + 2000 9 25 + 2000 × (2400/10 ) + log 1800 + 2000 log (15 . = log (25 . 005/3800) 32/3800) + = − 5 . 51 + − 5 . 02 = − 10 . 53 A negative number? Remember that we are taking logarithms of probabilities in this scoring function, and the probabilities of word occurrence are small. The im- portant issue is the effectiveness of the rankings produced using these scores. Ta- ble 7.3 shows the query likelihood scores for the same variations of term occur- rences that were used in Table 7.2. Although the scores look very different for BM25 and QL, the rankings are similar, with the exception that the document containing 15 occurrences of “president” and 1 of “lincoln” is ranked higher than the document containing 0 occurrences of “president” and 25 occurrences of “lin- coln” in the QL scores, whereas the reverse is true for BM25. Frequency of QL Frequency of “lincoln” score “president” 15 25 –10.53 15 1 –13.75 15 0 –19.05 1 –12.99 25 25 –14.40 0 Table 7.3. Query likelihood scores for an example document To summarize, query likelihood is a simple probabilistic retrieval model that directly incorporates term frequency. The problem of coming up with effective term weights is replaced by probability estimation, which is better understood and has a formal basis. The basic query likelihood score with Dirichlet smooth- ing has similar effectiveness to BM25, although it does do better on most TREC collections. If more sophisticated smoothing based on topic models is used (de- scribed further in section 7.6), query likelihood consistently outperforms BM25. This means that instead of smoothing using the collection probabilities for words, we instead use word probabilities from similar documents. The simplicity of the language model framework, combined with the ability to describe a variety of retrieval applications and the effectiveness of the associated</p> <p><span class="badge badge-info text-white mr-2">285</span> 7.3 Ranking Based on Language Models 261 ranking algorithms, make this approach a good choice for a retrieval model based on topical relevance. 7.3.2 Relevance Models and Pseudo-Relevance Feedback Although the basic query likelihood model has a number of advantages, it is lim- ited in terms of how it models information needs and queries. It is difficult, for example, to incorporate information about relevant documents into the ranking algorithm, or to represent the fact that a query is just one of many possible queries that could be used to describe a particular information need. In this section, we show how this can be done by extending the basic model. In the introduction to section 7.3, we mentioned that it is possible to represent the topic of a query as a language model. Instead of calling this the query language since it represents the topic covered by relevance model model, we use the name relevant documents. The query can be viewed as a very small sample of text gener- ated from the relevance model, and relevant documents are much larger samples of text from the same model. Given some examples of relevant documents for a query, we could estimate the probabilities in the relevance model and then use this model to predict the relevance of new documents. In fact, this is a version of P ( D | R ) as the classification model presented in section 7.2.1, where we interpret the probability of generating the text in a document given a relevance model. This is also called the model. Although this model, unlike the bi- document likelihood nary independence model, directly incorporates term frequency, it turns out that ( D | R ) is difficult to calculate and compare across documents. This is because P documents contain a large and extremely variable number of words compared to a query. Consider two documents D and D , for example, containing 5 and b a 500 words respectively. Because of the large difference in the number of words involved, the comparison of ( D for ranking will be more | R ) and P ( D ) | R P b a difficult than comparing P ( Q | D , which use the same query and ) and P ( Q | D ) b a smoothed representations for the documents. In addition, we still have the prob- lem of obtaining examples of relevant documents. There is, however, another alternative. If we can estimate a relevance model from a query, we can compare this language model directly with the model for a document. Documents would then be ranked by the similarity of the document model to the relevance model. A document with a model that is very similar to the relevance model is likely to be on the same topic. The obvious next question is how to compare two language models. A well-known measure from probabil- ity theory and information theory, the Kullback-Leibler divergence (referred to as</p> <p><span class="badge badge-info text-white mr-2">286</span> 262 7 Retrieval Models 12 measures the difference between two probability KL-divergence in this book), true probability distribution P and another distribution distributions. Given the P , the KL divergence is defined as: Q that is an approximation to ∑ ( P ) x ) = log P ( x ) Q || P ( KL Q x ) ( x Since KL-divergence is always positive and is larger for distributions that are fur- negative ther apart, we use the KL-divergence as the basis for the ranking function (i.e., smaller differences mean higher scores). In addition, KL-divergence is not symmetric, and it matters which distribution we pick as the true distribution. If R ) and we assume the true distribution to be the relevance model for the query ( D ), then the negative the approximation to be the document language model ( KL-divergence can be expressed as ∑ ∑ R | P ( w | R ) log P ( w | D ) − w w ( P P ( ) | R ) log V ∈ ∈ w w V w V . The second term on where the summation is over all words in the vocabulary the right-hand side of this equation does not depend on the document, and can be P ( w | R ) , ignored for ranking. Given a simple maximum likelihood estimate for f ) and the number of words in the based on the frequency in the query text ( w;Q | Q | ), the score for a document will be: query ( ∑ f w;Q ( log P w | D ) | Q | V w ∈ Although this summation is over all words in the vocabulary, words that do not occur in the query have a zero maximum likelihood estimate and will not con- tribute to the score. Also, query words with frequency k will contribute k × log ( w | D ) to the score. This means that this score is rank equivalent to the P query likelihood score described in the previous section. In other words, query likelihood is a special case of a retrieval model that ranks by comparing a rele- vance model based on a query to a document language model. The advantage of the more general model is that it is not restricted to the sim- ple method of estimating the relevance model using query term frequencies. If we 12 KL-divergence is also called information divergence, information gain, or relative en- tropy.</p> <p><span class="badge badge-info text-white mr-2">287</span> 7.3 Ranking Based on Language Models 263 regard the query words as a sample from the relevance model, then it seems rea- sonable to base the probability of a new sample word on the query words we have w seen. In other words, the probability of pulling a word out of the “bucket” rep- query words we have just resenting the relevance model should depend on the n pulled out. More formally, we can relate the probability of to the conditional w q probability of observing . . . q w given that we just observed the query words n 1 by the approximation: w | R ) ≈ P ( w | q P . . . q ( ) 1 n By definition, we can express the conditional probability in terms of the joint with the query words: w probability of observing ) . . . q w, q ( P 1 n R P | ≈ w ) ( q . . . q ( ) P n 1 ( P . . . q q ) is a normalizing constant and is calculated as: n 1 ∑ P . . . q ) ) = q . . . q w, q ( P ( 1 n n 1 ∈ w V P ( w, q Now the question is how to estimate the joint probability . . . q . Given ) n 1 C represented by language models, we can calculate the joint a set of documents probability as follows: ∑ ( w, q ( ) = P ) D p . . . q D ) P ( w, q . . . q | 1 1 n n ∈C D We can also make the assumption that: n ∏ ( . . . q ) | D ) = P ( w | D ) P ( D | P w, q q i 1 n =1 i ( w, q When we substitute this expression for P | D ) into the previous equa- . . . q 1 n tion, we get the following estimate for the joint probability: n ∑ ∏ ) . . . q D ) = P | q P ( D ) P ( w | D ) ( w, q ( P i n 1 =1 i D ∈C P ( D ) is usually as- How do we interpret this formula? The prior probability ∏ n is, in fact, ) D P ( q | sumed to be uniform and can be ignored. The expression i =1 i</p> <p><span class="badge badge-info text-white mr-2">288</span> 264 7 Retrieval Models D . This means that the estimate for the query likelihood score for the document w, q ) is simply a weighted average of the language model probabilities ( . . . q P n 1 in a set of documents, where the weights are the query likelihood scores for for w those documents. Ranking based on relevance models actually requires two passes. The first pass ranks documents using query likelihood to obtain the weights that are needed for relevance model estimation. In the second pass, we use KL-divergence to rank documents by comparing the relevance model and the document model. Note also that we are in effect adding words to the query by smoothing the relevance model using documents that are similar to the query. Many words that had zero probabilities in the relevance model based on query frequency estimates will now have non-zero values. What we are describing here is the pseudo-relevance exactly feedback process described in section 6.2.4. In other words, relevance models pro- vide a formal retrieval model for pseudo-relevance feedback and query expansion. The following is a summary of the steps involved in ranking using relevance mod- els: 1. Rank documents using the query likelihood score for query . Q C 2. Select some number of the top-ranked documents to be the set . P ( w | R ) using the estimate for 3. Calculate the relevance model probabilities ) P . . . . q w, q ( n 1 13 4. Rank documents again using the KL-divergence score: ∑ ) P ( w | R ) log P ( w | D w Some of these steps require further explanation. In steps 1 and 4, the docu- P w | D ) ) should be estimated using Dirich- ment language model probabilities ( ( C to be the whole collection, but let smoothing. In step 2, the model allows the set P ( w | R ) , because low-ranked documents have little effect on the estimation of usually only 10–50 of the top-ranked documents are used. This also makes the computation of P ( w | R ) substantially faster. For similar reasons, the summation in step 4 is not done over all words in the vocabulary. Typically only a small number (10–25) of the highest-probability words are used. In addition, the importance of the original query words is em- phasized by combining the original query frequency estimates with the relevance 13 More accurately, this score is the negative cross entropy because we removed the term ∑ ) R | P . w | R ) log P ( w ( V ∈ w</p> <p><span class="badge badge-info text-white mr-2">289</span> 7.3 Ranking Based on Language Models 265 λP w | Q )+(1 − model estimates using a similar approach to Jelinek-Mercer, i.e., ( ) P ( w | λ ) , where λ is a mixture parameter whose value is determined empiri- R cally (0.5 is a typical value for TREC experiments). This combination makes it clear that estimating relevance models is basically a process for query expansion and smoothing. The next important question, as for all retrieval models, is how well it works. Based on TREC experiments, ranking using relevance models is one of the best pseudo-relevance feedback techniques. In addition, relevance models produce a significant improvement in effectiveness compared to query likelihood ranking averaged over a number of queries. Like all current pseudo-relevance feedback techniques, however, the improvements are not consistent, and some queries can produce worse rankings or strange results. Tables 7.4 and 7.5 show the 16 highest-probability words from relevance mod- els estimated using this technique with some example queries and a large collec- 14 tion of TREC news stories from the 1990s. Table 7.4 uses the top 10 documents from the query likelihood ranking to construct the relevance model, whereas Ta- ble 7.5 uses the top 50 documents. The first thing to notice is that, although the words are reasonable, they are very dependent on the collection of documents that is used. In the TREC news collection, for example, many of the stories that mention Abraham Lincoln are on the topic of the Lincoln Bedroom in the White House, which President Clinton used for guests and President Lincoln used as an office during the Civil War. These types of stories are reflected in the top probability words for the queries “president lincoln” and “abraham lincoln”. Expanding the query using these words would clearly favor the retrieval of this type of story rather than more general biographies of Lincoln. The second observation is that there is not much difference between the words based on 10 documents and the words based on 50 documents. The words based on 50 documents are, however, somewhat more general because the larger set of documents contains a greater variety of topics. In the case of the query “tropical fish”, the relevance model words based on 10 documents are clearly more related to the topic. In summary, ranking by comparing a model of the query to a model of the document using KL-divergence is a generalization of query likelihood scoring. 14 This is a considerably larger collection than was used to generate the term association tables in Chapter 6. Those tables were based on the ROBUST track data, which con- sists of just over half a million documents. These tables were generated using all the TREC news collections, which total more than six million documents.</p> <p><span class="badge badge-info text-white mr-2">290</span> 266 7 Retrieval Models abraham lincoln fishing president lincoln tropical fish fish lincoln fish lincoln president america farm tropic room japan president salmon faith new aquarium bedroom house wild water guest white water species abraham new aquatic america caught room catch guest fair serve christian tag china bed history time coral washington public source eat bedroom tank raise old war city reef office war people animal politics long old fishermen tarpon abraham national boat fishery Table 7.4. Highest-probability terms from relevance model for four example queries (es- timated using top 10 documents) This generalization allows for more accurate queries that reflect the relative im- portance of the words in the topic that the information seeker had in mind when he was writing the query. Relevance model estimation is an effective pseudo- relevance feedback technique based on the formal framework of language mod- els, but as with all these techniques, caution must be used in applying relevance model–based query expansion to a specific retrieval application. Language models provide a formal but straightforward method of describing retrieval models based on topical relevance. Even more sophisticated models can be developed by incorporating term dependence and phrases, for example. Top- ical relevance is, however, only part of what is needed for effective search. In the next section, we focus on a retrieval model for combining all the pieces of evidence that contribute to user relevance, which is what people who use a search engine really care about.</p> <p><span class="badge badge-info text-white mr-2">291</span> 7.4 Complex Queries and Combining Evidence 267 abraham lincoln fishing president lincoln tropical fish fish lincoln fish lincoln president president water tropic america water america catch abraham reef storm new national fishermen species war great river boat man civil sea white new new year war river washington history time country clinton two bass tuna house room world boat booth million world history time farm state time center angle time politics kennedy public fly japan room guest trout mile Table 7.5. Highest-probability terms from relevance model for four example queries (es- timated using top 50 documents) 7.4 Complex Queries and Combining Evidence Effective retrieval requires the combination of many pieces of evidence about a document’s potential relevance. In the case of the retrieval models described in previous sections, the evidence consists of word occurrences that reflect top- ical content. In general, however, there can be many other types of evidence that should be considered. Even considering words, we may want to take into account whether certain words occur near each other, whether words occur in particular document structures, such as section headings or titles, or whether words are re- lated to each other. In addition, evidence such as the date of publication, the doc- ument type, or, in the case of web search, the PageRank number will also be im- portant. Although a retrieval algorithm such as query likelihood or BM25 could be extended to include some of these types of evidence, it is difficult not to resort to heuristic “fixes” that make the retrieval algorithm difficult to tune and adapt to new retrieval applications. Instead, what we really need is a framework where we can describe the different types of evidence, their relative importance, and how they should be combined. The inference network retrieval model, which has been</p> <p><span class="badge badge-info text-white mr-2">292</span> 268 7 Retrieval Models used in both commercial and open source search engines (and is incorporated in Galago), is one approach to doing this. Bayesian networks The inference network model is based on the formalism of and is a probabilistic model. The model provides a mechanism for defining and evaluating operators in a query language. Some of these operators are used to spec- ify types of evidence, and others describe how it should be combined. The version of the inference network we will describe uses language models to estimate the probabilities that are needed to evaluate the queries. In this section, we first give an overview of the inference network model, and then show how that model is used as the basis of a powerful query language for search applications. In the next section, we describe web search and explain how the inference network model would be used to combine the many sources of evi- dence required for effective ranking. Queries described using the inference network query language appear to be much more complicated than a simple text query with two or three words. Most users will not understand this language, just as most relational database users do not understand Structured Query Language (SQL). Instead, applications trans- late simple user queries into more complex inference network versions. The more complex query incorporates additional features and weights that reflect the best combination of evidence for effective ranking. This point will become clearer as we discuss examples in the next two sections. 7.4.1 The Inference Network Model A Bayesian network is a probabilistic model that is used to specify a set of events and the dependencies between them. The networks are directed, acyclic graphs (DAGs), where the nodes in the graph represent events with a set of possible outcomes and arcs represent probabilistic dependencies between the events. The 15 probability, or belief, of a particular event outcome can be determined given the probabilities of the parent events (or a prior probability in the case of a root node). When used as a retrieval model, the nodes represent events such as observ- ing a particular document, or a particular piece of evidence, or some combination of pieces of evidence. These events are all binary, meaning that TRUE and FALSE are the only possible outcomes. 15 Belief network is the name for a range of techniques used to model uncertainty. A Bayesian network is a probabilistic belief network.</p> <p><span class="badge badge-info text-white mr-2">293</span> 7.4 Complex Queries and Combining Evidence 269 ! body D ! ! h1 title h1 title body ... ... ... r r r r r r 1 1 1 N N N q q 1 2 I Fig. 7.4. Example inference network model Figure 7.4 shows an inference net where the evidence being combined are words in a web page’s title, body, and <h1> headings. In this figure, D is a docu- ment node. This node corresponds to the event that a document (the web page) is observed. There is one document node for every document in the collection, and r or represen- we assume that only one document is observed at any time. The i tation nodes are document features (evidence), and the probabilities associated with those features are based on language models θ estimated using the parame- μ ters . There is one language model for each significant document structure (title, body, or headings). In addition to features based on word occurrence, r nodes i also represent proximity features. Proximity features take a number of different forms, such as requiring words to co-occur within a certain “window” (length) of text, and will be described in detail in the next section. Features that are not based on language models, such as document date, are allowed but not shown in this example. q are used to combine evidence from representation nodes The query nodes i and other query nodes. These nodes represent the occurrence of more complex ev- idence and document features. A number of forms of combination are available, with Boolean AND and OR being two of the simplest. The network as a whole computes P ( I | D, μ ) , which is the probability that an information need is met</p> <p><span class="badge badge-info text-white mr-2">294</span> 270 7 Retrieval Models μ . The information need node is a spe- given the document and the parameters I cial query node that combines all of the evidence from the other query nodes into a single probability or belief score. This score is used to rank documents. Con- ceptually, this means we must evaluate an inference network for every document in the collection, but as with every other ranking algorithm, indexes are used to speed up the computation. In general, representation nodes are indexed, whereas query nodes are specified for each query by the user or search application. This means that indexes for a variety of proximity features, in addition to words, will be created (as described in Chapter 5), significantly expanding the size of the indexes. In some applications, the probabilities associated with proximity features are com- puted at query time in order to provide more flexibility in specifying queries. The connections in the inference network graph are defined by the query and the representation nodes connected to every document in the collection. The probabilities for the representation nodes are estimated using language models for each document. Note that these nodes do not represent the occurrence of a particular feature in a document, but instead capture the probability that the fea- characteristic of the document, in the sense that the language model could ture is generate it. For example, a node for the word “lincoln” represents the binary event that a document is about that topic (or not), and the language model for the doc- ument is used to calculate the probability of that event being TRUE . Since all the events in the inference network are binary, we cannot really use a multinomial model of a document as a sequence of words. Instead, we use a 16 multiple-Bernoulli model, which is the basis for the binary independence model in section 7.2.1. In that case, a document is represented as a binary feature vec- tor, which simply records whether a feature is present or not. In order to capture term frequency information, a different multiple-Bernoulli model is used where 17 of vectors, with one vector for each the document is represented by a multiset term occurrence (Metzler, Lavrenko, & Croft, 2004). It turns out that with the appropriate choice of parameters, the probability estimate based on the multiple- Bernoulli distribution is the same as the estimate for the multinomial distribution with Dirichlet smoothing, which is 16 Named after the Swiss mathematician Jakob Bernoulli (also known as James or Jacques, and one of eight famous mathematicians in the same family). The multiple-Bernoulli model is discussed further in Chapter 9. 17 A multiset (also called a bag ) is a set where each member has an associated number recording the number of times it occurs.</p> <p><span class="badge badge-info text-white mr-2">295</span> 7.4 Complex Queries and Combining Evidence 271 μP r C | ( + f ) ;D i r i P ) = ( | D, μ r i | D | + μ f where is the number of times feature r is occurs in document D , P ( r ) | C i i;D i r , and the collection probability for feature is the Dirichlet smoothing param- μ i eter. To be more precise, for the model shown in Figure 7.4 we would use f i;D counts, collection probabilities, and a value for μ that are specific to the docu- ment structure of interest. For example, if f was the number of times feature r i i;D occurs in a document title, the collection probabilities would be estimated from the collection of all title texts, and the μ parameter would be specific to titles. Also note that the same estimation formula is used for proximity-based features as for words. For example, for a feature such as “New York” where the words must occur f next to each other, is the number of times “New York” occurs in the text. i;D The query nodes, which specify how to combine evidence, are the basis of the operators in the query language. Although Bayesian networks permit arbitrary combinations (constrained by the laws of probability), the inference network re- trieval model is based on operators that can be computed efficiently. At each node in the network, we need to specify the probability of each outcome given all pos- sible states of the parent nodes. When the number of parent nodes is large, this could clearly get expensive. Fortunately, many of the interesting combinations can be expressed as simple formulas. As an example of the combination process and how it can be done efficiently, consider Boolean AND . Given a simple network for a query node q with two par- ent nodes a b , as shown in Figure 7.5, we can describe the conditional prob- and abilities as shown in Table 7.6. b a q Fig. 7.5. Inference network with three nodes We can refer to the values in the first column of Table 7.6 using p i , where ij and j refer to the states of the parents. For example, p refers to the probability 10</p> <p><span class="badge badge-info text-white mr-2">296</span> 272 7 Retrieval Models ( q TRUE | a, b ) a b P = FALSE 0 FALSE 0 TRUE FALSE FALSE TRUE 0 TRUE TRUE 1 Table 7.6. Conditional probabilities for example network is is given that a is TRUE and b q FALSE . To compute the probability of that TRUE , we use this table and the probabilities of the parent nodes (which come from q the representation nodes) as follows: ) ( q ) = p FALSE P ( a = FALSE ) bel ( b = P 00 and P p ( a = FALSE ) P ( b = TRUE ) + 01 p ) P ( a = TRUE + P ( b = FALSE ) 10 ) p P ( a = + ) P ( b = TRUE TRUE 11 = 0 · (1 − p p )(1 − p p ) + 0 · (1 − p · ) p ) + 1 + 0 · p p (1 − a b a b a a b b p p = b a p is the probability that is the probability that a is true, and p is true. where b b a bel We use the name ( q ) to indicate that this is the belief value (probability) and that results from an combination. AND This means that the AND combination of evidence is computed by simply mul- tiplying the probabilities of the parent nodes. If one of the parent probabilities is low (or zero if smoothing is not used), then the combination will have a low probability. This seems reasonable for this type of combination. We can define a number of other combination operators in the same way. If a q node has n par- ents with probability of being true , then the following list defines the common p i operators: q ( bel ) = 1 − p 1 not n ∏ ( q ) = 1 − bel ) p (1 − or i i n ∏ ( q ) = bel p and i i</p> <p><span class="badge badge-info text-white mr-2">297</span> 7.4 Complex Queries and Combining Evidence 273 n ∏ wt i ( q p ) = bel wand i i p ( q ) = max { bel } , p , . . . , p 2 1 max n ∑ n p i i ( ) = q bel sum n ∑ n p wt i i i ∑ ) = bel ( q wsum n wt i i th parent, which indicates the relative wt where i is a weight associated with the i importance of that evidence. Note that NOT is a unary operator (i.e., has only one parent). The weighted AND operator is very important and one of the most commonly used in the query language described in the next section. Using this form of com- bination and restricting the evidence (representation nodes) to individual words gives the same ranking as query likelihood. Given this description of the underlying model and combination operators, we can now define a query language that can be used in a search engine to produce rankings based on complex combinations of evidence. 7.4.2 The Galago Query Language The Galago query language presented here is similar to query languages used in open source search engines that are based on the inference network retrieval 18 model. This version focuses on the most useful aspects of those languages for a variety of search applications, and adds the ability to use arbitrary features. Note that the Galago search engine is not based on a specific retrieval model, but in- stead provides an efficient framework for implementing retrieval models. Although the query language can easily handle simple unstructured text docu- ments, many of the more interesting features make use of evidence based on docu- ment structure. We assume that structure is specified using tag pairs, as in HTML or XML. Consider the following document: <html> <head> <title>Department Descriptions 18 Such as Inquery and Indri.

298 274 7 Retrieval Models The following list describes ...

Agriculture

...

Chemistry

...

Computer Science

...

Electrical Engineering

... In the Galago query language, a document is viewed as a sequence of text that may contain arbitrary tags. In the example just shown, the document consists of text marked up with HTML tags. For each tag type T within a document (e.g., title, body, h1, etc.), we define 19 context of T to be all of the text and tags that appear within tags of type T. the In the example, all of the text and tags appearing between and tags define the body context. A single context is generated for each unique tag name. Therefore, a context defines a subdocument. Note that because of nested tags, certain word occurrences may appear in many contexts. It is also the case that there may be nested contexts. For example, within the context there is a nested

context made up of all of the text and tags that appear within the

and

body context and within tags. Here are the tags for the title, h1, and body contexts in this example document: title context: Department Descriptions h1 context:

Agriculture

Chemistry

...

Computer Science

...

Electrical Engineering

... body context: The following list describes ...

Agriculture

...

Chemistry

... 19 Contexts are sometimes referred to as fields .

299 7.4 Complex Queries and Combining Evidence 275

Computer Science

...

Electrical Engineering

... . An extent is a sequence of Each context is made up of one or more extents text that appears within a single begin/end tag pair of the same type as the con- text. For this example, in the context, there are extents

Agriculture

,

, etc. Both the title and body contexts contain only a single ex-

Chemistry

tent because there is only a single pair of <body> tags, respectively. The and number of extents for a given tag type is determined by the number of tag pairs of that type that occur within the document. In addition to the structure defined when a document is created, contexts are also used to represent structure added by feature extraction tools. For example, dates, people’s names, and addresses can be identified in text and tagged by a fea- ture extraction tool. As long as this information is represented using tag pairs, it can be referred to in the query language in the same way as other document struc- tures. Terms are the basic building blocks of the query language, and correspond to representation nodes in the inference network model. A variety of types of terms can be defined, such as simple terms, ordered and unordered phrases, synonyms, and others. In addition, there are a number of options that can be used to specify that a term should appear within a certain context, or that it should be scored using a language model that is estimated using a given context. Simple terms : term – term that will be normalized and stemmed. ”term” – term is not normalized or stemmed. Examples : presidents ”NASA” Proximity terms : #od:N( ... ) – ordered window – terms must appear ordered, with at most N-1 terms between each. #od( ... ) – unlimited ordered window – all terms must appear ordered anywhere within current context. #uw:N( ... ) – unordered window – all terms must appear within a window of length N in any order.</p> <p><span class="badge badge-info text-white mr-2">300</span> 276 7 Retrieval Models – unlimited unordered window – all terms must appear within #uw( ... ) current context in any order. Examples : #od:1(white house) – matches “white house” as an exact phrase. #od:2(white house) – matches “white * house” (where * is any word or null). #uw:2(white house) – matches “white house” and “house white”. : Synonyms #syn( ... ) #wsyn( ... ) The first two expressions are equivalent. They each treat all of the terms #wsyn listed as synonyms. The operator treats the terms as synonyms, and allows weights to be assigned to each term. The arguments given to these operators can only be simple terms or proximity terms. Examples : #syn(dog canine) – simple synonym based on two terms. #syn( #od:1(united states) #od:1(united states of america) ) – creates a syn- onym from two proximity terms. – weighted synonym indicating #wsyn( 1.0 donald 0.8 don 0.5 donnie ) relative importance of terms. Anonymous terms : #any:.() – used to match extent types Examples : #any:person() – matches any occurrence of a person extent. – matches exact phrases of the form:“lincoln #od:1(lincoln died in #any:date()) < date > ... < died in > ”. /date Context restriction and evaluation : expression.C1„...,CN – matches when the expression appears in all con- texts C1 through CN . expression.(C1,...,CN) – evaluates the expression using the language model defined by the concatenation of contexts C1...CN within the document. Examples : dog.title – matches the term “dog” appearing in a title extent. #uw(smith jones).author – matches when the two names “smith” and “jones” appear in an author extent. dog.(title) – evaluates the term based on the title language model for the</p> <p><span class="badge badge-info text-white mr-2">301</span> 7.4 Complex Queries and Combining Evidence 277 document. This means that the estimate of the probability of occurrence for for a given document will be based on the number of times that the dog word occurs in the title field for that document and will be normalized us- ing the number of words in the title rather than the document. Similarly, smoothing is done using the probabilities of occurrence in the title field over the whole collection. #od:1(abraham lincoln).person.(header) – builds a language model from all #od:1(abraham lin- of the “header” text in the document and evaluates in that context (i.e., matches only the exact phrase appearing coln).person within a person extent within the header context). Belief operators are used to combine evidence about terms, phrases, etc. There are both unweighted and weighted belief operators. With the weighted operator, the relative importance of the evidence can be specified. This allows control over how much each expression within the query impacts the final score. The filter operator is used to screen out documents that do not contain required evidence. All belief operators can be nested. Belief operators : #combine(...) – this operator is a normalized version of the bel op- ( q ) and erator in the inference network model. See the discussion later for more details. operator. bel – this is a normalized version of the ( q #weight(...) ) wand #filter(...) – this operator is similar to #combine , but with the difference that the document must contain at least one instance of all terms (simple, proximity, synonym, etc.). The evaluation of nested belief operators is not changed. : Examples #combine( #syn(dog canine) training ) – rank by two terms, one of which is a synonym. #combine( biography #syn(#od:1(president lincoln) #od:1(abraham lincoln)) ) – rank using two terms, one of which is a synonym of “president lincoln” and “abraham lincoln”. #weight( 1.0 #od:1(civil war) 3.0 lincoln 2.0 speech ) – rank using three terms, and weight the term “lincoln” as most important, followed by “speech”, then “civil war”. #filter( aquarium #combine(tropical fish) ) – consider only those documents containing the word “aquarium” and “tropical” or “fish”, and rank them</p> <p><span class="badge badge-info text-white mr-2">302</span> 278 7 Retrieval Models according to the query #combine(aquarium #combine(tropical fish)) . – rank #filter( #od:1(john smith).author) #weight( 2.0 europe 1.0 travel ) documents about “europe” or “travel” that have “John Smith” in the au- thor context. As we just described, the and #weight operators are normalized ver- #combine sions of the and bel bel operators, respectively. The beliefs of these oper- wand and ators are computed as follows: n ∏ n 1/ bel p = combine i i n ∑ ∏ n wt wt / ′ i ′ i i p bel = weight i i This normalization is done in order to make the operators behave more like the bel operators, which are both normalized. One advan- and bel original wsum sum tage of the normalization is that it allows us to describe the belief computa- tion of these operators in terms of various types of means (averages). For exam- bel computes the arithmetic mean over the beliefs of the parent nodes, ple, sum whereas bel computes a weighted arithmetic mean. Similarly, bel and combine wsum bel compute a geometric mean and weighted geometric mean, respectively. wand The filter operator also could be used with numeric and date field operators so that non-textual evidence can be combined into the score. For example, the query #filter(news.doctype #dateafter(12/31/1999).docdate #uw:20( brown.person #any:company() #syn( money cash payment ) ) ranks documents that are news stories, that appeared after 1999, and that con- tained at least one text segment of length 20 that mentioned a person named “brown”, a company name, and at least one of the three words dealing with money. The inference network model can easily deal with the combination of this type of evidence, but for simplicity, we have not implemented these operators in Galago. Another part of the inference network model that we do support in the Galago query language is document priors. Document priors allow the specification of a prior probability over the documents in a collection. These prior probabilities influence the rankings by preferring documents with certain characteristics, such as those that were written recently or are short.</p> <p><span class="badge badge-info text-white mr-2">303</span> 7.5 Web Search 279 : Prior – uses the document prior specified by the name given. Pri- #prior:name() ors are files or functions that provide prior probabilities for each docu- ment. Example : – uses a prior named recent to #combine(#prior:recent() global warming) give greater weight to documents that were published more recently. As a more detailed example of the use of this query language, in the next sec- tion we discuss web search and the types of evidence that have to be combined for effective ranking. The use of the #feature operator to define arbitrary features (new evidence) is discussed in Chapter 11. 7.5 Web Search Measured in terms of popularity, web search is clearly the most important search application. Millions of people use web search engines every day to carry out an enormous variety of tasks, from shopping to research. Given its importance, web search is the obvious example to use for explaining how the retrieval models we have discussed are applied in practice. There are some major differences between web search and an application that provides search for a collection of news stories, for example. The primary ones are the size of the collection (billions of documents), the connections between doc- uments (i.e., links), the range of document types, the volume of queries (tens of millions per day), and the types of queries. Some of these issues we have discussed in previous chapters, and others, such as the impact of spam, will be discussed later. In this section, we will focus on the features of the queries and documents that are most important for the ranking algorithm. There are a number of different types of search in a web environment. One popular way of describing searches was suggested by Broder (2002). In this tax- onomy, searches are either informational , navigational , or transactional . An in- formational search has the goal of finding information about some topic that may be on one or more web pages. Since every search is looking for some type of in- formation, we call these topical searches in this book. A navigational search has the goal of finding a particular web page that the user has either seen before or</p> <p><span class="badge badge-info text-white mr-2">304</span> 280 7 Retrieval Models 20 A transactional search has the goal of finding a site where assumes must exist. a task such as shopping or downloading music can be performed. Each type of search has an information need associated with it, but a different type of infor- mation need. Retrieval models based on topical relevance have focused primarily on the first type of information need (and search). To produce effective rankings for the other types of searches, a retrieval model that can combine evidence related to user relevance is required. of features (types of ev- Commercial web search engines incorporate hundreds idence) in their ranking algorithms, many derived from the huge collection of user interaction data in the query logs. These can be broadly categorized into features relating to page content, page metadata, anchor text, links (e.g., PageRank), and user behavior. Although anchor text is derived from the links in a page, it is used in a different way than features that come from an analysis of the link structure of pages, and so is put into a separate category. Page metadata refers to informa- tion about a page that is not part of the content of the page, such as its “age,” how often it is updated, the URL of the page, the domain name of its site, and the amount of text content in the page relative to other material, such as images and advertisements. It is interesting to note that understanding the relative importance of these features and how they can be manipulated to obtain better search rankings for a web page is the basis of search engine optimization (SEO). A search engine op- timizer may, for example, improve the text used in the title tag of the web page, improve the text in heading tags, make sure that the domain name and URL con- tain important keywords, and try to improve the anchor text and link structure related to the page. Some of these techniques are not viewed as appropriate by the web search engine companies, and will be discussed further in section 9.1.5. In the TREC environment, retrieval models have been compared using test collections of web pages and a mixture of query types. The features related to user behavior and some of the page metadata features, such as frequency of update, are not available in the TREC data. Of the other features, the most important for navigational searches are the text in the title, body, and heading ( h1 , h2 , h3 , and h4 ) parts of the document; the anchor text of all links pointing to the document; the PageRank number; and the inlink count (number of links pointing to the page). 20 In the TREC world, navigational searches are called home-page and named-page searches. Topical searches are called adhoc searches. Navigational searches are similar to known-item searches, which have been discussed in the information retrieval literature for many years.</p> <p><span class="badge badge-info text-white mr-2">305</span> 7.5 Web Search 281 Note that we are not saying that other features do not affect the ranking in web search engines, just that these were the ones that had the most significant impact in TREC experiments. Given the size of the Web, many pages will contain all the query terms. Some ranking algorithms rank only those pages which, in effect, filters the results us- AND . This can cause problems if only a subset of the Web is used ing a Boolean (such as in a site search application) and is particularly risky with topical searches. For example, only about 50% of the pages judged relevant in the TREC topical web searches contain all the query terms. Instead of filtering, the ranking algo- rithm should strongly favor pages that contain all query terms. In addition, term will be important. The additional evidence of terms occurring near each proximity other will significantly improve the effectiveness of the ranking. A number of re- trieval models incorporating term proximity have been developed. The following approach is designed to work in the inference network model, and produces good 21 results. dependencemodel is based on the assumption that query terms are likely to The appear in close proximity to each other within relevant documents. For example, given the query “Green party political views”, relevant documents will likely con- tain the phrases “green party” and “political views” within relatively close prox- imity to one another. If the query is treated as a set of terms Q , we can define Q as the set of all non-empty subsets of S . A Galago query attempts to capture Q dependencies between query terms as follows: 1. Every s ∈ S that consists of contiguous query terms is likely to appear as an Q exact phrase in a relevant document (i.e., represented using the #od:1 opera- tor). s ∈ S is likely to appear (ordered or unordered) 2. Every | s | > 1 such that Q within a reasonably sized window of text in a relevant document (i.e., in a window represented as #uw:8 for | s | = 2 and #uw:12 for | s | = 3 ). As an example, this model produces the Galago query language representation shown in Figure 7.6 for the TREC query “embryonic stem cells”, where the weights were determined empirically to produce the best results. Given the important pieces of evidence for web search ranking, we can now give an example of a Galago query that combines this evidence into an effective ranking. For the TREC query “pet therapy”, we would produce the Galago query shown in Figure 7.7. The first thing to note about this query is that it clearly shows 21 The formal model is described in Metzler and Croft (2005b).</p> <p><span class="badge badge-info text-white mr-2">306</span> 282 7 Retrieval Models #weight( 0.8 #combine(embryonic stem cells) 0.1 #combine( #od:1(stem cells) #od:1(embryonic stem) #od:1(embryonic stem cells)) 0.1 #combine( #uw:8(stem cells) #uw:8(embryonic cells) #uw:8(embryonic stem) #uw:12(embryonic stem cells))) Fig. 7.6. Galago query for the dependence model how a complex query expression can be generated from a simple user query. A number of proximity terms have been added, and all terms are evaluated using contexts based on anchor text, title text, body text, and heading text. From an efficiency perspective, the proximity terms may be indexed, even though this will increase the index size substantially. The benefit is that these relatively large query expressions will be able to be evaluated very efficiently at query time. #weight( 0.1 #weight( 0.6 #prior(pagerank) 0.4 #prior(inlinks)) 1.0 #weight( 0.9 #combine( #weight( 1.0 pet.(anchor) 1.0 pet.(title) 3.0 pet.(body) 1.0 pet.(heading)) #weight( 1.0 therapy.(anchor) 1.0 therapy.(title) 3.0 therapy.(body) 1.0 therapy.(heading))) 0.1 #weight( 1.0 #od:1(pet therapy).(anchor) 1.0 #od:1(pet therapy).(title) 3.0 #od:1(pet therapy).(body) 1.0 #od:1(pet therapy).(heading)) 0.1 #weight( 1.0 #uw:8(pet therapy).(anchor) 1.0 #uw:8(pet therapy).(title) 3.0 #uw:8(pet therapy).(body) 1.0 #uw:8(pet therapy).(heading))) ) Fig. 7.7. Galago query for web data The PageRank and inlink evidence is incorporated into this query as prior probabilities. In other words, this evidence is independent of specific queries and can be calculated at indexing time. The weights in the query were determined by</p> <p><span class="badge badge-info text-white mr-2">307</span> 7.6 Machine Learning and Information Retrieval 283 experiments with TREC Web page collections, which are based on a crawl of the .gov domain. The relative importance of the evidence could be different for the full Web or for other collections. The text in the main body of the page was found to be more important than the other parts of the document and anchor text, and this is reflected in the weights. Experiments with the TREC data have also shown that much of the evi- dence that is crucial for effective navigational search is not important for top- ical searches. In fact, the only features needed for topical search are the simple terms and proximity terms for the body part of the document. The other fea- tures do not improve effectiveness, but they also do not reduce it. Another dif- ference between topical and navigational searches is that query expansion using pseudo-relevance feedback was found to help topical searches, but made naviga- tional searches worse. Navigational searches are looking for a specific page, so it is not surprising that smoothing the query by adding a number of extra terms may increase the “noise” in the results. If a search was known to be in the topical cat- egory, query expansion could be used, but this is difficult to determine reliably, and since the potential effectiveness benefits of expansion are variable and some- what unpredictable, this technique is generally not used. Given that the evidence needed to identify good sites for transaction searches seems to be similar to that needed for navigational searches, this means that the same ranking algorithm can be used for the different categories of web search. Other research has shown that user behavior information, such as clickthrough data (e.g., which documents have been clicked on in the past, which rank posi- tions were clicked) and browsing data (e.g., dwell time on page, links followed), can have a significant impact on the effectiveness of the ranking. This type of ev- idence can be added into the inference network framework using additional op- erators, but as the number of pieces of evidence grows, the issue of how to deter- mine the most effective way of combining and weighting the evidence becomes more important. In the next section, we discuss techniques for learning both the weights and the ranking algorithm using explicit and implicit feedback data from the users. 7.6 Machine Learning and Information Retrieval There has been considerable overlap between the fields of information retrieval and machine learning. In the 1960s, relevance feedback was introduced as a tech- nique to improve ranking based on user feedback about the relevance of docu-</p> <p><span class="badge badge-info text-white mr-2">308</span> 284 7 Retrieval Models ments in an initial ranking. This was an example of a simple machine-learning algorithm that built a classifier to separate relevant from non-relevant documents based on training data. In the 1980s and 1990s, information retrieval researchers used machine learning approaches to learn ranking algorithms based on user feed- back. In the last 10 years, there has been a lot of research on machine-learning ap- proaches to text categorization. Many of the applications of machine learning to information retrieval, however, have been limited by the amount of training data available. If the system is trying to build a separate classifier for every query, there is very little data about relevant documents available, whereas other machine- learning applications may have hundreds or even thousands of training examples. Even the approaches that tried to learn ranking algorithms by using training data from all the queries were limited by the small number of queries and relevance judgments in typical information retrieval test collections. With the advent of web search engines and the huge query logs that are col- lected from user interactions, the amount of potential training data is enormous. This has led to the development of new techniques that are having a significant impact in the field of information retrieval and on the design of search engines. In the next section, we describe techniques for learning ranking algorithms that can combine and weight the many pieces of evidence that are important for web search. Another very active area of machine learning has been the development of so- phisticated statistical models of text. In section 7.6.2, we describe how these mod- els can be used to improve ranking based on language models. 7.6.1 Learning to Rank All of the probabilistic retrieval models presented so far fall into the category of generative models. A generative model for text classification assumes that docu- ments were generated from some underlying model (in this case, usually a multi- nomial distribution) and uses training data to estimate the parameters of the model. The probability of belonging to a class (i.e., the relevant documents for a query) is then estimated using Bayes’ Rule and the document model. A discrim- inative model, in contrast, estimates the probability of belonging to a class directly 22 from the observed features of the document based on the training data. In gen- eral classification problems, a generative model performs better with low num- bers of training examples, but the discriminative model usually has the advantage 22 We revisit the discussion of generative versus discriminative classifiers in Chapter 9.</p> <p><span class="badge badge-info text-white mr-2">309</span> 7.6 Machine Learning and Information Retrieval 285 given enough data. Given the amount of potential training data available to web search engines, discriminative models may be expected to have some advantages in this application. It is also easier to incorporate new features into a discrimina- tive model and, as we have mentioned, there can be hundreds of features that are considered for web ranking. discriminativelearning ) Early applications of learning a discriminative model ( in information retrieval used logistic regression to predict whether a document belonged to the relevant class. The problem was that the amount of training data and, consequently, the effectiveness of the technique depended on explicit rele- vance judgments obtained from people. Even given the resources of a commer- cial web search company, explicit relevance judgments are costly to obtain. On the other hand, query logs contain a large amount of implicit relevance informa- tion in the form of clickthroughs and other user interactions. In response to this, discriminative learning techniques based on this form of training data have been developed. The best-known of the approaches used to learn a ranking function for search is based on the Support Vector Machine (SVM) classifier. This technique will be discussed in more detail in Chapter 9, so in this section we will just give a brief 23 Ranking SVM can learn to rank. description of how a The input to the Ranking SVM is a training set consisting of partial rank in- formation for a set of queries q ) , r , r ) , ( q q ( ( ) , . . . , , r 2 2 n 1 n 1 where q ranking, or rele- is a query and r desired is partial information about the i i vance level, of documents for that query. This means that if document d should a be ranked higher than d . Where do , then ( d r , d ∈ ) ∈ r ) / ; otherwise, ( d , d a b b a i b i these rankings come from? If relevance judgments are available, the desired rank- ing would put all documents judged to be at a higher relevance level above those at a lower level. Note that this accommodates multiple levels of relevance, which are often used in evaluations of web search engines. If relevance judgments are not available, however, the ranking can be based on clickthrough and other user data. For example, if a person clicks on the third document in a ranking for a query and not on the first two, we can assume that it should be ranked higher in r . If d are the documents in the first, , d d , and 2 3 1 23 This description is based on Joachims’ paper on learning to rank using clickthrough data (Joachims, 2002b).</p> <p><span class="badge badge-info text-white mr-2">310</span> 286 7 Retrieval Models second, and third rank of the search output, the clickthrough data will result in pairs ( , d d ) and ( d being in the desired ranking for this query. This ranking , d ) 2 1 3 3 data will be noisy (because clicks are not relevance judgments) and incomplete, but there will be a lot of it, and experiments have shown that this type of training data can be used effectively. ⃗ , where Let’s assume that we are learning a linear ranking function ⃗w. ⃗w is a d a ⃗ d is the vector representation of weight vector that is adjusted by learning, and a the features of document . These features are, as we described in the last sec- d a tion, based on page content, page metadata, anchor text, links, and user behavior. Instead of language model probabilities, however, the features used in this model that depend on the match between the query and the document content are usu- ally simpler and less formal. For example, there may be a feature for the number of words in common between the query and the document body, and similar fea- tures for the title, header, and anchor text. The weights in the ⃗w vector determine the relative importance of these features, similar to the weights in the inference network operators. If a document is represented by three features with integer val- ⃗ d = (2 ues , , 1) and the weights ⃗w = (2 , 1 , 2) , then the score computed by the 4 ranking function is just: ⃗ d ⃗w. = (2 , 1 , 2) . (2 , 4 , 1) = 2 . 2 + 1 . 4 + 2 . 1 = 10 Given the training set of queries and rank information, we would like to find a that would satisfy as many of the following conditions as possible: weight vector ⃗w ⃗ ⃗ : , d d ) ∈ r ∀ ( ⃗w. d d > ⃗w. i 1 j j i . . . ⃗ ⃗ d d , d ∀ ) ∈ r > ⃗w. : ⃗w. ( d i j i j n This simply means that for all document pairs in the rank data, we would like the score for the document with the higher relevance rating (or rank) to be greater than the score for the document with the lower relevance rating. Unfortunately, there is no efficient algorithm to find the exact solution for ⃗w . We can, however, reformulate this problem as a standard SVM optimization as follows:</p> <p><span class="badge badge-info text-white mr-2">311</span> 7.6 Machine Learning and Information Retrieval 287 ∑ 1 : ⃗w. ⃗w + C ξ minimize i;j;k 2 : subject to ⃗ ⃗ ξ , d − ) ∈ r + 1 : ⃗w. ∀ d ( > ⃗w. d d i j j i i;j; 1 1 . . . ⃗ ⃗ ξ , d − ) ∈ ∀ + 1 : ⃗w. ( d d > ⃗w. d r i j n j i i;j;n ∀ i ∀ j ∀ k : ξ 0 , j, k ≥ i where slackvariable , allows for misclassification of difficult or noisy ξ , known as a is a parameter that is used to prevent . Overfit- C training examples, and overfitting ting happens when the learning algorithm produces a ranking function that does very well at ranking the training data, but does not do well at ranking documents 24 that do this optimization and for a new query. Software packages are available produce a classifier. Where did this optimization come from? The impatient reader will have to jump ahead to the explanation for a general SVM classifier in Chapter 9. For the time being, we can say that the SVM algorithm will find a classifier (i.e., the vector ) that has the following property. Each pair of documents in our training data ⃗w ⃗ ⃗ can be represented by the vector ( − d d ). If we compute the score for this pair as j i ⃗ ⃗ ⃗w. d that makes the smallest score as large − ( d ⃗w ) , the SVM classifier will find a j i as possible. The same thing is true for negative examples (pairs of documents that are not in the rank data). This means that the classifier will make the differences in scores as large as possible for the pairs of documents that are hardest to rank. Note that this model does not specify the features that should be used. It could even be used to learn the weights for features corresponding to scores from com- pletely different retrieval models, such as BM25 and language models. Combin- ing multiple searches for a given query has been shown to be effective in a number of experiments, and is discussed further in section 10.5.1. It should also be noted that the weights learned by Ranking SVM (or some other discriminative tech- nique) can be used directly in the inference network query language. Although linear discriminative classifiers such as Ranking SVM may have an advantage for web search, there are other search applications where there will be less training data and less features available. For these applications, the generative models of topical relevance may be more effective, especially as the models con- tinue to improve through better estimation techniques. The next section discusses 24 light Such as SV M ; see http://svmlight.joachims.org.</p> <p><span class="badge badge-info text-white mr-2">312</span> 288 7 Retrieval Models how estimation can be improved by modeling a document as a mixture of topic models. 7.6.2 Topic Models and Vocabulary Mismatch One of the important issues in general information retrieval is vocabulary mis- match. This refers to a situation where relevant documents do not match a query, because they are using different words to describe the same topic. In the web en- vironment, many documents will contain all the query words, so this may not ap- pear to be an issue. In search applications with smaller collections, however, it will be important, and even in web search, TREC experiments have shown that topi- cal queries produce better results using query expansion. Query expansion (using, for example, pseudo-relevance feedback) is the standard technique for reducing vocabulary mismatch, although stemming also addresses this issue to some extent. documents by adding related terms. A different approach would be to expand the For documents represented as language models, this is equivalent to smoothing the probabilities in the language model so that words that did not occur in the text have non-zero probabilities. Note that this is different from smoothing us- ing the collection probabilities, which are the same for all documents. Instead, we need some way of increasing the probabilities of words that are associated with the topic of the document. A number of techniques have been proposed to do this. If a document is known to belong to a category or cluster of documents, then the probabilities of words in that cluster can be used to smooth the document language model. We describe the details of this in Chapter 9. A technique known as Latent Seman- 25 , or LSI , tic Indexing maps documents and terms into a reduced dimensionality space, so that documents that were previously indexed using a vocabulary of hun- dreds of thousands of words are now represented using just a few hundred fea- tures. Each feature in this new space is a mixture or cluster of many words, and it is this mixing that in effect smooths the document representation. The ( LDA ) model, which comes from the machine LatentDirichletAllocation learning community, models documents as a mixture of topics. A topic is a lan- guage model, just as we defined previously. In a retrieval model such as query like- lihood, each document is assumed to be associated with a single topic. There are, 25 This technique is also called Latent Semantic Analysis or LSA (Deerwester et al., 1990). Note that “latent” is being used in the sense of “hidden.”</p> <p><span class="badge badge-info text-white mr-2">313</span> 7.6 Machine Learning and Information Retrieval 289 in effect, as many topics as there are documents in the collection. In the LDA ap- proach, in contrast, the assumption is that there is a fixed number of underlying (or latent) topics that can be used to describe the contents of documents. Each document is represented as a mixture of these topics, which achieves a smoothing effect that is similar to LSI. In the LDA model, a document is generated by first picking a distribution over topics, and then, for the next word in the document, we choose a topic and generate a word from that topic. Using our “bucket” analogy for language models, we would need multiple buckets to describe this process. For each document, we would have one bucket of topics, with the number of instances of each topic depending on the distribution of topics we had picked. For each topic, there would be another bucket containing words, with the number of instances of the words depending on the probabilities in the topic language model. Then, to generate a document, we first select a topic from the topic bucket (still without looking), then go to the bucket of words for the topic that had been selected and pick out a word. The process is then repeated for the next word. More formally, the LDA process for generating a document is: , pick a multinomial distribution θ 1. For each document D from a Dirichlet D distribution with parameter α . D : 2. For each word position in document z θ . from the multinomial distribution a) Pick a topic D w from P ( w | z, β ) , a multinomial probability condi- b) Choose a word z β . with parameter tioned on the topic θ A variety of techniques are available for learning the topic models and the distributions using the collection of documents as the training data, but all of these methods tend to be quite slow. Once we have these distributions, we can produce language model probabilities for the words in documents: ∑ ) ( w | D ) = P ( w | θ θ , β ) = P | P ( w | z, β ) P ( z lda D D z These probabilities can then be used to smooth the document representation by mixing them with the query likelihood probability as follows: ( ) c w μ f + w;D C | | P λ D | w ( ) = ) ) D | w + (1 − λ ( P lda | | μ + D</p> <p><span class="badge badge-info text-white mr-2">314</span> 290 7 Retrieval Models So the final language model probabilities are, in effect, a mixture of the maximum likelihood probabilities, collection probabilities, and the LDA probabilities. If the LDA probabilities are used directly as the document representation, the effectiveness of the ranking will be significantly reduced because the features are smoothed. In TREC experiments, (the number of topics) has a value of K too around 400. This means that all documents in the collection are represented as mixtures of just 400 topics. Given that there can be millions of words in a col- lection vocabulary, matching on topics alone will lose some of the precision of matching individual words. When used to smooth the document language model, however, the LDA probabilities can significantly improve the effectiveness of query likelihood ranking. Table 7.7 shows the high-probability words from four 26 LDA topics (out of 100) generated from a sample of TREC news stories. Note that the names of the topics were not automatically generated. Budgets Children Education Arts million new school children tax students women film program people schools show music child education budget movie billion years teachers play federal families high year public work musical spending teacher best parents new says bennett actor first family manigat state york plan welfare namphy state money men opera programs percent president theater actress government care elementary love congress life haiti Table 7.7. Highest-probability terms from four topics in LDA model The main problem with using LDA for search applications is that estimating the probabilities in the model is expensive. Until faster methods are developed, 26 This table is from Blei et al. (2003).</p> <p><span class="badge badge-info text-white mr-2">315</span> 7.7 Application-Based Models 291 this technique will be limited to smaller collections (hundreds of thousands of documents, but not millions). 7.7 Application-Based Models In this chapter we have described a wide variety of retrieval models and ranking algorithms. From the point of view of someone involved in designing and imple- menting a search application, the question is which of these techniques should be used and when? The answer depends on the application and the tools available. Most search applications involve much smaller collections than the Web and a lot less connectivity in terms of links and anchor text. Ranking algorithms that work well in web search engines often do not produce the best rankings in other appli- cations. Customizing a ranking algorithm for the application will nearly always produce the best results. The first step in doing this is to construct a test collection of queries, docu- ments, and relevance judgments so that different versions of the ranking algo- rithm can be compared quantitatively. Evaluation is discussed in detail in Chapter 8, and it is the key to an effective search engine. The next step is to identify what evidence or features might be used to rep- resent documents. Simple terms and proximity terms are almost always useful. Significant document structure—such as titles, authors, and date fields—are also nearly always important for search. In some applications, numeric fields may be important. Text processing techniques such as stemming and stopwords also must be considered. Another important source of information that can be used for query expan- sion is an application-specific thesaurus. These are surprisingly common since of- ten an attempt will have been made to build them either manually or automati- cally for a previous information system. Although they are often very incomplete, the synonyms and related words they contain can make a significant difference to ranking effectiveness. Having identified the various document features and other evidence, the next task is to decide how to combine it to calculate a document score. An open source search engine such as Galago makes this relatively easy since the combination and weighting of evidence can be expressed in the query language and many variations can be tested quickly. Other search engines do not have this degree of flexibility. If a search engine based on a simple retrieval model is being used for the search application, the descriptions of how scores are calculated in the BM25 or query</p> <p><span class="badge badge-info text-white mr-2">316</span> 292 7 Retrieval Models likelihood models and how they are combined in the inference network model can be used as a guide to achieve similar effects by appropriate query transforma- tions and additional code for scoring. For example, the synonym and related word information in a thesaurus should not be used to simply add words to a query. Un- less some version of the #syn operator is used, the effectiveness of the ranking will #syn in Galago can be used as an example of be reduced. The implementation of how to add this operator to a search engine. Much of the time spent in developing a search application will be spent on tuning the retrieval effectiveness of the ranking algorithm. Doing this without some concept of the underlying retrieval model can be very unrewarding. The re- trieval models described in this chapter (namely BM25, query likelihood, rele- vance models, inference network, and Ranking SVM) provide the best possible blueprints for a successful ranking algorithm. For these models, good parame- ter values and weights are already known from extensive published experiments. These values can be used as a starting point for the process of determining whether modifications are needed for an application. If enough training data is available, a discriminative technique such as Ranking SVM will learn the best weights di- rectly. References and Further Reading Since retrieval models are one of the most important topics in information re- trieval, there are many papers describing research in this area, starting in the 1950s. One of the most valuable aspects of van Rijsbergen’s book (van Rijsbergen, 1979) is the coverage of the older research in this area. In this book, we will focus on some of the major papers, rather than attempting to be comprehensive. These ref- erences will be discussed in the order of the topics presented in this chapter. The discussion of the nature of relevance has, understandably, been going on in information retrieval for a long time. One of the earlier papers that is often cited is Saracevic (1975). A more recent article gives a review of work in this area (Mizzaro, 1997). On the topic of Boolean versus ranked search, Turtle (1994) carried out an experiment comparing the performance of professional searchers using the best Boolean queries they could generate against keyword searches using ranked out- put and found no advantage for the Boolean search. When simple Boolean queries are compared against ranking, as in Turtle and Croft (1991), the effectiveness of ranking is much higher.</p> <p><span class="badge badge-info text-white mr-2">317</span> 7.7 Application-Based Models 293 The vector space model was first mentioned in Salton et al. (1975), and is de- scribed in detail in Salton and McGill (1983). The most comprehensive paper in weighting experiments with this model is Salton and Buckley (1988), although the term-weighting techniques described in section 7.1.2 are a later improvement on those described in the paper. The description of information retrieval as a classification problem appears in van Rijsbergen (1979). The best paper on the application of the binary indepen- dence model and its development into the BM25 ranking function is Sparck Jones et al. (2000). The use of language models in information retrieval started with Ponte and Croft (1998), who described a retrieval model based on multiple-Bernoulli lan- guage models. This was quickly followed by a number of papers that developed the multinomial version of the retrieval model (Hiemstra, 1998; F. Song & Croft, 1999). Miller et al. (1999) described the same approach using a Hidden Markov Model. Berger and Lafferty (1999) showed how translation probabilities for words could be incorporated into the language model approach. We will refer to this translation model again in section 10.3. The use of non-uniform prior probabili- ties was studied by Kraaij et al. (2002). A collection of papers relating to language models and information retrieval appears in Croft and Lafferty (2003). Zhai and Lafferty (2004) give an excellent description of smoothing tech- niques for language modeling in information retrieval. Smoothing using clusters and nearest neighbors is described in Liu and Croft (2004) and Kurland and Lee (2004). An early term-dependency model was described in van Rijsbergen (1979). A bigram language model for information retrieval was described in F. Song and Croft (1999), but the more general models in Gao et al. (2004) and Metzler and Croft (2005b) produced significantly better retrieval results, especially with larger collections. The relevance model approach to query expansion appeared in Lavrenko and Croft (2001). Lafferty and Zhai (2001) proposed a related approach that built a query model and compared it to document models. There have been many experiments reported in the information retrieval liter- ature showing that the combination of evidence significantly improves the rank- ing effectiveness. Croft (2000) reviews these results and shows that this is not surprising, given that information retrieval can be viewed as a classification prob- lem with a huge choice of features. Turtle and Croft (1991) describe the infer- ence network model. This model was used as the basis for the Inquery search en-</p> <p><span class="badge badge-info text-white mr-2">318</span> 294 7 Retrieval Models gine (Callan et al., 1992) and the WIN version of the commercial search engine WESTLAW (Pritchard-Schoch, 1993). The extension of this model to include language model probabilities is described in Metzler and Croft (2004). This ex- tension was implemented as the Indri search engine (Strohman et al., 2005; Met- zler, Strohman, et al., 2004). The Galago query language is based on the query language for Indri. The approach to web search described in section 7.5, which scores documents based on a combination or mixture of language models representing different parts of the document structure, is based on Ogilvie and Callan (2003). The BM25F ranking function (Robertson et al., 2004) is an extension of BM25 that is also designed to effectively combine information from different document fields. Spam is of such importance in web search that an entire subfield, called ad- versarial information retrieval , has developed to deal with search techniques for document collections that are being manipulated by parties with different inter- ests (such as spammers and search engine optimizers). We discuss the topic of spam in Chapter 9. The early work on learning ranking functions includes the use of logistic re- gression (Cooper et al., 1992). Fuhr and Buckley (1991) were the first to de- scribe clearly how using features that are independent of the actual query words (e.g., using a feature like the number of matching terms rather than which terms matched) enable the learning of ranking functions across queries. The use of Ranking SVM for information retrieval was described by Joachims (2002b). Cao et al. (2006) describe modifications of this approach that improve ranking effec- tiveness. RankNet (C. Burges et al., 2005) is a neural network approach to learn- ing a ranking function that is used in the Microsoft web search engine. Agichtein, Brill, and Dumais (2006) describe how user behavior features can be incorporated effectively into ranking based on RankNet. Both Ranking SVMs and RankNet learn using partial rank information (i.e., pairwise preferences). Another class of learning models, called listwise models, use the entire ranked list for learning. Ex- amples of these models include the linear discriminative model proposed by Gao et al. (2005), which learns weights for features that are based on language models. This approach has some similarities to the inference network model being used to combine language model and other features. Another listwise approach is the term dependence model proposed by Metzler and Croft (2005b), which is also based on a linear combination of features. Both the Gao and Metzler models pro- vide a learning technique that maximizes average precision (an important infor-</p> <p><span class="badge badge-info text-white mr-2">319</span> 7.7 Application-Based Models 295 mation retrieval metric) directly. More information about listwise learning mod- els can be found in Xia et al. (2008). Hofmann (1999) described a probabilistic version of LSI (pLSI) that intro- duced the modeling of documents as a mixture of topics. The LDA model was described by Blei et al. (2003). A number of extensions of this model have been proposed since then, but they have not been applied to information retrieval. The application of LDA to information retrieval was described in Wei and Croft (2006). Exercises 7.1. Use the “advanced search” feature of a web search engine to come up with AND , three examples of searches using the Boolean operators , and NOT that OR work better than using the same query in the regular search box. Do you think the search engine is using a strict Boolean model of retrieval for the advanced search? 7.2. Can you think of another measure of similarity that could be used in the vec- tor space model? Compare your measure with the cosine correlation using some example documents and queries with made-up weights. Browse the IR literature on the Web and see whether your measure has been studied (start with van Rijs- bergen’s book). 7.3. If each term represents a dimension in a t -dimensional space, the vector space model is making an assumption that the terms are . Explain this as- orthogonal sumption and discuss whether you think it is reasonable. 7.4. Derive Bayes’ Rule from the definition of a conditional probability: ) B ∩ A ( P ) = | A ( P B P ( B ) Give an example of a conditional and a joint probability using the occurrence of words in documents as the events. 7.5. Implement a BM25 module for Galago. Show that it works and document it. 7.6. Show the effect of changing parameter values in your BM25 implementation.</p> <p><span class="badge badge-info text-white mr-2">320</span> 296 7 Retrieval Models What is the “bucket” analogy for a bigram language model? Give examples. 7.7. 7.8. Using the Galago implementation of query likelihood, study the impact of short queries and long queries on effectiveness. Do the parameter settings make a difference? 7.9. Implement the relevance model approach to pseudo-relevance feedback in Galago. Show it works by generating some expansion terms for queries and doc- ument it. 7.10. Show that the bel operator computes the query likelihood score with wand bel operator compute? simple terms. What does the wsum 7.11. Implement a #not operator for the inference network query language in Galago. Show some examples of how it works. 7.12. Do a detailed design for numeric operators for the inference network query language in Galago. 7.13. Write an interface program that will take a user’s query as text and trans- form it into an inference network query. Make sure you use proximity operators. Compare the performance of the simple queries and the transformed queries.</p> <p><span class="badge badge-info text-white mr-2">321</span> 8 Evaluating Search Engines “Evaluation, Mr. Spock.” Captain Kirk, Star Trek: The Motion Picture 8.1 Why Evaluate? Evaluation is the key to making progress in building better search engines. It is also essential to understanding whether a search engine is being used effectively in a specific application. Engineers don’t make decisions about a new design for a commercial aircraft based on whether it better than another design. Instead, feels they test the performance of the design with simulations and experiments, eval- uate everything again when a prototype is built, and then continue to monitor and tune the performance of the aircraft after it goes into service. Experience has shown us that ideas that we intuitively feel must improve search quality, or models that have appealing formal properties often have little or no impact when tested using quantitative experiments. One of the primary distinctions made in the evaluation of search engines is effectiveness and efficiency . Effectiveness, loosely speaking, measures the between ability of the search engine to find the right information, and efficiency measures how quickly this is done. For a given query, and a specific definition of relevance, we can more precisely define effectiveness as a measure of how well the ranking produced by the search engine corresponds to a ranking based on user relevance judgments. Efficiency is defined in terms of the time and space requirements for the algorithm that produces the ranking. Viewed more generally, however, search is an interactive process involving different types of users with different informa- tion problems. In this environment, effectiveness and efficiency will be affected by many factors, such as the interface used to display search results and query re- finement techniques, such as query suggestion and relevance feedback. Carrying out this type of holistic evaluation of effectiveness and efficiency, while impor-</p> <p><span class="badge badge-info text-white mr-2">322</span> 298 8 Evaluating Search Engines tant, is very difficult because of the many factors that must be controlled. For this reason, evaluation is more typically done in tightly defined experimental settings, and this is the type of evaluation we focus on here. Effectiveness and efficiency are related in that techniques that give a small boost to effectiveness may not be included in a search engine implementation if they have a significant adverse effect on an efficiency measure such as query throughput. Generally speaking, however, information retrieval research focuses on improving the effectiveness of search, and when a technique has been estab- lished as being potentially useful, the focus shifts to finding efficient implemen- tations. This is not to say that research on system architecture and efficiency is not important. The techniques described in Chapter 5 are a critical part of build- ing a scalable and usable search engine and were primarily developed by research groups. The focus on effectiveness is based on the underlying goal of a search en- gine, which is to find the relevant information. A search engine that is extremely fast is of no use unless it produces good results. So is there a trade-off between efficiency and effectiveness? Some search en- gine designers discuss having “knobs,” or parameters, on their system that can be turned to favor either high-quality results or improved efficiency. The current sit- uation, however, is that there is no reliable technique that significantly improves effectiveness that cannot be incorporated into a search engine due to efficiency considerations. This may change in the future. In addition to efficiency and effectiveness, the other significant consideration in search engine design is cost. We may know how to implement a particular search technique efficiently, but to do so may require a huge investment in pro- cessors, memory, disk space, and networking. In general, if we pick targets for any two of these three factors, the third will be determined. For example, if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration. Alternatively, if we decide on efficiency and cost targets, it may have an impact on effectiveness. Two extreme cases of choices for these fac- tors are searching using a pattern-matching utility such as grep , or searching using an organization such as the Library of Congress. Searching a large text collection using grep will have poor effectiveness and poor efficiency, but will be very cheap. Searching using the staff analysts at the Library of Congress will produce excel- lent results (high effectiveness) due to the manual effort involved, will be efficient in terms of the user’s time (although it will involve a delay waiting for a response from the analysts), and will be very expensive. Searching directly using an effective search engine is designed to be a reasonable compromise between these extremes.</p> <p><span class="badge badge-info text-white mr-2">323</span> 8.2 The Evaluation Corpus 299 An important point about terminology is the meaning of “optimization” as it is discussed in the context of evaluation. The retrieval and indexing techniques in a search engine have many parameters that can be adjusted to optimize perfor- mance, both in terms of effectiveness and efficiency. Typically the best values for these parameters are determined using training data and a cost function. Training data is a sample of the real data, and the cost function is the quantity based on the data that is being maximized (or minimized). For example, the training data could be samples of queries with relevance judgments, and the cost function for a ranking algorithm would be a particular effectiveness measure. The optimization process would use the training data to learn parameter settings for the ranking algorithm that maximized the effectiveness measure. This use of optimization is very different from “search engine optimization”, which is the process of tailoring web pages to ensure high rankings from search engines. In the remainder of this chapter, we will discuss the most important evalu- ation measures, both for effectiveness and efficiency. We will also describe how experiments are carried out in controlled environments to ensure that the results are meaningful. 8.2 The Evaluation Corpus One of the basic requirements for evaluation is that the results from different techniques can be compared. To do this comparison fairly and to ensure that ex- periments are repeatable, the experimental settings and data used must be fixed. Starting with the earliest large-scale evaluations of search performance in the 1 experiments (Cleverdon, 1970), re- 1960s, generally referred to as the Cranfield searchers assembled testcollections consisting of documents, queries, and relevance judgments to address this requirement. In other language-related research fields, such as linguistics, machine translation, or speech recognition, a text corpus is a large amount of text, usually in the form of many documents, that is used for sta- tistical analysis of various kinds. The test collection, or evaluation corpus , in in- formation retrieval is unique in that the queries and relevance judgments for a particular search task are gathered in addition to the documents. Test collections have changed over the years to reflect the changes in data and user communities for typical search applications. As an example of these changes, 1 Named after the place in the United Kingdom where the experiments were done.</p> <p><span class="badge badge-info text-white mr-2">324</span> 300 8 Evaluating Search Engines the following three test collections were created at intervals of about 10 years, starting in the 1980s: • CACM: Titles and abstracts from the Communications of the ACM from 1958–1979. Queries and relevance judgments generated by computer scien- tists. • AP: Associated Press newswire documents from 1988–1990 (from TREC disks 1–3). Queries are the title fields from TREC topics 51–150. Topics and relevance judgments generated by government information analysts. • GOV2: Web pages crawled from websites in the .gov domain during early 2004. Queries are the title fields from TREC topics 701–850. Topics and rel- evance judgments generated by government analysts. The CACM collection was created when most search applications focused on bib- liographic records containing titles and abstracts, rather than the full text of doc- uments. Table 8.1 shows that the number of documents in the collection (3,204) and the average number of words per document (64) are both quite small. The total size of the document collection is only 2.2 megabytes, which is considerably less than the size of a single typical music file for an MP3 player. The queries for this collection of abstracts of computer science papers were generated by students and faculty of a computer science department, and are supposed to represent ac- tual information needs. An example of a CACM query is: Security considerations in local networks, network operating systems, and dis- tributed systems. Relevance judgments for each query were done by the same people, and were rel- atively exhaustive in the sense that most relevant documents were identified. This was possible since the collection is small and the people who generated the ques- tions were very familiar with the documents. Table 8.2 shows that the CACM queries are quite long (13 words on average) and that there are an average of 16 relevant documents per query. The AP and GOV2 collections were created as part of the TREC conference series sponsored by the National Institute of Standards and Technology (NIST). The AP collection is typical of the full-text collections that were first used in the early 1990s. The availability of cheap magnetic disk technology and online text entry led to a number of search applications for full-text documents such as news</p> <p><span class="badge badge-info text-white mr-2">325</span> 8.2 The Evaluation Corpus 301 Number of Size Collection Average number of words/doc. documents CACM 3,204 2.2 MB 64 AP 242,918 0.7 GB 474 GOV2 426 GB 1073 25,205,179 Table 8.1. Statistics for three example text collections. The average number of words per document is calculated without stemming. stories, legal documents, and encyclopedia articles. The AP collection is much big- ger (by two orders of magnitude) than the CACM collection, both in terms of the number of documents and the total size. The average document is also consider- ably longer (474 versus 64 words) since they contain the full text of a news story. The GOV2 collection, which is another two orders of magnitude larger, was de- signed to be a testbed for web search applications and was created by a crawl of the .gov domain. Many of these government web pages contain lengthy policy de- scriptions or tables, and consequently the average document length is the largest of the three collections. Collection Number of Average number of Average number of words/query queries relevant docs/query 64 13.0 CACM 16 100 4.3 220 AP GOV2 150 3.1 180 Table 8.2. Statistics for queries from example text collections The queries for the AP and GOV2 collections are based on TREC topics. The topics were created by government information analysts employed by NIST. The early TREC topics were designed to reflect the needs of professional analysts in government and industry and were quite complex. Later TREC topics were sup- posed to represent more general information needs, but they retained the TREC topic format. An example is shown in Figure 8.1. TREC topics contain three fields indicated by the tags. The title field is supposed to be a short query, more typical of a web application. The description field is a longer version of the query, which as this example shows, can sometimes be more precise than the short query. The narrative field describes the criteria for relevance, which is used by the people do-</p> <p><span class="badge badge-info text-white mr-2">326</span> 302 8 Evaluating Search Engines ing relevance judgments to increase consistency, and should not be considered as a query. Most recent TREC evaluations have focused on using the title field of the topic as the query, and our statistics in Table 8.2 are based on that field. <top> <num> Number: 794 pet <title> therapy <desc> Description: are pets or animals used in therapy for How and what are the humans benefits? <narr> Narrative: documents must include details Relevant of how pet ! or animal ! assisted therapy is or has been used. Relevant details include information about therapy programs, descriptions of the circumstances in which pet this pet used, the benefits of therapy type of therapy, the degree is of success of this therapy, and any laws or regulations governing it. </top> Fig. 8.1. Example of a TREC topic The relevance judgments in TREC depend on the task that is being evalu- ated. For the queries in these tables, the task emphasized high recall, where it is important not to miss information. Given the context of that task, TREC an- alysts judged a document as relevant if it contained information that could be used to help write a report on the query topic. In Chapter 7, we discussed the difference between user relevance and topical relevance . Although the TREC rel- evance definition does refer to the usefulness of the information found, analysts are instructed to judge all documents containing the same useful information as relevant. This is not something a real user is likely to do, and shows that TREC is primarily focused on topical relevance. Relevance judgments for the CACM collections are binary, meaning that a document is either relevant or it is not. This is also true of most of the TREC collections. For some tasks, multiple lev- els of relevance may be appropriate. Some TREC collections, including GOV2, were judged using three levels of relevance (not relevant, relevant, and highly rel- evant). We discuss effectiveness measures for both binary and graded relevance</p> <p><span class="badge badge-info text-white mr-2">327</span> 8.2 The Evaluation Corpus 303 in section 8.4. Different retrieval tasks can affect the number of relevance judg- ments required, as well as the type of judgments and the effectiveness measure. navigational searches, where the user is For example, in Chapter 7 we described looking for a particular page. In this case, there is only one relevant document for the query. Creating a new test collection can be a time-consuming task. Relevance judg- ments in particular require a considerable investment of manual effort for the high-recall search task. When collections were very small, most of the documents in a collection could be evaluated for relevance. In a collection such as GOV2, pooling is however, this would clearly be impossible. Instead, a technique called used. In this technique, the top k results (for TREC, k varied between 50 and 200) from the rankings obtained by different search engines (or retrieval algo- rithms) are merged into a pool, duplicates are removed, and the documents are presented in some random order to the people doing the relevance judgments. Pooling produces a large number of relevance judgments for each query, as shown in Table 8.2. However, this list is incomplete and, for a new retrieval algorithm that had not contributed documents to the original pool, this could potentially be a problem. Specifically, if a new algorithm found many relevant documents that were not part of the pool, they would be treated as being not relevant, and conse- quently the effectiveness of that algorithm could be significantly underestimated. Studies with the TREC data, however, have shown that the relevance judgments are complete enough to produce accurate comparisons for new search techniques. TREC corpora have been extremely useful for evaluating new search tech- niques, but they have limitations. A high-recall search task and collections of news articles are clearly not appropriate for evaluating product search on an e- commerce site, for example. New TREC “tracks” can be created to address im- portant new applications, but this process can take months or years. On the other hand, new search applications and new data types such as blogs, forums, and an- notated videos are constantly being developed. Fortunately, it is not that difficult to develop an evaluation corpus for any given application using the following ba- sic guidelines: • Use a document collection that is representative for the application in terms of the number, size, and type of documents. In some cases, this may be the actual collection for the application; in others it will be a sample of the actual collec- tion, or even a similar collection. If the target application is very general, then more than one collection should be used to ensure that results are not corpus-</p> <p><span class="badge badge-info text-white mr-2">328</span> 304 8 Evaluating Search Engines specific. For example, in the case of the high-recall TREC task, a number of different news and government collections were used for evaluation. • The queries that are used for the test collection should also be representative of the queries submitted by users of the target application. These may be acquired either from a query log from a similar application or by asking potential users for examples of queries. Although it may be possible to gather tens of thou- sands of queries in some applications, the need for relevance judgments is a major constraint. The number of queries must be sufficient to establish that a new technique makes a significant difference. An analysis of TREC experi- ments has shown that with 25 queries, a difference in the effectiveness measure MAP (section 8.4.2) of 0.05 will result in the wrong conclusion about which system is better in about 13% of the comparisons. With 50 queries, this error rate falls below 4%. A difference of 0.05 in MAP is quite large. If a significance test , such as those discussed in section 8.6.1, is used in the evaluation, a relative difference of 10% in MAP is sufficient to guarantee a low error rate with 50 queries. If resources or the application make more relevance judgments possi- ble, in terms of generating reliable results it will be more productive to judge more queries rather than to judge more documents from existing queries (i.e., increasing ). Strategies such as judging a small number (e.g., 10) of the top- k ranked documents from many queries or selecting documents to judge that will make the most difference in the comparison (Carterette et al., 2006) have been shown to be effective. If a small number of queries are used, the results should be considered indicative, not conclusive. In that case, it is important that the queries should be at least representative and have good coverage in terms of the goals of the application. For example, if algorithms for local search were being tested, the queries in the test collection should include many dif- ferent types of location information. • Relevance judgments should be done either by the people who asked the ques- tions or by independent judges who have been instructed in how to determine relevance for the application being evaluated. Relevance may seem to be a very subjective concept, and it is known that relevance judgments can vary depend- ing on the person making the judgments, or even vary for the same person at different times. Despite this variation, analysis of TREC experiments has shown that conclusions about the relative performance of systems are very stable. In other words, differences in relevance judgments do not have a sig- nificant effect on the error rate for comparisons. The number of documents that are evaluated for each query and the type of relevance judgments will de-</p> <p><span class="badge badge-info text-white mr-2">329</span> 8.3 Logging 305 pend on the effectiveness measures that are chosen. For most applications, it is generally easier for people to decide between at least three levels of rele- definitely relevant , definitely not relevant , and possibly relevant . These vance: can be converted into binary judgments by assigning the “possibly relevant” level to either one of the other levels, if that is required for an effectiveness measure. Some applications and effectiveness measures, however, may support more than three levels of relevance. As a final point, it is worth emphasizing that many user actions can be consid- ered implicit relevance judgments, and that if these can be exploited, this can sub- stantially reduce the effort of constructing a test collection. For example, actions such as clicking on a document in a result list, moving it to a folder, or sending it to a printer may indicate that it is relevant. In previous chapters, we have described how query logs and clickthrough can be used to support operations such as query expansion and spelling correction. In the next section, we discuss the role of query logs in search engine evaluation. 8.3 Logging Query logs that capture user interactions with a search engine have become an extremely important resource for web search engine development. From an eval- uation perspective, these logs provide large amounts of data showing how users browse the results that a search engine provides for a query. In a general web search application, the number of users and queries represented can number in the tens of millions. Compared to the hundreds of queries used in typical TREC collec- tions, query log data can potentially support a much more extensive and realistic evaluation. The main drawback with this data is that it is not as precise as explicit relevance judgments. An additional concern is maintaining the privacy of the users. This is par- ticularly an issue when query logs are shared, distributed for research, or used to construct user profiles (see section 6.2.5). Various techniques can be used to anonymize the logged data, such as removing identifying information or queries that may contain personal data, although this can reduce the utility of the log for some purposes. A typical query log will contain the following data for each query: • User identifier or user session identifier. This can be obtained in a number of ways. If a user logs onto a service, uses a search toolbar, or even allows cookies,</p> <p><span class="badge badge-info text-white mr-2">330</span> 306 8 Evaluating Search Engines this information allows the search engine to identify the user. A session is a series of queries submitted to a search engine over a limited amount of time. In some circumstances, it may be possible to identify a user only in the context of a session. • Query terms. The query is stored exactly as the user entered it. • List of URLs of results, their ranks on the result list, and whether they were 2 clicked on. • Timestamp(s). The timestamp records the time that the query was submit- ted. Additional timestamps may also record the times that specific results were clicked on. The clickthrough data in the log (the third item) has been shown to be highly correlated with explicit judgments of relevance when interpreted appropriately, and has been used for both training and evaluating search engines. More detailed information about user interaction can be obtained through a client-side applica- tion, such as a search toolbar in a web browser. Although this information is not always available, some user actions other than clickthroughs have been shown to be good predictors of relevance. Two of the best predictors are page dwelltime and search exit action. The page dwell time is the amount of time the user spends on a clicked result, measured from the initial click to the time when the user comes back to the results page or exits the search application. The search exit action is the way the user exits the search application, such as entering another URL, clos- ing the browser window, or timing out. Other actions, such as printing a page, are very predictive but much less frequent. Although clicks on result pages are highly correlated with relevance, they can- not be used directly in place of explicit relevance judgments, because they are very biased toward pages that are highly ranked or have other features such as being popular or having a good snippet on the result page. This means, for example, that pages at the top rank are clicked on much more frequently than lower-ranked pages, even when the relevant pages are at the lower ranks. One approach to re- preferences between moving this bias is to use clickthrough data to predict user pairs of documents rather than relevance judgments. User preferences were first mentioned in section 7.6, where they were used to train a ranking function. A preference for document d is more rel- compared to document d d means that 2 1 1 2 In some logs, only the clicked-on URLs are recorded. Logging all the results enables the generation of preferences and provides a source of “negative” examples for various tasks.</p> <p><span class="badge badge-info text-white mr-2">331</span> 8.3 Logging 307 evant or, equivalently, that it should be ranked higher. Preferences are most ap- propriate for search tasks where documents can have multiple levels of relevance, and are focused more on user relevance than purely topical relevance. Relevance judgments (either multi-level or binary) can be used to generate preferences, but preferences do not imply specific relevance levels. The bias in clickthrough data is addressed by “strategies,” or policies that gen- erate preferences. These strategies are based on observations of user behavior and verified by experiments. One strategy that is similar to that described in section Skip Above and Skip Next (Agichtein, Brill, Dumais, & Ragno, 7.6 is known as 2006). This strategy assumes that given a set of results for a query and a clicked , all unclicked results ranked above p are predicted to be p result at rank position less relevant than the result at p . In addition, unclicked results immediately fol- lowing a clicked result are less relevant than the clicked result. For example, given a result list of ranked documents together with click data as follows: d 1 d 2 d (clicked) 3 , d 4 this strategy will generate the following preferences: d > d 2 3 d > d 3 1 d > d 3 4 Since preferences are generated only when higher-ranked documents are ignored, a major source of bias is removed. The “Skip” strategy uses the clickthrough patterns of individual users to gener- ate preferences. This data can be noisy and inconsistent because of the variability in users’ behavior. Since query logs typically contain many instances of the same query submitted by different users, clickthrough data can be aggregated to remove potential noise from individual differences. Specifically, click distribution infor- mation can be used to identify clicks that have a higher frequency than would be expected based on typical click patterns. These clicks have been shown to corre- late well with relevance judgments. For a given query, we can use all the instances of that query in the log to compute the observed click frequency O ( d, p ) for the result d in rank position p . We can also compute the expected click frequency ( E p ) at rank p by averaging across all queries. The click deviation CD ( d, p ) for a result d in position p is computed as:</p> <p><span class="badge badge-info text-white mr-2">332</span> 308 8 Evaluating Search Engines ) ( O ( d, p ) − E ( p ) = . CD d, p CD ( d, p ) to “filter” clicks and provide more reliable We can then use the value of click information to the Skip strategy. A typical evaluation scenario involves the comparison of the result lists for two or more systems for a given set of queries. Preferences are an alternate method of specifying which documents should be retrieved for a given query (relevance judg- ments being the typical method). The quality of the result lists for each system is then summarized using an effectiveness measure that is based on either prefer- ences or relevance judgments. The following section describes the measures that are most commonly used in research and system development. 8.4 Effectiveness Metrics 8.4.1 Recall and Precision recall The two most common effectiveness measures, precision , were intro- and duced in the Cranfield studies to summarize and compare search results. Intu- itively, recall measures how well the search engine is doing at finding all the rele- vant documents for a query, and precision measures how well it is doing at reject- ing non-relevant documents. The definition of these measures assumes that, for a given query, there is a set of documents that is retrieved and a set that is not retrieved (the rest of the doc- uments). This obviously applies to the results of a Boolean search, but the same definition can also be used with a ranked search, as we will see later. If, in addition, relevance is assumed to be binary, then the results for a query can be summarized as shown in Table 8.3. In this table, is the relevant set of documents for the A query, is the non-relevant set, B is the set of retrieved documents, and B is the A ∩ gives the intersection of set of documents that are not retrieved. The operator two sets. For example, A ∩ B is the set of documents that are both relevant and retrieved. A number of effectiveness measures can be defined using this table. The two we are particularly interested in are: | A ∩ B | Recall = | A | | A ∩ B | P recision = | B |</p> <p><span class="badge badge-info text-white mr-2">333</span> 8.4 Effectiveness Metrics 309 Non-Relevant Relevant ∩ B A ∩ B Retrieved A ∩ B A ∩ B Not Retrieved A Table 8.3. Sets of documents defined by a simple search with binary relevance where . | gives the size of the set. In other words, recall is the proportion of rel- | evant documents that are retrieved, and precision is the proportion of retrieved documents that are relevant. There is an implicit assumption in using these mea- sures that the task involves retrieving as many of the relevant documents as pos- sible and minimizing the number of non-relevant documents retrieved. In other words, even if there are 500 relevant documents for a query, the user is interested in finding them all. We can also view the search results summarized in Table 8.3 as the output of a binary classifier, as was mentioned in section 7.2.1. When a document is retrieved, it is the same as making a prediction that the document is relevant. From this per- spective, there are two types of errors that can be made in prediction (or retrieval). These errors are called (a non-relevant document is retrieved) and false positives false negatives (a relevant document is not retrieved). Recall is related to one type of error (the false negatives), but precision is not related directly to the other type 3 , of error. Instead, another measure known as which is the proportion of fallout non-relevant documents that are retrieved, is related to the false positive errors: | B | A ∩ = F allout | A | 3 In the classification and signal detection literature, the errors are known as Type I and Type II errors. Recall is often called the true positive rate, or sensitivity. Fallout is called the false positive rate, or the false alarm rate. Another measure, specificity, is 1 – fallout. Precision is known as the positive predictive value, and is often used in medical diag- nostic tests where the probability that a positive test is correct is particularly important. The true positive rate and the false positive rate are used to draw ROC (receiver op- erating characteristic) curves that show the trade-off between these two quantities as the discrimination threshold varies. This threshold is the value at which the classifier makes a positive prediction. In the case of search, the threshold would correspond to a position in the document ranking. In information retrieval, recall-precision graphs are generally used instead of ROC curves.</p> <p><span class="badge badge-info text-white mr-2">334</span> 310 8 Evaluating Search Engines Given that fallout and recall together characterize the effectiveness of a search as a classifier, why do we use precision instead? The answer is simply that precision is more meaningful to the user of a search engine. If 20 documents were retrieved for a query, a precision value of 0.7 means that 14 out of the 20 retrieved doc- uments would be relevant. Fallout, on the other hand, will always be very small because there are so many non-relevant documents. If there were 1,000,000 non- relevant documents for the query used in the precision example, fallout would be 6/1000000 = 0.000006. If precision fell to 0.5, which would be noticeable to the user, fallout would be 0.00001. The skewed nature of the search task, where most of the corpus is not relevant to any given query, also means that evaluating a search engine as a classifier can lead to counterintuitive results. A search engine trained to minimize classification errors would tend to retrieve nothing, since classifying a document as non-relevant is always a good decision! is an effectiveness measure based on recall and precision that The F measure is used for evaluating classification performance and also in some search applica- tions. It has the advantage of summarizing effectiveness in a single number. It is defined as the of recall and precision, which is: harmonic mean 1 RP 2 = F = 1 1 1 ( R + P ) ( + ) P R 2 Why use the harmonic mean instead of the usual arithmetic mean or average? The harmonic mean emphasizes the importance of small values, whereas the arith- metic mean is affected more by values that are unusually large (outliers). A search result that returned nearly the entire document collection, for example, would have a recall of 1.0 and a precision near 0. The arithmetic mean of these values is 0.5, but the harmonic mean will be close to 0. The harmonic mean is clearly a 4 better summary of the effectiveness of this retrieved set. Most of the retrieval models we have discussed produce ranked output. To use recall and precision measures, retrieved sets of documents must be defined based on the ranking. One possibility is to calculate recall and precision values at every 4 The more general form of the weighted harmonic mean , which allows measure is the F weights reflecting the relative importance of recall and precision to be used. This mea- F = RP /( αR + (1 − α ) P ) , where α is a weight. This is often transformed sure is 2 2 2 β , which gives +1) using F F = ( β α +1) RP /( R + β = 1/( P ) . The common F , where recall and precision have equal importance. In some eval- measure is in fact 1 uations, precision or recall is emphasized by varying the value of β . Values of β > 1 emphasize recall.</p> <p><span class="badge badge-info text-white mr-2">335</span> 8.4 Effectiveness Metrics 311 rank position. Figure 8.2 shows the top ten documents of two possible rankings, together with the recall and precision values calculated at every rank position for a query that has six relevant documents. These rankings might correspond to, for example, the output of different retrieval algorithms or search engines. At rank position 10 (i.e., when ten documents are retrieved), the two rankings have the same effectiveness as measured by recall and precision. Recall is 1.0 be- cause all the relevant documents have been retrieved, and precision is 0.6 because both rankings contain six relevant documents in the retrieved set of ten docu- ments. At higher rank positions, however, the first ranking is clearly better. For example, at rank position 4 (four documents retrieved), the first ranking has a re- call of 0.5 (three out of six relevant documents retrieved) and a precision of 0.75 (three out of four retrieved documents are relevant). The second ranking has a recall of 0.17 (1/6) and a precision of 0.25 (1/4). the relevant documents = #1 Ranking 0.83 0.83 0.83 1.0 0.67 0.33 Recall 0.17 0.5 0.83 0.17 Precision 0.83 0.71 0.63 0.56 0.6 1.0 0.5 0.67 0.75 0.8 Ranking #2 0.67 0.67 0.83 1.0 Recall 0.0 0.17 0.17 0.17 0.33 0.5 0.33 0.5 0.56 0.6 0.5 0.4 0.25 0.57 0.5 0.0 Precision Fig. 8.2. Recall and precision values for two rankings of six relevant documents If there are a large number of relevant documents for a query, or if the relevant documents are widely distributed in the ranking, a list of recall-precision values for every rank position will be long and unwieldy. Instead, a number of techniques have been developed to summarize the effectiveness of a ranking. The first of these is simply to calculate recall-precision values at a small number of predefined rank positions. In fact, to compare two or more rankings for a given query, only the</p> <p><span class="badge badge-info text-white mr-2">336</span> 312 8 Evaluating Search Engines precision at the predefined rank positions needs to be calculated. If the precision p for a ranking at rank position is higher than the precision for another ranking, the recall will be higher as well. This can be seen by comparing the corresponding recall-precision values in Figure 8.2. This effectiveness measure is known as pre- . There are many possible values for the rank position p cision at rank p , but this measure is typically used to compare search output at the top of the ranking, since that is what many users care about. Consequently, the most common versions are precision at 10 and precision at 20. Note that if these measures are used, the im- plicit search task has changed to finding the most relevant documents at a given rank, rather than finding as many relevant documents as possible. Differences in search output further down the ranking than position 20 will not be considered. This measure also does not distinguish between differences in the rankings at po- p , which may be considered important for some tasks. For example, sitions 1 to the two rankings in Figure 8.2 will be the same when measured using precision at 10. Another method of summarizing the effectiveness of a ranking is to calculate precision at fixed or standard recall levels from 0.0 to 1.0 in increments of 0.1. Each ranking is then represented using 11 numbers. This method has the advan- all relevant documents, tage of summarizing the effectiveness of the ranking of rather than just those in the top ranks. Using the recall-precision values in Figure 8.2 as an example, however, it is clear that values of precision at these standard re- call levels are often not available. In this example, only the precision values at the standard recall levels of 0.5 and 1.0 have been calculated. To obtain the precision 5 interpolation . values at all of the standard recall levels will require Since standard recall levels are used as the basis for averaging effectiveness across queries and gen- erating , we will discuss interpolation in the next section. recall-precision graphs The third method, and the most popular, is to summarize the ranking by av- eraging the precision values from the rank positions where a relevant document was retrieved (i.e., when recall increases). If a relevant document is not retrieved 6 the contribution of this document to the average is 0.0. For the for some reason, first ranking in Figure 8.2, the average precision is calculated as: (1 . 0 + 0 . 67 + 0 . 75 + 0 . 8 + 0 . 83 + 0 . 6)/6 = 0 . 78 5 Interpolation refers to any technique for calculating a new point between two existing data points. 6 One common reason is that only a limited number of the top-ranked documents (e.g., 1,000) are considered.</p> <p><span class="badge badge-info text-white mr-2">337</span> 8.4 Effectiveness Metrics 313 For the second ranking, it is: 56 + 0 . 4 + 0 . 5 + 0 . 57 + 0 . 5 + 0 . 6)/6 = 0 . 52 (0 . Average precision has a number of advantages. It is a single number that is based on the ranking of all the relevant documents, but the value depends heavily on the highly ranked relevant documents. This means it is an appropriate measure for evaluating the task of finding as many relevant documents as possible while still reflecting the intuition that the top-ranked documents are the most important. All three of these methods summarize the effectiveness of a ranking for a single query. To provide a realistic assessment of the effectiveness of a retrieval algorithm, it must be tested on a number of queries. Given the potentially large set of results from these queries, we will need a method of summarizing the performance of the retrieval algorithm by calculating the average effectiveness for the entire set of queries. In the next section, we discuss the averaging techniques that are used in most evaluations. 8.4.2 Averaging and Interpolation In the following discussion of averaging techniques, the two rankings shown in Figure 8.3 are used as a running example. These rankings come from using the different queries. The aim of an averaging tech- same ranking algorithm on two nique is to summarize the effectiveness of a specific ranking algorithm across a collection of queries. Different queries will often have different numbers of rel- evant documents, as is the case in this example. Figure 8.3 also gives the recall- precision values calculated for the top 10 rank positions. Given that the average precision provides a number for each ranking, the sim- plest way to summarize the effectiveness of rankings from multiple queries would 7 be to average these numbers. This effectiveness measure, mean average precision , 8 , is used in most research papers and some system evaluations. MAP Since or 7 This sounds a lot better than average average precision! 8 geometric mean of the average precision ( GMAP ) is used in- In some evaluations the stead of the arithmetic mean. This measure, because it multiplies average precision val- ues, emphasizes the impact of queries with low performance. It is defined as n ∑ 1 GM AP AP = exp log i n =1 i where n . AP i is the average precision for query is the number of queries, and i</p> <p><span class="badge badge-info text-white mr-2">338</span> 314 8 Evaluating Search Engines relevant = documents for query 1 Ranking #1 0.8 1.0 0.6 Recall 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.38 0.44 0.5 1.0 0.5 0.67 0.5 0.4 0.5 Precision 0.43 for documents relevant = query 2 #2 Ranking 1.0 1.0 0.67 1.0 1.0 0.67 0.33 0.33 0.33 0.0 Recall 0.25 0.4 0.33 0.43 0.38 0.33 0.3 Precision 0.0 0.5 0.33 Fig. 8.3. Recall and precision values for rankings from two different queries it is based on average precision, it assumes that the user is interested in finding many relevant documents for each query. Consequently, using this measure for comparison of retrieval algorithms or systems can require a considerable effort to acquire the relevance judgments, although methods for reducing the number of judgments required have been suggested (e.g., Carterette et al., 2006). For the example in Figure 8.3, the mean average precision is calculated as fol- lows: = (1 . 0 + 0 average precision query 1 67 + 0 . 5 + 0 . 44 + 0 . 5)/5 = 0 . 62 . average precision query 2 = (0 . 5 + 0 . 4 + 0 . 43)/3 = 0 . 44 mean average precision . 62 + 0 . 44)/2 = 0 . 53 = (0 The MAP measure provides a very succinct summary of the effectiveness of a ranking algorithm over many queries. Although this is often useful, sometimes too much information is lost in this process. Recall-precision graphs , and the ta- bles of recall-precision values they are based on, give more detail on the effective- ness of the ranking algorithm at different recall levels. Figure 8.4 shows the recall- precision graph for the two queries in the example from Figure 8.3. Graphs for individual queries have very different shapes and are difficult to compare. To gen-</p> <p><span class="badge badge-info text-white mr-2">339</span> 8.4 Effectiveness Metrics 315 1 0.8 0.6 Precision 0.4 0.2 0 0.2 0.6 0.8 1 0 0.4 Recall Recall-precision graphs for two queries Fig. 8.4. erate a recall-precision graph that summarizes effectiveness over all the queries, the recall-precision values in Figure 8.3 should be averaged. To simplify the aver- aging process, the recall-precision values for each query are converted to precision values at standard recall levels, as mentioned in the last section. The precision val- 9 ues for all queries at each standard recall level can then be averaged. The standard recall levels are 0.0 to 1.0 in increments of 0.1. To obtain pre- cision values for each query at these recall levels, the recall-precision data points, such as those in Figure 8.3, must be interpolated. That is, we have to define a func- tion based on those data points that has a value at each standard recall level. There are many ways of doing interpolation, but only one method has been used in in- 9 This is called a in the literature. A macroaverage computes the measure macroaverage microaverage of interest for each query and then averages these measures. A combines all the applicable data points from every query and computes the measure from the combined data. For example, a microaverage precision at rank 5 would be calculated ∑ n is the number of relevant documents retrieved in the top five r , where r n /5 as i i i =1 documents by query i , and n is the number of queries. Macroaveraging is used in most retrieval evaluations.</p> <p><span class="badge badge-info text-white mr-2">340</span> 316 8 Evaluating Search Engines formation retrieval evaluations since the 1970s. In this method, we define the pre- cision at any standard recall level R as P ′ ′ ′ ′ { P : P ( R ≥ R ∧ ( R ) = , P max ) ∈ S } R where S is the set of observed ( R, P ) points. This interpolation, which defines the precision at any recall level as the maximum precision observed in any recall- precision point at a higher recall level, produces a step function, as shown in Figure 8.5. 1 0.8 0.6 Precision 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0 Recall Interpolated recall-precision graphs for two queries Fig. 8.5. Because search engines are imperfect and nearly always retrieve some non- relevant documents, precision tends to decrease with increasing recall (although this is not always true, as is shown in Figure 8.4). This interpolation method is consistent with this observation in that it produces a function that is monoton- ically decreasing. This means that precision values always go down (or stay the same) with increasing recall. The interpolation also defines a precision value for the recall level of 0.0, which would not be obvious otherwise! The general intu- ition behind this interpolation is that the recall-precision values are defined by the</p> <p><span class="badge badge-info text-white mr-2">341</span> 8.4 Effectiveness Metrics 317 sets of documents in the ranking with the best possible precision values. In query 1, for example, there are three sets of documents that would be the best possi- ble for the user to look at in terms of finding the highest proportion of relevant documents. The average precision values at the standard recall levels are calculated by sim- ply averaging the precision values for each query. Table 8.4 shows the interpolated precision values for the two example queries, along with the average precision val- ues. The resulting average recall-precision graph is shown in Figure 8.6. Recall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Ranking 1 1.0 1.0 1.0 0.67 0.67 0.5 0.5 0.5 0.5 0.5 0.5 Ranking 2 0.5 0.5 0.5 0.5 0.43 0.43 0.43 0.43 0.43 0.43 0.43 Average 0.75 0.75 0.75 0.59 0.55 0.47 0.47 0.47 0.47 0.47 0.47 Table 8.4. Precision values at standard recall levels calculated using interpolation ϭ Ϭ͘Θ Ϭ͘ς WƌĞĐŝƐŝŽŶ Ϭ͘κ Ϭ͘Ϯ Ϭ Ϭ Ϭ͘Ϯ Ϭ͘κ Ϭ͘ς Ϭ͘Θ ϭ ZĞĐĂůů Fig. 8.6. Average recall-precision graph using standard recall levels</p> <p><span class="badge badge-info text-white mr-2">342</span> 318 8 Evaluating Search Engines The average recall-precision graph is plotted by simply joining the average pre- cision points at the standard recall levels, rather than using another step function. Although this is somewhat inconsistent with the interpolation method, the inter- mediate recall levels are never used in evaluation. When graphs are averaged over many queries, they tend to become smoother. Figure 8.7 shows a typical recall- precision graph from a TREC evaluation using 50 queries. 1 0.8 0.6 Precision 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0 Recall Typical recall-precision graph for 50 queries from TREC Fig. 8.7. 8.4.3 Focusing on the Top Documents In many search applications, users tend to look at only the top part of the ranked result list to find relevant documents. In the case of web search, this means that many users look at just the first page or two of results. In addition, tasks such as navigational search (Chapter 7) or question answering (Chapter 1) have just a single relevant document. In these situations, recall is not an appropriate measure. Instead, the focus of an effectiveness measure should be on how well the search engine does at retrieving relevant documents at very high ranks (i.e., close to the top of the ranking).</p> <p><span class="badge badge-info text-white mr-2">343</span> 8.4 Effectiveness Metrics 319 precision One measure with this property that has already been mentioned is , where p in this case will typically be 10. This measure is easy to compute, atrank p can be averaged over queries to produce a single summary number, and is readily understandable. The major disadvantage is that it does not distinguish between different rankings of a given number of relevant documents. For example, if only one relevant document was retrieved in the top 10, according to the precision measure a ranking where that document is in the top position would be the same as one where it was at rank 10. Other measures have been proposed that are more sensitive to the rank position. reciprocal rank The measure has been used for applications where there is typ- ically a single relevant document. It is defined as the reciprocal of the rank at which the first relevant document is retrieved. The mean reciprocal rank (MRR) is the average of the reciprocal ranks over a set of queries. For example, if the top d is a non- , d d , d , where , d five documents retrieved for a query were , d n n r n n n d is a relevant document, the reciprocal rank would be relevant document and r 1/2 = 0 . 5 . Even if more relevant documents had been retrieved, as in the rank- d , the reciprocal rank would still be 0.5. The reciprocal rank ing , d , d , d , d n r r n n is very sensitive to the rank position. It falls from 1.0 to 0.5 from rank 1 to 2, and the ranking d . The , d 2 , d . , d 1/5 = 0 , d would have a reciprocal rank of r n n n n (0 . 5 + 0 . 2)/2 = 0 . 35 MRR for these two rankings would be . The has become a popular measure for eval- discountedcumulativegain(DCG) uating web search and related applications (Järvelin & Kekäläinen, 2002). It is based on two assumptions: • Highly relevant documents are more useful than marginally relevant docu- ments. • The lower the ranked position of a relevant document (i.e., further down the ranked list), the less useful it is for the user, since it is less likely to be examined. These two assumptions lead to an evaluation that uses graded relevance as a measure of the usefulness, or gain , from examining a document. The gain is accu- mulated starting at the top of the ranking and may be reduced, or , at discounted lower ranks. The DCG is the total gain accumulated at a particular rank p . Specif- ically, it is defined as: p ∑ rel i + rel DCG = 1 p log i 2 =2 i</p> <p><span class="badge badge-info text-white mr-2">344</span> 320 8 Evaluating Search Engines rel where i . For is the graded relevance level of the document retrieved at rank i example, web search evaluations have been reported that used manual relevance judgments on a six-point scale ranging from “Bad” to “Perfect” (i.e., ≤ rel ≤ 0 i ). Binary relevance judgments can also be used, in which case rel would be either 5 i 0 or 1. log i is the discount or reduction factor that is applied to The denominator 2 the gain. There is no theoretical justification for using this particular discount fac- 10 tor, although it does provide a relatively smooth (gradual) reduction. By varying the base of the logarithm, the discount can be made sharper or smoother. With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3. As an example, con- sider the following ranking where each number is a relevance level on the scale 0–3 (not relevant–highly relevant): 3, 2, 3, 0, 0, 1, 2, 2, 3, 0 These numbers represent the gain at each rank. The discounted gain would be: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 The DCG at each rank is formed by accumulating these numbers, giving: 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 , specific values of p are chosen for the evaluation, p Similar to precision at rank and the DCG numbers are averaged across a set of queries. Since the focus of this measure is on the top ranks, these values are typically small, such as 5 and 10. For this example, DCG at rank 5 is 6.89 and at rank 10 is 9.61. To facilitate averaging across queries with different numbers of relevant documents, these numbers can be normalized by comparing the DCG at each rank with the DCG value for the ranking for that query. For example, if the previous ranking contained all perfect 10 In some publications, DCG is defined as: p ∑ rel i DCG = (2 (1 + − 1)/ log i ) p =1 i For binary relevance judgments, the two definitions are the same, but for graded rele- vance this definition puts a strong emphasis on retrieving highly relevant documents. This version of the measure is used by some search engine companies and, because of this, may become the standard.</p> <p><span class="badge badge-info text-white mr-2">345</span> 8.4 Effectiveness Metrics 321 the relevant documents for that query, the perfect ranking would have gain values at each rank of: 3, 3, 3, 2, 2, 2, 1, 0, 0, 0 DCG values of: which would give ideal 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10.88 Normalizing the actual DCG values by dividing by the ideal values gives us the normalized discounted cumulative gain (NDCG) values: 1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88 ≤ 1 at any rank position. To summarize, the Note that the NDCG measure is NDCG for a given query can be defined as: DCG p N DCG = p IDCG p where IDCG is the ideal DCG value for that query. 8.4.4 Using Preferences In section 8.3, we discussed how user preferences can be inferred from query logs. Preferences have been used for training ranking algorithms, and have been sug- gested as an alternative to relevance judgments for evaluation. Currently, how- ever, there is no standard effectiveness measure based on preferences. In general, two rankings described using preferences can be compared using is the number of preferences that agree and the τ ). If P ( Q Kendall tau coefficient is the number that disagree, Kendall’s τ is defined as: Q − P = τ P + Q This measure varies between 1 (when all preferences agree) and –1 (when they all disagree). If preferences are derived from clickthrough data, however, only a par- tial ranking is available. Experimental evidence shows that this partial informa- tion can be used to learn effective ranking algorithms, which suggests that effec- tiveness can be measured this way. Instead of using the complete set of preferences to calculate P and Q , a new ranking would be evaluated by comparing it to the known set of preferences. For example, if there were 15 preferences learned from</p> <p><span class="badge badge-info text-white mr-2">346</span> 322 8 Evaluating Search Engines measure would be clickthrough data, and a ranking agreed with 10 of these, the τ . 33 − 5)/15 = 0 (10 . Although this seems reasonable, no studies are available that show that this effectiveness measure is useful for comparing systems. 11 mea- For preferences derived from binary relevance judgments, the BPREF sure has been shown to be robust with partial information and to give similar re- sults (in terms of system comparisons) to recall-precision measures such as MAP. In this measure, the number of relevant and non-relevant documents is balanced to facilitate averaging across queries. For a query with R relevant documents, only the first R non-relevant documents are considered. This is equivalent to using × R preferences (all relevant documents are preferred to all non-relevant doc- R uments). Given this, the measure is defined as: ∑ 1 N d r = BP REF − ) (1 R R d r d gives the number of non-relevant doc- is a relevant document and N where d r r non-relevant documents that are considered) that are uments (from the set of R d is actually . If this is expressed in terms of preferences, N ranked higher than d r r a method for counting the number of preferences that disagree (for binary rele- R × R is the number of preferences being considered, vance judgments). Since an alternative definition of BPREF is: P = BP REF P Q + which means it is very similar to Kendall’s τ . The main difference is that BPREF varies between 0 and 1. Given that BPREF is a useful effectiveness measure, this suggests that the same measure or τ could be used with preferences associated with graded relevance. 8.5 Efficiency Metrics Compared to effectiveness, the efficiency of a search system seems like it should be easier to quantify. Most of what we care about can be measured automatically with a timer instead of with costly relevance judgments. However, like effectiveness, it is important to determine exactly what aspects of efficiency we want to measure. Table 8.5 shows some of the measures that are used. 11 Binary Preference</p> <p><span class="badge badge-info text-white mr-2">347</span> 8.5 Efficiency Metrics 323 Description Metric name Measures the amount of time necessary to build a docu- Elapsed indexing time ment index on a particular system. Indexing processor time Measures the CPU seconds used in building a document index. This is similar to elapsed time, but does not count time waiting for I/O or speed gains from parallelism. Number of queries processed per second. Query throughput The amount of time a user must wait after issuing a query Query latency before receiving a response, measured in milliseconds. This can be measured using the mean, but is often more instructive when used with the median or a percentile bound. Indexing temporary space Amount of temporary disk space used while creating an index. Index size Amount of storage necessary to store the index files. Table 8.5. Definitions of some important efficiency metrics The most commonly quoted efficiency metric is query throughput, measured in queries processed per second. Throughput numbers are comparable only for the same collection and queries processed on the same hardware, although rough comparisons can be made between runs on similar hardware. As a single-number metric of efficiency, throughput is good because it is intuitive and mirrors the common problems we want to solve with efficiency numbers. A real system user will want to use throughput numbers for capacity planning, to help determine whether more hardware is necessary to handle a particular query load. Since it is simple to measure the number of queries per second currently being issued to a service, it is easy to determine whether a system’s query throughput is adequate to handle the needs of an existing service. The trouble with using throughput alone is that it does not capture latency. Latency measures the elapsed time the system takes between when the user issues a query and when the system delivers its response. Psychology research suggests that users consider any operation that takes less than about 150 milliseconds to be instantaneous. Above that level, users react very negatively to the delay they perceive. This brings us back to throughput, because latency and throughput are not orthogonal: generally we can improve throughput by increasing latency, and re-</p> <p><span class="badge badge-info text-white mr-2">348</span> 324 8 Evaluating Search Engines ducing latency leads to poorer throughput. To see why this is so, think of the dif- ference between having a personal chef and ordering food at a restaurant. The per- sonal chef prepares your food with the lowest possible latency, since she has no other demands on her time and focuses completely on preparing your food. Un- fortunately, the personal chef has low throughput, since her focus on you leads to idle time when she is not completely occupied. The restaurant is a high through- put operation with lots of chefs working on many different orders simultaneously. Having many orders and many chefs leads to certain economies of scale—for in- stance, when a single chef prepares many identical orders at the same time. Note that the chef is able to process these orders simultaneously precisely because some latency has been added to some orders: instead of starting to cook immediately upon receiving an order, the chef may decide to wait a few minutes to see if any- one else orders the same thing. The result is that the chefs are able to cook food with high throughput but at some cost in latency. Query processing works the same way. It is possible to build a system that han- dles just one query at a time, devoting all resources to the current query, just like the personal chef devotes all her time to a single customer. This kind of system is low throughput, because only one query is processed at a time, which leads to idle resources. The radical opposite approach is to process queries in large batches. The system can then reorder the incoming queries so that queries that use common subexpressions are evaluated at the same time, saving valuable execution time. However, interactive users will hate waiting for their query batch to complete. Like recall and precision in effectiveness, low latency and high throughput are both desirable properties of a retrieval system, but they are in conflict with each other and cannot be maximized at the same time. In a real system, query throughput is not a variable but a requirement: the system needs to handle every query the users submit. The two remaining variables are latency (how long the users will have to wait for a response) and hardware cost (how many processors will be applied to the search problem). A common way to talk about latency is with percentile bounds, such as “99% of all queries will complete in under 100 milliseconds.” System designers can then add hardware until this requirement is met. Query throughput and latency are the most visible system efficiency metrics, but we should also consider the costs of indexing. For instance, given enough time and space, it is possible to cache every possible query of a particular length. A system that did this would have excellent query throughput and query latency, but at the cost of enormous storage and indexing costs. Therefore, we also need</p> <p><span class="badge badge-info text-white mr-2">349</span> 8.6 Training, Testing, and Statistics 325 to measure the size of the index structures and the time necessary to create them. Because indexing is often a distributed process, we need to know both the total amount of processor time used during indexing and the elapsed time. Since the process of inversion often requires temporary storage, it is interesting to measure the amount of temporary storage used. 8.6 Training, Testing, and Statistics 8.6.1 Significance Tests Retrieval experiments generate data, such as average precision values or NDCG values. In order to decide whether this data shows that there is a meaningful dif- are ference between two retrieval algorithms or search engines, significance tests . In the case of a typical nullhypothesis needed. Every significance test is based on a retrieval experiment, we are comparing the value of an effectiveness measure for rankings produced by two retrieval algorithms. The null hypothesis is that there alter- is no difference in effectiveness between the two retrieval algorithms. The is that there is a difference. In fact, given two retrieval algorithms nativehypothesis is a new algorithm, we are usually and , where A is a baseline algorithm and A B B trying to show that the effectiveness of B is better than A , rather than simply find- ing a difference. Since the rankings that are compared are based on the same set of queries for both retrieval algorithms, this is known as a matched pair experiment. We obviously cannot conclude that B A on the basis of the re- is better than A B on all other queries. So how sults of a single query, since may be better than many queries do we have to look at to make a decision about which is better? If, B for example, A for 90% of 200 queries in a test collection, we is better than should be more confident that B is better for that effectiveness measure, but how confident? Significance tests allow us to quantify the confidence we have in any judgment about effectiveness. More formally, a significance test enables us to reject the null hypothesis in fa- B is better than A ) on the basis of vor of the alternative hypothesis (i.e., show that the data from the retrieval experiments. Otherwise, we say that the null hypoth- esis cannot be rejected (i.e., B might not be better than A ). As with any binary decision process, a significance test can make two types of error. A Type I error is when the null hypothesis is rejected when it is in fact true. A Type II error is 12 when the null hypothesis is accepted when it is in fact false. Significance tests 12 Compare to the discussion of errors in section 8.4.1.</p> <p><span class="badge badge-info text-white mr-2">350</span> 326 8 Evaluating Search Engines power , which is the probability that the test will reject are often described by their is better than ). In other words, the null hypothesis correctly (i.e., decide that A B a test with high power will reduce the chance of a Type II error. The power of a test can also be increased by increasing the sample size, which in this case is the number of queries in the experiment. Increasing the number of queries will also reduce the chance of a Type I error. The procedure for comparing two retrieval algorithms using a particular set of queries and a significance test is as follows: 1. Compute the effectiveness measure for every query for both rankings. 2. Compute a test statistic based on a comparison of the effectiveness measures for each query. The test statistic depends on the significance test, and is simply a quantity calculated from the sample data that is used to decide whether or not the null hypothesis should be rejected. P-value 3. The test statistic is used to compute a , which is the probability that a test statistic value at least that extreme could be observed if the null hypothesis were true. Small P-values suggest that the null hypothesis may be false. 4. The null hypothesis (no difference) is rejected in favor of the alternate hypoth- B is more effective than esis (i.e., ) if the P-value is ≤ α , the significancelevel . A Values for α are small, typically 0.05 and 0.1, to reduce the chance of a Type I error. In other words, if the probability of getting a specific test statistic value is very small assuming the null hypothesis is true, we reject that hypothesis and conclude that ranking algorithm is more effective than the baseline algorithm A . B The computation of the test statistic and the corresponding P-value is usually done using tables or standard statistical software. The significance tests discussed here are also provided in Galago. The procedure just described is known as a one-sided or one-tailed test since we want to establish that B is better than A . If we were just trying to establish that there is a difference between and A , it would be a two-sided or two-tailed B test, and the P-value would be doubled. The “side” or “tail” referred to is the tail of a probability distribution. For example, Figure 8.8 shows a distribution for the possible values of a test statistic assuming the null hypothesis. The shaded part of the distribution is the region of rejection for a one-sided test. If a test yielded the test statistic value x , the null hypothesis would be rejected since the probability of getting that value or higher (the P-value) is less than the significance level of 0.05.</p> <p><span class="badge badge-info text-white mr-2">351</span> 8.6 Training, Testing, and Statistics 327 = 0.05 p x statistic Test value Fig. 8.8. Probability distribution for test statistic values assuming the null hypothesis. The shaded area is the region of rejection for a one-sided test. The significance tests most commonly used in the evaluation of search engines 13 , sign test the Wilcoxon signed-rank test , and the are the . To explain these t-test tests, we will use the data shown in Table 8.6, which shows the effectiveness val- ues of the rankings produced by two retrieval algorithms for 10 queries. The values in the table are artificial and could be average precision or NDCG, for example, on a scale of 0–100 (instead of 0–1). The table also shows the difference in the ef- fectiveness measure between algorithm B and the baseline algorithm A. The small number of queries in this example data is not typical of a retrieval experiment. In general, the t-test assumes that data values are sampled from normal dis- tributions. In the case of a matched pair experiment, the assumption is that the difference between the effectiveness values is a sample from a normal distribution. The null hypothesis in this case is that the mean of the distribution of differences is zero. The test statistic for the paired t-test is: √ B − A = t . N σ B − A 14 where is the mean of the differences, σ − A B is the standard deviation A B − N is the size of the sample (the number of queries). For of the differences, and 13 Also known as Student’s t-test, where “student” was the pen name of the inventor, William Gosset, not the type of person who should use it. 14 For a set of data values x , the standard deviation can be calculated by σ = i √ ∑ N 2 − ( x x ) / N , where x is the mean. i =1 i</p> <p><span class="badge badge-info text-white mr-2">352</span> 328 8 Evaluating Search Engines A B Query B – A 35 25 10 1 2 43 84 41 3 -24 39 15 75 75 0 4 43 68 25 5 15 6 70 85 20 60 7 80 8 52 50 -2 49 9 9 58 10 50 75 25 Table 8.6. Artificial effectiveness data for two retrieval algorithms (A and B) over 10 queries. The column B – A gives the difference in effectiveness. the data in Table 8.6, B − A = 21 . 4 , σ . For a one- 33 . = 29 . 1 , and t = 2 A − B = tailed test, this gives a P-value of 0.02, which would be significant at a level of α 0 . . Therefore, for this data, the t-test enables us to reject the null hypothesis and 05 conclude that ranking algorithm B is more effective than A. There are two objections that could be made to using the t-test in search eval- uations. The first is that the assumption that the data is sampled from normal distributions is generally not appropriate for effectiveness measures, although the distribution of differences can resemble a normal distribution for large N . Recent experimental results have supported the validity of the t-test by showing that it produces very similar results to the randomization test on TREC data (Smucker et al., 2007). The randomization test does not assume the data comes from normal 15 nonparametric tests. distributions, and is the most powerful of the The random- ization test, however, is much more expensive to compute than the t-test. mea- The second objection that could be made is concerned with the level of surement associated with effectiveness measures. The t-test (and the randomiza- tion test) assume the the evaluation data is measured on an interval scale . This means that the values can be ordered (e.g., an effectiveness of 54 is greater than an effectiveness of 53), and that differences between values are meaningful (e.g., the difference between 80 and 70 is the same as the difference between 20 and 10). Some people have argued that effectiveness measures are an ordinal scale , which 15 A nonparametric test makes less assumptions about the data and the underlying distri- bution than parametric tests.</p> <p><span class="badge badge-info text-white mr-2">353</span> 8.6 Training, Testing, and Statistics 329 means that the magnitude of the differences are not significant. The Wilcoxon signed-rank test and the sign test, which are both nonparametric, make less as- sumptions about the effectiveness measure. As a consequence, they do not use all the information in the data, and it can be more difficult to show a significant dif- ference. In other words, if the effectiveness measure did satisfy the conditions for using the t-test, the Wilcoxon and sign tests have less power. The Wilcoxon signed-rank test assumes that the differences between the ef- fectiveness values for algorithms A and B can be ranked, but the magnitude is not important. This means, for example, that the difference for query 8 in Table 8.6 will be ranked first because it is the smallest non-zero absolute value, but the magnitude of 2 is not used directly in the test. The test statistic is: N ∑ w = R i i =1 . To compute R is a signed-rank, and N is the number of differences where = 0 ̸ i the signed-ranks, the differences are ordered by their absolute values (increasing), and then assigned rank values (ties are assigned the average rank). The rank values are then given the sign of the original difference. The null hypothesis for this test is that the sum of the positive ranks will be the same as the sum of the negative ranks. For example, the nine non-zero differences from Table 8.6, in rank order of absolute value, are: 2, 9, 10, 24, 25, 25, 41, 60, 70 The corresponding signed-ranks are: –1, +2, +3, –4, +5.5, +5.5, +7, +8, +9 Summing these signed-ranks gives a value of w = 35 . For a one-tailed test, this gives a P-value of approximately 0.025, which means the null hypothesis can be α = 0 . 05 . rejected at a significance level of The sign test goes further than the Wilcoxon signed-ranks test, and completely ignores the magnitude of the differences. The null hypothesis for this test is that 1 ( B > A ) = P ( A > B ) = P . In other words, over a large sample we would ex- 2 pect that the number of pairs where B is “better” than A would be the same as the number of pairs where A is “better” than B. The test statistic is simply the num- ber of pairs where B > A. The issue for a search evaluation is deciding what dif- ference in the effectiveness measure is “better.” We could assume that even small</p> <p><span class="badge badge-info text-white mr-2">354</span> 330 8 Evaluating Search Engines differences in average precision or NDCG—such as 0.51 compared to 0.5—are significant. This has the risk of leading to a decision that algorithm B is more ef- fective than A when the difference is, in fact, not noticeable to the users. Instead, an appropriate threshold for the effectiveness measure should be chosen. For ex- ample, an old rule of thumb in information retrieval is that there has to be at least 5% difference in average precision to be noticeable (10% for a more conserva- 0 . 51 − 0 . 5 = 0 tive threshold). This would mean that a difference of 01 would . be considered a tie for the sign test. If the effectiveness measure was precision at rank 10, on the other hand, any difference might be considered significant since it would correspond directly to additional relevant documents in the top 10. For the data in Table 8.6, we will consider any difference to be significant. This means there are seven pairs out of ten where B is better than A. The corresponding P-value is 0.17, which is the chance of observing seven “successes” in ten trials where the probability of success is 0.5 (just like flipping a coin). Using the sign test, we cannot reject the null hypothesis. Because so much information from the effectiveness measure is discarded in the sign test, it is more difficult to show a difference, and more queries are needed to increase the power of the test. On the other hand, it can be used in addition to the t-test to provide a more user-focused perspective. An algorithm that is significantly more effective according to both the t-test and the sign test, perhaps using different effectiveness measures, is more likely to be noticeably better. 8.6.2 Setting Parameter Values Nearly every ranking algorithm has parameters that can be tuned to improve the effectiveness of the results. For example, BM25 has the parameters k b , k , and 2 1 used in term weighting, and query likelihood with Dirichlet smoothing has the parameter μ . Ranking algorithms for web search can have hundreds of parameters that give the weights for the associated features. The values of these parameters can have a major impact on retrieval effectiveness, and values that give the best ef- fectiveness for one application may not be appropriate for another application, or even for a different document collection. Not only is choosing the right parameter values important for the performance of a search engine when it is deployed, it is an important part of comparing the effectiveness of two retrieval algorithms. An algorithm that has had its parameters tuned for optimal performance for the test collection may appear to be much more effective than it really is when compared to a baseline algorithm with poor parameter values.</p> <p><span class="badge badge-info text-white mr-2">355</span> 8.6 Training, Testing, and Statistics 331 The appropriate method of setting parameters for both maximizing effective- ness and making fair comparisons of algorithms is to use a testset trainingset and a of data. The training set is used to learn the best parameter values, and the test set is used for validating these parameter values and comparing ranking algorithms. The training and test sets are two separate test collections of documents, queries, and relevance judgments, although they may be created by splitting a single col- lection. In TREC experiments, for example, the training set is usually documents, queries, and relevance judgments from previous years. When there is not a large cross-validation amount of data available, can be done by partitioning the data into K K − 1 subsets. One subset is used for testing, and are used for training. This is repeated using each of the subsets as a test set, and the best parameter values are averaged across the K runs. Using training and test sets helps to avoid the problem of overfitting (men- tioned in Chapter 7), which occurs when the parameter values are tuned to fit a particular set of data too well. If this was the only data that needed to be searched in an application, that would be appropriate, but a much more common situa- tion is that the training data is only a sample of the data that will be encountered when the search engine is deployed. Overfitting will result in a choice of parame- ter values that do not generalize well to this other data. A symptom of overfitting is that effectiveness on the training set improves but effectiveness on the test set gets worse. A fair comparison of two retrieval algorithms would involve getting the best parameter values for both algorithms using the training set, and then using those values with the test set. The effectiveness measures are used to tune the parameter values in multiple retrieval runs on the training data, and for the final comparison, which is a single retrieval run, on the test data. The “cardinal sin” of retrieval exper- iments, which should be avoided in nearly all situations, is testing on the training data. This typically will artificially boost the measured effectiveness of a retrieval algorithm. It is particularly problematic when one algorithm has been trained in some way using the testing data and the other has not. Although it sounds like an easy problem to avoid, it can sometimes occur in subtle ways in more complex experiments. Given a training set of data, there a number of techniques for finding the best parameter settings for a particular effectiveness measure. The most common method is simply to explore the space of possible parameter values by brute force . This requires a large number of retrieval runs with small variations in parameter values (a parameter sweep ). Although this could be computationally infeasible for</p> <p><span class="badge badge-info text-white mr-2">356</span> 332 8 Evaluating Search Engines large numbers of parameters, it is guaranteed to find the parameter settings that give the best effectiveness for any given effectiveness measure. The Ranking SVM method described in section 7.6 is an example of a more sophisticated procedure for learning good parameter values efficiently with large numbers of parameters. This method, as well as similar optimization techniques, will find the best possi- 16 ble parameter values if the function being optimized meets certain conditions. Because many of the effectiveness measures we have described do not meet these conditions, different functions are used for the optimization, and the parame- ter values are not guaranteed to be optimal. This is, however, a very active area of research, and new methods for learning parameters are constantly becoming available. 8.6.3 Online Testing All of the evaluation strategies described thus far have assumed that training and testing are done offline. That is, we have assumed that all of the training and test data are fixed ahead of time. However, with real search engines, it may be possible to test (or even train) using live traffic. This is often called . For ex- online testing ample, suppose that you just developed a new sponsored-search advertising algo- rithm. Rather than evaluating your system using human relevance judgments, it is possible to deploy the new ranking algorithm and observe the amount of revenue generated using the new algorithm versus some baseline algorithm. This makes it possible to test various search engine components, such as ranking algorithms, query suggestion algorithms, and snippet generation algorithms, using live traffic and real users. Notice that this is similar to logging, which was discussed earlier in this chapter. With logging, evaluations are typically done retrospectively on “stale” data, whereas online testing uses live data. There are several benefits to online testing. First, it allows real users to inter- act with the system. These interactions provide information, such as click data, that can be used for various kinds of evaluation. Second, online testing is less bi- ased, since the evaluation is being done over a real sample of users and traffic. This 16 Specifically, the function should be convex (or concave ; a function f ( x ) is concave if and only if f ( x ) is convex). A convex function is a continuous function that satisfies − the following constraint for all λ in [0,1]: f ( λx x + (1 − λ ) x ( ) ≤ λf ( x f ) + (1 − λ ) ) 1 2 1 2</p> <p><span class="badge badge-info text-white mr-2">357</span> 8.7 The Bottom Line 333 is valuable because it is often difficult to build test collections that accurately re- flect real search engine users and traffic. Finally, online testing can produce a large amount of data very cheaply, since it does not require paying any humans to do relevance judgments. Unfortunately, online testing also has its fair share of drawbacks. The primary drawback is that the data collected is typically very noisy. There are many different reasons why users behave the way they do in an online environment. For example, if a user does not click on a search result, it does not necessarily mean the result is bad. The user may have clicked on an advertisement instead, lost interest, or sim- ply gone to eat dinner. Therefore, typically a very large amount of online testing data is required to eliminate noise and produce meaningful conclusions. Another drawback to online testing is that it requires live traffic to be altered in poten- tially harmful ways. If the algorithm being tested is highly experimental, then it may significantly degrade retrieval effectiveness and drive users away. For this rea- son, online testing must be done very carefully, so as not to negatively affect the user experience. One way of minimizing the effect of an online test on the general user population is to use the experimental algorithm only for a small percentage, such as 1% to 5%, of the live traffic. Finally, online tests typically provide only a very specific type of data—click data. As we described earlier in this section, click data is not always ideal for evaluating search engines, since the data is noisy and highly biased. However, for certain search engine evaluation metrics, such as 17 clickthrough rate and revenue, online testing can be very useful. Therefore, online testing can be a useful, inexpensive way of training or testing new algorithms, especially those that can be evaluated using click data. Special care must be taken to ensure that the data collected is analyzed properly and that the overall user experience is not degraded. 8.7 The Bottom Line In this chapter, we have presented a number of effectiveness and efficiency mea- sures. At this point, it would be reasonable to ask which of them is the right mea- sure to use. The answer, especially with regard to effectiveness, is that no single measure is the correct one for any search application. Instead, a search engine should be evaluated through a combination of measures that show different as- 17 The percentage of times that some item is clicked on.</p> <p><span class="badge badge-info text-white mr-2">358</span> 334 8 Evaluating Search Engines all pects of the system’s performance. In many settings, of the following measures and tests could be carried out with little additional effort: • Mean average precision - single number summary, popular measure, pooled relevance judgments. • Average NDCG - single number summary for each rank level, emphasizes top ranked documents, relevance judgments needed only to a specific rank depth (typically to 10). • Recall-precision graph - conveys more information than a single number mea- sure, pooled relevance judgments. • Average precision at rank 10 - emphasizes top ranked documents, easy to un- derstand, relevance judgments limited to top 10. Using MAP and a recall-precision graph could require more effort in relevance judgments, but this analysis could also be limited to the relevant documents found in the top 10 for the NDCG and precision at 10 measures. All these evaluations should be done relative to one or more baseline searches. It generally does not make sense to do an effectiveness evaluation without a good baseline, since the effectiveness numbers depend strongly on the particular mix of queries and documents in the test collection. The t-test can be used as the sig- nificance test for the average precision, NDCG, and precision at 10 measures. All of the standard evaluation measures and significance tests are available us- ing the evaluation program provided as part of Galago. In addition to these evaluations, it is also very useful to present a summary of the number of queries that were improved and the number that were degraded, relative to a baseline. Figure 8.9 gives an example of this summary for a TREC run, where the query numbers are shown as a distribution over various percentage levels of improvement for a specific evaluation measure (usually MAP). Each bar represents the number of queries that were better (or worse) than the baseline by the given percentage. This provides a simple visual summary showing that many more queries were improved than were degraded, and that the improvements were sometimes quite substantial. By setting a threshold on the level of improvement that constitutes “noticeable,” the sign test can be used with this data to establish significance. Given this range of measures, both developers and users will get a better pic- ture of where the search engine is performing well and where it may need improve- ment. It is often necessary to look at individual queries to get a better understand-</p> <p><span class="badge badge-info text-white mr-2">359</span> 8.7 The Bottom Line 335 25 20 15 Q i Q es i uer 10 5 0 100%, [50%, < [ ! 75%, ! 100%[ [ ! 50%, 100% [ ! 25%, > [0%, [75%, [25%, ! 75%] 50%] 100%] 0%] 25%] ! 50%] ! 25%] ! 75%] Percentage Gain or Loss Fig. 8.9. Example distribution of query effectiveness improvements ing of what is causing the ranking behavior of a particular algorithm. Query data such as Figure 8.9 can be helpful in identifying interesting queries. References and Further Reading Despite being discussed for more than 40 years, the measurement of effectiveness in search engines is still a hot topic, with many papers being published in the ma- jor conferences every year. The chapter on evaluation in van Rijsbergen (1979) gives a good historical perspective on effectiveness measurement in information retrieval. Another useful general source is the TREC book (Voorhees & Harman, 2005), which describes the test collections and evaluation procedures used and how they evolved. Saracevic (1975) and Mizzaro (1997) are the best papers for general reviews of the critical topic of relevance. The process of obtaining relevance judgments and the reliability of retrieval experiments are discussed in the TREC book. Zobel (1998) shows that some incompleteness of relevance judgments does not affect experiments, although Buckley and Voorhees (2004) suggest that substan- tial incompleteness can be a problem. Voorhees and Buckley (2002) discuss the</p> <p><span class="badge badge-info text-white mr-2">360</span> 336 8 Evaluating Search Engines error rates associated with different numbers of queries. Sanderson and Zobel (2005) show how using a significance test can affect the reliability of compar- isons and also compare shallow versus in-depth relevance judgments. Carterette et al. (2006) describe a technique for reducing the number of relevance judgments required for reliable comparisons of search engines. Kelly and Teevan (2003) re- view approaches to acquiring and using implicit relevance information. Fox et al. (2005) studied implicit measures of relevance in the context of web search, and Joachims et al. (2005) introduced strategies for deriving preferences based on clickthrough data. Agichtein, Brill, Dumais, and Ragno (2006) extended this approach and carried out more experiments introducing click distributions and deviation, and showing that a number of features related to user behavior are use- ful for predicting relevance. The measure was originally proposed by van Rijsbergen (1979) in the form F of = 1 − F E E measure in terms of mea- . He also provided a justification for the surement theory, raised the issue of whether effectiveness measures were interval or ordinal measures, and suggested that the sign and Wilcoxon tests would be appropriate for significance. Cooper (1968) wrote an important early paper that introduced the expected search length (ESL) measure, which was the expected number of documents that a user would have to look at to find a specified num- ber of relevant documents. Although this measure has not been widely used, it was the ancestor of measures such as NDCG (Järvelin & Kekäläinen, 2002) that fo- cus on the top-ranked documents. Another measure of this type that has recently been introduced is rank-biased precision (Moffat et al., 2007). Yao (1995) provides one of the first discussions of preferences and how they could be used to evaluate a search engine. The paper by Joachims (2002b) that showed how to train a linear feature-based retrieval model using preferences also used Kendall’s τ as the effectiveness measure for defining the best ranking. The recent paper by Carterette and Jones (2007) shows how search engines can be evaluated using relevance information directly derived from clickthrough data, rather than converting clickthrough to preferences. interactive information retrieval. A number of recent studies have focused on These studies involve a different style of evaluation than the methods described in this chapter, but are more formal than online testing. Belkin (2008) describes the challenges of evaluating interactive experiments and points to some interesting papers on this topic. Another area related to effectiveness evaluation is the prediction of query effec- tiveness. Cronen-Townsend et al. (2006) describe the Clarity measure, which is</p> <p><span class="badge badge-info text-white mr-2">361</span> 8.7 The Bottom Line 337 used to predict whether a ranked list for a query has good or bad precision. Other measures have been suggested that have even better correlations with average pre- cision. There are very few papers that discuss guidelines for efficiency evaluations of search engines. Zobel et al. (1996) is an example from the database literature. Exercises 8.1. Find three other examples of test collections in the information retrieval lit- erature. Describe them and compare their statistics in a table. 8.2. Imagine that you were going to study the effectiveness of a search engine for blogs. Specify the retrieval task(s) for this application, and then describe the test collection you would construct and how you would evaluate your ranking algo- rithms. 8.3. For one query in the CACM collection (provided at the book website), gen- erate a ranking using Galago, and then calculate average precision, NDCG at 5 and 10, precision at 10, and the reciprocal rank by hand. 8.4. For two queries in the CACM collection, generate two uninterpolated recall- precision graphs, a table of interpolated precision values at standard recall levels, and the average interpolated recall-precision graph. 8.5. Generate the mean average precision, recall-precision graph, average NDCG at 5 and 10, and precision at 10 for the entire CACM query set. 8.6. Compare the MAP value calculated in the previous problem to the GMAP value. Which queries have the most impact on this value? 8.7. Another measure that has been used in a number of evaluations is R-precision . This is defined as the precision at R documents, where R is the number of relevant documents for a query. It is used in situations where there is a large variation in the number of relevant documents per query. Calculate the average R-precision for the CACM query set and compare it to the other measures. 8.8. Generate another set of rankings for 10 CACM queries by adding structure to the queries manually. Compare the effectiveness of these queries to the simple queries using MAP, NDCG, and precision at 10. Check for significance using the t-test, Wilcoxon test, and the sign test.</p> <p><span class="badge badge-info text-white mr-2">362</span> 338 8 Evaluating Search Engines 8.9. For one query in the CACM collection, generate a ranking and calculate BPREF. Show that the two formulations of BPREF give the same value. 8.10. Consider a test collection that contains judgments for a large number of time-sensitive queries, such as “olympics” and “miss universe”. Suppose that the judgments for these queries were made in 2002. Why is this a problem? How can online testing be used to alleviate the problem?</p> <p><span class="badge badge-info text-white mr-2">363</span> 9 Classification and Clustering “What kind of thing? I need a clear definition.” Ripley, Alien We now take a slight detour from search to look at classification and clustering. Classification and clustering have many things in common with document re- trieval. In fact, many of the techniques that proved to be useful for ranking doc- uments can also be used for these tasks. Classification and clustering algorithms are heavily used in most modern search engines, and thus it is important to have a basic understanding of how these techniques work and how they can be applied to real-world problems. We focus here on providing general background knowl- edge and a broad overview of these tasks. In addition, we provide examples of how they can be applied in practice. It is not our goal to dive too deeply into the details or the theory, since there are many other excellent references devoted en- tirely to these subjects, some of which are described in the “References and Future Reading” section at the end of this chapter. Instead, at the end of this chapter, you should know what classification and clustering are, the most commonly used algo- rithms, examples of how they are applied in practice, and how they are evaluated. On that note, we begin with a brief description of classification and clustering. Classification , also referred to as categorization, is the task of automatically ap- plying labels to data, such as emails, web pages, or images. People classify items throughout their daily lives. It would be infeasible, however, to manually label ev- ery page on the Web according to some criteria, such as “spam” or “not spam.” Therefore, there is a need for automatic classification and categorization tech- niques. In this chapter, we describe several classification algorithms that are ap- plicable to a wide range of tasks, including spam detection, sentiment analysis, and applying semantic labels to web advertisements. Clustering , the other topic covered in this chapter, can be broadly defined as the task of grouping related items together. In classification, each item is assigned a</p> <p><span class="badge badge-info text-white mr-2">364</span> 340 9 Classification and Clustering label, such as “spam” or “not spam.” In clustering, however, each item is assigned to one or more clusters, where the cluster does not necessarily correspond to a mean- ingful concept, such as “spam” or “not spam.” Instead, as we will describe later in this chapter, items are grouped together according to their similarity. There- fore, rather than mapping items onto a predefined set of labels, clustering allows the data to “speak for itself ” by uncovering the implicit structure that relates the items. Both classification and clustering have been studied for many years by infor- mation retrieval researchers, with the aim of improving the effectiveness, or in some cases the efficiency, of search applications. From another perspective, these two tasks are classic machine learning problems. In machine learning, the learning algorithms are typically characterized as supervised or unsupervised. In supervised learning , a model is learned using a set of fully labeled items, which is often called training set . Once a model is learned, it can be applied to a set of unlabeled the test set , in order to automatically apply labels. Classification is items, called the often cast as a supervised learning problem. For example, given a set of emails that have been labeled as “spam” or “not spam” (the training set), a classification model can be learned. The model then can be applied to incoming emails in order to classify them as “spam” or “not spam”. algorithms, on the other hand, learn entirely based on Unsupervised learning unlabeled data. Unsupervised learning tasks are often posed differently than su- pervised learning tasks, since the input data is not mapped to a predefined set of labels. Clustering is the most common example of unsupervised learning. As we will show, clustering algorithms take a set of unlabeled data as input and then group the items using some notion of similarity. There are many other types of learning paradigms beyond supervised and un- supervised, such as semi-supervised learning , active learning , and online learning . However, these subjects are well beyond the scope of this book. Instead, in this chapter, we provide an overview of basic yet effective classification and clustering algorithms and methods for evaluating them. 9.1 Classification and Categorization Applying labels to observations is a very natural task, and something that most of us do, often without much thought, in our everyday lives. For example, consider a trip to the local grocery store. We often implicitly assign labels such as “ripe” or “not ripe,” “healthy” or “not healthy,” and “cheap” or “expensive” to the groceries</p> <p><span class="badge badge-info text-white mr-2">365</span> 9.1 Classification and Categorization 341 that we see. These are examples of binary labels, since there are only two options for each. It is also possible to apply multivalued labels to foods, such as “starch,” “meat,” “vegetable,” or “fruit.” Another possible labeling scheme would arrange categories into a hierarchy, in which the “vegetable” category would be split by color into subcategories, such as “green,” “red,” and “yellow.” Under this scheme, foods would be labeled according to their position within the hierarchy. These different labeling or categorization schemes, which include binary, multivalued, and hierarchical, are called ontologies (see Chapter 6). It is important to choose an ontology that is appropriate for the underlying task. For example, for detecting whether or not an email is spam, it is perfectly reasonable to choose a label set that consists of “spam” and “not spam”. However, if one were to design a classifier to automatically detect what language a web page is written in, then the set of all possible languages would be a more reasonable ontology. Typically, the correct choice of ontology is dictated by the problem, but in cases when it is not, it is important to choose a set of labels that is expres- sive enough to be useful for the underlying task. However, since classification is a supervised learning task, it is important not to construct an overly complex on- tology, since most learning algorithms will fail (i.e., not generalize well to unseen data) when there is little or no data associated with one or more of the labels. In the web page language classifier example, if we had only one example page for each of the Asian languages, then, rather than having separate labels for each of the languages, such as “Chinese”, “Korean”, etc., it would be better to combine all of the languages into a single label called “Asian languages”. The classifier will then be more likely to classify things as “Asian languages” correctly, since it has more training examples. In order to understand how machine learning algorithms work, we must first take a look at how people classify items. Returning to the grocery store example, consider how we would classify a food as “healthy” or “not healthy.” In order to make this classification, we would probably look at the amount of saturated fat, cholesterol, sugar, and sodium in the food. If these values, either separately or in combination, are above some threshold, then we would label the food “healthy” or “unhealthy.” To summarize, as humans we classify items by first identifying a number of important that will help us distinguish between the possible features labels. We then extract these features from each item. We then combine evidence from the extracted features in some way. Finally, we classify the item using some decision mechanism based on the combined evidence.</p> <p><span class="badge badge-info text-white mr-2">366</span> 342 9 Classification and Clustering In our example, the features are things such as the amount of saturated fat and the amount of cholesterol. The features are extracted by reading the nutritional information printed on the packaging or by performing laboratory tests. There are various ways to combine the evidence in order to quantify the “healthiness” H ) of the food, but one simple way is to weight the importance of each (denoted feature and then add the weighted feature values together, such as: f ood ) ≈ w ( f at H f ood ) + w ( chol ( f ood ) + chol f at ) sugar ( f ood ) + w w sodium ( f ood sugar sodium w w , etc., are the weights associated with each feature. Of course, in where , chol f at this case, it is likely that each of the weights would be negative. H , for a given food, we must apply some Once we have a healthiness score, decision mechanism in order to apply a “healthy” or “not healthy” label to the food. Again, there are various ways of doing this, but one of the most simple is to apply a simple threshold rule that says “a food is healthy if ( f ood ) ≥ t ” for H some threshold value t . Although this is an idealized model of how people classify items, it provides valuable insights into how a computer can be used to automatically classify items. Indeed, the two classification algorithms that we will now describe follow the same steps as we outlined earlier. The only difference between the two algorithms is in the details of how each step is actually implemented. 9.1.1 Naïve Bayes We are now ready to describe how items can be automatically classified. One of the Naïve Bayes . most straightforward yet effective classification techniques is called We introduced the Bayes classifier in Chapter 7 as a framework for a probabilistic relevant retrieval model. In that case, there were just two classes of interest, the non-relevant class. In general, classification tasks can involve more class and the than two labels or classes. In that situation, Bayes’ Rule , which is the basis of a Bayes classifier, states that: ) C ( P ) C P ( D | ) = | C ( P D P ( D ) ) P ( D | C ) P ( C ∑ = c = P ( D | C ) c ) P ( C = ∈C c</p> <p><span class="badge badge-info text-white mr-2">367</span> 9.1 Classification and Categorization 343 C and are randomvariables . Random variables are commonly used when where D modeling uncertainty. Such variables do not have a fixed (deterministic) value. Instead, the value of the variable is random. Every random variable has a set of possible outcomes associated with it, as well as a probability distribution over the outcomes. As an example, the outcome of a coin toss can be modeled as a random . The possible outcomes of the random variable are “heads” ( h ) and variable X t ). Given a fair coin, the probability associated with both the heads out- “tails” ( ( X = h ) = P ( come and the tails outcome is 0.5. Therefore, = t ) = 0 . 5 . P X Y Consider another example, where you have the algebraic expression = . If X was a deterministic variable, then Y would be deterministic as 10 + 2 X X well. That is, for a fixed Y would always evaluate to the same value. How- , ever, if is a random variable, then Y is also a random variable. Suppose that X X had possible outcomes –1 (with probability 0.1), 0 (with probability 0.25), and 1 (with probability 0.65). The possible outcomes for would then be 8, 10, and Y P ( Y = 8) = 0 . 1 , P ( Y = 10) = 0 . 25 , and 12, with ( Y = 12) = 0 . 65 . P In this chapter, we denote random variables with capital letters (e.g., , D ) C c , d ). Furthermore, we and outcomes of random variables as lowercase letters (e.g., denote the entire set of outcomes with caligraphic letters (e.g., C , D ). Finally, for notational convenience, instead of writing P X = x ) , we write P ( x ) . Similarly ( = P = x | Y X y ) , we write for conditional probabilities, rather than writing ( ( x | y ) P . Bayes’ Rule is important because it allows us to write a conditional probability P P C | D ) (such as ( ( D | C ) ). This is a very ) in terms of the “reverse” conditional ( powerful theorem, because it is often easy to estimate or compute the conditional probability in one direction but not the other. For example, consider spam classifi- cation, where represents a document’s text and C represents the class label (e.g., D “spam” or “not spam”). It is not immediately clear how to write a program that detects whether a document is spam; that program is represented by ( C | D ) . P However, it is easy to find examples of documents that are and are not spam. It is P ( D | C ) given examples or training data. possible to come up with estimates for The magic of Bayes’ Rule is that it tells us how to get what we want ( P ( C | D ) ), but may not immediately know how to estimate, from something we do know ). P ( D | C ) how to estimate ( It is straightforward to use this rule to classify items if we let C be the ran- dom variable associated with observing a class label and let D be the random variable associated with observing a document, as in our spam example. Given</p> <p><span class="badge badge-info text-white mr-2">368</span> 344 9 Classification and Clustering 1 a document D ) and a set of classes C = d (an outcome of random variable , . . . , c (outcomes of the random variable C ), we can use Bayes’ Rule to com- c 1 N ( c , . . . , P | d ) pute ( c , which computes the likelihood of observing class P d ) | N 1 d d was observed. Document label can then be labeled given that document c i with the class with the highest probability of being observed given the document. d as follows: That is, Naïve Bayes classifies a document ( d ) = arg max ) | P ( c Class d ∈C c c ) ( d | P c P ( ) ∑ arg max = ∈C c ( ) P ( d | c ) P c C c ∈ , out of the set of all possible arg max c P ( c | d ) where means “return the class ∈C c C , that maximizes P classes c | d ) .” This is a mathematical way of saying that we ( are trying to find the most likely class c given the document d . Instead of computing P c | d ) directly, we can compute P ( d | c ) and P ( c ) in- ( P c stead and then apply Bayes’ Rule to obtain | d ) . As we explained before, one ( reason for using Bayes’ Rule is when it is easier to estimate the probabilities of one conditional, but not the other. We now explain how these values are typically estimated in practice. P ( c ) . The estimation is We first describe how to estimate the class prior, straightforward. It is estimated according to: N c P ( c ) = N where N is the total is the number of training instances that have label c , and N c number of training instances. Therefore, ( c ) is simply the proportion of training P c instances that have label . P ( d | c ) Estimating is a little more complicated because the same “counting” estimate that we were able to use for estimating P ( c ) would not work. (Why? See exercise 9.3.) In order to make the estimation feasible, we must impose the sim- plifying assumption that d can be represented as d = w w , . . . , w and that n i 1 is independent of w d for every i ̸ = j . Simply stated, this says that document j 1 Throughout most of this chapter, we assume that the items being classified are textual documents. However, it is important to note that the techniques described here can be used in a more general setting and applied to non-textual items such as images and videos.</p> <p><span class="badge badge-info text-white mr-2">369</span> 9.1 Classification and Categorization 345 can be factored into a set of elements (terms) and that the elements (terms) are 2 independent of each other. This assumption is the reason for calling the classi- , because it requires documents to be represented in an overly simpli- naïve fier fied way. In reality, terms are not independent of each other. However, as we will show in Chapter 11, properly modeling term dependencies is possible, but typi- cally more difficult. Despite the independence assumption, the Naïve Bayes clas- sifier has been shown to be robust and highly effective for various classification tasks. This naïve independence assumption allows us to invoke a classic result from probability that states that the joint probability of a set of (conditionally) inde- pendent random variables can be written as the product of the individual condi- ( d | c ) can be written as: tional probabilities. That means that P n ∏ | c ) = | P ) c P ( w ( d i =1 i P ( | c ) for every possible term w in the vocabulary w Therefore, we must estimate c in the ontology C . It turns out that this is a much easier task than and class V estimating ( d | c ) since there is a finite number of terms in the vocabulary and a P finite number of classes, but an infinite number of possible documents. The inde- pendence assumption allows us to write the probability P ( c | d ) as: ) c ( P ) c P ( d | ∑ ) = c ( P d | P d | c ) ( ( c ) P ∈C c ∏ V ) ) P ( w c | c ( P i i =1 = ∑ ∏ V w ) ( P P ( c ) | c i =1 ∈C c i P ( The only thing left to describe is how to estimate | c ) . Before we can esti- w mate the probability, we must first decide on what the probability actually means. For example, P ( w | c ) could be interpreted as “the probability that term w is re- lated to class ,” “the probability that w has nothing to do with class c ,” or any c number of other things. In order to make the meaning concrete, we must explic- itly define the event space that the probability is defined over. An eventspace is the 2 This is the same assumption that lies at the heart of most of the retrieval models de- scribed in Chapter 7. It is also equivalent to the bag of words assumption discussed in Chapter 11.</p> <p><span class="badge badge-info text-white mr-2">370</span> 346 9 Classification and Clustering set of possible events (or outcomes) from some process. A probability is assigned to each event in the event space, and the sum of the probabilities over all of the events in the event space must equal one. The probability estimates and the resulting classification will vary depending on the choice of event space. We will now briefly describe two of the more popular w | c ) is estimated in each. ( P event spaces and show how Multiple-Bernoulli model , we define a c The first event space that we describe is very simple. Given a class binary random variable for every term in the vocabulary. The outcome for the w i ( w binary event is either 0 or 1. The probability = 1 P c ) can then be interpreted | i w ) is generated by class c .” Conversely, P as “the probability that term w c = 0 | ( i i w c .” This can be interpreted as “the probability that term is not generated by class i is exactly the event space used by the binary independence model (see Chapter 7), multiple-Bernoulli and is known as the event space. Under this event space, for each term in some class c , we estimate the prob- ability that the term is generated by the class. For example, in a spam classifier, ( P = 1 | spam ) is likely to have a high probability, whereas P ( dinner = cheap 1 | spam ) is going to have a much lower probability. document id cheap buy banking dinner the class 1 0 0 0 1 not spam 0 1 1 0 1 spam 2 0 0 0 0 0 1 not spam 3 4 1 1 0 1 spam 0 5 1 0 0 1 spam 1 6 0 0 1 0 1 not spam 7 0 1 0 1 not spam 1 8 0 0 0 1 not spam 0 9 0 0 0 0 1 not spam 10 1 0 1 1 not spam 1 Fig. 9.1. Illustration of how documents are represented in the multiple-Bernoulli event space. In this example, there are 10 documents (each with a unique id ), two classes (spam and not spam), and a vocabulary that consists of the terms “cheap”, “buy”, “banking”, “din- ner”, and “the”.</p> <p><span class="badge badge-info text-white mr-2">371</span> 9.1 Classification and Categorization 347 Figure 9.1 shows how a set of training documents can be represented in this event space. In the example, there are 10 documents, two classes (spam and not spam), and a vocabulary that consists of the terms “cheap”, “buy”, “banking”, “din- 3 7 ner”, and “the”. In this example, P and P ( not spam ) = spam ) = . Next, ( 10 10 ( we must estimate | c ) for every pair of terms and classes. The most straight- P w forward way is to estimate the probabilities using what is called the maximum likelihood estimate, which is: df w;c w | c P ( ) = N c df in which term is the number of training documents with class label c where w;c w occurs, and is the total number of training documents with class label c . N c As we see, the maximum likelihood estimate is nothing more than the propor- c that contain term w . Using the maximum likelihood tion of documents in class P ( the | spam ) = 1 , P ( the | not spam ) = 1 , estimate, we can easily compute 1 ( | spam ) = 0 , P dinner dinner | not spam ) = ( P , and so on. 7 Using the multiple-Bernoulli model, the document likelihood, P d | c ) , can be ( written as: ∏ ( ) w;d 1 −   w;d ) ( | c ) ) = P ( d (1 − P ( w | c )) | P ( w c ∈V w where δ ( w, D ) is 1 if and only if term w occurs in document d . In practice, it is not possible to use the maximum likelihood estimate because of the . In order to illustrate the zero probability problem, zeroprobabilityproblem let us return to the spam classification example from Figure 9.1. Suppose that we receive a spam email that happens to contain the term “dinner”. No matter what other terms the email does or does not contain, the probability P d | c ) will always ( be zero because P ( dinner | spam ) = 0 and the term occurs in the document (i.e., = 1 ). Therefore, any document that contains the term “dinner” δ dinner;d will automatically have zero probability of being spam. This problem is more gen- eral, since a zero probability will result whenever a document contains a term that never occurs in one or more classes. The problem here is that the maximum likeli- hood estimate is based on counting occurrences in the training set. However, the training set is finite, so not every possible event is observed. This is known as data sparseness. Sparseness is often a problem with small training sets, but it can also happen with relatively large data sets. Therefore, we must alter the estimates in such a way that all terms, including those that have not been observed for a given</p> <p><span class="badge badge-info text-white mr-2">372</span> 348 9 Classification and Clustering P ( c ) is non- class, are given some probability mass. That is, we must ensure that | w . By doing so, we will avoid all of the problems associated zero for all terms in V with the zero probability problem. As was described in Chapter 7, smoothing is a useful technique for overcoming the zero probability problem. One popular smoothing technique is often called , which assumes some prior probability over models and uses a Bayesiansmoothing maximumaposteriori estimate. The resulting smoothed estimate for the multiple- Bernoulli model has the form: α + df w;c w ) = | w ( P c N + α β + w w c α β and where . Different settings of these are parameters that depend on w w w α parameters result in different estimates. One popular choice is to set = 1 and w = 0 for all w , which results in the following estimate: β w + 1 df w;c w ) = ( | P c + 1 N c N N w w μ Another choice is to set = α β , where = is (1 − N and w ) for all μ w w w N N the total number of training documents in which term w occurs, and μ is a single tunable parameter. This results in the following estimate: N w + μ df w;c N w c | ) = ( P N + μ c This event space only captures whether or not the term is generated; it fails how many times the term occurs, which can be an important piece of to capture information. We will now describe an event space that takes term frequency into account. Multinomial model The binary event space of the multiple-Bernoulli model is overly simplistic, as it does not model the number of times that a term occurs in a document. Term fre- quency has been shown to be an important feature for retrieval and classifica- tion, especially when used on long documents. When documents are very short, it is unlikely that many terms will occur more than one time, and therefore the multiple-Bernoulli model will be an accurate model. However, more often than</p> <p><span class="badge badge-info text-white mr-2">373</span> 9.1 Classification and Categorization 349 not, real collections contain documents that are both short and long, and there- fore it is important to take term frequency and, subsequently, document length into account. event space is very similar to the multiple-Bernoulli event multinomial The space, except rather than assuming that term occurrences are binary (“term oc- curs” or “term does not occur”), it assumes that terms occur zero or more times (“term occurs zero times”, “term occurs one time”, etc.). dinner document id the class cheap buy banking 1 0 0 2 not spam 0 0 3 0 1 0 1 spam 2 0 0 0 0 1 3 not spam 4 0 3 0 2 spam 2 5 0 0 1 spam 5 2 0 0 1 0 1 not spam 6 7 0 1 0 1 not spam 1 8 0 0 0 1 not spam 0 9 0 0 0 0 1 not spam 10 1 0 1 2 not spam 1 Illustration of how documents are represented in the multinomial event space. Fig. 9.2. ), two classes (spam and id In this example, there are 10 documents (each with a unique not spam), and a vocabulary that consists of the terms “cheap”, “buy”, “banking”, “dinner”, and “the”. Figure 9.2 shows how the documents from our spam classification example are represented in the multinomial event space. The only difference between this representation and the multiple-Bernoulli representation is that the events are no longer binary. The maximum likelihood estimate for the multinomial model is very similar to the multiple-Bernoulli model. It is computed as: tf w;c | P ) = ( w c | | c tf where w occurs in class c in the training set, is the number of times that term w;c and | c | is the total number of terms that occur in training documents with class la- 4 , P ( the | not spam ) = bel . In the spam classification example, P ( the | spam ) = c 20 9 1 P , ( dinner | spam ) = 0 , and P ( dinner | not spam ) = . 15 15</p> <p><span class="badge badge-info text-white mr-2">374</span> 350 9 Classification and Clustering Since terms are now distributed according to a multinomial distribution, the likelihood of a document c is computed according to: d given a class ∏ ( ) tf w;d ( | d | ) d tf P c ) = , tf ) c | , . . . , tf w ( ( P ! | P w ;d w ;d w 2 1 ;d V ∈V w ∏ tf w;d ( w | ∝ ) c P ∈V w where is the is the number of times that term w occurs in document d , | d | tf w;d d ( | d | ) is the probability of generating , total number of terms that occur in P ( ) , and d tf is the multinomial , tf | a document of length , . . . , tf ! | ;d w w w ;d 2 1 ;d V 3 | P ( coefficient. d Notice that ) and the multinomial coefficient are document- | dependent and, for the purposes of classification, can be ignored. The Bayesian smoothed estimates of the term likelihoods are computed ac- cording to: α + tf w;c w ∑ | ( P w ) = c α | | c + w w ∈V . As with the multiple-Bernoulli is a parameter that depends on w α where w model, different settings of the smoothing parameters result in different types α = 1 for all w is one possible option. This results in the of estimates. Setting w following estimate: tf + 1 w;c ( ) = | P c w c + |V| | | cf w α μ Another popular choice is to set = is the total number of cf , where w w | | C times that term w occurs in any training document, | C | is the total number of terms in all training documents, and μ , as before, is a tunable parameter. Under this setting, we obtain the following estimate: cf w tf + μ w;c | C | ) = | w ( c P c | + μ | This estimate may look familiar, as it is exactly the Dirichlet smoothed language modeling estimate that was described in Chapter 7. 3 The multinomial coefficient is a generalization of the binomial coefficient. It is com- ! N N puted as , N ( , . . . , N )! = . It counts the total number of unique 1 2 k ! ! N N ! ··· N 1 2 k ∑ times. N N ways that items (terms) can be arranged given that item i occurs i i i</p> <p><span class="badge badge-info text-white mr-2">375</span> 9.1 Classification and Categorization 351 In practice, the multinomial model has been shown to consistently outper- form the multiple-Bernoulli model. Implementing a classifier based on either of these models is straightforward. Training consists of computing simple term oc- currence statistics. In most cases, these statistics can be stored in memory, which means that classification can be done efficiently. The simplicity of the model, com- bined with good accuracy, makes the Naïve Bayes classifier a popular and attrac- tive choice as a general-purpose classification algorithm. 9.1.2 Support Vector Machines Unlike the Naïve Bayes classifier, which is based purely on probabilistic princi- ples, the next classifier we describe is based on geometric principles. Support Vec- SVMs torMachines , often called , treat inputs such as documents as points in some geometric space. For simplicity, we first describe how SVMs are applied to classi- fication problems with binary class labels, which we will refer to as the “positive” 4 and “negative” classes. In this setting, the goal of SVMs is to find a hyperplane that separates the positive examples from the negative examples. In the Naïve Bayes model, documents were treated as binary vectors in the multiple-Bernoulli model and as term frequency vectors in the multinomial case. SVMs provide more flexibility in terms of how documents can be represented. With SVMs, rather than defining some underlying event space, we must instead ( f that take a document as input and ( · ) , . . . , f define a set of featurefunctions · ) 1 N produce what is known as a feature value . Given a document d , the document is represented in an N -dimensional space by the vector x . = [ f )] ( d ) , . . . , f d ( N d 1 Given a set of training data, we can use the feature functions to embed the training documents in this N -dimensional space. Notice that different feature functions will result in different embeddings. Since SVMs find a hyperplane that separates the data according to classes, it is important to choose feature functions that will help discriminate between the different classes. Two common feature functions are f . The ( d ) = δ ( w, d ) and f tf ( d ) = w w;d w w occurs in d first feature function is 1 if term , which is analogous to the multiple- Bernoulli model. The second feature function counts the number of times that w occurs in d , which is analogous to the multinomial model. Notice that these fea- ture functions are indexed by w , which means that there is a total of |V| such func- tions. This results in documents being embedded in a |V| -dimensional space. It is also possible to define similar feature functions over bigrams or trigrams, which 4 A hyperplane generalizes the notion of a plane to N -dimensional space.</p> <p><span class="badge badge-info text-white mr-2">376</span> 352 9 Classification and Clustering would cause the dimensionality of the feature space to explode. Furthermore, other information can be encoded in the feature functions, such as the document length, the number of sentences in the document, the last time the document was updated, and so on. Fig. 9.3. Data set that consists of two classes (pluses and minuses). The data set on the left is linearly separable, whereas the one on the right is not. N -dimen- Now that we have a mechanism for representing documents in an sional space, we describe how SVMs actually classify the points in this space. As described before, the goal of SVMs is to find a hyperplane that separates the neg- ative and positive examples. The hyperplane is learned from the training data. An unseen test point is classified according to which side of the hyperplane the point falls on. For example, if the point falls on the negative side, then we classify it as negative. Similarly, if it falls on the positive side, then it is classified as positive. It is not always possible to draw a hyperplane that perfectly separates the negative training data from the positive training data, however, since no such hyperplane may exist for some embedding of the training data. For example, in Figure 9.3, it is possible to draw a line (hyperplane) that separates the positive class (denoted by “+”) from the negative class (denoted by “–”) in the left panel. However, it is impossible to do so in the right panel. The points in the left panel are said to be linearly separable , since we can draw a linear hyperplane that separates the points. It is much easier to define and find a good hyperplane when the data is linearly separable. Therefore, we begin our explanation of how SVMs work by focusing on this special case. We will then extend our discussion to the more general and common case where the data points are not linearly separable.</p> <p><span class="badge badge-info text-white mr-2">377</span> 9.1 Classification and Categorization 353 Case 1: Linearly separable data Suppose that you were given a linearly separable training set, such as the one in Figure 9.3, and were asked to find the optimal hyperplane that separates the data points. How would you proceed? You would very likely first ask what exactly is meant by optimal. One might first postulate that optimal means hyperplane any that separates the positive training data from the negative training data. However, we must also consider the ultimate goal of any classification algorithm, which is generalize well to unseen data. If a classifier can perfectly classify the training to data but completely fails at classifying the test set data, then it is of little value. This scenario is known as overfitting . + + w ¢ 0 x > – – + + = 1 – x – + ¢ w + – 1 + ¡ = x – ¢ + – w + = 0 x ¢ – 0 x < ¢ w w – – + Fig. 9.4. Graphical illustration of Support Vector Machines for the linearly separable case. Here, the hyperplane defined by w is shown, as well as the margin, the decision regions, and the support vectors, which are indicated by circles. In order to avoid overfitting, SVMs choose the hyperplane that maximizes the separation between the positive and negative data points. This selection criteria makes sense intuitively, and is backed up by strong theoretical results as well. As- suming that our hyperplane is defined by the vector w , we want to find the w that separates the positive and negative training data and maximizes the separation be-</p> <p><span class="badge badge-info text-white mr-2">378</span> 354 9 Classification and Clustering tween the data points. The maximal separation is defined as follows. Suppose that + − is the closest negative training point to the hyperplane and that is the clos- x x 5 as the est positive training point to the hyperplane. Then, we define the margin + − distance from x to the hyperplane. x to the hyperplane plus the distance from Figure 9.4 shows a graphical illustration of the margin, with respect to the hy- − + ). The margin can be computed x and x perplane and the support vectors (i.e., using simple vector mathematics as follows: + − | | w · x x | + | w · ) = Margin w ( || w || is the dot product (inner product) between two vectors, and || w || = · where 1/2 · w ) is the length of the vector w . The SVM algorithm’s notion of an opti- ( w w that maximizes the margin while mal hyperplane, therefore, is the hyperplane still separating the data. In order to simplify things, it is typically assumed that − + · = − 1 and w · x x = 1 . These assumptions, which do not change the w 2 solution to the problem, result in the margin being equal to . An alternative w || || yet equivalent formulation is to find the hyperplane w that solves the following optimization problem: 1 2 || || w minimize : 2 subject to : · x ) = + ≥ 1 ∀ i w ( i s.t. Class i s.t. Class · − ≤− 1 ∀ w x ( i ) = i i This formulation is often used because it is easier to solve. In fact, this optimization problem can be solved using a technique called quadraticprogramming , the details of which are beyond the scope of this book. However, there many excellent open source SVM packages available. In the “References and Further Reading” section at the end of this chapter we provide pointers to several such software packages. Once the best w has been found, an unseen document d can be classified using the following rule: { > x · 0 + if w d ) = ( Class d − otherwise 5 − + and The vectors x x are known as support vectors . The optimal hyperplane w is a linear combination of these vectors. Therefore, they provide the support for the decision boundary. This is the origin of the name “Support Vector Machine.”</p> <p><span class="badge badge-info text-white mr-2">379</span> 9.1 Classification and Categorization 355 Therefore, the rule classifies documents based on which side of the hyperplane the document’s feature vector is on. Referring back to Figure 9.4, we see that in this example, those points to the left of the hyperplane are classified as positive exam- ples and those to the right of the hyperplane are classified as negative examples. Case 2: Non-linearly separable data Very few real-world data sets are actually linearly separable. Therefore, the SVM formulation just described must be modified in order to account for this. This can be achieved by adding a penalty factor to the problem that accounts for training instances that do not satisfy the constraints of the linearly separable formulation. in the positive class, w · x = − 0 . 5 . Suppose that, for some training point x · This violates the constraint ≥ 1 . In fact, x falls on the entirely wrong side x w · x is (at least) 1, we can apply a linear w of the hyperplane. Since the target for penalty based on the difference between the target and actual value. That is, the x is penalty given to − ( − 0 . 5) = 1 . 5 . If w · x = 1 . 25 , then no penalty would 1 be assigned, since the constraint would not be violated. This type of penalty is hinge loss function known as the . It is formally defined as: { ( if Class (1 max ) = + i w · x, 0) − ) = x ( L max (1 + w · x, 0) if Class ( i ) = − This loss function is incorporated into the SVM optimization as follows: ∑ N 1 2 : minimize || C w || ξ + i i =1 2 : subject to · x ) = + ≥ 1 − ξ w ∀ i s.t. Class ( i i i w · x − ≤− 1 + ξ ) = ∀ i s.t. Class ( i i i ξ ≥ 0 ∀ i i where ξ that allows the target values to be violated. is known as a slack variable i The slack variables enforce the hinge loss function. Notice that if all of the con- straints are satisfied, all of the slack variables would be equal to 0, and therefore the loss function would reduce to the linearly separable case. In addition, if any constraint is violated, then the amount by which it is violated is added into the objective function and multiplied by C , which is a free parameter that controls how much to penalize constraint violations. It is standard to set C equal to 1. This</p> <p><span class="badge badge-info text-white mr-2">380</span> 356 9 Classification and Clustering optimization problem finds a hyperplane that maximizes the margin while allow- ing for some slack. As in the linearly separable case, this optimization problem can be solved using quadratic programming. In addition, classification is performed in the same way as the linearly separable case. The kernel trick The example in Figure 9.3 illustrates the fact that certain embeddings of the train- transforma- ing data are not linearly separable. It may be possible, however, that a or of the data into a higher dimensional space results in a set of mapping tion linearly separable points. This may result in improved classification effectiveness, although it is not guaranteed. -dimensional vector into a higher dimen- N There are many ways to map an sional space. For example, given the vector [ ( d ) , . . . , f , one could aug- ( d )] f N 1 ment the vector by including squared feature values. That is, the data items would 2 N now be represented by the -dimensional vector: [ ] 2 2 ) , . . . , f ( ( d ) , f ) ( d ) f d d ( , . . . , f N N 1 1 The higher the dimensionality of the feature vectors, however, the less efficient the algorithm becomes, both in terms of space requirements and computation time. One important thing to notice is that the key mathematical operation involved in training and testing SVMs is the dot product. If there was an efficient way to compute the dot product between two very high-dimensional vectors without having to store them, then it would be feasible to perform such a mapping. In fact, this is possible for certain classes of high-dimensional mappings. This can be achieved by using a kernel function . A kernel function takes two N -dimensional vectors and computes a dot product between them in a higher dimensional space. This higher dimensional space is , in that the higher dimensional vectors implicit are never actually constructed. Let us now consider an example. Suppose that we have two -dimensional vec- 2 w = [ w as follows: w ) ] and tors = [ x · x ( ] . Furthermore, we define x 2 1 2 1   2 x 1 √   x ) = ( x 2 x 1 2 2 x 2 Here, ( · ) maps 2 -dimensional vectors into 3 -dimensional vectors. As we de- scribed before, this may be useful because the original inputs may actually be lin- early separable in the 3 -dimensional space to which ( · ) maps the points. One can</p> <p><span class="badge badge-info text-white mr-2">381</span> 9.1 Classification and Categorization 357 imagine many, many ways of mapping the original inputs into higher dimensional spaces. However, as we will now show, certain mappings have very nice properties that allow us to efficiently compute dot products in the higher dimensional space. w ) · ( ( ) would be to first Given this mapping, the naïve way to compute x w ) and ( x ) and then perform the dot product in the 3 - explicitly construct ( dimensional space. However, surprisingly, it turns out that this is not necessary, since: 2 2 2 2 x ) = w + 2 ( x w x ) · ( w w x + x w 2 2 1 1 1 2 2 1 2 = ( w · x ) w · x where 2 -dimensional space. Therefore, rather is computed in the original, than explicitly computing the dot product in the higher 3 -dimensional space, we only need to compute the dot product in the original 2 -dimensional space and then square the value. This “trick,” which is often referred to as the kernel trick , allows us to efficiently compute dot products in some higher dimensional space. Of course, the example given here is rather trivial. The true power of the ker- nel trick becomes more apparent when dealing with mappings that project into much higher dimensional spaces. In fact, some kernels perform a dot product in an infinite dimensional space! Kernel Type Value Implicit Dimension K ( x , x Linear N ) = x x · 1 2 2 1 ) ( − p + N 1 p , x x ) ) = ( · x ( K x Polynomial 2 1 1 2 N 2 2 ( x Infinite , x −|| ) = exp Gaussian x σ − x /2 || K 2 2 1 1 Table 9.1. A list of kernels that are typically used with SVMs. For each kernel, the name, value, and implicit dimensionality are given. A list of the most widely used kernels is given in Table 9.1. Note that the Gaus- sian kernel is also often called a radial basis function (RBF) kernel. The best choice of kernel depends on the geometry of the embedded data. Each of these kernels has been shown to be effective on textual features, although the Gaussian kernel 2 tends to work well across a wide range of data sets, as long as the variance ( σ ) is properly set. Most standard SVM software packages have these kernels built in, so using them is typically as easy as specifying a command-line argument. Therefore,</p> <p><span class="badge badge-info text-white mr-2">382</span> 358 9 Classification and Clustering given their potential power and their ease of use, it is often valuable to experiment with each of the kernels to determine the best one to use for a specific data set and task. The availability of these software packages, together with the SVM’s flexibil- ity in representation and, most importantly, their demonstrated effectiveness in many applications, has resulted in SVMs being very widely used in classification applications. Non-binary classification Up until this point, our discussion has focused solely on how support vector ma- chines can be used for binary classification tasks. We will now describe two of the most popular ways to turn a binary classifier, such as a support vector machine, into a multi-class classifier. These approaches are relatively simple to implement and have been shown to be work effectively. The first technique is called the one versus all (OVA) approach. Suppose that K ≥ 2 class classification problem. The OVA approach works by train- we have a K ing k th classifier, the k th class is treated as the classifiers. When training the all positive class and of the other classes are treated as the negative class. That is, each classifier treats the instances of a single class as the positive class, and the x , it is classified remaining instances are the negative class. Given a test instance K classifiers. The class for x is the (positive) class associated with the using all classifier that yields the largest value of w · x . That is, if w is the “class c versus c not class c” classifier, then items are classified according to: x ( ) = arg max x w Class · c c The other technique is called the oneversusone (OVO) approach. In the OVO approach, a binary classifier is trained for every unique pair of classes. For example, for a ternary classification problem with the labels “excellent”, “fair”, and “bad”, it would be necessary to train the following classifiers: “excellent versus fair”, “ex- cellent versus bad”, and “fair versus bad”. In general, the OVO approach requires K ( K − 1) classifiers to be trained, which can be computationally expensive for large 2 data sets and large values of K . To classify a test instance x , it is run through each of the classifiers. Each time x is classified as c , a vote for c is recorded. The class that has the most votes at the end is then assigned to x . Both the OVA and OVO approaches work well in practice. There is no con- crete evidence that suggests that either should be preferred over the other. Instead,</p> <p><span class="badge badge-info text-white mr-2">383</span> 9.1 Classification and Categorization 359 the effectiveness of the approaches largely depends on the underlying character- istics of the data set. 9.1.3 Evaluation Most classification tasks are evaluated using standard information retrieval met- 6 , F precision , recall , the rics, such as accuracy ROC curve analysis. measure, and Each of these metrics were described in detail in Chapter 8. Of these metrics, the most commonly used are accuracy and the F measure. There are two major differences between evaluating classification tasks and other retrieval tasks. The first difference is that the notion of “relevant” is replaced with “is classified correctly.” The other major difference is that microaveraging, which is not commonly used to evaluate retrieval tasks, is widely used in classi- fication evaluations. Macroaveraging for classification tasks involves computing some metric for each class and then computing the average of the per-class met- rics. On the other hand, microaveraging computes a metric for every test instance (document) and then averages over all such instances. It is often valuable to com- pute and analyze both the microaverage and the macroaverage, especially when is highly skewed. ( c the class distribution P ) 9.1.4 Classifier and Feature Selection Up until this point we have covered the basics of two popular classifiers. We have described the principles the classifiers are built upon, their underlying assump- tions, the pros and cons, and how they can be used in practice. As classification is a deeply complex and rich subject, we cover advanced classification topics in this section that may be of interest to those who would like a deeper or more complete understanding of the topic. Generative, discriminative, and non-parametric models The Naïve Bayes classifier was based on probabilistic modeling. The model re- quires us to assume that documents are generated from class labels according to a probabilistic model that corresponds to some underlying event space. The Naïve Bayes classifier is an example of a wider class of probabilistic models called gener- ative models . These models assume that some underlying probability distribution 6 Accuracy is another name for precision at rank 1.</p> <p><span class="badge badge-info text-white mr-2">384</span> 360 9 Classification and Clustering generates both documents and classes. In the Naïve Bayes case, the classes and documents are generated as follows. First, a class is generated according to ( c ) . P P ( d | c Then, a document is generated according to . This process is summarized ) in Figure 9.5. Generative models tend to appeal to intuition by mimicking how people may actually generate (write) documents. Class 1 Class 2 Class 3 Generate class according to P(c) Class 2 Generate document according to P(d|c) Fig. 9.5. Generative process used by the Naïve Bayes model. First, a class is chosen accord- d ing to c ) , and then a document is chosen according to P ( ( | c ) . P Of course, the accuracy of generative models largely depends on how accu- rately the probabilistic model captures this generation process. If the model is a reasonable reflection of the actual generation process, then generative models can be very powerful, especially when there are very few training examples. As the number of training examples grows, however, the power of the gener- ative model can be limited by simplifying distributional assumptions, such as the independence assumption in the Naïve Bayes classifier. In such cases, discrimina- tive models often outperform generative models. Discriminative models are those that do not model the generative process of documents and classes. Instead, they directly model the class assignment problem given a document as input. In this way, they discriminate between class labels. Since these models do not need to model the generation of documents, they often have fewer distributional assump- tions, which is one reason why they are often preferred to generative models when</p> <p><span class="badge badge-info text-white mr-2">385</span> 9.1 Classification and Categorization 361 there are many training examples. Support vector machines are an example of a discriminative model. Notice that no assumptions about the document genera- tion process are made anywhere in the SVM formulation. Instead, SVMs directly learn a hyperplane that effectively discriminates between the classes. + – – + – – – – – – + – + + – + – + + – + – – + + + – – + – – + – + + + + – – + + + – – + – – + + – – + + – + – – + + + – – + – + – Fig. 9.6. Example data set where non-parametric learning algorithms, such as a nearest neighbor classifier, may outperform parametric algorithms. The pluses and minuses indi- cate positive and negative training examples, respectively. The solid gray line shows the actual decision boundary, which is highly non-linear. Non-parametric classifiers are another option when there is a large number of training examples. Non-parametric classifiers let the data “speak for itself ” by eliminating all distributional assumptions. One simple example of a non- parametric classifier is the nearest neighbor classifier . Given an unseen example, the nearest neighbor classifier finds the training example that is nearest (accord- ing to some distance metric) to it. The unseen example is then assigned the label of this nearest neighbor. Figure 9.6 shows an example output of a nearest neigh- bor classifier. Notice the irregular, highly non-linear decision boundary induced by the classifier. Generative and discriminative models, even SVMs with a non-</p> <p><span class="badge badge-info text-white mr-2">386</span> 362 9 Classification and Clustering linear kernel, would have a difficult time fitting a model to this data. For this rea- son, the nearest neighbor classifier is optimal as the number of training examples approaches infinity. However, the classifier tends to have a very high variance for smaller data sets, which often limits its applicability. Feature selection The SVM classifier embeds inputs, such as documents, into some feature space that is defined by a set of feature functions. As we described, it is common to de- fine one (or more) feature functions for every word in the vocabulary. This |V| - dimensional feature space can be extremely large, especially for very large vocabu- laries. Since the feature set size affects both the efficiency and effectiveness of the classifier, researchers have devised techniques for pruning the feature space. These techniques. are known as feature selection The goal of feature selection is to find a small subset of the original features that can be used in place of the original feature set with the aim of significantly improving efficiency (in terms of storage and time) while not hurting effective- ness much. In practice, it turns out that feature selection techniques often improve effectiveness instead of reducing it. The reason for this is that some of the features eliminated during feature selection may be noisy or inaccurate, and therefore hin- der the ability of the classification model to learn a good model. Information gain is one of the most widely used feature selection criteria for text classification applications. Information gain is based on information theory principles. As its name implies, it measures how much information about the class labels is gained when we observe the value of some feature. Let us return to the spam classification example in Figure 9.1. Observing the value of the fea- ture “cheap” provides us quite a bit of information with regard to the class labels. If “cheap” occurs, then it is very likely that the label is “spam”, and if “cheap” does not occur, then it is very likely that the label is “not spam”. In information theory, entropy is the expected information contained in some distribution, such as the P ( c ) . Therefore, the information gain of some feature class distribution mea- f sures how the entropy of P ( c ) changes after we observe f . Assuming a multiple- Bernoulli event space, it is computed as follows: IG ( w ) = H ( C ) − H ( C | w ) ∑ ∑ ∑ = | P ( c ) log P ( c ) + ) c ( P log ) − P ( w ) w w | P ( c ∈C c ∈C c 1 ; 0 ∈{ w }</p> <p><span class="badge badge-info text-white mr-2">387</span> 9.1 Classification and Categorization 363 H ( ) is the entropy of P ( c ) and H ( C | w ) is known as the conditional where C . As an illustrative example, we compute the information gain for the term entropy “cheap” from our spam classification example: cheap ) = − P ( spam ) log P ( IG ) − P ( spam ) log P ( spam ) + ( spam ( cheap P P ( spam | cheap ) log P ( spam | cheap ) + ) P cheap ) P ( spam | cheap ) log P ( spam | cheap ) + ( ) ( ) P ( spam | cheap cheap log P ( spam | cheap ) + P P ( cheap ) P ( spam | cheap ) log P ( spam | cheap ) 3 3 3 7 7 3 4 = log log log − + − · 10 4 4 10 10 10 10 1 1 6 6 0 0 6 6 4 · + + · · log log log + 10 10 10 4 6 6 4 6 6 = 0 . 2749 P P cheap ) is shorthand for P ( cheap where , ( ( spam ) means P ( C = = 0) not spam ) , and it is assumed that 0 log 0 = 0 . The corresponding information gains for “buy”, “banking”, “dinner”, and “the” are 0.0008, 0.0434, 0.3612, and 0.0, respectively. Therefore, according to the information gain, “dinner” is the most informative word, since it is a perfect predictor of “not spam” according to the training set. On the opposite side of the spectrum, “the” is the worst predictor, since it appears in every document and therefore has no discriminative power. Similar information gain measures can be derived for other event spaces, such as the multinomial event space. There are many different ways to use the infor- mation gain to actually select features. However, the most common thing to do is to select the K features with the largest information gain and train a model using only those features. It is also possible to select a percentage of all features or use a threshold. Although many other feature selection criteria exist, information gain tends to be a good general-purpose feature selection criteria, especially for text-based clas- sification problems. We provide pointers to several other feature selection tech- niques in the “References and Further Reading” section at the end of this chapter.</p> <p><span class="badge badge-info text-white mr-2">388</span> 364 9 Classification and Clustering 9.1.5 Spam, Sentiment, and Online Advertising Although ranking functions are a very critical part of any search engine, classifica- tion and categorization techniques also play an important role in various search- related tasks. In this section, we describe several real-world text classification ap- plications. These applications are spam detection , sentiment classification , and on- . line advertisement classification Spam, spam, spam Classification techniques can be used to help detect and eliminate various types of spam. Spam is broadly defined to be any content that is generated for malevolent 7 purposes, such as unsolicited advertisements, deceptively increasing the ranking of a web page, or spreading a virus. One important characteristic of spam is that it tends to have little, if any, useful content. This definition of spam is very subjective, because what may be useful to one person may not be useful to another. For this reason, it is often difficult to come up with an objective definition of spam. There are many types of spam, including email spam, advertisement spam, blog spam, and web page spam. Spammers use different techniques for different types of spam. Therefore, there is no one single spam classification technique that works for all types of spam. Instead, very specialized spam classifiers are built for the different types of spam, each taking into account domain-specific information. Much has been written about email spam, and filtering programs such as Spam- 8 Assassin are in common use. Figure 9.7 shows the SpamAssassin output for an example email. SpamAssassin computes a score for the email that is compared to a threshold (default value 5.0) to determine whether it is spam. The score is based on a combination of features, one of the most important of which is the output of a Bayes classifier. In this case, the URL contained in the body of the email was on a blacklist, the timestamp on the email is later than the time it was received, and the Bayes classifier gives the email a 40–60% chance of being in the class “spam” based on the words in the message. These three features did not, however, give the email a score over 5, so it was not classified as spam (which is a mistake). 7 The etymology of the word spam, with respect to computer abuse, is quite interesting. The meaning is believed to have been derived from a 1970 Monty Python skit set in a restaurant where everything on the menu has spam (the meat product) in it. A chorus of Vikings begins singing a song that goes, “Spam, spam, spam, spam, ...” on and on, therefore tying the word spam to repetitive, annoying behavior. 8 http://spamassassin.apache.org/</p> <p><span class="badge badge-info text-white mr-2">389</span> 9.1 Classification and Categorization 365 ... To: ... From: profit Subject: debt non Spam ! Checked: This message probably not SPAM X ! ! ! Score: 3.853, Required: 5 Spam X Spam ! Level: *** (3.853) ! X ! X ! Tests: BAYES_50,DATE_IN_FUTURE_06_12,URIBL_BLACK Spam X Spam ! Report ! rig: !!!! Start SpamAssassin (v2.6xx ! cscf) results ! the URIBL_BLACK an URL listed in 2.0 URIBL blacklist Contains [URIs: bad ! debtyh.net.cn] 1.9 DATE_IN_FUTURE_06_12 Date: is 6 to 12 hours after Received: date 60% BAYES_50 BODY: Bayesian spam probability is 40 to 0.0 [score: 0.4857] Say bye to debt good Acceptable Unsecured Debt includes All Major Credit Cards, No ! collateral Bank Personal Loans, Loans, Medical etc. Bills http://www.bad debtyh.net.cn ! Fig. 9.7. Example output of SpamAssassin email spam filter Since this book focuses on search engines, we will devote our attention to web page spam, which is one of the most difficult and widespread types of spam. De- tecting web page spam is a difficult task, because spammers are becoming increas- ingly sophisticated. It seems sometimes that the spammers themselves have ad- vanced degrees in information retrieval! There are many different ways that spam- mers target web pages. Gyöngyi and Garcia-Molina (2005) proposed a web spam taxonomy that attempts to categorize the different web page spam techniques that are often used to artificially increase the ranking of a page. The two top-level cat- egories of the taxonomy are link spam and term spam . With link spam, spammers use various techniques to artificially increase the link-based scores of their web pages. In particular, search engines often use mea- sures such as inlink count and PageRank, which are based entirely on the link structure of the Web, for measuring the importance of a web page. However, these techniques are susceptible to spam. One popular and easy link spam technique involves posting links to the target web page on blogs or unmoderated message boards. Another way for a website to artificially increase its link-based score is to join a link exchange network. Link exchange networks are large networks of web-</p> <p><span class="badge badge-info text-white mr-2">390</span> 366 9 Classification and Clustering sites that all connect to each other, thereby increasing the number of links coming into the site. Another link spam technique is called link farming. Link farms are similar to exchange networks, except the spammer himself buys a large number of domains, creates a large number of sites, and then links them all together. There are various other approaches, but these account for a large fraction of link spam. A number of alternatives to PageRank have been proposed recently that attempt to dampen the potential effect of link spam, including HostTrust (Gyöngyi et al., 2004) and SpamRank (Benczúr et al., 2005). The other top-level category of spam is term spam. Term spam attempts to modify the textual representation of the document in order to make it more likely to be retrieved for certain queries or keywords. As with link-based scores, term- based scores are also susceptible to spam. Most of the widely used retrieval mod- els, including BM25 and language modeling, make use of some formulation that involves term frequency and document frequency. Therefore, by increasing the term frequency of target terms, these models can easily be tricked into retrieving non-relevant documents. Furthermore, most web ranking functions match text in the incoming anchor text and the URL. Modifying the URL to match a given term or phrase is easy. However, modifying the incoming anchor text requires more effort, but can easily be done using link exchanges and link farms. Another technique, called dumping , fills documents with many unrelated words (often an entire dictionary). This results in the document being retrieved for just about any query, since it contains almost every combination of query terms. Therefore, this acts as a recall enhancing measure. This can be combined with the other spamming techniques, such as repetition, in order to have high precision as well as high re- call. Phrase stitching (combining words and sentences from various sources) and weaving (adding spam terms into a valid source such as a news story) are other techniques for generating artificial content. All of these types of term spam should be considered when developing a ranking function designed to prevent spam. Figure 9.8 shows an example of a web page containing spam. The page contains both term spam with repetition of important words and link spam where related spam sites are mentioned. As should be apparent by now, there is an overwhelming number of types of spam. Here, we simply focused on web page spam and did not even start to con- sider the other types of spam. Indeed, it would be easy to write an entire book on subjects related to spam. However, before we end our spam discussion, we will describe just one of the many different ways that classification has been used to tackle the problem of detecting web page spam.</p> <p><span class="badge badge-info text-white mr-2">391</span> 9.1 Classification and Categorization 367 Website: B E T T I N G N F L F O O T B A L L P R O F O O T B A L L S P O R T S B O O K S N F L F O O T B A L L L I N E O N L I N E N F L S P O R T S B O O K S N F L Players Super Book When It Comes To Secure NFL Betting And Finding Players Super Book Is The The Best Football Lines Best Option! Sign Up And Ask For 30 % In Bonuses. MVP Sportsbook Football Betting Has Never been so easy and secure! MVP Sportsbook has all the NFL odds you are looking for. Sign Up Now and ask for up to 30 % in Cash bonuses. Term spam: football online nfl sportsbooksnfl football line pro sportsbooksnfl football pro nfl football online nfl gambling pro gambling odds online nfl betting online nflgamblibg spreads gambling offshore spreads online football nfl sportsbookonline online nfl football gambling line online nfl betting nfl football wagering online football online online betting spreads betting gambling football online nfl football betting odds offshore online gambling ... sportsbookonline gambling football nfl football Link spam: Beverly Sportsbook Hills Football Gambling Sportsbook Football MVP Football Odds Popular Wagering Football SB Players Poker V Wager Football Spreads Virtual Bookmaker Football Lines Gecko Football Betting Online Casino Spreads Point Football Bogarts Casino Hour Online MVP Casino Online Football Wagering Gambling Jackpot Football Gambling NFL Poker Popular Betting NFL Casino Toucan Jockey NFL Odds Wagering NFL Tracks All Bet NFL Lines MVP RacebookNFL Point Spreads Live Horse Betting ... NFL Bogarts Poker Spreads Sportsbook NFL Poker Popular Fig. 9.8. Example of web page spam, showing the main page and some of the associated term and link spam</p> <p><span class="badge badge-info text-white mr-2">392</span> 368 9 Classification and Clustering Ntoulas et al. (2006) propose a method for detecting web page spam using content (textual) analysis. The method extracts a large number of features from each web page and uses these features in a classifier. Some of the features include the number of words on the page, number of words in the title, average length of the words, amount of incoming anchor text, and the fraction of visible text. These features attempt to capture very basic characteristics of a web page’s text. Another feature used is the compressibility of the page, which measures how much the page can be reduced in size using a compression algorithm. It turns out that pages that can be compressed more are much more likely to be spam, since pages that contain many repeated terms and phrases are easier to compress. This has also been shown to be effective for detecting email spam. The authors also use as features the 9 and the fraction of globally fraction of terms drawn from globally popular words popular words appearing on the page. These features attempt to capture whether or not the page has been filled with popular terms that are highly likely to match a query term. The last two features are based on n -gram likelihoods. Experiments show that pages that contain very rare and very common n -grams are more likely to be spam than those pages that contain n -grams of average likelihood. All of these features were used with a decision tree learning algorithm, which is another type of supervised classification algorithm, and shown to achieve classification accuracy well above 90%. The same features could easily be used in a Naïve Bayes or Support Vector Machine classifier. Sentiment As we described in Chapter 6, there are three primary types of web queries. The models described in Chapter 7 focus primarily on informational and navigational queries. Transactional queries, the third type, present many different challenges. If a user queries for a product name, then the search engine should display a vari- ety of information that goes beyond the standard ranked list of topically relevant results. For example, if the user is interested in purchasing the product, then links to online shopping sites can be provided to help the user complete her purchase. It may also be possible that the user already owns the product and is searching for accessories or enhancements. The search engine could then derive revenue from the query by displaying advertisements for related accessories and services. 9 The list of “globally popular” words in this experiment was simply the N most frequent words in the test corpus.</p> <p><span class="badge badge-info text-white mr-2">393</span> 9.1 Classification and Categorization 369 Another possible scenario, and the one that we focus on in detail here, is that the user is researching the product in order to determine whether he should pur- chase it. In this case, it would be valuable to retrieve information such as product specifications, product reviews, and blog postings about the product. In order to reduce the amount of information that the user needs to read through, it would be preferable to have the system automatically aggregate all of the reviews and blog posts in order to present a condensed, summarized view. There are a number of steps involved with building such a system, each of which involves some form of classification. First, when crawling and indexing sites, the system has to automatically classify whether or not a web page contains a review or if it is a blog posting expressing an opinion about a product. The task of identifying opinionated text, as opposed to factual text, is called opinion de- tection . After a collection of reviews and blog postings has been populated, an- other classifier must be used to extract product names and their corresponding reviews. This is the information extraction task. For each review identified for a given product, yet another classifier must be used to determine the sentiment of the page. Typically, the sentiment of a page is either “negative” or “positive”, al- though the classifier may choose to assign a numeric score as well, such as “two stars” or “four stars”. Finally, all of the data, including the sentiment, must be ag- gregated and presented to the user in some meaningful way. Figure 9.9 shows part of an automatically generated product review from a web service. This sentiment- based summary of various aspects of the product, such as “ease of use”, “size”, and “software”, is generated from individual user reviews. Rather than go into the details of all of these different classifiers, we will focus our attention on how sentiment classifiers work. As with our previous examples, let us consider how a person would identify the sentiment of some piece of text. For a majority of cases, we use vocabulary clues in order to determine the senti- ment. For example, a positive digital camera review would likely contain words such as “great”, “nice”, and “amazing”. On the other hand, negative reviews would contain words such as “awful”, “terrible”, and “bad”. This suggests one possible so- lution to the problem, where we build two lists. The first list will contain words that are indicative of positive sentiment, and another list will contain words in- dicative of negative sentiment. Then, given a piece of text, we could simply count the number of positive words and the number of negative words. If there are more positive words, then assign the text a positive sentiment label. Otherwise, label it as having negative sentiment. Even though this approach is perfectly reasonable, it turns out that people are not very good at creating lists of words that indicate pos-</p> <p><span class="badge badge-info text-white mr-2">394</span> 370 9 Classification and Clustering user reviews All Comments comments) (148 General 82% positive of Use Ease (108 comments) 78% positive (92 comments) Screen 97% positive Software (78 comments) positive 35% Quality comments) Sound (59 89% positive Size (59 comments) 76% positive Fig. 9.9. Example product review incorporating sentiment itive and negative sentiment. This is largely due to the fact that human language is ambiguous and largely dependent on context. For example, the text “the digital camera lacks the amazing picture quality promised” would likely be classified as having positive sentiment because it contains two positive words (“amazing” and “quality”) and only one negative word (“lacks”). Pang et al. (2002) proposed using machine learning techniques for sentiment classification. Various classifiers were explored, including Naïve Bayes, Support Vector Machines, and maximum entropy , which is another popular classification technique. The features used in the classifiers were unigrams, bigrams, part-of- speech tags, adjectives, and the position of a term within a piece of text. The au- thors report that an SVM classifier using only unigram features exhibited the best performance, resulting in more accurate results than a classifier trained using all of the features. In addition, it was observed that the multiple-Bernoulli event space outperformed the multinomial event space for this particular task. This is likely caused by the fact that most sentiment-related terms occur only once in any piece of text, and therefore term frequency adds very little to the model. Interestingly, the machine learning models were significantly more accurate than the baseline model that used human-generated word lists. The SVM classifier with unigrams</p> <p><span class="badge badge-info text-white mr-2">395</span> 9.1 Classification and Categorization 371 had an accuracy of over 80%, whereas the baseline model had an accuracy of only around 60%. Classifying advertisements As described in Chapter 6, sponsored search and content match are two differ- ent advertising models widely used by commercial search engines. The former matches advertisements to queries, whereas the latter matches advertisements to web pages. Both sponsored search and content match use a pricing pay per click model, which means that advertisers must pay the search engine only if a user clicks on the advertisement. A user may click on an advertisement for a number of reasons. Clearly, if the advertisement is “topically relevant,” which is the stan- dard notion of relevance discussed in the rest of this book, then the user may click on it. However, this is not the only reason why a user may click. If a user searches for “tropical fish”, she may click on advertisements for pet stores, local aquariums, or even scuba diving lessons. It is less likely, however, that she would click on ad- vertisements for fishing, fish restaurants, or mercury poisoning. The reason for this is that the concept “tropical fish” has a certain semantic scope that limits the type of advertisements a user may find interesting. Although it is possible to use standard information retrieval techniques such as query expansion or query reformulation analysis to find these semantic matches for advertising, it is also possible to use a classifier that maps queries (and web pages) into semantic classes. Broder et al. (2007) propose a simple yet effective technique for classifying textual items, such as queries and web pages, into a se- mantic hierarchy. The hierarchy was manually constructed and consists of over 6,000 nodes, where each node represents a single semantic class. As one moves deeper down the hierarchy, the classes become more specific. Human judges man- ually placed thousands of queries with commercial intent into the hierarchy based on each query’s intended semantic meaning. Given such a hierarchy and thousands of labeled instances, there are many pos- sible ways to classify unseen queries or web pages. For example, one could learn a Naïve Bayes model or use SVMs. Since there are over 6,000 classes, however, there could be data sparsity issues, with certain classes having very few labeled instances associated with them. A bigger problem, however, would be the efficiency of this approach. Both Naïve Bayes and SVMs would be very slow to classify an item into one of 6,000 possible classes. Since queries must be classified in real time, this is not an option. Instead, Broder et al. propose using cosine similarity with tf.idf</p> <p><span class="badge badge-info text-white mr-2">396</span> 372 9 Classification and Clustering weighting to match queries (or web pages) to semantic classes. That is, they frame retrieval problem, where the query is the query (or the classification problem as a web page) to be classified and the document set consists of 6,000 “documents”, one for each semantic class. For example, for the semantic class “Sports”, the “doc- ument” for it would consists of all of the queries labeled as “Sports”. These “doc- uments” are stored in an inverted index, which allows for efficient retrieval for an incoming query (or web page). This can be viewed as an example of a nearest neighbor classifier. Aquariums Supplies Fish Rainbow Fish Discount Tropical Fish Food Feed your tropical fish a gourmet diet Resources for just pennies a day! www.cheapfishfood.com Web Page Ad Fig. 9.10. Example semantic class match between a web page about rainbow fish (a type of tropical fish) and an advertisement for tropical fish food. The nodes “Aquariums”, “Fish”, and “Supplies” are example nodes within a semantic hierarchy. The web page is classified as “Aquariums - Fish” and the ad is classified as “Supplies - Fish”. Here, “Aquariums” is the least common ancestor. Although the web page and ad do not share any terms in common, they can be matched because of their semantic similarity.</p> <p><span class="badge badge-info text-white mr-2">397</span> 9.2 Clustering 373 To use such a classifier in practice, one would have to preclassify every adver- tisement in the advertising inventory. Then, when a new query (or web page) ar- rives, it is classified. There are a number of ways to use the semantic classes to improve matching. Obviously, if the semantic class of a query exactly matches the semantic class of an advertisement, it should be given a high score. However, there are other cases where two things may be very closely related, even though they do not have exactly the same semantic class. Therefore, Broder et al. propose a way of measuring the distance between two semantic classes within the hierarchy based least common ancestor on the inverse of the of the two nodes in the hierarchy. A common ancestor is a node in the hierarchy that you must pass through in order to reach both nodes. The least common ancestor is the one with the maximum depth in the hierarchy. The distance is 0 if the two nodes are the same and very large if the the least common ancestor is the root node. Figure 9.10 shows an ex- ample of how a web page can be semantically matched to an advertisement using the hierarchy. In the figure, the least common ancestor of the web page and ad classes is “Aquariums”, which is one node up the hierarchy. Therefore, this match would be given a lower score than if both the web page and ad were classified into the same node in the hierarchy. The full advertisement score can be computed by combining this distance based on the hierarchy with the standard cosine similar- ity score. In this way, advertisements are ranked in terms of both topical relevance and semantic relevance. 9.2 Clustering Clustering algorithms provide a different approach to organizing data. Unlike the classification algorithms covered in this chapter, clustering algorithms are based on unsupervised learning, which means that they do not require any training data. Clustering algorithms take a set of unlabeled instances and group (cluster) them together. One problem with clustering is that it is often an ill-defined problem. Classification has very clear objectives. However, the notion of a good clustering is often defined very subjectively. In order to gain more perspective on the issues involved with clustering, let us examine how we, as humans, cluster items. Suppose, once again, that you are at a grocery store and are asked to cluster all of the fresh produce (fruits and vegeta- bles). How would you proceed? Before you began, you would have to decide what criteria you would use for clustering. For example, you could group the items by their color, their shape, their vitamin C content, their price, or some meaningful</p> <p><span class="badge badge-info text-white mr-2">398</span> 374 9 Classification and Clustering combination of these factors. As with classification, the clustering criteria largely depend on how the items are represented . Input instances are assumed to be a fea- ture vector that represents some object, such as a document (or a fruit). If you are interested in clustering according to some property, it is important to make sure that property is represented in the feature vector. After the clustering criteria have been determined, you would have to deter- mine how you would assign items to clusters. Suppose that you decided to cluster the produce according to color and you have created a red cluster (red grapes, red apples) and a yellow cluster (bananas, butternut squash). What do you do if you come across an orange? Do you create a new orange cluster, or do you assign it to the red or yellow cluster? These are important questions that clustering algorithms howmanyclusterstouse must address as well. These questions come in the form of how to assign items to clusters . and Finally, after you have assigned all of the produce to clusters, how do you quan- tify how well you did? That is, you must the clustering. This is often very evaluate difficult, although there have been several automatic techniques proposed. In this example, we have described clusters as being defined by some fixed set of properties, such as the “red” cluster. This is, in fact, a very specific form of clus- ter, called monothetic . We discussed monothetic classes or clusters in Chapter 6, and mentioned that most clustering algorithms instead produce polythetic clus- ters, where members of a cluster share many properties, but there is no single defin- ing property. In other words, membership in a cluster is typically based on the similarity of the feature vectors that represent the objects. This means that a cru- cial part of defining the clustering algorithm is specifying the similarity measure that is used. The classification and clustering literature often refers to a distance measure , rather than a similarity measure, and we use that terminology in the fol- lowing discussion. Any similarity measure, which typically has a value S from 0 to 1, can be converted into a distance measure by using − S . Many similarity and 1 distance measures have been studied by information retrieval and machine learn- ing researchers, from very simple measures such as Dice’s coefficient (mentioned in Chapter 6) to more complex probabilistic measures. The reader should keep these factors in mind while reading this section, as they will be recurring themes throughout. The remainder of this section describes three clustering algorithms based on different approaches, discusses evaluation issues, and briefly describes clustering applications.</p> <p><span class="badge badge-info text-white mr-2">399</span> 9.2 Clustering 375 K -Means Clustering 9.2.1 Hierarchical and We will now describe two different clustering algorithms that start with some ini- tial clustering of the data and then iteratively try to improve the clustering by op- timizing some objective function. The main difference between the algorithms is the objective function. As we will show, different objective functions lead to dif- ferent types of clusters. Therefore, there is no one “best” clustering algorithm. The choice of algorithm largely depends on properties of the data set and task. Throughout the remainder of this section, we assume that our goal is to cluster some set of N instances (which could be web pages, for example), represented as K clusters, where feature vectors, into is a constant that is fixed a priori . K Hierarchical clustering is a clustering methodology that builds clusters in a hier- Hierarchical clustering archical fashion. This methodology gives rise to a number of different clustering algorithms. These algorithms are often “clustered” into two groups, depending on how the algorithm proceeds. Divisive clustering algorithms begin with a single cluster that consists of all of the instances. During each iteration it chooses an existing cluster and divides it into two (or possibly more) clusters. This process is repeated until there are a total of K clusters. The output of the algorithm largely depends on how clusters are chosen and split. Divisive clustering is a approach. The other general type of hierar- top-down agglomerative clustering bottom-up chical clustering algorithm is called , which is a approach. Figures 9.11 and 9.12 illustrate the difference between the two types of algorithms. An agglomerative algorithm starts with each input as a separate N clusters, each of which contains a single input. cluster. That is, it begins with The algorithm then proceeds by joining two (or possibly more) existing clusters to form a new cluster. Therefore, the number of clusters decreases after each itera- tion. The algorithm terminates when there are K clusters. As with divisive cluster- ing, the output of the algorithm is largely dependent on how clusters are chosen and joined. The hierarchy generated by an agglomerative or divisive clustering algorithm 10 dendrogram . can be conveniently visualized using a A dendrogram graphically represents how a hierarchical clustering algorithm progresses. Figure 9.13 shows 10 From the Greek word dendron , meaning “tree.”</p> <p><span class="badge badge-info text-white mr-2">400</span> 376 9 Classification and Clustering A A E E D D G G B B F F C C A A E E D D G G B B F F C C Example of divisive clustering with K = 4 . The clustering proceeds from left Fig. 9.11. to right and top to bottom, resulting in four clusters. the dendrogram that corresponds to generating the entire agglomerative cluster- ing hierarchy for the points in Figure 9.12. In the dendrogram, points D and E are first combined to form a new cluster called H. Then, B and C are combined to form cluster I. This process is continued until a single cluster M is created, which consists of A, B, C, D, E, and F. In a dendrogram, the height at which instances combine is significant and represents the similarity (or distance) value at which the combination occurs. For example, the dendrogram shows that D and E are the most similar pair. Algorithm 1 is a simple implementation of hierarchical agglomerative clus- 11 tering. The algorithm takes X N , . . . , X vectors , representing the instances, N 1 K as input. The array (vector) A is the assign- and the desired number of clusters ment vector. It is used to keep track of which cluster each input is associated with. , then it means that input A i ] = j [ X If is in cluster j . The algorithm considers i C joining every pair of clusters. For each pair of clusters C ) , C , C ) , a cost C ( ( i j j i is computed. The cost is some measure of how expensive it would be to merge 11 Often called HAC in the literature.</p> <p><span class="badge badge-info text-white mr-2">401</span> 9.2 Clustering 377 A A E E D D G G B B F F C C A A E E D D G G B B F F C C Example of agglomerative clustering with K = 4 . The clustering proceeds Fig. 9.12. from left to right and top to bottom, resulting in four clusters. M L K J I H G A D E B C F Fig. 9.13. Dendrogram that illustrates the agglomerative clustering of the points from Figure 9.12</p> <p><span class="badge badge-info text-white mr-2">402</span> 378 9 Classification and Clustering Agglomerative Clustering Algorithm 1 AC ( X 1: , . . . , X procedure , K ) 1 N 2: , . . . , A [ N ] ← 1 , . . . , N [1] A ←{ , . . . , N ids } 3: 1 do c for to K = 4: N bestcost ←∞ 5: bestclusterA ← 6: undefined 7: ← undefined bestclusterB for i ids do 8: ∈ j for ids −{ i } do 9: ∈ c ) ← 10: ( C , C COST i i;j j if c 11: then < bestcost i;j 12: bestcost ← c i;j 13: bestclusterA i ← bestclusterB j 14: ← 15: end if 16: end for 17: end for ids ← ids −{ bestClusterA } 18: for 19: = 1 to N do i if 20: [ i ] is equal to bestClusterA then A 21: A [ i ] ← bestClusterB 22: end if end for 23: end for 24: 25: end procedure C . We will return to how the cost is computed shortly. After all and clusters C j i pairwise costs have been computed, the pair of clusters with the lowest cost are then merged. The algorithm proceeds until there are K clusters. As shown by Algorithm 1, agglomerative clustering largely depends on the cost function. There are many different ways to define the cost function, each of which results in the final clusters having different characteristics. We now describe a few of the more popular choices and the intuition behind them. Single linkage measures the cost between clusters C and C by computing the i j distance between every instance in cluster C . The cost is then and every one in C j i the minimum of these distances, which can be stated mathematically as:</p> <p><span class="badge badge-info text-white mr-2">403</span> 9.2 Clustering 379 | ( , C COST ) = min { dist ( X } , X C ) ∈ X , X ∈ C C i i j i j j i j dist is the distance between input X and where X . It is typically computed us- j i 12 between ing the Euclidean distance and X X , but many other distance mea- j i sures have been used. Single linkage relies only on the minimum distance between the two clusters. It does not consider how far apart the remainder of the instances in the clusters are. For this reason, single linkage could result in very “long” or spread-out clusters, depending on the structure of the two clusters being com- bined. Complete linkage is similar to single linkage. It begins by computing the dis- tance between every instance in cluster C and C . However, rather than using j i the minimum distance as the cost, it uses the maximum distance. That is, the cost is: ∈ COST ( , C } ) = max { dist ( X C , X ∈ ) | X , X C C i j j i j j i i Since the maximum distance is used as the cost, clusters tend to be more compact and less spread out than in single linkage. Cluster C A Cluster Cluster B Fig. 9.14. Examples of clusters in a graph formed by connecting nodes representing in- stances. A link represents a distance between the two instances that is less than some threshold value. To illustrate the difference between single-link and complete-link clusters, consider the graph shown in Figure 9.14. This graph is formed by representing in- 12 x and y is computed according to The Euclidean distance between two vectors √ ∑ 2 ( x i − y denotes the ) th component of the vector. where the subscript i i i i</p> <p><span class="badge badge-info text-white mr-2">404</span> 380 9 Classification and Clustering X stances (i.e., the dist ( X s) as nodes and connecting nodes where , X , ) < T j i i T is some threshold value. In this graph, clusters A, B, and C would all where be single-link clusters. The single-link clusters are, in fact, the connected compo- of the graph, where every member of the cluster is connected to at least one nents other member. The complete-link clusters would be cluster A, the singleton clus- ter C, and the upper and lower pairs of instances from cluster B. The complete- or maximal complete subgraphs of the graph, where link clusters are the cliques every member of the cluster is connected to every other member. uses a cost that is a compromise between single linkage and Average linkage complete linkage. As before, the distance between every pair of instances in C i C average is computed. As the name implies, average linkage uses the and of all j of the pairwise costs. Therefore, the cost is: ∑ ) , X X ( dist j i C ∈ C X ;X ∈ j j i i COST ) = ( , C C i j | C || C | i j where | C , respec- | and | C C | are the number of instances in clusters C and i j i j tively. The types of the clusters formed using average linkage depends largely on the structure of the clusters, since the cost is based on the average of the distances between every pair of instances in the two clusters. Averagegrouplinkage is closely related to average linkage. The cost is computed according to: ( C ) , C , μ COST dist ( μ ) = C j C i i j ∑ X X ∈ C . The centroid of a cluster is sim- = where μ C of cluster centroid is the i C | | C ply the average of all of the instances in the cluster. Notice that the centroid is also a vector with the same number of dimensions as the input instances. Therefore, average group linkage represents each cluster according to its centroid and mea- sures the cost by the distance between the centroids. The clusters formed using average group linkage are similar to those formed using average linkage. Figure 9.15 provides a visual summary of the four cost functions described up to this point. Specifically, it shows which pairs of instances (or centroids) are involved in computing the cost functions for the set of points used in Figures 9.11 and 9.12. Ward’s method is the final method that we describe. Unlike the previous costs, which are based on various notions of the distance between two clusters, Ward’s method is based on the statistical property of variance . The variance of a set of numbers measures how spread out the numbers are. Ward’s method attempts to</p> <p><span class="badge badge-info text-white mr-2">405</span> 9.2 Clustering 381 Single Linkage Complete Linkage A A E E D D G G B B F F C C Average Group Linkage Average Linkage μ A E μ D G μ B μ F C Fig. 9.15. Illustration of how various clustering cost functions are computed minimize the sum of the variances of the clusters. This results in compact clusters with a minimal amount of spread around the cluster centroids. The cost, which is slightly more complicated to compute than the previous methods, is computed according to: ∑ ∑ − ) X · ( μ − X ( μ ) + ) = , C C ( COST C C i j k k X ∈ C k = i;j ̸ k ∑ − ( X μ ) μ − X · ( ) ∪ C C ∪ C C j i i j ∪ C ∈ X C i j is the ∪ C is the union of the instances in clusters C and C where , and μ C i i j C j ∪ C j i centroid of the cluster consisting of the instances in C . This cost measures ∪ C j i what the intracluster variance would be if clusters i and j were joined. So, which of the five agglomerative clustering techniques is the best? Once again the answer depends on the data set and task the algorithm is being applied</p> <p><span class="badge badge-info text-white mr-2">406</span> 382 9 Classification and Clustering to. If the underlying structure of the data is known, then this knowledge may be used to make a more informed decision about the best algorithm to use. Typically, however, determining the best method to use requires experimentation and eval- uation. In the information retrieval experiments that have involved hierarchical average-link clustering, for example, clustering has generally had the best effec- tiveness. Even though clustering is an unsupervised method, in the end there is still no such thing as a free lunch, and some form of manual evaluation will likely be required. Efficiency is a problem with all hierarchical clustering methods. Because the computation involves the comparison of every instance to all other instances, 2 ( N ) for N instances. This limits O even the most efficient implementations are the number of instances that can be clustered in an application. The next cluster- K -means, is more efficient because it produces a flat ing algorithm we describe, partition , of the instances, rather than a hierarchy. clustering, or -means K K The -means algorithm is fundamentally different than the class of hierarchical clustering algorithms just described. For example, with agglomerative clustering, the algorithm begins with N clusters and iteratively combines two (or more) clus- ters together based on how costly it is to do so. As the algorithm proceeds, the number of clusters decreases. Furthermore, the algorithm has the property that X are in the same cluster as each other, there is no way for and once instances X j i them to end up in different clusters as the algorithm proceeds. With the K -means algorithm, the number of clusters never changes. The al- K clusters and ends with K clusters. During each iteration gorithm starts with of the K -means algorithm, each instance is either kept in the same cluster or as- signed to a different cluster. This process is repeated until some stopping criteria is met. -means algorithm is to find the cluster assignments, repre- K The goal of the [1] , . . . , A [ N ] , that minimize the following sented by the assignment vector A cost function: K ∑ ∑ , . . . , A [ N ]) = COST ( A [1] dist ( X , C ) i k =1 k i : A [ i ]= k is the distance between instance where ( X . As with the , C C ) dist X and class i k k i various hierarchical clustering costs, this distance measure can be any reasonable</p> <p><span class="badge badge-info text-white mr-2">407</span> 9.2 Clustering 383 measure, although it is typically assumed to be the following: 2 dist , C ( ) = || X X − μ || C k i i k X ) − μ μ − ) · ( = ( X i C C i k k X and μ squared. Here, as before, μ which is the Euclidean distance between i C C k k is the centroid of cluster C . Notice that this distance measure is very similar to the k cost associated with Ward’s method for agglomerative clustering. Therefore, the method attempts to find the clustering that minimizes the intracluster variance of the instances. Alternatively, the cosine similarity between X can be used as the dis- and μ C i k tance measure. As described in Chapter 7, the cosine similarity measures the angle between two vectors. For some text applications, the cosine similarity measure has been shown to be more effective than the Euclidean distance. This specific form -means is often called spherical K of . K -means One of the most naïve ways to solve this optimization problem is to try every possible combination of cluster assignments. However, for large data sets this is computationally intractable, because it requires computing an exponential num- globally ber of costs. Rather than finding the K -means al- optimal solution, the gorithm finds an approximate, heuristic solution that iteratively tries to minimize the cost. This solution returned by the algorithm is not guaranteed to be the global optimal. In fact, it is not even guaranteed to be locally optimal. Despite its heuris- tic nature, the algorithm tends to work very well in practice. Algorithm 2 lists the pseudocode for one possible K -means implementation. The algorithm begins by initializing the assignment of instances to clusters. This can be done either randomly or by using some knowledge of the data to make a more informed decision. An iteration of the algorithm then proceeds as follows. Each instance is assigned to the cluster that it is closest to, in terms of the distance measure ( X keeps track of whether any of the in- , C dist ) . The variable change i k stances changed clusters during the current iteration. If some have changed, then the algorithm proceeds. If none have changed, then the algorithm ends. Another reasonable stopping criterion is to run the algorithm for some fixed number of iterations. In practice, K -means clustering tends to converge very quickly to a solution. Even though it is not guaranteed to find the optimal solution, the solutions re- turned are often optimal or close to optimal. When compared to hierarchical clus- tering, K -means is more efficient. Specifically, since KN distance computations</p> <p><span class="badge badge-info text-white mr-2">408</span> 384 9 Classification and Clustering K -Means Clustering Algorithm 2 , KMC ( , . . . , X 1: X K ) procedure 1 N [1] , . . . , A 2: N ] ← initial cluster assignment A [ repeat 3: 4: change ← f alse for i = 1 to N do 5: ^ 6: ← arg min k dist ( X ) , C i k k ^ A [ i ] is not equal 7: k then if ^ k i ] ← A [ 8: 9: change ← true 10: end if 11: end for N until f alse return A [1] , . . . , A [ is equal to ] change 12: end procedure 13: are done in every iteration and the number of iterations is small, implementa- 2 tions of O ( KN ) rather than the O ( N K ) complexity of hierarchi- -means are cal methods. Although the clusters produced by K -means depend on the starting points chosen (the initial clusters) and the ordering of the input data, K -means generally produces clusters of similar quality to hierarchical methods. Therefore, K -means is a good choice for an all-purpose clustering algorithm for a wide range of search engine–related tasks, especially for large data sets. 9.2.2 K Nearest Neighbor Clustering K -means clustering are different from an algorith- Even though hierarchical and mic point of view, one thing that they have in common is the fact that both algo- rithms place every input into exactly one cluster, which means that clusters do not 13 parti- Therefore, these algorithms partition the input instances into K overlap. tions (clusters). However, for certain tasks, it may be useful to allow clusters to K near- overlap. One very simple way of producing overlapping clusters is called est neighbor clustering . It is important to note that the K here is very different from the K in K -means clustering, as will soon become very apparent. 13 Note that this is true for hierarchical clusters at a given level of the dendrogram (i.e., at a given similarity or distance value). Clusters from different levels of the dendrogram do overlap.</p> <p><span class="badge badge-info text-white mr-2">409</span> 9.2 Clustering 385 A A A A D A D A D D C B C B C D C B B D B B C C Example of overlapping clustering using nearest neighbor clustering with K = Fig. 9.16. 5 . The overlapping clusters for the black points (A, B, C, and D) are shown. The five nearest neighbors for each black point are shaded gray and labeled accordingly. In nearest neighbor clustering, a cluster is formed around every input in- K x stance. For input instance K points that are nearest to x according to some , the distance metric and x itself form a cluster. Figure 9.16 shows several examples of . Al- nearest neighbor clusters with formed for the points A , B , C , and D = 5 K though the figure only shows clusters around four input instances, in reality there would be one cluster per input instance, resulting in N clusters. As Figure 9.16 illustrates, the algorithm often fails to find meaningful clusters. In sparse areas of the input space, such as around D , the points assigned to cluster D are rather far away and probably should not be placed in the same cluster as D B , the clusters . However, in denser areas of the input space, such as around are better defined, even though some related inputs may be missed because K is not large enough. Applications that use K nearest neighbor clustering tend to emphasize finding a small number of closely related instances in the K nearest neighbors (i.e., precision) over finding all the closely related instances (recall). K nearest neighbor clustering can be rather expensive, since it requires com- puting distances between every pair of input instances. If we assume that com- puting the distance between two input instances takes constant time with respect</p> <p><span class="badge badge-info text-white mr-2">410</span> 386 9 Classification and Clustering 2 ) O ( N K , then this computation takes time. After all of the distances N and to 2 N have been computed, it takes at most ) time to find the K nearest neigh- O ( nearest neighbor K bors for each point. Therefore, the total time complexity for 2 ( N clustering is ) , which is the same as hierarchical clustering. O For certain applications, nearest neighbor clustering is the best choice of K clustering algorithm. The method is especially useful for tasks with very dense in- put spaces where it is useful or important to find a number of related items for ev- ery input. Examples of these tasks include language model smoothing, document score smoothing, and pseudo-relevance feedback. We describe how clustering can be applied to smoothing shortly. 9.2.3 Evaluation Evaluating the output of a clustering algorithm can be challenging. Since cluster- ing is an unsupervised learning algorithm, there is often little or no labeled data to use for the purpose of evaluation. When there is no labeled training data, it is sometimes possible to use an objective function, such as the objective function being minimized by the clustering algorithm, in order to evaluate the quality of the clusters produced. This is a chicken and egg problem, however, since the eval- uation metric is defined by the algorithm and vice versa. If some labeled data exists, then it is possible to use slightly modified versions of standard information retrieval metrics, such as precision and recall, to evaluate the quality of the clustering. Clustering algorithms assign each input instance to a cluster identifier. The cluster identifiers are arbitrary and have no explicit mean- ing. For example, if we were to cluster a set of emails into two clusters, some of the emails would be assigned to cluster identifier 1, while the rest would be assigned to cluster 2. Not only do the cluster identifiers have no meaning, but the clusters may not have a meaningful interpretation. For example, one would hope that one of the clusters would correspond to “spam” emails and the other to “non-spam” emails, but this will not necessarily be the case. Therefore, care must be taken when defining measures of precision and recall. One common procedure of measuring precision is as follows. First, the algo- K = |C| clusters. Then, for each cluster C , rithm clusters the input instances into i we define MaxClass ( C to be the (human-assigned) class label associated with ) i C . Since MaxClass ( C ) is associated with more of the in- the most instances in i i stances in C label for than any other class label, it is assumed that it is the true i cluster C . Therefore, the precision for cluster C is the fraction of instances in i i</p> <p><span class="badge badge-info text-white mr-2">411</span> 9.2 Clustering 387 ( C ) . This measure is often microaveraged across the cluster with label MaxClass i instances, which results in the following measure of precision: ∑ K MaxClass ( | C ) | i =1 i = ClusterP recision N | MaxClass where C with the label ) | is the total number of instances in cluster C ( i i ( ) . This measure has the intuitive property that if each cluster corre- MaxClass C i sponds to exactly one class label and every member of a cluster has the same label, 1 . then the measure is In many search applications, clustering is only one of the technologies that are being used. Typically, the output of a clustering algorithm is used as part of some complex end-to-end system. In these cases, it is important to analyze and evaluate how the clustering algorithm affects the entire end-to-end system. For example, if clustering is used as a subcomponent of a web search engine in order to improve ranking, then the clustering algorithm can be evaluated and tuned by measuring the impact on the effectiveness of the ranking. This can be difficult, as end-to-end systems are often complex and challenging to understand, and many factors will impact the ranking. 9.2.4 How to Choose K K . In hierarchical and K -means Thus far, we have largely ignored how to choose K represents the number of clusters. In K nearest neighbors smooth- clustering, ing, K represents the number of nearest neighbors used. Although these two things are fundamentally different, it turns out that both are equally challenging to set properly in a fully automated way. The problem of choosing K is one of the most challenging issues involved with clustering, since there is really no good so- lution. No magical formula exists that will predict the optimal number of clusters K largely depends to use in every possible situation. Instead, the best choice of on the task and data set being considered. Therefore, is most often chosen ex- K perimentally. In some cases, the application will dictate the number of clusters to use. This, however, is rare. Most of the time, the application offers no clues as to the best choice of K . In fact, even the range of values for K to try might not be obvious. Should 2 clusters be used? 10? 100? 1,000? There is no better way of getting an understanding of the best setting for K than running experiments that evaluate the quality of the resulting clusters for various values of K .</p> <p><span class="badge badge-info text-white mr-2">412</span> 388 9 Classification and Clustering With hierarchical clustering, it is possible to create the entire hierarchy of clus- ters and then use some decision mechanism to decide what level of the hiearchy to use for the clustering. In most situations, however, the number of clusters has to be manually chosen, even with hierarchical clustering. A A D B C B C B C B B B C C Fig. 9.17. Example of overlapping clustering using Parzen windows. The clusters for the black points (A, B, C, and D) are shown. The shaded circles indicate the windows used to determine cluster membership. The neighbors for each black point are shaded gray and labeled accordingly. When forming nearest neighbor clusters, it is possible to use an adaptive K value for K . That is, for instances in very dense regions, it may be useful to choose K , since the neighbors are likely to be related. Similarly, in very sparse ar- a large eas, it may be best to choose only a very small number of nearest neighbors, since moving too far away may result in unrelated neighbors being included. This idea 14 Parzen windows , is closely related to which are a variant of K nearest neighbors used for classification. With Parzen windows, the number of nearest neighbors is not fixed. Instead, all of the neighbors within a fixed distance (“window”) of an instance are considered its nearest neighbors. In this way, instances in dense 14 Named after Emanuel Parzen, an American statistician.</p> <p><span class="badge badge-info text-white mr-2">413</span> 9.2 Clustering 389 areas will have many nearest neighbors, and those in sparse areas will have few. Figure 9.17 shows the same set of points from Figure 9.16, but clustered using a Parzen window approach. We see that fewer outliers get assigned to incorrect clusters for points in sparse areas of the space (e.g., point C), whereas points in denser regions have more neighbors assigned to them (e.g., B). However, the clus- ters formed are not perfect. The quality of the clusters now depends on the size of the window. Therefore, although this technique eliminates the need to choose , it introduces the need to choose the window size, which can be an equally K challenging problem. 9.2.5 Clustering and Search A number of issues with clustering algorithms have resulted in them being less widely used in practice than classification algorithms. These issues include the computational costs, as well as the difficulty of interpreting and evaluating the clusters. Clustering has been used in a number of search engines for organizing the results, as we discussed in section 6.3.3. There are very few results for a search compared to the size of the document collection, so the efficiency of clustering is less of a problem. Clustering is also able to discover structure in the result set for arbitrary queries that would not be possible with a classification algorithm. Topic modeling, which we discussed in section 7.6.2, can also be viewed as an application of clustering with the goal of improving the ranking effectiveness of the search engine. In fact, most of the information retrieval research involving clustering has focused on this goal. The basis for this research is the well-known cluster hypothesis . As originally stated by van Rijsbergen (1979), the cluster hy- pothesis is: Closely associated documents tend to be relevant to the same requests. Note that this hypothesis doesn’t actually mention clusters. However, “closely as- sociated” or similar documents will generally be in the same cluster. So the hy- pothesis is usually interpreted as saying that documents in the same cluster tend to be relevant to the same queries. Two different tests have been used to verify whether the cluster hypothesis holds for a given collection of documents. The first compares the distribution of similarity scores for pairs of relevant documents (for a set of queries) to the dis- tribution for pairs consisting of a non-relevant and a relevant document. If the cluster hypothesis holds, we might expect to see a separation between these two distributions. On some smaller corpora, such as the CACM corpus mentioned</p> <p><span class="badge badge-info text-white mr-2">414</span> 390 9 Classification and Clustering in Chapter 8, this is indeed the case. If there were a number of clusters of relevant documents, however, which were not similar to each other, then this test may fail to show any separation. To address this potential problem, Voorhees (1985) proposed a test based on the assumption that if the cluster hypothesis holds, rel- , even if they were scattered in evant documents would have high local precision many clusters. Local precision simply measures the number of relevant documents found in the top five nearest neighbors for each relevant document. trec12 robust 0.6 0.8 1.0 0.4 0.0 0.2 0.0 0.6 0.8 1.0 0.2 0.4 trec12 robust 1500 4000 3000 1000 2000 Frequency Frequency 500 1000 0 0 0.2 0 0.2 0.4 0.6 0.8 1 0.4 0.6 0.8 1 0 Local Precision Local Precision Fig. 9.18. Cluster hypothesis tests on two TREC collections. The top two compare the distributions of similarity values between relevant-relevant and relevant-nonrelevant pairs (light gray) of documents. The bottom two show the local precision of the relevant documents.</p> <p><span class="badge badge-info text-white mr-2">415</span> 9.2 Clustering 391 Figure 9.18 shows the results of these tests used on two TREC collections. These collections have similar types of documents, including large numbers of collection are known to be harder in news stories. The 250 queries for the robust trec12 . terms of the typical MAP values obtained than the 150 queries used for The tests on the top row of the figure show that for both collections there is poor separation between the distributions of similarity values. The tests on the lower trec12 collection have high row, however, show that relevant documents in the collection, which means local precision. The local precision is lower in the robust that relevant documents tend to be more isolated and, consequently, harder to retrieve. Given that the cluster hypothesis holds, at least for some collections and queries, the next question is how to exploit this in a retrieval model. There are, in fact, a number of ways of doing this. The first approach, known as cluster-based retrieval , ranks clusters instead of individual documents in response to a query. If K clusters C , for example, we could rank clusters using the . . . C there were 1 K P | C ) , Q query likelihood retrieval model. This means that we rank clusters by ( j is the query, and: where Q n ∏ | P ) ) = ( Q C | P ( q C i j j =1 i The probabilities P ( q are estimated using a smoothed unigram language | C ) j i model based on the frequencies of words in the cluster, as described for docu- ments in Chapter 7. After the clusters have been ranked, documents within each cluster could be individually ranked for presentation in a result list. The intuition behind this ranking method is that a cluster language model should provide bet- ter estimates of the important word probabilities than document-based estimates. In fact, a relevant document with no terms in common with the query could po- tentially be retrieved if it were a member of a highly ranked cluster with other relevant documents. Rather than using this two-stage process, the cluster language model can be document language model as fol- directly incorporated into the estimation of the lows: f f f w;C w;D w;Coll j D δ ) | ) = (1 λ − w ( − P λ + + δ C | | | D | | | Coll j where and δ are parameters, λ , is the word frequency in the document D f w;D is the f , and is the word frequency in the cluster C D that contains f j w;C w;Coll j</p> <p><span class="badge badge-info text-white mr-2">416</span> 392 9 Classification and Clustering . The second term, which comes from the Coll word frequency in the collection cluster language model, increases the probability estimates for words that occur frequently in the cluster and are likely to be related to the topic of the document. In other words, the cluster language model makes the document more similar to other members of the cluster. This document language model with cluster-based smoothing can be used directly by the query likelihood retrieval model to rank documents as described in section 7.3.1. The document language model can be further generalized to the case where D is a member of multiple overlapping clusters, as follows: the document ∑ f f f w;C w;D w;Coll j − δ ) λ P ( w | + δ D ) + C ) = (1 − λ | D ( P j | | C | | D | | Coll j C j In this case, the second term in the document language model probability estimate w is the weighted sum of the probabilities from the cluster language for a word P models for all clusters. The weight ( D | C ) ) is the probability of the document ( j being a member of cluster C . We can also make the simplifying assumption that j P D | C , and zero otherwise. ) ( D is uniform for those clusters that contain j Retrieval experiments have shown that retrieving clusters can yield small but variable improvements in effectiveness. Smoothing the document language model with cluster-based estimates, on the other hand, provides significant and consis- tent benefits. In practice, however, the expense of generating clusters has meant that cluster-based techniques have not been deployed as part of the ranking al- gorithm in operational search engines. However, promising results have recently been obtained using query-specific clustering , where clusters are constructed only from the top-ranked (e.g., 50) documents (Liu & Croft, 2008; Kurland, 2008). These clusters, which can be used for either cluster-based retrieval or document smoothing, can obviously be generated much more efficiently. References and Further Reading Classification and clustering have been thoroughly investigated in the research ar- eas of statistics, pattern recognition, and machine learning. The books by Duda et al. (2000) and Hastie et al. (2001) describe a wide range of classification and clus- tering techniques and provide more details about the mathematical foundations the techniques are built upon. These books also provide good overviews of other</p> <p><span class="badge badge-info text-white mr-2">417</span> 9.2 Clustering 393 useful machine learning techniques that can be applied to various search engine tasks. For a more detailed treatment of Naïve Bayes classification for text classifi- cation, see McCallum and Nigam (1998). C. J. C. Burges (1998) gives a very de- tailed tutorial on SVMs that covers all of the basic concepts and theory. However, the subject matter is not light and requires a certain level of mathematical sophis- tication to fully understand. In addition, Joachims (2002a) is an entire book de- scribing various uses of SVMs for text classification. Van Rijsbergen (1979) provides a review of earlier research on clustering in information retrieval, and describes the cluster hypothesis and cluster-based re- trieval. Diaz (2005) proposed an alternative interpretation of the cluster hypothe- scores sis by assuming that closely related documents should have similar , given the same query. Using this assumption, Diaz developed a framework for smoothing K nearest neighbor clusters. Language model- retrieval scores using properties of ing smoothing using -means clustering was examined in Liu and Croft (2004). K Another language modeling smoothing technique based on overlapping K near- est neighbor clusters was proposed in Kurland and Lee (2004). There are various useful software packages available for text classification. The 15 provides implementations of various machine learning Mallet software toolkit algorithms, including Naïve Bayes, maximum entropy, boosting, Winnow, and conditional random fields. It also provides support for parsing and tokenizing 16 text into features. Another popular software package is SVMLight, which is an SVM implementation that supports all of the kernels described in this chapter. Clustering methods are included in a number of packages available on the Web. Exercises 9.1. Provide an example of how people use clustering in their everyday lives. What are the features that they use to represent the objects? What is the similarity mea- sure? How do they evaluate the outcome? 9.2. Assume we want to do classification using a very fine-grained ontology, such as one describing all the families of human languages. Suppose that, before train- ing, we decide to collapse all of the labels corresponding to Asian languages into a single “Asian languages” label. Discuss the negative consequences of this decision. 15 http://mallet.cs.umass.edu/ 16 http://svmlight.joachims.org/</p> <p><span class="badge badge-info text-white mr-2">418</span> 394 9 Classification and Clustering Suppose that we were to estimate P d | c ) according to: 9.3. ( N d;c ( | c ) = P d N c N in the training is the number of times document d is assigned to class where c d;c N is the number of instances assigned class label c in the training set. This set, and c P ( c ) is estimated. Why can this estimate not be used is analogous to the way that in practice? 9.4. For some classification data set, compute estimates for P ( w | c ) for all words w using both the multiple-Bernoulli and multinomial models. Compare the multiple- Bernoulli estimates with the multinomial estimates. How do they differ? Do the estimates diverge more for certain types of terms? 2 Explain why the solution to the original SVM formulation arg max = 9.5. w w || || w 1 2 arg min . w is equivalent to the alternative formulation || w || = w 2 9.6. Compare the accuracy of a one versus all SVM classifier and a one versus one SVM classifier on a multiclass classification data set. Discuss any differences observed in terms of the efficiency and effectiveness of the two approaches. Under what conditions will the microaverage equal the macroaverage? 9.7. 9.8. Cluster the following set of two-dimensional instances into three clusters us- ing each of the five agglomerative clustering methods: (–4, –2), (–3, –2), (–2, –2), (–1, –2), (1, –1), (1, 1), (2, 3), (3, 2), (3, 4), (4, 3) Discuss the differences in the clusters across methods. Which methods pro- duce the same clusters? How do these clusters compare to how you would manu- ally cluster the points? 9.9. Use K -means and spherical K -means to cluster the data points in Exercise 9.8. How do the clusterings differ? 9.10. Nearest neighbor clusters are not symmetric, in the sense that if instance A is one of instance B ’s nearest neighbors, the reverse is not necessarily true. Explain how this can happen with a diagram.</p> <p><span class="badge badge-info text-white mr-2">419</span> 9.2 Clustering 395 9.11. The K nearest neighbors of a document could be represented by links to those documents. Describe two ways this representation could be used in a search application. 9.12. Can the ClusterPrecision evaluation metric ever be equal to zero? If so, pro- vide an example. If not, explain why. 9.13. Test the cluster hypothesis on the CACM collection using both methods shown in Figure 9.18. What do you conclude from these tests?</p> <p><span class="badge badge-info text-white mr-2">420</span> </p> <p><span class="badge badge-info text-white mr-2">421</span> 10 Social Search “You will be assimilated.” Borg Collective, Star Trek: First Contact 10.1 What Is Social Search? In this chapter we will describe social search , which is rapidly emerging as a key search paradigm on the Web. As its name implies, social search deals with search within a social environment. This can be defined as an environment where a com- munity of users actively participate in the search process. We interpret this defi- nition of social search very broadly to include any application involving activities such as defining individual user profiles and interests, interacting with other users, and modifying the representations of the objects being searched. The active role of users in social search applications is in stark contrast to the standard search paradigms and models, which typically treat every user the same way and restrict interactions to query formulation. Users may interact with each other online in a variety of ways. For example, 1 users may visit a social media site, which have recently gained a great deal of popularity. Examples of these sites include Digg (websites), Twitter (status mes- sages), Flickr (pictures), YouTube (videos), Del.icio.us (bookmarks), and CiteU- Like (research papers). Social networking sites, such as MySpace, Facebook, and LinkedIn, allow friends, colleagues, and people with similar interests to interact with each other in various ways. More traditional examples of online social in- teractions include email, instant messenger, massively multiplayer online games (MMOGs), forums, and blogs. 1 Social media sites are often collectively referred to as Web 2.0, as opposed to the clas- sical notion of the Web (“Web 1.0”), which consists of non-interactive HTML docu- ments.</p> <p><span class="badge badge-info text-white mr-2">422</span> 398 10 Social Search As we see, the online world is a very social environment that is rich with users interacting with each other in various forms. These social interactions provide new and unique data resources for search systems to exploit, as well as a myriad of privacy issues. Most of the web search approaches we described in Chapter 7 only consider features of the documents or the link structure of the Web. In so- cially rich environments, however, we also have a plethora of user interaction data available that can help enhance the overall user experience in new and interesting ways. It would be possible to write an entire book on online social interactions and search within such environments. The focus of this chapter is to highlight and describe a few aspects of social search that are particularly interesting from a search engine and information retrieval perspective. The first topic we cover is user tags . Many social media websites allow users to assign tags to items. For example, a video-sharing website may allow users to not only assign tags to their own videos, but also to videos created by other people. An underwater video, for example, may have the tags “swimming”, “underwater”, “tropic”, and “fish”. Some sites allow multi-term tags, such as “tropical fish”, but others allow only single-term tags. As we will describe, user tags are a form of man- ual indexing , where the content of an object is represented by manually assigned terms. There are many interesting search tasks related to user tags, such as search- ing for items using tags, automatically suggesting tags, and visualizing clusters of tags. The second topic covered here is searching within communities , which de- scribes online communities and how users search within such environments. On- line communities are virtual groups of users that share common interests and in- teract socially in various ways in an online environment. For example, a sports fan who enjoys the outdoors and photography may be a member of baseball, hiking, and digital camera communities. Interactions in these communities range from passive activities (reading web pages) to those that are more active (writing in blogs and forums). These communities are virtual and ad hoc , meaning that there is typically no formal mechanism for joining one, and consequently people are im- plicitly rather than explicitly members of a community. Therefore, being able to automatically determine which communities exist in an online environment, and which users are members of each, can be valuable for a number of search-related tasks. One such task that we will describe is community-based question answer- ing, whereby a user posts a question to an online system and members of his own community, or the community most related to his question, provide answers to</p> <p><span class="badge badge-info text-white mr-2">423</span> 10.1 What Is Social Search? 399 the question. Such a search task is much more social, interactive, and focused than standard web search. filtering and recommender systems . It may be The next topics we describe are considered somewhat unusual to include these in a chapter on social search, be- cause they are not typical “Web 2.0” applications. Both types of systems, however, profiles rely on representations of individual users called , and for that reason fit into our broad definition. Both systems also combine elements of document re- trieval and classification. In standard search tasks, systems return documents in response to many different queries. These queries typically correspond to short- term information needs. In filtering, there is a fixed query (the profile) that rep- resents some long-term information need. The search system monitors incoming documents and retrieves only those documents that are relevant to the informa- tion need. Many online news websites provide document filtering functionality. For example, CNN provides an alerts service, which allows users to specify various topics of interest, such as “tropical storms”, or more general topics, such as sports or politics. When a new story matches a user’s profile, the system alerts the user, typically via email. In this way, the user does not need to continually search for articles of interest. Instead, the search system is tasked with finding relevant docu- ments that match the user’s long-term information needs. Recommender systems are similar to document filtering systems, except the goal is to predict how a user would rate some item, rather than retrieving relevant documents. For example, Amazon.com employs a recommender system that attempts to predict how much a user would like certain items, such as books, movies, or music. Recommender systems are social search algorithms because predictions are estimated based on ratings given by similar users, thereby implicitly linking people to a community of users with related interests. The final two topics covered in this chapter, peer-to-peer (P2P) search and metasearch , deal with architectures for social search. Peer-to-peer search is the task of querying a community of “nodes” for a given information need. Nodes can be individuals, organizations, or search engines. When a user issues a query, it is passed through the P2P network and run on one or more nodes, and then results are returned. This type of search can be fully distributed across a large network of nodes. Metasearch is a special case of P2P search where all of the nodes are search engines. Metasearch engines run queries against a number of search engines, col- lect the results, and then merge the results. The goal of metasearch engines is to provide better coverage and accuracy than a single search engine.</p> <p><span class="badge badge-info text-white mr-2">424</span> 400 10 Social Search personalization Finally, we note that is another area that could be regarded as part of social search, because it covers a range of techniques for improving search by representing individual user preferences and interests. Since most of these tech- niques provide context for the query, however, they were discussed as part of query refinement in section 6.2.5. 10.2 User Tags and Manual Indexing Before electronic search systems became available at libraries, patrons had to rely on card catalogs for finding books. As their name implies, card catalogs are large collections (catalogs) of cards. Each card contains information about a particular author, title, or subject. A person interested in a specific author, title, or subject would go to the appropriate catalog and attempt to find cards describing relevant books. The card catalogs, therefore, act as indexes to the information in a library. Card catalogs existed long before computers did, which means that these cards were constructed manually. Given a book, a person had to extract the author, ti- tle, and subject headings of the book so that the various catalogs could be built. This process is known as manual indexing . Given that it is impractical to manu- ally index the huge collections of digital media available today, search engines use automatic indexing techniques to assign identifiers (terms, phrases, features) to documents during index construction. Since this process is automatic, the quality and accuracy of the indexing can be much lower than that of manual indexing. The exhaustive , in the sense advantages of automatic indexing, however, are that it is that every word in the document is indexed and nothing is left out, and , consistent whereas people can make mistakes indexing or have certain biases in how they in- dex. Search evaluations that have compared manual to automatic indexing have found that automatic indexing is at least as effective and often much better than manual indexing. These studies have also shown, however, that the two forms of indexing complement each other, and that the most effective searches use both. As a compromise between manually indexing every item (library catalogs) and automatically indexing every item (search engines), social media sites pro- vide users with the opportunity to manually tag items. Each tag is typically a sin- gle word that describes the item. For example, an image of a tiger may be assigned the tags “tiger”, “zoo”, “big”, and “cat”. By allowing users to assign tags, some items end up with tags, and others do not. Of course, to make every item searchable, it is likely that every item is automatically indexed. Therefore, some items will contain</p> <p><span class="badge badge-info text-white mr-2">425</span> 10.2 User Tags and Manual Indexing 401 both automatic and manual identifiers. As we will show later in this section, this results in unique challenges for retrieval models and ranking functions. manual tagging . Social media tagging, like card catalog generation, is called This clash in naming is rather unfortunate, because the two types of indexing are actually quite different. Card catalogs are manually generated by experts who choose keywords, categories, and other descriptors from a controlled vocabulary (fixed ontology). This ensures that the descriptors are more or less standardized. On the other hand, social media tagging is done by end users who may or may not be experts. There is little-to-no quality control done on the user tags. Fur- thermore, there is no fixed vocabulary from which users choose tags. Instead, user tags form their own descriptions of the important concepts and relationships in a domain. User-generated ontologies (or taxonomies) are known as folksonomies . Therefore, a folksonomy can be interpreted as a dynamic, community-influenced ontology. There has been a great deal of research and interest invested in developing a semantically tagged version of the Web, which is often called the semantic web . The goal of the semantic web is to semantically tag web content in such a way that it becomes possible to find, organize, and share information more easily. For example, a news article could be tagged with metadata, such as the title, subject, description, publisher, date, and language. However, in order for the semantic web to materialize and yield significant improvements in relevance, a standard- ized, fixed ontology of metadata tags must be developed and used consistently across a large number of web pages. Given the growing popularity of social media sites that are based on flexible, user-driven folksonomies, compared to the small number of semantic web sites that are based on rigid, predefined ontologies, it seems as though users, in general, are more open to tagging data with a relatively unrestricted set of tags that are meaningful to them and that reflect the specific context of the application. Given that users are typically allowed to tag items in any way that they wish, there are many different types of tags. For example, Golder and Huberman (2006) described seven different categories of tags. Z. Xu et al. (2006) proposed a sim- plified set of five tag categories, which consists of the following: 1. Content-based tags. Tags describing the content of an item. Examples: “car”, “woman”, and “sky”. 2. Context-based tags. Tags that describe the context of an item. Examples: “New York City” or “Empire State Building”.</p> <p><span class="badge badge-info text-white mr-2">426</span> 402 10 Social Search 3. Attribute tags. Tags that describe implicit attributes of the item. Examples: “Nikon” (type of camera), “black and white” (type of movie), or “homepage” (type of web page). 4. Subjective tags. Tags that subjectively describe an item. Examples: “pretty”, “amazing”, and “awesome”. 5. Organizational tags. Tags that help organize items. Examples: “todo”, “my pic- tures”, and “readme”. As we see, tags can be applied to many different types of items, ranging from web pages to videos, and used for many different purposes beyond just tagging the con- tent. Therefore, tags and online collaborative tagging environments can be very useful tools for users in terms of searching, organizing, sharing, and discovering new information. It is likely that tags are here to stay and will become even more widely used in the future. Therefore, it is important to understand the various issues surrounding them and how they are used within search engines today. In the remainder of this section, we will describe how tags can be used for search, how new tags for an item can be inferred from existing tags, and how sets of tags can be visualized and presented to the user. 10.2.1 Searching Tags Since this is a book about search engines, the first tag-related task that we discuss is searching a collection of collaboratively tagged items. One unique property of tags is that they are almost exclusively textual keywords that are used to describe textual or non-textual items. Therefore, tags can provide a textual dimension to items that do not explicitly have a simple textual representation, such as images or videos. These textual representations of non-textual items can be very useful for searching. We can apply many of the retrieval strategies described in Chap- ter 7 to the problem. Despite the fact that searching within tagged collections can be mapped to a text search problem, tags present certain challenges that are not present when dealing with standard document or web retrieval. The first challenge, and by far the most pervasive, is the fact that tags are very sparse representations of very complex items. Perhaps the simplest way to search a set of tagged items is to use a Boolean retrieval model. For example, given the query “fish bowl”, one could run the query “fish AND bowl”, which would only return items that are tagged with both “fish” and “bowl”, or “fish OR bowl”, which would return items that are tagged with either “fish” or “bowl”. Conjunc- tive ( AND ) queries are likely to produce high-quality results, but may miss many</p> <p><span class="badge badge-info text-white mr-2">427</span> 10.2 User Tags and Manual Indexing 403 relevant items. Thus, the approach would have high precision but low recall. At the opposite end of the spectrum, the disjunctive ( ) queries will match many OR more relevant items, but at the cost of precision. Age of Aquariums - Tropical Fish Huge educational aquarium site for tropical fish hobbyists, promoting responsible keeping internationally since 1997. fish The Krib (Aquaria and Tropical Fish ) This site contains information about tropical fish aquariums, including archived usenet postings and e-mail discussions, along with new ... ... Keeping Tropical Fish and Goldfish in Aquariums, Fish Bowls, and ... Keeping Tropical Fish Fish Bowls, and and Goldfish in Aquariums, Ponds at AquariumFish.net. fish tropical goldfish aquariums bowls P(w | “tropical fish” ) Fig. 10.1. Search results used to enrich a tag representation. In this example, the tag being expanded is “tropical fish”. The query “tropical fish” is run against a search engine, and the snippets returned are then used to generate a distribution over related terms. Of course, it is highly desirable to achieve both high precision and high re- call. However, doing so is very challenging. Consider the query “aquariums” and a picture of a fish bowl that is tagged with “tropical fish” and “goldfish”. Most re- trieval models, including Boolean retrieval, will not be able to find this item, be- cause there is no overlap between the query terms and the tag terms. This problem, which was described in Chapter 6 in the context of advertising, is known as the vocabulary mismatch problem. There are various ways to overcome this problem, including simple things such as stemming. Other approaches attempt to enrich the sparse tag (or query) representation by performing a form of pseudo-relevance</p> <p><span class="badge badge-info text-white mr-2">428</span> 404 10 Social Search feedback. Figure 10.1 illustrates how web search results may be used to enrich a tag representation. In the example, the tag “tropical fish” is run as a query against a search engine. The snippets returned are then processed using any of the stan- dard pseudo-relevance feedback techniques described in section 7.3.2, such as rel- evance modeling, which forms a distribution over related terms. In this example, the terms “fish”, “tropical”, “aquariums”, “goldfish”, and “bowls” are the terms with the highest probability according to the model. The query can also, optionally, be expanded in the same way. Search can then be done using the enriched query and/or tag representations in order to maintain high levels of precision as well as recall. The second challenge is that tags are inherently noisy. As we have shown, tags can provide useful information about items and help improve the quality of search. However, like anything that users create, the tags can also be off topic, inappropriate, misspelled, or spam. Therefore it is important to provide proper in- centives to users to enter many high-quality tags. For example, it may be possible to allow users to report inappropriate or spam tags, thereby reducing the incentive to produce junk tags. Furthermore, users may be given upgraded or privileged ac- cess if they enter some number of (non-spam) tags over a given time period. This incentive promotes more tagging, which can help improve tag quality and cover- age. The final challenge is that many items in a given collection may not be tagged, which makes them virtually invisible to any text-based search system. For such items, it would be valuable to automatically infer the missing tags and use the tags for improving search recall. We devote the next section to diving deeper into the details of this problem. 10.2.2 Inferring Missing Tags As we just described, items that have no tags pose a challenge to a search sys- tem. Although precision is obviously a very important metric for many tag-related search tasks, recall may also be important in some cases. In such cases, it is impor- tant to automatically infer a set of tags for items that have no manual tags assigned to them. Let us first consider the case when the items in our collection are textual, such as books, news articles, research papers, or web pages. In these cases, it is possible to infer tags based solely on the textual representation of the item. One simple approach would involve computing some weight for every term that occurs in the</p> <p><span class="badge badge-info text-white mr-2">429</span> 10.2 User Tags and Manual Indexing 405 K terms with the highest weight as the inferred tags. text and then choosing the tf.idf -based weight, There are various measures of term importance, including a such as: N ) = log ) f wt + 1) log ( ( w ( w;D df w where is the total num- is the number of times term w occurs in item D , f N w;d ber of items, and df is the number of items that term w occurs in. Other term w importance measures may take advantage of document structure. For example, terms that occur in the title of an item may be given more weight. It is also possible to treat the problem of inferring tags as a classification prob- lem, as was recently proposed by Heymann, Ramage, and Garcia-Molina (2008). Given a fixed ontology or folksonomy of tags, the goal is to train a binary classifier for each tag. Each of these classifiers takes an item as input and predicts whether the associated tag should be applied to the item. This approach requires training one classifier for every tag, which can be a cumbersome task and requires a large amount of training data. Fortunately, however, training data for this task is virtu- ally free since users are continuously tagging (manually labeling) items! Therefore, it is possible to use all of the existing tag/item pairs as training data to train the classifiers. Heymann et al. use an SVM classifier to predict web page tags. They tf.idf weights for terms in the page text compute a number of features, including and anchor text, as well as a number of link-based features. Results show that high precision and recall can be achieved by using the textual features alone, especially for tags that occur many times in the collection. A similar classification approach can be applied to other types of items, such as images or videos. The challenge when dealing with non-textual items is extracting useful features from the items. The two approaches we described for inferring missing tags choose tags for items independently of the other tags that were assigned. This may result in very relevant, yet very redundant, tags being assigned to some item. For example, a picture of children may have the tags “child”, “children”, “kid”, “kids”, “boy”, “boys”, “girl”, “girls”—all of which would be relevant, but as you see, are rather redundant. Therefore, it is important to choose a set of tags that are both relevant and non- redundant. This is known as the novelty problem. Carbonell and Goldstein (1998) describe the Maximal Marginal Relevance (MMR) technique, which addresses the problem of selecting a diverse set of items. Rather than choosing tags independently of each other, MMR chooses tags iter- atively, adding one tag to the item at a time. Given an item i and the current set of tags for the item T , the MMR technique chooses the next tag according to the i</p> <p><span class="badge badge-info text-white mr-2">430</span> 406 10 Social Search t tag that maximizes: ) ( ; T ) ) = M M R λSim ) ( t, i ) − (1 − λ ( t , t t ( Sim max i tag item i t ∈ T i where Sim is a function that measures that similarity between a tag t and item item i (such as those measures described in this section), Sim measures the tag similarity between two tags (such as the measures described in section 10.2.1), λ λ = and is a tunable parameter that can be used to trade off between relevance ( ) and novelty ( λ = 0 1 ). Therefore, a tag that is very relevant to the item and not very similar to any of the other tags will have a large MMR score. Iteratively choosing tags in this way helps eliminate the production of largely redundant sets of tags, which is useful not only when presenting the inferred tags to the user, but also from the perspective of using the inferred tags for search, since a diverse set of tags should help further improve recall. 10.2.3 Browsing and Tag Clouds As we have shown, tags can be used for searching a set of collaboratively tagged items. However, tags can also be used to help users browse, explore, and discover new items in a large collection of items. There are several different ways that tags can be used to aid browsing. For example, when a user is viewing a given item, all of the item’s tags may be displayed. The user may then click on one of the tags and be shown a list of results of items that also have that tag. The user may then repeat this process, repeatedly choosing an item and then clicking on one of the tags. This allows users to browse through the collection of items by following a chain of related tags. Such browsing behavior is very focused and does not really allow the user to explore a large range of the items in the collection. For example, if a user starts on a picture of a tropical fish, it would likely take many clicks for the user to end up viewing an image of an information retrieval textbook. Of course, this may be desirable, especially if the user is only interested in things closely related to tropical fish. One way of providing the user with a more global view of the collection is to allow the user to view the most popular tags. These may be the most popular tags for the entire site or for a particular group or category of items. Tag popularity may be measured in various ways, but is commonly computed as the number of times the tag has been applied to some item. Displaying the popular tags allows</p> <p><span class="badge badge-info text-white mr-2">431</span> 10.2 User Tags and Manual Indexing 407 the user to begin her browsing and exploration of the collection using one of these tags as a starting point. art australia autumn baby band barcelona beach berlin architecture animals blackandwhite blue california cameraphone canada canon black birthday cat chicago china christmas church city clouds color concert day car dog england family film florida flower flowers food europe festival friends fun garden germany girl graffiti france hawaii green halloween holiday house india ireland home july kids lake landscape light live italy japan london macro me mexico music nature new newyork night nikon nyc ocean park party people portrait red river rock paris scotland show seattle sanfrancisco sky snow spain spring street sea summer taiwan texas thailand tokyo toronto travel sunset tree trees trip uk usa vacation washington water wedding Fig. 10.2. Example of a tag cloud in the form of a weighted list. The tags are in alphabetical order and weighted according to some criteria, such as popularity. Thus far we have described ways that tags may aid users with browsing. One of the most important aspects of browsing is displaying a set of tags to a user in a vi- sually appealing and meaningful manner. For example, consider displaying the 50 most popular tags to the user. The simplest way to do so is to just display the tags in a list or table, possibly in alphabetical or sorted order according to popularity. Besides not being very visually appealing, this display also does not allow the user to quickly observe all of the pertinent information. When visualizing tags, it is useful to show the list of tags in alphabetical order, so that users may quickly scan through the list or find the tags they are looking for. It is also beneficial to por- tray the popularity or importance of a given tag. There are many ways to visualize this information, but one of the most widely used techniques is called tag clouds . In a tag cloud, the display size of a given tag is proportional to its popularity or importance. Tags may be arranged in a random order within a “cloud” or alpha- betically. Figure 10.2 shows an example of a tag cloud where the tags are listed alphabetically. Such a representation is also called a weighted list. Based on this</p> <p><span class="badge badge-info text-white mr-2">432</span> 408 10 Social Search tag cloud, the user can easily see that the tags “wedding”, “party”, and “birthday” are all very popular. Therefore, tag clouds provide a convenient, visually appealing way of representing a set of tags. 10.3 Searching with Communities 10.3.1 What Is a Community? The collaborative tagging environments that we just described are filled with im- plicit social interactions. By analyzing the tags that users submit or search for, it is possible to discover groups of users with related interests. For example, ice hockey fans are likely to tag pictures of their favorite ice hockey players, tag their favorite ice hockey web pages, search for ice hockey–related tags, and so on. Tagging is just one example of how interactions in an online environment can be used to infer relationships between entities (e.g., people). Groups of entities that interact in an online environment and that share common goals, traits, or interests are an online community . This definition is not all that different from the traditional definition of community. In fact, online communities are actually very similar to traditional communities and share many of the same social dynamics. The primary difference between our definition and that of a traditional community is that an online com- munity can be made up of users, organizations, web pages, or just about any other meaningful online entity. Let us return to our example of users who tag and search for ice hockey–related items. It is easy to see that ice hockey fans form an online community. Members of the community do many other things other than tag and search. For example, they also post to blogs, newsgroups, and other forums. They may also send instant messages and emails to other members of the community about their ice hockey experiences. Furthermore, they may buy and sell ice hockey–related items online, through sites such as eBay. Hence, there are many ways that a user may partici- pate in a community. It is important to note, however, that his membership in the community is implicit. Another important thing to notice is that users are very likely to have a number of hobbies or interests, and may be members of more than one online community. Therefore, in order to improve the overall user expe- rience, it can be useful for search engines and other online sites to automatically determine the communities associated with a given user. Some online communities consist of non-human entities. For example, a set of web pages that are all on the same topic form an online community that is often</p> <p><span class="badge badge-info text-white mr-2">433</span> 10.3 Searching with Communities 409 web community called a . These pages form a community since they share similar traits (i.e., they are all about the same topic). Since web pages are created by users, web communities share many of the same characteristics as communities of users. Automatically identifying web communities can be useful for improving search. The remainder of this section covers several aspects of online communities that are useful from a search engine perspective. We first describe several ef- fective methods for automatically finding online communities. We then discuss community-based question answering, where people ask questions and receive answers from other members of the community. Finally, we cover collaborative searching, which is a search paradigm that involves a group of users searching to- gether. 10.3.2 Finding Communities The first task that we describe is how to automatically find online communities. As we mentioned before, online communities are implicitly defined by interac- tions among a set of entities with common traits. This definition is rather vague and makes it difficult to design general-purpose algorithms for finding every pos- sible type of online community. Instead, several algorithms have been developed that can effectively find special types of communities that have certain assumed characteristics. We will describe several such algorithms now. Most of the algorithms used for finding communities take as input a set of en- tities, such as users or web pages, information about each entity, and details about how the entities interact or are related to each other. This can be conveniently rep- resented as a graph, where each entity is a node in the graph, and interactions (or relationships) between the entities are denoted by edges. Graphs can be either di- rected or undirected. The edges in directed graphs have directional arrows that indicate the source node and destination node of the edge. Edges in undirected graphs do not have directional arrows and therefore have no notion of source and destination. Directed edges are useful for representing non-symmetric or causal relationships between two entities. Undirected edges are useful for representing symmetric relationships or for simply indicating that two entities are related in some way. Using this representation, it is easy to define two criteria for finding communi- ties within the graph. First, the set of entities (nodes) must be similar to each other according to some similarity measure. Second, the set of entities should interact with each other more than they interact with other entities. The first requirement</p> <p><span class="badge badge-info text-white mr-2">434</span> 410 10 Social Search makes sure that the entities actually share the same traits, whereas the second en- sures that the entities interact in a meaningful way with each other, thereby mak- ing them a community rather than a set of users with the same traits who never interact with each other. The first algorithm that we describe is the HITS algorithm, which was briefly discussed in Chapter 4 in the context of PageRank. The HITS algorithm is sim- ilar to PageRank, except that it is query-dependent, whereas PageRank is usually query-independent. You may be wondering what HITS has to do with finding communities, since it was originally proposed as a method for improving web search. Both PageRank and HITS, however, are part of a family of general, power- algorithms. These algorithms can be applied ful algorithms known as linkanalysis to many different types of data sets that can be represented as directed graphs. Given a graph of entities, we must first identify a subset of the entities that may possibly be members of the community. We call these entities the candidate entities . For example, if we wish to find the ice hockey online community, then we must query each node in the graph and find all of the nodes (users) that are interested in ice hockey. This can be accomplished by, for example, finding all users of a system who have searched for anything hockey-related. This ensures that the first criteria, which states that entities should be similar to each other, is satisfied. Another example is the task of finding the “fractal art” web community. Here, we could search the Web for the query “fractal art” and consider only those entities (web pages) that match the query. Again, this ensures that all of the pages are topically similar to each other. This first step finds sets of similar items, but fails to identify the sets of entities that actively participate, via various interactions, within the community, which is the second criteria that we identified as being important. Given the candidate entities, the HITS algorithm can be used to find the “core” of the community. The HITS algorithm takes a graph G with node set and edge set consists of as input. For finding communities, the vertex set V V E the candidate entities, and the edge set consists of all of the edges between can- E didate entities. For each of the candidate entities (nodes) p in the graph, HITS computes an authority score ( A ( p ) ) and a hub score ( H ( p ) ). It is assumed that good hubs are those that point to good authorities and that good authorities are those that are pointed to by good hubs. Notice the circularity in these definitions. This means that the authority score depends on the hub score, which in turn de- pends on the authority score. Given a set of authority and hub scores, the HITS algorithm updates the scores according to the following equations:</p> <p><span class="badge badge-info text-white mr-2">435</span> 10.3 Searching with Communities 411 HITS Algorithm 3 HITS ( = ( V, E ) , K ) procedure 1: G p ( ) ← 1 ∀ 2: ∈ V A p 0 ) ( p 3: H 1 ∀ p ∈ V ← 0 i = 1 to K do 4: for ∀ ( p ) ← 0 A p ∈ V 5: i H V ( p ) ← 0 ∀ p ∈ 6: i 7: Z ← 0 A 8: ← 0 Z H for ∈ V do p 9: q ∈ V 10: for do if ( p, q ) ∈ E then 11: H ) ( p 12: ← H q ( p ) + A ( ) − i i i 1 Z + ← Z ) 13: A q ( 1 i − H H 14: end if 15: if ( q, p ) ∈ E then ) + 16: p ) ← A A ( p ( H ) q ( 1 − i i i Z ) ← Z 17: + H q ( 1 − i A A end if 18: 19: end for 20: end for p ∈ V do 21: for A ( p ) i ← A 22: ( p ) i Z A ) p ( H i ) ← p H 23: ( i Z H end for 24: end for 25: return A , H 26: K K end procedure 27: ∑ p ) = ( H ( q ) A p q → ∑ ( p ) = H ) q ( A q p → where p → q indicates that an edge exists between entity p (source) and entity q (destination). As the equations indicate, A ( p ) is the sum of the hub scores of p the entities that point at , and H ( p ) is the sum of the authority scores pointed at by p . Thus, to be a strong authority, an entity must have many incoming edges, all</p> <p><span class="badge badge-info text-white mr-2">436</span> 412 10 Social Search with relatively moderate hub scores, or have very few incoming links that have very large hub scores. Similarly, to be a good hub, an entity must have many outgoing edges to less authoritative pages, or few outgoing edges to very highly authorita- tive pages. Iteration 1: Normalize Scores Iteration 1: Update Scores Iteration 1: Input .33, 0 0, .17 2, 0 1, 1 1, 1 0, 1 , 1, 1 0, 1 0, .17 0, 3 0, .50 1, 1 0, 0 1, 1 0,0 1, 1 1, 1 .17, .17 .50, 0 1, 1 3, 0 Iteration 2: Normalize Scores Iteration 2: Input Iteration 2: Update Scores 0, .21 0, .17 0, .50 .67, 0 .33, 0 .33, 0 , , , 0, .17 0, .21 0, .50 0, .43 0, .50 0, 1 0,0 0,0 0,0 .50, .33 .17, .17 .25, .14 .42, 0 .50, 0 .83, 0 Iteration 3: Input Iteration 3: Normalize Scores Iteration 3: Update Scores 0, .21 .33, 0 0, .42 .31, 0 0, .19 .57, 0 , , , 0, .42 0, .21 0, .19 0, .43 0,1 0, .46 0,0 0,0 0,0 .43, .33 .25, .14 .23, .16 .86, 0 .42, 0 .46, 0 Fig. 10.3. Illustration of the HITS algorithm. Each row corresponds to a single iteration of the algorithm and each column corresponds to a specific step of the algorithm. An iterative version of HITS is given in Algorithm 3. The algorithm begins by initializing all hub and authority scores to 1. The algorithm then updates the hub and authority scores according to the equations we just showed. Then, the hub scores are normalized so that the sum of the scores is 1. The same is also done for the authority scores. The entire process is then repeated on the normalized scores for a fixed number of iterations, denoted by K . The algorithm is guaranteed to converge and typically does so after a small number of iterations. Figure 10.3 shows an example of the HITS algorithm applied to a graph with seven nodes and six edges. The algorithm is carried out for three iterations. Notice that the nodes with many incoming edges tend to have higher authority scores, and those with more outgoing edges tend to have larger hub scores. Another characteristic</p> <p><span class="badge badge-info text-white mr-2">437</span> 10.3 Searching with Communities 413 of the HITS algorithm is that nodes that are not connected to any other nodes will always have hub and authority scores of 0. Once the hub and authority scores have been computed, the entities can be ranked according to their authority score. This list will contain the most authori- tative entities within the community. Such entities are likely to be the “leaders” or form the “core” of the community, based on their interactions with other mem- bers of the community. For example, if this algorithm were applied to the com- puter science researcher citation graph to find the information retrieval research community, the most authoritative authors would be those that are cited many times by prolific authors. These would arguably be the luminaries in the field and those authors that form the core of the research community. When finding web communities on the web graph, the algorithm will return pages that are linked to by a large number of reputable web pages. 2 2 1 1 1 1 1 4 4 4 3 3 3 3 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 5 7 6 1 4 3 2 Node: 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Vector: 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 Fig. 10.4. Example of how nodes within a directed graph can be represented as vectors. For a given node p , its vector representation has component q set to 1 if p → q . Clustering algorithms, such as the ones described in Chapter 9, may also be used for finding online communities. These algorithms easily adapt to the prob-</p> <p><span class="badge badge-info text-white mr-2">438</span> 414 10 Social Search lem, since community finding is an inherently unsupervised learning problem. Both agglomerative clustering and -means can be used for finding communi- K ties. Both of these clustering algorithms require a function that measures the dis- tance between two clusters. As we discussed in Chapter 9, the Euclidean distance is often used. However, it is not immediately clear how to apply the Euclidean distance to nodes in a (directed or undirected) graph. One simple way is to rep- | V | components—one resent each node (entity) in the graph as a vector that has p for every node in the graph. For some node q of its vector repre- , component p sentation is set to 1 if q , and 0 otherwise. This results in each node being → represented by the nodes it points to. Figure 10.4 shows how the nodes of a graph are represented this way. Each vector may optionally be normalized. Using this representation, it is possible to use the Euclidean distance and directly apply ag- glomerative clustering or K -means to the problem. A high similarity according to the Euclidean distance will occur when two entities have edges directed toward many of the same entities. For example, returning to the problem of finding the information retrieval research community, two authors would be considered very similar if they tended to cite the same set of authors. It is realistic to assume that most members of the information retrieval research community tend to cite many of the same authors, especially those that would be given a large authority score by HITS. Evaluating the effectiveness of community-finding algorithms is typically more difficult than evaluating traditional clustering tasks, since it is unclear how to de- termine whether some entity should be part of a given community. In fact, it is likely that if a group of people were asked to manually identify online communi- ties, there would be many disagreements due to the vague definition of a commu- nity. Therefore, it is impossible to say whether HITS or agglomerative clustering is better at finding communities. The best choice largely depends on the task and data set for the application. Now that you have seen several ways of automatically finding online commu- nities, you may be wondering how this information can be put to practical use. There are many different things that can be done after a set of communities has been identified. For example, if a user has been identified as part of the informa- tion retrieval research community, then when the user visits a web page, targeted content may be displayed to the user that caters to her specific interests. Search engines could use the community information as additional contextual informa- tion for improving the relevance of search results, by retrieving results that are also topically related to the community or communities associated with the user. On-</p> <p><span class="badge badge-info text-white mr-2">439</span> 10.3 Searching with Communities 415 line community information can also be used in other ways, including enhanced browsing, identifying experts, website recommendation, and possibly even sug- gesting who may be a compatible date for you! 10.3.3 Community-Based Question Answering In the last section, we described how to automatically find online communities. In this section, we describe how such communities can be used effectively to help an- swer complex information needs that would be difficult to answer using conven- tional search engines. For example, consider a person who is interested in learning about potential interactions between a medicine he was just prescribed and an herbal tea he often drinks. He could use a search engine and spend hours entering various queries, looking at search results, and trying to find useful information on the subject. The difficulty with using this approach is that no single page may exist that completely satisfies the person’s information need. Now, suppose that the person could ask his question directly to a large group of other people, sev- eral of whom are pharmacists or herbal experts. He would be much more likely to get an answer. This search scenario, where a person submits a question to a community consisting of both experts and non-experts in a wide range of topics, each of whom can opt to answer the question, is called community-based ques- tion answering (CQA). These systems harness the power of human knowledge in order to satisfy a broad range of information needs. Several popular commercial systems of this type exist today, including Yahoo! Answers and Naver, a Korean search portal. There are both pros and cons to CQA systems. The pros include users being able to get answers to complex or obscure information needs; the chance to see multiple, possibly differing opinions about a topic; and the chance to interact with other users who may share common interests, problems, and goals. The cons include the possibility of receiving no answer at all to a question, having to wait (possibly days) for an answer, and receiving answers that are incorrect, misleading, offensive, or spam. As we just mentioned, many of the answers that people submit are of low qual- ity. It turns out, however, that the old computer programming adage of “garbage in, garbage out” also applies to questions. Studies have shown that low-quality an- swers are often given to low-quality questions. Indeed, there are a wide range of questions that users can, and do, ask. Table 10.1 shows a small sample of questions submitted to Yahoo! Answers. Some of the questions in the table are well-formed</p> <p><span class="badge badge-info text-white mr-2">440</span> 416 10 Social Search What part of Mexico gets the most tropical storms? How do you pronounce the french words, coeur and miel? GED test? Why do I have to pay this fine? What is Schrödinger’s cat? What’s this song? Hi...can u ppl tell me sumthing abt death dreams?? What are the engagement and wedding traditions in Egypt? Fun things to do in LA? What lessons from the Tao Te Ching do you apply to your everyday life? Foci of a hyperbola? What should I do today? Why was iTunes deleted from my computer? Heather Locklear? Do people in the Australian Defense Force (RAAF) pay less tax than civilians? Whats a psp xmb? If C(-3, y) and D(1, 7) lie upon a line whose slope is 2, find the value of y.? Why does love make us so irrational? Am I in love? What are some technologies that are revolutionizing business? Table 10.1. Example questions submitted to Yahoo! Answers and grammatically correct, whereas others are not. In addition, some of the ques- tions have simple, straightforward answers, but many others do not. Besides allowing users to ask and answer questions, CQA services also provide users with the ability to search the archive of previously asked questions and the corresponding answers. This search functionality serves two purposes. First, if a user finds that a similar question has been asked in the past, then they may not need to ask the question and wait for responses. Second, search engines may aug- ment traditional search results with hits from the question and answer database. For example, if a user enters the query “schrödingers cat”, the search engine may choose to return answers to “What is Schrödinger’s cat?” (which appears in Ta- ble 10.1) in addition to the other, more standard set of ranked web pages. 2 Therefore, given a query, it is important to be able to automatically find po- tential answers in the question and answer database. There are several possible 2 For the remainder of our discussion, we use the term “query” to refer to a question or a search query.</p> <p><span class="badge badge-info text-white mr-2">441</span> 10.3 Searching with Communities 417 ways to search this database. New queries can be matched against the archived questions alone, the archived answers alone, or questions and answers combined. Studies have shown that it is better to match queries against the archived ques- tions rather than answers, since generally it is easier to find related questions (which are likely to have relevant answers) than it is to match queries directly to answers. Matching queries to questions can be achieved using any of the retrieval mod- els described in Chapter 7, such as language modeling or BM25. However, tradi- tional retrieval models are likely to miss many relevant questions because of the vocabulary mismatch problem. Here, vocabulary mismatch is caused by the fact that there are many different ways to ask the same question. For example, suppose we had the query “who is the leader of india?”. Related questions are “who is the prime minister of india?”, “who is the current head of the indian government?”, and so on. Notice that the only terms in common among any two of these ques- tions are “who”, “is”, “the”, “of ”, and “india”. Blindly applying any standard retrieval model would retrieve non-related questions such as “who is the finance minister of india?” and “who is the tallest person in all of india?”. Stopword removal in this case does not help much. Instead, better matches can be achieved by generalizing the notion of “leader” to include other concepts, such as “prime minister”, “head”, and “government”. In section 6.4, we described cross-language retrieval, where a user queries in a source language (e.g., English) and documents are retrieved in a target language (e.g., French). Most of the retrieval methods developed for cross-language re- trieval are based on machine translation techniques, which require learning trans- P ( s | t ) , where s is a word in the source language lation probabilities of the form t is a word in the target language. Translation models can also be used to and help overcome the vocabulary mismatch problem within a single language. This ′ ′ ( t | t is achieved by estimating ) , where t and t P are both words in the same language. This probability can be interpreted as the probability that word t is ′ t . Returning to our example, it is likely that used in place of ( leader | minister ) P and P ( leader | government ) would have non-zero values, and therefore result in more relevant questions being retrieved. We will now describe two translation- based models that have been used for finding related questions and answers in an archive. The first model was proposed by Berger and Lafferty (1999). It is similar to the query likelihood model described in section 7.3.1, except it allows query terms to</p> <p><span class="badge badge-info text-white mr-2">442</span> 418 10 Social Search 3 are ranked be “translated” from other terms. Given a query, related questions according to: ∏ ∑ ( ) = ( P ) A | | A P Q w | t ) P ( t t Q ∈ w ∈V Q is the query, A is a related question in the archive, where is the vocabulary, V P w | t ) are the translation probabilities, and P ( t | D ) is the smoothed probability ( t given document D (see section 7.3.1 for more details). Therefore, of generating w t we see that the model allows query term to be translated from other terms that may occur in the question. One of the primary issues with this model is that there is no guarantee the question will be related to the query; since every term is translated independently, the question with the highest score may be a good overall translation. term-for-term translation of the query, but not a good The second model, developed by Xue et al. (2008), is an extension of Berger’s model that attempts to overcome this issue by allowing matches of the original query terms to be given more weight than matches of translated terms. Under this model, questions (or answers) are ranked using the following formula: ∑ c w ∏ + t f | w ( P μ ) ) + (1 − β β f t;A w;A ∈V t | | C Q A | P ) = ( | | A + μ Q w ∈ where β is a parameter between 0 and 1 that controls the influence of the transla- tion probabilities, and μ is the Dirichlet smoothing parameter. Notice that when β = 0 , this model is equivalent to the original query likelihood model, with no influence from the translation model. As approaches 1, the translation model β begins to have more impact on the ranking and becomes more similar to Berger’s model. Ranking using these models can be computationally expensive, since each in- volves a sum over the entire vocabulary, which can be very large. Query processing speeds can be significantly improved by considering only a small number of trans- lations per query term. For example, if the five most likely translations of each query term are used, the number of terms in the summation will be reduced from V to 5. The one major issue that has been ignored thus far is how to compute the trans- lation probabilities. In cross-language retrieval, translation probabilities can be 3 The discussion focuses on question retrieval, but the same models can be used to re- trieve archived answers. As we said, question retrieval is generally more effective.</p> <p><span class="badge badge-info text-white mr-2">443</span> 10.3 Searching with Communities 419 search everest xp xp search everest mountain window google install information tallest 29,035 drive internet computer website highest mt version web ft click list pc free measure program info feet microsoft page mount Table 10.2. Translations automatically learned from a set of question and answer pairs. The 10 most likely translations for the terms “everest”, “xp”, and “search” are given. automatically learned using a parallel corpus. Translation probabilities are esti- s t s t ( { mated from pairs of documents of the form D , D ) , . . . , ( D , D ) } , where 1 1 N N s t D is document i written in the source language and D i is document written i i in the target language. However, the notion of a parallel corpus becomes hazy when dealing with intra language translations. A variety of approaches have been used for estimating translation probabilities within the same language. For finding related questions, one of the most successful approaches makes the assumption that question/answer pairs form a parallel corpus from which translation prob- abilities can be estimated. That is, translation probabilities are estimated from archived pairs of the form { ( Q i , A is question ) , . . . , ( Q Q , A , where ) } N N 1 1 i and is answer i . Example translations estimated from a real question and an- A i swer database using this approach are shown in Table 10.2. Pointers to algorithms for estimating translation probabilities given a parallel corpus are given in the “References and Further Reading” section at the end of this chapter. In this section, we assumed that people in a community would provide an- swers to questions, and an archive of questions and answers would be created by this process. As we mentioned in Chapter 1, it is also possible to design question answering systems that find answers for a more limited range of questions in the text of large document corpora. We describe these systems in more detail in Chap- ter 11.</p> <p><span class="badge badge-info text-white mr-2">444</span> 420 10 Social Search 10.3.4 Collaborative Searching The final community-based search task that we consider is collaborative search- ing . As the name suggests, collaborative searching involves a group of users with a common goal searching together in a collaborative setting. There are many situa- tions where collaborative searching can be useful. For example, consider a group of students working together on a world history report. In order to complete the report, the students must do background research on the report topic. Tradition- ally, the students would split the topic into various subtopics, assign each group member one of the subtopics, and then each student would search the Web or an online library catalog, independently of the other students, for information and resources on their specific subtopic. In the end, the students would have to com- bine all of the information from the various subtopics to form a coherent report. Each student would learn a great deal about his or her particular subtopic, and no single student would have a thorough understanding of all of the material in the report. Clearly, every student would end up learning a great deal more if the research process were more collaborative. A collaborative search system would al- low the students to search the Web and other resources together, so that every member of the group could contribute and understand every subtopic of the re- port. Collaborative search can also be useful within companies, where colleagues must collect information about various aspects of a particular project. Last, but certainly not least, recreational searchers may find collaborative search systems particularly useful. Suppose you and your friends are planning a party. A collabo- rative search system would help everyone coordinate information-gathering tasks, such as finding recipes, choosing decorations, selecting music, deciding on invita- tions, etc. There are two common types of collaborative search scenarios, depending on where the search participants are physically located with respect to each other. The first scenario, known as co-located collaborative search, occurs when all of the search participants are in the same location, such as the same office or same li- brary, sitting in front of the same computer. The other scenario, known as remote collaborative searching, occurs when the search participants are physically located in different locations. The participants may be in different offices within the same building, different buildings within the same city, or even in completely different countries across the globe. Figure 10.5 provides a schematic for these scenarios. Both situations present different challenges, and the systems developed for each have different requirements in terms of how they support search. To illustrate this, we briefly describe two examples of collaborative search systems.</p> <p><span class="badge badge-info text-white mr-2">445</span> 10.3 Searching with Communities 421 Remote Collaborative Searching Co-located Collaborative Searching Fig. 10.5. Overview of the two common collaborative search scenarios. On the left is co- located collaborative search, which involves multiple participants in the same location at the same time. On the right is remote collaborative search, where participants are in different locations and not necessarily all online and searching at the same time. The CoSearch system developed by Amershi and Morris (2008) is a co-located collaborative search system. The system has a primary display, keyboard, and mouse that is controlled by the person called the “driver”, who leads the search task. Additional participants, called “observers”, each have a mouse or a Bluetooth- 4 enabled mobile phone. The driver begins the session by submitting a query to a search engine. The search results are displayed on the primary display and on the display of any user with a mobile phone. Observers may click on search results, which adds the corresponding page into a shared “page queue.” This allows every participant to recommend which page should be navigated to next in a conve- nient, centralized manner, rather than giving total control to the driver. In addi- tion to the page queue, there is also a “query queue,” where participants submit new queries. The query queue provides everyone with a list of potentially useful queries to explore next, and provides the driver with a set of options generated 4 Bluetooth is the name of a short-range wireless technology that allows for communi- cation between devices, such as laptops, printers, PDAs, and mobile phones.</p> <p><span class="badge badge-info text-white mr-2">446</span> 422 10 Social Search collectively by the group. The CoSearch system provides many useful ways for a group of people to collaboratively search together, since it allows everyone to work toward a common task, while at the same time preserving the important di- vision of labor that is part of collaboration, via the use of multiple input devices. An example of a remote collaborative search system is SearchTogether, devel- oped by Morris and Horvitz (2007b). In this system, it is assumed that every participant in the session is in a different location and has his own computer. Furthermore, unlike co-located search, which assumes that all of the participants are present during the entire search session, remote search makes no assumptions about whether everyone is online at the same time. Therefore, whereas co-located search sessions tend to be transient, remote search sessions can be persistent. Users of the system may submit queries, which are logged and shared with all of the other search participants. This allows all participants to see what others are search- ing for, and allows them to resubmit or refine the queries. Users can add ratings (e.g., “thumbs up” or “thumbs down”) and comments to pages that are viewed during the search process, which will be aggregated and made available to other participants. In addition, a participant may explicitly recommend a given page to another participant, which will then show up in her recommended pages list. Therefore, the SearchTogether system provides most of the functionality of the CoSearch system, except it is adapted to the specific needs of remote collabora- tion. One particular advantage of a persistent search session is that new partici- pants, who were not previously part of the search process, can quickly be brought up to speed by browsing the query history, page ratings, comments, and recom- mendations. As we have outlined, collaborative search systems provide users with a unique set of tools to effectively collaborate with each other during a co-located or re- mote search session. Despite the promise of such systems, very few commercial collaborative search systems exist today. However, such systems are beginning to gain considerable attention in the research community. Given this, and the in- creasingly collaborative nature of the online experience, it may be only a matter of time before collaborative search systems become more widely available.</p> <p><span class="badge badge-info text-white mr-2">447</span> 10.4 Filtering and Recommending 423 10.4 Filtering and Recommending 10.4.1 Document Filtering As we mentioned previously, one part of social search applications is representing individual users’ interests and preferences. One of the earliest applications that fo- cused on user profiles was document filtering. Document filtering, often simply referred to as filtering , is an alternative to the standard ad hoc search paradigm. In ad hoc search, users typically enter many different queries over time, while the document collection stays relatively static. In filtering, the user’s information need stays the same, but the document collection itself is dynamic, with new docu- ments arriving periodically. The goal of filtering, then, is to identify (filter) the rel- evant new documents and send them to the user. Filtering, as described in Chapter 3, is a push application. Filtering is also an example of a supervised learning task, where the profile plays the part of the training data and the incoming documents are the test items that need to be classified as “relevant” or “not relevant.” However, unlike a spam de- tection model, which would take thousands of labeled emails as input, a filtering system profile may only consist of a single query, making the learning task even more challenging. For this reason, filtering systems typically use more specialized techniques than general classification techniques in order to overcome the lack of training data. Although they are not as widely used as standard web search engines, there are many examples of real-world document filtering systems. For example, many news sites offer filtering services. These services include alerting users when there is breaking news, when an article is published in a certain new category (e.g., sports or politics), or when an article is published about a certain topic, which is typically specified using one or more keywords (e.g., “terrorism” or “global warming”). The alerts come in the form of emails, SMS (text messages), or even personalized news feeds, thereby allowing the user to keep up with topics of interest without having to continually check the news site for updates or enter numerous queries to the site’s search engine. Therefore, filtering provides a way of personalizing the search experience by maintaining a number of long-term information needs. Document filtering systems have two key components. First, the user’s long- term information needs must be accurately represented. This is done by construct- ing a profile for every information need. Second, given a document that has just ar- rived in the system, a decision mechanism must be devised for identifying which are the relevant profiles for that document. This decision mechanism must not</p> <p><span class="badge badge-info text-white mr-2">448</span> 424 10 Social Search only be efficient, especially since there are likely to be thousands of profiles, but it must also be highly accurate. The filtering system should not miss relevant doc- uments and, perhaps even more importantly, should not be continually alerting users about non-relevant documents. In the remainder of this section, we describe the details of these two components. Profiles In web search, users typically enter a very short query. The search engine then faces the daunting challenge of determining the user’s underlying information need from this very sparse piece of information. There are numerous reasons why most search engines today expect information needs to be specified as short key- word queries. However, one of the primary reasons is that users do not want to (or do not have the time to) type in long, complex queries for each and every one of their information needs. Many simple, non-persistent information needs can often be satisfied using a short query to a search engine. Filtering systems, on the other hand, cater to long-term information needs. Therefore, users may be more willing to spend more time specifying their information need in greater detail in order to ensure highly relevant results over an extended period of time. The rep- resentation of a user’s long-term information need is often called a filtering profile or just a . profile What actually makes up a filtering profile is quite general and depends on the particular domain of interest. Profiles may be as simple as a Boolean or key- word query. Profiles may also contain documents that are known to be relevant or non-relevant to the user’s information need. Furthermore, they may contain other items, such as social tags and related named entities. Finally, profiles may also have one or more relational constraints, such as “published before 1990”, “price in the $10–$25 range”, and so on. Whereas the other constraints described act as soft filters, relational constraints of this form act as hard filters that must be satisfied in order for a document to be retrieved. Although there are many different ways to represent a profile, the underlying filtering model typically dictates the actual representation. Filtering models are very similar to the retrieval models described in Chapter 7. In fact, many of the widely used filtering models are simply retrieval models where the profile takes the place of the query. There are two common types of filtering models. The first are static models. Here, static refers to the fact that the user’s profile does not change over time, and therefore the same model can always be applied. The second are</p> <p><span class="badge badge-info text-white mr-2">449</span> 10.4 Filtering and Recommending 425 models, where the user’s profile is constantly changing over time. This adaptive scenario requires the filtering model to be dynamic over time as new information is incorporated into the profile. Static filtering models work under the assumption that the As we just described, static filtering models filtering profile remains static over time. In some ways, this makes the filtering process easier, but in other ways it makes it less robust. All of the popular static filtering models are derived from the standard retrieval models described in Chap- ter 7. However, unlike web search, filtering systems do not return a ranked list of documents for each profile. Instead, when a new document enters the system, the filtering system must decide whether or not it is relevant with respect to each pro- file. Figure 10.6 illustrates how a static filtering system works. As new documents arrive, they are compared to each profile. Arrows from a document to a profile indicate that the document was deemed relevant to the profile and returned to the user. Profile 1 Profile 2 Profile 3 = 5 t = 8 = 8 t = 2 t = 3 t t = 2 t = 3 t = 5 t Document Stream Fig. 10.6. Example of a static filtering system. Documents arrive over time and are com- pared against each profile. Arrows from documents to profiles indicate the document matches the profile and is retrieved.</p> <p><span class="badge badge-info text-white mr-2">450</span> 426 10 Social Search In the most simple case, a Boolean retrieval model can be used. Here, the filter- ing profile would simply consist of a Boolean query, and a new document would be retrieved for the profile only if it satisfied the query. The Boolean model, de- spite its simplicity, can be used effectively for document filtering, especially where precision is important. In fact, many web-based filtering systems make use of a Boolean retrieval model. One of the biggest drawbacks of the Boolean model is the low level of recall. Depending on the filtering domain, users may prefer to have good coverage over very high precision results. There are various possible solutions to this problem, including using the vector space model, the probabilistic model, BM25, or lan- guage modeling. All of these models can be extended for use with filtering by specifying a profile using a keyword query or a set of documents. Directly apply- ing these models to filtering, however, is not trivial, since each of them returns a score, not a “retrieve” or “do not retrieve” answer as in the case of the Boolean model. One of the most widely used techniques for overcoming this problem is to use a score threshold to determine whether to retrieve a document. That is, only documents with a similarity score above the threshold will be retrieved. Such a threshold would have to be tuned in order to achieve good effectiveness. Many complex issues arise when applying a global score threshold, such as ensuring that scores are comparable across profiles and over time. As a concrete example, we describe how static filtering can be done within the language modeling framework for retrieval. Given a static profile, which may consist of a single keyword query, multiple queries, a set of documents, or some combination of these, we must first estimate a profile language model denoted by . There are many ways to do this. One possibility is: P K ∑ − λ ) (1 f c w;T w i w | P ) = P ( α λ + ∑ i K | | C T | | α i i =1 i =1 i T , . . . , T are the pieces of text (e.g., queries, documents) that make up where k 1 α . The other the profile, and T is the weight (importance) associated with text i i variables and parameters are defined in detail in Chapter 7. Then, given an incoming document, a document language model ( D ) must be estimated. We again follow the discussion in Chapter 7 and estimate D using the following: c f w w;D + λ P ) | − ) = (1 D λ w ( | | | | C D</p> <p><span class="badge badge-info text-white mr-2">451</span> 10.4 Filtering and Recommending 427 Documents can then be ranked according to the negative KL-divergence be- tween the profile language model ( ) and the document language model ( D ) as P follows: ∑ ∑ P || D ) = P ) P | P ( w | P ) log P ( w | D ) − − w ( KL ( ( w | P ) log P ∈ V V ∈ w w ≥ D P if − KL ( P Document D ) is then delivered to profile t , where t is some || relevance threshold. Document filtering can also be treated as a machine learning problem. At its core, filtering is a classification task that often has a very small amount of training data (i.e., the profile). The task is then to build a binary classifier that determines whether an incoming document is relevant to the profile. However, training data would be necessary in order to properly learn such a model. For this task, the train- ing data comes in the form of binary relevance judgments over profile/document pairs. Any of the classification techniques described in Chapter 9 can be used. Suppose that a Support Vector Machine with a linear kernel is used; the scoring function would then have the following form: s ( P ; D ) = w · f ( P, D ) = w ) f P, D ( P, D ) + w ( f f ( P, D ) + . . . + w d 2 1 1 d 2 w , . . . , w are the set of weights learned during the SVM training pro- where d 1 f ( P, D ) , . . . , f are the set of features extracted from the pro- ( P, D ) cess, and 1 d file/document pair. Many of the features that have been successfully applied to text classification, such as unigrams and bigrams, can also be applied to filtering. Given a large amount of training data, it is likely that a machine learning approach will outperform the simple language modeling approach just described. However, when there is little or no training data, the language modeling approach is a good choice. Adaptive filtering models Static filtering profiles are assumed not to change over time. In such a setting, a user would be able to create a profile, but could not update it to better reflect his information need. The only option would be to delete the profile and create a new one that would hopefully produce better results. This type of system is rigid and not very robust. Adaptive filtering is an alternative filtering technique that allows for dynamic profiles. This technique provides a mechanism for updating the profile over time.</p> <p><span class="badge badge-info text-white mr-2">452</span> 428 10 Social Search Profile Profile 1 1.1 Profile Profile 2 2.1 Profile Profile 3 3.1 = 2 t = 3 t = 5 t = 8 t = 8 t = 2 t = 3 t = 5 t Document Stream Fig. 10.7. Example of an adaptive filtering system. Documents arrive over time and are compared against each profile. Arrows from documents to profiles indicate the document matches the profile and is retrieved. Unlike static filtering, where profiles are static over time, profiles are updated dynamically (e.g., when a new match occurs). Profiles may be either updated using input from the user or done automatically based on user behavior, such as click or browsing patterns. There are various rea- sons why it may be useful to update a profile as time goes on. For example, users may want to fine-tune their information need in order to find more specific types of information. Therefore, adaptive filtering techniques are more robust than static filtering techniques and are designed to adapt to find more relevant docu- ments over the life span of a profile. Figure 10.7 shows an example adaptive filter- ing system for the same set of profiles and incoming documents from Figure 10.6. Unlike the static filtering case, when a document is delivered to a profile, the user provides feedback about the document, and the profile is then updated and used for matching future incoming documents. As Figure 10.7 suggests, one of the most common ways to adapt a profile is in response to user feedback. User feedback may come in various forms, each of which can be used in different ways to update the user profile. In order to provide a concrete example of how profiles can be adapted in response to user feedback, we consider the case where users provide relevance feedback (see Chapter 6) on</p> <p><span class="badge badge-info text-white mr-2">453</span> 10.4 Filtering and Recommending 429 documents. That is, for some set of documents, such as the set of documents re- trieved for a given profile, the user explicitly states whether or not the document is relevant to the profile. Given the relevance feedback information, there are a number of ways to update the profile. As before, how the profile is represented and subsequently updated largely depends on the underlying retrieval model that is being used. As described in Chapter 7, the Rocchio algorithm can be used to perform rel- evance feedback in the vector space model. Therefore, if profiles are represented as vectors in a vector space model, Rocchio’s algorithm can be applied to update the profiles when the user provides relevance feedback information. Given a pro- ), and a set of P , a set of non-relevant feedback documents (denoted file N onrel ′ relevant feedback documents (denoted is computed Rel ), the adapted profile P as follows: ∑ ∑ 1 1 ′ γ. D P β. + D = − α.P i i Rel | N onrel | | | D D Nonrel ∈ ∈ Rel i i is the vector representing document i , and α , where , and γ are parameters D β i that control how to trade off the weighting between the initial profile, the relevant documents, and the non-relevant documents. relevancemodels can be used with language mod- Chapter 7 also described how eling for pseudo-relevance feedback. However, relevance models can also be used for true relevance feedback as follows: ∑ ∑ 1 P ) = P ( w | | D | D ( P P ( w ) D ) i | Rel | ∈C D ∈ D Rel i ∑ 1 D | ( w ) P ≈ i | Rel | D ∈ Rel i where C is the set of documents in the collection, Rel is the set of documents that have been judged relevant, D is the probability that is document i , and P ( D ) | D i i D is generated from document D ’s language model. The approxima- document i ≈ ) can be made because is going to be 1 or is a document, and P ( D tion ( | D ) D i i very close to 1 when = D and nearly 0 for most other documents. Therefore, D i the probability of w in the profile is simply the average probability of w in the language models of the relevant documents. Unlike the Rocchio algorithm, the non-relevant documents are not considered.</p> <p><span class="badge badge-info text-white mr-2">454</span> 430 10 Social Search If a classification technique, such as one of those described in Chapter 9, is used for filtering, then an algorithm can be used to adapt the clas- online learning sification model as new user feedback arrives. Online learning algorithms update model parameters, such as the hyperplane w in SVMs, by considering only one new item or a batch of new items. These algorithms are different from standard supervised learning algorithms because they do not have a “memory.” That is, once an input has been used for training, it is discarded and cannot be explicitly used in the future to update the model parameters. Only the new training inputs are used for training. The details of online learning methods are beyond the scope of this book. However, several references are given in the “References and Further Reading” section at the end of this chapter. Model Profile Representation Profile Updating Boolean Boolean Expression N/A Vector Space Vector Rocchio Language Modeling Probability Distribution Relevance Modeling Model Parameters Online Learning Classification Table 10.3. Summary of static and adaptive filtering models. For each, the profile repre- sentation and profile updating algorithm are given. Both static and adaptive filtering, therefore, can be considered special cases of many of the retrieval models and techniques described in Chapters 6, 7, and 9. Table 10.3 summarizes the various filtering models, including how profiles are represented and updated. In practice, the vector space model and language mod- eling have been shown to be effective and easy to implement, both for static and adaptive filtering. The classification models are likely to be more robust in highly dynamic environments. However, as with all classification techniques, the model requires training data to learn an effective model. Fast filtering with millions of profiles In a full-scale production system, there may be thousands or possibly even mil- lions of profiles that must be matched against incoming documents. Fortunately, standard information retrieval indexing and query evaluation strategies can be applied to perform this matching efficiently. In most situations, profiles are rep- resented as a set of keywords or a set of feature values, which allows each profile to</p> <p><span class="badge badge-info text-white mr-2">455</span> 10.4 Filtering and Recommending 431 be indexed using the strategies discussed in Chapter 5. Scalable indexing infras- tructures can easily handle millions, or possibly even billions, of profiles. Then, once the profiles are indexed, an incoming document can be transformed into a “query”, which again is represented as either a set of terms or a set of features. The “query” is then run against the index of profiles, retrieving a ranked list of profiles. The document is then delivered to only those profiles whose score, with respect to the “query”, is above the relevance threshold previously discussed. Evaluation Many of the evaluation metrics described in Chapter 8 can be used to evaluate filtering systems. However, it is important to choose appropriate metrics, because filtering differs in a number of ways from standard search tasks, such as news or web search. One of the most important differences is the fact that filtering sys- tems do not produce a ranking of documents for each profile. Instead, relevant documents are simply delivered to the profile