1 A Few Useful Things to Know about Machine Learning Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, WA 98195-2350, U.S.A. [email protected] ABSTRACT x (e.g., whether the correct output y for future examples t t spam filter correctly classifies previously unseen emails as Machine learning algorithms can figure out how to perform spam or not spam). important tasks by generalizing from examples. This is of- ten feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious 2. LEARNING = REPRESENTATION + problems can be tackled. As a result, machine learning is EVALUATION + OPTIMIZATION widely used in computer science and other fields. However, Suppose you have an application that you think machine developing successful machine learning applications requi res learning might be good for. The first problem facing you a substantial amount of “black art” that is hard to find in e. is the bewildering variety of learning algorithms availabl textbooks. This article summarizes twelve key lessons that Which one to use? There are literally thousands available, machine learning researchers and practitioners have learn ed. and hundreds more are published each year. The key to not These include pitfalls to avoid, important issues to focus o n, getting lost in this huge space is to realize that it consists and answers to common questions. of combinations of just three components. The components are: 1. INTRODUCTION Machine learning systems automatically learn programs from Representation. A classifier must be represented in some data. This is often a very attractive alternative to manuall y formal language that the computer can handle. Con- constructing them, and in the last decade the use of machine versely, choosing a representation for a learner is tan- learning has spread rapidly throughout computer science tamount to choosing the set of classifiers that it can and beyond. Machine learning is used in Web search, spam possibly learn. This set is called the hypothesis space filters, recommender systems, ad placement, credit scoring, of the learner. If a classifier is not in the hypothesis fraud detection, stock trading, drug design, and many other space, it cannot be learned. A related question, which applications. A recent report from the McKinsey Global In- we will address in a later section, is how to represent stitute asserts that machine learning (a.k.a. data mining or the input, i.e., what features to use. predictive analytics) will be the driver of the next big wave of Evaluation. An evaluation function (also called objective innovation [16]. Several fine textbooks are available to inter - is needed to distinguish scoring function) or function er, ested practitioners and researchers (e.g, [17, 25]). Howev good classifiers from bad ones. The evaluation function much of the “folk knowledge” that is needed to successfully used internally by the algorithm may differ from the - develop machine learning applications is not readily avail external one that we want the classifier to optimize, for able in them. As a result, many machine learning projects ease of optimization (see below) and due to the issues take much longer than necessary or wind up producing less- discussed in the next section. than-ideal results. Yet much of this folk knowledge is fairly Finally, we need a method to search among Optimization. easy to communicate. This is the purpose of this article. the classifiers in the language for the highest-scoring one. The choice of optimization technique is key to the Many different types of machine learning exist, but for il- efficiency of the learner, and also helps determine the lustration purposes I will focus on the most mature and classifier produced if the evaluation function has more widely used one: classification. Nevertheless, the issues I than one optimum. It is common for new learners to classi- will discuss apply across all of machine learning. A start out using off-the-shelf optimizers, which are later fier is a system that inputs (typically) a vector of discrete replaced by custom-designed ones. and/or continuous feature values and outputs a single dis- crete value, the class . For example, a spam filter classifies Table 1 shows common examples of each of these three com- email messages into“spam”or“not spam,”and its input may k ponents. For example, -nearest neighbor classifies a test be a Boolean vector x = ( = 1 x x ), where , . . . , x , . . . , x j 1 j d example by finding the k most similar training examples if the th word in the dictionary appears in the email and j and predicting the majority class among them. Hyperplane- learner training set of exam- = 0 otherwise. A x inputs a j based methods form a linear combination of the features per ples , y , . . . , x ( ) is an observed input ), where x x = ( x 1 i i i, i i,d class and predict the class with the highest-valued combina - is the corresponding output, and outputs a classifier. and y i tion. Decision trees test one feature at each internal node, The test of the learner is whether this classifier produces th e

2 Table 1: The three components of learning algorithms. Optimization Evaluation Representation Combinatorial optimization Accuracy/Error rate Instances -nearest neighbor Greedy search K Precision and recall Beam search Squared error Support vector machines Hyperplanes Branch-and-bound Likelihood Posterior probability Continuous optimization Naive Bayes Logistic regression Information gain Unconstrained Decision trees K-L divergence Gradient descent Cost/Utility Conjugate gradient Sets of rules Margin Quasi-Newton methods Propositional rules Logic programs Constrained Linear programming Neural networks Quadratic programming Graphical models Bayesian networks Conditional random fields Algorithm 1 LearnDT ( TrainSet ) with one branch for each feature value, and have class predic - tions at the leaves. Algorithm 1 shows a bare-bones decision if y have the same class TrainSet all examples in then ∗ tree learner for Boolean domains, using information gain an d ) return y MakeLeaf( ∗ x , y ) is the mutual information greedy search [21]. InfoGain( j if x has InfoGain( , y ) x > 0 no feature then j j y , c , between feature ) re- x c and the class x . MakeNode( j 0 1 TrainSet y ← Most frequent class in ∗ as the child for and has c x turns a node that tests feature 0 MakeLeaf( y ) return ∗ = 1. x c x as the child for = 0 and 1 x argmax ← InfoGain( , ) x y ∗ x j j TrainSet with ← = 0 TS Examples in x 0 ∗ Of course, not all combinations of one component from each TrainSet with x = 1 TS Examples in ← ∗ 1 column of Table 1 make equal sense. For example, dis- TS , LearnDT( x MakeNode( return )) TS ), LearnDT( ∗ 0 1 crete representations naturally go with combinatorial op- timization, and continuous ones with continuous optimiza- Contamination of your classifier by test data can occur in tion. Nevertheless, many learners have both discrete and insidious ways, e.g., if you use test data to tune parameters continuous components, and in fact the day may not be and do a lot of tuning. (Machine learning algorithms have far when every single possible combination has appeared in lots of knobs, and success often comes from twiddling them some learner! a lot, so this is a real concern.) Of course, holding out data reduces the amount available for training. This can Most textbooks are organized by representation, and it’s be mitigated by doing cross-validation: randomly dividing easy to overlook the fact that the other components are your training data into (say) ten subsets, holding out each equally important. There is no simple recipe for choosing one while training on the rest, testing each learned classifi er each component, but the next sections touch on some of the on the examples it did not see, and averaging the results to key issues. And, as we will see below, some choices in a see how well the particular parameter setting does. machine learning project may be even more important than the choice of learner. In the early days of machine learning, the need to keep train- ing and test data separate was not widely appreciated. This 3. IT’S GENERALIZATION THAT COUNTS was partly because, if the learner has a very limited repre- The fundamental goal of machine learning is to generalize sentation (e.g., hyperplanes), the difference between train - beyond the examples in the training set. This is because, ing and test error may not be large. But with very flexible no matter how much data we have, it is very unlikely that sifiers classifiers (e.g., decision trees), or even with linear clas we will see those exact examples again at test time. (No- with a lot of features, strict separation is mandatory. tice that, if there are 100,000 words in the dictionary, the 000 100 , possible different in- spam filter described above has 2 g Notice that generalization being the goal has an interestin puts.) Doing well on the training set is easy (just memorize consequence for machine learning. Unlike in most other op- the examples). The most common mistake among machine timization problems, we don’t have access to the function learning beginners is to test on the training data and have we want to optimize! We have to use training error as a sur- ed the illusion of success. If the chosen classifier is then test rogate for test error, and this is fraught with danger. How on new data, it is often no better than random guessing. So, to deal with it is addressed in some of the next sections. On if you hire someone to build a classifier, be sure to keep some the positive side, since the objective function is only a pro xy of the data to yourself and test the classifier they give you for the true goal, we may not need to fully optimize it; in on it. Conversely, if you’ve been hired to build a classifier, fact, a local optimum returned by simple greedy search may set some of the data aside from the beginning, and only use be better than the global optimum. it to test your chosen classifier at the very end, followed by learning your final classifier on the whole data.

3 Low High 4. DATA ALONE IS NOT ENOUGH Generalization being the goal has another major consequence: Variance Variance data alone is not enough, no matter how much of it you have. Consider learning a Boolean function of (say) 100 variables 100 6 − 10 from a million examples. There are 2 examples whose classes you don’t know. How do you figure out what High those classes are? In the absence of further information, Bias there is just no way to do this that beats flipping a coin. This observation was first made (in somewhat different form) by the philosopher David Hume over 200 years ago, but even today many mistakes in machine learning stem from failing to appreciate it. Every learner must embody some knowl- edge or assumptions beyond the data it’s given in order to generalize beyond it. This was formalized by Wolpert in Low his famous “no free lunch” theorems, according to which no learner can beat random guessing over all possible function s Bias to be learned [26]. This seems like rather depressing news. How then can we ever hope to learn anything? Luckily, the functions we want Figure 1: Bias and variance in dart-throwing. to learn in the real world are drawn uniformly from not the set of all mathematically possible functions! In fact, , and is overfitting quirks in the data. This problem is called very general assumptions—like smoothness, similar exam- the bugbear of machine learning. When your learner outputs d ples having similar classes, limited dependences, or limite a classifier that is 100% accurate on the training data but complexity—are often enough to do very well, and this is a only 50% accurate on test data, when in fact it could have large part of why machine learning has been so successful. output one that is 75% accurate on both, it has overfit. Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a Everyone in machine learning knows about overfitting, but large amount of output knowledge. Induction is a vastly it comes in many forms that are not immediately obvious. more powerful lever than deduction, requiring much less in- One way to understand overfitting is by decomposing gener- put knowledge to produce useful results, but it still needs bias and variance [9]. Bias is a learner’s alization error into more than zero input knowledge to work. And, as with any tendency to consistently learn the same wrong thing. Vari- lever, the more we put in, the more we can get out. ance is the tendency to learn random things irrespective of the real signal. Figure 1 illustrates this by an analogy with A corollary of this is that one of the key criteria for choos- throwing darts at a board. A linear learner has high bias, ing a representation is which kinds of knowledge are easily because when the frontier between two classes is not a hyper- expressed in it. For example, if we have a lot of knowledge plane the learner is unable to induce it. Decision trees don’ t about what makes examples similar in our domain, instance- have this problem because they can represent any Boolean based methods may be a good choice. If we have knowl- function, but on the other hand they can suffer from high re edge about probabilistic dependencies, graphical models a variance: decision trees learned on different training sets a good fit. And if we have knowledge about what kinds of generated by the same phenomenon are often very different, preconditions are required by each class, “IF . . . THEN . . . ” when in fact they should be the same. Similar reasoning rules may be the the best option. The most useful learners applies to the choice of optimization method: beam search in this regard are those that don’t just have assumptions has lower bias than greedy search, but higher variance, be- hard-wired into them, but allow us to state them explicitly, cause it tries more hypotheses. Thus, contrary to intuition, vary them widely, and incorporate them automatically into a more powerful learner is not necessarily better than a less the learning (e.g., using first-order logic [22] or grammars powerful one. [6]). 1 Even though the true classifier Figure 2 illustrates this. In retrospect, the need for knowledge in learning should not is a set of rules, with up to 1000 examples naive Bayes is be surprising. Machine learning is not magic; it can’t get more accurate than a rule learner. This happens despite something from nothing. What it does is get more from naive Bayes’s false assumption that the frontier is linear! less. Programming, like all engineering, is a lot of work: Situations like this are common in machine learning: strong we have to build everything from scratch. Learning is more false assumptions can be better than weak true ones, because like farming, which lets nature do most of the work. Farm- a learner with the latter needs more data to avoid overfitting . ers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs. 1 Training examples consist of 64 Boolean features and a 5. OVERFITTING HAS MANY FACES Boolean class computed from them according to a set of “IF What if the knowledge and data we have are not sufficient . . . THEN . . . ” rules. The curves are the average of 100 runs to completely determine the correct classifier? Then we run with different randomly generated sets of rules. Error bars the risk of just hallucinating a classifier (or parts of it) th at are two standard deviations. See Domingos and Pazzani [11] for details. is not grounded in reality, and is simply encoding random

4 Cross-validation can help to combat overfitting, for example 80 by using it to choose the best size of decision tree to learn. Bayes But it’s no panacea, since if we use it to make too many C4.5 75 parameter choices it can itself start to overfit [18]. 70 Besides cross-validation, there are many methods to combat overfitting. The most popular one is adding a regulariza- 65 to the evaluation function. This can, for exam- tion term ring ple, penalize classifiers with more structure, thereby favo 60 smaller ones with less room to overfit. Another option is to re perform a statistical significance test like chi-square befo 55 Test-Set Accuracy (%) adding new structure, to decide whether the distribution of the class really is different with and without this structure . 50 These techniques are particularly useful when data is very 10 10000 1000 100 t scarce. Nevertheless, you should be skeptical of claims tha a particular technique “solves” the overfitting problem. It’ s Number of Examples easy to avoid overfitting (variance) by falling into the op- ng posite error of underfitting (bias). Simultaneously avoidi Figure 2: Naive Bayes can outperform a state-of- both requires learning a perfect classifier, and short of know- the-art rule learner (C4.5rules) even when the true ing it in advance there is no single technique that will always classifier is a set of rules. do best (no free lunch). trillion examples, the latter covers only a fraction of about A common misconception about overfitting is that it is caused 18 − 10 of the input space. This is what makes machine learn- by noise, like training examples labeled with the wrong class . ing both necessary and hard. This can indeed aggravate overfitting, by making the learner draw a capricious frontier to keep those examples on what More seriously, the similarity-based reasoning that machin e it thinks is the right side. But severe overfitting can occur learning algorithms depend on (explicitly or implicitly) bre aks even in the absence of noise. For instance, suppose we learn a down in high dimensions. Consider a nearest neighbor clas- s Boolean classifier that is just the disjunction of the example sifier with Hamming distance as the similarity measure, and si- labeled“true”in the training set. (In other words, the clas suppose the class is just x ∧ . If there are no other fea- x 1 2 fier is a Boolean formula in disjunctive normal form, where tures, this is an easy problem. But if there are 98 irrele- each term is the conjunction of the feature values of one x vant features , . . . , x , the noise from them completely 3 100 - specific training example). This classifier gets all the train x and x , and nearest neighbor effec- swamps the signal in 2 1 ing examples right and every positive test example wrong, tively makes random predictions. regardless of whether the training data is noisy or not. Even more disturbing is that nearest neighbor still has a The problem of multiple testing [14] is closely related to over- e problem even if all 100 features are relevant! This is becaus fitting. Standard statistical tests assume that only one hy- in high dimensions all examples look alike. Suppose, for pothesis is being tested, but modern learners can easily tes t instance, that examples are laid out on a regular grid, and millions before they are done. As a result what looks signif- ’s x -dimensional, d . If the grid is x consider a test example t t icant may in fact not be. For example, a mutual fund that 2 nearest examples are all at the same distance from it. d beats the market ten years in a row looks very impressive, So as the dimensionality increases, more and more examples until you realize that, if there are 1000 funds and each has , until the choice of nearest x become nearest neighbors of t a 50% chance of beating the market on any given year, it’s neighbor (and therefore of class) is effectively random. quite likely that one will succeed all ten times just by luck. This problem can be combatted by correcting the signifi- This is only one instance of a more general problem with cance tests to take the number of hypotheses into account, high dimensions: our intuitions, which come from a three- but this can lead to underfitting. A better approach is to dimensional world, often do not apply in high-dimensional s, control the fraction of falsely accepted non-null hypothese ones. In high dimensions, most of the mass of a multivari- false discovery rate known as the [3]. ate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the vol- 6. INTUITION FAILS IN HIGH ume of a high-dimensional orange is in the skin, not the pulp. DIMENSIONS If a constant number of examples is distributed uniformly in After overfitting, the biggest problem in machine learning a high-dimensional hypercube, beyond some dimensionality . This expression was coined is the curse of dimensionality most examples are closer to a face of the hypercube than by Bellman in 1961 to refer to the fact that many algo- to their nearest neighbor. And if we approximate a hyper- rithms that work fine in low dimensions become intractable sphere by inscribing it in a hypercube, in high dimensions when the input is high-dimensional. But in machine learn- almost all the volume of the hypercube is outside the hyper- ing it refers to much more. Generalizing correctly becomes sphere. This is bad news for machine learning, where shapes exponentially harder as the dimensionality (number of fea- of one type are often approximated by shapes of another. tures) of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space. Even with Building a classifier in two or three dimensions is easy; we a moderate dimension of 100 and a huge training set of a

5 d 2 total number of functions is 2 . And even for hypothesis can find a reasonable frontier between examples of different spaces that are “merely” exponential, the bound is still very classes just by visual inspection. (It’s even been said that i f loose, because the union bound is very pessimistic. For ex- people could see in high dimensions machine learning would ample, if there are 100 Boolean features and the hypothesis not be necessary.) But in high dimensions it’s hard to under- space is decision trees with up to 10 levels, to guarantee stand what is happening. This in turn makes it difficult to = 1% in the bound above we need half a million ex- ǫ = δ design a good classifier. Naively, one might think that gath- amples. But in practice a small fraction of this suffices for ering more features never hurts, since at worst they provide accurate learning. no new information about the class. But in fact their bene- fits may be outweighed by the curse of dimensionality. Further, we have to be careful about what a bound like this means. For instance, it does not say that, if your learner Fortunately, there is an effect that partly counteracts the returned a hypothesis consistent with a particular training y.” curse, which might be called the“blessing of non-uniformit set, then this hypothesis probably generalizes well. What In most applications examples are not spread uniformly thr- it says is that, given a large enough training set, with high oughout the instance space, but are concentrated on or near t probability your learner will either return a hypothesis tha k -nearest neigh- a lower-dimensional manifold. For example, generalizes well or be unable to find a consistent hypothesis. bor works quite well for handwritten digit recognition even The bound also says nothing about how to select a good hy- though images of digits have one dimension per pixel, be- pothesis space. It only tells us that, if the hypothesis space cause the space of digit images is much smaller than the contains the true classifier, then the probability that the space of all possible images. Learners can implicitly take learner outputs a bad classifier decreases with training set advantage of this lower effective dimension, or algorithms size. If we shrink the hypothesis space, the bound improves, for explicitly reducing the dimensionality can be used (e.g. , but the chances that it contains the true classifier shrink [23]). also. (There are bounds for the case where the true classi- fier is not in the hypothesis space, but similar consideration s apply to them.) 7. THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM Another common type of theoretical guarantee is asymp- . Machine learning papers are full of theoretical guarantees totic: given infinite data, the learner is guaranteed to outp ut The most common type is a bound on the number of ex- the correct classifier. This is reassuring, but it would be ra sh amples needed to ensure good generalization. What should to choose one learner over another because of its asymptotic you make of these guarantees? First of all, it’s remarkable guarantees. In practice, we are seldom in the asymptotic that they are even possible. Induction is traditionally con - regime (also known as “asymptopia”). And, because of the trasted with deduction: in deduction you can guarantee that bias-variance tradeoff we discussed above, if learner A is be t- the conclusions are correct; in induction all bets are off. Or ter than learner B given infinite data, B is often better than such was the conventional wisdom for many centuries. One A given finite data. of the major developments of recent decades has been the realization that in fact we can have guarantees on the re- The main role of theoretical guarantees in machine learning sults of induction, particularly if we’re willing to settle for of is not as a criterion for practical decisions, but as a source probabilistic guarantees. s understanding and driving force for algorithm design. In thi capacity, they are quite useful; indeed, the close interplay The basic argument is remarkably simple [5]. Let’s say a of theory and practice is one of the main reasons machine ǫ classifier is bad if its true error rate is greater than . Then learning has made so much progress over the years. But the probability that a bad classifier is consistent with ran- n caveat emptor : learning is a complex phenomenon, and just n ǫ . ) − dom, independent training examples is less than (1 because a learner has a theoretical justification and works in b Let be the number of bad classifiers in the learner’s hy- practice doesn’t mean the former is the reason for the latter . pothesis space H . The probability that at least one of them n , by the union bound. ) − (1 b is consistent is less than ǫ 8. FEATURE ENGINEERING IS THE KEY Assuming the learner always returns a consistent classifier, At the end of the day, some machine learning projects suc- the probability that this classifier is bad is then less than n ceed and some fail. What makes the difference? Easily ≤ | H | . So H | (1 − ǫ ) | , where we have used the fact that b the most important factor is the features used. If you have , it suffices to δ if we want this probability to be less than ) ( 1 1 | H . | + ln ln many independent features that each correlate well with the H make n > ln( δ/ | | ≥ ) / ln(1 − ǫ ) δ ǫ class, learning is easy. On the other hand, if the class is Unfortunately, guarantees of this type have to be taken with a very complex function of the features, you may not be a large grain of salt. This is because the bounds obtained in able to learn it. Often, the raw data is not in a form that is this way are usually extremely loose. The wonderful feature amenable to learning, but you can construct features from it of the bound above is that the required number of examples that are. This is typically where most of the effort in a ma- | only grows logarithmically with /δ . Unfortunately, H | and 1 chine learning project goes. It is often also one of the most exponential in most interesting hypothesis spaces are doubly interesting parts, where intuition, creativity and “black a rt” the number of features d , which still leaves us needing a num- are as important as the technical stuff. ber of examples exponential in d . For example, consider the space of Boolean functions of d Boolean variables. If there First-timers are often surprised by how little time in a ma- e are possible different examples, there are 2 possible dif- e chine learning project is spent actually doing machine lear n- d ferent functions, so since there are 2 possible examples, the ing. But it makes sense if you consider how time-consuming

6 N. Bayes it is to gather data, integrate it, clean it and pre-process i t, and how much trial and error can go into feature design. Also, machine learning is not a one-shot process of build- kNN ing a data set and running a learner, but rather an iterative SVM process of running the learner, analyzing the results, modi- fying the data and/or the learner, and repeating. Learning is often the quickest part of this, but that’s because we’ve e already mastered it pretty well! Feature engineering is mor difficult because it’s domain-specific, while learners can be largely general-purpose. However, there is no sharp fronti er between the two, and this is another reason the most useful . learners are those that facilitate incorporating knowledge D. Tree Of course, one of the holy grails of machine learning is to automate more and more of the feature engineering process. Figure 3: Very different frontiers can yield similar One way this is often done today is by automatically gen- are training examples of − and + class predictions. ( erating large numbers of candidate features and selecting two classes.) the best by (say) their information gain with respect to the class. But bear in mind that features that look irrelevant in isolation may be relevant in combination. For example, and neural networks. But in fact propositional rules are input features, each of them by k if the class is an XOR of readily encoded as neural networks, and similar relation- itself carries no information about the class. (If you want ships hold between other representations. All learners es- to annoy machine learners, bring up XOR.) On the other sentially work by grouping nearby examples into the same hand, running a learner with a very large number of fea- class; the key difference is in the meaning of “nearby.” With tures to find out which ones are useful in combination may non-uniformly distributed data, learners can produce wide ly be too time-consuming, or cause overfitting. So there is ul- different frontiers while still making the same predictions i n timately no replacement for the smarts you put into feature the regions that matter (those with a substantial number of engineering. training examples, and therefore also where most test ex- amples are likely to appear). This also helps explain why 9. MORE DATA BEATS A CLEVERER powerful learners can be unstable but still accurate. Fig- ure 3 illustrates this in 2-D; the effect is much stronger in ALGORITHM high dimensions. Suppose you’ve constructed the best set of features you e can, but the classifiers you’re getting are still not accurat As a rule, it pays to try the simplest learners first (e.g., naiv e enough. What can you do now? There are two main choices: Bayes before logistic regression, -nearest neighbor before k design a better learning algorithm, or gather more data support vector machines). More sophisticated learners are (more examples, and possibly more raw features, subject to seductive, but they are usually harder to use, because they the curse of dimensionality). Machine learning researchers have more knobs you need to turn to get good results, and are mainly concerned with the former, but pragmatically because their internals are more opaque. the quickest path to success is often to just get more data. As a rule of thumb, a dumb algorithm with lots and lots of Learners can be divided into two major types: those whose data beats a clever one with modest amounts of it. (After representation has a fixed size, like linear classifiers, and all, machine learning is all about letting data do the heavy - those whose representation can grow with the data, like deci lifting.) c sion trees. (The latter are sometimes called non-parametri learners, but this is somewhat unfortunate, since they usu- This does bring up another problem, however: scalability. ally wind up learning many more parameters than paramet- In most of computer science, the two main limited resources ric ones.) Fixed-size learners can only take advantage of so are time and memory. In machine learning, there is a third much data. (Notice how the accuracy of naive Bayes asymp- one: training data. Which one is the bottleneck has changed totes at around 70% in Figure 2.) Variable-size learners can from decade to decade. In the 1980’s it tended to be data. in principle learn any function given sufficient data, but in Today it is often time. Enormous mountains of data are practice they may not, because of limitations of the algo- available, but there is not enough time to process it, so it rithm (e.g., greedy search falls into local optima) or compu - goes unused. This leads to a paradox: even though in prin- tational cost. Also, because of the curse of dimensionality , ciple more data means that more complex classifiers can be no existing amount of data may be enough. For these rea- learned, in practice simpler classifiers wind up being used, sons, clever algorithms—those that make the most of the because complex ones take too long to learn. Part of the data and computing resources available—often pay off in answer is to come up with fast ways to learn complex classi- the end, provided you’re willing to put in the effort. There fiers, and indeed there has been remarkable progress in this is no sharp frontier between designing learners and learn- direction (e.g., [12]). ing classifiers; rather, any given piece of knowledge could be encoded in the learner or learned from data. So machine Part of the reason using cleverer algorithms has a smaller learning projects often wind up having a significant compo- payoff than you might expect is that, to a first approxima- nent of learner design, and practitioners need to have some tion, they all do the same. This is surprising when you expertise in it [13]. consider representations as different as, say, sets of rules

7 those produced by (say) bagging or boosting: the latter are In the end, the biggest bottleneck is not data or CPU cycles, fairly even, while the former are extremely skewed, to the but human cycles. In research papers, learners are typically point where the single highest-weight classifier usually do m- compared on measures of accuracy and computational cost. inates, making BMA effectively equivalent to just selecting But human effort saved and insight gained, although harder it [8]. A practical consequence of this is that, while model to measure, are often more important. This favors learn- ensembles are a key part of the machine learning toolkit, ers that produce human-understandable output (e.g., rule BMA is seldom worth the trouble. sets). And the organizations that make the most of ma- e chine learning are those that have in place an infrastructur that makes experimenting with many different learners, data sources and learning problems easy and efficient, and where there is a close collaboration between machine learning ex- perts and application domain ones. 11. SIMPLICITY DOES NOT IMPLY ACCURACY 10. LEARN MANY MODELS, NOT JUST Occam’s razor famously states that entities should not be ONE multiplied beyond necessity. In machine learning, this is o f- ten taken to mean that, given two classifiers with the same In the early days of machine learning, everyone had their fa- training error, the simpler of the two will likely have the vorite learner, together with some reasons to believe a priori - lowest test error. Purported proofs of this claim appear reg in its superiority. Most effort went into trying many varia- ularly in the literature, but in fact there are many counter- tions of it and selecting the best one. Then systematic em- examples to it, and the “no free lunch” theorems imply it pirical comparisons showed that the best learner varies fro m cannot be true. application to application, and systems containing many dif - ferent learners started to appear. Effort now went into try- We saw one counter-example in the previous section: model st ing many variations of many learners, and still selecting ju ensembles. The generalization error of a boosted ensem- the best one. But then researchers noticed that, if instead ble continues to improve by adding classifiers even after the of selecting the best variation found, we combine many vari- training error has reached zero. Another counter-example is ations, the results are better—often much better—and at support vector machines, which can effectively have an infi- little extra effort for the user. nite number of parameters without overfitting. Conversely, ax )) can discriminate an arbitrarily the function sign(sin( model ensembles is now standard [1]. In the Creating such large, arbitrarily labeled set of points on the x axis, even simplest technique, called bagging , we simply generate ran- though it has only one parameter [24]. Thus, contrary to in- dom variations of the training set by resampling, learn a tuition, there is no necessary connection between the numbe r classifier on each, and combine the results by voting. This of parameters of a model and its tendency to overfit. y works because it greatly reduces variance while only slightl , training examples have weights, increasing bias. In boosting A more sophisticated view instead equates complexity with and these are varied so that each new classifier focuses on the size of the hypothesis space, on the basis that smaller the examples the previous ones tended to get wrong. In spaces allow hypotheses to be represented by shorter codes. , the outputs of individual classifiers become the in- stacking Bounds like the one in the section on theoretical guarantees puts of a “higher-level” learner that figures out how best to above might then be viewed as implying that shorter hy- combine them. potheses generalize better. This can be further refined by assigning shorter codes to the hypothesis in the space that Many other techniques exist, and the trend is toward larger we have some preference for. But viewing this as a priori and larger ensembles. In the Netflix prize, teams from all “proof” of a tradeoff between accuracy and simplicity is cir- over the world competed to build the best video recom- cular reasoning: we made the hypotheses we prefer simpler mender system (http://netflixprize.com). As the competi- by design, and if they are accurate it’s because our prefer- tion progressed, teams found that they obtained the best ences are accurate, not because the hypotheses are “simple” results by combining their learners with other teams’, and in the representation we chose. merged into larger and larger teams. The winner and runner- up were both stacked ensembles of over 100 learners, and s A further complication arises from the fact that few learner combining the two ensembles further improved the results. search their hypothesis space exhaustively. A learner with a Doubtless we will see even larger ones in the future. larger hypothesis space that tries fewer hypotheses from it is less likely to overfit than one that tries more hypotheses Model ensembles should not be confused with Bayesian model from a smaller space. As Pearl [19] points out, the size of averaging (BMA). BMA is the theoretically optimal approach the hypothesis space is only a rough guide to what really to learning [4]. In BMA, predictions on new examples are matters for relating training and test error: the procedure made by averaging the individual predictions of all classifiers by which a hypothesis is chosen. in the hypothesis space, weighted by how well the classifiers explain the training data and how much we believe in them Domingos [7] surveys the main arguments and evidence on a priori . Despite their superficial similarities, ensembles and the issue of Occam’s razor in machine learning. The conclu- BMA are very different. Ensembles change the hypothesis sion is that simpler hypotheses should be preferred because ons space (e.g., from single decision trees to linear combinati simplicity is a virtue in its own right, not because of a hy- of them), and can take a wide variety of forms. BMA assigns pothetical connection with accuracy. This is probably what weights to the hypotheses in the original space according to Occam meant in the first place. a fixed formula. BMA weights are extremely different from

8 Many researchers believe that causality is only a convenien t 12. REPRESENTABLE DOES NOT IMPLY fiction. For example, there is no notion of causality in phys- LEARNABLE ical laws. Whether or not causality really exists is a deep rners Essentially all representations used in variable-size lea philosophical question with no definitive answer in sight, have associated theorems of the form “Every function can , but the practical points for machine learners are two. First be represented, or approximated arbitrarily closely, using whether or not we call them “causal,” we would like to pre- this representation.” Reassured by this, fans of the repre- n dict the effects of our actions, not just correlations betwee sentation often proceed to ignore all others. However, just l observable variables. Second, if you can obtain experimenta because a function can be represented does not mean it can data (for example by randomly assigning visitors to different be learned. For example, standard decision tree learners versions of a Web site), then by all means do so [15]. cannot learn trees with more leaves than there are training examples. In continuous spaces, representing even simple 14. CONCLUSION functions using a fixed set of primitives often requires an Like any discipline, machine learning has a lot of “folk wis- infinite number of components. Further, if the hypothesis dom” that can be hard to come by, but is crucial for suc- space has many local optima of the evaluation function, as . cess. This article summarized some of the most salient items is often the case, the learner may not find the true function A good place to learn more is my book The Master Algo- even if it is representable. Given finite data, time and mem- , a non-technical introduction to machine learning [10]. rithm - ory, standard learners can learn only a tiny subset of all pos For a complete online machine learning course, check out sible functions, and these subsets are different for learner s http://www.cs.washington.edu/homes/pedrod/class. The re s with different representations. Therefore the key question i is also a treasure trove of machine learning lectures at http :- not “Can it be represented?”, to which the answer is often //www.videolectures.net. A widely used open source ma- trivial, but “Can it be learned?” And it pays to try different chine learning toolkit is Weka [25]. Happy learning! learners (and possibly combine them). Some representations are exponentially more compact than 15. REFERENCES others for some functions. As a result, they may also re- [1] E. Bauer and R. Kohavi. An empirical comparison of quire exponentially less data to learn those functions. Many voting classification algorithms: Bagging, boosting learners work by forming linear combinations of simple ba- , 36:105–142, 1999. and variants. Machine Learning sis functions. For example, support vector machines form [2] Y. Bengio. Learning deep architectures for AI. combinations of kernels centered at some of the training ex- , Foundations and Trends in Machine Learning bits n amples (the support vectors). Representing parity of 2:1–127, 2009. n in this way requires 2 basis functions. But using a repre- [3] Y. Benjamini and Y. Hochberg. Controlling the false sentation with more layers (i.e., more steps between input discovery rate: A practical and powerful approach to and output), parity can be encoded in a linear-size classifie r. Journal of the Royal Statistical multiple testing. Finding methods to learn these deeper representations is on e , 57:289–300, 1995. Society, Series B of the major research frontiers in machine learning [2]. . Bayesian Theory [4] J. M. Bernardo and A. F. M. Smith. Wiley, New York, NY, 1994. [5] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. Information Processing 13. CORRELATION DOES NOT IMPLY Letters , 24:377–380, 1987. CAUSATION [6] W. W. Cohen. Grammatically biased learning: The point that correlation does not imply causation is made Learning logic programs using an explicit antecedent so often that it is perhaps not worth belaboring. But, even description language. Artificial Intelligence , though learners of the kind we have been discussing can only 68:303–366, 1994. learn correlations, their results are often treated as repr e- [7] P. Domingos. The role of Occam’s razor in knowledge senting causal relations. Isn’t this wrong? If so, then why Data Mining and Knowledge Discovery , discovery. do people do it? 3:409–425, 1999. [8] P. Domingos. Bayesian averaging of classifiers and the More often than not, the goal of learning predictive mod- Proceedings of the Seventeenth overfitting problem. In els is to use them as guides to action. If we find that beer International Conference on Machine Learning , pages and diapers are often bought together at the supermarket, 223–230, Stanford, CA, 2000. Morgan Kaufmann. then perhaps putting beer next to the diaper section will [9] P. Domingos. A unified bias-variance decomposition increase sales. (This is a famous example in the world of and its applications. In Proceedings of the Seventeenth data mining.) But short of actually doing the experiment International Conference on Machine Learning , pages it’s difficult to tell. Machine learning is usually applied to 231–238, Stanford, CA, 2000. Morgan Kaufmann. data, where the predictive variables are not observational [10] P. Domingos. The Master Algorithm: How the Quest experimental under the control of the learner, as opposed to for the Ultimate Learning Machine Will Remake Our data, where they are. Some learning algorithms can poten- . Basic Books, New York, NY, 2015. World tially extract causal information from observational data, [11] P. Domingos and M. Pazzani. On the optimality of the but their applicability is rather restricted [20]. On the oth er simple Bayesian classifier under zero-one loss. Machine hand, correlation is a sign of a potential causal connection , , 29:103–130, 1997. Learning and we can use it as a guide to further investigation (for [12] G. Hulten and P. Domingos. Mining complex models example, trying to understand what the causal chain might from arbitrarily large databases in constant time. In be).

9 Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 525–531, Edmonton, Canada, 2002. ACM Press. [13] D. Kibler and P. Langley. Machine learning as an experimental science. In Proceedings of the Third , London, UK, European Working Session on Learning 1988. Pitman. . Multiple Comparisons [14] A. J. Klockars and G. Sax. Sage, Beverly Hills, CA, 1986. [15] R. Kohavi, R. Longbotham, D. Sommerfield, and R. Henne. Controlled experiments on the Web: Survey and practical guide. Data Mining and Knowledge , 18:140–181, 2009. Discovery [16] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Byers. Big data: The next frontier for innovation, competition, and productivity. Technical report, McKinsey Global Institute, 2011. Machine Learning . McGraw-Hill, New [17] T. M. Mitchell. York, NY, 1997. [18] A. Y. Ng. Preventing “overfitting” of cross-validation Proceedings of the Fourteenth International data. In Conference on Machine Learning , pages 245–253, Nashville, TN, 1997. Morgan Kaufmann. [19] J. Pearl. On the connection between the complexity and credibility of inferred models. International Journal of General Systems , 4:255–264, 1978. Causality: Models, Reasoning, and Inference [20] J. Pearl. . Cambridge University Press, Cambridge, UK, 2000. [21] J. R. Quinlan. C4.5: Programs for Machine Learning . Morgan Kaufmann, San Mateo, CA, 1993. [22] M. Richardson and P. Domingos. Markov logic networks. Machine Learning , 62:107–136, 2006. [23] J. Tenenbaum, V. Silva, and J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science , 290:2319–2323, 2000. [24] V. N. Vapnik. The Nature of Statistical Learning Theory . Springer, New York, NY, 1995. [25] I. Witten, E. Frank, and M. Hall. Data Mining: Practical Machine Learning Tools and Techniques . Morgan Kaufmann, San Mateo, CA, 3rd edition, 2011. [26] D. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation , 8:1341–1390, 1996.