# cacm12

## Transcript

2 Table 1: The three components of learning algorithms. Optimization Evaluation Representation Combinatorial optimization Accuracy/Error rate Instances -nearest neighbor Greedy search K Precision and recall Beam search Squared error Support vector machines Hyperplanes Branch-and-bound Likelihood Posterior probability Continuous optimization Naive Bayes Logistic regression Information gain Unconstrained Decision trees K-L divergence Gradient descent Cost/Utility Conjugate gradient Sets of rules Margin Quasi-Newton methods Propositional rules Logic programs Constrained Linear programming Neural networks Quadratic programming Graphical models Bayesian networks Conditional random fields Algorithm 1 LearnDT ( TrainSet ) with one branch for each feature value, and have class predic - tions at the leaves. Algorithm 1 shows a bare-bones decision if y have the same class TrainSet all examples in then ∗ tree learner for Boolean domains, using information gain an d ) return y MakeLeaf( ∗ x , y ) is the mutual information greedy search [21]. InfoGain( j if x has InfoGain( , y ) x > 0 no feature then j j y , c , between feature ) re- x c and the class x . MakeNode( j 0 1 TrainSet y ← Most frequent class in ∗ as the child for and has c x turns a node that tests feature 0 MakeLeaf( y ) return ∗ = 1. x c x as the child for = 0 and 1 x argmax ← InfoGain( , ) x y ∗ x j j TrainSet with ← = 0 TS Examples in x 0 ∗ Of course, not all combinations of one component from each TrainSet with x = 1 TS Examples in ← ∗ 1 column of Table 1 make equal sense. For example, dis- TS , LearnDT( x MakeNode( return )) TS ), LearnDT( ∗ 0 1 crete representations naturally go with combinatorial op- timization, and continuous ones with continuous optimiza- Contamination of your classifier by test data can occur in tion. Nevertheless, many learners have both discrete and insidious ways, e.g., if you use test data to tune parameters continuous components, and in fact the day may not be and do a lot of tuning. (Machine learning algorithms have far when every single possible combination has appeared in lots of knobs, and success often comes from twiddling them some learner! a lot, so this is a real concern.) Of course, holding out data reduces the amount available for training. This can Most textbooks are organized by representation, and it’s be mitigated by doing cross-validation: randomly dividing easy to overlook the fact that the other components are your training data into (say) ten subsets, holding out each equally important. There is no simple recipe for choosing one while training on the rest, testing each learned classifi er each component, but the next sections touch on some of the on the examples it did not see, and averaging the results to key issues. And, as we will see below, some choices in a see how well the particular parameter setting does. machine learning project may be even more important than the choice of learner. In the early days of machine learning, the need to keep train- ing and test data separate was not widely appreciated. This 3. IT’S GENERALIZATION THAT COUNTS was partly because, if the learner has a very limited repre- The fundamental goal of machine learning is to generalize sentation (e.g., hyperplanes), the difference between train - beyond the examples in the training set. This is because, ing and test error may not be large. But with very flexible no matter how much data we have, it is very unlikely that sifiers classifiers (e.g., decision trees), or even with linear clas we will see those exact examples again at test time. (No- with a lot of features, strict separation is mandatory. tice that, if there are 100,000 words in the dictionary, the 000 100 , possible different in- spam filter described above has 2 g Notice that generalization being the goal has an interestin puts.) Doing well on the training set is easy (just memorize consequence for machine learning. Unlike in most other op- the examples). The most common mistake among machine timization problems, we don’t have access to the function learning beginners is to test on the training data and have we want to optimize! We have to use training error as a sur- ed the illusion of success. If the chosen classifier is then test rogate for test error, and this is fraught with danger. How on new data, it is often no better than random guessing. So, to deal with it is addressed in some of the next sections. On if you hire someone to build a classifier, be sure to keep some the positive side, since the objective function is only a pro xy of the data to yourself and test the classifier they give you for the true goal, we may not need to fully optimize it; in on it. Conversely, if you’ve been hired to build a classifier, fact, a local optimum returned by simple greedy search may set some of the data aside from the beginning, and only use be better than the global optimum. it to test your chosen classifier at the very end, followed by learning your final classifier on the whole data.

3 Low High 4. DATA ALONE IS NOT ENOUGH Generalization being the goal has another major consequence: Variance Variance data alone is not enough, no matter how much of it you have. Consider learning a Boolean function of (say) 100 variables 100 6 − 10 from a million examples. There are 2 examples whose classes you don’t know. How do you figure out what High those classes are? In the absence of further information, Bias there is just no way to do this that beats flipping a coin. This observation was first made (in somewhat different form) by the philosopher David Hume over 200 years ago, but even today many mistakes in machine learning stem from failing to appreciate it. Every learner must embody some knowl- edge or assumptions beyond the data it’s given in order to generalize beyond it. This was formalized by Wolpert in Low his famous “no free lunch” theorems, according to which no learner can beat random guessing over all possible function s Bias to be learned [26]. This seems like rather depressing news. How then can we ever hope to learn anything? Luckily, the functions we want Figure 1: Bias and variance in dart-throwing. to learn in the real world are drawn uniformly from not the set of all mathematically possible functions! In fact, , and is overfitting quirks in the data. This problem is called very general assumptions—like smoothness, similar exam- the bugbear of machine learning. When your learner outputs d ples having similar classes, limited dependences, or limite a classifier that is 100% accurate on the training data but complexity—are often enough to do very well, and this is a only 50% accurate on test data, when in fact it could have large part of why machine learning has been so successful. output one that is 75% accurate on both, it has overfit. Like deduction, induction (what learners do) is a knowledge lever: it turns a small amount of input knowledge into a Everyone in machine learning knows about overfitting, but large amount of output knowledge. Induction is a vastly it comes in many forms that are not immediately obvious. more powerful lever than deduction, requiring much less in- One way to understand overfitting is by decomposing gener- put knowledge to produce useful results, but it still needs bias and variance [9]. Bias is a learner’s alization error into more than zero input knowledge to work. And, as with any tendency to consistently learn the same wrong thing. Vari- lever, the more we put in, the more we can get out. ance is the tendency to learn random things irrespective of the real signal. Figure 1 illustrates this by an analogy with A corollary of this is that one of the key criteria for choos- throwing darts at a board. A linear learner has high bias, ing a representation is which kinds of knowledge are easily because when the frontier between two classes is not a hyper- expressed in it. For example, if we have a lot of knowledge plane the learner is unable to induce it. Decision trees don’ t about what makes examples similar in our domain, instance- have this problem because they can represent any Boolean based methods may be a good choice. If we have knowl- function, but on the other hand they can suffer from high re edge about probabilistic dependencies, graphical models a variance: decision trees learned on different training sets a good fit. And if we have knowledge about what kinds of generated by the same phenomenon are often very different, preconditions are required by each class, “IF . . . THEN . . . ” when in fact they should be the same. Similar reasoning rules may be the the best option. The most useful learners applies to the choice of optimization method: beam search in this regard are those that don’t just have assumptions has lower bias than greedy search, but higher variance, be- hard-wired into them, but allow us to state them explicitly, cause it tries more hypotheses. Thus, contrary to intuition, vary them widely, and incorporate them automatically into a more powerful learner is not necessarily better than a less the learning (e.g., using first-order logic [22] or grammars powerful one. [6]). 1 Even though the true classifier Figure 2 illustrates this. In retrospect, the need for knowledge in learning should not is a set of rules, with up to 1000 examples naive Bayes is be surprising. Machine learning is not magic; it can’t get more accurate than a rule learner. This happens despite something from nothing. What it does is get more from naive Bayes’s false assumption that the frontier is linear! less. Programming, like all engineering, is a lot of work: Situations like this are common in machine learning: strong we have to build everything from scratch. Learning is more false assumptions can be better than weak true ones, because like farming, which lets nature do most of the work. Farm- a learner with the latter needs more data to avoid overfitting . ers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs. 1 Training examples consist of 64 Boolean features and a 5. OVERFITTING HAS MANY FACES Boolean class computed from them according to a set of “IF What if the knowledge and data we have are not sufficient . . . THEN . . . ” rules. The curves are the average of 100 runs to completely determine the correct classifier? Then we run with different randomly generated sets of rules. Error bars the risk of just hallucinating a classifier (or parts of it) th at are two standard deviations. See Domingos and Pazzani [11] for details. is not grounded in reality, and is simply encoding random

4 Cross-validation can help to combat overfitting, for example 80 by using it to choose the best size of decision tree to learn. Bayes But it’s no panacea, since if we use it to make too many C4.5 75 parameter choices it can itself start to overfit [18]. 70 Besides cross-validation, there are many methods to combat overfitting. The most popular one is adding a regulariza- 65 to the evaluation function. This can, for exam- tion term ring ple, penalize classifiers with more structure, thereby favo 60 smaller ones with less room to overfit. Another option is to re perform a statistical significance test like chi-square befo 55 Test-Set Accuracy (%) adding new structure, to decide whether the distribution of the class really is different with and without this structure . 50 These techniques are particularly useful when data is very 10 10000 1000 100 t scarce. Nevertheless, you should be skeptical of claims tha a particular technique “solves” the overfitting problem. It’ s Number of Examples easy to avoid overfitting (variance) by falling into the op- ng posite error of underfitting (bias). Simultaneously avoidi Figure 2: Naive Bayes can outperform a state-of- both requires learning a perfect classifier, and short of know- the-art rule learner (C4.5rules) even when the true ing it in advance there is no single technique that will always classifier is a set of rules. do best (no free lunch). trillion examples, the latter covers only a fraction of about A common misconception about overfitting is that it is caused 18 − 10 of the input space. This is what makes machine learn- by noise, like training examples labeled with the wrong class . ing both necessary and hard. This can indeed aggravate overfitting, by making the learner draw a capricious frontier to keep those examples on what More seriously, the similarity-based reasoning that machin e it thinks is the right side. But severe overfitting can occur learning algorithms depend on (explicitly or implicitly) bre aks even in the absence of noise. For instance, suppose we learn a down in high dimensions. Consider a nearest neighbor clas- s Boolean classifier that is just the disjunction of the example sifier with Hamming distance as the similarity measure, and si- labeled“true”in the training set. (In other words, the clas suppose the class is just x ∧ . If there are no other fea- x 1 2 fier is a Boolean formula in disjunctive normal form, where tures, this is an easy problem. But if there are 98 irrele- each term is the conjunction of the feature values of one x vant features , . . . , x , the noise from them completely 3 100 - specific training example). This classifier gets all the train x and x , and nearest neighbor effec- swamps the signal in 2 1 ing examples right and every positive test example wrong, tively makes random predictions. regardless of whether the training data is noisy or not. Even more disturbing is that nearest neighbor still has a The problem of multiple testing [14] is closely related to over- e problem even if all 100 features are relevant! This is becaus fitting. Standard statistical tests assume that only one hy- in high dimensions all examples look alike. Suppose, for pothesis is being tested, but modern learners can easily tes t instance, that examples are laid out on a regular grid, and millions before they are done. As a result what looks signif- ’s x -dimensional, d . If the grid is x consider a test example t t icant may in fact not be. For example, a mutual fund that 2 nearest examples are all at the same distance from it. d beats the market ten years in a row looks very impressive, So as the dimensionality increases, more and more examples until you realize that, if there are 1000 funds and each has , until the choice of nearest x become nearest neighbors of t a 50% chance of beating the market on any given year, it’s neighbor (and therefore of class) is effectively random. quite likely that one will succeed all ten times just by luck. This problem can be combatted by correcting the signifi- This is only one instance of a more general problem with cance tests to take the number of hypotheses into account, high dimensions: our intuitions, which come from a three- but this can lead to underfitting. A better approach is to dimensional world, often do not apply in high-dimensional s, control the fraction of falsely accepted non-null hypothese ones. In high dimensions, most of the mass of a multivari- false discovery rate known as the [3]. ate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the vol- 6. INTUITION FAILS IN HIGH ume of a high-dimensional orange is in the skin, not the pulp. DIMENSIONS If a constant number of examples is distributed uniformly in After overfitting, the biggest problem in machine learning a high-dimensional hypercube, beyond some dimensionality . This expression was coined is the curse of dimensionality most examples are closer to a face of the hypercube than by Bellman in 1961 to refer to the fact that many algo- to their nearest neighbor. And if we approximate a hyper- rithms that work fine in low dimensions become intractable sphere by inscribing it in a hypercube, in high dimensions when the input is high-dimensional. But in machine learn- almost all the volume of the hypercube is outside the hyper- ing it refers to much more. Generalizing correctly becomes sphere. This is bad news for machine learning, where shapes exponentially harder as the dimensionality (number of fea- of one type are often approximated by shapes of another. tures) of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space. Even with Building a classifier in two or three dimensions is easy; we a moderate dimension of 100 and a huge training set of a