1 An Idiot s guide to Support vector ’ machines ( SVMs ) R. Berwick, Village Idiot SVMs : A New Generation of Learning Algorithms Pre 1980: • – Almost all learning methods learned linear decision surfaces. – Linear learning methods have nice theoretical properties 1980 ’ • s – NNs allowed efficient learning of non- Decision trees and linear decision surfaces – Little theoretical basis and all suffer from local minima • 1990 ’ s – Efficient learning algorithms for non-linear functions based on computational learning theory developed . – Nice theoretical properties 1

2 Key Ideas • Two independent developments within last decade New efficient of non-linear regions that use – separability kernel functions ” : generalization of ‘ “ ’ to similarity new kinds of similarity measures based on dot products – Use of quadratic optimization problem to avoid ‘ local minimum ’ issues with neural nets – The resulting learning algorithm is an optimization algorithm rather than a greedy search Organization • Basic idea of support vector machines: just like 1- layer or multi-layer neural nets – Optimal hyperplane for linearly separable patterns – Extend to patterns that are not linearly separable by transformations of original data to map into new space – the Kernel function SVM algorithm for pattern recognition • 2

3 Support Vectors • closest Support vectors are the data points that lie hyperplane to the decision surface (or ) • They are the data points most difficult to classify They have direct bearing on the optimum location • of the decision surface • We can show that the optimal hyperplane stems from the function class with the lowest “ capacity ” = # of independent features/parameters we can twiddle [note this is ‘ extra ’ material not covered in the lectures you don ’ t have to know ... this] Recall from 1-layer nets : Which Separating Hyperplane ? • In general, lots of possible solutions for a,b,c (an infinite number!) • Support Vector Machine (SVM) finds an optimal solution 3

4 Support Vector Machine (SVM) Support vectors SVMs the margin • maximize ‘ street (Winston terminology: the ) ’ around the separating hyperplane. • The decision function is fully small) specified by a (usually very subset of training samples, the support vectors. Maximize • This becomes a Quadratic margin programming problem that is easy to solve by standard methods Separation by Hyperplanes • Assume linear separability for now (we will relax this later) • in 2 dimensions, can separate by a line in higher dimensions, need hyperplanes – 4

5 General input/output for SVMs just like for ... neural nets, but for one important addition (input, output) training pair samples; call the Input : set of x x input sample features ... y. , and the output result , x n 1 2 lots of input features x Typically, there can be . i Output: weights w (or w set of ), one for each feature, i y. (So far, whose linear combination predicts the value of just like neural nets ... ) Important difference: we use the optimization of maximizing the margin ( ‘ street width ’ ) to reduce the number of weights that are nonzero to just a few that correspond to the ‘ ’ in deciding the separating important features that matter hyperplane line( ... these nonzero weights correspond to the ) support vectors (because they ‘ support ’ the separating hyperplane ) 2-D Case Find a,b,c , such that ax + by ≥ c for red points ax + by ≤ (or < ) c for green points. 5

6 Which to pick? Hyperplane Lots of possible solutions for • a,b,c. • Some methods find a separating hyperplane, but not the optimal one (e.g., neural net) But: Which • points should influence optimality? – All points? • Linear regression • Neural nets – Or only “ difficult points ” close to decision boundary Support vector machines • Support Vectors again for linearly separable case • Support vectors are the elements of the training set that would change the position of the dividing hyperplane if removed. • Support vectors are the critical elements of the training set • The problem of finding the optimal hyper plane is an optimization problem and can be solved by optimization techniques (we use Lagrange multipliers to get this problem into a form that can be solved analytically). 6

7 Support Vectors: Input vectors that just touch the boundary of the margin (street) – circled below, there are 3 of them (or, rather, the of the vectors ‘ tips ’ T T b 1 + x b + w = 1 or w – x = 0 0 0 0 d X X X X X X instead of , , v , v v Here, we have shown the actual support vectors, 3 2 1 just the 3 circled points at the tail ends of the support vectors. d ’ width ‘ denotes 1/2 of the street d X X v 1 v 2 X X v 3 X X 7

8 Definitions Define the hyperplanes H such that: H 1 • x + w b ≥ +1 when y =+1 i i H • x -1 when + b ≤ y 1 = – w 0 i i H 2 + are the planes: d H and H 1 2 - b + H w • x = +1 : d 1 i H – 1 : w • x H + b = i 2 The points on the planes H and 1 H are the tips of the Support 2 Vectors The plane is the median in H 0 w • x b + between, where =0 i d+ = the shortest distance to the closest positive point to the closest negative point d- = the shortest distance – The margin (gutter) of a separating hyperplane is d+ + d . Moving a support vector moves the decision boundary Moving the other vectors has no effect The optimization algorithm to generate the weights proceeds in such a way that only the support vectors determine the weights and thus the boundary 8

9 Defining the separating Hyperplane Form of equation defining the decision surface separating • of the form: the classes is a hyperplane T x w + b = 0 – w is a weight vector – x is input vector – b is bias • Allows us to write T x + b w 0 for d = +1 ≥ i T w < 0 for d 1 + b = – x i Some final definitions • Margin of Separation ( d ) : the separation between the hyperplane and the closest data point for a given weight vector w and bias b . • Optimal Hyperplane (maximal margin): the particular hyperplane for which the margin of separation d is maximized. 9

10 Maximizing the margin (aka street width) We want a classifier (linear separator) with as big a margin as possible. H 1 H 0 H 2 d+ ) to a line: x Recall the distance from a point( ,y 0 0 2 2 d- = 0 is: | + c |/sqrt( A + c B + ), so, Ax + By + By Ax 0 0 is then: H and H The distance between 1 0 | w ||, so w • x+ b |/|| w ||=1/|| || The total distance between H and H is thus: 2/|| w 1 2 the margin, we thus need to w minimize || maximize With the ||. In order to : and H condition that there are no datapoints between H 2 1 y +b =+1 • w x ≥ +1 when i i – 1 when y ≤ = – 1 Can be combined into: y • ( x w • w) ≥ 1 +b x i i i i quadratic We now must solve a programming problem • Problem is: minimize ||w||, s.t. discrimination boundary is f ( x ) s.t. obeyed, i.e., min ( x )=0, which we can rewrite as: g 2 min f: ½ ||w|| (Note this is a quadratic function) s.t. g: y =0 ( w • x 1 ) – b = 1 or [ y – ( w • x b] ) – i i i i This is a constrained optimization problem It can be solved by the Lagrangian multipler method Because it is quadratic , the surface is a paraboloid, with just a single global minimum (thus avoiding a problem we had with neural nets!) 10

11 flatten 2 2 s.t. 2 y Example: x x + y =1 + 2+ paraboloid find intersection of two functions f, g at Intuition: = both constraints a tangent point (intersection satisfied; tangent = derivative is 0); this will be a g f s.t. the contraint is satisfied min (or max) for 2 2 2 x Flattened + 2 y f: = 0 with superimposed paraboloid + constraint y g: x = 1 Minimize g (shown in green) when the constraint line is tangent – linez of f (shown in red) to the inner ellipse contour direction of gradient arrows. note 11

12 2 2 flattened x paraboloid + 2 y 2+ =0 with superimposed constraint f: y are = 1; at tangent solution p , gradient vectors of f,g + g: x (no possible move to increment f that also keeps you in parallel region g ) Minimize when the constraint line g is tangent to the inner ellipse contour line of f Two constraints 1. Parallel normal constraint (= gradient constraint on s.t. solution is a max, or a min) f, g 2. g (x)=0 (solution is on the constraint line as well) We now recast these by combining f, g as the new Lagrangian function by introducing new ‘ slack α ’ denoted a or (more usually, denoted variables in the literature) 12

13 Redescribing these conditions Want to look for solution point p where • " = " ! p g p f ( ) ( ) x ( 0 = ) g & L Langrangian Or, combining these two as the • requiring derivative of L be zero: ) x L ( x , a f = ) ( x ) ! ag ( ) ( x , a " = 0 At a solution p • The the constraint line g and the contour lines of f must be tangent • If they are tangent, their gradient vectors (perpendiculars) are parallel • Gradient of g must be 0 – i.e., steepest ascent & so perpendicular to f g • Gradient of f must also be in the same direction as 13

14 Langrangian solves constrained How optimization r x , a ) = f ( x ) ! ag ( x ) L he ( e w " x , a ) = 0 ( wrt x recover the parallel normal Partial derivatives constraint λ recover the g ( x,y )=0 Partial derivatives wrt In general, + x , a ) = f ( x ) ( a L ) x g ( ! i i i In general f Gradient min of constraint condition g , a ) = f ( x ) + a L ( x g s ( x ) a f unc t i on of n + m va r i a bl e ! i i i t he x ' s , m f or t he a . D i f f e n e nt i a t i ng gi ve s n + m e qua t i ons , e a c h f or r t t o 0. T he n e qns di f f e r e nt i a t e d w r t e a c h x s gi ve t he gr a di e nt c ondi t i ons ; e i r e qns di f f e r e nt i a t e d w r t e a c h a m e c ove r t he c ons t r a i nt s g he t i i 2 f ( x ): ½ || w || In our case, ; g(x ): y is: ( w • x Lagrangian + b ) – 1=0 so i i 2 ½ || w || b – Σ a min L y = ( w • x +b) – 1] wrt w, [ i i i We expand the last to get the following L form: 2 min L = ½ || w || b – Σ a y w, ( w • x wrt +b) + Σ a i i i i 14

15 Lagrangian Formulation • is So in the SVM problem the Lagrangian l l 2 1 x L m w = ! a i a + b n + y w # ( ) " " i i P i i 2 i = 1 = i 1 % ng poi ni i a t s .t . $ i , a r s 0 w he r e l i s nt he # of t i s at min = 0 derivative • From the property that the l L ! we get: P " w y x = 0 = a # i i i w ! 1 = i l L ! P = 0 o y = s a " i i ! b 1 i = l l 0 x y = = y a , a w ! ! i i i i i 1 1 = i i = ’ s with this L What business? p • primal form of the This indicates that this is the optimization problem • We will actually solve the optimization problem by now solving for the dual of this original problem formulation? • What is this dual 15

16 Lagrangian Dual over w , b , The Problem: instead of minimizing ’ a maximize over a (the s , we can constraints involving subject to w subject to dual variable) the relations obtained previously for b and Our solution must satisfy these two relations: l l x 0 a y y = = , a w ! ! i i i i i 1 = i 1 i = back in the original b w eqn we can get rid of the By substituting for and and b . dependence on w first that we already now have our answer for what the weights Note w must be: they are a linear combination of the training inputs and the outputs, x training and y and the values of a. We will now solve for the i i a by differentiating the dual problem wrt a , and setting it to zero. Most s ’ a a s will turn out to have the value zero . The non-zero ’ s will of the ’ correspond to the support vectors r i m a P pr obl e m : l l l 2 1 L b = i a w m ! a + n + w # y x ( ) " " i P i i i 2 1 i i = 1 = 0 s .t . $ i a % i l l = a , 0 a = w y x y ! ! i i i i i 1 = i = i 1 m ua pr obl e l : D l l 1 m a x L y ( a y ) = a x a a # x " ( ) ! ! i i j i j i D i j 2 = 1 = i 1 i l 0 a .t . a & y $ = 0 s ! i i i = 1 i ) (note that we have removed the dependence on w and b 16

17 The Dual problem Kuhn-Tucker theorem: the solution we find here will • the solution to the original problem the same as be Q: But are we doing this???? (why not just why • solve the original problem????) : Because this will let us solve the problem by • Ans x computing the , x just (which the inner products of i j later on when we want to will be very important solve non-linearly separable classification problems) The Dual Problem l pr obl e m : D ua l l 1 a x L y ( a a ) = a # x y " m a x ( ) ! ! i D i i j j j i i 2 = 1 = 1 i i l 0 . a a y = .t s 0 & $ ! i i i 1 i = Notice that all we have are the dot products of x x , i j wrt a and set it equal to zero, If we take the derivative a we get the following solution, so we can solve for : i l = y 0 a ! i i i = 1 C 0 " a " i 17

18 Now knowing the we can find the a i w weights for the maximal margin separating hyperplane : l w = a x y ! i i i 1 i = And now, after training and finding the w by this method, unknown point u measured on features x given an we i can classify it by looking at the sign of: l ( x ) = w i u + b = ( a f y b x + i u ) ! i i i i = 1 Remember: most of the weights w zero , i.e., the a ’ s , will be i have nonzero Only the support vectors (on the gutters or margin) will this reduces the dimensionality of the solution a ’ s – weights or SVMs Inner products, similarity, and Why should inner product kernels be involved in pattern recognition using SVMs , or at all? Intuition is that inner products provide some measure of – ‘ similarity ’ Inner product in 2D between 2 vectors of unit length returns the – ‘ cosine of the angle between them = how ’ they are far apart T T , x = [0, 1] = [1, 0] e.g. y parallel similar ) i.e. if they are their inner product is 1 (completely T y = x • y = 1 x perpendicular unlike) their inner product is (completely If they are 0 (so should not contribute to the correct classifier) T • y = x • y = 0 x 18

19 Insight into inner products t Cons i de r t ha t w e a r e t r yi ng t o m a xi m i z e : he f or m l l 1 ) = a x y ( " a # a a x y L ( ) ! ! i j i j j D i i i 2 1 i 1 = i = l s .t . a $ y a = 0 & 0 ! i i i 1 = i function will be maximized if we give nonzero values to a ’ s that The claim is that this , those that ‘ matter ’ in fixing the maximum width ie correspond to the support vectors, ‘ street ’ ). Well, consider what this looks like. Note first from the constraint margin ( a ’ s are positive. Now let ’ s think about a few cases. condition that all the dissimilar x t , x ’ are completely , their dot product is 0, and they don Case 1. If two features j i L. contribute to Case 2. If two features , x . x alike , their dot product is 0. There are 2 subcases are completely j i 1: both x y , and x Then predict the same output value y (either +1 or Subcase 1). – i i i j y y is always 1, and the value of a the a decrease x will be positive. But this would x x y i j j j i j i value of L (since it would subtract from the first term sum). So, the algorithm downgrades similar feature vectors that make the same prediction. 2: x , and x , one is make opposite predictions about the output value y ie Subcase ( i j i – 1), but are otherwise very closely similar: then the product a is y +1, the other y x x a j i i i j negative subtracting it, so this adds to the sum, maximizing it. This is precisely and we are the examples we are looking for: the critical ones that tell the two classses apart. Insight into inner products, graphically: 2 very very similar x difft , x vectors that predict j i classes tend to maximize the margin width x j x i 19

20 2 vectors that are similar but predict the class are redundant same x i x j 2 dissimilar (orthogonal) vectors don ’ t count at all x j x i 20

21 But ... are we done??? Not Linearly Separable! Find a line that penalizes points on “ the wrong side ” 21

22 Transformation to separate φ (x) φ (x) φ x o (o) φ x φ o (x) φ (x) x o φ (o) x φ (x) o x φ (o) o (x) φ (o) φ x φ φ (x) (o) x (o) φ (o) φ o X F Non – Linear SVMs • The idea is to gain linearly separation by mapping the data to a higher dimensional space – The following set can ’ t be separated by a linear function, but can be separated by a quadratic one 2 a x b x a b = a b ! ! x ! + + x ( ) ( ) ( ) a b 2 – So if we map , x x x ! } { we gain linear separation 22

23 Problems with linear SVM =-1 =+1 What if the decision function is not linear? What transform would separate these? Ans: polar coordinates! Non-linear SVM The Kernel trick φ Imagine a function that maps the data into another space: φ =Radial →Η Radial Η =-1 =+1 φ =-1 =+1 L ∑ a – Remember the function we want to optimize: ½ ∑ a = a ) is the y x y • ( x x • x ) where ( j j i d i j i i i j , instead of computing this φ dot product of the two feature vectors. If we now transform to ) we will have to compute ( • x dot product ( x φ ( x how can we do this? This is ) • φ ( x But )). j i j i ... or worse, we don expensive and time consuming (suppose t know the is a quartic polynomial φ ’ function explicitly. Well, here is the neat thing: kernel function ” K such that K ( x φ ,x If there is a ) = ” ( x we do not need to know ) • φ ( x ), then j i i j at all!! That is, the kernel function defines inner products in the transformed space. or compute φ in the transformed space. similarity Or, it defines 23

24 Non-linear SVMs So, the function we end up optimizing is: ∑ a = – ½ ∑ a L a , y y ) K ( x x • i j j j d i i i Kernel example: The polynomial kernel p i, x j) = ( x K • x ( + 1) x , where p is a tunable parameter i j exponentiation one addition and one K Note: Evaluating only requires more than the original dot product SVMs Examples for Non Linear p 1 , K = ! + x y x y ) ( ) ( 2 y x " x , e y p K = x " 2 ( ) { } ! 2 y x y x , t a n h K ! " = # $ ) ) ( ( st • is polynomial (includes x 1 x as special case) nd 2 is radial basis function ( gaussians ) rd 3 is sigmoid (neural net activation function) 24

25 We nonlinear ve already seen ’ such ... transforms What is it??? • T β ) x • x tanh + β ( i 1 0 ’ • s the sigmoid It transform (for neural nets) SVMs • So, subsume neural nets! (but w/o their problems ... ) Inner Product Kernels Usual inner product Type of Support Vector Inner Product Kernel Machine , N K(x,x ... ), I = 1, 2, i T p + 1) Power x a x ( is specified p Polynomial learning i by the user priori machine 2 2 2 σ is )||x-x The width || σ ) exp(1/(2 Radial-basis function i (RBF) a priori specified T x ) β + x β tanh( Two layer neural net Actually works only for 0 1 i and β some values of 0 β 1 25

26 Kernels generalize the notion of inner ‘ product similarity ’ Note that one can define kernels over more than just vectors: strings, trees, structures, ... in fact, just about anything A very powerful idea: used in comparing DNA, protein structure, sentence structures, etc. Examples for Non Linear SVMs 2 – Gaussian Kernel Linear Gaussian 26

27 Nonlinear rbf kernel Admiral ’ s delight w/ difft kernel functions 27

28 Overfitting by SVM ... too much freedom to bend to fit the Every point is a support vector – no generalization. training data ’ SVMs have an ‘ automatic way to avoid such issues, but we In fact, won t cover it here ... see the book by Vapnik , 1995. (We add a ’ penalty function for mistakes made after training by over-fitting: recall that if one over-fits, then one will tend to make errors on new data. This penalty fn can be put into the quadratic programming problem t need to know this for this course.) directly. You don ’ 28

DEPAR T MENT THE NAVY OF ADQ UARTE UNI T ED ST ATE S MAR INE CORPS HE RS RINE COR N PS PENT 3000 MA AGO 20350-3000 NGTON, HI D.C. W AS 7E 00 .1 12 MCO 465 c AUG 0 8 013 2 ORDER 1200.17E MARINE CORPS C...

More info »201 8 Fourth National Report on Human Exposure to Environmental Chemicals U pdated Tables, March 2018 , Volume One

More info »AT Commands Reference Guide GE863-QUAD, GE863-PY, GE863-GPS, GM862-QUAD, GM862-QUAD-PY, GE862-GPS, GE864-QUAD, GE864-PY, GC864-QUAD and GC864-PY 80000ST10025a Rev. 0 - 04/08/06

More info »U.S. DEPARTMENT OF TRANSPORTATION ORDER FEDERAL AVIATION ADMINISTRATION 7400.11C JO Air Traffic Organization Policy August 13, 2018 SUBJ: Airspace Designations and Reporting Points . This O rder, publ...

More info »STATE OF NEW YORK 2 0 1 9 T E N T A T I V E A S S E S S M E N T R O L L PAGE 1 VALUATION DATE-JUL 01, 2018 COUNTY - Niagara T A X A B L E SECTION OF THE ROLL - 1 CITY - North Tonawanda TAX MAP NUMBER ...

More info »SCRIE TENANTS LIST ~ By Docket Number ~ Borough of Bronx SCRIE in the last year; it includes tenants that have a lease expiration date equal or who have received • This report displays information on ...

More info »John Bel Edwards Rebekah E. Gee MD, MPH SECRETARY GOVERNOR State of Louisiana Louisiana Department of Health Office of Public Health Certified Water and Wastewater Operators 2018 - 2019 Hours Hours li...

More info »Assembly Tools Production Fastening and Drilling 1-5.indd 1 1-5.indd 1 7/2/08 12:52:55 PM 7/2/08 12:52:55 PM

More info »DoD 7045.7-H EPARTMENT OF D EFENSE D F UTURE Y EARS D EFENSE P ROGRAM (FYDP) S TRUCTURE Codes and Definitions for All DoD Components Office of the Director, Program Analysis and Evaluation A pril 2004

More info »Insight Report The Global Information Technology Report 2014 Rewards and Risks of Big Data Beñat Bilbao-Osorio, Soumitra Dutta, and Bruno Lanvin, Editors

More info »Gulf of Mexico OCS Region OCS Operations Field Directory (Includes all active and expired fields and leases) Quarterly Repor t, as of March 31 , 201 9 U.S. Department of the Interior Bureau of Ocean E...

More info »First Last Mile Strategic Plan & PLANNING GUIDELINES Sounds good, I haven’t been to LACMA in a while...the Pathway? Hmm...I’ll check it out. See you soon! The Meet-Up! The Meet-Up! In sunny downtown L...

More info »February 10, 2019 $ 1 BUS BOOK EFFECTIVE THROUGH JUNE 8, 2019 OCBus.com EFECTIVO HASTA EL 8 DE JUNIO 2019 EASY JUST GOT EASIER. Upgrade to version 2.0 See back for cool new features! CHANGE HIGHLIGHTS...

More info »This publication contains: VOLUME I: SOURCES SOURCES AND EFFECTS Report of the United Nations Scientific Committee on the Effects of Atomic Radiation to the General Assembly OF IONIZING RADIATION Scie...

More info »[COMMITTEE PRINT] UNITED STATES GOVERNMENT Policy and Supporting Positions f Committee on Oversight and Government Reform U.S. House of Representatives 112th Congress, 2d Session DECEMBER 1, 2012 Avai...

More info »C HERMODYNAMICS HEMICAL T OMPOUNDS AND C OMPLEXES OF OF C U, Np, Pu, Am, Tc, Se, Ni and Zr O ELECTED WITH RGANIC L IGANDS S Wolfgang Hummel (Chairman) Laboratory for Waste Management Paul Scherrer Ins...

More info »Administrative Work in the Information Issued: May 2001 Revised: 8/03, 9/08, 5/11, October 2018 Technology Group, 2200 Job Family Standard for Administrative Work in the Information Technology Group, ...

More info »