1 Journal of Machine Learning Research 13 (2012) 2999-3039 Su bmitted 11/10; Revised 5/12; Published 10/12 Multi-Instance Learning with Any Hypothesis Class COM SABATO @ MICROSOFT . . Sivan Sabato SIVAN Microsoft Research New England 1 Memorial Drive Cambridge, MA TISHBY @ CS . HUJI . AC . Naftali Tishby IL School of Computer Science & Engineering The Hebrew University Jerusalem 91904, Israel Nicolas Vayatis Editor: Abstract In the supervised learning setting termed Multiple-Instan ce Learning (MIL), the examples are bags ts instances. Typically, this function of instances, and the bag label is a function of the labels of i e bag labels, but not the instance is the Boolean OR. The learner observes a sample of bags and th labels that determine the bag labels. The learner is then req uired to emit a classification rule for bags based on the sample. MIL has numerous applications, and many heuristic algorithms have been used successfully on this problem, each adapted to spec ific settings or applications. In this work we provide a unified theoretical analysis for MIL, which holds for any underlying hypothesis class, regardless of a specific application or problem domai n. We show that the sample complexity the bag, for any underlying hypothesis of MIL is only poly-logarithmically dependent on the size of hm for MIL, which uses a regular class. In addition, we introduce a new PAC-learning algorit supervised learning algorithm as an oracle. We prove that ef ficient PAC-learning for MIL can be generated from any efficient non-MIL supervised learning al gorithm that handles one-sided error. The computational complexity of the resulting algorithm is only polynomially dependent on the bag size. multiple-instance learning, learning theory, sample comp lexity, PAC learning, super- Keywords: vised classification 1. Introduction We consider the learning problem termed Multiple-Instance Learning (MIL) , first introduced in Dietterich et al. (1997). MIL is a special type of a supervised classificatio n problem. As in classical supervised classification, in MIL the learner receives a sample of labeled examples drawn i.i.d from an arbitrary and unknown distribution, and its objective is to discover a clas sification rule with a small expected error over the same distribution. In MIL additional structure is assumed, whereby the examples are received as bags of instances , such that each bag is composed of several instances. It is assumed that each instance has a true label, however the learner only observes the labels of the bags. In classical MIL the label of a bag is the Boolean OR of the labels of th e instances the bag contains. Various generalizations to MIL have been proposed (see, e.g ., Raedt, 1998; Weidmann et al., 2003). Here we consider both classical MIL and the more general setting, where a function c © 2012 Sivan Sabato and Naftali Tishby.

2 S ABATO AND ISBHY T unction is known to other than Boolean OR determines bag labels based on instance labels. This f . the learner a-priori. We term the more general setting generalized MIL n task, where a bag It is possible, in principle, to view MIL as a regular supervised classificatio is a single example, and the instances in a bag are merely part of its internal re presentation. Such a view, however, means that one must analyze each specific MIL problem s eparately, and that results and methods that apply to one MIL problem are not transferable to other MI L problems. We propose instead a generic approach to the analysis of MIL, in which the properties o f a MIL problem are analyzed as a function of the properties of the matching non-MIL problem. A s we show in this work, the connections between the MIL and the non-MIL properties are strong a nd useful. The generic methods that apply to approach has the advantage that it automatically extends all knowledge and non-MIL problems into knowledge and methods that apply to MIL, without requ iring specialized r diverse hypothesis analysis for each specific MIL problem. Our results are thus applicable fo ses. Moreover, the generic classes, label relationships between bags and instances, and target los approach allows a better theoretical understanding of the relationship, in g eneral, between regular learning and multi-instance learning with the same hypothesis class. The generic approach can also be helpful for the design of algorithms, s ince it allows deriving generic methods and approaches that hold across different settings. F or instance, as we show below, a generic PAC-learning algorithm can be derived for a large class of MI L problems with different hypothesis classes. Other applications can be found in follow-up resear ch of the results we report here, such as a generic bag-construction mechanism (Sabato et al., 201 0), and learning when bags have a manifold structure (Babenko et al., 2011). As generic analysis go es, it might be possible to improve upon it in some specific cases. Identifying these cases and provid ing tighter analysis for them is an important topic for future work. We do show that in some important ca ses—most notably is tight up to constants. that of learning separating hyperplanes with classical MIL—our analysis drug design appli- MIL has been used in numerous applications. In Dietterich et al. (1997) the cation motivates this setting. In this application, the goal is to predict which molecu les would bind to a specific binding site. Each molecule has several possible conformations (shapes) it can take. If at least one of the conformations binds to the binding site, then the molecule is la beled positive. However, it is not possible to experimentally identify which conformation was th e successful one. Thus, a molecule can be thought of as a bag of conformations, where eac h conformation is an in- stance in the bag representing the molecule. This application employs the hypoth esis class of Axis Parallel Rectangles (APRs), and has made APRs the hypothesis class of c hoice in several theoret- ical works that we mention below. There are many other applications for MIL , including image classification (Maron and Ratan, 1998), web index page recommendation ( Zhou et al., 2005) and text categorization (Andrews, 2007). n done in two main Previous theoretical analysis of the computational aspects of MIL has bee settings. In the first setting, analyzed for instance in Auer et al. (1998), Blum and Kalai (1998) and Long and Tan (1998), it is assumed that all the instances are drawn i.i.d fro m a single distribution over instances, so that the instances in each bag are statistically independe nt. Under this indepen- dence assumption, learning from an i.i.d. sample of bags is as easy as learnin g from an i.i.d. sample of instances with one-sided label noise. This is stated in the following theorem. Theorem 1 (Blum and Kalai, 1998) If a hypothesis class is PAC-learnable in polynomial time from one-sided random classification noise, then the same hypothesis clas s is PAC-learnable in polynomial time in MIL under the independence assumption. The computatio nal complexity of learning is polynomial in the bag size and in the sample size. 3000

3 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I owever, very limiting, The assumption of statistical independence of the instances in each bag is, h as it is irrelevant to many applications. over bags In the second setting one assumes that bags are drawn from an arbitrary , distribution so that the instances within a bag may be statistically dependent. This is clearly muc h more useful in practice, since bags usually describe a complex object with internal structu re, thus it is implausible pothesis class of APRs to assume even approximate independence of instances in a bag. For the hy and an arbitrary distribution over bags, it is shown in Auer et al. (1998) th at if there exists a PAC- ize of the learning algorithm for MIL with APRs, and this algorithm is polynomial in both the s C-learn DNF bag and the dimension of the Euclidean space, then it is possible to polynomially PA R P = N P (Pitt and Valiant, 1986). In addition, if formulas, a problem which is solvable only if h is not itself an it is possible to improperly learn MIL with APRs (that is, to learn a classifier whic s not been solved APR), then it is possible to improperly learn DNF formulas, a problem which ha PAC-learn MIL on to this date for general distributions. This result implies that it is not possible to s dimensionality. APRs using an algorithm which is efficient in both the bag size and the problem’ It does not, however, preclude the possibility of performing MIL efficien tly in other cases. In practice, numerous algorithms have been proposed for MIL, each fo cusing on a different specialization of this problem. Almost none of these algorithms assume statistical in dependence of instances in a bag. Moreover, some of the algorithms explicitly exploit presu med dependences between instances in a bag. Dietterich et al. (1997) propose several he uristic algorithms for finding an APR that predicts the label of an instance and of a bag. Diverse Dens ity (Maron and Lozano- ́ erez, 1998) and EM-DD (Zhang and Goldman, 2001) employ assumptions o P n the structure of and MI-SVM (Andrews the bags of instances. DPBoost (Andrews and Hofmann, 2003), mi-SVM ̈ et al., 2002), and Multi-Instance Kernels (G artner et al., 2002) are approaches for learning MIL using margin-based objectives. Some of these methods work quite well in pra ctice. However, no generalization guarantees have been provided for any of them. , independent of a In this work we analyze MIL and generalized MIL in a general framework othesis class. We assume a specific application, and provide results that hold for any underlying hyp fixed hypothesis class defined over instances. We then investigate the rela tionship between learning tting with no bags, and with respect to this hypothesis class in the classical supervised learning se learning with respect to the same hypothesis class in MIL. We address both s ample complexity and computational feasibility. Our sample complexity analysis shows that for binary hypothesis and thresh olded real-valued hypotheses, the distribution-free sample complexity for generalized MIL gr ows only logarithmically s for the case with the maximal bag size. We also provide poly-logarithmic sample complexity bound of margin learning. We further provide distribution-dependent sample comp lexity bounds for more general loss functions. These bound are useful when only the avera ge bag size is bounded. The results imply generalization bounds for previously proposed algorithms for MIL. Addressing the computational feasibility of MIL, we provide a new learning algorithm with prov able guarantees for a class of bag-labeling functions that includes the Boolean OR, used in clas sical MIL, as a special case. Given a non-MIL learning algorithm for the desired hypothesis cla ss, which can handle one- sided errors, we improperly learn MIL with the same hypothesis class. The c onstruction is simple to implement, and provides a computationally efficient PAC-learning of MIL, with o nly a polynomial dependence of the run time on the bag size. In this work we consider the problem of learning to classify bags using a lab eled sample of bags. We do not attempt to learn to classify single instances using a labeled sample of b ags. We point out 3001

4 S ABATO AND ISBHY T tances based on a bag that it is not generally possible to find a low-error classification rule for ins sample. As a simple counter example, assume that the label of a bag is the Boolea n OR of the labels negative instance. In this of its instances, and that every bag includes both a positive instance and a case all bags are labeled as positive, and it is not possible to distinguish the two types of instances by observing only bag labels. The structure of the paper is as follows. In Section 2 the problem is formally d efined and notation is introduced. In Section 3 the sample complexity of generalized MIL fo r binary hypotheses is analyzed. We provide a useful lemma bounding covering numbers for MI L in Section 4. In Section 5 we analyze the sample complexity of generalized MIL with real-valued functions for g and real-valued learning large-margin learning. Distribution-dependent results for binary learnin based on the average bag size are presented in Section 6. In Section 7 we present a PAC-learner for ludes technical proofs MIL and analyze its properties. We conclude in Section 8. The appendix inc that have been omitted from the text. A preliminary version of this work has bee n published as Sabato and Tishby (2009). 2. Notations and Definitions , we denote [ k ] , { 1 ,..., k } . For a real number x , we denote [ x ] { = max For a natural number 0 k x } . , + n y ∈ R log denotes a base 2 logarithm. For two vectors , 〈 x , y 〉 denotes the inner product of x and , x 1 if R . We use the function sign : , + 1 } where sign ( x ) = →{− x ≥ 0 and sign ( x ) = − 1 otherwise. y 1 f : A For a function B , we denote by f . For a univariate function A its restriction to a set C ⊆ → C | ′′ ′ , denote its first and second derivatives by f f respectively. f and be the input space, also called the domain of instances. A bag is a finite order ed set of X Let R instances from N . For . Denote the set of allowed sizes for bags in a specific MIL problem by X ⊆ n ) ( R A A ∪ any set and instances from R , we denote . Thus the domain of bags with a size in A R ∈ n ) R ( ̄ ∈ . A bag of size X is denoted by X x = ( x [ 1 ] ,..., x [ n ]) where each x [ j ] is X is an instance in n ̄ ̄ by | the bag. We denote the number of instances in x | . For any univariate function f x A → B , we : may also use its extension to a multivariate function from sequences of elements in A to sequences of elements in B , defined by f ( a [ 1 ] ,..., a [ k ]) = ( f ( a [ 1 ]) ,..., f ( a [ k ])) . Let I R an allowed range for hypotheses over instances or bags. For instance , I = {− 1 , + 1 } ⊆ X I B , B ] for real-valued hypotheses with a bounded range. H ⊆ I for binary hypotheses and = [ − ed bag-labeling function is a hypothesis class for instances. Every MIL problem is defined by a fix ( R ) ψ : I that determines the bag labels given the instance labels. Formally, every insta nce I → ) R ( → X : h defines a bag hypothesis, denoted by I hypothesis : → X and defined by I h ( R ) ̄ ̄ x ∀ , X ∈ ])) ( ) , ψ ( h ( x [ 1 ]) ,..., h ( x [ r x . h h . Importantly, the identity } H H , { ∈ | h H is denoted ψ and The hypothesis class for bags given of ψ is known to the learner a-priori, thus each ψ defines a different generalized MIL problem. For instance, in classical MIL, I {− 1 , + 1 } and ψ is the Boolean OR. = R ( ) D We assume the labeled bags are drawn from a fixed distribution X ×{− 1 , + over } , where 1 each pair drawn from D constitutes a bag and its binary label. Given a range I ⊆ R of possible label predictions, we define a loss function ℓ : {− 1 , + 1 }× I → R , where ℓ ( y , ˆ y ) is the loss incurred if the R ( ) h : X true label is y . The true loss of a bag-classifier → I is denoted by y and the predicted label is ˆ ̄ H if there , D , E ℓ h [ ℓ ( Y ) h ( ( X ))] . We say that a sample or a distribution are realizable by , ̄ X ( Y ) ∼ D , is a hypothesis h ∈ H that classifies them with zero loss. 3002

5 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I ( R ) ̄ ̄ { ) ,..., ( The MIL learner receives a labeled sample of bags x } , y 1 ) }⊆ X x , y ×{− 1 , + ( m 1 1 m ( ) m R ˆ ˆ D → , and returns a classifier drawn from I . The goal of the learner is to return h h that : X ˆ ( , D ) compared to the minimal loss that can be achieved with the bag hypothesis ℓ has a low loss h ∗ ℓ D , . The empirical loss of a classifier for bags on a labeled ) ( h class, denoted by ℓ ( ) , inf H , D H ∈ h ̄ ̄ ℓ ( h , S ) , E , we denote the x } sample S { = [ ℓ ( Y , h ( is X ))] . For an unlabeled set of bags S i S ) Y m , i ∈ [ X ] ( ∼ ∪ ̄ ] , { x by [ j ] | i ∈ [ m ] , j ∈ [ | S x multi-set of instances in the bags of | S } . Since this is a multi-set, i i ∪ any instance which repeats in several bags in is represented the same amount of time in S . S 2.1 Classes of Real-Valued bag-functions I = {− 1 , + 1 } and In classical MIL the bag function is the Boolean OR over binary labels, that is ( R ) 1 , + 1 ψ = OR : {− →{− 1 , + 1 } . A natural extension of the Boolean OR to a function over reals } is the max function. We further consider two classes of bag functions over reals, each representing ent subset of its properties. a different generalization of the max function, which conserves a differ The first class we consider is the class of bag-functions that extend monoto ne Boolean functions. 1 + 1 } , such that the map is monotone- {− , Monotone Boolean functions map Boolean vectors to increasing in each of the inputs. The set of monotone Boolean functions is e xactly the set of func- thus it includes the tions that can be represented by some composition of AND and OR functions, Boolean OR. The natural extension of monotone Boolean functions to real functions over real vec- tors is achieved by replacing OR with max and AND with min. Formally, we define ex tensions of monotone Boolean functions as follows. n A function from Definition 2 into R is an extension of an n -ary monotone Boolean function if it R n M defined inductively as follows, where the input to a function is z ∈ R : belongs to the set n ( 1 ) ∀ j ∈ [ n ] , z 7−→ z [ j ] ∈ M ; n + ∀ k ∈ N ( , f 2 ,..., f ) ∈ M M = ⇒ z 7−→ max }∈ ) z ( f { ; n n 1 j k ] k [ ∈ j + ⇒ k ∈ N ) , f . ,..., 3 M ∈ M }∈ = ∀ z 7−→ min ( { f ) ( z f j n n 1 k ∈ [ k ] j ( R ) ψ We say that a bag-function → R extends monotone Boolean functions if for all n ∈ R, : R n ∈ M . ψ n | R tion in a natural way. The class of extensions to Boolean functions thus generalizes the max func The second class of bag functions we consider generalizes the max func tion by noting that for rm ‖ z ‖ bounded inputs, the max function can be seen as a variant of the infinity-no = max | z [ i ] | . ∞ 1 Another natural bag-function over reals is the average function, defin ed as ψ ( z ) = , which z ∑ i ∈ ] i [ n n can be seen as a variant of the 1-norm z ‖ = ‖ | . More generally, we treat the case where i [ z | ] ∑ 1 ∈ i ] [ n ] I the hypotheses map into 1 , 1 = [ , and consider the class of bag functions inspired by a p -norm, − defined as follows. ) R ( → [ ∞ ) , the p -norm bag function ψ 1 : [ − 1 , + 1 ] ∈ For p Definition 3 , [ − 1 , + 1 ] is defined by: p ( ) p / 1 n 1 n p z z , ∈ R ) , ψ ∀ ( ) . [ i ]+ 1 z 1 − ( p ∑ n 1 = i lim For p ∞ , Define ψ . ≡ = ψ → ∞ p p ∞ 3003

6 S ABATO AND ISBHY T − 1 / p , + 1 ] , we have ψ are in ( z ) ≡ n [ ψ − 1 ·‖ z + 1 ‖ is the length − 1 where n Since the inputs of p p p . Note that the average function is simply ψ of , and ψ z ≡‖ z + 1 ‖ p − 1 ≡ max. Other values of ∞ 1 ∞ -norm inequality, which states that for all p ∈ [ 1 , ∞ fall between these two extremes: Due to the ) p 1 n − 1 / p n ‖ 1 ‖ + ≤ n x , ] 1 ‖ x ‖ − ≤‖ x ‖ [ , we have that for all z ∈ , R ∈ x and ∞ p 1 n average ψ ≡ ( z ) ≤ ψ . ( z ) ≤ ψ max ( z ) ≡ ∞ 1 p related to the scale Many of our results hold when the scale of the output of the bag-function is ction does not change by of its inputs. Formally, we consider cases where the output of the bag-fun of a Lipschitz much unless its inputs change by much. This is formalized in the following definition bag function. R ( ) -Lipschitz with respect to the infinity norm : A bag function Definition 4 → R is c ψ for c > 0 R if n n ∈ R , ∀ a , b ∈ R ) , | ψ ( a ) − ψ ( b ∀ |≤ c ‖ a − b ‖ . ∞ The average bag-function and the max bag functions are 1-Lipschitz. Mo reover, all extensions m—this is easy to of monotone Boolean functions are 1-Lipschitz with respect to the infinity nor verify by induction on Definition 2. All p -norm bag functions are also 1-Lipschitz, as the following derivation shows: 1 / p p − 1 / − ( b ) | = n | ψ ‖ ( ·|‖ a + 1 ‖ ‖ −‖ b + 1 a b |≤ n ) − ψ . ·‖ a − b ‖ − ≤‖ a p p ∞ p p p Thus, our results for Lipschitz bag-functions hold in particular for the two bag-function classes we have defined here, and in specifically for the max function. 3. Binary MIL I = {− 1 , + 1 } , thus we have a binary In this section we consider binary MIL. In binary MIL we let X H ⊆{− 1 , + 1 } instance hypothesis class . We further let our loss be the zero-one loss, defined ] by . The distribution-free sample complexity of learning relative to a binary y ( y , ˆ y ) = 1 [ y 6 = ˆ ℓ 1 / 0 the hypothesis class hypothesis class with the zero-one loss is governed by the VC-dimension of (Vapnik and Chervonenkis, 1971). Thus we bound the VC-dimension of H as a function of the r = maximal possible bag size R , and of the VC-dimension of H . We show that the VC- max dimension of H is at most logarithmic in r , and at most linear in the VC-dimension of H , for R ( ) 1 , + 1 } any bag-labeling function : {− →{− 1 , + 1 } . It follows that the sample complexity of ψ MIL grows only logarithmically with the size of the bag. Thus MIL is feasible eve n for quite large bags. In fact, based on the results we show henceforth, Sabato et al. ( 2010) have shown that MIL can sometimes be used to accelerate even single-instance learning. We furth er provide lower bounds that show that the dependence of the upper bound on r and on the VC-dimension of H is imperative, for a large class of Boolean bag-labeling functions. We also show a matchin g lower bound for the VC-dimension of classical MIL with separating hyperplanes. 3.1 VC-Dimension Upper Bound Our first theorem establishes a VC-Dimension upper bound for generaliz ed MIL. To prove the theorem we require the following useful lemma. 3004

7 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I ( R ) For any R : {− 1 , + 1 } ⊆ and any bag function ψ →{− 1 , + 1 } , and for any hypoth- Lemma 5 N X ( R ) ⊆{− H and a finite set of bags S ⊆ X + 1 } , 1 esis class , ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∪ H . H ≤ | S S | such that H ∈ g , g H be bag hypotheses. There exist instance hypotheses h ∈ h Let , Proof 2 1 2 1 ∪ ∪ , thus proving the lemma. for i = 1 , 2. Assume that h g h = g = 6 6 h = . We show that g i 1 2 2 1 | S | S S | S | i g g S such that = 6 ∈ x . Thus there exists at least one bag From the assumption it follows that S | S | 2 1 g ( x ) 6 = g . ( x ) . Denote its size by n . We have ψ ( g ])) ( x [ 1 ]) ,..., g n ( x [ n ])) 6 = ψ ( g [ ( x [ 1 ]) ,..., g x ( 2 1 1 2 2 2 ∪ ∪ [ ] such that g ∈ ( x [ j ]) 6 = g . There- ( x [ j ]) . By the definition of S j , x [ j ] ∈ S n Hence there exists a 2 1 ∪ ∪ . g 6 = g fore 2 1 | S | S Assume that is a hypothesis class with a finite VC-dimension d. Let r ∈ N and Theorem 6 H ( R ) ] . Let the bag-labeling function ψ : {− 1 , + assume that R } ⊆ [ r →{− 1 , + 1 } be some Boolean 1 function. Denote the VC-dimension of H by d . We have r d . ≤ max { 16 , 2 d log ( 2 er ) } r Proof J , denote by J For a set of hypotheses , so that the restriction of each of its members to A A | ( ) R H S ⊆ X of size , there exists a set of bags } . Since d , { h is the VC-dimension of J | h ∈ J A r A | d d r r ∪ ∪ d that is shattered by . | , therefore 2 | ≤| H | H = | , so that H H H |≤| 2 . By Lemma 5 | r S | | S S | | S ∪ ⊆ [ r ] implies | S In addition, | ≤ rd R . By applying Sauer’s lemma (Sauer, 1972; Vapnik and r Chervonenkis, 1971) to we get H ( ( ) ) d d ∪ e | S | erd r d r ∪ |≤ H 2 ≤| ≤ , S | d d e is the base of the natural logarithm. It follows that d To ≤ d ( log ( er Where − log d ) + d log d . ) r r provide an explicit bound for d by dividing to cases: , we bound d log d r r 1 2 , ) er ≤ ( d log , thus d d ≤ 2 d ( log ( er ) − log d ) 1. Either d log d ≤ r r r 2 1 2. or d < . In this case, d log d r r 2 (a) either d ≤ 16, r √ √ log 2 d 2 < d ≤ < d d / log d . d 2 d , thus d log d log = 2 d 16. In this case > d (b) or r r r r r r Substituting in the implicit bound we get d . ≤ d ( log ( er ) − log d )+ 2 d log 2 d ≤ 2 d log ( 2 er ) r 2 d . ≤ max { 16 , Combining the cases we have d log ( 2 er ) } r 3005

8 S ABATO AND ISBHY T 3.2 VC-Dimension Lower Bounds In this section we show lower bounds for the VC-dimension of binary MIL, in dicating that the and r dependence on d in Theorem 6 is tight in two important settings. ( R ) + } {− + 1 →{− 1 , , 1 } is r-sensitive if there exists a num- : 1 We say that a bag-function ψ n and a vector ber ∈{− 1 , + 1 } n such that for at least r different numbers j ∈ ,..., j R ∈ [ n ] , c r 1 c [ 1 ] ,..., c [ j ] ] ,..., c [ n ]) 6 = ψ ( c [ 1 ψ ,..., ( c [ j . Many commonly used Boolean functions, ] ,..., c [ n ]) − i i of the inputs, are such as OR, AND, Parity, and all their variants that stem from negating some ∈ R . Our first lower bound shows if ψ is r r -sensitive for every r -sensitive, the bound in Theorem 6 othesis classes. cannot be improved without restricting the set of considered instance hyp ( R ) 1 , + 1 } 1 Theorem 7 Assume that the bag function →{− 1 , + ψ } : ∈ {− is r-sensitive for some r . For any natural d and any instance domain with | X |≥ rd ⌊ log ( r ) ⌋ , there exists a hypothesis X N with a VC-dimension at most d, such that the VC dimension of class H ⌊ log ( H ) ⌋ . is at least d r n ψ J r -sensitive, there are a vector c ∈{− 1 , + 1 } Since and a set Proof ⊆ n such that | J | = r is ,..., ∀ J , ψ ( c [ 1 ] ,..., c [ n ]) 6 = ψ ( c [ 1 ] ∈ − c [ j ] ,..., c [ n ]) . Since ψ maps all inputs to {− 1 , + 1 } , it and j ∀ j ∈ J , ψ ( c [ 1 ] ,..., − c [ j ] ,..., c [ n ]) = − ψ ( c [ 1 ] ,..., c [ n follows that . Denote a = ψ ( c [ 1 ] ,..., c [ n ]) . ]) Then we have · j J , y ∈{− 1 , + 1 } , ψ ( c [ 1 ] ,..., c [ j ] · y ,..., c [ n ]) = ∀ ∈ y . (1) a For simplicity of notation, we henceforth assume w.l.o.g. that = r and J = [ r ] . n r ( Let S be a set of d ⌊ log ⊆ r ) ⌋ bags of size r , such that all the instances in all the bags are X distinct elements of X . Divide S into d mutually exclusive subsets, each with ⌊ log ( r ) ⌋ bags. Denote ̄ bag by p x in subset t . We define the hypothesis class t ) ( p , ⌊ log ( r ) ⌋ h ] |∀ i ∈ [ d { , k [ ∈ [ 2 ] k H ,..., k , ] } , 1 i d [ k where ,..., k h ] is defined as follows (see illustration in Table 1): For x ∈ X which is not an 1 d , h [ k instance of any bag in ,..., k in the binary ] = − 1. For x = x p be bit S b , let [ j ] 1 d ) p , n t p ( ) ( , n , and define representation of the number { ) − k = j a c [ j ] · , ( 2 b 1 t ( p , j − 1 ) [ ,..., k k ]( [ j ]) = x h 1 d ) t , p ( [ j ] j 6 = k c . t We now show that is shattered by S H , indicating that the VC-dimension of H is at least S | = d ⌊ log ( r ) ⌋ . To complete the proof, we further show that the VC-dimension of H is no more | d than . H : Let { y be some labeling over 1 + , 1 } {− } First, we show that S is shattered by p log , t ∈ [ d ] ∈⌊ ⌋ ) t , p ( ) r ( . For each t ∈ [ d ] let for the bags in S ⌋ ⌊ log ( r ) 1 + y t ) ( p , − p 1 1 k + , . 2 · t ∑ 2 p 1 = and ∈ ⌊ log ( r ) Then by Equation (1), for all ] [ t ∈ [ d ] , p ⌋ ̄ [ k ,..., k [ ]( h x ]) r [ c ,..., ) = ψ ( c 1 1 ] ,..., c [ k ) ] · a ( 2 b − t 1 d ) k ) , p ( 1 t , p ( − t 2 1 a 2 b . = − ( ) = 2 b − 1 = y k 1 − ( p , k ) − 1 ) ( ) p , ( p , t t t 3006

9 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I ̄ p ( x t h ) x Instance label [ r ]) Bag label h ( i ) , ( p t − − − − − − − 1 + + − − − − + 1 + − − − 2 − − − − − − − − − 3 + 1 − − − − − − − + + 2 2 − − − − − − − + + + − − − − − − − 3 − − − − − − − − − 1 − + − − − − − − + 3 2 − − − − − − − − − 3 = h [ 4 , 8 , 3 ] , with ψ = OR (so that c is the all − 1 vector), h Table 1: An example of the hypotheses 8, and d = 3. Each line represents a bag in S , each column represents an instance in r = the bag. labels k . Thus k } ] [ S according to { y h ,..., 1 d t ) ( p , Second, we show that the VC-dimension of H d . Let A ⊆ X of size d + 1. If is no more than A there is an element in then this element is labeled − 1 by all h ∈ H , which is not an instance in S is not shattered. Otherwise, all elements in A are instances in bags in S . Since there therefore A d S , there exist two elements in A which are instances of bags in the same subset are subsets of . Denote these instances by x ( p )[ , t )[ j . Consider all the possible labelings of the ] and x ( p ] , t t j 2 1 1 2 H . If A is shattered, there must be four possible labelings for these two elements by hypotheses in h [ elements. However, by the definition of ,..., k then there ] it is easy to see that if j k = j j = 2 1 1 d H , and if j then there are at most three are at most two possible labelings by hypotheses in = j 6 2 1 A d H , hence the VC-dimension of H is no more than possible labelings. Thus . is not shattered by r the important Theorem 10 below provides a lower bound for the VC-dimension of MIL fo case where the bag-function is the Boolean OR and the hypothesis class is th e class of separating n n n . For w ∈ R hyperplanes in , the function h . : R R →{− 1 , + 1 } is defined by h ) ( x ) = sign ( 〈 w , x 〉 w w n } h W | w ∈ R , { . Let r ∈ N . We denote the VC- The hypothesis class of linear classifiers is w n OR by for R = { r } and ψ = dimension of W using two d . We prove a lower bound for d n r , n r n , n , and Lemma 9 links d d for small with d for lemmas: Lemma 8 provides a lower bound for , r , r , n n 3 r n . The resulting general lower bound, which holds for r = max R , is then stated in Theorem 10. large ≥⌊ Let d . be the VC-dimension of W ⌋ as defined above. Then d ) r 2 ( log Lemma 8 , 3 , r n n r Denote L , ⌊ log ( 2 r ) ⌋ . We will construct a set S of L bags of size r that is shattered by W Proof . 3 The construction is illustrated in Figure 1. n = ( n [ ,..., n , created by concatenating all the subsets of ) be a sequence of indices from Let L ] K 1 L − 1 L − 1 L 2 [ L ] , and every index appears 2 in some arbitrary order, so that K = ≤ r times in n . Define 3 3 ) a A | k ∈ [ K ] }⊆ R a set where a are , ( cos ( 2 π k / K = , sin ( 2 π k / K ) , 1 ) ∈ R { , so that a a ,..., 1 K k k 3 ̄ ̄ . Define the set of bags equidistant on a unit circle on a plane embedded in = { R x such ,..., S x } L 1 ̄ [ x . = ( x } [ 1 ] ,..., x i [ r ]) where { x = that j ] | j ∈ [ r ] } = { a n | i i i i k k 3007

10 S ABATO AND ISBHY T 1 3 2 2 3 1 1 −1 2 3 +1 2 1 3 = = log 4 + 1 = 3. Each dot 4 and L Figure 1: An illustration of the constructed shattered set, with r corresponds to an instance. The numbers next to the instances denote the bag to which an defined in the proof. In this illustration bags N instance belongs, and match the sequence 1 and 3 are labeled as positive by the bag-hypothesis represented by the solid line. W bags, and : Let ( y We now show that ,..., y S ) be some binary labeling of L is shattered by L 3 1 such that i | y = = + 1 } . By the definition of n , there exist j let , j . Y Y = { n } | j j ≤ k ≤ { 1 2 2 1 i k 3 R w that separates the vectors { a Clearly, there exists a hyperplane | j from the rest ≤ k ≤ j } ∈ 2 1 k ̄ h 1 ( x ) = + j ) = + 1 if and only if j . Thus sign ≤ k ≤ ( of the vectors in . It follows that 〈 w , a A 〉 i w 2 1 k ̄ if and only if there is a k ,..., j . This } such that a i is an instance in ∈{ x = , that is such that n j 2 1 i k k ∈ Y i condition holds if and only if , hence h classifies S according to the given labeling. It follows w S that W . , therefore d ⌋ ) r ≥| S | = ⌊ log ( 2 is shattered by 3 r 3 , k Let k r be natural number such that k ≤ n. Then d , n . ≥⌊ n / , ⌋ d Lemma 9 n , r r , k k ∈ R t and a number Proof ∈{ 0 ,..., ⌊ n / k ⌋} define the vector s ( x , t ) , ( 0 ,..., 0 , For a vector x n ̄ ] ,..., x [ k ] , 0 ,..., 0 ) ∈ R [ , where x x 1 ] [ kt + 1. Similarly, for a bag 1 x = is at coordinate i k r r n ̄ x x . [ r ]) ∈ ( R x ) [ , define the bag s ( 1 x ] , t ) , ( s ( ) ,..., [ 1 ] , t ) ,..., s ( x R [ r ] , t )) ∈ ( ( i i i i i k r k ̄ { ⊆ x Let } R be a set of bags with instances in S ) that is shattered by R ( = W . Define i k k ] d [ ∈ i r k , n n r ̄ } s ( )] x , t S : , a set of bags with instances in S is shattered R ⊆ ( R , ) { . Then S n i n n ⌊ n / [ ⌋ ] d k i ∈ ] , t ∈ [ , k r W W { y } , hence there are is shattered by by : Let be some labeling for S S . n n k k , ] , t ∈ [ t n / k ⌋ ] ) ⌊ i ∈ [ d ( i , k r k ̄ . y ) = x ( h k ∈ R w such that ∀ i ∈ [ d w ,..., , ] , t ∈⌊ separators / ⌋ n i w 1 , k r ⌋ k / n ⌊ ) , t ( i t ⌊ n / k ⌋ Set w , x 〉 x , s ( w w , t ) . Then 〈 w , s ( . Therefore , t ) 〉 = 〈 ∑ t t = t 0 ̄ ( s ( h x )) , t )) = OR ( sign ( 〈 w , s ( x 〉 [ 1 ] , t ) 〉 ) ,..., sign ( 〈 w , s ( x ) [ r ] , t i i i w ̄ ( x . h y ) = 〉 , x = [ 1 ] 〉 ) ,..., sign ( 〈 w OR , x ( [ r ] sign )) = ( 〈 w w i t i t i , t ) ( i t n ≥| d . d ⌋ S S k | = ⌊ is thus shattered, hence / n n , r n r , k The desired theorem is an immediate consequence of the two lemmas above, by n oting that when- ever r ∈ R , the VC-dimension of W d . is at least , n n r n Theorem 10 W be the class of separating hyperplanes in R as defined above. Assume that the Let n bag function is ψ = OR and the set of allowed bag sizes is R. Let r = max R. Then the VC-dimension / 3 is at least ⌊ n . W ⌋⌊ log 2 r ⌋ of n 3008

11 M -I L EARNING WITH A NY H YPOTHESIS C LASS ULTI NSTANCE 3.3 Pseudo-dimension for Thresholded Functions from real-valued functions In this section we consider binary hypothesis classes that are generated X using thresholds. Let be a set of real valued functions. The binary hypothesis class of ⊆ R F z T = { ( x , z ) 7→ sign ( f is x ) − thresholded functions generated by ) | f ∈ F , z ∈ R } , where x ∈ X F ( F ∈ R . The sample complexity of learning with T and and the zero-one loss is governed by the z F pseudo-dimension of T F (Pollard, 1984). In this section , which is equal to the VC-dimension of F ( R ) , thus F R , and bound the pseudo-dimension of ψ R : → we consider a bag-labeling function T providing an upper bound on the sample complexity of binary MIL with . The following bound F holds for bag-labeling functions that extend monotone Boolean functions, defined in Definition 2. X ⊆ F Let be a function class with pseudo-dimension d . Let R ⊆ [ r ] , and assume Theorem 11 R ( R ) F . Then extends monotone Boolean functions. Let d R ψ that be the pseudo-dimension of : → R r 2 ≤ max { 16 , d d log ( 2 er ) } . r Proof which extends monotone Boolean functions, First, by Definition 2, we have that for any ψ n ∈ and any y ∈ R , n R any ψ ( y [ 1 ] ,..., y [ n ]) − sign ) = sign ( ψ ( y [ 1 ] − z ,..., y [ n ] − z )) ( z ψ ( sign ( y [ 1 ] − z ,..., y [ n ] − z )) . (2) = perations allowed by This can be seen by noting that each of the equalities holds for each of the o for each n , thus by induction they hold for all functions in M and all combinations of them. M n n f let t : X × R →{− 1 , . 1 } be defined by t For a real-valued function ( y , z ) = sign ( f ( y ) − z ) + f f ∈ We have t T | f { F } , and T = and R ∈ n , | f ∈ F } . In addition, for all f ∈ F , z ∈ R { = t f F f F n ̄ x , we have ∈ X ̄ ̄ ( − ])) n [ x ( f ) z x ) − z ) = sign ( ψ ( f ( x [ 1 ]) ,..., f t ( ) = z , x ( sign f ψ ( sign ( f ( x [ 1 ]) − z ,..., f ( x [ n ]) − z )) (3) = ̄ ψ ( t = ( x [ 1 ] , z ) ,..., t )) = ( x [ n ] , z t ( x , z ) , f f f where the equality on line (3) follows from Equation (2). Therefore = } F | { ∈ f t { = . = T t } | f ∈ F } = { h | h ∈ T T f F F f F T . Thus, by Theorem 6 is equal to the pseudo-dimension of F The VC-dimension of d , which is F and the equality above, the VC-dimension of T . The proof is is bounded by max { 16 , 2 d log ( 2 er ) } F . F , is exactly the VC-dimension of T d , the pseudo-dimension of completed by noting that r F This concludes our results for distribution-free sample complexity of Binary MIL. In Section 6 we provide sample complexity analysis for distribution-dependent binary MI L, as a function of the average bag size. 3009

12 S ABATO AND ISBHY T 4. Covering Numbers Bounds for MIL s, since they allow Covering numbers are a useful measure of the complexity of a function clas m convergence guar- bounding the sample complexity of a class in various settings, based on unifor antees (see, e.g., Anthony and Bartlett, 1999). In this section we provide a lemma that relates the stance hypothesis class. covering numbers of bag hypothesis classes with those of the underlying in We will use this lemma in subsequent sections to derive sample complexity upper b ounds for addi- A A be a set of real-valued functions over some domain tional settings of MIL. Let . A F -cover ⊆ R γ A of defined on functions is a set of functions C ⊆ ‖·‖ with respect to a norm such that for any F R ◦ F there exists a g ∈ C such that ‖ f − g ‖ F ≤ γ . The covering number for given γ > 0, f and ◦ , ∈ ◦ , is the size of the smallest such , F , ◦ ) γ γ -covering for F . N ( denoted by ⊆ A be a finite set. We consider coverings with respect to the L 1, ( S ) norm for p ≥ Let S p defined by ( ) p 1 / 1 p s | ( f | ) f . , ‖ ‖ ) S ( L ∑ p | S | S ∈ s F = L ( S ) is defined by ‖ f ‖ For for a sample , . The covering number of p , max ∞ | f ( S ) | ∈ S ∞ s S ) L ( ∞ m size L norm is with respect to the p γ ( N , F , p ) , sup . )) S ( L , F , γ N ( m p S | m : A | S = ⊆ ence rates, hence A small covering number for a function class implies faster uniform converg smaller sample complexity for learning. The following lemma bounds the covering n umber of ct to the infinity norm (see bag hypothesis-classes whenever the bag function is Lipschitz with respe finition 2) and all - p Definition 4). Recall that all extensions of monotone Boolean functions (De norm bag-functions (Definition 3) are 1-Lipschitz, thus the following lemma ho lds for them with = 1. a ( R ) N and suppose the bag function ψ : R Lemma 12 Let R ⊆ → R is a-Lipschitz with respect to the ( R ) infinity norm, for some a X > . Let S ⊆ be a finite set of bags, and let r be the average size of a 0 X H ∈ [ 1 , ∞ ] , and hypothesis class , p ⊆ R γ , 0 bag in S. For any > γ ∪ . ( L )) , H S , , γ ( N L ≤ ( S )) H N ( , p p / p 1 ar ̄ Proof , for any bag n x of size ψ and hypotheses First, note that by the Lipschitz condition on , g ∈ H , h ̄ ̄ | x x ) − g ( h x ) | = | ψ ( h ( x [ 1 ]) ,..., h ( ( [ n ])) − ψ ( g ( x [ 1 ]) ,..., g ( x [ n ])) |≤ a max (4) . | ) | h ( x ) − g ( x ̄ x x ∈ ∪ be a minimal γ -cover of H with respect to the norm defined by L ( Let S C ) , so that | C | = p ∪ ∪ γ , H , L . ( S N )) . For every h ∈ H there exists a g ∈ C such that ‖ h − g ‖ ∞ < p . Assume ( γ ≤ p ) S ( L p 3010

13 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I Then by Equation (4) ( ( ) ) / p 1 1 p / p a 1 p p ̄ ̄ ‖ ≤ = h ( h x ) − g ( g x ) | − ‖ h | | ) x ( g − ) x ( | max L ( S ) ∑ ∑ p ̄ x ∈ x | S S | | | ̄ ̄ ∈ S S x ∈ x ( ) ) ( / 1 p 1 p / p a a p p = ) ( ) | − g x ( h | x ) x ( h | ) | x ( g − ≤ ∑ ∑ ∑ p 1 / S | | | S | ∪ ̄ ∈ x x ̄ S ∈ x S x ∈ ( ) p / 1 ) ( 1 / p ∪ 1 | | S p − | ) | h ( x x ( g ) a = ∑ ∪ | | | S | S ∪ S ∈ x / 1 / p 1 p ∪ − h g . γ · ‖ = ≤ ar ‖ ar ) ( L S p 1 / p It follows that C is a ( ar ) -covering for H . For p = ∞ we have γ ̄ ̄ x ( g − ) x ( h | max | | ) ( h ) − g ( h x ) |≤ a max − max g x ‖ = ‖ S ( L ) ∞ ̄ x x ∈ ̄ ̄ x ∈ S S ∈ x / 1 p ∪ . γ | h ( x ) − g ( x ) | = · ‖ h − g ‖ r · a = a γ ≤ a max = a L S ) ( ∞ ∪ x ∈ S 1 / p ∪ Thus in both cases, is a γ -covering for ar , and its size is N ( γ , H , L H ( S C )) . Thus p 1 / p ∪ ∪ ( γ , H , L . ( S ar )) ≤ N ( γ , H , L )) ( S N p p γ with We get the statement of the lemma by substituting γ . / 1 p ar As an immediate corollary, we have the following bound for covering numbers of a given sample size. ( R ) ⊆ Corollary 13 r ] . Suppose the bag function ψ : R Let r ∈ N → R is a-Lipschitz with , and let R [ X 0 . Let γ > 0 , p ∈ [ 1 , ∞ ] , and H ∈ R respect to the infinity norm for some a . For any m ≥ 0 , > γ N H , p ) ≤ ( H p ) , , . ( , γ N rm m p / 1 · r a 5. Margin Learning for MIL: Fat-Shattering Dimension Large-margin classification is a popular supervised learning approach, which has received atten- ) attempts to optimize tion also as a method for MIL. For instance, MI-SVM (Andrews et al., 2002 an adaptation of the soft-margin SVM objective (Cortes and Vapnik, 1995) to MIL, in which the margin of a bag is the maximal margin achieved by any of its instances. It has no t been shown, however, whether minimizing the objective function of MI-SVM, or other marg in formulations for MIL, allows learning with a reasonable sample size. We fill in this gap in Theore m 14 below, which bounds the γ -fat-shattering dimension (see, e.g., Anthony and Bartlett 1999) of MIL. T he objec- tive of MI-SVM amounts to replacing the hypothesis class of separating hyperplanes with the H class of bag-hypotheses H where the bag function is ψ = max. Since max is the real-valued exten- sion of OR, this objective function is natural in our MIL formulation. The distr ibution-free sample complexity of large-margin learning with the zero-one loss is proportional to th e fat-shattering di- mension (Alon et al., 1997). Thus, we provide an upper bound on the fat- shattering dimension 3011

14 S ABATO AND ISBHY T esis class, and of the of MIL as a function of the fat-shattering dimension of the underlying hypoth > 0 be the desired maximal allowed bag size. The bound holds for any Lipschitz bag-function. Let γ ) γ γ , H ( , denote its -fat-shattering dimension by Fat H margin. For a hypothesis class X and assume R ⊆ [ r ] . Let B , Theorem 14 > 0 . Let H ⊆ [ 0 , B ] Let r be a real-valued hypoth- ∈ N a R ) ( , B ] esis class and assume that the bag function ψ : → [ 0 , aB ] is a-lipschitz with respect to the [ 0 infinity norm. Then for all ∈ , aB ] γ 0 ( ( )} { 2 2 B a 6 · 2048 · γ γ 2 · , H ) log ) H . , (5) ( Fat r · Fat ( , γ 33 max ( ) H 24Fat , ≤ 2 a 64 a 64 γ bag size on the sample This theorem shows that for margin learning as well, the dependence of the o results, which complexity is poly-logarithmic. In the proof of the theorem we use the following tw link the covering number of a function class with its fat-shattering dimension. γ > . For m ≥ Theorem 15 (Bartlett et al., 1997) Let F be a set of real-valued functions and let 0 16 , F ) , ( Fat γ ( 16 γ , Fat ) / 8 F ≤ e ( γ , F , N ) . ∞ m The following theorem is due to Anthony and Bartlett (1999) (Theorem 12.8 ), following Alon et al. (1993). Let F be a set of real-valued functions with range in [ 0 , B ] . Let γ > 0 . For all m ≥ 1 Theorem 16 , γ ( ) ( Fat 4 log ) F , ( ) γ / eBm 2 4 B 4 m ∞ 2 < ) N ( , F , γ . (6) m 2 γ γ . Here we only , F ) ≥ Fat ( m Theorem 12.8 in Anthony and Bartlett (1999) deals with the case 4 γ γ ) ( Fat m 4 / ≤ B ) then the trivial upper bound N ( ( γ , H , ∞ ) ≤ ( B / γ ) ) γ ≤ m 1, since if ≥ m require ( Fat m 4 implies Equation (6). H ) , , γ ≥ [of Theorem 14] From Theorem 15 and Lemma 12 it follows that for Fat ( 16 m Proof 8 log N ( γ , γ 16 Fat ( , ) H (7) ) H , ∞ ≤ ≤ 6 log N . ( γ / a , H , ∞ ) m rm e log m ≥ 1, if Fat ( γ / 4 ) ≥ 1 then By Theorem 16, for all ( ) 2 B m B 4 4 eBm γ ≤ ∀ γ , ∞ ) ≤ 1 + Fat ( , log , H ) log ( N ) ( γ log , H m 2 γ γ e 4 2 ) ( 2 8 eBm γ 4 B m (8) , log log ) H ) ( ( Fat ≤ 2 4 γ γ 2 γ m B 4 2 ≤ ( Fat ( log H , (9) ) ) . 2 γ 4 nd the value of the other The inequality in line (8) holds since we have added 1 to the second factor, a B 2 2 γ factors is at least 1. The last inequality follows since if ≤ eB ≤ γ , we have 8 4 B / / γ . Equa- 2 e tion (9) also holds if Fat ( γ / 4 ) < 1, since this implies Fat ( γ / 4 ) = 0 and N 1. Combining ( γ , H , ∞ ) = m H then ) 16 γ , ≥ ( Fat Equation (7) and Equation (9), we get that if m 2 2 4 rm B a γ aB 2 ( . Fat 16 ) ( , H ) log γ , (10) , ) ≤ 6Fat ( H ≤ γ ∀ 2 a 4 e 2 γ 3012

15 M ULTI L EARNING WITH A NY H YPOTHESIS C LASS -I NSTANCE = ( 16 γ , Fat ⌈ m Set Fat ( 16 γ , H ) + 1. If Fat ( 16 γ , H ) ≥ 1, we have that m ≥ Fat ( 16 γ , H ) and H ) ⌉≤ H ) . Thus Equation (10) holds, and ≤ , 2Fat 16 γ also ( m 2 2 aB γ a B 4 2 ≤ 6Fat H ) ≤ ∀ ( )) 1 )+ H γ ) 16 H ( , log , , γ Fat ( · r · ( Fat ( 16 γ , 2 4 e 2 γ a 2 2 γ a 8 B 2 )) H . 6Fat ( ≤ ( , ) H log · r ( 16 γ , · Fat 2 a γ 4 ( 16 γ , Now, it is easy to see that if Fat ) H < 1, this inequality also holds. Therefore it holds in general. with 16, we have that Substituting γ / γ 2 2 a 2048 B aB 8 γ 2 Fat · r · γ log , ( , H ) , Fat ( γ , ( ∀ ≤ γ (11) 6Fat ≤ ) H . ( H )) 2 64 a γ e holds, in particular, for all γ ≤ aB . Note that the condition on γ β = 6Fat ( γ / 64 a , H ) and η = To derive the desired Equation (5) from Equation (11), let 2 2 2 2 ) F ≤ β . Then Equation (11) can be restated as H ( η rF ) . log B . Denote F = Fat ( γ , / γ a 2048 √ √ It follows that log ( η rF ) ≤ F β , Thus / √ ) ( √ √ √ rF F η β log ( βη r ) . ≤ log ( η rF ) log ( η rF ) log Therefore √ √ F ≤ ))) ( log ( η rF ) / 2 − log ( log ( η rF ) β log ( βη r / 2 , rF ) log η ( hence √ √ ( log ( η rF )) 2 log r βη . log β ) ( F ≤ − 1 ( ) rF η ( log ) 1 ≥ F 2048. Assume · 33 ≥ x for all 33 and ( ≤ x ( log / )) x ( log ) Now, it is easy to verify that log 4 γ ≤ aB . Then 2 2 2 B η a rF rF / γ = ≥ 2048 F ≥ 33 · 2048 . 2048 √ √ 1 1 η rF )) / log ( η rF ) ≤ Therefore log ( log ( , which implies ≤ log ( βη r β . Thus F ≤ F ) 2 4 2 log β βη r ) . Substituting the parameters with their values, we get the desired bound, state d in 4 ( Equation (5). 6. Sample Complexity by Average Bag Size: Rademacher Complex ity The upper bounds we have shown so far provide distribution-free samp le complexity bounds, which depend only on the maximal possible bag size. In this section we show that eve n if the bag size is unbounded, we can still have a sample complexity guarantee, if the average bag size for the input mplexity (Bartlett distribution is bounded. For this analysis we use the notion of Rademacher co A and Mendelson, 2002). Let be some domain. The empirical Rademacher complexity of a class of , 1 A ×{− } 1 + R functions F ⊆ S = 1 ( x with respect to a sample , y ) } is } 1 + , ⊆ A ×{− { i i ] ∈ i [ m 1 ( f σ ) | y ] E , [ , sup x | R , ) S , F ( i i σ i ∑ m F f ∈ ] m [ ∈ i 3013

16 S ABATO AND T ISBHY {± = ( σ σ ) are m independent uniform ,..., 1 } -valued variables. The average Rademacher where σ 1 m D over A ×{− 1 , + 1 } and a sample size m is complexity of F with respect to a distribution m S , F ( R [ )] . E , ) D , F ( R ∼ m S D is The worst-case Rademacher complexity over samples of size m sup . ) S , F ( R R F sup ) = ( m m ⊆ S A This quantity can be tied to the fat-shattering dimension via the following result: ≥ 1 and γ ≥ 0 Theorem 17 (See, for example, Mendelson, 2002, Theorem 4.11) Let m . sup If ( F ) ≤ γ R γ -fat-shattering dimension of F is at most m. then the m A ⊆ R . Assume a hypothesis class H ⊆ I and a loss function Let ℓ : {− 1 , + 1 }× I → R . For a I h h , we denote by h ∈ the function defined by h hypothesis ( x , y ) = ℓ ( y , H ( x )) . Given H and ℓ , we ℓ ℓ A ×{− 1 , + 1 } h ∈ H }⊆ R H . , { h define the function class | ℓ ℓ rtlett and Mendel- Rademacher complexities can be used to derive sample complexity bounds (Ba ∈ , 1 ] . For any son, 2002): Assume the range of the loss function is [ ( 0 , 1 ) , with probability of 0 δ − δ over the draw of samples S ⊆ A ×{− 1 , + 1 } of size m drawn from D , every h ∈ H satisfies 1 √ δ ) 8 ln ( 2 / ℓ ( h , S )+ 2 R )+ ( H ℓ , D ( h , D ) ≤ . (12) m ℓ m Thus, an upper bound on the Rademacher complexity implies an upper bound on the average loss of a classifier learned from a random sample. 6.1 Binary MIL Our first result complements the distribution-free sample complexity bounds that were provided for R ) ( binary MIL in Section 3. The average (or expected) bag size under a dis X × over D tribution ̄ + 1 } is E {− 1 . Our sample complexity bound for binary MIL depends on the average ] | X , | [ ̄ D ∼ ) Y , X ( bag size and the VC dimension of the instance hypothesis class. Recall that th e zero-one loss is ℓ to S , we use ( y , ˆ defined by ) = 1 [ y 6 = ˆ y ] . For a sample of labeled examples S = { ( x } , y ) y i i X 1 / 0 [ ] ∈ i m S , that is S . denote the examples of { x } = i X i ∈ [ m ] X Let H ⊆{− 1 , + 1 } N be a binary hypothesis class with VC-dimension d. Let R ⊆ Theorem 18 ( R ) + + 1 } and assume a bag function 1 , →{− 1 , {− 1 } . Let r be the average bag size under : ψ distribution D over labeled bags. Then √ ) er 4 ( ln d D ) ≤ 17 H , ( R . ℓ 1 0 / m Let S be a labeled bag-sample of size m . Dudley’s entropy integral (Dudley, 1967) states that Proof ∫ √ ∞ 12 √ ( R , S ) ≤ H (13) S γ d ( )) ln N ( γ , H L , 2 ℓ ℓ 0 / 0 1 / 1 m 0 ∫ √ 1 12 √ . d γ ln N ( γ , H )) S ( L , = 2 ℓ 1 / 0 m 0 3014

17 M NSTANCE EARNING WITH A NY H YPOTHESIS C LASS ULTI -I L 1, ( γ , The second equality holds since for any γ N > 1, thus the expression in the )) = , L S ( H 2 ℓ / 0 1 integral is zero. S γ H with respect to the norm L is a ( H with ) , then C 2-cover for / -cover for is a C If γ X 2 ℓ ℓ 0 1 0 1 / / ∈ S respect to the norm . This can be seen as follows: Let h L ) ( ∈ h for some H . Let H ∈ f C 2 ℓ ℓ 1 / 0 1 0 / . We have h ‖ such that ‖ f ≤ γ − ) L S ( X 2 ) ( 2 / 1 1 2 y , x ( | ) h , x ( y ) − | f = ‖ h − f ‖ ℓ ℓ ℓ ℓ ( ) L S ∑ 1 0 / / 1 0 1 / 0 0 / 1 2 m , x ( y ) ∈ S ( ) / 2 1 1 2 = ( y , f ( x )) − ℓ | ℓ | ( y , h ( x )) 1 / 0 1 0 / ∑ m ) ∈ S ( x , y ) ( 2 / 1 1 1 1 2 ( | γ x ) − h ( x ) | ) ≤ . 2 / ( = f ‖ f − h ‖ = L ( S ) ∑ X 2 2 m 2 x ∈ S X C L -covering number of γ . It follows that we can bound the is a γ / 2-cover for Therefore ) ( S 2 ℓ / 1 0 H by: ℓ / 1 0 N ( γ , (14) . H , L S ( S )) ≤ N ( 2 γ , H , L ( )) 2 2 X ℓ 1 / 0 ∪ r ( S ) be the average bag size in the sample S , that is r ( S ) = | S Let | / | S | . By Lemma 12, √ ∪ r L ( S )) ≤ N ( γ / , H ( S ) , H L (15) ( S , . )) , γ ( N X 2 2 X From Equation (13), Equation (14) and Equation (15) we conclude that √ ∫ 1 √ 12 ∪ √ ( R ≤ H )) . γ d , S ln N ( 2 γ / ) r ( S ) , H , L S ( 2 ℓ X 1 / 0 m 0 d , and any γ > 0, By Dudley (1978), for any H with VC-dimension ) ( e 4 ∪ . d H S L ln )) ≤ 2 ( ln , , γ ( N 2 X 2 γ Therefore √ ( ) ∫ 1 ( S ) er 12 √ R ( , H γ 2 d ln ) S ≤ d ℓ / 0 1 2 γ m 0 √ ) ( ∫ ∫ √ 1 1 √ d 2 ln d γ + ( er ( ln )) ( 1 / γ S ) d γ ≤ 17 m 0 0 √ √ √ ( ln ( er ( S ))+ ) π / 2 d ( d ln ( 4 er S )) 17 = 17 ≤ . m m √ ln ( x ) is concave for x ≥ 1. Therefore we may take the expectation of both sides of The function this inequality and apply Jensen’s inequality, to get [ ] √ ( er 4 ( )) d S ln m m 17 [ R ( H , H D ) = E E , S )] ≤ ( R D ∼ D ∼ S S m ℓ ℓ 0 / 0 1 1 / m √ √ m [ r ( S )]) · E 4 ( ln er e d ln ( 4 d ) S D ∼ 17 . = 17 ≤ m m 3015

18 S ABATO AND T ISBHY ity of binary MIL with We conclude that even when the bag size is not bounded, the sample complex a specific distribution depends only logarithmically on the average bag size in th is distribution, and linearly on the VC-dimension of the underlying instance hypothesis class. 6.2 Real-Valued Hypothesis Classes g other loss functions In our second result we wish to bound the sample complexity of MIL when usin age bag size, and on the that accept real valued predictions. This bound will depend on the aver Rademacher complexity of the instance hypothesis class. Lipschitz. For the We consider the case where both the bag function and the loss function are e Lipschitz with respect bag function, recall that all extensions of monotone Boolean functions ar {− 1 , + 1 }× R to the infinity norm. For the loss function R , we require that it is Lipschitz ℓ : → > y ∈{− 1 , 0 such that for all 1 } and a in its second argument, that is, that there is a constant + , y . This property is satisfied by many popular losses. For ∈ R , | y ( ℓ , y | ) − ℓ ( y , y y ) |≤ a | y − y 1 2 1 2 1 2 M. It is defined as instance, consider the hinge-loss, which is the loss minimized by soft-margin SV y y , ˆ y ) = [ ℓ − y ˆ ( ] , and is 1-Lipschitz in its second argument. 1 + hl The following lemma provides a bound on the empirical Rademacher complexity of MIL, as a function of the average bag size in the sample and of the behavior of the wor st-case Rademacher complexity over instances. We will subsequently use this bound to bound the a verage Rademacher ge [ 0 , 1 ] . To avoid complexity of MIL with respect to a distribution. We consider losses with the ran ̄ st one labeled bag x , y ) ⊆ ( degenerate cases, we consider only losses such that there exists at lea R ( ) ̄ ̄ + 1 } and hypotheses h , g ∈ H such that h y ( ×{− x , 1 ) = 0 and g X ( , x , y ) = 1. We say that such ℓ ℓ a loss has a full range . X ( R ) 0 , B ] Lemma 19 be a hypothesis class. Let R ⊆ N , and let the bag function ψ : R Let H ⊆ → R [ R , : {− 1 , + 1 }× ℓ → [ 0 , 1 ] -Lipschitz with respect to the infinity norm. Assume a loss function be a 1 ℓ has a full range. Suppose there -Lipschitz in its second argument. Further assume that which is a 2 : ( 0 , 1 ] → R such that is a continuous decreasing function f sup γ ∈ ( 0 , 1 ] , f ( γ ) ∈ N = ⇒ R ∀ . γ ≤ ) ( H ) γ f ( , for all ( 0 , 1 ] ∈ Let S be a labeled bag-sample of size m, with an average bag size r. Then ε √ )( ( ) ∫ 2 2 2 1 γ 10 rm B a ea 4 1 2 √ R ( 1 + ≤ , ε + S ) H 4 f ( log ) d γ . ℓ 2 a m 4 a ε ε 2 1 A refinement of Dudley’s entropy integral (Srebro et al., 2010, Lemma A.3 Proof ) states that for ε ∈ ( 0 , 1 ] , for all real function classes F with range all 0 , 1 ] and for all sets S , [ ∫ √ 1 10 √ S ( L , F ( (16) , . γ γ d ln N )) S + 4 R ≤ ) ε , F ( 2 m ε H . In addition, for any set S , the L norm is ( S ) ℓ , is [ 0 Since the range of 1 ] , this holds for F = 2 ℓ bounded from above by the L . Thus, by ( S ) norm. Therefore N ( γ , F , L )) ( S )) ≤ N ( γ , F , L S ( 2 ∞ ∞ Equation (16) we have √ ∫ 1 10 √ ( R γ 4 + ≤ ) . S , γ H ln N ( ε , H d , L )) ( S (17) ∞ ℓ ℓ m ε 3016

19 M -I L EARNING WITH A NY H YPOTHESIS C LASS ULTI NSTANCE , H and consider h g Now, let ∈ g h ∈ H . Since , ℓ is a -Lipschitz, we have 2 ℓ ℓ ℓ ̄ ̄ ̄ ̄ ‖ x h ( g , = max y ( ℓ h )) | h x ( | x − , y ( ) − g g ( )) x , , y y ) | = max ( ℓ | − ‖ i i i i i i i i ℓ ℓ S ( L ) ℓ ℓ ∞ [ ∈ i ] m [ ∈ i m ] ̄ ̄ − h ‖ g ‖ h ( x x a ) = g ( . | ) − a ≤ | max i i 2 2 S L ( ) ∞ X ] [ ∈ i m ⊆ It follows that if C γ / a H -cover for is a then C . Therefore ⊆ H H is a γ -cover for H 2 ℓ ℓ ℓ N ( γ , , , L . By Lemma 12, ( S )) ≤ N ( γ / a )) , H H L S ( ∞ X 2 ∞ ℓ ∪ a S )) ≤ N ( , / L a H , H , L ( S ( γ ) ∞ , H a γ , . / a )) ≤ N ( a , / γ ( N X ∞ ∞ 1 2 1 2 rm 2 X Combining this with Equation (17) it follows that ∫ √ 1 10 √ H , a γ . d ) ∞ a , / γ ( (18) R ( + ε 4 N ) S , H ≤ 2 1 rm ℓ m ε sup ∈ ( 0 , 1 ] , and let γ Now, let = sup { γ , by Theorem 17 ≤ γ | f ( γ γ ) ∈ N } . Since R γ ≤ ) H ( ◦ ◦ ◦ ◦ ) ( f γ ◦ γ H is at most f ( -fat-shattering dimension of the ) . It follows that γ ◦ ◦ γ , H ) ≤ Fat Fat γ ( , H ) ≤ f ( γ ) ≤ 1 + f ( γ ) . ( ◦ ◦ γ , since f is continuous and decreasing. There- The last inequality follows from the definition of ◦ fore, by Theorem 16, ( ) 2 γ 4 eBm m B 4 H , ∞ ) ≤ 1 +( f ( log ( γ γ N , ≤ , B ∀ ) log ( log ) 1 )+ m 2 γ γ 4 ( ) 2 m 4 eB 4 eBm γ log ) log ) 1 )+ ( ( ( f (19) ≤ 2 4 γ γ 2 γ m eB 4 2 ( ≤ ( f 1 ) log )+ ( (20) ) . 2 4 γ ) 1 to the third factor, and the value of ( ≥ The inequality in line (19) holds since we have added log e B . γ the other factors is at least 1. The last inequality follows since ≤ ≤ B does not restrict us: By the assumptions on ℓ , there We now show that the assumption γ ̄ ̄ ̄ ̄ | = n ( 0. Let ) = y , x h | x . By the x , y ) = 1 and g ( and a labeled bag h ) y , x ( such that H ∈ g , are ℓ ℓ Lipschitz assumptions we have ̄ ̄ ̄ ̄ ̄ ̄ x , y ) − g x ( | x , y ) | = h ℓ ( y , h ( ) x )) − ℓ ( y , g ( ( x )) |≤ a ( | h ( | x ) − g = | 1 2 ℓ ℓ = a . | ψ ( h ( x [ 1 ]) ,..., h ( x [ n ])) − ψ ( g ( x [ 1 ]) ,..., g ( x [ n ])) |≤ a B a a max a |≤ ]) j [ | h ( x [ j ]) − g ( x 1 1 2 2 2 ] j n [ ∈ ] a B . It follows that for all γ ≤ ( 0 , 1 a , γ / a Thus 1 a ≤ B . Thus Equation (20) can be combined ∈ 1 1 2 2 ε 0 , 1 ] , ( with Equation (18) to get that for all ∈ √ ) ( ) ( ∫ 2 2 2 1 4 ea a B rm 10 γ 2 1 2 √ S ε ) + ≤ H , 4 d γ R ( )+ 1 log f ( ℓ 2 a 4 γ a m ε 2 1 √ ( ) ∫ 2 2 2 1 γ 10 a ea rm B 4 2 1 √ f ( + ε 4 ≤ )+ log γ d 1 2 a a 4 m ε ε 2 1 √ ( ) )( ∫ 2 2 2 1 B rm γ a 10 4 ea 2 1 √ γ d . ) log + 1 f ( + 4 ≤ ε 2 ε a m 4 a ε 2 1 3017

20 S ABATO AND T ISBHY √ √ √ The last inequality follows from the fact that a + ≤ b for non-negative a and b , and from + a b ∫ 1 1 ≤ 1. ε MIL, as a Based on Lemma 19, we will now bound the average Rademacher complexity of expected bag size. Since function of the worst-case Rademacher complexity over instances, and the ends on the bag sizes in the number of instances in a bag sample of a certain size is not fixed, but dep sup ( R ) for different values of m . For H the specific sample, we will need to consider the behavior of m β ln ( m ) 1 √ √ , or to for to many learnable function classes, the Rademacher complexity is proportional m m β . The following theorem bounds the average Rademacher complexity of MIL some non-negative logarithmic dependence in all these cases. The resulting bound indicates that here too there is a poly- an application of of the sample complexity on the average bag size. Following the proof we show the bound to a specific function class. R ( X ) , ⊆ ] 0 be a hypothesis class. Let R [ N , and let the bag function ψ : R ⊆ H Let → Theorem 20 B → -Lipschitz with respect to the infinity norm. Assume a loss function : {− 1 , + 1 }× R ℓ [ 0 , 1 ] , R be a 1 which is a ℓ has a full range. Suppose that -Lipschitz in its second argument. Further assume that 2 , , K ≥ there are C such that for all m ≥ K, 0 β β ln ) m ( C sup √ ≤ H ( R ) . m m Then there exists a number N 0 that depends only on C , β and K such that for any distribution D ≥ ≥ 1 , with average bag size r, and for all m ) ( a a 1 + β 2 2 2 2 2 2 2 1 ) m a ( ln a C 16 N rm B ) + a 4 ( 4 + ea 10 log 2 1 2 1 1 + β √ . R ( ≤ ) D , H m ℓ m ( m , and let ̃ r be its average bag size. Denote T S x ) = Let Proof be a labeled bag sample of size 2 2 sup ( 1 / γ T ) 4 β ) ( x C , and define f ( γ ) = , thus allowing the use of Lemma 19. . We will show that R ln γ ≤ 2 γ ( f ) γ √ √ We have ( m ) / T R ≤ T ( f ) γ )) / m f ( γ , thus it suffices to show that ≤ γ . ( m √ 1 2 2 z γ ) = Let ( T ( 1 / γ ) . Since the function )) . We will now show that z ( γ ) T ( z f ( γ )) ≥ ( γ ) / T ( f ( γ γ β 2 2 ) = Cx ln x ( x xT ) is monotonic increasing for x ≥ 1, we will conclude that z ( γ ) ≥ 1 / γ for all γ ≤ 1. ( 0 such that for all β ≥ 0, there is a number n ≥ C x ≥ n , , It is easy to see that for all values of β / 1 − 2 1 − 2 β 2 x ≤ ln ( . ) x C we have For such x x β 2 β β 2 2 )) = C ln ( ( x / T x ( T ln ))) C ( ln ( x ) − ) = ( C ln x ( 2 β 2 ) ( x C ln − 1 / β β β ) − ( 1 − 2 / ≥ C ( ) ln ( x ))) ln = C ln ( ( x ) x 2 = T ( x ) / 2 . (21) γ Let ∈ ( 0 , 1 ) such that f ( γ γ ) = k = max { n , K } . Since f ( , for all ) is monotonic decreasing with γ ◦ ◦ ≤ γ k , f ( γ ) ≥ γ . Therefore, for γ ≤ γ , ◦ ◦ √ √ √ ( f ( ) γ ) f γ 1 f ( γ ) 1 2 2 γ γ ) T ( z ( z )) = ( . 1 / ) γ / γ ( T ) = γ ( f ) ( T ≥ ( )) = γ T ( f 2 ( ( γ f 2 ( ( f ( T )) γ )) 2 T T )) γ ( f 3018

21 M ULTI EARNING WITH A NY H YPOTHESIS C LASS -I NSTANCE L rom the definition The middle inequality follows from Equation (21), and the last equality follows f 1 z ( γ ) ) f of γ ( . We conclude that ≥ γ . Therefore, for all , γ ≤ ◦ γ T ( f ( γ )) sup √ ≤ ) γ ( z γ / 1 = . ≤ ) H ( R ) γ ( f ) ( f γ ̃ as follows: Define f { ) f ( γ γ γ ≤ ◦ ̃ f ) = γ ( γ k . γ > ◦ sup ) , clearly R ≤ , For ( H γ ≤ γ , and for γ > γ γ ◦ ◦ ̃ ( γ ) f sup sup sup H H ) = R H γ ( . ) = R ( ≤ ( γ ≤ ) R ◦ ̃ k ) ( f γ ) ( γ f ◦ sup H ) ≤ γ . By Lemma 19, for all ε ( ( 0 , 1 ] , ∈ , R Therefore for all γ ∈ ( 0 , 1 ] ̃ f ( γ ) √ )( ) ( ∫ 2 2 2 1 ̃ 4 B ea rm a γ 10 1 2 ̃ √ ( f log γ d + ) 1 R ( ) ε 4 ≤ + S H , ℓ 2 a ε 4 a m ε 2 1 √ )( ( ) ∫ ∫ 2 2 2 1 4 a γ a √ ◦ 1 2 B 10 ̃ rm a ea 4 γ 2 1 √ log 1 + γ d ) ε + = 4 ( + γ k d f 2 m ε 4 a a a a γ 4 ε 2 1 ◦ 1 2 √ )( ) ( ∫ 2 2 2 γ a a 4 √ ◦ 2 1 B ea ̃ rm a 4 10 γ 1 2 √ + 1 (22) log γ d ) . 4 ε ≤ + + ( f k 2 a a m 4 ε ε 1 2 √ 1 + Denote N = β > 0 we have . Now, if k √ √ ∫ ∫ ∫ 2 2 2 a 4 a γ a γ a 4 a a γ 4 ◦ ◦ ◦ 1 2 1 2 1 2 γ γ ( 16 T a a ) γ / 1 2 f ( f ( d γ ) 2 a d a d γ ≤ γ ) = 2 1 γ 4 a 4 a a a ε ε ε 1 2 1 2 ] [ ∫ 4 a γ a ◦ 2 1 β 2 2 2 2 2 γ a a 4 ◦ 1 2 / γ 16 ) ( a ln a a a 16 1 + β 2 1 1 2 = 2 a a C β a 2 = γ ln a ) / ( 2 ( C + 1 )) d ( − 2 1 2 1 2 γ γ ε ε ) ( 2 2 2 2 2 a C a 16 a a a γ C a a 16 a 1 2 1 2 ◦ β 1 + β 1 + 2 1 2 1 = . ≤ ) ) ( ln ln ( 2 2 1 + ε + β ε 1 β = 0, since in that case The same inequality holds also for β √ ∫ ∫ 2 2 2 4 a a γ a γ 4 a ◦ ◦ 2 1 2 1 γ / ) a γ a 16 ( T 2 1 f ( γ d a γ d a 2 = ) 1 2 4 γ a a ε ε 2 1 ∫ γ a a 4 ◦ 2 1 a a 4 γ 1 1 2 ◦ 4 a a γ ◦ 2 1 C ln ( a = 2 a )] ) 2 a = a d C [ γ ( γ ln = 2 a a C 2 1 2 1 2 1 ε γ ε ε ( ) 2 2 a a 4 16 a a a a C 2 1 2 1 β + 1 2 1 C a ≤ ln a 2 ( ( ln . ) ) = 2 1 2 ε ε β + 1 Therefore we can further bound Equation (22) to get )( ) ( 2 2 2 2 2 C a a 16 a a 10 ̃ 4 B a rm ea 1 2 β + 1 1 2 1 2 √ + N ln log ) . ( 4 , ε ≤ ) + H S ( R ℓ 2 2 ε β 1 ε + m 3019

22 S ABATO AND ISBHY T √ ε = Setting / 1 we get m ( ) a a C β 1 + 2 2 2 2 2 2 1 2 ( 4 ̃ rm ea ) B N + 4 a + 10 log m ln a ) ( 16 a 1 2 2 1 β + 1 √ ( R H S ) , ≤ . ℓ m denote its average bag size by ̃ r ( S ) . We have Now, for a given sample S m ( [ R ) = E )] , H H , D S R ( D ∼ S m ℓ ℓ ) ( C a a 1 + β 2 2 2 2 2 2 1 2 ) m a ln 16 ( a ( + ) m ) S N r ̃ B a + ( 4 10 log 4 ea 2 1 2 1 1 + β √ ≤ E m ) ( C a a 1 + β 2 2 2 2 2 2 1 2 m a ) a ( 16 ln 10 log ) a N + 4 ea B 4 ( rm + 1 2 2 1 + 1 β √ ≤ . m m )] = [ . This is the r ̃ r ( S E In the last inequality we used Jensen’s inequality and the fact that ∼ S D desired bound, hence the theorem is proven. To demonstrate the implications of this theorem, consider the case of MIL with sof t-margin h we denote by T . The kernel SVM. Kernel SVM can operate in a general Hilbert space, whic x X x ∈ T |‖ { ‖≤ 1 } , and the function class is the class of linear separators domain of instances is = W ( C ) = { h C | w ∈ T , ‖ w ‖≤ with a bounded norm } , for some C > 0, where h . The loss = 〈 x , w 〉 w w ℓ defined above, which is 1-Lipschitz in the second argument. We have (Bar tlett is the hinge-loss hl and Mendelson, 2002) 0 C ( m ln C ) sup √ √ W C ( ) ( R ) ≤ = . ℓ m hl m m X ] W ( C Thus we can apply Theorem 20 with ⊆ [ − C , C β = , thus we can apply the 0. Note that ) B 2 C by simply shifting the output of each = theorem with by C and adjusting the loss function h w N ψ accordingly. By Theorem 20 there exists a number such that for any 1-Lipschitz bag-function over labeled bags with an average bag size of r , we have (such as max) and for any distribution D 2 2 16 ( ln C + N 4 )) 10 log ( 16 eC + rm m )( √ . ( R ≤ ) D , H m ℓ m e loss of MIL with We can use this result and apply Equation (12) to get an upper bound on th soft-margin SVM. 7. PAC-Learning for MIL In the previous sections we addressed the sample complexity of generalized MIL, showing that it nal aspect of grows only logarithmically with the bag size. We now turn to consider the computatio and computational MIL, and specifically the relationship between computational feasibility of MIL feasibility of the learning problem for the underlying instance hypothesis. X H ∈ [ − 1 , + 1 ] We consider real-valued hypothesis classes , and provide a MIL algorithm which uses a learning algorithm that operates on single instances as an oracle. W e show that if the oracle 3020

23 M ULTI L EARNING WITH A NY H YPOTHESIS C LASS -I NSTANCE , and the bag-function satisfies certain boundedness conditions, can minimize error with respect to H . In particular, the guarantees hold if the bag- H then the MIL algorithm is guaranteed to PAC-learn function is Boolean OR or max, as in classical MIL and its extension to real-va lued hypotheses. A H from single instances, we provide an algorithm called Given an algorithm that learns to implement a weak learner for bags with respect to MILearn . That is, for any that uses A H returns a hypothesis from MILearn weighted sample of bags, H that has some success in labeling as the building block in a Boosting the bag-sample correctly. This will allow the use of MILearn algorithm (Freund and Schapire, 1997), which will find a linear combination of hypotheses from H is efficient then the resulting that classifies unseen bags with high accuracy. Furthermore, if A Boosting algorithm is also efficient, with a polynomial dependence on the maximal bag size. We open with background on Boosting in Section 7.1. We then describe the we ak learner in and analyze its properties in Section 7.2. In Section 7.3 we provide guarantees o n a Boosting algorithm f PAC-learning for MIL that uses our weak leaner, and conclude that the computational complexity o r single instances. can be bounded by the computational complexity of agnostic PAC-learning fo 7.1 Background: Boosting with Margin Guarantees In this section we give some background on Boosting algorithms, which we will use to derive an pire, 1997) are techniques efficient learning algorithm for MIL. Boosting methods (Freund and Scha —a learning algorithm that achieves error slightly that allow enhancing the power of a weak learner input sample. The idea is better than chance—to derive a classification rule that has low error on an to iteratively execute the weak learner on weighted versions of the input sa mple, and then to return a linear combination of the classifiers that were emitted by the weak learner in ea ch round. A be a domain of objects to classify, and let H : [ − 1 , + 1 ] Let be the hypothesis class used A m ⊆ } x by the weak learner. A Boosting algorithm receives as input a labeled samp , y = ) { le ( S i i = 1 i 1 , + 1 } , and iteratively feeds to the weak learner a reweighed version of S . Denote the m − 1- A ×{− m ∈ = { w dimensional simplex by R ∆ | i ] ∆ ∈ w w . For a vector = 1 , ∀ , ∈ [ m } , w [ i ] ≥ 0 ∑ m i m i ] m [ ∈ m k . The Boosting algorithm runs in w reweighed by S is the sample rounds. ] , x , y S ) } i = { ( [ w w i i 1 = i , and receives a it sets a weight vector On round ∈ ∆ w , calls the weak learner with input t S w m t t h H as output from the weak learner. After k rounds, the Boosting algorithm returns hypothesis ∈ t f , which is a linear combination of the hypotheses received from the : A → [ − 1 , a classifier 1 ] + ◦ f = weak learner: . R ∈ α ,..., α α , where h ∑ ◦ t t 1 k k ∈ t ] [ The literature offers plenty of Boosting algorithms with desirable properties. For concreteness, ∗ ̈ we use the algorithm AdaBoost (R atsch and Warmuth, 2005), since it provides suitable guarantees on the margin of its output classifier. For a labeled example ( x , y ) , the quantity y f is the margin ( x ) ◦ of f when classifying x . If the margin is positive, then sign ◦ f correctly. The margin classifies x ◦ ◦ m is defined as S = { ( x f of any function y on a labeled sample ) } , i i i = 1 M ( f , S ) = min . ) x ( f y i i [ ] ∈ i m ) M If , S ( is positive, then the entire sample is classified correctly by sign ◦ f . f If S is an i.i.d. sample drawn from a distribution on A ×{− 1 , + 1 } , then classification error of f ◦ and the pseudo-dimension M ( f of the hypothesis , S ) on the distribution can be bounded based on d ◦ class H . The following bound (Schapire and Singer, 1999, Theorem 8) holds w ith probability 1 − δ 3021

24 S ABATO AND T ISBHY ≥ m over the training samples, for any d : √ 2 2 S m d ) / M ln ( f d , / )+ ln ( 1 / δ ) ( ◦ Y · f . ( X ) ≤ 0 ] ≤ O P [ (23) ◦ m In fact, inspection of the proof of this bound in Schapire and Singer (199 9) reveals that the only property of the hypothesis class that is used to achieve this result is the following bound, due to H with pseudo-dimension s H Haussler and Long (1995), on the covering number of a hypothesis clas d : ( ) d em 1 ] , N , ( γ ∀ H γ ∞ ) ≤ ∈ ( 0 , , . (24) m d γ that will be useful to us. Thus, Equation (23) holds whenever this covering bound holds—a fact ∗ AdaBoost , a guarantee on the size of the margin of f can be achieved if one can provide For ◦ edge of the hypotheses returned by the weak learner. The edge of a hypoth- a guarantee on the h esis measures of how successful it is in classifying labeled examples. Let → [ − 1 , + 1 ] be a : A be a distribution over is ×{− 1 , + 1 } . The edge of h with respect to D D hypothesis and let A h , D ) ( E Γ . )] X ( , h [ Y · D ) Y , X ( ∼ S = { ( w For a weighted and labeled sample , x , , y } ) } 1 + , 1 ×{− ⊆ R A × + i i i m [ ∈ i ] x ( h y . w ) , ) Γ S ( , h i i i ∑ i ∈ [ m ] 1 − Γ ( h , D ) is the Note that if x ) is interpreted as the probability of h to emit 1 for input x , then ( h 2 expected misclassification error of . Thus, a positive edge implies a labeling success of more h on D ∗ AdaBoost than chance. For , a positive edge on each of the weighted samples fed to the weak f . learner suffices to guarantee a positive margin of its output classifier ◦ ∗ ̈ atsch and Warmuth 2005) receives a labeled sample S of Theorem 21 (R AdaBoost Assume ∗ runs for k rounds and returns the classifier f AdaBoost . If size m as input. Suppose that ◦ √ − ) ≥ ρ , then M ≥ f ρ , S ) ( k ] , Γ ( h for every round t , S ∈ [ 2 ln m / k. ◦ t w t We present a simple corollary, which we will use when analyzing Boosting fo r MIL. This ∗ can be used to transform a weak learner that approximates the AdaBoost corollary shows that st margin of a best edge of a weighted sample to a Boosting algorithm that approximates the be labeled sample. The proof of the corollary employs the following well known r esult, originally by von Neumann (1928) and later extended (see, e.g., Nash and Sofer, 1996). For a hypothe- sis class H , denote by co ( H ) the set of all linear combinations of hypotheses in H . We say that A H − 1 , + 1 ] ⊆ is compact with respect to a sample S = { ( x if the set of , y [ ) } } 1 + , 1 ⊆ A ×{− i i ] m [ ∈ i vectors { ( h ( x is compact. ) ,..., h ( x } )) | h ∈ H m 1 Theorem 22 (The Strong Min-Max theorem) If H is compact with respect to S, then . ) S , f sup ( M sup Γ ( h , S ) = min w w ∈ ∆ m ∈ h H ) H ( co ∈ f 3022

25 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I ∗ Suppose that is executed with an input sample S, and assume that H Corollary 23 AdaBoost ∗ is compact with respect to S. Assume the weak learner used by has the following AdaBoost ∆ as input, then with probability at least ∈ guarantee: For any w , if the weak learner receives S w m such that δ − 1 it returns a hypothesis h ◦ S , Γ ( h ≥ g ( sup , )) S Γ ( h , ) w w ◦ H ∈ h [ − 1 , + 1 ] → [ − 1 , + 1 where g is some fixed non-decreasing function. Then for any input sample S, : ] ∗ runs k rounds, it returns a linear combination of hypotheses f if = AdaBoost α h , such ∑ ◦ t t ] [ t k ∈ k δ that with probability at least 1 − √ . k / m 2 ln ( ( sup f S , M ) ≥ M ( f , S )) − g ◦ ) H ( co ∈ f ) S , f ( sup , M sup Γ ( h . Thus, for any vector of S ) = By Theorem 22, min Proof w ∆ ∈ w H ) ∈ H h f ∈ co ( m ( in the simplex, sup weights Γ ( h , S w ) ≥ sup . It follows that in each round, the ) S , f M w ) H ( co ∈ f ∈ H h ) ≥ h such that Γ ( h , S as input returns a hypothesis S weak learner that receives w t t w t t ) S , , f )) ≥ g ≥ sup ( M By Theorem 21, it follows that . )) S M ( f ( ( h g S ( sup Γ , ◦ w co ∈ f H ) ( H h ∈ t √ sup ( − )) g S , M ( f m 2 ln / k . ) H co ∈ f ( 7.2 The Weak Learner In this section we will present our weak learner for MIL and provide gua rantees for the edge it achieves. Our guarantees depend on boundedness properties of the bag-function ψ , which we de- -norm bag functions (see fine below. To motivate our definition of boundedness, consider the p ( ) p / 1 n 1 p ψ Definition 3), defined by , ) z ( z [ i ]+ 1 ) Recall that this class of functions in- . 1 − ( ∑ p 1 i = n cludes the max function ( ) and the average function ( ψ ψ ) as two extremes. Assume R ⊆ [ r ] for 1 ∞ z N . It is easy to verify that for any natural n , any sequence z some ,..., r ∈ [ − 1 , + 1 ] , and all ∈ 1 n ∈ , ∞ ] , p 1 [ 1 1 − n z z ≤ ψ ( z . ,..., z ) ≤ + i n i 1 p ∑ ∑ n [ ] i ] n ∈ ∈ i [ n ) ( R [ r ] , it follows that for all ( z , ,..., z 1 ) ∈ [ − 1 , + Since ] R ⊆ n 1 1 + . 1 − r z (25) ≤ ψ z ( z ≤ ,..., z ) i n 1 p i ∑ ∑ r ] i [ n ∈ ] [ ∈ i n m of its argu- We will show that in cases where the bag function is linearly bounded in the su ments, as in Equation (25), a single-instance learning algorithm can be used to learn MIL. Our weak learner will be parameterized by the boundedness parameters of the bag- function, defined formally as follows. ( R ) : [ − 1 , + 1 ] ] Definition 24 A function → [ − 1 , + 1 ψ is ( a , b , c , d ) -bounded if for all ( z ∈ ,..., z ) n 1 R ( ) , + ] [ 1 , 1 − . d + z c ≤ ) z z + b ≤ ψ ( z ,..., a i 1 n i ∑ ∑ i ∈ [ n ] ∈ [ i n ] 3023

26 S ABATO AND ISBHY T 1 p 1 , ∞ ) , ψ Thus, for all over bags of size at most r is ( ∈ [ 1 , , r − 1 ) -bounded. , 0 p r h denotes a special , we introduce some notations. MILearn Before listing the weak learner pos ) R ( H , H ∪{ h . } ∈ ( x ) = + 1. We denote X x ∀ 1: , h bag-hypothesis that labels all bags as pos + pos be an algorithm that receives a labeled and weighted instance sample as inpu t, and returns a Let A ( ∈ . The result of running A h S is denoted A H S ) ∈ H . hypothesis with input MILearn , listed as Algorithm 1 below, accepts as input a bag sample The algorithm and a S bounded bag-function . It also has access to the algorithm A . We sometimes emphasize that ψ A uses a specific algorithm MILearn A . MILearn constructs a MILearn as an oracle by writing S , labeling each instance in S S from the instances that make up the bags in sample of instances I I with the label of the bag it came from. The weights of the instances depend on w hether the bag they ψ . Having constructed came from was positive or negative, and on the boundedness propertie s of , S I MILearn calls A with S . It then decides whether to return the bag-hypothesis induced by applyin g I to A ( ψ . ) , or to simply return h S pos I N MILearn ( f ( N ) + O ) , where N is It is easy to see that the time complexity of is bounded by the total number of instances in the bags of , and S ( n ) is an upper bound on the time complexity f of when running on a sample of size n . As we presently show, the output of MILearn is a A . H S whose edge on S depends on the best achievable edge for bag-hypothesis in + A Algorithm 1 : MILearn : Assumptions X + 1 , ∈ 1 ] H [ • − A receives a weighted instance sample and returns a hypothesis in H . • Algorithm : Input ̄ • ( w , —a labeled and weighted sample of bags, S , y , ) } { x i i i [ ∈ ] m i —an ( a , b , c , d ) -bounded bag-function. • ψ H . Output ∈ : h + ◦ α . c ← a , α ← 1 − 1 ( ) 1 (+ ) x j ) y . · w , , [ ] } S ( ←{ α 2 i i i I y [ m ] , j ∈ [ r ] i ∈ i h 3 ← A ( S . ) I I Γ ( if h , S ) ≥ Γ ( h 4 , S ) then pos I h ← h , 5 ◦ I else 6 h 7 ← h . pos ◦ Return h . 8 ◦ 3024

27 M ULTI L EARNING WITH A NY H YPOTHESIS C LASS -I NSTANCE A depend on the properties of . We define two properties that we The guarantees for MILearn A A A consider for returns is close to the best . The first property is that the edge of the hypothesis possible one on the input sample. An algorithm that accepts a weighted and labeled sample of instances -optimal) ε Definition 25 ( A H is ε in if for all weighted samples S ⊆ R X × X ×{− 1 , + 1 } and returns a hypothesis in -optimal + with total weight W , A ) S ( , S ) ≥ sup Γ ( . Γ ( h , S ) − ε W H h ∈ returns is close to the best possible one The second property is that the edge of the hypothesis that A hypotheses that label on the input sample, but only compared to the edges that can be achieved by all the negative instances of − 1. For a hypothesis class H and a distribution D over labeled S with that label all negative examples in D with − 1, by H examples, we denote the set of hypotheses in ( H , D ) = { h ∈ H | P Ω . } 1 ] = 1 − [ h ( X ) = − 1 | Y = D ( Y , X ∼ ) where , ( H , S ) , Ω ( H S U For a labeled sample ) Ω U is the uniform distribution over the examples , S S . in S Definition 26 (one-sided- ε An algorithm A that accepts a weighted and labeled sample -optimal) X and returns a hypothesis in H is one-sided- ε -optimal if for all weighted samples of instances in , ⊆ × X ×{− 1 R + 1 } with total weight W , S + ( A ( S . , S ) ≥ sup Γ Γ ( h , S ) − ε W ) Ω ( H , S ) h ∈ Clearly, any algorithm which is -optimal is also one-sided- -optimal, thus the first requirement ε ε is stronger. In our results below we compare the edge achieved using MILearn A from to the best possible edge for the sample S H . Denote the best edge achievable for by by a hypothesis in S ∗ sup , γ h Γ ( . , S ) ∈ h H ∗ ( the best edge that can be achieved by a hypothesis in Ω We denote by γ H S ) . Formally, , + ∗ . ) S Γ , h ( , sup γ + ) S , H ( Ω ∈ h Denote the weight of the positive bags in the input sample by W S = and the weight of w ∑ + i y =+ 1 : i i W the negative bags by = . We will henceforth assume without loss of generality that w ∑ − i 1 = − i : y i the total weight of all bags in the input sample is 1, that is + W W = 1. + − Note that for any , b , c , d ) -bounded ψ , if there exists any sequence z ( ,..., z a such that n 1 ψ ( z ,..., z ) = − 1, then 1 n (26) z + d . ≤− c ≤ z + 1 b a i i ∑ ∑ ] n [ ∈ i i ∈ [ n ] This implies b − 1 − − d − 1 ≤ ≤ . z i ∑ c a ] i ∈ [ n 3025

28 S T ABATO AND ISBHY c c − d Rearranging, we get + ≥ 0, with equality if Equation (26) holds with equalities. The − b 1 a a that depends on the tightness of this inequality for MILearn next theorem provides a guarantee for sitive margin for the output the given bag function. As evident from Theorem 21, to guarantee a po ∗ MILearn as the weak learner, we need to guarantee that the edge of when used with AdaBoost of MILearn the hypothesis returned by is always positive. Since the best edge cannot be more than 1, MILearn we emphasize in the theorem below that the edge achieved by is positive at least when the best edge is 1 (and possibly also for smaller edges, depending on the par ameters). We subsequently show how these general guarantees translate to a specific result for the max function, and other bag functions with the same boundedness properties. ( R ) ⊆ [ r ] . Let ψ : Theorem 27 − 1 , + 1 ] Let r ∈ N → [ − 1 , + 1 ] be an ( a , b , c , d ) -bounded bag- and R [ 1 c c c ε ∈ [ < , 0 function such that a ≤ c. Let 0 1 . b − ) , and assume that d + − = η . Denote Z = a rc a a A MILearn with a weighted bag sample Consider running the algorithm S of total weight 1 , and let A h . Then be the hypothesis returned by MILearn ◦ is ε -optimal then A 1. If η 1 1 ∗ Z − Z γ + + ) ε − rc − ( 1 Z 2 Z . , h ( Γ ) ≥ S ◦ η 1 )( 1 − ) 1 +( 1 − Z 2 ( Γ , h Thus, 0 whenever ) S > ◦ 1 1 ε rc η 1 ∗ 1 γ > − + ( )+ + . 2 2 2 Z Z Z Z ∗ 0 > ) S . − ε / ( Z + 1 ) and γ 1 = 1 then Γ ( h rc , ( 2 ≤ η In particular, if ) ◦ 1 ε -optimal, and ψ ( z A ,..., z , then ) = − 2. If only if z 1 = ... = z − = is one-sided- n 1 n 1 η ∗ − γ rc ε ) − Z ( Z + 1 + 2 , ( h Γ . S ≥ ) ◦ η Z ( − 1 ) − Z − 2 1 2 Γ ( h Thus, , S > 0 whenever ) ◦ η ∗ γ > . ( Z + 1 )+ rc ε Z + 2 ∗ . ) S 0 > ( h then , 1 = Γ Z + 1 ) and γ In particular, if η ≤ 2 ( 1 − rc ε Z ) / ( ◦ + general terms, as it The proof of the theorem is provided in Appendix A. This theorem is stated in . In particular, if ψ is any function between an average and a max, including ψ holds for any bounded -norm bag functions ψ any of the defined in Definition 3, we can simplify the result, as captured p p by the following corollary. 1 X ⊆ [ − 1 , + 1 Corollary 28 Let . Let R ⊆ [ r ] , and ε ∈ [ 0 , H ] ) . Assume a bag function r ( R ) − 1 , + 1 ] → ψ : [ [ − 1 , + 1 ] such that for any z , ,..., z ] ∈ [ − 1 , + 1 1 n 1 z max ≤ ) . z ≤ ψ ( z ,..., z n 1 i i ∑ n i ∈ [ n ] i [ n ∈ ] A Let h be the hypothesis returned by MILearn . Then ◦ 3026

29 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I A 1. If -optimal for some ε ∈ [ 0 , 1 / r ] , then is ε 2 ∗ 2 1 r ( γ ) + + − r ε 1 S ≥ ) . , Γ h ( ◦ − 1 r 2 ε 1 ∗ ∗ γ ( Γ then 1 . In particular, if = h , + h Γ Thus ( , 0 γ S ≥ 1 − ) > . 0 > whenever S ) ◦ ◦ 2 r r 2 ε -optimal some ε ∈ [ 0 2. If 1 / r A ] , then is one-sided- , 2 ∗ ε r − γ + . ( Γ , h S ≥ ) ◦ − 1 2 r ∗ 2 ∗ ( Thus , Γ h whenever γ ε . > r S ) . In particular, if > 0 0 = 1 then Γ ( h > , S ) γ ◦ ◦ + + Let z 1 ,..., z . We have ∈ [ − Proof , + 1 ] n 1 . 1 − n + z ) 1 ≤ ) z ( min − z n − ( max z ≤ i i i i ∑ ∑ [ n ] i ∈ n ] [ i ∈ ] i ∈ [ n Therefore, by the assumption on ψ , for any n ∈ R z 1 − r + . − ≤ z 1 + n ψ ≤ z ,..., z ( ) i i n 1 ∑ ∑ [ ] ∈ n i [ n ∈ ] i In addition 1 1 z ,..., z ( ψ z ≤ ≤ . z ) i i 1 n ∑ ∑ n r i n [ ] ∈ i ∈ [ n ] 1 ψ ( Therefore is , 0 , 1 , r − 1 ) -bounded. It follows that Z 0. r in this case, and d − Zb − Z + 1 = = r Claim (1) follows by applying case (1) of Theorem 27 with η = 0. For claim (2) we apply case (2) of Theorem 27. Thus we need to show tha t if ψ ( z 1 ,..., z − ) = n 1 1. We have that − z and ∈ [ − 1 , + 1 ] , then z ,..., = ... = z z = n 1 1 n 1 z . 1 ≤− z ) ≤ z ( ψ ,..., − 1 ≤ 1 n i ∑ n i ∈ [ n ] 1 Therefore 1, − = z = ... z 1. Thus case (2) = − 1. Since no z z can be smaller than − = ∑ i 1 i n [ i ∈ ] n n edness parameters of of Theorem 27 holds. We get our claim (2) directly by subsituting the bound ψ in Theorem 27 case (2). 7.3 From Single-Instance Learning to Multi-Instance Learning ∗ MILearn with the guarantees on In this section we combine the guarantees on , to show AdaBoost that efficient agnostic PAC-learning of the underlying instance hypothes H implies efficient PAC- is learning of MIL. For simplicity we formalize the results for the natural case wh ere the bag function is ψ = max. Results for other bounded bag functions can be derived in a similar fa shion. First, we formally define the notions of agnostic and one-sided PAC-learnin g algorithms. We then show that given an algorithm on instances that satisfies one of these d efinitions, we can con- struct an algorithm for MIL which approximately maximizes the margin on an input bag sample. 3027

30 S ABATO AND ISBHY T Specifically, if the input bag sample is realizable by H , then the MIL algorithm we propose will find or, and with a positive a linear combination of bag hypotheses that classifies the sample with zero err d in Section 7.1, margin. Combining this with the margin-based generalization guarantees mentione we conclude that we have an efficient PAC-learner for MIL. ) , δ , S ε be an algorithm B ( Let Definition 29 (Agnostic PAC-learner and one-sided PAC-learner) m ∈ ( 0 , that accepts as input ) , and a labeled sample S ∈ ( X ×{− 1 , + 1 } ) δ , and emits as output , ε 1 H . B a hypothesis h agnostic PAC-learner for H with complexity c ( ε , δ ) if B runs for no ∈ is an ( , δ ) steps, and for any probability distribution D over X ×{− 1 , + 1 } , if S is an i.i.d. ε more than c − ε δ ) , then with probability at least 1 , δ over S and the randomization of ( sample from D of size c , B ( B ( ε , δ , Γ ) , D ) ≥ sup . ε − Γ ( h , D ) S H h ∈ is a if under the same conditions, with probability at least 1 − δ one-sided PAC-learner B ( B ( ε , δ , S ) , D ) ≥ sup − ε Γ ) D , h ( Γ . ) , H ( Ω ∈ h D B : Algorithm 2 O ε , δ Assumptions : ε , δ ∈ ( 0 , 1 ) • . • receives a labeled instance sample as input and returns a hypothesis in H . B ε Algorithm • c ( B , δ ) . is a one-sided (or agnostic) PAC-learning algorithm with complexity Input : A labeled and weighted instance sample S = { ( w . , x } , y 1 ) } + , 1 ×{− X ⊆ R × + i i i m ∈ i ] [ H : A hypothesis in Output ∈ [ m ] , p i ← w For all / . w 1 ∑ i i i m [ ∈ i ] t ∈ [ c ( ε , δ )] , independently draw a random j j such that For each 2 = i with probability p . i t t ̃ ) . } y , x ( ←{ S 3 j j ∈ [ c ( ε , δ )] t t t ̃ B ( 4 S ) h ← h . 5 Return B , listed for H and parameters ε , δ ∈ ( 0 , 1 ) , the algorithm O Given an agnostic PAC-learner B , δ ε ε − δ . Similarly, if B is a one- -optimal algorithm with probability 1 above as Algorithm 2, is an B δ − -optimal algorithm with probability 1 ε is a one-sided- . Our MIL O sided PAC-learner, then δ ε , B O ∗ , δ ε AdaBoost with as the (high probability) weak learner. It is MILearn algorithm is then simply . We also show H easy to see that this algorithm learns a linear combination of hypotheses from + below that under certain conditions this linear combination induces a positive ma rgin on the input bag sample with high probability. Given this guaranteed margin, we bound the ge neralization error of the learning algorithm via Equation (23). B The computational complexity of and in the instance-sample size ) δ is polynomial in O ( ε , c δ , ε B O ε , δ . Therefore, the computational complexity of m MILearn is polynomial in c ( ε , δ ) and in N , where N is the total number of instances in the input bag sample S . 3028

31 M -I L EARNING WITH A NY H YPOTHESIS C LASS ULTI NSTANCE For 1-Lipschitz bag functions which have desired boundedness prope rties, both the sample com- plexity and the computational complexity of the proposed MIL algorithm are poly nomial in the maximal bag size and linear in the complexity of the underlying instance hypothes is class. This over labeled is formally stated in the following theorem, for the case of a realizable distribution p -norm bag-functions, since they are bags. Note that in particular, the theorem holds for all the 1-Lipschitz and satisfy the boundedness conditions. X [ − Theorem 30 , + 1 ] Let be a hypothesis class with pseudo-dimension d. Let B be a one- H ⊆ 1 with complexity c ( ε , δ ) . Let r ∈ N , and let R ⊆ [ r ] . Assume that the bag sided PAC-learner for H ( R ) -Lipschitz with respect to the infinity norm, and that for any , + 1 ] : − 1 → [ − 1 , + 1 ] is 1 [ function ψ ) ( R ( ) ∈ [ − 1 z + 1 ] ,..., z , 1 n 1 . z max ≤ ) z z ,..., ≤ ψ ( z i 1 i n ∑ n [ n ] ∈ i [ ∈ n i ] Assume that is compact with respect to any sample of size m. Let D be a distribution over H R ) ( ̄ , + 1 X which is realizable by } ×{− 1 ( H , that is there exists an h such that [ h P ∈ X ) = H ̄ X , Y ) ∼ ( D 1 2 . Assume m ≥ 10 d ln ( er ) , and let ε = Y ] = 1 ) and k ( 2 r − 1 32 = ln ( m ) . 2 2 r ∗ m ( 0 , 1 ) , if AdaBoost D is executed for k rounds on a random sample S ∼ For all δ , ∈ B O k ε , δ / 2 as the weak learner, then with probability 1 − δ , the classifier f returned by with MILearn ◦ ∗ AdaBoost satisfies √ 2 2 ( ln )+ m ( ) ln ) δ r / dr 2 ln ( ̄ . (27) X ≤ ( Y f [ O ] P 0 ≤ ) D m B is one-sided- ε -optimal with proba- is a one-sided PAC-learning algorithm, O Proof B Since k ε , δ / 2 B O ε , δ / k 2 / k . Therefore, by case (2) of Corollary 28, if MILearn δ receives a weighted bility at least 1 − S it returns a bag hypothesis , with probability 1 − δ / 2 k bag sample h ∈ such that H ◦ w + 2 sup − ε Γ S , r h ( ) w ( Ω ∈ h S ) , H ) Γ S , ≥ h ( . ◦ w − 1 2 r ∗ AdaBoost runs for k rounds then with probability 1 − δ / 2 it returns a Thus, by Corollary 23, if H such that linear combination of hypotheses from + 2 sup r − ) S ε f , ( M √ co ( Ω ( f ∈ )) H S , − M f ( ) S , ≥ / 2 ln m k . (28) ◦ 2 r 1 − that classifies correctly the bag ) S , H ( Due to the realizability assumption for h , there is an D Ω ∈ ( . It follows that for any weighting ∈ ∆ S S , Γ w h , S sample ) = 1. It is easy to verify that since of w m ) is compact with respect to , then so is Ω ( H , S S . Thus, by Theorem 22, sup H ) = S , f M ( f ∈ co ( Ω ( )) H S , ( with their values, setting sup ) = S , f Γ ( M , S k ) = 1. Substituting ε and h sup min w w ( H f ∈ co ( Ω , H , S )) S ) ( h ∈ Ω 1 in Equation (28) and simplifying, we get that with probability 1 − δ / 2 1 f M , S ) ≥ ( (29) . ◦ − 4 r 8 3029

32 S ABATO AND ISBHY T this we need to We would now like to apply the generalization bound in Equation (23), but for . We have the following bound on the covering numbers of H H , show that Equation (24) holds for γ 1 ( : ∈ , for all 0 ] ) ( d erm , γ ( N . ( γ , H , ∞ ) ≤ , ∞ ) ≤ N H m rm γ d The first inequality is due to Corollary 13 and the fact that is 1-Lipschitz, and the second inequality ψ is due to Haussler and Long (1995) and the pseudo-dimension of (see Equation (24) above). This H implies ( ) ) ) ( ( d d d erm em em d ln ln ) r ( d d ) ( r H , ∞ ) ≤ N , γ ( ( = e = )) er ( · e · 10 ln m γ ( ln d 10 er γ d γ ) d · ) ( d em ( r ( d ( ln )) 10 ln ( er ))+ ln e · . = ln ( er ) · 10 γ d Therefore, for ≥ 10 d ln ( er ) m ) ( d em ( ln ))+ er ( 10 ln ( ln d ( )) r γ , N ( · e H + γ ( , , H , ∞ ) ≤ 1 ∞ + ) ≤ N 1 m m + ( d er 10 · γ ln ) ( ) d em 10 ln ( er ( d ln ( )) ( er ))+ ln ≤ e · . ( γ · ) 10 d ln er Now, ln ( 10 ln ( er ))+ ln ( er ) = ln ( 10 )+ ln ( ln ( er ))+ ln ( er ) ≤ ln ( 10 )+ 2 ln ( er ) ≤ 3 + 2 ln ( er ) ≤ 5 ln ( er ) . Therefore, ( ( ( ) ) ) er ) ( ln d 5 ln ) d er ( 10 d 2 em em m e ) er ( ln d 5 H ≤ , ∞ ) ≤ ≤ . e · γ , ( N + m · γ · 10 d ln γ γ · 10 d ln ( er ) 10 d ln ( er ) ( er ) ln m 10 d ln ( er ) , Equation (24) holds for H Thus, for when substituting d with d . = 10 d ≥ ( er ) r + d with d as well. It This means the generalization bound in Equation (23) holds when substituting r − δ / 2 follows that with probability 1 √ 2 2 ( d ) ln ( m / d δ ) / M / f 1 , S )+ ln ( ◦ r r . ] ≤ ≤ [ O ) ( Y f P 0 X ◦ m − δ / 2, by Equation (29) we have M ( f Now, with probability 1 , S ) ≥ 1 / ( 8 r − 4 ) . Combining the ◦ two inequalities and applying the union bound, we have that with probability 1 δ − √ 2 2 d ( 8 r − 4 ) ) ln δ ( / / d 2 )+ ln ( m r r ) ≤ 0 ] ≤ O [ Y f X ( P ◦ m √ 2 2 ( 10 ln ( er )( 8 r − 4 ) ) ln δ d m )+ ln ( 2 / O ≤ . m Due to the O-notation we can simplify the right-hand side to get Equation (27). 3030

33 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I ing as well, using Similar generalization results for Boosting can be derived for margin-learn covering-numbers arguments as discussed in Schapire et al. (1998). T he theorem above leads to the following conclusion. If there exists a one-sided PAC-learning algorithm for with polynomial run-time in H Corollary 31 1 1 , which has polynomial , then there exists a PAC-learning algorithm for classical MIL on H and ε δ 1 1 run-time in r, . and ε δ Corollary 31 is similar in structure to Theorem 1: Both state that if the single-insta nce problem is solvable with one-sided error, then the realizable MIL problem is solvable. Theorem 1 applies only gs drawn from an to bags with statistically independent instances, while Corollary 31 applies to ba nly requires that the arbitrary distribution. The assumption of Theorem 1 is similarly weaker, as it o while Corollary 31 requires single-instance PAC-learning algorithm handle random one-sided noise, urse, Corollary 31 does that the single-instance algorithm handle arbitrary one-sided noise. Of co 998). Indeed, this hardness not contradict the hardness result provided for APRs in Auer et al. (1 -dimensional APRs which is polynomial in result states that if there exists a MIL algorithm for d and d r R P = N P . Our result does not imply that such an algorithm exists, since there both , then is polynomial in d is no known agnostic or one-sided PAC-learning algorithm for APRs which . 7.3.1 E : H ALF - SPACES XAMPLE ss, to create a PAC-learning We have shown a simple and general way, independent of hypothesis cla tances. Whenever algorithm for classical MIL from a learning algorithm that runs on single ins an appropriate polynomial algorithm exists for the non-MIL learning proble m, the resulting MIL algorithm will also be polynomial in r . To illustrate, consider for instance the algorithm proposed in Shalev-Shwartz et al. (2010). This algorithm is an agnostic PAC-learne r of fuzzy kernelized half-spaces with an L > 0. Its time complexity -Lipschitz transfer function, for some constant L L 1 L (( and sample-complexity are at most poly · ln ) ( )) . Since this complexity bound is polynomial ε δ / ε and in 1 / δ , this algorithm can serve as the algorithm B in Theorem 30, and Corollary 31 in 1 lexity that depends holds. Thus, we can generate an algorithm for PAC-learning MIL with comp 1 1 . The full MIL algorithm r , directly on the complexity of this learner, and is polynomial in and ε δ ∗ un for fuzzy kernelized half-spaces can thus be described as follows: R AdaBoost with the weak B O B ε , δ B is listed in Algorithm 2, and is listed in Algorithm 1, learner MILearn O , where MILearn δ , ε ∗ AdaBoost is a labeled ut to is the agnostic PAC-learner from Shalev-Shwartz et al. (2010). The inp sample of bags, and the output is a real-valued classifier for bags. ement in the development More generally, using the construction we proposed here, any advanc of algorithms for agnostic or one-sided learning of any hypothesis class tr anslates immediately to an algorithm for PAC-learning MIL with the same hypothesis class, and with corr esponding complexity guarantees. 8. Conclusions In this work we have provided a new theoretical analysis for Multiple Instan ce Learning with any underlying hypothesis class. We have shown that the dependence of the sample complexity of 3031

34 S ABATO AND ISBHY T us implying that the generalized MIL on the number of instances in a bag is only poly-logarithmic, th statistical performance of MIL is only mildly sensitive to the size of the bag. The analysis includes f which are used in practice in binary hypotheses, real-valued hypotheses, and margin learning, all o MIL applications. Our sample complexity results can be summarized as follows, w d is the VC here dimension or pseudo-dimension of the underlying hypothesis class, and r is the maximal/average bag size. • ( d log ( r )) . The VC dimension of binary MIL is O For non-trivial bag functions, there are hypothesis classes such that • the VC dimension of binary MIL is ( d log ( r )) . Ω ( The VC dimension of binary MIL with separating hyperplanes in dimension • Ω ( d log d r )) . is • The pseudo-dimension of binary MIL for bag functions that are extensio ns of monotone Boolean functions is O ( d log ( r )) . • Covering numbers for MIL hypotheses with Lipschitz bag functions can be bounded by cov- ering numbers for the single instance hypothesis class. • The fat-shattering dimension of real-valued MIL with Lipschitz bag-function s is the sin- poly-logarithmic in the bag size and quasilinear in the fat shattering dimension of gle instance hypothesis class. The Rademacher complexity of binary MIL with a bounded average bag size is • √ d is the sample size. m ( r ) / m ) where log ( O • The Rademacher complexity of real-valued MIL with a Lipschitz loss function a nd a Lipschitz bag function is upper bounded by a logarithmic dependence on the averag e bag size and a quasilinear dependence on the Rademacher complexity of the instance hypo thesis class. for its natural ex- For classical MIL, where the bag-labeling function is the Boolean OR, and gs by executing a tension to max, we have presented a new learning algorithm, that classifies ba learning algorithm designed for single instances. This algorithm provably P AC-learns MIL. In both the sample complexity analysis and the computational analysis, we have shown tig ht connections between classical supervised learning and Multiple Instance Learning, w hich holds regardless of the underlying hypothesis class. Many interesting open problems remain for the generic analysis of MIL. In p articular, our re- sults hold under certain assumptions on the bag functions. An interesting ope n question is whether these assumptions are necessary, or whether useful results can be ac hieved for other classes of bag functions. Another interesting question is how additional structure within a ba g, such as sparsity, may affect the statistical and computational feasibility of MIL. These interestin g problems are left for future research. Acknowledgments The authors thank Shai Shalev-Shwartz and Nati Srebro for helpful d iscussions. Enlightening com- ments from Phil Long and Sally Goldman on the preliminary version of this work h ave helped 3032

35 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I rted by the Adams improve the presentation immensely. During this work Sivan Sabato was suppo Fellowship Program of the Israel Academy of Sciences and Humanities. Th is work is partly sup- ported by the Gatsby Charitable Foundation. Appendix A. Proof of Theorem 27 , is to prove a guarantee for MILearn The first step in providing a guarantee for the edge achieved by A the edge achieved on the bag sample by the hypothesis returned by in step (3) of the algorithm. This is done in the following lemma. ( ) R 1 , + 1 ] 1 Lemma 32 Assume → [ − 1 , + ψ : is an ( a , b , c , d ) -bounded bag function with 0 < a ≤ [ − ] c . Consider running the algorithm MILearn with a weighted bag sample S of = c, and denote Z a . Let h total weight be the hypothesis returned by the oracle A in step (3) of MILearn . Let W be 1 I created in MILearn , step (2). Then the total weight of the sample S I A ε -optimal, 1. If is 1 1 ∗ d − Zb + W − )) . Zb W Z +( ε − − − )( d 1 Γ ( γ ) Z ≥ +( S , h + I Z Z A is one-sided- ε -optimal, and ψ ( z , then ,..., z ... ) = − 1 only if z − = 2. If = z = 1 n n 1 1 1 1 1 1 ∗ h , S ) ≥ ( Γ − +( − Z +( 1 − . W )( d − Zb )) W ε + Zb − γ + Z − d I + + Z Z Z Z ( R ) ̄ ̄ x Proof For all ,..., x ) ∈ X h ∈ H we have , and for all = ( x h . Since )) ( ) = ψ ( h ( x x ) ,..., h ( x n 1 1 n , is , b , c a d ) -bounded, it follows that ( ψ ̄ d (30) )+ x ( h . b ≤ h ( x )+ ) x c h ( ≤ a ∑ ∑ ̄ ̄ x ∈ x ∈ x x In addition, since a and c are positive we also have ̄ ̄ ≤ x ) ( ( h ) c / ≤ (31) . a d − ) x ( h h ( / x ) − b ) ( ∑ ̄ ∈ x x ̄ Assume the input bag sample is ( w x , S = = , y I ) } and } 1 { = + . Denote I y = { i ∈ [ m ] | + i i − i i ] [ ∈ m i i ∈ [ m ] | y . Let = − 1 } { h ∈ H be a hypothesis. We have i ̄ ̄ w w Γ ( ) x , S ( h ( ) = x h ) − h i i i i ∑ ∑ ∈ I I i ∈ i − + a w ( ≥ ( w c − b h ( x )+ ) h d )+ x ) ( (32) i i ∑ ∑ ∑ ∑ ̄ ̄ x x ∈ x ∈ x I ∈ i ∈ i I i i + − a w = w c w d (33) . h ( x ) − b w − x ( h )+ i i i i ∑ ∑ ∑ ∑ ∑ ∑ ̄ ̄ x ∈ x ∈ x x I ∈ i ∈ i I ∈ I i I ∈ i i i + − − + line (32) follows from Equation (30). As evident by steps (1,2) of MILearn S , In the sample all I instances from positive bags have weight (+ 1 ) = a , and all instances from negative bags have α weight α ( − 1 ) = c . Therefore Γ , S ) = ( h a w ) = x ( h ) y ( α y w w c − ) x ( h . h ( x ) I i i i i i ∑ ∑ ∑ ∑ ∑ ∑ ̄ ̄ ̄ x x ∈ x x ∈ x x ∈ i ∈ I I i ∈ ∈ [ m i ] i i i + − 3033

36 S ABATO AND ISBHY T Combining this equality with Equation (33) we get . d w w − b Γ , S h )+ ( , S ) ≥ Γ ( h i i I ∑ ∑ ∈ I i ∈ i I − + , it follows that W − 1 = W = w and = W w Since ∑ ∑ i − + + i ∈ I i I ∈ i − + Γ ( S ) ≥ Γ ( h , S − )+ bW h , dW (34) = Γ ( h , S . )+( b + d ) W d − + I + I − Now, for any hypothesis h we can conclude from Equation (31) that aw S h ) = ( Γ , cw ) x − ( h x ) ( h i I i ∑ ∑ ∑ ∑ ̄ ̄ x x ∈ x x ∈ i ∈ I ∈ i I i i + − ̄ ̄ ( aw ≥ ( cw a h ( / x x ) − b ) / c − − ( h ) ) d i i i i ∑ ∑ I ∈ i I i ∈ + − a c ̄ ̄ w w cbw a / adw c / + h − = − ) x x ( h ( ) i i i i i i ∑ ∑ ∑ ∑ a c i ∈ I ∈ I I ∈ i ∈ I i i − + − + a ad cb c c ̄ S )+( − x ) h ( h , Γ ) W − ( w + W = i i + − ∑ c a c a a i I ∈ + c ad a cb cb c ̄ = h ( h , − ) x S ( )+( w + . + W ) Γ − ) ( i i + ∑ c a c a a a I ∈ i + c , it follows that = = 1 − W In the last equality we used the fact that . Since Z W + − a 1 d ̄ ( S )+( h h − ) x , ( w Zb − Z ) + (35) . Zb + W ) ≥ h S Γ ) ( Z ( Γ , i i + I ∑ Z Z i I ∈ + 1 We will now lower-bound the right-hand-side of Equation (35). Note that Z ≤ 0 since c ≥ a . − Z ̄ ( . We consider each of the two cases in the ) x h Therefore we need an upper bound for w ∑ i i i I ∈ + statement of the lemma separately. ASE 1: A IS ε - OPTIMAL A.0.2 C ̄ H ∈ = h W . Therefore, by Equation (35) for any w ( ) ≤ h x w We have ∑ ∑ + i i i i ∈ I ∈ I i + + d 1 S ≥ ) Γ Z Γ ( ( h , − − − (36) . Z Zb ) W Zb + , S )+( h I + Z Z n 1 n ∗ such that Γ ( n h , set For a natural . We have (see explanations below) , S ) ≥ γ − h ∗ ∗ n , − , S ) ≥ Γ ( h (37) h S d )+( b + d ) W Γ ( + I I I n d (38) W , S )+( ε + d ) W − − b h ( Γ ≥ I + ∗ 1 d n ≥ Γ Z ( )+( S h , W − Z − ε (39) − Zb ) W − + Zb +( b + d ) W d − + + ∗ Z Z 1 1 n )( − ε − d − W Zb − Z +( 1 + W )) Zb d − S h )+( , Z = ( Γ + ∗ Z Z 1 1 1 ∗ γ ( Z ≥ − + W − Z +( 1 − . ε )( d − Zb )) W − )+( Zb − d + Z n Z Equation (37) is a restatement of Equation (34). Equation (38) follows fro m the ε -optimality of A . Equation (39) follows from Equation (36). By taking n → ∞ , this inequality proves case (1) of the lemma. 3034

37 M -I L EARNING WITH A NY H YPOTHESIS C LASS ULTI NSTANCE OPTIMAL 2: - SIDED - ε - NE A A.0.3 C ASE O IS ̄ ̄ I − h ( ( x ∈ ) ≤ i h . Then for all x , ) 1. Therefore S , H h ∈ Ω ( ) = . Let = w W We have w ∑ ∑ i − i + i i ∈ i I I ∈ i + + ̄ ̄ S ) = Γ ) ( x h ( h h ( , x − ) w w i i i i ∑ ∑ I ∈ i I ∈ i + − ̄ w = w ( x )+ h i i i ∑ ∑ I ∈ i i I ∈ + − ̄ w = )+ h . W ( x i − i ∑ i I ∈ + ̄ w Therefore W )+ , S h ( 1. Combining this with Equation (35) we x − ) = Γ ( h , S ) − W h = Γ ( ∑ i i + − i ∈ I + get 1 d ̄ Z Γ ( h , S Γ ( ( h , S − h ( ) x ≥ ) )+( w Zb ) − Zb + + Z W ) i I i + ∑ Z Z I ∈ i + 1 d + W ) Zb + . Zb − Z )( Γ ( , h − 1 ) − ( S W )+( S , h )+ Z ( Γ = + + Z Z 1 1 1 d = S , h )+( − Z − − (40) − Zb ) W . + Zb Γ ( Z + + Z Z Z Z n n n 1 ∗ ̄ i ∈ I , . For all bags − h ( 1. h S ) ≥ γ ∈ − Ω ( ) = H , S ) x such that h Γ ( , For a natural n , set − i + + + + n n n ̄ h ( Thus ψ x ( [ 1 ]) ,..., h ( in case (2) of the lemma, this [ | ψ x x ])) = − 1. By the assumption on i i i + + n n ̄ ) S . We have (see H , ∈ Ω ( 1. Therefore h x j [ ( − ]) = | i I x ∈ | ] , h , j ∈ [ implies that for all I i i − + + explanations below) Γ ( d , S ) ≥ Γ ( h (41) , S h )+( b + d ) W − + I I I n ( h , ≥ Γ S (42) )+( b + d ) W W − d − ε + I + 1 1 d 1 n , S )+( h Z Γ W − ε − (43) − − Zb ) W d + Zb − − W + ( +( b + d ) Z ≥ + + + Z Z Z Z 1 1 1 1 n h , S )+( = − W − − Z +( 1 − ε Γ )( d ( Zb )) W − + Zb − d + Z + + Z Z Z Z 1 1 1 1 1 ∗ ≥ − W )+( d ε − Z +( 1 − . − )( d − Zb )) W ( + Zb − γ + Z − + + Z Z n Z Z m the one-sided- - Equation (41) is a restatement of Equation (34). Equation (42) follows fro ε n optimality of A and the fact that h ∈ Ω ( H , S . Equation (43) follows from Equation (40). By ) I + considering n → ∞ , this proves the second part of the lemma. Proof [of Theorem 27] MILearn selects the hypothesis with the best edge on S between h . and h pos I Therefore , . )) S ) = max ( Γ ( h S , S ) , Γ ( h , Γ ( h I pos ◦ We have ̄ , h ( Γ ) = − . w W y 2 h = ( 1 x W ) = − w W y = S pos + i i − i i i + pos ∑ ∑ [ m ] [ m ] i ∈ i ∈ 3035

38 S ABATO AND T ISBHY Thus (44) ) = max ( 2 W − 1 S Γ ( h , S )) . , h ( Γ , + I ◦ h , S ) by bounding Γ ( ( h , S ) separately for the two cases of the theorem. Let Γ We now lower-bound I ◦ R . Since W ⊆ [ r ] , a ≤ c , and be the total weight of S w = 1, we have ∑ I i ] m ∈ [ i = W aw + w (45) rc = cw rc ≤ i i i ∑ ∑ ∑ ∑ ∑ ̄ ̄ ∈ x x x ∈ x i : y =+ 1 : i 1 = − y m ∈ i [ ] i i i i A IS ε - OPTIMAL A.0.4 C ASE 1: From Lemma 32 and Equation (45) we have 1 1 ∗ S ) ≥ Z h , +( γ Zb Z +( 1 − ε rc )( d − − )) W − + Zb − d Γ ( I + Z Z 1 1 ∗ Z γ +( = +( rc − Z 1 )( Z − 1 + η )) W ε − ( Z − 1 + η ) − − + Z Z 1 ∗ +( η − 2 )( 1 − γ Z = Z W + 1 − η − . − rc ε ) + Z d − Zb − Z + 1 = η . Combining this with Equa- The second line follows from the assumption tion (44) we get 1 ∗ ) ≥ max { 2 W − 1 , γ S +( η − 2 )( 1 − Z h , Γ ( . } ) W ε + 1 − η − Z − rc + ◦ + Z The right-hand-side is minimal when the two expressions in the maximum are equa l. This occurs when ∗ rc Z − ε + 2 − η − Z γ . , = W W ◦ + 1 ) η )( 1 − 2 +( 2 − Z Therefore, for any value of W + η 1 1 ∗ Z γ − Z + rc − + 1 ( ε ) − Z 2 Z = − W 2 ≥ ) S 1 . ( , Γ h ◦ ◦ η 1 − )( 1 ) +( 1 − 1 Z 2 A ε Case 2: -optimal From Lemma 32 and Equation (45) we have is one-sided- 1 1 1 1 ∗ ( Γ , ≥ ) S h +( − rc − Z +( 1 − γ ε )( d − Zb )) W − + Zb − d + Z I + + Z Z Z Z 1 1 1 1 ∗ = +( γ − − Z +( 1 − − rc )( Z − 1 + η )) W − ε ( Z − 1 + η )+ Z + + Z Z Z Z 1 1 1 ∗ +( η − )( 1 − 2 . ε ) γ rc + 1 − η − W − = + + Z Z Z d − Zb = Z − 1 + η . Combining this with Equation (44) The second line follows from the assumption we get 1 1 1 ∗ ≥ max { 2 h − 1 , Γ , S ( ) W ε rc . η − 2 )( 1 − } ) W − + 1 − η − γ +( + ◦ + + Z Z Z The right-hand-side is minimal when the two expressions in the maximum are equa l. This occurs when ∗ ) ε rc − η − Z 2 − 1 +( γ + . , W = W ◦ + Z +( 2 − η )( Z − 1 ) 2 3036

39 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I W W in the lower bound, we get Substituting for ◦ + η ∗ − γ ε Z ( rc − ) + Z 1 + 2 1 = − W 2 ≥ ) S , Γ h ( . ◦ ◦ η 1 − Z − 2 Z − 1 ) ( 2 References N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sen sitive dimensions, uniform convergence, and learnability. In Foundations of Computer Science, 1993. Proceedings., 34th , pages 292–301. IEEE, 1993. Annual Symposium on N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sen sitive dimensions, uniform , 44(4):615–631, 1997. J. ACM convergence, and learnability. S. Andrews. Learning from Ambiguous Examples . PhD thesis, Brown University, May 2007. S. Andrews and T. Hofmann. Multiple-instance learning via disjunctive pro gramming boosting. In Advances in Neural Information Processing Systems 16 , 2003. S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector mach ines for multiple-instance learning. In , pages 561–568, 2002. Advances in Neural Information Processing Systems 15 Neural Network Learning: Theoretical Foundations . Cambridge M. Anthony and P. L. Bartlett. University Press, 1999. P. Auer, P. M. Long, and A. Srinivasan. Approximating hyper-rectan gles: learning and pseudoran- dom sets. Journal of Computer and System Sciences , 57(3):376–388, 1998. B. Babenko, N. Verma, P. Dollar, and S. Belongie. Multiple instance learnin g with manifold bags. In Proceedings of the 28th International Conference on Machine Learning , ICML (ICML-11) ’11, pages 81–88, 2011. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: R isk bounds and structural results. , 3:463–482, 2002. Journal of Machine Learning Research P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for r eal-valued function classes. IEEE Transactions on Information Theory , 43(5):1721–1724, 1997. A. Blum and A. Kalai. A note on learning from multiple-instance examples. Machine Learning , 30 (1):23–29, 1998. Machine Learning , 20(3):273–297, September C. Cortes and V. Vapnik. Support-vector networks. 1995. ́ T. G. Dietterich, R. H. Lathrop, and T. Lozano-P erez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence , 89(1-2):31–71, 1997. 3037

40 S ABATO AND ISBHY T The Annals of Probability , 6(6): R. M. Dudley. Central limit theorems for empirical measures. 899–929, 1978. of gaussian processes. R.M. Dudley. The sizes of compact subsets of hilbert space and continuity , 1(3):290 – 330, 1967. Journal of Functional Analysis Y. Freund and R.E. Schapire. A decision-theoretic generalization of on- line learning and an appli- cation to boosting. Journal of Computer and System Sciences , 55(1):119–139, August 1997. ̈ artner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kerne ls. In Proceedings of T. G , pages 179–186, 2002. the Nineteenth International Conference on Machine Learning D. Haussler and P. M. Long. A generalization of Sauer’s lemma. Journal of Combinatorial Theory, , 71(2):219–240, 1995. Series A t to product distributions P. M. Long and L. Tan. PAC learning axis-aligned rectangles with respec from multiple-instance examples. Machine Learning , 30(1):7–21, 1998. ISSN 0885-6125. ́ O. Maron and T. Lozano-P erez. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems 10 , pages 570–576, 1998. O. Maron and A. L. Ratan. Multiple-instance learning for natural scene c lassification. In Proceed- ings of the Fifteenth International Conference on Machine Learning , pages 341–349, 1998. S. Mendelson. Rademacher averages and phase transitions in glivenko -cantelli classes. IEEE Trans- actions on Information Theory , 48(1):251–263, 2002. Linear and Nonlinear Programming S. Nash and A. Sofer. . McGraw-Hill, New York, 1996. L. Pitt and L. G. Valiant. Computational limitations on learning from examples. Tec hnical report, Harvard University Aiken Computation Laboratory, July 1986. Convergence of Stochastic Processes . Springer-Verlag, 1984. D. Pollard. he missing links (ex- L. De Raedt. Attribute-value learning versus inductive logic programming: T tended abstract). In Proceedings of the 8th International Workshop on Inductive Logic Prog ram- ming , pages 1–8, London, UK, 1998. Springer-Verlag. ̈ atsch and M. K. Warmuth. Efficient margin maximizing with boosting. G. R Journal of Machine Learning Research , 6:2131–2152, December 2005. S. Sabato and N. Tishby. Homogeneous multi-instance learning with arbitrary dependence. In Proceedings of the Twenty Second Annual Conference on Computationa l Learning Theory , 2009. S. Sabato, N. Srebro, and N. Tishby. Reducing label complexity by learn ing from bags. In Proceed- ings of the Thirteenth International Conference on Artificial Intelligence an d Statistics , volume 9, pages 685–692, 2010. N. Sauer. On the density of families of sets. Journal of Combinatorial Theory Series A , 13:145–147, 1972. 3038

41 M ULTI NSTANCE L EARNING WITH A NY H YPOTHESIS C LASS -I nce-rated predictions. R. E. Schapire and Y. Singer. Improved boosting algorithms using confide , 37(3):1–40, 1999. Machine Learning new explanation for R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: A the effectiveness of voting methods. The Annals of Statistics , 26(5):1651–1686, October 1998. S. Shalev-Shwartz, O. Shamir, and K. Learning kernel-based halfspa ces with the zero-one loss. In Proceedings of the Twenty Third Annual Conference on Computational L earning Theory , 2010. N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low-noise and f ast rates. CoRR , abs/1009.3896, 2010. V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence o f relative frequencies of Theory of Probability and its Applications , XVI(2):264–280, 1971. events to their probabilities. Mathematische Annalen , 100:295–320, 1928. J. von Neumann. Zur theorie der gesellschaftsspiele. N. Weidmann, E. Frank, and B. Pfahringer. A two-level learning method fo r generalized multi- instance problems. In Proceedings of the European conference on machine learning , pages 468– 479, 2003. Q. Zhang and S.A. Goldman. EM-DD: An improved multiple-instance learning tec hnique. In Advances in Neural Information Processing Systems 14 , 2001. Z. Zhou, K. Jiang, and M. Li. Multi-instance learning based web mining. Applied Intelligence , 22 (2):135–147, 2005. 3039