1 Beyond Spatial Pyramids: Receptive Field Learning for Pooled Image Features 2 1 1 Trevor Darrell Chang Huang Yangqing Jia 2 1 UC Berkeley EECS & ICSI NEC Labs America 1 2 [email protected] } { jiayq,trevor @eecs.berkeley.edu Abstract tain properties, such as sparsity [24, 33, 34] or locality [31]. Recent papers [6, 28, 9] have explored the relationship be- tween dictionary learning and encoding, and have proposed In this paper we examine the effect of receptive field de- simple yet effective approaches that achieve competitive re- signs on classification accuracy in the commonly adopted sults. The neuroscience justification of coding comes from pipeline of image classification. While existing algorithms simple neurons in the human visual cortex V1, which have usually use manually defined spatial regions for pooling, we been believed to produce sparse and overcomplete activa- show that learning more adaptive receptive fields increases tions [24]. performance even with a significantly smaller codebook size at the coding layer. To learn the optimal pooling param- Similarly, the idea of spatial pooling dates back to eters, we adopt the idea of over-completeness by starting Hubel’s seminal paper about complex cells in the mam- with a large number of receptive field candidates, and train malian visual cortex [13], which identifies mid-level image a classifier with structured sparsity to only use a sparse sub- features that are invariant to small spatial shifting. The spa- set of all the features. An efficient algorithm based on incre- tial invariance property also reflects the concept of locally mental feature selection and retraining is proposed for fast orderless images [17], which suggests that low-level fea- learning. With this method, we achieve the best published tures are grouped spatially to provide information about the performance on the CIFAR-10 dataset, using a much lower overall semantics. Most recent research on spatial pooling dimensional feature space than previous methods. aims to find a good pooling operator, which could be seen as a function that produces informative statistics based on lo- cal features in a specific spatial area. For example, average 1. Introduction and max pooling strategies have been found in various al- gorithms respectively, and systematic comparisons between State-of-the-art category-level image classification algo- such pooling strategies have been presented and discussed rithms usually adopt a local patch based, multiple-layer in [2, 4]. Recently, Coates et al. proposed to pool over mul- pipeline to find good image features. Many methods start tiple features in the context of deep learning [7]. from local image patches using either normalized raw pixel However, relatively little effort has been put into better values or hand-crafted descriptors such as SIFT [22] or designs or learning of better spatial regions for pooling, al- HOG [10], and encode them into an overcomplete rep- though it has been discussed in the context of learning local resentation using various algorithms such as K-means or descriptors [32]. A predominant approach to define the spa- sparse coding. After coding, global image representations tial regions for pooling, which we will also call the recep- are formed by spatially pooling the coded local descriptors tive fields (borrowing the terminology from neuroscience) [33, 2, 4]. Such global representations are then fed into non- for the pooled features, comes from the idea of spatial pyra- linear classifiers [21] or linear classifiers [33], with the latter mids [21, 33], where regular grids of increasing granularity being more popular recently due to their computation effi- are used to pool local features. The spatial pyramids provide ciency. Methods following such a pipeline have achieved a reasonable cover over the image space with scale infor- competitive performance on several challenging classifica- mation, and most existing classification methods either use tion tasks, such as Caltech-101 and Pascal VOC [11]. them directly, or use slightly modified/simplified versions. During the last decade, much emphasis has been directed at the coding step. Dictionary learning algorithms have In this paper, we ask the question “are spatial pyramids been discussed to find a set of basis that reconstructs lo- optimal for image classification?” , the answer to which is cal image patches or descriptors well [23, 9], and several often neglected by existing algorithms. While a pyramid encoding methods have been proposed to map the origi- of regions succeeds in providing us information about the nal data to a high-dimensional space that emphasizes cer- spatial layout of image features, one can reasonably ques-

2 I 1 c ) ( R , 1 1 A pooling ) ( x f 2 coding , R ( c ) A 2 2 x “BEAR” : : K ) R , ( c M M A Figure 1. The image classification pipeline. See Section 2 for details. tion their optimality, as the grid structure may not be adap- while keeping the activations sparse helps classification, es- tive enough to fit the spatial statistics of natural images. As pecially when linear classifiers are used in the later steps. a simple example, to distinguish most indoor and outdoor Recently, Coates et al. [9] have shown that relatively scenes, a human may look for the existence of the horizon, simple dictionary learning and encoding approaches lead which could be captured by thin horizontal pooling regions to surprisingly good performances. To learn a dictionary over the image. Spatial grids, even with a pyramid structure, d K ··· of size ] , d , from randomly sam- d = [ D , 1 2 K fail to provide such information. { pled patches p , p , ··· , p } each reshaped as a vector 2 N 1 of pixel values, two simple yet effective approaches are ad- Instead of arbitrarily defining heuristic receptive fields, vocated: we aim to explicitly learn the receptive fields for classifi- cation tasks. Specifically, we propose to adaptively learn 1. K-means, minimizes the which squared dis- such regions by considering the receptive fields additional tance between each patch and its nearest code: ∑ parameters, and jointly learning these parameters with the N 2 min ‖ − d min p . ‖ D i j j 2 =1 i subsequent classifiers. The resulting benefit is two-fold: receptive fields tailored to classification tasks increase the 2. OMP-M, which learns a dictionary that minimizes the overall accuracy of classification; in addition, with the help reconstruction error, with the constraint that each patch of such mid-level features, we are able to use a much lower- M is modeled by a linear combination of at most ∑ N 2 dimensional feature to achieve the state-of-the-art perfor- min codes: , where the length ‖ α D − p ‖ D , α i i 2 i =1 i mance. We experiment with our algorithm on the bench- is 1 of each dictionary entry , and the cardinality of d j mark CIFAR-10 dataset and other datasets, and report a sig- α . each reconstruction coefficient is at most M i nificant improvement in both accuracy and efficiency. For encoding, Coates et al. also propose to substitute sparse The remainder of the paper is organized as follows. Sec- coding by the following efficient approaches: tion 2 provides a background review of the image classifica- tion pipeline, and Section 3 proposes the adaptive receptive 1. Triangle coding [6], which computes the activation of field learning idea, as well as an efficient algorithm to carry } ( f as p for a patch k code , z − ) z ( ,μ 0 { ) = max x k k out learning. Experiments are presented in Section 4, and d is the distance from -th code k to the p , z where k k we conclude the paper in Section 5. ) z ( μ to all codes. is the mean of distances from and p 2. Soft thresholding, which computes the inner product 2. The Classification Pipeline and each code, with a fixed threshold pa- p between > x ( f : α rameter − p 0 d , } { α ) = max In this section, we briefly review the image classification k k pipeline we adopted, which leads to the problem of learning We refer to [9] for a systematic discussion about differ- the receptive fields for spatial pooling. Specifically, we will ent dictionary learning and encoding algorithms. In our ex- focus on two-layer classification approaches. periment, we will adopt these standard approaches in order We illustrate the pipeline from raw images to the predic- to isolate the contribution of spatial pooling from the choice tion of class labels in Figure 1. Specifically, starting with of different coding methods. Since local patches are usually an input image , two stages are usually adopted to generate I extracted densely in a grid-based fashion, we will organize the global feature, as we formally define below. the activations of image I as a set of matrices denoted by 2 K 1 ( ( , one for each code in the code- } ) { A I A , ) A I ( I ) , ··· k book, whose element A ( I ) contains the activation of code (1) Coding. In the coding step, we extract local image ij i,j d ( for the local image patch at spatial location . ) k activation values based K patches, and encode each patch to K on a codebook of size (learned via a separate dictionary learning step). These activations are typically binary (in the (2) Pooling. Since the coding result are highly overcom- case of vector quantization) or continuous (in the case of plete and highly redundant, the pooling layer aggregates the e.g. sparse coding). It is generally believed that having an activations over certain spatial regions of the image to ob- the dimension of patches) codebook K overcomplete ( tain an M dimensional vector x as the global representation

3 of the image. Each dimension of the pooled feature is x i obtained by taking the activations of one code in a specific spatial region (shown as the red rectangular in Figure 1), and performing a predefined operator (usually average or max) on the set of activations. We follow a similar approach to that in [3] to formally (c) (a) (b) define pooled features. Specifically, given an operator op Figure 2. An example of overcomplete rectangular bins based on a that maps a set of real values to a single real value (e.g. × superpixel setting: (a) superpixels; (b) spatial pyramid bins; 4 4 can be defined by taking their average), a pooled feature x i (c) overcomplete rectangular bins. and a spatial based on the selection of a code indexed by c i R : region denoted by i Inspired by the selectivity of complex cells in the visual cortex, we propose to learn the pooled features adaptively. c i = op( (1) x ) A i R i pooled features is equiv- M Specifically, learning a set of , c ,c alent to learning the parameters ··· ,c } and C { = 2 1 M Borrowing the definition from neuroscience, we call R the i 1 R R R , R { , ··· , } = . To this end, we note that the 2 M 1 receptive field for the pooled feature, which could be seen pooled features are directly fed into the final classifier, and c i is then the set of A as a binary mask over the image. R i propose to jointly learn the classifier parameters θ together . R in the receptive field c activations of code i i with the pooling parameters. Thus, given a set of training This definition provides a general definition that em- N ( } ) , the joint learning leads to solving y , data I X { = n n =1 n braces existing pooling algorithms. For example, com- the following optimization problem: monly used operators involve computing the statistics of the activations under the -norm: p N ∑ 1 , ) + (3) ) θ Reg( ; y x ( λ ( θ ) f L min n n ∑ C θ , , R N 1 1 p =1 n p α (2) ) ( x = i c i i c α ∈ A i | | R i i R where x = op( A ) i ni n, R i = 1 this corresponds to the average pooling, and p when c K i I where we assume that the coding from A { to } is n n i =1 when p →∞ this corresponds to the max pooling. done in an unsupervised fashion, as has been suggested by In our paper we focus on the definition of receptive fields several papers such as [6]. We call this method receptive for pooling. The simplest form of pooling takes the whole field learning, as the receptive fields are learned in such a image as the receptive field, thus assuming a bag-of-words way that information most relevant to the classification task model where spatial information is ignored. The more com- will be extracted. monly adopted spatial pooling approach [21, 33] pools fea- One practical issue is that solving the optimization prob- tures from multiple levels of regular grids, thus defining a lem (3) may be impractical, as there is an exponential num- codes and a pyramid of pooled features. Given a set of K ber of receptive field candidates, leading to a combinato- receptive fields, the pooled features are then de- N set of rial problem. Numerical solutions are also difficult, as the fined by taking the Cartesian product of the codes and the gradient with respect to the pooling parameters is not well- receptive fields, yielding a KN -dimenisonal global feature. defined. Thus, instead of searching in the space of all possi- Finally, a classifier, usually linear SVM or logistic re- ble receptive fields, we adopt the idea of over-completeness gression, is trained using the global feature vector to predict in the sparse coding community. Specifically, we start from . θ ; x ( ) f = y the final label of the image as a set of reasonably overcomplete set of potential receptive fields, and then find a sparse subset of such pooled features. 3. Receptive Field Learning for Pooled Fea- The over-completeness enables us to maintain performance, tures while the sparsity allows us to still carry out classification efficiently during testing time. While significant efforts have been placed on the coding part of the classification pipeline, the pooling step has re- 3.1. Overcomplete Receptive Fields ceived relatively little attention. Existing research on pool- The exponential number of possible receptive fields ing mainly focuses on the analysis of the pooling operator, arises when we consider the inclusion and exclusion of sin- such as in [4]. Specifically, spatial regions are almost al- gle pixels individually. In practice this is often unneces- ways defined on regular grids [33]. In fact, regular grids sary, as we expect the active pixels in a receptive field to be may not guarantee to be optimal: for example, long hori- 1 zontal bars may serve as better pooling regions for natural For simplicity, we will use the max operator, but note that any opera- scenes, and such receptive fields may be dataset-dependent. tor could also be incorporated in our framework.

4 Performance vs. Number of Features spatially contiguous. In this paper, we use receptive fields 0.90 2 : this provides us a rea- consisting of rectangular regions 0.85 4 ( ) dif- n sonable level of over-completeness, as there are O 0.80 ferent rectangular receptive fields for an image containing 0.75 pixels. In addition, since the motivation of spatial n n × 0.70 Accuracy pooling is to provide tolerance to small spatial displace- 0.65 ments, we build the rectangular regions upon superpixels, 0.60 Training which are defined as dense regular grids on the image. Fig- 0.55 Testing ure 2 shows an example of such rectangular receptive fields 0.50 0 3000 5000 6000 1000 2000 4000 compared with regions defined by the spatial pyramid on a Number of Features 4 grid. × 4 Figure 3. Performance vs. number of selected features, with the experiment setting in Table 1 of Section 4. Given the set of P overcomplete regions, which we denote by , R , } ··· R , R { = R , and the dictionary P 1 2 L norm of the matrix W : the second regularizer is the , ∞ 1 = { d K d , , d D , ··· } of size , we can define a set of 2 1 K PK potential pooled features based the Cartesian product M M ∑ ∑ . Specifically, the R × D j -th -th receptive field and the i W ‖ = ‖ = ‖ ‖ W | | W max (5) ∞ , 1 · i, ∞ ij ∈{ 1 } j ,L ··· , K × i + code jointly defines the j ) -th pooled feature as ( =1 i =1 i j x . Note that when the coding and pool- ) = op( A i + j K × R i denotes the -th row of the matrix i where W . This reg- W · i, ing are both carried out in an overcomplete fashion, the re- ularizer introduces structured sparsity by encouraging the sulting pooled feature is usually very high-dimensional. weight matrix to be row-wise sparse, so that the classi- W fiers for different classes tend to agree on whether to use a 3.2. Structured Sparsity for Receptive Field Learn- specific feature, and when combined together, only jointly ing use a subset of the overcomplete pooled features. The addi- While it is possible to train a linear classifier using the tion of the L norm also provides a elastic-net like regu- ∞ , 1 x above, in practice it high-dimensional pooled feature larization, which is known to perform well when the dimen- is usually beneficial to build a classifier using relatively sion of data is much higher than the number of data points low-dimensional features. In addition, for multiple-label [37]. classification, we want the classifiers of different labels to For optimization considerations, we use the multi-class share features. This brings two potential advantages: fea- extension of the binomial negative log likelihood (BNLL) ture computation could be minimized, and sharing features loss function [25]: among different classifiers is known to provide robustness L ∑ > to the learned classifiers. To this end, we adopt the idea of ( − ) b + x y > W i i · ,i , x ( ln(1 + + b l W y ) (6) ) = e structured sparsity [26, 29], and train a multiple-class linear i =1 classifier via the following optimiza- b + y = f ( x ) = Wx The choice of the BNLL loss function over the hinge loss is tion problem: mainly for computational simplicity, as the gradient is eas- N ier to compute for any input. In practice, the performance ∑ λ 1 1 > 2 does not change much if we use the hinge loss instead. ‖ W ‖ + + l ( ‖ W ‖ x b W )+ y , λ min n 2 , ∞ 1 n Fro W , b N 1 n =1 3.3. Fast Approximate Learning by Feature Selec- (4) tion where is the L 1 y -dimensional label vector coded in a − i fashion, with values taken from {− 1 , +1 L given L − of } Jointly optimizing (4) is still a computationally challeng- -dimensional feature vector defined by is an M classes. x i ing task despite its convexity, due to the over-completeness overcomplete pooling in the previous subsection, and W = in both coding and pooling. While it is possible to carry w , [ w , w , ··· ] weight matrix containing the L × M is a 2 L 1 out the computation on smaller-scale problems like Caltech- classifiers. L weight vector for the 101, we adopt a greedy approach to train the model for Two regularization terms are adopted in the optimiza- larger-scale problems. Inspired by the matching pursuit al- 2 W tion. The squared Frobenius norm ‖ ‖ aims to mini- Fro gorithm in dictionary training and the grafting algorithm mize the structured loss in the classical SVM fashion, and [25] in machine learning, we start with an empty set of se- lected features, incrementally add features to the set, and 2 As a side note, we also experimented with receptive fields that are retrain the model when new features are added. sampled from an Ising model on the fly during training, but rectangular re- recording the set Mathematically, we maintain a set S gions worked empirically better, possibly because the additional flexibility of Ising models leads to over-fitting the training data. of currently selected features. At each iteration, for each

5 77.0 feature index j that has not been not selected, we compute 76.59 SPM 76.0 the score of the feature as the 2-norm of the gradient of the Rand objective function (4), denoted by L b ) , with respect to W , ( 75.41 75.0 Ours the corresponding weight vectors: 74.83 74.0 ∥ ∥ 2 Figure 4. Performance comparison among spatial pyramid pool- ∥ ∥ L ∂ ( W , b ) ∥ ∥ (7) ) = j score( ing, random feature selection and our method, all using the same ∥ ∥ ∂ W · j, Fro number of features for the final classification. It can be observed that a few selected features could already achieve a comparatively We then select the feature with the largest score, and add high performance. it to the selected set S . The model is retrained using the previously learned optimum solution as the starting point. From a boosting perspective, this can be considered as in- and we refer to [9] for a detailed discussion about dictionary crementally learning weak classifiers, but our method dif- learning and coding algorithms. fers from boosting in the sense that the weights for already For classification, when we use pre-defined receptive selected features are also updated when new features are se- fields such as spatial pyramids, the SVM regularization term lected. is chosen via 5-fold cross validation on the training data. As the speed of retraining drops when more features are 01 When we perform feature selection, we fix = 0 λ . 1 added, we adopt an approximate retraining strategy: for (which is the best value when performing 5-fold cross val- , we select an active subset S based S each iteration of t × 2 regular grid) and drop idation for max pooling on a 2 A on the score above. We then retrain the model with respect λ , since the incremental feature selection already serves as 2 to the active set and the bias term only: a greedy approximation of the sparse constraint. Although the parameters are not tuned specifically for each configura- +1) t ( b L ( W , b ) (8) W = arg min , tion, we found it to perform well empirically under various b , W , S · S , · A A scenarios. keep unchanged. The intu- W with the constraint that ̄ S , · A 4.1. Spatial Pyramid Revisited ition is that with an already trained classifier from the pre- vious iteration, adding one dimension will only introduce It is interesting to empirically evaluate the performance small changes to the existing weights. of spatial pyramid regions against other choices of recep- In practice, we found the performance of this approx- tive fields. To this end, we trained a dictionary of size 200 imate algorithm with the active set size less than 100 to (for speed considerations), and tested the performance of 3- be very close to the full retraining algorithm with a signifi- layer spatial pyramid pooling against two algorithms based cant increase in computation speed. Figure 3 shows typical on overcomplete receptive fields: (1) random selection from curves of the training and testing accuracy with respect to the overcomplete pooled features, and (2) our method, both the number of iterations. The performance usually stabi- selecting the same number of features that spatial pyramid lizes with a significantly smaller number of features, show- pooling uses. Results are shown in Figure 4. Our method ing the effectiveness of introducing structured sparsity into outperforms SPM, but a more interesting finding is that classifier learning. the predefined spatial pyramid regions perform consistently worse than random selection, indicating that arbitrarily de- 4. Experiments fined pooled features may not capture the statistics of real- world data well. With explicit learning of the pooling pa- We will mainly report the performance of our algorithm rameters, we achieved the highest performance among the 3 × on the CIFAR-10 dataset 32 , which contains 50,000 32 three algorithms, showing the effectiveness and necessity of images from 10 categories as training data, and 10,000 im- learning adaptive receptive fields. ages as testing data. We fix the dictionary learning algorithms to k-means 4.2. The Effect of Spatial Over-completeness clustering and the coding algorithms to triangular coding One may ask if the performance increase could be ob- as proposed in [6] for CFAR-10. Such a coding strategy has tained without overcompletenes by simply using a denser been shown to be particularly effective in spite of its sim- grid. To answer this question, we examined the perfor- plicity. We also tested alternative dictionary learning and mance of our algorithm against the 2 × 2 pooling grid (which coding algorithms, which led to similar conclusions. As is used in [9] to obtain very high performance) and a denser our main focus is on learning receptive fields for pooled fea- We 4 × 4 grid, with both average and max poolings. tures, the results of different coding algorithms are omitted, also compared our method against random feature selection 3 from the same pooling candidates. Table 1 summarizes the http://www.cs.toronto.edu/ kriz/cifar.html

6 Pooling Area Method Features Accuracy Performance on CIFAR-10 84 70.24 2 2 Ave × 800 82 4 4 × 72.24 3,200 Ave 80 66.31 800 Max × 2 2 78 4 Max 3,200 73.03 4 × 76 74.83 4,200 Max 3-layer SPM Accuracy 74 800 Max 73.42 OC + feat select 72 76.28 3,200 Baseline Ours, equal-dim 76.59 4,200 70 Ours, optimum-dim 76.72 6,400 68 800 200 400 1600 4000 Codebook Size 76.44 20,000 Max OC, all features Figure 5. Testing accuracy on CIFAR-10 with and without 800 Max OC + rand select 69.48 overcomplete pooling. In the figure, “equal-dim” selects the Max 3,200 74.42 OC + rand select same number of features as the baseline (Coates et al.[6]), and OC + rand select Max 4,200 75.41 “optimum-dim” selects the optimum number of features deter- Table 1. Comparison of different pre-defined pooling strategies mined by cross-validation. (X-axis in log scale) and our method (overcomplete (OC) + feature selection). Ran- dom selection from the same overcomplete pooled features is also listed, showing the necessity of better receptive field learning. 4.3. Larger Codebook vs. Better Spatial Pooling Under the two-stage pipeline adopted in this paper, there are effectively two possible directions to increase the per- testing accuracy under various experimental settings, using formance: to increase the codebook size and to increase the a codebook size of 200. pooling over-completeness. We argue that these two direc- Results from Table 1 demonstrates that denser pooling tions are complementary: the performance gain from our does help performance. The 4 × 4 grid increases the perfor- effort on pooling could not simply be replaced by increas- × 2 pooling. How- mance by about 3 percent compared to 2 ing the codebook size, at least not easily. More importantly, ever, with overcomplete receptive fields we can almost al- as the codebook size grows larger, it becomes more difficult ways increase performance further. We achieved an 76.72% to obtain further performance gain, while it is still relatively accuracy with only 200 codes, already close with state-of- easy to obtain gains from better pooling. the-art algorithms using much larger codebook sizes (Table To empirically justify this argument, we trained multi- 2). It is also worth pointing out that even random feature ple codebooks of different sizes, and compared the result- selection gives us comparable or better performance when ing accuracies with and without overcomplete pooling in compared to pre-defined pooling grids under the same num- Figure 5. As can be observed, it becomes harder to ob- ber of feature dimension (e.g. compare the performance be- tain further performance gain by increasing the codebook , 200 fea- tween 4 × 4 max pooling and randomly selecting 3 size when we already have a large codebook, while using tures from an overcomplete set of pooled features). a better pooling strategy always brings additional accuracy gains. In fact, with our method, we are able to use a code- Further, the importance of feature selection lies in two book of half the size (and half the number of pooled fea- aspects: first, simply using all the features is not practical tures) while maintaining performance (compare the green during testing time, as the dimension can easily go to hun- and blue curves). It is particularly interesting that, by select- dreds of thousands when we increase the codebook size. ing more features from the overcomplete spatial regions, we Feature selection is able to get very close performance com- are able to achieve state-of-the-art performance with a much pared to using all the features, but with a significantly lower smaller number of codes (the red curve), which has the po- dimensionality, which is essential in many practical scenar- tential in time-sensitive or memory-bounded scenarios. ios. Usually, feature selection enables us to achieve a high performance with only a few features (Figure 3). Adding 4.4. Best Performance remaining features will only contribute negligibly to the Our best performance on the CIFAR-10 dataset was overall performance. Second, performing feature selection achieved by training a codebook size of 6,000, performing has the potential benefit of removing redundancy, thus in- max pooling on overcomplete rectangular bins based on a creasing the generalization ability of the learned classifiers grid, and selecting features up to 24,000 dimensions. 4 × 4 [25, 30]. In our experiment in Table 1, the best performance We also note that the accuracy has not saturated at this num- is achieved with a few thousands features. Similarly, we ber of features, but we would like to test the performance found that with larger codebook sizes, using all the over- when the number of mid-level features is limited to a rea- complete pooled features actually decreases performance, sonable scale. With these settings, we achieved an accuracy arguably due to the decrease of the generalization ability.

7 Method Method err% Pooled Features Accuracy a 6,400 Baseline [9] 1.02 ours, d=1600 80.17 ours, d=4000 82.04 16,000 0.64 Our Method ours, d=6000 83.11 24,000 0.83 Lauer et al. [20] 0.59 77.9 Coates et al. [6], d=1600 6,400 Labusch et al. [19] Ranzato et al. [27] Coates et al. [6], d=4000 16,000 79.6 0.62 Coates et al. [9], d=6000 Jarrett et al. [15] 0.53 48,000 81.5 78.9 N/A Conv. DBN [18] a Our implementation. Improved LCC [35] 74.5 N/A 80.49 8-layer Deep NN [5] N/A Figure 6. Left: Performance comparison (error rate in percentage) 3-layer Deep NN [8] 82.0 N/A on MNIST. Top box: comparison between algorithms using simi- lar pipelines. Bottom box: performance of other related algorithms Table 2. Performance on the CIFAR-10 dataset. The first and sec- in the literature. Right: 1-vs-1 saliency maps learned on MNIST. ond blocks compare performance between our method and Coates The left-bottom corner plots the mean of digit 8 and 9 multiplied et al. [6, 9] under similar codebook sizes, where the only differ- by the corresponding saliency map, showing that the classifier fo- ence is the spatial pooling strategy. The third block reports the cuses on the bottom part which intuitively also distinguishes the performance of several state-of-the-art methods in the literature. two digits best. of 83.11% on the testing data. To the best of our knowl- Pooling Performance Method Codebook edge, this is the best published result on CIFAR-10 without ScSPM [33] 0.54 ± 73.2 SPM 1024 (SC) increasing the training set size by morphing the images. LCC+SPM [31] 1024 SPM 73.44 Table 2 lists the performance of several state-of-the-art 1024 (SC) ± 75.3 OC 0.70 Our Method methods. It is also worth pointing out that, to achieve the 0.7 Boureau et al. [3] 64K SPM 77.1 ± same performance, our algorithm usually uses a much lower SPM [21] 64.6 0.7 ± number of features compared with other well-performing 72.8 ± 0.39 (15 training) NBNN [1] algorithms. 65.6 ± 1.0 Jarret et al. [15] RLDA [16] 0.8 73.7 ± 4.5. Results on MNIST 71.0 ± 1.0 Adaptive Deconv. Net [36] Feng et al. [12] 82.6 We can view the set of learned receptive fields for pool- ing as a saliency map for classification [14]. To visually Table 3. Performance comparison (accuracy in percentage) on show the saliency map and verify its empirical correctness, Caltech-101. Top: comparison between algorithms using similar pipelines. Bottom: performance of other related algorithms in the we applied our method to handwritten digit recognition on literature. the MNIST dataset, on which convolutional deep learning models are particularly effective. To this end, we adopted a similar pipeline as we did for CIFAR-10: dense 6x6 local 4.6. Results on Caltech-101 patches with ZCA whitening are used; a dictionary of size Lastly, we report the performance of our algorithm com- is trained with OMP-1, and thresholding coding with 800 pared with SPM on the Caltech-101 dataset in Table 3. = 0 α 25 (untuned) is adopted. The features are then max- . State-of-the-art performance following similar pipelines are 6 × pooled on overcomplete rectangular areas based on a 6 also included in the table. Specifically, we used the same regular grid. Note that we used a different coding method two-step pipeline as proposed by Yang et al. [33]: SIFT from the CIFAR-10 experiment to show that the overcom- features are extracted from 16 × 16 patches with a stride of plete spatial pooling method is agnostic of the choice of 8, and are coded using sparse coding with a codebook of low-level coding algorithms. Any parameter involved in the size 1024. For SPM, the coded features are pooled over a pipeline such as SVM regularization weights is tuned on a 2 × 2 regular grids; for a fair compar- 4 , 1 × 1 pyramid of 4 × , random 50k/10k split of the training data. regular grid as our base regions, 4 × ison we also use the 4 Figure 6 shows the 1-vs-1 saliency maps between digits. and select the same number of features as SPM uses. It can be seen that by learning receptive fields, the classifier focuses on regions where the digits have maximal dissim- As can be observed in the table, our pooling algorithm ilarity, e.g., the bottom part for 8 and 9, and the top part outperforms spatial pooling, although a gap still exists be- for 3 and 5, which matches our intuition about their appear- tween our result and state-of-the-art methods, which uses more complex coding schemes than that we used. The ances. For 10-digit classification, we achieved an error rate 64% . 0 of , on par with several state-of-the-art algorithms results suggest that coding is a more dominant factor for (Figure 6 left). A gap still exists between our method and the performance of Caltech-101. Existing research, espe- the best deep-learning algorithm, and combining receptive cially the Naive Bayes nearest neighbor method [1], has learning with deeper structures is future work. also shown a consistent increase of accuracy with higher-

8 dimensional coding output [3, 34]. However, we still obtain [13] DH Hubel and TN Wiesel. Receptive fields, binocular interaction The Journal of and functional architecture in the cat’s visual cortex. a consistent gain by adopting more flexible receptive fields Physiology , 160:106–154, 1962. 1 for pooling, which justifies the effectiveness of the proposed [14] L Itti and C Koch. Computational modeling of visual attention. Na- algorithm. Note that the best performance reported by Feng ture reviews neuroscience , 2001. 7 et al. [12] was obtained by jointly learning the pooling op- [15] K Jarrett, K Kavukcuoglu, MA Ranzato, and Y LeCun. What is the p erator ( p in -norm pooling) and a per-code spatial saliency , 2009. ICCV best multi-stage architecture for object recognition? In 7 map in addition to a larger dictionary, which also follows [16] S Karayev, M Fritz, S Fidler, and T Darrell. A probabilistic model the idea of learning better spatial information beyond SPM. , 2011. 7 CVPR for recursive factorized image features. In [17] JJ Koenderink and AJ van Doorn. The structure of locally orderless 5. Conclusion images. IJCV , 31(2/3):159–168, 1999. 1 [18] A Krizhevsky. Convolutional deep belief networks on CIFAR-10. In this paper, we examined the effect of receptive field Technical Report , 2010. 7 designs on the classification accuracy in the commonly [19] K Labusch, E Barth, and T Martinetz. Simple method for high- adopted pipeline of image classification. While existing performance digit recognition based on sparse coding. , IEEE TNN algorithms use manually defined spatial regions for pool- 19(11):1985–1989, 2008. 7 ing, learning more adaptive receptive fields increases per- [20] F Lauer, CY Suen, and G Bloch. A trainable feature extractor formance even with a significantly smaller codebook size at , 40(6):1816– Pattern Recognition for handwritten digit recognition. 1824, 2007. 7 the coding layer. We adopted the idea of over-completeness [21] S Lazebnik, C Schmid, and J Ponce. Beyond bags of features: Spa- and structured sparsity, and proposed an efficient algorithm tial pyramid matching for recognizing natural scene categories. In to perform feature selection from a set of pooling candi- CVPR , 2006. 1, 3, 7 dates. With this method, we achieved the best published [22] D Lowe. Distinctive image features from scale-invariant keypoints. performance on the CIFAR-10 dataset, using a much lower IJCV , 2004. 1 dimensional feature space than previous methods. Possi- [23] J Mairal, F Bach, J Ponce, and G Sapiro. Online learning for matrix JMLR , 11:19–60, 2010. 1 factorization and sparse coding. ble future work involves more flexible definition of pooling [24] B Olshausen and DJ Field. Sparse coding with an overcomplete basis receptive fields, and unsupervised learning of such pooled , 37(23):3311–3325, set: a strategy employed by V1? Vision research features. 1997. 1 [25] S Perkins, K Lacker, and J Theiler. Grafting: fast, incremental feature References , 3:1333–1356, JMLR selection by gradient descent in function space. 2003. 4, 6 [1] O Boiman, E Shechtman, and M Irani. In defense of nearest-neighbor [26] A Quattoni, M Collins, and T Darrell. Transfer learning for image , 2008. 7 based image classification. In CVPR CVPR , 2008. classification with sparse prototype representations. In [2] Y Boureau, F Bach, Y LeCun, and J Ponce. Learning mid-level 4 features for recognition. In CVPR , 2010. 1 [27] MA Ranzato, FJ Huang, Y Boureau, and Y LeCun. Unsupervised [3] Y Boureau, N Le Roux, F Bach, J Ponce, and Y LeCun. Ask the lo- learning of invariant feature hierarchies with applications to object cals: multi-way local pooling for image recognition. In ICCV , 2011. , 2007. 7 CVPR recognition. In 3, 7, 8 [28] R Rigamonti, MA Brown, and V Lepetit. Are sparse representations [4] Y Boureau and J Ponce. A theoretical analysis of feature pooling in , 2011. 1 really relevant for image classification? In CVPR , 2010. 1, 3 ICML visual recognition. In [29] M Schmidt, K Murphy, G Fung, and R Rosales. Structure learning [5] DC Cires ̧an, U Meier, J Masci, LM Gambardella, and J Schmidhu- in random fields for heart motion abnormality detection. In CVPR , ber. High-performance neural networks for visual object classifica- 2008. 4 tion. ArXiv e-prints , 2011. 7 [30] R Tibshirani. Regression shrinkage and selection via the lasso. JRSS [6] A Coates, H Lee, and AY Ng. An analysis of single-layer networks , pages 267–288, 1996. 6 Series B AISTATS in unsupervised feature learning. In , 2010. 1, 2, 3, 5, 6, 7 [31] J Wang, J Yang, K Yu, F Lv, T Huang, and Y Gong. Locality- constrained linear coding for image classification. In CVPR , 2010. [7] A. Coates and A.Y. Ng. Selecting receptive fields in deep networks. 1, 7 , 2011. 1 NIPS In CVPR [32] S Winder and M Brown. Learning local image descriptors. In , [8] A Coates and AY Ng. Selecting receptive fields in deep networks. In 2007. 1 NIPS , 2011. 7 [33] J Yang, K Yu, and Y Gong. Linear spatial pyramid matching using [9] A Coates and AY Ng. The importance of encoding versus training sparse coding for image classification. In , 2009. 1, 3, 7 CVPR , 2011. 1, 2, 5, ICML with sparse coding and vector quantization. In [34] J Yang, K Yu, and T Huang. Efficient highly over-complete sparse 7 coding using a mixture model. In ECCV , 2010. 1, 8 [10] N Dalal. Histograms of oriented gradients for human detection. In [35] K Yu and T Zhang. Improved local coordinate coding using local CVPR , 2005. 1 ICML tangents. In , 2010. 7 [11] M Everingham, L Van Gool, CKI Williams, J Winn, and A Zisser- [36] MD Zeiler, GW Taylor, and R Fergus. Adaptive deconvolutional , IJCV man. The PASCAL visual object classes (VOC) challenge. networks for mid and high level feature learning. In ICCV , 2011. 7 88(2):303–338, 2010. 1 [37] H Zou and T Hastie. Regularization and variable selection via the [12] J Feng, B Ni, Q Tian, and S Yan. Geometric Lp-norm Feature Pool- JRSS , 67(2):301–320, 2005. 4 elastic net. CVPR ing for Image Classification. In , 2011. 7, 8