1 A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms Richard Szeliski Daniel Scharstein Dept. of Math and Computer Science Microsoft Research Middlebury College Microsoft Corporation Middlebury, VT 05753 Redmond, WA 98052 [email protected] [email protected] Abstract that allows the dissection and comparison of individual algorithm components design decisions; Stereo matching is one of the most active research areas in computer vision. While a large number of algorithms for for the quantitative evaluation test bed 2. To provide a stereo correspondence have been developed, relatively lit- of stereo algorithms. Towards this end, we are plac- tle work has been done on characterizing their performance. ing sample implementations of correspondence algo- In this paper, we present a taxonomy of dense, two-frame rithms along with test data and results on the Web at stereo methods. Our taxonomy is designed to assess the dif- . www.middlebury.edu/stereo ferent components and design decisions made in individual stereo algorithms. Using this taxonomy, we compare exist- We emphasize calibrated two-frame methods in order to fo- ing stereo methods and present experiments evaluating the cus our analysis on the essential components of stereo cor- performance of many different variants. In order to estab- respondence. However, it would be relatively straightfor- lish a common software platform and a collection of data ward to generalize our approach to include many multi-frame sets for easy evaluation, we have designed a stand-alone, methods, in particular multiple-baseline stereo [85] and its flexible C++ implementation that enables the evaluation of plane-sweep generalizations [30, 113]. individual components and that can easily be extended to in- The requirement of dense output is motivated by modern clude new algorithms. We have also produced several new applications of stereo such as view synthesis and image- multi-frame stereo data sets with ground truth and are mak- based rendering, which require disparity estimates in all im- ing both the code and data sets available on the Web. Finally, age regions, even those that are occluded or without texture. we include a comparative evaluation of a large set of today’s Thus, sparse and feature-based stereo methods are outside best-performing stereo algorithms. the scope of this paper, unless they are followed by a surface- fitting step, e.g., using triangulation, splines, or seed-and- grow methods. 1. Introduction We begin this paper with a review of the goals and scope of Stereo correspondence has traditionally been, and continues this study, which include the need for a coherent taxonomy to be, one of the most heavily investigated topics in computer and a well thought-out evaluation methodology. We also vision. However, it is sometimes hard to gauge progress in disparity space representations, which play a central review the field, as most researchers only report qualitative results role in this paper. In Section 3, we present our taxonomy on the performance of their algorithms. Furthermore, a sur- of dense two-frame correspondence algorithms. Section 4 vey of stereo methods is long overdue, with the last exhaus- discusses our current test bed implementation in terms of tive surveys dating back about a decade [7, 37, 25]. This the major algorithm components, their interactions, and the paper provides an update on the state of the art in the field, parameters controlling their behavior. Section 5 describes with particular emphasis on stereo methods that (1) operate our evaluation methodology, including the methods we used on two frames under known camera geometry, and (2) pro- for acquiring calibrated data sets with known ground truth. disparity map, i.e., a disparity estimate at each dense duce a In Section 6 we present experiments evaluating the different pixel. algorithm components, while Section 7 provides an overall Our goals are two-fold: comparison of 20 current stereo algorithms. We conclude in Section 8 with a discussion of planned future work. of existing stereo algorithms taxonomy 1. To provide a 1

2 2. Motivation and scope theory [74, 72]. For example, how does the algorithm mea- sure the evidence that points in the two images match, i.e., Compiling a complete survey of existing stereo methods, that they are projections of the same scene point? One com- even restricted to dense two-frame methods, would be a mon assumption is that of Lambertian surfaces, i.e., surfaces formidable task, as a large number of new methods are pub- whose appearance does not vary with viewpoint. Some al- lished every year. It is also arguable whether such a survey gorithms also model specific kinds of camera noise, or dif- would be of much value to other stereo researchers, besides ferences in gain or bias. being an obvious catch-all reference. Simply enumerating Equally important are assumptions about the world different approaches is unlikely to yield new insights. or scene geometry and the visual appearance of objects. Clearly, a comparative evaluation is necessary to assess Starting from the fact that the physical world consists the performance of both established and new algorithms and of piecewise-smooth surfaces, algorithms have built-in to gauge the progress of the field. The publication of a simi- smoothness assumptions (often implicit) without which the [8] has had a dramatic effect on the et al. lar study by Barron correspondence problem would be underconstrained and ill- development of optical flow algorithms. Not only is the per- posed. Our taxonomy of stereo algorithms, presented in Sec- formance of commonly used algorithms better understood tion 3, examines both matching assumptions and smoothness by researchers, but novel publications have to improve in assumptions in order to categorize existing stereo methods. some way on the performance of previously published tech- Finally, most algorithms make assumptions about camera niques [86]. A more recent study by Mitiche and Bouthemy calibration and epipolar geometry. This is arguably the best- [78] reviews a large number of methods for image flow com- understood part of stereo vision; we therefore assume in putation and isolates central problems, but does not provide this paper that we are given a pair of rectified images as any experimental results. input. Recent references on stereo camera calibration and In stereo correspondence, two previous comparative pa- rectification include [130, 70, 131, 52, 39]. pers have focused on the performance of sparse feature matchers [54, 19]. Two recent papers [111, 80] have devel- 2.2. Representation oped new criteria for evaluating the performance of dense A critical issue in understanding an algorithm is the represen- stereo matchers for image-based rendering and tele-presence tation used internally and output externally by the algorithm. applications. Our work is a continuation of the investiga- Most stereo correspondence methods compute a univalued tions begun by Szeliski and Zabih [116], which compared d disparity function ( x,y ) with respect to a reference image, the performance of several popular algorithms, but did not which could be one of the input images, or a “cyclopian” provide a detailed taxonomy or as complete a coverage of view in between some of the images. algorithms. A preliminary version of this paper appeared Other approaches, in particular multi-view stereo meth- in the CVPR 2001 Workshop on Stereo and Multi-Baseline ods, use multi-valued [113], voxel-based [101, 67, 34, 33, Vision [99]. 24], or layer-based [125, 5] representations. Still other ap- An evaluation of competing algorithms has limited value proaches use full 3D models such as deformable models if each method is treated as a “black box” and only final [120, 121], triangulated meshes [43], or level-set methods results are compared. More insights can be gained by exam- [38]. ining the individual components of various algorithms. For Since our goal is to compare a large number of methods example, suppose a method based on global energy mini- within one common framework, we have chosen to focus on mization outperforms other methods. Is the reason a better techniques that produce a univalued ) d disparity map ( x,y energy function, or a better minimization technique? Could as their output. Central to such methods is the concept of a the technique be improved by substituting different matching . The term ) x,y,d ( disparity space was first intro- disparity costs? duced in the human vision literature to describe the differ- ence in location of corresponding features seen by the left In this paper we attempt to answer such questions by and right eyes [72]. (Horizontal disparity is the most com- providing a taxonomy of stereo algorithms. The taxonomy monly studied phenomenon, but vertical disparity is possible is designed to identify the individual components and de- if the eyes are verged.) sign decisions that go into a published algorithm. We hope In computer vision, disparity is often treated as synony- that the taxonomy will also serve to structure the field and mous with inverse depth [20, 85]. More recently, several re- to guide researchers in the development of new and better searchers have defined disparity as a three-dimensional pro- algorithms. jective transformation (collineation or homography) of 3-D 2.1. Computational theory space ( X,Y,Z ) . The enumeration of all possible matches in such a generalized disparity space can be easily achieved Any vision algorithm, explicitly or implicitly, makes as- with a plane sweep algorithm [30, 113], which for every sumptions about the physical world and the image formation disparity d projects all images onto a common plane using process. In other words, it has an underlying computational 2

3 a perspective projection (homography). (Note that this is The actual sequence of steps taken depends on the specific different from the meaning of plane sweep in computational algorithm. geometry.) For example, local (window-based) algorithms, where In general, we favor the more generalized interpretation the disparity computation at a given point depends only on of disparity, since it allows the adaptation of the search space intensity values within a finite window, usually make implicit to the geometry of the input cameras [113, 94]; we plan to smoothness assumptions by aggregating support. Some of use it in future extensions of this work to multiple images. these algorithms can cleanly be broken down into steps 1, 2, (Note that plane sweeps can also be generalized to other 3. For example, the traditional sum-of-squared-differences sweep surfaces such as cylinders [106].) (SSD) algorithm can be described as: In this study, however, since all our images are taken on a 1. the matching cost is the squared difference of intensity linear path with the optical axis perpendicular to the camera values at a given disparity; displacement, the classical inverse-depth interpretation will coordinates of the disparity space ) x,y ( suffice [85]. The 2. aggregation is done by summing matching cost over are taken to be coincident with the pixel coordinates of a square windows with constant disparity; reference image chosen from our input data set. The corre- r in reference image x,y ( spondence between a pixel and ) 3. disparities are computed by selecting the minimal (win- ′ ′ a pixel x ( ,y in matching image is then given by m ) ning) aggregated value at each pixel. ′ ′ x y, (1) ,y = ) x,y ( sd + x = Some local algorithms, however, combine steps 1 and 2 and use a matching cost that is based on a support region, e.g. = ± 1 is a sign chosen so that disparities are always where s normalized cross-correlation [51, 19] and the rank transform positive. Note that since our images are numbered from [129]. (This can also be viewed as a preprocessing step; see leftmost to rightmost, the pixels move from right to left. Section 3.1.) Once the disparity space has been specified, we can intro- algorithms make explicit global On the other hand, disparity space image or DSI [127, 18]. duce the concept of a smoothness assumptions and then solve an optimization In general, a DSI is any image or function defined over a con- problem. Such algorithms typically do not perform an ag- tinuous or discretized version of disparity space ( x,y,d . ) gregation step, but rather seek a disparity assignment (step In practice, the DSI usually represents the confidence or 3) that minimizes a global cost function that combines data log likelihood (i.e., cost ) of a particular match implied by (step 1) and smoothness terms. The main distinction be- . x,y ) d ( tween these algorithms is the minimization procedure used, The goal of a stereo correspondence algorithm is then to e.g., simulated annealing [75, 6], probabilistic (mean-field) ( x,y ) that produce a univalued function in disparity space d diffusion [97], or graph cuts [23]. best describes the shape of the surfaces in the scene. This In between these two broad classes are certain iterative can be viewed as finding a surface embedded in the dispar- algorithms that do not explicitly state a global function that ity space image that has some optimality property, such as is to be minimized, but whose behavior mimics closely that lowest cost and best (piecewise) smoothness [127]. Figure 1 of iterative optimization algorithms [73, 97, 132]. Hierar- shows examples of slices through a typical DSI. More figures chical (coarse-to-fine) algorithms resemble such iterative al- of this kind can be found in [18]. gorithms, but typically operate on an image pyramid, where 3. A taxonomy of stereo algorithms results from coarser levels are used to constrain a more local search at finer levels [126, 90, 11]. In order to support an informed comparison of stereo match- ing algorithms, we develop in this section a taxonomy and 3.1. Matching cost computation categorization scheme for such algorithms. We present a set The most common pixel-based matching costs include of algorithmic “building blocks” from which a large set of abso- squared intensity differences (SD) [51, 1, 77, 107] and existing algorithms can easily be constructed. Our taxonomy (AD) [58]. In the video processing lute intensity differences is based on the observation that stereo algorithms generally community, these matching criteria are referred to as the perform (subsets of) the following four steps [97, 96]: mean-squared error (MSE) and mean absolute difference 1. matching cost computation; (MAD) measures; the term displaced frame difference is also often used [118]. 2. cost (support) aggregation; More recently, robust measures, including truncated quadratics and contaminated Gaussians have been proposed 3. disparity computation / optimization; and [15, 16, 97]. These measures are useful because they limit 4. disparity refinement. the influence of mismatches during aggregation. 3

4 (a) (c) (d) (b) (e) (f) x,y Figure 1: Slices through a typical disparity space image (DSI): (a) original color image; (b) ground-truth disparities; (c–e) three ) ( 16 , 21 ; (e) an ( x,d ) slice for y =151 (the dashed line in Figure (b)). Different dark (matching) regions are visible in slices for d =10 , Figures (c–e), e.g., the bookshelves, table and cans, and head statue, while three different disparity levels can be seen as horizontal lines ) ( in the x,d slice (Figure (f)). Note the dark bands in the various DSIs, which indicate regions that match at this disparity. (Smaller dark regions are often the result of textureless regions.) Other traditional matching costs include normalized ( ) x,y,d . A support region can be either two- C the DSI cross-correlation [51, 93, 19], which behaves similar to sum- dimensional at a fixed disparity (favoring fronto-parallel of-squared-differences (SSD), and binary matching costs surfaces), or three-dimensional in - y - d space (supporting x (i.e., match / no match) [73], based on binary features such slanted surfaces). Two-dimensional evidence aggregation as edges [4, 50, 27] or the sign of the Laplacian [82]. Bi- has been implemented using square windows or Gaussian nary matching costs are not commonly used in dense stereo convolution (traditional), multiple windows anchored at dif- methods, however. ferent points, i.e., shiftable windows [2, 18], windows with adaptive sizes [84, 60, 124, 61], and windows based on Some costs are insensitive to differences in camera gain connected components of constant disparity [22]. Three- or bias, for example gradient-based measures [100, 95] and dimensional support functions that have been proposed in- non-parametric measures such as rank and census transforms clude limited disparity difference [50], limited disparity gra- [129]. Of course, it is also possible to correct for differ- dient [88], and Prazdny’s coherence principle [89]. ent camera characteristics by performing a preprocessing Aggregation with a fixed support region can be performed step for bias-gain or histogram equalization [48, 32]. Other using 2D or 3D convolution, matching criteria include phase and filter-bank responses [74, 63, 56, 57]. Finally, Birchfield and Tomasi have pro- , ( ) (2) x,y,d )= x,y,d ( C ∗ ) x,y,d ( C w 0 posed a matching cost that is insensitive to image sampling [12]. Rather than just comparing pixel values shifted by inte- or, in the case of rectangular windows, using efficient (mov- gral amounts (which may miss a valid match), they compare ing average) box-filters. Shiftable windows can also be each pixel in the reference image against a linearly interpo- implemented efficiently using a separable sliding min-filter lated function of the other image. itera- (Section 4.2). A different method of aggregation is The matching cost values over all pixels and all disparities , i.e., an aggregation (or averaging) operation tive diffusion C form the initial disparity space image . While our ( x,y,d ) 0 that is implemented by repeatedly adding to each pixel’s study is currently restricted to two-frame methods, the ini- cost the weighted values of its neighboring pixels’ costs tial DSI can easily incorporate information from more than [114, 103, 97]. two images by simply summing up the cost values for each m matching image , since the DSI is associated with a fixed 3.3. Disparity computation and optimization r reference image (Equation (1)). This is the idea behind In local methods, the emphasis is on the Local methods. multiple-baseline SSSD and SSAD methods [85, 62, 81]. matching cost computation and on the cost aggregation steps. As mentioned in Section 2.2, this idea can be generalized to Computing the final disparities is trivial: simply choose at arbitrary camera configurations using a plane sweep algo- each pixel the disparity associated with the minimum cost rithm [30, 113]. value. Thus, these methods perform a local “winner-take- all” (WTA) optimization at each pixel. A limitation of this 3.2. Aggregation of cost approach (and many other correspondence algorithms) is that uniqueness of matches is only enforced for one image Local and window-based methods aggregate the matching (the reference image ), while points in the other image might in cost by summing or averaging over a support region 4

5 get matched to multiple points. Once the global energy has been defined, a variety of al- gorithms can be used to find a (local) minimum. Traditional approaches associated with regularization and Markov Ran- Global optimization. In contrast, global methods perform dom Fields include continuation [17], simulated annealing almost all of their work during the disparity computation [47, 75, 6], highest confidence first [28], and mean-field an- phase and often skip the aggregation step. Many global nealing [45]. methods are formulated in an energy-minimization frame- More recently, methods have and max-flow graph-cut work [119]. The objective is to find a disparity function d been proposed to solve a special class of global optimiza- that minimizes a global energy, tion problems [92, 55, 23, 123, 65]. Such methods are more efficient than simulated annealing and have produced good ( d )+ λE E ( d ) . E (3) d ( )= smooth data results. The data term, E ( d ) , measures how well the disparity data function d agrees with the input image pair. Using the dis- A different class of global opti- Dynamic programming. parity space formulation, mization algorithms are those based on dynamic program- ∑ . While the 2D-optimization of Equation (3) can be ming E d )= ( x,y C ( x,y,d (4) , )) ( data shown to be NP-hard for common classes of smoothness ) x,y ( functions [123], dynamic programming can find the global minimum for independent scanlines in polynomial time. Dy- is the (initial or aggregated) matching cost DSI. where C namic programming was first used for stereo vision in sparse, d E The smoothness term ( ) encodes the smooth- smooth edge-based methods [3, 83]. More recent approaches have ness assumptions made by the algorithm. To make the opti- focused on the dense (intensity-based) scanline optimization mization computationally tractable, the smoothness term is problem [10, 9, 46, 31, 18, 13]. These approaches work by often restricted to only measuring the differences between computing the minimum-cost path through the matrix of all neighboring pixels’ disparities, pairwise matching costs between two corresponding scan- ∑ lines. Partial occlusion is handled explicitly by assigning a ( E )= d x,y ) − d ( ρ +1 ,y ))+ ( d ( x smooth group of pixels in one image to a single pixel in the other x,y ) ( d ( x,y ) , d ρ (5) +1)) x,y ( ( − image. Figure 2 shows one such example. Problems with dynamic programming stereo include where ρ is some monotonically increasing function of dis- the selection of the right cost for occluded pixels and parity difference. (An alternative to smoothness functionals the difficulty of enforcing inter-scanline consistency, al- is to use a lower-dimensional representation such as splines though several methods propose ways of addressing the lat- [112].) ter [83, 9, 31, 18, 13]. Another problem is that the dynamic is a quadratic func- ρ In regularization-based vision [87], programming approach requires enforcing the monotonicity d smooth everywhere and may lead to tion, which makes [128]. This constraint requires that or ordering constraint poor results at object boundaries. Energy functions that do the relative ordering of pixels on a scanline remain the same discontinuity-preserving not have this problem are called between the two views, which may not be the case in scenes functions [119, 16, 97]. Geman ρ and are based on robust containing narrow foreground objects. and Geman’s seminal paper [47] gave a Bayesian interpreta- tion of these kinds of energy functions [110] and proposed a cooperative Finally, algo- Cooperative algorithms. discontinuity-preserving energy function based on Markov rithms, inspired by computational models of human stereo . Black line processes Random Fields (MRFs) and additional vision, were among the earliest methods proposed for dis- and Rangarajan [16] show how line processes can be often parity computation [36, 73, 76, 114]. Such algorithms it- be subsumed by a robust regularization framework. eratively perform local computations, but use nonlinear op- can also be made to depend on the E The terms in smooth erations that result in an overall behavior similar to global intensity differences, e.g., optimization algorithms. In fact, for some of these algo- ρ ( I ‖ ( − ρ (6) )) ,y +1 x ( d , ) x,y ( d ( ) ‖ ) ,y +1 x ( I − ) x,y · rithms, it is possible to explicitly state a global function that d I is being minimized [97]. Recently, a promising variant of ρ where decreasing is some monotonically function of in- I Marr and Poggio’s original cooperative algorithm has been tensity differences that lowers smoothness costs at high in- developed [132]. tensity gradients. This idea [44, 42, 18, 23] encourages dis- 3.4. Refinement of disparities parity discontinuities to coincide with intensity/color edges and appears to account for some of the good performance of Most stereo correspondence algorithms compute a set of global optimization approaches. disparity estimates in some discretized space, e.g., for inte- 5

6 filter can be applied to “clean up” spurious mismatches, and M holes due to occlusion can be filled by surface fitting or jk R by distributing neighboring disparity estimates [13, 96]. In i R our implementation we are not performing such clean-up h steps since we want to measure the performance of the raw R algorithm components. M Right scanline LLM 3.5. Other methods Not all dense two-frame stereo correspondence algorithms M L can be described in terms of our basic taxonomy and rep- acf g M resentations. Here we briefly mention some additional al- a b cdef gk gorithms and representations that are not covered by our Left scanline framework. The algorithms described in this paper first enumerate all Stereo matching using dynamic programming. For each Figure 2: possible matches at all possible disparities, then select the pair of corresponding scanlines, a minimizing path through the best set of matches in some way. This is a useful approach matrix of all pairwise matching costs is selected. Lowercase letters when a large amount of ambiguity may exist in the com- a – k ) symbolize the intensities along each scanline. Uppercase ( puted disparities. An alternative approach is to use meth- letters represent the selected path through the matrix. Matches ods inspired by classic (infinitesimal) optic flow computa- M , while partially occluded points (which have are indicated by tion. Here, images are successively warped and motion esti- a fixed cost) are indicated by and R L , corresponding to points mates incrementally updated until a satisfactory registration only visible in the left and right image, respectively. Usually, only is achieved. These techniques are most often implemented a limited disparity range is considered, which is 0–4 in the figure within a coarse-to-fine hierarchical refinement framework (indicated by the non-shaded squares). Note that this diagram shows an “unskewed” x - d slice through the DSI. [90, 11, 8, 112]. A univalued representation of the disparity map is also not essential. Multi-valued representations, which can rep- ger disparities (exceptions include continuous optimization resent several depth values along each line of sight, have techniques such as optic flow [11] or splines [112]). For ap- been extensively studied recently, especially for large multi- plications such as robot navigation or people tracking, these view data set. Many of these techniques use a voxel-based may be perfectly adequate. However for image-based ren- representation to encode the reconstructed colors and spatial dering, such quantized maps lead to very unappealing view occupancies or opacities [113, 101, 67, 34, 33, 24]. Another synthesis results (the scene appears to be made up of many way to represent a scene with more complexity is to use mul- thin shearing layers). To remedy this situation, many al- tiple layers, each of which can be represented by a plane plus gorithms apply a sub-pixel refinement stage after the initial residual parallax [5, 14, 117]. Finally, deformable surfaces discrete correspondence stage. (An alternative is to simply of various kinds have also been used to perform 3D shape start with more discrete disparity levels.) reconstruction from multiple images [120, 121, 43, 38]. Sub-pixel disparity estimates can be computed in a va- 3.6. Summary of methods riety of ways, including iterative gradient descent and fit- ting a curve to the matching costs at discrete disparity lev- Table 1 gives a summary of some representative stereo els [93, 71, 122, 77, 60]. This provides an easy way to matching algorithms and their corresponding taxonomy, i.e., increase the resolution of a stereo algorithm with little addi- the matching cost, aggregation, and optimization techniques tional computation. However, to work well, the intensities used by each. The methods are grouped to contrast different being matched must vary smoothly, and the regions over matching costs (top), aggregation methods (middle), and op- which these estimates are computed must be on the same timization techniques (third section), while the last section (correct) surface. lists some papers outside the framework. As can be seen Recently, some questions have been raised about the ad- from this table, quite a large subset of the possible algorithm visability of fitting correlation curves to integer-sampled design space has been explored over the years, albeit not matching costs [105]. This situation may even be worse very systematically. when sampling-insensitive dissimilarity measures are used 4. Implementation [12]. We investigate this issue in Section 6.4 below. Besides sub-pixel computations, there are of course other We have developed a stand-alone, portable C++ implemen- ways of post-processing the computed disparities. Occluded tation of several stereo algorithms. The implementation is areas can be detected using cross-checking (comparing left- closely tied to the taxonomy presented in Section 3 and cur- to-right and right-to-left disparity maps) [29, 42]. A median rently includes window-based algorithms, diffusion algo- 6

7 Method Matching cost Aggregation Optimization SSD (traditional) WTA squared difference square window (square window) Hannah [51] WTA cross-correlation WTA Nishihara [82] square window binarized filters Kass [63] -none- WTA filter banks [40] phase -none- Fleet et al. phase-matching filter banks WTA Jones and Malik [57] -none- square window absolute difference Kanade [58] WTA Scharstein [95] gradient-based Gaussian WTA rank transform (square window) Zabih and Woodfill [129] WTA Cox [32] histogram eq. -none- DP et al. Frohlinghaus and Buhmann [41] wavelet phase -none- phase-matching -none- DP shifted abs. diff Birchfield and Tomasi [12] binary images iterative aggregation WTA Marr and Poggio [73] binary images 3D aggregation WTA Prazdny [89] binary images Szeliski and Hinton [114] WTA iterative 3D aggregation squared difference adaptive window Okutomi and Kanade [84] WTA et al. [127] cross-correlation non-linear filtering hier. WTA Yang Shah [103] squared difference non-linear diffusion regularization Boykov et al. thresh. abs. diff. [22] WTA connected-component iterative 3D aggregation mean-field Scharstein and Szeliski [97] robust sq. diff. iterative aggregation WTA squared difference Zitnick and Kanade [132] Veksler [124] abs. diff - avg. adaptive window WTA Quam [90] cross-correlation hier. warp -none- squared difference Barnard [6] SA -none- et al. DP squared difference shiftable window Geiger [46] Belhumeur [9] -none- DP squared difference Cox et al. [31] squared difference -none- DP Ishikawa and Geiger [55] squared difference graph cut -none- squared difference graph cut Roy and Cox [92] -none- absolute difference shiftable window DP Bobick and Intille [18] et al. [23] Boykov squared difference -none- graph cut Kolmogorov and Zabih [65] squared difference -none- graph cut Birchfield and Tomasi [14] -none- GC + planes shifted abs. diff. Ta o et al. [117] squared difference (color segmentation) WTA + regions Table 1: Summary taxonomy of several dense two-frame stereo correspondence methods. The methods are grouped to contrast different matching costs (top), aggregation methods (middle), and optimization techniques (third section). The last section lists some papers outside our framework. Key to abbreviations: hier. – hierarchical (coarse-to-fine), WTA – winner-take-all, DP – dynamic programming, SA – simulated annealing, GC – graph cut. 7

8 rithms, as well as global optimization methods using dy- namic programming, simulated annealing, and graph cuts. While many published methods include special features and post-processing steps to improve the results, we have chosen to implement the basic versions of such algorithms, in order to assess their respective merits most directly. The implementation is modular and can easily be ex- tended to include other algorithms or their components. We plan to add several other algorithms in the near future, and shifted 3 × 3 Shiftable window. The effect of trying all Figure 3: we hope that other authors will contribute their methods to windows around the black pixel is the same as taking the minimum our framework as well. Once a new algorithm has been inte- (non-shifted) windows in the centered matching score across all grated, it can easily be compared with other algorithms using same neighborhood. (Only 3 of the neighboring shifted windows our evaluation module, which can measure disparity error are shown here for clarity.) and reprojection error (Section 5.1). The implementation contains a sophisticated mechanism for specifying parame- ter values that supports recursive script files for exhaustive • Binomial filter: use a separable FIR (finite impulse re- 1 performance comparisons on multiple data sets. sponse) filter. We use the coefficients / , { 1 , 4 6 , , } 1 , 4 16 We provide a high-level description of our code using the the same ones used in Burt and Adelson’s [26] Lapla- same division into four parts as in our taxonomy. Within cian pyramid. our code, these four sections are (optionally) executed in Other convolution kernels could also be added later, as sequence, and the performance/quality evaluator is then in- could recursive (bi-directional) IIR filtering, which is a voked. A list of the most important algorithm parameters is very efficient way to obtain large window sizes [35]. The given in Table 2. width of the box or convolution kernel is controlled by 4.1. Matching cost computation aggr size window . To simulate the effect of shiftable windows [2, 18, 117], The simplest possible matching cost is the squared or ab- we can follow this aggregation step with a separable square solute difference in color / intensity between corresponding min-filter. The width of this filter is controlled by the param- pixels ( match ). To approximate the effect of a robust fn aggr eter . The cascaded effect of a box-filter and minfilter matching score [16, 97], we truncate the matching score to an equal-sized min-filter is the same as evaluating a complete match a maximal value . When color images are being max set of shifted windows, since the value of a shifted window is compared, we sum the squared or absolute intensity differ- the same as that of a centered window at some neighboring ence in each channel before applying the clipping. If frac- pixel (Figure 3). This step adds very little additional com- tional disparity evaluation is being performed ( disp step < putation, since a moving 1-D min-filter can be computed ), each scanline is first interpolated up using either a linear 1 efficiently by only recomputing the min when a minimum match or cubic interpolation filter ( interp ) [77]. We also aggr value leaves the window. The value of minfilter can optionally apply Birchfield and Tomasi’s sampling insensi- be less than that of aggr size window , which simulates the tive interval-based matching criterion ( match ) [12], interval effect of a partially shifted window. (The converse doesn’t i.e., we take the minimum of the pixel matching score and 1 make much sense, since the window then no longer includes the score at ± -step displacements, or 0 if there is a sign 2 the reference pixel.) change in either interval. We apply this criterion separately We have also implemented all of the diffusion methods to each color channel, which is not physically plausible (the developed in [97] except for local stopping, i.e., regular dif- sub-pixel shift must be consistent across channels), but is fusion, the membrane model, and Bayesian (mean-field) dif- easier to implement. fusion. While this last algorithm can also be considered an 4.2. Aggregation optimization method, we include it in the aggregation mod- The aggregation section of our test bed implements some ule since it resembles other iterative aggregation algorithms aggr commonly used aggregation methods ( ): fn closely. The maximum number of aggregation iterations is controlled by aggr . Other parameters controlling the iter Box filter: use a separable moving average filter (add • diffusion algorithms are listed in Table 2. one right/bottom value, subtract one left/top). This im- 4.3. Optimization plementation trick makes such window-based aggrega- tion insensitive to window size in terms of computation Once we have computed the (optionally aggregated) costs, time and accounts for the fast performance seen in real- we need to determine which discrete set of disparities best time matchers [59, 64]. represents the scene surface. The algorithm used to deter- 8

9 Name Typical values Description min 0 disp smallest disparity disp 15 largest disparity max disparity step size disp step 0.5 matching function fn match SD, AD interpolation function Linear, Cubic match interp 20 maximum difference for truncated SAD/SSD match max false match interval 1/2 disparity match [12] Box, Binomial fn aggregation function aggr size 9 aggr window size of window minfilter 9 spatial min-filter (shiftable window) aggr iter 1 aggr number of aggregation iterations diff 0.15 parameter for regular and membrane diffusion lambda λ diff 0.5 β beta for membrane diffusion parameter cost 0.01 scale of cost values (needed for Bayesian diffusion) diff scale mu 0.5 parameter μ diff for Bayesian diffusion diff 0.4 parameter σ sigmaP for robust prior of Bayesian diffusion P epsP 0.01 parameter diff for robust prior of Bayesian diffusion P opt fn WTA, DP, SA, GC optimization function opt smoothness 1.0 weight of smoothness term ( λ ) opt grad 8.0 threshold for magnitude of intensity gradient thresh grad penalty smoothness penalty factor if gradient is too small opt 2.0 cost 20 occlusion opt cost for occluded pixels in DP algorithm opt sa var Gibbs, Metropolis simulated annealing update rule sa start starting temperature opt 10.0 T sa T 0.01 ending temperature opt end opt sa Linear annealing schedule schedule refine subpix true fit sub-pixel value to local correlation eval bad 1.0 thresh acceptable disparity error 2 I box filter width applied to ‖∇ width 3 ‖ textureless eval x 2 eval 4.0 threshold applied to filtered ‖∇ I ‖ thresh textureless x eval disp gap 2.0 disparity jump threshold eval width 9 width of discontinuity region discont eval ignore border 10 number of border pixels to ignore partial shuffle 0.2 analysis interval for prediction error eval Table 2: The most important stereo algorithm parameters of our implementation. 9

10 mine this is controlled by opt fn , and can be one of: basic version of different algorithms. However, GCPs are potentially useful in other algorithms as well, and we plan • winner-take-all (WTA); to add them to our implementation in the future. scanline opti- Our second global optimization technique, dynamic programming (DP); • (SO), is a simple (and, to our knowledge, novel) ap- mization • scanline optimization (SO); proach designed to assess different smoothness terms. Like • simulated annealing (SA); - the previous method, it operates on individual d DSI slices x and optimizes one scanline at a time. However, the method graph cut (GC). • is asymmetric and does not utilize visibility or ordering con- such x value is assigned at each point straints. Instead, a d The winner-take-all method simply picks the lowest (aggre- that the overall cost along the scanline is minimized. (Note gated) matching cost as the selected disparity at each pixel. that without a smoothness term, this would be equivalent to The other methods require (in addition to the matching cost) a winner-take-all optimization.) The global minimum can the definition of a smoothness cost. Prior to invoking one again be computed using dynamic programming; however, of the optimization algorithms, we set up tables containing unlike in traditional (symmetric) DP algorithms, the order- the values of ρ in Equation (6) and precompute the spa- d ing constraint does not need to be enforced, and no occlusion ρ tially varying weights ) x,y ( . These tables are controlled I cost parameter is necessary. Thus, the SO algorithm solves smoothness , which controls the over- by the parameters opt the same optimization problem as the graph-cut algorithm λ all scale of the smoothness term (i.e., in Equation (3)), described below, except that vertical smoothness terms are and the parameters opt grad thresh and opt grad , penalty ignored. which control the gradient-dependent smoothness costs. We Both DP and SO algorithms suffer from the well-known currently use the smoothness terms defined by Veksler [123]: difficulty of enforcing inter-scanline consistency, resulting { grad thresh I< p if ∆ opt in horizontal “streaks” in the computed disparity map. Bo- ρ (∆ I )= (7) I 1 I ∆ if opt ≥ grad thresh , bick and Intille’s approach to this problem is to detect edges in the DSI slice and to lower the occlusion cost for paths = opt p where grad penalty . Thus, the smoothness cost along those edges. This has the effect of aligning depth dis- is multiplied by p for low intensity gradient to encourage continuities with intensity edges. In our implementation, disparity jumps to coincide with intensity edges. All of the we achieve the same goal by using an intensity-dependent optimization algorithms minimize the same objective func- smoothness cost (Equation (6)), which, in our DP algorithm, tion, enabling a more meaningful comparison of their per- is charged at all L-M and R-M state transitions. formance. Our implementation of simulated annealing supports both Our first global optimization technique, DP, is a dynamic the Metropolis variant (where downhill steps are always programming method similar to the one proposed by Bobick taken, and uphill steps are sometimes taken), and the Gibbs and Intille [18]. The algorithm works by computing the Sampler, which chooses among several possible states ac- minimum-cost path through each x - d slice in the DSI (see cording to the full marginal distribution [47]. In the latter Figure 2). Every point in this slice can be in one of three case, we can either select one new state (disparity) to flip states: M (match), L (left-visible only), or R (right-visible to at random, or evaluate all possible disparities at a given only). Assuming the ordering constraint is being enforced, pixel. Our current annealing schedule is linear, although we a valid path can take at most three directions at a point, plan to add a logarithmic annealing schedule in the future. each associated with a deterministic state change. Using Our final global optimization method, GC, implements dynamic programming, the minimum cost of all paths to a the - β swap move algorithm described in [23, 123]. (We α point can be accumulated efficiently. Points in state M are α -expansion in the future.) We ran- plan to implement the simply charged the matching cost at this point in the DSI. β - α domize the pairings at each (inner) iteration and stop occlusion cost Points in states L and R are charged a fixed the algorithm when no further (local) energy improvements ( opt ). Before evaluating the final disparity occlusion cost are possible. map, we fill all occluded pixels with the nearest background 4.4. Refinement disparity value on the same scanline. The DP stereo algorithm is fairly sensitive to this param- The sub-pixel refinement of disparities is controlled by the eter (see Section 6). Bobick and Intille address this problem refine boolean variable . When this is enabled, the subpix ground control points by precomputing (GCPs) that are then three aggregated matching cost values around the winning used to constrain the paths through the DSI slice. GCPs are disparity are examined to compute the sub-pixel disparity high-confidence matches that are computed using SAD and estimate. (Note that if the initial DSI was formed with frac- shiftable windows. At this point we are not using GCPs in tional disparity steps, these are really sub-sub-pixel values. our implementation since we are interested in comparing the A more appropriate name might be floating point disparity 10

11 Name Symb. Description values.) A parabola is fit to these three values (the three end- disp ing values are used if the winning disparity is either min RMS disparity error rms error all R ). If the curvature is positive and the minimum max or disp rms error nonocc R " (no occlusions) O of the parabola is within a half-step of the winning disparity rms " (at occlusions) error R occ O (and within the search limits), this value is used as the final textured R error " (textured) rms T disparity estimate. error rms R " (textureless) textureless T In future work, we would like to investigate whether discont error rms " (near discontinuities) R D initial or aggregated matching scores should be used, or bad pixels all B bad pixel percentage whether some other approach, such as Lucas-Kanade, might " (no occlusions) B bad pixels nonocc O yield higher-quality estimates [122]. bad pixels occ B " (at occlusions) O B pixels bad " (textured) textured T 5. Evaluation methodology textureless pixels bad B " (textureless) T In this section, we describe the quality metrics we use for bad pixels discont B " (near discontinuities) D evaluating the performance of stereo correspondence algo- predict view extr. error (near) err near P − rithms and the techniques we used for acquiring our image 1 predict err middle P view extr. error (mid) / 2 data sets and ground truth estimates. match err P predict view extr. error (match) 1 predict far P view extr. error (far) err + 5.1. Quality metrics To evaluate the performance of a stereo algorithm or the effects of varying some of its parameters, we need a quan- Error (quality) statistics computed by our evaluator. See Table 3: titative way to estimate the quality of the computed corre- the notes in the text regarding the treatment of occluded regions. spondences. Two general approaches to this are to compute error statistics with respect to some ground truth data [8] textureless regions • : regions where the squared hor- T and to evaluate the synthetic images obtained by warping izontal intensity gradient averaged over a square win- the reference or unseen images by the computed disparity eval dow of a given size ( width ) is below a textureless map [111]. given threshold ( eval textureless thresh ); In the current version of our software, we compute the following two quality measures based on known ground truth O • occluded regions : regions that are occluded in the data: matching image, i.e., where the forward-mapped dis- parity lands at a location with a larger (nearer) disparity; 1. RMS (root-mean-squared) error (measured in disparity and d units) between the computed disparity map ) ( x,y C and the ground truth map d , i.e., ) x,y ( D • depth discontinuity regions : pixels whose neighbor- T eval ing disparities differ by more than gap disp , di- 1 2 discont lated by a window of width eval width . ∑ 1 2 (8) , d ( ) − | ( x,y ) | x,y d = R C T N These regions were selected to support the analysis of match- x,y ) ( ing results in typical problem areas. For the experiments in where N is the total number of pixels. this paper we use the values listed in Table 2. The statistics described above are computed for each of 2. Percentage of bad matching pixels, the three regions and their complements, e.g., ∑ 1 ∑ 1 = B − x,y ( , d ) ) ( | x,y ( ) (9) | >δ d d T C B <δ − | = ) ) x,y ( x,y ( d | ( , ) d N d t T c ) x,y ( N T x,y ) ∈T ( where δ ( eval bad thresh ) is a disparity error toler- d , B . R , ..., and so on for R T T D , 0 . =1 ance. For the experiments in this paper we use δ d Table 3 gives a complete list of the statistics we collect. since this coincides with some previously published Note that for the textureless, textured, and depth discontinu- studies [116, 132, 65]. ity statistics, we exclude pixels that are in occluded regions, In addition to computing these statistics over the whole on the assumption that algorithms generally do not pro- image, we also focus on three different kinds of regions. duce meaningful results in such occluded regions. Also, we These regions are computed by pre-processing the reference ignore border pixels when com- exclude a border of eval image and ground truth disparity map to yield the following puting all statistics, since many algorithms do not compute three binary segmentations (Figure 4): meaningful disparities near the image boundaries. 11

12 (a) (b) (c) (d) Figure 4: Segmented region maps: (a) original image, (b) true disparities, (c) textureless regions (white) and occluded regions (black), (d) depth discontinuity regions (white) and occluded regions (black). Figure 5: Series of forward-warped reference images. The reference image is the middle one, the matching image is the second from the right. Pixels that are invisible (gaps) are shown in light magenta. Figure 6: Series of inverse-warped original images. The reference image is the middle one, the matching image is the second from the right. Pixels that are invisible are shown in light magenta. Viewing this sequence (available on our web site) as an animation loop is a good way to check for correct rectification, other misalignments, and quantization effects. 12

13 The second major approach to gauging the quality of re- We have begun to collect such a database of images, build- construction algorithms is to use the color images and dis- ing upon the methodology introduced in [116]. Each image parity maps to predict the appearance of other views [111]. sequence consists of 9 images, taken at regular intervals with Here again there are two major flavors possible: a camera mounted on a horizontal translation stage, with the camera pointing perpendicularly to the direction of motion. 1. Forward warp the reference image by the computed We use a digital high-resolution camera (Canon G1) set in disparity map to a different (potentially unseen) view manual exposure and focus mode and rectify the images us- (Figure 5), and compare it against this new image to ing tracked feature points. We then downsample the original . obtain a forward prediction error 2048 using a high-quality 8-tap 384 × 512 images to 1536 × filter and finally crop the images to normalize the motion of 2. Inverse warp a new view by the computed disparity map background objects to a few pixels per frame. image (Figure 6), and compare to generate a stabilized All of the sequences we have captured are made up inverse pre- it against the reference image to obtain an of piecewise planar objects (typically posters or paintings, . diction error some with cut-out edges). Before downsampling the images, There are pros and cons to either approach. we hand-label each image into its piecewise planar compo- The forward warping algorithm has to deal with tearing nents (Figure 7). We then use a direct alignment technique problems: if a single-pixel splat is used, gaps can arise even on each planar region [5] to estimate the affine motion of between adjacent pixels with similar disparities. One pos- each patch. The horizontal component of these motions is sible solution would be to use a two-pass renderer [102]. then used to compute the ground truth disparity. In future Instead, we render each pair of neighboring pixel as an in- work we plan to extend our acquisition methodology to han- terpolated color line in the destination image (i.e., we use dle scenes with quadric surfaces (e.g., cylinders, cones, and ). If neighboring pixels differ by more that Gouraud shading spheres). a disparity of eval disp , the segment is replaced by sin- gap Of the six image sequences we acquired, all of which are gle pixel spats at both ends, which results in a visible tear available on our web page, we have selected two (“Sawtooth” (light magenta regions in Figure 5). and “Venus”) for the experimental study in this paper. We For inverse warping, the problem of gaps does not oc- also use the University of Tsukuba “head and lamp” data cur. Instead, we get “ghosted” regions when pixels in the array of images together with hand-labeled × set [81], a 5 5 reference image are not actually visible in the source. We integer ground-truth disparities for the center image. Finally, eliminate such pixels by checking for visibility (occlusions) we use the monochromatic “Map” data set first introduced first, and then drawing these pixels in a special color (light by Szeliski and Zabih [116], which was taken with a Point magenta in Figure 6). We have found that looking at the Grey Research trinocular stereo camera, and whose ground- inverse warped sequence, based on the ground-truth dispari- truth disparity map was computed using the piecewise planar ties, is a very good way to determine if the original sequence technique described above. Figure 7 shows the reference is properly calibrated and rectified. image and the ground-truth disparities for each of these four In computing the prediction error, we need to decide how sequences. We exclude a border of 18 pixels in the Tsukuba to treat gaps. Currently, we ignore pixels flagged as gaps images, since no ground-truth disparity values are provided in computing the statistics and report the percentage of such there. For all other images, we use eval ignore border =10 missing pixels. We can also optionally compensate for small for the experiments reported in this paper. misregistrations [111]. To do this, we convert each pixel in In the future, we hope to add further data sets to our the original and predicted image to an interval, by blend- collection of “standard” test images, in particular other se- ing the pixel’s value with some fraction eval shuffle partial quences from the University of Tsukuba, and the GRASP of its neighboring pixels min and max values. This idea Laboratory’s “Buffalo Bill” data set with registered laser is a generalization of the sampling-insensitive dissimilarity range finder ground truth [80]. There may also be suitable measure [12] and the shuffle transformation of [66]. The images among the CMU Computer Vision Home Page data reported difference is then the (signed) distance between sets. Unfortunately, we cannot use data sets for which only the two computed intervals. We plan to investigate these a sparse set of feature matches has been computed [19, 54]. and other sampling-insensitive matching costs in the future It should be noted that high-quality ground-truth data is [115]. critical for a meaningful performance evaluation. Accurate 5.2. Test data sub-pixel disparities are hard to come by, however. The ground-truth data for the Tsukuba images, for example, is To quantitatively evaluate our correspondence algorithms, strongly quantized since it only provides integer disparity we require data sets that either have a ground truth dispar- d estimates for a very small disparity range ( =5 ... 14 ). ity map, or a set of additional views that can be used for This is clearly visible when the images are stabilized using prediction error test (or preferably both). 13

14 Sawtooth ref. image ground-truth disparities planar regions Venus ref. image planar regions ground-truth disparities Tsukuba ref. image ground-truth disparities Map ref. image ground-truth disparities Figure 7: Stereo images with ground truth used in this study. The Sawtooth and Venus images are two of our new 9-frame stereo sequences of planar objects. The figure shows the reference image, the planar region labeling, and the ground-truth disparities. We also use the familiar Tsukuba “head and lamp” data set, and the monochromatic Map image pair. 14

15 the ground-truth data and viewed in a video loop. In contrast, , and textureless pixels bad , nonocc pixels bad sures are the ground-truth disparities for our piecewise planar scenes bad pixels discont , and they appear in most of the plots have high (subpixel) precision, but at the cost of limited below. We prefer the percentage of bad pixels over RMS scene complexity. To provide an adequate challenge for disparity errors since this gives a better indication of the the best-performing stereo methods, new stereo test images overall performance of an algorithm. For example, an al- sub-pixel ground truth will soon and with complex scenes gorithm is performing reasonably well if B . The 10% < O be needed. RMS error figure, on the other hand, is contaminated by the (potentially large) disparity errors in those poorly matched Synthetic images have been used extensively for quali- 10% of the image. RMS errors become important once the tative evaluations of stereo methods, but they are often re- percentage of bad pixels drops to a few percent and the qual- stricted to simple geometries and textures (e.g., random-dot ity of a sub-pixel fit needs to be evaluated (see Section 6.4). stereograms). Furthermore, issues arising with real cam- Note that the algorithms always take exactly two images eras are seldom modeled, e.g., aliasing, slight misalignment, as input, even when more are available. For example, with noise, lens aberrations, and fluctuations in gain and bias. our 9-frame sequences, we use the third and seventh frame Consequently, results on synthetic images usually do not as input pair. (The other frames are used to measure the extrapolate to images taken with real cameras. We have prediction error.) experimented with the University of Bonn’s synthetic “Cor- ridor” data set [41], but have found that the clean, noise-free 6.1. Matching cost images are unrealistically easy to solve, while the noise- We start by comparing different matching costs, including contaminated versions are too difficult due to the complete absolute differences (AD), squared differences (SD), trun- lack of texture in much of the scene. There is a clear need for cated versions of both, and Birchfield and Tomasi’s [12] synthetic, photo-realistic test imagery that properly models sampling-insensitive dissimilarity measure (BT). real-world imperfections, while providing accurate ground An interesting issue when trying to assess a single algo- truth. rithm component is how to fix the parameters that control 6. Experiments and results the other components. We usually choose good values based on experiments that assess the other algorithm components. In this section, we describe the experiments used to evalu- (The inherent boot-strapping problem disappears after a few ate the individual building blocks of stereo algorithms. Us- rounds of experiments.) Since the best settings for many pa- ing our implementation framework, we examine the four rameters vary depending on the input image pair, we often main algorithm components identified in Section 3 (match- have to compromise and select a value that works reasonably ing cost, aggregation, optimization, and sub-pixel fitting) well for several images. In Section 7, we perform an overall comparison of a large set of stereo algorithms, including other authors’ implemen- tations. We use the Tsukuba, Sawtooth, Venus, and Map In this experiment we compare the match- Experiment 1: data sets in all experiments and report results on subsets ing costs AD, SD, AD+BT, and SD+BT using a local al- of these images. The complete set of results (all experi- 9 × 9 window, followed by gorithm. We aggregate with a ments run on all data sets) is available on our web site at winner-take-all optimization (i.e., we use the standard SAD . www.middlebury.edu/stereo and SSD algorithms). We do not compute sub-pixel esti- Using the evaluation measures presented in Section 5.1, mates. Truncation values used are 1, 2, 5, 10, 20, 50, and we focus on common problem areas for stereo algorithms. (no truncation); these values are squared when truncating ∞ Of the 12 ground-truth statistics we collect (Table 3), we SD. have chosen three as the most important subset. First, as a Results: Figure 8 shows plots of the three evaluation mea- measure of overall performance, we use B , the percent- sures B B B , , and for each of the four matching costs D O O T age of bad pixels in non-occluded areas. We exclude the as a function of truncation values, for the Tsukuba, Saw- occluded regions for now since few of the algorithms in this tooth, and Venus images. Overall, there is little difference study explicitly model occlusions, and most perform quite between AD and SD. Truncation matters mostly for points poorly in these regions. As algorithms get better at matching near discontinuities. The reason is that for windows con- occluded regions [65], however, we will likely focus more taining mixed populations (both foreground and background on the total matching error B . points), truncating the matching cost limits the influence of wrong matches. Good truncation values range from 5 to 50, The other two important measures are B , the and B D T typically around 20. Once the truncation values drop below percentage of bad pixels in textureless areas and in areas near the noise level (e.g., 2 and 1), the errors become very large. depth discontinuities. These measures provide important in- Using Birchfield-Tomasi (BT) helps for these small trunca- formation about the performance of algorithms in two criti- tion values, but yields little improvement for good truncation cal problem areas. The parameter names for these three mea- 15

16 60% 50% Tsukuba 40% 9x9 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 1 2 5 2 1 5 5 1 2 5 1 2 inf inf inf inf 50 10 20 20 10 50 10 20 50 20 10 50 SAD+BT SSD SAD SSD+BT match_max 60% 50% Sawtooth 40% 9x9 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 2 1 1 2 5 2 1 5 5 2 1 inf inf inf inf 10 50 20 50 10 20 50 20 10 20 50 10 SSD SAD+BT SSD+BT SAD match_max 60% 50% Venus 40% 9x9 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 1 2 1 5 2 2 1 5 2 1 5 inf inf inf inf 50 20 10 50 20 10 50 20 10 10 20 50 SAD SSD SAD+BT SSD+BT match_max window as a function of truncation values Figure 8: 9 × 9 Experiment 1. Performance of different matching costs aggregated with a match max for three different image pairs. Intermediate truncation values (5–20) yield the best results. Birchfield-Tomasi (BT) helps when truncation values are low. 16

17 Results: Figure 10 shows plots of the bad pixel percent- values. The results are consistent across all data sets; how- ages B ever, the best truncation value varies. We have also tried a , as a function of truncation values B B , and D T O window size of 21, with similar results. for Tsukuba, Sawtooth, and Venus images. Each plot has Conclusion: Truncation can help for AD and SD, but the six curves, corresponding to DP, DP+BT, SO, SO+BT, GC, best truncation value depends on the images’signal-to-noise- GC+BT. It can be seen that the truncation value affects the ratio (SNR), since truncation should happen right above the performance. As with the local algorithms, if the truncation noise level present (see also the discussion in [97]). value is too small (in the noise range), the errors get very large. Intermediate truncation values of 50–5, depending on algorithm and image pair, however, can sometimes improve Experiment 2: This experiment is identical to the previ- the performance. The effect of Birchfield-Tomasi is mixed; × 9 9 min-filter (in effect, ous one, except that we also use a as with the local algorithms in Experiments 1 and 2, it limits we aggregate with shiftable windows). the errors if the truncation values are too small. It can be Results: Figure 9 shows the plots for this experiment, again seen that BT is most beneficial for the SO algorithm, how- for Tsukuba, Sawtooth, and Venus images. As before, there ever, this is due to the fact that SO really requires a higher are negligible differences between AD and SD. Now, how- λ value of to work well (see Experiment 5), in which case ever, the non-truncated versions perform consistently the the positive effect of BT is less pronounced. best. In particular, for points near discontinuities we get Conclusion: Using robust (truncated) matching costs can the lowest errors overall, but also the total errors are com- slightly improve the performance of global algorithms. The parable to the best settings of truncation in Experiment 1. best truncation value, however, varies with each image pair. BT helps bring down larger errors, but as before, does not Setting this parameter automatically based on an estimate significantly decrease the best (non-truncated) errors. We of the image SNR may be possible and is a topic for fur- again also tried a window size of 21 with similar results. ther research. Birchfield and Tomasi’s matching measure Conclusion: The problem of selecting the best truncation can improve results slightly. Intuitively, truncation should value can be avoided by instead using a shiftable window not be necessary for global algorithms that operate on un- (min-filter). This is an interesting result, as both robust aggregated matching costs, since the problem of outliers in matching costs (trunctated functions) and shiftable windows a window does not exist. An important problem for global have been proposed to deal with outliers in windows that algorithms, however, is to find the correct balance between straddle object boundaries. The above experiments suggest data and smoothness terms (see Experiment 5 below). Trun- that avoiding outliers by shifting the window is preferable cation can be useful in this context since it limits the range to limiting their influence using truncated cost functions. of possible cost values. 6.2. Aggregation We now assess how matching costs affect Experiment 3: We now turn to comparing different aggregation methods global algorithms, using dynamic programming (DP), scan- used by local methods. While global methods typically op- line optimization (SO), and graph cuts (GC) as optimization erate on raw (unaggregated) costs, aggregation can be useful techniques. A problem with global techniques that minimize for those methods as well, for example to provide starting a weighted sum of data and smoothness terms (Equation (3)) values for iterative algorithms, or a set of high-confidence is that the range of matching cost values affects the optimal matches or (GCPs) [18] used to restrict ground control points λ , i.e., the relative weight of the smoothness term. value for the search of dynamic-programming methods. For example, squared differences require much higher values In this section we examine aggregation with square win- λ for than absolute differences. Similarly, truncated differ- dows, shiftable windows (min-filter), binomial filters, reg- ence functions result in lower matching costs and require ular diffusion, and membrane diffusion [97]. Results for lower values for λ . Thus, in trying to isolate the effect of Bayesian diffusion, which combines aggregation and opti- the matching costs, we are faced with the problem of how to mization, can be found in Section 7. λ choose . The cleanest solution to this dilemma would per- haps be to find a (different) optimal λ independently for each matching cost under consideration, and then to report which In this experiment we use (non-truncated) Experiment 4: matching cost gives the overall best results. The optimal , λ absolute differences as matching cost and perform a winner- however, would not only differ across matching costs, but take-all optimization after the aggregation step (no sub-pixel also across different images. Since in a practical matcher estimation). We compare the following aggregation meth- , we have done the same in λ we need to choose a constant ods: this experiment. We use λ =20 (guided by the results dis- 1. square windows with window sizes 3, 5, 7, ..., 29; cussed in Section 6.3 below) and restrict the matching costs to absolute differences (AD), truncated by varying amounts. 2. shiftable square windows (min-filter) with window For the DP algorithm we use a fixed occlusion cost of 20. sizes 3, 5, 7, ...29; 17

18 60% 50% Tsukuba 40% 9x9 + min filter 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 1 2 5 2 1 5 5 1 2 5 1 2 inf inf inf inf 50 10 20 20 10 50 10 20 50 20 10 50 SAD+BT SSD SAD SSD+BT match_max 60% 50% Sawtooth 40% 9x9 + min filter 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 2 1 1 2 5 2 1 5 5 2 1 inf inf inf inf 10 50 20 50 10 20 50 20 10 20 50 10 SSD SAD+BT SSD+BT SAD match_max 60% 50% Venus 40% 9x9 + min filter 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 1 2 1 5 2 2 1 5 2 1 5 inf inf inf inf 50 20 10 50 20 10 50 20 10 10 20 50 SAD SSD SAD+BT SSD+BT match_max Figure 9: 9 × 9 shiftable window (min-filter) as a function Experiment 2. Performance of different matching costs aggregated with a of truncation values match max for three different image pairs. Large truncation values (no truncation) work best when using shiftable windows. 18

19 60% 50% Tsukuba 40% 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 2 5 2 5 2 2 5 5 5 inf inf inf inf inf inf 50 20 10 10 20 50 10 20 50 10 20 50 20 10 50 20 10 50 GC+BT SO DP DP+BT SO+BT GC match_max 60% 50% Sawtooth 40% 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 2 2 5 2 5 2 5 5 5 inf inf inf inf inf inf 20 50 10 10 20 50 10 20 50 10 50 20 50 10 20 10 20 50 GC GC+BT DP DP+BT SO SO+BT match_max 60% 50% Venus 40% 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 2 2 5 5 2 2 5 5 5 inf inf inf inf inf inf 20 10 50 10 50 20 10 50 20 10 20 50 20 50 10 50 20 10 SO+BT GC GC+BT DP DP+BT SO match_max for Experiment 3. Performance of different matching costs for global algorithms as a function of truncation values match max Figure 10: three different image pairs. Intermediate truncation values ( ∼ 20 ) can sometimes improve the performance. 19

20 50% 40% Tsukuba 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 2 3 9 5 4 6 8 7 9 3 5 7 13 15 11 13 15 17 19 21 23 25 27 29 17 19 21 23 10 12 14 18 18 20 22 24 26 28 10 20 30 40 50 60 25 80 90 27 29 70 11 0.5 0.0 0.8 0.7 0.6 0.9 0.4 0.3 0.2 0.1 110 140 100 150 120 130 window+minf. (w) membrane ( β β ) diffusion (iter) binomial f. (iter) window (w) β β 50% 40% Sawtooth 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 3 5 7 3 2 4 6 8 5 7 9 9 11 13 15 11 13 15 17 19 21 23 25 27 29 17 19 21 23 10 12 14 18 18 20 22 24 25 28 10 20 30 40 50 60 70 80 90 27 29 26 0.5 0.0 0.8 0.7 0.6 0.9 0.4 0.3 0.2 0.1 100 130 140 150 110 120 window+minf. (w) membrane ( β β ) diffusion (iter) binomial f. (iter) window (w) β β 50% 40% Venus 30% bad_pixels_nonocc bad_pixels_textureless 20% bad_pixels_discont 10% 0% 3 5 7 3 5 2 4 6 8 7 9 9 11 13 15 11 13 15 17 19 21 23 25 27 17 19 21 23 25 10 12 14 18 18 20 22 24 26 28 10 20 30 40 50 60 70 80 90 27 29 29 0.3 0.9 0.8 0.7 0.6 0.5 0.4 0.2 0.1 0.0 150 130 140 100 110 120 diffusion (iter) membrane ( β window (w) window+minf. (w) β ) binomial f. (iter) β β Figure 11: Experiment 4. Performance of different aggregation methods as a function of spatial extent (window size, number of iterations, and diffusion β ). Larger window extents do worse near discontinuities, but better in textureless areas, which tend to dominate the overall statistics. Near discontinuities, shiftable windows have the best performance. 20

21 3. iterated binomial (1-4-6-4-1) filter, for 2, 4, 6, ..., 28 pends on the amount of (horizontal) texture present in the iterations; two regions. Consider first a purely horizontal depth discontinuity (top 4. regular diffusion [97] for 10, 20, 30, ...,150 iterations; edge of the foreground square in Figure 12). Whichever of the two regions has more horizontal texture will create β 5. membrane diffusion [97] for 150 iterations and = 0.9, a stronger mode, and the computed disparities will thus ..., 0.8. 0.7, 0.0. “bleed” into the less-textured region. For non-horizontal depth boundaries, however, the most prominent horizontal Note that for each method we are varying the parameter that texture is usually the object boundary itself, since differ- controls the spatial extent of the aggregation (i.e., the equiv- ent objects typically have different colors and intensities. alent of window size). In particular, for the binomial filter Since the object boundary is at the foreground disparity, a and regular diffusion, this amounts to changing the number strong preference for the foreground disparity at points near of iterations. The membrane model, however, converges af- the boundary is created, even if the background is textured. ter sufficiently many iterations, and the spatial extent of the This is the explanation for the well-known “foreground fat- β aggregation is controlled by the parameter , the weight of tening” effect exhibited by window-based algorithms. This the original cost values in the diffusion equation [97]. can be seen at the right edge of the foreground in Figure 12; B , as a B , and Results: B Figure 11 shows plots of D T O the left edge is an occluded area, which can’t be recovered function of spatial extent of aggregation for Tsukuba, Saw- in any case. tooth, and Venus images. Each plot has five curves, corre- Adaptive window methods have been developed to com- sponding to the five aggregation methods listed above. The bat this problem. The simplest variant, shiftable windows most striking feature of these curves is the opposite trends (min-filters) can be effective, as is shown in the above ex- B of errors in textureless areas ( ) and at points near discon- T periment. Shiftable windows can recover object boundaries B tinuities ( ). Not surprisingly, more aggregation (larger D quite accurately if both foreground and background regions window sizes or higher number of iterations) clearly helps to are textured, and as long as the window fits as a whole within recover textureless areas (note especially the Venus images, the foreground object. The size of the min-filter should be which contain large untextured regions). At the same time, chosen to match the window size. As is the case with all lo- too much aggregation causes errors near object boundaries cal methods, however, shiftable windows fail in textureless (depth discontinuities). The overall error in non-occluded areas. regions, B , exhibits a mixture of both trends. Depending O Conclusion: Local algorithms that aggregate support can on the image, the best performance is usually achieved at perform well, especially in textured (even slanted) regions. an intermediate amount of aggregation. Among the five ag- Shiftable windows perform best, in particular near depth dis- gregation methods, shiftable windows clearly perform best, continuities. Large amounts of aggregation are necessary in most notably in discontinuity regions, but also overall. The textureless regions. other four methods (square windows, binomial filter, regu- lar diffusion, and membrane model) perform very similarly, 6.3. Optimization except for differences in the shape of the curves, which are In this section we compare the four global optimization tech- due to our (somewhat arbitrary) definition of spatial extent niques we implemented: dynamic programming (DP), scan- for each method. Note however that even for shiftable win- line optimization (SO), graph cuts (GC), and simulated an- dows, the optimal window size for recovering discontinuities nealing (SA). is small, while much larger windows are necessary in untex- tured regions. This experiment exposes some of the funda- Discussion: In this experiment we investigate the role Experiment 5: mental limitations of local methods. While large windows opt of smoothness in Equation λ , the smoothness weight are needed to avoid wrong matches in regions with little (3). We compare the performance of DP, SO, GC, and SA texture, window-based stereo methods perform poorly near λ for =5 , 10, 20, 50, 100, 200, 500, and 1000. We use unag- object boundaries (i.e., depth discontinuities). The reason gregated absolute differences as the matching cost (squared is that such methods implicitly assume that all points within ), and no λ differences would require much higher values for a window have similar disparities. If a window straddles sub-pixel estimation. The number of iterations for simulated a depth boundary, some points in the window match at the annealing (SA) is 500. foreground disparity, while others match at the background Results: Figure 13 shows plots of B B as a , B , and D T O disparity. The (aggregated) cost function at a point near a λ for Tsukuba, Venus, and Map images. (To function of depth discontinuity is thus bimodal in the d direction, and show more varied results, we use the Map images instead stronger of the two modes will be selected as the winning of Sawtooth in this experiment.) Since DP has an extra disparity. Which one of the two modes will win? This de- parameter, the occlusion cost, we include three runs, for 21

22 True disparities SAD SAD+MF SAD+MF error Input image SAD error 21 × Figure 12: SAD algorithm, with and without a Illustration of the “foreground fattening” effect, using the Map image pair and a 21 min-filter. The error maps encode the signed disparity error, using gray for 0, light for positive errors, and dark for negative errors. Note that without the min-filter (middle column) the foreground region grows across the vertical depth discontinuity towards the right. With the min-filter (right column), the object boundaries are recovered fairly well. highly efficient dynamic programming techniques, it nega- occlusion = 20, 50, and 80. Using as before B opt cost O ( pixels nonocc ) as our measure of overall performance, bad tively affects the performance, as exhibited in the character- it can be seen that the graph-cut method (GC) consistently istic “streaking” in the disparity maps (see Figures 17 and 18 performs best, while the other three (DP, SO, and SA) per- below). Several authors have proposed methods for increas- form slightly worse, with no clear ranking among them. GC ing inter-scanline consistency in dynamic-programming ap- also performs best in textureless areas and near discontinu- proaches, e.g., [9, 31, 13]. We plan to investigate this area ities. The best performance for each algorithm, however, in future work. depending on the image pair. λ requires different values for For example, the Map images, which are well textured and We now focus on the graph-cut optimiza- Experiment 6: only contain two planar regions, require high values (around tion method to see whether the results can be improved. We 500), while the Tsukuba images, which contain many objects try both Birchfield-Tomasi matching costs and a smoothness at different depths, require smaller values (20–200, also de- cost that depends on the intensity gradients pending on the algorithm). The occlusion cost parameter Figure 14 shows the usual set of performance mea- Results: for the DP algorithm, while not changing the performance B sures , B for four different experiments for B , and D O T dramatically, also affects the optimal value for λ . Although Tsukuba, Sawtooth, Venus, and Map images. We use a GC is the clear winner here, it is also the slowest algorithm: =20 smoothness weight of , except for the Map images, λ DP and SO, which operate on each scanline independently, . The matching cost are (non-truncated) abso- =50 λ where typically run in less than 2 seconds, while GC and SA re- lute differences. The parameters for the gradient-dependent quire 10–30 minutes. opt smoothness costs are = 8 (same in all exper- thresh grad The graph-cut method consistently outper- Conclusion: iments), and opt = 1, 2, or 4 (denoted p1, p2, grad penalty forms the other optimization methods, although at the cost and p4 in the plots). Recall that the smoothness cost is multi- of much higher running times. GC is clearly superior to sim- plied by opt grad penalty if the intensity gradient is below ulated annealing, which is consistent with other published opt grad thresh to encourage disparity jumps to coincide results [23, 116]. When comparing GC and scanline meth- with intensity edges. Each plot in Figure 14 shows 4 runs: ods (DP and SO), however, it should be noted that the latter p1, p1+BT, p2+BT, and p4+BT. In the first run, the penalty solve a different (easier) optimization problem, since vertical is 1, i.e., the gradient dependency is turned off. This gives smoothness terms are ignored. While this enables the use of the same results as in Experiment 5. In the second run, we 22

23 60% Map 50% 40% bad_pixels_nonocc 30% bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 5 5 5 5 5 20 50 10 10 50 20 10 20 50 50 20 10 50 10 20 10 20 50 200 100 500 100 200 500 500 100 200 500 200 100 200 100 500 100 200 500 1000 1000 1000 1000 1000 1000 SO SA DP 80 DP 50 DP 20 GC opt_smoothness 60% Tsukuba 50% 40% bad_pixels_nonocc 30% bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 5 5 5 5 5 50 20 10 20 50 10 10 50 20 10 50 20 50 10 20 50 20 10 100 200 500 100 500 200 500 100 200 200 100 500 500 200 100 100 500 200 1000 1000 1000 1000 1000 1000 SA DP 20 DP 50 DP 80 SO GC opt_smoothness 60% Venus 50% 40% bad_pixels_nonocc 30% bad_pixels_textureless 20% bad_pixels_discont 10% 0% 5 5 5 5 5 5 10 20 50 20 10 50 10 20 50 20 50 10 50 20 10 20 50 10 200 500 100 100 500 200 200 100 500 100 500 200 100 200 500 200 100 500 1000 1000 1000 1000 1000 1000 DP 80 SO GC SA DP 20 DP 50 opt_smoothness Figure 13: λ ( opt Experiment 5. Performance of global optimization techniques as a function of the smoothness weight smoothness ) for Map, Tsukuba, and Venus images. Note that each image pair requires a different value of λ for optimal performance. 23

24 15% 15% 12% 12% Tsukuba Map 9% 9% bad_pixels_nonocc bad_pixels_nonocc 6% 6% bad_pixels_textureless bad_pixels_textureless bad_pixels_discont bad_pixels_discont 3% 3% 0% 0% p1,BT p2,BT p4,BT p2,BT p1,BT p1 p1 p4,BT 15% 15% 12% 12% Sawtooth Venus 9% 9% bad_pixels_nonocc bad_pixels_nonocc 6% 6% bad_pixels_textureless bad_pixels_textureless bad_pixels_discont bad_pixels_discont 3% 3% 0% 0% p1 p1,BT p2,BT p4,BT p1 p1,BT p2,BT p4,BT Experiment 6. Performance of the graph-cut optimization technique with different gradient-dependent smoothness penalties (p1, Figure 14: p2, p4) and with and without Birchfield-Tomasi (BT). add Birchfield-Tomasi, still without a penalty. We then add 6.4. Sub-pixel estimation a penalty of 2 and 4 in the last two runs. It can be seen that To evaluate the performance of the sub- Experiment 7: the low-gradient penalty clearly helps recovering the dis- pixel refinement stage, and also to evaluate the influence of continuities, and also in the other regions. Which of the two the matching criteria and disparity sampling, we cropped penalties works better depends on the image pair. Birchfield- a small planar region from one of our image sequences Tomasi also yields a slight improvement. We have also tried (Figure 15a, second column of images). The image itself other values for the threshold, with mixed results. In future is a page of newsprint mounted on cardboard, with high- work we plan to replace the simple gradient threshold with frequency text and a few low-frequency white and dark re- an edge detector, which should improve edge localization. gions. (These textureless regions were excluded from the The issue of selecting the right penalty factor is closely re- statistics we gathered.) The disparities in this region are in , since it affects the lated to selecting the right value for λ the order of 0.8–3.8 pixels and are slanted both vertically overall relationship between the data term and the smooth- and horizontally. ness term. This also deserves more investigation. SSD window (Fig- × 9 We first run a simple Results: 9 Conclusion: Both Birchfield-Tomasi’s matching cost and ure 15b). One can clearly see the discrete disparity levels the gradient-based smoothness cost improve the perfor- computed. The disparity error map (second column of im- mance of the graph-cut algorithm. Choosing the right ages) shows the staircase error, and the histogram of dispari- parameters (threshold and penalty) remains difficult and ties (third column) also shows the discretization. If we apply image-specific. the sub-pixel parabolic fit to refine the disparities, the dis- parity map becomes smoother (note the drop in RMS error in Figure 15c), but still shows some soft staircasing, which We have performed these experiments for scanline-based is visible in the disparity error map and histogram as well. optimization methods (DP and SO) as well, with similar These results agree with those reported by Shimizu and Oku- results. Gradient-based penalties usually increase perfor- tomi [105]. mance, in particular for the SO method. Birchfield-Tomasi In Figure 15d, we investigate whether using the always seems to increase overall performance, but it some- Birchfield-Tomasi sampling-invariant measure [12] im- times decreases performance in textureless areas. As be- proves or degrades this behavior. For integral sampling, fore, the algorithms are highly sensitive to the weight of the their idea does help slightly, as can be seen by the reduced λ smoothness term and the penalty factor. 24

25 disp. refine preproc. RMS disp. disp. disp. Birchf.- subpix blur disp. error map error histogram step Tomasi ground truth 0 (a) (b) 1 no no 0.296 no (c) yes no no 0.088 1 (d) 1 yes yes no 0.082 (e) 1 no yes 0.135 yes 1 (f) yes no no 0.051 2 1 no 0.087 (g) no no 4 1 0.046 (h) yes no no 4 Figure 15: RMS disparity errors for cropped image sequence (planar region of newspaper). The reference image is shown in row (a) in the “disp. error” column. The columns indicate the disparity step, the sub-pixel refinement option, Birchfield-Tomasi’s sampling-insensitive matching option, the optional initial blur, and the RMS disparity error from ground truth. The first image column shows the computed disparity map, the second shows the signed disparity error, and the last column shows a histogram of computed disparities. 25

26 RMS disparity error Inverse prediction error (middle) 0.40 1.22 1.20 0.35 0.30 1.18 1.16 0.25 predict_err_integral rms_error_integral 1.14 0.20 predict_err_refined rms_error_refined ground truth 1.12 0.15 0.10 1.10 1.08 0.05 0.00 1.06 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 disp_step disp_step (a) (b) disp step and Figure 16: interval . The even data points Plots of RMS disparity error and inverse prediction errors as a function of match match are with sampling-insensitive matching turned on. The second set of plots in each figure is with preproc blur enabled (1 interval blur iteration). RMS value and the smoother histogram in Figure 15d. In 7. Overall comparison all other instances, it leads to poorer performance (see Fig- We close our experimental investigation with an overall com- ure 16a, where the sampling-invariant results are the even parison of 20 different stereo methods, including 5 algo- data points). rithms implemented by us and 15 algorithms implemented by their authors, who have kindly sent us their results. We In Figure 15e, we investigate whether lightly blurring the evaluate all algorithms using our familiar set of Tsukuba, 1 1 1 ( input images with a / / ) kernel helps sub-pixel re- / , , 4 2 4 Sawtooth, Venus, and Map images. All algorithms are run finement, because the first order Taylor series expansion of with constant parameters over all four images. We do not the imaging function becomes more valid. Blurring does perform sub-pixel estimation in this comparison. indeed slightly reduce the staircasing effect (compare Fig- Among the algorithms in our implementation framework, ure 15e to Figure 15c), but the overall (RMS) performance we have selected the following five: degrades, probably because of loss of high-frequency detail. × (1) SSD – 21 21 shiftable window SSD, 1 1 We also tried / pixel disparity sampling at the and / 2 4 (2) DP – dynamic programming, initial matching stages, with and without later sub-pixel re- finement. Sub-pixel refinement always helps to reduce the (3) SO – scanline optimization, RMS disparity error, although it has negligible effect on the (4) GC – graph-cut optimization, and inverse prediction error (Figure 16b). From these prediction (5) Bay – Bayesian diffusion. error plots, and also from visual inspection of the inverse warped (stabilized) image sequence, it appears that using We chose shiftable-window SSD as best-performing repre- original matching scheme is any sub-pixel refinement after sentative of all local (aggregation-based) algorithms. We are sufficient to reduce the prediction error (and the appearance not including simulated annealing here, since GC solves the of “jitter” or “shearing”) to negligible levels. This is de- same optimization problem better and more efficiently. For spite the fact that the theoretical justification for sub-pixel each of the five algorithms, we have chosen fixed parame- refinement is based on a quadratic fit to an adequately sam- ters that yield reasonably good performance over a variety pled quadratic energy function. At the moment, for global of input images (see Table 4). methods, we rely on the per-pixel costs that go into the opti- We compare the results of our implementation with results mization to do the sub-pixel disparity estimation. Alternative provided by the authors of the following algorithms: approaches, such as using local plane fits [5, 14, 117] could also be used to get sub-pixel precision. (6) Max-flow / min-cut algorithm, Roy and Cox [92, 91] To eliminate “staircasing” in the computed Conclusions: (one of the first methods to formulate matching as a disparity map and to also eliminate the appearance of “shear- graph flow problem); ing” in reprojected sequences, it is necessary to initially eval- (7) Pixel-to-pixel stereo, Birchfield and Tomasi [13] (scan- 1 uate the matches at a fractional disparity ( / pixel steps ap- 2 line algorithm using gradient-modulated costs, fol- pear to be adequate). This should be followed by finding lowed by vertical disparity propagation into unreliable the minima of local quadratic fits applied to the computed areas); matching costs. 26

27 SSD DP Bay SO GC Matching cost fn AD AD AD SD match AD no no no Truncation no no yes yes no no Birchfield / Tomasi yes Aggregation size window aggr 21———— 21———— minfilter aggr 1 — — — 1000 aggr iter ————0.5 mu diff diff sigmaP ————0.4 epsP ———— 0.01 diff scale diff 0.01 cost ———— Optimization fn DP SO GC Bayesian opt WTA ( opt ) —205020— smoothness λ occlusion cost —20——— opt opt thresh —888— grad opt grad penalty —422— Table 4: Parameters for the five algorithms implemented by us. (8) Multiway cut, Birchfield and Tomasi [14] (alternate (19) Belief propagation, Sun [109] (a MRF formulation et al. segmentation and finding affine parameters for each using Bayesian belief propagation); segment using graph cuts); (20) Layered stereo, Lin and Tomasi [69] (a preliminary ver- sion of an extension of the multiway-cut method [14]). (9) Cooperative algorithm, Zitnick and Kanade [132] (a new variant of Marr and Poggio’s algorithm [73]); Some of these algorithms do not compute disparity esti- et al. [23] (same as our GC method, (10) Graph cuts, Boykov mates everywhere, in particular those that explicitly model but a much faster implementation); occlusions (3 and 11), but also (16) which leaves low- confidence regions unmatched. In these cases we fill un- (11) Graph cuts with occlusions, Kolmogorov and Zabih matched areas as described for the DP method in Section 4.3. [65] (an extension of the graph-cut framework that ex- Table 5 summarizes the results for all algorithms. As plicitly models occlusions); pixels ( bad nonocc ) B in the previous section, we report O (12) Compact windows, Veksler [124] (an adaptive window as a measure of overall performance, as well as B T technique allowing non-rectangular windows); ( pixels textureless B bad ( bad pixels discont ). We ), and D (13) Genetic algorithm, Gong and Yang [49] (a global opti- for the Map images since the images B do not report T mization technique operating on quadtrees); are textured almost everywhere. The algorithms are listed roughly in order of overall performance. 9 (14) Realtime method, Hirschm ̈ uller [53] ( × 9 SAD with The disparity maps for Tsukuba and Venus images are shiftable windows, followed by consistency checking shown in Figures 17 and 18. The full set of disparity maps, and interpolation of uncertain areas); as well as signed and binary error maps are available on our et al. [68] (a variant of (15) Stochastic diffusion, Lee . www.middlebury.edu/stereo web site at Bayesian diffusion [97]); Looking at these results, we can draw several conclu- sions about the overall performance of the algorithms. First, (16) Fast correlation algorithm, M ̈ [79] (an uhlmann et al. global optimization methods based on 2-D MRFs generally efficient implementation of correlation-based matching perform the best in all regions of the image (overall, tex- with consistency and uniqueness validation); tureless, and discontinuities). Most of these techniques are (17) Discontinuity-preserving regularization, Shao [104] (a based on graph-cut optimization (4, 8, 10, 11, 20), but belief multi-view technique for virtual view generation); propagation (19) also does well. Approaches that explic- itly model planar surfaces (8, 20) are especially good with (18) Maximum-surface technique, Sun [108] (a fast stereo piecewise planar scenes such as Sawtooth and Venus. algorithm using rectangular subregions); 27

28 True disparities 19 – Belief propagation 20 – Layered stereo 11 – GC + occlusions *4 – Graph cuts 13 – Genetic algorithm 10 – Graph cuts 6 – Max flow 12 – Compact windows 9 – Cooperative alg. 15 – Stochastic diffusion *2 – Dynamic progr. 14 – Realtime SAD *3 – Scanline opt. 7 – Pixel-to-pixel stereo *1 – SSD+MF *5 – Bayesian diffusion 17 – Disc.-pres. regul. 16 – Fast Correlation 8 – Multiway cut Figure 17: Comparative results on the Tsukuba images. The results are shown in decreasing order of overall performance ( B ). Algorithms O implemented by us are marked with a star. 28

29 True disparities 8 – Multiway cut 20 – Layered stereo 19 – Belief propagation 12 – Compact windows 10 – Graph cuts 14 – Realtime SAD *4 – Graph cuts 6 – Max flow 15 – Stochastic diffusion 13 – Genetic algorithm 9 – Cooperative alg. 11 – GC + occlusions *1 – SSD+MF *5 – Bayesian diffusion 18 – Max surface 17 – Disc.-pres. regul. 16 – Fast Correlation *3 – Scanline opt. 7 – Pixel-to-pixel stereo Figure 18: Comparative results on the Venus images. The results are shown in decreasing order of overall performance ( B ). Algorithms O implemented by us are marked with a star. 29

30 Tsukuba Sawtooth Map Venus B B B B B B B B B B B D D D D T O O O T O T 3 3 8.82 3 0.34 1 0.00 1 3.35 1 1.52 4 2.96 10 2.62 2 0.37 6 5.24 6 1.58 20 Layered 1.06 5 1.09 5 9.49 5 1.30 6 0.06 3 6.34 6 1.79 7 2.61 8 6.91 4 0.31 4 3.88 4 *4 Graph cuts 1.94 1.15 19 Belief prop. 1 6.31 1 0.98 5 0.30 5 4.83 5 1.00 2 0.76 2 9.13 6 0.84 10 5.27 7 1 0.42 2.79 1.27 2 6.90 2 0.36 2 0.00 1 3.65 2 0.43 12 5.39 13 2.54 1 1.79 13 10.08 12 11 GC+occl. 2 1.86 4 1.00 3 9.35 4 0.42 3 0.14 4 3.76 3 1.69 6 2.30 6 5.40 3 2.39 16 9.35 10 10 Graph cuts 8 Multiw. cut 8.08 6.53 14 25.33 18 0.61 4 0.46 8 4.60 4 0.53 1 0.31 1 8.06 5 0.26 3 3.27 3 17 3.36 8 8 12.91 9 1.61 9 0.45 7 7.87 7 1.67 5 2.18 4 13.24 9 0.33 5 3.94 5 12 Compact win. 3.54 4.25 12 14 Realtime 1.80 15.05 1.32 7 0.35 6 9.21 8 4.47 4 13 3 12.33 7 0.81 9 11.35 15 12 1.53 6.49 16 11.62 19 12.29 7 1.45 8 0.72 9 9.29 *5 Bay. diff. 4.00 14 7.21 16 18.39 13 0.20 1 2.49 2 9 9 Cooperative 9 3.65 9 14.77 11 3.49 26.38 10 13.41 13 2.57 11 3.52 11 14 17 0.22 2 2.37 1 2.03 2.29 5.23 15 3.80 *1 SSD+MF 24.66 17 2.21 11 0.72 10 13.97 15 3.74 13 6.82 15 12.94 8 0.66 8 9.35 10 10 15 Stoch. diff. 3.95 10 4.08 11 15 2.45 14 0.90 11 10.58 10 2.45 9 2.41 7 21.84 15 1.31 12 7.79 9 15.49 2.96 6 7 14.97 12 2.21 12 2.76 16 13.96 14 2.49 10 2.89 9 23.04 16 1.04 11 10.91 14 13 Genetic 2.66 17 5.12 17 14.62 10 2.31 13 1.79 12 14.93 17 6.30 7.06 11.37 18 14.57 10 0.50 7 6.83 8 7 Pix-to-pix 14 6 Max flow 7 2.00 6 15.10 14 3.47 15 3.00 17 14.19 2.98 2.16 8 2.24 5 21.73 14 3.13 17 15.98 18 16 *3 Scanl. opt. 19 13 11.94 6 4.06 16 2.64 15 11.90 11 9.44 15 14.59 19 18.20 12 1.84 14 10.22 13 5.08 6.78 *2 Dyn. prog. 11 4.63 13 12.34 8 4.84 19 3.71 19 13.26 12 10.10 20 15.01 20 17.12 11 3.33 18 14.04 17 4.12 17 Shao 9.67 18 7.04 16 35.63 19 4.25 17 3.19 18 30.14 20 6.01 16 6.70 14 43.91 20 2.36 15 33.01 20 16 Fast Correl. 9.76 19 10.36 20 4.76 18 1.87 13 22.49 18 6.48 18 16 17 31.29 18 8.42 20 12.68 16 13.85 24.39 11.10 20 10.70 18 41.99 20 5.51 18 Max surf. 5.56 20 27.39 19 4.36 15 4.78 12 41.13 19 4.17 19 27.88 19 20 Table 5: B B ( bad Comparative performance of stereo algorithms, using the three performance measures nonocc ), pixels O T ( bad pixels textureless ), and B ). All algorithms are run with constant parameter settings across all images. The ( bad pixels discont D small numbers indicate the rank of each algorithm in each column. The algorithms are listed roughly in decreasing order of overall performance, and the minimum (best) value in each column is shown in bold. Algorithms implemented by us are marked with a star. Next, cooperative and diffusion-based methods (5, 9, 15) To demonstrate the importance of parameter settings, Ta- do reasonably well, but often get the boundaries wrong, espe- ble 6 compares the overall results ( B ) of algorithms 1–5 for O cially on the more complex Tsukuba images. On the highly the fixed parameters listed in Table 4 with the “best” results textured and relatively simple Map sequence, however, they when parameters are allowed to vary for each image. Note can outperform some of the full optimization approaches. that we did not perform a true optimization over all param- The Map sequence is also noisier than the others, which eters values, but rather simply chose the overall best results works against algorithms that are sensitive to internal pa- among the entire set of experiments we performed. It can rameter settings. (In these experiments, we asked everyone be seen that for some of the algorithms the performance can to use a single set of parameters for all four datasets.) be improved substantially with different parameters. The global optimization algorithms are particularly sensitive to Lastly, local (1, 12, 14, 16) and scanline methods (2, 3, 7) , and DP is also sensitive to the occlusion the parameter λ perform less well, although (14) which performs additional cost parameter. This is consistent with our observations in consistency checks and clean-up steps does reasonably well, Section 6.3. Note that the Map image pair can virtually be as does the compact window approach (12), which uses a so- “solved” using GC, Bay, or SSD, since the images depict a phisticated adaptive window. Simpler local approaches such simple geometry and are well textured. More challenging as SSD+MF (1) generally do poorly in textureless areas (if data sets with many occlusions and textureless regions may the window size is small) or near discontinuities (if the win- be useful in future extensions of this study. dow is large). The disparity maps created by the scanline- based algorithms (DP and SO) are promising and show a Finally, we take a brief look at the efficiency of dif- lot of detail, but the larger quantitative errors are clearly a ferent methods. Table 7 lists the image sizes and num- result of the “streaking” due to the lack of inter-scanline ber of disparity levels for each image pair, and the run- consistency. ning times for 9 selected algorithms. Not surprisingly, 30

31 Tsukuba Sawtooth Venus Map fixed fixed best fixed best best best fixed 5.23 1.55 3.74 2.92 0.66 0.22 5.23 1 SSD 2.21 2DP 4.84 3.70 10.10 9.13 3.33 1.21 4.12 3.82 5.08 4.66 4.06 3.47 9.44 8.31 1.84 1.04 3SO 1.94 1.94 1.30 0.98 1.79 1.48 0.31 0.09 4GC 6.49 5 Bay 1.45 4.00 4.00 0.20 0.20 6.49 1.45 Overall performance ( bad pixels nonocc ) for algorithms 1–5, both using fixed parameters across all images, and best Table 6: B O parameters for each image. Note that for some algorithms significant performance gains are possible if parameters are allowed to vary for each image. the speed-optimized methods (14 and 16) are fastest, fol- ferent research groups, and we hope to obtain more in the lowed by local and scanline-based methods (1 – SSD, 2 future. We are planning to maintain the on-line version of – DP, 3 – SO). Our implementations of Graph cuts (4) Table 5 that lists the overall results of the currently best- and Bayesian diffusion (5) are several orders of magni- performing algorithms on our web site. We also hope that tude slower. The authors’ implementations of the graph cut some researchers will take the time to add their algorithms methods (10 and 11), however, are much faster than our to our framework for others to use and to build upon. In the implementation. This is due to the new max-flow code long term, we hope that our efforts will lead to some set of by Boykov and Kolmorogov [21], which is available at data and testing methodology becoming an accepted stan- www.cs.cornell.edu/People/vnk/software.html . dard in the stereo correspondence community, so that new algorithms will have to pass a “litmus test” to demonstrate In summary, if efficiency is an issue, a simple shiftable- that they improve on the state of the art. window method is a good choice. In particular, method 14 by Hirschm ̈ uller [53] is among the fastest and produces very There are many other open research questions we would good results. New implementations of graph-cut methods like to address. How important is it to devise the right give excellent results and have acceptable running times. cost function (e.g., with better gradient-dependent smooth- Further research is needed to fully exploit the potential of ness terms) in global optimization algorithms vs. how im- scanline methods without sacrificing their efficiency. portant is it to find a global minimum? What are the best (sampling-invariant) matching metrics? What kind of adap- 8. Conclusion tive/shiftable windows work best? Is it possible to auto- matically adapt parameters to different images? Also, is In this paper, we have proposed a taxonomy for dense two- prediction error a useful metric for gauging the quality of frame stereo correspondence algorithms. We use this tax- stereo algorithms? We would also like to try other existing onomy to highlight the most important features of existing data sets and to produce some labeled data sets that are not stereo algorithms and to study important algorithmic com- all piecewise planar. ponents in isolation. We have implemented a suite of stereo Once this study has been completed, we plan to move matching algorithm components and constructed a test har- on to study multi-frame stereo matching with arbitrary cam- ness that can be used to combine these, to vary the algo- era geometry. There are many technical solutions possi- rithm parameters in a controlled way, and to test the perfor- ble to this problem, including voxel representations, layered mance of these algorithm on interesting data sets. We have representations, and multi-view representations. This more also produced some new calibrated multi-view stereo data general version of the correspondence problem should also sets with hand-labeled ground truth. We have performed an prove to be more useful for image-based rendering applica- extensive experimental investigation in order to assess the tions. impact of the different algorithmic components. The ex- periments reported here have demonstrated the limitations Developing the taxonomy and implementing the algo- of local methods, and have assessed the value of different rithmic framework described in this paper has given us a global techniques and their sensitivity to key parameters. much deeper understanding of what does and does not work We hope that other researchers well in stereo matching. We hope that publishing this study along with our sample will also take the time to carefully analyze the behavior of code and data sets on the Web will encourage other stereo their own algorithms using the framework and methodology researchers to run their algorithms on our data and to report developed in this paper, and that this will lead to a deeper their comparative results. Since publishing the initial ver- understanding of the complex behavior of stereo correspon- sion of this paper as a technical report [98], we have already dence algorithms. received experimental results (disparity maps) from 15 dif- 31

32 Tsukuba Sawtooth Venus Map 434 width 434 384 284 383 288 height 216 380 disparity levels 20 20 16 30 Time (seconds): 0.2 0.1 14 – Realtime 0.1 0.2 0.3 0.2 0.3 0.2 16 – Efficient *1 – SSD+MF 1.1 1.5 1.7 0.8 1.8 1.9 *2–DP 1.0 0.8 1.1 2.3 1.3 *3–SO 2.2 48.3 22.3 23.6 51.3 10–GC 11 – GC+occlusions 69.8 154.4 239.9 64.0 735.0 829.0 *4–GC 662.0 480.0 1055.0 2047.0 1236.0 *5 – Bay 2049.0 Table 7: Running times of selected algorithms on the four image pairs. , IJCV [6] S. T. Barnard. Stochastic stereo matching over scale. Acknowledgements 3(1):17–32, 1989. Many thanks to Ramin Zabih for his help in laying the [7] S. T. Barnard and M. A. Fischler. Computational stereo. foundations for this study and for his valuable input and ACM Comp. Surveys , 14(4):553–572, 1982. [8] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance suggestions throughout this project. We are grateful to Y. , 12(1):43–77, 1994. of optical flow techniques. IJCV Ohta and Y. Nakamura for supplying the ground-truth im- [9] P. N. Belhumeur. A Bayesian approach to binocular stere- agery from the University of Tsukuba, and to Lothar Her- IJCV , 19(3):237–260, 1996. opsis. mes for supplying the synthetic images from the Univer- [10] P. N. Belhumeur and D. Mumford. A Bayesian treatment sity of Bonn. Thanks to Padma Ugbabe for helping to la- of the stereo correspondence problem using half-occluded bel the image regions, and to Fred Lower for providing his regions. In , pages 506–512, 1992. CVPR paintings for our image data sets. Finally, we would like to [11] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hi- thank Stan Birchfield, Yuri Boykov, Minglung Gong, Heiko ECCV , pages erarchical model-based motion estimation. In uller, Vladimir Kolmogorov, Sang Hwa Lee, Mike Hirschm ̈ 237–252, 1992. Lin, Karsten M ̈ uhlmann, S ́ ebastien Roy, Juliang Shao, Harry [12] S. Birchfield and C. Tomasi. Depth discontinuities by pixel- Shum, Changming Sun, Jian Sun, Carlo Tomasi, Olga Vek- , pages 1073–1080, 1998. ICCV to-pixel stereo. In [13] S. Birchfield and C. Tomasi. A pixel dissimilarity mea- sler, and Larry Zitnick, for running their algorithms on our , sure that is insensitive to image sampling. IEEE TPAMI data sets and sending us the results, and Gary Bradski for 20(4):401–406, 1998. helping to facilitate this comparison of stereo algorithms. [14] S. Birchfield and C. Tomasi. Multiway cut for stereo and This research was supported in part by NSF CAREER , pages 489–495, ICCV motion with slanted surfaces. In grant 9984485. 1999. [15] M. J. Black and P. Anandan. A framework for the robust , pages 231–236, 1993. ICCV estimation of optical flow. In References [16] M. J. Black and A. Rangarajan. On the unification of line processes, outlier rejection, and robust statistics with appli- [1] P. Anandan. A computational framework and an algorithm IJCV , 19(1):57–91, 1996. cations in early vision. for the measurement of visual motion. IJCV , 2(3):283–310, Visual Reconstruction . MIT [17] A. Blake and A. Zisserman. 1989. Press, Cambridge, MA, 1987. [2] R. D. Arnold. Automated stereo perception. Technical Re- , [18] A. F. Bobick and S. S. Intille. Large occlusion stereo. IJCV port AIM-351, Artificial Intelligence Laboratory, Stanford 33(3):181–200, 1999. University, 1983. [19] R. C. Bolles, H. H. Baker, and M. J. Hannah. The JISCT [3] H. Baker and T. Binford. Depth from edge and intensity stereo evaluation. In DARPA Image Understanding Work- , pages 631–636, 1981. IJCAI81 based stereo. In , pages 263–274, 1993. shop [4] H. H. Baker. Edge based stereo correlation. In L. S. Bau- [20] R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar- , pages 168– Image Understanding Workshop mann, editor, plane image analysis: An approach to determining structure 175. Science Applications International Corporation, 1980. from motion. IJCV , 1:7–55, 1987. [5] S. Baker, R. Szeliski, and P. Anandan. A layered approach [21] Y. Boykov and V. Kolmogorov. A new algorithm for energy , pages 434–441, 1998. to stereo reconstruction. In CVPR 32

33 [43] P. Fua andY. G. Leclerc. Object-centered surface reconstruc- minimization with discontinuities. In Intl. Workshop on En- IJCV tion: Combining multi-image stereo and shading. , ergy Minimization Methods in Computer Vision and Pattern 16:35–56, 1995. Recognition , pages 205–220, 2001. [44] E. Gamble and T. Poggio. Visual integration and detection of [22] Y. Boykov, O. Veksler, and R. Zabih. A variable window discontinuities: the key role of intensity edges. A. I. Memo , 20(12):1283–1294, IEEE TPAMI approach to early vision. 970, AI Lab, MIT, 1987. 1998. [45] D. Geiger and F. Girosi. Parallel and deterministic algo- Fast approxi- [23] Y. Boykov, O. Veksler, and R. Zabih. rithms for MRF’s: Surface reconstruction. , IEEE TPAMI mate energy minimization via graph cuts. IEEE TPAMI , 13(5):401–412, 1991. 23(11):1222–1239, 2001. [46] D. Geiger, B. Ladendorf, and A. Yuille. Occlusions and [24] A. Broadhurst, T. Drummond, and R. Cipolla. A probabilis- ECCV binocular stereo. In , pages 425–433, 1992. tic framework for space carving. In , volume I, pages ICCV [47] S. Geman and D. Geman. Stochastic relaxation, Gibbs dis- 388–393, 2001. IEEE tribution, and the Bayesian restoration of images. [25] L. G. Brown. A survey of image registration techniques. , 6(6):721–741, 1984. TPAMI , 24(4):325–376, 1992. Computing Surveys ICCV [48] M. A. Gennert. Brightness-based stereo matching. In , [26] P. J. Burt and E. H. Adelson. The Laplacian pyramid as a pages 139–143, 1988. compact image code. IEEE Transactions on Communica- [49] M. Gong and Y.-H. Yang. Multi-baseline stereo matching tions , COM-31(4):532–540, 1983. using genetic algorithm. In IEEE Workshop on Stereo and [27] J. F. Canny. A computational approach to edge detection. Multi-Baseline Vision , 2001. IJCV (this issue). , 8(6):34–43, 1986. IEEE TPAMI [50] W. E. L. Grimson. Computational experiments with a feature [28] P. B. Chou and C. M. Brown. The theory and practice of IEEE TPAMI , 7(1):17–34, 1985. based stereo algorithm. Bayesian image labeling. , 4(3):185–210, 1990. IJCV [51] M. J. Hannah. Computer Matching of Areas in Stereo Im- [29] S. D. Cochran and G. Medioni. 3-D surface description from ages . PhD thesis, Stanford University, 1974. IEEE TPAMI binocular stereo. , 14(10):981–994, 1992. [52] R. I. Hartley and A. Zisserman. . Multiple View Geometry [30] R. T. Collins. A space-sweep approach to true multi-image Cambridge University Press, Cambridge, UK, 2000. matching. In , pages 358–363, 1996. CVPR [53] H. Hirschm ̈ uller. Improvements in real-time correlation- [31] I. J. Cox, S. L. Hingorani, S. B. Rao, and B. M. Maggs. A based stereo vision. In IEEE Workshop on Stereo and Multi- maximum likelihood stereo algorithm. CVIU , 63(3):542– Baseline Vision , 2001. IJCV (this issue). 567, 1996. [54] Y. C. Hsieh, D. McKeown, and F. P. Perlant. Performance Dynamic his- [32] I. J. Cox, S. Roy, and S. L. Hingorani. evaluation of scene registration and stereo matching for car- togram warping of image pairs for constant image bright- IEEE TPAMI tographic feature extraction. , 14(2):214–238, IEEE International Conference on Image Process- ness. In 1992. , volume 2, pages 366–369. IEEE Computer ing (ICIP’95) [55] H. Ishikawa and D. Geiger. Occlusions, discontinuities, and Society, 1995. epipolar lines in stereo. In ECCV , pages 232–248, 1998. [33] B. Culbertson, T. Malzbender, and G. Slabaugh. Generalized [56] M. R. M. Jenkin, A. D. Jepson, and J. K. Tsotsos. Techniques voxel coloring. In International Workshop on Vision Algo- , for disparity measurement. CVGIP: Image Understanding rithms , pages 100–114, Kerkyra, Greece, 1999. Springer. 53(1):14–30, 1991. [34] J. S. De Bonet and P. Viola. Poxels: Probabilistic voxelized [57] D. G. Jones and J. Malik. A computational framework for , pages 418–425, 1999. volume reconstruction. In ICCV determining stereo correspondence from a set of linear spa- [35] R. Deriche. Fast algorithms for low-level vision. IEEE tial filters. In , pages 395–410, 1992. ECCV TPAMI , 12(1):78–87, 1990. [58] T. Kanade. Development of a video-rate stereo machine. In [36] P. Dev. Segmentation processes in visual perception: A , pages 549–557, Monterey, Image Understanding Workshop cooperative neural model. COINS Technical Report 74C-5, CA, 1994. Morgan Kaufmann Publishers. University of Massachusetts at Amherst, 1974. [59] T. Kanade et al. A stereo machine for video-rate dense depth [37] U. R. Dhond and J. K. Aggarwal. Structure from stereo— , pages 196–202, mapping and its new applications. In CVPR a review. , IEEE Trans. on Systems, Man, and Cybern. 1996. 19(6):1489–1510, 1989. [60] T. Kanade and M. Okutomi. A stereo matching algorithm [38] O. Faugeras and R. Keriven. Variational principles, surface IEEE with an adaptive window: Theory and experiment. evolution, PDE’s, level set methods, and the stereo problem. TPAMI , 16(9):920–932, 1994. IEEE Trans. Image Proc. , 7(3):336–344, 1998. [61] S. B. Kang, R. Szeliski, and J. Chai. Handling occlusions in The Geometry of Multiple [39] O. Faugeras and Q.-T. Luong. dense multi-view stereo. In CVPR , 2001. . MIT Press, Cambridge, MA, 2001. Images [62] S. B. Kang, J. Webb, L. Zitnick, and T. Kanade. A multi- [40] D. J. Fleet, A. D. Jepson, and M. R. M. Jenkin. Phase-based baseline stereo system with active illumination and real-time CVGIP , 53(2):198–210, 1991. disparity measurement. image acquisition. In ICCV , pages 88–93, 1995. [41] T. Frohlinghaus and J. M. Buhmann. Regularizing phase- [63] M. Kass. , IJCV Linear image features in stereopsis. based stereo. In ICPR , volume A, pages 451–455, 1996. 1(4):357–368, 1988. [42] P. Fua. A parallel stereo algorithm that produces dense depth [64] R. Kimura et al. A convolver-based real-time stereo machine maps and preserves image features. Machine Vision and (SAZAN). In CVPR , volume 1, pages 457–463, 1999. Applications , 6:35–49, 1993. 33

34 [87] T. Poggio, V. Torre, and C. Koch. Computational vision and [65] V. Kolmogorov and R. Zabih. Computing visual correspon- , 317(6035):314–319, 1985. Nature regularization theory. ICCV , volume II, dence with occlusions using graph cuts. In [88] S. B. Pollard, J. E. W. Mayhew, and J. P. Frisby. PMF: A pages 508–515, 2001. stereo correspondence algorithm using a disparity gradient , [66] K. N. Kutulakos. Approximate N-view stereo. In ECCV , 14:449–470, 1985. Perception limit. volume I, pages 67–83, 2000. Biological [89] K. Prazdny. Detection of binocular disparities. [67] K. N. Kutulakos and S. M. Seitz. A theory of shape by space , 52(2):93–99, 1985. Cybernetics , 38(3):199–218, 2000. carving. IJCV In [90] L. H. Quam. Hierarchical warp stereo. Image [68] S. H. Lee, Y. Kanatsugu, and J.-I. Park. Hierarchical stochas- Understanding Workshop , pages 149–155, New Orleans, tic diffusion for disparity estimation. In IEEE Workshop on Louisiana, 1984. Science Applications International Corpo- Stereo and Multi-Baseline Vision , 2001. IJCV (this issue). ration. [69] M. Lin and C. Tomasi. Surfaces with occlusions from lay- [91] S. Roy. Stereo without epipolar lines: A maximum flow ered stereo. Technical report, Stanford University, 2002. In formulation. IJCV , 34(2/3):147–161, 1999. preparation. [92] S. Roy and I. J. Cox. A maximum-flow formulation of the [70] C. Loop and Z. Zhang. Computing rectifying homographies ICCV , pages N-camera stereo correspondence problem. In CVPR , volume I, pages 125–131, 1999. for stereo vision. In 492–499, 1998. [71] B. D. Lucas and T. Kanade. An iterative image regis- [93] T. W. Ryan, R. T. Gray, and B. R. Hunt. Prediction of cor- tration technique with an application in stereo vision. In relation errors in stereo-pair images. Optical Engineering , Seventh International Joint Conference on Artificial Intelli- 19(3):312–322, 1980. gence (IJCAI-81) , pages 674–679, Vancouver, 1981. [94] H. Saito and T. Kanade. Shape reconstruction in projective Vision . W. H. Freeman and Company, New York, [72] D. Marr. CVPR , volume 2, grid space from large number of images. In 1982. pages 49–54, 1999. [73] D. Marr and T. Poggio. Cooperative computation of stereo [95] D. Scharstein. Matching images by comparing their gradient , 194:283–287, 1976. disparity. Science , volume 1, pages 572–575, 1994. ICPR fields. In [74] D. C. Marr and T. Poggio. A computational theory of human [96] D. Scharstein. View Synthesis Using Stereo Vision , vol- stereo vision. , Proceedings of the Royal Society of London Lecture Notes in Computer Science (LNCS) . ume 1583 of B 204:301–328, 1979. Springer-Verlag, 1999. [75] J. Marroquin, S. Mitter, and T. Poggio. Probabilistic solution [97] D. Scharstein and R. Szeliski. Stereo matching with nonlin- Journal of of ill-posed problems in computational vision. IJCV ear diffusion. , 28(2):155–174, 1998. the American Statistical Association , 82(397):76–89, 1987. [98] D. Scharstein and R. Szeliski. A taxonomy and evaluation of [76] J. L. Marroquin. Design of cooperative networks. Working dense two-frame stereo correspondence algorithms. Tech- Paper 253, AI Lab, MIT, 1983. nical Report MSR-TR-2001-81, Microsoft Research, 2001. [77] L. Matthies, R. Szeliski, and T. Kanade. Kalman filter-based [99] D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and algorithms for estimating depth from image sequences. evaluation of dense two-frame stereo correspondence algo- IJCV , 3:209–236, 1989. IEEE Workshop on Stereo and Multi-Baseline rithms. In [78] A. Mitiche and P. Bouthemy. Computation and analysis of , 2001. Vision image motion: A synopsis of current problems and methods. [100] P. Seitz. Using local orientation information as image prim- , 19(1):29–55, 1996. IJCV SPIE Visual Com- itive for robust object recognition. In [79] K. M ̈ anner. Cal- uhlmann, D. Maier, J. Hesser, and R. M ̈ munications and Image Processing IV , volume 1199, pages culating dense disparity maps from color stereo images, an 1630–1639, 1989. IEEE Workshop on Stereo and efficient implementation. In [101] S. M. Seitz and C. M. Dyer. Photorealistic scene reconstruc- Multi-Baseline Vision , 2001. IJCV (this issue). tion by voxel coloring. IJCV , 35(2):1–23, 1999. [80] J. Mulligan, V. Isler, and K. Daniilidis. Performance evalu- [102] J. Shade, S. Gortler, L.-W. He, and R. Szeliski. Layered ation of stereo for tele-presence. In , volume II, pages ICCV depth images. In SIGGRAPH , pages 231–242, 1998. 558–565, 2001. [103] J. Shah. A nonlinear diffusion model for discontinuous dis- [81] Y. Nakamura, T. Matsuura, K. Satoh, and Y. Ohta. Occlusion parity and half-occlusion in stereo. In , pages 34–40, CVPR detectable stereo — occlusion patterns in camera matrix. In 1993. CVPR , pages 371–378, 1996. [104] J. Shao. Combination of stereo, motion and rendering for [82] H. K. Nishihara. Practical real-time imaging stereo matcher. IEEE Workshop on Stereo and Multi- 3d footage display. In , 23(5):536–545, 1984. Optical Engineering Baseline Vision , 2001. IJCV (this issue). [83] Y. Ohta and T. Kanade. Stereo by intra- and inter- [105] M. Shimizu and M. Okutomi. Precise sub-pixel estimation , scanline search using dynamic programming. IEEE TPAMI ICCV on area-based matching. In , volume I, pages 90–97, 7(2):139–154, 1985. 2001. [84] M. Okutomi and T. Kanade. A locally adaptive window for [106] H.-Y. Shum and R. Szeliski. Stereo reconstruction from IJCV signal matching. , 7(2):143–162, 1992. , pages 14–21, 1999. multiperspective panoramas. In ICCV [85] M. Okutomi and T. Kanade. A multiple-baseline stereo. [107] E. P. Simoncelli, E. H. Adelson, and D. J. Heeger. Probability IEEE TPAMI , 15(4):353–363, 1993. distributions of optic flow. In CVPR , pages 310–315, 1991. [86] M. Otte and H.-H. Nagel. Optical flow estimation: advances [108] C. Sun. Rectangular subregioning and 3-d maximum- , volume 1, pages 51–60, 1994. and comparisons. In ECCV 34

35 stereo matching and occlusion detection. IEEE TPAMI , surface techniques for fast stereo matching. In IEEE Work- 22(7):675–684, 2000. , 2001. IJCV (this shop on Stereo and Multi-Baseline Vision issue). [109] J. Sun, H. Y. Shum, and N. N. Zheng. Stereo matching using belief propagation. In ECCV , 2002. Submitted. [110] R. Szeliski. Bayesian Modeling of Uncertainty in Low-Level . Kluwer Academic Publishers, Boston, MA, 1989. Vision [111] R. Szeliski. Prediction error as a quality metric for motion and stereo. In , pages 781–788, 1999. ICCV [112] R. Szeliski and J. Coughlan. Spline-based image registra- , 22(3):199–218, 1997. IJCV tion. [113] R. Szeliski and P. Golland. Stereo matching with trans- , 32(1):45–61, 1999. Special IJCV parency and matting. Issue for Marr Prize papers. [114] R. Szeliski and G. Hinton. Solving random-dot stereograms CVPR , pages 284–288, 1985. using the heat equation. In [115] R. Szeliski and D. Scharstein. Symmetric sub-pixel stereo ECCV matching. In , 2002. Submitted. [116] R. Szeliski and R. Zabih. An experimental comparison of International Workshop on Vision Al- stereo algorithms. In gorithms , pages 1–19, Kerkyra, Greece, 1999. Springer. [117] H. Tao, H. Sawhney, and R. Kumar. A global matching , volume I, pages framework for stereo computation. In ICCV 532–539, 2001. [118] M. Tekalp. Digital Video Processing . Prentice Hall, Upper Saddle River, NJ, 1995. Regularization of inverse visual prob- [119] D. Terzopoulos. , 8(4):413–424, IEEE TPAMI lems involving discontinuities. 1986. [120] D. Terzopoulos and K. Fleischer. Deformable models. The , 4(6):306–331, 1988. Visual Computer [121] D. Terzopoulos and D. Metaxas. Dynamic 3D models with local and global deformations: Deformable superquadrics. IEEE TPAMI , 13(7):703–714, 1991. [122] Q. Tian and M. N. Huhns. Algorithms for subpixel registra- , 35:220–233, 1986. CVGIP tion. Efficient Graph-based Energy Minimization [123] O. Veksler. . PhD thesis, Cornell Univer- Methods in Computer Vision sity, 1999. [124] O. Veksler. Stereo matching by compact windows via mini- ICCV mum ratio cycle. In , volume I, pages 540–547, 2001. [125] J. Y. A. Wang and E. H. Adelson. Layered representation for CVPR motion analysis. In , pages 361–366, 1993. [126] A. Witkin, D. Terzopoulos, and M. Kass. Signal matching IJCV , 1:133–144, 1987. through scale space. [127] Y. Yang, A. Yuille, and J. Lu. Local, global, and multilevel CVPR , pages 274–279, 1993. stereo matching. In [128] A. L. Yuille and T. Poggio. A generalized ordering constraint for stereo correspondence. A.I. Memo 777, AI Lab, MIT, 1984. [129] R. Zabih and J. Woodfill. Non-parametric local transforms for computing visual correspondence. In ECCV , volume II, pages 151–158, 1994. [130] Z. Zhang. Determining the epipolar geometry and its uncer- IJCV tainty: A review. , 27(2):161–195, 1998. [131] Z. Zhang. A flexible new technique for camera calibration. , 22(11):1330–1334, 2000. IEEE TPAMI [132] C. L. Zitnick and T. Kanade. A cooperative algorithm for 35

See, Hear, Move: Towards Embodied Visual Perception Kristen Grauman Facebook AI Research University of Texas at Austin

More info »SURF: Speeded Up Robust Features 2 12 1 , and Luc Van Gool , Tinne Tuytelaars Herbert Bay 1 ETH Zurich bay, vangool } @vision.ee.ethz.ch { 2 Katholieke Universiteit Leuven Tinne.Tuytelaars, Luc.Vangoo...

More info »Selective Search for Object Recognition †2 2 2 ∗ 1,2 , and A.W.M. Smeulders J.R.R. Uijlings , T. Gevers , K.E.A. van de Sande 1 University of Trento, Italy 2 University of Amsterdam, the Netherlands T...

More info »Learning to See Physics via Visual De-animation Jiajun Wu Erika Lu Pushmeet Kohli University of Oxford MIT CSAIL DeepMind William T. Freeman Joshua B. Tenenbaum MIT CSAIL, Google Research MIT CSAIL Ab...

More info »Journal of Machine Learning Research 11 (2010) 1109-1135 Su bmitted 2/09; Revised 9/09; Published 3/10 Large Scale Online Learning of Image Similarity Through Ranking Gal Chechik GOOGLE . COM GAL @ Go...

More info »Generative Visual Manipulation on the Natural Image Manifold 1 1 2 1 , Eli Shechtman , and Alexei A. Efros Jun-Yan Zhu , Philipp Kr ̈ahenb ̈uhl 1 University of California, Berkeley junyanz, philkr, ef...

More info »A Discriminatively Trained, Multiscale, Deformable Part Model Pedro Felzenszwalb David McAllester Deva Ramanan University of Chicago Toyota Technological Institute at Chicago UC Irvine [email protected]

More info »Geometric Context from a Single Image Alexei A. Efros Derek Hoiem Martial Hebert Carnegie Mellon University } { @cs.cmu.edu dhoiem,efros,hebert Abstract Many computer vision algorithms limit their per...

More info »ParseNet: Looking Wider to See Better Andrew Rabinovich Alexander C. Berg Wei Liu UNC Chapel Hill MagicLeap Inc. UNC Chapel Hill [email protected] [email protected] [email protected] bird cat cat...

More info »Learning 3-D Scene Structure from a Single Still Image Ashutosh Saxena, Min Sun and Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305 } @cs.stanford.edu asaxena,aliensu...

More info »Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items M. Hadi Kiapour Kota Yamaguchi Tamara L. Berg Stony Brook University UNC at Chapel Hill UNC at Chapel Hill Stony Brook, NY, USA Ch...

More info »Following Gaze in Video ` Adri a Recasens Carl Vondrick Aditya Khosla Antonio Torralba Massachusetts Institute of Technology recasens, vondrick, khosla, torralba @csail.mit.edu { } b) 0.8 0.5 Gaze sco...

More info »Behavior Recognition via Sparse Spatio-Temporal Features ́ Vincent Rabaud Garrison Cottrell Serge Belongie Piotr Doll ar Department of Computer Science and Engineering University of California, San Di...

More info »1 3D Traffic Scene Understanding from Movable Platforms Martin Lauer Raquel Urtasun Christian Wojek Andreas Geiger Christoph Stiller Abstract —In this paper, we present a novel probabilistic generativ...

More info »A Naturalistic Open Source Movie for Optical Flow Evaluation 1 3 2 2 ,andMichaelJ.Black , Garrett B. Stanley Daniel J. Butler , Jonas Wulff 1 University of Washington, Seattle, WA, USA [email protected]

More info »Where are they looking? ∗ ∗ ` Aditya Khosla a Recasens Carl Vondrick Antonio Torralba Adri Massachusetts Institute of Technology { } @csail.mit.edu recasens, khosla, vondrick, torralba (* - indicates ...

More info »Road Scene Segmentation from a Single Image , 3 2 , 3 1 3 1 , Theo Gevers , and Antonio M. Lopez Jose M. Alvarez ,YannLeCun 1 Courant Institute of Mathematical Sciences, New York University, New York,...

More info »From Red Wine to Red Tomato: Composition with Context Ishan Misra Martial Hebert Abhinav Gupta The Robotics Institute, Carnegie Mellon University small snake small elephant elephant snake Abstract Com...

More info »Food Recognition Using Statistics of Pairwise Local Features 2 2 1 2 , 3 3 , Mei Chen Rahul Sukthankar Dean Pomerleau Shulin (Lynn) Yang [email protected] [email protected] [email protected] r...

More info »High Level Describable Attributes for Predicting Aesthetics and Interestingness Tamara L Berg Vicente Ordonez Sagnik Dhar Stony Brook University Stony Brook, NY 11794, USA [email protected] Abs...

More info »