1 High-quality video view interpolation using a layered representation Sing Bing Kang Simon Winder Richard Szeliski C. Lawrence Zitnick Matthew Uyttendaele Interactive Visual Media Group, Microsoft Research, Redmond, WA ⇒ ⇐ (a) (c) (d) (b) Figure 1: A video view interpolation example: (a,c) synchronized frames from two different input cameras and (b) a virtual interpolated view. (d) A depth-matted object from earlier in the sequence is inserted into the video. inspired a large number of papers. However, extending IBR to dy- Abstract namic scenes is not trivial because of the dif fi culty (and cost) of synchronizing so many cameras as well as acquiring and storing the The ability to interactively control viewpoint while watching a video images. is an exciting application of image-based rendering. The goal of our work is to render dynamic scenes with interactive viewpoint control Our work is motivated by this problem of capturing, representing, using a relatively small number of video cameras. In this paper, we and rendering dynamic scenes from multiple points of view. Being show how high-quality video-based rendering of dynamic scenes can able to do this interactively can enhance the viewing experience, be accomplished using multiple synchronized video streams com- enabling such diverse applications as new viewpoint instant replays, bined with novel image-based modeling and rendering algorithms. changing the point of view in dramas, and creating “freeze frame” vi- Once these video streams have been processed, we can synthesize sual effects at will. We wish to provide a solution that is cost-effective any intermediate view between cameras at any time, with the poten- yet capable of realistic rendering. In this paper, we describe a sys- tial for space-time manipulation. tem for high-quality view interpolation between relatively sparse camera viewpoints. Video matting is automatically performed to fi In our approach, we rst use a novel color segmentation-based stereo enhance the output quality. In addition, we propose a new temporal algorithm to generate high-quality photoconsistent correspondences fi two-layer representation that enables both ef cient compression and across all camera views. Mattes for areas near depth discontinuities interactive playback of the captured dynamic scene. are then automatically extracted to reduce artifacts during view syn- thesis. Finally, a novel temporal two-layer compressed representa- tion that handles matting is developed for rendering at interactive 1.1 Video-based rendering rates. One of the earliest attempts at capturing and rendering dynamic TM system , scenes was Kanade et al. ’s Virtualized Reality I.3.3 [Computer Graphics]: Picture/Image CR Categories: which involved 51 cameras arranged around a 5-meter geodesic Generation—display algorithms; I.4.8 [Image Processing and Com- 512 × 512 dome. The resolution of each camera is and the cap- puter Vision]: Scene Analysis—Stereo and Time-varying imagery. ture rate 30 fps. They extract a global surface representation at each ow time frame, using a form of voxel coloring based on the scene fl Keywords: Image-Based Rendering, Dynamic Scenes, Computer equation [Vedula et al. 2000]. Unfortunately, the results look un- Vision. realistic because of low resolution, matching errors, and improper handling of object boundaries. Matusik et al.  use the images from four calibrated FireWire 1 Introduction cameras ( 256 × 256 ) to compute and shade visual hulls. The com- putation is distributed across ve PCs, which can render 8000 pixels fi of the visual hull at about 8 fps. Carranza et al.  use seven Most of the past work on image-based rendering (IBR) involves ren- inward looking synchronized cameras distributed around a room to dering static scenes, with two of the best-known techniques being capture 3D human motion. Each camera has a 320 × 240 resolution Light Field Rendering [Levoy and Hanrahan 1996] and the Lumi- and captures at 15 fps. They use a 3D human model as a prior to ccess in high quality rendering graph [Gortler et al. 1996]. Their su compute 3D shape at each time frame. Yang et al. [2002a] designed stems from the use of a large number of sampled images and has an × 8 grid of 320 × 240 cameras for capturing dynamic scenes. 8 Instead of storing and rendering the data, they transmit only the rays necessary to compose the desired virtual view. In their system, the cameras are not genlocked; instead, they rely on internal clocks across six PCs. The camera capture rate is 15 fps, and the interactive viewingrateis18fps. Using the Lumigraph structure with per-pixel depth values, Schi- macher et al.  were able to render interpolated views at close
2 1 to interactive rates (ranging from 2 to 9 fps, depending on image size, However, it is an expensive system, and provides single camera. number of input cameras, and whether depth data has to be computed viewpoint depth only, which makes it less suitable for free viewpoint ucke y). Goldl ̈ fl on-the- et al. video.  proposed a system which also involves capturing, computing, and triangulating depth maps off- line, followed by real-time rendering using hardware acceleration. 1.3 Video view interpolation However, their triangulation process ignores depth discontinuities ounted for (single depth per pixel). and matting is not acc Despite all the advances in stereo and image-based rendering, it is [2002b] use graphics hardware to compute stereo data et al. Ya n g cult to render high-quality, high resolution views of fi still very dif through plane sweeping and subsequently render new views. They dynamic scenes. To address this problem, we use high-resolution 240 × 320 are able to achieve the rendering rate of 15 fps with 5 ) and a new color segmentation-based stereo 1024 × 768 cameras ( cameras. However, the matching window used is only one pixel, and algorithm to generate high quality photoconsistent correspondences occlusions are not handled. across all camera views. Mattes for areas near depth discontinuities are automatically extracted to reduce artifacts during view synthesis. fi elds, Wilburn et As a proof of concept for storing dynamic light Finally, a novel temporal two-layer representation is used for on-  demonstrated that it is possible to synchronize six cameras al. line rendering at interactive rates. Once the input videos have been × at 30 fps), and compress and store all the image data in ( 480 640 processed off-line, our real-time rendering system can interactively real time. They have since increased the size of the system to 128 synthesize any intermediate view at any time. cameras. The MPEG community has also been investigating the issue of visu- For several years now, Interactive “bullet time”—and more. alizing dynamic scenes, which it terms “free viewpoint video.” The lms have been seeing the fi viewers of TV commercials and feature fi rst ad hoc group (AHG) on 3D audio and video (3DAV) of MPEG opping time and “freeze frame” effect used create the illusion of st was established at the 58th meeting in December 2001 in Pattaya, changing the camera viewpoint. The earliest commercials were pro- Thailand. A good overview of this MPEG activity is presented by R 2 ,which system duced using Dayton Taylor’s lm-based Timetrack fi ́ c and Kimata . Smoli rapidly jumped between different still cameras arrayed along a rail to give the illusion of moving through a frozen slice of time. 1.2 Stereo with dynamic scenes rst appeared, the effect was fresh and looked spectacu- fi When it lar, and soon it was being emulated in many productions, the most famous of which is probably the “bullet time” effects seen in TheMa- Many images are required to perform image-based rendering if the trix . Unfortunately, this effect is typically a one-time, pre-planned scene geometry is either unknown or known to only a rough approx- affair. The viewpoint trajectory is planned ahead of time, and many imation. If geometry is known accurately, it is possible to reduce the man hours are expended to produce the desired interpolated views. requirement for images substantially [Gortler et al. 1996]. One prac- R are based on video Newer systems such as Digital Air’s Movia tical way of extracting the scene geometry is through stereo. Within camera arrays, but still rely on having many cameras to avoid soft- the past 20 years, many stereo algorithms have been proposed for ware view interpolation. static scenes [Scharstein and Szeliski 2002]. TM exible. First of all, once In contrast, our approach is much more fl work, Vedula et al.  pro- As part of the Virtualized Reality . all the input videos have been processed, viewing is interactive posed an algorithm for extracting 3D motion (i.e., correspondence The user can watch the dynamic scene by manipulating (freezing, between scene shape across time) using 2D optical ow and 3D fl slowing down, or reversing) time and changing the viewpoint at scene shape. In their approach, they use a voting scheme similar to will. Since different trajectories can be taken through space-time, voxel coloring [Seitz and Dyer 1997], where the measure used is ecause we no two viewing experiences need be the same. Second, b how well a hypothesized voxel location fi ow equation. fl ts the 3D have high-quality 3D stereo data at our disposal, object manipulation ow and Zhang and Kambhamettu  also integrated 3D scene fl (such as insertion or deletion) is easy. fi ne motion model is used structure in their framework. A 3D af locally, with spatial regularization, and discontinuities are preserved Features of our system. Our current system acquires the video et al. using color segmentation. Tao  assume the scene is and computes the geometry information off-line, and subsequently piecewise planar. They also assume constant velocity for each planar renders in real-time. We chose this approach because the applications patch in order to constrain the dynamic depth map estimation. we envision include high-quality archival of dynamic events and In a more ambitious effort, Carceroni and Kutulakos  recover instructional videos for activities such as ballet and martial arts. Our piecewise continuous geometry and re fl ectance (Phong model) un- foremost concern is the rendering quality, and our current stereo der non-rigid motion with known lighting positions. They discretize algorithm, while very effective, is not fast enough for the entire the space into surface elements (“surfels”), and perform a search meant to be used not system to operate in real-time. Our system is over location, orientation, and re fl ectance parameter to maximize for immersive teleconferencing (such as blue-c [Gross et al. 2003]) agreement with the observed images. or real-time (live) broadcast 3D TV. In an interesting twist to conventional local window matching, Zhang We currently use eight cameras placed along a 1D arc spanning ◦  use matching windows that straddle space and time. et al. from one end to the other (this span can be extended, as 30 about The advantage of this method is that there is less dependence on shown in the discussion section). We plan to extend our system to ◦ brightness constancy over time. coverage. While this 360 2D camera arrangement and eventually would not be a trivial extension, we believe that the Unstructured nding techniques have also been applied to moving Active range fi Lumigraph [Buehler et al. 2001] provides the right framework for scenes. Hall-Holt and Rusinkiewicz  use projected boundary- accomplishing this. The main contribution of our work is a layered coded stripe patterns that vary over time.There is also a commercial TM 1 , which is a range sensing system on the market called ZCam http://www.3dvsystems.com/products/zcam.html 2 http://www.timetrack.com/ video camera add-on used in conjunction with a broadcast video
3 strip strip cameras d width width i B i matte banks M i depth controlling of hard discontinuity laptop disks (b) (a) concentrators Figure 3: Two-layer representation: (a) discontinuities in the depth are foundanda boundarystrip iscreatedaroundthese;(b) a matting and M . algorithm is used to pull the boundary and main layers B i i guration of our system with 8 cameras. fi Figure 2: Acon (The boundarylayer is drawn with variable transparency to suggest partial opacity values.) depth image representation that produces much better results than the crude proxies used in the Unstructured Lumigraph. papers [Buehler et al. 2001]. Another possibility is to use per-pixel depth, as in Layered Depth Images [Shade et al. 1998], the offset In the remainder of this paper, we present the details of our sys- depth maps in Fac ̧ade [Debevec et al. 1996], or sprites with depth rst describe the novel hardware we use to capture mul- fi tem. We local [Baker et al. 1998; Shade et al. 1998]. In general, using different tiple synchronized videos (Section 2). Next, we describe the novel geometric proxies for each reference view [Pulli et al. 1997; Debevec image-based representation that is the key to producing high-quality et al. 1998; Heigl et al. 1999] produces higher quality results, so that interpolated views at video rates (Section 3). We then present our is the approach we adopt. multi-view stereo reconstruction and matting algorithms that en- able us to reliably extract this representation from the input video fi xed number of in- To obtain the highest possible quality for a (Sections 4 and 5). We then describe our compression technique put images, we use per-pixel depth maps generated by the novel (Section 6) and image-based rendering algorithm (implemented on a stereo algorithm described in Section 4. However, even multiple GPU using vertex and pixel shaders) that enable real-time interactive depth maps still exhibit rendering artifacts when generating novel performance (Section 7). Finally, we highlight the results obtained ) due to the abrupt nature of the foreground to jaggies views: aliasing ( using our system, and close with a discussion of future work. background transition and contaminated colors due to mixed pixels, which become visible when compositing over novel backgrounds or objects. 2 Hardware system We address these problems using a novel two-layer representation Figure 2 shows a con guration of our video capturing system with fi inspired by Layered Depth Images and sprites with depth [Shade 8 cameras arranged along a horizontal arc. We use high resolution et al. 1998]. We fi rst locate the depth discontinuities in a depth map 768 × ) PtGrey color cameras to capture video at 15 fps, with 1024 ( and create a boundary strip (layer) around these pixels (Figure 3a). d i ◦ .Tohan- 30 8mm lenses, yielding a horizontal fi eld of view of about We then use a variant of Bayesian matting [Chuang et al. 2001] to dle real-time storage of all the input videos, we commissioned Pt- estimate the foreground and background colors, depths, and opaci- Grey to build us two concentrator units. Each concentrator synchro- ties (alpha values) within these strips, as described in Section 5. To nizes four cameras and pipes the four uncompressed video streams reduce the data size, the multiple alpha-matted depth images are then fi ber optic cable. The two con- into a bank of hard disks through a compressed using a combination of temporal and spatial prediction, centrators are synchronized via a FireWire cable. as described in Section 6. The cameras are calibrated before every capture session using a At rendering time, the two reference views nearest to the novel view 36” fl at plate, which is calibration pattern mounted on a 36” × are chosen, and all the layers involved are then warped. The warped moved around in front of all the cameras. The calibration technique layers are combined based on their respective pixel depths, pixel eces- of Zhang  is used to recover all the camera parameters n opacity, and proximity to the novel view. A more detailed description sary for Euclidean stereo recovery. of this process is given in Section 7. 3 Image-based representation 4 Reconstruction algorithm The goal of the of fl ine processing and on-line rendering stages is to When developing a stereo vision algorithm for use in view interpo- create view-interpolated frames of the highest possible quality. One lation, the requirements for accuracy vary from those of standard approach, as suggested in the seminal Light Field Rendering paper stereo algorithms used for 3D reconstruction. We are not as directly [Levoy and Hanrahan 1996], is to simply re-sample rays based only concerned with error in disparity as we are in the error in intensity on the relative positions of the input and virtual cameras. However, values for the interpolated image. For example, a multi-pixel dispar- as demonstrated in the Lumigraph [Gortler et al. 1996] and subse- ity error in an area of low texture, such as a white wall, will result in quent work, using a 3D impostor or proxy for the scene geometry can cantly less intensity error in the interpolated image than the fi signi greatly improve the quality of the interpolated views. Another ap- same disparity error in a highly textured area. In particular, edges proach is to create a single texture-mapped 3D model [Kanade et al. and straight lines in the scene need to be rendered correctly. 1997], but this generally produces inferior results to using multiple reference views. Traditional stereo algorithms tend to produce erroneous results around disparity discontinuities. Unfortunately, such errors produce Since we use geometry-assisted image-based rendering, which kind some of the most noticeable artifacts in interpolated scenes, since of 3D proxy should we use? Wang and Adelson  use planar they typically coincide with intensity edges. Recently, a new ap- sprites to model object motion, but such models cannot account proach to stereo vision called segmentation-based stereo has been for local depth distributions. An alternative is to use a single global proposed. These methods segment the image into regions likely polyhedral model, as in the Lumigraph and Unstructured Lumigraph
4 Segmentation DSD Refinement Compute Initial DSD Matting Cross Disparity Refinement Good match Outline of the stereo algorithm. Figure 4: Bad match Figure 7: Good and bad match gain histograms. After smoothing, each pixel is assigned its own segment. Two neigh- boring 4-connected segments are merged if the Euclidean distance (a) (c) (b) between their average colors varies by less than 6. Segments smaller than 100 pixels in area are merged with their most similarly colored Figure 5: Segmentation: (a) neighboring pixel groups used in aver- neighbors. Since large areas of homogeneous color may also possess aging; (b) close-up of color image and (c) its segmentation. varying disparity, we split horizontally and vertically segments that are more than 40 pixels wide or tall. Our segments thus vary in size from 100 to 1600 pixels. A result of our segmentation algorithm can to have similar or smooth disparities prior to the stereo computa- be seen in Figure 5(b–c). tion. A smoothness constraint is then enforced for each segment. Tao  used a planar constraint, while Zhang and Kamb- et al. 4.2 Initial Disparity Space Distribution hamettu  used the segments for local support. These methods have shown very promising results in accurately handling disparity After segmentation, our next step is to compute the initial disparity discontinuities. space distribution (DSD) for each segment in each camera. The DSD Our algorithm also uses a segmentation-based approach and has in image s is the set of probabilities over all disparities for segment ij the following advantages over prior work: disparities within seg- disparity space image . It is a variant of the classic (DSI), which I i ments must be smooth but need not be planar; each image is treated associates a cost or likelihood at every disparity with every pixel equally, i.e., there is no reference image; occlusions are modeled has s [Scharstein and Szeliski 2002]. The probability that segment ij ∑ explicitly; and consistency between disparity maps is enforced re- . , with ) d ( p ( d )=1 p is denoted by d disparity ij ij d sulting in higher quality depth maps. is set to Our initial DSD for each segment s ij Our algorithm is implemented using the following steps (Figure 4). ∏ ( m ) d ijk First, each image is independently segmented. Sec ond, we compute ∈ k N 0 i ∑ ∏ p ( )= d , (1) ij an initial disparity space distribution (DSD) for each segment, using ′ ( d ) m ijk ′ N k ∈ d i the assumption that all pixels within a segment have the same dispar- ne each segment’s DSD using neig fi ity. Next, we re hboring segments s at dispar- ( in image k is the matching function for ) d where m ij ijk and its projection into other images. We relax the assumption that . For this paper, we assume i are the neighbors of image d ,and N ity i each segment has a single disparity during a disparity smoothing , i.e. the cameras consists of the immediate neighbors of i that N i stage. Finally, we use image matting to compute alpha values for . We divide by the sum of all the matching to the left and right of i pixels along disparity discontinuities. scores to ensure the DSD sums to one. Given the gain differences between our cameras, we found a match- 4.1 Segmentation ing score that uses a histogram of pixel gains produces the best ′ nd its projection fi x ,we s in segment x results. For each pixel ij The goal of segmentation is to split each image into regions that in image . We then create a histogram using the gains (ratios), k ′ are likely to contain similar disparities. These regions or segments /I . For color pixels, the gains for each channel are com- x ( ) ) x ( I i k should be as large as possible to increase local support while mini- puted separately and added to the same histogram. The bins of the mizing the chance of the segments covering areas of varying dispar- histogram are computed using a log scale. For all examples in this ity. In creating these segments, we assume that areas of homogeneous paper, we used a histogram with 20 bins ranging from 0.8 to 1.25. color generally have smooth disparities, i.e., disparity discontinuities If a match is good, the histogram has a few bins with large values with generally coincide with intensity edges. the rest being small, while a bad match has a more even distribution Our segmentation algorithm has two steps. First, we smooth the im- (Figure 7). To measure the “sharpness” of the distribution, we could age using a variant of anisotropic diffusion [Perona and Malik 1990]. use several methods such as measuring the variance or entropy. We We then segment the image based on neighboring color values. found the following to be both ef cient and produce good results: fi The purpose of smoothing prior to segmentation is to remove as + , ) h + h (2) d )=max ( h ( m l ijk l +1 − 1 l l much image noise as possible in order to create more consistent seg- ments. We also want to reduce the number of thin segments along h where l th bin in the histogram, i.e., the matching score is is the l intensity edges. Our smoothing algorithm iteratively averages (8 the sum of the three largest contiguous bins in the histogram. times) a pixel with three contiguous neighbors as shown in Figure 5(a). The set of pixels used for averaging is determined by which 4.3 Coarse DSD refinement pixels have the minimum absolute difference in color from the center pixel. This simpli ed variant of the well known anisotropic diffu- fi The next step is to iteratively re fi ne the disparity space distribution fi sion and bilateral ltering algorithms produces good results for our of each segment. We assume as we did in the previous section that application.
5 (a) (c) (e) (b) (d) ge; (b) color-based segmentation; (c) initial disparity estimates Figure 6: Sample results from stereo reconstruction stage: (a) input color ima ˆ ;(d)re fi ned disparity estimates; (e) smoothed disparity estimates d . ( x ) d ij i each segment has a single disparity. boundary of objects will r eceive contributions from both the fore- pixel mixed ground and background colors. If we use the original fi When re ning the DSD, we wish to enforce a smoothness constraint colors during image-based rendering, visible artifacts will result. A between segments and a consistency constraint between images. method to avoid such visible artifacts is to use image priors [Fitzgib- The smoothness constraint states that neighboring segments with bon et al. 2003], but it is not clear if such as technique can be used similar colors should have similar disparities. The second constraint for real-time rendering. enforces consistency in disparities between images. That is, if we project a segment with disparity d onto a neighboring image, the A technique that may be used to separate the mixed pixels is that of d segments it projects to should have disparities close to . Szeliski and Golland . While the underlying principle behind their work is persuasive, the problem is still ill-conditioned. As a We iteratively enforce these two constraints using the following result, issues such as partial occlusion and fast intensity changes at equation: cult to overcome. fi or near depth discontinuities are dif ∏ l ) ( d ( c d ) ij ijk N ∈ k +1 t i We handle the mixed pixel problem by computing matting infor- ∏ ∑ )= ( d (3) , p ij ′ ′ ) ( d d ) c ( l mation within a neighborhood of four pixels from all depth dis- ij ijk ′ d N ∈ k i continuities. A depth discontinuity is de fi ned as any disparity jump ( ) d c enforces the smoothness constraint and en- d ( ) where l greater than (=4) pixels. Within these neighborhoods, foreground λ ij ijk forces the consistency constraint with each neig hboring image in and background colors along with opacities (alpha values) are com- . The details of the smoothness and consistency constraints are N puted using Bayesian image matting [Chuang et al. 2001]. (Chuang i given in Appendix A. et al.  later extended their technique to videos using optic fl ow.) The foreground information is combined to form our bound- ary layer as shown in Figure 3. The main layer consists of the back- 4.4 Disparity smoothing ground information along with the rest of the image information et located away from the depth discontinuities. Note that Chuang Up to this point, the disparities in each segment are constant. At this ’s algorithms do not estimate depths, only colors and opacities. al. stage, we relax this constraint and allow disparities to vary smoothly Depths are estimated by simply using alpha-weighted averages of based on disparities in neighboring segments and images. nearby depths in the boundary and main layers. To prevent cracks from appearing during rendering, the boundary matte is dilated by x At the end of the coarse re in nement stage, we set each pixel fi one pixel toward the inside of the boundary region. ˆ with the maximum value in the DSD, to the disparity d segment s ij ij ′ ′ ) = arg max ( d ) p . To ensure that disparities ( x ,d ∀ i.e., x ij ∈ s d ij Figure 8 shows the results of the applying the stereo reconstruction are consistent between images, we do the following. For each pixel and two-layer matting process to a complete image frame. Notice is the d with disparity y ( ) x , we project it into image I .If in I x k i i how only a small amount of information needs to be transmitted ) d ( x ) − d and ( y ) | <λ , we replace | d ( x projection of x in I i k i k to account for the soft object boundaries, and how the boundary and y ( ) . The resulting update formula d ( ) x with the average of d i k lors are cleanly recovered. opacities and boundary/main layer co is therefore t t ∑ 1 )+ d ( x ( y ) d x +1 t t x i k ( x )= ) x d δ ( ) (4) , − δ +(1 d ik ik i i 6 Compression N 2 # i N ∈ k i x Compression is used to reduce our large data-sets to a manage- | is the indicator variable that | ) = y ( <λ d d − ) ( x where δ i k ik able size and to support fast playback from disk. We developed our is the number of neighbors. tests for similar disparities and # N i own codec that exploits temporal and between-camera (spatial) re- The parameter is set to 4, i.e., the same value used to compute λ dundancy. Temporal prediction uses motion compensated estimates the occlusion function (9). After averaging the disparities across from the preceding frame, while spatial prediction uses a reference × 5 images, we then average the disparities within a x window of 5 camera’s texture and disparity maps transformed into the viewpoint (restricted to within the segment) to ensure they remain smooth. of a spatially adjacent camera. We code the differences between pre- Figure 6 shows some sample results from the stereo reconstruction dicted and actual images using a novel transform-based compression process. You can see how the disparity estimates improve at each scheme that can simultaneously handle texture, disparity and alpha- successive re fi nement stage. matte data. Similar techniques have previously been employed for elds [Chang et al. 2003]; however, our emphasis is encoding light fi on high-speed decoding performance. 5 Boundary matting Our codec compresses two kinds of information: RGBD data for the During stereo computation, we assume that each pixel has a unique main plane (where D is disparity) and RGBAD alpha-matted data for disparity. In general this is not the case, as some pixels along the the boundary edge strips. For the former, we use both non-predicted
6 (b) (a) (d) (c) (e) b) main depth estimates; (c) boundary color estimates; (d) boundary Sample results from matting stage: (a) main color estimates; ( Figure 8: depth estimates; (e) boundary alpha (opacity) estimates. For ease of printing, the boundary images are negated, so that transparent/empty pixels show up as white. 53 M B B M Camera Camera i+1 i i+1 i i i+1 P P s s 51 P P Render Render Render Render s s 49 boundary layer main layer main layer boundary layer I P t 47 Inter-view P P s s prediction 45 Blend P P PSNR (dB) s s 43 Camera views P P s s 41 Temporal prediction Rendering system: the main and boundary images from Figure 10: P I t 39 each camera are rendered and composited before blending. P P 37 s s 100 200 50 0 150 Compression Ratio P P s s Camera 0 Camera 1 No Prediction × 8 cients for those fi coef blocks that are non-transparent. 8 Camera 4 Camera 5 Camera 2 = 1 T = 0 T Camera 7 Camera 6 Time (a) (b) Figure 9(b) shows graphs of signal-to-noise ratio (PSNR) versus compression factor for the RGB texture component of Camera 3 Compression fi gures: (a) Spatial and temporal prediction Figure 9: -frame codec and using between-camera spatial I coded using our scheme; (b) PSNR compression performance curves. prediction from the other seven cameras. Spatial prediction results in a higher coding ef fi ciency (higher PSNR), especially for prediction from nearby cameras. I ) compression, while for the latter, we use only ) and predicted ( ( P I -frames because the thin strips compress extremely well. To approach real-time interactivity, the overall decoding scheme is highly optimized for speed and makes use of Intel streaming me- Figure 9(a) illustrates how the main plane is coded and demon- RGBD × 512 I -frame currently takes 9 dia extensions. Our 384 strates our hybrid temporal and spatial prediction scheme. Of the ms to decode. We are working on using the GPU for inter-camera eight camera views, we select two reference cameras and initially prediction. compress the texture and disparity data using I -frames. On sub- sequent frames, we use motion compensation and code the error .There- signal using a transform-based codec to obtain frames P t 7 Real-time rendering , are compressed using spatial prediction maining camera views, P s from nearby reference views. We chose this scheme because it min- In order to interactively manipulate the viewpoint, we have ported imizes the amount of information that must be decoded when we our software renderer to the GPU. Because of recent advances in selectively decompress data from adjacent camera pairs in order to the programmability of GPUs, we are able to render directly from synthesize our novel views. At most, two temporal and two spatial the output of the decompressor without using the CPU for any ad- decoding steps are required to move forward in time. ditional processing. The output of the decompressor consists of 5 To carry out spatial prediction, we use the disparity data from each boundary planes of data for each view: the main color, main depth, reference view to transform both the texture and disparity data into alpha matte, boundary color, and boundary depth. Rendering and the viewpoint of the nearby camera, resulting in an approximation compositing this data proceeds as follows. to that camera’s data, which we then correct by sending compressed First, given a novel view, we pick the nearest two cameras in the difference information. During this process, the de-occlusion holes data set, say cameras . Next, for each camera, we project +1 i i and created during camera view transformation are treated separately into the virtual view. The B and boundary data M the main data and the missing texture is coded without prediction using an alpha- i i results are stored in separate buffers each containing color, opacity mask. This gives extremely clean results that could not be obtained nal frame. A fi and depth. These are then blended to generate the with a conventional block-based P-frame codec. block diagram of this process is shown in Figure 10. We describe I -frame data, we use an MPEG-like scheme with DC pre- To code each of these steps in more detail below. diction that makes use of a fast 16-bit integer approximation to the The main layers consists of color and depth at every pixel. We convert discrete cosine transform (DCT). RGB data is converted to the YUV the depth map to a 3D mesh using a simple vertex shader program. color-space and D is coded similarly to Y. For P-frames, we use a The shader takes two input streams: the X-Y positions in the depth similar technique but with different code tables and no DC predic- map and the depth values. To reduce the amount of memory required, tion. For -frame coding with alpha data, we use a quad-tree plus I × block. The shader is 192 the X-Y positions are only stored for a 256 fi Huffman coding method to rst indicate which pixels have non-zero then repeatedly called with different offsets to generate the required alpha values. Subsequently, we only code YUV or D texture DCT
7 (a) (b) (b) (c) (a) Interpolation results at different baselines: (a) current Figure 13: baseline, (b) baseline doubled, (c) baseline tripled. The insets show the subtle differences in quality. (c) (d) Sample results from rendering stage: (a) rendered main Figure 11: Our rendering algorithms are implemented on an ATI 9800 PRO. layer from one view, with depth discontinuities erased; (b) rendered 384 × 512 images at 5 fps and 768 × 1024 We currently render boundary layer; (c) rendered main layer from the other view; (d) images at 10 fps from disk or 20 fps from main memory. The current nal blended result. fi rendering bottleneck is disk bandwidth, which should improve once the decompression algorithm is fully integrated into our rendering pipeline. (Our timings show that we can render at full resolution at 3D mesh and texture coordinates. The color image is applied as a 30 fps if the required images are all loaded into the GPU’s memory.) texture map to this mesh. The main layer rendering step contains most of the of the data, so it 8 Results is desirable to only create the data structures described above once. However, we should not draw triangles across depth discontinuities. We have tested our system on a number of captured sequences. cult to kill triangles already in the pipeline, on current fi Since it is dif Three of these sequences were captured over a two evening period GPU architectures, we erase these triangles in a separate pass. The guration for each and are shown on fi using a different camera con discontinuities are easy to fi nd since they are always near the inside the accompanying video. edge of the boundary region (Figure 3). A small mesh is created to rst sequence used the cameras arranged in a horizontal arc, as The fi erase these, and an associated pixel shader is used to set their color shown in Figure 2 and was used to fi lm the break-dancers shown to a zero-alpha main color and their depth to the maximum scene in Figures 1, 6, 8, and 12. The second sequence was shot with the depth. same dancers, but this time with the cameras arranged on a vertical Next, the boundary regions are rendered. The boundary data is fairly arc. Two input frames from this sequence along with an interpolated sparse since only vertices with non-zero alpha values are rendered. view are shown in Figure 12(a–c). The third sequence was shot the Typically, the boundary layer contains about 1/64 the amount of following night at a ballet studio, with the cameras arranged on an data as the main layer. Since the boundary only needs to be rendered arc with a slight upward sweep (Figure 12(d–f)). where the matte is non-zero, the same CPU pass used to generate the Looking at these thumbnails, it is hard to get a true sense of the qual- erase mesh is used to generate a boundary mesh. The position and ity of our interpolated views. A much better sense can be obtained color of each pixel are stored with the vertex. Note that, as shown by viewing our accompanying video. In general, we believe that the in Figure 3, the boundary and main meshes share vertices at their quality of the results signi fi cantly exceeds the quality demonstrated boundaries in order to avoid cracks and aliasing artifacts. by previous view interpolation and image-based modeling systems. Once all layers have been rendered into separate color and depth buffers, a custom pixel shader is used to blend these results. (During In addition to creating virtual Object insertion example. fl y- the initial pass, we store the depth values in a separate buffers, since throughs and other space-time manipulation effects, we can also pixel shaders do not currently have access to the hardware z-buffer.) use our system to perform object insertion. Figure 1(d) shows a The blending shader is given a weight for each camera based on the frame from our “doubled” video in which we inserted an extra copy camera’s distance from the novel virtual view [Debevec et al. 1996]. of a break-dancer into the video. This effect was achieved by rst fi For each pixel in the novel view all overlapping fragments from the “pulling” a matte of the dancer using a depth threshold and then projected layers are composited from front to back, and the shader inserting the pulled sprite into the original video using z-buffering. performs a soft Z compare in order to compensate for noise in the The Ma- ght scene in fi The effect is reminiscent of the Agent Smith depth estimates and reprojection errors. Pixels that are suf ciently fi Come trix Reloaded and the multiplied actors in Michel Gondry’s close together are blended using the view-dependent weights. When music video. However, unlike the computer generated IntoMy World pixels differ in depth, the frontmost pixel is used. Finally, the blended or the 15 days of painful post-production Matrix imagery used in the pixel value is normalized by its alpha value. This normalization is matting in Gondry’s video, our effect was achieved totally automat- important since some pixels might only be visible or partially visible ically from real-world data. in one camera’s view. Figure 11 shows four intermediate images generated during the Effect of baseline. We have also looked into the effect of in- rendering process. You can see how the depth discontinuities are creasing the baseline between successive pairs of cameras. In our ◦ . correctly erased, how the soft alpha-matted boundary elements are fi guration, the end-to-end coverage is about 30 current camera con However, the maximum disparity between neighboring pairs of cam- fi nal view-dependent blend produces high- rendered, and how the eras can be as large as 100 pixels. Our algorithm can tolerate up to quality results.
8 ⇒ ⇐ ⇐ ⇒ (d) (a) (e) (f) (c) (b) Figure 12: More video view interpolation results: (a,c) input images from vertical arc and (b) interpolated view; (d,f) input images from ballet studio and (e) interpolated view. about 150-200 pixels of disparity before hole artifacts due to missing Buehler, C., Bosse, M., McMillan, L., Gortler, S. J., and Cohen, background occur. Our algorithm is generally robust, and as can be Proceedings of M. F. 2001. Unstructured lumigraph rendering. seen in Figure 13, we can triple the baseline with only a small loss SIGGRAPH 2001 , 425–432. of visual quality. Carceroni, R. L., and Kutulakos, K. N. 2001. Multi-view scene capture by surfel sampling: From video streams to non-rigid 3D motion, shape and re International Conference on ectance. In fl Computer Vision (ICCV) , vol. II, 60–67. 9 Discussion and conclusions Carranza, J., Theobalt, C., Magnor, M. A., and Seidel, H.-P. 2003. Free-viewpoint video of human actors. ACM Transactions on Compared with previous systems for video-based rendering of dy- Graphics 22 , 3, 569–577. namic scenes, which either use no 3D reconstruction or only global Chang, C.-L., et al. 2003. Inter-view wavelet compression of light 3D models, our view-based approach provides much better visual fi elds with disparity-compensated lifting. In Visual Communica- quality for the same number of cameras. This makes it more practical tion and Image Processing (VCIP 2003) , 14–22. to set up and acquire, since fewer cameras are needed. Furthermore, Chuang, Y.-Y., et al. 2001. A Bayesian approach to digital mat- the techniques developed in this paper can be applied to dynamic any ting. In Conference on Computer Vision and Pattern Recognition fi eld capture and rendering system. Being able to interactively light (CVPR) , vol. II, 264–271. view and manipulate such videos on a commodity PC opens up all ACM Chuang, Y.-Y., et al. 2002. Video matting of complex scenes. kinds of possibilities for novel 3D dynamic content. Transactions on Graphics 21 , 3, 243–248. Debevec, P. E., Taylor, C. J., and Malik, J. 1996. Modeling and While we are pleased with the quality of our interpolated viewpoint rendering architecture from photographs: A hybrid geometry- and videos, there is still much that we can do to improve the quality of image-based approach. , Computer Graphics (SIGGRAPH’96) the reconstructions. Like most stereo algorithms, our algorithm has 11–20. problems with specular surfaces or strong re fl ections. In a separate fi Debevec, P. E., Yu, Y., and Borshukov, G. D. 1998. Ef cient work [Tsin et al. 2003], our group has worked on this problem with view-dependent image-based rendering with projective texture- some success. This may be integrated into our system in the future. Eurographics Rendering Workshop 1998 , 105–116. mapping. Note that using more views and doing view interpolation can help Fitzgibbon, A., Wexler, Y., and Zisserman, A. 2003. Image-based model such effects, unlike the single texture-mapped 3D model used rendering using image-based priors. In International Conference in some other systems. , vol. 2, 1176–1183. on Computer Vision (ICCV) Goldl ̈ ucke, B., Magnor, M., and Wilburn, B. 2002. Hardware- While we can handle motion blur through the use of the matte (soft Proceedings Vision, accelerated dynamic light fi eld rendering. In alpha values in our boundary layer), we minimize it by using a fast Modeling and Visualization VMV 2002 , 455–462. shutter speed and increase the lighting. Gortler, S. J., Grzeszczuk, R., Szeliski, R., and Cohen, M. F. 1996. At the moment, we process each frame (time instant) of video sep- The Lumigraph. In Computer Graphics (SIGGRAPH’96) Pro- arately. We believe that even better results could be obtained by in- ceedings , ACM SIGGRAPH, 43–54. corporating temporal coherence, either in video segmentation (e.g., Gross, M., et al. 2003. blue-c: A spatially immersive display and 3D [Patras et al. 2001]) or directly in stereo as cited in Section 1.2. video portal for telepresence. Proceedings of SIGGRAPH 2003 (ACM Transactions on Graphics) , 819–827. During the matting phase, we process each camera independently Hall-Holt, O., and Rusinkiewicz, S. 2001. Stripe boundary codes from the others. We believe that better results could be obtained for real-time structured-light range scanning of moving objects. by merging data from adjacent views when trying to estimate the International Conference on Computer Vision (ICCV) In ,vol.II, semi-occluded background. This would allow us to use the multi- 359–366. image matting of [Wexler et al. 2002] to get even better estimates Heigl, B., et al. 1999. Plenoptic modeling and rendering from image of foreground and background colors and opacities, but only if the , 94–101. DAGM’99 sequences taken by hand-held camera. In depth estimates in the semi-occluded regions are accurate. Kanade, T., Rander, P. W., and Narayanan, P. J. 1997. Virtual- IEEE ized reality: constructing virtual worlds from real scenes. Virtual viewpoint video allows users to experience video as an inter- , 1(1):34–47. MultiMedia Magazine active 3D medium. It can also be used to produce a variety of special Levoy, M., and Hanrahan, P. 1996. Light eld rendering. In fi effects such as space-time manipulation and virtual object insertion. ,ACMSIG- Computer Graphics (SIGGRAPH’96) Pro ceedings The techniques presented in this paper bring us one step closer to GRAPH, 31–42. making image-based (and video-based) rendering an integral com- Proceedings of Matusik, W., et al. 2000. Image-based visual hulls. ponent of future media authoring and delivery. SIGGRAPH 2000 , 369–374. Patras, I., Hendriks, E., and Lagendijk, R. 2001. Video segmentation References IEEE Transactions on by MAP labeling of watershed segments. Pattern Analysis and Machine Intelligence 23 , 3, 326–332. Baker, S., Szeliski, R., and Anandan, P. 1998. A layered approach Perona, P., and Malik, J. 1990. Scale-space and edge detection using to stereo reconstruction. In Conference on Computer Vision and IEEE Transactions onPatternAnalysisand anisotropic diffusion. Pattern Recognition (CVPR) , 434–441.
9 . We assume that the disparity of segment Machine Intelligence 12 lies within a ∈ S s , 7, 629–639. s ij ij il ˆ Pulli, K., et al. 1997. View-based rendering: Visualizing real objects modeled by a contaminated normal distribution with vicinity of d il from scanned range and color data. In Proceedings of the 8th ˆ : mean d il ∏ Eurographics Workshop on Rendering , 23–34. 2 ˆ l N ( , )= ,σ )+ d ( d ; (5) d ij il l Scharstein, D., and Szeliski, R. 2002. A taxonomy and evaluation of ∈ s S ij il dense two-frame stereo correspondence algorithms. International 2 2 , 1, 7–42. Journal of Computer Vision 47 1 − ( d − μ ) 2 / 2 σ − 2 ( ; μ, σ where N d πσ )=(2 is the usual normal e ) Schirmacher, H., Ming, L., and Seidel, H.-P. 2001. On-the- y fl 2 for each . We estimate the variance . 01 distribution and σ =0 l Proceedings of Euro- processing of generalized Lumigraphs. In using three values: the similarity in color neighboring segment s il , 3, 165–173. graphics, Computer Graphics Forum 20 of the segments, the length of the border between the segments and Seitz, S. M., and Dyer, C. M. 1997. Photorealistic scene reconstr- ˆ ) .Let ∆ be the difference between the average colors of ( d p il il jl cution by voxel coloring. In Conference on Computer Vision and and b and s ’s border that be the percentage of s s segments ij ij il jl , 1067–1073. Pattern Recognition (CVPR) 2 occupies. We set σ to s il l Shade, J., Gortler, S., He, L.-W., and Szeliski, R. 1998. Layered depth images. In Computer Graphics (SIGGRAPH’98) Pro ceed- υ 2 σ = (6) , l , ACM SIGGRAPH, 231–242. ings 2 2 ˆ p ) ,σ d ) (∆ N ( b ;0 jl il jl il ∆ Smoli ́ c, A., and Kimata, H. 2003. AHG on 3DAV Coding. ISO/IEC 2 JTC1/SC29/WG11 MPEG03/M9635. σ and =8 υ where in our experiments. =30 ∆ Szeliski, R., and Golland, P. 1999. Stereo matching with trans- parency and matting. International Journal of Computer Vision The consistency constraint ensures Consistency Constraint. 32 , 1, 45–61. that different image’s disparity maps agree, i.e., if we project a pixel Tao, H., Sawhney, H., and Kumar, R. 2001. A global matching with disparity d from one image into another, its projection should International Conference framework for stereo computation. In ) to en- d ( c . When computing the value of also have disparity d ijk , vol. I, 532–539. on Computer Vision (ICCV) force consistency, we apply several constraints. First, a segment’s Tsin, Y., Kang, S. B., and Szeliski, R. 2003. Stereo matching with DSD should be similar to the DSD of the segments it projects to in the re fl ections and translucency. In Conference on Computer Vision other images. Second, while we want the segments’ DSD to agree , vol. I, 702–709. and Pattern Recognition (CVPR) between images, they must also be consistent with the matching Vedula, S., Baker, S., Seitz, S., and Kanade, T. 2000. Shape and d ) . Third, some segments may have no correspond- ( m function ijk motion carving in 6D. In Conference on Computer Vision and ing segments in the other image due to occlusions. , vol. II, 592–598. Pattern Recognition (CVPR) Wang, J. Y. A., and Adelson, E. H. 1993. Layered representation for we compute its projected DSD, For each disparity d s and segment ij motion analysis. In Conference on Computer Vision and Pattern with respect to image ( is the segment in image k, x ) d ) ( I π .If p k ijk , 361–366. Recognition (CVPR) projects to and x that pixel s is the number of pixels in , C I k ij ij Wexler, Y., Fitzgibbon, A., and Zisserman, A. 2002. Bayesian ∑ 1 Seventh European estimation of layers from multiple images. In t t p (7) . ) d )= ( d ( p ijk ) k,x ( π Conference on Computer Vision (ECCV) , vol. III, 487–501. C ij x s ∈ ij Wilburn, B., Smulski, M., Lee, H. H. K., and Horowitz, M. 2002. SPIE Electonic Imaging: Media eld video camera. In fi The light , vol. 4674, 29–36. Processors is oc- s We also need an estimate of the likelihood that segment ij t Yang, J. C., Everett, M., Buehler, C., and McMillan, L. 2002. A real- ) is low if there is d ( cluded in image . Since the projected DSD p k ijk time distributed light Eurographics Workshop on eld camera. In fi little evidence for a match, the visibility likelihood can be estimated Rendering , 77–85. as ∑ t ′ Yang, R., Welch, G., and Bishop, G. 2002. Real-time consensus- v p ( , d 0 . )) . (8) =min(1 ijk ijk based scene reconstruction using commodity graphics hardware. ′ d c Graphics fi Proceedings of Paci In , 225–234. Zhang, Y., and Kambhamettu, C. 2001. On 3D scene fl ow and struc- Along with the projected DSD, we compute an occlusion function Conference on Computer Vision and Pattern ture estimation. In o s , which has a value of 0 if segment occludes another ) d ( ij ijk Recognition (CVPR) , vol. II, 778–785. and 1 if is does not. This ensures that even I segment in image k acetime stereo: Zhang, L., Curless, B., and Seitz, S. M. 2003. Sp I is not visible in image , its estimated depth does not lie in s if ij k Shape recovery for dynamic scenes. In Conference on Computer th image’s estimates of depth. front of a surface element in the k Vision and Pattern Recognition , 367–374. ( d ) as More speci ne fi cally, we de fi o ijk exible new technique for camera calibration. fl Zhang, Z. 2000. A ∑ 1 IEEE Transactions on Pattern Analysis and Machine Intelligence t ˆ λ ( h ) d , ) ( + d (9) p ( d )=1 . 0 − − d o kl ijk k,x ( π ) 22 , 11, 1330–1334. c j x s ∈ ij λ is the Heaviside step function and 0 ≥ x if is )=1 x ( h where A Smoothness and consistency a constant used to determine if two surfaces are the same. For our experiments, we set λ to 4 disparity levels. Here we present the details of our smoothness and consistency con- Finally, we combine the occluded and non-occluded cases. If the straints. d directly from the pro- ) ( c segment is not occluded, we compute ijk t . For occluded ) ( d m ( d ) p jected DSD and the match function, ijk ijk When creating our initial segments, Smoothness Constraint. fi ) d .Our nal func- ( o regions we only use the occlusion function ijk we use the heuristic that neighboring pixels with similar colors d ( is therefore ) tion for c ijk should have similar disparities. We use the same heuristic across t denote the neighbors of seg- S segments to re fi ne the DSD. Let ij c ) d ( o ) v (10) − ( d )= v 0 p . . ( d ) m )+(1 ( d ijk ijk ijk ijk ijk ijk ˆ be the maximum disparity estimate for segment d ,and ment s ij il