1 Noise2Noise: Learning Image Restoration without Clean Data 1 2 1 1 1 3 1 1 Jacob Munkberg Miika Aittala Samuli Laine Timo Aila Jon Hasselgren Jaakko Lehtinen Tero Karras renderings of a synthetic scene, etc. Significant advances Abstract have been reported in several applications, including Gaus- We apply basic statistical reasoning to signal re- sian denoising, de-JPEG, text removal (Mao et al., 2016), construction by machine learning – learning to super-resolution (Ledig et al., 2017), colorization (Zhang map corrupted observations to clean signals – with et al., 2016), and image inpainting (Iizuka et al., 2017). Yet, a simple and powerful conclusion: it is possi- obtaining clean training targets is often difficult or tedious: ble to learn to restore images by only looking at a noise-free photograph requires a long exposure; full MRI corrupted examples, at performance at and some- sampling precludes dynamic subjects; etc. times exceeding training using clean data, without explicit image priors or likelihood models of the In this work, we observe that we can often learn to turn corruption. In practice, we show that a single , bad images into good images by only looking at bad images model learns photographic noise removal, denois- and do this just as well – sometimes even better – as if we ing synthetic Monte Carlo images, and reconstruc- were using clean examples. Further, we require neither an tion of undersampled MRI scans – all corrupted explicit statistical likelihood model of the corruption nor by different processes – based on noisy data only. an image prior, and instead learn these indirectly from the training data. (Indeed, in one of our examples, synthetic Monte Carlo renderings, the non-stationary noise cannot be characterized analytically.) In addition to denoising, our 1. Introduction observation is directly applicable to inverse problems such Signal reconstruction from corrupted or incomplete mea- as MRI reconstruction from undersampled data. While our surements is an important subfield of statistical data analysis. conclusion is almost trivial from a statistical perspective, it Recent advances in deep neural networks have sparked sig- significantly eases practical learned signal reconstruction by nificant interest in avoiding the traditional, explicit a priori lifting requirements on availability of training data. learn- statistical modeling of signal corruptions, and instead The reference TensorFlow implementation for Noise2Noise ing to map corrupted observations to the unobserved clean 1 training is available on GitHub. versions. This happens by training a regression model, e.g., a convolutional neural network (CNN), with a large number x (ˆ and clean targets x y ,y of pairs ) of corrupted inputs ˆ i i i i 2. Theoretical Background and minimizing the empirical risk Assume that we have a set of unreliable measurements ∑ ( ,y ,... ) of the room temperature. A common strategy y ) L ( f (ˆ x (1) , argmin , y ) 1 2 i θ i θ for estimating the true unknown temperature is to find a i number z that has the smallest average deviation from the arXiv:1803.04189v3 [cs.CV] 29 Oct 2018 is a parametric family of mappings (e.g., CNNs), f where θ : measurements according to some loss function L under the loss function L . We use the notation ˆ x to un- x ) derline the fact that the corrupted input is a ˆ x y | ∼ (ˆ p i ) (2) . } { z, y ( argmin E L y z random variable distributed according to the clean target. Training data may include, for example, pairs of short and 2 ) y − , this minimum is found z ) = ( z,y ( L loss L For the 2 long exposure photographs of the same scene, incomplete at the arithmetic mean of the observations: and complete k-space samplings of magnetic resonance images, fast-but-noisy and slow-but-converged ray-traced z } y (3) . { E = y 1 2 3 Aalto University MIT CSAIL. Correspondence to: NVIDIA ( z z,y | − ) = The L loss, the sum of absolute deviations L 1 > < Jaakko Lehtinen [email protected] . y | , in turn, has its optimum at the median of the observations. th 35 International Conference on Machine Proceedings of the The general class of deviation-minimizing estimators are , Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 Learning 1 https://github.com/NVlabs/noise2noise by the author(s).

2 Noise2Noise: Learning Image Restoration without Clean Data principle, corrupt the training targets of a neural network known as M-estimators (Huber, 1964). From a statistical with zero-mean noise without changing what the network viewpoint, summary estimation using these common loss Combining this with the corrupted inputs from Equa- learns. functions can be seen as ML estimation by interpreting the tion 1, we are left with the empirical risk minimization task loss function as the negative log likelihood. ∑ Training neural network regressors is a generalization of argmin x L ( f (ˆ (6) , ) , ˆ y ) i i θ this point estimation procedure. Observe the form of the θ i ) typical training task for a set of input-target pairs ( ,y x , i i where both the inputs and the targets are now drawn from : ( x ) is parameterized by θ where the network function f θ a corrupted distribution (not necessarily the same), condi- tioned on the underlying, unobserved clean target y such f ( (4) argmin x E { L ) ( , y ) } . i θ ( x,y ) θ y | ˆ } x that E { ˆ = y . Given infinite data, the solution is i i i . For finite data, the variance is the (1) the same as that of Indeed, if we remove the dependency on input data, and average variance of the corruptions in the targets, divided f use a trivial that merely outputs a learned scalar, the task θ by the number of training samples (see appendix). Inter- . Conversely, the full training task decomposes (2) reduces to estingly, none of the above relies on a likelihood model of to the same minimization problem at every training sample; the corruption, nor a density model (prior) for the under- simple manipulations show that (4) is equivalent to lying clean image manifold. That is, we do not need an ) ( ( . f (5) , y argmin E }} ) { L x { E or p , as long as we have data explicit ( ) noisy clean | clean ) ( p x θ y x | θ distributed according to them. The network can, in theory, minimize this loss by solving the In many image restoration tasks, the expectation of the cor- point estimation problem separately for each input sample. the clean target that we seek to restore. is rupted input data Hence, the properties of the underlying loss are inherited by Low-light photography is an example: a long, noise-free ex- neural network training. posure is the average of short, independent, noisy exposures. With this in mind, the above suggests the ability to learn to The usual process of training regressors by Equation 1 over remove photon noise given only pairs of noisy images, with a finite number of input-target pairs ( hides a subtle ) x ,y i i no need for potentially expensive or difficult long exposures. point: instead of the 1:1 mapping between inputs and tar- Similar observations can be made about other loss functions. gets (falsely) implied by that process, in reality the mapping For instance, the L loss recovers the median of the targets, 1 is multiple-valued. For example, in a superresolution task meaning that neural networks can be trained to repair im- (Ledig et al., 2017) over all natural images, a low-resolution ages with significant (up top 50%) outlier content, again x can be explained by many different high-resolution image only requiring access to pairs of such corrupted images. images , as knowledge about the exact positions and ori- y entations of the edges and texture is lost in decimation. In In the next sections, we present a wide variety of examples ( other words, p y | x ) is the highly complex distribution of demonstrating that these theoretical capabilities are also natural images consistent with the low-resolution x . Train- efficiently realizable in practice. ing a neural network regressor using training pairs of low- and high-resolution images using the loss, the network L 2 3. Practical Experiments learns to output the average of all plausible explanations (e.g., edges shifted by different amounts), which results in We now experimentally study the practical properties of spatial blurriness for the network’s predictions. A signif- noisy-target training. We start with simple noise distribu- icant amount of work has been done to combat this well tions (Gaussian, Poisson, Bernoulli) in Sections 3.1 and 3.2, known tendency, for example by using learned discriminator and continue to the much harder, analytically intractable functions as losses (Ledig et al., 2017; Isola et al., 2017). Monte Carlo image synthesis noise (Section 3.3). In Sec- tion 3.4, we show that image reconstruction from sub- Our observation is that for certain problems this tendency Nyquist spectral samplings in magnetic resonance imaging has an unexpected benefit. A trivial, and, at first sight, use- (MRI) can be learned from corrupted observations only. L less, property of minimization is that on expectation, the 2 estimate remains unchanged if we replace the targets with 3.1. Additive Gaussian Noise random numbers whose expectations match the targets. This is easy to see: Equation (3) holds, no matter what particu- We will first study the effect of corrupted targets using y s are drawn from. Consequently, the lar distribution the synthetic additive Gaussian noise. As the noise has zero also remain of Equation θ optimal network parameters (5) loss for training to recover the mean. mean, we use the L 2 | x ) unchanged, if input-conditioned target distributions p ( y are replaced with arbitrary distributions that have the same Our baseline is a recent state-of-the-art method ”RED30” This implies that we can, in conditional expected values. (Mao et al., 2016), a 30-layer hierarchical residual net-

3 Noise2Noise: Learning Image Restoration without Clean Data 32.5 32.5 32.5 33 32.5 32 32 32.5 31.5 31.5 31.5 31.5 32 31 31 30.5 30.5 30.5 30.5 31.5 30 30 31 29.5 29.5 29.5 29.5 29 29 30.5 150 200 250 300 350 400 450 40 26 24 22 20 60 80 100 120 140 28 30 0 20 0 50 100 0 2 4 6 8 10 12 14 16 18 40 pix clean targets 5 pix 2 pix noisy targets Case 1 (trad.) Case 2 Case 3 (N2N) 20 pix 10 pix = 25 (b) Brown Gaussian, = 25 (c) Capture budget study (see text) (a) White Gaussian, σ σ dB K ODAK dataset) as a function of training epoch for additive Gaussian noise. (a) For i.i.d. (white) Denoising performance ( in Figure 1. Gaussian noise, clean and noisy targets lead to very similar convergence speed and eventual quality. (b) For brown Gaussian noise, we observe that increased inter-pixel noise correlation (wider spatial blur; one graph per bandwidth) slows convergence down, but eventual performance remains close. (c) Effect of different allocations of a fixed capture budget to noisy vs. clean examples (see text). For all further tests, we switch from RED30 to a shallower Table 1. , BSD300 , K ODAK PSNR results from three test datasets × U-Net (Ronneberger et al., 2015) that is roughly 10 faster and 14 ET S for Gaussian, Poisson, and Bernoulli noise. The com- to train and gives similar results ( − 0.2 dB in Gaussian noise). parison methods are BM3D, Inverse Anscombe transform (ANSC), The architecture and training parameters are described in and deep image prior (DIP). the appendix. Bernoulli (p=0.5) Gaussian (σ=25) Poisson (λ=30) noisy BM3D clean noisy ANSC noisy clean clean DIP Convergence speed Clearly, every training example asks 31.52 30.78 33.01 29.15 Kodak 33.17 31.82 32.48 32.50 31.50 for the impossible: there is no way the network could suc- 30.16 27.56 31.04 31.16 28.97 BSD300 31.07 31.06 30.34 30.18 ceed in transforming one instance of the noise to another. 30.07 31.72 31.51 28.36 30.06 30.67 30.50 31.28 31.31 Set14 30.57 31.61 31.85 28.36 31.63 Average 30.14 32.02 30.59 30.89 Consequently, the training loss does actually not decrease during training, and the loss gradients continue to be quite large. Why do the larger, noisier gradients not affect con- vergence speed? While the activation gradients are indeed work with 128 feature maps, which has been demonstrated noisy, the weight gradients are in fact relatively clean be- to be very effective in a wide range of image restoration cause Gaussian noise is independent and identically dis- tasks, including Gaussian noise. We train the network us- tributed (i.i.d.) in all pixels, and the weight gradients get 16 ing 256 256-pixel crops drawn from the 50k images in × averaged over pixels in our fully convolutional network. 2 MAGE N ET validation set. We furthermore random- I the Figure 1b makes the situation harder by introducing inter- , σ ∈ [0 separately for 50] ize the noise standard deviation pixel correlation to the noise. This brown additive noise each training example, i.e., the network has to estimate the is obtained by blurring white Gaussian noise by a spatial magnitude of noise while removing it (“blind” denoising). Gaussian filter of different bandwidths and scaling to retain (Martin et al., BSD300 We use three well-known datasets: . An example is shown in Figure 1b. As the correla- = 25 σ 2 (Zeyde et al., 2010), and 2001), . As sum- ODAK K 14 ET S tion increases, the effective averaging of weight gradients marized in Table 1, the behavior is qualitatively similar decreases, and the weight updates become noisier. This in all three sets, and thus we discuss the averages. When makes the convergence slower, but even with extreme blur, trained using the standard way with clean targets (Equa- the eventual quality is similar (within 0.1 dB). σ 0.02 dB with ± . The tion 1), RED30 achieves 31.63 = 25 Finite data and capture budget The previous studies re- confidence interval was computed by sampling five random lied on the availability of infinitely many noisy examples initializations. The widely used benchmark denoiser BM3D produced by adding synthetic noise to clean images. We 0.7 dB worse results. When we ∼ (Dabov et al., 2007) gives now study corrupted vs. clean training data in the realis- modify the training to use noisy targets (Equation 6) instead, tic scenario of finite data and a fixed capture budget. Our the denoising performance remains equally good. Further- experiment setup is as follows. Let one ImageNet image more, the training converges just as quickly, as shown in σ with white additive Gaussian noise at = 25 correspond to Figure 1a. This leads us to conclude that clean targets are one “capture unit” (CU). Suppose that 19 CUs are enough unnecessary in this application. This perhaps surprising for a clean capture, so that one noisy realization plus the observation holds also with different networks and network clean version (the average of 19 noisy realizations) con- capacities. Figure 2a shows an example result. sumes 20 CU. Let us fix a total capture budget of, say, 2000 2 CUs. This budget can be allocated between clean latents http://r0k.us/graphics/kodak/

4 Noise2Noise: Learning Image Restoration without Clean Data (a) Gaussian ( σ = 25 ) ) and noise realizations per clean latent ( N ( M ) such that M ∗ N = 2000 . In the traditional scenario, we have only , M = 100 100 training pairs ( = 20 N ): a single noisy realization and the corresponding clean image (= average of 19 noisy images; Figure 1c, Case 1). We first observe ∗ 20 ∗ 100 captured data as same that using the 19 = 38000 BM3D training pairs with corrupted targets — i.e., for each latent, possible noisy/clean pairs — yields 20 ∗ 19 forming all the notably better results (several .1s of dB) than the traditional, fixed noisy+clean pairs, even if we still only have N = 100 λ ) = 30 (b) Poisson ( latents (Figure 1c, Case 2). Second, we observe that setting , i.e., increasing the number of clean = 1000 M and = 2 N latents but only obtaining two noisy realizations of each (resulting in 2000 training pairs) yields even better results NSCOMBE (again, by several .1s of dB, Figure 1c, Case 3). A corrupted We conclude that for additive Gaussian noise, . 5 ) = 0 p (c) Bernoulli ( targets offer benefits — not just the same performance but better — over clean targets on two levels: both 1) seeing more realizations of the corruption for the same latent clean RIOR P image, and 2) seeing more latent clean images, even if just two corrupted realizations of each, are beneficial. MAGE I 3.2. Other Synthetic Noises EEP D We will now experiment with other types of synthetic noise. The training setup is the same as described above. Comparison Input Ground truth Our is the dominant source of noise in pho- Poisson noise Example results for Gaussian, Poisson, and Bernoulli Figure 2. tographs. While zero-mean, it is harder to remove because it noise. Our result was computed by using noisy targets — the L is signal-dependent. We use the loss, and vary the noise 2 corresponding result with clean targets is omitted because it is 50] during training. Training with clean ∈ λ [0 , magnitude virtually identical in all three cases, as discussed in the text. A targets results in 30.59 ± 0.02 dB, while noisy targets give different comparison method is used for each noise type. an equally good 30.57 0.02 dB, again at similar conver- ± ̈ gence speed. A comparison method (M akitalo & Foi, 2011) that first transforms the input Poisson noise into Gaussian The probability of corrupted pixels is denoted with ; in our p (Anscombe transform), then denoises by BM3D, and finally p and during testing 0 . [0 ∈ p training we vary . 0 95] , . 5 . = 0 inverts the transform, yields 2 dB less. Training with clean targets gives an average of 31.85 ± 0.03 dB, noisy targets (separate m for input and target) give Other effects, e.g., dark current and quantization, are domi- a slightly higher 32.02 ± 0.03 dB, possibly because noisy nated by Poisson noise, can be made zero-mean (Hasinoff targets effectively implement a form of dropout (Srivastava et al., 2016), and hence pose no problems for training with et al., 2014) at the network output. DIP was almost 2 dB noisy targets. We conclude that noise-free training data is worse – DIP is not a learning-based solution, and as such unnecessary in this application. That said, saturation (gamut very different from our approach, but it shares the property clipping) renders the expectation incorrect due to removing that neither clean examples nor an explicit model of the part of the distribution. As saturation is unwanted for other corruption is needed. We used the “Image reconstruction” reasons too, this is not a significant limitation. 3 setup as described in the DIP supplemental material. (aka binomial noise) con- Multiplicative Bernoulli noise Figure 3 demonstrates blind text removal. Text removal structs a random mask m that is 1 for valid pixels and 0 for The corruption consists of a large, varying number of ran- zeroed/missing pixels. To avoid backpropagating gradients dom strings in random places, also on top of each other, and from missing pixels, we exclude them from the loss: furthermore so that the font size and color are randomized ∑ 2 (ˆ y ˆ − ) argmin )) , (7) f ( m ( x as well. The font and string orientation remain fixed. θ i i θ i The network is trained using independently corrupted input as described by Ulyanov et al. (2017) in the context of their 3 https://dmitryulyanov.github.io/deep deep image prior (DIP). image prior

5 Noise2Noise: Learning Image Restoration without Clean Data ≈ . 04 p ≈ 0 . 42 p 0 p ≈ . 25 ) L Example training pairs L Input ( Clean targets Ground truth 0 2 1 17.12 dB PSNR 26.89 dB 35.82 dB 35.75 dB loss. The mean ( L Removing random text overlays corresponds to seeking the median pixel color, accomplished using the L Figure 3. 1 2 loss) is not the correct answer: note shift towards mean text color. Only corrupted images shown during training. = 0 p p = 0 . 81 22 . / p 70 ) L Input ( . L L = 0 Clean targets Ground truth Example training pairs 1 2 0 13.02 dB / 16.36 dB 28.43 dB 28.86 dB PSNR 8.89 dB Figure 4. L ) seeking losses. loss performs better than the mean ( L For random impulse noise, the approx. mode-seeking ) or median ( L 1 2 0 PSNR delta from clean targets replaces some pixels with Random-valued impulse noise 0 noise and retains the colors of others. Instead of the standard -5 salt and pepper noise (randomly replacing pixels with black L1 L0 or white), we study a harder distribution where each pixel -10 20% 30% 60% 90% 80% 70% 50% 40% 10% is replaced with a random color drawn from the uniform 3 , and retains its color 1] with probability [0 distribution p Figure 5. PSNR of noisy-target training relative to clean targets with probability − . The pixels’ color distributions are a p 1 with a varying percentage of target pixels corrupted by RGB im- Dirac at the original color plus a uniform distribution, with pulse noise. In this test a separate network was trained for each cor- p relative weights given by the replacement probability . In ruption level, and the graph was averaged over the ODAK dataset. K this case, neither the mean nor the median yield the correct mode of the distribution result; the desired output is the (the Dirac spike). The distribution remains unimodal. For approximate mode seeking, we use an annealed version is p and target pairs. The probability of corrupted pixels γ − ) L ˆ y | + ) of the “ , ( | f loss” function defined as (ˆ x 0 θ . 25 during approximately [0 , 0 . 5] during training, and p ≈ 0 8 − γ 2 is annealed linearly from 0 , where where to = 10 testing. In this test the mean ( L loss) is not the correct 2 during training. This annealing did not cause any numerical answer because the overlaid text has colors unrelated to the L issues in our tests. The relationship of the loss and mode 0 actual image, and the resulting image would incorrectly tend seeking is analyzed in the appendix. towards a linear combination of the right answer and the average text color (medium gray). However, with any rea- We again train the network using noisy inputs and noisy sonable amount of overlaid text, a pixel retains the original targets, where the probability of corrupted pixels is random- color more often than not, and therefore the median is the . 95] . Figure 4 shows ized separately for each pair from [0 , 0 | as the loss = | f correct statistic. Hence, we use (ˆ L ) − ˆ y x the inference results when 70% input pixels are randomized. 1 θ function. Figure 3 shows an example result.

6 Noise2Noise: Learning Image Restoration without Clean Data dominated by the long-tail effects (outliers) in the targets, L loss biases the results heavily towards gray, Training with 2 because the result tends towards a linear combination the and training does not converge. On the other hand, if the correct answer and and mean of the uniform random corrup- ) v ( T denoiser were to output tonemapped values , the non- would make the expected value of noisy target loss gives good results T linearity of tion. As predicted by theory, the L 1 { different from the clean training target images E as long as fewer than 50% of the pixels are randomized, T ( v ) } , leading to incorrect predictions. but beyond that threshold it quickly starts to bias dark and T ( E { v } ) , on the other hand, L bright areas towards gray (Figure 5). 0 A metric often used for measuring the quality of HDR im- shows little bias even with extreme corruptions (e.g. 90% ages is the relative MSE (Rousselle et al., 2011), where pixels), because of all the possible pixel values, the correct the squared difference is divided by the square of approx- answer (e.g. 10%) is still the most common. 2 2 ) y ( / f y + ) ˆ . − ) x (ˆ imate luminance of the pixel, i.e., (ˆ θ However, this metric suffers from the same nonlinearity 3.3. Monte Carlo Rendering problem as comparing of tonemapped outputs. Therefore, we propose to use the network output, which tends to- Physically accurate renderings of virtual environments are wards the correct value in the limit, in the denominator: most often generated through a process known as Monte 2 2 (ˆ x f ) − ˆ y ) = ( / ( f (ˆ x ) + 0 . 01) . It can be shown L HDR θ θ Carlo path tracing. This amounts to drawing random se- L that converges to the correct expected value as long HDR quences of scattering events (“light paths”) in the scene that as we consider the gradient of the denominator to be zero. connect light sources and virtual sensors, and integrating the radiance carried by them over all possible paths (Veach Finally, we have observed that it is beneficial to tone map & Guibas, 1995). The Monte Carlo integrator is constructed x T the input image instead of using HDR inputs. The (ˆ ) such that the intensity of each pixel is the expectation of network continues to output non-tonemapped (linear-scale) the random path sampling process, i.e., the sampling noise luminance values, retaining the correctness of the expected is zero-mean. However, despite decades of research into value. Figure 6 evaluates the different loss functions. importance sampling techniques, little else can be said about Denoising Monte Carlo rendered images We trained a the distribution. It varies from pixel to pixel, heavily de- denoiser for Monte Carlo path traced images rendered using pends on the scene configuration and rendering parameters, 64 samples per pixel (spp). Our training set consisted of and can be arbitrarily multimodal. Some lighting effects, 860 architectural images, and the validation was done using such as focused caustics, also result in extremely long-tailed 34 images from a different set of scenes. Three versions of distributions with rare, bright outliers. the training images were rendered: two with 64 spp using All of these effects make the removal of Monte Carlo noise different random seeds (noisy input, noisy target), and one much more difficult than removing, e.g., Gaussian noise. with 131k spp (clean target). The validation images were On the other hand, the problem is somewhat alleviated by rendered in both 64 spp (input) and 131k spp (reference) the possibility of generating auxiliary information that has × versions. All images were 960 540 pixels in size, and as been empirically found to correlate with the clean result mentioned earlier, we also saved the albedo and normal during data generation. In our experiments, the denoiser buffers for all of the input images. Even with such a small input consists of not only the per-pixel luminance values, dataset, rendering the 131k spp clean images was a stren- but also the average albedo (i.e., texture color) and normal uous effort — for example, Figure 7d took 40 minutes to vector of the surfaces visible at each pixel. × render on a high-end graphics server with 8 NVIDIA Tesla P100 GPUs and a 40-core Intel Xeon CPU. High dynamic range (HDR) Even with adequate sam- pling, the floating-point pixel luminances may differ from The average PSNR of the 64 spp validation inputs with re- each other by several orders of magnitude. In order to con- spect to the corresponding reference images was 22.31 dB struct an image suitable for the generally 8-bit display de- (see Figure 7a for an example). The network trained for vices, this high dynamic range needs to be compressed to a 2000 epochs using clean target images reached an average ́ fixed range using a tone mapping operator (Cerd a-Company PSNR of 31.83 dB on the validation set, whereas the simi- et al., 2016). We use a variant of Reinhard’s global op- larly trained network using noisy target images gave 0.5 dB 1 / 2 . 2 ) = ( v/ )) v (1 + , v ( T erator (Reinhard et al., 2002): less. Examples are shown in Figure 7b,c – the training took is a scalar luminance value, possibly pre-scaled where v 12 hours with a single NVIDIA Tesla P100 GPU. with an image-wide exposure constant. This operator maps At 4000 epochs, the noisy targets matched 31.83 dB, i.e., ≥ ≤ T ( v ) < any 1 . v 0 into range 0 noisy targets took approximately twice as long to converge. The combination of virtually unbounded range of lumi- However, the gap between the two methods had not nar- poses a problem. T nances and the nonlinearity of operator rowed appreciably, leading us to believe that some quality If we attempt to train a denoiser that outputs luminance difference will remain even in the limit. This is not sur- 2 f = ( ) y , a standard MSE loss v values ˆ − ) x (ˆ will be L 2 θ

7 Noise2Noise: Learning Image Restoration without Clean Data Input, 8 spp Reference, 32k spp with ˆ x, ˆ y L y with T (ˆ x ) , ˆ y L ˆ with T (ˆ x ) ,T (ˆ y ) L , with ˆ x, ˆ y L L with T (ˆ x ) HDR HDR 2 2 2 25.46 dB 15.50 dB 29.05 dB 30.09 dB PSNR 11.32 dB 25.39 dB Figure 6. Comparison of various loss functions for training a Monte Carlo denoiser with noisy target images rendered at 8 samples per L . Applying a non-linear tone map to is clearly superior to pixel (spp). In this high-dynamic range setting, our custom relative loss L 2 HDR the inputs is beneficial, while applying it to the target images skews the distribution of noise and leads to wrong, visibly too dark results. (a) Input (64 spp), 23.93 dB (b) Noisy targets, 32.42 dB (d) Reference (131k spp) (c) Clean targets, 32.95 dB Figure 7. Denoising a Monte Carlo rendered image. (a) Image rendered with 64 samples per pixel. (b) Denoised 64 spp input, trained using 64 spp targets. (c) Same as previous, but trained on clean targets. (d) Reference image rendered with 131 072 samples per pixel. PSNR values refer to the images shown here, see text for averages over the entire validation set. PSNR 40 movie shot (Chaitanya et al., 2017). In this context, it can 30 even be desirable to train on-the-fly while walking through 20 the scene. In order to maintain interactive frame rates, we 10 can afford only few samples per pixel, and thus both input 0 and target images will be inherently noisy. 400 1000 600 800 500 700 300 200 100 0 900 Figure 8 shows the convergence plots for an experiment Clean targets Input Noisy targets where we trained a denoiser from scratch for the duration of 1000 frames in a scene flythrough. On an NVIDIA Titan Online training PSNR during a 1000-frame flythrough Figure 8. of the scene in Figure 6. Noisy target images are almost as good 512 pixel image with V GPU, path tracing a single 512 × for learning as clean targets, but are over 2000 faster to render × 8 spp took 190 ms, and we rendered two images to act (190 milliseconds vs 7 minutes per frame in this scene). Both as input and target. A single network training iteration denoisers offer a substantial improvement over the noisy input. with a random 256 × 256 pixel crop took 11.25 ms and we performed eight of them per frame. Finally, we denoised both rendered images, each taking 15 ms, and averaged prising, since the training dataset contained only a limited the result to produce the final image shown to the user. number of training pairs (and thus noise realizations) due Rendering, training and inference took 500 ms/frame. to the cost of generating the clean target images, and we Figure 8 shows that training with clean targets does not wanted to test both methods using matching data. That perform appreciably better than noisy targets. As rendering said, given that noisy targets are 2000 times faster to pro- a single clean image takes approx. 7 minutes in this scene duce, one could trivially produce a larger quantity of them (resp. 190 ms for a noisy target), the quality/time tradeoff and still realize vast gains. The finite capture budget study clearly favors noisy targets. (Section 3.1) supports this hypothesis. Since it can be tedious to collect a suf- Online training 3.4. Magnetic Resonance Imaging (MRI) ficiently large corpus of Monte Carlo images for training Magnetic Resonance Imaging (MRI) produces volumetric a generally applicable denoiser, a possibility is to train a images of biological tissues essentially by sampling the model specific to a single 3D scene, e.g., a game level or a

8 Noise2Noise: Learning Image Restoration without Clean Data -space”) of the signal. Modern Fourier transform (the “ k MRI techniques have long relied on compressed sensing (CS) to cheat the Nyquist-Shannon limit: they undersample Image k -space, and perform non-linear reconstruction that removes aliasing by exploiting the sparsity of the image in a suitable transform domain (Lustig et al., 2008). We observe that if we turn the k -space sampling into a ran- k over the dom process with a known probability density p ( ) Spectrum frequencies k , our main idea applies. In particular, we model (a) Input (b) Noisy trg. (c) Clean trg. (d) Reference k the -space sampling operation as a Bernoulli process where k | | − λ 29.77 dB 18.93 dB 29.81 dB k p ( e ) = each individual frequency has a probability 4 The frequencies that are of being selected for acquisition. retained are weighted by the inverse of the selection proba- MRI reconstruction example. (a) Input image with only Figure 9. 1 10% of spectrum samples retained and scaled by . (b) Recon- /p bility, and non-chosen frequencies are set to zero. Clearly, struction by a network trained with noisy target images similar the expectation of this “Russian roulette” process is the to the input image. (c) Same as previous, but training done with controls the overall frac- λ correct spectrum. The parameter clean target images similar to the reference image. (d) Original, tion of k -space retained; in the following experiments, we uncorrupted image. PSNR values refer to the images shown here, 10% choose it so that of the samples are retained relative to a see text for averages over the entire validation set. full Nyquist-Shannon sampling. The undersampled spectra are transformed to the primal image domain by the standard inverse Fourier transform. An example of an undersam- with noisy targets reached an average PSNR of 31.74 dB pled input/target picture, the corresponding fully sampled on the validation data, and the network trained with clean reference, and their spectra, are shown in Figure 9(a, d). targets reached 31.77 dB. Here the training with clean tar- gets is similar to prior art (Wang et al., 2016; Lee et al., Now we simply set up a regression problem of the form (6) 2017). Training took 13 hours on an NVIDIA Tesla P100 and train a convolutional neural network using pairs of two GPU. Figure 9(b, c) shows an example of reconstruction re- ˆ x and independent undersampled images y of the same vol- ˆ sults between convolutional networks trained with noisy and ume. As the spectra of the input and target are correct on ex- clean targets, respectively. In terms of PSNR, our results pectation, and the Fourier transform is linear, we use the L 2 quite closely match those reported in recent work. loss. Additionally, we improve the result slightly by enforc- ing the exact preservation of frequencies that are present in the input image , ) x (ˆ ˆ f by Fourier transforming the result x θ 4. Discussion replacing the frequencies with those from the input, and We have shown that simple statistical arguments lead to new transforming back to the primal domain before computing − 1 2 capabilities in learned signal recovery using deep neural )))) − ˆ y ) ( , R ( F ( F ( f the loss: the final loss reads (ˆ x ˆ x θ networks; it is possible to recover signals under complex where R denotes the replacement of non-zero frequencies , without an corruptions without observing clean signals from the input. This process is trained end-to-end. explicit statistical characterization of the noise or other cor- We perform experiments on 2D slices extracted from the ruption, at performance levels equal or close to using clean 5 IXI brain scan MRI dataset. To simulate spectral sampling, target data. That clean data is not necessary for denoising we draw random samples from the FFT of the (already re- is not a new observation: indeed, consider, for instance, the constructed) images in the dataset. Hence, in deviation from classic BM3D algorithm (Dabov et al., 2007) that draws actual MRI samples, our data is real-valued and has the on self-similar patches within a single noisy image. We periodicity of the discrete FFT built-in. The training set show that the previously-demonstrated high restoration per- 256 resolution from 50 sub- contained 5000 images in 256 × formance of deep neural networks can likewise be achieved jects, and for validation we chose 1000 random images from entirely without clean data, all based on the same general- 10 different subjects. The baseline PSNR of the sparsely- purpose deep convolutional model. This points the way to sampled input images was 20.03 dB when reconstructed significant benefits in many applications by removing the directly using IFFT. The network trained for 300 epochs need for potentially strenuous collection of clean data. 4 Our simplified example deviates from practical MRI in the AmbientGAN (Ashish Bora, 2018) trains generative adver- sense that we do not sample the spectra along 1D trajectories. sarial networks (Goodfellow et al., 2014) using corrupted However, we believe that designing pulse sequences that lead to observations. In contrast to our approach, AmbientGAN similar pseudo-random sampling characteristics is straightforward. 5 needs an explicit forward model of the corruption. We find http://brain-development.org/ixi-dataset T1 images. → combining ideas along both paths intriguing.

9 Noise2Noise: Learning Image Restoration without Clean Data Acknowledgments Kingma, Diederik P. and Ba, Jimmy. Adam: A method for ICLR stochastic optimization. In , 2015. Bill Dally, David Luebke, Aaron Lefohn for discussions and Ledig, Christian, Theis, Lucas, Huszar, Ferenc, Caballero, supporting the research; NVIDIA Research staff for sugges- Jose, Aitken, Andrew P., Tejani, Alykhan, Totz, Johannes, tions and discussion; Runa Lober and Gunter Sprenger for Wang, Zehan, and Shi, Wenzhe. Photo-realistic single synthetic off-line training data; Jacopo Pantaleoni for the in- image super-resolution using a generative adversarial net- teractive renderer used in on-line training; Samuli Vuorinen work. In Proc. CVPR , pp. 105–114, 2017. for initial photography test data; Koos Zevenhoven for dis- cussions on MRI; Peyman Milanfar for helpful comments. Lee, D., Yoo, J., and Ye, J. C. Deep residual learning for compressed sensing MRI. In Proc. IEEE 14th Interna- References tional Symposium on Biomedical Imaging (ISBI 2017) , pp. 15–18, 2017. Ashish Bora, Eric Price, Alexandros G. Dimakis. Ambi- entGAN: Generative models from lossy measurements. Lustig, Michael, Donoho, David L., Santos, Juan M., and , 2018. ICLR Pauly, John M. Compressed sensing MRI. In IEEE Signal Processing Magazine , volume 25, pp. 72–82, 2008. ́ ́ Cerd a-Company, Xim, P arraga, C. Alejandro, and Otazu, Which tone-mapping operator is the best? Xavier. Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew. Recti- CoRR , A comparative study of perceptual quality. fier nonlinearities improve neural network acoustic mod- abs/1601.04450, 2016. els. In Proc. International Conference on Machine Learn- ing (ICML) , volume 30, 2013. Chaitanya, Chakravarty R. Alla, Kaplanyan, Anton S., Schied, Christoph, Salvi, Marco, Lefohn, Aaron, Mao, Xiao-Jiao, Shen, Chunhua, and Yang, Yu-Bin. Im- Nowrouzezahrai, Derek, and Aila, Timo. Interactive age restoration using convolutional auto-encoders with reconstruction of Monte Carlo image sequences using a , 2016. symmetric skip connections. In Proc. NIPS , 36 recurrent denoising autoencoder. ACM Trans. Graph. (4):98:1–98:12, 2017. Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of human segmented natural images and its application Dabov, K., Foi, A., Katkovnik, V., and Egiazarian, K. Image to evaluating segmentation algorithms and measuring denoising by sparse 3-D transform-domain collaborative , volume 2, pp. 416– Proc. ICCV ecological statistics. In IEEE Trans. Image Process. filtering. , 16(8):2080–2095, 423, 2001. 2007. ̈ M akitalo, Markku and Foi, Alessandro. Optimal inversion Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, of the Anscombe transformation in low-count Poisson Bing, Warde-Farley, David, Ozair, Sherjil, Courville, IEEE Trans. Image Process. image denoising. , 20(1): Aaron, and Bengio, Yoshua. Generative Adversarial Net- 99–109, 2011. NIPS works. In , 2014. Reinhard, Erik, Stark, Michael, Shirley, Peter, and Ferwerda, Hasinoff, Sam, Sharlet, Dillon, Geiss, Ryan, Adams, An- James. Photographic tone reproduction for digital images. drew, Barron, Jonathan T., Kainz, Florian, Chen, Jiawen, ACM Trans. Graph. , 21(3):267–276, 2002. and Levoy, Marc. Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas. Trans. Graph. , 35(6):192:1–192:12, 2016. U-net: Convolutional networks for biomedical image segmentation. , 9351:234–241, 2015. MICCAI He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into rectifiers: Surpassing human- Rousselle, Fabrice, Knaus, Claude, and Zwicker, Matthias. level performance on imagenet classification. CoRR , Adaptive sampling and reconstruction using greedy error abs/1502.01852, 2015. , 30(6):159:1–159:12, minimization. ACM Trans. Graph. 2011. Huber, Peter J. Robust estimation of a location parameter. , 35(1):73–101, 1964. Ann. Math. Statist. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: Iizuka, Satoshi, Simo-Serra, Edgar, and Ishikawa, Hiroshi. A simple way to prevent neural networks from overfitting. ACM Globally and locally consistent image completion. Journal of Machine Learning Research , 15:1929–1958, , 36(4):107:1–107:14, 2017. Trans. Graph. 2014. Isola, Phillip, Zhu, Jun-Yan, Zhou, Tinghui, and Efros, Alexei A. Image-to-image translation with conditional Ulyanov, Dmitry, Vedaldi, Andrea, and Lempitsky, Victor S. Proc. CVPR 2017 adversarial networks. In , 2017. Deep image prior. CoRR , abs/1711.10925, 2017.

10 Noise2Noise: Learning Image Restoration without Clean Data Veach, Eric and Guibas, Leonidas J. Optimally combining sampling techniques for Monte Carlo rendering. In Proc. ACM SIGGRAPH 95 , pp. 419–428, 1995. Wang, S., Su, Z., Ying, L., Peng, X., Zhu, S., Liang, F., Feng, D., and Liang, D. Accelerating magnetic resonance imaging via deep learning. In Proc. IEEE 13th Inter- national Symposium on Biomedical Imaging (ISBI) , pp. 514–517, 2016. Zeyde, R., Elad, M., and Protter, M. On single image scale- up using sparse-representations. In Proc. Curves and Surfaces: 7th International Conference , pp. 711–730, 2010. Zhang, Richard, Isola, Phillip, and Efros, Alexei A. Colorful image colorization. In Proc. ECCV , pp. 649–666, 2016.

11 Noise2Noise: Learning Image Restoration without Clean Data A. Appendix AME N F N UNCTION out A.1. Network architecture INPUT n 3 × 0 ENC 3 Convolution 48 CONV Table 2 shows the structure of the U-network (Ronneberger ENC 3 × 3 Convolution 48 1 CONV et al., 2015) used in all of our tests, with the exception 2 POOL 1 2 48 Maxpool × of the first test in Section 3.1 that used the “RED30” net- ENC 3 × 3 Convolution 48 2 CONV work (Mao et al., 2016). For all basic noise and text removal 48 × 2 Maxpool 2 POOL 2 experiments with RGB images, the number of input and × 3 Convolution 48 3 CONV ENC 3 output channels were = 3 m = . For Monte Carlo de- n 2 × 2 48 3 POOL Maxpool n = 9 ,m = 3 , i.e., input contained RGB noising we had ENC 3 CONV 4 48 Convolution 3 × pixel color, RGB albedo, and a 3D normal vector per pixel. × POOL 4 48 Maxpool 2 2 The MRI reconstruction was done with monochrome im- ENC 3 × 3 Convolution 48 5 CONV = 1 m = n ages ( ). Input images were represented in range POOL 5 48 Maxpool 2 × 2 [ 5] 0 . 5 , 0 . . − Convolution 3 ENC × CONV 3 48 6 2 2 UPSAMPLE 5 48 Upsample × A.2. Training parameters 4 CONCAT 5 96 Concatenate output of POOL The network weights were initialized following He et DEC CONV 5 A 96 Convolution 3 × 3 al. (2015). No batch normalization, dropout or other reg- 96 DEC CONV 5 Convolution B 3 × 3 ularization techniques were used. Training was done us- 4 2 × 2 UPSAMPLE 96 Upsample ing ADAM (Kingma & Ba, 2015) with parameter values Concatenate output of CONCAT 4 3 POOL 144 − 8 . 99 , . = 10 9 , . β = 0 = 0 β 2 1 A DEC CONV 4 3 96 Convolution 3 × CONV B DEC 3 4 96 Convolution 3 × Learning rate was kept at a constant value during training UPSAMPLE 3 96 Upsample 2 × 2 except for a brief rampdown period at where it was smoothly Concatenate output of 3 144 2 POOL CONCAT 001 brought to zero. Learning rate of 0 . was used for all CONV DEC 3 A 96 Convolution 3 × 3 experiments except Monte Carlo denoising, where 0 . 0003 DEC 3 × 3 Convolution 3 CONV 96 B was found to provide better stability. Minibatch size of 4 2 2 UPSAMPLE 2 96 Upsample × was used in all experiments. CONCAT 2 144 Concatenate output of POOL 1 3 Convolution 3 96 DEC CONV × A 2 L minimization A.3. Finite corrupted data in 2 DEC CONV 2 B 96 Convolution 3 × 3 2 1 2 Upsample 96 × UPSAMPLE Let us compute the expected error in L norm minimization 2 N INPUT CONCAT 96+ Concatenate 1 n { ˆ y } task when corrupted targets are used in place of i =1 i N CONV A 64 Convolution 3 × 1 3 DEC y { the clean targets a finite number. Let N } y , with i i =1 i CONV DEC 3 × 3 Convolution 32 B 1 ˆ y be arbitrary random variables, such that } = y E . As { i i CONV DEV 1 C , linear act. 3 × 3 Convolution m usual, the point of least deviation is found at the respec- tive mean. The expected squared difference between these N Table 2. Network architecture used in our experiments. de- out means across realizations of the noise is then: notes the number of output feature maps for each layer. Number ] [ 2 depend on m of network input channels n and output channels ∑ ∑ 1 1 the experiment. All convolutions use padding mode “same”, and y ˆ − y E i i ˆ y N N except for the last layer are followed by leaky ReLU activation i i [ [ ] ] . Other layers have linear 1 . = 0 α function (Maas et al., 2013) with ∑ ∑ ∑ ∑ 1 activation. Upsampling is nearest-neighbor. 2 2 = − y E )( + y ) ( E y E 2 ) ˆ ) ˆ y ( ( y ˆ i y ˆ y i i ˆ i 2 N i i i i ∑ 1 mutually uncorrelated, the last row simplifies to y ˆ Var( ) = i 2 N i [ ] ∑ 1 1 ∑ ∑ 1 1 ) y Var( (9) i = Cov(ˆ ) y y ˆ , N N i j i N N j i (8) In either case, the variance of the estimate is the average (co)variance of the corruptions, divided by the number of ∑ ∑ samples ˆ y E . Therefore, the error approaches zero as the In the intermediate steps, we have used N ( y ) = i i y ˆ i i and basic properties of (co)variance. If the corruptions are number of samples grows. The estimate is unbiased in the

12 Noise2Noise: Learning Image Restoration without Clean Data sense that it is correct on expectation, even with a finite amount of data. The above derivation assumes scalar target variables. When ˆ are images, N is to be taken as the total number of scalars y i × #color chan- #pixels/image in the images, i.e., #images × nels. ” norm L A.4. Mode seeking and the “ 0 L norm” could intuitively be ex- Interestingly, while the “ 0 pected to converge to an exact mode, i.e. a local maximum of the probability density function of the data, theoretical analysis reveals that it recovers a slightly different point. While an actual mode is a zero-crossing of the derivative of L the PDF, the norm minimization recovers a zero-crossing 0 of its Hilbert transform instead. We have verified this behav- ior in a variety of numerical experiments, and, in practice, we find that the estimate is typically close to the true mode. This can be explained by the fact that the Hilbert transform approximates differentiation (with a sign flip): the latter is a multiplication by iω in the Fourier domain, whereas the − i sgn( ω ) . Hilbert transform is a multiplication by q For a continuous data density x ) , the norm minimization ( ∗ L amounts to finding a point x that has a min- task for p imal expected p -norm distance (suitably normalized, and p th root) from points y ∼ q ( y ) : omitting the 1 ∗ p E x = argmin { | x } | y − y ∼ q p x ∫ (10) 1 p y | x − y | = argmin q ( y ) d p x Following the typical procedure, the minimizer is found at a root of the derivative of the expression under argmin: ∫ ∂ 1 p | x − y | 0 = y ( y ) d q p ∂x ∫ (11) p − 1 = | x − y | ) sgn( x q ( y ) d y − y lim . The usual This equality holds also when we take → 0 p L norms can readily be derived from and L results for 1 2 this form. For the case, we take p = 0 and obtain L 0 ∫ − 1 x − y ) | x − y | 0 = y q ( sgn( ) d y ∫ (12) 1 d y. = q ( y ) y − x The right hand side is the formula for the Hilbert transform of q ( x ) , up to a constant multiplier.

Strong Performers and Successful Reformers in Education Lessons from PISA for the United States

More info »THE OFFICIAL OHIO LANDS BOOK Written by Dr. George W. Knepper

More info »2017 BAH Rates - WITHOUT DEPENDENTS O07 O06 W04 W05 O02E O03E W03 O02 O03 O04 O05 O01E O01 E07 E08 E09 W01 W02 E03 E05 E06 E04 E02 E01 MHA_NAME MHA 1155 KETCHIKAN, AK 1521 1527 1587 1659 1788 1530 163...

More info »Harmoniz ed vision 4 hedule of the United States (2019) Re Tariff Sc Annotated f poses ting Pur or Statistical Repor GN p .1 GENERAL R ATION ULES OF INTERPRET inciples: wing pr ollo y the f verned b i...

More info »S. Pub. 115-7 2017-2018 Official Congressional Directory 115th Congress Convened January 3, 2017 JOINT COMMITTEE ON PRINTING UNITED STATES CONGRESS UNITED STATES GOVERNMENT PUBLISHING OFFICE WASHINGTO...

More info »2018 HISTORY AND Massachusetts SOCIAL SCIENCE Curriculum – Framework FRAMEWORK 2018 Grades Pre Kindergarten to 12 -

More info »CITY CHARTER AMENDED AS OF NOVEMBER 2, 2010 CITY OF LONG BEACH CITY HALL 333 W. OCEAN BOULEVARD LONG BEACH, CA 90802 CITY CLERK DEPARTMENT 562-570-6101

More info »"INSURANCE COMPANY OF 1921, THE" LAW 40 Act of May. P.L. 682, No. 284 Cl. 17, 1921, AN ACT to insurance; and consolidating amending, Relating revising, for incorporation of insurance providing the the...

More info »HURCH C ISTORY H HURCH C H ISTORY IN THE ULNESS F IN THE ULNESS F OF T IMES OF IMES T S tudent M anual S anual M tudent RELIGION 341 THROUGH 343

More info »Aldine Press Books at the Harry Ransom Humanities Research Center The University of Texas at Austin A Descriptive Catalogue by Craig W. Kallendorf ‡ Maria X. Wells Austin Harry Ransom Humanities Resea...

More info »Carbon and Other Biogeochemical Cycles Coordinating Lead Authors: Philippe Ciais (France), Christopher Sabine (USA) Lead Authors: Govindasamy Bala (India), Laurent Bopp (France), Victor Brovkin (Germa...

More info »The Size of the Sheet in America: Manufactured by Paper-Moulds of Philadelphia , Sellers JOHN BIDWELL xi STATEMENT about the dimensions of a book is one of the essential ingredients of bibliographical...

More info »John Bel Edwards Rebekah E. Gee MD, MPH SECRETARY GOVERNOR State of Louisiana Louisiana Department of Health Office of Public Health Certified Water and Wastewater Operators 2018 - 2019 Hours Hours li...

More info »Enga This report can be seen as a continuation of the last several Commission on Appraisal reports.The underlying theme running through those studies con- ging Our cerns the nature of the UU community...

More info »Capital A Critique of Political Economy Volume I Book One: The Process of Production of Capital First published: in German in 1867, English edition first published in 1887; Source: First English editi...

More info »Insights Economic FEDERAL RESERVE BANK OF DALLAS VOLUME 11, NUMBER 1 Jean-Baptiste Say Foundations of France’s Free Trade Tradition France is not known today for its vigorous support of free markets, ...

More info »A c c e s s o r i e s Color Kinetics and City Theatrical have teamed up to develop a full line of accessories for popular Color Kinetics professional lighting ™ 72, products, including accessories for...

More info »