oldnew 11.dvi

Transcript

1 C ́edric Villani O ptimal transport, old and new June 13, 2008 Springer Berlin Heidelberg NewYork Hong Kong London Milan Paris Tokyo

2 Do mo chuisle mo chro ́ı, A ̈elle

3 Contents P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 reface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conventions 7 1 3 Introduction . . . . . . . . . . . . . . . . . . . 17 1 Couplings and changes of variables 2 Three examples of coupling techniques . . . . . . . . . . . . . . . 33 3 The founding fathers of optimal transport . . . . . . . . . . . 41 Part I Qualitative description of optimal transport 51 4 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5 Cyclical monotonicity and Kantorovich duality . . . . . . . 63 6 The Wasserstein distances . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7 Displacement interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 8 The Monge–Mather shortening principle . . . . . . . . . . . . . 175 9 Solution of the Monge problem I: Global approach . . . 217 10 Solution of the Monge problem II: Local approach . . . 227

4 VIII Contents 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 1 The Jacobian equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 12 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 13 Qualitative picture Part II Optimal transport and Riemannian geometry 367 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 14 Ricci curvature 15 Otto calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 16 Displacement convexity I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 17 Displacement convexity II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 18 Volume control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 19 Density control and local regularity . . . . . . . . . . . . . . . . . . 521 20 Infinitesimal displacement convexity . . . . . . . . . . . . . . . . . 541 21 Isoperimetric-type inequalities . . . . . . . . . . . . . . . . . . . . . . . 561 22 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 23 Gradient flows I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 24 Gradient flows II: Qualitative properties . . . . . . . . . . . . . 709 25 Gradient flows III: Functional inequalities . . . . . . . . . . . . 735 Part III Synthetic treatment of Ricci curvature 747 26 Analytic and synthetic points of view . . . . . . . . . . . . . . . . 751 27 Convergence of metric-measure spaces . . . . . . . . . . . . . . . 759 28 Stability of optimal transport . . . . . . . . . . . . . . . . . . . . . . . . 789

5 Contents IX 9 Weak Ricci curvature bounds I: Definition and 2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811 30 Weak Ricci curvature bounds II: Geometric and analytic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865 Conclusions and open problems 921 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933 List of short statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975 List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985 Some notable cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989

6

7 Preface

8 2 Preface hen I was first approached for the 2005 edition of the Saint-Flour W 1 Probability Summer School, I was intrigued, flattered and scared. Apart from the challenge posed by the teaching of a rather analytical subject to a probabilistic audience, there was the danger of producing a remake of my recent book Topics in Optimal Transportation . However, I gradually realized that I was being offered a unique op- portunity to rewrite the whole theory from a different perspective, with alternative proofs and a different focus, and a more probabilistic pre- sentation; plus the incorporation of recent progress. Among the most striking of these recent advances, there was the rising awareness that John Mather’s minimal measures had a lot to do with optimal trans- port, and that both theories could actually be embedded in a single framework. There was also the discovery that optimal transport could provide a robust synthetic approach to Ricci curvature bounds. These links with dynamical systems on one hand, differential geometry on the other hand, were only briefly alluded to in my first book; here on the contrary they will be at the basis of the presentation. To summa- rize: more probability, more geometry, and more dynamical systems. Of course there cannot be more of everything, so in some sense there is less analysis and less physics, and also there are fewer digressions. So the present course is by no means a reduction or an expansion of my previous book, but should be regarded as a complementary reading. Both sources can be read independently, or together, and hopefully the complementarity of points of view will have pedagogical value. Throughout the book I have tried to optimize the results and the presentation, to provide complete and self-contained proofs of the most important results, and comprehensive bibliographical notes — a daunt- ingly difficult task in view of the rapid expansion of the literature. Many statements and theorems have been written specifically for this course, and many results appear in rather sharp form for the first time. I also added several appendices, either to present some domains of mathe- matics to non-experts, or to provide proofs of important auxiliary re- sults. All this has resulted in a rapid growth of the document, which in the end is about six times (!) the size that I had planned initially. So the non-expert reader is advised to skip long proofs at first reading , and concentrate on explanations, statements, examples and sketches of proofs when they are available. 1 F ans of Tom Waits may have identified this quotation.

9 Preface 3 bout terminology: For some reason I decided to switch from “trans- A portation” to “transport”, but this really is a matter of taste. For people who are already familiar with the theory of optimal trans- port, here are some more serious changes. Part I is devoted to a qualitative description of optimal transport. The dynamical point of view is given a prominent role from the be- ginning, with Robert McCann’s concept of displacement interpolation. This notion is discussed before any theorem about the solvability of the Monge problem, in an abstract setting of “Lagrangian action” which generalizes the notion of length space. This provides a unified picture of recent developments dealing with various classes of cost functions, in a smooth or nonsmooth context. I also wrote down in detail some important estimates by John Mather, well-known in certain circles, and made extensive use of them, in particular to prove the Lipschitz regularity of “intermediate” trans- port maps (starting from some intermediate time, rather than from ini- tial time). Then the absolute continuity of displacement interpolants comes for free, and this gives a more unified picture of the Mather and Monge–Kantorovich theories. I rewrote in this way the classical theorems of solvability of the Monge problem for quadratic cost in Eu- clidean space. Finally, this approach allows one to treat change of vari- ables formulas associated with optimal transport by means of changes of variables that are Lipschitz, and not just with bounded variation. Part II discusses optimal transport in Riemannian geometry, a line of research which started around 2000; I have rewritten all these ap- plications in terms of Ricci curvature, or more precisely curvature- dimension bounds. This part opens with an introduction to Ricci cur- vature, hopefully readable without any prior knowledge of this notion. Part III presents a synthetic treatment of Ricci curvature bounds in metric-measure spaces. It starts with a presentation of the theory of Gromov–Hausdorff convergence; all the rest is based on recent research papers mainly due to John Lott, Karl-Theodor Sturm and myself. In all three parts, noncompact situations will be systematically treated, either by limiting processes, or by restriction arguments (the restriction of an optimal transport is still optimal; this is a simple but powerful principle). The notion of approximate differentiability, intro- duced in the field by Luigi Ambrosio, appears to be particularly handy in the study of optimal transport in noncompact Riemannian mani- folds.

10 4 Preface everal parts of the subject are not developed as much as they would S deserve. Numerical simulation is not addressed at all, except for a few comments in the concluding part. The regularity theory of optimal transport is described in Chapter 12 (including the remarkable recent works of Xu-Jia Wang, Neil Trudinger and Gr ́egoire Loeper), but with- out the core proofs and latest developments; this is not only because of the technicality of the subject, but also because smoothness is not needed in the rest of the book. Still another poorly developed subject is the Monge–Mather–Ma ̃n ́e problem arising in dynamical systems, and including as a variant the optimal transport problem when the cost function is a distance. This topic is discussed in several treatises, such as Albert Fathi’s monograph, Weak KAM theorem in Lagrangian dynam- ics ; but now it would be desirable to rewrite everything in a framework that also encompasses the optimal transport problem. An important step in this direction was recently performed by Patrick Bernard and Boris Buffoni. In Chapter 8 I shall provide an introduction to Mather’s theory, but there would be much more to say. The treatment of Chapter 22 (concentration of measure) is strongly influenced by Michel Ledoux’s book, The Concentration of Measure Phenomenon ; while the results of Chapters 23 to 25 owe a lot to the monograph by Luigi Ambrosio, Nicola Gigli and Giuseppe Savar ́e, Gradient flows in metric spaces and in the space of probability mea- sures . Both references are warmly recommended complementary read- ing. One can also consult the two-volume treatise by Svetlozar Rachev Mass Transportation Problems , for many ap- and Ludger R ̈uschendorf, plications of optimal transport to various fields of probability theory. While writing this text I asked for help from a number of friends and collaborators. Among them, Luigi Ambrosio and John Lott are the ones whom I requested most to contribute; this book owes a lot to their detailed comments and suggestions. Most of Part III, but also significant portions of Parts I and II, are made up with ideas taken from my collaborations with John, which started in 2004 as I was enjoying the hospitality of the Miller Institute in Berkeley. Frequent discussions with Patrick Bernard and Albert Fathi allowed me to get the links between optimal transport and John Mather’s theory, which were a key to the presentation in Part I; John himself gave precious hints about the history of the subject. Neil Trudinger and Xu-Jia Wang spent vast amounts of time teaching me the regularity theory of Monge– Amp`ere equations. Alessio Figalli took up the dreadful challenge to

11 Preface 5 heck the entire set of notes from first to last page. Apart from these c people, I got valuable help from Stefano Bianchini, Fran ̧cois Bolley, Yann Brenier, Xavier Cabr ́e, Vincent Calvez, Jos ́e Antonio Carrillo, Dario Cordero-Erausquin, Denis Feyel, Sylvain Gallot, Wilfrid Gangbo, Diogo Aguiar Gomes, Natha ̈el Gozlan, Arnaud Guillin, Nicolas Juillet, Kazuhiro Kuwae, Michel Ledoux, Gr ́egoire Loeper, Francesco Maggi, Robert McCann, Shin-ichi Ohta, Vladimir Oliker, Yann Ollivier, Felix Otto, Ludger R ̈uschendorf, Giuseppe Savar ́e, Walter Schachermayer, Benedikt Schulte, Theo Sturm, Josef Teichmann, Anthon Thalmaier, ̈ Hermann Thorisson, S ̈uleyman Ust ̈unel, Anatoly Vershik, and others. Short versions of this course were tried on mixed audiences in the Universities of Bonn, Dortmund, Grenoble and Orl ́eans, as well as the Borel seminar in Leysin and the IHES in Bures-sur-Yvette. Part of the writing was done during stays at the marvelous MFO Institute in Oberwolfach, the CIRM in Luminy, and the Australian National University in Canberra. All these institutions are warmly thanked. It is a pleasure to thank Jean Picard for all his organization work on the 2005 Saint-Flour summer school; and the participants for their questions, comments and bug-tracking, in particular Sylvain Arlot (great bug-tracker!), Fabrice Baudoin, J ́erˆome Demange, Steve Evans (whom I also thank for his beautiful lectures), Christophe Leuridan, Jan Ob l ́oj, Erwan Saint Loubert Bi ́e, and others. I extend these thanks to the joyful group of young PhD students and maˆıtres de conf ́erences with whom I spent such a good time on excursions, restaurants, quan- tum ping-pong and other activities, making my stay in Saint-Flour truly wonderful (with special thanks to my personal driver, St ́ephane Loisel, and my table tennis sparring-partner and adversary, Fran ̧cois Simenhaus). I will cherish my visit there in memory as long as I live! Typing these notes was mostly performed on my (now defunct) faithful laptop Torsten, a gift of the Miller Institute. Support by the Agence Nationale de la Recherche and Institut Universitaire de France is acknowledged. My eternal gratitude goes to those who made fine typesetting accessible to every mathematician, most notably Donald A Knuth for T T . Final X, Bib T X and XFig X, and the developers of L E E E thanks to Catriona Byrne and her team for a great editing process. As usual, I encourage all readers to report mistakes and misprints. I will maintain a list of errata, accessible from my Web page . C ́edric Villani Lyon, June 2008

12

13 Conventions

14 8 Conventions xioms A I use the classical axioms of set theory; not the full version of the axiom of choice (only the classical axiom of “countable dependent choice”). Sets and structures A Id is the identity mapping, whatever the space. If is a set then the A A function 1 ( x ) = 1 if x ∈ is the indicator function of , and 0 : 1 A A F is a formula, then 1 is the indicator function of the otherwise. If F set defined by the formula F . If g are two functions, then ( f,g ) is the function x 7−→ f and f x ) ,g ( x )). The composition f ( g will often be denoted by f ( g ). ( ◦ is the set of positive integers: N = { 1 , N , 3 ,... } . A sequence is 2 written ( x ) ). x , or simply, when no confusion seems possible, ( N ∈ k k k n R is the set of real numbers. When I write R it is implicitly assumed that n is a positive integer. The Euclidean scalar product between two n and b in R b is denoted interchangeably by a · vectors or 〈 a 〉 . The a,b Euclidean norm will be denoted simply by , independently of the | · | n dimension . identity ( R ) is the space of real n × M matrices, and I n the n × n n n matrix. The trace of a matrix M will be denoted by tr M , its deter- ∗ minant by det M M , and its Hilbert–Schmidt norm , its adjoint by √ ∗ M ) by ‖ M ‖ tr ( (or just ‖ M ‖ ). M HS Unless otherwise stated, Riemannian manifolds appearing in the text are finite-dimensional, smooth and complete. If a Riemannian manifold M is given, I shall usually denote by n its dimension, by the geodesic distance on M , and by vol the volume (= n d -dimensional Hausdorff) measure on . The tangent space at x will be denoted by M M , and the tangent bundle by TM . The norm on T T will most M x x n of the time be denoted by R |·| , without explicit mention of , as in x . (The symbol ‖·‖ will be reserved for special norms or the point functional norms.) If S is a set without smooth structure, the notation T will instead denote the tangent cone to S at x (Definition 10.46). S x n Q R , or on the tangent bundle of a If is a quadratic form defined on 〈 〉 manifold, its value on a (tangent) vector Q v v, v will be denoted by , · or simply Q ( v ). The open ball of radius r and center x in a metric space X is denoted ( interchangeably by x,r ) or B is a Riemannian manifold, ( x ). If X B r the distance is of course the geodesic distance. The closed ball will be denoted interchangeably by B [ x,r ] or B ). The diameter of a metric x ( ] r space X will be denoted by diam ( X ).

15 Conventions 9 he closure of a set A A ( this T in a metric space will be denoted by ). A is also the set of all limits of sequences with values in locally compact x ∈X X if every point A metric space is said to be boundedly compact admits a compact neighborhood; and if every closed is compact. X and bounded subset of ′ ′ ,d X X ,d between metric spaces ( ) is said to be f ) and ( A map ′ C ( f ( x ) -Lipschitz if ( y )) ≤ C d ( x,y ) for all x , y in X . The best d ,f C is then denoted by ‖ f ‖ . admissible constant Lip A map is said to be locally Lipschitz if it is Lipschitz on bounded not necessarily compact (so it makes sense to speak of a locally sets, Lipschitz map defined almost everywhere). X A curve in a space is a continuous map defined on an interval of R , valued in X . For me the words “curve” and “path” are synonymous. evaluation map e is defined by The time- e t ( γ ) = γ ). = γ ( t t t t γ is a curve defined from an interval of If into a metric space, R its length will be denoted by ( γ ), and its speed by | ̇ γ | ; definitions are L recalled on p. 131. Usually geodesics will be , constant-speed geodesic curves. minimizing X ( Γ If X ) stands for the space of all geodesics is a metric space, : [0 , 1] →X . γ Being given x the and x ] in a metric space, I denote by [ x ,x 0 1 t 0 1 t -barycenters of set of all and x x , as defined on p. 407. If A and 0 0 1 are two sets, then [ A stands for the set of all [ ,A with ] ] A x ,x 0 1 t t 1 1 0 x . ,x ( ) ∈ A A × 0 1 0 1 Function spaces ) the space X ) is the space of continuous functions X → R , C ( C X ( b X → R ; and of bounded continuous functions ( X ) the space of C 0 continuous functions X → R converging to 0 at infinity; all of them are equipped with the norm of uniform convergence ‖ φ = sup | φ | . ‖ ∞ k Then ( X ) is the space of k -times continuously differentiable func- C b u : X → R , such that all the partial derivatives of u up to order k tions are bounded; it is equipped with the norm given by the supremum of all norms ‖ ∂u ‖ ; , where ∂u is a partial derivative of order at most k C b k C ( X ) is the space of -times continuously differentiable functions with k c compact support; etc. When the target space is not R but some other space Y , the notation is transformed in an obvious way: C ( X ; Y ), etc. p L is the Lebesgue space of exponent p ; the space and the measure will often be implicit, but clear from the context.

16 10 Conventions C alculus = u t ), defined on an interval of R and u The derivative of a function ( ′ n or in a smooth manifold, will be denoted by , or more valued in R u + . The notation u/dt stands for the upper right-derivative u often by ̇ d + ) d u/dt = lim sup of a real-valued function . /s [ u ( t + s u − u ( t )] : 0 ↓ s If u is a function of several variables, the partial derivative with t will be denoted by ∂ , or u respect to the variable ∂u/∂t . The notation t u ∂ does not stand for u , but for u ( t ) . t t The gradient operator will be denoted by grad or simply ; the di- ∇ ; the Laplace operator by ∆ ; the Hessian vergence operator by div or ∇· 2 2 (so ∇ does not stand for the Laplace opera- operator by Hess or ∇ n tor). The notation is the same in or in a Riemannian manifold. ∆ is R the divergence of the gradient, so it is typically a nonpositive operator. The value of the gradient of at point x will be denoted either by f ̃ f or ∇ ∇ ( x ). The notation f ∇ stands for the approximate gradient, x introduced in Definition 10.2. n n is a map R R → If T , ∇ T stands for the Jacobian matrix of T , that is the matrix of all partial derivatives ( ∂T /∂x ). ) (1 ≤ i,j ≤ n j i All these differential operators will be applied to (smooth) functions but also to measures, by duality. For instance, the Laplacian of a mea- ∫ ∫ 2 ζ d ( ∆μ ) = is defined via the identity ( ∆ζ ) dμ μ ζ ∈ C sure ( ). The c ∆ ( f vol) = ( ∆f ) vol. Similarly, notation is consistent in the sense that I shall take the divergence of a vector-valued measure, etc. = o ( g ) means f/g −→ 0 (in an asymptotic regime that should be f clear from the context), while f = O ( g ) means that f/g is bounded. log stands for the natural logarithm with base e . The positive and negative parts of x ∈ R are defined respectively x x, = max ( x, 0) and x 0); both are nonnegative, and = max ( − by + − . The notation x = x + x | | a ∧ b will sometimes be used for min ( a,b ). − + All these notions are extended in the usual way to functions and also to signed measures. Probability measures δ is the Dirac mass at point x . x All measures considered in the text are Borel measures on Polish spaces , which are complete, separable metric spaces, equipped with their Borel σ -algebra. I shall usually not use the completed σ -algebra, except on some rare occasions (emphasized in the text) in Chapter 5. A measure is said to be finite if it has finite mass, and locally finite if it attributes finite mass to compact sets.

17 Conventions 11 he space of Borel probability measures on X P ( X ), T is denoted by the space of finite Borel measures by ), the space of signed finite ( M X + . X μ is denoted by ‖ ( ‖ Borel measures by ). The total variation of M μ TV f with respect to a probability measure The integral of a function ∫ ∫ ( x ) dμ ( x μ will be denoted interchangeably by f ( ) or ) μ ( dx ) or f x ∫ f dμ . is a Borel measure on a topological space , a set N is said to μ X If -negligible if N is included in a Borel set of zero μ be μ -measure. Then is said to be concentrated on a set C if X \ C is negligible. (If C μ μ [ itself is Borel measurable, this is of course equivalent to C ] = 0.) X \ By abuse of language, I may say that has full μ -measure if μ is X X concentrated on . is a Borel measure, its support Spt is the smallest closed set If μ μ on which it is concentrated. The same notation Spt will be used for the support of a continuous function. If X , and T is a Borel map X → Y , then μ is a Borel measure on 2 μ stands for the image measure T (or push-forward) of μ by T : It is # − 1 a Borel measure on Y μ )[ A ] = μ [ T , defined by ( T ( A )]. # The law of a random variable defined on a probability space X P Ω, ) is denoted by law ( X ); this is the same as ( . P X # The weak topology on P ( X ) (or topology of weak convergence, or narrow topology) is induced by convergence against C bounded ( X ), i.e. b continuous X is Polish, then the space P ( X ) itself is test functions. If ∗ topology of Polish. Unless explicitly stated, I do not use the weak- X C ) or C ( ( X )). measures (induced by c 0 When a probability measure is clearly specified by the context, it , and the associated integral, or will sometimes be denoted just by P . expectation, will be denoted by E ( dxdy If x ∈ X and π ) is a probability measure in two variables ∈ Y , its marginal (or projection) on X (resp. y ) is the measure Y X π (resp. Y ) is random π ), where X ( x,y ) = x , Y ( x,y ) = y . If ( x,y # # x,y ) = π with law ( x given y is denoted , then the conditional law of by π ( dx | y ); this is a measurable function Y → P ( X ), obtained by disintegrating π y -marginal. The conditional law of y given x along its π dy | x ). will be denoted by ( μ is said to be absolutely continuous with respect to a A measure measure if there exists a measurable function f such that μ = f ν . ν 2 D epending on the authors, the measure T ), μ is often denoted by T # μ , T μ μ , T ( ∗ # R − 1 − 1 T [ μ μ ( da ), Tμ ◦ T , ]. , μT δ ∈· , or μ ) a ( T

18 12 Conventions otation specific to optimal transport and related fields N ∈ P ( X ) and ν ∈ P ( Y ) are given, then Π ( μ,ν ) is the set of all joint If μ whose marginals are and ν . X ×Y probability measures on μ μ,ν ) is the optimal (total) cost between μ and ν , see p. 92. It C ( ( ). c implicitly depends on the choice of a cost function x,y ∈ [1 , + ∞ ), W For any p p , see is the Wasserstein distance of order p Definition 6.1; and P ( X ) is the Wasserstein space of order p , i.e. the p set of probability measures with finite moments of order , equipped p with the distance W , see Definition 6.4. p P X ) is the set of probability measures on X with compact support. ( c ac X ν P on ( X ) (resp. If a reference measure is specified, then ac ac ( X ), P P X ( ), )) stands for those elements of P ( X ) (resp. P X ( p c p X )) which are absolutely continuous with respect to ν . P ( c plays the DC N ( N is the displacement convexity class of order N role of a dimension); this is a family of convex functions, defined on p. 457 and in Definition 17.1. U ); it depends on a convex func- is a functional defined on P ( X ν tion and a reference measure ν on X . This functional will be defined U at various levels of generality, first in equation (15.2), then in Defini- tion 29.1 and Theorem 30.4. β is another functional on P ( X ), which involves not only a convex U π,ν U function , but also a coupling π and a ν and a reference measure , which is a nonnegative function on X ×X : See distortion coefficient β again Definition 29.1 and Theorem 30.4. Γ and Γ The operators are quadratic differential operators associ- 2 ated with a diffusion operator; they are defined in (14.47) and (14.48). ( ) K,N is the notation for the distortion coefficients that will play a β t prominent role in these notes; they are defined in (14.61). K,N ) means “curvature-dimension condition ( K,N )”, which CD( Kg morally means that the Ricci curvature is bounded below by K a ( real number, the Riemannian metric) and the dimension is bounded g above by N (a real number which is not less than 1). If c ( x,y ) is a cost function then ˇ c ( y,x ) = c ( x,y ). Similarly, if π ( ) is a coupling, then ˇ π is the coupling obtained by swapping dxdy ), or more rigorously, ˇ π dy dx ) = variables, that is ˇ ( dxdy ( π = S , π π # where S ( x,y ) = ( y,x ). , Assumptions , (Twist) , (Lip) , (SC) , (locLip) , (locSC) (Super) n − 1 (Cut (STwist) on p. 313, ∞ (H ) are defined on p. 246, ) on p. 317.

19 Introduction

20

21 15 o start, I shall recall in Chapter 1 some basic facts about couplings T and changes of variables, including definitions, a short list of famous couplings (Knothe–Rosenblatt coupling, Moser coupling, optimal cou- pling, etc.); and some important basic formulas about change of vari- ables, conservation of mass, and linear diffusion equations. In Chapter 2 I shall present, without detailed proofs, three applica- tions of optimal coupling techniques, providing a flavor of the kind of applications that will be considered later. Finally, Chapter 3 is a short historical perspective about the foun- dations and development of optimal coupling theory.

22

23 1 C ouplings and changes of variables Couplings are very well-known in all branches of probability theory, but since they will occur again and again in this course, it might be a good idea to start with some basic reminders and a few more technical issues. Y ,μ ) and ( X ,ν ) be two probability Let ( Definition 1.1 (Coupling). and ν means constructing two random variables spaces. Coupling μ law ( Y ( Ω, P ) , such that on some probability space X ) = μ , and X Y ) = law ( . The couple ( X,Y ) is called a coupling of ( μ,ν ) . By abuse ν of language, the law of X,Y ) is also called a coupling of ( μ,ν ) . ( μ ν are the only laws in the problem, then without loss of If and Ω = X ×Y . In a more measure-theoretical generality one may choose μ and ν means constructing a measure π on X×Y formulation, coupling such that π admits μ and ν as marginals on X and Y respectively. The following three statements are equivalent ways to rephrase that marginal condition: (proj , where proj ) respectively π = μ , (proj and proj ) • π = ν # # Y Y X X x,y x,y stand for the projection maps ( x and ( ) ) 7−→ y ; 7−→ • For all measurable sets A ⊂ X , B ⊂ Y , one has π [ A ×Y ] = μ [ A ], π B ] = ν [ B ]; X × [ For all integrable (resp. nonnegative) measurable functions φ,ψ on • , Y , X ∫ ∫ ∫ ( ) x ) + ψ ( y ) ψ dν. dπ ( x,y ) = ( + φdμ φ X X×Y Y

24 18 1 Couplings and changes of variables first remark about couplings is that they always exist: at least A , in which the variables X and Y are there is the trivial coupling μ ⊗ ). This independent (so their joint law is the tensor product ν does not give X can hardly be called a coupling, since the value of Y . Another extreme is when all any information about the value of is contained in the value of X , the information about the value of Y Y is just a function of . This motivates the following in other words X and do not play symmetric roles). X definition (in which Y Definition 1.2 (Deterministic coupling). With the notation of X,Y ) is said to be deterministic if there Definition 1.1, a coupling ( T : X →Y such that Y exists a measurable function T ( X ) . = To say that ( ) is a deterministic coupling of μ and ν is strictly X,Y equivalent to any one of the four statements below: ν ) is a coupling of μ and X,Y whose law π is concentrated on the • ( of a measurable function T : X →Y ; graph X • μ and Y = T ( has law ), where T ; μ = ν X # • X has law μ and Y = T ( X ), where T is a change of variables from μ ν : for all ν -integrable (resp. nonnegative measurable) func- to φ tions , ∫ ∫ ( ) ) dν ( y ) = φ ); (1.1) φ T ( x ) y dμ ( x ( X Y π = (Id • ) μ . ,T # The map T appearing in all these statements is the same and is uniquely defined μ X,Y ) has been -almost surely (when the joint law of ( ̃ T and fixed). The converse is true: If T coincide μ -almost surely, then ̃ : Informally, μ = . It is common to call T transport map μ T T the # # one can say that T transports the mass represented by the measure μ , to the mass represented by the measure ν . Unlike couplings, deterministic couplings do not always exist: Just think of the case when μ is a Dirac mass and ν is not. But there may also be infinitely many deterministic couplings between two given probability measures.

25 Some famous couplings 19 ome famous couplings S Here below are some of the most famous couplings used in mathematics — of course the list is far from complete, since everybody has his or her own preferred coupling technique. Each of these couplings comes with its own natural setting; this variety of assumptions reflects the variety of constructions. (This is a good reason to state each of them with some generality.) Let ( X ,μ ) and ( Y ,ν 1. The measurable isomorphism. ) be two Polish (i.e. complete, separable, metric) probability spaces with- out atom (i.e. no single point carries a positive mass). Then there exists a (nonunique) measurable bijection T : X → Y such that − 1 = ν , ( T ν T ) . In that sense, all atomless Polish prob- μ = μ # # ability spaces are isomorphic, and, say, isomorphic to the space Y , 1] equipped with the Lebesgue measure. Powerful as that = [0 theorem may seem, in practice the map T is very singular; as a good exercise, the reader might try to construct it “explicitly”, in terms X = R and Y of cumulative distribution functions, when , 1] = [0 (issues do arise when the density of vanishes at some places). Ex- μ perience shows that it is quite easy to fall into logical traps when working with the measurable isomorphism, and my advice is to never use it. Moser mapping. Let X be a smooth compact Riemannian 2. The manifold with volume vol, and let f,g be Lipschitz continuous pos- itive probability densities on X ; then there exists a deterministic μ vol and ν = g vol, constructed by resolution of an f coupling of = elliptic equation. On the positive side, there is a somewhat explicit , and it is as smooth as can representation of the transport map T k,α k +1 ,α then T is C be: if C f,g . The formula is given in the are Appendix at the end of this chapter. The same construction works n R provided that in f and g decay fast enough at infinity; and it is robust enough to accommodate for variants. be two probability on R . Let μ , ν increasing rearrangement 3. The R ; define their cumulative distribution functions by measures on ∫ ∫ x y F ( dν. dμ, G ( y ) = x ) = −∞ −∞ Further define their right-continuous inverses by

26 20 1 Couplings and changes of variables { } 1 − F x ∈ R ; F ( x ) > t ) ; ( = inf t } { 1 − ; ∈ R ; G ( y ) > t t ) = inf G y ( and set − 1 = G ◦ F. T does not have atoms, then If μ μ = ν . This rearrangement is quite T # simple, explicit, as smooth as can be, and enjoys good geometric properties. n Knothe–Rosenblatt rearrangement . Let μ and ν be in R 4. The n , such that μ is absolutely continu- two probability measures on R ous with respect to Lebesgue measure. Then define a coupling of μ as follows. and ν Take the marginal on the first variable: this gives probabil- Step 1: dy ( μ ), ν ity measures dx ) on R , with μ being atomless. Then ( 1 1 1 1 1 y T ( x ) by the formula for the increasing rearrangement define = 1 1 1 μ into ν . of 1 1 Step 2: Now take the marginal on the first two variables and dis- integrate it with respect to the first variable. This gives proba- μ ( ( dx ) = dx dy ) = μ dy ( dx ( ) μ ν bility measures dx ), | x 1 1 2 1 1 2 2 2 1 2 2 ), ( dy ) ν ( dy | y ). Then, for each given y ∈ R , set y = T ( x ν 1 1 2 1 1 2 1 1 1 and define y = T ) by the formula for the increasing rear- ( x x ; 2 1 2 2 ( ) into | x μ ν dx ( dy | y ). (See Figure 1.1.) rangement of 1 2 1 2 2 2 Then repeat the construction, adding variables one after the other n y and defining T steps, this produces ( x ); etc. After ; x ,x = 1 3 3 3 2 μ y T a map x ) which transports = to ν , and in practical situa- ( tions might be computed explicitly with little effort. Moreover, the Jacobian matrix of the change of variables T is (by construction) upper triangular with positive entries on the diagonal; this makes it suitable for various geometric applications. On the negative side, this mapping does not satisfy many interesting intrinsic properties; n it is not invariant under isometries of , not even under relabeling R of coordinates. Holley coupling on a lattice. Let μ and ν 5. The be two discrete N , say { 0 Λ 1 } probabilities on a finite lattice , equipped with the , natural partial ordering ( x ≤ y if x ). Assume that ≤ y n for all n n ∀ x,y ∈ Λ, μ [inf( x,y )] ν [sup( x,y (1.2) ≥ μ [ x ] ν [ y ] . )]

27 Some famous couplings 21 ν μ x dy d 1 1 T 1 econd step in the construction of the Knothe–Rosenblatt map: After the Fig. 1.1. S x (seen → correspondance x has been determined, the conditional probability of y 1 1 2 dx as a one-dimensional probability on a small “slice” of width ) can be transported 1 to the conditional probability of y (seen as a one-dimensional probability on a slice 2 of width dy ). 1 Then there exists a coupling ( ) of ( μ,ν ) with X ≤ X,Y . The situa- Y tion above appears in a number of problems in statistical mechanics, in connection with the so-called FKG (Fortuin–Kasteleyn–Ginibre) inequalities. Inequality (1.2) intuitively says that ν puts more mass on large values than μ . 6. for solutions of partial Probabilistic representation formulas differential equations. There are hundreds of them (if not thou- sands), representing solutions of diffusion, transport or jump pro- cesses as the laws of various deterministic or stochastic processes. Some of them are recalled later in this chapter. 7. The exact coupling of two stochastic processes, or Markov chains. Two realizations of a stochastic process are started at initial time, and when they happen to be in the same state at some time, they are merged: from that time on, they follow the same path and ac- cordingly, have the same law. For two Markov chains which are started independently, this is called the classical coupling . There

28 22 1 Couplings and changes of variables re many variants with important differences which are all intended a to make two trajectories close to each other after some time: the ε , the -coupling Ornstein coupling (in which one requires the two variables to be close, rather than to occupy the same state), (in which one allows an additional time-shift), shift-coupling the etc. optimal coupling or optimal transport . Here one intro- 8. The duces a ( x,y ) on X ×Y , that can be interpreted cost function c x to as the work needed to move one unit of mass from location location y . Then one considers the Monge–Kantorovich mini- mization problem E inf ( X,Y ) , c where the pair ( ) runs over all possible couplings of ( μ,ν ); or X,Y equivalently, in terms of measures, ∫ inf c ( x,y ) dπ ( x,y ) , X×Y where the infimum runs over all joint probability measures on π with marginals μ X×Y ν . Such joint measures are called trans- and ference plans (or transport plans, or transportation plans); those achieving the infimum are called optimal transference plans . Of course, the solution of the Monge–Kantorovich problem depends on the cost function c . The cost function and the probability spaces here can be very general, and some nontrivial results can be obtained as soon c X , Y are Polish spaces. Even the as, say, is lower semicontinuous and ( x,y apparently trivial choice c appears in the probabilistic ) = 1 = y 6 x interpretation of total variation: } { ν ‖ . = 2 inf μ E 1 − ; law ( X ) = μ, law ( Y ) = ν ‖ TV Y X 6 = Cost functions valued in 0 , 1 } also occur naturally in Strassen’s duality { theorem. Under certain assumptions one can guarantee that the optimal cou- pling really is deterministic; the search of deterministic optimal cou- plings (or Monge couplings) is called the Monge problem . A solution of the Monge problem yields a plan to transport the mass at minimal cost with a recipe that associates to each point x a single point y . (“ No mass shall be split. ”) To guarantee the existence of solutions to the

29 Gluing 23 onge problem, two kinds of assumptions are natural: First, c M should “vary enough” in some sense (think that the constant cost function μ will allow for arbitrary minimizers), and secondly, should enjoy some regularity property (at least Dirac masses should be ruled out!). Here 2 x,y ) = | x − y is a typical result: If c in the Euclidean space, μ is ( | μ , ν have absolutely continuous with respect to Lebesgue measure, and finite moments of order 2, then there is a unique optimal Monge cou- pling between and ν . More general statements will be established in μ Chapter 10. Optimal couplings enjoy several nice properties: (i) They naturally arise in many problems coming from economics, physics, partial differential equations or geometry (by the way, the in- creasing rearrangement and the Holley coupling can be seen as partic- ular cases of optimal transport); (ii) They are quite stable with respect to perturbations; (iii) They encode good geometric information, if the cost function c is defined in terms of the underlying geometry; (iv) They exist in smooth as well as nonsmooth settings; optimal cost functional (v) They come with a rich structure: an (the value of the infimum defining the Monge–Kantorovich problem); a dual variational problem ; and, under adequate structure conditions, interpolation . a continuous On the negative side, it is important to be warned that optimal transport is in general not so smooth. There are known counterexam- ples which put limits on the regularity that one can expect from it, even for very nice cost functions. All these issues will be discussed again and again in the sequel. The rest of this chapter is devoted to some basic technical tools. Gluing If is a function of Y and Y is a function of X Z Z is , then of course a function of X . Something of this still remains true in the setting of nondeterministic couplings, under quite general assumptions. Gluing lemma. Let ( X , be Polish probability spaces. If ,μ 3 ) , i = 1 , 2 , i i ( X ) ,X ,μ ) is a coupling of ( μ μ ,μ ( ) , ( Y is a coupling of ,Y ) and 3 2 2 1 2 2 3 1

30 24 1 Couplings and changes of variables hen one can construct a triple of random variables t ( ,Z Z ,Z such ) 3 1 2 Z ) ,Z that ) has the same law as ( X has the same ,X ( ) and ( Z ,Z 2 2 1 2 1 3 Y law as ,Y ( ) . 3 2 It is simple to understand why this is called “gluing lemma”: if π 12 stands for the law of ( ,X stands for the law of ) on X π ×X and X 2 1 23 2 1 X ) ,X ,Z ) on X ,Z ×X Z , then to construct the joint law π ( of ( 123 1 2 2 3 3 2 3 glue π and π one just has to along their common marginal μ . 2 12 23 π and π as Expressed in a slightly informal way: Disintegrate 12 23 , ( dx ) dx dx ) = π ( ( dx ) | x μ π 2 1 12 2 2 1 12 2 ( | dx ) = π π ( dx dx x ) μ ( dx ) , 3 2 3 2 2 2 23 23 and then reconstruct π as 123 π ( dx . dx ) dx x ) = π | ( dx dx | x ( ) μ π ( dx ) 2 2 23 2 1 3 12 3 123 2 1 2 Change of variables formula n When one writes the formula for change of variables, say in or on R a Riemannian manifold, a Jacobian term appears, and one has to be careful about two things: the change of variables should be injective (otherwise, reduce to a subset where it is injective, or take the multi- plicity into account); and it should be somewhat smooth. It is classical to write these formulas when the change of variables is continuously differentiable, or at least Lipschitz: Let M be an n -dimensional Rieman- Change of variables formula. 1 C nian manifold with a metric, let μ , μ be two probability measures 0 1 on M , and let T : M → M be a measurable function such that T = μ 0 # ) x ( V − ) ( ) = e ν be a reference measure, of the form ν μ . Let vol( dx dx , 1 where V is continuous and vol is the volume (or n -dimensional Haus- dorff) measure. Further assume that (i) μ ; ( dx ) = ρ ) ( x ) ν ( dx ) and μ dy ( dy ) = ρ ( ( y ) ν 1 0 0 1 (ii) T is injective; (iii) T is locally Lipschitz. Then, μ -almost surely, 0

31 Change of variables formula 25 ( x ) = ρ ρ ( T ( x )) J (1.3) ( x ) , T 1 0 , defined by ( x ) is the Jacobian determinant of T at x J where T x [ ))] ( B ( ν T ε J x (1.4) ( ) := lim . T 0 ε ↓ x ) B ] ν [ ( ε T is only defined on the complement of a μ - The same holds true if 0 negligible set, and satisfies properties (ii) and (iii) on its domain of definition. When J ν coincides with Remark 1.3. is just the volume measure, T n M R is the ab- the usual Jacobian determinant, which in the case = T . Since V is solute value of the determinant of the Jacobian matrix ∇ continuous, it is almost immediate to deduce the statement with an ar- bitrary V from the statement with V = 0 (this amounts to multiplying x ( T ( V ( x ) V − )) ) x ( V V ( y ) ( , ( x ) by e e ) by y J ρ , e ) by x ( ρ ). 1 T 0 Remark 1.4. There is a more general framework beyond differentiabil- . A func- approximate differentiability ity, namely the property of T on an n -dimensional Riemannian manifold is said to be approx- tion ̃ imately differentiable at T , differentiable at if there exists a function x ̃ { x T 6 = T } has zero density at x , i.e. , such that the set [{ }] ̃ ( vol x x ∈ B T ( ) ); T ( x ) 6 = x r lim . = 0 r 0 → x ) ] B vol [ ( r It turns out that, roughly speaking, an approximately differentiable map can be replaced, up to neglecting a small set, by a Lipschitz map (this is a kind of differentiable version of Lusin’s theorem). So one can prove the Jacobian formula for an approximately differentiable map by approximating it with a sequence of Lipschitz maps. Approximate differentiability is obviously a local property; it holds true if the distributional derivative of T is a locally integrable function, or even a locally finite measure. So it is useful to know that the change of variables formula still holds true if Assumption (iii) above is replaced by (iii’) T is approximately differentiable.

32 26 1 Couplings and changes of variables C onservation of mass Formula The single most important theorem of change of variables arising in conservation continuum physics might be the one resulting from the of mass formula, ∂ρ ∇ · ( ρξ + . (1.5) ) = 0 ∂t Here ρ = ρ ( t,x ) stands for the density of a system of particles at t and position x ; ξ = ξ ( t,x ) for the velocity field at time t and time x position stands for the divergence operator. Once again, the ; and ∇· . M natural setting for this equation is a Riemannian manifold It will be useful to work with particle densities ( dx ) (that are not μ t necessarily absolutely continuous) and rewrite (1.5) as ∂μ , + · ( μξ ) = 0 ∇ ∂t where the time-derivative is taken in the weak sense, and the diver- gence operator is defined by duality against continuously differentiable functions with compact support: ∫ ∫ φ ∇· ( μξ ) = − dμ. ) ( ξ ·∇ φ M M Eulerian description of The formula of conservation of mass is an the physical world, which means that the unknowns are fields. The next theorem links it with the Lagrangian description, in which everything is expressed in terms of particle trajectories, that are integral curves of the velocity field: ( ) d ) x t,T ξ = ( 1.6) T ( ( x ) . t t dt If ξ is (locally) Lipschitz continuous, then the Cauchy–Lipschitz the- orem guarantees the existence of a flow T locally defined on a maximal t time interval, and itself locally Lipschitz in both arguments and x . t Then, for each the map T is a local diffeomorphism onto its image. t t But the formula of conservation of mass also holds true without any regularity assumption on ξ ; one should only keep in mind that if ξ is not Lipschitz, then a solution of (1.6) is not uniquely determined by ( x 7−→ T ) is not necessarily uniquely defined. its value at time 0, so x t Still it makes sense to consider random solutions of (1.6).

33 Diffusion formula 27 1 M be a C M manifold, T ∈ (0 , + ∞ ] Let ass conservation formula. ( [0 ) be a (measurable) velocity field on ξ ,T ) × M . Let and let t,x M μ ( ) be a time-dependent family of probability measures on t

34 28 1 Couplings and changes of variables √ = d 2 σ ( t ,X (1.7) ) dB X (0 ≤ t < T ) . t t t Then the following two statements are equivalent: is a weak solution of the linear (diffusion) partial μ μ ( (i) ) = dx t differential equation ) ( ∗ ∇ ∂ · ) ( σσ μ = ∇ μ x x t ∗ × ,T ) on σ [0 stands for the transpose of σ ; M , where μ ) = law ( X . (ii) for all t ∈ [0 ,T ) , where X (1.7) solves t t t n Example 1.5. In R , the solution of the heat equation with initial √ ( B 2 Brownian motion sped up by a = is the law of δ datum X t 0 t √ factor 2). R emark 1.6. Actually, there is a finer criterion for the diffusion equa- tion to hold true: it is sufficient that the Ricci curvature at point x be 2 Cd ( x is the metric at ,x ) g bounded below by g as x − , where →∞ x 0 x x x is an arbitrary reference point. The exponent 2 here is point and 0 sharp. Exercise 1.7. M be a smooth compact manifold, equipped with its Let standard reference volume, and let ρ be a smooth positive probability 0 density on M . Let ( ρ be the solution of the heat equation ) t ≥ t 0 = ∆ρ. ∂ ρ t ρ . ) to construct a deterministic coupling of Use ( ρ and ρ 0 1 t Hint: Rewrite the heat equation in the form of an equation of conser- vation of mass. Appendix: Moser’s coupling In this Appendix I shall promote Moser’s technique for coupling smooth positive probability measures; it is simple, elegant and powerful, and plays a prominent role in geometry. It is not limited to compact mani- folds, but does require assumptions about the behavior at infinity. M be a smooth n -dimensional Riemannian manifold, equipped Let − V ( x ) e with a reference probability measure ν ), where dx ) = vol( dx (

35 Bibliographical notes 29 1 ∈ ( M ) . Let μ V = ρ C ν , μ be two probability measures on = ρ ν 1 0 1 0 ρ are bounded below by a constant , ρ M ; assume for simplicity that 1 0 K > ρ and ρ are locally Lipschitz, and that 0. Further assume that 0 1 the equation − −∇ ∆ ) u = ρ ( V ρ ·∇ 1 0 1 , 1 ) (that is, can be solved for some u is locally Lipschitz). ( M ∈ C u ∇ loc Then, define a locally Lipschitz vector field ∇ u ( x ) ξ ( ) = t,x , tρ ρ − ( x ) + ) (1 ( x ) t 1 0 T of probability ( with associated flow ( )) ) μ , and a family ( x ≤ t ≤ t 0 0

36 30 1 Couplings and changes of variables In [814], for the sake of consistency of the presentation I treated op- ( n R timal coupling on , R as a particular case of optimal coupling on however this has the drawback to involve subtle arguments.) The Knothe–Rosenblatt coupling was introduced in 1952 by Rosen- blatt [709], who suggested that it might be useful to “normalize” sta- tistical data before applying a statistical test. In 1957, Knothe [523] rediscovered it for applications to the theory of convex bodies. It is quite likely that other people have discovered this coupling indepen- dently. An infinite-dimensional generalization was studied by Bogachev, Kolesnikov and Medvedev [134, 135]. FKG inequalities were introduced in [375], and have since then played a crucial role in statistical mechanics. Holley’s proof by coupling appears in [477]. Recently, Caffarelli [188] has revisited the subject in connection with optimal transport. It was in 1965 that Moser proved his coupling theorem, for smooth compact manifolds without boundaries [640]; noncompact manifolds were later considered by Greene and Shiohama [432]. Moser himself also worked with Dacorogna on the more delicate case where the domain is an open set with boundary, and the transport is required to fix the boundary [270]. Strassen’s duality theorem is discussed e.g. in [814, Section 1.4]. The gluing lemma is due to several authors, starting with Vorob’ev in 1962 for finite sets. The modern formulation seems to have emerged around 1980, independently by Berkes and Philipp [101], Kallenberg, Thorisson, and maybe others. Refinements were discussed e.g. by de Acosta [273, Theorem A.1] (for marginals indexed by an arbitrary set) or Thorisson [781, Theorem 5.1]; see also the bibliographic comments in [317, p. 20]. For a proof of the statement in these notes, it is suf- ficient to consult Dudley [317, Theorem 1.1.10], or [814, Lemma 7.6]. A comment about terminology: I like the word “gluing” which gives a good indication of the construction, but many authors just talk about “composition” of plans. 1 The formula of change of variables for or Lipschitz change of vari- C ables can be found in many textbooks, see e.g. Evans and Gariepy [331, Chapter 3]. The generalization to approximately differentiable maps is explained in Ambrosio, Gigli and Savar ́e [30, Section 5.5]. Such a gen- erality is interesting in the context of optimal transportation, where changes of variables are often very rough (say BV , which means of bounded variation). In that context however, there is more structure:

37 Bibliographical notes 31 or instance, changes of variables will typically be given by the gradient F n , and on such a map one knows slightly more of a convex function in R BV than on a general function, because convex functions are twice differentiable almost everywhere (Theorem 14.25 later in these notes). McCann [614] used this property to prove, by slightly more elemen- tary means, the change of variables formula for a gradient of convex function; the proof is reproduced in [814, Theorem 4.8]. It was later generalized by Cordero-Erausquin, McCann and Schmuckenschl ̈ager to Riemannian manifolds [246], a case which again can be treated either as part of the general theory of changes of variables, or with the BV help of almost everywhere second derivatives of semiconcave functions. The formula of conservation of mass is also called the method of characteristics for linear transport equations, and is described in a num- ber of textbooks in partial differential equations, at least when the driv- ing vector field is Lipschitz, see for instance Evans [327, Section 3.2]. An essentially equivalent statement is proven in [814, Theorem 5.34]. Treating vector fields that are only assumed to be locally Lipschitz is not so easy: see Ambrosio, Gigli and Savar ́e [30, Section 8.1]. The Lipschitz condition can be relaxed into a Sobolev or even a BV condition, but then the flow is determined only almost everywhere, and this becomes an extremely subtle problem, which has been studied by many authors since the pioneering work of DiPerna and Lions [304] at the beginning of the nineties. See Ambrosio [21] for recent progress and references. The version which is stated in these notes, with no regularity assumption, is due to Ambrosio and carefully proved in [30, Section 8.1]. In spite of its appealing and relatively natural character (especially in a probabilistic perspective), this is a very recent research result. Note that, if T , then the ( x ) is not uniquely determined by x t solution to the conservation equation starting with a given probability measure might admit several solutions. A recent work by Lisini [565] addresses a generalization of the for- mula of conservation of mass in the setting of general Polish spaces. Of course, without any regularity assumption on the space it is impos- sible to speak of vector fields and partial differential equations; but it is still possible to consider paths in the space of probability measures, and random curves. Lisini’s results are most naturally expressed in the language of optimal transport distances; see the bibliographical notes for Chapter 7.

38 32 1 Couplings and changes of variables he diffusion formula can be obtained as a simple consequence of T the Itˆo formula, which in the Euclidean setting can be found in any textbook on stochastic differential equations, e.g. [658]. It was recently the hundredth anniversary of the discovery of the diffusion formula by Einstein [322]; or rather rediscovery, since Bachelier already had obtained the main results at the turn of the twentieth century [251, 739]. (Some information about Bachelier’s life can be found online at sjepg.univ-fcomte.fr/sjepgbis/libre/bachelier/page01/page01.htm . ) Fasci- nating tales about the Brownian motion can be read in Nelson’s un- conventional book [648], especially Chapters 1–4. For the much more subtle Riemannian setting, one may consult Stroock [759], Hsu [483] and the references therein. The Brownian motion on a smooth Riemannian manifold is always well-defined, even if the manifold has a wild behavior at infinity (the construction of the Brownian motion is purely local); but in the ab- sence of a good control on the Ricci curvature, there might be several heat kernels, and the heat equation might not be uniquely solvable for a given initial datum. This corresponds to the possibility of a blow-up of the Brownian motion (i.e. the Brownian motion escapes to infin- ity) in finite time. All this was explained to me by Thalmaier. The 2 sharp criterion Ric ≥ − C (1 + d ( x for avoiding blow-up of ,x ) g ) x x 0 the heat equation is based on comparison theorems for Laplace oper- ators. In the version stated here it is due to Ichihara [486]; see also the book by Hackenbroch and Thalmaier [454, p. 544]. Nonexplosion criteria based on curvature have been studied by Gaffney, Yau, Hsu, Karp and Li, Davies, Takeda, Sturm, and Grigor’yan; for a detailed exposition, and many explanations, the reader can consult the survey by Grigor’yan [434, Section 9].

39 2 T hree examples of coupling techniques In this chapter I shall present three applications of coupling methods. The first one is classical and quite simple, the other two are more original but well-representative of the topics that will be considered later in these notes. The proofs are extremely variable in difficulty and will only be sketched here; see the references in the bibliographical notes for details. Convergence of the Langevin process Consider a particle subject to the force induced by a potential V ∈ 1 n R ), a friction and a random white noise agitation. If X stands for ( C t t , m for its mass, λ for the friction the position of the particle at time k T for the temperature of coefficient, for the Boltzmann constant and the heat bath, then Newton’s equation of motion can be written 2 √ d X dX d B t t t m ( X , ) − λm kT 2.1) ( + = − ∇ V t 2 dt dt dt ) B is a standard Brownian motion. This is a second-order where ( t ≥ 0 t (stochastic) differential equation, so it should come with initial condi- ̇ and the velocity tions for both the position X . X Now consider a large cloud of particles evolving independently, ac- cording to (2.1); the question is whether the distribution of particles will converge to a definite limit as t →∞ . In other words: Consider the stochastic differential equation (2.1) starting from some initial distribu- ̇ ̇ ), ( dxdv ) = law ( X X , tion X μ ); is it true that law ( X , ), or law ( X t t 0 0 t 0 will converge to some given limit law as t →∞ ?

40 34 2 Three examples of coupling techniques bviously, to solve this problem one has to make some assumptions O , which should prevent the particles from all escaping V on the potential at infinity; for instance, we can make the very strong assumption that V 2 ∇ 0 such that the Hessian V K > is uniformly convex, i.e. there exists 2 V ≥ KI satisfies ∇ . Some assumptions on the initial distribution n might also be needed; for instance, it is natural to assume that the Hamiltonian has finite expectation at initial time: ( ) 2 ̇ | X | 0 ( E ) + X V < ∞ + 0 2 nder these assumptions, it is true that there is exponential conver- U does not grow too wildly at infinity V gence to equilibrium, at least if (for instance if the Hessian of V is also bounded above). However, I do not know of any simple method to prove this. On the other hand, consider the limit where the friction coefficient is quite strong, and the motion of the particle is so slow that the ac- celeration term may be neglected in front of the others: then, up to resetting units, equation (2.1) becomes √ dX d B t t ( X ( ) + = 2 − 2.2) V , ∇ t dt dt which is often called a Langevin process. Now, to study the conver- gence of equilibrium for (2.2) there is an extremely simple solution by coupling. Consider another random position ( Y obeying the same ) t ≥ t 0 equation as (2.2): √ dY d B t t 2.3) Y V ) + − 2 = ( ∇ ( , t dt dt the same where the random realization of the Brownian motion is as X Y and in (2.2) (this is the coupling). The initial positions may be 0 0 coupled in an arbitrary way, but it is possible to assume that they are independent. In any case, since they are driven by the same Brownian motion, X 0. and Y t > will be correlated for t t nor B is not differentiable as a function of time, neither X Since t t Y is differentiable (equations (2.2) and (2.3) hold only in the sense of t solutions of stochastic differential equations); but it is easily checked that α is a continuously differentiable function of time, and := X Y − t t t ) ( dα t − , ∇ V ( = ) ) − ∇ V ( Y X t t dt

41 Euclidean isoperimetry 35 s o in particular 〈 〉 2 ∣ ∣ d α | | 2 t 2 ∣ ∣ ) − ∇ V ( Y = ) , X | α Y − − ≤− K ∇ V X | − Y K ( X . = − t t t t t t t dt 2 It follows by Gronwall’s lemma that 2 Kt 2 2 − α e | α ≤ | | . | t 0 2 2 Y | Assume for simplicity that and E | E | | X are finite. Then 0 0 ) ( Kt − 2 2 2 Kt 2 2 2 − e . (2.4) + Y − ≤ 2 X E | X | | E | E | Y | ≤ | Y − X | E e t 0 0 0 0 t In particular, X − Y converges to 0 almost surely, and this is indepen- t t dent of the distribution of . Y 0 This in itself would be essentially sufficient to guarantee the exis- tence of a stationary distribution; but in any case, it is easy to check, by applying the diffusion formula, that ) y − V ( e dy ) = ( ν dy Z ∫ − V ( e where Z = is a normalization constant) is stationary: If law ( . Then (2.4) easily implies that ) = ν , then also law ( Y ν ) = Y t 0 μ ; in addition, the convergence is := law ( X ν ) converges weakly to t t exponentially fast. Euclidean isoperimetry n R with given surface, which one has the largest Among all subsets of volume? To simplify the problem, let us assume that we are looking n ⊂ R for a bounded open set with, say, Lipschitz boundary ∂Ω , and Ω | ∂Ω | is given; then the problem is to maximize the that the measure of measure of | Ω | . To measure ∂Ω one should use the ( n − 1)-dimensional Hausdorff measure, and to measure the n -dimensional Hausdorff Ω n R measure, which of course is the same as the Lebesgue measure in . It has been known, at least since ancient times, that the solution to this “isoperimetric problem” is the ball. A simple scaling argument shows that this statement is equivalent to the Euclidean isoperimetric inequality :

42 36 2 Three examples of coupling techniques ∂ Ω | | | B ∂ | , ≥ n n n − 1 − 1 n | | | | Ω B is any ball. B here w There are very many proofs of the isoperimetric inequality, and many refinements as well. It is less known that there is a proof by coupling. Here is a sketch of the argument, forgetting about regularity issues. be a ball such that | ∂B | = | ∂Ω | Let X dis- B . Consider a random point Ω Y distributed uniformly tributed uniformly in , and a random point . Introduce the Knothe–Rosenblatt coupling of and Y : This is B X in = T ( X a deterministic coupling of the form Y ), such that, at each ∈ Ω , the Jacobian matrix ∇ T ( x ) is triangular with nonnegative di- x X agonal entries. Since the law of Y ) has uniform density 1 / | Ω | (resp. (resp. 1 | B | ), the change of variables formula yields / ) ( 1 1 x ∀ Ω ∈ d ( ∇ T = x ) 2.5) et ( . | B | | | Ω ∇ T is triangular, the Jacobian determinant of T is det( Since T ) = ∇ ∏ ∑ , and its divergence = λ , where the nonnegative numbers T λ ∇· i i ( ) . Then the arithmetic–geometric are the eigenvalues of ∇ T λ n i ≤ i ≤ 1 ∑ ∏ /n 1 ) ( inequality ( λ ) /n becomes λ ≤ i i ) ( x ∇· T ( ) 1 /n ) ∇ ( det x ≤ . T n C ombining this with (2.5) results in )( T ) x ( ∇· 1 ≤ . / n 1 / n 1 B | Ω | | n | Ω Integrate this over and then apply the divergence theorem: ∫ ∫ 1 1 1 − n − 1 1 n | Ω | H ≤ d ( ∇ · T )( x ) dx = (2.6) ) σ · T ( , 1 1 n n ∂ Ω Ω | n | B | n B | n n − 1 R σ is the unit outer normal to Ω and H : ∂Ω where is the → ( n − 1)-dimensional Hausdorff measure (restricted to ∂Ω ). But T is valued in , so | T · σ |≤ 1, and (2.6) implies B 1 Ω | | ∂ 1 − n Ω | | ≤ . 1 n n | B |

43 Caffarelli’s log-concave perturbation theorem 37 1 − 1 n | ∂B | = n | B | , the right-hand side is actually | B | ince ∂Ω | = S | so , is indeed bounded by the volume of . This concludes the volume of B Ω the proof. The above argument suggests the following problem: Can one devise an optimal coupling between sets Open Problem 2.1. (in the sense of a coupling between the uniform probability measures on these sets) in such a way that the total cost of the coupling decreases under some evolution converging to balls, such as mean curvature mo- tion? Caffarelli’s log-concave perturbation theorem The previous example was about transporting a set to another, now the present one is in some sense about transporting a whole space to another. with a “model It is classical in geometry to compare a space X that has nice properties and is, e.g., less curved than X . space” M A general principle is that certain inequalities which hold true on the X . The theorem model space can automatically be “transported” to discussed in this section is a striking illustration of this idea. Let be nonnegative continuous functions on R , with F,G,H,J,L H J nondecreasing, and let ℓ ∈ and . For a given measure μ on R n ≥ , let λ [ μ ] be the largest λ R 0 such that, for all Lipschitz functions n h : → R , R ) ( ) ( ∫ ∫ ∫ 1 h ⇒ F ) L G ( ℓ ) dμ h ≤ = ( dμ = ) | H . ( h ∇ J dμ | λ n n n R R R (2.7) Functional inequalities of the form (2.7) are variants of Sobolev in- equalities; many of them are well-known and useful. Caffarelli’s theo- they can only be improved by log-concave perturbation of rem states that . More precisely, if γ is the standard Gaussian the Gaussian distribution − v = e measure and μ γ is another probability measure, with v convex, then . [ μ ] λ λ [ γ ] ≥

44 38 2 Three examples of coupling techniques is proof is a simple consequence of the following remarkable fact, H which I shall call Caffarelli’s log-concave perturbation theo- If rem dμ/dγ : is log-concave, then there exists a 1-Lipschitz change μ to the measure In other words, γ of variables from the measure . ( ) C ( X ) there is a deterministic coupling of ( γ,μ ), such that = X, Y ( x ) −C ( y ) |C x − y | , or equivalently |∇C| ≤ 1 (almost everywhere). | ≤ | It follows in particular that ∣ ∣ ∣ ∣ ◦C ) h ( ≤| ( ∇ h ∇ ◦C| , (2.8) ) whatever the function h . Now it is easy to understand why the existence of the map C im- plies (2.7): On the one hand, the definition of change of variables implies ∫ ∫ ∫ ∫ G ) h G ( h ◦C ) dγ, ( L ( h ) dμ = = L ( h ◦C ) dγ ; dμ on the other hand, by the definition of change of variables again, in- equality (2.8) and the nondecreasing property of J , ∫ ∫ ∫ ( ) ) ( J dμ J | |∇ h ◦C| h dγ ≥ |∇ ) ( |∇ ( h ◦C ) | J dγ. = n R Thus, inequality (2.7) is indeed “transported” from the space ( ,γ ) n to the space ( ,μ ). R Bibliographical notes It is very classical to use coupling arguments to prove convergence to equilibrium for stochastic differential equations and Markov chains; many examples are described by Rachev and R ̈uschendorf [696] and Thorisson [781]. Actually, the standard argument found in textbooks to prove the convergence to equilibrium for a positive aperiodic ergodic Markov chain is a coupling argument (but the null case can also be treated in a similar way, as I learnt from Thorisson). Optimal couplings are often well adapted to such situations, but definitely not the only ones to apply. The coupling method is not limited to systems of independent parti- cles, and sometimes works in presence of correlations, for instance if the law satisfies a nonlinear diffusion equation. This is exemplified in works

45 Bibliographical notes 39 y Tanaka [777] on the spatially homogeneous Boltzmann equation with b Maxwell molecules (the core of Tanaka’s argument is reproduced in my book [814, Section 7.5]), or some recent papers [138, 214, 379, 590]. Cattiaux and Guillin [221] found a simple and elegant coupling argu- ment to prove the exponential convergence for the law of the stochastic process √ ̃ ̃ = dX B dt, − X E ∇ V ( 2 ) − d X t t t t ̃ ̃ X is an independent copy of where , the E expectation only bears X t t n 1 ̃ on , and V is assumed to be a uniformly convex C X potential on R t satisfying ( − x ) = V ( x ). V It is also classical to couple a system of particles with an auxiliary artificial system to study the limit when the number of particles be- comes large. For the Vlasov equation in kinetic theory this was done by Dobrushin [309] and Neunzert [653] several decades ago. (The proof is reproduced in Spohn [757, Chapter 5], and also suggested as an exercise in my book [814, Problem 14].) Later Sznitman used this strategy in a systematic way for the propagation of chaos, and made it very popular, see e.g. his work on the Boltzmann equation [767] or his Saint-Flour lecture notes [768] and the many references included. In all these works, the “philosophy” is always the same: Introduce some nice coupling and see how it evolves in a certain asymptotic regime (say, either the time, or the number of particles, or both, go to infinity). It is possible to treat the convergence to equilibrium for the complete system (2.1) by methods that are either analytic [301, 472, 816, 818] or probabilistic [55, 559, 606, 701], but all methods known to me are much more delicate than the simple coupling argument which works for (2.2). It is certainly a nice open problem to find an elementary coupling argument which applies to (2.1). (The arguments in the above- mentioned probabilistic proofs ultimately rely on coupling methods via theorems of convergence for Markov chains, but in a quite indirect way.) Coupling techniques have also been used recently for proving rather spectacular uniqueness theorems for invariant measures in infinite di- mension, see e.g. [321, 456, 457]. Classical references for the isoperimetric inequality and related top- ics are the books by Burago and Zalgaller [176], and Schneider [741]; and the survey by Osserman [664]. Knothe [523] had the idea to use a “coupling” method to prove geometric inequalities, and Gromov [635, Appendix] applied this method to prove the Euclidean isopetrimetric inequality. Trudinger [787] gave a closely related treatment of the same

46 40 2 Three examples of coupling techniques nequality and some of its generalizations, by means of a clever use of i the Monge–Amp`ere equation (which more or less amounts to the con- struction of an optimal coupling with quadratic cost function, as will be seen in Chapter 11). Cabr ́e [182] found a surprising simplification of Trudinger’s method, based on the solution of just a linear elliptic equation. The “proof” which I gave in this chapter is a variation on Gromov’s argument; although it is not rigorous, there is no real diffi- culty in turning it into a full proof, as was done by Figalli, Maggi and Pratelli [369]. These authors actually prove much more, since they use of the isoperi- this strategy to establish a sharp quantitative stability metric inequality (if the shape of a set departs from the optimal shape, then its isoperimetric ratio departs from the optimal ratio in a quantifi- able way). In the same work one can find a very interesting comparison of the respective performances of the couplings obtained by the Knothe method and by the optimal transport method (the comparison turns very much to the advantage of optimal transport). Other links between coupling and isoperimetric-type inequalities are presented in Chapter 6 of my book [814], the research paper [587], the review paper [586] and the bibliographical notes at the end of Chap- ters 18 and 21. The construction of Caffarelli’s map is easy, at least conceptually: C γ μ = with the measure The optimal coupling of the Gaussian measure − v γ e , when the cost function is the square of the Euclidean distance, will do the job. But proving that C is indeed 1-Lipschitz is much more of a sport, and involves some techniques from nonlinear partial differential equations [188]. An idea of the core of the proof is explained in [814, Problem 13]. It would be nice to find a softer argument. ̈ Ust ̈unel pointed out to me that, if v is convex and symmetric − v T x ) = v ( x ( ( from γ to e v − γ is con- )), then the Moser transport tracting, in the sense that | T ( x ) | ≤ | x | ; it is not clear however that T would be 1-Lipschitz. Caffarelli’s theorem has many analytic and probabilistic applica- tions, see e.g. [242, 413, 465]. There is an infinite-dimensional version by ̈ Ust ̈unel [361], where the Gaussian measure is replaced by the Feyel and Wiener measure. Another variant was recently studied by Valdimars- son [801]. Like the present chapter, the lecture notes [813], written for a CIME Summer School in 2001, present some applications of optimal transport in various fields, with a slightly impressionistic style.

47 3 T he founding fathers of optimal transport Like many other research subjects in mathematics, the field of optimal transport was born several times. The first of these births occurred at the end of the eighteenth century, by way of the French geometer Gaspard Monge. Monge was born in 1746 under the French Ancient R ́egime. Because of his outstanding skills, he was admitted in a military training school from which he should have been excluded because of his modest origin. He invented descriptive geometry on his own, and the power of the method was so apparent that he was appointed professor at the age of 22, with the understanding that his theory would remain a military secret, for exclusive use of higher officers. He later was one of the most ardent warrior scientists of the French Revolution, served as a professor under several regimes, escaped a death sentence pronounced during the Terror, and became one of Napoleon’s closest friends. He taught at ́ ́ Ecole Normale Sup ́erieure and Ecole Polytechnique in Paris. Most of his work was devoted to geometry. In 1781 he published one of his famous works, M ́emoire sur la th ́eorie des d ́eblais et des remblais (a “d ́eblai” is an amount of material that is extracted from the earth or a mine; a “remblai” is a material that is input into a new construction). The problem considered by Monge is as follows: Assume you have a certain amount of soil to extract from the ground and transport to places where it should be incorporated in a construction (see Figure 3.1). The places where the material should be extracted, and the ones where it should be transported to, are all known. But the assignment has to be determined: To which destination should one send the material that has been extracted at a certain place? The answer does matter because transport is costly, and you want to

48 42 3 The founding fathers of optimal transport inimize the total cost. Monge assumed that the transport cost of one m unit of mass along a certain distance was given by the product of the mass by the distance. T x y remblais ́eblais d M Fig. 3.1. onge’s problem of d ́eblais and remblais Nowadays there is a Monge street in Paris, and therein one can find an excellent bakery called Le Boulanger de Monge . To acknowledge this, and to illustrate how Monge’s problem can be recast in an economic perspective, I shall express the problem as follows. Consider a large number of bakeries, producing loaves, that should be transported each morning to caf ́es where consumers will eat them. The amount of bread that can be produced at each bakery, and the amount that will be consumed at each caf ́e are known in advance, and can be modeled as probability measures (there is a “density of production” and a “density of consumption”) on a certain space, which in our case would be Paris (equipped with the natural metric such that the distance between two points is the length of the shortest path joining them). The problem is to find in practice where each unit of bread should go (see Figure 3.2), in such a way as to minimize the total transport cost. So Monge’s problem really is the search of an optimal coupling; and to be more precise, he was looking for a deterministic optimal coupling. Fig. 3.2. E conomic illustration of Monge’s problem: squares stand for production units, circles for consumption places.

49 3 The founding fathers of optimal transport 43 onge studied the problem in three dimensions for a continuous M distribution of mass. Guided by his beautiful geometric intuition, he made the important observation that transport should go along straight lines that would be orthogonal to a family of surfaces. This study led him to the discovery of lines of curvature , a concept that by itself was a great contribution to the geometry of surfaces. His ideas were developed by Charles Dupin and later by Paul Appell. By current mathematical standards, all these arguments were flawed, yet it certainly would be worth looking up all these problems with modern tools. Much later Monge’s problem was rediscovered by the Russian math- ematician Leonid Vitaliyevich Kantorovich. Born in 1912, Kantorovich was a very gifted mathematician who made his reputation as a first- class researcher at the age of 18, and earned a position of professor at just the same age as Monge had. He worked in many areas of math- ematics, with a strong taste for applications in economics, and later theoretical computer science. In 1938 a laboratory consulted him for the solution of a certain optimization problem, which he found out was representative of a whole class of linear problems arising in various ar- eas of economics. Motivated by this discovery, he developed the tools of linear programming, that later became prominent in economics. The publication of some of his most important works was delayed because of the great care with which Soviet authorities of the time handled the divulgence of scientific research related to economics. In fact (and this is another common point with Monge) for many years it was strictly forbidden for Kantorovich to publicly discuss some of his main discov- eries. In the end his work became well-known and in 1975 was awarded the Nobel Prize for economics, jointly with Tjalling Koopmans, “for their contributions to the theory of optimum allocation of resources”. In the case that is of direct interest for us, namely the problem of optimal coupling, Kantorovich stated and proved, by means of func- tional analytical tools, a duality theorem that would play a crucial role later. He also devised a convenient notion of distance between prob- ability measures: the distance between two measures should be the optimal transport cost from one to the other, if the cost is chosen as the distance function. This distance between probability measures is nowadays called the Kantorovich–Rubinstein distance, and has proven to be particularly flexible and useful.

50 44 3 The founding fathers of optimal transport t was only several years after his main results that Kantorovich I made the connection with Monge’s work. The problem of optimal cou- . pling has since then been called the Monge–Kantorovich problem Throughout the second half of the twentieth century, optimal cou- pling techniques and variants of the Kantorovich–Rubinstein distance (nowadays often called Wasserstein distances, or other denominations) were used by statisticians and probabilists. The “basis” space could be finite-dimensional, or infinite-dimensional: For instance, optimal cou- plings give interesting notions of distance between probability measures on path spaces. Noticeable contributions from the seventies are due to Roland Dobrushin, who used such distances in the study of parti- cle systems; and to Hiroshi Tanaka, who applied them to study the time-behavior of a simple variant of the Boltzmann equation. By the mid-eighties, specialists of the subject, like Svetlozar Rachev or Ludger R ̈uschendorf, were in possession of a large library of ideas, tools, tech- niques and applications related to optimal transport. During that time, reparametrization techniques (yet another word for change of variables) were used by many researchers working on in- equalities involving volumes or integrals. Only later would it be under- stood that optimal transport often provides useful reparametrizations. At the end of the eighties, three directions of research emerged inde- pendently and almost simultaneously, which completely reshaped the whole picture of optimal transport. One of them was John Mather’s work on Lagrangian dynamical systems. Action-minimizing curves are basic important objects in the theory of dynamical systems, and the construction of closed action- minimizing curves satisfying certain qualitative properties is a classical problem. By the end of the eighties, Mather found it convenient to study not only action-minimizing curves, but action-minimizing sta- tionary measures in phase space. Mather’s measures are a generaliza- tion of action-minimizing curves, and they solve a variational problem which in effect is a Monge–Kantorovich problem. Under some condi- tions on the Lagrangian, Mather proved a celebrated result according to which (roughly speaking) certain action-minimizing measures are au- tomatically concentrated on Lipschitz graphs. As we shall understand in Chapter 8, this problem is intimately related to the construction of a deterministic optimal coupling. The second direction of research came from the work of Yann Bre- nier. While studying problems in incompressible fluid mechanics, Bre-

51 3 The founding fathers of optimal transport 45 ier needed to construct an operator that would act like the projection n on the set of measure-preserving mappings in an open set (in probabilis- tic language, measure-preserving mappings are deterministic couplings of the Lebesgue measure with itself). He understood that he could do so by introducing an optimal coupling: If u is the map for which one wants to compute the projection, introduce a coupling of the Lebesgue L with u L measure . This study revealed an unexpected link between # optimal transport and fluid mechanics; at the same time, by pointing out the relation with the theory of Monge–Amp`ere equations, Brenier attracted the attention of the community working on partial differential equations. The third direction of research, certainly the most surprising, came from outside mathematics. Mike Cullen was part of a group of meteo- semi- rologists with a well-developed mathematical taste, working on geostrophic equations , used in meteorology for the modeling of atmo- spheric fronts. Cullen and his collaborators showed that a certain fa- mous change of unknown due to Brian Hoskins could be re-interpreted in terms of an optimal coupling problem, and they identified the min- imization property as a condition. A striking outcome of this stability work was that optimal transport could arise naturally in partial differ- ential equations which seemed to have nothing to do with it. All three contributions emphasized (in their respective domain) that important information can be gained by a qualitative description of . These new directions of research attracted various optimal transport mathematicians (among the first, Luis Caffarelli, Craig Evans, Wilfrid Gangbo, Robert McCann, and others), who worked on a better descrip- tion of the structure of optimal transport and found other applications. An important conceptual step was accomplished by Felix Otto, who discovered an appealing formalism introducing a differential point of view in optimal transport theory. This opened the way to a more geo- metric description of the space of probability measures, and connected optimal transport to the theory of diffusion equations, thus leading to a rich interplay of geometry, functional analysis and partial differential equations. Nowadays optimal transport has become a thriving industry, involv- ing many researchers and many trends. Apart from meteorology, fluid mechanics and diffusion equations, it has also been applied to such di- verse topics as the collapse of sandpiles, the matching of images, and the design of networks or reflector antennas. My book, Topics in Optimal

52 46 3 The founding fathers of optimal transport ransportation , written between 2000 and 2003, was the first attempt T to present a synthetic view of the modern theory. Since then the field has grown much faster than I expected, and it was never so active as it is now. Bibliographical notes Before the twentieth century, the main references for the problem of “d ́eblais et remblais” are the memoirs by Monge [636], Dupin [319] and Appell [42]. Besides achieving important mathematical results, Monge and Dupin were strongly committed to the development of society and it is interesting to browse some of their writings about economics and industry (a list can be found online at ). A lively ac- gallica.bnf.fr count of Monge’s life and political commitments can be found in Bell’s delightful treatise, Men of Mathematics [80, Chapter 12]. It seems how- ever that Bell did dramatize the story a bit, at the expense of accuracy and neutrality. A more cold-blooded biography of Monge was written by de Launay [277]. Considered as one the greatest geologists of his time, not particularly sympathetic to the French Revolution, de Lau- nay documented himself with remarkable rigor, going back to original sources whenever possible. Other biographies have been written since then by Taton [778, 779] and Aubry [50]. Monge originally formulated his transport problem in Euclidean x c x,y ) = space for the cost function ( − y | ; he probably had no idea of | the extreme difficulty of a rigorous treatment. It was only in 1979 that Sudakov [765] claimed a proof of the existence of a Monge transport for general probability densities with this particular cost function. But his proof was not completely correct, and was amended much later by Ambrosio [20]. In the meantime, alternative rigorous proofs had been devised first by Evans and Gangbo [330] (under rather strong assump- tions on the data), then by Trudinger and Wang [791], and Caffarelli, Feldman and McCann [190]. Kantorovich defined linear programming in [499], introduced his minimization problem and duality theorem in [500], and in [501] applied his theory to the problem of optimal transport; this note can be consid- ered as the act of birth of the modern formulation of optimal transport. Later he made the link with Monge’s problem in [502]. His major work

53 Bibliographical notes 47 n economics is the book [503], including a reproduction of [499]. An- i other important contribution is a study of numerical schemes based on linear programming, joint with his student Gavurin [505]. Kantorovich wrote a short autobiography for his Nobel Prize [504]. Online at www.math.nsc.ru/LBRT/g2/english/ssk/legacy.html are some comments by Kutateladze, who edited his mathematical works. Journal of Mathematical Sciences , edited A recent special issue of the by Vershik, was devoted to Kantorovich [810]; this reference contains translations of [501] and [502], as well as much valuable information about the personality of Kantorovich, and the genesis and impact of his ideas in mathematics, economy and computer science. In another his- torical note [808] Vershik recollects memories of Kantorovich and tells some tragicomical stories illustrating the incredible ideological pressure put on him and other scientists by Soviet authorities at the time. The “classical” probabilistic theory of optimal transport is exhaus- tively reviewed by Rachev and R ̈uschendorf [696, 721]; most notable applications include limit theorems for various random processes. Re- lations with game theory, economics, statistics, and hypotheses testing are also common (among many references see e.g. [323, 391]). Mather introduced minimizing measures in [600], and proved his Lipschitz graph theorem in [601]. The explicit connection with the Monge–Kantorovich problem came only recently [105]: see Chapter 8. Tanaka’s contributions to kinetic theory go back to the mid-seventies [644, 776, 777]. His line of research was later taken up by Toscani and collaborators [133, 692]; these papers constituted my first contact with the optimal transport problem. More recent developments in the kinetic theory of granular media appear for instance in [138]. Brenier announced his main results in a short note [154], then pub- lished detailed proofs in [156]. Chapter 3 in [814] is entirely devoted to Brenier’s polar factorization theorem (which includes the exis- tence of the projection operator), its interpretation and consequences. For the sources of inspiration of Brenier, and various links between op- timal transport and hydrodynamics, one may consult [155, 158, 159, 160, 163, 170]. Recent papers by Ambrosio and Figalli [24, 25] provide a complete and thorough rewriting of Brenier’s theory of generalized incompressible flows. The semi-geostrophic system was introduced by Eliassen [325] and Hoskins [480, 481, 482]; it is very briefly described in [814, Problem 9, pp. 323–326]. Cullen and collaborators wrote many papers on the sub-

54 48 3 The founding fathers of optimal transport ect, see in particular [269]; see also the review article [263], the works j by Cullen and Gangbo [266], Cullen and Feldman [265] or the recent book by Cullen [262]. Further links between optimal transport and other fields of mathe- matics (or physics) can be found in my book [814], or in the treatise by Rachev and R ̈uschendorf [696]. An important source of inspiration was the relation with the qualitative behavior of certain diffusive equations arising from gas dynamics; this link was discovered by Jordan, Kinder- lehrer and Otto at the end of the nineties [493], and then explored by several authors [208, 209, 210, 211, 212, 213, 214, 216, 669, 671]. Below is a nonexhaustive list of some other unexpected applications. Relations with the modeling of sandpiles are reviewed by Evans [328], as well as compression molding problems; see also Feldman [353] (this c ( is for the cost function ) = | x − y | ). Applications of optimal x,y transport to image processing and shape recognition are discussed by Gangbo and McCann [400], Ahmad [6], Angenent, Haker, Tannen- baum, and Zhu [462, 463], Chazal, Cohen-Steiner and M ́erigot [224], and many other contributors from the engineering community (see e.g. [700, 713]). X.-J. Wang [834], and independently Glimm and Oliker [419] (around 2000 and 2002 respectively) discovered that the theoretical problem of designing reflector antennas could be recast in terms of optimal transport for the cost function c ( x,y ) = − log(1 − x · y ) 2 on ; see [402, 419, 660] for further work in the area, and [420] for S 1 Rubinstein another version of this problem involving two reflectors. and Wolansky adapted the strategy in [420] to study the optimal de- sign of lenses [712]; and Guti ́errez and Huang to treat a refraction problem [453]. In his PhD Thesis, Bernot [108] made the link be- tween optimal transport, irrigation and the design of networks. Such topics were also considered by Santambrogio with various collabora- tors [152, 207, 731, 732, 733, 734]; in particular it is shown in [732] that optimal transport theory gives a rigorous basis to some varia- tional constructions used by physicists and hydrologists to study river basin morphology [65, 706]. Buttazzo and collaborators [178, 179, 180] explored city planning via optimal transport. Brenier found a connec- tion to the electrodynamic equations of Maxwell and related models in string theory [161, 162, 163, 164, 165, 166]. Frisch and collaborators 1 A ccording to Oliker, the connection between the two-reflector problem (as for- mulated in [661]) and optimal transport is in fact much older, since it was first formulated in a 1993 conference in which he and Caffarelli were participating.

55 Bibliographical notes 49 inked optimal transport to the problem of reconstruction of the “condi- l tions of the initial Universe” [168, 382, 755]. (The publication of [382] in the prestigious generalist scientific journal is a good indication Nature of the current visibility of optimal transport outside mathematics.) Relations of optimal transport with geometry, in particular Ricci curvature, will be explored in detail in Parts II and III of these notes. Many generalizations and variants have been studied in the litera- ture, such as the optimal matching [323], the optimal transshipment (see [696] for a discussion and list of references), the optimal transport of a fraction of the mass [192, 365], or the optimal coupling with more than two prescribed marginals [403, 525, 718, 723, 725]; I learnt from Strulovici that the latter problem has applications in contract theory. In spite of this avalanche of works, one certainly should not regard optimal transport as a kind of miraculous tool, for “there are no mir- acles in mathematics”. In my opinion this abundance only reflects the fact that optimal transport is a simple, meaningful, natural and there- fore universal concept.

56

57 Part I Qualitative description of optimal transport

58

59 53 T he first part of this course is devoted to the description and charac- terization of optimal transport under certain regularity assumptions on the measures and the cost function. As a start, some general theorems about optimal transport plans are established in Chapters 4 and 5, in particular the Kantorovich duality c -cyclically monotone maps, both in the theorem. The emphasis is on statements and in the proofs. The assumptions on the cost function and the spaces will be very general. From the Monge–Kantorovich problem one can derive natural dis- tance functions on spaces of probability measures, by choosing the cost function as a power of the distance. The main properties of these dis- tances are established in Chapter 6. In Chapter 7 a time-dependent version of the Monge–Kantorovich problem is investigated, which leads to an interpolation procedure be- tween probability measures, called displacement interpolation. The nat- ural assumption is that the cost function derives from a Lagrangian action, in the sense of classical mechanics; still (almost) no smoothness is required at that level. In Chapter 8 I shall make further assumptions of smoothness and convexity, and recover some regularity properties of the displacement interpolant by a strategy due to Mather. Then in Chapters 9 and 10 it is shown how to establish the exis- tence of deterministic optimal couplings, and characterize the associ- ated transport maps, again under adequate regularity and convexity assumptions. The Change of variables Formula is considered in Chap- ter 11. Finally, in Chapter 12 I shall discuss the regularity of the trans- port map, which in general is not smooth. The main results of this part are synthetized and summarized in Chapter 13. A good understanding of this chapter is sufficient to go through Part II of this course.

60

61 4 B asic properties Existence The first good thing about optimal couplings is that they exist: ( X ,μ ) Let Theorem 4.1 (Existence of an optimal coupling). Y ,ν ) be two Polish probability spaces; let a : X → R ∪ {−∞} and ( : Y → R ∪{−∞} be two upper semicontinuous functions such b and 1 1 L b ( μ ) , that ∈ a ∈ ( ν ) . Let c : X ×Y → R ∪{ + ∞} be a lower L semicontinuous cost function, such that ( x,y ) ≥ a ( x ) + b c y ) for all ( x,y . Then there is a coupling of ( μ,ν ) which minimizes the total cost E c ( X,Y ) among all possible couplings ( X,Y ) . Remark 4.2. The lower bound assumption on guarantees that the c ∪{ E X,Y ) is well-defined in R ( + ∞} . In most cases of expected cost c a = 0, applications — but not all — one may choose = 0. b The proof relies on basic variational arguments involving the topol- ogy of weak convergence (i.e. imposed by bounded continuous test func- tions). There are two key properties to check: (a) lower semicontinuity, (b) compactness. These issues are taken care of respectively in Lem- mas 4.3 and 4.4 below, which will be used again in the sequel. Before going on, I recall : If X is a Polish space, then Prokhorov’s theorem a set P ⊂ P ( X ) is precompact for the weak topology if and only if it is [ tight 0 there is a compact set K ε such that μ ε > X\ K ≤ ] , i.e. for any ε ε for all μ ∈P . Lemma 4.3 (Lower semicontinuity of the cost functional). Let X and Y be two Polish spaces, and c a lower X ×Y → R ∪{ + ∞} :

62 56 4 Basic properties emicontinuous cost function. Let s R ∪{−∞} be an upper h : X ×Y → h c π semicontinuous function such that ) . Let ( ≥ be a sequence of N k k ∈ π ∈ P ( X×Y ) , probability measures on X×Y , converging weakly to some 1 1 ∈ π ∈ ) , h L L ( ( π ) , and h in such a way that k ∫ ∫ hdπ. −−−→ hdπ k →∞ k X×Y X×Y Then ∫ ∫ lim inf cdπ . ≤ cdπ k →∞ k X×Y X×Y ∫ is nonnegative, then F : π In particular, if c cdπ is lower semicon- → tinuous on ( X ×Y ) , equipped with the topology of weak convergence. P Let X Y be Lemma 4.4 (Tightness of transference plans). and P P X P ⊂ and Q ⊂ ( ( Y ) be tight subsets of two Polish spaces. Let ) ( X ) and P ( Y ) respectively. Then the set Π ( P , Q P of all transference ) plans whose marginals lie in and Q respectively, is itself tight in P ( ) . P X ×Y . Replacing c by c − h , we may assume that c Proof of Lemma 4.3 is a nonnegative lower semicontinuous function. Then can be written c c as the pointwise limit of a nondecreasing family ( of continuous ) ℓ ℓ ∈ N real-valued functions. By monotone convergence, ∫ ∫ ∫ ∫ cdπ c cdπ dπ = lim = lim lim inf lim ≤ dπ . c ℓ k ℓ k k →∞ →∞ k →∞ →∞ ℓ ℓ ⊓⊔ . Let μ ∈ P , ν ∈ Q , and π ∈ Π ( μ,ν ). By assump- Proof of Lemma 4.4 ε > 0 there is a compact set K tion, for any ⊂ X , independent of ε ε in P , such that μ [ X \ K the choice of ] ≤ μ ; and similarly there is ε a compact set L ⊂ Y , independent of the choice of ν in Q , such that ε ν [ Y \ L ), ] ≤ ε . Then for any coupling ( X,Y ) of ( μ,ν ε ] [ X,Y ) / ∈ K P × L ε. ] + ≤ P [ X / ∈ K 2 ( P [ Y / ∈ L ≤ ] ε ε ε ε The desired result follows since this bound is independent of the cou- pling, and K × L is compact in X ×Y . ⊓⊔ ε ε is tight in Proof of Theorem 4.1 is Polish, { μ } X P ( X ); similarly, . Since { ν ), and is tight in P ( Y ). By Lemma 4.4, Π ( μ,ν ) is tight in P ( X×Y }

63 Restriction property 57 y Prokhorov’s theorem this set has a compact closure. By passing to b ( μ,ν ) is closed, the limit in the equation for marginals, we see that Π so it is in fact compact . be a sequence of probability measures on ) Then let ( , X ×Y π N ∈ k k ∫ such that cdπ converges to the infimum transport cost. Extracting k a subsequence if necessary, we may assume that converges to some π k 1 Π ( μ,ν ). The function h : ( x,y ) 7−→ a ( x ) + π ( y ) lies in L ∈ ( π ) b k ∫ ∫ 1 L ), and c ≥ h by assumption; moreover, = hdπ ( = π hdπ and in k ∫ ∫ + ; so Lemma 4.3 implies adμ bdν ∫ ∫ lim inf cdπ . ≤ cdπ k →∞ k Thus π is minimizing. ⊓⊔ Remark 4.5. This existence theorem does not imply that the optimal transport plans lead to an infinite all cost is finite. It might be that ∫ = + ∞ for all π total cost, i.e. Π ( μ,ν ). A simple condition to cdπ ∈ rule out this annoying possibility is ∫ ( x,y ) dμ ( x ) dν ( y ) < + ∞ , c which guarantees that at least the independent coupling has finite total cost. In the sequel, I shall sometimes make the stronger assumption 1 1 x,y ) ≤ c , ( x ) + c ) ( y ) , ( c ν ,c ( ) ∈ L c ( μ ) × L ( Y Y X X which implies that any coupling has finite total cost, and has other nice consequences (see e.g. Theorem 5.10). Restriction property The second good thing about optimal couplings is that any sub-coupling is still optimal . In words: If you have an optimal transport plan, then any induced sub-plan (transferring part of the initial mass to part of the final mass) has to be optimal too — otherwise you would be able to lower the cost of the sub-plan, and as a consequence the cost of the whole plan. This is the content of the next theorem.

64 58 4 Basic properties heorem 4.6 (Optimality is inherited by restriction). Let X ,μ ) T ( 1 1 ∈ a ∈ L Y ( μ ) , b ) L be two Polish spaces, ( ν ) , let c : X ×Y → ( ,ν and + ∞} R c ( x,y ) ≥ a ( x ) + b ( y ) ∪{ be a measurable cost function such that x,y C ( μ,ν ) ; and let μ to ν . for all be the optimal transport cost from C ( μ,ν ) < + ∞ and let Assume that ∈ Π ( μ,ν ) be an optimal trans- π port plan. Let π be a nonnegative measure on X ×Y , such that ̃ π ≤ π ̃ ̃ π X ×Y ] > 0 . Then the probability measure and [ π ̃ ′ π := [ π ×Y ] ̃ X ′ ′ and ν μ . is an optimal transference plan between its marginals π Moreover, if μ is the unique optimal transference plan between ′ ′ ν π is the unique optimal transference plan between μ and , then also ′ and ν . Example 4.7. If ( X,Y ) is an optimal coupling of ( μ,ν ), and Z ⊂X×Y [ ] 0, then the pair ( ( X,Y ) ∈ Z is such that > P X,Y ), conditioned ′ ′ ′ to lie in , is an optimal coupling of ( ), where μ ,ν is the law of Z μ ′ X,Y ∈ Z ”, and conditioned by the event “( ν is the law of Y X ) conditioned by the same event. ′ . Assume that π is not optimal; then there exists Proof of Theorem 4.6 ′′ π such that ′ ′ ′′ ′ ′′ ′ π (proj π = (proj = μ π , (proj ) , ν = = (proj ) ) ) π # # # # Y X X Y (4.1) yet ∫ ∫ ′′ ′ ) dπ ) ( x,y c < ( c ( x,y ) dπ x,y ( x,y ) . (4.2) Then consider ′′ ̃ π − π ) + := ( Zπ ̃ , (4.3) π ̂ ̃ Z where ̃ π [ X ×Y ] > 0. Clearly, ̂ π is a nonnegative measure. On the = other hand, it can be written as ′′ ′ ̃ π + ̂ Z ( π π − π = ); then (4.1) shows that π has the same marginals as π , while (4.2) implies ̂ that it has a lower transport cost than π . (Here I use the fact that the total cost is finite.) This contradicts the optimality of π . The conclusion ′ is that π is in fact optimal.

65 Convexity properties 59 t remains to prove the last statement of Theorem 4.6. Assume that I ′′ and ν ; and let π π μ is the unique optimal transference plan between ′ ′ be any optimal transference plan between . Define again ̂ π and μ ν , which implies that has the same cost as , so π π = π π by (4.3). Then ̂ ̂ ′′ ′′ ′ ̃ , i.e. π = Zπ π ̃ . ⊓⊔ π = Convexity properties The following estimates are of constant use: X Y be Theorem 4.8 (Convexity of the optimal cost). and Let : X×Y → R two Polish spaces, let + ∞} be a lower semicontinuous c ∪{ C be the associated optimal transport cost functional function, and let P ( X ) on P ( Y ) . Let ( Θ,λ ) be a probability space, and let μ be two , ν × θ θ ( Θ ( X ) and P P Y ) re- measurable functions defined on , with values in 1 ( x,y ) ≥ a ( x ) + b ( y ) spectively. Assume that a ∈ L c ( dμ , dλ ( θ )) , where θ 1 dλ ∈ ( dν b L ( θ )) . Then θ ) ( ( ) ∫ ∫ ∫ ) , μ . ν C λ ( dθ ) ) ≤ dθ ( dθ C ( μ ( ,ν λ ) λ θ θ θ θ Θ Θ Θ 1 1 . First notice that ∈ ( μ L ), b ∈ L a ( ν - ) for λ Proof of Theorem 4.8 θ θ θ . For each such θ , Theorem 4.1 guarantees the exis- almost all values of π ∈ Π ( μ ,ν ), for the cost c . Then tence of an optimal transport plan θ θ θ ∫ ∫ ∫ π ). λ ( dθ ) has marginals μ := π μ dθ λ ( dθ := ν := ) and ν ( λ θ θ θ π is a Admitting temporarily Corollary 5.22, we may assume that θ measurable function of θ . So ∫ μ,ν ) ≤ C ( c ( ) π ( dxdy ) x,y X×Y ) ( ∫ ∫ ( dθ ) π ) λ x,y ( ) = ( dxdy c θ Θ X×Y ( ) ∫ ∫ ) dθ c ( x,y ) π ( ( dxdy ) = λ θ Θ X×Y ∫ ,ν , C ( μ ) = dθ ) λ ( θ θ Θ and the conclusion follows. ⊓⊔

66 60 4 Basic properties escription of optimal plans D Obtaining more precise information about minimizers will be much more of a sport. Here is a short list of questions that one might ask: Is the optimal coupling unique? smooth in some sense? • • Is there a , i.e. a deterministic optimal coupling? Monge coupling • Is there a geometrical way to characterize optimal couplings? Can one check in practice that a certain coupling is optimal? About the second question: Why don’t we try to apply the same reasoning as in the proof of Theorem 4.1? The problem is that the set not compact; in fact, this set of deterministic couplings is in general is often dense in the larger space of all couplings! So we may expect value of the infimum in the Monge problem coincides with the that the value of the minimum in the Kantorovich problem; but there is no a priori reason to expect the existence of a Monge minimizer. 1 2 2 x = R X , let c ( x,y ) = | = − y | Let , let μ be H Example 4.9. Y 1 2) { [ − 1 , 1], and let ν be (1 / }× H restricted to restricted to {− 1 , 1 }× 0 1 − 1 , 1], where H [ is the one-dimensional Hausdorff measure. Then there is a unique optimal transport, which for each point (0 ) sends one half ,a of the mass at (0 ,a ) to ( − 1 ,a ), and the other half to (1 ,a ). This is not a Monge transport, but it is easy to approximate it by (nonoptimal) deterministic transports (see Figure 4.1). Fig. 4.1. T he optimal plan, represented in the left image, consists in splitting the mass in the center into two halves and transporting mass horizontally. On the right the filled regions represent the lines of transport for a deterministic (without splitting of mass) approximation of the optimum.

67 Bibliographical notes 61 ibliographical notes B Theorem 4.1 has probably been known from time immemorial; it is usually stated for nonnegative cost functions. Prokhorov’s theorem is a most classical result that can be found e.g. in [120, Theorems 6.1 and 6.2], or in my own course on integration [819, Section VII-5]. Theorems of the form “infimum cost in the Monge problem = minimum cost in the Kantorovich problem” have been established by Gangbo [396, Appendix A], Ambrosio [20, Theorem 2.1], and Pratelli [687, Theorem B]. The most general results to this date are those which X ,μ ) appear in Pratelli’s work: Equality holds true if the source space ( X×Y → ∪{ + ∞} , R is Polish without atoms, and the cost is continuous ∞ allowed. (In [687] the cost c is bounded below, but with the value + 1 1 ( x,y ) ≥ a ( x ) + b ( y ), where a ∈ L it is sufficient that ( μ ) and b ∈ L c ( ν ) are continuous.)

68

69 5 yclical monotonicity and Kantorovich duality C To go on, we should become acquainted with two basic concepts in the theory of optimal transport. The first one is a geometric property called cyclical monotonicity; the second one is the Kantorovich dual problem, which is another face of the original Monge–Kantorovich problem. The main result in this chapter is Theorem 5.10. Definitions and heuristics I shall start by explaining the concepts of cyclical monotonicity and Kantorovich duality in an informal way, sticking to the bakery analogy of Chapter 3. Assume you have been hired by a large consortium of bakeries and caf ́es, to be in charge of the distribution of bread from production units (bakeries) to consumption units (caf ́es). The locations of the bakeries and caf ́es, their respective production and consumption rates, are all determined in advance. You have written a transference x y and each caf ́e plan, which says, for each bakery (located at) , how j i x to y . much bread should go each morning from i j As there are complaints that the transport cost associated with your plan is actually too high, you try to reduce it. For that purpose x that sends part of its production to a distant you choose a bakery 1 caf ́e y , and decide that one basket of bread will be rerouted to another 1 ) caf ́e x ; thus you will gain c ( x ,y , that is closer to − c ( x y ,y ). Of 2 2 1 1 1 1 course, now this results in an excess of bread in , so one basket of y 2 bread arriving to y ) should in turn be rerouted (say, from bakery x 2 2 to yet another caf ́e, say y . The process goes on and on until finally you 3

70 64 5 Cyclical monotonicity and Kantorovich duality edirect a basket from some bakery x to y r , at which point you can N 1 stop since you have a new admissible transference plan (see Figure 5.1). A n attempt to improve the cost by a cycle; solid arrows indicate the mass Fig. 5.1. transport in the original plan, dashed arrows the paths along which a bit of mass is rerouted. The new plan is (strictly) better than the older one if and only if ( x . ,y ) )+ c ( x ,y ,y x )+ ... + c ( x ( ,y c ) < c ( x c ,y + )+ c ( x ... ,y )+ 2 1 1 1 N 3 N 2 N 2 1 2 x cycles x Thus, if you can find such ,y ) in your trans- ) ,..., ( ( ,y N N 1 1 ference plan, certainly the latter is not optimal. Conversely, if you do not find them, then your plan cannot be improved (at least by the pro- cedure described above) and it is likely to be optimal. This motivates the following definitions. Definition 5.1 (Cyclical monotonicity). X , Y be arbitrary Let Γ c X ×Y → ( −∞ , sets, and ∞ ] be a function. A subset : ⊂ X ×Y + is said to be c -cyclically monotone if, for any N ∈ N , and any family ( x , holds the inequality ,y Γ ) ,..., ( x of points in ,y ) N 1 1 N N N ∑ ∑ c ( x ,y ,y x ) ≤ ( (5.1) ) c i i i +1 i i =1 =1 i y (with the convention = y ). A transference plan is said to be +1 1 N c -cyclically monotone if it is concentrated on a c -cyclically monotone set.

71 Definitions and heuristics 65 nformally, a c cannot be im- I -cyclically monotone plan is a plan that proved : it is impossible to perturb it (in the sense considered before, by rerouting mass along some cycle) and get something more economical. One can think of it as a kind of local minimizer. It is intuitively obvi- -cyclically monotone; the converse ous that an optimal plan should be c property is much less obvious (maybe it is possible to get something better by radically changing the plan), but we shall soon see that it holds true under mild conditions. The next key concept is the dual Kantorovich problem. While the , in central notion in the original Monge–Kantorovich problem is cost . Imagine that a company offers to take care the dual problem it is price of all your transportation problem, buying bread at the bakeries and selling them to the caf ́es; what happens in between is not your problem (and maybe they have tricks to do the transport at a lower price than you). Let ( x ) be the price at which a basket of bread is bought at ψ x , and φ ( y ) the price at which it is sold at caf ́e bakery . On the whole, y the price which the consortium bakery + caf ́e pays for the transport ), instead of the original cost φ ) − ψ ( x y c ( x,y ). This of course is for is ( μ ( each unit of bread: if there is a mass ) at x , then the total price of dx the bread shipment from there will be ( x ) μ ψ dx ). ( So as to be competitive, the company needs to set up prices in such a way that ∀ ( x,y ) , φ ( y ) − ψ ( x ) ≤ c ( x,y ) . (5.2) When you were handling the transportation yourself, your problem was . Now that the company takes up the transporta- minimize the cost to tion charge, their problem is to maximize the profits . This naturally dual Kantorovich problem : leads to the { } ∫ ∫ sup ( y ) dν ( y ) − ψ . ψ ( x ) dμ ( x ); φ ( y ) − (5.3) ( x ) ≤ c ( x,y ) φ X Y From a mathematical point of view, it will be imposed that the 1 ψ and φ appearing in (5.3) be integrable: ψ ∈ L ,μ ( X functions ); 1 ,ν L φ ( Y ∈ ). With the intervention of the company, the shipment of each unit of bread does not cost more than it used to when you were handling it yourself; so it is obvious that the supremum in (5.3) is no more than the optimal transport cost:

72 66 5 Cyclical monotonicity and Kantorovich duality } { ∫ ∫ φ ( y ) dν ( y up − s ψ ( x ) dμ ( x ) ) − ψ ≤ c φ Y X } { ∫ (5.4) ) inf . ≤ x,y c ( x,y ) dπ ( ) ( Π ∈ π μ,ν X×Y ) and a transference plan for which ψ,φ Clearly, if we can find a pair ( π ) is optimal in the left-hand side and π is there is equality, then ( ψ,φ also optimal in the right-hand side. ψ,φ ) will informally be said to be com- A pair of price functions ( petitive if it satisfies (5.2). For a given y , it is of course in the interest of the company to set the highest possible competitive price ( y ), i.e. φ the highest lower bound for (i.e. the infimum of) ( x ) + c ( x,y ), among ψ x , the price ψ ( x ) should be the x . Similarly, for a given all bakeries ( y ) − c ( x,y ). Thus it makes sense to describe a pair supremum of all φ ) as tight if ψ,φ of prices ( ( ) ( ) φ ( ( x ) + c ( x,y ) y , ψ ( x ) = sup . ) = inf φ ( y ) − c ( x,y ) (5.5) ψ x y In words, prices are tight if it is impossible for the company to raise the selling price, or lower the buying price, without losing its competitivity. Consider an arbitrary pair of competitive prices ( ψ,φ ). We can al- ( ) ) + ( y ) = inf . Then φ ψ ( x φ c ( x,y ) ways improve by replacing it by x 1 ) ( ψ ; ( x ) = sup ( we can also improve φ ) ψ by replacing it by ) − c ( x,y y 1 1 y ( ) by φ then replacing ( y ) = inf ) , and so on. It turns ψ φ ( x ) + c ( x,y x 2 1 1 out that this process is stationary: as an easy exercise, the reader can check that φ , which means that after just one iteration = φ ψ , ψ = 2 1 2 1 one obtains a pair of tight prices. Thus, when we consider the dual Kantorovich problem (5.3), it makes sense to restrict our attention to tight pairs, in the sense of equation (5.5). From that equation we can in terms of ψ reconstruct ψ as the only unknown φ , so we can just take in our problem. That unknown cannot be just any function: if you take a general function ψ φ by the first formula in (5.5), there is no , and compute chance that the second formula will be satisfied. In fact this second formula will hold true if and only if is c -convex , in the sense of the ψ next definition (illustrated by Figure 5.2). Definition 5.2 ( c -convexity). Let X , Y be sets, and c : X ×Y → ∞} ( , + ∞ ] . A function ψ : X → R ∪{ + −∞ is said to be c -convex if it is not identically + ∞ , and there exists ζ : Y → R ∪{±∞} such that

73 Definitions and heuristics 67 ( ) ∈ x ) = sup X ∀ x ζ ( y ) − c ( x,y ) ( . (5.6) ψ y ∈Y c c defined by Then its ψ -transform is the function ( ) c y ) = inf y ∈Y ψ ψ ∀ x ) + c ( x,y ) ( , (5.7) ( ∈X x and its c -subdifferential is the c -cyclically monotone set defined by { } c ψ x,y ) ∈X ×Y ; ψ := ( y ) − ψ ( x ) = c ( x,y ) ∂ . ( c c ψ -conjugate. are said to be c The functions ψ and -subdifferential of at point x is Moreover, the c ψ { } ∂ ) = x y ψ ; ( x,y ) ∈ ∂ , ψ ( ∈Y c c or equivalently z ∈X , ψ ( x ) + c ( x,y ) ≤ ψ ( z ) + c ( z,y ) . (5.8) ∀ y y 1 0 x 2 y 2 x x 1 0                                                             c A - Fig. 5.2. convex function is a function whose graph you can entirely caress from below with a tool whose shape is the negative of the cost function (this shape ). In the picture might vary with the point y ∈ ∂ ψ ( x ). y i c i n n c ( x,y ) = − x · y on R Particular Case 5.3. × R If , then the c - transform coincides with the usual Legendre transform, and -convexity c n R is just plain convexity on . (Actually, this is a slight oversimplifica- tion: c -convexity is equivalent to plain convexity plus lower semicon- tinuity! A convex function is automatically continuous on the largest

74 68 5 Cyclical monotonicity and Kantorovich duality Ω o where it is finite, but lower semicontinuity might fail at the pen set · c x,y ) = boundary of x ( y Ω .) One can think of the cost function − 2 x,y ) = | x − y | 2, since the “interaction” / as basically the same as c ( and between the positions is the same for both costs. x y c d is a distance on some metric space If = Particular Case 5.4. c X , then a -convex function is just a 1-Lipschitz function, and it c -transform. Indeed, if ψ is c -convex it is obviously 1- is its own ψ is 1-Lipschitz, then ψ ( x ) ≤ ψ ( y ) + d ( x,y ), so Lipschitz; conversely, if c ( ψ [ ψ ( y ) + d ) = inf x,y )] = ψ ( ( x ). As an even more particular case, x y c ( x,y ) = 1 − 1, ≤ ψ , then ψ is c -convex if and only if sup ψ if inf y x = 6 c = ψ . (More generally, if c satisfies the triangle in- and then again ψ c ( x,z ) ≤ c ( x,y ) + c ( y,z ), then ψ is c -convex if and only if equality c ( y ) − ψ ( x ) ≤ c ( x,y ) for all x,y ; and then ψ = ψ ψ .) Remark 5.5. There is no measure theory in Definition 5.2, so no as- sumption of measurability is made, and the supremum in (5.6) is a true supremum, not just an essential supremum; the same for the infimum is continuous, then a c -convex function is automatically in (5.7). If c is not c lower semicontinuous, and its subdifferential is closed; but if ψ and ∂ ψ is not a priori guaranteed. continuous the measurability of c ψ ≡ Remark 5.6. ∞ so as to avoid trivial I excluded the case when + c -convex function might more properly (!) situations; what I called a be called a proper c -convex function. This automatically implies that ζ ∞ at all if c is real-valued. If in (5.6) does not take the value + c does achieve infinite values, then the correct convention in (5.6) is ∞ − (+ ∞ ) = −∞ . (+ ) ψ is a function on If , then its c -transform is a function on Y . X Conversely, given a function on Y , one may define its c -transform as a function on X . It will be convenient in the sequel to define the latter concept by an infimum rather than a supremum. This convention has and Y , the drawback of breaking the symmetry between the roles of X but has other advantages that will be apparent later on. Definition 5.7 ( c -concavity). With the same notation as in Defini- φ : Y → R ∪{−∞} is said to be c -concave if it tion 5.2, a function R −∞ ψ : X → , and there exists ∪{±∞} such that is not identically c c ψ φ . Then its c -transform is the function φ = defined by ) ( c φ ∀ ( x ) = sup ∈X ) φ x ( y ) − c ( x,y ; ∈Y y

75 Kantorovich duality 69 -superdifferential is the c -cyclically monotone set defined by a nd its c { } c c ) ⊂X ×Y ; φ ( φ ) − φ := ( x ) = c ( x,y ) ( . x,y ∂ y In spite of its short and elementary proof, the next crucial result is -convexity. one of the main justifications of the concept of c c -convexity). Proposition 5.8 (Alternative characterization of : X → R ∪{ + ∞} , let its c -convexification be defined For any function ψ cc c c . More explicitly, ) ψ = ( ψ by ) ( cc ) = sup ψ . inf ( ) x,y x ψ ( ̃ x ) + c ( ̃ x,y ) − c ( ∈X x e y ∈Y cc is c ψ ψ = ψ . Then -convex if and only if . As a general fact, for any function φ : Y → Proof of Proposition 5.8 ccc c φ -convex), one has the identity φ ∪{−∞} = (not necessarily c . R Indeed, ] [ ccc x ) = sup φ inf ( sup y x,y ( ̃ y ) − c ( ̃ x, ̃ φ ) + c ( ̃ x,y ) − c ( ; ) e x y e y ccc c ) x shows that φ ̃ ( then the choice x ≤ φ = ( x ); while the choice x ccc c = y shows that φ x ( x ) ≥ φ ̃ ( y ). cc c ccc -convex, then there is ζ such that ψ = ζ If , so ψ is = ζ c = ψ c ζ = ψ . cc ψ ψ = ψ , then The converse is obvious: If is c -convex, as the c c . ⊓⊔ ψ -transform of Proposition 5.8 is a generalized version of the Legendre Remark 5.9. duality in convex analysis (to recover the usual Legendre duality, take n n x,y ) = − x · y in R c × R ( ). Kantorovich duality We are now ready to state and prove the main result in this chapter.

76 70 5 Cyclical monotonicity and Kantorovich duality ( heorem 5.10 (Kantorovich duality). ) and ( Y ,ν ) be T X ,μ Let R ∪{ + ∞} be a lower X ×Y → c two Polish probability spaces and let : semicontinuous cost function, such that ) ∈X ×Y ∀ ( x,y ) ≥ a ( x ) + b ( y ) ( x,y , c 1 ∈ ( μ L and a for some real-valued upper semicontinuous functions ) 1 b ( ν ) . Then ∈ L (i) There is duality: ∫ min ) x,y x,y ( c ( dπ ) Π ) μ,ν ∈ ( π X×Y ( ) ∫ ∫ ) φ ( y ) dν ( y = − sup ψ ( x ) dμ ( x ) Y C X ( ( ) × C ψ,φ ( Y X ); φ − ψ ≤ c ) ∈ b b ) ( ∫ ∫ x ( dμ ) x ( ψ φ − ) y ( dν ) y ( = sup ) 1 1 Y X − φ ); ν ( ) L ≤ × c μ ( ψ L ∈ ) ψ,φ ( ) ( ∫ ∫ c dμ ) x ) = sup ( ψ x ( y ) dν ( y ) − ( ψ 1 X Y ( ∈ ψ μ ) L ( ) ∫ ∫ c = sup , φ ( y ) dν ( y ) ) x φ x ( ) dμ ( − 1 X Y ) ν ∈ L ( φ be c -convex ψ and in the above suprema one might as well impose that and φ c -concave. ∫ cdπ (ii) If ( μ,ν ) = inf c is real-valued and the optimal cost C Π μ,ν ) ∈ π ( c Γ ⊂X×Y is finite, then there is a measurable -cyclically monotone set a , b , c are continuous) such that for any π ∈ Π ( μ,ν ) the fol- (closed if lowing five statements are equivalent: (a) is optimal; π π c -cyclically monotone; (b) is c (c) There is a ψ such that, π -almost surely, -convex c y ( ψ ) − ψ ( x ) = c ( x,y ) ; (d) There exist ψ X → R ∪{ + ∞} and φ : Y → R ∪{−∞} , : for all φ y ) − ψ such that x ) ≤ c ( x,y ) ( ( x,y ) , ( with equality π -almost surely; (e) π is concentrated on Γ . (iii) If c is real-valued, C ( μ,ν ) < + ∞ , and one has the pointwise upper bound 1 1 x,y ) ≤ c , ( x ) + c ) ( y ) , ( c ν ,c ( ) ∈ L (5.9) ( μ ) × L ( c Y X Y X

77 Kantorovich duality 71 t hen both the primal and dual Kantorovich problems have solutions, so ∫ ) x,y ( dπ ( ) c min x,y ∈ μ,ν ( Π ) π X×Y ) ( ∫ ∫ x max dμ ) x ( ψ ( − ) y ( dν ) y ( φ = ) 1 1 ) φ ); ν ( μ L × ψ ≤ ( − L ∈ ) ψ,φ ( c X Y ( ) ∫ ∫ c ( dμ y , ) ψ = max ( y ) dν ( ) ) − x x ψ ( 1 ψ μ ( L ∈ ) X Y be c - ψ and in the latter expressions one might as well impose that c = ψ convex and . If in addition a , b and c are continuous, then there φ c -cyclically monotone set Γ ⊂ X ×Y , such that for any is a closed 1 ∈ Π ( μ,ν ) and for any c -convex ψ ∈ L π ( μ ) , { π π [ Γ ] = 1 ; is optimal in the Kantorovich problem if and only if is optimal in the dual Kantorovich problem if and only if Γ ∂ ψ ψ. ⊂ c When the cost Remark 5.11. is continuous, then the support of π c is c -cyclically monotone; but for a discontinuous cost function it might π is concentrated on a (nonclosed) c -cyclically mono- a priori be that π c tone set, while the support of -cyclically monotone. So, in the is not sequel, the words “concentrated on” are not exchangeable with “sup- ported in”. There is another subtlety for discontinuous cost functions: c φ and ψ appearing in statements (ii) It is not clear that the functions and (iii) are Borel measurable; it will only be proven that they coincide with measurable functions outside of a -negligible set. ν Remark 5.12. Note the difference between statements (b) and (e): The set Γ appearing in (ii)(e) is the same for all optimal π ’s, it only depends on μ and ν . This set is in general not unique. If c is contin- uous and Γ is imposed to be closed, then one can define a smallest , which is the closure of the union of all the supports of the opti- Γ π ’s. There is also a largest Γ , which is the intersection of all the mal -subdifferentials ∂ ψ c , where ψ is such that there exists an optimal π c supported in ∂ ψ . (Since the cost function is assumed to be continuous, c the c -subdifferentials are closed, and so is their intersection.) Remark 5.13. Here is a useful practical consequence of Theorem 5.10: Given a transference plan π , if you can cook up a pair of competitive prices ( ψ,φ ) such that φ ( y ) − ψ ( x ) = c ( x,y ) throughout the support of π , then you know that π is optimal. This theorem also shows that

78 72 5 Cyclical monotonicity and Kantorovich duality ptimal transference plans satisfy very special conditions: if you fix an o ), then mass arriving at y can come from x only if optimal pair ( ψ,φ c x,y φ ( y ) − ψ ( x ) = ψ ), which means that ( y ) − ψ ( x ( c ) = ( ) ′ ′ Arg min ψ ( x x ) + c ( x ∈ ,y ) . ′ ∈X x In terms of my bakery analogy this can be restated as follows: A caf ́e accepts bread from a bakery only if the combined cost of buying the bread there and transporting it here is lowest among all possible bak- eries. Similarly, given a pair of competitive prices ( ψ,φ ), if you can cook up a transference plan such that φ ( y ) − ψ ( x ) = c ( x,y ) throughout π the support of , then you know that ( ψ,φ ) is a solution to the dual π Kantorovich problem. c ≤ c Remark 5.14. + c in (iii) can be weakened The assumption Y X into ∫ c ( x,y ) dμ , x ) dν ( y ) < + ∞ ( X×Y or even }] [{ ∫    ∞ + < ) y μ 0; x ; > ( c ( x,y ) dν    Y (5.10) [{ }] ∫      ν ; y + dμ ( x ) < ( ∞ c > 0 . ) x,y X If the variables x and y are swapped, then ( μ,ν ) should Remark 5.15. ψ ν,μ ) by ( − φ, − ψ,φ ). ) and ( be replaced by ( Particular Case 5.4 leads to the following vari- Particular Case 5.16. c ( x,y ) = d ( x,y ) is a distance on a Polish ant of Theorem 5.10. When X , and μ,ν belong to P space ( X ), then 1 { } ∫ ∫ Y E [ ψ ( X ) − ψ ( d )] = sup inf ( . ψ dμ − X,Y ) = sup ψ dν E Y X (5.11) where the infimum on the left is over all couplings ( X,Y ) of ( μ,ν ), and the supremum on the right is over all 1-Lipschitz functions . This is ψ the Kantorovich–Rubinstein formula ; it holds true as soon as the supremum in the left-hand side is finite, and it is very useful. n n c ( x,y ) = − x · y in R Particular Case 5.17. × R Now consider . This cost is not nonnegative, but we have the lower bound c ( x,y ) ≥

79 Kantorovich duality 73 2 2 2 1 2 1 y | | ) / 2 . So if x →| x | x ∈ L | ( ( ) and y →| y | μ ∈ L + ( ν ), then one − | can invoke the Particular Case 5.3 to deduce from Theorem 5.10 that { } ∫ ∫ ] [ ∗ ∗ φ ( X ) + φ sup ( Y ) E = inf ( X · φdμ + Y ) = inf φ E dν , Y X (5.12) X,Y ), the ) of ( where the supremum on the left is over all couplings ( μ,ν infimum on the right is over all (lower semicontinuous) convex functions n ∗ φ stands for the usual Legendre transform of , and . In for- R on φ mula (5.12), the signs have been changed with respect to the statement of of Theorem 5.10, so the problem is to maximize the correlation X Y . the random variables and Before proving Theorem 5.10, I shall first informally explain the construction. At first reading, one might be content with these informal explanations and skip the rigorous proof. Idea of proof of Theorem 5.10 . Take an optimal π (which exists from ψ,φ ) be two competitive prices. Of course, as Theorem 4.1), and let ( in (5.4), ∫ ∫ ∫ ∫ dπ ( x,y ) ≥ [ φdν − ( ψ dμ = x,y c φ ( y ) − ψ ( x )] dπ ( x,y ) . ) ∫ So if both quantities are equal, then [ c − φ + ψ ] dπ = 0, and since the integrand is nonnegative, necessarily φ ) − ψ ( x ) = c ( x,y ) π ( dxdy ) − almost surely y ( . x Intuitively speaking, whenever there is some transfer of goods from , the prices should be adjusted exactly to the transport cost. to y ) and ( y Now let ( x be such that ( x ,y ) belongs to the ) i 0 0 ≤ i ≤ m i ≤ i m i i ≤ . Then we π support of to y x , so there is indeed some transfer from i i hope that   ) = x ( ) c ,y ( x φ ( y ψ ) −  0 0 0 0    ψ ( x φ ( y ( ) − ) c x ) = ,y 1 1 1 1  ...     c φ y . ) − ψ ( x ) ) = ( ( x ,y m m m m On the other hand, if x is an arbitrary point,

80 74 5 Cyclical monotonicity and Kantorovich duality   ) y , x ( c φ ) y ≤ ) − ψ ( x (  0 1 1 0    x ,y − φ ( y ( ) ) ψ ( x c ) ≤ 2 2 1 1  ...     y ψ ) − φ ( ( ) ≤ c ( x,y . ) x m m By subtracting these inequalities from the previous equalities and adding up everything, one obtains [ ] ] [ ,y ( x . ) + ( c ( x ) ,y ) ) − c ( x x,y ,y ( ) ≥ + ... + ψ c ( x c x − ) ψ m 0 1 m 0 0 0 m ψ Of course, one can add an arbitrary constant to , provided that one subtracts the same constant from φ ; so it is possible to decide that ( x . ) = 0, where ( x ψ ,y π ) is arbitrarily chosen in the support of 0 0 0 Then [ ] ] [ ≥ − c ( x (5.13) ,y , ) − c ( x ψ ,y ) ) ( + ... + x c ( x x,y ,y ( ) ) c m m 0 1 0 0 m x ) (1 ≤ i ≤ m ) in the and this should be true for all choices of ( ,y i i , and for all ≥ π 1. So it becomes natural to define ψ support of m x as the supremum of all the functions (of the variable ) appearing in the right-hand side of (5.13). It will turn out that this ψ satisfies the equation c ( y ) − ψ ( x ψ c ( x,y ) = π ( dxdy )-almost surely . ) c and ψ Then, if are integrable, one can write ψ ∫ ∫ ∫ ∫ ∫ c c ψ ψ dπ cdπ dπ ψ = dν − − ψ dμ. = This shows at the same time that π is optimal in the Kantorovich c problem, and that the pair ( ψ,ψ ) is optimal in the dual Kantorovich problem. ⊓⊔ Rigorous proof of Theorem 5.10, Part (i) . First I claim that it is suffi- c Indeed, let cient to treat the case when is nonnegative. ∫ ∫ x,y ) := c ( x,y ) − a ( x ) ̃ b ( y ) ≥ 0 , Λ := c adμ + ( bdν ∈ R . − Whenever ψ : X → R ∪{ + ∞} and φ : Y → R ∪{−∞} are two functions, define ̃ ̃ φ ( ) := ψ ( x ) + a ( x ) , ψ x ( y ) := φ ( y ) − b ( y ) .

81 Kantorovich duality 75 hen the following properties are readily checked: T = ⇒ ̃ c real-valued c real-valued lower semicontinuous = c lower semicontinuous ⇒ ̃ c 1 1 1 1 ̃ ̃ ⇐⇒ ψ ∈ ψ ∈ ( μ L L φ ∈ L ( ( ν ) ⇐⇒ φ ∈ L μ ( ν ); ) ); ∫ ∫ Π ( μ,ν ) , cdπ ̃ cdπ = ∀ π ∈ Λ ; − ∫ ∫ ∫ ∫ 1 1 ̃ ̃ ) × L ψ,φ ( ν ) , ) ∈ φdν − L ∀ ψ dμ = ( φdν − μ ( − Λ ; ψ dν ̃ c -convex ⇐⇒ ψ ψ is ̃ c -convex; is ̃ is -concave ⇐⇒ φ φ c ̃ c -concave; is ̃ ̃ ) are c -conjugate ⇐⇒ ( ) are φ, ( ψ φ,ψ ̃ c -conjugate; Γ is -cyclically monotone ⇐⇒ Γ is ̃ c -cyclically monotone . c Thanks to these formulas, it is equivalent to establish Theorem 5.10 for the cost ̃ c . So in the sequel, I shall c or for the nonnegative cost is nonnegative. c assume, without further comment, that The rest of the proof is divided into five steps. ∑ ∑ n n , where the costs δ ) = (1 /n δ ν , /n = (1 μ Step 1: ) If y x j i j =1 i =1 are finite, then there is at least one cyclically monotone trans- ( x ,y ) c j i ference plan. Indeed, in that particular case, a transference plan between μ and ν can be identified with a bistochastic n × n array of real numbers a ∈ , 1]: each a [0 tells what proportion of the 1 /n mass carried by ij ij x will go to destination point y . So the Monge–Kantorovich problem i j becomes ∑ inf ) ,y x a c ( ij i i ) ( a ij ij a ) satisfying where the infimum is over all arrays ( ij ∑ ∑ = 1 , a (5.14) a . = 1 ij ij i j n n × Here we are minimizing a linear function on the compact set [0 1] , , so obviously there exists a minimizer; the corresponding transference plan π can be written as

82 76 5 Cyclical monotonicity and Kantorovich duality ∑ 1 π = a , δ ij ( ,y ) x i j n j i x and its support ,y S is the set of all couples ( a 0. > ) such that ij j i Assume that S is not cyclically monotone: Then there exist N ∈ N ,..., and ( ,y such that S ) x ( x ) in ,y i j i j 1 1 N N x c ,y . ) + c ( x ) ,y c ( ... + c ( x ,y ,y ) < c ( x x ,y ( ) + ... + ) + i i j j j j i i j i 1 1 1 1 3 2 2 N N N (5.15) := min( a ,...,a π ̃ 0. Define a new transference plan Let > ) a ,j i ,j i 1 1 N N by the formula N ∑ ) ( a = π π + ̃ δ . δ − ( x ,y ( ,y x ) ) i i j j +1 ℓ ℓ ℓ ℓ n = ℓ 1 It is easy to check that this has the correct marginals, and by (5.15) ̃ is strictly less than the cost associated with the cost associated with π S is indeed c π . This is a contradiction, so -cyclically monotone! If c is continuous, then there is a cyclically monotone trans- Step 2: ference plan. To prove this, consider sequences of independent random variables μ ∈ X , y . According to the law of ∈ Y , with respective law x , ν j i large numbers for empirical measures (sometimes called fundamental theorem of statistics, or Varadarajan’s theorem), one has, with proba- bility 1, n n ∑ ∑ 1 1 μ := δ δ −→ (5.16) ν μ, ν −→ := n n y x i j n n j = 1 = i 1 as n →∞ , in the sense of weak convergence of measures. In particular, by Prokhorov’s theorem, ( ν ) are tight sequences. μ ) and ( n n , let π be a cyclically monotone transference plan be- For each n n μ . By Lemma 4.4, and ν is tight. By Prokhorov’s tween { π } n ∈ N n n n theorem, there is a subsequence, still denoted ( π ), which converges n π , i.e. weakly to some probability measure ∫ ∫ ( x,y ) dπ ) ( x,y ) −→ h h ( x,y ) dπ ( x,y n h on X × Y . By applying the for all bounded continuous functions previous identity with h ( x,y ) = ), we see that ( x ) and h ( x,y ) = g ( y f

83 Kantorovich duality 77 h as marginals and ν , so this is an admissible transference plan π μ and . between μ ν implies that for all For each π , the cyclical monotonicity of n N n ⊗ N -almost all ( x ,..., ,y and ) π ( x ), the inequality (5.1) is sat- ,y N N 1 1 n ⊗ N is concentrated on the set C ( N ) of all isfied; in other words, π n N ,y is con- ) ,..., ( x c ,y satisfying (5.1). Since )) (( ( X × Y x ∈ ) N 1 1 N ⊗ N ⊗ N π ( closed of ) is a set, so the weak limit π tinuous, is also C N n ( N ). Let Γ = Spt π (Spt stands for “support”), then concentrated on C N N ⊗ N = Spt( π ) = (Spt ) ⊂C ( π ) , Γ N N , Γ is c -cyclically monotone. and since this holds true for all If c is continuous real-valued and π is Step 3: -cyclically monotone, c then there is a -convex ψ such that ∂ c ψ contains the support of π . c Indeed, let π (this is a closed set). Γ again denote the support of ∈ ,y Pick any ( ) x Γ , and define 0 0 { ] [ ] [ ψ sup ( x c ( x x ,y ( ) − c ( x c ,y ) := sup ) − + ) c ( x ) ,y ,y 1 0 1 0 0 2 1 1 ∈ N m } [ ] c ( x . ,y + ) − c ( x,y Γ ) ··· ; ( + ∈ ,y ) x ,..., ( x ,y ) m 1 1 m m m m (5.17) ) = ( = 1 and ( By applying the definition with ), one ,y ,y m x x 0 1 0 1 immediately sees that ψ ( x ) is the ) ≥ 0. On the other hand, ψ ( x 0 0 x supremum of all the quantities [ ,y c ) − c ( x − ,y ) ( ... + [ c ( x ,y )] + m m 0 1 0 0 x )] which by cyclical monotonicity are all nonpositive. So actu- ,y c ( m 0 ( x ally ) = 0. (In fact this is the only place in this step where c -cyclical ψ 0 monotonicity will be used!) By renaming as y , obviously y m { ] [ ) ( c − ,y ,y x ( c ) x ψ sup ( sup x ) = sup 0 0 1 0 y ∈ ∈Y N m ( ) x ( ,..., ,y x ) ,x ,y m m − m 1 1 1 1 − ] ] [ [ x ; ,y + ) − c ( x ) ,y x,y ) + + ··· c ( c ( x ( ,y ) − c m 1 1 2 1 } ,y ( ) ,..., ( x (5.18) ,y ) ∈ Γ x . m 1 1 ψ ( x ) = sup So [ ζ ( y ) − c ( x,y )], if ζ is defined by y { [ ] [ ] ζ c ( x ); ,y ,y ) − ( ( x x ,y ( ) y + ) = sup c ( x c ,y + ) − c ( x ··· ,y + ) c 1 1 1 0 1 m 0 0 2 } ∈ N , ( (5.19) m ,y Γ ) ,..., ( x ∈ ,y ) x m 1 1

84 78 5 Cyclical monotonicity and Kantorovich duality ψ with the convention that out of proj ( ( Γ )). Thus −∞ is a c - = ζ Y convex function. y ) ∈ Γ . By choosing x Now let ( = x , y x, = y i n the definition of m m ψ , {( ] [ ≥ sup + ψ sup ( x ) ,y x ( c − ) ,y x ( c ) 1 0 0 0 m ( ,..., ,y ) ,y x x ( ) − 1 − 1 m m 1 1 } ) [ ] [ ] ··· c ,y ( x + ) − c ( x,y ) . ) ) y , + c ( x, y x − c ( 1 m 1 − − m 1 m − I n the definition of ψ , it does not matter whether one takes the supre- m − 1 or over m variables, since one also takes the supremum mum over m over . So the previous inequality can be recast as ( x ) ≥ ψ ( ψ c ) + c x x, y ) − ( x , y ) . ( I n particular, ψ ( x ) + c ( x, y ) ≥ ψ ( x ) + c ( x, y ) . Taking the infimum over x ∈X in the left-hand side, we deduce that c ( ψ x, ≥ x ) + c ( ( y ) . y ) ψ ince the reverse inequality is always satisfied, actually S c + ( = ψ ( x ) ) c ( x, y ) , ψ y nd this means precisely that ( x, y ) ∈ a - ψ . So Γ does lie in the c ∂ c ψ . subdifferential of c Step 4: is continuous and bounded, then there is duality. If c ‖ Let c ( x,y ). By Steps 2 and 3, there exists a transference ‖ := sup plan whose support is included in ∂ , and which ψ for some c -convex ψ π c c was constructed “explicitly” in Step 3. Let φ = ψ . From (5.17), ψ = sup ψ is a supremum of contin- , where each ψ m m ψ uous functions, and therefore lower semicontinuous. In particular, is 1 measurable. The same is true of φ . ψ , φ Next we check that x be such ,y ψ ) ∈ ∂ are bounded. Let ( c 0 0 that ψ ( x , ) < + ∞ ; then necessarily φ ( y ∈X ) > −∞ . So, for any x 0 0 ψ ( x ) = sup ; [ φ ( y ) − c ( x,y )] ≥ φ ( y ‖ ) − c ( x,y c ) ≥ φ ( y −‖ ) 0 0 0 y 1 A lower semicontinuous function on a Polish space is always measurable, even if it is obtained as a supremum of uncountably many continuous functions; in fact it can always be written as a supremum of countably many continuous functions!

85 Kantorovich duality 79 . ( = inf y [ ψ ( x ) + c ( x,y )] ≤ ψ ( x ) ) + c ( x ‖ ,y ) ≤ ψ ( x c ) + ‖ φ 0 0 0 x c c , φ = φ ψ , we get Re-injecting these bounds into the identities = ψ x ) ≤ sup ψ φ ( y ) ≤ ψ ( x ( ) + ‖ c ‖ ; 0 y ( y ) ≥ inf . ψ ( x ) ≥ φ ( y c ) −‖ φ ‖ 0 x So both φ are bounded from above and below. and ψ , ψ against μ , ν respectively, and, by the Thus we can integrate φ marginal condition, ∫ ∫ ∫ [ ] ( y ) − ( ψ ( x ) dμ ( x ) = φ ( φ ( y ) − ψ y ) ) dν dπ ( x,y ) . x Since ( y ) − ψ ( x ) = c ( x,y ) on the support of π , the latter quantity φ ∫ equals ( x,y ) dπ ( x,y ). It follows that (5.4) is actually an equality, c which proves the duality. Step 5: If c is lower semicontinuous, then there is duality. Since c is nonnegative lower semicontinuous, we can write c ( x,y ) = lim , ) c x,y ( k →∞ k is a nondecreasing sequence of bounded, uniformly con- c where ( ) k ∈ N k tinuous functions. To see this, just choose { ( ]} [ ) ′ ′ ′ ′ ( min x,y c ) = inf ; c ,y ) ) ,k ( + k y,y d ( x,x ( ) + d x k ′ ′ ) ,y x ( c note that is k -Lipschitz, nondecreasing in k , and further satisfies k 2 x,y 0 ) ≤ min( c ( x,y ) ,k ). c ( ≤ k , we can find k , φ π ψ such that ψ is bounded By Step 4, for each k k k k c c -convex, φ and = ( ψ ) , and k k ∫ ∫ ∫ x,y ) dπ . ( x,y ) = c φ ) ( y ) x ( y ) − ( ψ ( ( x ) dμ dν k k k k Since c ) is no greater than c , the constraint φ x,y ( y ) − ψ ( ( x ) ≤ c k k k k implies φ ) are admissible in the ( y ) − ψ ,ψ ( x ) ≤ c ( x,y ); so all ( φ k k k k 2 I t is instructive to understand exactly where the lower semicontinuity assumption is used to show c = lim c . k

86 80 5 Cyclical monotonicity and Kantorovich duality ual problem with cost d the functions φ c and . Moreover, for each k k are uniformly continuous because itself is uniformly continuous. c ψ k μ,ν By Lemma 4.4, Π ( ) is weakly sequentially compact. Thus, up to converges to some π extraction of a subsequence, we can assume that k ∈ Π ( μ,ν ). For all indices ℓ ≤ k , we have c ̃ ≤ c π , so ℓ k ∫ ∫ ̃ = lim c d π dπ c ℓ k ℓ →∞ k ∫ ≤ dπ lim sup c k k k →∞ ) ( ∫ ∫ φ . ( y ) dν ( y ) − x ψ ) ( x ) dμ ( = lim sup k k k →∞ On the other hand, by monotone convergence, ∫ ∫ π = lim π. ̃ c ̃ cd d ℓ ℓ →∞ So ) ( ∫ ∫ ∫ ∫ cdπ ≤ inf cd ̃ π ≤ lim sup − ψ ( y ) dν ( y ) x φ ( ( x ) dμ ) k k ) μ,ν Π ( →∞ k ∫ cdπ. ≤ inf ) ( μ,ν Π Moreover, ∫ ∫ ∫ y ) dν ( y ) − (5.20) φ cdπ. ( x ) dμ ( x ) −−−→ ( inf ψ k k →∞ k ) Π ( μ,ν C Since each pair ( ,φ ), the duality also holds ) lies in C Y ( X ) × ( ψ b k b k with bounded continuous (and even Lipschitz) test functions, as claimed in Theorem 5.10(i). ⊓⊔ Proof of Theorem 5.10, Part (ii) . From now on, I shall assume that the optimal transport cost C ( μ,ν ) is finite, and that c is real-valued. As in the proof of Part (i) I shall assume that is nonnegative, since the c general case can always be reduced to that particular case. Part (ii) will be established in the following way: (a) ⇒ (b) ⇒ (c) ⇒ (d) ⇒ (a) (e) ⇒ (b). There seems to be some redundancy in this chain of ⇒ implications, but this is because the implication (a) ⇒ (c) will be used to construct the set Γ appearing in (e).

87 Kantorovich duality 81 a) ⇒ Let π b e an optimal plan, and let ( φ ( ,ψ (b): ) be as in ∈ N k k k Step 5 of the proof of Part (i). Since the optimal transport cost is finite 1 belongs to ( π c ). From (5.20) and L by assumption, the cost function , the marginal property of π ∫ [ ] ) − φ ( ( y ) + ψ c ( x ) ( dπ x,y x,y ) −−−→ , 0 k k →∞ k 1 x ) − φ so ( y ) + ψ c ( x,y ) converges to 0 in L ( ( π ) as k → ∞ . Up k k to choosing a subsequence, we can assume that the convergence is al- φ )-almost ( y dy ) − ψ most sure; then ( x dx ) converges to c ( x ( ,y π ), i i i i i i k k k . By passing to the limit in the inequality →∞ surely, as N N N ∑ ∑ ∑ c ,y x ( ) ≥ ψ ( − ) [ φ y ( y ( φ ) − ψ [ ( x x )] = )] i i +1 i i i +1 i k k k k =1 i =1 i =1 i ⊗ N -almost surely, = y y ) we see that, π (where by convention N 1 +1 N N ∑ ∑ c ( x ) ,y ,y x ) ≥ c (5.21) ( . i i i i +1 =1 =1 i i ⊗ N π Γ At this point we know that ⊂ is concentrated on some set N N ,y , such that Γ X ×Y consists of N ( x )) ) ,y ) ,..., ( x -tuples (( N 1 N 1 N ) be the projec- (( x ,y ,y x ) ) := ( satisfying (5.21). Let proj i ≤ N 1 i ≤ i k k k N tion on the k th factor of ( X ×Y ) . It is not difficult to check that := ∩ -cyclically monotone set which has full c ) is a Γ Γ proj ( N N k ≤ 1 ≤ k -measure; so c is indeed π -cyclically monotone. π ⇒ (b) (c): Let π b e a cyclically monotone transference plan. The function ψ can be constructed just as in Step 3 of the proof of Part (i), only with some differences. First, is not necessarily closed; it is just Γ π [ Γ ] = 1. (If Γ is not Borel, make it Borel by a Borel set such that modifying it on a negligible set.) With this in mind, define, as in Step 3 of Part (i), { ] [ ] [ ) ,y sup ψ ( c ( x x ,y ( ) − c ( x x ,y − ) ) := sup + c c ( x ) ,y 1 0 0 2 1 1 1 0 m ∈ N } [ ] c ( x ,y ) − + ( x,y ··· ) + ; ( x ,y ) ,..., ( x ,y ) ∈ Γ . c 1 m m m 1 m m (5.22) From its definition, for any x ∈X ,

88 82 5 Cyclical monotonicity and Kantorovich duality ( x ≥ c ( x ψ , y ) ) − c ( x,y . ) > −∞ 0 0 0 Then there is no (Here the assumption of being real-valued is useful.) c x c ) = 0, that ψ is ( -convex, difficulty in proving, as in Step 3, that ψ 0 ∂ ψ . π and that is concentrated on c c ψ , ψ The rest of this step will be devoted to the measurability of . These are surprisingly subtle issues, which do not arise if c is ∂ and ψ c continuous; so the reader who only cares for a continuous cost function might go directly to the next step. ψ First, the measurability of is not clear at all from formula (5.22): This is typically an uncountable supremum of semicontinuous upper functions, and there is no a priori reason for this to be Borel measurable. c is nonnegative lower semicontinuous, there is a nondecreasing Since c ) sequence ( ) x,y of continuous nonnegative functions, such that c ( ℓ ℓ ∈ ℓ N ( x,y c ℓ → ∞ , for all ( x,y ). By Egorov’s theorem, converges to ) as k for each N there is a Borel set E , such that with π [ E /k ] ≤ 1 ∈ k k c the convergence of c is uniform on Γ \ E to . Since π (just as any ℓ k probability measure on a Polish space) is regular, we can find a compact Γ ⊂ set \ E . There is no loss of generality , such that π Γ Γ /k ] ≥ 1 − 2 [ k k k Γ are increasing in in assuming that the sets k . k Γ c ) converges uniformly and monotoni- On each , the sequence ( ℓ k ; in particular c is continuous on Γ . Furthermore, since π is cally to c k Γ obviously concentrated on the union of all , there is no loss of general- k ity in assuming that = ∪ Γ . . We may also assume that ( x Γ ,y Γ ) ∈ 0 1 0 k Now, let x be given in X , and for each k,ℓ,m , let ( ] ) [ ) ,...,x ,y ,y x F := ,y c ( x ( ,y c ) − x 0 0 m 1 m 0 0 0 m,k,ℓ ℓ [ ] ] [ , ,y ) ) − c ) ( x + ,y x,y ) ( + ··· + x c ( x ( ,y c c − m 2 m 1 m 1 1 ℓ ℓ m is a continuous ,y F ,...,x . It is clear that ,y x for ( ∈ Γ ) m m 0 0 m,k,ℓ k function (because c is continuous on is continuous on X ×X , and c ℓ m Γ Γ ). It is defined on the compact set , and it is nonincreasing as a k k function of , with ℓ F lim = F , m,k m,k,ℓ →∞ ℓ where ] ( ) [ ) ,y ,y ,...,x x ,y ( F := x c ( x c ,y − ) 0 0 m 1 0 0 m 0 m,k [ ] ] [ x + ,y ) ) − c ( x x,y . ( ) c + ··· + ( c ( x c ,y − ) ,y m m 1 2 m 1 1

89 Kantorovich duality 83 ow I claim that N sup (5.23) lim . F F = sup m,k,ℓ m,k m m ℓ →∞ Γ Γ k k m Γ X ℓ ∈ ∈ N Indeed, by compactness, for each such that there is ℓ k sup ); F X = F ( m,k,ℓ m,k,ℓ ℓ m Γ k X and up to extraction of a subsequence, one may assume that con- ℓ ′ verges to some ≤ ℓ , X . Then by monotonicity, for any ℓ ′ ); F X = F ( ( X sup ) ≤ F m,k,ℓ ℓ m,k,ℓ m,k,ℓ ℓ m Γ k ′ →∞ , with ℓ and if one lets fixed, one obtains ℓ ′ . sup ) X ( F lim sup ≤ F m,k,ℓ m,k,ℓ m Γ →∞ ℓ k ′ →∞ , to get Now let ℓ . F sup X sup ≤ F ) ≤ F lim sup ( m,k m,k m,k,ℓ m m Γ Γ →∞ ℓ k k The converse inequality sup F sup F lim inf ≤ m,k m,k,ℓ m m →∞ ℓ Γ Γ k k F is obvious because ≤ F ; so (5.23) is proven. m,k,ℓ m,k To summarize: If we let { [ [ ] ] c ( x ( ,y ψ ) − c ) ( x ,y ,y x ) x + ) := sup c ( x ( ,y c ) − 1 1 0 2 1 1 0 0 m,k,ℓ ℓ ℓ } [ ] c ( x + ,y , ) − c ··· ( x,y Γ ) + ; ( x ∈ ,y ) ) ,..., ( x ,y m 1 m 1 m m m ℓ k then we have { ] [ [ ] ψ ) = sup ( c ( x ) ,y ,y ) − x ( x ( ,y c ) x + lim c ( x − ,y ) c 1 1 0 1 2 0 1 0 m,k,ℓ →∞ ℓ } [ ] x c ( x + ,y Γ ) − c ( x,y ∈ ) + ; ( ··· ) ,y ,y ) ,..., ( x . m m 1 m m 1 m k It follows easily that, for each x ,

90 84 5 Cyclical monotonicity and Kantorovich duality x ψ ) ( = sup sup . ) x lim ( ψ m,k,ℓ →∞ ℓ ∈ m N k ∈ N ψ Since x ) is lower semicontinuous in x (as a supremum of contin- ( m,k,ℓ ψ uous functions of x ), itself is measurable. c ψ := is subtle also, and at the present φ The measurability of level of generality it is not clear that this function is really Borel mea- surable. However, it can be modified on a ν -negligible set so as to φ y ) − ψ ( x ) = c ( x,y ), π ( dxdy )-almost become measurable. Indeed, ( ( π ) as π ( dx | y ) ν dxdy dy ), then φ ( y ) surely, so if one disintegrates ( ̃ ( dy )-almost surely, with the Borel function will coincide, φ ( y ) := ν ∫ y ( x ) + c ( x,y )] π ( dx | ψ ). [ X ̃ ν -measure such that Z be a Borel set of zero = φ outside Then let φ Z . The subdifferential ∂ , ψ coincides, out of the π -negligible set X× Z of c ̃ ( x,y ) ∈ X ×Y ; with the measurable set φ ( y ) − ψ ( x ) = c { x,y ) } . The ( ∂ ψ can be modified on a π -negligible set so as to be conclusion is that c Borel measurable. c ⇒ (d): (c) φ = ψ Just let . ( d) (a): Let ( ψ ,φ ) be a pair of admissible functions, and let π be ⇒ φ − = c , π -almost surely. The goal is to a transference plan such that ψ π ψ and show that is optimal. The main difficulty lies in the fact that φ need not be separately integrable. This problem will be circumvented n ∈ N , w ∈ R ∪{±∞} , define by a careful truncation procedure. For   |≤ n w if | w  w ) = T ( n if w > n n   − if w < − n, n and ξ ( x,y ) := φ ( y ) − ψ ( x ); ξ . ( x,y ) := T )) ( φ ( y )) − T x ( ψ ( n n n ξ = 0. It is easily checked that ξ converges monotoni- In particular, n 0 to ξ ; more precisely, cally ξ ( x,y ) remains equal to 0 if ξ ( x,y ) = 0; • n • ξ ) if the latter quantity is positive; ( x,y ) increases to ξ ( x,y n • ) if the latter quantity is negative. ( x,y ξ ξ ( x,y ) decreases to n As a consequence, ξ ) is an admis- ≤ ( ξ ψ ) φ,T ≤ ξ T ≤ c . So ( + n + n n n sible pair in the dual Kantorovich problem, and

91 Kantorovich duality 85 ) ( ∫ ∫ ∫ ∫ ∫ ′ ′ dν − = ( T . ψ ) dμ ≤ sup ξ π dν ( ψ T − d φ φ ) dμ n n n ′ ′ c ≤ ψ − φ (5.24) coincides ξ On the other hand, by monotone convergence and since π outside of a with -negligible set, c ∫ ∫ ∫ −−−→ dπ ξ dπ = cdπ ; ξ n →∞ n ξ ≥ ξ ≥ 0 0 ∫ ∫ = 0 dπ −−−→ ξ ξ dπ . n →∞ n ξ< ξ< 0 0 This and (5.24) imply that ) ( ∫ ∫ ∫ ′ ′ dμ cdπ dν ; φ ψ sup − ≤ ′ ′ ψ − c ≤ φ is optimal. π so Before completing the chain of equivalences, we should first con- Γ struct the set . By Theorem 4.1, there is at least one optimal trans- ̃ π . From the implication (a) ⇒ (c), there is some ference plan, say ψ ̃ ̃ ̃ is concentrated on ∂ . such that ψ ; just choose Γ π ∂ ψ ̃ := c c (a) ⇒ (e): Let ̃ π b e the optimal plan used to construct Γ , and let c ̃ be the associated c -convex function; let φ = ψ ψ . Then let π ψ = and ̃ π have the same cost and same π be another optimal plan. Since marginals, ∫ ∫ ∫ cd ̃ π = lim cdπ π = ( T ̃ φ − T d ψ ) n n →∞ n ∫ dπ, = lim ( T ) φ − T ψ n n →∞ n where T is the truncation operator that was already used in the proof n ⇒ (a). So of (d) ∫ [ ] ( x,y ) − T (5.25) φ ( y ) + T . ψ ( x ) c dπ ( x,y ) −−−→ 0 n n →∞ n ξ ( x,y ) := φ ( y ) − ψ ( x ); then by monotone conver- As before, define gence, ∫ ∫ [ ] ; dπ − T − φ + T c ψ ) dπ −−−→ ( ξ c n n n →∞ 0 ξ ≥ 0 ≥ ξ

92 86 5 Cyclical monotonicity and Kantorovich duality ∫ ∫ [ ] T dπ. φ + T ) ψ c dπ −−−→ ξ − − ( c n n →∞ n 0 0 < ξ ξ< ≤ , the integrands here are nonnegative and both integrals ξ Since c make sense in [0 , + ∞ ]. So by adding the two limits and using (5.25) we get ∫ ∫ [ ] ψ = lim c . ξ ) c − T = 0 φ + T dπ − ( n n →∞ n π c , this proves that Since coincides ξ -almost surely with ξ , which ≤ c was the desired conclusion. s cyclically monotone by assump- i This is obvious since Γ (b): (e) ⇒ ⊓⊔ tion. Proof of Theorem 5.10, Part (iii) . As in the proof of Part (i) we may 0. Let π be optimal, and let assume that be a c -convex func- ψ c ≥ c ∂ ψ tion such that π φ := ψ is concentrated on . The goal is to . Let c ≤ c show that under the assumption + c ) solves the dual , ( ψ,φ c Y X Kantorovich problem. ψ φ are integrable. For this we repeat The point is to prove that and the estimates of Step 4 in the proof of Part (i), with some variants: After x x ,y ) are finite, ) such that φ ( y y ), ψ ( securing ( ( ), c c ( x ) and 0 0 Y 0 0 X 0 0 we write [ ] x ) + c )] ( x ) = sup y ) φ ( y ψ − c ( x,y ) + c ( ( x ) ( ≥ sup c [ φ ( y ) − Y X X y y ( φ ( ) − c ≥ y y ); 0 Y 0 [ ] ) − c )] ( y ) = inf x x,y ψ ( x ) + c ( φ ) − ( ( ( y ) y ≤ inf c [ ψ ( x ) + c Y Y X x x ψ x ≤ ) + c ( ( x ) . 0 0 X So ψ is bounded below by the μ -integrable function φ ( y , ) − c c ( y − ) 0 Y 0 X φ )+ ν -integrable function ψ ( x ; and c is bounded above by the ( x c )+ 0 Y 0 X ∫ ∫ R ψ dμ and thus both φdν make sense in − ∪{−∞} . Since their ∫ ∫ φ − ψ ) dπ = ( cdπ > −∞ , both integrals are finite. So sum is ∫ ∫ ∫ cdπ φdν − = ψ dμ, and it results from Part (i) of the theorem that both π and ( ψ,φ ) are optimal, respectively in the original and the dual Kantorovich prob- lems.

93 Restriction property 87 o prove the last part of (iii), assume that c T is continuous; then the -convex function is a closed c c subdifferential of any -cyclically mono- tone set. be an arbitrary optimal transference plan, and ( ) an op- π ψ,φ Let c ) is optimal in the dual Kan- ψ,ψ timal pair of prices. We know that ( torovich problem, so ∫ ∫ ∫ c dν ) = ) ψ dπ x,y − x,y ψ dμ. ( c ( Thanks to the marginal condition, this be rewritten as ∫ [ ] c x,y − ψ ( x ) − c ( ( ) y dπ ( x,y ) = 0 . ) ψ π is concentrated on Since the integrand is nonnegative, it follows that c ) such that ψ ψ ( y ) − the set of pairs ( x,y x ) − c ( x,y ) = 0, which is ( precisely the subdifferential of . Thus any optimal transference plan is ψ ψ is defined Γ . So if concentrated on the subdifferential of any optimal as the intersection of all subdifferentials of optimal functions ψ , then Γ also contains the supports of all optimal plans. π ∈ Π ( μ,ν ) is a transference plan concentrated on Conversely, if ̃ ∫ ∫ ∫ ∫ c c Γ [ ψ , then − ψ cd d ̃ π = ̃ ψ π dν − = ψ dμ , so ̃ π is optimal. ] ̃ ̃ is a c -convex function such that ∂ Similarly, if ψ ψ contains Γ , then c c ̃ ̃ and ψ are integrable against μ and ν ψ by the previous estimates ∫ ∫ ∫ ∫ c c c ̃ ̃ ̃ ̃ ̃ ̃ = respectively, and [ ψ ) dν − ψ cdπ ψ dμ , so ( − ψ, = ψ ψ ] dπ ⊓⊔ is optimal. This concludes the proof. Restriction property The dual side of the Kantorovich problem also behaves well with respect to restriction, as shown by the following results. -convexity). Let X , Lemma 5.18 (Restriction of be two sets c Y ′ ′ ′ R ∪{ + ∞} . Let X and ⊂ X , Y c ⊂ Y and let c : be the X ×Y → ′ ′ to X restriction of × Y c . Let ψ : X → R ∪ { + ∞} be a c -convex ′ ′ ′ function. Then there is a : X ψ -convex function R ∪{ + ∞} such c → ′ ′ ′ ′ ′ ∂ X , ψ ψ coincides with ψ on proj ≤ (( on that ψ ) ∩ ( X ψ ×Y )) c X ′ ′ ′ ′ ∩ ( X and ×Y ∂ ) ⊂ ∂ ψ ψ . c c

94 88 5 Cyclical monotonicity and Kantorovich duality T heorem 5.19 (Restriction for the Kantorovich duality the- X ,μ ) and ( Y ,ν ) be two Polish probability spaces, let Let orem). ( 1 1 μ ) L b ∈ L and ( ν ( be real-valued upper semicontinuous func- a ∈ ) : X ×Y → R be a lower semicontinuous cost function c tions, and let ( x,y ) ≥ a ( x ) + b such that y ) for all x,y . Assume that the optimal c ( C μ,ν ) between μ and ν is finite. Let π be an optimal transport cost ( be a -convex function such that π is con- ψ transference plan, and let c centrated on ψ . Let ̃ π be a measure on ∂ satisfying ̃ π ≤ π , and X ×Y c ′ ′ ′ ′ X ×Y ] > 0 ; let π π/Z := ̃ Z , and let μ = , ν ̃ be the marginals of π π . [ ′ ′ ′ μ X and Y Spt be a closed set containing a closed set Let ⊂ X ⊂ Y ′ ′ ′ ′ containing be the restriction of c to X Spt ν ×Y . Then there . Let c ′ ′ ′ ∪{ : X ψ → R -convex function + ∞} such that is a c ′ ′ ′ ψ on proj ψ (( ∂ , which has full ψ ) ∩ ( X coincides with ×Y (a) )) c X ′ -measure; μ ′ ′ ′ ∂ (b) is concentrated on ; π ψ c ′ ′ ′ ( X ψ ,μ (c) ) and solves the dual Kantorovich problem between ′ ′ ′ ,ν ( Y with cost c ) . In spite of its superficially complicated appearance, Theorem 5.19 is very simple. If the reader feels that it is obvious, or alternatively that it is obscure, he or she may just skip the proofs and retain the loose statement that “it is always possible to restrict the cost function”. c ′ , define . For y ∈Y φ ψ Proof of Lemma 5.18 = . Let { ′ ′ ′ φ ( y ) if ; ψ ∂ ∈X ∈ ; ( x ∃ ,y ) x c ′ ) = y ( φ −∞ otherwise . ′ x ∈X let then For [ ] [ ] ′ ′ ′ ′ x ψ ) = sup ( ) − c ( x,y ) φ = sup ( y ) ( ) ( y φ − c . x,y ′ ′ y ∈Y y ∈Y ′ ′ ′ ′ c By construction, -convex. Since φ ψ ≤ φ and Y ⊂Y is it is obvious that [ ] ′ ′ ∈X ( x ) ≤ sup . ) ∀ φ ( y x − c ( x,y ) , ψ = ψ ( x ) ∈Y y ′ ′ ′ proj such (( ∂ ∈Y ψ ) ∩ ( X Let ∩Y x )); this means that there is y ∈ c X ′ y ) ∂ x,y ψ . Then φ that ( ( ∈ ) = φ ( y ), so c ′ ′ ( x ) ψ φ . ( y ) − c ( x,y ) = φ ( y ) − c ( x,y ) = ψ ( x ) ≥

95 Application: Stability 89 ′ ′ ′ ∩ ψ (( ∂ T ψ ) does coincide with ( X ψ ×Y on proj )). hus c X ′ ′ ′ ′ ( ψ ( X Finally, let ( ×Y ∂ ∈ φ ) ( y ) = φ ( y ), ψ x,y ∩ x ) = ψ ( x ); ), then c ′ so for all , ∈X z ′ ′ ′ ′ ′ ψ ( x,y ) = ψ ( x ) + c ( x,y ) = φ ( y ) = φ ( ( y ) ≤ ψ x ( z ) + c ) + ( z,y ) , c ′ ′ ⊓⊔ which means that ( x,y ψ ∂ . ∈ ) c ′ Proof of Theorem 5.19 . Let ψ be defined by Lemma 5.18. To prove (a), ′ ′ ′ ′ is concentrated on ( ), so ( X ∂ ×Y ∩ ) μ π is it suffices to note that ψ c ′ ′ concentrated on proj is concentrated on ψ ) ∩ ( X ∂ ×Y (( )). Then π c X ′ ′ ψ , so ̃ π is concentrated on ∂ ∂ ψ ∩ ( X ), which by Lemma 5.18 ×Y c c ′ ′ is included in ; this proves (b). Finally, (c) follows from Theo- ∂ ψ c ⊓⊔ rem 5.10(ii). The rest of this chapter is devoted to various applications of Theo- rem 5.10. Application: Stability An important consequence of Theorem 5.10 is the stability of optimal transference plans. For simplicity I shall prove it under the assumption that c is bounded below. X and Y Theorem 5.20 (Stability of optimal transport). Let be c C ( X ×Y ) be a real-valued continuous cost Polish spaces, and let ∈ inf −∞ . Let ( c ) function, be a sequence of continuous cost c > ∈ N k k and c . Let ( μ ) X×Y ( ν ) on functions converging uniformly to N k k ∈ k ∈ N k X Y respectively. Assume be sequences of probability measures on and μ , let converges to that (resp. ν converges to ν ) weakly. For each k μ k k π μ and ν . If be an optimal transference plan between k k k ∫ k N , ∀ c , dπ ∞ ∈ + < k k then, up to extraction of a subsequence, π converges weakly to some k c -cyclically monotone transference plan π ∈ Π ( μ,ν ) . If moreover ∫ lim inf ∞ + , c < dπ k k N ∈ k then the optimal total transport cost C ( μ,ν ) between μ and ν is finite, and π is an optimal transference plan.

96 90 5 Cyclical monotonicity and Kantorovich duality orollary 5.21 (Compactness of the set of optimal plans). Let C Y and ( x,y ) be a real-valued continuous be Polish spaces, and let X c −∞ be two compact subsets of K and L c > cost function, inf . Let X and P ( Y ) respectively. Then the set of optimal transference plans ( P ) K π L is itself compact in whose marginals respectively belong to and ( ) . P X ×Y μ and ν are convergent sequences, by . Since Proof of Theorem 5.20 k k Prokhorov’s theorem they constitute tight sets, and then by Lemma 4.4 π all lie in a tight set of X×Y ; so we can extract a further the measures k subsequence, still denoted ( π ) for simplicity, which converges weakly k π ∈ Π ( μ,ν ). to some π To prove that c -monotone, the argument is essentially the same is as in Step 2 of the proof of Theorem 5.10(i). Indeed, by Theorem 5.10, ⊗ N π each -cyclically monotone set; so π is concentrated on a c is k k k )) such that ( N ) of (( x concentrated on the set ) ,..., ( x ,y C ,y N N 1 1 k ∑ ∑ x ( c ,y ) , x ) ≤ ( ,y c i i +1 i i k k ≤ i ≤ N 1 N ≤ i ≤ 1 where as usual y large k = y are given, for . So if ε > 0 and N 1 N +1 N ⊗ is concentrated on the set C enough π ( ) defined by N ε k ∑ ∑ c ( x ,y ) ≤ c ,y ) + x ( ε. i i +1 i i N 1 ≤ 1 N ≤ i ≤ i ≤ ⊗ N , and then by letting π Since this is a closed set, the same is true for ⊗ N π is concentrated on the set ε → C ( N ) defined by 0 we see that ∑ ∑ ,y c ( x . ) ) ≤ ,y x ( c i i i i +1 N ≤ 1 N ≤ i ≤ 1 i ≤ is c -cyclically monotone, as desired. π So the support of ∫ Now assume that lim inf dπ < + ∞ . Then by the same c k k k →∞ argument as in the proof of Theorem 4.1, ∫ ∫ ≤ lim inf . ∞ cdπ c + dπ < k k →∞ k In particular, C ( μ,ν ) < + ∞ ; so Theorem 5.10(ii) guarantees the opti- mality of π . ⊓⊔ An immediate consequence of Theorem 5.20 is the possibility to select optimal transport plans in a measurable way.

97 Application: Stability 91 Let C X orollary 5.22 (Measurable selection of optimal plans). , : R be a continuous cost function, c X×Y → Y be Polish spaces and let . Let Ω be a measurable space and let ω inf ( μ c > ,ν −∞ ) be a 7−→ ω ω → P ( X ) × P ( measurable function ) . Then there is a measurable Ω Y ω π choice such that for each ω , π 7−→ is an optimal transference ω ω plan between and ν . μ ω ω O be the set of all optimal transference Proof of Corollary 5.22 . Let ( X ×Y P Φ : O → plans, equipped with the weak topology on ), and let ( X ) × P ( Y ) be the map which to π associates its marginals ( μ,ν ). P Φ Obviously is closed; in particular is continuous. By Theorem 5.20, O is onto. By Corollary 5.21 it is a Polish space. By Theorem 4.1, Φ − 1 μ,ν Φ all pre-images ( ) are compact. So the conclusion follows from general theorems of measurable selection (see the bibliographical notes ⊓⊔ for more information). Theorem 5.20 also admits the following corollary about the stability . transport maps of X be a Corollary 5.23 (Stability of the transport map). Let locally compact Polish space, and let Y ,d ) be another Polish space. Let ( : X ×Y → R be a lower semicontinuous function with inf c > −∞ , c : k let c ∈ N X ×Y → R be lower semicontinuous, such and for each k c converges uniformly to be a . Let μ ∈ P ( X ) and let ( ν ) that c N k ∈ k k sequence of probability measures on , converging weakly to ∈ P ( Y ) . Y ν X → Y such that each T Assume the existence of measurable maps : k := (Id ,T ν π and μ is an optimal transference plan between μ ) # k k k c , having finite total transport cost. Further, assume the for the cost k existence of a measurable map X →Y such that π := (Id : ) T μ is ,T # μ and ν , for the cost c , and the unique optimal transference plan between T converges to T in probability: has finite total transport cost. Then k [{ }] ( ) ) x ∈ X ; d ∀ T (5.26) ( x ) ,T ( x ε > 0 ≥ ε μ −−−→ . 0 k →∞ k An important assumption in the above statement is the Remark 5.24. T . uniqueness of the optimal transport map If the measure μ is replaced by a sequence ( μ ) Remark 5.25. k k N ∈ μ , then the maps converging weakly to and T may be far away T k from each other, even μ , -almost surely: take for instance X = Y = R k . ) = 1 = δ x ( , μ = δ T , ν ) = 0, = ν μ δ x , T ( = 0 0 k =0 k k x 6 /k 1

98 92 5 Cyclical monotonicity and Kantorovich duality roof of Corollary 5.23 P . By Theorem 5.20 and uniqueness of the opti- con- ν π and = (Id ,T mal coupling between ) , we know that μ μ # k k k ,T ) verges weakly to μ . = (Id π # 0 and δ > 0 be given. By Lusin’s theorem (in the abstract Let ε > version recalled in the end of the bibliographical notes) there is a closed and the restriction of such that μ [ X \ K ] ≤ δ ⊂ X T to K is K set continuous. So the set { } ∈ ( x,y ) A . ×Y ; d ( T ( x ) ,y ) ≥ ε = K ε K ×Y X ×Y . Also π [ A is closed in ] = 0 since , and therefore closed in ε is concentrated on the graph of T . So, by weak convergence of π and π k A , closedness of ε 0 = [ A ] ] ≥ lim sup A π π [ ε ε k k →∞ }] [{ ε π ≥ = lim sup ( x,y ) ∈ K ×Y ; d ( T ( x ) ,y ) k →∞ k [{ }] x ∈ K ; d ( T ( ) μ ,T x ( x )) ≥ ε = lim sup k k →∞ [{ }] δ. μ lim sup x ∈X ; d ( T ( x ) ,T − ( x )) ≥ ε ≥ k →∞ k Letting δ → 0 we conclude that lim sup μ [ d ( T ( x ) ,T ] = 0, ( x )) ≥ ε k which was the goal. ⊓⊔ Application: Dual formulation of transport inequalities Let be a given cost function, and let c ∫ ) = (5.27) C ( μ,ν cdπ inf μ,ν ) π Π ( ∈ stand for the value of the optimal transport cost of transport between μ and ν . If is a given reference measure, inequalities of the form ν ∀ μ ∈ P ( X ) , C ( μ,ν ) ≤ F ( μ ) arise in several branches of mathematics; some of them will be studied in Chapter 22. It is useful to know that if F is a convex function of

99 Application: Dual formulation of transport inequalities 93 then there is a nice dual reformulation of these inequalities in terms , μ of the Legendre transform of . This is the content of the following F theorem. , Y be two Let Theorem 5.26 (Dual transport inequalities). X a given probability measure on Y . Let F : P ( X ) → Polish spaces, and ν ) + P ( X be a convex lower semicontinuous function on , and Λ ∞} R ∪{ its Legendre transform on ( X ) ; more explicitly, it is assumed that C b ) (  ∫   − ) φ ( Λ ∀ μ ∈ P ( X ) , F ( μ ) = sup φdμ    ( X ) ∈ C X φ  b (5.28) ) ( ∫      ( X ) , Λ ∀ φ ) = sup φ ∈ C φdμ − F ( μ ) ( . b  ) μ ∈ X P ( X c Further, let R ∪{ + ∞} be a lower semicontinuous cost func- X×Y → : c > −∞ . Then, the following two statements are equivalent: inf tion, ∀ μ ∈ P ( X ) , C ( μ,ν ) ≤ F ( μ ) ; (i) ( ) ∫ c c ) := ( Y ) , Λ (ii) ∀ x φdν − φ φ ∈ ≤ 0 , where φ C ( b Y [ . ( y ) − c ( x,y )] sup φ y is a nondecreasing convex function with R → R : Φ Moreover, if (0) = 0 , then the following two statements are equivalent: Φ ∀ μ ∈ P ( X ) , Φ ( C ( μ,ν )) ≤ F ( μ ) ; (i’) ( ) ∫ c ∗ Y ) , ∀ t φ 0 , Λ ∀ t ∈ (ii’) C ( ≥ − tφ , − Φ 0 ( t ) φdν ≤ b Y ∗ Φ Φ . stands for the Legendre transform of where Remark 5.27. The writing in (ii) or (ii’) is not very rigorous since Λ is a priori defined on the set of bounded continuous functions, and c c might not belong to that set. (It is clear that φ is bounded from φ above, but this is all that can be said.) However, from (5.28) Λ ( φ ) is a nondecreasing function of φ , so there is in practice no problem to extend Λ to a more general class of measurable functions. In any case, the correct way to interpret the left-hand side in (ii) is ) ( ( ) ∫ ∫ c , φdν Λ − Λ − φ ψ φdν = sup c φ ψ ≥ Y Y where ψ in the supremum is assumed to be bounded continuous.

100 94 5 Cyclical monotonicity and Kantorovich duality emark 5.28. R One may simplify (ii’) by taking the supremum over t ; since is nonincreasing, the result is Λ ( ( )) ∫ c φ Λ Φ ≤ 0 . (5.29) − φdν Y (ii) is a particular (This shows in particular that the equivalence (i) ⇔ (ii’).) However, in certain situations it ⇔ case of the equivalence (i’) is better to use the inequality (ii’) rather than (5.29); see for instance Proposition 22.5. Example 5.29. The most famous example of inequality of the type = Y and F ( μ ) is the Kullback information of of (i) occurs when X ∫ ν , that is F ( μ ) = H = ( μ ) = with respect to ρ log ρdν , where ρ μ ν μ ; and by convention μ ) = + ∞ if ( is not absolutely continuous F dμ/dν ν . Then one has the explicit formula with respect to ) ( ∫ φ ) = log e dν ( Λ . φ So the two functional inequalities μ,ν ∈ P ∀ X ) , C ( μ ) ≤ H ) ( μ ( ν and ∫ R c φdν φ ≤ , e ∀ ∈ C ( e X ) dν φ b are equivalent. . First assume that (i) is satisfied. Then for all Proof of Theorem 5.26 c ≥ , φ ψ ( } ) ( { ) ∫ ∫ ∫ ) μ ( F − dμ φdν − ψ Λ = sup ψ φdν − μ Y X ( P ∈ ) Y X } { ∫ ∫ ( F − ) = sup ψ dμ φdν − μ X ) X ( P ∈ μ Y ] [ , 0 ≤ C ) μ ≤ sup ( μ,ν ) − F ( μ ) ( P ∈ X where the easiest part of Theorem 5.10 (that is, inequality (5.4)) was used to go from the next-to-last line to the last one. Then (ii) follows upon taking the supremum over ψ .

101 Application: Dual formulation of transport inequalities 95 onversely, assume that (ii) is satisfied. Then, for any pair ( ) ψ,φ ∈ C X ) × C C ( Y ) one has, by (5.28), ( b b ) ( ) ( ∫ ∫ ∫ ∫ ∫ φdν ) φdν − ψ ψ dμ = ≤ Λ − . μ φdν − ψ dμ + F ( Y Y X Y X c ≥ yields ψ Taking the supremum over all φ ( ) ∫ ∫ ∫ c c dμ ≤ Λ ) φdν μ φdν − φ φ − + F ( . Y Y X By assumption, the first term in the right-hand side is always nonpos- itive; so in fact ∫ ∫ c − φ . dμ ≤ F ( μ ) φdν Y X Y C Then (i) follows upon taking the supremum over ( ∈ ) and apply- φ b ing Theorem 5.10 (i). Now let us consider the equivalence between (i’) and (ii’). By as- ∗ ( r ) ≤ 0 for r ≤ 0, so Φ sumption, Φ t ) = sup if [ r t − Φ ( r )] = + ∞ ( r 0. Then the Legendre inversion formula says that t < ] [ ∗ − r ) = sup R ∀ r rt ( Φ , Φ ( t ) ∈ . R t ∈ + (The important thing is that the supremum is over R .) and not R + c ∈ ( X ), for all ψ ≥ φ φ and for all If (i’) is satisfied, then for all C b R , ∈ t + ( ) ∫ ∗ t Λ − Φ ( φdν t ) − tψ Y [ ] ∫ ∫ ( ) ∗ = sup − tψ − ) t ( t ) φdν dμ − F ( μ Φ ) μ ∈ P ( X Y X ] [ ( ) ∫ ∫ ∗ = sup t − ) ) φdν μ ( t ψ dμ F − Φ − ( μ X ( P ∈ X Y ) [ ] ∗ ) μ ( F sup tC ( μ,ν ) − Φ ≤ ( t ) − ) X ( P ∈ μ ] [ , 0 ≤ ≤ ) μ sup Φ ( C ( μ,ν )) − F ( ) ( P ∈ μ X ∗ tr ≤ ) was used. ( r ) + Φ where the inequality ( t Φ

102 96 5 Cyclical monotonicity and Kantorovich duality onversely, if (ii’) is satisfied, then for all ( φ,ψ C ( X ) × C C ( Y ) ∈ ) b b 0, and t ≥ ( ) ∫ ∫ ∫ ∫ ∗ ∗ Φ − ( t ) = t dμ − t ψ dμ tφdν − tψ − Φ φdν ( t ) Y X X Y ) ( ∫ ∗ ) Φ tψ − t Λ ( t − ≤ + F ( μ ); φdν Y c ψ one obtains ≥ then by taking the supremum over φ ) ( ∫ c ∗ ∗ ( t ) ≤ tC ( t μ,ν ) φdν − tφ ) − Φ Φ ( t ) Λ + F ( μ − Y F μ ); ≤ ( t and (i’) follows by taking the supremum over 0. ⊓⊔ ≥ Application: Solvability of the Monge problem As a last application of Theorem 5.10, I shall now present the criterion which is used in the large majority of proofs of existence of a determin- istic optimal coupling (or Monge transport). Theorem 5.30 (Criterion for solvability of the Monge prob- Let ( ,μ ) and ( Y ,ν ) be two Polish probability spaces, and let lem). X 1 1 L μ ) , b ∈ L ∈ ( ν ) be two real-valued upper semicontinuous func- a ( tions. Let : X ×Y → R be a lower semicontinuous cost function such c that ( x,y ) ≥ a ( x ) + b c y ) for all x,y . Let C ( μ,ν ) be the optimal total ( transport cost between μ and ν . If (i) C μ,ν ) < + ∞ ; ( ∪{ c (ii) for any : X → R -convex function + ∞} , the set of x ∈X ψ such that ∂ -negligible; ψ ( x ) contains more than one element is μ c then there is a unique (in law) optimal coupling X,Y ) of ( μ,ν ) , and it ( is deterministic. It is characterized (among all possible couplings) by the existence of a c -convex function ψ such that, almost surely, Y belongs ) to μ ψ ( X ∂ . In particular, the Monge problem with initial measure c and final measure ν admits a unique solution.

103 Bibliographical notes 97 roof of Theorem 5.30 . The argument is almost obvious. By Theo- P -convex function c ψ rem 5.10(ii), there is a , and a measurable set ψ ∂ π is concentrated on Γ . By as- ⊂ Γ such that any optimal plan c such that μ [ Z ] = 0 and ∂ sumption there is a Borel set ψ ( x ) contains Z c x / Z . So for any x ∈ proj at most one element if ∈ Γ ) \ Z , there is ex- ( X y ∈Y such that ( x,y ) ∈ Γ , and we can then define actly one ( x ) = y . T Let now be any optimal coupling. Because it has to be concen- π Γ , and × Y is π -negligible, π is also concentrated on trated on Z \ Z ×Y ), which is precisely the set of all pairs ( x,T ( x )), i.e. the Γ ( T . It follows that graph of is the Monge transport associated with π the map . T The argument above is in fact a bit sloppy, since I did not check the measurability of . I shall show below how to slightly modify the T construction of T to make sure that it is measurable. The reader who does not want to bother about measurability issues can skip the rest of the proof. K ) be a nondecreasing sequence of compact sets, all of them Let ( ℓ ℓ ∈ N included in Γ \ ( Z ×Y ), such that π [ ∪ K ) exists ] = 1. (The family ( K ℓ ℓ π , just as any finite Borel measure on a Polish space, is regular.) because is given, then for any := proj lying in the compact set J ℓ If x ( K ) ℓ ℓ X T . Then we ( x ) as the unique we can define such that ( x,y ) ∈ K y ℓ ℓ ℓ T ∪ J T can define on , T restricts to by stipulating that for each ℓ ℓ on J , . The map T x is continuous on J → : Indeed, if x x ∈ T and m m ℓ ℓ ℓ ℓ ,T x ( x )) then the sequence ( is valued in the compact set K , ∈ N m m m ℓ ℓ so up to extraction it converges to some ( x,y ) ∈ K , and necessarily ℓ y T ( x ). So T is a Borel map. Even if it is not defined on the whole = ℓ of Γ \ ( Z ×Y ), it is still defined on a set of full μ -measure, so the proof can be concluded just as before. ⊓⊔ Bibliographical notes There are many ways to state the Kantorovich duality, and even more ways to prove it. There are also several economic interpretations, that belong to folklore. The one which I formulated in this chapter is a variant of one that I learnt from Caffarelli. Related economic inter- pretations underlie some algorithms, such as the fascinating “auction

104 98 5 Cyclical monotonicity and Kantorovich duality lgorithm” developed by Bertsekas (see [115, Chapter 7], or the vari- a ous surveys written by Bertsekas on the subject). But also many more classical algorithms are based on the Kantorovich duality [743]. − ) as the un- A common convention consists in taking the pair ( ψ,φ 3 known. This has the advantage of making some formulas more sym- c φ − ( y ) = inf metric: The [ c ( x,y ) c -transform becomes ( x )], and then φ x c x ) = inf ψ [ c ( ( ) − ψ ( y )], so this is the same formula going back and x,y y x and functions of y , upon exchange of x forth between functions of y and X and Y may have nothing in common, in general this . Since symmetry is essentially cosmetic. The conventions used in this chap- ter lead to a somewhat natural “economic” interpretation, and will also lend themselves better to a time-dependent treatment. Moreover, they also agree with the conventions used in weak KAM theory, and more generally in the theory of dynamical systems [105, 106, 347]. It might be good to make the link more explicit. In weak KAM theory, X = is a Riemannian manifold M ; a Lagrangian cost function is Y TM ; and c = c ( x,y ) is a continuous cost given on the tangent bundle function defined from the dynamics, as the minimum action that one should take to go from to y (as later in Chapter 7). Since in gen- x c = 0, it is not absurd to consider the optimal transport x,x ) 6 eral ( C cost μ,μ ) between a measure μ and itself. If M is compact, it is ( easy to show that there exists a μ t hat minimizes C ( μ,μ ). To the opti- mal transport problem between μ nd μ , Theorem 5.10 associates two a c -cyclically monotone sets: a minimal one and a distinguished closed Γ . These sets can and Γ maximal one, respectively ⊂ M × M max min TM via the embedding (initial position, be identified with subsets of 7−→ (initial position, initial velocity). Under that identi- final position) Γ and fication, Γ are called respectively the Mather and Aubry max min sets; they carry valuable information about the underlying dynamics. For mnemonic purposes, to recall which is which, the reader might use the resemblance of the name “Mather” to the word “measure”. (The Mather set is the one cooked up from the supports of the probability measures.) 2 c ( x,y ) = | x − y | / In the particular case when 2 in Euclidean space, 2 2 · ) as | x | it is customary to expand / 2 − x ( y + | y | x,y / 2, and change c 2 2 x | unknowns by including / 2 and | y | | / 2 into ψ and φ respectively, then change signs and reduce to the cost function · y , which is the one x 3 T he latter pair was denoted ( φ,ψ ) in [814, Chapter 1], which will probably upset the reader.

105 Bibliographical notes 99 ppearing naturally in the Legendre duality of convex functions. This is a explained carefully in [814, Chapter 2], where reminders and references n about the theory of convex functions in R are also provided. The Kantorovich duality theorem was proven by Kantorovich him- self on a compact space in his famous note [501] (even before he re- alized the connection with Monge’s transportation problem). As Kan- c ( x,y ) = | x − torovich noted later in [502], the duality for the cost | y n in implies that transport pathlines are orthogonal to the surfaces R ψ = constant { , where ψ is the Kantorovich potential, i.e. the solu- } tion of the dual problem; in this way he recovered Monge’s celebrated original observation. In 1958, Kantorovich and Rubinstein [506] made the duality more explicit in the special case when c ( x,y ) = d ( x,y ). Much later the statement was generalized by Dudley [316, Lecture 20] [318, Sec- tion 11.8], with an alternative argument (partly based on ideas by Neveu) which does not need completeness (this sometimes is useful to handle complete nonseparable spaces); the proof in the first refer- 4 ence contains a gap which was filled by de Acosta [273, Appendix B]. R ̈uschendorf [390, 715], Fernique [356], Szulga [769], Kellerer [512], Feyel [357], and probably others, contributed to the problem. Modern treatments most often use variants of the Hahn–Banach theorem, see for instance [550, 696, 814]. The proof presented in [814, Theorem 1] first proves the duality when , Y are compact, then treats X the general case by an approximation argument; this is somewhat tricky but has the advantage of avoiding the general version of the axiom of choice, since it uses the Hahn–Banach theorem only in the separable space C ( K ), where K is compact. n Mikami [629] recovered the duality theorem in using stochastic R control, and together with Thieullen [631] extended it to certain classes of stochastic processes. Ramachandran and R ̈uschendorf [698, 699] investigated the Kan- torovich duality out of the setting of Polish spaces, and found out a necessary and sufficient condition for its validity (the spaces should be “perfect”). In the case when the cost function is a distance, the optimal trans- port problem coincides with the Kantorovich transshipment problem, for which more duality theorems are available, and a vast literature 4 D e Acosta used an idea suggested by Dudley in Saint-Flour, 25 years before my own course!

106 100 5 Cyclical monotonicity and Kantorovich duality h as been written; see [696, Chapter 6] for results and references. This topic is closely related to the subject of “Kantorovich norms”: see [464], [695, Chapters 5 and 6], [450, Chapter 4], [149] and the many references therein. Around the mid-eighties, it was understood that the study of the dual problem, and in particular the existence of a maximizer, could lead to precious qualitative information about the solution of the Monge–Kantorovich problem. This point of view was emphasized by many authors such as Knott and Smith [524], Cuesta-Albertos, Matr ́an and Tuero-D ́ıaz [254, 255, 259], Brenier [154, 156], Rachev and R ̈uschendorf [696, 722], Abdellaoui and Heinich [1, 2], Gangbo [395], Gangbo and McCann [398, 399], McCann [616] and probably others. Then Ambrosio and Pratelli proved the existence of a maximizing pair under the conditions (5.10); see [32, Theorem 3.2]. Under adequate as- sumptions, one can also prove the existence of a maximizer for the dual problem by direct arguments which do not use the original problem (see for instance [814, Chapter 2]). The classical theory of convexity and its links with the property of cyclical monotonicity are exposed in the well-known treatise by -convexity and c - c Rockafellar [705]. The more general notions of cyclical monotonicity were studied by several researchers, in particular c -concavity; I personally R ̈uschendorf [722]. Some authors prefer to use c -convexity, because signs get better in many advocate working with situations. However, the conventions used in the present set of notes have the disadvantage that the cost function ( · ,y ) is not c -convex. c − )-convexity the no- A possibility to remedy this would be to call ( c tion which I defined. This convention (suggested to me by Trudinger) − c )-convex hundreds is appealing, but would have forced me to write ( of times throughout this book. ∂ ψ ( x ) and the terminology of c -subdifferential is de- The notation c rived from the usual notation ∂ψ ( x ) in convex analysis. Let me stress that in my notation ∂ ψ ( x ) is a set of points, not a set of tangent vectors c or differential forms. Some authors prefer to call ∂ ψ ( x ) the contact c of ψ at x (any y in the contact set is a contact point) and to use set the notation ∂ ) for a set of tangent vectors (which under suitable ψ ( x c assumptions can be identified with the contact set, and which I shall − −∇ ), in Chapters 10 and 12). c ( x,∂ x ψ ( ( x )), or ∇ denote by ψ c c x c In [712] the authors argue that c -convex functions should be con- structible in practice when the cost function c is convex (in the usual

107 Bibliographical notes 101 s c -convex profiles can be “engineered” ense), in the sense that such with a convex tool. For the proof of Theorem 5.10, I borrowed from McCann [613] the -cyclical monotonicity from approximation by com- idea of recovering c binations of Dirac masses; from R ̈uschendorf [719] the method used to ψ reconstruct Γ ; from Schachermayer and Teichmann [738] the from clever truncation procedure used in the proof of Part (ii). Apart from that the general scheme of proof is more or less the one which was used by Ambrosio and Pratelli [32], and Ambrosio, Gigli and Savar ́e [30]. On the whole, the proof avoids the use of the axiom of choice, does not need any painful approximation procedure, and leads to the best known results. In my opinion these advantages do compensate for its being somewhat tricky. About the proof of the Kantorovich duality, it is interesting to notice that “duality for somebody implies duality for everybody” (a rule which is true in other branches of analysis): In the present case, constructing one particular cyclically monotone transference plan allows one to prove all transference plans the duality, which leads to the conclusion that should be cyclically monotone. By the way, the latter statement could also be proven directly with the help of a bit of measure theoretical abstract nonsense, see e.g. [399, Theorem 2.3] or [1, 2]. The use of the law of large numbers for empirical measures might be natural for a probabilistic audience, but one should not forget that this is a subtle result: For any bounded continuous test function, the usual law of large numbers yields convergence out of a negligible set, but then all bounded continuous one has to find a negligible set that works for X test functions. In a compact space C ) is ( X this is easy, because b separable, but if X is not compact one should be careful. Dudley [318, Theorem 11.4.1] proves the law of large numbers for empirical measures on general separable metric spaces, giving credit to Varadarajan for this theorem. In the dynamical systems community, these results are known as part of the so-called Krylov–Bogoljubov theory, in relation with ergodic theorems; see e.g. Oxtoby [674] for a compact space. The equivalence between the properties of optimality (of a transfer- ence plan) and cyclical monotonicity, for quite general cost functions and probability measures, was a widely open problem until recently; it was explicitly listed as Open Problem 2.25 in [814] for a quadratic cost n function in R . The current state of the art is as follows:

108 102 5 Cyclical monotonicity and Kantorovich duality • he equivalence is false for a general lower semicontinuous cost func- t tion with possibly infinite values, as shown by a clever counterex- ample of Ambrosio and Pratelli [32]; continuous cost function with possibly • the equivalence is true for a infinite values, as shown by Pratelli [688]; the equivalence is true for a real-valued • lower semicontinuous cost function, as shown by Schachermayer and Teichmann [738]; this is the result that I chose to present in this course. Actually it is sufficient for the cost to be lower semicontinuous and real-valued almost everywhere (with respect to the product measure ⊗ ν ). μ c is measurable • more generally, the equivalence is true as soon as { c ∞} is the union of a (not even lower semicontinuous!) and = ⊗ )-negligible Borel set; this result is due to μ closed set and a ( ν Beiglb ̈ock, Goldstern, Maresch and Schachermayer [79]. Schachermayer and Teichmann gave a nice interpretation of the Ambrosio–Pratelli counterexample and suggested that the correct no- tion in the whole business is not cyclical monotonicity, but a variant which they named “strong cyclical monotonicity condition” [738]. As I am completing these notes, it seems that the final resolution of this equivalence issue might soon be available, but at the price of a journey into the very guts of measure theory. The following con- struction was explained to me by Bianchini. Let be an arbitrary c lower semicontinuous cost function with possibly infinite values, and let be a c -cyclically monotone plan. Let Γ be a c -cyclically monotone π π [ Γ ] = 1. Define an equivalence relation R on Γ as follows: set with ′ ′ ( ( x x,y ,y ∼ ) if there is a finite number of pairs ( x ) ,y , ), 0 ≤ k ≤ N k k such that: ( x,y ) = ( x ; ,y ∞ ); either c ( x + ,y < ) < + ∞ or c ( x ) ,y 0 1 0 0 0 1 ,y x ,y ; etc. until ) ∈ Γ ; either c ( x ∞ ,y + ) < + ∞ or c ( x < ) ( 1 2 2 1 1 1 ′ ′ ,y into equivalence classes ( x x Γ ,y ). The relation R divides ) = ( N N ( Γ associates its equiv- ) x be the map which to a point p . Let α R α Γ/ ∈ alence class in general has no topological or measur- x Γ/ R . The set able structure, but we can equip it with the largest σ -algebra making p measurable. On Γ × ( Γ/ R ) introduce the product σ -algebra. If now the graph p is measurable for this σ -algebra, then π should be optimal of in the Monge–Kantorovich problem. Related to the above discussion is the “transitive c -monotonicity” considered by Beiglb ̈ock, Goldstern, Maresch and Schachermayer [79], who also study in depth the links of this notion with c -monotonicity, optimality, strong c -monotonicity in the sense of [738], and a new in-

109 Bibliographical notes 103 t eresting concept of “robust optimality”. The results in [79] unify those of [688] and [738]. A striking theorem is that robust optimality is always c -monotonicity. equivalent to strong An alternative approach to optimality criteria, based on extensions of the classical saddle-point theory, was developed by L ́eonard [550]. In most applications, the cost function is continuous, and often rather simple. However, it is sometimes useful to consider cost functions that achieve the value + ∞ , as in the “secondary variational problem” considered by Ambrosio and Pratelli [32] and also by Bernard and Buf- foni [104]. Such is also the case for the optimal transport in Wiener ̈ space considered by Feyel and Ust ̈unel [358, 359, 360, 361], for which c ( x,y ) is the square norm of x − y in the Cameron– the cost function ∞ x − y does not belong to that space). if Martin space (so it is + In this setting, optimizers in the dual problem can be constructed via finite-dimensional approximations, but it is not known whether there c -monotonicity. is a more direct construction by p n x − y | in If one uses the cost function R | and lets p →∞ , then the x c x . − y | |≤ sup | | y − -cyclical monotonicity condition becomes sup i i +1 i i Much remains of the analysis of the Kantorovich duality, but there are also noticeable changes [222]. When condition (5.9) (or its weakened version (5.10)) is relaxed, it is not clear in general that the dual Kantorovich problem admits a maximizing pair. Yet this is true for instance in the case of optimal transport in Wiener space; this is an indication that condition (5.10) might not be the “correct” one, although at present no better general condition is known. Lemma 5.18 and Theorem 5.19 were inspired by a recent work of Fathi and Figalli [348], in which a restriction procedure is used to solve the Monge problem for certain cost functions arising from Lagrangian dynamics in unbounded space; see Theorem 10.28 for more information. The core argument in the proof of Theorem 5.20 has probably been discovered several times; I learnt it in [30, Proposition 7.13]. At the present level of generality this statement is due to Schachermayer and Teichmann [738]. I learnt Corollary 5.23 from Ambrosio and Pratelli [32], for mea- sures on Euclidean space. The general statement presented here, and its simple proof, were suggested to me by Schulte, together with the counterexample in Remark 5.25. In some situations where more struc- ture is available and optimal transport is smooth, one can refine the

110 104 5 Cyclical monotonicity and Kantorovich duality c onvergence in probability into true pointwise convergence [822]. Even when that information is not available, under some circumstances one c ( ( x )): Theorem 28.9(v) x,T can obtain the uniform convergence of k is an illustration (this principle also appears in the tedious proof of Theorem 23.14). Measurable selection theorems in the style of Corollary 5.22 go back at least to Berbee [98]. Recently, Fontbona, Gu ́erin and M ́el ́eard [374] ) associates the μ,ν studied the measurability of the map which to ( union of the supports of all optimal transports between μ and ν (mea- surability in the sense of set-valued functions). This question was mo- tivated by applications in particle systems and coupling of stochastic differential equations. Theorem 5.26 appears in a more or less explicit form in various works, especially for the particular case described in Example 5.29; see in particular [128, Section 3]. About Legendre duality for convex R , one may consult [44, Chapter 14]. The classical reference functions in n R textbook about Legendre duality for plainly convex functions in is [705]. An excellent introduction to the Legendre duality in Banach spaces can be found in [172, Section I.3]. Finally, I shall say a few words about some basic measure-theoretical tools used in this chapter. The regularity of Borel measures on Polish spaces is proven in [318, p. 225]. Measurable selection theorems provide conditions under which one may select elements satisfying certain conditions, in a measurable way. The theorem which I used to prove Corollary 5.22 is one of the most classical results of this kind: A surjective Borel map f between Pol- − 1 ish spaces, such that the fibers f ( y ) are all compact, admits a Borel right-inverse. See Dellacherie [288] for a modern proof. There are more advanced selection theorems due to Aumann, Castaing, Kuratowski, Michael, Novikov, Ryll-Nardzewski, Sainte-Beuve, von Neumann, and others, whose precise statements are remarkably opaque for nonexperts; a simplified and readable account can be found in [18, Chapter 18]. Lusin’s theorem [352, Theorem 2.3.5], which was used in the proof of Corollary 5.23, states the following: If X is a locally compact metric space, is a separable metric space, μ is a Borel measure on X , f is Y a measurable map X → Y , and A ⊂ X is a measurable set with finite measure, then for each 0 there is a closed set K ⊂ A such that δ > μ [ is continuous. \ K ] < δ and the restriction of f to K A

111 6 T he Wasserstein distances Assume, as before, that you are in charge of the transport of goods be- tween producers and consumers, whose respective spatial distributions are modeled by probability measures. The farther producers and con- sumers are from each other, the more difficult will be your job, and you would like to summarize the degree of difficulty with just one quantity. For that purpose it is natural to consider, as in (5.27), the optimal transport cost between the two measures: ∫ μ,ν ) = inf (6.1) ) C x,y ( dπ ( c ( x,y ) , ) ( Π ∈ π μ,ν c x,y ) is the cost for transporting one unit of mass from x to where ( y . Here we do not care about the shape of the optimizer but only the of this optimal cost. value μ ν One can think of (6.1) as a kind of distance between , but in and general it does not, strictly speaking, satisfy the axioms of a distance function. However, when the cost is defined in terms of a distance, it is easy to cook up a distance from (6.1): Let ( X ,d ) be a Polish Definition 6.1 (Wasserstein distances). metric space, and let p ∈ [1 , ∞ ) . For any two probability measures μ,ν on p between μ and ν is defined , the Wasserstein distance of order X by the formula ( ) ∫ /p 1 p (6.2) ( ( μ,ν ) x,y ) = W dπ d ( x,y ) inf p ) ( Π ∈ π μ,ν X { } 1 [ ] p p = inf ) ( d E X,Y l aw ( X ) = μ, law ( Y ) = ν , .

112 106 6 The Wasserstein distances P The articular Case 6.2 (Kantorovich–Rubinstein distance). W distance is also commonly called the Kantorovich–Rubinstein dis- 1 tance (although it would be more proper to save the the terminology W norm ; see the bibli- Kantorovich–Rubinstein for the which extends 1 ographical notes). ). In this example, the distance does δ W ,δ ( Example 6.3. d ( x,y ) = p x y not depend on ; but this is not the rule. p At the present level of generality, is still not a distance in the W p ; but otherwise it does strict sense, because it might take the value + ∞ satisfy the axioms of a distance, as will be checked right now. W satisfies the axioms of a distance . First, it is clear that Proof that p ( ) = W ( μ,ν ν,μ ). W p p Next, let , μ μ and μ , and let be three probability measures on X 3 2 1 X ) an optimal cou- ,X ,Z ) be an optimal coupling of ( μ ) and ( ,μ Z ( 2 2 2 3 1 1 p μ ,μ ) (for the cost function c = d pling of ( ). By the Gluing Lemma 3 2 ′ ′ ′ (recalled in Chapter 1), there exist random variables ( X ,X ,X ) with 3 1 2 ′ ′ ′ ′ X ,X X ,X ). In par- ) and law ( ) = law ( X ,X law ( ,Z ) = law ( Z 3 1 2 2 3 1 2 2 ′ ′ ,μ X ) is a coupling of ( μ ticular, ( ), so ,X 1 3 3 1 1 ) ( 1 ) ( ) ( p p p ′ ′ ′ ′ ′ ′ p X E ≤ ) ( ( μ ,X ,μ d W ) d X X ( ≤ ) + , ( X d ,X E ) 3 1 p 1 3 3 2 2 1 1 1 ( ) ( ) p p ′ p p ′ ′ ′ E X + d ( d ( X ) ≤ , X E , ) X 3 2 2 1 = W , ( μ ) , μ ,μ ) + W μ ( p 2 3 1 p 2 where the inequality leading to the second line is an application of the p Minkowski inequality in ( P ), and the last equality follows from the L ′ ′ ′ ′ X W ,X X satisfies ) are optimal couplings. So ,X fact that ( ) and ( p 3 2 2 1 the triangle inequality. Finally, assume that W ) = 0; then there exists a transference ( μ,ν p plan which is entirely concentrated on the diagonal ( = x ) in X ×X . y So = Id μ = μ . ⊓⊔ ν # To complete the construction it is natural to restrict W to a subset p ( P ( X ) × P of X ) on which it takes finite values. Definition 6.4 (Wasserstein space). With the same conventions as in Definition 6.1, the Wasserstein space of order p is defined as

113 6 The Wasserstein distances 107 { } ∫ p := P μ ∈ P ( X ); ( , d ( x X ,x ) ) μ ( dx ) < + ∞ 0 p X is arbitrary. This space does not depend on the choice of x where ∈X 0 defines a (finite) distance on . Then W the point x P . ( X ) p 0 p In words, the Wasserstein space is the space of probability measures finite moment of order p . In this course, it will always be which have a W equipped with the distance . p is finite on W . Let π be a transference plan between two Proof that P p p μ and ν in P elements ( X ). Then the inequality p ] [ p p p p − 1 x,x ) d + d ( x ( ,y ) d 2 ) x,y ≤ ( 0 0 p p ( x,y ) is is π ( dxdy )-integrable as soon as d ( · ,x - ) shows that d μ 0 p ( x d , · ) integrable and is ν -integrable. ⊓⊔ 0 Remark 6.5. Theorem 5.10(i) and Particular Case 5.4 together lead duality formula for the Kantorovich–Rubinstein to the useful μ,ν in P ( distance X ), : For any 1 } { ∫ ∫ sup . W ψ dν (6.3) ( − μ,ν ) = ψ dμ 1 X ≤ X ‖ ψ ‖ 1 Lip Among many applications of this formula I shall just mention the fol- lowing covariance inequality: if f is a probability density with respect to μ then ( ) )( ∫ ∫ ∫ − f dμ ( fg ) dμ ≤‖ g g dμ . W ) ( f μ,μ ‖ 1 Lip Remark 6.6. A simple application of H ̈older’s inequality shows that p ≤ q = ⇒ W (6.4) ≤ W . q p In particular, the Wasserstein distance of order 1, W , is the weakest of 1 all. The most useful exponents in the Wasserstein distances are p = 1 and p = 2. As a general rule, the W distance is more flexible and easier 1 W distance better reflects geometric features (at to bound, while the 2 least for problems with a Riemannian flavor), and is better adapted when there is more structure; it also scales better with the dimension. Results in W distance are usually stronger, and more difficult to es- 2 tablish, than results in W distance. 1

114 108 6 The Wasserstein distances R On the other hand, under adequate regularity assump- emark 6.7. tions on the cost function and the probability measures, it is possible to control in terms of W W even for q < p ; these reverse inequalities p q express a certain rigidity property of optimal transport maps which comes from -cyclical monotonicity. See the bibliographical notes for c more details. Convergence in Wasserstein sense Now we shall study a characterization of convergence in the Wasserstein converges weakly to μ space. The notation μ −→ μ μ , means that k k ∫ ∫ → φdμ φdμ for any bounded continuous φ . i.e. k P ). Let ( X ,d ) be a Polish Definition 6.8 (Weak convergence in p μ [1 p ∞ ) . Let ( ∈ space, and ) be a sequence of probability mea- , ∈ N k k P is sures in X ) and let μ be another element of P ) ( X ) . Then ( μ ( p p k P ) if any one of the following equivalent said to converge weakly in X ( p ∈X : properties is satisfied for some (and then any) x 0 ∫ ∫ p p d and d ( x ; −→ ) ) dμ x ( x ) −→ μ μ ( x ( ,x ) (i) dμ ,x 0 0 k k ∫ ∫ p p μ and lim sup ) x ,x d ( x ( ,x ) (ii) dμ dμ ( x ) ≤ μ d ( x ; −→ ) 0 0 k k k →∞ ∫ p −→ μ and lim ) = 0 x lim sup (iii) dμ μ ; ) ,x x ( d ( 0 k k →∞ R →∞ k d ) ,x R ≥ x ( 0 ( ) p x ( (iv) For all continuous functions ) | ≤ C with 1 + d ( φ , ,x ) | φ x 0 C ∈ R , one has ∫ ∫ ( x ) dμ . ( x ) −→ x φ ( φ ) dμ ( x ) k Theorem 6.9 ( metrizes P W ). Let be a Polish space, and X ,d ) ( p p ∈ [1 , ∞ ) ; then the Wasserstein distance W p metrizes the weak con- p . In other words, if vergence in ( X ) P ( μ ) is a sequence of measures p k ∈ k N is another measure in in ( X ) and μ , then the statements P ( X ) P p μ μ converges weakly in P to ( X ) p k and W 0 ( μ −→ ,μ ) p k are equivalent.

115 Convergence in Wasserstein sense 109 R As a consequence of Theorem 6.9, convergence in the p - emark 6.10. p Wasserstein space implies convergence of the moments of order . There ∫ p 1 /p 7−→ x μ ,x ) is 1- μ ( dx )) d ( is a stronger statement that the map ( 0 Lipschitz with respect to ; in the case of a locally compact length W p space, this will be proven in Proposition 7.29. Below are two immediate corollaries of Theorem 6.9 (the first one results from the triangle inequality): ). Corollary 6.11 (Continuity of If W X ,d ) is a Polish space, ( p p ∈ [1 , ∞ ) , then W μ is continuous on P ( and X ) . More explicitly, if p p k ( ) converges to μ (resp. ν ) weakly in P (resp. , then X ) as k →∞ ν p k . ( μ ) ,ν μ,ν ) W W ( −→ p p k k On the contrary, if these convergences are only usual Remark 6.12. W ) ( μ,ν weak convergences, then one can only conclude that ≤ p W ( μ ,ν ): the Wasserstein distance is lower semicontinuous on lim inf p k k P ( X ) (just like the optimal transport cost C , for any lower semicon- tinuous nonnegative cost function c ; recall the proof of Theorem 4.1). Let ( ,d ) Corollary 6.13 (Metrizability of the weak topology). X ̃ d is a bounded distance inducing the same topology be a Polish space. If ̃ d as = d/ d d ) ), then the convergence in Wasserstein (such as (1 + ̃ d is equivalent to the usual weak convergence of sense for the distance P ( probability measures in ) . X Before starting the proof of Theorem 6.9, it will be good to make some more comments. The short version of that theorem is that Wasserstein distances metrize weak convergence . This sounds good, but after all, there are many ways to metrize weak convergence. Here is a list of some of the most popular ones, defined either in terms of mea- μ , ν , or in terms of random variables X , sures with law ( X ) = μ , Y law ( Y ) = ν : • the L ́evy–Prokhorov distance (or just Prokhorov distance): { } [ ] d ε > 0; ∃ X,Y ; inf P μ,ν ( ( X,Y ) > ε ) = inf ≤ ε d ; (6.5) P • the bounded Lipschitz distance (also called Fortet–Mourier dis- tance): } { ∫ ∫ ‖ ‖ 1 φdμ − ≤ φdν ; φ φ ‖ ‖ + d ) = sup μ,ν ( ; (6.6) ∞ Lip bL

116 110 6 The Wasserstein distances t ∗ distance (on a locally compact metric space): • weak- he ∣ ∣ ∫ ∫ ∑ ∣ ∣ k − ∣ ∣ 2 d (6.7) , ) = φ ( dμ − μ,ν φ dν w ∗ k k ∣ ∣ k ∈ N ) where ( φ C ); ( X is a dense sequence in 0 k N k ∈ n the (on P ( R )): Toscani distance • 2 ∣ ∣ ∫ ∫   ∣ ∣ ξ · ix − · ix − ξ ∣ ∣ dμ e e ) x ( ( x ) − dν ∣ ∣     sup d ) = μ,ν ( 6.8) ( . T   2 ξ | | n ∈ \{ 0 } ξ R μ , ν have the same mean, otherwise (Here I implicitly assume that by d μ,ν ) would be infinite; one can also introduce variants of d ( T T changing the exponent 2 in the denominator.) So why bother with Wasserstein distances? There are several answers to that question: 1. Wasserstein distances are rather strong, especially in the way they take care of large distances in X ; this is a definite advantage over, for instance, the weak- ∗ distance (which in practice is so weak that I advise the reader to never use it). It is not so difficult to combine information on convergence in Wasserstein distance with some smoothness bound, in order to get convergence in stronger distances. 2. The definition of Wasserstein distances makes them convenient to use in problems where optimal transport is naturally involved, such as many problems coming from partial differential equations. 3. The Wasserstein distances have a rich duality; this is especially = 1, in view of (6.3) (compare with the definition of p useful for the bounded Lipschitz distance). Passing back and forth from the original to the dual definition is often technically convenient. 4. Being defined by an infimum, Wasserstein distances are often rela- any coupling tively easy to bound from above: The construction of μ and ν yields a bound on the distance between μ and ν . In between ′ C -Lipschitz mapping f : X →X the same line of ideas, any induces ′ ( -Lipschitz mapping P (the ( X ) → P μ C X a ) defined by μ 7−→ f 1 1 # proof is obvious).

117 Convergence in Wasserstein sense 111 5 . Wasserstein distances incorporate a lot of the geometry of the space. x 7−→ For instance, the mapping is an isometric embedding of X δ x P X ); but there are much deeper links. This partly explains into ( p ( ) is often very well adapted to statements that combine why P X p weak convergence and geometry. To prove Theorem 6.9 I shall use the following lemma, which has interest on its own and will be useful again later. W are tight). Let X be Lemma 6.14 (Cauchy sequences in p ≥ 1 and let ( μ ) a Polish space, let p be a Cauchy sequence in ∈ k k N ( ( ) ,W ( ) . Then X μ P ) is tight. p p k The proof is not so obvious and one might skip it at first reading. P μ ) Proof of Lemma 6.14 ): This X be a Cauchy sequence in ( . Let ( p k N ∈ k means that W ( μ . ,μ →∞ ) −→ 0 as k,ℓ p ℓ k In particular, ∫ ] [ p ( ) p p x ( x ) = W ) ,μ δ ,μ μ ,μ ( ,x ) ≤ ( W W ( δ dμ ) + d 1 x p p 0 p 1 x k k k 0 0 k →∞ . remains bounded as μ W Since W sense. , the sequence ( W ) is also Cauchy in the ≥ p 1 1 k Let ε > 0 be given, and let N ∈ N be such that 2 k = ⇒ W N ( μ (6.9) ,μ . ) < ε ≥ N 1 k 2 ∈ N , there is j ∈{ 1 ,...,N } such that W . ( μ Then for any ,μ k ) < ε 1 j k k ≥ N , this is (6.9); if k < N , just choose j = k .) (If { } μ ,...,μ K Since the finite set is tight, there is a compact set N 1 μ K [ X \ K ] < ε such that j ∈ { 1 ,...,N } . By compactness, for all j can be covered by a finite number of small balls: K ⊂ B ( x ∪ ,ε ) ∪ ... 1 x ,ε ). B ( m Now write ⋃ ⋃ x ( U ,ε ) B ... := B ( x ); ,ε m 1 { } ⋃ ⋃ U := ; d ( x,U ) < ε ); ⊂ B ( x ε , 2 ε ) x ... ∈X B ( x 2 , ε m 1 ( ) d ( x,U ) φ ( x ) := − 1 . ε +

118 112 6 The Wasserstein distances N ote that 1 ≤ φ ≤ 1 )-Lipschitz. By using these bounds /ε and φ is (1 U U ε ≤ and the Kantorovich–Rubinstein duality (6.3), we find that for N j k and arbitrary, ∫ μ [ ≥ U φdμ ] ε k k ) ( ∫ ∫ ∫ + φdμ − = φdμ φdμ j j k ∫ ) ( W ,μ μ 1 j k ≥ φdμ − j ε ) μ , μ ( W j 1 k U [ ] μ . ≥ − j ε O 1 μ U ] ≥ μ n the one hand, [ K ] ≥ [ − ε if j ≤ N ; on the other hand, j j 2 we can find j = j ( k ) such that W . So in fact ( for each k ,μ ε ) ≤ μ j 1 k 2 ε U ε. ] ≥ 1 − ε μ 2 [ − = 1 − ε k ε At this point we have shown the following: For each ε > 0 there is a finite family ( x give mass at ) μ such that all measures i ≤ ≤ 1 m i k least 1 to the set Z := ∪ B ( x ε , 2 ε ). The point is that Z might not 2 − i be compact. There is a classical remedy: Repeat the reasoning with ε ( +1) ℓ − , ℓ ∈ such that ; so there will be ( x replaced by 2 ) ε N i i 1 m ( ℓ ) ≤ ≤ [ ] ⋃ ℓ − ℓ − 2 ≤ ε. ) ε B ( X \ μ , 2 x i k ) ℓ ( m i ≤ 1 ≤ Thus μ ε, [ X \ S ] ≤ k where ⋂ ⋃ p − := S . ) 2 ε , x ( B i p ≤∞ ≤ 1 ≤ i ≤ 1 ( p m ) S can be covered by finitely many balls of radius δ , By construction, − ℓ is arbitrarily small (just choose ℓ large enough that 2 where δ ε < δ , − ℓ and then , 2 totally x ε ) ( B ( x B ,δ )). Thus S is will be included in i i bounded , i.e. it can be covered by finitely many balls of arbitrarily small radius. It is also closed, as an intersection of finite unions of closed sets. Since X is a complete metric space, it follows from a classical result in topology that S is compact. This concludes the proof of Lemma 6.14. ⊓⊔

119 Convergence in Wasserstein sense 113 P . Let ( μ roof of Theorem 6.9 ) in distance μ → be such that μ k k k N ∈ ; the goal is to show that μ converges to μ in P W ( X ). First, by p p k Lemma 6.14, the sequence ( μ is tight, so there is a subsequence ) ∈ k k N ′ ′ ) such that μ . μ ( ̃ μ converges weakly to some probability measure k k Then by Lemma 4.3, ′ . ) ≤ lim inf W ̃ W ( ( μ ) = 0 μ,μ ,μ p p k ′ k →∞ ̃ μ = μ , and the whole sequence ( μ ) has to converge to So μ . This only k shows the weak convergence in the usual sense, not yet the convergence P ( X ). in p ε > 0 there is a constant For any > 0 such that for all nonnegative C ε a,b real numbers , p p p ≤ ( (1 + ε a a + + C b b ) . ) ε Combining this inequality with the usual triangle inequality, we see x , one has , x that whenever y are three points in X and 0 p p p ,y ) ( ≤ (1 + ε ) d ( x d x ) (6.10) + C . d ( x,y ) ,x ε 0 0 Now let ( μ 0, and ) be a sequence in P −→ ( X ) such that W ) ( μ ,μ p p k k and π be an optimal transport plan between for each k , let μ . In- μ k k tegrating inequality (6.10) against and using the marginal property, π k we find that ∫ ∫ ∫ p p p d dμ . ( x ) ≤ ( ε ) x d ( x ) ,y ) (1+ dμ ( y ) + C x,y ,x d ( x,y ) ) dπ ( ε 0 0 k k But of course, ∫ p p x,y ) μ dπ 0; ( x,y ) = W −−−→ ( d ( ,μ ) p k k k →∞ therefore, ∫ ∫ p p d ( x x ,x ) . dμ ( ( x ) ≤ (1 + ε ) ) d ( x dμ ,x ) lim sup 0 0 k k →∞ ε → 0, we see that Property (ii) of Definition 6.8 holds true; so Letting P does converge weakly in μ . ( X ) to μ p k Conversely, assume that μ ; and converges weakly in P μ ( X ) to p k again, for each k , introduce an optimal transport plan π between μ k k and μ , so that

120 114 6 The Wasserstein distances ∫ p ( ) x dπ d ( x,y ) −→ 0 . ,y k μ { μ } is tight. By Prokhorov’s theorem, ( ) forms a tight sequence; also k ) is itself tight in π X ×X ). So, up P By Lemma 4.4, the sequence ( ( k ), one may assume π to extraction of a subsequence, still denoted by ( k that π π P ( X ×X ) . −→ weakly in k π is optimal, Theorem 5.20 guarantees that π is an opti- Since each k and μ , so this is the (completely trivial) coupling mal coupling of μ , Id ) π μ (in terms of random variables, Y = X ). Since this is = (Id # π independent of the extracted subsequence, actually is the limit of the whole sequence ( π ). k 0. If ∈ X and R > Now let d ( x,y ) x , then the largest of the > R 0 ( x,x two numbers ) and d ( d 2, and no ,y ) has to be greater than R/ x 0 0 d ( x,y ) less than 2. In a slightly pedantic form, / 1 ( ) ≥ R x,y d ≤ 1 . + 1 x,x ) ) ≥ R/ 2 and d ( x,x 2] ) ≥ d ( x,y ) / 2] [ [ d ( x d ,y ) ≥ R/ 2 and d ( x ( ,y ) ≥ d ( x,y / 0 0 0 0 So, obviously [ ] p p p R x,y ) ( d ( x,y ) d 1 − ≤ [ d ( 2] / x,x ≥ R/ 2 and d ( x,x ) ) ≥ d ( x,y ) 0 0 + p + d x,y ) 1 ( [ d ( x 2] ,y ) ≥ R/ 2 and d ( x / ,y ) ≥ d ( x,y ) 0 0 p p p p ≤ ) x 1 d . 1 ( ) ,y 2 ( + 2 x,x d 0 0 ,y ≥ ) 2 x,x d ( x d R/ ) ≥ R/ 2 ( 0 0 It follows that ∫ p p ) ,μ ) W = d ( ( x,y ) μ dπ x,y ( p k k ∫ ∫ ] [ [ ] p p p = dπ d ( x,y ) + ( x,y d ) x,y ) ∧ − R R ( ) dπ x,y ( k k + ∫ ∫ [ ] p p p ∧ R ( ≤ dπ ) d x,y ) + 2 ( x,y x,y ( dπ ) ) x,x ( d 0 k k ≥ x,x ) ( d 2 R/ 0 ∫ p p + 2 ( x x,y ) ) d dπ ( ,y 0 k >R/ d ( x 2 ,y ) 0 ∫ ∫ [ ] p p p ∧ R ( ) dπ x d x,y ) + 2 ( x,y ( dμ ) ) x,x ( d = 0 k k R/ x,x ( ≥ d ) 2 0 ∫ p p ) y ( . dμ + 2 ) ,y x d ( 0 2 R/ ≥ ) ,y x ( d 0

121 Control by total variation 115 S π ince converges weakly to π , the first term goes to 0 as k →∞ . So k ∫ p p p x,x ) ,μ ) W ≤ lim x ( 2 lim sup lim sup dμ ( μ ) ( d p 0 k k R →∞ k k →∞ →∞ ) ≥ ( x,x 2 d R/ 0 ∫ p p lim sup ( ) dμ y ) + lim d ( x ,y 2 0 →∞ R →∞ k ( ≥ R/ 2 x d ,y ) 0 . = 0 ⊓⊔ This concludes the argument. Control by total variation The total variation is a classical notion of distance between probability measures. There is, by the way, a classical probabilistic representation formula of the total variation: μ − ν ‖ (6.11) = 2 inf P [ X 6 = ‖ ] , Y TV X,Y μ,ν ); this identity where the infimum is over all couplings ( ) of ( can be seen as a very particular case of Kantorovich duality for the cost function 1 . = y x 6 It seems natural that a control in Wasserstein distance should be weaker than a control in total variation. This is not completely true, because total variation does not take into account large distances. But one can control W total variation: by weighted p Theorem 6.15 (Wasserstein distance is controlled by weighted Let μ and ν be two probability measures on a Polish total variation). ( X ,d ) . Let p ∈ [1 , ∞ ) and x . Then ∈X space 0 1 ) ( ∫ p 1 1 1 p ′ p ) | ( x μ d | ) x ν − (6.12) ( μ,ν d ( x . , ) ≤ 2 , 1 + W = p 0 ′ p p In the case = 1, if the diameter of X is Particular Case 6.16. p D , this bound implies . ( μ,ν ) ≤ D ‖ μ − ν ‖ bounded by W TV 1 Remark 6.17. The integral in the right-hand side of (6.12) can be in- terpreted as the Wasserstein distance W for the particular cost func- 1 tion [ d ( x )]1 ,x . d ( x ,y ) + 0 0 x 6 = y

122 116 6 The Wasserstein distances roof of Theorem 6.15 π be the transference plan obtained by P . Let and keeping fixed all the mass shared by ν μ , and distributing the rest uniformly: this is 1 ν ( μ ∧ ν ) + , π ( = (Id − Id ) ) , ⊗ ( μ − ν ) μ − + # a μ ∧ ν = μ − ( μ w ν ) here and a = ( μ − ν ) ]. A more [ X ] = ( μ − ν ) X [ − + + − is π sloppy but probably more readable way to write 1 ) = ( μ ∧ ν π dx ) δ ( + dxdy )( μ . dy ( ) − ν ) ( ( d x ) ( μ − ν ) x = y − + a W , the definition of π , the triangle inequality By using the definition of p p 1 − p p p B ) B + A ( A , the elementary inequality ( + 2 d ), and the for ≤ a , we find that definition of ∫ p p x,y ) ( d ( x,y ) μ,ν dπ ( ≤ ) W p ∫ 1 p = d ( x ) ) d ( μ − ν ) y ( x ) d ( μ − ν ) ( ,y + − a ∫ p − 1 [ ] 2 p p x ,x y ) ≤ + d ( x ( ,y ) ) d d ( μ ( ν ) ) ( x ) d ( μ − ν − + 0 0 − a [ ] ∫ ∫ p − p p 1 x,x 2 ) ν d ( μ − ν ) ) ( x ) + ≤ d ( x y ,y ) d d ( μ − ( ) ( − + 0 0 ∫ [ ] 1 − p p ν ) = 2 d d ( μ − ( ) ) + ( μ − ν ) x x,x ( − + 0 ∫ p − 1 p d ( x,x . ) = 2 d | μ − ν | ( x ) 0 ⊓⊔ Topological properties of the Wasserstein space The Wasserstein space P ( ) inherits several properties of the base X p space X . Here is a first illustration: Theorem 6.18 (Topology of the Wasserstein space). Let X be a complete separable metric space and p ∈ [1 , ∞ ) . Then the Wasserstein

123 Topological properties of the Wasserstein space 117 s P pace ( X ) , metrized by the Wasserstein distance W , is also a com- p p plete separable metric space. In short: The Wasserstein space over a Polish space is itself a Polish space. Moreover, any probability measure can be approximated by a sequence of probability measures with finite support. is compact, then ( X ) is also compact; but if X Remark 6.19. P X If p ( X ) is not locally compact. P is only locally compact, then p P ( Proof of Theorem 6.18 . The fact that ) is a metric space was al- X p . Let D be ready explained, so let us turn to the proof of separability , and let be the space of probability measures a dense sequence in X P ∑ δ a , where the a are rational coefficients, and that can be written j j x j x . It will turn out that are finitely many elements in D the P is dense j in ( X ). P p ε > To prove this, let 0 be given, and let be an arbitrary element x 0 ), then there exists a compact set μ lies in P of ( X . If K ⊂ X such D p that ∫ p p ( x d ,x ) . dμ ( x ) ≤ ε 0 X\ K K by a finite family of balls B ( x , with centers ,ε/ Cover ≤ k ≤ N 2), 1 k x ∈D , and define k ⋃ ′ = ( x B ,ε ) \ B ) B ( x ,ε . j k k j

124 118 6 The Wasserstein distances t he coefficients might be replaced by rational coefficients, up to a a j very small error in Wasserstein distance. By Theorem 6.15,   ] [ ∑ ∑ ∑ 1 1 ′ p p   2 ≤ b δ | b − a , | a δ W ) , x ax ( d ,x m j x j j j x p k ℓ j j k,ℓ j ≤ N N ≤ N ≤ j j nd obviously the latter quantity can be made as small as possible for a b . some well-chosen rational coefficients j μ ) Finally, let us prove the be a . Again let ( completeness k ∈ N k X ( Cauchy sequence in ). By Lemma 6.14, it admits a subsequence P p ′ μ ) which converges weakly (in the usual sense) to some measure μ . ( k Then, ∫ ∫ p p ′ ) d dμ ( x ) ≤ lim inf , ( ∞ + d ( x ) ,x ) < dμ x ,x ( x 0 0 k ′ →∞ k belongs to P so ( X ). Moreover, by lower semicontinuity of W μ , p p ′ ′ ′ W ( μ,μ ≤ lim inf μ ) ( W ,μ ) , p p ℓ k ℓ ′ k →∞ so in particular ′ ′ ′ lim sup W ( μ,μ W ) ≤ lim sup ( μ ,μ , ) = 0 p p ℓ k ℓ ′ ′ ′ →∞ ℓ ,ℓ k →∞ ′ which means that μ W converges to μ in the sense (and not just in p ℓ the sense of weak convergence). Since ( μ ) is a Cauchy sequence with k a converging subsequence, it follows by a classical argument that the whole sequence is converging. ⊓⊔ Bibliographical notes The terminology of Wasserstein distance (apparently introduced by Do- brushin) is very questionable, since (a) these distances were discovered and rediscovered by several authors throughout the twentieth century, including (in chronological order) Gini [417, 418], Kantorovich [501], Wasserstein [803], Mallows [589] and Tanaka [776] (other contributors being Salvemini, Dall’Aglio, Hoeffding, Fr ́echet, Rubinstein, Ornstein,

125 Bibliographical notes 119 a nd maybe others); (b) the explicit definition of this distance is not so easy to find in Wasserstein’s work; and (c) Wasserstein was only inter- ested in the case p = 1. By the way, also the spelling of Wasserstein is doubtful: the original spelling was Vasershtein. (Similarly, Rubinstein was spelled Rubinshtein.) These issues are discussed in a historical note by R ̈uschendorf [720], who advocates the denomination of “min- p -metric” instead of “Wasserstein distance”. Also Vershik [808] imal L tells about the discovery of the metric by Kantorovich and stands up in favor of the terminology “Kantorovich distance”. However, the terminology “Wasserstein distance” (or “Wasserstein metric”) has been extremely successful: at the time of writing, about 30,000 occurrences can be found on the Internet. Nearly all recent pa- pers relating optimal transport to partial differential equations, func- tional inequalities or Riemannian geometry (including my own works) use this convention. I will therefore stick to this by-now well-established terminology. After all, even if this convention is a bit unfair since it does not give credit to all contributors, not even to the most important of them (Kantorovich), at least it does give credit to somebody. As I learnt from Bernot, terminological confusion was enhanced in the mid-nineties, when a group of researchers in image processing in- troduced the denomination of “Earth Mover’s distance” [713] for the Wasserstein (Kantorovich–Rubinstein) distance. This terminology W 1 was very successful and rapidly spread by the high rate of growth of the engineering literature; it is already starting to compete with “Wasser- stein distance”, scoring more than 15,000 occurrences on Internet. Gini considered the special case where the random variables are dis- crete and lie on the real line; like Mallows later, he was interested by applications in statistics (the “Gini distance” is often used to roughly quantify the inequalities of wealth or income distribution in a given population). Tanaka discovered applications to partial differential equa- p = 2, while Gini tions. Both Mallows and Tanaka worked with the case p = 1 and was interested both in = 2, and Hoeffding and Fr ́echet p worked with general p (see for instance [381]). A useful source on the point of view of Kantorovich and Rubinstein is Vershik’s review [809]. Kantorovich and Rubinstein [506] made the important discovery that the original Kantorovich distance ( W in my notation) can be 1 ) of signed measures over a Pol- norm on the set M ( X extended into a ish space X . It is common to call this extension the Kantorovich– Rubinstein norm , and by abuse of language I also used the denomina-

126 120 6 The Wasserstein distances t W ion Kantorovich–Rubinstein metric for . (It would be more proper 1 to call it just the Kantorovich metric, but more or less everything in this subject should be called after Kantorovich.) This norm property is a particular feature of the exponent p = 1, and should be taken seri- ously because it has strong implications in functional analysis. For one thing, the Kantorovich–Rubinstein norm provides an explicit isometric embedding of an arbitrary Polish space in a Banach space. As pointed out to me by Vershik, the Kantorovich–Rubinstein norm on a metric space ( X ) can be intrinsically characterized as the max- ,d ‖ · ‖ imal norm M ( X ) which is “compatible” with the distance, in on the sense that ‖ δ ; this maximality − δ ∈ X ‖ = d ( x,y ) for all x,y y x property is a consequence of the duality formula. Here are a few words about the other probability metrics men- tioned in this chapter. The Toscani metric is useful in the theory of the Boltzmann equation, see [812, Section 4.2] and references quoted therein. Together with its variants, it is also handy for studying rates of convergence in the central limit theorem, or certain stable limit the- orems [424]. The L ́evy–Prokhorov metric appears in a number of text- books, such as Dudley [318, p. 394]. For the taxonomy of probability metrics and their history, the unavoidable reference is the monograph by Rachev [695], which lists dozens and dozens of metrics together with their main properties and applications. (Many of them are variants, particular cases or extensions of the Wasserstein and L ́evy–Prokhorov metrics.) The more recent set of notes by Carrillo and Toscani [216] also presents applications of various probability metrics to problems arising in partial differential equations (in particular the inelastic Boltzmann equation). Here as in all the rest of this course, I only considered complete sep- arable metric spaces. However, Wasserstein distances also make sense in noncomplete separable metric spaces: The case p = 1 was treated by Dudley [318, Lemma 11.8.3], while the general case was recently considered by Clement and Desch [237]. In this reference the triangular inequality is proven by approximation by countable spaces. The equivalence between the four statements in Definition 6.8 is proven in [814, Theorem 7.12]. I borrowed the proof of Lemma 6.14 from Bolley [136]; and the scheme of proof of Theorem 6.9 from Ambrosio, Gigli and Savar ́e [30]. There are alternative proofs of Theorem 6.9 in the literature, for instance in [814, Chapter 7]. Similar convergence results

127 Bibliographical notes 121 h ad been obtained earlier by various authors, at least in particular cases, see e.g. [260, 468]. In dimension 1, Theorem 6.9 can be proven by simpler methods, and interpreted as a quantitative refinement of Skorokhod’s representation theorem, as noticed in [795] or [814, Section 7.3]. ∞ W = lim W , does not fit in the -Wasserstein distance, The p p →∞ ∞ setting considered in this chapter, in particular because the induced topology is quite stronger than the weak topology of measures. This distance however is useful in a surprising number of problems [208, 212, 222, 466, 617]. The representation formula (6.11) for the total variation distance is a particular case of Strassen’s duality theorem, see for instance [814, Section 1.4]. Remark 6.17 is extracted from [427, comments following Remark VI.5]. Theorem 6.15 is a copy–paste from [814, Proposition 7.10], which it- self was a slight adaptation of [696, Lemma 10.2.3]. Other upper bounds for the Wasserstein distances are available in the literature; see for in- stance [527] for the case of the Hamming distance on discrete product spaces. Results of lower bounds for the Wasserstein distance (in terms of moments for instance) are not so common. One example is Proposi- tion 7.29 in the next chapter. In the particular case of the 2-Wasserstein distance on a Hilbert space, there are lower bounds expressed in terms of moments and covariance matrices [258, 407]. In relation with the ∞ -Wasserstein distance, Bouchitt ́e, Jimenez and Rajesh [151] prove the following estimate: If is a bounded Lipschitz Ω n open subset of R , equipped with the usual Euclidean distance, μ ( dx ) = f ( x ) dx and ν ( dy ) are probability measures on Ω and the density f is , uniformly bounded below, then for any p > 1, ) ( C p p + n ,ν ( μ ) W , ≤ ) μ,ν ( W p ∞ f inf C = C ( p,n,Ω ). As mentioned in Remark 6.7, this “converse” where estimate is related to the fact that the optimal transport map for the p | x − y | cost function enjoys some monotonicity properties which make it very rigid, as we shall see again in Chapter 10. (As an analogy: the 1 ,p Sobolev norms W are all topologically equivalent when applied to C -Lipschitz convex functions on a bounded domain.) Theorem 6.18 belongs to folklore and has probably been proven many times; see for instance [310, Section 14]. Other arguments are

128 122 6 The Wasserstein distances d ue to Rachev [695, Section 6.3], and Ambrosio, Gigli and Savar ́e [30]. In the latter reference the proof is very simple but makes use of the deep Kolmogorov extension theorem. Here I followed a much more el- ementary argument due to Bolley [136]. The statement in Remark 6.19 is proven in [30, Remark 7.1.9]. In a Euclidean or Riemannian context, the Wasserstein distance between two very close measures, say (1 + h with ) ν W h ν ) and (1 + 2 1 2 − 1 )-norm of H h ( ν ,h h − h ; very small, is approximately equal to the 2 2 1 1 see [671, Section 7], [814, Section 7.6] or Exercise 22.20. (One may also take a look at [567, 569].) There is in fact a natural one-parameter 1 − ) and ( family of distances interpolating between H W , defined by ν 2 a variation on the Benamou–Brenier formula (7.34) (insert a factor α − 1 /dν ) dμ ( , 0 ≤ α ≤ 1 in the integrand of (7.33); this construction is t due to Dolbeault, Nazaret and Savar ́e [312]). Applications of the Wasserstein distances are too numerous to be listed here; some of them will be encountered again in the se- quel. In [150] Wasserstein distances are used to study the best ap- proximation of a measure by a finite number of points. Various au- thors [700, 713] use them to compare color distributions in different images. These distances are classically used in statistics, limit the- orems, and all kinds of problems involving approximation of prob- ability measures [254, 256, 257, 282, 694, 696, 716]. Rio [704] de- rives sharp quantitative bounds in Wasserstein distance for the cen- tral limit theorem on the real line, and surveys the previous lit- erature on this problem. Wasserstein distances are well adapted to study rates of fluctuations of empirical measures, see [695, Theo- rem 11.1.6], [696, Theorem 10.2.1], [498, Section 4.9], and the research papers [8, 307, 314, 315, 479, 771, 845]. (The most precise results are those in [307]: there it is shown that the average W distance be- 1 tween two independent copies of the empirical measure behaves like ∫ 1 − 1 /d 1 − 1 /d N ρ ) /N , where ( is the size of the samples, ρ the density of the common law of the random variables, and ≥ 3; the proofs are d partly based on subadditivity, as in [150].) Quantitative Sanov-type theorems have been considered in [139, 742]. Wasserstein distances are also commonly used in statistical mechanics, most notably in the the- ory of propagation of chaos, or more generally the mean behavior of large particle systems [768] [757, Chapter 5]; the original idea seems to go back to Dobrushin [308, 309] and has been applied in a large number of problems, see for instance [81, 82, 221, 590, 624]. The origi-

129 Bibliographical notes 123 n al version of the Dobrushin–Shlosman uniqueness criterion [308, 311] in spin systems was expressed in terms of optimal transport distance, although this formulation was lost in most subsequent developments (I learnt this from Ollivier). Wasserstein distances are also useful in the study of mixing and con- vergence for Markov chains; the original idea, based on a contraction property, seems to be due to Dobrushin [310], and has been redis- covered since then by various authors [231, 662, 679]. Tanaka proved W that the distance is contracting along solutions of a certain class 2 of Boltzmann equations [776, 777]; these results are reviewed in [814, Section 7.5] and have been generalized in [138, 214, 379, 590]. Wasserstein distances behave well with increasing dimension, and therefore have been successfully used in large or infinite dimension; for instance for the large-time behavior of stochastic partial differential equations [455, 458, 533, 605], or hydrodynamic limits of systems of particles [444]. In a Riemannian context, the distance is well-adapted to the W 2 study of Ricci curvature, in relation with diffusion equations; these themes will be considered again in Part II. Here is a short list of some more surprising applications. Werner [836] suggested that the W distance is well adapted to quantify some 1 variants of the uncertainty principle in quantum physics. In a re- cent note, Melleray, Petrov and Vershik [625] use the properties of the Kantorovich–Rubinstein norm to study spaces which are “linearly rigid”, in the sense that, roughly speaking, there is only one way to embed them in a Banach space. The beautiful text [809] by Vershik reviews further applications of the Kantorovich–Rubinstein distance to several original topics (towers of measures, Bernoulli automorphisms, classification of metric spaces); see also [808] and the older contribu- tion [807] by the same author.

130

131 7 D isplacement interpolation version of optimal transport I shall now discuss a time-dependent continuous displacement of measures. There are two main leading to a motivations for that extension: • a time-dependent model gives a more complete description of the transport; the richer mathematical structure will be useful later on. • As in the previous chapter I shall assume that the initial and final probability measures are defined on the same Polish space ( X ,d ). The main additional structure assumption is that the cost is associated with action , which is a way to measure the cost of displacement along an a continuous curve, defined on a given time-interval, say [0 , 1]. So the x and a final point y is obtained cost function between an initial point x to y : by minimizing the action among paths that go from { } x,y ) = inf ( A c γ ); γ (7.1) = x, γ . = y ; γ ∈C ( 1 0 Here C is a certain class of continuous curves, to be specified in each particular case of interest, on which the action functional A is defined. Of course, Assumption (7.1) is meaningless unless one requires some specific structure on the action functional (otherwise, just choose A ( γ ) = c ( γ )...). A good notion of action should provide a recipe ,γ 1 0 for choosing optimal paths, and in particular a recipe to interpolate between points in X . It will turn out that under soft assumptions, this interpolation recipe between points can be “lifted” to an interpolation recipe between probability measures . This will provide a time-dependent

132 126 7 Displacement interpolation n otion of optimal transport, that will be called displacement inter- interpolation between polation (by opposition to the standard linear probability measures). This is a key chapter in this course, and I have worked hard to attain a high level of generality, at the price of somewhat lengthy argu- ments. So the reader should not hesitate to skip proofs at first reading, concentrating on statements and explanations. The main result in this chapter is Theorem 7.21. Deterministic interpolation via action-minimizing curves To better understand what an action functional should be, let us start with some examples and informal discussions. Consider a model where the unknown is the position of a given physical system in some position space, say a Riemannnian manifold M . (See the Appendix for reminders about Riemannian geometry if needed.) We learn from classical physics that in the absence of a potential, the action is the integral over time of the (instantaneous) kinetic energy: ∫ 1 2 | ̇ γ | t ) = A ( γ d t, 2 0 where ̇ stands for the velocity (or time-derivative) of the curve γ at γ t time t . More generally, an action is classically given by the time-integral Lagrangian along the path: of a ∫ 1 ( ) = A γ L ( γ (7.2) , ̇ γ dt. ,t ) t t 0 Here L TM × [0 , 1], where the smooth manifold M is the is defined on position space and the tangent bundle is the phase space, which is TM the space of all possible positions and velocities. It is natural to work in TM because one often deals with second-order differential equations on M (such as Newton’s equations), which transform themselves into first-order equations on TM . Typically L would take the form 2 | v | V (7.3) − ( x ) ( L x,v,t ) = 2 where V is a potential; but much more complicated forms are admis- sible. When V is continuously differentiable, it is a simple particular

133 Deterministic interpolation via action-minimizing curves 127 case of the formula of first variation (recalled in the Appendix) that minimizers of (7.3), with given endpoints, satisfy Newton’s equation 2 d x = − V ( x ) . (7.4) ∇ 2 dt A ( γ ) is well-defined, it is natural to assume that To make sure that is continuously differentiable, or piecewise continuously dif- the path γ ferentiable, or at least almost everywhere differentiable as a function t absolutely continuous of . A classical and general setting is that of X ) is a metric space, a continuous curve : By definition, if ( ,d curves , γ →X is said to be absolutely continuous if there exists a func- : [0 1] 1 ∈ L < t ([0 , 1]; dt ) such that for all intermediate times t in tion ℓ 0 1 , 1], [0 ∫ t 1 ,γ γ ( ) ≤ d t ℓ ( ) dt. (7.5) t t 1 0 t 0 p More generally, it is said to be absolutely continuous of order if for- p ∈ L mula (7.5) holds with some ([0 , 1]; dt ). ℓ γ γ t 7−→ d ( If ) is ,γ is absolutely continuous, then the function t t 0 differentiable almost everywhere, and its derivative is integrable. But the converse is false: for instance, if γ is the “Devil’s staircase”, en- countered in measure theory textbooks (a nonconstant function whose distributional derivative is concentrated on the Cantor set in [0 , 1]), is differentiable almost everywhere, and ̇ ( t ) = 0 for almost γ γ then , even though γ every t is not constant! This motivates the “integral” definition of absolute continuity based on formula (7.5). n If X is R , or a smooth differentiable manifold, then absolutely t continuous paths are differentiable for Lebesgue-almost all [0 , 1]; in ∈ physical words, the velocity is well-defined for almost all times. Before going further, here are some simple and important examples. For all of them, the class of admissible curves is the space of absolutely continuous curves. n X = R , choose Example 7.1. In ( x,v,t ) = | v | (Euclidean norm of L the velocity). Then the action is just the length functional, while the cost c ( x,y ) = | x − y | is the Euclidean distance. Minimizing curves are t straight lines, with arbitrary parametrization: = γ ), + s ( γ )( γ − γ 1 t 0 0 where s : [0 , 1] → [0 , 1] is nondecreasing and absolutely continuous. n x,v,t X = R Example 7.2. again, choose L ( In ) = c ( v ), where c is strictly convex. By Jensen’s inequality,

134 128 7 Displacement interpolation ( ) ∫ ∫ 1 1 = c c ( − γ ̇ γ γ dt γ ≤ dt, ) ( ̇ c ) t 0 1 t 0 0 and this is an equality if and only if ̇ γ is constant. Therefore action- t : γ = minimizers are straight lines with γ ). + constant velocity ( γ γ − t 1 0 0 t Then, of course, ( x,y ) = c ( y − x ) . c This example shows that very different Lagrangians can Remark 7.3. have the same minimizing curves. Let X = M Example 7.4. TM its be a smooth Riemannian manifold, p ( x,v,t ) = | v | p , tangent bundle, and ≥ 1. Then the cost function is L p . There are two quite ) ( , where d is the geodesic distance on M x,y d different cases: • If p > 1, minimizing curves are defined by the equation ̈ γ = 0 t (zero acceleration), to be understood as ( d/dt ) ̇ = 0, where ( d/dt ) γ t γ stands for the covariant derivative along the path (once again, see the reminders in the Appendix if necessary). Such curves have con- ) | ̇ stant speed (( d/dt | = 0), and are called minimizing, constant- γ t speed geodesics , or simply geodesics. If p = 1, minimizing curves are geodesic curves parametrized in an • arbitrary way. Example 7.5. X = M be a smooth Riemannian manifold, Again let L ( x,v,t ), assumed to be strictly and now consider a general Lagrangian in the velocity variable v . The characterization and study of ex- convex tremal curves for such Lagrangians, under various regularity assump- tions, is one of the most classical topics in the calculus of variations. Here are some of the basic — which does not mean trivial — results 1 L is a C in the field. Throughout the sequel, the Lagrangian function defined on × , 1]. TM [0 By the first variation formula (a proof of which is sketched in the • Appendix), minimizing curves satisfy the Euler–Lagrange equation ] [ d ,t γ γ ̇ ( ) , ) ) , ∇ ( (7.6) = ( ∇ L L )( γ ,t , ̇ γ t t t t x v dt which generalizes (7.4). At least this equation should be satisfied for 1 minimizing curves that are sufficiently smooth, say piecewise C .

135 Deterministic interpolation via action-minimizing curves 129 If there exists K,C > • 0 such that x,v,t ≥ K | v |− C, ( L ) is bounded below by K L ( γ ) − C , where L then the action of a curve γ is the length; this implies that all action-minimizing curves starting K and ending in a given compact K from a given compact stay 0 1 within a bounded region. If minimizing curves depend smoothly on their position and ve- • locity at some time, then there is also a bound on the velocities along minimizers that join K K . Indeed, there is a bound on to 1 0 ∫ 1 L ( x,v,t ) ; so there ; so there is a bound on L ( x,v,t ) for some t dt 0 is a bound on the velocity at some time, and then this bound is propagated in time. Assume that L is strictly convex and superlinear in the velocity • variable, in the following sense:   ( 7−→ , ) is convex x,v,t v L    ∀ ) x,t ( (7.7)  ) ( x,v,t L   . ∞ − −−−→ +  | v | v |→∞ | Then v 7−→ ∇ is invertible, and (7.6) can be rewritten as a dif- L v ferential equation on the new unknown ∇ L ( γ, ̇ γ,t ). v 2 2 C • and the strict inequality ∇ L is If in addition L > 0 holds (more v 2 rigorously, ∇ x L x, · ,t ) ≥ K ( is the metric and ) g g for all x , where ( x v K ( x ) > 0), then the new equation (where x and p = ∇ ) are L ( x,v,t v the unknowns) has locally Lipschitz coefficients, and the Cauchy– Lipschitz theorem can be applied to guarantee the unique local ex- istence of Lipschitz continuous solutions to (7.6). Under the same assumptions on L does not depend on t , one can show L , at least if 1 , and therefore sat- C directly that minimizers are of class at least isfy (7.6). Conversely, solutions of (7.6) are locally (in time) mini- mizers of the action. Finally, the convexity of L makes it possible to define its Legendre • transform (again, with respect to the velocity variable): ( ) ) x,v,t ( , p · v − L ) := sup H ( x,p,t M T ∈ v x which is called the Hamiltonian ; then one can recast (7.6) in terms of a Hamiltonian system, and access to the rich mathematical world

136 130 7 Displacement interpolation o L is strictly convex superlinear, f Hamiltonian dynamics. As soon as x,v ) ( x, ∇ the Legendre transform ( L ( x,v,t )) is a homeomor- 7−→ v x,v ) can be re-expressed in terms of phism, so assumptions about ( ∇ the new variables ( ( x,v,t )). = x,p L v L does not depend on t , then H ( x, ∇ • L ( x,v )) is constant along If v x,v γ minimizing curves ( , ̇ γ ) = ( ); if L does depend on t , then t t d/dt ) H ( x, ∇ )). L ( x,v )) = ( ∂ x,v H )( x, ∇ ( L ( v v t Some of the above-mentioned assumptions will come back often in the sequel, so I shall summarize the most interesting ones in the fol- lowing definition: Definition 7.6 (Classical conditions on a Lagrangian function). M be a smooth, complete Riemannian manifold, and L ( Let ) a x,v,t × [0 , 1] . In this course, it is said that L satisfies the Lagrangian on TM classical conditions if 1 (a) C L in all variables; is (b) is a strictly convex superlinear function of v , in the sense L (7.7) of ; K,C > 0 such that for all (c) There are constants x,v,t ) ∈ TM × ( [0 , 1] , L ( x,v,t ) ≥ K | v |− C ; (d) There is a well-defined locally Lipschitz flow associated to the Euler–Lagrange equation, or more rigorously to the minimization prob- ( x ,v lem; that is, there is a locally Lipschitz map ,t ) ; t ) → φ ,t ( x ,v 0 0 0 0 0 0 t TM × [0 on 1] × [0 , 1] , with values in TM , such that each action- , 1 : [0 , 1] → M belongs to γ minimizing curve ([0 , 1]; M ) and satisfies C ( γ ( t ) , ̇ γ ( t )) = φ . ( γ ( t ) ) , ̇ γ ( t ,t ) 0 0 t 0 Assumption (d) above is automatically satisfied if is L Remark 7.7. 2 2 , C of class ∇ L > 0 everywhere and L . t does not depend on v This looks general enough, however there are interesting cases where X does not have enough differentiable structure for the velocity vector to be well-defined (tangent spaces might not exist, for lack of smooth- ness). In such a case, it is still possible to define the speed along the curve: ) ,γ γ ( d t ε + t ̇ := lim sup | γ | 7.8) ( . t | ε | ε 0 → This generalizes the natural notion of speed, which is the norm of the velocity vector. Thus it makes perfect sense to write a Lagrangian of the

137 Deterministic interpolation via action-minimizing curves 131 ( form v | ,t ) in a general metric space X ; here L might be essentially x, L | ∫ 1 [0 , 1]. (To ensure that R × any measurable function on Ldt X × + 0 ∪{ + ∞} , it is natural to assume that makes sense in is bounded R L below.) X ,d ) be a metric space. Define the length of an Example 7.8. Let ( absolutely continuous curve by the formula ∫ 1 ( L ) = | ̇ γ (7.9) | dt. γ t 0 Then minimizing curves are called geodesics. They may have vari- able speed, but, just as on a Riemannian manifold, one can always them (that is, replace γ by ̃ γ where ̃ γ reparametrize = γ s , with t t ( s ) continuous increasing) in such a way that they have constant speed. In d ( γ 1]. ,γ that case ) = | t − s |L ( γ ) for all s,t ∈ [0 , t s Again let ( X ,d ) be a metric space, but now consider Example 7.9. the action ∫ 1 ) = ( A γ c ( | ̇ γ dt, | ) t 0 p c is strictly convex and strictly increasing (say c ( | v | ) = | v | where , p > 1). Then, ( ) ∫ ∫ 1 1 ) ( ( c γ L ( γ )) = c d ) ,γ ≤ c ( ̇ γ ≤ | dt | | ) | c ( dt, ̇ γ 1 0 t t 0 0 with equality in both inequalities if and only if γ is a constant-speed, ( ) ( x,y ) = c ( d minimizing geodesic. Thus x,y ) c and minimizing curves are also geodesics, but with constant speed. Note that the distance . As an c can be recovered from the cost function, just by inverting p c ( | v | ) = | v | illustration, if , then p > 1, and 1 { } ∫ 1 p p ) = inf x,y d ( | | dt ; ̇ γ = x, γ γ = y . 1 0 t 0 I n a given metric space, geodesics might not always exist, and it can even be the case that nonconstant continuous curves do not exist (think of a discrete space). So to continue the discussion we shall have to impose appropriate assumptions on our metric space and our cost function.

138 132 7 Displacement interpolation H ere comes an important observation. When one wants to compute “in real life” the length of a curve, one does not use formula (7.9), but rather subdivides the curve into very small pieces, and approximates the length of each small piece by the distance between its endpoints. The finer the subdivision, the greater the measured approximate length (this is a consequence of the triangle inequality). So by taking finer and finer subdivisions we get an increasing family of measurements, whose upper bound may be taken as the measured length. This is actually an alternative definition of the length, which agrees with (7.9) for ab- solutely continuous curves, but does not require any further regularity assumption than plain continuity: ] [ (7.10) . ) + L ) ,γ γ ( d + ( d ( γ γ ,γ ) = sup ··· sup t t t t 1 0 N − 1 N <...

139 Abstract Lagrangian action 133 u family of functionals parametrized by seful to consider an action as a s,t A the initial and the final times: So is a functional on the set of paths ] . Then we let [ →X s,t } { s,t s,t s,t γ γ ); γ c = x, γ ) = inf = y ; A ( ∈ C ([ s,t ]; X ) ( . x,y t s (7.12) s,t ( x,y ) is the minimal work needed to go from point x at In words, c , to point at final time t . s y initial time p ( x, | Example 7.10. | ,t ) = | v | Consider the Lagrangian . Then L v p d x,y ( ) s,t ) = ( c x,y . − 1 p − t s ) ( N ote a characteristic property of these “power law” Lagrangians: The s,t cost function depends on only through multiplication by a constant. In particular, minimizing curves will be independent of s and t , up to reparametrization. Abstract Lagrangian action After all these preparations, the following definition should appear somewhat natural. ( X ,d ) Definition 7.11 (Lagrangian action). be a Polish space, Let t ,t i f and let R . A Lagrangian action ( A ) t ∈ ,t is a family of on X i f s,t t on C ([ s,t ]; X ) lower semicontinuous functionals A ), ≤ s < t ≤ t ( i f s,t c X ×X on and cost functions , such that: ,t t ,t t ,t t 2 1 2 3 3 1 ⇒ A t < t ; < t + A (i) ≤ ≤ t = A t = 2 3 1 i f x,y ∈X (ii) ∀ { } s,t s,t c A ); ( γ ( γ ∈ C ([ s,t ]; X ); γ x,y = x, γ ; = y ) = inf s t ( γ ) (iii) For any curve , ≤ t ≤ t t t i f [ t t ,t ,t ,t t 1 0 1 2 i f γ ( A ( γ sup c ) = sup ,γ ( γ c ) + ,γ ) t t t t 0 1 1 2 t ≤ ... t ≤ = t t ≤ = t N N ∈ 0 1 i N f ] ,t t N 1 N − c . ··· ( γ + ) + ,γ t t 1 N N −

140 134 7 Displacement interpolation t ,t i f he functional A will just be called the action, and the T A = ,t t i f c the cost associated with the action. A curve = cost function c γ ,t : [ ] t is said to be action-minimizing if it minimizes A among →X i f all curves having the same endpoints. Examples 7.12. (i) To recover (7.2) as a particular case of Defini- tion 7.11, just set ∫ t s,t ) = ( γ L ( γ A , ̇ γ (7.13) ,τ ) dτ. τ τ s s,t ( γ ) = L ( γ ) (here (ii) A length space is a space in which is the A L length) defines a Lagrangian action. ′ ′ ,t t i f ] ⊂ [ t t ,t ,t ] then it is clear that ( A ) If [ induces an action i f i f ′ ′ t ,t ′ ′ i f ,t A on the time-interval [ ) t ( ], just by restriction. i f In the rest of this section I shall take ( t ,t ) = (0 , 1), just for simplic- i f ity; of course one can always reduce to this case by reparametrization. It will now be useful to introduce further assumptions about exis- tence and compactness of minimizing curves. 0 , 1 ) be a Lagrangian ac- Definition 7.13 (Coercive action). Let ( A s,t X , with associated cost functions ( c ) tion on a Polish space . ≤ s −∞ inf γ s

141 Abstract Lagrangian action 135 a ction-minimizing curve among the set of curves that have prescribed final and initial points. In that case the requirement of nonemptiness in (ii) is fulfilled. is a smooth complete Riemannian manifold (i) If Examples 7.15. X x,v,t L and ) is a Lagrangian satisfying the classical conditions of Def- ( inition 7.6, then the action defined by (7.13) is coercive. is a geodesic length space, then the action defined by X (ii) If 2 s,t L ( ) = ) ( / ( t − s ) is coercive; in fact minimizers are constant- γ A γ speed minimizing geodesic curves. On the other hand the action defined s,t ( γ by A ( γ ) is not coercive, since the possibility of reparametriza- ) = L tion prevents the compactness of the set of minimizing curves. ( X ,d ) Proposition 7.16 (Properties of Lagrangian actions). Let 0 , 1 A a coercive Lagrangian action on X . Then: ( be a Polish space and ) s,t c (i) For all intermediate times is lower semicontinuous on s < t , R , with values in + ∞} . X ×X ∪{ s,t on [ s,t ] ⊂ [0 , 1] is a minimizer of A (ii) If a curve , then its γ ′ ′ ′ ′ s ,t is also a minimizer for s,t ] ,t A [ ] ⊂ [ . restriction to s t X < t , < t (iii) For all times in [0 , 1] , and x in ,x 3 3 2 1 1 ( ) ,t t ,t ,t t t 3 1 2 1 2 3 (7.14) ( ) = inf ( x ; ,x ) + c ,x c c x ( x ,x ) 1 2 2 3 3 1 ∈X x 2 x , then there is a mini- and if the infimum is achieved at some point 2 , and passes at time t mizing curve which goes from to x x at time t 3 3 1 1 . t x through at time 2 2 is a minimizer of A if and only if, for all intermediate (iv) A curve γ t < t < t , times 1] in [0 , 3 1 2 t t ,t ,t ,t t 1 2 3 1 2 3 ) ( . ( γ ,γ ,γ ) = ) + c c γ c (7.15) ( γ ,γ t t t t t t 2 2 3 1 3 1 s,t c are continuous, then the set Γ of all (v) If the cost functions action-minimizing curves is closed in the topology of uniform conver- gence; s < t , there exists a Borel map S : X ×X → (vi) For all times s → t s,t ([ s,t ]; X ) , such that for all x,y ∈ X , S ( x,y ) belongs to Γ C . In → x y words, there is a measurable recipe to join any two endpoints x and y . γ : [ s,t ] →X by a minimizing curve

142 136 7 Displacement interpolation emark 7.17. R The statement in (iv) is a powerful formulation of the minimizing property. It is often quite convenient from the technical point of view, even in a smooth setting, because it does not involve any time-derivative. The continuity assumption in (v) is satisfied in most Remark 7.18. s,t 2 s,t L ( γ ) A / ( t − s ), then c ( ( γ ) = ) = cases of interest. For instance, if x,y 2 x,y / ( t − ) ), which is obviously continuous. Continuity also holds d ( s is a Riemannian manifold and X true in the other model example where ( x,v,t ) on the cost is obtained from a Lagrangian function × [0 , 1] L TM satisfying the classical assumptions; a proof is sketched in the Ap- pendix. . Let us prove (i). By definition of the co- Proof of Proposition 7.16 ercivity, x,y ) is never −∞ . Let ( x ) ) be sequences c ( and ( y k k ∈ ∈ N k k N ∪{ and x x converging to ) y x } forms respectively. Then the family ( k K y , and the family ( y also forms a compact ) ∪ { a compact set } s k set . By assumption, for each k we can find a minimizing curve K t s,t , so → X joining x to y s,t γ γ Γ belongs to ] : [ which is k k k k K → K s t we can extract a subsequence which converges γ ) compact. From ( N k k ∈ γ . The uniform convergence im- uniformly to some minimizing curve x ( = γ . The plies that s ) → γ ( s ), y y = γ to ( t ) → γ ( t ), so γ joins x k k k k s,t s,t s,t A lower semicontinuity of ( γ ) A lim inf A ≤ ( γ implies that ); thus k s,t s,t s,t s,t ) ≤A c ( γ ) ≤ lim inf A ( ( γ . ) = lim inf c x,y ( x ) ,y k k k s,t c This establishes the lower semicontinuity of the cost . ′ ′ to [ s ] is not optimal, Property (ii) is obvious: if the restriction of ,t γ ′ ′ ′ ′ s ′ ,t ′ ,t s A on [ introduce ̃ ] such that γ ,t s A ) γ ̃ ( < γ ). Then the path ( ′ ′ ′ ′ ], ̃ γ on [ s s,s ,t γ ] and γ again on [ t ], ,t on [ obtained by concatenating s,t has a strictly lower action than γ , which is impossible. (Obviously, A this is the same line of reasoning as in the proof of the “restriction property” of Theorem 4.6.) γ at x Now, to prove (iii), introduce minimizing curves joining → 1 1 2 time t at time , to x x at time t , to , and γ t at time x joining 3 2 2 2 2 3 1 → 2 γ γ t ,t ] by concatenation of . Then define and γ t . From on [ → 2 3 3 2 → 3 1 1 the axioms of Definition 7.11, ,t t ,t t t ,t t ,t 1 3 1 3 1 2 3 2 c ) ,x x ( ≤A ) = A ( γ ) + A γ ( ( γ ) 1 3 2 1 → → 3 2 ,t t t ,t 2 1 3 2 = c c ,x x ( ) + ( ) . x ,x 1 2 2 3

143 Interpolation of random variables 137 T he inequality in (iii) follows by taking the infimum over . Moreover, x 2 if there is equality, that is, ,t t ,t t t ,t 1 2 3 2 3 1 c c ( ) + ,x x x ,x ( ) = c ( ) , x ,x 1 2 3 2 1 3 then equality holds everywhere in the above chain of inequalities, so the ,t t 3 1 ), while passing through c ( x γ ,x achieves the optimal cost curve 3 1 x at time t . 2 2 It is a consequence of (iii) that any minimizer should satisfy (7.15), since the restrictions of t γ ,t to [ ] and to [ t ] should both be min- ,t 3 2 1 2 be a curve satisfying (7.15) for all ( t imizing. Conversely, let ,t ) γ ,t 3 2 1 with t < t < t . By induction, this implies that for each subdivision 3 2 1 = 1, t 0 = ≤ ... < t < t N 1 0 ∑ t ,t 0 , 1 j +1 j ) ,γ γ c ( . ) = c ( γ ,γ t t 0 1 j +1 j j 0 1 , 0 , 1 ,γ ), ) = c ( γ By point (iii) in Definition 7.11, it follows that ( γ A 1 0 which proves (iv). ≤ t ) stand for the set < t ,t < t ,t If 0 1, now let Γ ( t ≤ 1 2 2 3 1 3 s,t c are continuous, then of all curves satisfying (7.15). If all functions ( t is Γ Γ ,t ) is closed for the topology of uniform convergence. Then ,t 3 2 1 the intersection of all Γ ( t ), so it is closed also; this proves state- ,t ,t 2 3 1 ment (v). (Now there is a similarity with the proof of Theorem 5.20.) s,t s < t Γ be the set of all action-minimizing , let For given times curves defined on [ E s,t be the “endpoints” mapping, defined ], and let s,t s,t by on γ 7−→ ( γ Γ ,γ ). By assumption, any two points are joined by t s at least one minimizing curve, so E is onto X ×X . It is clear that s,t E is a continuous map between Polish spaces, and by assumption s,t 1 − E ( x,y ) is compact for all x,y . It follows by general theorems of s,t measurable selection (see the bibliographical notes in case of need) E admits a measurable right-inverse S = Id . , i.e. E ◦ S that s s,t → t s → t s,t This proves statement (vi). ⊓⊔ Interpolation of random variables Action-minimizing curves provide a fairly general framework to inter- polate between points, which can be seen as deterministic random vari- ables. What happens when we want to interpolate between genuinely

144 138 7 Displacement interpolation r andom variables, in a way that is most economic? Since a determin- istic point can be identified with a Dirac mass, this new problem con- tains both the classical action-minimizing problem and the Monge– Kantorovich problem. be the cost associated with the La- c Here is a natural recipe. Let , be two given laws. Introduce an op- grangian action, and let μ μ 1 0 timal coupling ( X ,X μ ), and a random action-minimizing ,μ ) of ( 1 0 1 0 path ( ) X . (We shall see later that such a thing X to X joining 1 t ≤ t 1 ≤ 0 0 is an interpolation of X always exists.) Then the random variable X t 0 X . ; or equivalently the law μ μ and μ and is an interpolation of 0 1 1 t displacement interpolation , by opposition This procedure is called + − t ) μ to the linear interpolation = (1 tμ . Note that there is a μ t 1 0 priori no uniqueness of the displacement interpolation. Some of the concepts which we just introduced deserve careful e t will stand for the evaluation at time attention. In the sequel, : t e ( γ ) = γ ( t ). t ) ( X ,d Definition 7.19 (Dynamical coupling). be a Polish Let space. A dynamical transference plan Π is a probability measure on the space C ([0 , 1]; X ) . A dynamical coupling of two probability measures , μ P ( X ) is a random curve γ : [0 ∈ 1] →X such that law ( γ ) = μ , ,μ 1 0 0 0 γ μ ) = law ( . 1 1 Let ( X ,d ) be a Definition 7.20 (Dynamical optimal coupling). 0 , 1 c ) Polish space, a Lagrangian action on X , A the associated cost, ( and Γ the set of action-minimizing curves. A dynamical optimal trans- ference plan is a probability measure Π on Γ such that π := ( e ,e ) Π 0 1 1 , 0 # is an optimal transference plan between is and μ μ . Equivalently, Π 0 1 the law of a random action-minimizing curve whose endpoints consti- tute an optimal coupling of μ . Such a random curve is called a and μ 1 0 dynamical optimal coupling of ( μ itself ,μ Π ) . By abuse of language, 1 0 is often called a dynamical optimal coupling. The next theorem is the main result of this chapter. It shows that the law at time t of a dynamical optimal coupling can be seen as a minimizing path in the space of probability measures. In the important case when the cost is a power of a geodesic distance, the corollary stated right after the theorem shows that displacement interpolation can be

145 Interpolation of random variables 139 t hought of as a geodesic path in the space of probability measures. “A geodesic in the space of laws is the law of a geodesic.” ( ) The theorem also shows that such interpolations can be constructed under quite weak assumptions. X ,d ) be a Let Theorem 7.21 (Displacement interpolation). ( , 1 0 ) a coercive Lagrangian action on X A Polish space, and ( , with con- s,t . Whenever ≤ s < t ≤ 0 , denote by tinuous cost functions c 1 s,t C ) μ,ν ( the optimal transport cost between the probability measures s,t 0 , 1 0 , 1 c for the cost c and and C = ν ; write c = . Let μ μ and μ C 1 0 be any two probability measures on X , such that the optimal transport C ( μ , the ,μ cost ) is finite. Then, given a continuous path ( μ ) t 1 ≤ t ≤ 1 0 0 following properties are equivalent: (i) For each [0 , 1] , μ t is the law of γ is a , where ( γ ∈ ) t 0 t ≤ 1 t t ≤ ( ,μ ) ; dynamical optimal coupling of μ 0 1 t , < t 1] < t (ii) For any three intermediate times in [0 , 2 3 1 t t ,t t ,t ,t 2 1 3 2 1 3 C μ ( μ ); ( C ,μ ,μ ,μ μ ( ) = C ) + t t t t t t 2 1 3 3 2 1 (iii) The path ( μ is a minimizing curve for the coercive action ) 0 ≤ ≤ 1 t t P ) by ( functional defined on X − N 1 ∑ ,t t s,t i +1 i ( ( μ ) = sup (7.16) ) C ,μ sup A μ t t i +1 i <... X

146 140 7 Displacement interpolation a P nd let ( X ) be the space of probability measures on X with finite p moment of order W p . Then, , metrized by the Wasserstein distance p μ ,μ , valued ∈ P given any two ( X ) , and a continuous curve ( μ ) t ≤ t ≤ 1 p 0 1 0 ( X ) , the following properties are equivalent: in P is the law of γ , where γ is a random (minimizing, constant- (i) μ t t γ ,γ ) is an optimal coupling; speed) geodesic such that ( 0 1 ( μ . ) ) X ( P (ii) is a geodesic curve in the space ≤ 1 t ≤ 0 t p μ and μ are given, there exists at least one such curve. Moreover, if 1 0 K X ⊂ P are compact subsets of ( X ) and K ) ⊂ P More generally, if ( p p 0 1 ( X ) , then the set of geodesic curves ( μ and ) ∈K μ P such that t 1 ≤ 0 0 0 ≤ t μ ∈K is compact and nonempty; and also the set of dynamical opti- 1 1 mal transference plans Π with ( e is compact ) ∈K Π ∈K Π , ( e ) 1 1 0 0 # # and nonempty. Corollary 7.23 (Uniqueness of displacement interpolation). With the same assumptions as in Theorem 7.21, if: π between μ and μ ; (a) there is a unique optimal transport plan 1 0 x ( dx ) -almost surely, dx (b) and x are joined by a unique min- π 1 1 0 0 imizing curve; joining μ then there is a unique displacement interpolation ) μ ( t 1 ≤ 0 0 ≤ t μ to . 1 ∫ t s,t p ( γ A In Corollary 7.22, ) = Remark 7.24. ̇ γ | | dτ . Then action- τ s X are the same, whatever the value of p > 1. Yet minimizing curves in P , because geodesics in X ) are not the same for different values of p ( p μ ,μ a coupling of ( ) which is optimal for a certain value of p , might 0 1 well not be for another value. Remark 7.25. Theorem 7.21 applies to Lagrangian functions L ( x,v,t ) 2 on a Riemannian manifold L is C TM and satisfies the , as soon as classical conditions of Definition 7.6. Then is the law at time t of a μ t random solution of the Euler–Lagrange equation (7.6). Remark 7.26. In Theorem 7.21, the minimizing property of the path ( μ ) is expressed in a weak formulation, which makes sense with a lot t of generality. But this theorem leaves open certain natural questions: • Is there a differential equation for geodesic curves, or more generally optimal paths ( μ ? Of course, the answer is related to the ) 0 ≤ t ≤ 1 t possibility of defining a tangent space in the space of measures.

147 Interpolation of random variables 141 • s there a more explicit formula for the action on the space of prob- I X ability measures, say for a simple enough action on ? Can it be ∫ 1 L ( μ , written as ̇ ? (Of course, in Corollary 7.22 this is the dt ,t ) μ t t 0 p = μ ̇ L | case with , but this expression is not very “explicit”.) | Are geodesic paths nonbranching? (Does the velocity at initial time • μ ?) uniquely determine the final measure 1 Can one identify simple conditions for the existence of a unique • geodesic path between two given probability measures? All these questions will be answered affirmatively in the sequel of this course, under suitable regularity assumptions on the space, the action or the probability measures. The assumption of local compactness in Corollary 7.22 Remark 7.27. is not superficial: it is used to guarantee the coercivity of the action. For spaces that are not locally compact, there might be an analogous theory, but it is certainly more tricky. First of all, selection theorems are not immediately available if one does not assume compactness of the set of geodesics joining two given endpoints. More importantly, the convergence scheme used below to construct a random geodesic curve from a time-dependent law might fail to work. Here we are encountering a general principle in probability theory: Analytic characterizations of stochastic processes (like those based on semigroups, generators, etc.) are essentially available only in locally compact spaces. In spite of all that, there are some representation theorems for Wasserstein geodesics that do not need local compactness; see the bibliographical notes for details. The proof of Theorem 7.21 is not so difficult, but a bit cumbersome because of measurability issues. For training purposes, the reader might rewrite it in the simpler case where any pair of points is joined by a n R ). To help understanding, I shall unique geodesic (as in the case of first sketch the main idea. Main idea in the proof of Theorem 7.21 . The delicate part consists in showing that if ( μ ) is a given action-minimizing curve, then there ex- t ists a random minimizer γ such that μ will be = law ( γ γ ). This t t (0) (0) γ constructed by dyadic approximation, as follows. First let ( ,γ ) 1 0 (0) μ could be ,μ be an optimal coupling of ( ). (Here the notation γ 1 0 0 (0) replaced by just x , it does not mean that there is some curve γ 0

148 142 7 Displacement interpolation (1) (1) b ,γ ehind.) Then let ( γ ), and ,μ ) be an optimal coupling of ( μ 0 / 2 1 0 2 1 / (1) (1) ′ ) ,μ ,γ (( ). By gluing these cou- ) be an optimal coupling of ( μ γ 1 / 1 2 1 2 1 / (1) (1) ′ ) γ = γ plings together, I can actually assume that ( , so that I have / 2 1 2 1 / (1) (1) (1) ,γ a triple ( γ ) in which the first two components on the one ,γ 0 1 / 1 2 hand, and the last two components on the other hand, constitute opti- mal couplings. γ ) and ( Now the key observation is that if ( γ ) are optimal ,γ ,γ t t t t 2 2 3 1 ,μ couplings of ( ,μ μ satisfy the ) and ( μ μ ) respectively, and the t t t t t 1 3 2 2 k γ ,γ ) should be optimal. Indeed, equality appearing in (ii), then also ( t t 3 1 by taking expectation in the inequality ,t t ,t t ,t t 3 1 2 1 3 2 c c ≤ ,γ γ ( ) ) + ( ,γ c γ ) ( γ ,γ t t t t t t 3 1 1 2 3 2 and using the optimality assumption, one obtains ,t ,t t t t ,t 3 2 3 1 1 2 ) ≤ C γ ( E c ( μ μ ( ,μ . ,γ ) + C ) ,μ t t t t t t 1 3 3 1 2 2 μ ) is action-minimizing imposes Now the fact that ( t t ,t ,t ,t t t 2 3 3 1 1 2 ) + C C ( μ ); ( μ ,μ ,μ ,μ μ ( ) = C t t t t t t 1 1 3 2 2 3 so actually t t ,t ,t 3 1 1 3 ,γ , ) ) ≤ C ( γ E c ( μ ,μ t t t t 1 3 3 1 which means that indeed ( ,γ ) ) is an optimal coupling of ( μ ,μ γ t t t t 1 1 3 3 t ,t 3 1 . for the cost c (1) (1) ,γ So ( ) is an optimal coupling of ( μ γ ,μ ). Now we can proceed 1 0 1 0 k , a random discrete path in the same manner and define, for each k k ) ( ( ) ) ( k γ ( s,t ) is an optimal coupling for all times ,γ ) such that ( γ of s k − t j 2 k the form j/ . These are only discrete paths, but it is possible to extend 2 k ) ( γ ) them into paths ( that are minimizers of the action. Of course, ≤ t ≤ 1 0 t ( k ) k γ , there is no reason why law ( is not of the form j/ 2 if ) would t t coincide with μ , for . But hopefully we can pass to the limit as k →∞ t ⊓⊔ each dyadic time, and conclude by a density argument. s,t Complete proof of Theorem 7.21 . First, if A ( γ ) is bounded below by a constant − C , independently of s,t and γ , then the same is true of s,t s,t the cost functions and of the total costs C . So all the quantities c appearing in the proof will be well-defined, the value + ∞ being possibly s,t attained. Moreover, the action A defined by the formula in (iii) will

149 Interpolation of random variables 143 a C , so Property (i) of lso be bounded below by the same constant − Definition 7.13 will be satisfied. μ μ Now let be given. According to Theorem 4.1, there exists and 0 1 π and μ μ , for the cost at least one optimal transference plan between 0 1 0 , 1 . Let c be the mapping appearing in Proposition 7.16(vi), c = S → 1 0 and let := ( ) π. Π S 0 → 1 # defines the law of a random geodesic γ , and the identity E ◦ Then Π 1 , 0 = Id implies that the endpoints of γ are distributed according S 1 0 → . This proves the existence of a path satisfying (i). Now the main to π part of the proof consists in checking the equivalence of properties (i) and (ii). This will be performed in four steps. μ ) be any continuous curve in the space of proba- Let ( Step 1. t 1 0 ≤ t ≤ be three intermediate times. Let ,t bility measures, and let ,t π t 3 1 2 → t t 1 2 be an optimal transference plan between for the transport and μ μ t t 1 2 t ,t 1 2 c , and similarly let π cost be an optimal transference plan be- → t t 2 3 ,t t 3 2 tween μ for the transport cost c and μ . By the Gluing Lemma t t 3 2 ,γ ,γ γ of Chapter 1 one can construct random variables ( ) such t t t 3 1 2 γ γ (in particular, ,γ π ) = ) = π ,γ that law ( and law ( t t → t t t t → t t 2 1 3 2 3 1 2 2 i μ γ 3). Then, by (7.14), law ( for ) = = 1 , 2 , t t i i t t ,t ,t ,t t t ,t 3 1 2 1 3 1 3 2 C ( μ ) ( γ ,γ ,μ ,γ γ ( ) ≤ E c ) ≤ E c ( γ c E ,γ ) + t t t t t t t t 2 2 1 3 3 3 1 1 t ,t t ,t 2 2 3 1 ,μ C ) ) + C ( ( μ = . μ ,μ t t t t 2 1 2 3 This inequality holds for any path, optimal or not. Assume that ( μ ) satisfies (i), so there is a dynamical op- Step 2. t Π such that μ ) = ( e be a ran- timal transference plan γ Π . Let t t # dom minimizing curve with law Π , and consider the obvious coupling γ μ )). Then from the ,γ ,μ ( ) (resp. ( γ ) (resp. ( ,μ ,γ μ )) of ( t t t t t t t t 2 2 2 2 3 1 1 3 γ , definition of the optimal cost and the minimizing property of ,t t ,t ,t t t t ,t 1 2 2 2 3 1 3 2 ( μ ( μ ) ,γ ,μ ,μ γ ) ( E c ) + C ≤ C ( γ c E ,γ ) + t t t t t t t t 2 2 1 3 3 2 2 1 ,t t t ,t ,t t ,t t 2 3 2 1 1 3 1 3 γ ) = E A γ ) + ( A ( E ) = E c A ( E γ ( γ . ,γ ) = (7.18) t t 3 1 ) is Now choose = 0, t t = t , t ,γ = 1. Since by assumption ( γ 0 1 1 2 3 μ ), the above computation implies ,μ an optimal coupling of ( 1 0 0 ,t t, 1 0 , 1 ,μ ) + C C , ( μ μ ,μ ( ) ≤ C μ ) ,μ ( 1 t 0 t 1 0

150 144 7 Displacement interpolation a nd since the reverse inequality holds as a consequence of Step 1, ac- tually 0 ,t t, 1 0 , 1 C μ C ( μ . ,μ ) ) = C ,μ ( ) + ( μ ,μ 0 t t 1 0 1 Moreover, equality has to hold in (7.18) (for that particular choice of in- , 0 0 ,t 1 ,μ ,μ termediate times); since ) < + ∞ this implies C C ( ( μ ) = μ t 0 0 1 0 ,t ( γ c ,γ ), which means that ( E γ ) should actually be an optimal ,γ t 0 0 t ,γ ,μ coupling of ( ). Similarly, ( γ ) should be an optimal coupling of μ 1 t 0 t ( μ ,μ ). t 1 t t t Next choose = s , = 0, , and apply the previous deduction = t 3 2 1 γ ). After inserting ,γ ,μ to discover that ( μ ) is an optimal coupling of ( s t s t , we recover s and t = t t this information in (7.18) with = 2 3 t ,t ,t t ,t t 2 1 2 3 1 3 C μ ) + ( C ,μ ,μ ( ) ≤ C μ μ . ,μ ( ) t t t t t t 2 1 3 2 3 1 ) satisfies Property (ii). So This together with Step 1 proves that ( μ t ⇒ (ii). far we have proven (i) Assume that ( ) satisfies Property (ii); then we can per- Step 3. μ t form again the same computation as in Step 1, but now all the in- equalities have to be equalities. This implies that the random variables γ ( ,γ ,γ ) satisfy: t t t 1 2 3 ,t t 3 1 ; (a) ( ) is an optimal coupling of ( μ ,γ c γ ) for the cost ,μ t t t t 3 1 3 1 ,t t t ,t ,t t 3 1 1 2 3 2 ( c γ ) almost surely; ,γ ,γ ( γ ( γ ,γ (b) ) = ) + c c t t t t t t 3 2 3 1 2 1 s,t (c) ,γ c ) γ + ∞ almost surely. ( < s t Armed with that information, we proceed as follows. We start from γ ,γ ) of ( μ ,μ ), with joint law π an optimal coupling ( . Then as 0 → 0 0 1 1 1 (1) (1) (1) (1) γ , μ in Step 1 we construct a triple ( γ ,γ ) = ) with law ( γ , 0 1 0 1 0 2 (1) (1) (1) (1) μ = ) is an optimal ) , law ( γ γ ,γ ) = μ law ( , such that ( γ 1 1 1 1 1 0 2 2 2 1 (1) (1) 0 , 2 coupling of ( ,μ μ , ) is an optimal cou- γ ) a nd ( γ for the cost c 1 0 1 1 2 2 1 , 1 2 pling of ( ) for the cost c μ , From (a) and (b) above we know μ . 1 1 2 (1) (1) (1) (1) that ( γ ) is an optimal coupling of ( μ ,γ ,μ ) ) (but law ( ,γ γ 1 0 1 0 1 0 (1) (1) , 0 1 )), and moreover ) = ,γ might be different from law ( ( γ γ c ,γ 0 1 1 0 1 1 ( 1) (1) 1) ( (1) 0 , 1 , 2 2 c ,γ ( ( γ γ c γ ) almost surely. ) + , 1 1 0 1 2 2 Next it is possible to iterate the construction, introducing more and more midpoints. By a reasoning similar to the one above and an k ≥ 1, random induction argument, one can construct, for each integer k ) ( ) k ( ) k ( ) ) ( k ( k , , , γ ...γ γ ) in such a way that ,γ variables ( γ 2 3 1 1 0 k k k 2 2 2

151 Interpolation of random variables 145 ( k ) ( k ) k ( i,j ≤ constitutes an optimal coupling γ a) for any two , ( γ 2 ) , i j k k 2 2 , μ of ( ) , μ j i k k 2 2 k , one has ≤ 2 ,i i ,i (b) for any three indices 1 3 2 i i i i i i ) ( ( ( ) ) 1 2 1 2 3 3 k ) ( ( ) k k ( ) ) ) ( k k ) k ( ( , , , k k k k k k 2 2 2 2 2 2 , γ = c , γ + c γ γ . , γ c γ i i i i i i 2 3 1 2 1 3 k k k k k k 2 2 2 2 2 2 k ) ( γ , t this stage it is convenient to extend the random variables A k , into (random) continuous curves which are only defined for times j/ 2 ( k ) γ ) ( . For that we use Proposition 7.16(vi) again, and for any ≤ t ≤ 1 0 t k k t ∈ , ( i + 1) / 2 i/ ) we define 2 ( ( ) := S γ e , . ( γ ) γ 1 i +1 i i + i t t , k k k k 2 2 2 2 ( k ) .) Then the law Π is just the evaluation at time e t of Recall that ( t γ is a probability measure on the set of continuous curves in X . ) ( t 0 ≤ t ≤ 1 ( k ) is actually concentrated on minimizing curves. Π I claim that (Skip at first reading and go directly to Step 4.) To prove this, it is sufficient to check the criterion in Proposition 7.16(iv), involving three t ,t intermediate times . By construction, the criterion holds true if ,t 3 1 2 k k ( i + 1) / , 2 ], and i/ all these times belong to the same time-interval [ 2 k 2 j/ also if they are all of the form ; the problem consists in “crossing subintervals”. Let us show that k − − − k − k k − ≤ j 2 1) 2 + 1) 2 < t < ( j ( < s < i 2 = ⇒ i  + j +1 j 1 i − 1 1 i − ( ) ( ( ) ) , , t, s s,t  k k k k 2 2 2 2 c γ = γ , γ , γ c ) + ,γ γ c ( γ c + , γ  − 1 − i 1 i +1 j +1 j t s s t  k k  k k 2 2 2 2  j j i i  , , t s,  s ,t k k k k  2 2 2 2 ) + c γ ( γ ( , c γ ( γ , ,γ , ) = γ c ) γ ( + c γ ) j j i i s s t t k k k k 2 2 2 2 (7.19) To prove this, we start with +1 j 1 + j 1 − i − i 1 , , t, s s,t k k k k 2 2 2 2 ≤ c ( γ ) , γ ( γ c γ ,γ ) + γ γ , , γ ) + c ) ( ( c i j +1 +1 − 1 1 − i j s s t t k k k k 2 2 2 2 − 1 i + 1 i i i , s, s , k k k k 2 2 2 2 c , ( γ ) ( γ γ , γ , γ ( ) + c ≤ γ c ) + i i i 1 − i +1 s s k k k k 2 2 2 2 j +1 j , t, t k k 2 2 . + . ) γ , γ , ) + c + c ( .. γ ( γ j +1 j t t k k 2 2 ( 7.20)

152 146 7 Displacement interpolation ince we have used minimizing curves to interpolate on each dyadic S subinterval, i 1 − i 1 − i i s s, , , k k k k 2 2 2 2 ( γ ( γ γ , γ ( , γ ) = c c ) + c , ) , γ i − 1 i i i − 1 s s k k k k 2 2 2 2 tc. So the right-hand side of (7.20) coincides with e j j + 1 i 1 − i , , k k k k 2 2 2 2 c γ , , ) , c ... + ( γ + γ ( γ ) +1 j j 1 i − i k k k k 2 2 2 2 j + 1 i − 1 , ( ) k k k 2 2 Π , this is just ( γ a . So there ) c γ nd by construction of j +1 i − 1 k k 2 2 has to be equality everywhere in (7.20), which leads to (7.19). (Here s,t ( γ I use the fact that c .) After that it is an easy game ) ,γ + ∞ < t s to conclude the proof of the minimizing property for arbitrary times t ,t ,t . 3 1 2 , we have To recapitulate: Starting from a curve ( ) Step 4. μ ≤ ≤ 1 t 0 t ( k ) which are all concen- constructed a family of probability measures Π ) k ( Γ ) Π e of minimizing curves, and satisfy ( trated on the set = μ t t # k . For that we = t . It remains to pass to the limit as k →∞ 2 j/ for all ( k ) shall check the tightness of the sequence ( ) 0 be arbi- . Let ε > Π k ∈ N trary. Since μ , μ such that are tight, there are compact sets K K , 0 1 0 1 [ . From the coercivity of the action, the K μ ] ≤ ε , μ X \ [ X \ K ε ] ≤ 1 1 0 0 0 , 1 Γ is compact, K to set of action-minimizing curves joining K 0 1 K → K 0 1 ] [ 1 , 0 Γ Π Γ \ is (with obvious notation) and K → K 0 1 [ ] γ ] ,γ K ) / ∈ K ∈ × P / K ( P [ γ γ / ∈ K [ ] + P ≤ 0 0 1 1 0 1 1 0 = μ ε. [ X \ K 2 ] + μ ≤ [ X \ K ] 1 1 0 0 ( ) k ). So one can extract a This proves the tightness of the family ( Π ( ) k Π , that converges weakly to some subsequence thereof, still denoted probability measure Π . Γ is closed; so Π By Proposition 7.16(v), Γ . is still supported in ℓ = i/ 2 Moreover, for all dyadic time in [0 t 1], we have, if k is larger , ( k ) e , and by passing to the limit we find that ) μ Π = than ℓ , ( t t # also. e ) ( Π = μ t t # By assumption, μ depends continuously on t . So, to conclude that t ( e 1] it now suffices to check the continuity ) , Π = μ [0 for all times t ∈ t t # of ( e is an arbitrary bounded ) φ Π as a function of t . In other words, if t # continuous function on X , one has to show that

153 Interpolation of random variables 147 ψ ) = E φ ( γ ( ) t t t is a random geodesic with law Π . But if is a continuous function of γ γ (for all γ ), 7−→ this is a simple consequence of the continuity of t t and Lebesgue’s dominated convergence theorem. This concludes Step 4, (i). ⇒ and the proof of (ii) s,t A in (iii) do coin- Next, let us check that the two expressions for cide. This is about the same computation as in Step 1 above. Let s < t be given, let ( ) ) be a subdivi- t be a continuous path, and let ( μ t ≤ ≤ i s τ τ ,X ]. Further, let be such that law ( γ sion of [ ) = μ ) s,t X γ , and let ( s τ t τ s,t μ be an optimal coupling of ( ), for the cost function c . Further, ,μ t s γ let ( ) for μ ) = be a random continuous path, such that law ( γ s t ≤ τ τ ≤ τ τ ∈ [ s,t ]. Then all τ ∑ t ,t s,t s,t i i +1 μ ) ,X ,μ X ( ( ) ≤ C C ( μ c ,μ E ) = t t s s t t i i +1 i s,t s,t c , ( γ ≤ ,γ ) ) ≤ E A E ( γ t s γ where the next-to-last inequality follows from the fact that ( ,γ ) is t s a coupling of ( ,μ ), and the last inequality is a consequence of the μ s t s,t . This shows that c definition of ∑ s,t t ,t i +1 i ( μ A . ,μ ) γ ( ) ≤ E C t t i i +1 i On the other hand, there is equality in the whole chain of inequalities t is a (random) action- = s , t γ = t , X if = γ , and , X γ = t t s 1 τ 0 s minimizing curve. So the two expressions in (iii) do coincide. Now let us address the equivalence between (ii) and (iii). First, s,t A is lower semicontinuous, since it is defined as a it is clear that ,t t 3 1 supremum of lower semicontinuous functionals. The inequality A ≥ t ,t t ,t 1 2 2 3 + A holds true for all intermediate times t A < t (this is < t 1 3 2 a simple consequence of the definitions), and the converse inequality is a consequence of the general inequality s,t s,t ,t t 2 2 2 s < t ( , ) ,μ ,μ μ ) ≤ C < t = ( μ ( ,μ ⇒ C ) + C μ 2 s t t t s t 2 2 2 which we proved in Step 1 above. So Property (i) in Definition 7.11 is satisfied. To check Property (ii) of that definition, take any two probability measures μ , μ and introduce a displacement interpolation s t

154 148 7 Displacement interpolation ( μ ) ]. Then Property s,t or the Lagrangian action restricted to [ f t ≤ ≤ s τ τ s,t s,t ) = C ( ( μ (ii) of Theorem 7.21 implies ,μ A ). Finally, Property μ s t A ) (iii) in Definition 7.11 is also satisfied by construction. In the end, ( s,t . C does define a Lagrangian action, with induced cost functionals To conclude the proof of Theorem 7.21 it only remains to check the coercivity of the action; then the equivalence of (i), (ii) and (iii) will follow from Proposition 7.16(iv). Let s < t be two given times in 1], and let K [0 , K be compact sets of probability measures such that , t s s,t ( μ ,μ . Action-minimizing curves ) < + ∞ for all C μ ∈ K ∈ K , μ t s t t s s s,t ) can be written as law ( γ is a random action- for γ , where A τ ≤ ≤ s τ t ∈ K s,t such that law ( γ minimizing curve [ ) ∈ K . , law ( γ → X ) ] t s s t One can use an argument similar to the one in Step 4 above to prove Π that the laws of such minimizing curves form a tight, closed set; so s,t we have a compact set of dynamical transference plans Π , that are C ([ s,t ]; X ). The problem is to show that the probability measures on s,t ([ paths ( Π e constitute a compact set in C ) s,t ]; P ( X )). Since the τ # continuous image of a compact set is compact, it suffices to check that the map s,t s,t (( e ) ) Π Π 7−→ τ s ≤ τ ≤ t # is continuous from P ( C ([ s,t ]; X )) to C ([ s,t ]; P ( X )). To do so, it will be convenient to metrize P X ) with the Wasserstein distance W ( , re- 1 placing if necessary d by a bounded, topologically equivalent distance. ([ ) is ]; X C (Recall Corollary 6.13.) Then the uniform distance on s,t on also bounded and there is an associated Wasserstein distance W 1 ̃ C ([ s,t ]; X )). Let Π and P Π ( be two dynamical optimal transference ̃ Π plans, and let (( , ( ̃ γ ; let also )) be an optimal coupling of Π and γ ) τ τ , ̃ μ be the associated displacement interpolations; then the required μ τ τ continuity follows from the chain of inequalities ̃ . ) Π ) Π, W ( ( μ W , ̃ μ ) = sup ≤ sup γ ̃ , γ ( E d ( γ d , ̃ γ sup ) ≤ E t t 1 1 t t t t [0 , ∈ t ∈ 1] 1] , [0 1] , [0 ∈ t t This proves that displacement interpolations with endpoints lying in given compact sets themselves form a compact set, and concludes the proof of the coercivity of the action ( A ). ⊓⊔ Remark 7.28. ⇒ (i), instead of In the proof of the implication (ii) ( k ) defining Π on the space of continuous curves, one could instead ) k ( Π defined on discrete times, construct by compactness work with ℓ 2 +1 X a consistent system of marginals on , for all ℓ , and then invoke

155 Interpolation of random variables 149 K which is defined on a set olmogorov’s existence theorem to get a Π of curves. Things however are not that simple, since Kolmogorov’s the- orem constructs a random measurable curve which is not a priori con- tinuous. Here one has the same conceptual catch as in the construction of Brownian motion as a probability measure on continuous paths. . Introduce the family of actions Proof of Corollary 7.22 ∫ t s,t p ) = | ̇ ( A | γ dτ. γ τ s Then p ( x,y d ) s,t ) = x,y c ( , p − 1 − ) t ( s a nd all our assumptions hold true for this action and cost. (The assump- tion of local compactness is used to prove that the action is coercive, see the Appendix.) The important point now is that p W ( μ,ν ) p s,t ) = ( C μ,ν . p − 1 t ( ) − s S o, according to the remarks in Example 7.9, Property (ii) in Theo- rem 7.21 means that ( μ ) is in turn a minimizer of the action associated t p ̇ μ | P , i.e. a geodesic in with the Lagrangian | ( X ). Note that if μ is t p γ at time t , then the law of a random optimal geodesic t p p p p p p ) ) W ≤ E d ( γ ,μ ,γ μ ) , = E ( t − s ) μ d ( γ ( ,γ W ) ( = ( t − s ) ,μ 1 p 0 s 1 s t p 0 t μ so the path is indeed continuous (and actually 1-Lipschitz) for the t distance W . ⊓⊔ p . By Theorem 7.21, any displacement interpo- Proof of Corollary 7.23 e ) Π , where Π is a probability measure lation has to take the form ( t # is an optimal π e on action-minimizing curves such that ,e := ( ) Π 1 0 # π . Let Z transference plan. By assumption, there is exactly one such x ,x be the set of pairs ( ) such that there is more than one minimiz- 0 1 x ing curve joining to x , ; by assumption π [ Z ] = 0. For ( Z ∈ ,x / ) x 1 0 0 1 there is a unique geodesic γ = S ( x has to ,x Π ) joining x . So to x 1 1 0 0 coincide with S π . ⊓⊔ # To conclude this section with a simple application, I shall use Corol- lary 7.22 to derive the Lipschitz continuity of moments, alluded to in Remark 6.10.

156 150 7 Displacement interpolation roposition 7.29 ( P -Lipschitz continuity of p -moments). Let W p X p ≥ 1 and be a locally compact Polish length space, let ∈ ,d ( ) μ,ν X ) . Then for any φ ∈ Lip( X ; R P ) , ( + p ∣ ∣ 1 1 ( ) ) ( ∫ ∫ ∣ ∣ p p ∣ ∣ p p . μ φ ( y ) ( ν ( d y ) φ ) ( x ) − ) μ,ν ≤ ‖ φ ‖ ( W dx ∣ ∣ p Lip ∣ ∣ . Without loss of generality, ‖ φ ‖ = 1. The Proof of Proposition 7.29 Lip case p = 1 is obvious from (6.3) and holds true for any Polish space, p > 1. Let ( μ ) so I shall assume be a displacement interpolation ≤ 1 ≤ 0 t t p = μ and μ between = ν , for the cost function μ ( x,y ) = d ( x,y ) c . 1 0 By Corollary 7.22, there is a probability measure on Γ , the set of Π , such that . geodesics in = ( e ) X Π μ t t # ∫ ∫ p p φ ( x ) t μ ). By Fatou’s ( dx ( ) = Ψ φ ( γ dγ ) Then let Π ( ) = t t X Γ lemma and H ̈older’s inequality, ∫ + p + d Ψ γ d ) φ ( t γ Π ( d ) ≤ dt dt Γ ∫ 1 − p ) | ( φ | ̇ γ γ Π ( dγ ) p ≤ t t 1 1 ) ( ( ) ∫ ∫ 1 − p p p p ̇ dγ φ ) | Π γ | γ Π ( d γ ) ( p ( ) ≤ t t 1 ) ( ∫ p 1 − 1 p p ) = p Ψ ( ) γ dγ ) t Π ( , γ ( d 1 0 1 1 − p t ) = p Ψ ( W . ( μ ,ν ) p + 1 /p 1 /p 1 /p ) − /dt ] ≤ W So ( ( μ,ν ), thus | Ψ (1) )[ Ψ d Ψ (0) ( t | ≤ W ), ( μ,ν p p ⊓⊔ which is the desired result. Displacement interpolation between intermediate times and restriction μ a dis- Again let μ be any two probability measures, ( μ ) and t 0 0 ≤ t ≤ 1 1 placement interpolation associated with a dynamical optimal transfer- ence plan , and ( γ a random action-minimizing curve with ) Π ≤ t ≤ 1 t 0 law ( ). ) = Π . In particular ( γ ,μ ,γ μ ) is an optimal coupling of ( γ 1 0 0 1

157 Displacement interpolation between intermediate times and restriction 151 With the help of the previous arguments, it is almost obvious that γ ( ) is also an optimal coupling of ( μ ,γ ,μ ). What may look at t t t t 1 0 1 0 ,t t ) 6 = (0 , 1), this is the first sight more surprising is that if ( op- only 1 0 timal coupling, at least if action-minimizing curves “cannot branch”. Furthermore, there is a time-dependent version of Theorem 4.6. Theorem 7.30 (Interpolation from intermediate times and re- striction). Let X be a Polish space equipped with a coercive action ) on C ([0 , 1]; X ) . Let ( ∈ P ( C ([0 , 1]; X )) be a dynamical optimal A Π t ,t transport plan associated with a finite total cost. For any in [0 , 1] 1 0 to t with < t as ≤ 1 , define the time-restriction of Π 0 [ t ] ,t ≤ 1 0 1 0 t ,t 1 0 := ( Π r to the inter- ) γ Π , where r is the restriction of ) γ ( t ,t t ,t # 1 0 0 1 t ,t val [ ] . Then: 1 0 t t ,t ,t 0 1 0 1 Π (i) is a dynamical optimal coupling for the action ( . A ) ,t t 1 0 ̃ ̃ Π (ii) If ]; X ) , such that t Π ≤ ,t ([ C is a measure on Π and 1 0 ̃ [ C ([ t Π ,t , let ]; X )] > 0 1 0 ̃ Π ′ ′ ′ Π := Π ) e . = ( , μ ] [ t # t ̃ ( X ,t C t ) Π [ ]; 0 1 ′ ′ ′ Π and μ Then is a dynamical optimal coupling between μ ; and t t 1 0 ′ ( μ ) is a displacement interpolation. t ≤ t ≤ t t 1 0 (iii) Further, assume that action-minimizing curves are uniquely and measurably determined by their restriction to a nontrivial time- ′ t in (ii) is the unique dynam- ,t interval, and ) 6 = (0 , 1) . Then, Π ( 0 1 ′ ′ ′ ,t t 1 0 ical optimal coupling between μ and μ ( ) . In particular, π := t t 0 1 ′ ′ ( ,e μ is the unique optimal transference plan between ) and Π e t t # t 1 0 0 ′ ′ ′ ; and μ μ t := ( ) is the unique displacement inter- ) t Π ≤ ( e t ≤ 0 1 t # t t 1 ′ ′ μ . and polation between μ t t 0 1 t ∈ (0 , 1) , (iv) Under the same assumptions as in (iii), for any Π ⊗ Π )( dγ d ̃ γ ) -almost surely, ( [ γ . = ̃ γ ] ] = ⇒ [ γ = ̃ γ t t Π In other words, the curves seen by cannot cross at intermediate s,t times. If the costs c are continuous, this conclusion extends to all γ, curves γ ∈ Spt Π . ̃ (v) Under the same assumptions as in (iii), there is a measurable map F . : X → Γ ( X ) such that, Π ( dγ ) -almost surely, F γ ( γ ) = t t t

158 152 7 Displacement interpolation R emark 7.31. In Chapter 8 we shall work in the setting of smooth Riemannian manifolds and derive a quantitative variant of the “no- crossing” property expressed in (iv). Corollary 7.32 (Nonbranching is inherited by the Wasserstein ,d ) be a complete separable, locally compact length space space). Let X ( (1 ∞ ) . Assume that X is nonbranching, in the sense that ∈ p and let , , 1] a geodesic is uniquely determined by its restriction to γ : [0 → X X ) is a nontrivial time-interval. Then also the Wasserstein space P ( p ( ) is nonbranching, then X is non- P nonbranching. Conversely, if X p branching. γ be a random geodesic with law Π . Let Proof of Theorem 7.30 . Let ,t 0 1 t , ,t t 1 1 0 0 and γ stand for the restrictions of , to the time intervals γ γ γ t ], [ ,t ] and [ t , 1], respectively. Then [0 ,t 1 0 0 1 ,t 0 t ,t , 1 t 0 0 1 1 C ( μ ) + C ,μ ( ,μ ) + C μ ,μ ) μ ( t 0 t t 1 t 0 1 1 0 1 0 ,t ,t t , ,t t t 1 0 , ,t t 1 1 0 0 0 1 1 0 γ ( ) c E ( γ c ( γ E ≤ ) + E ) + c 0 , 1 0 , 1 ,μ ) = C E ( γ ( μ = c ) 1 0 ,t t t , 0 1 ,t 1 0 0 1 ) + C μ ≤ C ( ( μ μ ( ,μ . ,μ ) + C ) ,μ t 1 t 0 t t 1 0 0 1 So there has to be equality in all the inequalities, and it follows that t ,t ,t t t t 0 1 0 0 1 1 C ( E γ c ( μ ) = . ,μ ) ,γ t t 1 0 ,t t t ,t 1 0 0 1 γ is a dynamical optimal transference plan. Π is optimal, and So Statement (i) is proven. ,t t t ,t 1 0 1 0 is an optimal transfer- ,e π Π ) = ( e As a corollary of (i), t t # 0 1 ̃ ) . The inequal- and μ ,e ence plan between . Let ̃ π := ( e μ Π t t t t # 1 0 0 1 t ,t t ,t 0 1 0 1 ̃ ̃ Π ≤ π ≤ Π is preserved by push-forward, so ity . Also π ′ ̃ [ X ×X ] = ̃ Π [ C ([ t ] ,t X ×X ]; X )] > 0. By Theorem 4.6, π ̃ := ̃ π/ π π [ 1 0 ′ is an optimal transference plan between its marginals. But π coincides ′ ′ ) on action- Π ,e is concentrated (just as with ( ) Π Π e , and since t t # 0 1 ′ minimizing curves, Theorem 7.21 guarantees that Π is a dynamical optimal transference plan between its marginals. This proves (ii). (The ′ μ continuity of in t is shown by the same argument as in Step 4 of the t proof of Theorem 7.21.) To prove (iii), assume, without loss of generality, that t > 0; then 0 an action-minimizing curve γ is uniquely and measurably determined

159 Displacement interpolation between intermediate times and restriction 153 0 ,t 0 γ ,t by its restriction ]. In other words, there is a measur- to [0 0 0 0 ,t 0 ,t ,t 0 0 0 , defined on the set of all γ F : Γ , such → able function Γ , 1] can be written as γ that any action-minimizing curve : [0 → X ,t 0 t ,t ,t 0 0 0 1 0 ( ). Similarly, there is also a measurable function such γ F F ,t t t ,t 1 1 0 0 γ that . ( F ) = γ ,t t 0 1 ̃ Π , which are is concentrated on the curves By construction, γ the restrictions to [ ,t t ] of the action-minimizing curves γ . Then let 1 0 t ,t 0 1 ̃ = ( F ) Π Π ; this is a probability measure on C ([0 , 1]; X ). Of : # t t , t ,t 0 1 0 1 F Π course Π ( ) = Π ; so by (ii), Π/ Π [ ≤ ( [0 , 1]; X )] is opti- C # mal. (In words, Π by extending to the time-interval Π s obtained from i ′ Π , .) Then it is easily [0 1] those curves which appear in the sub-plan ′ ′ seen that r = ( Π ) 1]; )] = and Π [ C ( [0 , , X Π Π [ C ([ t )]. So ,t X ]; t ,t 1 0 # 1 0 t ,t 0 1 ̃ Π Π , and it suffices to prove Theorem 7.30(iii) in the case when = this will be assumed in the sequel. 0 ,t 0 ,t 0 0 = law ( be a random geodesic with law γ Π γ Π ), Now let , and t t ,t 1 , t t t ,t , ,t 1 1 1 0 1 1 0 0 1 = law ( = law ( ). By (i), Π γ γ Π Π is a dynamical ), ,t t 1 0 ̃ Π ; let and μ be another μ optimal transference plan between t t 0 1 ,t t t ,t 0 0 1 1 ̃ . = Π such plan. The goal is to show that Π ,t t ,t 0 0 0 1 ̃ Disintegrate Π Π along their common marginal μ and and t 0 C ([0 ,t glue them together. This gives a probability measure on ); X ) × 0 X × C (( t as ,t g ]; X ), supported on triples ( γ,g, ̃ γ ) such that γ ( t ) → 1 0 − + t , ̃ γ ( t ) → g as t → t t → . Such triples can be identified with continu- 0 0 ous functions on [0 ], so what we have is in fact a probability measure ,t 1 t , 1 1 on ). Repeat the operation by gluing this with Π C ([0 ,t X , so as ]; 1 to get a probability measure Π n C ([0 , 1]; X ). o b e a random variable with law Π : By construction, γ Then let 1 0 , t , t 1 0 ,t , t 0 1 1 0 γ = law ( γ ) = law ( γ ) γ ), so law ( ), and law ( 1 0 , 1 , t 1 , t t 0 , t , t ,t 0 , t t 0 0 1 0 0 1 1 1 γ ( c γ c ) + ) + E c E ( ≤ ) ( E γ γ ( c ) E , , , t t 1 0 ,t 0 t t ,t 1 t t , 0 0 1 1 1 0 0 1 c γ E ) + γ ( γ ( c E = ) c E ( ) + 1 ,t t t ,t , 0 1 0 1 0 μ ,μ ( μ ) ,μ = ) + C ) + C C ( ( μ ,μ 1 t 0 t t t 1 0 1 0 0 , 1 C μ ,μ = ) . ( 0 1 Thus Π i s a dynamical optimal transference plan between μ and μ . 1 0 It follows from Theorem 7.21 that there is a random action-minimizing γ with law ( ̂ γ ) = Π . ̂ In particular, curve 0 ,t ,t 0 ,t t ,t t 0 0 0 1 1 0 law ( γ ̂ γ ) = law ( γ ̂ law ( ); ̃ ) = law ( γ ) . ,t 0 0 F = r F By assumption there is a measurable function ( ) ◦ F ,t t 1 0 t ,t 0 ,t 0 1 0 such that ( g g . So = ), for any action-minimizing curve g F

160 154 7 Displacement interpolation t ,t ,t t ,t 0 ,t 0 t ,t 1 0 0 1 0 0 0 1 l aw ( ̃ γ ) = law ( γ ̂ γ ̂ ( F ) = law ( )) = law ( γ ( F γ )) = law ( ) . This proves the uniqueness of the dynamical optimal transference ′ ′ . The remaining part of (iii) is obvious since to plan joining μ μ t t 0 1 any optimal plan or displacement interpolation has to come from a dynamical optimal transference plan, according to Theorem 7.21. = ( e ,e Now let us turn to the proof of (iv). Since the plan π ) Π 1 0 # 0 , 1 -cyclically monotone (Theorem 5.10(ii)), we have, Π ⊗ Π ( dγ d ̃ γ )- is c almost surely, , 0 , 1 0 1 1 0 0 , 1 , ( ( ̃ γ ) , ̃ γ ,γ ) ≤ c γ (7.21) ,γ ( , γ , ̃ γ ̃ ) + c ) + c c γ ( 1 0 1 1 0 1 0 0 and all these quantities are finite (almost surely). If ̃ γ are two such paths, assume that γ γ = ̃ γ and = X for some t t t ∈ (0 , 1). Then 0 , 1 t, 1 ,t 0 c ̃ ≤ c , ( γ γ ,X ) + ) ( c ( X, ̃ γ ) , (7.22) γ 0 1 0 1 and similarly 1 , 0 ,t 0 t, 1 ,γ . ) ≤ c c ( ( ̃ γ ) ,X ) + c γ (7.23) ( X,γ ̃ 1 0 1 0 By adding up (7.22) and (7.23), we get 1 0 , 1 , 0 c , ) + c γ ( γ ( ̃ γ ,γ ) ̃ 0 1 1 0 ] [ [ ] 1 t, 0 ,t 0 ,t t, 1 ( X,γ γ ) c + ,X c ) + c ( ̃ γ ) ,X ) + c γ ( ( X, ̃ ≤ 1 1 0 0 0 , 1 0 , 1 . ,γ ) ) + c = c ( ( ̃ γ γ , ̃ γ 0 1 0 1 Since the reverse inequality holds true by (7.21), equality has to hold in all intermediate inequalities, for instance in (7.22). Then it is easy γ d efined by γ ( s ) , and γ ( s ) for 0 ≤ s ≤ t = to see that the path γ ( s ) = ̃ γ ( s ) for s ≥ t , is a minimizing curve. Since it coincides with γ on a nontrivial time-interval, it has to coincide with γ everywhere, and . ̃ γ = ̃ γ everywhere. So similarly it has to coincide with γ s,t c are continuous, the previous conclusion holds true If the costs Π ⊗ Π -almost surely, but actually for any two minimizing not only γ curves ̃ γ lying in the support of Π . Indeed, inequality (7.21) defines , a closed set in Γ × Γ , where Γ stands for the set of minimizing curves; C so Spt Π × Spt Π = Spt( Π ⊗ Π ) ⊂C . , 1 0 It remains to prove (v). Let be a c -cyclically monotone subset Γ X ×X on which π is concentrated, and let Γ be the set of mini- of 0 , 1 , 1] be such that ( γ ) ,γ K ) ∈ Γ mizing curves γ : [0 . Let ( → X 1 0 ℓ ℓ ∈ N

161 Interpolation of prices 155 a , such that nondecreasing sequence of compact sets contained in Γ ( ∪ ] = Π [ Γ ] = 1. For each ℓ , we define F [ on e Π . K γ ) by F ) = ( γ K t t ℓ ℓ ℓ ℓ ( ∈ e , then for This map is continuous: Indeed, if K x x ) converges to t k ℓ γ we have each = ( γ x ) , and up to extraction for some γ k ∈ K t k k k k ℓ ∈ γ , and in particular ( γ converges to ) K converges to γ ; but then t t k ℓ ∪ F γ . Then we can define F on ) = K ( as a map which coincides γ t ℓ ℓ F on each K . (Obviously, this is the same line of reasoning as in with ℓ ℓ ⊓⊔ Theorem 5.30.) Proof of Corollary 7.32 . Assume that X is nonbranching. Then there ,t 0 0 γ F , where exists some function , defined on the set of all curves 0 ,t 0 is the restriction to [0 ,t γ ] of the geodesic γ : [0 , 1] →X , such that 0 0 ,t 0 ). γ F ( = γ F is automatically continuous. Indeed, let ( γ ) I claim that be n ∈ N n ,t 0 0 such that γ converges uniformly on [0 ,t . ] to some g : [0 ,t → X ] n 0 0 0 ,t 0 are uniformly bounded, the speeds of all the Since the functions γ n geodesics γ γ ([0 , 1]) are are uniformly bounded too, and the images n n X all included in a compact subset of . It follows from Ascoli’s theo- γ rem that the sequence ( ) converges uniformly, up to extraction of a n subsequence. But then its limit has to be a geodesic, and its restric- γ ,t ] coincides with tion to [0 g . There is at most one such geodesic, so 0 is uniquely determined, and the whole sequence γ converges. This γ n implies the continuity of F . Then Theorem 7.30(iii) applies: If ( μ ), ) X ( P is a geodesic in 1 ≤ p ≤ 0 t t is uniquely μ there is only one geodesic between and μ ) . So ( μ t ≤ 0 1 t ≤ 1 t 0 determined by its restriction to [0 ,t ]. The same reasoning could be 0 done for any nontrivial time-interval instead of [0 ,t P ( X ) is indeed ]; so 0 p nonbranching. The converse implication is obvious, since any geodesic γ X in- in P ( duces a geodesic in ), namely ( δ ⊓⊔ . ) X 1 ≤ ≤ 0 p t ) ( γ t Interpolation of prices When the path μ varies in time, what becomes of the pair of “prices” t ( ψ,φ ) in the Kantorovich duality? The short answer is that these func- tions will also evolve continuously in time, according to Hamilton– Jacobi equations .

162 156 7 Displacement interpolation D efinition 7.33 (Hamilton–Jacobi–Hopf–Lax–Oleinik evolution 0 , 1 be a metric space and ) semigroup). a coercive La- X ( A Let s,t c ) ( X . For any two grangian action on , with cost functions 0 ≤ ss + − ward (resp. backward) Hamilton–Jacobi (or Hopf–Lax, or Lax–Oleinik) semigroup. s,t H gives the values of ψ at time t , from its values Roughly speaking, + s,t H s at time does the reverse. So the semigroups ; while H and H − + − are in some sense inverses of each other. Yet it is not true in general t,s s,t that H H = Id . Proposition 7.34 below summarizes some of the − + main properties of these semigroups; the denomination of “semigroup” itself is justified by Property (ii). Proposition 7.34 (Elementary properties of Hamilton–Jacobi With the notation of Definition 7.33, semigroups). s,t s,t ,t s s,t H (i) H and are order-preserving: ψ ≤ = ⇒ H ψ ψ . ψ ≤ H + − ± ± [0 t < t ( < t ii) Whenever are three intermediate times in , , 1] 1 2 3  t ,t t ,t t ,t 2 3 3 1 2 1 = H H  H + + +    ,t ,t t t ,t t 2 1 1 2 3 3 H = H . H − − − s < t are two times in [0 , 1] (iii) Whenever , t,s s,t s,t t,s H H ≤ Id ; H H . ≥ Id − + + − Proof of Proposition 7.34 . Properties (i) and (ii) are immediate conse- quences of the definitions and Proposition 7.16(iii). To check Property (iii), e.g. the first half of it, write ) ( s,t t,s ′ s,t ′ s,t )( x ) = sup ) inf x,y ( . ψ ( x ( ) + c H ( x H ,y ) − c ψ + − ′ x y ′ x = x shows that the infimum above is bounded above by The choice t,s s,t x ), independently of y ; so ⊓⊔ ψ ), as desired. ( H ( x ψ )( x ) ≤ ψ ( H + −

163 Interpolation of prices 157 T he Hamilton–Jacobi semigroup is well-known and useful in geome- try and dynamical systems theory. On a smooth Riemannian manifold, when the action is given by a Lagrangian x,v,t ), strictly convex and L ( 0 ,t ( H superlinear in the velocity variable, then ) := · ψ t, solves the S 0 + + differential equation ( ) ∂S + ) + H x, ( ∇ S x ( x,t ) ,t ,t = 0 , (7.24) + ∂t ∗ L where is obtained from L by Legendre transform in the H vari- = v Hamiltonian of the system. This equation pro- able, and is called the vides a bridge between a Lagrangian description of action-minimizing S curves, and an Eulerian description: From ( x,t ) one can reconstruct + ( ) x,t ) = ∇ v H a velocity field x, ∇ S , in such a way that in- ( x,t ) ,t ( + p x = v ( x,t ) are minimizing curves. Well, tegral curves of the equation ̇ S rigorously speaking, that would be the case if were differentiable! + But things are not so simple because is not in general differentiable S + everywhere, so the equation has to be interpreted in a suitable sense (called viscosity sense). It is important to note that if one uses the t, 1 S x,t backward semigroup and defines H ) := ( ψ , then S formally − t − − satisfies the same equation as S , but the equation has to be inter- + preted with a different convention (backward viscosity). This will be illustrated by the next example. Example 7.35. , consider the simple M On a Riemannian manifold 2 x,v,t ) = | v | Lagrangian cost / 2; then the associated Hamiltonian is L ( 2 1 2 ) = | p | 2. If / just H is a C ( solution of ∂S/∂t + |∇ S | x,p,t / 2 = 0, S then the gradient of can be interpreted as the velocity field of a family S S of minimizing geodesics. But if is a given Lipschitz function and 0 S ( t,x ) is defined by the forward Hamilton–Jacobi semigroup starting + from initial datum S ) , one only has (for all t,x 0 2 − ∂S | ∇ S | + + + = 0 , 2 ∂t where )] x ( f − ) y ( f [ − − ( | f . |∇ x ) := lim sup , z z, = max( − 0) − x ,y d ) ( y → x Conversely, if one uses the backward Hamilton–Jacobi semigroup to define a function S ( x,t ), then −

164 158 7 Displacement interpolation + 2 | [ ( x )] − ) y ( f f ∂ | ∇ S S + − − + x , f | |∇ , 0 = ) := lim sup ( + ∂t 2 ( x ,y ) d x → y where now = max( z, 0). When the Lagrangian is more complicated, z + things may become much more intricate. The standard convention is to use the forward Hamilton–Jacobi semigroup by default. We shall now see that the Hamilton–Jacobi semigroup provides a simple answer to the problem of interpolation in dual variables. In the 0 , 1 X is again a Polish space, ( A a coercive Lagrangian ) next statement, s,t s,t , with associated cost functions ; and C c stands for the X action on s,t . c optimal total cost in the transport problem with cost Theorem 7.36 (Interpolation of prices). With the same assump- μ μ , tions and notation as in Definition 7.33, let be two probability 1 0 1 , 0 ) C ( μ measures on ,μ be a pair ) < + ∞ , and let ( ψ X ,φ , such that 1 1 0 0 0 1 , -conjugate functions such that any optimal plan c of π μ between 1 0 0 , μ . (Recall Theorem 5.10; under has its support included in and ψ 0 , 1 ∂ 1 0 c ( ,φ ) is just a solution of adequate integrability conditions, the pair ψ 1 0 μ ) be a displacement ( the dual Kantorovich problem.) Further, let ≤ 0 ≤ 1 t t interpolation between μ and μ . Whenever s < t are two intermediate 0 1 times in [0 , 1] , define ,s 0 ,t 1 ψ := H , φ := H ψ . φ s t 0 1 + − ψ ,φ Then ) is optimal in the dual Kantorovich problem associated to ( s t s,t ,μ . In particular, ) and μ ( c t s ∫ ∫ s,t ( ) = C φ μ − ,μ ψ dμ , dμ s s t s t t and s,t ( y ) − ψ φ ( x ) ≤ c , ( x,y ) t s with equality π ( dxdy ) -almost surely, where π is any optimal trans- s,t s,t ference plan between and μ . μ t s Proof of Theorem 7.36 . From the definitions, s,t ( y ) − ψ ) ( x φ − c ) ( x,y s t [ ] ′ ′ s,t 1 ′ ′ 0 ,s t, ) − c ( = sup ( y φ ,y ) − ψ c ( x . ) − c ( ) ( x x,y ,x ) − y 0 1 ′ ′ ,x y

165 Interpolation of prices 159 ′ ,s ′ s,t t, 0 ′ 0 , 1 ′ 1 ) + c x c ( y,y ,x ) ≥ c ) + c ( ( x ( ,y x,y ), it follows that S ince ] [ s,t ′ ′ ′ 0 , 1 ′ − c . ( x,y ) ≤ sup 0 ≤ ( ) y φ ) ( y − ) − ψ ,y ( x ψ ) − c φ ( x ( x ) 0 t 1 s ′ ′ y ,x s,t ( y ) − ψ So ( x ) ≤ c φ ( x,y ). This inequality does not depend on the s t fact that ( ,φ ψ ) is a tight pair of prices, in the sense of (5.5), but only 0 1 0 , 1 c ≤ φ − on the inequality . ψ 0 1 γ such that Next, introduce a random action-minimizing curve μ ). Since ( ψ ,φ ) is an optimal pair, we know from The- = law ( γ 1 t 0 t orem 5.10(ii) that, almost surely, 0 , 1 c ( γ ( ) = γ ) − φ ( γ ,γ ) . ψ 0 1 1 0 0 1 , 1 0 ,s 0 s,t 1 t, ,γ ,γ ) = c γ From the identity ( c γ ,γ ( )+ c ( ( γ γ ,γ ) and )+ c t t 1 1 0 s s 0 , and φ the definition of ψ t s ] [ ] [ 1 s,t ,s t, 0 ) ( γ ,γ ) − c γ c ( γ γ ,γ ( ) ,γ − γ ψ ) = ( ( φ ) + c 0 1 t 0 1 s 1 t s 0 φ γ ( γ . ) − ψ ) ( ≤ s t t s s,t ) almost surely, so ( γ ,γ This shows that actually c φ ( γ ) − ψ ( γ ) = s s t t s t ( ψ ,φ ) has to be optimal in the dual Kantorovich problem between t s μ = law ( γ ). ) and μ ⊓⊔ = law ( γ t t s s In the limit case , the above results become → t Remark 7.37. s { ≤ ψ φ t t φ -almost surely = ψ μ t t t φ X = ψ . everywhere in ... but it is not true in general that t t Remark 7.38. However, the identity ψ = φ holds true everywhere 1 1 as a consequence of the definitions. Exercise 7.39. After reading the rest of Part I, the reader can come back to this exercise and check his or her understanding by proving that, for a quadratic Lagrangian: (i) The displacement interpolation between two balls in Euclidean space is always a ball, whose radius increases linearly in time (here I am identifying a set with the uniform probability measure on this set). (ii) More generally, the displacement interpolation between two el- lipsoids is always an ellipsoid.

166 160 7 Displacement interpolation ( iii) But the displacement interpolation between two sets is in gen- eral not a set. (iv) The displacement interpolation between two spherical caps on the sphere is in general not a spherical cap. (v) The displacement interpolation between two antipodal spheri- cal caps on the sphere is unique, while the displacement interpolation between two antipodal points can be realized in an infinite number of ways. Appendix: Paths in metric structures This Appendix is a kind of crash basic course in Riemannian geome- try, and nonsmooth generalizations thereof. Much more detail can be obtained from the references cited in the bibliographical notes. A (finite-dimensional, smooth) Riemannian manifold is a manifold g M g defines equipped with a Riemannian metric : this means that a scalar product on each tangent space T , varying smoothly with x . M x and w at tangent vectors at x v v · w really means So if , the notation g ( g ), where g . The degree of smoothness of is the metric at x v,w x x 3 C manifolds depends on the context, but it is customary to consider 2 with a C metric. For the purpose of this course, the reader might ∞ smoothness. C assume 1 : [0 , Let → M be a smooth path, γ denoted by ( γ . For each ) 1] 0 ≤ t ≤ 1 t t ∈ (0 , 1), the time-derivative at time t is — by the very definition of tangent space — the tangent vector = ̇ v in T γ . The scalar product M γ t t √ gives a way to measure the norm of that vector: | v | = g v Then v · . M T x one can define the length of γ by the formula ∫ 1 ) = L ( γ | ̇ γ | dt, (7.25) t 0 and the , or geodesic distance, between two points x and y by distance the formula { } x,y ) = inf γ L ( d ); γ (7.26) = x, γ . = y ( 1 0 1 F or me the words “path” and “curve” are synonymous.

167 Appendix: Paths in metric structures 161 A fter that it is easy to extend the length formula to absolutely con- tinuous curves. Note that any one of the three objects (metric, length, distance) determines the other two; indeed, the metric can be recovered from the distance via the formula ) ,γ γ d ( 0 t 7.27) γ = lim | , ( ̇ | 0 t 0 ↓ t and the usual polarization identity ] [ 1 v,w ) = g ,v + w ) − g ( v − w,v − w ) . ( w ( v g + 4 stand for the collection of all T Let M , x TM M , equipped with ∈ x a manifold structure which is locally product. A point in TM is a pair x,v ) with v ∈ T ( M . The map π : ( x,v ) 7−→ x is the projection of TM x onto M → TM is M ; it is obviously smooth and surjective. A function called a vector field: It is given by a tangent vector at each point. So a vector field really is a map : x → ( x,v ), but by abuse of notation one f f ( x ) = v . If γ : [0 , 1] → M is an injective path, one defines often writes = vector field along as a path ξ : [0 , 1] → TM such that π ◦ ξ γ γ . a p = p ( x ) is a linear form varying smoothly on T If M , then it can x be identified, thanks to the metric g , via the formula , to a vector field ξ x x v = ξ ( · ) · v, ( p ) ∈ T where v , and the dot in the left-hand side just means “ p ( x ) M x applied to v ”, while the dot in the right-hand side stands for the scalar product defined by g . As a particular case, if p is the differential of a f , the corresponding vector field ξ is the gradient of f , denoted function ∇ by ∇ f f . or x f If f ( x,v ) is a function on TM , then one can differentiate it with = f or with respect to v . Since T respect to T d M ≃ T , both M x x x x ) ( x,v and d f can be seen as linear forms on T ; this allows us to define M x v , the “gradient with respect to the position variable” and f and ∇ ∇ f v x the “gradient with respect to the velocity variable”. Differentiating functions does not pose any particular conceptual problem, but differentiating vector fields is quite a different story. If ξ is a vector field on M , then ξ ( x ) and ξ ( y ) live in different vector spaces, so it does not a priori make sense to compare them, unless one identifies in some sense T is a map M and T ξ M . (Of course, one could say that y x M → TM and define its differential as a map ) but this → T ( TM TM

168 162 7 Displacement interpolation i T ( TM ) is “too large”; it is much better if we s of little use, because can come up with a reasonable notion of derivation which produces a map TM .) TM → There is in general no canonical way to identify T M if M T and y x M y T x 6 = and T as ; but there is a canonical way to identify M γ γ t 0 t varies continuously. This operation is called parallel transport , or Levi-Civita transport. A vector field which is transported in a parallel will “look constant” to an observer who lives way along the curve γ 3 and travels along . If M is a surface embedded in R in , parallel M γ transport can be described as follows: Start from the tangent plane at (0), and then press your plane onto along γ γ , in such a way that M there is no friction (no slip, no rotation) between the plane and the surface. With this notion it becomes possible to compute the derivative of : If γ a vector field ξ is a vector field along along a path is a path and ̇ ξ is another vector field along γ , say t → ( ξ γ t ), , then the derivative of defined by ∣ ∣ d ̇ ∣ ) = t ( ξ ( , ) θ ) γ ξ ( 0 t t → t 0 ∣ dt t t = 0 θ . This γ along M is the parallel transport from T T where M to γ γ t t → t t 0 0 θ makes sense because ) is an element of the fixed vector space γ ( ξ t → t t 0 ̇ t , called the M . The path T → covariant ξ ( t ) is a vector field along γ γ t 0 derivative of ξ along γ , and denoted by ∇ , or, if there is no possible ξ γ ̇ n R dξ/dt ). If M = (or simply γ , , confusion about the choice of Dξ/Dt coincides with ( ̇ then ξ ∇ γ ·∇ ) ξ . γ ̇ ̇ It turns out that the value of ξ ( t ) only depends on γ , on the values 0 t 0 of in a neighborhood of γ ξ (not on the whole , and on the velocity ̇ γ t t 0 0 γ ξ ). Thus if path is a vector field defined in the neighborhood of a t , it makes sense to define point v is a tangent vector at x x ∇ ξ , and v by the formula Dξ v. ξ ( x ) = x, = ( 0) , γ γ = ∇ ̇ 0 0 v Dt The quantity ∇ ξ ( x ) is “the covariant derivative of the vector field ξ v at in direction v .” Of course, if ξ and v are two vector fields, one x can define a vector field ∇ ). ξ by the formula ( ∇ x ξ )( x ) = ( ∇ )( ξ v v x ) v ( ) is the v at The linear operator ( x 7−→ ∇ covariant gradient of ξ ξ v x , denoted by ∇ ξ ( x ) or ∇ . ξ ; it is a linear operation T M M → T x x x It is worth noticing explicitly that the notion of covariant derivation coincides with the convective derivation used in fluid mechanics (for in-

169 Appendix: Paths in metric structures 163 s tance in Euler’s equation for an incompressible fluid). I shall sometimes ξ ∇ = v ·∇ ξ . ) adopt the notation classically used in fluid mechanics: ( v ) · v should rather be reserved for ∇ (On the contrary, the notation ( ξ ∗ ∗ ) v , where ( ∇ ξ ( ∇ ξ ∇ ξ ; then 〈 v ·∇ ξ,w 〉 = 〈 v, ∇ ξ · w 〉 ) is the adjoint of n R .) and we are back to the classical conventions of The procedure of parallel transport allows one to define the covari- ant derivation; conversely, the equations of parallel transport along γ Dξ/Dt = 0, where D/Dt is the covariant derivative can be written as γ . So it is equivalent to define the notion of covariant derivation, along or to define the rules of parallel transport. There are (at least) three points of view about the covariant deriva- tion. The first one is the point of view: Let us think of M as extrinsic N N M an embedded surface in R is a subset of , it is equipped ; that is, R N R , and the quadratic form g is just with the topology induced by x N R M . Then the usual Euclidean scalar product on , restricted to T x the covariant derivative is defined by ( ) d ξ ( γ )) ( t ̇ ) = t ( ξ Π , M T γ t dt N T Π . In stands for the orthogonal projection (in R here ) onto M w x T M x short, the covariant derivative is the projection of the usual derivative onto the tangent space. While this definition is very simple, it does not reveal the fact that intrinsic notions, the covariant derivation and parallel transport are which are invariant under isometry and do not depend on the embed- N g R ding of , but just on M . An intrinsic way to define covariant into derivation is as follows: It is uniquely characterized by the two natural rules d D ̇ ̇ ̇ ̇ f = ξ, ζ 〉 + 〈 ξ, , ζ ζ 〉 ; ξ (7.28) ( 〉 ξ ) = 〈 f ξ + f ξ, 〈 dt Dt where the dependence of all the expressions on t is implicit; and by the not so natural rule ∇ . ξ −∇ ] ζ = [ ξ,ζ ξ ζ ξ,ζ ] is the Lie bracket of ξ and Here [ , which is defined as the unique ζ vector field such that for any function F , ( dF ) · [ ξ,ζ ξ. d ( dF · ξ ) · ζ − d ( dF · ζ ) · ] =

170 164 7 Displacement interpolation ̇ f stands F urther, note that in the second formula of (7.28) the symbol ̇ ̇ stand for the usual derivative of ); while the symbols → ξ and γ ζ ( t f t ξ and ζ along γ . for the covariant derivatives of the vector fields . A third approach to covariant derivation is based on coordinates ∈ x , then there is a neighborhood O of x which is diffeomorphic Let M n ⊂ U . Let Φ be a diffeomorphism R → O , and to some open subset U n in ,...,e is said to have ) be the usual basis of R let ( . A point m e O n 1 1 n 1 n ) if m = Φ ( y ( ,...,y y ); and a vector v ∈ T coordinates M ,...,y m 1 k d ,...,v is said to have if components v . v ) = ,...,v v Φ · ( n 1 1 k ) ,...,y y ( coefficients of the metric g are the functions g Then the defined by ij ∑ i j ) = . g g v ( v v,v ij The coordinate point of view reduces everything to “explicit” com- n R putations and formulas in ; for instance the derivation of a function i ◦ ∂ f f := ( ∂/∂y i along the f th direction is defined as Φ ). This is )( i conceptually simple, but rapidly leads to cumbersome expressions. A central role in these formulas is played by the Christoffel symbols , which are defined by n ( ) ∑ 1 m km Γ := ∂ , g g ∂ g g − ∂ + j ij i k jk ki ij 2 k 1 = ij g where ( ) is by convention the inverse of ( g ). Then the covariant ij derivation along γ is given by the formula ) ( k k ∑ d ξ Dξ i k j = Γ ̇ γ + ξ . ij dt Dt j i Be it in the extrinsic or the intrinsic or the coordinate point of view, the notion of covariant derivative is one of the cornerstones on which differential Riemannian geometry has been constructed. Another important concept is that of Riemannian volume , which I shall denote by vol. It can be defined intrinsically as the n -dimensional Hausdorff measure associated with the geodesic distance (where is the n √ dimension of the manifold). In coordinates, vol( dx ) = g ) d x . The det( Riemannian volume plays the same role as the Lebesgue measure in n R . After these reminders about Riemannian calculus, we can go back L ( x,v,t ) be a smooth La- to the study of action minimization. Let 1]. To find an equation satisfied by the curves TM × [0 , grangian on

171 Appendix: Paths in metric structures 165 w hich minimize the action, we can compute the differential of the ac- γ h tion. So let be a curve, and a small variation of that curve. (This γ in such a way that amounts to considering a family γ = γ and t s,t 0 ,t ) | γ ( d/ds A = h ( t ).) Then the infinitesimal variation of the action s,t s =0 γ , along the variation h , is at ∫ ) ( 1 ̇ h = t d dt. A ∇ ( L ( γ ) , ̇ γ t ,t ) · h ( γ ) + ) ( L ( γ h , ̇ γ · ,t ) · ∇ t v t t x t 0 Thanks to (7.28) we can perform an integration by parts with respect to the time variable, and get ( ) ∫ 1 d d h · ) = γ ∇ A L − ( γ ∇ dt ) L ) t ( γ ( , ̇ ( h , t ) · x t t v dt 0 + ( ∇ (7.29) L )( γ . , ̇ γ (0) , 1) · h (1) − ( ∇ h L )( γ · , ̇ γ 0) , 0 0 1 v v 1 . first variation formula This is the x,y of γ are fixed, the tangent curve h vanishes When the endpoints = 0 and t = 1. Since h is otherwise arbitrary, it is easy to deduce at t the equation for minimizers: d . L = ∇ (7.30) L ∇ x v dt More explicitly, if a differentiable curve ( γ ) is minimizing, then ≤ t ≤ 1 t 0 ) ( d ∇ ( γ , ̇ γ , t ) . = ∇ L ( γ , ̇ γ L ,t ) , 0 < t < 1 v t t x t t dt This is the Euler–Lagrange equation associated with the Lagrangian ; to memorize it, it is convenient to write it as L d ∂ L ∂ L , = 7.31) ( ∂x ̇ ∂ x dt so that the two time-derivatives in the left-hand side formally “can- cel out”. Note carefully that the left-hand side of the Euler–Lagrange equation involves the time-derivative of a curve which is valued in TM ; d/dt ) in (7.31) is in fact a covariant derivative along the minimizing so ( γ , the same operation as we denoted before by ∇ curve , or D/Dt . γ ̇ 2 = ( x,v,t ) = | v | The most basic example is when / 2. Then ∇ v L L v ̇ dv/dt = 0, or ∇ = 0, which is the usual γ and the equation reduces to γ ̇ equation of vanishing acceleration. Curves with zero acceleration are called geodesics ; their equation, in coordinates, is

172 166 7 Displacement interpolation ∑ k k j i + Γ ̈ γ ̇ γ ̇ γ . = 0 ij i j k k t → ̇ γ γ ( t ), not the (Note: ̈ th component of ̈ γ .) is the derivative of k γ is constant, and to stress this fact one can The speed of such a curve say that these are constant-speed geodesics, by opposition with general geodesics that can be reparametrized in an arbitrary way. Often I shall just say “geodesics” for constant-speed geodesics. It is equivalent to say that a geodesic γ has constant speed, or that its length between any s < t is proportional to t − two times . s An important concept related to geodesics is that of the exponen- . If x M and v ∈ T ∈ M are given, then exp tial map v is defined as x x (1), where γ : [0 γ 1] → M is the unique constant-speed geodesic start- , ing from (0) = x with velocity ̇ γ γ v . The exponential map is a (0) = convenient notation to handle “all” geodesics of a Riemannian manifold at the same time. We have seen that minimizing curves have zero acceleration, and the converse is also true locally , that is if γ . A curve which is very close to γ 0 1 minimizing minimizes the action between its endpoints is called a geodesic , or minimal geodesic, or simply a geodesic. The Hopf–Rinow (seen as a metric space) is M theorem guarantees that if the manifold M are joined by at least one minimal complete, then any two points in geodesic. There might be several minimal geodesics joining two points y (to see this, consider two antipodal points on the sphere), but x and geodesics are: nonbranching : Two geodesics that are defined on a time interval [0 ,t • ] ′ ′ ] for some t and coincide on [0 > 0 have to coincide on the whole ,t of [0 ]. Actually, a stronger statement holds true: The velocity of ,t the geodesic at time t = 0 uniquely determines the final position at time t = 1 (this is a consequence of the uniqueness statement in the Cauchy–Lipschitz theorem). • locally unique : For any given x , there is r in the > 0 such that any y x y x → ( x ) can be connected to x B γ = γ , ball by a single geodesic r x and then the map y 7−→ ̇ γ (0) is a diffeomorphism (this corresponds to parametrize the endpoint by the initial velocity). y almost everywhere unique : For any x , the set of points • that can be connected to x by several (minimizing!) geodesics is of zero mea- sure. A way to see this is to note that the square distance function

173 Appendix: Paths in metric structures 167 2 x , · ) is locally semiconcave, and therefore differentiable almost d ( everywhere. (See Chapter 10 for background about semiconcavity.) Γ x The set of (minimizing, constant speed) geodesics joining x,y compact might not be single-valued, but in any case it is in y and 1] C ), even if M is not compact. To see this, note that (i) the ([0 , ,M ) ( B Γ x,d ( x,y ) image of any element of , so lies entirely in the ball x,y x,y Γ Γ are d ( is uniformly bounded, (ii) elements in )-Lipschitz, x,y x,y Γ is closed because so they constitute an equi-Lipschitz family; (iii) x,y γ (0) = x , γ (1) = y , L ( γ ) ≤ d ( γ it is defined by the equations ,γ ) 1 0 is not continuous with respect to uniform (the length functional L convergence, but it is lower semicontinuous, so an upper bound on M is locally compact, so Ascoli’s the length defines a closed set); (iv) M . compactness theorem applies to functions with values in A similar argument shows that for any two given compact sets K s K K , the set of geodesics γ such that γ is ∈ K and and γ ∈ t t s s t s,t ([ s,t ]; M ). So the Lagrangian action defined by A ) = ( γ compact in C 2 t ) ( / ( γ − s ) is coercive in the sense of Definition 7.13. L Most of these statements can be generalized to the action coming 2 L ( x,v,t ) on TM × [0 , 1], if L is C from a Lagrangian function and satisfies the classical conditions of Definition 7.6. In particular the as- sociated cost functions will be continuous. Here is a sketch of the proof: x Let be two given points, and let x and → x and y y → y be con- k k verging sequences. For any ε > 0, small enough, s,t − t ε − ε,t + s + s,s ε,t ε c x (7.32) ,x ) + c ≤ ) . ,y ) ( x,y ) + ( x ( c ( y,y c k k k k It is easy to show that there is a uniform bound K on the speeds of all minimizing curves which achieve the costs appearing above. Then the ε + s,s c x ,x ) = Lagrangian is uniformly bounded on these curves, so ( k t − ε,t c ) = O ( ( y,y ε ), O ( ε ). Also it does not affect much the Lagrangian k ] + ε,t − ε s (evaluated on candidate minimizers) to reparametrize [ s + ε,t − ε s,t ] by a linear change of variables, so c into [ ( x,y ) converges to s,t c ( x,y ) as s → t . This proves the upper semicontinuity, and therefore s,t c . the continuity, of s,t c is In fact there is a finer statement: superdifferentiable . This notion will be explained and developed later in Chapter 10. Besides the Euclidean space, Riemannian manifolds constitute in some sense the most regular metric structure used by mathemati-

174 168 7 Displacement interpolation c ians. A Riemannian structure comes with many nice features (cal- culus, length, distance, geodesic equations); it also has a well-defined dimension n (the dimension of the manifold) and carries a natural vol- ume. Finsler structures constitute a generalization of the Riemannian structure: one has a differentiable manifold, with a norm on each tan- T M , but that norm does not necessarily come from a scalar gent space x product. One can then define lengths of curves, the induced distance as for a Riemannian manifold, and prove the existence of geodesics, but the geodesic equations are more complicated. length space (or intrinsic Another generalization is the notion of length space), in which one does not necessarily have tangent spaces, and a distance d which are yet one assumes the existence of a length L compatible, in the sense that ) (  ∫ 1 d ,γ γ t t ε +   ( | ̇ γ γ | ) = dt, ̇ γ L | := lim sup | ,  t t  | ε | ε 0 → 0  } {    x, γ L ( γ ,y γ ) = inf = . = y ); x ( d 1 0 In practice the following criterion is sometimes useful: A complete metric space ( X ,d ) is a length space if and only if for any two points in X , and any ε > 0 one can find an ε -midpoint of ( x,y ), i.e. a point m such that ε ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ) d x,y ( d ) x,y ( ∣ ∣ ∣ ∣ ,m − ) ( x d ) − d ( y ,m ≤ ε, ≤ ε. ε ε ∣ ∣ ∣ ∣ 2 2 Minimizing paths are fundamental objects in geometry. A length space in which any two points can be joined by a minimizing path, geodesic , is called a geodesic space , or strictly intrinsic length or space, or just (by abuse of language) length space. There is a criterion X ,d ) is a geodesic in terms of midpoints: A complete metric space ( space if and only if for any two points in X there is a midpoint, which is of course some m ∈ X such that ) d ( x,y . ( m,y ) = d x,m ) = ( d 2 T here is another useful criterion: If the metric space ( X ,d ) is a com- plete, locally compact length space, then it is geodesic. This is a gen- eralization of the Hopf–Rinow theorem in Riemannian geometry. One

175 Bibliographical notes 169 c γ in such a way that their speed an also reparametrize geodesic curves ̇ | is constant, or equivalently that for all intermediate times s and t , | γ s and t coincides with the distance between their length between times their positions at times s t . and The same proof that I sketched for Riemannian manifolds applies in geodesic spaces, to show that the set Γ of (minimizing, constant x,y speed) geodesics joining x to y is compact; more generally, the set Γ is compact, as soon K ∈ γ and of geodesics γ with γ K ∈ 0 0 K 1 → K 1 0 1 K as and K are compact. So there are important common points be- 1 0 tween the structure of a length space and the structure of a Riemannian manifold. From the practical point of view, some main differences are that (i) there is no available equation for geodesic curves, (ii) geodesics x and may “branch”, (iii) there is no guarantee that geodesics between are unique for y very close to x , (iv) there is neither a unique notion of y dimension, nor a canonical reference measure, (v) there is no guarantee that geodesics will be almost everywhere unique. Still there is a the- ory of differential analysis on nonsmooth geodesic spaces (first variation formula, norms of Jacobi fields, etc.) mainly in the case where there are lower bounds on the sectional curvature (in the sense of Alexandrov, as will be described in Chapter 26). Bibliographical notes There are plenty of classical textbooks on Riemannian geometry, with variable degree of pedagogy, among which the reader may consult [223], [306], [394]. For an introduction to the classical calculus of variations in dimension 1, see for instance [347, Chapters 2–3], [177], or [235]. For an introduction to the Hamiltonian formalism in classical mechan- ics, one may use the very pedagogical treatise by Arnold [44], or the more complex one by Thirring [780]. For an introduction to analysis in metric spaces, see Ambrosio and Tilli [37]. A wonderful introduc- tion to the theory of length spaces can be found in Burago, Burago and Ivanov [174]. In the latter reference, a Riemannian manifold is de- n fined as a length space which is locally isometric to equipped with R a quadratic form g depending smoothly on the point x . This defini- x tion is not standard, but it is equivalent to the classical definition, and in some sense more satisfactory if one wishes to emphasize the metric point of view. Advanced elements of differential analysis on nonsmooth

176 170 7 Displacement interpolation m etric spaces can be found also in the literature on Alexandrov spaces, see the bibliographical notes of Chapter 26. I may have been overcautious in the formulation of the classical con- ditions in Definition 7.6, but in the time-dependent case there are crazy 1 Lagrangian functions do not counterexamples showing that nice C 1 necessarily have minimizing curves (equivalently, these minimizing C curves won’t solve the Euler–Lagrange equation); see for instance the constructions by Ball and Mizel [63, 64]. (A backwards search in Math- SciNet will give access to many papers concerned with multidimensional analogs of this problem, in relation with the so-called Lavrentiev phe- nomenon.) I owe these remarks to Mather, who also constructed such counterexamples and noticed that many authors have been fooled by these issues. Mather [601, Section 2 and Appendices], Clarke and Vin- ter [235, 236, 823] discuss in great detail sufficient conditions under 1 which everything works fine. In particular, if is C , strictly convex L superlinear in v and time-independent , then minimizing curves are au- 1 tomatically and satisfy the Euler–Lagrange equation. (The difficult C point is to show that minimizers are Lipschitz; after that it is easier to see that they are at least as smooth as the Lagrangian.) I introduced the abstract concept of “coercive Lagrangian action” for the purpose of this course, but this concept looks so natural to me that I would not be surprised if it had been previously discussed in the literature, maybe in disguised form. Probability measures on action-minimizing curves might look a bit scary when encountered for the first time, but they were actu- ally rediscovered several times by various researchers, so they are ar- guably natural objects: See in particular the works by Bernot, Caselles and Morel [109, 110] on irrigation problems; by Bangert [67] and Hohloch [476] on problems inspired by geometry and dynamical sys- tems; by Ambrosio on transport equations with little or no regu- larity [21, 30]. In fact, in the context of partial differential equa- tions, this approach already appears in the much earlier works of Bre- nier [155, 157, 158, 159] on the incompressible Euler equation and re- lated systems. One technical difference is that Brenier considers prob- ability measures on the huge (nonmetrizable) space of measurable paths, while the other above-mentioned authors only consider much smaller spaces consisting of continuous, or Lipschitz-continuous func- tions. There are important subtleties with probability measures on non- metrizable spaces, and I advise the reader to stay away from them.

177 Bibliographical notes 171 A lso in relation to the irrigation problem, various models of traffic plans and “dynamic cost function” are studied in [108, 113]; while paths in the space of probability measures are considered in [152]. The Hamilton–Jacobi equation with a quadratic cost function (i.e. 2 ) = | v ( ) will be considered in more detail in Chapter 22; see in x,v,t L | particular Proposition 22.16. For further information about Hamilton– Jacobi equations, there is an excellent book by Cannarsa and Sines- trari [199]; one may also consult [68, 327, 558] and the references therein. Of course Hamilton–Jacobi equations are closely related to the concept of c -convexity: for instance, it is equivalent to say that ψ is c -convex, or that it is a solution at time 0 of the backward Hamilton– Jacobi semigroup starting at time 1 (with some arbitrary initial datum). At the end of the proof of Proposition 7.16 I used once again the basic measurable selection theorem which was already used in the proof of Corollary 5.22, see the bibliographical notes on p. 104. Interpolation arguments involving changes of variables have a long history. The concept and denomination of displacement interpolation was introduced by McCann [614] in the particular case of the quadratic cost in Euclidean space. Soon after, it was understood by Brenier that this procedure could formally be recast as an action minimization problem in the space of measures, which would reduce to the classi- cal geodesic problem when the probability measures are Dirac masses. In Brenier’s approach, the action is defined, at least formally, by the formula { } ∫ ∫ 1 ∂μ 2 ) = inf μ ( A ) ) | v ( t,x ; | dt dμ x ( ( , ) = 0 vμ + ∇ · t ∂t ) ( v t,x 0 (7.33) and then one has the Benamou–Brenier formula 2 (7.34) ( μ , ,μ ) ) = inf W A ( μ 1 2 0 where the infimum is taken among all paths ( μ ) satisfying certain 0 ≤ t t 1 ≤ regularity conditions. Brenier himself gave two sketches of the proof for this formula [88, 164], and another formal argument was suggested by Otto and myself [671, Section 3]. Rigorous proofs were later provided by several authors under various assumptions [814, Theorem 8.1] [451] [30, Chapter 8] (the latter reference contains the most precise results). The adaptation to Riemannian manifolds has been considered in [278, 431, 491]. We shall come back to these formulas later on, after a more precise qualitative picture of optimal transport has emerged. One of

178 172 7 Displacement interpolation t he motivations of Benamou and Brenier was to devise new numerical methods [88, 89, 90, 91]. Wolansky [838] considered a more general situation in the presence of sources and sinks. There was a rather amazing precursor to the idea of displacement interpolation, in the form of Nelson’s theory of “stochastic mechanics”. Nelson tried to build up a formalism in which quantum effects would be explained by stochastic fluctuations. For this purpose he considered an action minimization problem which was also studied by Guerra and Morato: ∫ 1 2 ̇ | | dt, X E inf t 0 ) such that X where the infimum is over all random paths ( t t ≤ 1 0 ≤ law ( ) solves the stochas- ) = μ X , law ( X , and in addition ( ) = μ X 1 1 t 0 0 tic differential equation dX d B t t ξ , = ) + σ ( t ,X t dt dt σ > B is a standard Brownian motion, where 0 is some coefficient, t and ξ is a drift, which is an unknown in the problem. (So the mini- X ) but also over all drifts!) ,X mization is over all possible couplings ( 0 1 This formulation is very similar to the Benamou–Brenier formula just alluded to, only there is the additional Brownian noise in it, thus it is more complex in some sense. Moreover, the expected value of the action is always infinite, so one has to renormalize it to make sense of Nelson’s problem. Nelson made the incredible discovery that after a change of variables, minimizers of the action produced solutions of the n free Schr ̈odinger equation in R . He developed this approach for some time, and finally gave up because it was introducing unpleasant non- local features. I shall give references at the end of the bibliographical notes for Chapter 23. It was Otto [669] who first explicitly reformulated the Benamou– Brenier formula (7.34) as the equation for a geodesic distance on a Rie- mannian setting, from a formal point of view. Then Ambrosio, Gigli and Savar ́e pointed out that if one is not interested in the equations of mo- tion, but just in the geodesic property, it is simpler to use the metric no- tion of geodesic in a length space [30]. Those issues were also developed by other authors working with slightly different formalisms [203, 214]. All the above-mentioned works were mainly concerned with dis- n placement interpolation in R . Agueh [4] also considered the case of p c ( x,y ) = | x − y | 1) in Euclidean space. Then displacement ( p > cost

179 Bibliographical notes 173 i nterpolation on Riemannian manifolds was studied, from a heuristic point of view, by Otto and myself [671]. Some useful technical tools were introduced in the field by Cordero-Erausquin, McCann and Schmucken- schl ̈ager [246] for Riemannian manifolds; Cordero-Erausquin adapted them to the case of rather general strictly convex cost functions in n [243]. R The displacement interpolation for more general cost functions, aris- ing from a smooth Lagrangian, was constructed by Bernard and Buf- foni [105], who first introduced in this context Property (ii) in Theo- rem 7.21. At the same time, they made the explicit link with the Mather minimization problem, which will appear in subsequent chapters. This connection was also studied independently by De Pascale, Gelli and Granieri [278]. In all these works, displacement interpolation took place in a smooth structure, resulting in particular in the uniqueness (almost everywhere) of minimizing curves used in the interpolation, at least if the Lagrangian is nice enough. Displacement interpolation in length spaces, as pre- sented in this chapter, via the notion of dynamical transference plan, was developed more recently by Lott and myself [577]. Theorem 7.21 in this course is new; it was essentially obtained by rewriting the proof in [577] with enough generality added to include the setting of Bernard and Buffoni. The most natural examples of Lagrangian functions are those tak- 2 L ( t,x,v ) = | v | / ing the form 2 − U ( t,x ), where U ( t,x ) is a potential energy. In relation with incompressible fluid mechanics, Ambrosio and Figalli [24, 25] studied the case when is the pressure field (a priori U U nonsmooth). Another case of great interest is when t,x ) is the scalar ( curvature of a manifold evolving in time according to the Ricci flow; then, up to a correct time rescaling, the associated minimal action is known in geometry as Perelman’s L -distance. First Topping [782], and then Lott [576] discussed the Lagrangian action induced by the -distance (and some of its variants) at the level of the space of prob- L ability measures. They used this formalism to recover some key results in the theory of Ricci flow. Displacement interpolation in the case p = 1 is quite subtle because of the possibility of reparametrization; it was carefully discussed in the Euclidean space by Ambrosio [20]. Recently, Bernard and Buffoni [104] shed some new light on that issue by making explicit the link with the Mather–Ma ̃n ́e problem. Very roughly, the distance cost function is

180 174 7 Displacement interpolation a typical representative of cost functions that arise from Lagrangians, minimization over the choice of the time-interval if one also allows ⊂ R (rather than fixing, say, ] = 1). This extra freedom accounts ,T [0 T for the degeneracy of the problem. ( γ Lagrangian cost functions of the form | ̇ γ | , where V is a “Lya- V ) punov functional”, have been used by Hairer and Mattingly [458] in relation to convergence to equilibrium, as a way to force the system to visit a compact set. Such cost functions also appear formally in the modeling of irrigation [108], but in fact this is a nonlinear problem V ( γ ) is determined by the total mass of particles passing at a since given point. The observation in Remark 7.27 came from a discussion with S. Evans, who pointed out to me that it was difficult, if not impos- sible, to get characterizations of random processes expressed in terms of the measures when working in state spaces that are not locally com- pact (such as the space of real trees). In spite of that remark, recently Lisini [565] was able to obtain representation theorems for general ab- solutely continuous paths ( μ in the Wasserstein space P ( X ) ) t 0 p ≤ t ≤ 1 ∫ p 1), as soon as ̇ ‖ ( μ is just a Polish space and ‖ p > X , where dt < ∞ t P p ). He showed that such a curve may ̇ μ ‖ X ‖ is the metric speed in P ( p P t p be written as ( e Π , where Π is the law of a random absolutely con- ) t # tinuous curve γ ; as a consequence, he could generalize Corollary 7.22 by removing the assumption of local compactness. Lisini also established a metric replacement for the relation of conservation of mass: For almost all t , p p ̇ γ . | ≤‖ E ̇ μ ‖ | t P p He further applied his results to various problems about transport in infinite-dimensional Banach spaces. Proposition 7.29 is a generalization of a result appearing in [444]. Gigli communicated to me an alternative proof, which is more elemen- tary and needs neither local compactness, nor length space property.

181 8 T he Monge–Mather shortening principle Monge himself made the following important observation. Consider the transport cost x,y ) = | x − y | in the Euclidean plane, and two pairs c ( to x x ,y ), such that an optimal transport maps x ), ( y and x ( ,y 2 1 1 2 1 2 1 to y . (In our language, ( x ) belong to the support of ,y ,y ) and ( x 2 2 1 2 1 π .) Then either all four points lie on a single line, an optimal coupling ,y x ,y or the two line segments [ ], [ x ] do not cross, except maybe 1 2 2 1 at their endpoints. The reason is easy to grasp: If the two lines would cross at a point which is not an endpoint of both lines, then, by triangle inequality we would have | x , − y | | + | x y − y − | < | x x − y | | + 1 1 1 2 2 2 1 2 π c -cyclically and this would contradict the fact that the support of is monotone. Stated otherwise: Given two crossing line segments, we can the total length of the paths by replacing these lines by the shorten x ,y new transport lines [ ] (see Figure 8.1). ] and [ x ,y 1 1 2 2 Quadratic cost function For cost functions that do not satisfy a triangle inequality, Monge’s ar- gument does not apply, and pathlines can cross. However, it is often the case that the crossing of the (with the time variable explicitly curves taken into account) is forbidden. Here is the most basic example: Con- 2 − c ) = | x x,y y | sider the quadratic cost function in Euclidean space ( ), ( and let ( x ) belong to the support of some optimal cou- ,y ,y ) and ( x 2 1 2 1 pling. By cyclical monotonicity,

182 176 8 The Monge–Mather shortening principle x 1 x 2 y 1 y 2 x y M onge’s observation. The cost is Euclidean distance; if is sent to Fig. 8.1. 1 1 x y to to y and x to y . x and , then it is cheaper to send 2 1 2 1 2 2 2 2 2 2 | | + | x . x y − | − ≤| x y − y | | (8.1) + | x y − 2 2 2 1 2 1 1 1 Then let , γ ) = (1 − t ) γ + ty ( t ( t ) = (1 − t ) x + ty x 1 2 2 1 2 1 x be the two line segments respectively joining y to , and x . It to y 2 1 2 1 s,t ( s ) = γ may happen that ( t ) for some γ ∈ [0 , 1]. But if there is a 2 1 ) = (0 1) such that γ ∈ ( t , γ ( t ) =: X , then t 0 0 2 0 1 2 2 | | x + − x − y | y | 1 1 2 2 〉 〈 2 2 − X | = + | X − y y | | x 2 − X − x − ,X 1 2 2 1 〉 〈 2 2 X | + | X − y + | − − 2 | X − x x ,X − y 1 2 2 1 ) ( 2 2 2 2 | ) ] − | x = [ − y t | + (1 + t x | − y 2 2 1 1 0 0 〉 〈 (1 − t + 4 ) t x y − y − ,x 2 1 2 0 1 0 [ ) ]( 2 2 2 2 t ) − + 2 t | (1 − t y ) ≤ | + (1 − − y x | t + | x 1 2 1 0 2 0 0 0 2 2 | − y = | , + x x | − y | 2 2 1 1 and the inequality is strict unless x , in which case − y y = x − 2 2 1 1 γ 1]. But strict inequality contradicts (8.1). ( t ) = γ , ( t ) for all t ∈ [0 2 1 The conclusion is that two distinct interpolation trajectories cannot meet at intermediate times. It is natural to ask whether this conclusion can be reinforced in a quantitative statement. The answer is yes; in fact there is a beautiful identity:

183 General statement and applications to optimal transport 177 ∣ ∣ 2 ) ) ( ( ∣ ∣ 2 2 2 2 + y − | x − x | y ) t | − t = (1 | t ty + x ) − (1 − ( t + − 1 ) ty x ∣ ∣ 2 2 1 1 2 2 1 1 ) ( 2 2 2 2 | x + − y (8.2) | t + | x . (1 y − | t −| x ) − y | | − −| x y − 2 1 1 1 2 2 1 2 To appreciate the consequences of (8.2), let x ( t ) = (1 − t ) x . + ty ty , γ + ( t ) = (1 − t ) γ 2 1 2 1 1 2 Then (8.1) and (8.2) imply ( ) ) ( 1 1 | | y − y − x ≤ max x | max , | , ( ) . | γ − ) t | γ t ( 2 1 2 1 1 2 t t 1 − y ince γ 1], one can ( t ) − γ , ( t ) |≤ max( | x [0 − S ∈ | , | | t − y ) for all | x 2 2 1 2 1 1 1), t conclude that for any (0 , ∈ 0 ( ) 1 1 | ) | γ t ( t ) − γ ( ( t ) |≤ max | sup γ − , 8.3) ) t ( γ ( . 1 0 2 2 1 0 − 1 t t 0 0 ≤ ≤ 0 t 1 (By the way, this inequality is easily seen to be optimal.) So the uniform whole paths γ and distance between the γ can be controlled by their 2 1 some time t distance at ∈ (0 , 1). 0 General statement and applications to optimal transport For the purpose of a seemingly different problem, Mather (not aware of Monge’s work, neither of optimal transport) established an estimate which relies on the same idea as Monge’s shortening argument — only much more sophisticated — for general cost functions on Lagrangian manifolds. He obtained a quantitative version of these estimates, in a form quite similar to (8.3). Mather’s proof uses three kinds of assumption: (i) the existence of a second-order differential equation for minimizing curves; (ii) an assumption of regularity of the Lagrangian, and (iii) an assumption of strict convexity of the Lagrangian. To quantify the strict convexity, n I shall use the following concept: A continuous function L on R will κ ) -convex if it satisfies a (strict) convexity inequality be said to be (2 + of the form ) ( v ) + L ( w ) L w + ( v 2 κ + v ≥ K | | − w − L 2 2

184 178 8 The Monge–Mather shortening principle f 0. or some constant K > The next statement is a slight generalization of Mather’s estimate; if the reader finds it too dense, he or she can go directly to Corollary 8.2 which is simpler, and sufficient for the rest of this course. be a smooth Let Theorem 8.1 (Mather’s shortening lemma). M d , and let Riemannian manifold, equipped with its geodesic distance ) be a cost function on M × M , defined by a Lagrangian L ( x,v,t ) c ( x,y × [0 1] . Let x ,x ,y , ,y TM be four points on M such that on 2 1 1 2 ( c ,y . ) + c ( x ) ,y ,y ) ≤ c x x x ,y ( ) + c ( 2 1 2 2 1 2 1 1 γ Further, let and γ be two action-minimizing curves respectively join- 2 1 y to x and x ing to y be a bounded neighborhood of the graphs . Let V 2 2 1 1 [0 and γ a strict upper bound on the maximal in M × of γ 1] , and S , 2 1 speed along these curves. Define ) ( ⋃ x,B . (0) V := ⊂ TM × [0 , 1] ,t S V ( x,t ) ∈ V γ and γ , convex in the velocity is a neighborhood of In words, 1 2 variable. Assume that: (i) minimizing curves for L are solutions of a Lipschitz flow, in the sense of Definition 7.6 (d); 1 ,α is of class C (ii) L in V with respect to the variables x and v , L for some 1] (so ∇ α L and ∇ (0 , are H ̈older- α ; H ̈older- 1 meaning ∈ v x Lipschitz); is (2 + κ ) -convex in V , with respect to the v variable. (iii) L = t (0 , 1) , there is a constant C Then, for any ∈ C ( L, V ,t ) , and a 0 t 0 0 such that = β ( α,κ ) β positive exponent ) ( ) ( β ) d (8.4) γ t ( t ) . ( ( t ) ,γ ≤ C ,γ ) d sup γ t ( 1 0 t 2 2 1 0 0 1 t ≤ 0 ≤ α = 1 Furthermore, if κ = 0 , then one can choose β = 1 and and C . ) = C ( L, V ) / min( t t , 1 − 0 t 0 0 2 2 is of class C and If ∇ L 0, then Assumption (iii) will be true L > v for κ = 0, so we have the next corollary:

185 General statement and applications to optimal transport 179 C Let M be a orollary 8.2 (Mather’s shortening lemma again). 2 L ( x,v,t ) be a C = Lagrangian smooth Riemannian manifold and let L × , 1] , satisfying the classical assumptions of Definition 7.6, on TM [0 2 together with L > 0 . Let c be the cost function associated to ∇ , and L v d let . Then, for any compact K ⊂ M M be the geodesic distance on ,x x C there is a constant ,y are four points such that, whenever ,y 2 2 1 1 K K with in , ( c ,y ) ) + c ( x ,y ,y x ) ≤ c ( x ( ,y c ) + x 2 2 2 2 1 1 1 1 γ to , γ y are action-minimizing curves joining respectively x and 1 1 1 2 x , then for any to y , and t 1) ∈ (0 , 0 2 2 ) ( ( ) C K ) ,γ t ( γ ) sup ≤ ( t d , ) t ( γ ) (8.5) t ( γ d . 1 2 2 0 1 0 , t ) t 1 min( − 0 0 0 1 ≤ ≤ t The short version of the conclusion is that the distance between t γ γ is controlled, uniformly in and , by the distance at any time 2 1 ∈ t (0 , 1). In particular, the initial and final distance between these 0 curves is controlled by their distance at any intermediate time. (But the final distance is not controlled by the initial distance!) Once again, inequalities (8.4) or (8.5) are quantitative versions of the qualitative statement that the two curves, if distinct, cannot cross except at initial or final time. 2 The cost function c ( x,y ) = d ( x,y ) Example 8.3. corresponds to the 2 Lagrangian function x,v,t ) = | v | L , which obviously satisfies the as- ( β = 1 is ad- sumptions of Corollary 8.2. In that case the exponent C can missible. Moreover, it is natural to expect that the constant K lower bound on the sectional curvature be controlled in terms of just a of M . I shall come back to this issue later in this chapter (see Open Problem 8.21). α 1+ Example 8.4. c ) = d ( x,y ) The cost function x,y does not satisfy ( < α < 1. Even if the associated the assumptions of Corollary 8.2 for 0 1+ α ( x,v,t ) = | v | Lagrangian L is not smooth, the equation for mini- mizing curves is just the geodesic equation, so Assumption (i) in The- orem 8.1 is still satisfied. Then, by tracking exponents in the proof of Theorem 8.1, one can find that (8.4) holds true with β = (1+ α ) / (3 − α ). But this is far from optimal: By taking advantage of the homogeneity of the power function, one can prove that the exponent β = 1 is also

186 180 8 The Monge–Mather shortening principle a α ∈ (0 , 1). (It is the constant, rather than the expo- dmissible, for all α ↓ nent, which deteriorates as 0.) I shall explain this argument in the Appendix, in the Euclidean case, and leave the Riemannian case as a delicate exercise. This example suggests that Theorem 8.1 still leaves room for improvement. The proof of Theorem 8.1 is a bit involved and before presenting it I prefer to discuss some applications in terms of optimal couplings. K M , I say that a dynamical is a compact subset of In the sequel, if K optimal transport Π is supported in if it is supported on geodesics . K whose image lies entirely inside Theorem 8.5 (The transport from intermediate times is lo- cally Lipschitz). , let c be a cost func- On a Riemannian manifold M K tion satisfying the assumptions of Corollary 8.2, let be a compact , and let be a dynamical optimal transport supported in M Π subset of K is supported on a set of geodesics S such that for any two . Then Π ̃ γ ∈ S , γ, ( ) ) ( t ( t ) , ̃ γ ( t ) d ≤ C . ( sup γ ) d γ ( t ) ) , ̃ γ ( t (8.6) 0 0 0 K 0 ≤ ≤ 1 t In particular, if μ is a displacement interpolation between ) ( 0 t ≤ 1 t ≤ any two compactly supported probability measures on M , and t 1) ∈ (0 , 0 is given, then for any [0 , 1] the map t ∈ γ t T : γ ( ) t ) 7−→ ( → 0 t t 0 is well-defined μ -almost surely and Lipschitz continuous on its do- t 0 main; and it is in fact the unique solution of the Monge problem be- t tween is an optimal and μ )) . In other words, the coupling ( γ ( μ t ) ,γ ( 0 t t 0 deterministic coupling. n 2 R Example 8.6. with c ( x,y ) = | x − y | On , let μ and let = δ 0 0 = law ( X ) be arbitrary. Then it is easy to check that μ = law ( tX ), μ t and in fact the random geodesic γ ( t ) is just tγ (1). So γ ( t ) = tγ ( t , ) /t 0 0 which obviously provides a deterministic coupling. This example is eas- ily adapted to more general geometries (see Figure 8.2). Proof of Theorem 8.5 . The proof consists only in formalizing things that by now may look essentially obvious to the reader. First note

187 General statement and applications to optimal transport 181     μ 1         δ μ = 0 x 0 (1 (0) → n this example the map γ / 2) is not well-defined, but the map γ I Fig. 8.2. / 2) → γ γ γ (1 / 2) → γ (1). Also (1 (0) is well-defined and Lipschitz, just as the map is singular, but μ is absolutely continuous as soon as t > 0. μ 0 t e Π ,e is an optimal coupling ,e π ,e , where ) π ( that ( ⊗ Π ) = π ⊗ 0 1 1 0 # μ μ and between -almost . So if a certain property holds true π ⊗ π 1 0 ⊗ Π -almost surely for the surely for quadruples, it also holds true Π endpoints of pairs of curves. π is optimal, it is c -cyclically monotone (Theorem 5.10 (ii)), Since π ⊗ π ( dxdy d ̃ xd ̃ y )-almost surely, so, ) + ( ) + c c ̃ x, ̃ y ) ≤ c ( x, ̃ y x,y c ( ̃ x,y ) . ( Thus, Π ⊗ Π ( dγ d ̃ γ )-almost surely, c ( γ (0) ,γ (1)) + c ( ̃ γ (0) , ̃ γ (1)) ≤ c ( γ (0) , ̃ γ (1)) + c ( ̃ γ (0) ,γ (1)) . Then (8.6) follows from Corollary 8.2. Let be the support of Π ; by assumption this is a compact set. S Since the inequality (8.6) defines a closed set of pairs of geodesics, actually it has to hold true for γ, ̃ γ ) pairs ( S × S . all ∈ T S ) (that is, the Now define the map ( on the compact set e t t → t 0 0 γ ( t union of all ), when γ varies over the compact set S ), by the for- 0 ( ). This map is well-defined, for if two geodesics t ( γ ( γ mula t )) = T 0 t → t 0 γ and ̃ γ in the support of Π are such that γ ( t ), then inequal- ) = ̃ γ ( t 0 0 γ = γ . The same inequality shows that T ity (8.6) imposes ̃ is actu- t → t 0 ally Lipschitz-continuous, with Lipschitz constant ). / min( t t , 1 − C 0 0 K All this shows that ( γ ( t ))) is indeed a Monge coupling ) ,T t ( γ ( t t 0 0 → 0 of ( μ ), with a Lipschitz map. To complete the proof of the theorem, ,μ t t 0 it only remains to check the uniqueness of the optimal coupling; but this follows from Theorem 7.30(iii). ⊓⊔

188 182 8 The Monge–Mather shortening principle he second application is a result of “preservation of absolute con- T tinuity”. Theorem 8.7 (Absolute continuity of displacement interpola- 2 tion). ( x,v,t ) be a C Let M be a Riemannian manifold, and let L [0 , 1] , satisfying the classical conditions of Defi- × TM Lagrangian on 2 nition 7.6, with L > 0 ; let c ∇ be the associated cost function. Let μ 0 v μ be two probability measures on M such that the optimal cost and 1 μ ) ,μ be a displacement interpolation C ( ( μ ) is finite, and let t 0 ≤ t ≤ 1 1 0 μ μ and μ is absolutely continuous with re- . If either μ between or 0 0 1 1 , then also spect to the volume on is absolutely continuous for all μ M t (0 1) . t , ∈ μ is absolutely Proof of Theorem 8.7 . Let us assume for instance that 1 continuous, and prove that is absolutely continuous (0 < t < 1). μ 0 t 0 μ and μ are compactly supported. First consider the case when 0 1 Then the whole displacement interpolation is compactly supported, and Theorem 8.5 applies, so there is a Lipschitz map T solving the Monge μ and μ problem between . t t 0 1 − 1 N ⊂ T Now if N ( T ( N )) is a set of zero volume, the inclusion implies [ ] − 1 N ] ≤ μ )] N ( T μ (8.7) ( T , N )) [ = ( T ( μ T [ )[ T ( N )] = μ 1 t t t # 0 0 0 n N )] ≤‖ T ‖ N and the latter quantity is 0 since vol [ vol [ T ] = 0 and μ ( 1 Lip is absolutely continuous. So (8.7) shows that [ N ] = 0 for any Borel μ t 0 of zero volume, and this means precisely that μ is absolutely N set t 0 continuous. Actually, the previous computation is not completely rigorous be- T N ) is not necessarily Borel measurable; but this is not serious cause ( ( N since T ) can still be included in a negligible Borel set, and then the proof can be repaired in an obvious way. Now let us turn to the general case where μ and μ are not as- 0 1 sumed to be compactly supported. This situation will be handled by a restriction argument. Assume by contradiction that μ is not abso- t 0 lutely continuous. Then there exists a set Z with zero volume, such t 0 ∈ that . Then [ Z } Z ] > 0. Let Z := { γ ∈ Γ ( M ); γ μ t t t t 0 0 0 0 Π [ Z ] = P [ γ ∈ Z ] = μ Z [ ] . 0 > t t t t 0 0 0 0 By regularity, there exists a compact set K ⊂ Z , such that Π [ K ] > 0. Let

189 Proof of Mather’s estimates 183 1 Π K ′ : = Π , K Π ] [ ′ ′ ′ a ,e nd let ) := ( Π e be the associated transference plan, and μ π = 0 1 # t ′ ′ Π ( the marginal of e ) at time t . In particular, Π t # ( e ) Π μ 1 1 # ′ , = μ ≤ 1 K K Π [ [ ] Π ] ′ o s is still absolutely continuous. μ 1 ′ ′ By Theorem 7.30(ii), ( μ μ ) is a displacement interpolation. Now, t t 0 ′ is singular. But the K ( is concentrated on ) ⊂ e μ e ( Z ) ⊂ Z , so t t t 0 t 0 0 0 ′ ′ are and μ first part of the proof rules out this possibility, because μ 1 0 ′ e ) and ( K respectively supported in e μ ( K ), which are compact, and 1 0 1 is absolutely continuous. ⊓⊔ Proof of Mather’s estimates Now let us turn to the proof of Theorem 8.1. It is certainly more im- portant to grasp the idea of the proof (Figure 8.3) than to follow the calculations, so the reader might be content with the informal expla- nations below and skip the rigorous proof at first reading. Idea of the proof of Theorem 8.1 . Assume, to fix the ideas, that γ and 1 γ cross each other at a point , these and at time t m . Close to m 0 0 0 2 two curves look like two straight lines crossing each other, with re- spective velocities . Now cut these curves on the time inter- and v v 2 1 t − τ,t + τ ] and on that interval introduce “deviations” (like a val [ 0 0 plumber installing a new piece of pipe to short-cut a damaged region of a channel) that join the first curve to the second, and vice versa. This amounts to replacing (on a short interval of time) two curves with approximate velocity v and v , by two curves with approximate 1 2 velocities ( v 2. Since the time-interval where the modification oc- + v / ) 2 1 curs is short, everything is concentrated in the neighborhood of ( ,t m ), 0 0 so the modification in the Lagrangian action of the two curves is ap- proximately ( ( ) ) ] [ + v v 1 2 ) τ , 2 (2 m L , t L − L ( m ) ,v ,t ,t ,v ) + m ( . 0 0 0 0 2 1 0 0 2

190 184 8 The Monge–Mather shortening principle γ ( 0) γ (0) 2 1 γ 1 γ 2 rinciple of Mather’s proof: Let γ be two action-minimizing and γ P Fig. 8.3. 2 1 t and the two curves γ γ curves. If at time pass too close to each other, one can 1 0 2 devise shortcuts (here drawn as straight lines). v ( m , , · ,t Since ) is strictly convex, this quantity is negative if L v 6 = 1 2 0 0 which means that the total action has been strictly improved by the modification. But then c ( x ), in ,y ,y ) + c ( x x ,y ( ) < c ( x c ,y ) + 1 1 2 2 2 2 1 1 contradiction to our assumptions. The only possibility available is that v = v , i.e. at the crossing point the curves have the same position and 2 1 the same velocity; but then, since they are solutions of a second-order ⊓⊔ differential inequality, they have to coincide at all times. It only remains to make this argument quantitative: If the two curves t pass close to each other at time , then their velocities at that time 0 will also be close to each other, and so the trajectories have to be close to each other for all times in [0 , 1]. Unfortunately this will not be so easy. Rigorous proof of Theorem 8.1 . Step 1: Localization. The goal of this step is to show that the problem reduces to a local computation that can be performed as if we were in Euclidean space, and that it is sufficient to control the difference of velocities at time t (as in the above sketchy 0 explanation). If the reader is ready to believe in these two statements, then he or she can go directly to Step 2. For brevity, let γ stand for the union of the images of the ∪ γ 2 1 minimizing paths γ V and γ ), there is a . For any point x in proj ( 1 2 M

191 Proof of Mather’s estimates 185 n mall ball B ( x ) which is diffeomorphic to an open set in R s , and r x by compactness one can cover a neighborhood of γ γ by a finite ∪ 2 1 B , each of them having radius no less than δ > 0. number of such balls j V ), and ( Without loss of generality, all these balls are included in proj M and X in γ ∪ γ it can be assumed that whenever two points are X 1 1 2 2 4, then one of the balls B separated by a distance less than contains δ/ j ). X B ) ∪ ( ( X B 2 1 4 δ/ δ/ 4 ( γ t If ) and γ 4, then the ( t δ/ ) are separated by a distance at least 0 2 0 1 τ small enough that τS ≤ δ/ 4 conclusion is obvious. Otherwise, choose S is a strict bound on the maximum speed along the curves); (recall that τ,t + τ ] the curves never leave the balls t then on the time-interval [ − 0 0 and ( X γ ) ∪ B ), and therefore the whole trajectories of B ( X 2 1 1 4 4 δ/ δ/ B . If one γ on that time-interval have to stay within a single ball 2 j takes into account positions, velocities and time, the system is confined B B × within . (0) × [0 , 1] ⊂V S j B , one can introduce a Euclidean system of On any of these balls j coordinates, and perform all computations in that system (write L in those coordinates, etc.). The distance induced on by that system of B j coordinates will not be the same as the original Riemannian distance, but it can be bounded from above and below by multiples thereof. So we can pretend that we are really working with a Euclidean metric, and all conclusions that are obtained, involving only what happens inside B the ball , will remain true up to changing the bounds. Then, for the j sake of all computations, we can freely add points as if we were working in Euclidean space. If it can be shown, in that system of coordinates, that ∣ ∣ ∣ ∣ β ∣ ∣ ∣ ∣ − ̇ γ ̇ ( t γ ) (8.8) ( ≤ C t , γ ) ( t ) ) − γ t ( 2 0 0 2 0 1 0 1 t ̇ t then this means that ( ) , ̇ γ )) are very close ( t ( )) and ( γ ( ( t γ ) , γ 0 1 2 0 0 1 0 2 ; more precisely they are separated by a distance TM to each other in ( ) β γ ( t which is ,γ ( ( t )) d O . Then by Assumption (i) and Cauchy– ) 2 1 0 0 Lipschitz theory this bound will be propagated backward and forward in time, so the distance between ( t ) , ̇ γ ( t )) and ( γ ( t ) , ̇ γ γ ( t )) will re- ( 2 1 1 2 ( ) β ,γ γ ( t ) O ( t main bounded by )) d ( . Thus to conclude the argument 2 0 0 1 it is sufficient to prove (8.8). Step 2: Construction of shortcuts. First some notation: Let us ( ( ) = γ ( t ), x ( t ) = γ ( t ), write t t ) = ̇ γ ( t ), v x ( t ) = ̇ γ ( t ), and also v 1 2 1 2 2 1 1 2 X = X ( t ), V = v ( t ), x = x ( t ), V = v ( t ). The goal is to 0 0 0 1 2 2 1 2 1 2 0 1 by control V | − V X | | | X − . 1 2 2 1

192 186 8 The Monge–Mather shortening principle ( t ) be defined by x et L 12   x ]; [0 ∈ t for ) t ( τ − ,t  1 0       ( )  ( )  τ ) − t ) τ + ( t ( x x + x t )+ x ( ( t ) t − t + τ 2 0 0 1 1 2 0  + 2 τ 2 2 ) ( ( t ) = x ( ) 12 x − ) τ ) τ − − t ( x t ( τ + t − t 0 1 0 2 0  +  2 τ 2      ]; f or + ∈ [ t τ − τ,t t 0 0     t t for t x [ ) ( + τ, 1] . ∈ 2 0 x is a continuous function of t ; it is a path that starts along Note that 12 ( , then switches to γ ) stand for its time-derivative: . Let v t γ 12 1 2   − ,t ]; τ [0 ∈ t for ) t ( v  1 0       ] ] [ ( [ ) τ τ ) t ( − ) τ + x t ( x t ( )+ x + ) x ( t + − τ ) t ( )+ t ( v v 1 1 0 0 0 0 2 1 2 2 1 − + v ( t ) = 12 2 2 2 τ 2     ]; τ for t ∈ [ t + − τ,t 0 0     ( t ) for v ∈ [ t . + τ, 1] t 0 2 x ) are defined symetri- ( t ) and its time-derivative v Then the path ( t 21 21 cally (see Figure 8.4). These definitions are rather natural: First we try to construct paths on [ − τ,t t + τ ] whose velocity is about the half of 0 0 the velocities of and γ ; then we correct these paths by adding simple γ 1 2 functions (linear in time) to make them match the correct endpoints. I shall conclude this step with some basic estimates about the paths τ,t x x ]. For a start, note that on the time-interval [ t τ − and + 0 0 21 12  ) ( + x x x + x 2 1 1 2   , − = − x − x 1 12 2   2 2 8.9) ( ) (   + v v + v v 2 1 1 2   − − = − v v . 1 2 1 2 2 2 In the sequel, the symbol O ( m ) will stand for any expression which is bounded by C only depends on V and on the regularity , where Cm L on V . From Cauchy–Lipschitz theory and bounds on the Lagrangian Assumption (i), ) ( , − v (8.10) | ( t ) + | x | | x V | ( t ) = O v | X − − X V | + | − 2 1 1 2 2 1 2 1

193 Proof of Mather’s estimates 187 x 2 1 x 12 ( t he paths x x ( t ) obtained by using the shortcuts to switch ) and T Fig. 8.4. 12 21 from one original path to the other. and then by plugging this back into the equation for minimizing curves we obtain ( ) − ̇ v | ̇ ( t ) = O v | X . − X | | + | V | − V 2 1 2 1 2 1 Upon integration in times, these bounds imply ) ( t ) − x (8.11) ( t ) = ( X ; − X x ) + O ( τ ( | X ) − X | | + | V V − 1 2 1 2 1 2 1 2 ) ( t ) − v (8.12) ( t ) = ( V ; − V v ) + O τ ( ( | X ) − X | | + | V V − 2 1 2 1 2 1 2 1 and therefore also ) ( 2 t ) − x t ( x ) = ( X ) − X | )+( t − t V ) ( V − − V V )+ O . τ ( ( | X | − X + | 2 2 1 0 2 2 1 2 1 1 1 (8.13) τ is small enough (depending only on As a consequence of (8.12), if the Lagrangian L ), ) ( | V − V | 1 2 ( | t ) ≥ − v | v O τ 8.14) | X ( − X . | − 2 1 2 1 2 Next, from Cauchy–Lipschitz again, ( ) 2 + x ) − x ( t + τ ) = X − X + τ ( V − V )+ O ( τ t ( | X − X | + | V − V | ) ; τ 2 1 2 0 1 1 1 2 1 2 0 2 and since a similar expression holds true with τ replaced by − τ , one has

194 188 8 The Monge–Mather shortening principle [ ] ] [ τ ) − x ( ( t t + τ ) ) x τ + x − ( t t − τ ) − x ( 0 2 0 1 2 0 1 0 − 2 2 ( ) 2 X = ) + O ( τ X ( | X (8.15) − X , | + | V − − V ) | 1 2 1 2 1 2 and also [ ] ] [ ) − x ) ( t + x ) ( t + τ x ( t − τ ) − x ( t − τ τ 1 2 0 2 0 0 1 0 + 2 2 ( ) 2 − − V (8.16) ) + O = τ τ ( | X . − X ( | + | V ) V V | 2 2 1 1 2 1 It follows from (8.15) that ( ) | t v v ( t ) | ) + X X − ( 1 2 1 2 ( t ) − . v 8.17) ( = + τ | V O − V | 2 1 12 τ 2 After integration in time and use of (8.16), one obtains t ) t x ( ( x ) + 2 1 − ) t ( x 12 2 ) ( [ ] x x ( t ) ) + t ( 0 2 1 0 2 x = + ( V t | O ) | X − − X V | + τ − | 2 1 2 0 2 1 1 2 ( ) | X = − X 8.18) O + τ | V ( − V | | 2 1 2 1 In particular, ( ) (8.19) − x . | ( t ) = O | | X x − X | | + τ | V V − 1 2 2 1 21 12 Now I shall eval- . L Step 3: Taylor formulas and regularity of L along the old and the new paths, using the uate the behavior of regularity assumption (ii). From that point on, I shall drop the time variable for simplicity (but it is implicit in all the computations). First, ) ( x + x 2 1 , v x ,v L ( L − ) 1 1 1 2 ( ( ) ) ( ) x x x + x − 1 2 1 2 + 1 α = L ∇ x · O v | + , − x ; | x 2 1 1 2 2 similarly

195 Proof of Mather’s estimates 189 ) ( x + x 1 2 v , x ( , v L ) − L 2 2 2 2 ( ) ( ) ) ( x + x x − x 2 1 1 2 α 1 + L = . ∇ v | x x − , · | + O x 2 1 2 2 2 Moreover, ( ( ) ) x x + + x x 1 2 1 2 α − −∇ ∇ v , . ) , v L L = O ( | v | v x 1 2 2 x 1 2 2 The combination of these three identities, together with estimates (8.11) and (8.12), yields ) ( ) )) ( ( ( x + x x + x 2 1 2 1 L L x ) ( L ) + v , v , − + L ,v ( ,v x 1 1 1 2 2 2 2 2 ( ) 1+ α α − − + | x x x x O || v − v | | | = 2 2 1 1 1 2 ( α 1+ α 1+ α X | | O + | τ | V V − V − | X = − | X V − X || + 2 1 2 1 2 2 1 1 ) α 1+ α V τ − V + || X | − X | . 2 2 1 1 Next, in an analogous way, ( ) v + v 2 1 L x x ( , ,v L ) − 12 12 12 2 ) ( ) ( ∣ ∣ ( ) v v + v + + v v v 1 + α 2 1 2 1 2 1 ∣ ∣ + ∇ x , v · , − v = L − O 2 1 2 2 1 v 1 2 2 2 ) ( + v v 1 2 , − L x x ) ( L ,v 21 21 21 2 ) ( ( ) ∣ ∣ ( ) v + v + v v v + v 1 α + 1 2 2 1 1 2 ∣ ∣ O , ∇ = v L x − v + , − · 2 1 1 2 v 2 1 2 2 2 ( ( ) ) ) ( v v + v v + 1 2 1 2 α O − ∇ , x L L = , | x − x x | ∇ . 21 v 21 12 v 1 2 2 2 Combining this with (8.9), (8.17) and (8.19), one finds ) ( ) ( ( ) ) ( v + v + v v 2 1 2 1 ( L − ) ,v x x + L , x , L ) + ,v L x ( 21 12 12 12 1 21 2 2 2 ( ) ∣ ∣ ∣ ∣ v + v v v + 1 + α 2 1 1 2 α ∣ ∣ ∣ ∣ − v v | x | x − − + = O 1 12 1 21 2 2 2 2 ( ) α 1+ X X | | − 2 1 1+ α 1+ α = O τ | + | V V − . 1 2 α + 1 τ

196 190 8 The Monge–Mather shortening principle A fter that, ( ) ( ) + v v + v v x + x 1 1 2 2 2 1 , , x L − L 12 2 2 2 ( ) ) ( ( ) ∣ ∣ v x + x v + x x + x x + 1 + α 2 2 2 1 1 1 2 1 ∣ ∣ , , x L · − = x ∇ + O − 2 1 x 2 1 2 2 2 2 ) ( ) ( + v v v v x + x + 1 2 1 2 2 1 , − L , x L 21 2 2 2 ) ( ( ) ( ) ∣ ∣ x x x v x + v x + + x + + α 1 2 1 1 1 1 2 2 2 ∣ ∣ , = , x − · − x + L ∇ O 2 2 1 1 x 2 2 2 2 cancel each other exactly upon som- ∇ and now by (8.9) the terms in x mation, so the bound (8.18) leads to ) ( ) ( ( ) ) ( + x x v + v v v + v + v 2 2 1 1 2 1 1 2 2 L , x + L − x , L , 2 12 1 2 2 2 2 ) ( ∣ ∣ + x x α 1 + 1 2 ∣ ∣ − x = O 1 2 2 ) ( 1+ α 1+ α α 1+ O | X + τ − X | V = − V . | | 2 1 1 2 Step 4: Comparison of actions and strict convexity. From our minimization assumption, x ≤A ) + A ( x A ) ( ( x , ) + A ( x ) 21 12 2 1 which of course can be rewritten ∫ ( t τ + 0 ( ,t L ( x ) ( t ) ,v ( t ) ,t ) + L ( x ) t ) ,v ( t ) ,t ) − L ( x ( t ) ,v ( t 1 2 1 12 12 2 − t τ 0 ) L ( x (8.20) ( t ) ,v . ( t ) ,t ) 0 dt ≤ − 21 21 x ) / 2, From Step 3, we can replace in the integrand all positions by ( x + 1 2 v , and v by ( v 2, up to a small error. Collecting the various + v / ) 2 1 12 21 τ , one obtains error terms, and taking into account the smallness of t (dropping the variable again) { } ∫ ( ) ( ( ) ) τ + t 0 x + x + v v + x + x x x 1 2 1 2 1 2 1 2 1 L v v , t , , d L − 2 L + 1 2 2 2 τ 2 2 2 t τ − 0 ) ( α 1+ X | − X | 2 1 α 1+ ≤ (8.21) C + τ | . | − V V 2 1 α + 1 τ

197 Complement: Ruling out focalization by shortening 191 O n the other hand, from the convexity condition (iii) and (8.14), the left-hand side of (8.21) is bounded below by ∫ ( ) t + τ 0 κ 2+ 1 2 + κ ′ | − v | v K | dt ≥ K (8.22) . | V X − V − X Aτ | |− 2 1 2 1 2 1 τ 2 τ − t 0 V − If V , then the proof is finished. If this is not the |≤ 2 | | X | − X Aτ 2 1 2 1 | V 2, and then − case, this means that / |− Aτ | X | − X V |≥| V − V 1 2 1 2 1 2 the comparison of the upper bound (8.21) and the lower bound (8.22) yields ( ) α 1+ | X − X | 2 1 κ 2+ α 1+ V | ≤ V − C | | V + τ | V − . (8.23) 2 1 1 2 + 1 α τ If | = 0, then the proof is finished. Otherwise, the conclusion − V V | 1 2 1+ α small enough that Cτ | V − − V V | ≤ follows by choosing τ (1 / 2) | 2 1 1 ( κ 2+ 1+ κ − α | | V ) and (8.23) implies − V V ; then τ = O | 1 2 2 ) ( α 1 + β = , β | | . X X − V | = | V O − 2 1 1 2 ) (1 + κ − α ) + 2 + κ (1 + α (8.24) In the particular case when κ = 0 and α = 1, one has ) ( 2 X − X | | 1 2 2 2 − V − | , V | C + τ | V | ≤ V 2 1 1 2 2 τ nd if τ is small enough this implies just a X − | X | 2 1 . C − 8.25) V | ( V |≤ 1 2 τ The upper bound on τ depends on the regularity and strict convexity ). in V , but also on t of since τ cannot be greater than min( t , 1 − t τ 0 0 0 This is actually the only way in which t explicitly enters the estimates. 0 So inequality (8.25) concludes the argument. ⊓⊔ Complement: Ruling out focalization by shortening This section is about the application of the shortening technique to a classical problem in Riemannian geometry; it may be skipped at first reading.

198 192 8 The Monge–Mather shortening principle L M be a smooth Riemannian manifold and let L = L ( x,v,t ) be et 2 C TM × [0 , 1], satisfying the classical conditions of a Lagrangian on 2 x 0. Let X ∇ ( x Definition 7.6, together with ,v 0) ) = X , ( L > ,v 0 t 0 t 0 0 v L , t be the solution at time of the flow associated with the Lagrangian at time 0. starting from the initial position x 0 ′ ′ It is said that there is = X focalization x ( x on another point ,v ), 0 0 t ′ ′ d > ) is singular (not invertible). X · 0, if the differential map ( t , x 0 v t 0 In words, this means that starting from x it is very difficult to make 0 ′ the curve explore a whole neighborhood of by varying its initial ve- x locity; instead, trajectories have a tendency to “concentrate” at time ′ ′ x t . along certain preferred directions around The reader can test his or her understanding of the method pre- sented in the previous section by working out the details of the following problem. Problem 8.8 (Focalization is impossible before the cut locus). γ 1] → M be a minimizing , : [0 With the same notation as before, let . By using the same strategy x curve starting from some initial point 0 , focal- of proof as for Mather’s estimates, show that, starting from x 0 ization is impossible at t γ ) if 0 < t ( < 1 . ∗ ∗ Hint: Here is a possible reasoning: γ to (a) Notice that the restriction of ,t is the unique minimizing ] [0 ∗ . [0 ,t ] , joining x curve on the time-interval to x t = γ ( ) ∗ 0 ∗ ∗ on y x , (b) Take ̃ γ close to [0 ,t ] and introduce a minimizing curve ∗ ∗ joining x is close to the to y ; show that the initial velocity ̃ v γ of ̃ 0 0 . of if y is close enough to x v γ initial velocity ∗ 0 γ and the action of (c) Bound the difference between the action of by O ( d ( x )) ,y ̃ γ γ and ̃ γ are bounded by a (recall that the speeds along ∗ in some compact uniform constant, depending only of the behavior of L ). set around γ → x γ (d) Construct a path by first going along ̃ γ up to time (1) 0 = t t − τ ( τ small enough), then using a shortcut from ̃ γ ( t to − τ ) ∗ ∗ t γ + τ ) ( γ up to time 1. Show that the gain of , finally going along ∗ 2 2 ̃ ( − V | | − O ( d V x τ ,y ) action is at least of the order of /τ ) , where ∗ ̇ ̃ ̃ | . ) and V V = ( ̃ γ ( t ) ) . Deduce that γ V − t V | = O ( d ( x /τ ,y ) = ̇ ∗ ∗ ∗ (e) Conclude that | v . Use a contradiction − ̃ v ) | = O ( d ( x /τ ,y ) ∗ 0 0 argument to deduce that the differential map d is invertible, ) X · ( x , v 0 t 0 − 1 − as a function and more precisely that its inverse is of size ) O ((1 ) t ∗ of t . ∗

199 Complement: Ruling out focalization by shortening 193 2 L ( x,v,t ) = | v | I , what we have proven is n the important case when a well-known result in Riemannian geometry; to explain it I shall first recall the notions of focal points . cut locus and be a minimizing geodesic, and let t be the largest time such Let γ c and γ is minimizing between γ γ , . Roughly speaking, that for all t < t c 0 t t ( ) is the first point at which the geodesic ceases to be minimizing; γ c γ may or may not be minimizing between γ ( t γ ), but it is (0) and c + certainly not minimizing between ( t γ (0) and ε ), for any ε > 0. γ c γ ( t . When the ) is said to be a cut point of Then the point γ along γ 0 c x initial position of the geodesic is fixed and the geodesic varies, the 0 cut locus x . set of all cut points constitutes the of 0 ′ ′ Next, two points x are said to be focal (or conjugate) if x x and 0 ′ ′ t ( not v ) is ), where the differential d · can be written as exp exp t ( v 0 x x 0 0 0 ′ invertible x can be obtained from x by a . As before, this means that 0 geodesic γ with ̇ γ (0) = v , such that it is difficult to explore a whole 0 ′ neighborhood of by slightly changing the initial velocity v x . 0 With these notions, the main result of Problem 8.8 can be summa- . It can rized as follows: Focalization never occurs before the cut locus occur either at the cut locus, or after. 2 has only one cut Example 8.9. N S On the sphere , the north pole S . Fix a point, which is also its only focal point, namely the south pole γ going from γ (0) = N to γ (1) = S , and deform your sphere geodesic γ out of a neighborhood of , 1], so as to dig a shortcut that allows you [0 N γ (1 / 2) in a more efficient way than using γ . This will to go from to γ create a new cut point along S will not be a cut point along γ , and any longer (it might still be a cut point along some other geodesic). On the other hand, S will still be the only focal point along γ . Remark 8.10. If x and y are not conjugate, and joined by a unique minimizing geodesic γ , then it is easy to show that there is a neigh- by a unique U y such that any z in U borhood x of is also joined to minimizing geodesic. Indeed, any minimizing geodesic has to be close to γ , therefore its initial velocity should be close to ̇ γ ; and by the lo- 0 cal inversion theorem, there are neighborhoods W y of ̇ γ of and U 0 0 such that there is a unique correspondence between the initial veloc- ity ̇ γ ∈ of a minimizing curve starting from x , and the final point W 0 γ (1) ∈ U . Thus the cut locus of a point x can be separated into two sets: (a) those points y for which there are at least two distinct minimizing geodesics going from x to y ;

200 194 8 The Monge–Mather shortening principle b) those points ( y for which there is a unique minimizing geodesic, x but which are focal points of . Introduction to Mather’s theory In this section I shall present an application of Theorem 8.1 to the theory of Lagrangian dynamical systems. This is mainly to give the reader an idea of Mather’s motivations, and to let him or her better understand the link between optimal transport and Mather’s theory. These results will not play any role in the sequel of the notes. Theorem 8.11 (Lipschitz graph theorem). M be a compact Let L Riemannian manifold, let x,v,t ) be a Lagrangian function on L = ( R , and T > 0 , such that × TM is T -periodic in the t variable, i.e. L ( x,v,t + T ) = L ( x,v,t ) ; (a) L 2 is of class in all variables; (b) C L 2 (c) ∇ is (strictly) positive everywhere, and L is superlinear in v . L v ∫ t s,t s,t ) = A ( Define as usual the action by γ L ( , ̇ γ γ ,τ ) dτ . Let c be the τ τ s s,t × M associated cost function on C the corresponding optimal M , and P ( M ) × P ( M ) . cost functional on Let μ e a probability measure solving the minimization problem b 0 ,T , ) inf ( C μ,μ (8.26) ) ( P ∈ X μ ( μ nd and let be a displacement interpolation between μ = μ a ) t 0 0 ≤ t ≤ T -periodic curve μ . Extend ( μ ) into a T μ R → P ( M ) defined for all = T t times. Then still defines a displacement t < t ) , the curve ( (i) For all μ t t 1 ≤ t ≤ t 0 1 0 interpolation; T t,t + C μ (ii) The optimal transport cost ; ,μ ( ) is independent of t t t (iii) For any t is a minimizer for ∈ R , and for any k ∈ N , μ t 0 0 + kT ,t t 0 0 C ) μ,μ ( . ( γ Moreover, there is a random curve ) , such that ∈ R t t t ∈ R , law ( γ (iv) For all ) = μ ; t t (v) For any < t , the curve ( γ ) t is action-minimizing; t 0 t t ≤ 1 ≤ t 1 0 (vi) The map γ → ̇ γ is well-defined and Lipschitz. 0 0

201 Introduction to Mather’s theory 195 0 ,T c R Since is not assumed to be nonnegative, the opti- emark 8.12. mal transport problem (8.26) is not trivial. Remark 8.13. does not depend on t , then one can apply the pre- If L ℓ − = 2 , and then use a compactness argument to vious result for any T ) construct a constant curve ( μ satisfying Properties (i)–(vi) above. ∈ t t R μ stationary measure for the Lagrangian system. In particular is a 0 Before giving its proof, let me explain briefly why Theorem 8.11 is interesting from the point of view of the dynamics. A trajectory L is a curve γ of the dynamical system defined by the Lagrangian which is locally action-minimizing; that is, one can cover the time- interval by small subintervals on which the curve is action-minimizing. It is a classical problem in mechanics to construct and study periodic trajectories having certain given properties. Theorem 8.11 does not construct a periodic trajectory, but at least it constructs a random (or equivalently a probability measure Π on the set of trajectory γ : The law on average of γ satisfies trajectories) which is periodic μ t t μ on Π = μ . This can also be thought of as a probability measure t t + T the set of all possible trajectories of the system. Of course this in itself is not too striking, since there may be a great deal of invariant measures for a dynamical system, and some of them are often easy to construct. The important point in the conclusion of Theorem 8.11 is that the curve γ is not “too random”, in the sense that the random variable ( γ (0) , ̇ γ (0)) takes values in a Lipschitz graph . (If ( γ , ̇ γ (0)) were a deterministic element in TM , this would mean (0) Π that just sees a single periodic curve. Here we may have an infinite collection of curves, but still it is not “too large”.) γ is the fact that the Another remarkable property of the curves any R minimization property holds along , not neces- time-interval in sarily small. Let M be a compact Riemannian manifold, and let Example 8.14. 2 ( x,v,t ) = | v | V / 2 − V ( x ), where L has a unique maximum x . Then 0 Mather’s procedure selects the probability measure δ , and the sta- x 0 tionary curve γ x (which is an unstable equilibrium). ≡ 0 It is natural to try to construct more “interesting” measures and curves by Mather’s procedure. One way to do so is to change the La- grangian, for instance replace L ( x,v,t ) by L , := L ( x,v,t ) + ω ( x ) · v ω where ω is a vector field on M . Indeed,

202 196 8 The Monge–Mather shortening principle • f ω is closed (as a differential form), that is if ∇ ω is a symmetric I L L have the same Euler–Lagrange equations, operator, then and ω so the associated dynamical system is the same; → ω = ∇ f for some function f : M is exact, that is if R , then If • ω and L have the same minimizing curves. L ω As a consequence, one may explore various parts of the dynamics ω vary over the finite-dimensional group obtained by taking by letting the quotient of the closed forms by the exact forms. In particular, one ∫ T 1 E can make sure that the expected mean “rotation number” ( ̇ γ dt ) 0 T takes nontrivial values as ω varies. Proof of Theorem 8.11 . I shall repeatedly use Proposition 7.16 and The- 0 ,T orem 7.21. First, ( μ,μ ) is a lower semicontinuous function of μ C , T L bounded below by ) > −∞ , so the minimization problem (8.26) (inf does admit a solution. Define = μ μ = by displacement interpolation μ , μ then define 0 T t for 0 < t < T , and extend the result by periodicity. k ∈ N be given and let ̃ μ be a minimizer for the variational Let problem 0 ,kT inf C ( μ,μ ) . P ( M ) μ ∈ μ s a solution of this problem. For i We shall see later that actually ̃ μ the moment, let ( ) be obtained first by taking a displacement t ∈ R t interpolation between ̃ μ ; and then by extending the = ̃ μ and ̃ μ μ = ̃ 0 kT -periodicity. result by kT On the one hand, k − 1 ∑ ,kT ,kT 0 0 jT, ( j +1) T ,μ C μ ) ̃ μ, ̃ μ ) ≤ ) C C ,μ ( ( μ ( ≤ jT 0 kT ( j +1) T =0 j 0 1 , ( μ, μ ) k C . 8.27) = ( μ , On the other hand, by definition of 1 − 1 − k k ( ) ∑ ∑ 1 1 T 0 , T 0 , ( C C μ ̃ ̃ μ (8.28) , μ, μ ) ≤ jT jT k k = 0 0 = j j 1 k − 1 k − ) ( ∑ ∑ 1 1 ,T 0 ̃ μ . , ̃ (8.29) μ = C jT T +1) j ( k k j 0 = j = 0

203 Introduction to Mather’s theory 197 0 ,T S C ( μ,ν ) is a convex function of ( μ,ν ) (Theorem 4.8), ince − k − 1 1 k − 1 k ( ) ∑ ∑ ∑ 1 1 1 +1) j ( T jT, ,T 0 ̃ μ ̃ , ( , C ≤ ) μ ̃ μ ̃ μ C jT jT +1) j ( T T ( j +1) k k k 0 j = 0 = = 0 j j 1 0 , kT C ( ̃ μ = , ̃ μ (8.30) ) , 0 kT k where the last equality is a consequence of Property (ii) in Theo- rem 7.21. Inequalities (8.29) and (8.30) together imply 1 1 0 , 1 0 , kT kT , 0 C ( ≤ ( C μ ) , ̃ μ μ ) = . ̃ C μ, μ ) ( ̃ μ, ̃ 0 kT k k Since the reverse inequality holds true by (8.27), in fact all the inequal- ities in (8.27), (8.29) and (8.30) have to be equalities. In particular, k − 1 ∑ j 0 jT, ,kT ( T +1) ,μ C C ) = ( ,μ (8.31) μ ( μ . ) jT 0 kT ( j +1) T =0 j Let us now check that the identity t t ,t ,t t ,t 2 2 3 3 1 1 C ( μ (8.32) ,μ ( μ μ ( ,μ ) C ) = C ,μ ) + t t t t t t 3 2 3 1 1 2 holds true for any three intermediate times t . By periodicity, < t < t 2 1 3 ≤ ≥ t it suffices to do this for < t t < t , then (8.32) ≤ T 0. If 0 3 1 2 1 is true by the property of displacement interpolation (Theorem 7.21 t < t again). If jT ≤ , this is also true because of the ≤ ( j + 1) T < t 3 1 2 large enough T k -periodicity. In the remaining cases, we may choose that ≤ kT . Then t 3 ,kT ,kT 0 ,t t t 0 ,t 3 1 3 1 ( μ ,μ ) ( C C ) ≤ C ( μ ,μ ,μ ) + C μ μ ,μ ( ) + t 0 0 t t t kT kT 3 1 1 3 0 t t ,t t ,t ,kT ,t 2 1 2 1 3 3 C ) + ( μ ) ,μ ,μ μ ( ) + C ≤ ,μ C μ ( μ ( C ,μ ) + t t 0 t t t t kT 3 3 1 2 2 1 ∑ s ,s j j +1 ≤ ) ( μ (8.33) , ,μ C s s j +1 j s where the times { 0 ,T, 2 T,... ,kT } ∪ are obtained by ordering of j { t ) is a ,t μ ,t ] we know that ( } . On each time-interval [ ℓT, ( ℓ + 1) T 3 t 1 2 displacement interpolation, so we can apply Theorem 7.21(ii), and as a result bound the right-hand side of (8.33) by ∑ ) ( ℓT, ( ℓ +1) T C (8.34) . ,μ μ ℓT ( +1) T ℓ ℓ

204 198 8 The Monge–Mather shortening principle ( < t Consider for instance the particular case when 0 < T < < t 2 1 ,t t t ,T ,t 0 0 ,T 2 1 1 2 + C ; then one can write + C = C C , and also T 2 t < 3 ,t t T T, 2 T 0 ,t t 2 , t ,T T,t t , 2 T T,t 3 3 3 1 2 1 3 2 = C + C + C + C C + C . So + = C C ,T 0 T, 2 T .) C C + 0 ,kT μ C ,μ But (8.34) is just ), as shown in (8.31). So there is ( 0 kT in fact equality in all these inequalities, and (8.32) follows. Then by Theorem 7.21, ( μ ) defines a displacement interpolation between any t two of its intermediate values. This proves (i). At this stage we have t = 0. also proven (iii) in the case when 0 ∈ R , one has, by (8.32) and the T -periodicity, Now for any t ,T 0 ,t 0 t,T ,μ ) ) = C C ( μ μ ,μ ,μ μ ) + C ( ( t 0 T T t 0 t,T T,t + T C ) + C ( = ,μ ( μ μ ,μ ) + T t T t T t,t + T ( μ = ,μ C ) T t t + t,t T + ( μ = ,μ , ) C t t which proves (ii). Next, let be given, and repeat the same whole procedure with t 0 the initial time 0 replaced by : That is, introduce a minimizer ̃ μ for t 0 t + T ,t 0 0 μ,μ ), etc. This gives a curve ( ̃ μ ) ( C with the property that R ∈ t t t,t + T 0 ,T μ C , ̃ μ ). It follows that ) = C ( ̃ ( ̃ μ μ , ̃ t 0 t 0 t ,t + T ,T 0 T , 0 0 0 ,μ C ( ) = μ ( C ) ≤ ) ̃ μ, μ ( ̃ μ μ , C t t 0 0 0 0 ,t ,t + T T t t t + + T ,t 0 0 0 0 0 0 C = C ( ̃ μ . ( ̃ μ, ̃ μ ) ≤ ) , ̃ μ ,μ C ) = ( μ t t t t 0 0 0 0 μ So there is equality everywhere, and is indeed a minimizer for t 0 ,t t + T 0 0 μ,μ ). This proves the remaining part of (iii). C ( γ ) Next, let ( be a random minimizing curve on [0 ,T ] with ≤ T 0 ≤ t t k law ( ) = μ γ , as in Theorem 7.21. For each γ , define ( k as ) t t kT ≤ t ≤ ( k +1) T t k k a copy of ( ) = γ γ ) = law ( . Since μ ) is T γ -periodic, law ( t 0 T ≤ t t ≤ kT +1) T k ( , for all μ k . So we can glue together these random curves, just as in 0 the proof of Theorem 7.30, and get random curves ( ) such that γ t t ∈ R γ is action- ) = law ( ) for all t ∈ R , and each curve ( γ μ t t t ≤ t ≤ ( k +1) T kT minimizing. Property (iv) is then satisfied by construction. Property (v) can be established by a principle which was already is γ used in the proof of Theorem 7.21. Let us check for instance that 2 T ]. For this one has to show that (almost surely) minimizing on [0 , t ,t t ,t t ,t 2 1 3 2 1 3 c γ ( ,γ c ) + ( γ ,γ ) = c ( γ ,γ (8.35) , ) t t t t t t 1 2 2 3 1 3

205 Introduction to Mather’s theory 199 t < t f < t or any choice of intermediate times in [0 , 2 T ]. Assume, 3 2 1 . Then < 2 T < T < t < t without real loss of generality, that 0 < t 1 2 3 ,t ,t t t 3 1 3 1 μ ) ≤ E c C ( ,μ ) ( γ ,γ t t t t 1 3 1 3 [ ] t ,t ,t t 1 2 3 2 E ≤ c ) + c γ ( ,γ ( γ ) ,γ t t t t 2 1 2 3 ,t t T,t ,T t 1 2 3 2 ≤ c E E c ,γ γ ) + ( ( γ ,γ ) + E c ,γ ) ( γ T t t t T t 2 2 1 3 t ,t T,t ,T t 1 2 2 3 C = C ( μ ( μ ) + ,μ C ) + ,μ μ ,μ ( ) T t t t T t 1 2 2 3 t ,t 1 3 C = ( μ ,μ ) , t t 3 1 μ ) was used in the where the property of optimality of the path ( t ∈ R t last step. So all these inequalities are equalities, and in particular [ ] t ,t ,t t ,t t 1 3 2 1 3 2 E c − ,γ c γ ( ) ,γ γ ) ( − c . ) ,γ = 0 ( γ t t t t t t 3 1 2 1 3 2 Since the integrand is nonpositive, it has to vanish almost surely. t ,t So (8.35) is satisfied almost surely, for given . Then the same ,t 3 2 1 inequality holds true almost surely for all choices of rational times ; and by continuity of ,t ,t γ it holds true almost surely at all times. t 1 2 3 This concludes the proof of (v). From general principles of Lagrangian mechanics, there is a uniform γ γ ) bound on the speeds of all the curves ( (this is because ≤ − ≤ T T t − T t γ lie in a compact set). So for any given ε > 0 we can find δ and T such that 0 ≤ t ≤ δ implies d ( γ is small enough the ,γ ε ) ≤ ε . Then if t 0 γ γ map ( ) → ( γ ) is Lipschitz. (This is another well-known fact in , ̇ ,γ 0 0 t 0 Lagrangian mechanics; it can be seen as a consequence of Remark 8.10.) But from Theorem 8.5, applied with the intermediate time t = 0 0 − on the time-interval [ ], we know that γ is well-defined 7−→ γ T,T t 0 is γ → ̇ (almost surely) and Lipschitz continuous. It follows that γ 0 0 also Lipschitz continuous. This concludes the proof of Theorem 8.11. ⊓⊔ The story does not end here. First, there is a powerful dual point of view to Mather’s theory, based on solutions to the dual Kantorovich problem; this is a maximization problem defined by ∫ ( φ − ψ ) dμ, (8.36) sup μ on M , and all where the supremum is over all probability measures ψ,φ ) such that pairs of Lipschitz functions ( 0 ,T x,y . ∈ M × M, φ ( y ) − ψ ( x ) ≤ c ∀ ( ( x,y ) )

206 200 8 The Monge–Mather shortening principle N ext, Theorem 8.11 suggests that some objects related to optimal transport might be interesting to describe a Lagrangian system. This is indeed the case, and the notions defined below are useful and well- known in the theory of dynamical systems: Definition 8.15 (Useful transport quantities describing a La- μ ) For each displacement interpolation ( grangian system). as in 0 t t ≥ Theorem 8.11, define (i) the Mather critical value as the opposite of the mean optimal transport cost: 1 1 , 0 kT , 0 T M − = c = : C (8.37) C ( μ,μ ) = ( μ,μ ); kT T (ii) the Mather set as the closure of the union of the supports of ) μ V , where ( all measures is a displacement interpolation as in μ t ≥ 0 t 0 # V is the Lipschitz map γ ( → Theorem 8.11 and γ ; , ̇ γ ) 0 0 0 ( γ ̇ , (iii) the Aubry set as the set of all γ such that there is a solu- ) 0 0 0 ,T φ,ψ ) of the dual problem (8.36) satisfying H tion ( ψ ψ ) = ) − γ ( γ ( 0 1 + 0 ,T ( γ ,γ c ) . 0 1 → γ Up to the change of variables ( ̇ γ ), the Mather and ) ,γ ( γ , 0 0 1 0 Aubry sets are just the same as Γ appearing in the bibli- and Γ max min ographical notes of Chapter 5. Example 8.16. Take a one-dimensional pendulum. For small values of the total energy, the pendulum is confined in a periodic motion, making just small oscillations, going back and forth around its equilibrium po- sition and describing an arc of circle in physical space (see Figure 8.5). For large values, it also has a periodic motion but now it goes always in the same direction, and describes a complete circle (“revolution”) in physical space. But if the system is given just the right amount of energy, it will describe a trajectory that is intermediate between these two regimes, and consists in going from the vertical upward position (at time −∞ ) to the vertical upward position again (at time + ∞ ) af- ter exploring all intermediate angles. There are two such trajectories (one clockwise, and one counterclockwise), which can be called revolu- tions of infinite period; and they are globally action-minimizing. When ξ = 0, the solution of the Mather problem is just the Dirac mass on the unstable equilibrium , and the Mather and Aubry sets Γ are reduced x 0 to { ( x ξ reaches , this remains the same until ) } . When ξ varies in R ,x 0 0

207 Introduction to Mather’s theory 201 a certain critical value; above that value, the Mather measures are sup- ported by revolutions. At the critical value, the Mather and Aubry sets x,v differ: the Aubry set (viewed in the variables ( )) is the union of the two revolutions of infinite period. n the left figure, the pendulum oscillates with little energy between two I Fig. 8.5. extreme positions; its trajectory is an arc of circle which is described clockwise, then counterclockwise, then clockwise again, etc. On the right figure, the pendulum has much more energy and draws complete circles again and again, either clockwise or counterclockwise. The dual point of view in Mather’s theory, and the notion of the weak KAM the- Aubry set, are intimately related to the so-called ory , in which stationary solutions of Hamilton–Jacobi equations play a central role. The next theorem partly explains the link between the two theories. Theorem 8.17 (Mather critical value and stationary Hamilton– Jacobi equation). With the same notation as in Theorem 8.11, as- L does not depend on t , and let ψ be a Lips- sume that the Lagrangian 0 ,t M H chitz function on , such that ψ = ψ + ct for all times t ≥ 0 ; that + is, ψ is left invariant by the forward Hamilton–Jacobi semigroup, ex- cept for the addition of a constant which varies linearly in time. Then, 0 ,T = c , and the pair ( ψ,H necessarily c s optimal in i ) cT + ψ,ψ ) = ( ψ + ,T 0 c , and initial and the dual Kantorovich problem with cost function final measures equal to μ . 0 , 1 R is a way to reformu- The equation H ψ = ψ + ct emark 8.18. + ( late the stationary Hamilton–Jacobi equation ( x, ∇ ψ H x )) + c = 0. Yet another reformulation would be obtained by changing the forward Hamilton–Jacobi semigroup for the backward one. Theorem 8.17 does

208 202 8 The Monge–Mather shortening principle n ot guarantee the existence of such stationary solutions, it just states if that is uniquely such solutions exist, then the value of the constant c determined and can be related to a Monge–Kantorovich problem. In weak KAM theory, one then establishes the existence of these solutions by independent means; see the references suggested in the bibliograph- ical notes for much more information. − Remark 8.19. The constant which coincides with Mather’s criti- c ( of the system. effective Hamiltonian cal value) is often called the T = 1. Let ψ be Proof of Theorem 8.17 . To fix the ideas, let us impose 0 1 , H ψ = ψ + such that , and let μ be any probability measure on M ; c + then ∫ ∫ ∫ , 1 0 ψ ) dμ − H ψ dμ = ( cdμ = c. + , 1 0 ) μ,μ By the easy part of the Kantorovich duality, ≥ c . By taking ( C μ ∈ P ( M ), we conclude that the infimum over all ≥ . c c o prove the reverse inequality, it suffices to construct a particular T 0 , 1 c C μ ( probability measure ) ≤ such that . The idea is to look for μ μ,μ as a limit of probability measures distributed uniformly over some well- chosen long minimizing trajectories. Before starting this construction, we first remark that since M is compact, there is a uniform bound C on L ( γ ( t ) , ̇ γ ( t )), for all action-minimizing curves γ : [0 , 1] → M ; and since L is time-independent, this statement trivially extends to all action- t ] with | t − t ,t | ≥ 1. minimizing curves defined on time-intervals [ 1 0 0 1 ψ is uniformly bounded on M . Also x be an arbitrary point in M ; for any Now let 0 we have, by T > definition of the forward Hamilton–Jacobi semigroup, { } ∫ 0 T, 0 − H ( ψ )( γ ( − T )) + x ) = inf ψ ( L x , ( γ ( s ) , ̇ γ ( s )) ds ; γ (0) = + T − γ → − T, 0] where the infimum is over all action-minimizing curves M : [ x . (The advantage in working with negative times is that ending at one can fix one of the endpoints; in the present context where M is compact this is nonessential, but it would become important if M were ) ( T noncompact.) By compactness, there is a minimizing curve γ = ; γ ( T ) γ , and the stationarity of ψ then, by the definition of

209 Introduction to Mather’s theory 203 ∫ ] [ 0 ) ( 1 1 , 0 − T ( ( T ) T ) ( T ) ) ( d s = s ) , T − ̇ γ L )) ( ψ )( x ) − ψ ( γ γ ( s H ( + T T T − ( ) 1 T ) ( = − ψ cT γ + ψ ) ( − T )) x ( ( T ( ) 1 c + O . = T ( T +1) n the sequel, I shall write just for γ . Of course the estimate γ I by T + 1, so above remains unchanged upon replacement of T ( ) ∫ 0 1 1 γ ( s ) , ̇ γ ( L )) ds = c + O s ( . T T + T 1) ( − hen define T ∫ ∫ − 1 0 1 1 δ ; ν ds δ d := s ; := μ T T ) γ s ) s ( ( γ T T T T − + 1) ( − θ : γ ( s ) 7−→ γ ( s + 1). It is clear that θ and μ ; moreover, = ν T T # ∫ s +1 ( ) ( ) 1 , 1 , 0 0 s )) c = c γ ( s ) γ ( s ) ,γ ( s + 1) ,θ = ( γ ( ( )) ) , ̇ γ u u ( du. γ L ( s Thus by Theorem 4.8, ∫ 1 − ( ) 1 0 , 1 0 , 1 γ c ds )) s ( γ ( s ) ,θ ( μ ( ≤ ,ν C ) T T T 1) ( − + T ( ) ∫ ∫ s 1 +1 − 1 = du ds ( u L ( γ ( u ) , ̇ γ )) T 1) s + T ( − ∫ 0 1 (8.38) ) u L ( γ du, u ) , ̇ γ ( u )) a ( ( = T 1) + − ( T 0] where − ( T + 1) , : [ → [0 , 1] is defined by a   u ≤ T − if 1 1; ≤− ∫  u = 1 ) = ds ( a 1 0; ≤ − u if − ≤ u u ≤ s +1 s ≤   u + T + 1 if − ( T + 1) ≤ u ≤− T. Replacing a by 1 in the integrand of (8.38) involves modifying the integral on a set of measure at most 2; so

210 204 8 The Monge–Mather shortening principle ( ) ) ( ∫ 0 1 1 1 0 , 1 C , ν ( ) ≤ μ O L ( γ ( u ) , ̇ γ ( u )) du + O = c + . T T T T T 1) − + ( T 8.39) ( Since M ) is compact, the family ( μ converges, up to ex- ) P ( T N ∈ T . Then (up to μ traction of a subsequence, to some probability measure extraction of the same subsequence) also converges to μ , since ν T ∫ ∫ ∥ ∥ 0 T − ∥ ∥ 1 2 ∥ ∥ ∥ ∥ ν − = μ δ . δ ds + ds ≤ ∥ ∥ T T s ( ) γ ( s ) γ TV T T TV 1) 1 − ( T + − T hen from (8.39) and the lower semicontinuity of the optimal transport cost, 1 , 1 0 , 0 ≤ C c. ≤ C ) μ,μ ( ( μ lim inf ,ν ) T T →∞ T This concludes the proof. ⊓⊔ The next exercise may be an occasion to manipulate the concepts introduced in this section. Exercise 8.20. With the same assumptions as in Theorem 8.11, as- sume that L v ; that is, L ( x, − v,t ) = L ( x,v,t ). Show is symmetric in 0 0 ,T ,T ) = c x,y ( ( y,x ). Take an optimal measure that c f or the min- μ π imization problem (8.26), and let be an associated optimal trans- π and ˇ π (obtained by exchanging ference plan. By gluing together x y ), construct an optimal transference plan for the variables and T replaced by 2 T the problem (8.26) with x , such that each point stays in place. Deduce that the curves γ are 2 T -periodic. Show that ,T 0 , 2 T 0 T 2 , 0 C ) = ( μ, μ ) , and deduce that c ( c ( x,y ) is π -almost surely x,x 0 , 2 T ψ H constant. Construct almost surely. - μ , such that cT ψ = ψ + 2 + Next assume that L does not depend on t , and use a compactness argument to construct a ψ and a stationary measure μ , such that 0 ,t this is far Note that almost surely. - μ ct , H t ≥ 0, for all + ψ = ψ + from proving the existence of a stationary solution of the Hamilton– Jacobi equation, as appearing in Theorem 8.17, for two reasons: First the symmetry of is a huge simplification; secondly the equation L 0 ,t H - μ , not just M almost surely. ct s hould hold everywhere in ψ = ψ + +

211 Possible extensions of Mather’s estimates 205 ossible extensions of Mather’s estimates P As noticed in Example 8.4, it would be desirable to have a sharper version of Theorem 8.1 which would contain as a special case the correct 1+ α | v | exponents for the Lagrangian function x,v,t , 0 ( 1. L ) = < α < But even for a “uniformly convex” Lagrangian there are several ex- tensions of Theorem 8.1 which would be of interest, such as (a) getting rid of the compactness assumption, or at least control the dependence of constants at infinity; and (b) getting rid of the smoothness assumptions. 2 x,v,t I shall discuss both problems in the most typical case | v | L , ( ) = 2 x,y ) = d ( x,y ( c . i.e. ) Intuitively, Mather’s estimates are related to the local behavior of geodesics (they should not diverge too fast), and to the convexity prop- 2 d erties of the square distance function ( , · ). Both features are well x 0 sectional curvature captured by lower bounds on the of the manifold. There is by chance a generalized notion of sectional curvature bounds, due to Alexandrov, which makes sense in a general metric space, with- out any smoothness; metric spaces which satisfy these bounds are called Alexandrov spaces . (This notion will be explained in more detail in Chapter 26.) In such spaces, one could hope to solve problems (a) and (b) at the same time. Although the proofs in the present chap- ter strongly rely on smoothness, I would be ready to believe in the following statement (which might be not so difficult to prove): Let ( X ,d ) be an Alexandrov space with curva- Open Problem 8.21. ,y K R , and let x ture bounded below by ,x X ,y be four points in ∈ 2 1 2 1 such that 2 2 2 2 ,y ,y ) d + d ( x ) ( ,y ) x ≤ d ( x x ,y ( ) . + d 2 1 2 1 2 1 1 2 γ and γ be two constant-speed geodesics respectively join- Further, let 2 1 x , there is a constant to y 1) and x . Then, for any to y , ing t (0 ∈ 0 2 1 1 2 C K , t , and on an upper bound on all the dis- , depending only on t 0 0 tances involved, such that ) ( ) ( ) sup γ ( t ) ,γ ( t . ) ≤ C t ( d d γ ( t ) , γ 1 2 t 0 1 0 2 0 0 1 ≤ t ≤ To conclude this discussion, I shall mention a much rougher “short- ening lemma”, which has the advantage of holding true in general met- ric spaces, even without curvature bounds. In such a situation, in gen- eral there may be branching geodesics, so a bound on the distance at

212 206 8 The Monge–Mather shortening principle o ne intermediate time is clearly not enough to control the distance be- tween the positions along the whole geodesic curves. One cannot hope either to control the distance between the velocities of these curves, since the velocities might not be well-defined. On the other hand, we along the may take advantage of the property of preservation of speed minimizing curves, since this remains true even in a nonsmooth con- text. The next theorem exploits this to show that if geodesics in a displacement interpolation pass near each other at some intermediate lengths have to be approximately equal. time, then their Let Theorem 8.22 (A rough nonsmooth shortening lemma). ) be a metric space, and let γ γ , ( X be two constant-speed, minimiz- ,d 1 2 ing geodesics such that ) ( ( ) ( ) ) ( 2 2 2 2 (0) (0) ,γ ,γ (1) . (1) ≤ γ + γ (1) (0) ,γ ,γ (1) d d + d γ γ (0) d 2 1 1 2 2 1 1 2 D and L Let stand for the respective lengths of γ and L γ , and let 2 1 2 1 γ ∪ ( be a bound on the diameter of )([0 , 1]) . Then γ 2 1 √ √ ) ( C D √ L L − | |≤ , γ , ) t ( γ d ) t ( 1 2 0 1 0 2 − ( t ) t 1 0 0 for some numeric constant C . . Write d Proof of Theorem 8.22 = d ( x = ,y X ), d ), = d ( x ,y 2 12 2 1 1 1 21 t ( X γ = γ ( t ). From the minimizing assumption, the triangle ), 0 2 0 1 2 inequality and explicit calculations, 2 2 2 2 0 + d ≤ L − d L − 1 21 12 2 ) ( 2 ≤ ,X ( ) + d ( X d ,X x ) + d ( X ,y ) 2 2 1 1 1 2 ( ) 2 ,X ) + d d X + ( ) + d ( X ,y ) x ,X ( 2 1 2 2 1 1 ( ) 2 L − + d ( X = ,X L ) + (1 t t ) 0 2 1 1 2 0 ) ( 2 2 2 L + d ( X − ,X L ) + (1 − t L ) t − + L 0 1 0 1 2 2 2 1 ( ) 2 X . ,X ) ) = 2 L d + L L + d ( X − ,X L ) ( − 2 t ) ( (1 − t 1 0 1 1 2 2 2 2 1 0 As a consequence, √ √ X , X ( d ) L + + L 2 1 2 1 , , X ) ( d X L |≤ | − L 2 1 1 2 t 1 ) ( t − 0 0

213 Appendix: Lipschitz estimates for power cost functions 207 nd the proof is complete. a ⊓⊔ Appendix: Lipschitz estimates for power cost functions The goal of this Appendix is to prove the following shortening lemma 1+ α | x − y ( for the cost function x,y in Euclidean space. ) = | c Theorem 8.23 (Shortening lemma for power cost functions). n Let , 1) , and let x α ,y , such that ,x ∈ ,y R be four points in (0 2 2 1 1 α 1+ α 1+ α 1+ 1+ α y + | − y (8.40) | | . ≤| x x − y | | x − + | x y − | 2 1 2 1 1 1 2 2 Further, let ) ( t ) = (1 − t ) x . + ty ty , γ + ( t ) = (1 − t γ x 2 1 2 1 2 1 ( K (0 , 1) there is a constant t = K ∈ α,t Then, for any ) > 0 such 0 0 that γ ( t ) − γ | ( t ) |≥ K sup . | ) t ( | γ ( t ) − γ 0 2 1 0 1 2 1 t 0 ≤ ≤ The proof below is not constructive, so I won’t have Remark 8.24. any quantitative information on the best constant ( α,t ). It is natural K , the constant K ( α,t to think that for each fixed t ) (which only depends α ) will go to 0 as α ↓ 0. When α = 0, the conclusion of the theorem on x ,y is false: Just think of the case when are aligned. But this ,x ,y 2 1 2 1 is the only case in which the conclusion fails, so it might be that a modified statement still holds true. Proof of Theorem 8.23 . First note that it suffices to work in the affine space generated by ,y ,x ,y , which is of dimension at most 3; hence x 1 2 1 2 n . For notational all the constants will be independent of the dimension simplicity, I shall assume that t = 1 / 2, which has no important in- 0 fluence on the computations. Let X := γ 2). It is (1 / 2), X / := γ (1 2 1 1 2 sufficient to show that | x | − x X | + | y − − y X |≤ C | 2 1 2 2 1 1 for some constant C , independent of . ,y ,x ,y x 2 1 1 2

214 208 8 The Monge–Mather shortening principle tep 1: Reduction to a compact problem by invariance. S and x y Exchanging the roles of if necessary, we might assume that | − | ≤ | x y − y | = 0, , and then by translation invariance that x x 1 2 2 1 1 by homogeneity that | y x | = 1 (treat separately the trivial case − 1 1 y x ), and by rotation invariance that y = = e is a fixed unit vector. 1 1 1 := | x , then | Let | y R − x 2, so | ≤ 1 implies | x / − X 1 | ≤ 2 2 2 2 2 2, it follows that |≤ R − 1 / 2, and since | X | X 1 / |≥ | X 1. − X − |≥ R 2 1 1 2 | x y − x +1. So the conclusion | = R and | y R − On the other hand, |≤ 2 1 1 2 2. Otherwise, lie in the ball x R | and | y ≥ | is obvious if B | (0). 2 2 3 Step 2: Reduction to a perturbation problem by com- ( k ) ( k ) , let ( ,y x pactness. For any positive integer k ) be such that 2 2 ) k ( ) k ( | + | y ( − y | | ) / x X | − X ,y | is minimized by ( x − ,y x ,x ) under 1 2 1 2 1 2 1 1 2 2 − 1 . − X the constraint |≥ k | X 2 1 I By compactness, such a configuration does exist, and the value k of the infimum goes down with k , and converges to ( ) | x − x | + | y y − | 1 2 1 2 := inf I , ( 8.41) | X − X | 2 1 where the infimum is taken over all configurations such that 6 = X . X 1 2 α 1+ | → | together with inequality (8.40) The strict convexity of x x = prevent X ), in which case there is , unless ( x ,y X x ) = ( ,y 1 2 1 2 2 1 nothing to prove. So it is sufficient to show that I > 0. ) ( k ) k ( ,y ) takes values in a compact set, there Since the sequence ( x 2 2 ( k ) k ) ( x )) which converges is a subsequence thereof (still denoted ( ,y 2 2 ( ∞ ( ∞ ) ) to some ( x ). By continuity, condition (8.40) holds true with ,y 2 2 ) ) ∞ ( ∞ ( ( ∞ ) − ,y ) = ( ( ). If (with obvious notation) | X > x X | ,y x 0, then 2 2 1 2 2 2 ( ∞ ) ( ∞ ) the configuration ( in (8.41), I ,y ,x x ,y ) achieves the minimum 1 1 2 2 and that minimum is positive. So the only case there remains to treat is ) ( ∞ X . Then, by strict convexity, condition (8.40) imposes = X when 1 2 ∞ ) ) ( ∞ ( k ( ( k ) ) = x = y x . Equivalently, x , y x y converges to to , and 1 1 1 2 2 2 2 x is very close y . All this shows that it suffices to treat the case when 1 2 x . and to y is very close to y 2 1 1 Step 3: Expansions. Now let x (8.42) = x δy, + δx, y + = y 1 1 2 2 where δx and δy are vectors of small norm (recall that x has unit − y 1 1 norm). Of course

215 Appendix: Lipschitz estimates for power cost functions 209 δ x + δy X X − = ; , x δy − x = = δx, y y − 1 2 2 2 1 1 2 so to conclude the proof it is sufficient to show that ∣ ∣ ∣ ∣ δx + δy ∣ ∣ (8.43) , ) | δy | x | + δ ( K ≥ | ∣ ∣ 2 are small enough, and (8.40) is satisfied. δx | | δy | | as soon as and 2 2 2 b | + = | a | By using the formulas + 2 〈 a,b 〉 | | b | a and + 1+ α − )(1 (1 + α ) α α ) ( 1 + 2 3 2 ε ) (1 + − ε ε = + O ( ε 1 + ) , 8 2 ne easily deduces from (8.40) that o ] [ 2 2 2 2 2 2 δx −| δy | − ≤ (1 − α ) δy 〈 δx − δy,e 〉 | −〈 δx,e 〉 | −〈 δy,e 〉 −| δx | ( ) 3 3 δx | O + | δy | + | . This can be rewritten 3 3 − α ) 〈 δx,e 〉〈 δy,e 〉 ≥ O ( 〉− δx | 〈 + | δy | (1 ) . δx,δy | Consider the new scalar product v,w 〉〉 := 〈 v,w 〉− (1 − α ) 〈 v,e 〉〈 w,e 〉 〈〈 α > (which is indeed a scalar product because 0), and denote the v ‖ . Then the above conclusion can be summarized associated norm by ‖ into ( ) 3 3 ‖ 〈〈 ‖ δx,δy + ‖ δy ‖ 〉〉 ≥ O . (8.44) δx It follows that ∥ ∥ 2 ) ( ∥ ∥ + δy 1 δx 2 2 ∥ ∥ ‖ ‖ δ x δx,δy 〉〉 + ‖ δy ‖ 〈〈 + 2 = ∥ ∥ 4 2 1 2 2 3 3 δ x ‖ ‖ + ‖ δy ‖ . ) + O ( ‖ δx ‖ ( + ‖ δy ‖ ≥ ) 4 ⊓⊔ | δx | So inequality (8.43) is indeed satisfied if | δy | is small enough. +

216 210 8 The Monge–Mather shortening principle 1+ α d ( x,y ) E Extend this result to the cost function on a xercise 8.25. γ ̃ γ stay within a compact set. and Riemannian manifold, when Hints: This tricky exercise is only for a reader who feels very comfort- able. One can use a reasoning similar to that in Step 2 of the above ) ( k ) k ( γ , γ ) which is asymptotically the proof, introducing a sequence ( ̃ “worst possible”, and converges, up to extraction of a subsequence, ∞ ( ∞ ) ∞ ( ) ( ) ) ∞ ( γ γ γ and ̃ ). There are three cases: (i) ̃ , to ( are distinct γ ) k ( γ geodesic curves which cross; this is ruled out by Theorem 8.1. (ii) k ) ( γ converge to a point; then everything becomes local and one ̃ and n ) ( ( k ) k , Theorem 8.23. (iii) ̃ γ γ can use the result in R converge to and ( ∞ ) γ ; then these curves can be approximated by a nontrivial geodesic ( ∞ ) γ infinitesimal perturbations of , which are described by differential equations (Jacobi equations). Remark 8.26. Of course it would be much better to avoid the com- pactness arguments and derive the bounds directly, but I don’t see how to proceed. Bibliographical notes Monge’s observation about the impossibility of crossing appears in his seminal 1781 memoir [636]. The argument is likely to apply whenever the cost function satisfies a triangle inequality, which is always the case in what Bernard and Buffoni have called the Monge–Ma ̃n ́e prob- lem [104]. I don’t know of a quantitative version of it. A very simple argument, due to Brenier, shows how to construct, without any calculations, configurations of points that lead to line- crossing for a quadratic cost [814, Chapter 10, Problem 1]. There are several possible computations to obtain inequalities of the style of (8.3). The use of the identity (8.2) was inspired by a result by Figalli, which is described below. It is an old observation in Riemannian geometry that two minimiz- ing curves cannot intersect twice and remain minimizing; the way to prove this is the shortcut method already known to Monge. This simple principle has important geometrical consequences, see for instance the works by Morse [637, Theorem 3] and Hedlund [467, p. 722]. (These references, as well as a large part of the historical remarks below, were pointed out to me by Mather.)

217 Bibliographical notes 211 A t the end of the seventies, Aubry discovered a noncrossing lemma which is similar in spirit, although in a different setting. Together with Le Daeron, he demonstrated the power of this principle in studying the so-called Frenkel–Kantorova model from solid-state physics; see the “Fundamental Lemma” in [52]. In particular, the method of Aubry and Le Daeron provided an alternative proof of results by Mather [598] about the existence of quasiperiodic orbits for certain dynamical sys- 1 tems. The relations between the methods of proof of Aubry and Mather are discussed in [599, 604] and constitute the core of what is usually called the Aubry–Mather theory. Bangert [66, Lemma 3.1] gave a general version of the Aubry–Le Daeron lemma, and illustrated its use in various aspects of the theory of twist diffeomorphisms. (Bangert’s paper can be consulted as an entry point for this subject.) He also made the connection with the earlier works of Morse and Hedlund in geome- try. There is a related independent study by Bialy and Polterovich [117]. Then Moser [641] showed that the theory of twist diffeomorphisms (at least in certain particular cases) could be embedded in the theory of strictly convex Lagrangian systems, and Denzler [297] adapted the noncrossing arguments of Aubry, Le Daeron and Bangert to this new setting (see for instance Theorem 2.3 there), which in some sense goes back to older geometric works. Around 1990, Mather came up with two contributions which for our purposes are crucial. The first consists in introducing minimizing rather than minimizing curves [600]; the second consists in measures a quantitative version of the noncrossing argument, for a general class of strictly convex Lagrangian functions [601, p. 186]. This estimate, which in these notes I called Mather’s shortening lemma, was the key technical ingredient in the proof of his fundamental “Lipschitz graph theorem” [601, Theorem 2]. Although the statement in [601] is not really the same as the one which appears in this chapter, the proof really is similar. The idea to use this approach in optimal transport theory came to me when Bernard mentioned Mather’s lemma in a seminar where he was presenting his results with Buffoni about the optimal transport problem for rather general Lagrangian functions [105]. 1 A ccording to Mather, the chronology is blurred, because Aubry knew similar results somewhat earlier, at least for certain classes of systems, but had never published them; in particular, the discoveries of Aubry and Mather were inde- pendent. Further, see the review paper [51].

218 212 8 The Monge–Mather shortening principle I n the meantime, an appropriate version of the noncrossing lemma had already been rediscovered (but not in a quantitative version) by researchers in optimal transport. Indeed, the noncrossing property of optimal trajectories, and the resulting estimates about absolute conti- nuity of the displacement interpolant, were some of the key technical tools used by McCann [614] to establish convexity properties of certain n functionals along displacement interpolation in R for a quadratic cost; these statements were generalized by Cordero-Erausquin, McCann and Schmuckenschl ̈ager [246] for Riemannian manifolds, and for rather gen- n eral convex cost functions in by Cordero-Erausquin [243]. R Results similar to Theorems 8.5 and 8.7 are also proven by Bernard and Buffoni [105] via the study of Hamilton–Jacobi equations, in the style of weak KAM theory. This is a bit less elementary but powerful as well. The basic idea is to exploit the fact that solutions of Hamilton– Jacobi equations are automatically semiconcave for positive times; I learnt from Otto the usefulness of this regularization property in the context of optimal transport (see [814, p. 181]). Fathi and Figalli [348] generalized this strategy to noncompact situations. Bernard [102] also used the same idea to recover an important result about the existence 1 of subsolutions of certain Hamilton–Jacobi equations. C Figalli and Juillet [366] obtained a result similar to Theorem 8.7 when the cost is the squared distance on a degenerate Riemannian structure such as the Heisenberg group or an Alexandrov space with curvature bounded below. Their approach is completely different since it uses the uniqueness of Wasserstein geodesics and the so-called mea- sure contraction property (which is traditionally associated with Ricci curvature bounds but nevertheless holds in the Heisenberg group [496]). Figalli and Juillet note that concentration phenomena arise in the Heisenberg group which are not seen in Riemannian manifolds; and that the Monge–Mather shortening lemma does not hold in this set- ting. Theorem 8.11 is a variant of Mather’s Lipschitz graph theorem, appearing (up to minor modifications) in Bernard and Buffoni [105, Theorem C]. The core of the proof is also taken from that work. The acronym “KAM” stands for Kolmogorov, Arnold and Moser; the “classical KAM theory” deals with the stability (with high proba- bility) of perturbed integrable Hamiltonian systems. An account of this theory can be found in, e.g., Thirring [780, Section 3.6]. With respect to weak KAM theory, some important differences are that: (a) classical

219 Bibliographical notes 213 K AM theory only applies to slight perturbations of integrable systems; (b) it only deals with very smooth objects; (c) it controls the behavior of a large portion of the phase space (the whole of it, asymptotically when the size of the perturbation goes to 0). The weak KAM theory is much more recent than the classical one; it was developed by several authors, in particular Fathi [344, 345]. A theorem of the existence of a stationary solution of the Hamilton– Jacobi equation can be found in [347, Theorem 4.4.6]. Precursors are Mather [602, 603] and Ma ̃n ́e [592, 593]. The reader can also consult the book by Fathi [347], and (with a complementary point of view) the one by Contreras and Iturriaga [238]. Also available are some technical notes by Ishii [488], and the review works [604, 752]. The proof of Theorem 8.17, as I wrote it, is a minor variation of an argument shown to me by Fathi. Related considerations appear in a recent work by Bernard and Buffoni [106], who analyze the weak KAM theory in light of the abstract Kantorovich duality. One may also consult [278]. From its very beginning, the weak KAM theory has been associated with the theory of viscosity solutions of Hamilton–Jacobi equations. An early work on the subject (anterior to Mather’s papers) is an un- published preprint by P.-L. Lions, Papanicolaou and Varadhan [564]. Recently, the weak KAM theory has been related to the large-time behavior of Hamilton–Jacobi equations [69, 107, 346, 349, 383, 384, 1 regular- 385, 487, 645, 707]. Aubry sets are also related with the C ity of Hamilton–Jacobi equations, which has important applications in the theory of dynamical systems [102, 103, 350]. See also Evans and Gomes [332, 333, 334, 423] and the references therein for an alternative point of view. In this chapter I presented Mather’s problem in terms of trajectories and transport cost. There is an alternative presentation in terms of invariant measures, following an idea by Ma ̃n ́e. In Ma ̃n ́e’s version of the problem, the unknown is a probability measure μ dxdv ) on the ( ( TM ∇ tangent bundle · ; it is stationary in the sense that v μ ) = 0 x (this is a stationary kinetic transport equation), and it should minimize ∫ L the action ( x,v ) μ ( dxdv ). Then one can show that μ is actually invariant under the Lagrangian flow defined by L . As Gomes pointed out to me, this approach has the drawback that the invariance of μ is not built in from the definition; but it has several nice advantages:

220 214 8 The Monge–Mather shortening principle • t makes the graph property trivial if L is strictly convex: Indeed, I μ , at each ∈ M , onto the one can always collapse the measure x ∫ ); this operation preserves the invari- ) = ( v μ ( ξ | x x barycenter dv μ was already ance of the measure, and decreases the cost unless supported on a graph. (Note: This does not give the Lipschitz reg- ularity of the graph!) • This is a linear programming problem, so it admits a dual prob- ); the value of this infimum is sup lem which is inf H ( ∇ φ,x x φ x but another way to characterize the effective Hamiltonian H see , e.g. [238, 239]. • This is a good starting point for some generalizations, see for in- stance [422]. I shall conclude with some more technical remarks. The use of a restriction property to prove the absolute continuity of the displacement interpolant without any compactness assumption was inspired by a discussion with Sturm on a related subject. It was also Sturm who asked me whether Mather’s estimates could be generalized to Alexandrov spaces with curvature bounded below. The theorem according to which a Lipschitz map T dilates the n - n ‖ T dimensional Hausdorff measure by a factor at most ‖ is an almost Lip immediate consequence of the definitions of Hausdorff measure, see e.g. [174, Proposition 1.7.8]. Alexandrov spaces are discussed at length in the very pedagogical monograph by Burago, Burago and Ivanov [174]. Several characteri- zations of Alexandrov spaces are given there, and their equivalence is established. For instance, an Alexandrov space has curvature bounded 2 if the square distance function z, ( K · ) below by is “no more convex” d than the square distance function in the model space having constant sectional curvature . Also geodesics in an Alexandrov space cannot K diverge faster than geodesics in the model space, in some sense. These properties explain why such spaces may be a natural generalized set- ting for optimal transport. Upper bounds on the sectional curvature, on the other hand, do not seem to be of any help. Figalli recently solved the Open Problem 8.21 in the special case K = 0 (nonnegative curvature), with a very simple and sharp argu- ment: He showed that if γ are any two minimizing, constant- and γ 2 1 speed geodesics in an Alexandrov space ( X ,d ) with nonnegative cur- vature, and γ y (0) = x (1) = , γ γ , then x , , γ y (1) = (0) = 1 1 2 2 2 1 1 2

221 Bibliographical notes 215 2 2 2 2 , γ ( t )) ≥ (1 − t ) d γ ( x ( ,x ) ( + t t d ( y ,y ) ) d 1 2 2 1 1 2 [ ] 2 2 2 2 ) d ( x (8.45) ,y . + t + d ( x (1 ,y − ) t − d ( x ) ,y ,y ) ) − d ( x 2 2 1 1 2 2 1 1 (So in this case there is no need for an upper bound on the distances between x might be negative ,x K ,y .) The general case where ,y 2 2 1 1 seems to be quite more tricky. As a consequence of (8.45), Theorem 8.7 holds when the cost is the squared distance on an Alexandrov space with nonnegative curvature; but this can also be proven by the method of Figalli and Juillet [366]. Theorem 8.22 takes inspiration from the no-crossing proof in [246, Lemma 5.3]. I don’t know whether the H ̈older-1/2 regularity is optimal, and I don’t know either whether it is possible/useful to obtain similar estimates for more general cost functions.

222

223 9 olution of the Monge problem I: Global S approach In the present chapter and the next one I shall investigate the solv- ability of the Monge problem for a Lagrangian cost function. Recall from Theorem 5.30 that it is sufficient to identify conditions under does not see the set of points where the which the initial measure μ c -subdifferential of a -convex function ψ is multivalued. c , and a cost function c ( Consider a Riemannnian manifold ) on M x,y × , deriving from a Lagrangian function L ( x,v,t ) on M × [0 , 1] M TM μ and satisfying the classical conditions of Definition 7.6. Let μ be 1 0 be a displacement μ ) two given probability measures, and let ( t ≤ 1 t 0 ≤ at γ interpolation, written as the law of a random minimizing curve time t . If the Lagrangian satisfies adequate regularity and convexity prop- γ s ) ,γ ( t )) is always de- ( erties, Theorem 8.5 shows that the coupling ( < s < 1, however singular μ might and μ terministic as soon as 0 1 0 be. The question whether one can construct a deterministic coupling of ( μ ,μ ) is much more subtle, and cannot be answered without reg- 1 0 μ . In this chapter, a simple approach to this ularity assumptions on 0 problem will be attempted, but only with partial success, since even- tually it will work out only for a particular class of cost functions, including at least the quadratic cost in Euclidean space (arguably the most important case). c Our main assumption on the cost function will be: Assumption (C): c -convex function ψ For any x ∈ M , the and any c -subdifferential ∂ is pathwise connected. ψ ( x ) c n c ( x,y ) = − x Consider the cost function y in R · . Let Example 9.1. n one has and y y belong to ∂ R ψ ( x ); then, for any z ∈ c 1 0

224 218 9 Solution of the Monge problem I: Global approach ψ ) + y ( · ( z − x ) ≤ ψ ( z ); ψ ( x ) + y x · ( z − x ) ≤ ψ ( z ) . 0 1 . Thus x )+ y It follows that · ( z − ( ) ≤ ψ ( z ), where y ψ := (1 − t ) y ty + x 0 t t 1 the line segment ( is entirely contained in the subdifferential of y ) ≤ 0 t ≤ t 1 2 . The same computation applies to c ψ x,y ) = | x − y | at / 2, or to any x ( a ( x ) − x · y + b ( y ). cost function of the form Actually, there are not so many examples where Assumption (C) is known to be satisfied. Before commenting more on this issue, let me illustrate the interest of this assumption by showing how it can be used. Theorem 9.2 (Conditions for single-valued subdifferentials). n -dimensional Riemannian manifold, and c a M Let be a smooth real-valued cost function, bounded below, deriving from a Lagrangian , on TM × [0 ( 1] , satisfying the classical conditions of Defini- L x,v,t ) tion 7.6 and such that: (i) Assumption (C) is satisfied. (ii) The conclusion of Theorem 8.1 (Mather’s shortening lemma), , holds true for t in the form of inequality = 1 / 2 with an expo- (8.4) 0 β > 1 − (1 /n ) , and a uniform constant. More explicitly: When- nent x ever ,x ≤ ,y ) ,y ,y are four points in M satisfying c ( x x ,y ( ) + c 1 1 2 2 2 1 2 1 x ( ,y c ) + c ( x ,y ) , and γ , γ are two action-minimizing curves with 2 1 1 1 2 2 (0) = γ , γ (1) = y , γ x x , γ (1) = y , then (0) = 1 1 2 2 1 2 1 2 ( ) ( ) β ) γ ( t ) ,γ ( sup (9.1) t ≤ C d d γ (1 / 2) , γ . (1 / 2) 2 2 1 1 ≤ 0 ≤ t 1 c M ψ , there is a set Z ⊂ Then, for any of Hausdorff -convex function dimension at most n − 1) /β < n (and therefore of zero n -dimensional ( measure), such that the c -subdifferential ∂ contains at most one ψ ( x ) c element if ∈ Z . x / ( . Let x for which ψ be the set of points x ) < Proof of Theorem 9.2 Z ∞ but ∂ + ψ ( x ) is not single-valued; the problem is to show that Z is c of dimension at most ( n − 1) /β . Let x ∈ M with ψ ( x ) < + ∞ , and let y ∈ ∂ ). Introduce an ψ ( x c x,y γ = γ action-minimizing curve joining x = γ (0) to y = γ (1). I claim that the map ( ) 1 γ F : 7− → x 2

225 9 Solution of the Monge problem I: Global approach 219 i s well-defined on its domain of definition, which is the union of all x,y 2) determines = γ (1 / m x unambiguously; there 2). (I mean, (1 / γ γ (1 / 2) is the same.) Indeed, cannot be two different points x for which ′ ′ ′ ψ ( x ) and y assume ∈ ∂ , y ( ∈ ∂ ), with ψ ( x ) < + ∞ , ψ ( x ψ ) < + ∞ x c c ′ y and be minimizing geodesics between and let and γ on the one γ x ′ ′ and y hand, x on the other hand. It follows from the definitions of subdifferential that { ′ ′ c ( x,y ) ≤ ψ ( x x ( ) + ( x c ,y ) ψ ) + ′ ′ ′ ′ c ( x ψ ,y ( ) ≤ ψ ( x ) + c ( x,y x ) . ) + Thus ′ ′ ′ ′ ( x c ,y x,y ) ≤ c ( x,y ) + ) + c ( x c ,y ) . ( Then by (9.1), ( ) β ( ) ( ) 1 1 ′ ′ x,x ) C d ( γ d ≤ . , γ 2 2 m = γ (1 / 2) determines x = F ( T ) unambiguously, his implies that m F β . (Obviously, this is the same reasoning as and even that is H ̈older- in the proof of Theorem 8.5.) Now, cover by a countable number of open sets in which M is M n of diffeomorphic to a subset , via some diffeomorphism φ R . In each U U U H of these open sets of all hyperplanes passing , consider the union U through a point of rational coordinates, orthogonal to a unit vector with rational coordinates. Transport this set back to thanks to the M local diffeomorphism, and take the union over all the sets U . This gives D ⊂ M with the following properties: (i) It is of dimension n − 1; a set M (to see this, (ii) It meets every nontrivial continuous curve drawn on write the curve locally in terms of φ and note that, by continuity, at U least one of the coordinates of the curve has to become rational at some time). x ∈ ). , and let y ,y Next, let be two distinct elements of ∂ ψ ( x Z 0 c 1 By assumption there is a continuous curve ( y ) lying entirely in ≤ ≤ 1 0 t t )) ψ ( x ). For each t , introduce an action-minimizing curve ( γ s ( ∂ c t ≤ s ≤ 1 0 between and y ( s here is the time parameter along the curve). De- x t fine m 2). This is a continuous path, nontrivial (otherwise := γ / (1 t t (1 / 2) = γ γ (1 / 2), but two minimizing trajectories starting from x can- 0 1 not cross in their middle, or they have to coincide at all times by (9.1)). So there has to be some t such that y con- ∈ D . Moreover, the map F t t F ( y ). ) = x for all structed above satisfies . It follows that x ∈ F ( D t (See Figure 9.1.)

226 220 9 Solution of the Monge problem I: Global approach A Z ⊂ F ( D ). Since D is of Hausdorff dimension n − 1 s a conclusion, ⊓⊔ F -H ̈older, the dimension of F ( D ) is at most ( n − 1) /β . β is and y 0 y 1 m 0 x m 1 y cheme of proof for Theorem 9.2. Here there is a curve ( lying ) S Fig. 9.1. t t ≤ 1 ≤ 0 m ∂ ( x ), and there is a nontrivial path ( ψ obtained by taking the ) entirely in t c t ≤ 1 0 ≤ γ x y (0) . This path has to meet D ; but its image by and (1 / 2) 7→ γ midpoint between t is { x } , so x ∈ F ( D ). Now come the consequences in terms of Monge transport. Corollary 9.3 (Solution of the Monge problem, I). M be a Let c be a cost function on × M , with associ- Riemannian manifold, let M , and let C , ν be two probability measures on M . ated cost functional μ Assume that: C ( μ,ν ) < + ∞ ; (i) (ii) the assumptions of Theorem 9.2 are satisfied; μ gives zero probability to sets of dimension at most ( n − 1) /β . (iii) Then there is a unique (in law) optimal coupling ( x,y ) of μ and ν ; it is deterministic, and characterized (among all couplings of μ,ν ) ) by the ( existence of a c -convex function ψ such that y ∈ ∂ (9.2) ψ ( x ) almost surely . c π ; it is determin- Equivalently, there is a unique optimal transport plan istic, and characterized by the existence of a c -convex ψ such that (9.2) holds true π -almost surely. Proof of Corollary 9.3 . The conclusion is obtained by just putting to- gether Theorems 9.2 and 5.30. ⊓⊔

227 9 Solution of the Monge problem I: Global approach 221 W e have now solved the Monge problem in an absolutely painless way; but under what assumptions? At least we can treat the important cost function x,y ) = − x · y . Indeed the notion of c -convexity reduces c ( -subdifferential to plain convexity (plus lower semicontinuity), and the c ψ is just its usual subdifferential, which I shall of a convex function denote by ∂ψ . Moreover, under an assumption of finite second mo- ments, for the Monge problem this cost is just as good as the usual 2 2 2 | x = | x | squared Euclidean distance, since − 2 | · y + | y x − , and y | ∫ 2 2 ( + | y | x ) dπ | x,y ) is independent of the choice of π ∈ Π ( μ,ν ). Par- | ( ticular as it may seem, this case is one of the most important for ap- plications, so I shall state the result as a separate theorem. Theorem 9.4 (Monge problem for quadratic cost, first result). 2 n Let ) = | x − y | c in R x,y . Let μ , ν be two probability measures on ( n R such that ∫ ∫ 2 2 dμ ( x ) + | | y | x | ( y ) < + ∞ (9.3) dν μ does not give mass to sets of dimension at most n − 1 . (This and μ is true in particular if is absolutely continuous with respect to the Lebesgue measure.) Then there is a unique (in law) optimal coupling x,y ) of μ and ν ( ; it is deterministic, and characterized, among all cou- plings of ( μ,ν ) , by the existence of a lower semicontinuous convex func- tion ψ such that y ∈ ∂ψ ( x ) almost surely . (9.4) In other words, there is a unique optimal transference π ; it is a Monge transport plan, and it is characterized by the existence of a lower semi- ψ π . whose subdifferential contains continuous convex function Spt does not give mass to sets of di- μ The assumption that Remark 9.5. n − 1 is optimal for the existence of a Monge coupling, mension at most 1 H as can be seen by choosing | = (the one-dimensional Haus- μ 1] ×{ 0 } , [0 2 1] ×{ 0 } in R dorff measure concentrated on the segment [0 ), and then , 1 = (1 2) H | / . (See Figure 9.2.) It is also optimal ν 1] ×{− 1 }∪ [0 , 1] ×{ +1 } [0 , 1 μ = (1 / 2) H for the uniqueness, as can be seen by taking and 0 }× [ − 1 , 1] { 1 n . In fact, whenever H ν ) are supported 2) R ( P ∈ ν , / μ = (1 2 ×{ 1] } 0 1 − [ , n R transference plan is optimal! , then any on orthogonal subspaces of ), To see this, define a function ψ = 0 on Conv(Spt μ by ψ = + ∞ else- ψ ∗ ψ ψ where; then = 0 on Conv(Spt ν ), is convex lower semicontinuous, so ∂ψ contains Spt μ × Spt ν , and any transference plan is supported in ∂ψ .

228 222 9 Solution of the Monge problem I: Global approach Fig. 9.2. T he source measure is drawn as a thick line, the target measure as a thin line; the cost function is quadratic. On the left, there is a unique optimal coupling but no optimal Monge coupling. On the right, there are many optimal couplings, in fact any transference plan is optimal. In the next chapter, we shall see that Theorem 9.4 can be improved = ∇ ψ in at least two ways: Equation (9.4) can be rewritten x ); y ( and the assumption (9.3) can be replaced by the weaker assumption μ,ν C < + ∞ (finite optimal transport cost). ( ) Now if one wants to apply Theorem 9.2 to nonquadratic cost func- tions, the question arises of how to identify those cost functions c ( x,y ) which satisfy Assumption (C). Obviously, there might be some geo- X and Y : For instance, metric obstructions imposed by the domains n Y subset of R if , then Assumption (C) is violated nonconvex is a n , R even by the quadratic cost function. But even in the whole of, say, not a generic condition, and so far there is only a Assumption (C) is c x,y ) = ( short list of known examples. These include the cost functions √ 2 n p/ 2 n 2 c o n R 1 + × R | , or more generally x ( x,y ) = (1 + | x − y | − ) y | √ n n R (0) × B ; and (0) ⊂ R (1 × R B , where 2) on = 1 / < p < p − 1 R R 2 n − 1 n − 1 ) c on S ( x,y ) = × S d ( x,y , where d is the geodesic distance on the sphere. For such cost functions, the Monge problem can be solved by combining Theorems 8.1, 9.2 and 5.30, exactly as in the proof of Theorem 9.4. This approach suffers, however, from two main drawbacks: First it seems to be limited to a small number of examples; secondly, the verifi- cation of Assumption (C) is subtle. In the next chapter we shall inves- tigate a more pedestrian approach, which will apply in much greater generality. I shall end this chapter with a simple example of a cost function which does not satisfy Assumption (C).

229 Bibliographical notes 223 P c -subdifferential). roposition 9.6 (Non-connectedness of the p 2 2 . Then there is a ) = | x − y | p > on R ( × R x,y c c -convex Let and let 2 2 : → R such that R ψ ψ (0) is not connected. function ∂ c 2 ∈ [ − 1 , 1] define y Proof of Proposition 9.6 = (0 ,t ) ∈ R t , and . For t ) ( 2 p/ p 2 2 | t | + ) t ( β + , (0 t ( x β η + ( x ( − t ) ) + x ,y − c ) + ) = x,y ( c − ) = t t 2 t 1 ′ is a smooth even function, β (0) = 0, β where ( t ) > 0 for | t | > 0. β r > X = ( ± r, 0). (The fact that the segments 0 and Further, let ± X ] are orthogonal is not accidental.) Then ,X (0) = ] and [ y η [ ,y 1 − t + 1 − 2 p/ 2 p 2 | t η ; while ( X + ) = − ( r t + | ) is an increasing function of ) t ( + | t | β ± t ′ 2 2 p/ 2 − 1 p − 2 t ) < pt [( β t + t ) is a decreasing function of ) | t | if 0 − t < β ( ( ], r which we shall assume. Now define ψ ( x ) = sup { η . By ( x ); t ∈ [ − 1 , 1] } t > (0) = β (1) ψ 0, while construction this is a c -convex function, and p ) = ) = η ψ ( X . ( − r X ± ± 0 ∂ ψ (0) is not connected. First, ∂ We shall check that ψ (0) is not c c empty: this can be shown by elementary means or as a consequence of Example 10.20 and Theorem 10.24 in the next chapter. Secondly, 2 ψ (0) ⊂ { ( y : This comes from the fact that all ,y } ) ∈ R ∂ ; y = 0 1 c 2 1 are decreasing as a function of | η functions | . (So ψ is also nonin- x t 1 2 2 p/ 2 | , and if ( y 0) ,y , ) ∈ ∂ (0 ψ (0 , 0), then ( y creasing in ψ + y | + ) x ≤ 2 1 1 c 2 1 p p | | + ψ ( y y , 0) ≤ | y y | = 0.) Obviously, + ψ (0 , 0), which imposes 2 2 1 1 ∈ (0) is a symmetric subset of the line y = 0 ψ . But if 0 { ∂ ∂ ψ (0), } c c 1 p < ψ ≤ | X then 0 | (0) + ψ ( X ) = 0, which is a contradiction. So ± ± ∂ (0) does not contain 0, therefore it is not connected. ψ c η by ψ , we (What is happening is the following. When replacing 0 X have surelevated the origin, but we have kept the points ( ,η )) ( X ± 0 ± in place, which forbids us to touch the graph of ψ from below at the origin with a translation of .) ⊓⊔ η 0 Bibliographical notes It is classical that the image of a set of Hausdorff dimension d by a Lipschitz map is contained in a set of Hausdorff dimension at most d : See for instance [331, p. 75]. There is no difficulty in modifying the proof to show that the image of a set of Hausdorff dimension d by a H ̈older- β map is contained in a set of dimension at most d/β .

230 224 9 Solution of the Monge problem I: Global approach T he proof of Theorem 9.2 is adapted from a classical argument n ψ on has a single- according to which a real-valued convex function R valued subdifferential everywhere out of a set of dimension at most − n 1; see [11, Theorem 2.2]. The key estimate for the proof of the 1 − ) exists and is Lipschitz; but this can ∂ψ latter theorem is that (Id + be seen as a very particular case of the Mather shortening lemma. In the next chapter another line of argumentation for that differentiability theorem, more local, will be provided. The paternity of Theorem 9.4 is shared by Brenier [154, 156] with Rachev and R ̈uschendorf [722]; it builds upon earlier work by Knott and Smith [524], who already knew that an optimal coupling lying entirely in the subdifferential of a convex function would be optimal. Brenier rewrote the result as a beautiful polar factorization theorem , which is presented in detail in [814, Chapter 3]. The nonuniqueness statement in Remark 9.5 was formulated by Mc- Cann [613]. Related problems (existence and uniqueness of optimal couplings between measures supported on polygons) are discussed by Gangbo and McCann [400], in relation to problems of shape recogni- tion. Other forms of Theorem 9.4 appear in Rachev and R ̈uschendorf [696], in particular an extension to infinite-dimensional separable Hilbert spaces; the proof is reproduced in [814, Second Proof of Theorem 2.9]. (This problem was also considered in [2, 254].) All these arguments are based on duality; then more direct proofs, which do not use the Kantorovich duality explicitly, were found by Gangbo [395], and also Caffarelli [187] (who gives credit to Varadhan for this approach). A probabilistic approach of Theorem 9.4 was studied by Mikami and Thieullen [628, 630]. The idea is to consider a minimization prob- lem over paths which are not geodesics, but geodesics perturbed by some noise; then let the noise vanish. This is reminiscent of Nelson’s approach to quantum mechanics, see the bibliographical notes of Chap- ters 7 and 23. McCann [613] extended Theorem 9.4 by removing the assumption of bounded second moment and even the weaker assumption of finite transport cost: Whenever μ does not charge sets of dimension n − 1 , ( there exists a unique coupling of μ,ν ) which takes the form y = ∇ Ψ ( x ) , where Ψ is a lower semicontinuous convex function. The tricky part in this statement is the uniqueness. This theorem will be proven in

231 Bibliographical notes 225 he next chapter (see Theorem 10.42, Corollary 10.44 and Particular t Case 10.45). Ma, Trudinger and X.-J. Wang [585, Section 7.5] were the first to seriously study Assumption (C); they had the intuition that it was connected to a certain fourth-order differential condition on the cost function which plays a key role in the smoothness of optimal transport. Later Trudinger and Wang [793], and Loeper [570] showed that the above-mentioned differential condition is essentially, under adequate geometric and regularity assumptions, equivalent to Assumption (C). These issues will be discussed in more detail in Chapter 12. (See in particular Proposition 12.15(iii).) The counterexample in Proposition 9.6 is extracted from [585]. The 2 p/ 2 | x − y | c ) ( x,y satisfies Assumption (C) on the ball fact that ) = (1 + √ / of radius 1 1 is also taken from [585, 793]. It is Loeper [571] who p − n − 1 S satisfies Assump- discovered that the squared geodesic distance on tion (C); then a simplified argument was devised by von Nessi [824]. As mentioned in the end of the chapter, by combining Loeper’s result with Theorems 8.1, 9.2 and 5.30, one can mimick the proof of Theorem 9.4 and get the unique solvability of the Monge problem for μ does not see sets of the quadratic distance on the sphere, as soon as dimension at most n − 1. Such a theorem was first obtained for general compact Riemannian manifolds by McCann [616], with a completely different argument. Other examples of cost functions satisfying Assumption (C) will − 2 p − y | −| be listed in Chapter 12 (for instance , or | x − y | x /p for 2 2 are convex p ≤ 1, or | x − y | − + | f ( x ) − g ( y ) | ≤ , where f and g 2 and 1-Lipschitz). But these other examples do not come from a smooth convex Lagrangian, so it is not clear whether they satisfy Assumption (ii) in Theorem 9.2. In the particular case when ν has finite support, one can prove the unique solvability of the Monge problem under much more general assumptions, namely that the cost function is continuous, and μ does not charge sets of the form { x ; c ( x,a ) − c ( x,b ) = k } (where a,b,k are arbitrary), see [261]. This condition was recently used again by Gozlan [429].

232

233 10 S olution of the Monge problem II: Local approach In the previous chapter, we tried to establish the almost sure single- -subdifferential by an argument involving “global” valuedness of the c topological properties, such as connectedness. Since this strategy worked out only in certain particular cases, we shall now explore a different method, based on properties of c -convex functions. The idea is local “Is the c -subdifferential of ψ that the global question x single-valued at or not?” might be much more subtle to attack than the local ques- “Is the function ψ differentiable at tion or not?” For a large class x of cost functions, these questions are in fact equivalent; but these dif- ferent formulations suggest different strategies. So in this chapter, the emphasis will be on tangent vectors and gradients, rather than points in the c -subdifferential. This approach takes its source from the works by Brenier, Rachev n , around the end of the R and R ̈uschendorf on the quadratic cost in eighties. It has since then been improved by many authors, a key step being the extension to Riemannian manifolds, first addressed by Mc- Cann in 2000. The main results in this chapter are Theorems 10.28, 10.38 and (to a lesser extent) 10.42, which solve the Monge problem with increasing generality. For Parts II and III of this course, only the particular case considered in Theorem 10.41 is needed. A heuristic argument c ψ be a c -convex function on a Riemannian manifold M , and φ = ψ Let . ψ Assume that ∈ ∂ c y ( x ); then, from the definition of -subdifferential, c

234 228 10 Solution of the Monge problem II: Local approach o ̃ x ∈ M , ne has, for all { φ ) − ψ ( x ) = c ( x,y ) ( y (10.1) y − ψ ( ̃ x ) ≤ c ( ̃ x,y ) ( . φ ) It follows that ̃ ) ψ ψ ( ̃ x ) ≤ c ( x x,y ) − c ( x,y ) . (10.2) ( − ̃ x → x Now the idea is to see what happens when , along a given direction. Let be a tangent vector at x , and consider a path ε → ̃ x ( ε ), w and initial velocity ε ,ε defined for ), with initial position x [0 w . ∈ 0 n x ( ε ) = exp ̃ ( εw ); or in R (For instance, , just consider ̃ x ( ε ) = x + εw .) x Assume that and c ( · ,y ψ x , divide both sides ) are differentiable at of (10.2) by ε > 0 and pass to the limit: −∇ ψ ( x ) · w ≤∇ (10.3) c ( x,y ) · w. x w − w , the inequality will be reversed. So neces- If then one changes to sarily (10.4) x ) + ∇ ψ c ( ∇ ) = 0 . ( x,y x is given, this is an equation for y . Since our goal is to show that x If y x , then it will surely help if (10.4) admits at most one is determined by ∇ ) is c ( x, · solution, and this will obviously be the case if injective . x This property (injectivity of c ( x, · )) is in fact a classical condition ∇ x in the theory of dynamical system, where it is sometimes referred to as twist condition . a Three objections might immediately be raised. First, ψ is an un- known of the problem, defined by an infimum, so why would it be ∇ differentiable? Second, the injectivity of c as a function of y seems x quite hard to check on concrete examples. Third, even if c is given in the problem and a priori quite nice, why should it be differentiable at ( x,y )? As a very simple example, consider the square distance func- 2 1 ( x,y ) tion on the 1-dimensional circle S d = R / (2 π Z ), identified with [0 , π ): 2 ( ) 2 ( | x − d | , x,y π −| x − y | ) = min . y Then d ( x,y ) is not differentiable as a function of x when | y − x | = π , 2 is not differentiable either (see Figure 10.1). ( x,y ) and of course d Similar problems would occur on, say, a compact Riemannian man- ifold, as soon as there is no uniqueness of the geodesic joining x to y .

235 Differentiability and approximate differentiability 229 2 ( x d 0) ( , 0) x, d x x π π 2 0 0 π 2 π 1 he distance function ( · ,y ) on S T , and its square. The upper-pointing Fig. 10.1. d | x − y | = π ; still singularity is typical. The square distance is not differentiable when it is superdifferentiable, in a sense that is explained later. For instance, if and S respectively stand for the north and south N 2 ) fails to be differentiable at , then poles on ( x,S S x = N . d Of course, for any x this happens only for a negligible set of y ’s; and the cost function is differentiable everywhere else, so we might think that this is not a serious problem. But who can tell us that the optimal x (or a lot of them) to a place transport will not try to take each y ( x,y c such that ) is not differentiable? To solve these problems, it will be useful to use some concepts from nonsmooth analysis: subdifferentiability, superdifferentiability, approxi- mate differentiability. The short answers to the above problems are that (a) under adequate assumptions on the cost function, will be differ- ψ entiable out of a very small set (of codimension at least 1); (b) will c be superdifferentiable because it derives from a Lagrangian, and subd- ifferentiable wherever itself is differentiable; (c) where it exists, ∇ c ψ x will be injective because c derives from a strictly convex Lagrangian. The next three sections will be devoted to some basic reminders about differentiability and regularity in a nonsmooth context. For the convenience of the non-expert reader, I shall provide complete proofs of the most basic results about these issues. Conversely, readers who feel very comfortable with these notions can skip these sections. Differentiability and approximate differentiability Let us start with the classical definition of differentiability: n Let ⊂ R Definition 10.1 (Differentiability). be an open set. A U function f if there exists U → R is said to be differentiable at x ∈ U :

236 230 10 Solution of the Monge problem II: Local approach n p ∈ R a such that vector ( z f ( x ) + 〈 p,z − x 〉 + o ( | z − x | ) as z → x. ) = f is uniquely determined; it is called the gradient of Then the vector p f p,w , and is denoted by f ( x ) ; the map x → 〈 ∇ 〉 is the differential at w f at x . of U is an open set of a smooth Riemannian manifold M , f : U → If R is said to be differentiable at if it is so when expressed in a local chart x x p ∈ T M such that around ; or equivalently if there is a tangent vector x (exp . x ) = f ( x ) + 〈 p,w 〉 + o ( | w | f as w → 0 ) w p ∇ f ( x ) . The vector is again denoted by Differentiability is a pointwise concept, which is not invariant un- der, say, change of Lebesgue equivalence class: If is differentiable or f ∞ even C everywhere, by changing it on a dense countable set we may obtain a function which is discontinuous everywhere, and a fortiori not differentiable. The next notion is more flexible in this respect, since it allows for modification on a negligible set. It relies on the useful con- cept of density . Recall that a measurable set A is said to have density ρ at x if x A ∩ B ( vol [ )] r lim = ρ. 0 r → x ) ] vol [ B ( r n It is a basic result of measure theory that a measurable set in R , or . in a Riemannian manifold, has density 1 at almost all of its points Let Definition 10.2 (Approximate differentiability). U be an , and let f : M → R ∪{±∞} be a open set of a Riemannian manifold U f is said to be approximately differentiable measurable function. Then ̃ : U if there is a measurable function x f ∈ U → R , differentiable at ̃ x , such that the set { at f = f } has density 1 at x ; in other words, [{ }] ̃ z ∈ B z ( x vol f ( z ) = ) f ( ); r lim = 1 . → 0 r ( vol [ ) ] B x r Then one defines the approximate gradient of at x by the formula f ̃ ̃ f ( . ) = ∇ ∇ f ( x ) x

237 Differentiability and approximate differentiability 231 ̃ f ( x ) is well-defined . Since this concept is local and invari- roof that P ∇ U ant by diffeomorphism, it is sufficient to treat the case when is a n subset of . R ̃ ̃ and Let be two measurable functions on U which are both f f 1 2 x f on a set of density 1. The problem differentiable at and coincide with ̃ ̃ is to show that x ) = ∇ f f ∇ ( x ). ( 1 2 Z For each be the set of points in B 0, let r > x ) where either ( r r ̃ ̃ 6 = f f ( ( x ) or f ( x ) 6 = x f ) ( x ). It is clear that vol [ Z )]). ] = o (vol [ B x ( r 2 1 r ̃ ̃ f f are continuous at Since x , one can write and 2 1 ∫ 1 ̃ ̃ ( f x ) = lim f ( z ) dz 1 1 0 → r ] ( x ) vol [ B r ∫ ∫ 1 1 ̃ ̃ = lim ( ( z ) d z f z = lim d f z ) 2 1 → 0 → r 0 r B \ ( x ) \ Z vol [ ] ] B Z ( x ) vol [ r r r r ∫ 1 ̃ ̃ ) x ( f . ( z ) dz = = lim f 2 2 0 → r vol [ x ) ] B ( r ̃ ̃ ̃ ) be the common value of So let f ( f x f . at x and 2 1 z ∈ B Next, for any ( x ) \ Z , one has r r  〈 〉 ̃ ̃ ̃  ( , z ) = ) f ( x ) + r ∇ ( f o ( x ) , z − x f + 1 1   〉 〈  ̃ ̃ ̃ − ( x ) + ( ∇ z f f ( f ) , z ) = x , + o ( r ) x 2 2 so 〈 〉 ̃ ̃ ( x ) −∇ ∇ f r ( x ) , z − x f = o ( . ) 2 1 ̃ ̃ −∇ f Let ( x ) ∇ w f ); the previous estimate reads ( x := 2 1 ∈ Z (10.5) = ⇒ 〈 w,z − x 〉 x / o ( r ) . = r ) such that w = 0, then the set of z ∈ B 2 has ( If 6 〈 w,z − x 〉 ≥ r/ x r measure at least K vol [ B is small enough, ( x )], for some K > 0. If r r K/ then vol [ ] ≤ ( K/ 4) vol [ B ], so ( x )] ≤ ( Z 2) vol [ B Z ( x ) \ r r r r [{ ] } r K − ( \ Z ; 〈 w,z ) x 〉≥ B ∈ z x vol ≥ ] . Z v ol [ B \ ( x ) r r r r 2 2 Then (still for r small enough), ∫ ∣ ∣ ∣ ∣ x dy 〉 〈 w,z − r K Z ( B x \ ) r r , ≥ ] x \ Z 4 ) ( B vol [ r r

238 232 10 Solution of the Monge problem II: Local approach i w = 0, which was n contradiction with (10.5). The conclusion is that ⊓⊔ the goal. Regularity in a nonsmooth world Regularity is a loose concept about the control of “how fast” a function varies. In the present section I shall review some notions of regularity which apply in a nonsmooth context, and act as a replacement for, say, 1 2 or C regularity bounds. C n ⊂ R be open, and Definition 10.3 (Lipschitz continuity). Let U : let → R be given. Then: f U is said to be Lipschitz if there exists L < ∞ such that (i) f x,z ∈ U, | f ( z ) − f ( x ) |≤ L | z − x | ∀ . (ii) is said to be locally Lipschitz if, for any x f ∈ U , there is a 0 neighborhood x O in which f is Lipschitz. of 0 : M , then f If U → R U is an open subset of a Riemannian manifold is said to be locally Lipschitz if it is so when expressed in local charts; f U or equivalently if , equipped is Lipschitz on any compact subset of M with the geodesic distance on . 1 Example 10.4. Obviously, a C function is locally Lipschitz, but the f converse is not true (think of x ) = | x | ). ( Definition 10.5 (Subdifferentiability, superdifferentiability). Let n be an open set of R , and f : U → R a function. Then: U f x , with subgradient p , if (i) is said to be subdifferentiable at 〈 〉 ) ≥ f ( x ) + p, z f − x ( + z ( | z − x | ) . o − f at will be denoted by ∇ p x ( x ) . The convex set of all subgradients (ii) f is said to be uniformly subdifferentiable in U if there is a ( continuous function R ω → R , , such that ω ( r ) = o : r ) as r → 0 + + and 〈 〉 n U ∃ . ∈ R ∀ ; f ( z ) ≥ f ( x ) + x p, z − x ∈ − ω ( | z − x | ) p

239 Regularity in a nonsmooth world 233 ( f is said to be locally subdifferentiable (or locally uniformly iii) U subdifferentiable) in if each ∈ U admits a neighborhood on which x 0 is uniformly subdifferentiable. f M , a function f U U → R If is an open set of a smooth manifold : (resp. locally subdifferen- is said to be subdifferentiable at some point x U ) if it is so when expressed in local charts. tiable in Corresponding notions of superdifferentiability and supergradients are obtained in an obvious way by just reversing the signs of the in- equalities. The convex set of supergradients for at x is denoted by f + ( x ) . ∇ f If f has a minimum at x , then 0 is a subgradient ∈ U Examples 10.6. 0 of at x f , whatever the regularity of f . If f has a subgradient p at 0 x g is smooth, then f + g has a subgradient p + ∇ g ( x ) at x . If f and U , then it is (uniformly) subdifferentiable at every point is convex in U in , by the well-known inequality 〈 〉 ) ≥ f ( f ) + ( p, z − x z , x which holds true as soon as p ∈ ∂f ( x ) and [ x,y ] ⊂ U . If f is the sum of a convex function and a smooth function, then it is also uniformly subdifferentiable. It is obvious that differentiability implies both subdifferentiability and superdifferentiability. The converse is true, as shown by the next statement. Proposition 10.7 (Sub- and superdifferentiability imply differ- Let be an open set of a smooth Riemannian mani- U entiability). , and let f : fold → R be a function. Then f is differentiable at x M U if and only if it is both subdifferentiable and superdifferentiable there; and then + − ( x ) = ∇ ∇ f ( f ) = {∇ f ( x ) } . x Proof of Proposition 10.7 . The only nontrivial implication is that if f is both subdifferentiable and superdifferentiable, then it is differentiable. Since this statement is local and invariant by diffeomorphism, let us n − + x . So let p ∈∇ U f ( pretend that ) and q ∈∇ ⊂ f ( x ); then R 〈 〉 z ) − f ( x ) ≥ f p, z − x ); − o ( | z − x | (

240 234 10 Solution of the Monge problem II: Local approach 〉 〈 o f x ) ≤ − q , z − x ) + ( ( | z − x | ) . z f ( p − q,z − x 〉≤ o ( | z − x | ), which means This implies 〈 〈 〉 z − x − q, p lim . 0 = z 6 = x z → x ; | − z x | z − x ) / | z − x | can take arbitrary fixed values in Since the unit vector ( = z x , it follows that p → q . Then the unit sphere as 〈 〉 z ) − f ( x ) = p, z f − x ( + o ( | z − x | ) , which means that f x . This also shows that is indeed differentiable at = q ∇ f ( x ), and the proof is complete. ⊓⊔ = p The next proposition summarizes some of the most important re- sults about the links between regularity and differentiability: Theorem 10.8 (Regularity and differentiability almost every- where). Let U be an open subset of a smooth Riemannian manifold , and let f M U → R be a function. Let n be the dimension of M . : Then: (i) If f is continuous, then it is subdifferentiable on a dense subset of U , and also superdifferentiable on a dense subset of U . (ii) If f is locally Lipschitz, then it is differentiable almost every- where (with respect to the volume measure). (iii) If is locally subdifferentiable (resp. locally superdifferentiable), f ( − 1) - then it is locally Lipschitz and differentiable out of a countably n rectifiable set. Moreover, the set of differentiability points coincides with the set of points where there is a unique subgradient (resp. supergradi- ent). Finally, f is continuous on its domain of definition. ∇ Remark 10.9. Statement (ii) is known as Rademacher’s theorem . The conclusion in statement (iii) is stronger than differentiability al- most everywhere, since an ( − 1)-rectifiable set has dimension n − 1, and n is therefore negligible. In fact, as we shall see very soon, the local sub- differentiability property is stronger than the local Lipschitz property. Reminders about the notion of countable rectifiability are provided in the Appendix. Proof of Theorem 10.8 . First we can cover U by a countable collection of small open sets U , each of which is diffeomorphic to an open subset k

241 Regularity in a nonsmooth world 235 n o f R O . Then, since all the concepts involved are local and invariant k under diffeomorphism, we may work in O . So in the sequel, I shall k n is a subset of pretend that R U . , and be continuous on f Let us start with the proof of (i). Let U U ; the problem is to show that let admits V be an open subset of f . So let at least one point of subdifferentiability in ∈ V x V , and let 0 0 be so small that r > x x , r ) ⊂ V . Let B = B ( B ( ,r ), let ε > 0 0 0 2 and let be defined on g y g ( x ) := f ( x ) + | x − x | B b /ε . Since f is 0 on B . But g attains its minimum on ∂B is bounded below g continuous, 2 2 /ε − by , where M r | f | on M is an upper bound for . If ε < r B / (2 M ), 2 inf ( ) = f ( x cannot achieve ) ≤ M < r g /ε − M ≤ x g g ; so then 0 0 ∂B , and has to achieve it at some point x ∂B ∈ B . Then its minimum on 1 is subdifferentiable at x f , and therefore g also. This establishes (i). 1 The other two statements are more tricky. Let : U → R be a f n v R and x ∈ U , define ∈ Lipschitz function. For [ ] tv ) ( x + f − f ( x ) , x D ( f 10.6) ( ) := lim v 0 → t t provided that this limit exists. The problem is to show that for almost any x , there is a vector p ( x ) such that D and the limit f ( x ) = 〈 p ( x ) ,v 〉 v n − 1 in (10.6) is uniform in, say, ∈ S . Since the functions [ f ( x + tv ) − v ( x /t are uniformly Lipschitz in v , it is enough to prove the pointwise f )] ( x )), and then the limit D convergence (that is, the mere existence of f v will automatically be uniform by Ascoli’s theorem. So the goal is to , the limit D show that for almost any x ( x ) exists for any v , and is f v linear in v . It is easily checked that: D ); f ( x ) is homogeneous in v : D x f ( x ) = (a) ( f tD v v tv D f ( x ) is a Lipschitz function of v on its domain: in fact, (b) v D ; f ( x ) − D ( f | x ) |≤ L | v − w | , where L = ‖ f ‖ Lip v w (c) If D ; this comes from the f ( x ) → ℓ as w → v , then D ℓ f ( x ) = v w estimate ∣ ∣ ) ( ) ( ∣ ∣ + ) x ( f − ) v f ( x t f + tv ( ) x ( f − ) x k ∣ ∣ | v v − | − . ‖ ‖ f ≤ sup Lip k ∣ ∣ t t t n n ∈ R f , let A ) be the set of x ∈ R For each such that D x v ( v v does not exist. The first claim is that each A has zero Lebesgue mea- v ⊥ This is obvious if v = 0. Otherwise, let H = v sure. be the hyper- plane orthogonal to v , passing through the origin. For each x ∈ H , 0

242 236 10 Solution of the Monge problem II: Local approach et l L = x . The + R v be the line parallel to v , passing through x 0 0 x 0 f x ) at x = nonexistence of ( + t v is equivalent to the nondif- x D v 0 0 7−→ f ( x + tv ) at t = t ferentiability of . Since t 7−→ f t x + tv ) is ( 0 → R , it follows from a well-known result of real analysis Lipschitz R stands for t ∈ R , where λ λ that it is differentiable for -almost all 1 1 λ A ∩ L [ the one-dimensional Lebesgue measure. So ] = 0. Then by v 1 x 0 ∫ ∩ [ A is the ] = Fubini’s theorem, λ λ = 0, where [ A dx λ L ] x 0 1 n n v v 0 H -dimensional Lebesgue measure, and this proves the claim. n linear f into a D The problem consists in extending the function v n , and let ∈ R v . Let ζ be a smooth (not just homogeneous) function of v compactly supported function. Then, by the dominated convergence theorem, [ ] ∫ y ) f ( y + tv ) − f ( ( f )( ζ x ) = ∗ ζ ( x − y ) lim D y d v 0 t → t ∫ 1 = lim ζ ( x − y ) dy )] [ y f ( y + tv ) − f ( 0 → t t ∫ [ ] 1 f dy ) y ζ ( x − y + t v ) − ζ ( x − y ) ( = lim 0 t → t ∫ 〈∇ ( x − y ) ,v 〉 f ( y ) dy. = ζ ∗ x (Note that f is well-defined for any ζ , even if D D f is defined only v v f ζ ∗ D for almost all x depends linearly on v . In particular, if v .) So v n are any two vectors in R and w , then ζ [ D ∗ . ] = 0 f − D f f − D w w + v v Since ζ is arbitrary, it follows that f f ( x ) + D (10.7) D ( x ) = D ) x ( f w + v w v n n ∈ R A \ ( A . ∩ A for almost all ∩ x R ∈ x ), that is, for almost all w v v + w n be the set of all B ∈ R such Now it is easy to conclude. Let x v,w D f f ( x ), D ) is not well-defined, or (10.7) does f ( x ) or D x ( that w v w v + n B ) := not hold true. Let ( , and let be a dense sequence in R v N k k ∈ ⋃ B ∈ x / B is still Lebesgue-negligible, and for each . Then B v ,v j k N j,k ∈ we have (10.8) . ) x f ( x ) = D ( f D ( x ) + D f v v v v + j j k k ) is a Lipschitz continuous function of Since , it can be extended f ( x D v v uniquely into a Lipschitz continuous function, defined for all x / ∈ B

243 Regularity in a nonsmooth world 237 n v ∈ R a , which turns out to be D nd f ( x ) in view of Property (c). v By passing to the limit in (10.8), we see that f ( x ) is an additive D v v , . We already know that it is a homogeneous function of function of v so it is in fact linear. This concludes the proof of (ii). Next let us turn to the proof of (iii). Before going on, I shall first main idea of the proof of statement explain in an informal way the . Suppose for simplicity that we are dealing with a convex function (iii) n n lies in the subdifferential ∂ψ in x ) of R at x , then for all z ∈ R ( , . If p ψ 〈 〉 ≥ ψ ( x ψ ( p, z − x z . ) ) + ′ ′ ( x ) and p p ∈ ∂ψ ( x ∂ψ ), then ∈ In particular, if 〉 〈 ′ ′ ≥ 0 . − x p , x p − ψ is not differentiable at x , this means that the convex set ∂ψ ( If ) x ′ n ] ⊂ R is not a single point, so it should contain a line segment [ . p,p ′ fix p , and let Σ be the p For these heuristic explanations, let us and ′ ′ ′ n p,p set of all ] ⊂ ∂ψ x x ). Then 〈 p − p ∈ ,x − x R 〉 ≥ 0 ( such that [ ′ ′ ∈ Σ x,x p and p for all , we see that . By exchanging the roles of ′ ′ − p actually ,x − x p 〉 = 0. This implies that Σ is included in a single 〈 ′ p p − ; in particular its dimension is at most hyperplane, orthogonal to − n 1. The rigorous argument can be decomposed into six steps. In the ψ sequel, will stand for a locally subdifferentiable function. Step 1: ψ is locally semiconvex . Without loss of generality, we may assume that ω ( r ) /r is nondecreasing continuous (otherwise replace ω ( r ) by ω r ) = r sup { ω ( s ) /s ; s ≤ r } which is a nondecreasing continuous ( tω r ( tr ) ≤ ω ( r ). function of ); then x x,y ∈ U , let V be a convex neighborhood of x , in U . Let Let ∈ V 0 0 − − , 1] and p ∈∇ ∈ ψ t [0 t ) x + ty ). Then ((1 ( ) x ) ≥ ψ ( (1 − t ) x + ty ψ + 〈 t ( x − y ) ,p 〉− tω ( | x − y | ); (10.9) ( ) − ≥ ψ ψ (1 − t ) x + ty ) + 〈 (1 − t ) ( y y x ) ,p 〉− (1 − t ) ω ( | x − y | ) . (10.10) ( Take the linear combination of (10.9) and (10.10) with respective co- efficients (1 − t ) and t : the result is ) ( tψ − t ) x + ty ψ ≤ (1 − t ) ψ ( x ) + (1 ( y ) + 2 t (1 − t ) ω ( | x − y | ) . (10.11)

244 238 10 Solution of the Monge problem II: Local approach tep 2: is locally bounded above. Let x S ∈ U , let x ψ ,...,x U ∈ N 0 1 ) is a neighborhood of of ( ,...,x C be such that the convex hull x 1 N ∑ will do). Any point of x can be written as ( α x N where = 2 n C 0 i i ∑ = 1. By (10.11) and finite induction, , 0 α ≤ i ≤ α i i ∑ ∑ ) ( N ψ α ≤ α | ψ ( x x ) + 2 ); max x ω ( | x − i i j i i i ij ψ C , and therefore in a neighborhood of x . so is bounded above on 0 is locally bounded below. Let x U ∈ Step 3: , let V ψ be a neighborhood 0 of on which ψ is bounded above, and let B = x r ( x ), where B 0 r 0 B x ( x ; then ) ⊂ V . For any x ∈ B , let y = 2 x is such that − 0 0 r y < r = | x − x | | , so y ∈ B , and x − | 0 0 ( ) [ ] + y x y | ) x | ( ω − 1 ) = ( ψ ψ x ψ y ( + + ) x ( ) . ψ ≤ 0 2 2 2 ince ψ ( x y ) is fixed and ψ ( S ) is bounded above, it follows that ψ ( x ) is 0 bounded below for B . x ∈ U is locally Lipschitz. ψ Step 4: ∈ Let , let V be a neighborhood of x 0 ( on which | ψ | ≤ M < + ∞ , and let r > 0 be such that B . x x V ) ⊂ 0 0 r ′ ′ ∈ B For any , where ( x tz ), we can write y,y y = (1 − t ) y + 0 2 r/ ′ ′ | y − y t | /r , so z = ( y − y = ) /t + y ∈ B . Then ( x r ) and | y − z | = 0 r ′ ψ ) ≤ (1 − t ) ψ ( y ) + tψ ( z ( t (1 − t ) ω ( | y − z | ), so y ) + 2 ′ ′ ψ z − y | ( ω 2 ) z ( ψ − ) ψ ( y ) ) − | ( y ) y ( ψ y − ( y ) ) ψ ( ψ ≤ + = ′ | y − z | y | | t | | z − y − | y − z | y 2 M ) r 2 ω ( ≤ + . r r ′ ′ ψ ( y ψ ) − T ( y )] / | y hus the ratio [ − y | is uniformly bounded above ′ in ), it is also uniformly ( x B ). By symmetry (exchange y and y 0 2 r/ ψ is Lipschitz on B x ). ( bounded below, and 0 2 r/ − − ψ ψ This means that if p ∇ ∈ ∇ Step 5: is continuous. ( x ) and k k − x ). To prove this, it suffices to pass to ,p x ) → ( x,p ) then p ∈∇ ( ψ ( k k the limit in the inequality ψ ( z ) ≥ ψ ( x . ) + 〈 p ) ,z − x | 〉− ω ( | z − x k k k k Step 6: ψ is differentiable out of an ( n − 1) -dimensional set. Indeed, − ∇ be the set of points x such that Σ let ψ ( x ) is not reduced to a

245 Regularity in a nonsmooth world 239 − ingle element. Since ψ ( x ) is a convex set, for each x ∈ Σ there is a s ∇ ′ − ] ψ ( x ). So nontrivial segment [ p,p ⊂∇ ⋃ ( ℓ ) , Σ Σ = ℓ ∈ N ) − ℓ ( x such that ∇ ( ψ where x ) contains a segment Σ is the set of points ′ ] of length 1 /ℓ and | p |≤ ℓ . To conclude, it is sufficient to show that [ p,p ( ) ℓ is countably ( n − 1)-rectifiable, and for that it is sufficient to each Σ ) ℓ ( ℓ ) ( T x Σ the dimension of the tangent cone Σ ∈ show that for each x n − 1 (Theorem 10.48(i) in the First Appendix). is at most ) ℓ ( ( ℓ ) , and let q T So let Σ Σ ∈ x , q 6 = 0. By assumption, there is ∈ x ( ℓ ) x ∈ a sequence such that Σ k − x x k − → q. t k | x − x . | /t In particular converges to the finite, nonzero limit | q | k k ′ − 1 ℓ ,p N For any ], of length ∈ , there is a segment [ p , that is k k k − ′ − 1 ∇ ); and | p | ≤ ℓ , | p . By compactness, ψ | ≤ ℓ + ℓ contained in x ( k k k ′ ′ , → x , p p → up to extraction of a subsequence one has , p x p → k k k ′ ′ − 1 − − p | . By continuity of ∇ = ψ , both p and p p belong to ∇ ℓ ψ ( x ). − | Then the two inequalities  〈 〉 ′  ( x ) ≥ ψ ( x − ) + p ) | x , x − x x ψ − ω ( | k k k  k  〈 〉  ) ) ≥ ψ ( x ) + p, x ( | ψ x x − ω ( | x − x − k k k combine to yield 〈 〉 ′ | p , x − x . p ≥− 2 ω ( − x − x ) | k k k So 〉 〈 x − x | ) | x − x | ( | ω x − x k k k ′ − . 2 ≥ p , − p k x | − t t | x k k k P assing to the limit, we find ′ 〈 − p ,q 〉≥ 0 . p ′ p and p But the roles of can be exchanged, so actually ′ 〈 p − p ,q 〉 = 0 .

246 240 10 Solution of the Monge problem II: Local approach ′ p − p S is nonzero, this means that q belongs to the hyperplane ince ′ ⊥ ( ℓ ) ( ℓ ) p ) − . So for each , the tangent cone T ( Σ x ∈ Σ is included in p x a hyperplane. To conclude the proof of (iii), it only remains to prove the equiva- x is a lence between differentiability and unique subdifferentiability. If differentiability point of ψ , then we know from Proposition 10.7 that at x . Conversely, assume that x is there is a unique subgradient of f − . Let ( ( x ) = { p } ∇ x such that ) ψ be a dense sequence in a neigh- N k ∈ k − k ∈ N , let borhood of ∈ ∇ x ψ ( x ; for each ). By definition of p k k subdifferentiability, ψ x ) ≥ ψ ( x ( ) + 〈 p ) ,x − x | 〉− ω ( | x − x k k k k ψ ( x . ) + 〈 p,x − x ω 〉 + 〈 p ) − p,x − x | 〉− = ( | x − x k k k k k − p ψ imposes p The continuity of → ∇ as x ; it follows that → x k k ψ ( x ) ≥ ψ ( x ). By density, ) + 〈 p,x − x | 〉− o ( | x − x k k k ( ( ≥ ψ ( y ) + 〈 p,x − y 〉− o ) | x − y | ) as y → x. ψ x + ∈ ∇ ψ This shows that ( p ); then by Proposition 10.7, p = ∇ ψ ( x ) x and the proof is finished. ⊓⊔ Semiconvexity and semiconcavity Convexity can be expressed without any reference to smoothness, yet it implies a lower bound on the Hessian. In nonsmooth analysis, convexity-type estimates are often used as a replacement for second- order derivative bounds. In this respect the notion of semiconvexity is extremely convenient. Let U be an open set of a Definition 10.10 (Semiconvexity). ω : R → R be continuous, such smooth Riemanian manifold and let + + that ω ( r ) = o ( r ) as r → 0 . A function f : U → R ∪{ + ∞} is said to be semiconvex with modulus if, for any constant-speed geodesic path ω ( γ ) , whose image is included in U , ≤ 1 t 0 ≤ t ) ( γ . ) ≤ (1 − t ) f (10.12) γ f ) + tf ( γ ) ) + t (1 − t ) ω ( d ( γ ,γ ( 0 1 1 0 t

247 Semiconvexity and semiconcavity 241 I x t is said to be locally semiconvex if for each ∈ U there is a neighbor- 0 hood V in U such that (10.12) holds true as soon as γ of ,γ ; x V ∈ 1 0 0 or equivalently if (10.12) holds true for some fixed modulus ω as long K stays in a compact subset K of U . as γ Similar definitions for semiconcavity and local semiconcavity are obtained in an obvious way by reversing the sign of the inequality in . (10.12) n R , semiconvexity with modulus ω means that for In Example 10.11. n in any x,y and for any t ∈ [0 , 1], R ) ( − t ) x + ty f ≤ (1 − t ) f ( x ) + (1 ( y ) + t (1 − t ) ω ( | x − y | ) . tf When ω = 0 this is the usual notion of convexity. In the case ω ( r ) = 2 Cr 2, there is a differential characterization of semiconvexity in terms / n ( f → R is semiconvex with modulus ω R r ) = of Hessian matrices: : 2 2 2 if and only if ∇ Cr f ≥− / is not twice differentiable, then . (If f CI n 2 should be interpreted as the distributional gradient.) ∇ f A well-known theorem states that a convex function is subdifferen- tiable everywhere in the interior of its domain. The next result gener- alizes this property to semiconvex functions. Proposition 10.12 (Local equivalence of semiconvexity and sub- differentiability). Let M be a smooth complete Riemannian mani- fold. Then: (i) If ψ : M → R ∪{ + ∞} is locally semiconvex, then it is locally − 1 ∂D subdifferentiable in the interior of its domain ψ ( R ) ; and := is D countably ( n − 1) -rectifiable; (ii) Conversely, if U is an open subset of M , and ψ : U → R is locally subdifferentiable, then it is also locally semiconvex. Similar statements hold true with “subdifferentiable” replaced by “superdifferentiable” and “semiconvex” replaced by “semiconcave”. Remark 10.13. This proposition implies that local semiconvexity and local subdifferentiability are basically the same. But there is also a global version of semiconvexity. Remark 10.14. We already proved part of Proposition 10.12 in the n proof of Theorem 10.8 when M = R . Since the concept of semicon- vexity is not invariant by diffeomorphism (unless this diffeomorphism is an isometry), we’ll have to redo the proof.

248 242 10 Solution of the Monge problem II: Local approach roof of Proposition 10.12 x P ∈ M , let O . For each be an open neigh- 0 x 0 . There is an open neighborhood V borhood of x of x and a contin- 0 x 0 0 uous function ω such that (10.12) holds true for any geodesic = ω x 0 V . The open sets whose image is included in which M ∩ V cover O x x x 0 0 0 is a countable union of compact sets; so we may extract from this family . If the property is proven in each M a countable covering of ∩ V O , x x 0 0 then the conclusion will follow. So it is sufficient to prove (i) in any x . In the sequel, U arbitrarily small neighborhood of any given point 0 will stand for such a neighborhood. 1 − D ψ ( R ). If D Let ,x to ∈ = ∩ U , and γ is a geodesic joining x x 0 1 0 ≤ , then ψ ( γ )) is finite ) x (1 − t ) ψ ( γ ,γ ) + tψ ( γ γ ) + t (1 − t ) ω ( d ( 1 1 1 t 0 0 ∈ , 1]. So D is geodesically convex. t for all [0 is small enough, then (a) any two points in U are joined by If U is isometric to a small open subset U of a unique geodesic; (b) V n equipped with some Riemannian distance . Since the property R d of (geodesic) semiconvexity is invariant by isometry, we may work in V d . If x and y are two given points in V , let equipped with the distance ( x,y d ) m ) stand for the midpoint (with respect to the geodesic distance x y . Because d is Riemannian, one has of and y + x ) = ( m x,y + o ( | x − y | ) . 2 ′ hen let x ∈ V and let T are D be the tangent cone of D at x . If p,p T x ′ , there are sequences x T → x , t , D 0, x any two points in x → → x k k k ′ → 0 such that t k ′ x − x − x x k ′ k −−→ − − p . ; p −−→ ′ →∞ k k →∞ t t k k ′ ′ ′ ′ ) = ,x Then | ) ∈ D and m ( x m ,x ( x ) = ( x − + x x x ) / 2 + o ( | k k k k k k k k ′ t 2 + ( p ), so + p x t ) / + o ( k k k k ′ ′ − ) m x x ,x ( p + p k k . = lim D T ∈ x k →∞ 2 t k T cone. This leaves two possibilities: either D is a convex Thus T is D x x n included in a half-space, or it is the whole of R . n T , D = R Assume that . If C is a small (Euclidean) cube of side 2 r x x , for r small enough any point in a neighborhood of x can centered at 0 0 be written as a combination of barycenters of the vertices x of ,...,x N 1 C , and all these barycenters will lie within a ball of radius 2 r centered

249 Semiconvexity and semiconcavity 243 a x t . (Indeed, let C , then stand for the set of the vertices of the cube C 0 0 ∑ ) ε ) n ( ε ( } { } , where x ; ε ∈ {± = x + r 1 ε e , ( e ) = x being C ≤ 1 ≤ j j 0 j j 0 n ) ε ( n 1 n − } R C 1 ∈{± ε the canonical basis of be the union . For each , let 1 ( ε ) − 1) n − 2 ε, ( ( ε, 1) of geodesic segments [ x ,x let C ]. Then for ε ∈ {± 1 be } 2 − 1) ( ε, C the union of geodesic segments between an element of and an 1 ( 1) ε, C ; etc. After n operations we have a simply connected set element of 1 which asymptotically coincides with the whole (solid) cube as r → 0; C n x for r small enough.) Then in the interior of so it is a neighborhood of 0 ( ) is bounded above by max C ψ , x . ) ,...,ψ ( x } ) ψ +sup { ω ( s ); s ≤ 4 r ( N 1 x D , and that This shows at the same time that lies in the interior of 0 x ψ is bounded above around . 0 n R T ∈ x cannot be ∂D , then , so it is included in In particular, if D x a half space. By Theorem 10.48(ii) in the Appendix, this implies that n − ∂D is countably ( 1)-rectifiable. x will be an interior point of D and we shall show In the sequel, 0 ψ x . We have just seen that that is locally subdifferentiable around 0 x ; we shall now see that it ψ is bounded above in a neighborhood of 0 B = is also bounded below. Let 0 is sufficiently small ( x r > ); if B 0 r ′ B there is y y ∈ B such that the midpoint of y and then for any ∈ ′ is x and going to . (Indeed, take the geodesic γ y starting from y 0 ( ≤ t ≤ ), 0 1, and extend it up to time 2, set tv , say γ ) = exp t ( x 0 x 0 ′ v B . If B (2 is small enough the geodesic is automatically ) ∈ y = exp x 0 ′ x m ( y,y ).) Then minimizing up to time 2, and = 0 [ ] 1 ′ x ) ≤ ψ ( y ) + ψ ( y ( ) ψ + sup ( . ) s ω 0 2 2 ≤ s r ′ y ψ ( y ) is bounded above, this shows that x ψ ( is fixed and ) is Since 0 y varies in B . bounded below as ψ is locally Lipschitz. Let V Next, let us show that be a neigh- borhood of x in which | ψ | is bounded by M . If r > 0 is small 0 ′ ′ enough, then for any B y,y ( x such ) there is z = z ( y,y ∈ ) ∈ V 0 r ′ ′ / y ] so small , λ = d ( y,y that ) y,z 4 r ∈ [0 , 1 / 2]. (Indeed, choose r = [ λ that all geodesics in B . Given V ( x ⊂ ) are minimizing, and B ) x ( r 0 r 5 5 0 ′ ′ , take the geodesic going from y to y and , say exp y y tv ), 0 ≤ t ≤ 1; ( y t ( λ ) = 1 / (1 − λ ), write z = exp extend it up to time ( t ( λ ) v ). Then y ′ ′ x .) So ,z ) ≤ d ( x r ,y ) + t ( λ ) d ( y,y d ) ≤ d ( x 5 ,y ) + 2 d ( y,y ( ) < 0 0 0 ′ ( y λψ ) ≤ (1 − λ ) ψ ( y ) + ψ ( z ) + λ (1 − λ ) ω ( d ( y,z )) , whence

250 244 10 Solution of the Monge problem II: Local approach ′ ′ y,z ( d ( ω ψ ( y z ) )) ψ ( y ) ) y ( ψ − ) ψ ( − ) ( y ψ ) − ψ ( y ≤ + = . (10.13) ′ y ,z ) ) ( ) λd y y ( d ( y ,z ) d ( ,z d ,y ′ y,z ) = d ( y,y Since ) /λ = 4 r , (10.13) implies d ( ′ ( ) − ψ ( y ) y ψ ) r ( ω 2 M . ≤ + ′ r ,y y r ( d ) ′ ′ ( y B ) − ψ ( y )] /d ( y,y S ) is bounded above in o the ratio [ ψ ( x ); by sym- 0 r ′ y metry (exchange ) it is also bounded below, and ψ is Lipschitz and y B ( x ). in 0 r The next step consists in showing that there is a uniform modulus of subdifferentiability (at points of subdifferentiability!). More precisely, − is subdifferentiable at x , and p ∈∇ if ψ ( x ), then for any w 6 = 0, | w | ψ small enough, (10.14) ψ w ) ≥ ψ ( x ) + 〈 p,w 〉− ω ( | w | ) . (exp x Indeed, let ( t ) = exp 1], ( tw ), y γ , w , then for any t ∈ [0 = exp x x ψ ( γ ( t )) ≤ (1 − t ) ψ ( x ) + tψ ( y ) + t (1 − t ) ω ( | w | ) , so ψ t )) − ψ ( x ) ( γ ( y x − ψ ( ( ) ψ ) | w ) | ω ( ) − (1 + . ≤ t t | w | | w | | w | n the other hand, by subdifferentiability, O 〉 〈 ( t )) − ψ ( x ) o ) | w | 〈 p ,tw 〉 t ( o ψ w ( ( t | w | ) γ = p , − − ≥ . t w | | t w | | | | w | t | w | | w t he combination of the two previous inequalities implies T 〉 〈 ψ w ) o ( t | w | ) | ψ ( y ) − w ( x ) | ( ω t ) − (1 + p, ≤ − . w | | w | | w | t | w | | T → 0 gives (10.14). At the same time, it shows that for t he limit w |≤ r , | 〈 〉 w r ( ω ) x ≤ ‖ . ( B p, )) + ( ψ ‖ 0 r 2 Lip | w | r B y choosing w = rp , we conclude that | p | is bounded above, indepen- − dently of . So ∇ ψ is locally bounded in the sense that there is a x − ∇ uniform bound on the norms of all elements of ψ ( x ) when x varies in a compact subset of the domain of ψ .

251 Semiconvexity and semiconcavity 245 t last we can conclude. Let be interior to D . By Theorem 10.8(i), A x − = ∇ → ψ ( x ) 6 such that ∅ . For each k ∈ N , x there is a sequence x k k − , there is a uniform bound on ψ ( x let ). As k →∞ ∈∇ | p p | , so we k k k n → p may extract a subsequence such that R p . For each ∈ , for x k k n ∈ small enough, w each R ,w . w ) ≥ ψ ( x ) ) + 〈 p | ψ (exp ω ( | w 〉− k k x k Since k →∞ and recover ψ is continuous, we may pass to the limit as w | small enough) | (for (exp p,w w ) ≥ ψ ( x ) + 〈 ψ 〉− ω ( | w | ) . x So x , and the proof of (i) is ψ is uniformly subdifferentiable around complete. Statement (ii) is much easier to prove and completely similar to an U , x argument already used in the proof of Theorem 10.8(iii). Let ∈ be a small neighborhood of , such that f is uniformly sub- V and let x with modulus ω differentiable in V . Without loss of generality, assume ω r ) /r is a nondecreasing function of r ( ω ( r ) by that ; otherwise replace ( r ) = r sup { ω ( s ) /s ; 0 < s ≤ r } . Let W ⊂ V be a neighborhood of x , ω ′ y,y W can be joined by a unique small enough that any two points in ′ y,y γ geodesic V ; by abuse of notation , whose image is contained in ′ ′ y,y y for the initial velocity of I shall write y γ − . V be a geodesic such that γ Then let ,γ ∈ γ ; let t ∈ [0 , 1], and let 0 1 − p ( γ ∈∇ ). It follows from the subdifferentiability that f t ( ) f ( γ ) + 〈 p, γ f − γ ) 〉 + ω ( d ( γ ,γ ) ≤ . γ 1 1 1 t t t d ( γ ( ,γ is nonincreasing, ) = (1 − t ) d Then since γ /r ,γ ) ) and ω ( r 1 1 t 0 ) ( 〉 ) ≤ f ( γ (10.15) ) + 〈 p, γ . − γ f ( + (1 − t ) ω γ d ( γ ) ,γ 1 0 t 1 t 1 Similarly, ) ( γ (10.16) ) ≤ f ( γ . ) + 〈 p, γ f − γ ) 〉 + tω ( d ( γ ,γ 0 1 0 t 0 t Now take the linear combination of (10.15) and (10.16) with coefficients t − t : Since t ( γ ), we recover − γ t ) + (1 − and 1 )( γ M − γ T ) = 0 (in t γ 0 1 t t (1 − t ) f ( γ . ) + tf ( γ )) ) − f ( γ ,γ ) ≤ 2 t (1 − t ) ω ( d ( γ 0 1 1 0 t ⊓⊔ f is semiconvex in W . This proves that

252 246 10 Solution of the Monge problem II: Local approach A ssumptions on the cost function M be a smooth complete connected Riemannian manifold, let Let X , let be a closed subset of Y M be an arbitrary Polish space, and let ×Y → be a continuous cost function. (Most of the time, we M R : c shall have = Y .) I shall impose certain assumptions on the X = M as a function of x , when x varies in the interior (in behavior of ) of c M . They will be chosen from the following list (the notation X S stands T x S x , see Definition 10.46): for tangent cone to at ( (Super) ) is everywhere superdifferentiable as a function of x , c x,y y . for all ′ (Twist) ∇ c ( x, · ) is injective: if x,y,y On its domain of definition, x ′ ′ are such that ( x,y ) = ∇ . c ( x,y y ) , then ∇ = y c x x c ( x,y ) is locally Lipschitz as a function of x , uniformly in y . (Lip) x c x,y ) is locally semiconcave as a function of ( , uniformly (SC) in y . (locLip) c ( x,y ) is locally Lipschitz as a function of x , locally in y . (locSC) ( ) is locally semiconcave as a function of x , locally in y . c x,y ) x For any ∞ and for any measurable set S which does not “lie (H 1 ” (in the sense that T S is not contained in a on one side of x x z ,...,z half-space) there is a finite collection of elements ∈ 1 k , and a small open ball B containing x , such that for any S y outside of a compact set, ) ,y z . ( w,y ) ≥ inf inf c ( c j B ∈ w j ≤ k 1 ≤ x (H For any x and any neighborhood U of ∞ there is a small ball ) 2 B containing x such that ] [ ( sup −∞ = inf ) lim c . z,y ) − c ( w,y y →∞ U z ∈ B ∈ w Our theorems of solvability of the Monge problem will be expressed in terms of these assumptions. (I will write (H ∞ ) for the combination

253 Assumptions on the cost function 247 o (H ∞ ) f and (H ∞ ) .) There are some obvious implications between 2 1 implies (Super) . Before going any fur- them, in particular (locSC) ∞ ther, I shall give some informal explanations about and (H ∞ ) ) , (H 1 2 which probably look obscure to the reader. Both of them are assump- ( x,y ) as y → ∞ , therefore they are void tions about the behavior of c varies in a compact set. They are essentially quantitative versions if y it is possible to lower the cost to of the following statement: For any y to y , by starting from a well-chosen point z go from x . For x close to n n R , then I would × R instance, if c is a radially symmetric cost on z , “opposite to x choose y ”. very close to In the rest of this section, I shall discuss some simple sufficient conditions for all these assumptions to hold true. The first result is that , , (locLip) and (locSC) are satisfied by (Super) (Twist) Conditions many Lagrangian cost functions. Proposition 10.15 (Properties of Lagrangian cost functions). , let c ( x,y ) be a cost function M On a smooth Riemannian manifold 1 C ( L associated with a x,v,t ) . Assume that any x,y ∈ M Lagrangian 1 C minimizing curve. Then: can be joined by at least one 1 x,y ) ∈ M (i) For any M , and any C ( minimizing curve γ con- × necting x to y , the tangent vector −∇ is a supergra- L ( x, ̇ γ M , 0) ∈ T x 0 v c ( ,y ) at x ; in particular, c is superdifferentiable at ( x,y ) as dient for · . a function of x is strictly convex as a function of v , and minimizing curves (ii) If L c are uniquely determined by their initial position and velocity, then c is differentiable at ( x,y ) as a function satisfies a twist condition: If of , then y is uniquely determined by x and ∇ . Moreover, c ( x,y ) x x ( ) ( x,y ) + ∇ , L ∇ c ̇ γ (0) , 0 x, = 0 v x where γ is the unique minimizing curve joining x to y . (iii) If K and K , L has the property that for any two compact sets 0 1 K K and ending in the velocities of minimizing curves starting in are 1 0 c is locally Lipschitz and locally semiconcave uniformly bounded, then x , locally in y . as a function of 2 Consider the case L ( x,v,t ) = | v | Example 10.16. . Then ∇ ; L = 2 v v 2 at − v is is a supergradient of d ( · ,y ) and (i) says that 2 x , where v 0 0 the velocity used to go from x to y . This is a generalization of the usual formula in Euclidean space:

254 248 10 Solution of the Monge problem II: Local approach 2 | x − y | ∇ ) = 2( x − y ) = − 2( y − x ) . ( x 2 x,y d is differentiable at ( x,y ) as a function of Also (ii) says that if ) ( 1 and y are connected by a unique minimizing geodesic. x , then x The requirements in (ii) and (iii) are fulfilled if the Remark 10.17. 2 Lagrangian is time-independent, C , strictly convex superlinear as a L (recall Example 7.5). But they also hold true for other function of v 1+ α 1. | L | ( ) = , 0 < α < x,v,t interesting cases such as v Part (i) of Proposition 10.15 means that the behavior Remark 10.18. of the (squared) distance function is typical: if one plots c ( x,y ) as a x , for fixed y , one will always see upward-pointing crests as function of in Figure 10.1, never downward-pointing ones. Proof of Proposition 10.15 . The proof is based on the formula of first ) be given, and let x,y t ) γ variation. Let ( be a minimizing curve, ( 1 t ≤ 0 ≤ 1 t , joining x to y . Let ̃ γ be another curve, not C as a function of , joining ̃ x to ̃ y . Assume that ̃ x is very close necessarily minimizing x to to ̃ x ; by abuse of , so that there is a unique geodesic joining x x − x for the initial velocity of this geodesic. notation, I shall write ̃ ̃ y is very close to y . Then, by the formula Similarly, let us assume that of first variation, ∫ [ 1 ) ( ( ) ( ) = γ ̃ A , ̇ γ − ,t L dt + ) ∇ y L γ γ ̃ , ̇ γ ( , 1 y · 1 1 t t v 0 ] ( ) ( ) −∇ ̇ γ L , 0 , · ( ̃ x − x ) γ + ω (10.17) sup , ) γ ̃ , d ( γ t t 0 0 v ≤ t ≤ 0 1 ω ( r ) /r → 0, and ω only depends on the behavior of the man- where γ ifold in a neighborhood of , and on a modulus of continuity for the L in a neighborhood of { ( γ . Without loss of , ̇ γ } ,t ) derivatives of 0 ≤ ≤ 1 t t t ω generality, we may assume that r ) /r is nonincreasing as r ↓ 0. ( Then let ̃ x be arbitrarily close to y . By working in smooth charts, it to is easy to construct a curve ̃ γ γ = ̃ x joining ̃ γ , in such a way = y ̃ 1 0 d ( γ that , ̃ γ ). Then by (10.17), ) ≤ d ( x, ̃ x t t 〈 〉 ̃ x,y ) ≤A ( ̃ γ ) ≤ c ( x,y ) − c ∇ , L ( x,v, 0) , ̃ x − x ( + ω ( | ̃ x − x | ) v which proves (i). 1 2 s pointed out to me by Fathi, this implies (by Theorem 10.8(iii)) that d A is also differentiable at ( x,y ) as a function of y !

255 Assumptions on the cost function 249 ow for the proof of (ii): If ( · ,y ) is not only superdifferentiable N c but plainly differentiable, then by Proposition 10.7 there is just one −∇ ( x,v, 0) = ∇ supergradient, which is the gradient, so c ( x,y ). L v x L variable, this equation determines v is strictly convex in the v Since γ uniquely. By assumption, this in turn determines the whole geodesic , and in particular y . and y vary in small balls, the Finally, let us consider (iii). When x along the minimizing curves will be bounded by assump- velocity v will also be uniform. Then ω x,y ) is locally c tion; so the function ( x superdifferentiable as a function of , and the conclusion comes from Proposition 10.12. ⊓⊔ Proposition 10.15 is basically all that is needed to treat quite general cost functions on a compact Riemannian manifold. But for noncompact , manifolds, it might be difficult to check Assumptions or (Lip) (SC) ∞ . Here are a few examples where this can be done. (H ) Example 10.19. Gangbo and McCann have considered cost functions n n x,y of the form ( x − y ) on R ( × R c , satisfying the following ) = c r > ,π and θ ∈ (0 For any given ) , if | y | is large enough assumption: 0 , direction K y,e ) , with apex y ( e , height r and then there is a cone r,θ θ , such that c takes its maximum on K angle ( y,e ) at y . Let us check r,θ briefly that this assumption implies ) (H . (The reader who feels ∞ 1 that both assumptions are equally obscure may very well skip this and jump directly to Example 10.20.) Let S be given such that T x S and x n − 1 is included in no half-space. So for each direction e ∈ there are S z points and z in S , each of which lies on one side of the hyperplane − + passing through z e . By a compactness argument, and orthogonal to z one can find a finite collection of points in S , an angle θ < π ,...,z 1 k − n 1 S r > and a positive number e and for any w 0 such that for all ∈ , the truncated cone K ( w,e ) contains at least one of x close enough to r,θ z . But by assumption, . Equivalently, K the ( w − y,e ) contains z y − j j r,θ − | − y | large enough there is a cone K ≤ ( for w y,e ) such that c ( z − y ) w r,θ c ( w − y ) for all z ∈ K (for ( w − y,e ). This inequality applies to z = z j r,θ ( j c ( z ). − y ) ≤ some ), and then w − y c j Example 10.20. As a particular case of the previous example, (H ∞ ) 1 holds true if c = c ( x − y ) is radially symmetric and strictly increasing as a function of x − y | . | Example 10.21. Gangbo and McCann considered cost functions that also satisfy c convex and superlinear. This x,y ) = c ( x − y ) with c (

256 250 10 Solution of the Monge problem II: Local approach n (H ∞ ) ssumption implies . Indeed, if x in R a and ε > 0 are given, let 2 | ε ( x − y ) / − x − y | ; it suffices to show that x z = z − y ) − c ( x − y ) −−−→ c ( −∞ , y →∞ , x − y h or equivalently, with = ) ( ) ( ε c − ) 1 − h h − c ( −−→ + ∞ . h →∞ | | h But the inequality (0) ≥ c ( p ) + ∇ c ( p ) · ( − c ) p and the superlinearity of imply ∇ c ( p ) · ( p/ c p | ) → + ∞ as p → ∞ ; | then, with the notation h ), = h (1 − ε/ | h | ε εh h ε − c ( h c ) ≥∇ c ( h , ) · ) ∞ + −−−→ = ε ∇ c ( h − ( · h ) ε ε ε | | h | | h |→∞ | h ε as desired. If ( ) is a Riemannian manifold with nonneg- M,g Example 10.22. , then (as recalled in the Third Appendix) ative sectional curvature 2 2 2 ∇ d ( x is semicon- ,x ) ( / 2) ≤ g ) , and it follows that c ( x,y ) = d ( x,y x 0 x 2 cave with a modulus r ) = r . This condition of nonnegative curvature ( ω is quite restrictive, but there does not seem to be any good alternative 2 geometric condition implying the semiconcavity of d ( x,y ) , uniformly in x and y . I conclude this section with an open problem: Find simple sufficient conditions on a rather Open Problem 10.23. general Lagrangian on an unbounded Riemannian manifold, so that it will satisfy (H ∞ ) . Differentiability of c -convex functions We shall now return to optimal transport, and arrive at the core of the analysis of the Monge problem: the study of the regularity of c -convex

257 Differentiability of c convex functions 251 - c -subdifferentiability, subdifferentiability, and functions. This includes plain differentiability. M is a complete connected Riemannian In Theorems 10.24 to 10.26, X M such that the manifold of dimension is a closed subset of , n (in frontier ) is of dimension at most n − 1 (for instance it is ∂ X M is an arbitrary Polish space. The cost function locally a graph), and Y X × Y → c is assumed to be continuous. The statements will : R be expressed in terms of the assumptions appearing in the previous interior section; these assumptions will be made for points, which are (viewed as a subset of M ). points lying in the interior of X -subdifferentiability of c -convex functions). c Theorem 10.24 ( (H ∞ ) be satisfied. Let ψ : X → R ∪{ + ∞} Let Assumption c - be a − 1 M ) of its domain ψ convex function, and let be the interior (in ( R ) . Ω − 1 ( R ) \ Ω ψ n − 1 . Moreover, ψ Then, is a set of dimension at most c -subdifferentiable everywhere in Ω . Finally, if is locally bounded and K ⊂ is compact, then ∂ Ω ψ ( K ) is itself compact. c Theorem 10.25 (Subdifferentiability of -convex functions). c (Super) Assume that ψ be a c -convex function, and is satisfied. Let let x be an interior point of X (in M ) such that ∂ ψ ψ ( x ) 6 = ∅ . Then c is subdifferentiable at . In short: x − ψ ) 6 = ∅ = ⇒ ∇ ( x ( x ) 6 = ∅ . ψ ∂ c + − ∂ ) ψ ( x More precisely, for any , one has y ∈ −∇ c x,y ) ⊂∇ . ( ( x ) ψ c x Theorem 10.26 (Differentiability of c As- -convex functions). (Super) and (Twist) sume that ψ be a c -convex are satisfied, and let function. Then: (i) If (Lip) is satisfied, then ψ is locally Lipschitz and differentiable in X (locLip) and , apart from a set of zero volume; the same is true if ∞ are satisfied. (H ) (SC) is satisfied, then ψ is locally semiconvex and differen- (ii) If tiable in the interior (in ) of its domain, apart from a set of dimen- M sion at most n − 1 ; and the boundary of the domain of ψ is also of dimension at most n − 1 . The same is true if (locSC) and (H ∞ ) are satisfied. − 1 S = ψ X . Let ( R ) \ ∂ Proof of Theorem 10.24 . (Here ∂ X is the bound- ary of X in M , which by assumption is of dimension at most n − 1.)

258 252 10 Solution of the Monge problem II: Local approach I shall use the notion of tangent cone defined later in Definition 10.46, x and show that if is such that T ∈ S is not included in a half-space, S x ψ . It will follow that x is in is bounded on a small ball around then x will be included x S \ Ω . So for each T fact in the interior of S ∈ Ω , x S \ Ω will be of dimension at in a half-space, and by Theorem 10.48(ii) − 1. Moreover, this will show that ψ is locally bounded in Ω . most n x be such that ( x ) < + ∞ , and T So let S is not included in a ψ x z ,...,z half-space. By assumption, there are points in S , a small ball 1 k y x K ⊂Y such that for any around ∈Y \ K , B , and a compact set . ) ,y c ( w,y ) ≥ inf z c inf ( j B ∈ w ≤ j ≤ 1 k y φ -transform of ψ . For any c ∈Y \ K , be the Let ( y ) − inf . ) z c ( w,y ) ≤ φ ( y ) − inf ( ψ z ≤ ) c ( φ ,y sup j j B w ∈ j k ≤ 1 ≤ ≤ k ≤ j 1 So w ∈ B, ∀ y ∈Y \ K, φ ∀ y ) − c ( w,y ) ≤ sup . ) z ( ψ ( j k ≤ j ≤ 1 When y ∈ K , the trivial bound φ ( y ) − c ( w,y ) ≤ ψ ( x ) + c ( x,y ) − c ( w,y ) implies ∀ ∈ B, ψ ( w ) = sup w )] [ φ ( y ) − c ( w,y ∈Y y ( ) ] [ max ) ) w,y ψ ( z ( ) , sup c − sup . c ( x,y ) + ψ ( x ≤ j ∈ K y ≤ ≤ 1 k j This shows that ψ is indeed bounded above on B . Since it is lower R semicontinuous with values in + ∞} , it is also bounded below on ∪{ a neighborhood of x . All in all, ψ is bounded in a neighborhood of x . Next, let x ∈ Ω ; the goal is to show that ∂ be a ψ ( x ) 6 = ∅ . Let U c small neighborhood of , on which | ψ | is bounded by M . By assumption x ′ , such that for all B K in U there is a compact set y , and a small ball outside K , ′ z ∈ B M , c ( z,y ) − c ( x,y ) ≤− (2 ∀ + 1) . ( So if ∈ K , there is a z such that ψ ( z ) + c ( z,y ) ≤ c y / x,y ) − ( M + 1) ≤ ψ ( x ) + c ( x,y ) − 1, and

259 Differentiability of c convex functions 253 - ] [ ( c ( x,y ) ≤ inf φ ) y − ) c ) + c ( z,y z − ( ( x,y ) ψ ′ z ∈ B ′ ′ ) − 1 = sup φ . 1 [ ≤ ( y ψ ) − ( ( x,y x )] − c ′ y ∈Y ( is the same as the ) − c ( x,y ) over all Y φ Then the supremum of y . But this is a maximization problem for an K supremum over only upper semicontinuous function on a compact set, so it admits a solution, ∂ ( x ). which belongs to ψ c x The same reasoning can be made with w in a small replaced by neighborhood of x , then the conclusion is that ∂ B ψ ( w ) is nonempty c ′ ⊂ ∈ . If K z B Ω , uniformly for K and contained in the compact set is a compact set, we can cover it by a finite number of small open such that balls ∂ B ψ ( B , so that ) is contained in a compact set K j j c j ′ ′ ( K ∂ ) ⊂ ∪ K ) is closed by the . Since on the other hand ∂ ψ ψ ( K c c j ′ c ∂ , it follows that ψ ( K ) is compact. This concludes the continuity of c proof of Theorem 10.24. ⊓⊔ . Let x Proof of Theorem 10.25 c -subdifferentiability of ψ , be a point of and let y ∈ ∂ ). Further, let ψ ( x c [ ] ) := inf φ ψ ( x ) + c ( x,y ) y ( c ψ . By definition of c -subdifferentiability, be the -transform of ( x ) = φ ( y ) − c ( x,y ) . (10.18) ψ ) w T Let M and let x ∈ be obtained from x by a variation of size O ( ε ε x ( w x , one = exp in the direction , say εw ). From the definition of φ ε x has of course ψ ( x (10.19) ) ≥ φ ( y ) − c ( x . ,y ) ε ε + p Further, let ∈ ∇ ( x,y c ). By (10.18), (10.19) and the superdifferen- x tiability of c , ( x ψ ) ≥ φ ( y ) − c ( x ) ,y ε ε ≥ φ ( y ) − c ( x,y ) − ε 〈 p,w 〉 + o ( ε ) 〉 = x ) − ε 〈 p,w ( + o ( ε ) . ψ This shows that ψ is indeed subdifferentiable at x , with p as a subgra- dient, and ⊓⊔ p ∈ ∂ − ψ ( x ). c

260 254 10 Solution of the Monge problem II: Local approach roof of Theorem 10.26 ‖ c ( · ,y ) ‖ P ≤ L , then also ψ ( x ) = . (i) If Lip )] satisfies ( y ) − c ( x,y φ ‖ ψ sup ‖ ≤ L . Then by Theorem 10.8(ii) [ Lip y X is differentiable everywhere on the interior of ψ , apart from a set of zero volume. is only locally Lipschitz in ) and y , but condition (H ∞ c is If x in X there is a compact set ensured, then for each compact set K ′ K ⊂Y such that x ∈ K, ψ ( x ) = sup c . )] x,y ( ∀ − ) [ φ ( y ) − c ( x,y )] = sup y ( φ [ ′ K ∈ y ( ψ y ∂ ∈ ) x c x The functions inside the supremum are uniformly Lipschitz when ′ stays in K K , so the result of the supremum is again a y and stays in locally Lipschitz function. ( x,y ) is semiconcave, locally in x and uniformly (ii) Assume that c . Let be a compact subset of M , and let γ be a constant-speed K y in ; then the inequality geodesic whose image is included in K ) ( c ,y ) ≥ (1 − t ) c ( γ ( ,y ) + tc ( γ ) ,y ) − t (1 − t ) ω γ d ( γ ,γ 0 1 0 t 1 leads to ] [ y ) = sup ψ ( φ ( γ ) − c ( γ ) ,y t t y [ ] ) ( φ ( y ) − (1 − t ) c ( γ ) ,y ) − tc ( γ ,γ ,y ) ≤ + t (1 − t ) ω sup d ( γ 0 1 0 1 y [ ] − t ) φ ( y ) − (1 − t ) c ( γ = sup ,y ) + tφ ( y ) − tc ( γ (1 ,y ) 1 0 y ( ) − t ) ω + d ( γ t ,γ (1 ) 1 0 [ ] [ ] ) sup φ ( y ) − c ( γ ) ,y ) ≤ + t sup − (1 φ ( y ) − c ( γ ,y t 1 0 y y ( ) γ t ) ω + d ( (1 t ,γ − ) 1 0 ) ( t ) ψ ( γ . ) + tψ ( γ − ) + t (1 = (1 t ) ω − d ( γ ) ,γ 1 1 0 0 So ψ inherits the semiconcavity modulus of c as semiconvexity mod- ulus. Then the conclusion follows from Proposition 10.12 and Theo- rem 10.8(iii). If c is semiconcave in x and y , one can use a localization argument as in the proof of (i). ⊓⊔ Remark 10.27. Theorems 10.24 to 10.26, and (in the Lagrangian case) Proposition 10.15 provide a good picture of differentiability points of

261 Applications to the Monge problem 255 c c satisfy (Twist) , (Super) and (H ∞ ) , and - convex functions: Let be in the interior of the domain of a c ψ . If ψ let x -convex function ) consists of just one point ∂ ψ ( x then y , and x is differentiable at c ( x ) = −∇ ∇ c ( ψ ), which in the Lagrangian case also coincides x,y x with L ( x,v, 0), where v is the initial velocity of the unique action- ∇ v , the minimizing curve joining . If ψ is not differentiable at x to x y (locSC) , we can use the local picture is not so precise; however, under − to show that ∇ semiconvexity of ψ ( x ) is included in the closed con- ψ + −∇ c ( x,∂ vex hull of ψ ( x )), which in the Lagrangian case is also the c x γ L (0 ,x, ̇ {∇ (0)) } , where γ varies among action- closed convex hull of v x to ∂ ψ minimizing curves joining ( x ). (Use Remark 10.51 in the Second c Appendix, the stability of c , the c -subdifferential, the semiconcavity of + ∇ c , the stability of minimizing curves and the continuity continuity of x + L .) There is in general no reason why −∇ of )) would be c ( x,∂ x ψ ( ∇ c v x convex; we shall come back to this issue in Chapter 12. Applications to the Monge problem The next theorem shows how to incorporate the previous information into the optimal transport problem. M Theorem 10.28 (Solution of the Monge problem II). be Let a closed subset of M , with dim( ∂ a Riemannian manifold, ) ≤ n − 1 , X X Y an arbitrary Polish space. Let c : X ×Y → R and be a continuous cost function, bounded below, and let ∈ P ( X ) , ν ∈ P ( Y μ , such that ) the optimal cost ( μ,ν ) is finite. Assume that: C c is superdifferentiable everywhere (Assumption (Super) ); (i) (ii) ∇ ); c ( x, · ) is injective where defined (Assumption (Twist) x (iii) any -convex function is differentiable μ -almost surely on its c c domain of -subdifferentiability. ( x,y ) of ( Then there exists a unique (in law) optimal coupling ) ; it μ,ν is deterministic, and there is a c -convex function ψ such that ∇ ψ ( x ) + ∇ (10.20) c ( x,y ) = 0 almost surely . x T solving the Monge In other words, there is a unique transport map problem, and ∇ ψ ( x ) + ∇ ) -almost surely. ( x,T ( x )) = 0 , μ ( dx c x

262 256 10 Solution of the Monge problem II: Local approach I ∞ ) is satisfied, then: f moreover (H characterizes the optimal coupling; (10.20) (a) Equation Z be the set of points where ψ is differentiable; then one can (b) Let → T ( x ) on Z by the equation T ( x ) ∈ ∂ define a continuous map x ( x ) , ψ c and = Spt ν Spt μ ) . T ( (10.21) As a corollary of this theorem, ∇ ψ is uniquely deter- Remark 10.29. μ -almost surely, since the random variable ∇ ψ ( x ) has to coincide mined −∇ (in law) with ( x,y ). c x The uniqueness of the optimal transport does not in Remark 10.30. ψ , even up to an additive general imply the uniqueness of the function is the closure of a connected constant. However, this becomes true if X , such that the density of μ with respect to the volume Ω open set Ω ; indeed, ∇ ψ will then be measure is positive almost everywhere in Ω uniquely defined almost surely in . (This uniqueness theorem does not require μ to be absolutely continuous.) In general ∂ ψ Remark 10.31. (Spt μ ) is larger than Spt ν . Take for c 2 2 2 Y = R , let , c ( x,y ) = | x − y | instance X B = B B (0) ⊂ R = ; split 1 in two halves along the first coordinate, translate the right (resp. left) ′ , , 0)), call B 1 the resulting − half ball by a unit vector (1 0) (resp. ( ′ ν ) as the normalized measure on B (resp. B set, and define ). μ (resp. ′ . Theorem 12.49 ψ (Spt μ ) will be the whole convex hull of Then ∂ B c ∂ ψ (Spt μ ). will provide sufficient conditions to control c 1 c C If in Theorem 10.28 the cost Remark 10.32. La- derives from a L grangian x,v,t ), strictly convex in v , such that minimizing curves are ( uniquely determined by their initial velocity, then Proposition 10.15(ii) implies the following important property of the optimal coupling ( x,y ): Almost surely, x y by a unique minimizing curve. For in- is joined to 2 x,y x,y stance, if d ( ( ) c , the optimal transference plan π will be ) = concentrated on the set of points ( x,y ) in M × M such that x and y are joined by a unique geodesic. Remark 10.33. Assumption (iii) can be realized in a number of ways, depending on which part of Theorem 10.26 one wishes to use: For in- stance, it is true if c X×Y and μ is absolutely continuous; is Lipschitz on or if c is locally Lipschitz and μ , ν are compactly supported and μ is absolutely continuous; or if c is locally semiconcave and satisfies (H ∞ )

263 Applications to the Monge problem 257 a μ does not charge sets of dimension n − 1; etc. It is important to nd note that Assumption (iii) implicitly contains some restrictions about the behavior at infinity, of either the measure M , or μ , or the manifold . c the cost function All the assumptions of Theorem 10.28 are satisfied if Example 10.34. 2 Y is compact and the Lagrangian L = C M and satisfies the = X is classical conditions of Definition 7.6. All the assumptions of Theorem 10.28 are satisfied if Example 10.35. n 1 = Y R M , c is a C = strictly convex function with a bounded X = does not charge sets of dimension − μ 1. Indeed, ∇ Hessian and c will n x c ; and c will be uniformly semicon- be injective by strict convexity of 2 , so Theorem 10.26 guarantees that cave with a modulus c -convex Cr functions are differentiable everywhere apart from a set of dimension at most − 1. n All the assumptions of Theorem 10.28 are satisfied if Example 10.36. 2 = Y , c ( x,y ) = d = x,y ) X , and M has nonnegative sectional cur- M ( M is compact. vature; recall indeed Example 10.22. The same is true if . Let π be an optimal transference plan. From Proof of Theorem 10.28 c -conjugate functions ( ψ,φ ) such Theorem 5.10, there exists a pair of x,y φ y ) − ψ ( x ) that c ( ( ) everywhere, with equality π -almost surely. ≤ Write again (10.2), at a point x of differentiability of ψ ( x should be interior to X , viewed as a subset of M ), and choose ̃ x = ̃ x ( ε ) = γ ( ε ), where ̇ γ w ; divide by ε > 0 and pass to the lim inf: (0) = ̃ ) x,y ( c − ) ,y ) c ( ε x ( x −∇ ( lim inf ≤ w · ) ψ 10.22) ( . → ε 0 ε −∇ ψ ( x ) is a subgradient of c ( · ,y ) at x . But by as- It follows that sumption, there exists at least one supergradient of c · ,y ) at x , say G . ( c ( · ,y ) really is differentiable at x , with gradient By Proposition 10.7, −∇ ψ ( x ). ) = So (10.20) holds true, and then Assumption (iii) implies T ( x = y 1 − − 1 ) x,y ( ( x, −∇ ψ ( x )), where ( ∇ ∇ c ) ), c is the inverse of x 7−→∇ ( c x x x c y x for which ∇ viewed as a function of and defined on the set of ( x,y ) x exists. (The usual measurable selection theorem implies the measura- bility of this inverse.) Thus π is concentrated on the graph of T ; or equivalently, π = . Since this conclusion does not depend on the choice of (Id ) , μ ,T π # but only on the choice of ψ , the optimal coupling is unique.

264 258 10 Solution of the Monge problem II: Local approach I t remains to prove the last part of Theorem 10.28. From now on (H ∞ is satisfied. Let π ∈ Π ( μ,ν ) be a transference I shall assume that ) be a c ψ plan, and let -convex function such that (10.20) holds true. ψ x ∈ Z ; be the set of differentiability points of , and let Let Z in particular, (in M ), and should belong x should be interior to X . By Theorem 10.24, there is some ψ to the interior of the domain of ∂ c ψ ( x ). Let G be a supergradient of y ( · ,y ) at ∈ ; by Theorem 10.25, x c − − ψ ( x ) } = {∇ ψ ( x ) } , so −∇ G ( x ) is the only supergradient of ∈ {∇ ψ ( ,y ) at x (as in the beginning of the proof of Theorem 10.28); · ( · ,y ) c c x and ∇ ) = 0. By injectivity c ( x,y ) + ∇ ψ ( really is differentiable at x x ( ∇ x, · ), this equation determines y = T ( x ) as a function of x ∈ Z . of c x ( x ∂ ) = { T ( x ) } This proves the first part of (b), and also shows that ψ c ∈ x . for any Z π Z ×Y , equation (10.20) implies that Now, since is concentrated on π really is concentrated on the graph of T . A fortiori π that ∂ ] = 1, ψ [ c π is -cyclically monotone, and therefore optimal by Theorem 5.10. so c This proves (a). ) Z . Let ( Next, let us prove that T is continuous on be a x ∈ N k k Z , converging to sequence in ∈ Z , and let y ). Assumption = T ( x x k k ∞ ) (H ∂ ψ transforms compact sets into and Theorem 10.24 imply that c compact sets; so the sequence ( y takes values in a compact set, ) k ∈ k N ′ y . By and up to extraction of a subsequence it converges to some ∈Y -subdifferential, we passing to the limit in the inequality defining the c ′ ′ ∂ y ψ ( x ). Since x obtain Z , this determines y ∈ = T ( x ) uniquely, ∈ c T ( x is indeed )) so the whole sequence ( T ), and converges to T ( x N k k ∈ continuous. T . Indeed, the in- Equation (10.21) follows from the continuity of − 1 (Spt μ ( T T μ )) implies ⊂ clusion Spt ] [ 1 − μ )] = (Spt T T [ ( T (Spt μ )) ν ≥ μ [Spt μ ] = 1; μ so by definition of support, Spt ν ⊂ T ( Spt μ ). On the other hand, if x ∈ Spt μ ∩ Z , let y = T ( x ), and let ε > 0; by continuity of T there is ), and then δ > B 0 such that ( x )) ⊂ B T ( y ( ε δ [ [ ] ( )] − 1 − 1 ν [ ( B B ( y )) > ≥ μ ( T y )] = μ T ( B 0; ( x )) T ≥ μ [ B ( x )] ε ε δ δ μ so Spt ν . This shows that T (Spt ∈ ) ⊂ Spt ν , and therefore y T ( Spt μ ) ⊂ Spt ν , as desired. This concludes the proof of (b). ⊓⊔ Remark 10.37. Uniqueness in Theorem 10.28 was very easy because we were careful to note in Theorem 5.10 that any optimal π is supported

265 Removing the conditions at infinity 259 i c -subdifferential of any optimal function ψ for the dual problem. n the π is supported in the -subdifferential If we only knew that any optimal c ψ , we still could get uniqueness fairly easily, either by working of some out a variant of the statement in Theorem 5.10, or by noting that if any optimal measure is concentrated on a graph, then a strict convex combination of such measures has to be concentrated on a graph itself, and this is possible only if the two graphs coincide almost surely. Removing the conditions at infinity This section and the next one deal with extensions of Theorem 10.28. Here we shall learn how to cover situations in which no control at infinity is assumed, and in particular Assumption (iii) of Theorem 10.28 might not be satisfied. The short answer is that it is sufficient to replace gradient in (10.20) by an . (Actually a little the approximate gradient bit more will be lost, see Remarks 10.39 and 10.40 below.) Theorem 10.38 (Solution of the Monge problem without con- ditions at infinity). M be a Riemannian manifold and Y an Let c : M ×Y → R be a continuous cost func- arbitrary Polish space. Let μ tion, bounded below, and let P ( M ) , ν ∈ P ( Y ) , such that the optimal ∈ C ) μ,ν cost is finite. Assume that: ( c (Super) ); (i) is superdifferentiable everywhere (Assumption ∇ ); c (ii) x, · ) is injective (Assumption (Twist) ( x (iii) for any closed ball B = B , ⊂ Y ( x K ) and any compact set 0 ] r ′ c B × K defined on c is such that any the function by restriction of ′ -convex function on B × K c is differentiable -almost surely; μ (iv) is absolutely continuous with respect to the volume measure. μ μ,ν ( ) Then there exists a unique (in law) optimal coupling ( x,y ) ; it of is deterministic, and satisfies the equation ̃ ∇ ψ ( x ) + ∇ (10.23) c ( x,y ) = 0 almost surely . x and it satisfies T In other words, there is a unique optimal transport ̃ ∇ ψ ( x the equation ∇ almost surely. c ( x,T ( x )) = 0 ) + x Remark 10.39. I don’t know if (10.23) is a characterization of the optimal transport.

266 260 10 Solution of the Monge problem II: Local approach R (iv’) μ gives zero emark 10.40. If Assumption (iv) is weakened into − mass to sets of dimension at most 1, then there is still uniqueness n ) ψ ∈ ∂ c ψ ( x y such that of the optimal coupling, and there is a -convex c almost surely; but it is not clear that equation (10.23) still holds. This uniqueness result is a bit more tricky than the previous one, and I shall postpone its proof to the next section (see Theorem 10.42). be a c -convex function as in Theo- . Let Proof of Theorem 10.38 ψ be an optimal transport; according to Theorem 5.10(ii), π rem 5.10. Let π [ ∂ ψ ] = 1. c ℓ be any point in M . For any Let ∈ N , let B x stand for the closed 0 ℓ ball B ( x be an increasing sequence of compact ). Let also ( K ) 0 ℓ ∈ N ℓ ] ℓ , such that ν [ ∪ sets in Y ] = 1. The sets B fill up the whole of × K K ℓ ℓ ℓ ×Y , up to a π -negligible set. Let c M be the restriction of c to B . × K ℓ ℓ ℓ is large enough, then If B π × K [ ] > 0, so we can define ℓ ℓ ℓ π 1 B K × ℓ ℓ := π , ℓ B × K [ ] π ℓ ℓ μ and ν of π . By Theorem 4.6, nd then introduce the marginals a ℓ ℓ ℓ ), with π is optimal in the transport problem from ( ,μ B ) to ( K ,ν ℓ ℓ ℓ ℓ ℓ c cost c -convex function ψ which . By Theorem 5.19 we can find a ℓ ℓ ℓ ψ μ coincides with -almost surely, and actually on the whole of S := ℓ ℓ B proj )). ψ ) ∩ ( ∂ K × (( c ℓ ℓ M ψ S ( ∂ covers proj ), and therefore also The union of all sets c ℓ M ̃ proj ), apart from a μ -negligible set. Let π S (Spt be the set of points ℓ M ̃ has zero volume. at which S S has density 1; we know that S in \ S ℓ ℓ ℓ ℓ ̃ still covers M , apart from a μ -negligible set. So the union of all sets S ℓ μ (Here I have used the absolute continuity of .) By Assumption (iii), each function ψ is differentiable apart from a ℓ μ -negligible set Z . Moreover, by Theorem 10.28, the equation ℓ ∇ (10.24) c ( x,y ) + ∇ ψ ) = 0 ( x x ℓ μ and ν , for the cost determines the unique optimal coupling between ℓ ℓ is in the interior of . (Note that ∇ , c B coincides with ∇ c c when x x x ℓ ℓ ℓ μ ] = 0, so equation (10.24) does hold true [ ∂B -almost surely.) and π ℓ ℓ ℓ Now we can define our Monge coupling. For each ℓ ∈ N , and each ̃ ∈ x S , so \ Z x , ψ on a set which has density 1 at coincides with ψ ℓ ℓ ℓ ̃ ψ ), and (10.24) becomes ( x ) = ∇ ∇ ψ ( x ℓ ̃ ∇ (10.25) ( x,y ) + ∇ c ψ ( x ) = 0 . x

267 Removing the conditions at infinity 261 T ℓ , and holds π his equation is independent of -almost surely since ℓ ̃ \ Z S has full μ is -measure. By letting ℓ → ∞ we deduce that π ℓ ℓ ℓ x,y ) satisfying (10.25). By assumption this concentrated on the set of ( 1 − ̃ ∇ = ( x, − ψ ∇ y ( x )) uniquely as a measurable c equation determines x . Uniqueness follows obviously since function of was an arbitrary x π ⊓⊔ optimal plan. As an illustration of the use of Theorems 10.28 and 10.38, let us see how we can solve the Monge problem for the square distance on a Riemannian manifold. In the next theorem, I shall say that M has if all sectional curvatures σ asymptotically nonnegative curvature at x point satisfy x C σ ≥− (10.26) x 2 , d ) x ( x 0 and some x for some positive constant ∈ M . C 0 Theorem 10.41 (Solution of the Monge problem for the square 2 M distance). c ( x,y ) = d ( x,y ) Let . be a Riemannian manifold, and Let be two probability measures on M , such that the optimal cost μ,ν between and ν is finite. If μ is absolutely continuous, then there is a μ unique solution of the Monge problem between μ and ν , and it can be written as ( ) ̃ x ) = exp = y T ∇ ψ ( x ) ( , (10.27) x 2 where d / is some 2 -convex function. The approximate gradient can ψ be replaced by a true gradient if any one of the following conditions is satisfied: μ and ν are compactly supported; (a) M (b) has nonnegative sectional curvature; ν is compactly supported and (c) has asymptotically nonnegative M curvature. Proof. The general theorem is just a particular case of Theorem 10.38. In case (a), we can apply Theorem 10.28 with X = B , Y ( x ) = 0 ] R R ) contains all geodesics that go B is large enough that ( x where 0 R ′ to Spt ν from Spt c μ -convex . Then the conclusion holds with some ′ ψ , where c : is the restriction of c to X ×Y function [ ] 2 ) x,y ( d ) = sup ψ x ( φ ( y ) − . 2 ( ) y x B ∈ 0 r ]

268 262 10 Solution of the Monge problem II: Local approach 2 d T / 2-function, it suffices to set φ ( y ) = −∞ on o recover a true 2 x,y ( x ) = sup M 2] (as in the proof of / [ φ ( y ) − d ( ψ ) \Y , and let M y ∈ Lemma 5.18). 2 d ( · In case (b), all functions ) / 2 are uniformly semiconcave (as ,y recalled in the Third Appendix), so Theorem 10.28 applies. 2 d · ,y ) In case (c), all functions / 2, where y varies in the support ( ν , are uniformly semiconcave (as recalled in the Third Appendix), of to be a large closed ball containing the support of so we can choose Y ν , and apply Theorem 10.28 again. ⊓⊔ Removing the assumption of finite cost In this last section, I shall investigate situations where the total trans- port cost might be infinite. Unless the reader is specifically interested in such a situation, he or she is advised to skip this section, which is quite tricky. If ( μ,ν ) = + ∞ , there is no point in searching for an optimal C c transference plan. However, it does make sense to look for -cyclically generalized optimal transfer- monotone plans, which will be called . ence plans Theorem 10.42 (Solution of the Monge problem with possibly infinite total cost). Let X be a closed subset of a Riemannian manifold M such that dim( ∂ X ) ≤ n − 1 , and let Y be an arbitrary Polish space. Let c M ×Y → R be a continuous cost function, bounded : ∈ μ ( M ) , ν P P ( Y ) . Assume that: below, and let ∈ c is locally semiconcave (Assumption (locSC) ); (i) ); ∇ c ( x, · ) is injective (Assumption (Twist) (ii) x (iii) μ does not give mass to sets of dimension at most n − 1 . Then there exists a unique (in law) coupling ( x,y ) of ( μ,ν ) such that π = law ( ) is c -cyclically monotone; moreover this coupling is de- x,y π terministic. The measure is called the generalized optimal transfer- ence plan between μ and ν . Furthermore, there is a c -convex function ψ : M → R ∪{ + ∞} such that π [ ∂ . ψ ] = 1 c • If Assumption (iii) is reinforced into (iii’) μ is absolutely continuous with respect to the volume measure,

269 Removing the assumption of finite cost 263 t hen ̃ ∇ x ) + ∇ ψ c ( x,y ) = 0 π ( dxdy ) ( . (10.28) -almost surely x • If Assumption (iii) is left as it is, but one adds ∞ ) (H (SC) , (iv) the cost function satisfies or then ψ ( x ) + ∇ , c ( x,y ) = 0 π ( dxdy ) -almost surely ∇ (10.29) x and this characterizes the generalized optimal transport. Moreover, one x T ( x ) on the set of differentiability can define a continuous map → ψ by the equation T ( x ) ∈ ∂ . ψ ( x ) , and then points of ν = T ( Spt μ ) Spt c Remark 10.39 applies also in this case. Remark 10.43. . Let us first consider the existence problem. Proof of Theorem 10.42 μ ) Let ( be a sequence of compactly supported probability mea- ∈ N k k sures converging weakly to μ ; and similarly let ( ν be a sequence ) k ∈ k N ν of compactly supported probability measures converging weakly to . , the total transport cost ,ν ( μ k For each C ) is finite. Then let π be k k k μ ; by Theorem 5.10(ii), and ν an optimal transference plan between k k is c -cyclically monotone. By Lemma 4.4, the sequence ( π ) con- π k ∈ N k k π verges, up to extraction, to some transference plan Π ( μ,ν ). By ∈ π is c -cyclically monotone. By Step 3 of the proof of Theorem 5.20, Theorem 5.10(i) (R ̈uschendorf’s theorem), there is a c -convex ψ such that Spt( π ⊂ ∂ ] = 1. ψ , in particular π [ ∂ ) ψ c c If μ is absolutely continuous, then we can proceed as in the proof of Theorem 10.38 to show that the coupling is deterministic and that (10.28) holds true π -almost surely. In the case when (H ∞ ) or (SC) is assumed, we know that ψ is c - subdifferentiable everywhere in the interior of its domain; then we can proceed as in Theorem 10.28 to show that the coupling is deterministic, that (10.29) holds true, and that this equation implies y ∈ ∂ ). So ψ ( x c if we prove the uniqueness of the generalized optimal transference plan this will show that (10.28) characterizes it. Thus it all boils down to proving that under Assumptions (i)–(iii), the generalized optimal transport is unique. This will be much more technical, and the reader is advised to skip all the rest of the proof at first reading. The two main ingredients will be the Besicovich density theorem and the implicit function theorem.

270 264 10 Solution of the Monge problem II: Local approach et be a generalized optimal coupling of μ and ν , and let ψ be L π -convex function such that Spt( π ⊂ ∂ a ψ . Let z c ∈ X , let B ) = c 0 ℓ B ,ℓ ] ∪ X [ K ) z be an increasing sequence of compact , and let ( 0 N ℓ ℓ ∈ K Y [ ∪ K subsets of ] = 1. Let Z , := π | [ B c × ν := ], c , such that × K B ℓ ℓ ℓ ℓ ℓ ℓ ℓ ℓ and := 1 be the two ν π μ ); let also π/Z π , S (Spt := proj K × B ℓ ℓ ℓ ℓ ℓ ℓ M ℓ ℓ marginals of S π is still an increasing family of . It is easy to see that ℓ ℓ S [ ∪ , and that M ] = 1. compact subsets of μ ℓ B -convex function ψ According to Theorem 5.19, there is a : c → ℓ ℓ ℓ is locally semiconcave, + which coincides with ψ on S ∪{ . Since R ∞} c ℓ the cost c is uniformly semiconcave, and ψ is differentiable on S apart ℓ ℓ ℓ n from a set of dimension − 1. , the set has μ -density 1 at Besicovich’s density theorem By S ℓ x ∈ S μ -almost all ; that is ℓ S μ ∩ [ )] ( x B r ℓ . −−−→ 1 r → 0 B [ μ ( x ) ] r (The proof of this uses the fact that we are working on a Riemannian manifold; see the bibliographical notes for more information.) π on S coincides with Moreover, the transport plan induced by π ℓ ℓ the deterministic transport associated with the map − 1 x ( ∇ T c . ) : )) ( x, −∇ ψ 7−→ ( x x ℓ ℓ ℓ is the nondecreasing limit of the measures Z π π , it follows that Since ℓ ℓ itself is deterministic, and associated with the transport map T that π S x sends ( x ) if x ∈ T -almost surely.) . (This map is well-defined μ to ℓ ℓ Then let { ; := x ∈ S C ; x is interior to X ; S x has μ -density 1 at ℓ ℓ ℓ } k ≥ ℓ, ψ . is differentiable at x ; ∇ ∀ c ( x,T ( x )) + ∇ ψ ) = 0 ( x x ℓ k (Note: There is no reason for ∇ ψ ( x ) to be an approximate gradient ℓ ψ at x , because ψ ψ is assumed to coincide with of only on a set of ℓ μ x , not on a set of vol-density 1 at x . . . ) -density 1 at The sets C form a nondecreasing family of bounded Borel sets. ℓ C Moreover, has been obtained from S by deletion of a set of zero ℓ ℓ volume, and therefore of zero μ μ [ ∪ C -measure. In particular, ] = 1. ℓ Now let ̃ π be another generalized optimal transference plan, and ̃ ̃ ) ψ c -convex function with Spt( ̃ π be a ⊂ ∂ ψ let . We repeat the same c

271 Removing the assumption of finite cost 265 ̃ π , and get sequences ( ̃ Z instead of π , onstruction as above with c ) ℓ ℓ ∈ N ̃ ̃ ̃ ( ̃ C , ( ̃ c π ) , such that the sets ) C , ( form a nonde- ψ ) ) , ( ∈ ℓ ℓ N ℓ ∈ ℓ ∈ N ℓ ℓ N ℓ ∈ ℓ ℓ N ̃ ̃ ∪ creasing family of bounded Borel sets with ] = 1, μ ψ C coincides [ ℓ ℓ ̃ ̃ with C ψ . Also we find that ̃ π is deterministic and determined by on ℓ ̃ ̃ ̃ T coincides with , where T . on S the transport map T ℓ ℓ ̃ Next, the sets ∩ C C also form a nondecreasing family of Borel sets, ℓ ℓ ̃ ̃ ∪ ( C C ∩ and μ )] = 1 (the nondecreasing property )] = μ [( ∪ C C ) ∩ ( ∪ [ ℓ ℓ ℓ ℓ ̃ ∩ C has μ -density 1 at each of C was used in the first equality). Also ℓ ℓ its points. ̃ T T on a set of positive μ -measure; then there is 6 Assume that = ̃ ̃ such that { T 6 = some T }∩ ( C ℓ ∩ ∈ C N ) has positive μ -measure. This ℓ ℓ ̃ ̃ implies that 6 = { T -measure, and then }∩ ( C μ ∩ T C ) has positive ℓ ℓ ℓ ℓ [ ] ̃ ̃ μ ∇ ψ }∩ {∇ C ψ ∩ 6 C ) = > 0 . ( ℓ ℓ ℓ ℓ . In the sequel, I shall fix such an ̃ ̃ be a μ -Besicovich point of E , := ( C } ∩ Let C ψ ) ∩{∇ x ψ ∇ 6 = ℓ ℓ ℓ ℓ ℓ E μ -density 1. (Such a point exists since E i.e. a point at which has ℓ ℓ -measure.) By adding a suitable constant to ψ has positive μ , we may ̃ ̃ ψ ψ ( ψ ). Since x and ( ψ ) = are semiconvex, we can assume that x ℓ ℓ implicit function theorem to deduce that there is a small apply the ̃ has dimension ψ x = in which the set ψ 1. neighborhood of { n − } ℓ ℓ (See Corollary 10.52 in the Second Appendix of this chapter.) Then, for r small enough, Assumption (iii) implies [ ] ̃ ψ = { ψ }∩ B 6 ( x ) μ = 0 . r ℓ ℓ ̃ ̃ { < x ψ x }∩ B ( ( So at least one of the sets ) and { ψ B > ) ψ ψ }∩ r r ℓ ℓ ℓ ℓ ( μ μ has B 2. Without loss of generality, I shall -measure at least x )] / [ r ̃ { ψ > assume that this is the set ψ ; so } ℓ ℓ [ ] x ( B [ μ )] r ̃ ψ x ) ≥ B }∩ > ( μ ψ { ( 10.30) . r ℓ ℓ 2 ψ , coincides with ψ on the set S Next, , which has μ -density 1 at x ℓ ℓ ̃ ̃ coincides with and similarly ψ ψ μ -density 1 at x . It follows on a set of ℓ that ( ) ] [ 1 ̃ ̃ > ψ > ψ }∩ B ( { ) μ ≥ μ [ B ( x )] ψ }∩{ ψ x − → r as 0 o ( 1) . r r ℓ ℓ 2 (10.31) ̃ ̃ is a Besicovich point of {∇ ψ C 6 , ∇ Then since ψ x }∩ C ∩ = ℓ ℓ ℓ ℓ

272 266 10 Solution of the Monge problem II: Local approach [ ] ̃ ̃ ̃ ̃ { > > ψ μ }∩{∇ ψ ( 6 = ∇ ψ ψ }∩{ }∩ ψ C ) ∩ ψ C x ) ∩ B ( r ℓ ℓ ℓ ℓ ℓ ℓ [ ] ̃ ̃ ̃ }∩{ ψ μ > }∩ ψ { ψ > B )] ( x ) ≥ − μ [ B C ( x ) \ ( C ψ ∩ r r ℓ ℓ ℓ ℓ ( ) 1 ( x )] ≥ μ [ − B ( 1) − o (1) o . r 2 As a conclusion, [ ] ̃ ̃ ̃ ̃ ∀ ψ }∩{ ψ r > > 0 ψ . }∩{∇ ψ 0 6 = μ ∇ ψ > }∩ ( C { ∩ ψ > C ) ) ∩ B x ( r ℓ ℓ ℓ ℓ ℓ ℓ (10.32) Now let ̃ A ψ } . { := ψ > The proof will result from the next two claims: − 1 ̃ ( T ( A )) ⊂ A ; T Claim 1: − 1 ̃ ̃ ̃ ̃ ψ )) }∩ ( C A ∩ The set C ( ) ∩{∇ ψ T { = ∇ ψ ψ ( }∩ 6 T > Claim 2: ℓ ℓ ℓ ℓ ℓ ℓ . x lies a positive distance away from Let us postpone the proofs of these claims for a while, and see why ⊂ A they imply the theorem. Let S be defined by ̃ ̃ ̃ ̃ ψ S ψ > := ψ }∩{∇ ψ 6 = ∇ { ψ }∩ ( C ∩ ψ > C , ) }∩{ ℓ ℓ ℓ ℓ ℓ ℓ and let ( ) − 1 ̃ d T r x,S ( T ( A )) ∩ / 2 . := − 1 ̃ ) T ( T ( A )) = ∅ by definition, μ [ S ∩ B ( x,r S ∩ On the one hand, since ∩ 1 − ̃ T ( A ))] = T [ ∅ ] = 0. On the other hand, r is positive by Claim 2, ( μ μ [ S ∩ B ( x,r )] > so 0 by (10.32). Then ] [ ] [ 1 − − 1 ̃ ̃ A )) μ ≥ μ T S ∩ B ( x,r ) A \ T ( T ( T ( A )) ( = μ [ S ∩ B ( x,r )] > 0 . \ 1 − ̃ T ( T ( A )) ⊂ A by Claim 1, this implies Since − 1 ̃ μ [ ( T ( A T < μ [ A ] . (10.33) ))] But then, we can write 1 − 1 − ̃ ̃ T ] A ( T ( A ))] = ν [ T ( A )] = ν [ ≤ [ ( A )] = μ [ T T [ μ ( T ( A ))] , μ which contradicts (10.33). So it all boils down to establishing Claims 1 and 2 above. 1 − ̃ )). Then there exists T Let Proof of Claim 1: ( T ( A x y ∈ A such ∈ ̃ ̃ ̃ y ) = ). By T ( x that T ( y ) ∈ ∂ x ψ ( y ) and T T ( x ) ∈ ∂ ( ( ψ ). Recall that c c

273 Removing the assumption of finite cost 267 u sing the definition of the c -subdifferential and the assumptions, we can write the following chain of identities and inequalities: ̃ ̃ ̃ ̃ ( T ( x )) ≤ c ψ ( ) + ) + c ( y, ( T ( x )) x, ψ x y ̃ y ) + ( ( y,T ( y )) ψ = c y c ( y,T ( y )) < ψ ( ) + y ( c ( x,T ( ) + )) ψ ≤ x ̃ x ) + c ( x, = T ( x )) . ψ ( ̃ , and proves Claim 1. ) < ψ ( x ), i.e. x ∈ A x ( ψ This implies that Assume that this claim is false; then there is a se- Proof of Claim 2: 1 − ̃ ̃ ̃ , valued in { ψ ( x > ψ T }∩ ( C ( ∩ ) C quence ( ) ∩ )), such T A ℓ ℓ ℓ ℓ N ∈ k k ̃ x → that . For each k , there is y ). On ∈ A such that x T ( x y ) = T ( k k k k ̃ ̃ ̃ , the transport T coincides with T , and the transport ∩ C with C T T ℓ ℓ ℓ ℓ ̃ ( x so ) ∈ ∂ T ψ ); then we can write, for any ( x y ) and T ( y ( ) ∈ ∂ ψ c c k k k ℓ ℓ k ∈ M , z ( ψ z ) ≥ ψ )) ( y y ) + c ( y ( ,T ( y z,T )) − c ( k k k ℓ ℓ k ̃ ̃ ( y )) ) + c ( y x , = ψ ( x ( )) − c ( z, T T k k k k ℓ ̃ ̃ ̃ )) ( y x ) + c ( y ( , ψ T ( x T )) − c ( z, > k k k ℓ k ̃ ̃ ̃ x ψ ) + c ( x ≥ , z, T ( x )) )) − c ( x ( T ( ℓ k k k k and since is differentiable at x Since c is locally semiconcave by ψ ℓ assumption, we can expand the right-hand side and obtain ̃ ̃ ̃ ) ≥ ) + ψ x ( x ( ψ c )) x T , ( T ( x z )) − c ( z, ( k k ℓ k ℓ k ̃ ̃ ) ( x ) + ∇ ) ψ | ( x = · ( ψ x − x ) + o ( | x − x k k ℓ ℓ ̃ (10.34) c ( x . , z T ( x ) )) · ( −∇ − x | ) + o ( | z − x x k k k k where ( z − x | | ) in the last line is uniform in k . (Here I have cheated o k n R rather than on a Riemannian manifold, by pretending to work in but all this is purely local, and invariant under diffeomorphism; so it is close is really no problem to make sense of these formulas when z ̃ ̃ ( ) = 0; so (10.34) c enough to x , x T .) Recall that x )) + ∇ ∇ ψ ( x ( x k k k k ℓ can be rewritten as ̃ ̃ ( z ) ≥∇ ψ ψ | ( x ) + ∇ ) ψ x ( x ) · ( x − − x ) + o ( | x k ℓ ℓ ℓ k ̃ ∇ . ψ ) ( x | ) · + z − x x ) + o ( | z − ( k k k ℓ

274 268 10 Solution of the Monge problem II: Local approach ̃ k → ∞ , remembering that ∇ T ψ hen we can pass to the limit as ℓ is continuous (because is semiconvex, recall the proof of Proposi- ψ ℓ tion 10.12(i)), and get ̃ ̃ ) ≥ ψ ψ ( ( x ) + ∇ z ψ ) ) x | · ( z − x ) + o ( | z − x ( ℓ ℓ ℓ ̃ (10.35) ( x ) + ∇ = ψ . ψ x ) · ( z − x ) + o ( | z − x | ) ( ℓ ℓ ̃ ̃ x ).) On the ( x ) = ψ ( x ) = x ψ (Recall that is such that ) = ψ ψ x ( ( ℓ ℓ other hand, since ψ is differentiable at x , we have ℓ ψ . ( z ) = ψ ) ( x ) + ∇ ψ | ( x ) · ( z − x ) + o ( | z − x ℓ ℓ ℓ Combining this with (10.35), we see that ̃ − ( ( x ) −∇ ψ , ( x )) · ∇ z ψ x ) ≤ o ( | z − x | ) ( ℓ ℓ ̃ which is possible only if ψ ∇ ( x ) −∇ ψ ) = 0. But this contradicts the ( x ℓ ℓ x . So Claim 2 holds true, and this concludes the proof of definition of Theorem 10.42. ⊓⊔ The next corollary of Theorem 10.42 is the same as Theorem 10.41 except that the classical Monge problem (search of a transport of mini- mum cost) has been replaced by the generalized Monge problem (search c -monotone transport). of a Corollary 10.44 (Generalized Monge problem for the square distance). Let M be a smooth Riemannian manifold, and let c ( x,y ) = 2 ( x,y ) d . Let μ,ν be two probability measures on M . , then there • n − 1 gives zero mass to sets of dimension at most μ If T solving the generalized Monge problem is a unique transport map μ and ν . between If μ is absolutely continuous, then this solution can be written • ( ) ̃ ( x ) = exp (10.36) ψ y ∇ = ( x ) T , x 2 ψ is some d / 2 -convex function. where • If M has nonnegative sectional curvature, or ν is compactly sup- ported and M satisfies (10.26) , then equation (10.36) still holds, but in addition the approximate gradient can be replaced by a true gradient.

275 First Appendix: A little bit of geometric measure theory 269 n If M = R P , formula (10.36) becomes articular Case 10.45. ( ) 2 |·| ) x ) = ∇ = ∇ ψ y + ψ ( ( x + , x 2 2 2 w / |·| 2-convex, or equivalently here is / 2 + ψ is convex lower ψ |·| y = ∇ Ψ ( x semicontinuous. So (10.36) can be written Ψ is ), where convex lower semicontinuous, and we are back to Theorem 9.4. First Appendix: A little bit of geometric measure theory The geometric counterpart of differentiability is of course the approxi- by a tangent plane, or hyperplane, or more generally mation of a set S -dimensional space, if S is the dimension of d . d by a tangent is smooth, then there is no ambiguity on its dimension (a curve S If has dimension 1, a surface has dimension 2, etc.) and the tangent space S is not smooth, this might not be the case, at always exists. But if tangent cone least not in the usual sense. The notion of (sometimes called contingent cone) often remedies this problem; it is naturally as- sociated with the notion of d -rectifiability , which acts as countable d a replacement for the notion of “being of dimension ”. I shall recall below some of the basic results about these concepts. Definition 10.46 (Tangent cone). If S is an arbitrary subset of n x , and R ∈ S , T then the tangent cone S to S at x is defined as x { } x − x k ∈ := T lim . S 0 ; x → , t S , x 0 → x, t > x k k k k →∞ k t k The dimension of this cone is defined as the dimension of the vector space that it generates. Definition 10.47 (Countable rectifiability). Let S be a subset of n R d ∈ [0 ,n , and let be an integer. Then S is said to be countably ] ⋃ f , where each ) ( D is Lipschitz on a mea- f -rectifiable if ⊂ S d k k k k ∈ N d has Hausdorff dimension at of R D . In particular, S surable subset k most d . The next theorem summarizes two results which were useful in the present chapter:

276 270 10 Solution of the Monge problem II: Local approach T heorem 10.48 (Sufficient conditions for countable rectifiabil- ity). n S , such that T be a measurable set in S has dimension (i) Let R x for all ∈ S . Then S d d -rectifiable. is countably at most x n R , such that T S is included in a S (ii) Let be a measurable set in x ∈ ∂S . Then ∂S is countably half-space, for each n − 1) -rectifiable. x ( . For each ∈ S , let π Proof of Theorem 10.48 stand for the orthogonal x x ⊥ stand for the orthogonal , and let π T projection on = Id − π S x x x ⊥ S projection on ( . I claim that T ) x ⊥ ∈ S, ∃ r > 0; ∀ y ∈ S, | x − y |≤ r = ⇒ | π ∀ x ) ( y ) |≤| π x ( x − y − | . x x (10.37) Indeed, assume that (10.37) is false. Then there is x ∈ S , and there ⊥ y ) > | ) such that | x − is a sequence ( y | ≤ 1 /k and yet | π y − ( x N ∈ k k k x π − ( x | y ) | , or equivalently x ( ) ( ) ∣ ∣ ∣ ∣ − y x y − x ∣ ∣ ∣ ∣ k k ⊥ π 10.38) π > . ( ∣ ∣ ∣ ∣ x x y | | y − x | x | − k k w ) := ( x − y converges Up to extraction of a subsequence, / | x − y | k k k ⊥ ∈ T | → S with | θ | = 1. Then | π to w π θ 1 and | w | → 0, which x x k k x contradicts (10.38). So (10.37) is true. k ∈ N , let Next, for each { } := x x ∈ S ; property (10.37) holds true for | S − y |≤ 1 /k . k It is clear that the sets S cover S , so it is sufficient to establish the k d S for a given k . -rectifiability of k δ > 0 be small enough ( δ < 1 / 2 will do). Let Let be the set Π d d Π of all orthogonal projections on -dimensional linear spaces. Since d is compact, we can find a finite family ( ) of such orthogonal ,...,π π N 1 projections, such that for any π ∈ Π with there is j ∈ { 1 ,...,N } d ‖ − π S ‖ ≤ δ , where π stands for the operator norm. So the set ‖·‖ j k is covered by the sets { } := ; x ∈ S S . ‖ π δ − π ‖≤ x ℓ k kℓ To prove (i), it suffices to prove that S is locally rectifiable. We shall kℓ ′ show that for any two x,x ∈ S , kℓ

277 First Appendix: A little bit of geometric measure theory 271 1 1 + 2 δ ⊥ ′ ′ ′ ≤ | π x − (10.39) ; | x | = ⇒ | L ( ( x − x |≤ ) π , L = ) x − x | ℓ ℓ 1 − k 2 δ with a ball of diameter 1 this will imply that the intersection of S /k kℓ n π ( R ), and the conclusion -Lipschitz graph over is contained in an L ℓ will follow immediately. ′ are any two orthogonal projec- π,π To prove (10.39), note that, if ⊥ ′ ⊥ ′ ′ π | ) ( )( z ) | = | (Id − π )( z ) − (Id − π π )( z ) | = | ( π − π − )( z ) | , tions, then ( ′ ⊥ ⊥ ′ π π ) − ‖ = ‖ π − ( π ‖ , and ‖ therefore ⊥ ′ ⊥ ⊥ ′ ⊥ ′ | ) |≤| ( π π | − π ( ) )( x − x x ) | + | π − x ( x − x x ℓ x ℓ ′ ′ π x − x π ) | + | π − ( x − x )( ) | ( ≤| x x ℓ ′ ′ ′ | )( x ≤| x ( ) | + | π π ( x − x − ) | + | ( π − π ) )( x − x π − x x ℓ ℓ ℓ ′ ′ δ x − x π ) | + 2 ≤| | x − x ( | ℓ ′ ⊥ ′ ) | π . ( x − x | ) | + 2 δ | π ≤ (1 + 2 ( x − x δ ) ℓ ℓ This establishes (10.39), and Theorem 10.48(i). Now let us turn to part (ii) of the theorem. Let F be a finite set in − 1 n 1 − n ( ν )) . I claim that S S cover such that the balls ( B F ν ∈ 8 / 1 | y − x | , ∃ ∀ ∈ F, ∀ y ∈ ∂S ∩ B x ( x ) , 〈 y − x,ν 〉≤ ∈ ∂S, ∃ r > 0 ν . r 2 10.40) ( x Indeed, otherwise there is ∂S such that for all k ∈ N and for all ∈ ∈ x,ν there is y F ∈ ∂S such that | y > − x | ≤ 1 /k and 〈 y 〉 − ν k k k − 1 n x y / 2. By assumption there is ξ ∈ S − | | such that k ∀ ∈ T . S, ζ ξ,ζ 〉≤ 0 〈 x Let ν ∈ F be such that | ξ − ν | < 1 / 8 and let ( y be a sequence as ) k ∈ k N ′ ′ − ∂S and y < 6 = x , there is y y | ∈ S such that | above. Since ∈ y y k k k k k y | − x | / 8. Then k ′ | y − x | | x − y | k ′ ′ k . ≥ ν |−| y − y 〉−| |≥ x,ν − x 〈 y 〉≥〈 y || x,ξ − ξ − y − k k k k 4 8 o S 〈 〉 ′ y − x 1 k , ξ ( . 10.41) ≥ ′ − x | 8 | y k ′ ′ − | x converges to some ) | y x / − Up to extraction of a subsequence, ( y k k ζ ∈ T 〉 ≥ S , and then by passing to the limit in (10.41) we have 〈 ζ,ξ x

278 272 10 Solution of the Monge problem II: Local approach 1 . But by definition, ξ is such that 〈 ζ,ξ 〉 ≤ 0 for all ζ ∈ T / S . This 8 x contradiction establishes (10.40). As a consequence, A , is included in the union of all sets ∂S 1 /k,ν N , ν ∈ F , and where k ∈ { } | y − x | y ∈ ∂S ∩ := A ( x ) , 〈 y − x,ν 〉≤ x ∈ ∂S ; ∀ B . r,ν r 2 o conclude the proof of the theorem it is sufficient to show that each T A is locally the image of a Lipschitz function defined on a subset of r,ν an ( n − 1)-dimensional space. r > 0 and ν ∈ F be given, let x So let ∈ A be the , and let π r,ν 0 n ⊥ orthogonal projection of . (Explicitly, π ( x ) = ν −〈 x,ν 〉 ν .) We R to x := A shall show that on ∩ B is injective and its inverse π ( x ), D 0 r,ν 2 r/ ′ ( D )) is Lipschitz. To see this, first note that for any two x,x (on ∈ D , π ′ ′ ′ B one has ( x ), so, by definition of A 2. , 〈 x x − x,ν 〉 ≤ | x ∈ − x | / r,ν r ′ ′ x − x By symmetry, also ,ν 〉≤| x − x 〈 | / 2, so in fact ′ ∣ ∣ | x − x | ′ ∣ ∣ ≤ 〈 x − 〉 ,ν x . 2 ′ ′ π ( x ) and z T = π hen if x z ), = ( ′ ∣ ∣ | | x − x ′ ′ ′ ′ ∣ ∣ x,ν 〉−〈 x | ,ν 〉 |≤| z ≤| z − z − | + z x | + − x 〈 , 2 ′ ′ | x − x s |≤ 2 | z − z o | . This concludes the proof. ⊓⊔ Second Appendix: Nonsmooth implicit function theorem Let be an n -dimensional smooth Riemannian manifold, and x M ∈ M . 0 r ′ ⊂ I shall say that a set is a k -dimensional C M graph (resp. k - M dimensional Lipschitz graph) in a neighborhood of x if there are 0 (i) a smooth system of coordinates around x , say 0 ′ , ,y ) x ( = ζ x k n − k where × R ζ is a smooth diffeomorphism from an open subset of R , into a neighborhood O of x ; 0

279 Second Appendix: Nonsmooth implicit function theorem 273 r ′ n − k ′ O C → R ( (resp. Lipschitz) function φ , where O : is an ii) a k R open subset of ; ∈ x , such that for all O ′ ′ . ⇐⇒ y = φ ( x x ) ∈ M This definition is illustrated by Figure 10.2. ζ − k n R φ M k R Fig. 10.2. k - dimensional graph The following statement is a consequence of the classical implicit r f : M → R is of class function theorem: If ( r ≥ 1 ), f ( x and ) = 0 C 0 − 1 ( x -dimensional ) 6 = 0 , then the set { f = 0 } = f n ∇ (0) is an ( f − 1) 0 r x C . graph in a neighborhood of 0 nonsmooth In this Appendix I shall consider a version of this theo- rem. The following notion will be useful. Definition 10.49 (Clarke subdifferential). Let f be a continu- of a Riemannian ous real-valued function defined on an open subset U ∈ U , define ∂f ( x ) as the convex hull of all limits manifold. For each x ∇ f ( x of sequences ) , where all x and are differentiability points of f k k → x . In short: x k { } Conv im . ( ) x ∇ f l ∂f ( x ) = k x → x k Here comes the main result of this Appendix. If ( are sub- ) A 1 i ≤ m i ≤ ∑ ∑ ; A = { sets of a vector space, I shall write } a A ∈ a . i i i i i

280 274 10 Solution of the Monge problem II: Local approach heorem 10.50 (Nonsmooth implicit function theorem). T Let ) be real-valued Lipschitz functions defined in an open set U of f ( i ≤ 1 m ≤ i ∈ U be such that: x n an -dimensional Riemannian manifold, and let 0 ∑ (a) x f ) = 0 ; ( i 0 ∑ (b) ∂f / ( 0 ∈ ) . x 0 i ∑ f { = 0 } is an ( n − 1) -dimensional Lipschitz graph around x Then . i 0 Remark 10.51. ψ be a convex continuous function defined around Let n n such that R p ∈ R some point , and let p does not belong to ∈ x 0 at x the Clarke differential of ; then 0 does not belong to the Clarke ψ 0 2 ̃ : x 7−→ ψ ( x ) − ψ ( x differential of ) − p · ( x − x , ) + | x − x x | ψ at 0 0 0 0 6 = x such that and Theorem 10.50 obviously implies the existence of x 0 ̃ ( x ) = 0, in particular ψ ( x ) < ψ ( x ( ) + p · ψ x − x does not belong ). So p 0 0 ψ at x . In other words, the subdifferential is to the subdifferential of 0 included in the Clarke differential. The other inclusion is obvious, so ∂ψ used both notions coincide. This justifies a posteriori the notation in Definition 10.49. An easy localization argument shows that similarly, for any locally semiconvex function ψ defined in a neighborhood of x , − ( x ) = ∇ ∂ψ ψ ( x ). Corollary 10.52 (Implicit function theorem for two subdiffer- ̃ entiable functions). ψ be two locally subdifferentiable ψ Let and n -dimensional Riemannian U functions defined in an open set of an ̃ x ∈ manifold U be such that ψ , M ψ are differentiable at x , and let , 0 0 and ̃ ̃ ψ ) = ψ ψ ( x . ); ∇ ( ( x ) ) 6 = ∇ x ψ ( x 0 0 0 0 ̃ of x }∩ such that { ψ = Then there is a neighborhood ψ V V is an 0 ( − 1) -dimensional Lipschitz graph; in particular, it has Hausdorff n n − 1 . dimension exactly ̃ . Let f is locally = ψ , f f = − Proof of Corollary 10.52 ψ . Since 2 1 1 is locally superdifferentiable, both functions subdifferentiable and f 2 x (Theorem 10.8(iii)). Moreover, are Lipschitz in a neighborhood of 0 f ∇ and ∇ f are continuous on their respective domain of definition; 2 1 ∑ ∂f ) = ( x x ) = {∇ f ( ( x ∂f ) } ( i = 1 , 2). Then by assumption so 0 i 0 0 i i ∑ ) ∇ f does not contain 0. The conclusion follows from Theo- ( x } { 0 i rem 10.50. ⊓⊔ Proof of Theorem 10.50 . The statement is purely local and invariant 1 n . diffeomorphism, so we might pretend to be working in R C under

281 Second Appendix: Nonsmooth implicit function theorem 275 n , ∂f or each ( x i ) ⊂ B (0 , ‖ f ) is a compact ‖ x ) ⊂ R F , so ∂f ( i Lip i i 0 0 ∑ n ; then also convex subset of ( x R ) is compact and convex, and ∂f 0 i by assumption does not contain 0. By the Hahn–Banach theorem, there n α > and ∈ 0 such that R are v ∑ α for all p ∈ 〈 ∂f p,v ( x 〉≥ ). (10.42) 0 i ∑ x So there is a neighborhood such that of 〈∇ f V ( x ) ,v 〉 ≥ α/ 2 i 0 ∈ V where all functions f at all points are differentiable. (Other- x i x ) , converging to x , such that wise there would be a sequence ( 0 k ∈ k N ∑ 〈∇ ( x f ) ,v 〉 < α/ 2, but then up to extraction of a subsequence we i k ∑ x ( x , which ) → p f ∈ ∂f < α ( ∇ 2 ), so 〈 would have p α/ ,v 〉 ≤ i 0 i i i k would contradict (10.42).) x = = 0, v Without loss of generality, we may assume that 0 ) , 0 ,..., 0), V = ( − β,β ( × B (0 ,r ), where the latter ball is a sub- e 0 1 ∑ n − 1 f and set of ≤ ( αβ ) / (4 R ‖ r ). Further, let ‖ Lip i 0 { ′ n − 1 ∈ B (0 ,r ; ) ⊂ R := Z y 0 } ] [ ∈ ( − β,β ); ∃ i ; ∇ f , ( t,y ) does not exist } λ { 0 t > i 1 and ′ β,β ) × Z = ( − D = V \ Z. Z ; λ I claim that [ Z ] = 0. To prove this it is sufficient to check that n ′ ′ ′ Z λ ] = 0. But Z ) is the nonincreasing limit of ( Z [ , where − 1 n ∈ N ℓ ℓ { ′ (0 Z y ∈ B = ,r ); 0 ℓ } [ ] { t ∈ ( − β,β ); ∃ i ; ∇ f λ ( t,y . } ) does not exist ≥ 1 /ℓ i 1 By Fubini’s theorem, [ ] { } ′ O ; ∇ f ); ( x ) does not exist for some i λ ( ≥ x λ /ℓ (1 × [ Z ∈ ]) − n i 1 n ℓ f and the left-hand side is equal to 0 since all are differentiable almost i ′ λ →∞ ℓ ] = 0, and by taking the limit [ Z everywhere. It follows that 1 n − ℓ ′ λ we obtain [ Z ] = 0. n − 1 ∑ f = f f stand for its partial derivative , and let ∂ 〉 Let = 〈∇ f,v 1 i with respect to the first coordinate. The first step of the proof has shown that ∂ f f ( x are ≥ α/ 2 at each point x where all functions ) 1 i

282 276 10 Solution of the Monge problem II: Local approach ′ y ∈ B (0 ,r ifferentiable. So, for each ) \ Z d , the function t → f ( t,y ) 0 is Lipschitz and differentiable -almost everywhere on ( − β,β ), and it λ 1 ′ ′ ( ) α/ 2. Thus for all t,t ( ∈ ≥ − β,β ), satisfies t,y f ′ ′ ′ α/ ( f ,y ) − f ( t,y ) ≥ ( t 2) ( t = − t ) . (10.43) t < t ⇒ ′ ) , ( t D ,y )) in This holds true for all (( × D . Since Z = V \ D t,y D has zero Lebesgue measure, V , so (10.43) extends to is dense in ′ ∈ t t,y ,y )) ( V . , ) all (( ∈ B (0 ,r For all ), inequality (10.43), combined with the estimate y 0 αβ (0 ,y ) | = | f (0 ,y ) | f (0 , 0) |≤‖ f ‖ f | y |≤ − , Lip 4 g f ( t,y ) = 0 admits exactly one solution uarantees that the equation = t ( y ) in ( − β,β ). φ It only remains to check that φ is Lipschitz on B (0 ,r ∈ ). Let y,z 0 B ,r (0 ), then f ( φ ( y ) ,y ) = f ( φ ( z ) ,z ) = 0, so 0 ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ) − f ( φ ( z ) ,y ) f ( = φ ( f y φ ( z ) ,z ) − f ( φ ( z ) ,y ) ) ,y . (10.44) ( Since the first partial derivative of f is no less than α/ 2, the left-hand side of (10.44) is bounded below by ( α/ 2) | φ ( y ) − φ ( z ) | , while the right- hand side is bounded above by ‖ ‖ f | z − y | . The conclusion is that Lip 2 ‖ f ‖ Lip z ( φ ) |≤ ( | ) − φ y , | z − y | α φ is indeed Lipschitz. ⊓⊔ o s Third Appendix: Curvature and the Hessian of the squared distance The practical verification of the uniform semiconcavity of a given cost c ( x,y ) might be a very complicated task in general. In the par- function 2 c ( x,y ) = d ( x,y ) ticular case when , this problem can be related to the sectional curvature of the Riemannian manifold. In this Appendix I shall recall some results about these links, some of them well-known, other ones more confidential. The reader who does not know about sec- tional curvature can skip this Appendix, or take a look at Chapter 14 first.

283 Third Appendix: Curvature and the Hessian of the squared dist ance 277 n M = is the Euclidean space, then d ( x,y ) = | x − y | and there If R is the very simple formula ) ( 2 | x − y | 2 = I ∇ , n x 2 n n T here the right-hand side is just the identity operator on R w . R = x is an arbitrary Riemannian manifold, there is no simple for- If M 2 2 d ∇ x,y ) ( / 2, and this operator will in general mula for the Hessian x −∞ x and not be defined in the sense that it can take eigenvalues if y are conjugate points. However, as we shall now see, there is still a 2 2 , and thus derive estimates d x,y recipe to estimate ∇ / 2 from above ( ) x 2 of semiconcavity for d / 2. x So let y be any two points in M , and let γ be a minimiz- and ing geodesic joining to x , parametrized by arc length; so γ (0) = y , y ( d x,y )) = x . Let H ( t ) stand for the Hessian operator of x → γ ( 2 = x,y / 2 at x ) γ ( t ). (The definition of this operator is recalled and d ( ,d ( x,y )) the operator H ( discussed in Chapter 14.) On [0 ) is well- t defined (since the geodesic is minimizing, it is only at = d ( x,y ) that t −∞ H (0) = Id and then its eigenvalues may appear). It starts at t varies in (0 ,d ( x,y )). eigenvectors and eigenvalues vary smoothly as ( γ t ) is an eigenvector of H ( t ), associated with the The unit vector ̇ eigenvalue +1. The problem is to bound the eigenvalues in the or- ⊥ S ( t ) = ( ̇ γ ) M ⊂ T ) be an ,...,e e . So let ( thogonal subspace n 2 ) t γ ( S (0), and let ( orthonormal basis of ( t ) ,...,e e ( t )) be obtained by 2 n t parallel transport of ( this remains an or- ) along γ ; for any e ,...,e n 2 thonormal basis of S ( t ). To achieve our goal, it is sufficient to bound h ( t ) = 〈 H ( t ) · e is arbitrary ( above the quantities ) ,e i ( t ) 〉 , where t i i γ ( ) t { ,...,n } . in 2 H (0) is the identity, we have h (0) = 1. To get a differential Since h ( t ), we can use a classical computation of Riemannian equation on distance (not squared!): If geometry, about the Hessian of the ( t ) = k 2 〈∇ , then d ( y,x ) · e 〉 ( t ) ,e ) ( t i i ( ) γ t x 2 ̇ t ) + k ( t ) σ + k ( t ) ≤ 0 , (10.45) ( where σ ( t ) is the sectional curvature of the plane generated by ̇ γ ( t ) and e ), we note that ( t ) inside T t ( h ) and M . To relate k ( t i ) γ ( t ( ) 2 y,x ) d ( ∇ = d ( y ); y,x ,x ) ∇ ( d x x 2

284 278 10 Solution of the Monge problem II: Local approach ) ( 2 ,x ) d ( y 2 2 ,x ) ∇ ) + ( y ∇ = d ( y,x d ∇ d ( x,y ) ⊗∇ . d ( x,y ) x x x x 2 e ( t ) and using the fact that By applying this to the tangent vector i d ( x,y ) at ∇ = γ ( t ) is just ̇ γ ( t ), we get x x 2 t ) = d ( y,γ ( t )) k ( t ) + 〈 ̇ γ ( t ) ,e h ( t ) 〉 ( = tk ( t ) . i Plugging this into (10.45) results in 2 2 ̇ ) − h ( t ) + h ( h ( t ≤− t ) σ ( t ) . (10.46) t t From (10.46) follow the two comparison results which were used in Theorem 10.41 and Corollary 10.44: (a) Assume that are all non- the sectional curvatures of M ̇ . Then (10.46) forces ≤ 0, so h remains bounded above by 1 negative h for all times. In short: ( ) 2 d ( x,y ) 2 ⇒ ∇ nonnegative sectional curvature = d . ≤ I T M x x 2 (10.47) (If we think of the Hessian as a bilinear form, this is the same as 2 2 ∇ ≤ ) ( / ( d g , where g is the Riemannian metric.) Inequal- x,y 2) x 2 x,y ) ity (10.47) is rigorous if / 2 is twice differentiable at x ; otherwise d ( the conclusion should be reinterpreted as 2 2 r d ) ( x,y s semiconcave with a modulus ω ( r ) = . i → x 2 2 b) Assume now that are x ( the sectional curvatures at point 2 ) ( x bounded below by ,x C/d − , where x is an arbitrary point. In 0 0 this case I shall say that M has asymptotically nonnegative curvature . Then if y ( t ) ≥ σ varies in a compact subset, we have a lower bound like 2 ′ 2 ′ ′ ) − C − C = /t /d , where C ( is some positive constant. So (10.46) y,x implies the differential inequality ′ 2 ̇ ( t ) ≤ C t + h ( t ) − h ( t ) h . √ 2 ′ = (1 + , then the C C : 1 + 4 ) / 2 h If ( t ) ever becomes greater than right-hand side becomes negative; so h can never go above C . The conclusion is that

285 Bibliographical notes 279 M ⇒ h as asymptotically nonnegative curvature = ( ) 2 d ( x,y ) 2 ∇ ∈ y , ∀ K, Id ( ≤ ) C K T M x x 2 2 is any compact subset of M . Again, at points where d ( x,y ) / where 2 K is not twice differentiable, the conclusion should be reinterpreted as 2 2 d ( x,y ) r x → ) ω ) = C ( K r s semiconcave with a modulus i ( . 2 2 xamples 10.53. The previous result applies to any compact mani- E n R by modification fold, or any manifold which has been obtained from n on a compact set. But it does not apply to the hyperbolic space H ; n 2 is any given point in H ( , then the function x → d in fact, if y,x ) y is not uniformly semiconcave as x . (Take for instance the unit disk →∞ 2 2 R , with polar coordinates ( r,θ ) as a model of H , then the distance in d ( r,θ ) = log((1 + r ) / (1 − from the origin is )); an explicit computation r shows that the first (and only nonzero) coefficient of the matrix of the 2 ), which diverges logarithmically as / 2 is 1 + r d ( Hessian of d r → 1.) r Remark 10.54. The exponent 2 appearing in the definition of “asymp- totic nonnegative curvature” above is optimal in the sense that for any p x 2 it is possible to construct manifolds satisfying ≥− C/d ( σ p < ,x ) 0 x 2 d ( x is not uniformly semiconcave. , · ) and on which 0 Bibliographical notes The key ideas in this chapter were first used in the case of the quadratic cost function in Euclidean space [154, 156, 722]. The existence of solutions to the Monge problem and the differ- entiability of c -convex functions, for strictly superlinear convex cost n R functions in (other than quadratic) was investigated by several au- thors, including in particular R ̈uschendorf [717] (formula (10.4) seems to appear there for the first time), Smith and Knott [754], Gangbo and McCann [398, 399]. In the latter reference, the authors get rid of all mo- ment assumptions by avoiding the explicit use of Kantorovich duality. These results are reviewed in [814, Chapter 2]. Gangbo and McCann impose some assumptions of growth and superlinearity, such as the one

286 280 10 Solution of the Monge problem II: Local approach d escribed in Example 10.19. Ekeland [323] applies similar tools to the optimal matching problem (in which both measures are transported to another topological space). It is interesting to notice that the structure of the optimal map can be guessed heuristically from the old-fashioned method of Lagrange multipliers; see [328, Section 2.2] where this is done for the case of the quadratic cost function. The terminology of the twist condition comes from dynamical sys- tems, in particular the study of certain classes of diffeomorphisms in dimension 2 called twist diffeomorphisms [66]. Moser [641] showed that under certain conditions, these diffeomorphisms can be represented as the time-1 map of a strictly convex Lagrangian system. In this setting, the twist condition can be informally recast by saying that the dynam- ics “twists” vertical lines (different velocities, common position) into oblique curves (different positions). There are cases of interest where the twist condition is satisfied even though the cost function does not have a Lagrangian struc- ture. Examples are the so-called symmetrized Bregman cost function ( x,y ) = 〈∇ φ ( x ) − ∇ φ ( y ) ,x − y 〉 where φ is strictly convex [206], c 2 2 ( x,y ) = | x − y | ) + | f ( x ) − g ( y or the cost | c , where f and g are convex and -Lipschitz, k < 1 [794]. For applications in meteorol- k ogy, Cullen and Maroofi [268] considered cost functions of the form 3 2 2 ) = [( x . − y ( ) c + ( x R − y in a bounded region of ) ( + φ x,y x /y )] 3 2 1 1 2 3 Conversely, a good example of an interesting smooth cost function which does not satisfy the twist condition is provided by the restriction of the square Euclidean distance to the sphere, or to a product of convex boundaries [6, 400]. Gangbo and McCann [399] considered not only strictly convex, n R but also strictly concave cost functions in (more precisely, strictly concave functions of the distance), which are probably more realis- tic from an economic perspective, as explained in the introduction of their paper. The main results from [399] are briefly reviewed in [814, Section 2.4]. Further numerical and theoretical analysis for nonconvex cost functions in dimension 1 have been considered by McCann [615], R ̈uschendorf and Uckelmann [724, 796], and Plakhov [683]. Hsu and Sturm [484] worked on a very nice application of an optimal trans- port problem with a concave cost to a problem of maximal coupling of Brownian paths.

287 Bibliographical notes 281 M M is a compact Rieman- cCann [616] proved Theorem 10.41 when μ nian manifold and is absolutely continuous. This was the first optimal transport theorem on a Riemannian manifold (save for the very par- ticular case of the n -dimensional torus, which was treated before by Cordero-Erausquin [240]). In his paper McCann also mentioned the possibility of covering more general cost functions expressed in terms of the distance. Later Bernard and Buffoni [105] extended McCann’s results to more general Lagrangian cost functions, and imported tools and tech- niques from the theory of Lagrangian systems (related in particular to Mather’s minimization problem). The proof of the basic result of this chapter (Theorem 10.28) is in some sense an extension of the Bernard– Buffoni theorem to its “natural generality”. It is clear from the proof that the Riemannian structure plays hardly any role, so it extends for instance to Finsler geometries, as was done in the work of Ohta [657]. Before the explicit link realized by Bernard and Buffoni, several researchers, in particular Evans, Fathi and Gangbo, had become grad- ually aware of the strong similarities between Monge’s theory on the one hand, and Mather’s theory on the other. De Pascale, Gelli and Granieri [278] contributed to this story; see also [839]. Fang and Shao rewrote McCann’s theorem in the formalism of Lie groups [339]. They used this reformulation as a starting point to derive theorems of unique existence of the optimal transport on the path space over a Lie group. Shao’s PhD Thesis [748] contains a synthetic view on these issues, and reminders about differential calculus in Lie groups. ̈ Ust ̈unel [358, 359, 360, 362] derived theorems of unique Feyel and solvability of the Monge problem in the Wiener space, when the cost is the square of the Cameron–Martin distance (or rather pseudo-distance, since it takes the value + ∞ ). Their tricky analysis goes via finite- dimensional approximations. Ambrosio and Rigot [33] adapted the proof of Theorem 10.41 to cover degenerate (subriemannian) situations such as the Heisenberg group, equipped with either the squared Carnot–Carath ́eodory metric or the squared Kor ́anyi norm. The proofs required a delicate analysis of minimizing geodesics, differentiability properties of the squared dis- tance, and fine properties of BV functions on the Heisenberg group. Then Rigot [702] generalized these results to certain classes of groups. Further work in this area (including the absolute continuity of Wasser- stein geodesics at intermediate times, the differentiability of the optimal

288 282 10 Solution of the Monge problem II: Local approach m aps, and the derivation of an equation of Monge–Amp`ere type) was achieved by Agrachev and Lee [3], Figalli and Juillet [366], and Figalli and Rifford [370]. Another nonsmooth generalization of Theorem 10.41 was obtained by Bertrand [114], who adapted McCann’s argument to the case of an Alexandrov space with finite (Hausdorff) dimension and (sectional) curvature bounded below. His analysis makes crucial use of fine regular- ity results on the structure of Alexandrov spaces, derived by Perelman, Otsu and Shioya [174, 175, 665]. is due to Remark 10.30 about the uniqueness of the potential ψ Loeper [570, Appendix]; it still holds if μ is any singular measure, pro- dμ/d vol > 0 almost everywhere (even though the transport vided that map might not be unique then). The use of approximate differentials as in Theorem 10.38 was initi- ated by Ambrosio and collaborators [30, Chapter 6], for strictly convex n cost functions in R . The adaptation to Riemannian manifolds is due to Fathi and Figalli [348], with a slightly more complicated (but slightly more general) approach than the one used in this chapter. The tricky proof of Theorem 10.42 takes its roots in Alexandrov’s uniqueness theorem for graphs of prescribed Gauss curvature [16]. (The method can be found in [53, Theorem 10.2].) McCann [613] understood that Alexandrov’s strategy could be revisited to yield the uniqueness n of a cyclically monotone transport in R without the assumption of n M = R finite total cost (Corollary 10.44 in the case when ). The tricky extension to more general cost functions on Riemannian manifolds was performed later by Figalli [363]. The current proof of Theorem 10.42 is so complicated that the reader might prefer to have a look at [814, Section 2.3.3], where the core of McCann’s proof is explained in simpler 2 ( x,y ) = | terms in the particular case − y | c . x The case when the cost function is the distance ( c ( x,y ) = d ( x,y )) is not covered by Theorem 10.28, nor by any of the theorems ap- pearing in the present chapter. This case is quite more tricky, be it in Euclidean space or on a manifold. The interested reader can con- sult [814, Section 2.4.6] for a brief review, as well as the research pa- pers [20, 31, 32, 104, 190, 279, 280, 281, 354, 364, 380, 686, 765, 791]. The treatment by Bernard and Buffoni [104] is rather appealing, for its simplicity and links to dynamical system tools. An extreme case (maybe purely academic) is when the cost is the Cameron–Martin dis-

289 Bibliographical notes 283 t ance on the Wiener space; then usual strategies seem to fail, in the first place because of (non)measurability issues [23]. The optimal transport problem with a distance cost function is irrigation problem studied recently by various au- also related to the thors [109, 110, 111, 112, 152], the Bouchitt ́e–Buttazzo variational problem [147, 148], and other problems as well. In this connection, see also Pratelli [689]. The partial optimal transport problem, where only a fixed fraction of the mass is transferred, was studied in [192, 365]. Under adequate as- sumptions on the cost function, one has the following results: whenever the transferred mass is at least equal to the shared mass between the measures and ν , then (a) there is uniqueness of the partial transport μ map; (b) all the shared mass is at the same time both source and target; (c) the “active” region depends monotonically on the mass transferred, and is the union of the intersection of the supports and a semiconvex set. To conclude, here are some remarks about the technical ingredients used in this chapter. Rademacher [697] proved his theorem of almost everywhere differ- entiability in 1918, for Lipschitz functions of two variables; this was later generalized to an arbitrary number of variables. The simple ar- gument presented in this section seems to be due to Christensen [233]; it can also be found, up to minor variants, in modern textbooks about real analysis such as the one by Evans and Gariepy [331, pp. 81–84]. Ambrosio showed me another simple argument which uses Lebesgue’s density theorem and the identification of a Lipschitz function with a function whose distributional derivative is essentially bounded. The book by Cannarsa and Sinestrari [199] is an excellent reference n R , as well as the links for semiconvexity and subdifferentiability in with the theory of Hamilton–Jacobi equations. It is centered on semi- concavity rather than semiconvexity (and superdifferentiability rather than subdifferentiability), but this is just a question of convention. Many regularity results in this chapter have been adapted from that source (see in particular Theorem 2.1.7 and Corollary 4.1.13 there). Also the proof of Theorem 10.48(i) is adapted from [199, Theorem 4.1.6 and Corollary 4.1.9]. The core results in this circle of ideas and tools can be traced back to a pioneering paper by Alberti, Ambrosio and Cannarsa [12]. Following Ambrosio’s advice, I used the same methods to establish Theorem 10.48(ii) in the present notes.

290 284 10 Solution of the Monge problem II: Local approach n ⊂ R O is d -rectifiable if it can be written as ne often says that S 1 n R ), apart from a C a countable union of manifolds (submanifolds of d set of zero H measure. This property seems stronger, but is actually equivalent to Definition 10.47 (see [753, Lemma 11.1]). Stronger notions r 1 2. For instance C for some r ≥ are obtained by changing C into n − 1)-rectifiability of the nondifferentiability Alberti [9] shows that the ( 2 set of a convex function is achieved with manifolds, which is optimal. C Apart from plain subdifferentiability and Clarke subdifferentiability, other notions of differentiability for nonsmooth functions are discussed in [199], such as Dini derivatives or reachable gradients. The theory of approximate differentiability (in Euclidean space) is developed in Federer [352, Section 3.1.8]; see also Ambrosio, Gigli and Savar ́e [30, Section 5.5]. A central result is the fact that any approxi- mately differentiable function coincides, up to a set of arbitrarily small measure, with a Lipschitz function. The proof of Besicovich’s density theorem [331, p. 43] is based on Besicovich’s covering lemma. This theorem is an alternative to the more classical Lebesgue density theorem (based on Vitali’s covering lemma), which requires the doubling property. The price to pay for Besicovich’s n R (or a Riemannian manifold, by theorem is that it only works in localization) rather than on a general metric space. The nonsmooth implicit function theorem in the second Appendix (Theorem 10.50) seems to be folklore in nonsmooth real analysis; the core of its proof was explained to me by Fathi. Corollary 10.52 was discovered or rediscovered by McCann [613, Appendix], in the case n ̃ are convex functions in ψ where R ψ . and Everything in the Third Appendix, in particular the key differential inequality (10.46), was explained to me by Gallot. The lower bound 2 σ is sufficient ≥− C/d ( x assumption on the sectional curvatures ,x ) 0 x 2 2 ∇ to get upper bounds on ) x,y d y stays in a compact set, but it ( as x to get upper bounds that are uniform in both x and y . is not sufficient A counterexample is developed in [393, pp. 213–214]. The exact computation about the hyperbolic space in Example 10.53 is the extremal situation for a comparison theorem about the Hessian of M is a Riemannian manifold the squared distance [246, Lemma 3.12]: If κ < with sectional curvature bounded below by 0, then √ ( ) 2 ( d | ) ,y x | κ ) x,y ( d 2 √ ∇ . ≤ x 2 | κ | d tanh( x ,y )) (

291 Bibliographical notes 285 s pointed out to me by Ghys, the problem of finding a sufficient A 2 ( x,y ) to be bounded above is related to condition for the Hessian of d S the problem of whether large spheres ( y ) centered at y look flat at r infinity, in the sense that their second fundamental form is bounded like O (1 /r ).

292

293 11 T he Jacobian equation Transport is but a change of variables, and in many problems involving changes of variables, it is useful to write the Jacobian equation x ) = g ( T ( x ( J f ( x ) , )) T f where g are the respective densities of the probability measures and n ν with respect to the volume measure (in R μ , the Lebesgue and measure), and ( x ) is the absolute value of the Jacobian determinant J T associated with T : ))] x ( B ( T vol [ r J x )) | = lim . ( ( det( | ) = x ∇ T T → r 0 x ) B ] vol [ ( r There are two important things that one should check before writing the Jacobian equation: First, T should be injective on its domain of definition; secondly, it should possess some minimal regularity . So how smooth should T be for the Jacobian equation to hold true? T We learn in elementary school that it is sufficient for to be continu- ously differentiable, and a bit later that it is actually enough to have T Lipschitz continuous. But that degree of regularity is not always avail- able in optimal transport! As we shall see in Chapter 12, the transport map might fail to be even continuous. T There are (at least) three ways out of this situation: (i) Only use the Jacobian equation in situations where the opti- mal map is smooth. Such situations are rare; this will be discussed in Chapter 12. (ii) Only use the Jacobian equation for the optimal map between μ is a compactly supported displacement ) and μ μ , where ( t t 0 ≤ t ≤ 1 t 0

294 288 11 The Jacobian equation i nterpolation, and is fixed in (0 , 1). Then, according to Theorem 8.5, t 0 the transport map is essentially Lipschitz. This is the strategy that I shall use in most of this course. (iii) Apply a more sophisticated theorem of change of variables, cov- ering for instance changes of variables with bounded variation (possibly be differentiable T discontinuous). It is in fact sufficient that the map almost everywhere, or even just approximately differentiable almost everywhere, in the sense of Definition 10.2. Such a theorem is stated below without proof; I shall use it in Chapter 23. The volume measure will be denoted by vol. on M Let M Theorem 11.1 (Jacobian equation). be a Riemannian man- 1 ) ∈ f M L be a nonnegative integrable function on M , and ifold, let ( T : M → M be a Borel map. Define μ ( dx ) = f ( x ) vol( dx ) and let := ν μ . Assume that: T # Σ ⊂ , such that f = 0 almost (i) There exists a measurable set M Σ , and T is injective on Σ ; everywhere outside of T Σ . is approximately differentiable almost everywhere on (ii) ̃ ∇ T Let T , and let J be the approximate gradient of be defined almost T ̃ Σ by the equation J is ( x ) := | det( everywhere on ∇ T ( x )) | . Then ν T absolutely continuous with respect to the volume measure if and only Σ if almost everywhere. In that case ν is concentrated on T ( 0 ) , > J T and its density is determined by the equation x f g ( T ( x )) J ( ( x ) . (11.1) ) = T In an informal writing: − 1 d T ) ( ( g vol) # = J ( g ◦ T ) vol . T d v ol Theorem 11.1 establishes the Jacobian equation as soon as, say, the optimal transport has locally bounded variation. Indeed, in this case is almost everywhere differentiable, and its gradient coin- T the map cides with the absolutely continuous part of the distributional gradient ′ ∇ T . The property of bounded variation is obviously satisfied for the D quadratic cost in Euclidean space, since the second derivative of a con- vex function is a nonnegative measure. n Consider two probability measures μ Example 11.2. and μ on R , 0 1 with finite second moments; assume that μ and μ are absolutely con- 0 1 tinuous with respect to the Lebesgue measure, with respective densities

295 11 The Jacobian equation 289 nd f . Under these assumptions there exists a unique optimal trans- f a 1 0 Ψ and T ( x ) = ∇ , and it takes the form ( x ) for port map between μ μ 1 0 Ψ some lower semicontinuous convex function . There is a unique dis- ) placement interpolation ( , and it is defined by μ ≤ 1 ≤ 0 t t = ( T μ ) . μ ) , T x ( x ) = (1 − t ) x + tT ( x ) = (1 − t ) x + t ∇ Ψ ( t 0 t t # μ be its density. is absolutely continuous, so let f By Theorem 8.7, each t t T is of locally bounded variation, and it is differentiable ∇ The map 2 2 ∇ Ψ Ψ , where ∇ almost everywhere, with Jacobian matrix ∇ is T = (see Theorem 14.25 later in this course). Ψ the Alexandrov Hessian of Then, it follows from Theorem 11.1 that, μ ( dx )-almost surely, 0 2 ( x ) = f . ( ∇ Ψ ( x f ∇ )) det( Ψ ( x )) 1 0 ∈ , 1], t [0 Also, for any x ) = f ( f ( ( x )) det( ∇ T ( x )) T t t 0 t ) ( ( ) 2 − t ) x + t ∇ Ψ ( x ) = det f (1 − . ) I t + t ∇ (1 Ψ ( x ) n t 1 − T ◦ T T = If , stands for the transport map between μ and μ → t t t t t 0 0 t 0 then the equation )) x ( x ) = f ( ( T T ∇ )) det( ( f x → t t t t → t t 0 0 0 also holds true for t ∈ (0 , 1); but now this is just the theorem of change 0 of variables for Lipschitz maps. In the sequel of this course, with the noticeable expression of Chap- ter 23, it will be sufficient to use the following theorem of change of variables. Let M be a Riemannian Theorem 11.3 (Change of variables). 2 c manifold, and x,y ) a cost function deriving from a C Lagrangian ( L ( x,v,t ) on TM × [0 , 1] , where L satisfies the classical conditions of 2 Definition 7.6, together with ∇ L > . Let ( μ ) be a displacement 0 t ≤ 0 ≤ 1 t v interpolation, such that each μ is absolutely continuous and has density t f -almost . Let t μ ∈ (0 , 1) , and t ∈ [0 , 1] ; further, let T be the ( t t 0 t t → 0 0 surely) unique optimal transport from μ be the J to μ , and let t t t → t 0 0 associated Jacobian determinant. Let F be a nonnegative measurable function on M × R such that +

296 290 11 The Jacobian equation [ ] ( y = 0 f = ⇒ F ( y,f ) ( y )) = 0 . t t Then, ∫ ∫ ( ) ) x f ( t 0 y T F )) vol( dy ( ( x ) , ) = y,f ( F ) . dx vol( J ) x ( → t t t t → t 0 0 ( ) x J t → t M M 0 ( . ( dx ) -almost surely, J 1] , [0 ∈ Furthermore, x ) μ 0 for all t > t t t → 0 0 Proof of Theorem 11.3 ) into just dx . For brevity I shall abbreviate vol( μ ) dx is compactly sup- . Let us first consider the case when ( ≤ t ≤ t 1 0 Π ported. Let be a probability measure on the set of minimizing curves, μ such that e ). By ) Π Π . Let K (Spt = e e (Spt Π ) and K = = ( t t t t t t # 0 0 → γ is well-defined and Lipschitz for all Theorem 8.5, the map γ t t 0 ∈ Spt Π . So T . By as- K → K ( γ is a Lipschitz map γ ) = γ t t t → t t t 0 0 0 μ is absolutely continuous, so Theorem 10.28 (applied with sumption t t ,t t,t 0 0 x,y ), or maybe c ) if the cost function ( x,y c t < t ( ) guarantees 0 that the coupling ( γ ,γ ) is deterministic, which amounts to saying t t 0 γ is injective apart from a set of zero probability. → that γ t t 0 Then we can use the change of variables formula with g = 1 , K t T = T ). Therefore, for any nonnegative x ( J , and we find f ( x ) = t t → t → t 0 0 on , measurable function G M ∫ ∫ ( y ) dy = ( y )( G ) G ) d (( T μ ) y → t t # 0 K K t t ∫ dx ) x ( G ◦ T = f )) x ( ( t → t 0 K t 0 ∫ dx. x ( G ( T J )) x ( ) = → t t t → t 0 0 K t 0 G ( y ) = F ( y,f )) by ( We can apply this to )) and replace f x ( T ( y → t t t t 0 f ); this is allowed since in the right-hand side the contri- x ( x ) / J ( t t → t 0 0 )) = 0 is negligible, and f T with ( x ( J x bution of those ( x ) = 0 t t → t t t → 0 0 ) = 0. So in the end implies (almost surely) ( x f t 0 ∫ ∫ ( ) ) ( f x t 0 ( F F T ( )) = x ) , y dy ( y,f J x. ( d ) x t t t → t → t 0 0 J ) ( x → t t K K 0 t t 0 Since f f ( y ) = 0 almost surely outside of K and ( x ) = 0 almost t t t 0 surely outside of K , these two integrals can be extended to the whole t 0 of M .

297 11 The Jacobian equation 291 N Π is not com- ow it remains to generalize this to the case when K be a pactly supported. (Skip this bit at first reading.) Let ( ) N ℓ ℓ ∈ ∪ ] = 1. For ℓ [ Π nondecreasing sequence of compact sets, such that K ℓ K 0, so we can consider the restriction ] > large enough, Π Π of Π to [ ℓ ℓ K K and K , and of . Then let e be the images of K and by e t t ℓ ℓ ,ℓ t t,ℓ 0 0 = ( e μ ) and Π μ , μ . Since course Π μ e ) are absolutely = ( t t t t t ℓ # # ℓ ,ℓ t,ℓ 0 0 0 μ and μ be their respective continuous, so are ; let f and f t,ℓ t ,ℓ t ,ℓ t,ℓ 0 0 T for the transport problem between densities. The optimal map t,ℓ t → 0 γ μ is obtained as before by the map γ and μ → , so this is t t t,ℓ t ,ℓ 0 0 actually the restriction of . Thus we have the Jacobian T to K t → t t ,ℓ 0 0 equation (11.2) , ) ( x ) = f x ( T ( J )) x ( f t → t t → t t ,ℓ t,ℓ 0 0 0 . This equation where the Jacobian determinant does not depend on ℓ ′ ′ K holds true almost surely for x , as soon as ℓ ∈ ≤ ℓ , so we may pass ℓ to the limit as ℓ →∞ to get (11.3) . ) ( x ) = f x ( T ( J )) x ( f t t → → t t t t 0 0 0 ′ ′ Since this is true almost surely on K , it is also true almost , for each ℓ ℓ surely. Next, for any nonnegative measurable function G , by monotone con- vergence and the first part of the proof one has ∫ ∫ G ( y ) dy = lim ) y dy ( G →∞ ℓ K UK t,ℓ t,ℓ ∫ ( J ) x ( G dx T ( x )) = lim t → t t → t 0 0 →∞ ℓ K ,ℓ t 0 ∫ dx. = x ) ( T ( ( x )) J G t → t t t → 0 0 UK ℓ,t 0 x ( ) = F ( G The conclusion follows as before by choosing ( y )) and y,f t using the Jacobian equation (11.3), then extending the integrals to the whole of M . J x for ) being positive It remains to prove the assertion about ( t t → 0 t ∈ [0 1], and not just for t = 1, or for almost all values all values of , t ), where T γ of can be written γ ( t ) → γ ( t . The transport map → 0 t t 0 γ ( t ). Since γ is mini- is a minimizing curve determined uniquely by 0 mizing, we know (recall Problem 8.8) that the map ( γ ,γ , ̇ γ γ ) → ( ) 0 0 t 0 0 can be written as the composition of the T is locally invertible. So t → t 0

298 292 11 The Jacobian equation m F (0)) and : γ ( t γ ) → ( γ (0) ,γ ( t ̇ )), F aps : ( γ (0) ,γ ( t (0) )) → ( γ , 0 0 0 1 2 γ t (0) , ̇ γ (0)) → γ ( F ). Both F : ( and F have a positive Jacobian de- 3 3 2 terminant, at least if t < 1; so if x is chosen in such a way that F has 1 a positive Jacobian determinant at x , then also T F ◦ F ◦ = F 3 t → t 1 2 0 x t ∈ [0 will have a positive Jacobian determinant at 1). ⊓⊔ for , Bibliographical notes n R ) by combining Lemma 5.5.3 Theorem 11.1 can be obtained (in in [30] with Theorem 3.83 in [26]. In the context of optimal transport, the change of variables for- mula (11.1) was proven by McCann [614]. His argument is based on Lebesgue’s density theory, and takes advantage of Alexandrov’s the- orem, alluded to in this chapter and proven later as Theorem 14.25: A convex function admits a Taylor expansion at order 2 at almost each x in its domain of definition. Since the gradient of a convex func- tion has locally bounded variation, Alexandrov’s theorem can be seen essentially as a particular case of the theorem of approximate differ- entiability of functions with bounded variation. McCann’s argument is reproduced in [814, Theorem 4.8]. Along with Cordero-Erausquin and Schmuckenschl ̈ager, McCann later generalized his result to the case of Riemannian manifolds [246]. Modulo certain complications, the proof basically follows the same pat- n tern as in R . Then Cordero-Erausquin [243] treated the case of strictly n R convex cost functions in in a similar way. Ambrosio pointed out that those results could be retrieved within the general framework of push-forward by approximately differentiable mappings. This point of view has the disadvantage of involving more subtle arguments, but the advantage of showing that it is not a special feature of optimal transport. It also applies to nonsmooth cost functions p x − y | such as . In fact it covers general strictly convex costs of the | 1 c ( x − y ) as soon as c has superlinear growth, is C form everywhere and 2 C out of the origin. A more precise discussion of these subtle issues can be found in [30, Section 6.2.1]. It is a general feature of optimal transport with strictly convex cost n R that if in T stands for the optimal transport map, then the Jacobian matrix ∇ T , even if not necessarily nonnegative symmetric, is diagonal- izable with nonnegative eigenvalues; see Cordero-Erausquin [243] and

299 Bibliographical notes 293 A mbrosio, Gigli and Savar ́e [30, Section 6.2]. From an Eulerian perspec- tive, that diagonalizability property was already noticed by Otto [666, Proposition A.4]. I don’t know if there is an analog on Riemannian manifolds. y = exp ( Changes of variables of the form ∇ ψ ( x )) (where ψ is x 2 d not necessarily / 2-convex) have been used in a remarkable paper by Cabr ́e [181] to investigate qualitative properties of nondivergent el- liptic equations (Liouville theorem, Alexandrov–Bakelman–Pucci esti- mates, Krylov–Safonov–Harnack inequality) on Riemannian manifolds with nonnegative sectional curvature. (See for instance [189, 416, 786] n R for classical proofs in .) It is mentioned in [181] that the methods extend to sectional curvature bounded below. For the Harnack inequal- ity, Cabr ́e’s method was extended to nonnegative Ricci curvature by S. Kim [516].

300

301 12 S moothness The smoothness of the optimal transport map may give information about its qualitative behavior, as well as simplify computations. So it is natural to investigate the regularity of this map. T is the existence of What characterizes the optimal transport map a c ψ such that (10.20) (or (10.23)) holds true; so it is natural -convex . to search for a closed equation on ψ To guess the equation, let us work formally without being too de- x y vary in manding about regularity issues. We shall assume that and n , or in nice subsets of smooth n -dimensional Riemannian manifolds. R μ ( dx ) = f ( x ) vol( dx ) and ν ( dy ) = g ( y ) vol( dy ) be two absolutely Let c ) be a smooth cost function, x,y ( continuous probability measures, let and let T be a Monge transport. The differentiation of (10.20) with (once again) leads to respect to x 2 2 2 x ) + ∇ ∇ ( c ( x,T ψ x )) + ∇ ( , c ( x,T ( x )) ·∇ T ( x ) = 0 xy xx which can be rewritten 2 2 2 ( x ∇ ∇ ) + (12.1) c ( x,T ( x )) = −∇ ψ . c ( x,T ( x )) ·∇ T ( x ) xy xx The expression on the left-hand side is the Hessian of the function ′ ′ ′ x x )) + ψ c x ( ), considered as a function of x ,T and then evaluated ( ( ′ . Since this function has a minimum at x at = x , its Hessian is x nonnegative, so the left-hand side of (12.1) is a nonnegative symmetric operator; in particular its determinant is nonnegative. Take absolute values of determinants on both sides of (12.1): ) ( ∣ ∣ 2 2 2 ∣ ∣ )) det( | | x ( . T ∇ )) det ∇ c ( x,T ( x x ( det = )) x,T ( c ∇ ∇ ) + ( ψ x xy xx

302 296 12 Smoothness T hen the Jacobian determinant in the right-hand side can be replaced )), and we arrive at the basic x /g ( T f x ) partial differential equa- ( by ( : tion of optimal transport ) ( ∣ ∣ ) f x ( 2 2 2 ∣ ∣ ( x )) det = ψ ( det x ) + ∇ c ( x,T ( x )) ∇ ∇ c ( x,T (12.2) . xy xx x ) ( ) ( T g ψ in terms of f and g , if one recalls This becomes a closed equation on from (10.20) that ) ( − 1 x ( ψ (12.3) , −∇ ) x, ) c ) = ( x ( T ∇ x y where the inverse is with respect to the variable. Remark 12.1. In a genuinely Riemannian context, equation (12.2) at first sight does not need to make sense: change the metric close to = ( x ) and not close to x , then the left-hand side is invariant but y T the right-hand side seems to change! This is a subtle illusion: indeed, y , then the volume measure also, and if the metric is changed close to the density g has to be modified too. All in all, the right-hand side in (12.2) is invariant. Now what can be said of (12.2)? Unfortunately not much simplifi- cation can be expected, except in special cases. The most important of c ( x,y ) = − x · y in them is the quadratic cost function, or equivalently n R . Then (12.2)–(12.3) reduces to f ( x ) 2 ) = ψ ( x det ∇ (12.4) . ( ∇ x ) ) ( g ψ This is an instance of the Monge–Amp`ere equation , well-known in the theory of partial differential equations. By extension, the system (12.2)–(12.3) is also called a generalized Monge–Amp`ere equation. At this point we may hope that the theory of partial differential equations will help our task quite a bit by providing regularity results for the optimal map in the Monge–Kantorovich problem, at least if we rule out cases where the map is trivially discontinuous (for instance if the support of the initial measure μ is connected, while the support of the final measure ν is not). However, things are not so simple. As the next few sections will demonstrate, regularity can hold only under certain stringent assump- tions on the geometry and the cost function. The identification of these

303 Caffarelli’s counterexample 297 c onditions will introduce us to a beautiful chapter of the theory of fully nonlinear partial differential equations; but one sad conclusion will be optimal transport is in general not smooth — even worse, that smoothness requires nonlocal conditions which are probably impossible to check effectively, say on a generic Riemannian manifolds. So if we still want to use optimal transport in rather general situations, we’d better find ways to do without smoothness. It is actually a striking feature of optimal transport that this theory can be pushed very far with so little regularity available. In this chapter I shall first study various counterexamples to identify obstructions to the regularity of the optimal transport, and then discuss some positive results. The following elementary lemma will be useful: ( X ,μ ) and ( Y ,ν Lemma 12.2. be any two Polish probability Let ) T X → Y , and let π = (Id ,T ) spaces, let be a continuous map μ # be the associated transport map. Then, for each x ∈ Spt μ , the pair x,T ( x )) belongs to the support of π . ( . Let and ε > 0 be given. By continuity of T , Proof of Lemma 12.2 x δ > 0 such that T ( B ⊂ ( x )) there is B )). Without loss of gener- ( T ( x ε δ δ ≤ ε ality, . Then }] [ ] [{ ( x ) × B )) ( T ( x )) B = μ π z ∈X ; z ∈ B x ( x ) and T ( z ) ∈ B ( ( T ε ε ε ε [ μ B . ( x ) ∩ B 0 ( x ≥ μ [ B > ( x )] )] = ε δ δ Since ε is arbitrarily small, this shows that π attributes positive mea- sure to any neighborhood of ( x,T ( x )), which proves the claim. ⊓⊔ Caffarelli’s counterexample n cannot R Caffarelli understood that regularity results for (12.2) in be obtained unless one adds an assumption of convexity of the support ν . Without such an assumption, the optimal transport may very well of be discontinuous, as the next counterexample shows. Theorem 12.3 (An example of discontinuous optimal trans- port). There are smooth compactly supported probability densities f n g on R and , such that the supports of f and g are smooth and con- nected, f and g are (strictly) positive in the interior of their respective

304 298 12 Smoothness upports, and yet the optimal transport between ( dx ) = f ( x ) dx and s μ 2 dy ( y ) dy , for the cost c ( x,y ) = | x − y | ) = , is discontinuous. ν ( g be the indicator function of the unit ball . Let Proof of Theorem 12.3 f 2 g = g R be B in (normalized to be a probability measure), and let ε obtained by first sepa- the (normalized) indicator function of a set C ε rating the ball into two halves B and B (say with distance 2), then 1 2 ε ). (See O building a thin bridge between those two halves, of width ( B Figure 12.1.) Let also B : be the normalized indicator function of g ∪ 1 2 as g ↓ 0. It is not difficult to see that this is the limit of g (identified ε ε ε f with a probability measure) can be obtained from by a continuous deterministic transport (after all, one can deform B continuously into C ; just think that you are playing with clay, then it is possible to mas- ε sage the ball into C , without tearing off). However, we shall see here ε small enough, the optimal transport cannot be continuous . that for ε     S     S S − +                                                                                                                                                                                 Fig. 12.1. P rinciple behind Caffarelli’s counterexample. The optimal transport from the ball to the “dumb-bells” has to be discontinuous, and in effect splits the upper region S into the upper left and upper right regions S . Otherwise, there and S + − should be some transport along the dashed lines, but for some lines this would contradict monotonicity. is The proof will rest on the stability of optimal transport: If T and μ , and T is an optimal the unique optimal transport between ν ε μ transport betwen ν -probability as , then T μ converges to T in and ε ε ε ↓ 0 (Corollary 5.23). In the present case, choosing μ ( dx ) = f ( x ) dx , ν ( dy ) = g ( y ) dy , 2 x x,y ) = | ( − y | c , it is easy to figure out that the unique optimal

305 Loeper’s counterexample 299 t is the one that sends ( x,y ) to ( x − 1 ,y ) if x < 0, and to ransport T + 1 ,y x > 0. ( x ) if S , and S be the upper regions in the ball, the left S Let now + − half-ball and the right half-ball, respectively, as in Figure 12.1. As a small enough, a consequence of the convergence in probability, for ε has to go to (if it lies on large fraction (say 0.99) of the mass in S S − S (if it lies on the right). Since the continuous image of a the left) or to + ) ( S connected set is itself connected, there have to be some points in T ε x to S S ; and so there is some that form a path going from ∈ S such + − T that ( x ) is close to the left-end of the tube joining the half-balls, ε in particular ( x ) − x has a large downward component. From the T ε ̃ of x have to convergence in probability again, many of the neighbors x be transported to, say, , with nearly horizontal displacements S ( ̃ x ) − T − 〈 〉 ̃ x is picked below x , we shall have ̃ x − ̃ x,T ( x ) − T ( ̃ x ) x < 0; . If such an 2 2 2 2 T ( x ) | or equivalently, + | ̃ x − T ( ̃ x ) | | > | x − T ( ̃ x ) | x + | ̃ x − T ( x ) | − . If T is continuous, in view of Lemma 12.2 this contradicts the -cyclical c ε is monotonicity of the optimal coupling. The conclusion is that when small enough, the optimal map is discontinuous. T ε and The maps in this example are extremely smooth (in fact f g constant!) in the interior of their support, but they are not smooth as n . To produce a similar construction with functions maps defined on R n that are smooth on , one just needs to regularize f and g R a tiny bit, ε ε → 0. ⊓⊔ letting the regularization parameter vanish as Loeper’s counterexample Loeper discovered that in a genuine Riemannian setting the smoothness of optimal transport can be prevented by some local geometric obstruc- tions. The next counterexample will illustrate this phenomenon. Theorem 12.4 (A further example of discontinuous optimal transport). S , There is a smooth compact Riemannian surface f and there are smooth positive probability densities g on S , and such that the optimal transport between μ ( dx ) = f ( x ) vol( dx ) and ) vol( ν dy ) = g ( y ( dy ) , with a cost function equal to the square of the geodesic distance on S , is discontinuous.

306 300 12 Smoothness R The obstruction has nothing to do with the lack of emark 12.5. smoothness of the squared distance. Counterexamples of the same type exist for very smooth cost functions. As we shall see in Theorem 12.44, the surface Remark 12.6. S in Theorem 12.4 could be replaced by any compact Riemannian mani- fold admitting a negative sectional curvature at some point. In that sense there is no hope for general regularity results outside the world of nonnegative sectional curvature. A +    B B + − O          A − P S , im- Fig. 12.2. rinciple behind Loeper’s counterexample. This is the surface 3 , “viewed from above”. By symmetry, mersed in O R has to stay in place. Because most of the initial mass is close to A and A , and most of the final mass is close + − to B -balls to one and B A , at least some mass has to move from one of the − + B -balls. But then, because of the modified (negative curvature) Pythagoras of the → B , inequality, it is more efficient to replace the transport scheme ( → O ), by A O A → O , O ( B ). → 3 S be a compact surface in R Proof of Theorem 12.4 with the . Let S is invariant under the symmetries x →− x , following properties: (a) y → − y ; (b) S crosses the axis ( x,y ) = (0 , 0) at exactly two points, ′ O,r = (0 , 0 , 0) and O namely ; (c) S coincides in an open ball B ( O ) 2 2 y x with the “horse saddle” ( − z = ). (Think of S as a small piece of the horse saddle which has been completed into a closed surface.)

307 Loeper’s counterexample 301 2 x et ,y ), > 0 to be determined later. Then let A L = ( x ,x , 0 0 0 0 + 0 2 2 2 , 0 ,x = ( x ), and similarly B ); = (0 ,y A , − y − y ), B − = (0 , − y , 0 − 0 + 0 − 0 0 0 see Figure 12.2. In the sequel the symbol A A will stand for “either + ± A ”, etc. or − ,B and y If are small enough then the four points A x ,A ,B − − + 0 0 + where has (strictly) negative curva- O S belong to a neighborhood of to A ture, and the unique geodesic joining (resp. B O ) satisfies the ± ± y = 0) (resp. x = 0); then the lines ( O,A ) and ( equation ( O,B ) are ± ± O . Since we are on a negatively curved surface, Pythago- orthogonal at ras’s identity in a triangle with a square angle is modified in favor of the diagonal, so 2 2 2 d + d ( O,B . ) ) < d ( A ( ,B O,A ) ± ± ± ± ( 0 small enough that the balls B ε A > ,ε ), By continuity, there is 0 0 + ( A ) are all disjoint and satisfy ,ε ,ε ), B ( B B ,ε B ) and B ( + 0 0 0 − − ] [ ∈ ( A x ,ε ) ) ∪ B ( A ,ε ,ε B ) , y ∈ B ( B B ,ε ( ) ∪ B 0 + − − 0 0 + 0 2 2 2 x,y ( ( O,y ) ⇒ < d ( d ) = . (12.5) + ) O,x d g be smooth probability densities on M , even in x Next let f and y and , such that ∫ ∫ 1 1 ) ( x f dx > ( ) . g dy > y ; 2 2 B A ) ( ,ε ∪ ) ,ε A B ( ε ( ) ∪ B ( B B ,ε , ) B 0 − 0 + 0 0 + − ( 12.6) μ ( dx ) = f ( x ) dx , ν ( dy ) = g ( y ) dy , let T be the unique optimal Let μ and ν (for the cost function c ( x,y ) = transport between the measures 2 ̃ x,y ) d T be the optimal transport between ν and μ . ( T and ( ), and let ̃ are inverses of each other, at least in a measure-theoretical sense.) T ̃ I claim that either T is discontinuous. T or ̃ and T are continuous. T Indeed, suppose to the contrary that both T ( O ) = O . Since the problem is We shall first see that necessarily x x and y → − y , and since there is → − symmetric with respect to T maps O into a point that is uniqueness of the optimal transport, ′ O or O . Similarly, invariant under these two transforms, that is, either ′ ′ O ∈{ ) T O,O ( } . So we have two cases to dismiss: ′ ′ O O Case 1: , T ( O ) = ) = O . Then by Lemma 12.2 the two ( T ′ ′ pairs ( O O,O ,O ) belong to the support of the optimal plan ) and ( T associated to , which trivially contradicts the cyclical monotonicity ′ 2 2 2 2 ′ ′ ′ d ( O = 0. ,O ) O,O > d ( O,O ) since + d ( O ) ,O d ) + (

308 302 12 Smoothness ′ ′ ′ ′ ′ ′ T ( O O ) = O ) = . Then both ( O O ( ) and ( O C ,O , ) ase 2: T ,O ′ and U be two disjoint U belong to the support of the optimal plan. Let ′ O neighborhoods of respectively. By swapping variables x and O and ′ ′ ′ O O ,O ,O ) belong to the support of the optimal y ) and ( we see that ( ̃ ̃ plan (Id , ν ; so for any ε > 0 the map ) T has to send a set of positive T # ′ ′ ( O measure in ) into U , and also a set of positive measure in B ) ( O B ε ε ′ ̃ T . . This contradicts the continuity of U into O ( O . By Lemma 12.2 again, ( O,O ) T So we may assume that ) = . Then (12.6) implies that belongs to the support of the optimal plan π B A ,ε ) ∪ ( ( A ,ε there is some transfer of mass from either ) to B 0 + 0 − ( ( ,ε B ) ∪ B ); in other words, we can find, in the support of B ,ε B − 0 + 0 ) and ) with x ∈ B ( A ,ε ) ∪ B ( A the optimal transport, some ( ,ε x,y 0 + 0 − y ∈ ( B B ,ε ) ) ∪ B ( B O,O ,ε ). From the previous step we know that ( 0 0 − + also lies in that support; then by c -cyclical monotonicity, 2 2 2 2 ; + ( O,O ) ) ≤ d ( x,O ) x,y ( d ( y,O ) d d + but this contradicts (12.5). The proof is complete. ⊓⊔ Smoothness and Assumption (C) Here as in Chapter 9, I shall say that a cost function on X×Y satisfies c c Assumption (C) if for any ψ and for any x ∈ X , -convex function the c -subdifferential (or contact set) of ψ at x , ∂ ), is connected. ψ ( x c Some interesting consequences of this assumption were discussed in c x,y ) = ( Chapter 9, and it was shown that even the simple cost function n n p | , on R − × R x | p > 2, does not satisfy it. Now we shall see that y this assumption is more or less necessary for the regularity of optimal transport. For simplicity I shall work in a compact setting; it would be easy to extend the proof to more general situations by imposing suitable conditions at infinity. Theorem 12.7 (Smoothness needs Assumption (C)). X Let (resp. Y ) be the closure of a bounded open set in a smooth Rieman- nian manifold M (resp. N ), equipped with its volume measure. Let c : R be a continuous cost function. Assume that there are X ×Y → c a ψ : X → R , and a point -convex function x ∈ X , such that ∂ ) ψ ( x c ∞ i s disconnected. Then there exist C positive probability densities f on

309 Smoothness and Assumption (C) 303 X nd g on Y such that any optimal transport map from f vol to g vol a is discontinuous. Remark 12.8. and Assumption (C) involves both the geometry of Y ; in some sense the obstruction described in this the local properties of c theorem underlies both Caffarelli’s and Loeper’s counterexamples. Remark 12.9. The volume measure in Theorem 12.7 can be replaced by any finite measure giving positive mass to all open sets. (In that sense the Riemannian structure plays no real role.) Moreover, with sim- ( ∂ ψ ple modifications of the proof one can weaken the assumption “ ) x c ψ i ( s disconnected” into “ ∂ x ) is not simply connected”. c . Note first that the continuity of the cost func- Proof of Theorem 12.7 tion and the compactness of X ×Y ψ . In the implies the continuity of will be used interchangeably for the distances on proof, the notation d Y and X . ( Let and be two disjoint connected components of ∂ C ψ C x ) . c 2 1 Since ψ ( x ) is closed, C ∂ and C lie a positive distance apart. Let 2 1 c = d ( C . Further, let ,C } ) / 5, and let C = { r ∈Y ; d ( y,∂ r ψ ( x ) ) ≥ 2 y c 1 2 is compact, . Obviously, C ∈ , y C ∈ C y , B = B ) ( y y ) , B ( = B r r 1 2 1 2 2 2 1 1 B to B has to go through C . and any path going from 2 1 K = { z ∈ X ; ∃ Then let ∈ C ; y ∈ ∂ is ψ ( z ) } . It is clear that K y c compact: Indeed, if z ), then ∈ K converges to z ∈X , and y z ∈ ∂ ( ψ c k k k y converges to some y ∈ C and one can pass without loss of generality k ψ ( z ), where ) + c ( z to the limit in the inequality ,y t,y ) ≤ ψ ( t ) + c ( k k k k ∈X is arbitrary. Also K is not empty since X and Y are compact. t ( y C , for any x Next I claim that for any y ∈ ∂ ), ψ ∈ x such that c and for any i ∈{ 1 , 2 } , c ( x, y 12.7) ) + c ( x,y ) < c ( x,y ) + c ( x, y ( ) . i i x y / ∂ , so ψ Indeed, ∈ ) ( c [ ] x ) + c ( x,y ) = inf 12.8) ( . ψ ( ̃ x ) + c ( ̃ x,y ) ψ < ψ ( x ) + c ( x,y ) ( ∈X x e ψ y ∈ ∂ On the other hand, , so ( x ) i c ψ ( x ) + c ( x, y 12.9) ) ≤ ψ ( x ) + c ( x, y ( ) . i i The combination of (12.8) and (12.9) implies (12.7). Next, one can reinforce (12.7) into

310 304 12 Smoothness c x, y ( ) + c ( x,y ) ≤ c ( x,y ) + c ( x, y 12.10) ) − ε ( i i 0. This follows easily from a contradiction argument based ε > for some K ∂ ψ . C , and once again the continuity of on the compactness of and c (0 ,r ) be small enough that for any ( x,y ) ∈ Then let × C , for δ ∈ K ∈{ 1 , 2 } , the inequalities i any ′ ′ ′ ′ 2 ( ≤ ) d x,x ( δ, d 2 δ , d ( y x , y x, δ 2 ≤ ) ) ≤ , d ( y,y δ ≤ ) 2 i i imply ∣ ∣ ∣ ∣ ε ε ′ ′ ∣ ∣ ∣ ∣ x,y c , ( ( c , ≤ 12.11) c ( x , ) y ) − ( ≤ x,y ( c ) ) y , x − 10 10 ∣ ∣ ∣ ∣ ε ε ′ ′ ′ ′ ∣ ∣ ∣ ∣ ( c − ) ≤ y x , x, ( c c x − , y ) . ) , c ( x , y ≤ ) y ( i i i i 10 10 δ K L = { et ∈X ; d ( x,K ) ≤ δ } . From the assumptions on X and Y , x B ( x has positive volume, so we can fix a smooth positive probability ) δ on X such that the measure μ = f vol satisfies density f 3 δ (12.12) ( x μ ] ≥ [ . ; f ≥ ε B > 0 on K ) 0 δ 4 Also we can construct a sequence of smooth positive probability den- sities ( g vol satisfy ) g = ν on Y such that the measures N k k k k ∈ ( ) 1 . −−−→ δ (12.13) ν + δ eakly w y y k 2 1 →∞ k 2 Let us assume the existence of a continuous optimal transport T k ν to sending μ , for any k . We shall reach a contradiction, and this k k will prove the theorem. ) [ B ν ( y large enough. Then by (12.12) From (12.13), ] ≥ 1 / 3 for k δ k 1 T , and has to send some mass from B ( x ) to B ) the transport y ( δ k δ 1 . Since B x ) to B similarly from ( y ) is connected, it has to ) ( T ) ( B x ( δ k δ δ 2 ′ ′ y meet ∈ C and x C . ∈ B y ( x ) such that T = ( x . So there are ) k δ k k k k Let x ). Without loss of generality ∈ K be such that y x ∈ ∂ ( ψ c k k k as x x we may assume that ∈ K → k → ∞ . By the second part ∞ k m := μ [ B is large of (12.12), x k )] ≥ ε 0. When vol [ B > ( x )] ( ∞ 0 ∞ δ δ enough, ν has to [ B T ( y (by (12.13) again), so ) ∪ B m ( y − ) ] > 1 δ k k δ 2 1 ) . y ( B ; say ) y y ) or B ( send some mass from B ) to either ( x ( B ∞ δ δ δ δ 2 1 1 ′ ′ B ∈ ) ( ) such that x ( x ( B ∈ T x y In other words, there is some . ) ∞ δ δ 1 k k Let us recapitulate: for k large enough,

311 Regular cost functions 305 ′ ′ d x, ( (12.14) ); x ( ψ x ∈ ∂ ∈ = y y x C ; ) ) ( ≤ T ; δ c k k k k k k ′ ′ d ( x ) ≤ 2 δ ; d ( T . ( x ,x ) , y δ ) ≤ k k 1 k k By c -cyclical monotonicity of optimal transport and Lemma 12.2, ′ ′ ′ ′ ′ ′ ′ ′ x ( ) ) ) + c ( x x ,T ) ( x ( c )) ≤ c , x T x , T ( ( x . ,T )) + c ( x ( k k k k k k k k k k k k From the inequalities in (12.14) and (12.11), we deduce ε 4 ,y ) + c ( x c , y ( ) ≤ c ( x, y ) + ) + c ( x x,y . k k k k 1 1 10 ince y ) and ∈ C ∩ ∂ , this contradicts (12.10). The proof ψ ( x K S x ∈ c k k k is complete. ⊓⊔ Regular cost functions In the previous section we dealt with plainly continuous cost functions; now we shall come back to the differentiable setting used in Chapter 10 X will for the solution of the Monge problem. Throughout this section, M and Dom ( ∇ be a closed subset of a Riemannian manifold c ) will x stand for the set of points ( x,y ∈ X ×Y such that ∇ ) c ( x,y ) is well- x defined. (A priori X in M .) It will x should belong to the interior of be assumed that x, · ) is one-to-one on its domain of definition ( c ∇ x in Chapter 10). (Twist) (Assumption -segment). A continuous curve Definition 12.10 ( y c ) Y in ( t [0 , 1] ∈ t ( ) x i f (a) for x,y c ) ∈ D om ( ∇ c -segment with base is said to be a x t t s p , where ,p = 0 ∈ T p M all uch that ∇ ; (b) there are c ( x,y + ) x t t 1 0 x p . = (1 − t ) p tp + 0 1 t c In other words a -segment is the image of a usual segment by a map − 1 c ( x, · ) ) . Since ( ( ∇ ), a c ) in the definition always lies in Dom ( ∇ x,y x t x c -segment is uniquely determined by its base point x a nd its endpoints . ,y ; I shall denote it by [ y ,y ] y x 0 0 1 1 ⊂Y D -convexity). A set C c is said to be c -convex efinition 12.11 ( with respect to x ∈ X if for any two points y there is a ,y C ∈ 1 0 -segment ( y ) necessarily unique) which is entirely c y ,y ] ( = [ x 0 0 ≤ 1 t ≤ t 1 contained in C .

312 306 12 Smoothness M C ⊂Y is said to be c -convex with respect to a ore generally, a set ̃ ̃ X if C is c -convex with respect to any x ∈ subset X . X of c ⊂ D c ) is said to be totally ∇ -convex if for any two A set Dom ( x ( points ) a nd ( x,y with ) i x,y n D , there is a c -segment ( y ) 0 t t ≤ 1 1 0 ≤ or all , ( x,y base ) x D f such that t . ∈ t Similar notions of strict c -convexity are defined by imposing that y t C if y . 6 = y belong to the interior of and t ∈ (0 , 1) 1 0 n X = Y = R When and c ( x,y ) = − x · y (or x · y ), c - Example 12.12. n − 1 n and = S R X = . If c ( x,y ) = Y convexity is just plain convexity in 2 1 − n / 2 then Dom ( d ( c ( x,y ) ∇ \{− S ) · x, ) = x } ( the cut locus is the x n − 1 = x,S antipodal point), and s a convex c ( {− x } ) ∇ B (0 ,π ) ⊂ T i M \ x x n − 1 S set. So for any point x m inus the cut locus of x i , s c -convex with n − 1 n − 1 respect to An equivalent statement is that × S S x \{ ( x, − x ) } . 2 ∇ )) is totally d / 2-convex. By abuse of (which is the whole of Dom ( c x n 2 1 − d / 2-convex with respect to itself. is S language, one may say that The same is true of a flat torus with arbitrary sidelengths. Example 12.13. To construct a Riemannian manifold which is not 2 / 2)-convex with respect to itself, start with the Euclidean plane d ( 3 (embedded in ), and from the origin draw three lines L R , L and L 0 + − , − , 0) and (1 , 1). Put a directed respectively by the vectors (1 1), (1 , without affecting L high enough mountain on the pathway of L 0 + L L , so that or are still is minimizing only for a finite time, while L ± − 0 2 d / minimizing at all times. The resulting surface is not 2-convex with respect to the origin. Before stating the main definition of this section, I shall now in- troduce some more notation. If X is a closed subset of a Riemannian manifold M and c : X → Y → R is a continuous cost function, for ′ any X I shall denote by Dom x ( ∇ in the interior of c ( x, · )) the in- x ′ ∇ of Dom ( c ( x, · )). Moreover I shall write Dom terior ( ∇ ) for the c x x ′ x }× Dom { ( ∇ union of all sets c ( x, · )), where x varies in the inte- x rior of X . For instance, if X = Y = M is a complete Riemannian 2 x,y x,y ) = d ( ( ) manifold and is the square of the geodesic distance, c ′ by removing the cut locus, ( ∇ then Dom c ) is obtained from M × M x ∇ c ) might be slightly bigger (these facts are recalled in while Dom ( x the Appendix). Definition 12.14 (regular cost function). A cost c : X×Y → R is said to be regular if for any x i n the interior of X and for any c -convex

313 Regular cost functions 307 ′ ψ : X → R , the set ∂ unction ψ ( x ) ∩ D om f ( ∇ -convex with c ( x, · ) ) is c x c respect to . x is said to be strictly regular if moreover, for any nontrivial T c he cost y ( c = [ y ,y ] -segment ) a s the , i (0 x ∈ i n ∂ t ψ ( x ) , 1) nd for any 1 0 1 ≤ t ≤ 0 t c x c x y only contact point of , i.e. the only ψ ∈X such that y . at ∂ ) ψ ( x ∈ c t t ′ is a subset of Dom c ( ∇ More generally, if D ), I shall say that c x is regular in if for any D i n proj ψ D and any c -convex function x : X s ∂ X → ψ ( x ) ∩ { y ; ( x,y ) ∈ D } i , the set c -convex. Equivalently, the R c and the graph of ∂ intersection of ψ should be totally D -convex. The c c notion of strict regularity in D is obtained by modifying this definition in an obvious way. ψ be a c -convex function and let What does regularity mean? Let c ( · ,y − ) + a from below at and − c ( · ,y ψ ) + a touch the graph of 1 0 0 1 , take any y ] ∈ [ y x ,y < t < until the −∞ , 0 from 1, and increase a 1 t 0 x function − c ( · ,y : the regularity property ) + a touches the graph of ψ t means that hould be a contact point (and the only one if the cost is x s strictly regular). Before going further I shall discuss several convenient reformula- c tions of the regularity property, in terms of (i) elementary -convex functions; (ii) subgradients; (iii) connectedness of c -subdifferentials. (locSC) Assumptions (Twist) (twist condition), (local semiconcav- ∞ ) (adequate behavior at infinity) will be the same as in (H ity) and Chapter 10 (cf. p. 246). X be a Proposition 12.15 (Reformulation of regularity). Let and let be a Polish space. Y closed subset of a Riemannian manifold M : X ×Y → R be a continuous cost function satisfying (Twist) . Let c Then: ′ c is regular if and only if for any ( (i) ) , ( x,y x,y ) ∈ D om , ( ∇ ) c x 1 0 , 1] , [0 ∈ t s well-defined, and for any i ,y -segment = [ y the ( y ] c ) x 1 1 ≤ t ≤ 0 t 0 ) ( ) + ( c ( x,y c ) ≤ m ax − − c ( x,y ) + c ( x,y ) , − c ( x ,y ) + c ( x,y ) x,y t 0 t 1 1 0 ( 12.15) t c (with strict inequality if 6 = y and , y ∈ (0 , 1) is strictly regular, 1 0 x = x ) . 6 (ii) If c is a regular cost function satisfying (locSC) and (H ∞ ) , then for any c -convex ψ : X → R and any x n the interior of X such i ′ ) · x, x ) ⊂ D , one has om ) ( ∇ ( c ( ψ ∂ that x c

314 308 12 Smoothness − − , ) x ) = ∇ ∇ ψ ( x ( ψ c − ) here the c ψ ( x -subgradient i s defined by w ∇ c ( ) − x ∇ := −∇ . c ψ x,∂ ( ψ ( x ) ) c x c ′ c satisfies (locSC) and Dom ( ( ∇ -convex, then c iii) If is totally c ) x x is regular if and only if for any : X → R and any ψ i n the -convex c c , interior of X ( ) ′ ( ψ ∂ ∇ c ( x ) ∩ s connected. D (12.16) x, · ) om i x c Remark 12.16. Statement (i) in Proposition 12.15 means that it is sufficient to test Definition 12.14 on the particular functions ) ( ( , ) + ) x,y ( x ) := max 12.17) − c ( x,y ( ) + c ( x,y c ) , − c ( x ,y ψ x 1 ,y 0 y , 0 1 1 0 ′ belong to Dom where ( x,y ) x,y ) ( ∇ c ). (The functions ψ and ( 0 1 x,y x , y 1 0 →| x x in usual convexity theory.) See play in some sense the role of | 1 Figure 12.3 for an illustration of the resulting “recipe”. 2 c ( x,y ) = − x · y , or equivalently | x − y | Example 12.17. , Obviously is a regular cost function, but it is not strictly regular. The same is 2 c ( x,y ) = −| x − y | true of , although the qualitative properties of opti- mal transport are quite different in this case. We shall consider other examples later. It follows from the definition of -subgradient and Remark 12.18. c x i n the interior of X and any -convex c Theorem 10.24 that for any ψ : X → R , − − . ψ ( x ) ⊂ ∇ ∇ ψ ( x ) c o Proposition 12.15(ii) means that, modulo issues about the domain S of differentiability of c , it is equivalent to require that c satisfies the − ∇ ) fills the whole of the convex set regularity property, or that x ( ψ c − ) ψ ( x ∇ . Remark 12.19. Proposition 12.15(iii) shows that, again modulo issues about the differentiability of c , the regularity property is morally equiv- alent to Assumption (C) in Chapter 9. See Theorem 12.42 for more.

315 Regular cost functions 309 y y 1 1 / 2 y 0 x R Fig. 12.3. egular cost function. Take two cost-shaped mountains peaked at y 0 y , let and b e a pass, choose an intermediate point y ) on ( y and grow a ,y x , 1 1 0 t x . x (Note: the y mountain peaked at from below; the mountain should emerge at t of the cost function.) shape of the mountain is the negative Proof of Proposition 12.15 . Let us start with (i). The necessity of the y . Con- ,y ) both belong to ∂ x ψ ( condition is obvious since , y 0 x,y c 1 1 0 ψ be any c -convex function versely, if the condition is satisfied, let R , let x b elong to the interior of X and let y . By ,y ψ ∈ ∂ x X → ( ) c 0 1 ψ adding a suitable constant to (which will not change the subdifferen- ψ ( x ) = 0. Since ψ is tial), we may assume that -convex, ψ ≥ ψ , c , y x,y 0 1 so for any t ∈ [0 , 1] and x ∈X , c ( x,y ) + ψ ( x ) ≥ c ( x,y ) ) + ψ x ( y , t x,y t 0 1 ) x,y ≥ ) + ψ , ( ( x c = c ( x,y ) + ψ ( x ) y x,y t t , 0 1 w hich shows that y , as desired. ∈ ∂ ) ψ ( x c t c , ψ and Now let b e as in Statement (ii). By Theorem 10.25, x − ψ . Moreover, ) x,∂ x ψ ( x ) ) is included in the convex set ∇ is ψ ( c −∇ ( c x − ∇ is the convex hull of ψ ( x ) locally semiconvex (Theorem 10.26), so x cluster points of ψ ( x ), as ∇ → x . (This comes from a localization ar- gument and a similar result for convex functions, recall Remark 10.51.)

316 310 12 Smoothness − p of ∇ I ψ ( x ) is the limit of ∇ ψ ( x t follows that any extremal point ) k for some sequence → x . Then let y x ∈ ∂ exist by ψ ( x y ) (these c k k k k Theorem 10.24 and form a compact set). By Theorem 10.25, ∇ c is x ψ ,y well-defined at ( ) and −∇ c ( x ). Up to extraction ,y x ) = ∇ x ( k k k k k y → of a subsequence, we have ∈ ∂ y ψ ( x lies in , in particular ( x,y ) ) c k the domain of c and −∇ c ( ∇ x,y = p . The conclusion is that ) x − − − ψ ( x ) ) ⊂∇ E , ψ ( x ) ⊂ ∇ ∇ ψ ( x ( ) c − x ) is convex here ( ∇ w E ψ stands for “extremal points”. In particular, c − ∇ ψ ( if and only if it coincides with x ) . Part (ii) of the Proposition follows easily. It remains to prove (iii). Whenever ( ) and ( x,y belong to ) x,y 1 0 ′ Dom ( ∇ c ), let ψ y belong and y be defined by (12.17). Both x 0 , x,y 1 y 1 0 . If ∂ can be connected by ( x ) ψ c is regular, then y and y to x,y c 0 , y 1 1 0 c -segment with base x , which lies inside Dom ( ∇ . This c ) a ∂ ) ψ ( x ∩ c x proves (12.16). Conversely, let us assume that (12.16) holds true, and prove that c is regular. If x,y s well- , y i are as above, the c -segment [ y ] ,y x 1 1 0 0 defined by (a), so by part (i) of the Proposition we just have to prove . ∂ c ψ -convexity of the , x,y y c 1 0 ( x ) = Then let c ( x,y ) + c ( h − x,y ) ) , h . ( x ) = − c ( x,y x,y ) + c ( 0 0 1 1 0 1 are semi- ∇ h ( x ) 6 = ∇ h h ( x ) . Since h By the twist condition, and 0 1 0 1 convex, we can apply the nonsmooth implicit function theorem (Corol- 1)- h = h − ) defines an ( lary 10.52) to conclude that the equation ( n 1 0 . This graph is x G in the neighborhood of dimensional Lipschitz graph − ψ = ψ , G is included in the orthogonal of ) , so ∇ ( ψ a level set of x y x,y , 1 0 − x ) ( x ) − ∇ h which is the line directed by ( h ) . In particular, ∇ ∇ ψ ( x 0 1 i s a one-dimensional set. ′ ). By (b), S = ψ ( x ) ∩ D om ( ∇ = c ( x, · ) ∂ S is connected, so S Let x c ′ c ( x,S ) ∇ is connected too, and by Theorem 10.25, S is included in x − ′ the line ψ ( x ) . Thus S ∇ is a convex line segment (closed or not) x,y containing . This finishes the proof of the c ( x,y ) and −∇ ) c ( −∇ 1 0 x x regularity of c . ⊓⊔ The next result will show that the regularity property of the cost function is a necessary condition for a general theory of regularity of optimal transport. In view of Remark 12.19 it is close in spirit to The- orem 12.7.

317 Regular cost functions 311 T heorem 12.20 (Nonregularity implies nondensity of differen- satisfy c : X ×Y → R c (Twist) , Let tiable -convex functions). (H ∞ ) . Let C be a totally c -convex set contained in (locSC) and ′ is not regular in Dom c ) , such that (a) c ∇ C ; and (b) for any ( x ′ ( ) , ( x,y ) ) i n C , ∂ ψ ) · x, ( ∇ x,y x ) ⊂ D om . Then for some ( c ( c 1 0 x , x,y y 1 0 ( cannot be the ) , ( x,y x,y ) i n C the c -convex function ψ = ψ x,y , y 1 0 1 0 c -convex functions locally uniform limit of differentiable . ψ k c The short version of Theorem 12.20 is that if is not regular, then differentiable c -convex functions are not dense, for the topology of lo- c -convex functions. What is cal uniform convergence, in the set of all happening here can be explained informally as follows. Take a function such as the one pictured in Figure 12.3, and try to tame up the singu- larity at the “saddle” by letting mountains grow from below and gently surelevate the pass. To do this without creating a new singularity you would like to be able to touch only the pass. Without the regularity assumption, you might not be able to do so. Before proving Theorem 12.20, let us see how it can be used to prove a nonsmoothness result of the same kind as Theorem 12.7. Let , x,y y 1 0 be as in the conclusion of Theorem 12.20. We may partition X in two sets X 1). Fix an and X , such that y = 0 ∈ ∂ i ψ ( x ) for each x ∈X ( i c 1 0 i μ f , let a arbitrary smooth positive probability density = X [ X ] and on i i = a ; δ y ) = + a X δ ( T . Let ν be the transport map defined by T y i y i 0 1 1 0 by construction the associated transport plan has its support included c ) is optimal in the ψ in π is an optimal transport plan and ( ψ,ψ ∂ , so c dual Kantorovich problem associated with ( μ,ν ). By Remark 10.30, ψ is the unique optimal map, up to an additive constant, which can be fixed by requiring ψ ( x ) ) = 0 for some x f ∈ X . Then let ( 0 0 ∈ N k k ν := f vol be a sequence of smooth probability densities, such that k k ν . For any k let ψ converges weakly to be a c -convex function, optimal k in the dual Kantorovich problem. Without loss of generality we can assume ψ are uniformly continuous, we ( x ) = 0. Since the functions ψ 0 k k ̃ ψ . By passing to can extract a subsequence converging to some function ̃ ψ is the limit in both sides of the Kantorovich problem, we deduce that ̃ = ψ optimal in the dual Kantorovich problem, so ψ . By Theorem 12.20, the ψ cannot all be differentiable. So we have proven the following k corollary: Corollary 12.21 (Nonsmoothness of the Kantorovich poten- tial). With the same assumptions as in Theorem 12.20, if Y is a

318 312 12 Smoothness losed subset of a Riemannian manifold, then there are smooth posi- c on f g on Y , such that the associated X tive probability densities and is not differentiable. Kantorovich potential ψ Now let us prove Theorem 12.20. The following lemma will be useful. Let U be an open set of a Riemannian manifold, and Lemma 12.22. let ) ( ψ be a sequence of semi-convex functions converging uniformly N k k ∈ − to ψ on U . Let a nd p ∈ ∇ x ψ ( x ) ∈ Then there exist sequences U . − x → and p p ∈∇ x ψ . ( x , ) , such that p → k k k k k . Since this statement is local, we may pretend Proof of Lemma 12.22 n that , centered at is a small open ball in R U . x By adding a well-chosen quadratic function we may also pretend that ψ is convex. Let δ > 0 and 2 ̃ ̃ ) = ψ ( ( x ) − ψ ( x ) + δ | x − x | ψ / 2 − p x ( x − x ) . Clearly · ψ converges k k k 2 ̃ ) ) = ψ uniformly to x ( − ψ ( ψ x ( + δ | x − x | ( / 2 − p · x x − x ) . Let x ) ∈ U b e k ̃ ̃ a point where achieves its minimum. Since ψ has a unique minimum ψ k at by uniform convergence x approaches x x , in particular x , belongs k k ̃ ψ has a minimum at x , necessarily U for k large enough. Since to k k − − ̃ p ( x ), or equivalently 0 := ψ − δ ( x − ∈ ∇ p ( ) x ), so ∈ ∇ x ψ k k k k k → , which proves the result. ⊓⊔ p p k c C , there are ( x,y ) Proof of Theorem 12.20 . Since is not regular in 0 x,y x ) in C such that ∂ a ψ ( nd ( ) ∩ C i s not c -convex, where ψ = ψ . , y x,y 1 c 0 1 ψ is the limit of differentiable c -convex functions ψ Assume that . k − ∈ ∇ Let ψ ( x ) . By Lemma 12.22 there are sequences x p → x a nd k − = p such that p ) ∈ ∇ p → x ( x ( ), i.e. p ψ ∂ ∇ ψ ). Since ( x ψ c k k k k k k k k k is nonempty (Proposition 10.24), in fact −∇ c ( x , ,∂ } ψ p ( x { )) = c k k k k ( x ) contains a single point, say y ∂ . By Proposi- ψ in particular c k k k stays in a compact set, so we may assume y → y . y tion 10.24 again, k k ψ ψ to The pointwise convergence of implies y ∈ ∂ . Since ψ ( x ) c k is semiconcave, ∇ c is continuous on its domain of definition, so c x p −∇ ( x,y ) = lim p . = c x k − −∇ ; combining this c ( x,∂ So ψ ( x ) ) contains the whole of ∇ ) ψ ( x c x with Remark 12.18 we conclude that ∂ -convex, in contradic- ψ ( x ) is c c tion with our assumption. ⊓⊔ Now that we are convinced of the importance of the regularity con- dition, the question naturally arises whether it is possible to translate it in analytical terms, and to check it in practice. The next section will bring a partial answer.

319 The Ma–Trudinger–Wang condition 313 he Ma–Trudinger–Wang condition T condition on differential Ma, Trudinger and X.-J. Wang discovered a the cost function, which plays a key role in smoothness estimates, but in the end turns out to be a local version of the regularity property. Before explaining this condition, I shall introduce a new key assump- tion which will be called the “strong twist” condition. (Recall that the c ∇ “plain” twist condition is the injectivity of · ) on its domain of ( x, x definition.) , Let be closed sets in Riemannian manifolds M,N respectively, X Y ′ ∇ c ) stand for the set of all ( ( ) ∈X×Y such and as before let Dom x,y x x lies in the interior of X and y in the interior of Dom ( ∇ )). c ( x, · that x It will be said that strong twist condition if c satisfies the ′ ) ∇ (STwist) c Dom is an open set on which c ( ∇ is c is smooth, x x 2 c one-to-one, and the mixed Hessian ∇ is nonsingular. x,y 2 Remark 12.23. ∇ c implies the local injectivity The invertibility of x,y ∇ ( c of x, · ), by the implicit function theorem; but alone it does not a x priori guarantee the global injectivity. Remark 12.24. One can refine Proposition 10.15 to show that cost functions deriving from a well-behaved Lagrangian do satisfy the strong twist condition. In the Appendix I shall give more details for the im- portant particular case of the squared geodesic distance. 2 One should think of ∇ Remark 12.25. c T M × as a bilinear form on x x,y : It takes a pair of tangent vectors ( × ) ∈ T M ξ,η T N , and gives T N y y x 2 2 c ( ∇ )) · ( ξ,η ) = 〈∇ back a number, ( x,y · ( x,y ) c ξ, η 〉 . It will play the x,y role of a Riemannian metric (or rather the negative of a Riemannian metric), except that ξ and η do not necessarily belong to the same tangent space! The Ma–Trudinger–Wang condition is not so simple and involves fourth-order derivatives of the cost. To write it in an unambiguous way, it will be convenient to use coordinates, with some care. If and x are given points, let us introduce geodesic coordinates x and ,...,x y n 1 respectively. (This means that ,...,y y y x and in the neighborhood of n 1 one chooses Euclidean coordinates in, say, T and then parameter- M x ̃ x in the neighborhood of x by the coordinates of the vector izes a point ξ such that ̃ x = exp ).) The technical advantage of using geodesic ( ξ x coordinates is that geodesic paths starting from x or y will be straight

320 314 12 Smoothness c urves, in particular the Hessian can be computed by just differentiat- ing twice with respect to the coordinate variables. (This advantage is nonessential, as we shall see.) is a function of If u x = ∂ u u = ∂u/∂x , then will stand for the j j j partial derivative of with respect to . Indices corresponding to the x u j y a ( x,y ) is a function derivation in will be written after a comma; so if 2 y , then a of stands for ∂ x a/∂x and ∂y . Sometimes I shall use the j i,j i ∑ k k = , etc.), a a b b convention of summation over repeated indices ( k k k x and y and often the arguments will be implicit. i j 2 ) defined by c c ξ As noted before, the matrix ( η − = 〈∇ c · 〉 ξ,η i,j i,j x,y will play the role of a Riemannian metric; in agreement with classical i,j ) the coor- c conventions of Riemannian geometry, I shall denote by ( dinates of the inverse of ( ), and sometimes raise and lower indices c i,j j i,j j − c ξ ξ , − c according to the rules ξ = = ξ ) are , etc. In this case ( ξ i i i,j i j ∗ M ), while ( ) T ξ the coordinates of a 1-form (an element of ( ) are the x coordinates of a tangent vector in M . (I shall try to be clear enough T y to avoid devastating confusions with the operation of the metric g .) c -second fundamental form). Let c : X×Y → Definition 12.26 ( satisfy . Let R Ω ⊂ Y be open (in the ambient manifold) (STwist) 2 . Let boundary with contained in the interior of Y C ( x,y ) ∈ ∂Ω ′ be the outward unit normal vec- ( ∇ Dom c ) , with y ∈ ∂Ω , and let n x ∑ i y . Let ( n ∂Ω ) be defined by 〈 n,ξ 〉 tor to = , defined close to n ξ i y i ξ ∈ T ) M ). Define the quadratic form II ∀ ( x,y ( on T by the formula Ω y c y ∑ ( ) j i k,ℓ )( ξ ( ∂ x,y n II − c ξ c ξ n ) = i j c ij,k ℓ ijkℓ ∑ ( ) j k,ℓ i (12.18) c c . ξ ∂ ξ = n j ℓ i,k ijkℓ Definition 12.27 (Ma–Trudinger–Wang tensor, or c -curvature operator). Let : X ×Y → R satisfy (STwist) . For any ( x,y ) ∈ c ′ S ∇ ( c ) , define a quadrilinear form Dom on the space of bivec- ( x,y ) c x ( ξ,η ) ∈ T satisfying tors × T N M y x 2 c (12.19) 〈∇ ( x,y ) · ξ, η 〉 = 0 x,y by the formula ∑ ) ( 3 j r,s k i ℓ S ) = ξ,η ) ( x,y ( η (12.20) c . − c ξ c ξ η c c ij,r ij,kℓ s,kℓ 2 jkℓrs i

321 The Ma–Trudinger–Wang condition 315 emark 12.28. R Both for practical and technical purposes, it is often S , see formula (12.21) better to use another equivalent definition of c below. comes from, and to compute it in practice, To understand where S c ( c the key is the change of variables x,y ). This leads to the p = −∇ x following natural definition. Let c be a cost function satisfying Definition 12.29 ( c -exponential). -exponential map on the image of −∇ c by the c , define the (Twist) x − 1 ( p ) = ( ∇ c c ) -exp formula ( x, − p ) . x x -exp In other words, ( p ) is the unique y such that c = 0. c ( x,y )+ p ∇ x x 2 ( When d ( x,y ) c / 2 on X = Y = M , a complete Riemannian x,y ) = manifold, one recovers the usual exponential of Riemannian geometry, TM whose domain of definition can be extended to the whole of . More generally, if c comes from a time-independent Lagrangian, under suit- -exponential can be defined as the solution at able assumptions the c with initial velocity , time 1 of the Lagrangian system starting at x v L in such a way that ( x,v ) = p ∇ . v Then, with the notation p = −∇ c ( x,y ), we have the following x c -curvature operator: reformulation of the ) ( 2 2 ) ( 3 d d ξ,η ) = − ( x,y )( S ( e , (12.21) xp ) ( tξ ) , c c -exp) sη ( p + c x x 2 2 dt ds 2 η in the right-hand side is an abuse of notation for the tangent where 2 obtained from by the operation of −∇ vector at x η ( x,y ) (viewed c xy T is obtained by M → T S as an operator ). In other words, M x y c c and x,y ) twice with respect to x differentiating the cost function ( twice p , not with respect to y . Getting formula (12.20) with respect to from (12.21) is just an exercise, albeit complicated, in classical differ- ential calculus; it involves the differentiation formula for the matrix 1 − 1 − − 1 . ) = − M M ( HM d inverse: H · Particular Case 12.30 (Loeper’s identity). If X = Y = M is a 2 c ( x,y ) = d ( x,y ) ξ,η / 2, and smooth complete Riemannian manifold, are two unit orthogonal vectors in M , then T x S (12.22) ( x,x )( ξ,η ) = σ ) ( P x c is the sectional curvature of M at x along the plane P generated by ξ and η . (See Chapter 14 for basic reminders about sectional curvature.)

322 316 12 Smoothness T t , the geodesic o establish (12.22), first note that for any fixed small ( x tξ ) is orthogonal to s → exp to exp ( sη ) at s = 0, so curve joining x x ) ) = 0 for any fixed ( F ( t,s ) = 0. Similarly ( d/dt ) , so s F ( t,s d/ds =0 s t =0 F at (0 , 0) takes the form the Taylor expansion of ( ) 2 tξ ) , exp exp ( sη ) d ( x x 2 2 4 2 2 4 = + C t A + Dt t s E s + + B s 2 6 6 t + + s ( ) . O ) are quadratic functions of F 0) and F (0 ,s t, t and s respectively, ( Since C = E = 0, so S − ( x,x ) = − 6 D is necessarily 6 times the coefficient of c 4 2 2. Then the result follows (exp in the expansion of ( tξ ) , exp / ( tη )) t d x x from formula (14.1) in Chapter 14. Remark 12.31. S Formula (12.21) shows that ( x,y ) is intrinsic, in c the sense that it is independent of any choice of geodesic coordinates (this was not obvious from Definition 12.20). However, the geometric interpretation of S is related to the regularity property, which is in- c dependent of the choice of Riemannian structure; so we may suspect that the choice to work in geodesic coordinates is nonessential. It turns out indeed that Definition 12.20 is independent of any choice of coor- dinates, geodesic or not: We may apply Formula (12.20) by just letting 2 = ∂ c c ( x,y ) /∂x c ∂y ) with , p x,y = − c ( (partial derivative of i,j j i i i respect to x ), and replace (12.21) by i ∣ 2 2 ( ) 3 ∂ ∂ ∣ ( S c x x,y ) ( ξ,η ) = − ) p 12.23) ( , ,c -exp ( ∣ c x 2 2 2 ∂p x,y = − d x,p c ( = ) x ∂x x η ξ c -exponential is defined in an obvious way in terms of 1-forms where the (differentials) rather than tangent vectors. This requires some justifi- cation since the second differential is not an intrisical concept (except when the first differential vanishes) and might a priori depend on the choice of coordinates. When we differentiate (12.23) twice with respect to x , there might be an additional term which will be a combination of ξ k k k Γ x ) ∂ where the c ( x, ( c - exp , p ) x are the Christof- ( Γ ( )) = p ( − ) Γ k k x ij ij ij p , so anyway it disap- fel symbols. But this additional term is linear in p pears when we differentiate twice with respect to . (This argument η 2 ξ,η does not need the “orthogonality” condition c ( x,y ) · ( ∇ ) = 0.) Remark 12.32. Even if it is intrinsically defined, from the point of view of Riemannian geometry S is not a standard curvature-type op- c erator, for at least two reasons. First it involves derivatives of order

323 The Ma–Trudinger–Wang condition 317 g nonlocal , in a strong sense. Take for reater than 2; and secondly it is 2 c ) = d ( x,y ) ( / 2 on a Riemannian manifold ( M,g ), fix x instance x,y , compute S ( x,y ), then a change of the Riemannian metric g can and y c ( x,y ), even if the metric g is left unchanged in a S affect the value of c and y , and even if it is unchanged in a neighborhood neighborhood of x to ! Here we are facing the fact that geodesic x of the geodesics joining y distance is a highly nonlocal notion. S The operator is symmetric under the exchange of Remark 12.33. y , in the sense that if ˇ c ( x,y ) = c ( x ), then S and ) = ( y,x )( η,ξ y,x c ˇ satisfies the ( S )( ξ,η ). (Here I am assuming implicitly that also ˇ c x,y c strong twist condition.) This symmetry can be seen from (12.20) by just rearranging indices. To see it directly from (12.21), we may ap- ply Remark 12.31 to change the geodesic coordinate system around c x by q = −∇ a nd parameterize ( x, y ) . Then we obtain the nicely x y symmetric expression for ( x, y ) ( ξ,η ): S c ∣ 2 2 ) ( 3 ∂ ∂ ∣ − ˇ c - ) . c 12.24) exp ) ( q ( , c -exp p ( ∣ x y 2 2 ∂p 2 p = − d ( c y x, , q = − d ) c ( x, y ) ∂q x y η ξ is tangent at x and (Caution: at y , so differentiating with respect to ξ q 2 ξ means in fact differentiating in the direction −∇ q in direction c ; ξ · xy recall (12.21). The same for .) p η In the sequel, I shall stick to the same conventions as in the be- ginning of the chapter, so I shall use a fixed Riemannian metric and use Hessians and gradients with respect to this metric, rather than differentials of order 1 and 2. Theorems 12.35 and 12.36 will provide necessary and sufficient dif- ferential conditions for c -convexity and regularity; one should note care- ′ ( fully that these conditions are valid only inside Dom c ). I shall use ∇ x ˇ c ( y,x ) = c ( x,y ), ∈ D = { ( y,x ); ( x,y ) the notation ˇ D } . I shall also be led to introduce an additional assumption: − 1 n ) For any x in the interior of proj (Cut ( D ) , the “cut locus” X ′ ( x cut locally has finite ( D ) \ Dom )) ( ∇ · c ( x, ) = proj x D Y ( n − 1) -dimensional Hausdorff measure. n − 1 Example 12.34. (Cut ) trivially holds when c satisfies Condition the strong twist condition, D is totally c -convex and product . It also holds when is the squared distance on a Riemannian manifold M , and c D = (see the Appendix). \ cut( M ) is the domain of smoothness of c M

324 318 12 Smoothness N ow come the main results of this section: c Theorem 12.35 (Differential formulation of -convexity). Let X ×Y → c (STwist) . Let x ∈X and R : be a cost function satisfying 2 , Y boundary. Let x ∈ X C be a connected open subset of let C with ′ such that { x }× C ( ∇ C c ) . Then D i s c -convex with respect to x ⊂ om x if and only if ( x,y ) ≥ 0 for all y ∈ ∂C . II c II ( x,y ) > 0 for all y ∈ ∂Ω then Ω is strictly c -convex If moreover c x with respect to . c Let : Theorem 12.36 (Differential formulation of regularity). be a cost function satisfying X ×Y → , and let D be a R (STwist) ′ c -convex open subset of totally ( ∇ Dom c ) . Then: x (i) If , then S is regular in ( x,y ) ≥ 0 D ( x,y ) ∈ D . c for all c x,y ( (ii) Conversely, if ) ≥ 0 (resp. > 0 ) for all ( x,y ) ∈ D , ˇ c S c n − 1 ˇ is totally ˇ c -convex, and c satisfies (Cut , satisfies (STwist) ) on D D c is regular (resp. strictly regular) in D . , then For Theorem 12.36 to hold true, it is important that Remark 12.37. S no condition be imposed on the sign of ( x,y ) · ( ξ,η ) (or more rig- c orously, the expression in (12.20)) when and η are not “orthogonal” ξ √ 2 to each other. For instance, ( x,y ) = c | x − y | s regular, still i 1 + x S ) · ( ξ,ξ ) < 0 for ξ = x,y − y and | x − y | large enough. ( c A corollary of Theorem 12.36 and Remark 12.33 is Remark 12.38. is more or less equivalent to the regularity of ˇ . c c that the regularity of Proof of Theorem 12.35 . Let us start with some reminders about clas- n 2 Ω with C sical convexity in Euclidean space. If R is an open set in x ∈ ∂Ω , let T ∂Ω Ω stand for the tangent space to boundary and at x , x and for the outward normal on ∂Ω (extended smoothly in a neigh- n x ). The second fundamental form of Ω , evaluated at x , is borhood of ∑ i j for n ξ ξ ∂ defining function Ω . A by II( x )( ξ ) = T defined on Ω i j x ij x is a function at defined in a neighborhood of x , such that Φ < 0 Φ in Ω , Φ > 0 outside Ω , and |∇ Φ | > 0 on ∂Ω . Such a function always Φ x ) = ± d ( x,∂Ω ), with + exists (locally), for instance one can choose ( x is outside Ω and − when x is in sign when . (In that case ∇ Φ is the Ω outward normal on .) If Φ is a defining function, then n ∂Ω ∇ Φ/ |∇ Φ | = on ∂Ω , and for all ξ ⊥ n , ) ( Φ Φ Φ Φ i j j ij k i i j j i j i = ξ ξ ∂ ξ − = ξ n ξ ξ . i j 3 |∇ |∇ Φ | | Φ |∇ Φ |

325 The Ma–Trudinger–Wang condition 319 S x ) ≥ 0 on ∂Ω is equivalent to the condition that o the condition II( 2 Φ x ) is always nonnegative when evaluated on tangent vectors. ∇ ( In that case, we can always choose a defining function of the form ′ 2 x,Ω )), with g (0) = g ( (0) = 1, g strictly convex, so that ∇ ( Φ ( x d ± g ) is nonnegative in all directions. With a bit more work, one can show Ω that is convex if and only if it is connected and its second fundamen- . Moreover, if the second ∂Ω tal form is nonnegative on the whole of fundamental form is positive then Ω is strictly convex. Now let C be as in the assumptions of Theorem 12.35 and let −∇ x,C c ( Ω ) ⊂ T = M . Since ∇ ) is smooth, one-to-one with c ( x, · x x x is smooth and coincides with c ( x,∂C ). nonsingular differential, −∇ ∂Ω x is connected, so is Ω Since Ω we just have C , so to prove the convexity of to worry about the sign of its second fundamental form. Let y ∈ ∂ C be Φ = Φ ( fixed, and let ) be a defining function for C in a neighborhood y 1 − of . Then Φ ◦ ( −∇ y c ( x, · )) = is a defining function for Ω . By Ψ x direct computation, ( ) ij k,ℓ s,j r,i c = c Ψ Φ Φ . c − c rs ℓ rs,k (The derivation indices are raised because these are derivatives with j variables.) Let ξ = ( ξ respect to the ) ∈ T p M be tangent to C ; y 2 ∗ whose identifies ξ with an element in ( T ∇ M ) the bilinear form c x j ξ ) = ( − coordinates are denoted by ( ξ ). Then c i i,j ( ij s k,ℓ r = Ψ Φ ξ − c ξ c ξ Φ . ) ξ j i rs ℓ rs,k s = 0, the nonnegativity of this expression is equivalent to Since n ξ s r k,ℓ s ξ − n n the nonnegativity of ( ) c c ξ ∂ , for all tangent vectors ξ . r s ℓ rs,k This establishes the first part of Theorem 12.35, and at the same time justifies formula (12.18). (To prove the equivalence between the two k,ℓ k n = − c n expressions in (12.18), use the temporary notation and ℓ k,ℓ k k k n c ∂ c n n n = − note that ∂ ( c c − .) The ) + c = n − ∂ j j j i i,k i,k ij,k ij,k ℓ statement about strict convexity follows the same lines. ⊓⊔ In the proof of Theorem 12.36 I shall use the following technical lemma: n − 1 (STwist) and (Cut Lemma 12.39. c satisfies ) then it satisfies If the following property of “transversal approximation”: 2 (TA) For any n the interior of proj path x,x D ) , any C i ( X ′ y can be ap- ) ) D ( proj ∩ drawn in Dom ) ( ∇ ) c ( x, · ( x t ≤ 0 t ≤ 1 Y 2 ̂ topology by a path ( proximated in y C ) such that ≤ t ≤ 1 0 t ′ t ∈ (0 , 1); ̂ y } / ∈ Dom is discrete. ( ∇ )) c ( x, · { x t

326 320 12 Smoothness L emma 12.39 applies for instance to the squared geodesic distance on a Riemannian manifold. The detailed proof of Lemma 12.39 is a bit tedious, so I shall be slightly sketchy: 1 n − (Cut ) , let . As in the statement of Sketch of proof of Lemma 12.39 ′ ( ) = proj ( D ) \ Dom x ( ∇ cut c ( x, · )). Since cut ) has empty inte- ( x D D x Y 2 ∈ [0 , rior, for any fixed t in C 1] we can perturb the path topology y 0 y , in such a way that ̂ y ( t into a path ) / ∈ cut ̂ ( x ). Repeating this oper- D 0 ( y t ̂ ) lies outside cut ation finitely many times, we can ensure that ( x ) j D k k = j/ 2 for each , where k ∈ N and j ∈ { 0 ,..., 2 t } . If k is large j enough, then for each the path ̂ y can be written, on the time-interval j t [ ,t ], in some well-chosen local chart, as a straight line. Moreover, +1 j j ( x ) is closed, there will be a small ball B ), = B ( since cut y ( t ,r ) ̂ j j j D y > 0, such that on the interval [ t is − ε r ,t ̂ + ε ] the path j j j j j ( B B = B entirely contained in ̂ y , and the larger ball 2 t ) , 2 r ) ( j j j j does not meet cut x ). If we prove the approximation property on ( D − t ε each interval [ ,t + ε ], we shall get suitable paths ( ̂ y ) on − j j − 1 1 t j j 2 t , B ∈ + ε y ̂ ; in particular ,t ) in − ε C ], approximating ( y [ t j j t j 1 − i j 1 − i and we can easily “patch together” these pieces of ( ̂ y ) in the intervals t ε ,t t + − ] while staying within 2 B . [ ε j j j j j All this shows that we just have to treat the case when ( y ) takes t n R of and is a straight line. In these U values in a small open subset n − 1 ( x ) coordinates, U will have finite H Σ := cut ∩ measure, with D k k -dimensional Hausdorff measure. Without loss of H standing for the n ,...,e te generality, y e , where ( = R ) is an orthonormal basis of n 1 n t − τ < t < τ ; and U is the cylinder B (0 ,σ ) × ( − τ,τ ), for some and 0. σ > n − 1 z (0 ,σ ) ⊂ R y For any z , let ∈ B = ( z,t ). The goal is to show that t − n z 1 )-almost surely, y ( dz H Σ in at most finitely many points. intersects t To do this one can apply the co-area formula (see the bibliographical f : ( z,t ) 7−→ z (defined on U ), then notes) in the following form: let ∫ − 1 0 − 1 n n − 1 ) dz ( H H [ Σ ∩ f [ Σ ( z )] ] ≥ H . ) ( Σ f By assumption the left-hand side is finite, and the right-hand side is ∫ z 1 n − y t { # exactly ; ∈ Σ }H ( dz ); so the integrand is finite for almost t z k ) z all → 0 such that each ( y z , and in particular there is a sequence k t intersects Σ finitely many often. ⊓⊔ Proof of Theorem 12.36 . Let us first assume that the cost c is regular on D , and prove the nonnegativity of S . c

327 The Ma–Trudinger–Wang condition 321 L x,y ) ∈ D b e given, and let p ∈ T et ( + ) x,y c ∇ e such that b M ( x x ν be a unit vector in = 0. Let M . For ε ∈ [ − ε p ,ε T ] one can define x 0 0 − 1 p ) = ( ∇ a -segment by the formula y ( ε ; x, − c ( ε ) ), p ( ε ) = p + εν ( ) c x − = y ( let ε ) + ) and y x,y = y ( ε ( ). Further, let h c ( x ) = − y 0 1 0 0 0 0 c ( x,y and h x,y ( x ) = − c ( x,y ) ) + c ( − ) ) . By construction, ∇ x,y c ( x 0 1 0 1 1 ∇ c ( x,y ) ) is colinear to ν and nonzero, and h x ( x ) = . ( h x 1 1 0 )) h h By the implicit function theorem, the equation ( ( x ) = x ( 1 0 ̃ n ( n − 1)-dimensional submanifold defines close to M , orthogonal to x a ⊥ ̃ o ν ν = M at T ∈ ξ For any ne can define in the neighborhood of . x x ̃ smooth curve ( γ ( t )) nd x a x (0) = , valued in a M , such that γ τ t ≤ τ − ≤ γ (0) = ξ . ̇ is ψ ) = max( h Let ( x ) ,h -convex and ( x )). By construction ψ x c ( 1 0 both belong to ∂ ,y y ( ψ x ) . So 0 c 1 t ( ) + c ( x,y ) ≤ ψ ( γ ( x ) ) + c ( γ ( t ) ,y ) ψ ( ) 1 γ h t ( ) ( + ) + ) ( γ ( t )) = h c ( γ ( t ) ,y 0 1 2 [ ] 1 c ( γ ( t ) , y . ) + c ( x,y ) ) − c ( γ ( t ) , y y ) + c ( x,y , ) − + c ( γ ( t ) = 1 1 0 0 2 ψ ( x ) = 0, this can be recast as Since ) ( ) ( 1 1 c ( x,y x,y c ) + . ( x,y )+ ) ) − c ( γ ( t ) , y ( ) , y ≥ )+ c ( γ ( t ) ,y c ) 0 − γ ( t c ( 1 0 1 0 2 2 12.25) ( At = 0 the expression on the right-hand side achieves its maximum t value 0, so the second t -derivative is nonnegative. In other words, ∣ ) ( 2 ∣ [ ] d 1 ∣ . 0 ) ≤ c ( γ ( t ) , y ,y ) + c ( γ ( t ) ) t ) ,y − c ( γ ( 1 0 ∣ 2 dt 2 = 0 t This is equivalent to saying that ( ) 〈 〉 〈 〉 〈 1 2 2 2 ,ξ 〉≥ ∇ ξ . c ∇ ( x,y c ( x,y ,ξ ) · ) ,ξ · + ξ ∇ c ( x,y ξ ) · 1 0 x x x 2 γ ( (Note: The path ) is not geodesic, but this does not matter because t the first derivative of the right-hand side in (12.25) vanishes at t = 0.) p = ( p + p ) / 2 and p − p was along an arbitrary direction Since 0 1 1 0 2 〉 is concave as x,y ) · ξ ,ξ (orthogonal to ), this shows precisely that c ( 〈∇ ξ x y p , after the change of variables y = a function of ( x,p ). In particular, the expression in (12.21) is ≥ 0. To see that the expressions in (12.21)

328 322 12 Smoothness nd (12.20) are the same, one performs a direct (tedious) computation a is a tangent vector in the -space and is a tangent η p to check that if ξ -space, then x vector in the ( ) ( ) ( ) 4 x,p x,y c ∂ ( ) ℓ,n i i j k,m j r,s ξ ξ η c η η = ξ c c c ξ c η − c n m ij,r ij,kℓ ℓ k s,kℓ ∂x ∂ p ∂x ∂p j i k ℓ ( ) r,s i j k ℓ c . c ξ c ξ c η = η − ij,r ij,kℓ s,kℓ k k,m − c η η stand for the coordinates of a tangent vector at (Here = m η , obtained from p → y , and still denoted η y after changing variables by abuse of notation.) To conclude, one should note that the condition i i j η η ξ = 0, is equivalent to c , i.e. ξ ξ η ⊥ = 0. i i,j Next let us consider the converse implication. Let ( ) ) and ( x,y x,y 1 0 ; p = p D = −∇ elong to c ( b x,y ∈ ) , p p = −∇ − c ( x,y p ) , ζ = x 0 1 1 x 1 0 0 1 − . By assumption ( ) p − x, ) x,y p , and = ( ∇ y c ) ζ t ( + p = M , T t t t x x t D elongs to t ∈ [0 , 1]. b for all t ∈ [0 , 1], let h ( t ) = − c ( x,y For ) + c ( x,y . Let us first assume that ) t t ′ )) for all ( ∇ c ( · x ∈ Dom t ; then h is a smooth function of t ∈ (0 , 1), ,y t y and we can compute its first and second derivatives. First, i i,r i c = − ( ̇ y ( ) : ζ x,y . ) = ζ t r ) defines a straight curve in Similarly, since ( -space, y p t i,k ℓ,r i j,s i,k j ℓ y c ) c = ζ − ζ . = − c c c ( ̈ ζ c ζ r s k,ℓj k,ℓj So [ ] j j i ̇ c h ( x,y , ) − c t ( x,y ζ ) ) = ζ − = c ( η ,j t ,j t ,j i j,i i = − c . ( x,y η ) + c where η x,y c ) and η ( = − t ,j t j ,j j Next, ] [ j i ̈ x,y ( x,y c ) − c − ( ) = ζ ) ( ζ h t t ,ij ,ij t ] [ i j ,k ℓ c ( x,y ) c ( x,y ) c + ζ − ζ c ,j t , t j k,ℓi ( ) [ ] j i ℓ ,k ζ ) − c ζ ( x,y = ) c − η − c ( x,y c ,ij ,ij t t ℓ k,ij ( ) [ ] k i j c ) c x,y ( ( − − − η = c ) x,y ζ ζ . t ,ij ,ij t k ,ij i j 2 y . , ζ , and let Φ ( x ) = c 〉 ( x,y ζ,ζ ) ζ Now freeze ζ t = 〈∇ , · c ( x,y ) t t ,ij t y This can be seen either as a function of x or as a function of q =

329 The Ma–Trudinger–Wang condition 323 ∇ x,y − ), viewed as a 1-form on T c ( M . Computations to go back y y t t and forth between the q -space and the x -space are the same as those y -space. If q and q a re associated to x and between the p -space and the and espectively, then q − q , = x η r ( ) ] [ k i j c ) ( x,y q ) ( − η x,y c d − c ζ ) ζ − = Φ ( q ) − Φ ( q ) Φ · ( q − ,ij t ,ij t q ,ij k ∫ 1 ( ( ) ) 2 ) = ∇ (1 − s ) q + s q ds. · ( q − q,q − q ) (1 − s Φ q 0 2 x,y means differentiating c ( Φ ) twice with respect to y ∇ Computing q , with ∇ ) + c ( x,y and then twice with respect to q = 0. According to q y Remark 12.33, the result is the same as when we first differentiate with S and then with respect to , so this gives ( − 2 respect to 3) p x . Since / c q − = − η , we end up with q  ) ( 2 ̇  ) ); h ( η,ζ ) = ( ∇ · x,y c ( t t  x,y   ∫ ) ( 1  ) ( 2  1 −  ̈ + t s − ) (1 η,ζ ( ( h ) = · 1 − s ) q ) s q,y y  , ds, c S ( ) ( ∇ t t y c 3 0 (12.26) is inverted with respect to the x variable. Here I have ∇ c where now y slightly abused notation since the vectors ζ do not necessar- η and 2 ∇ ily satisfy c · ( η,ζ ) = 0; but S stands for the same analytic c x,y in (12.26) expression as in (12.20). Note that the argument of S c ˇ is always well-defined because -convex and c D was assumed to be ˇ ′ ′ )).) x,x ] · ( i s contained in Dom c ( ∇ ,y ∈ c ( · ,y Dom x ( ∇ )). (So [ y y t t y t h achieves its maximum value at t = 0 or The goal is to show that t = 1. Indeed, the inequality h ( t ) ≤ max( h (0) ,h (1)) is precisely (12.15). Let us first consider the simpler case when h achieves S > 0. If c 2 ̇ ∇ (0 h ∈ t t ) = 0, so , 1), then its maximum at ( c · ( η,ζ ) = 0, 0 0 x,y ̈ , x ) η,ζ ) > 0, and by (12.26) we have S h ( t ( ) > 0 (unless x = ... )( 0 c which contradicts the fact that t is a maximum. So h has a maximum 0 at t t = 1, which proves (12.15), and this inequality is strict = 0 or t x t = 1 or x = unless . = 0 or T S is only assumed to be o work out the borderline case where c nonnegative we shall have to refine the analysis just a bit. By the same density argument as above, we can assume that h is smooth. Freeze ζ and let η vary in a ball. Since S ) is a quadratic ( x,y ) · ( η,ζ c ℓ η , nonnegative on the hyperplane { ζ function of η , there is a = 0 } ℓ

330 324 12 Smoothness ℓ 2 S onstant ( x,y ) · ( η,ζ ) ≥− C | ζ C η c | = − C |∇ such that . c · ( ζ,η ) | c ℓ x,y This constant only depends on an upper bound on the functions of x,y 2 c ), and on upper bounds on the norm of appearing in ( S x,y and ∇ c x,y 2 ( its inverse. By homogeneity, ) · ( η,ζ ) ≥− C |∇ S , c · ( ζ,η ) x,y ζ || η | || c x,y is uniform when where the constant x, C x,y , y vary in compact 0 1 ζ | and domains. The norms η | remain bounded as t varies in [0 , 1], | | so (12.26) implies ̈ ̇ t ) ≥− C | | h ( t ) h . (12.27) ( k h t ) = h ( t ) + ε ( t − 1 / 2) Now let , where ε > 0 and k ∈ 2 N will ( ε t be such that h admits a maximum at t . If be chosen later. Let 0 ε 0 ̈ ̇ h 1), then h ( t ) = 0, , ∈ ( t t ) ≤ 0, so (0 0 0 ε 0 ε k 2 − − 2 k ̈ ̈ ( t h ) − εk ( k − 1) ( t ) = − 1 / 2) ( t h ≤− εk ( k − 1) ( t ; − 1 / 2) 0 0 0 ε 0 1 k − ̇ . ) = − εk ( t / − 1 h 2) ( t 0 0 C Plugging these formulas back in (12.27) we deduce | − 1 / 2 |≥ k − 1, t 0 k 2. Thus h has been chosen greater than 1 + which is impossible if C/ ε = 0 or t = 1, i.e. has to reach its maximum either at t k − k ( t − 1 h 2) ( ≤ max( h (0) ,h (1)) + ε 2 t ) + . ε / ε → 0, we conclude again to (12.15). Letting To conclude the proof of the theorem, it only remains to treat the ′ x does not belong to Dom ( ∇ case when c ( · , y , or equiva- )) for all t t y ′ ) is not contained in Dom lently when the path ( ( ∇ ). Let us c y x, · ( x t consider for instance the case S > 0. Thanks to Lemma 12.39, we c y ) by a very close path ( can approximate ( ̂ y ) ), in such a way that ( y t t t ′ leaves Dom ∇ . c ( x, · )) only on a discrete set of times t ( j x Outside of these times, the same computations as before can be repeated with y y ̂ (here I am cheating a bit since ( ̂ y in place of ) is t t t not a c -segment any longer, but it is no big deal to handle correction cannot achieve a maximum in (0 , 1) except maybe at some terms). So h , and it all amounts to proving that t cannot be a maximum t time j j ̇ ̇ of h is continuous at t h and either. This is obvious if h ( t = 0. This ) 6 j j ̇ h is discontinuous at t , because by semiconvexity is also obvious if j + − ̇ ̇ ̇ x,y is continuous ), necessarily of h ( t t h ) > → − h ( t c ( ). Finally, if t j j ̇ ̈ t t h ( at ) ) = 0, the same computations as before show that and h ( t j j is strictly positive when t is close to (but different from) t , then the j ̇ t h implies that h is strictly convex around continuity of , so it cannot j have a maximum at t . ⊓⊔ j

331 The Ma–Trudinger–Wang condition 325 W ith Theorems 12.35 and 12.36 at hand it becomes possible to prove or disprove the regularity of certain simple cost functions. Typically, one ′ first tries to exhaust Dom c ( x, · )) by smooth open sets compactly ( ∇ x -convexity of these sets by included in its interior, and one checks the c variable. use of the c -second fundamental form; and similarly for the x . In simple enough situations, Then one checks the sign condition on S c this strategy can be worked out successfully. 2 n → C and convex functions R are f R with If Example 12.40. g 2 2 1, |∇ g | < 1, then c ( x,y ) = | x − y | |∇ + | f ( x ) − g ( y ) | f satisfies | < n n n n 2 2 . If , so it is regular on R ≥ × R × 0 on ∇ R f and ∇ R g are S c n n > 0, so positive everywhere, then is strictly regular on R S × R . c c √ n 2 The cost functions x,y ) = ( c Examples 12.41. y | i 1 + n R | × x − √ n p 2 , R x − y | | ≥ in B (0 , 1) × B (0 , 1) \ {| x − y 1 − } , | x − y | | on 1 n 2 n n − 1 n − 1 for 0 < p < 1, d ( x,y ) × R S R \{ y × S = x } \{ y = − x } , all on satisfy S -convex > 0, and are therefore strictly regular on any totally c c p (0 , 1 / 2) × B (0 , 1 / 2) for | x − y | subdomain (for instance, ). The same B p x y | | ( − 2 < p < 0), − log | − − y | is true of the singular cost functions x n n n − 1 n − 1 x } on − log | x − y | on S × R R × S \{ y = \{ y = ± x } . Also , or − 2 n n ) = | x − y | x the limit case satisfies S . ≥ 0 on R c × R ( \{ x,y = y } c Theorem 12.42 (Equivalence of regularity conditions). Let N , c : M × be Riemannian manifolds and let → R be a locally M N c and ˇ c semiconcave cost function such that (STwist) , c sat- satisfy n − 1 isfies (Cut , and the c -exponential map extends into a continuous ) ′ TM → N . Further, assume that Dom -convex ( ∇ map c ) is totally c x ′ and ( ∇ -convex. Then the following three properties ˇ c ) is totally Dom c ˇ x are equivalent: (i) c satisfies Assumption (C); (ii) c is regular; (iii) c satisfies the Ma–Trudinger–Wang condition S . ≥ 0 c The implications (i) ⇒ (ii) and (ii) ⇒ (iii) remain Remark 12.43. true without the convexity assumptions. It is a natural open prob- lem whether these assumptions can be completely dispensed with in Theorem 12.42. A bold conjecture would be that (i), (ii) and (iii) are always equivalent and automatically imply the total c -convexity ′ of Dom ( ∇ c ). x

332 326 12 Smoothness . Theorem 12.36 ensures the equivalence of (ii) roof of Theorem 12.42 P and (iii). We shall now see that (i) and (ii) are also equivalent. M be a c -convex function, R ψ Assume (ii) is satisfied. Let → : and let the goal is to show that ∂ ( ψ x x ∈ is connected. Without M ; ) c ) = 0. Let ψ ( x loss of generality we may assume ψ be defined as y , x,y 0 1 in (12.17). The same reasoning as in the proof of Proposition 12.15(i) ) shows that ∂ ψ ψ , so it is sufficient to show that ∂ x ( ψ ⊂ ∂ c c y c y x,y x,y , , 1 1 0 0 ′ ,y y do not belong to Dom is connected. Even if ( ∇ ), the latter c ( x, · ) 1 x 0 − n 1 , so we can find ) (Cut set has an empty interior as a consequence of ( ( ) k ) k ( k ) ′ and ( y such that ( x,y sequences ( y c ) and ) ∈ D om ) ( ∇ ) x N N ∈ k ∈ k 1 0 i ( k ) ( k ) uniquely ( i = 0 , 1). In particular, there are p y → y , ∈ T M x i i i ) k ( k ) ( ) p x,y + = 0. ( c determined, such that ∇ x i i ) k ( k ( ) k ( ) ] ,y Then let ψ . The c -segment [ y = ψ s well- i k ( ) ) ( k x 1 0 x,y , y 1 0 ( k ) is regular). In other ) (because c x defined and included in ( ∂ ψ c ( ( k ) k ) ) ( k ψ words, ( . Passing to the limit as + tp (1 − t ) ) ∈ ∂ x c ) p -exp ( c x 0 1 , after extraction of a subsequence if necessary, we find vectors →∞ k , ( p y ) = ,p (not necessarily uniquely determined) such that p -exp c 0 0 1 0 x -exp ( p c ) = y . This proves the , and c -exp ) ( (1 − t ) p x + tp ( ) ∈ ∂ ψ c 1 1 1 0 x x desired connectedness property. x,y e such that ( x,y ) ∈ b Conversely, assume that (i) holds true. Let ′ c ∇ D om ), and let p = −∇ ; c ( ( ) . Let ν be a unit vector in T M x,y y x x ) ε ( ( ε ) y for = c -exp 0 small enough, ε > ε ν ) and y ( ( p ) = c -exp ν + p − ε x x 0 1 ′ ( ε ) ) ε ( ( ψ ∂ 0, → ε . As ∇ = ψ x c belong to Dom x, · ) ). Let ψ ) ( ( ) ( ( ε ε ) c x , y x,y 1 0 ′ s )). Since ψ c ( x ) = { y } (because ( x,y ) ∈ D om hrinks to ( ∇ ∂ x,y,y x c ′ ε ∇ is in- c ) is open, for ε small enough the whole set Dom ( ψ ∂ ( x ) x c ′ ). By the same reasoning as in the proof of Propo- ( ∇ cluded in Dom c x ε ∂ ψ sition 12.15(iii), the connectedness of ( x ) implies its c -convexity. c Then the proof of Theorem 12.36(i) can be repeated to show that along pairs of vectors satisfying the correct orthogonality ( x,y ) ≥ S 0 c condition. ⊓⊔ I shall conclude this section with a negative result displaying the power of the differential reformulation of regularity. Theorem 12.44 (Smoothness of the optimal transport needs Let M be a compact Riemannian manifold nonnegative curvature). such that the sectional curvature σ M ( P ) i s negative for some x ∈ x a nd some plane P ⊂ T M . Then there exist smooth positive probability x

333 Differential formulation of c - convexity 327 and g M such that the optimal transport map T from f densities on 2 x,y , with cost c ( to ) = d ( vol ) g , is discontinuous. f vol x,y The same conclusion holds true under the weaker assumption that S ( y ) · ( ξ ,η ) < 0 for some ( x, y ) ∈ M × M x, uch that y d oes not s c 2 nd ∇ belong to the cut locus of a c ( x, y ) · ( ξ ,η ) = 0 . x xy A counterexample by Y.-H. Kim shows that the sec- Remark 12.45. ond assumption is strictly weaker than the first one. ξ,η be orthogonal tangent vectors at x Proof of Theorem 12.44 . Let . By Particular Case 12.30, P ( S generating x ) ( ξ,η ) < 0. If we fix x, c V of x w e can find r > 0 such that for any x the a neighborhood 1 − c ) C ∇ ( x,B (0)) is well-defined. If we take a small enough set = ( x x r , containing U subdomain of nd define C = ∩ a is C C × , then x U x ∈ x U -convex and open. By Theorem 12.36, c c U × C . totally is not regular in y in ,y ψ For any C , define C × U as in (12.17). If we let 0 1 , x,y y 1 0 { x } ×{ x } , ∂ } ψ x . { = shrink to x ) will converge to ∂ ψ ) x ( ( x x, x, c y , x,y c 0 1 S U C are small enough, ∂ o if ψ and will be contained in an ) x ( x y , ,y c 1 0 ′ a fortiori in Dom arbitrarily small ball around ( x , c ( x, · ) ). Then we ∇ x can apply Theorem 12.20 and Corollary 12.21. S x, ( A similar reasoning works for the more general case when y ) · c as long as ,η ) < 0, since ∂ ( ψ ⊓⊔ . ) = { y } ( ξ y i s not a cut point of x y x, y, c Differential formulation of c -convexity c -convexity property quite difficult to check Its definition makes the in general. In contrast, to establish the plain convexity of a smooth n R function → R it is sufficient to just prove the nonnegativity of its Hessian. If c is an arbitrary cost function, there does not seem to be c c is a such a simple differential characterization for -convexity; but if regular cost function there is one, as the next result will demonstrate. The notation is the same as in the previous section. c -convexity). Let c Theorem 12.46 (Differential criterion for : X ×Y → be a cost function such that c and ˇ R satisfy (STwist) , c ′ D be a totally c -convex closed subset of Dom c ( ∇ such that and let ) x ′ ˇ ˇ c -convex and S ) ≥ 0 D D . Let X and let = proj is totally ( D on c X ′ 2 C ψ ( X ∈ ; R ) (meaning that ψ is twice continuously differentiable on

334 328 12 Smoothness ′ ′ up to the boundary). If for any x ∈ X X there is y ∈ Y such that , x,y ) D and ( ∈   ψ ( x ) + ∇ ) = 0 c ( x,y ∇ x  (12.28)   2 2 ∇ ( x c ( x,y ψ ≥ 0 , ) + ∇ ) x ′ ′ ′ X then ψ c is -convex, where c c is the -convex on (or more rigorously, c D ). restriction of to In view of the discussion at the beginning of this chap- Remark 12.47. ter, (12.28) is a condition for c -convexity, up necessary and sufficient to issues about the smoothness of ψ and the domain of differentiability . Note that the set of y of c ’s appearing in (12.28) is not required to be ( D ), so in practice one may often enlarge D in the the whole of proj Y y variable before applying Theorem 12.46. Theorem 12.46 shows that if c is a regular cost, then Remark 12.48. c -convexity is a local (up to issues about the domain of definition) notion. Proof of Theorem 12.46 satisfy the assumptions of the theorem, . Let ψ and let ( + ∈ D s uch that ∇ ψ ( x, ) ) ∇ y c ( x, y ) = 0. The goal is x x ′ ∈X ) , ψ ( x ) + c ( x, y ∀ x ψ ( x ) + c ( x, y ) . ( 12.29) ≥ c cc ( y ) = ψ ( x ) + ψ ( x, y ) , in particular ψ If this is true then ( x ) = c ′ c [ ψ ( y ) − c ( x,y ) ] ≥ ψ ( x ) , and since this is true for any x ∈ X sup y cc cc -convex. , therefore ψ we will have = ψ , so ψ will be c ≥ ψ ψ The proof of (12.29) is in the same spirit as the proof of the converse ′ implication in Theorem 12.36. Let , and let ( x x = [ x,x ] ∈ X ) ≤ 1 t y 0 ≤ t e the ˇ = -segment with base y a nd endpoints x c b x a nd x . Let = x 1 0 ( t ) = ψ ( x . ) + c ( x h , y ) t t (0) for all o prove (12.29) it is sufficient to show that ( t ) ≥ T h t ∈ [0 , 1]. h ˇ c -convexity of y D implies that ( x . Let , The ˇ ) always lies in D t x, q c ( x, y ) , q = − ∇ = c ( −∇ y ) , η = q − q , then as in the proof of y y j k,j j = − c x η ) = η Theorem 12.36 we have ( ̇ , k ] [ j i i ̇ y ) , ζ η = − c ) y ( x η h ( x ( ) + t ) = ( x c , ψ t i ,j t i t i j,i j ζ ζ = ψ ; similarly, ( x c ) + w − ( x = , y ) and ζ here c t i t i i i

335 Control of the gradient via c - convexity 329 ( ) [ ] k j i ̈ c ( t ( x h , y ) x + c 12.30) ( x , y ) ζ ) = ( ψ ) + . η η ( ij t ij t t j,k i y ∇ ψ ( x ) + By assumption there is ∇ such that c ( x ) = 0, in ,y t t t t x c = particular ζ ( x ). Then (12.30) can be rewritten as , y ) − c y ( x , t i t t i i , where p ) + d Φ Φ ( p ) · ( p ) − p ( t p t t t [ ] i j ( ( x Φ c ) := x ,y ) y η ψ η ( ) + ij ij t t −∇ p c ( x = ,y ), and of course p is seen as a function of = ), ,y c ( x −∇ t x t x t t = − ∇ p c ( x .) After using , y ) . (Note that ψ does not contribute to d Φ p t x t c -convexity of D and a Taylor formula, we end up with formulas the that are quite similar to (12.26):  2 ̇ ,ζ η ( x ) y , c ( h ( t ) = ∇ ) ·  t x,y      [ ] 2 2 ̈ ζ,ζ h ( t ) = ) ∇ ) ψ ( x ) + ∇ · ( c ( x ,y t t t x   ∫ ) ( 1  ( )  2  1 −  x , ( , c ) ds. x ) − s ∇ + 1 S s ) p − + s p ( ) (1 ,ζ · ( η t t c t x t 3 0 (12.31) ̈ By assumption the first term in the right-hand side of h is nonneg- ative, so, reasoning as in the proof of Theorem 12.36 we arrive at ̈ ̇ C | h h ( t ) | , (12.32) ≥− C where , ψ , is a positive constant depending on c y . x , We shall see x , ̇ h (0) = 0, implies that h is nondecreasing that (12.32), combined with , 1], and therefore h ( t ) ≥ h (0), which was our goal. on [0 ̇ = ( t ) < 0 for some h ∈ (0 , 1], and let t Assume indeed that t ∗ ∗ 0 ̇ ̇ sup t 0 and ; { h ( t ) = 0 } ≥ 0. For t ∈ ( t t ,t < ) we have ≤ h ( t ) ∗ ∗ 0 ̇ ̈ ̇ ̇ ̇ h ( t ) | = ( h/ d/dt h ≤ C , so log | ) log h ( t ) | ≥ log | | h ( t ), ) | − C ( t t − ∗ ∗ ̇ t t . The we obtain a contradiction since log | t h and as → −∞ ) | = ( 0 0 ̇ ) h ( t conclusion is that ≥ 0 for all t ∈ [0 , 1], and we are done. ⊓⊔ Control of the gradient via -convexity c The property of c -convexity of the target is the key to get good control of the localization of the gradient of the solution to (12.2). This asser- tion might seem awkward: After all, we already know that under gen- (recall the end of Theorem 10.28), T ( Spt μ ) = Spt ν eral assumptions,

336 330 12 Smoothness w is related to the gradient of ψ by T ( x ) = here the transport T − 1 −∇ c ( x, −∇ ψ ( x )); so ∇ ψ ( x ) always belongs to ) ( c (Spt μ, Spt ν ) ∇ x x x varies in Spt μ . when To understand why this is not enough, assume that you are approx- by smooth approximate solutions ψ imating . Then ∇ ψ ψ −→ ∇ ψ k k , but you have no control of the at all points of differentiability of ψ ! In particular, in ψ x ) if ψ is not differentiable at x ( ∇ behavior of k y might very well get beyond the the approximation process the point , putting you in trouble. To guarantee good control of support of ν , you need an information on the whole smooth approximations of ψ ). The next theorem says that such control is ∂ ψ ( c -subdifferential x c available as soon as the target is c -convex. c -subdifferential by c -convexity of Theorem 12.49 (Control of and Let Y , c : X × Y → R , μ ∈ P ( X ) , ν ∈ P ( Y ) , X target). : R ∪{ + ∞} satisfy the same assumptions as in Theorem 10.28 ψ X → be an open set such that ∞ ). Let Ω ⊂X (H Spt μ = (including ) , and Ω C let Spt ν ⊂ C . Assume that: ⊂Y be a closed set such that ′ ∇ Ω Dom × ( C (a) c ) ; ⊂ x C is c -convex with respect to Ω . (b) C ∂ ψ ( Ω ) ⊂ Then . c Proof of Theorem 12.49 . We already know from Theorem 10.28 that − 1 Ω ) ⊂ C , where T ( x ) = ( ∇ ( c ) )) stands for the optimal x ( x, −∇ ψ ( T x ∂ transport. In particular, ( Ω ∩ Dom ( ∇ ψ )) ⊂ C . The problem is to ψ c ψ ψ ( x ) when ∂ is not differentiable at x . control c x ∈ Ω be such a point, and let y ∈ ∂ x ψ ( Let ). By Theorem 10.25, c − ( x,y ) ∈ ∇ c ψ ( x ). Since ψ is locally semiconvex, we can apply −∇ x − ψ Remark 10.51 to deduce that ∇ x ) is the convex hull of limits of ( ∇ ( x N ) when x ψ → x . Then by Proposition 12.15(ii), there are L ∈ k k ( L = n + 1 would do), α ) such that ≥ 0 and ( x L ) ≤ ℓ ≤ (1 N ∈ k,ℓ ℓ k ∑ k x α → x as = 1, →∞ , and k,ℓ ℓ L ∑ . α ) (12.33) ψ ( x x,y ) −−−→ ( c −∇ ∇ x k,ℓ ℓ →∞ k =1 ℓ From the observation at the beginning of the proof, ∇ ψ ( x ∈ ) k,ℓ −∇ c ( x this set converges uniformly (say in → ∞ ), and as k ,C x k,ℓ the sense of Hausdorff distance) to −∇ ) which is convex. By c ( x,C x ), so c ( x,y ) −∇ passing to the limit in (12.33) we get c ( x,C ∈ −∇ x x y ∈ C. ⊓⊔

337 Smoothness results 331 S moothness results After a long string of counterexamples (Theorems 12.3, 12.4, 12.7, Corollary 12.21, Theorem 12.44), we can at last turn to positive results . It is indeed absolutely about the smoothness of the transport map T remarkable that a good regularity theory can be developed once all the previously discussed obstructions are avoided, by: • suitable assumptions of convexity of the domains; • suitable assumptions of regularity of the cost function. These results constitute a chapter in the theory of Monge–Amp`ere- type equations, more precisely for the “second boundary value prob- lem”, which means that the boundary condition is not of Dirichlet type; instead, what plays the role of boundary condition is that the image of the source domain by the transport map should be the target domain. Typically a convexity-type assumption on the target will be needed for local regularity results, while global regularity (up to the bound- ary) will request convexity of both domains. Throughout this theory 2 estimates on the unknown the main problem is to get ; once these C ψ estimates are secured, the equation becomes “uniformly elliptic”, and higher regularity follows from the well-developed machinery of uni- formly elliptic fully nonlinear second-order partial differential equations combined with the linear theory (Schauder estimates). It would take much more space than I can afford here to give a fair account of the methods used, so I shall only list some of the main results proven so far, and refer to the bibliographical notes for more information. I shall distinguish three settings which roughly speaking are respectively the quadratic cost function in Euclidean space; regular cost functions; and strictly regular cost functions. The day may come when these results will all be unified in just two categories (regular and strictly regular), but we are not there yet. k,α k,α ( Ω In the sequel, I shall denote by C ) (resp. ( C Ω ) ) the space of functions whose derivatives up to order k are locally α -H ̈older (resp. globally α -H ̈older in Ω ) for some α ∈ (0 , 1], α = 1 meaning Lipschitz n 2 continuity. I shall say that a -smooth open set C ⊂ R is uniformly C convex if its second fundamental form is uniformly positive on the whole ∂C . A similar notion of uniform c -convexity can be defined by use of c -second fundamental form in Definition 12.26. I shall say that of the 2 2 is uniformly regular if it satisfies S ) ( x,y a cost function ≥ λ | ξ | c | η | c 2 λ > 0, where S 〉 is defined by (12.20) and for some = 0; ξ,η c · 〈∇ c xy

338 332 12 Smoothness I shall abbreviate this inequality into ( x,y ) ≥ λ Id . When I say that S c a density is bounded from above and below, this means bounded by positive constants. ( x,y ) = Theorem 12.50 (Caffarelli’s regularity theory). c Let n 2 n in R y × R − , and let Ω,Λ be connected bounded open subsets | | x n . Let f,g be probability densities on Ω and Λ respectively, with R of g bounded from above and below. Let ψ : Ω → R be the unique f and (up to an addive constant) Kantorovich potential associated with the μ dx ) = f ( x ) dx and ν ( dy ) = g ( y ) dy , and the ( probability measures . Then: c cost 1 ,β ∈ C (i) If Λ ( Ω ) is convex, then β ∈ (0 , 1) . ψ for some ,α 0 ,α 0 C for some (ii) If ( Ω ) , g ∈ C Λ is convex, ( Λ ) f α ∈ (0 , 1) , ∈ ,α 2 ∈ ( Ω ) ; moreover, for any k ∈ N and α ψ (0 , 1) , then C ∈ k,α k +2 ,α k,α , g ∈ C ∈ ( Λ ) = ⇒ ψ ∈ C C ) ( ( Ω ) . Ω f 0 ,α 2 C ∈ ( and uniformly convex, f C (iii) If and Λ are Ω a ) nd Ω 0 ,α 2 ,α ( Λ ) f or some α ∈ (0 , 1) , then ψ ∈ C C ∈ ( Ω ) ; more generally, for g , k N and α ∈ (0 , 1) ∈ any k,α k,α k +2 k +2 ,α , g ∈ C f ∈ Λ ) , Ω,Λ ∈ C C ( = ⇒ ψ ∈ C ( Ω ) ( Ω ) . T heorem 12.51 (Urbas–Trudinger–Wang regularity theory). n and R Let , and let c : Y X be the closures of bounded open sets in be a smooth cost function satisfying and S ≥ 0 R X×Y → (STwist) c 2 Ω ⊂X and Λ ⊂Y in the interior of C X×Y -smooth connected . Let be f C ( Ω ) , g ∈ C ∈ Λ ) b e positive probability densities. open sets and let ( ψ be the unique (up to an additive constant) Kantorovich potential Let μ ( dx ) = f ( x ) dx and ν ( dy ) = associated with the probability measures is uniformly ( dy , and the cost c . If (a) Λ ) c -convex with respect y g 1 , 1 uniformly ˇ c -convex with respect to Λ , (b) f ∈ C to Ω , and ( Ω ) , Ω , 1 1 3 , 1 3 ,β ) C and (c) Λ and Ω are of class C , g ∈ , then ψ ∈ C ( Λ ( Ω ) f or all β (0 , 1) . ∈ k,α we have N and α ∈ If moreover for some , 1) ∈ f ∈ C k ( Ω ) , (0 ,α +2 k k ,α ,α +2 k ∈ a Ω , Λ are of class C ) Ω ( , then ψ nd C C ∈ g ( Ω ) . T heorem 12.52 (Loeper–Ma–Trudinger–Wang regularity the- ory). Let X and Y be the closures of bounded connected open sets n in , and let c : X × Y → R be a smooth cost function satisfy- R ing (STwist) . Let S X ×Y ≥ λ Id , λ > 0 , in the interior of and c

339 Smoothness results 333 ⊂ Λ ⊂Y be two connected open sets, let μ ∈ P ( Ω ) such that Ω X and Ω almost everywhere in g be a probability density dμ/dx > 0 , and let ψ be the unique (up to an Λ on , bounded from above and below. Let additive constant) Kantorovich potential associated with the probability ( dx ) , ν ( dy ) = g ( y ) dy , and the cost c . Then: measures μ is c -convex with respect to Ω and if Λ (i) If m − 1 , ∃ C > 0 , ∀ x ∈ Ω, ∀ r > 0 , μ [ B ∃ ( x )] ≤ C r m > n , r 1 ,β then C ( Ω ) for some β ∈ (0 , 1) . ψ ∈ 1 , 1 -convex with respect to Ω , f ∈ C (ii) If Λ is uniformly ( Ω ) and c , 1 1 3 ,β ∈ ( Λ C , then ψ ∈ C ) g ( Ω ) for all β ∈ (0 , 1) . If moreover for k,α k,α some and α ∈ (0 , 1) we have f ∈ C k ( Ω ) , g ∈ C N ( Λ ) , then ∈ k +2 ,α ∈ ψ ( Ω ) . C Theorem 12.51 shows that the regularity of the cost Remark 12.53. function is sufficient to build a strong regularity theory. These results are still not optimal and likely to be refined in the near future; in ,α 2 α particular one can ask whether C → estimates are available for C plainly regular cost functions (but Caffarelli’s methods strongly use the affine invariance properties of the quadratic cost function); or whether interior estimates exist (Theorem 12.52(ii) shows that this is the case for uniformly regular costs). Remark 12.54. On the other hand, the first part of Theorem 12.52 shows that a uniformly regular cost function behaves better, in certain ways, than the square Euclidean norm! For instance, the condition in p ( dx ) = f ( x ) dx Theorem 12.52(i) is automatically satisfied if f ∈ L μ , for p > n ; but it also allows μ to be a singular measure. (Such estimates are not even true for the linear Laplace equation!) As observed by specialists, uniform regularity makes the equation much more elliptic. Remark 12.55. Theorems 12.52 and 12.51 imply a certain converse to Theorem 12.20: Roughly speaking, if the cost function is regular then any c -convex function ψ defined in a uniformly bounded convex domain can be approximated uniformly by smooth c -convex functions. In other words, the density of smooth -convex functions is more or less c a necessary and sufficient condition for the regularity property. n n R × R All these results are stated only for bounded subsets of , so the question arises whether they can be extended to more general

340 334 12 Smoothness c ost functions on Riemannian manifolds. One possibility is to redo the proofs of all these results in curved geometry (with probably addi- tional complications and assumptions). Another possibility is to use a localization argument to reduce the general case to the particular case n where the functions are defined in . At the level of the optimal trans- R port problem, such an argument seems to be doomed: If you cut out ⊂ X Λ ⊂ Y , there is in general no Ω a small piece and a small piece Λ in such a way that the optimal hope of being able to choose Ω and to Λ these domains satisfy adequate c -convexity transport sends and Ω are available properties. However, whenever interior a priori estimates besides the regularity results, this localization strategy is likely to work at the level of the partial differential equation. At least Theorems 12.50 and 12.52 can be complemented with such interior a priori estimates: Let Theorem 12.56 (Caffarelli’s interior a priori estimates). n Ω be open and let ψ : Ω → R be a smooth convex function R ⊂ satisfying the Monge–Amp`ere equation 2 ∇ det( x )) = F ( x, ∇ ψ ( x )) in Ω. (12.34) ψ ( κ Let ( ψ ) stand for the modulus of (strict) convexity of ψ in Ω . Then Ω ′ ′ Ω for any open subdomain such that Ω one has the a priori ⊂ , Ω ∈ (0 , 1) , for all α ∈ estimates (for some , 1) , for all k ∈ N ) β (0 ) ( ′ ,κ ∞ ( ) ‖ ψ ≤ C ψ Ω,Ω ‖ , ‖ F ‖ ∞ ; , ‖∇ ψ ‖ ′ ,β 1 Ω L ) ( Ω ) ( Ω L ) C Ω ( ) ( ′ ∞ ,α ; ‖ ′ ) ≤ C ψ α,Ω,Ω ‖ , ‖ F ‖ ψ 0 ,α ( ,κ 2 , ‖∇ ψ ‖ Ω L ( Ω ) Ω C ( ) ) Ω ( C ( ) ′ ∞ k,α,Ω,Ω ) ψ ( ,κ ‖ ‖ ≤ C ψ ‖ . , ‖ F ‖ ψ ‖∇ , k,α ′ ,α +2 k Ω L ( Ω ) Ω ( ( ) Ω C C ) With Theorem 12.56 and some more work to establish the strict convexity, it is possible to extend Caffarelli’s theory to unbounded do- mains. Theorem 12.57 (Loeper–Ma–Trudinger–Wang interior a pri- ori estimates). Let X , Y be the closures of bounded open sets in n R , and let c : X × Y → R be a smooth cost function satisfying (STwist) and uniformly regular. Let Ω ⊂X be a bounded open set and let ψ : Ω → R be a smooth c -convex solution of the Monge–Amp`ere-type equation

341 Smoothness results 335 ( ) ( ) 2 2 1 − ∇ ∇ et d ψ ( x )+ = c ( ( x, −∇ ψ ( x )) x, c ∇ F ( x, ∇ ψ ( x )) in Ω . ) x x (12.35) − 1 { ( ∇ Let c ) ⊂ Y be a strict neighborhood of ( Λ −∇ ψ ( x )); x ∈ Ω } , x, x ′ Ω . Then for any open subset Ω c such that -convex with respect to ′ Ω , Ω one has the a priori estimates (for some β ∈ (0 , 1) , for all ⊂ ) ∈ 1) , for all k ≥ 2 , α (0 ) ( ′ ∞ ∞ ‖ ψ ‖∇ , ≤ ψ ‖ Ω,Ω ; ,c | C ‖ ‖ , ‖ F ,β 1 ′ Λ Ω × ) Ω ( ) L L ( Ω ) ( Ω C ) ( ′ ∞ 3 ,α 1 ‖ ′ ‖ ≤ C ψ α,Ω,Ω ‖ ,c | ψ ‖∇ , , ‖ F ‖ 1 ; , Λ × Ω L ( Ω ) C Ω ( ) ) ( C Ω ) ( ′ ‖∇ ψ ψ ∞ ‖ ‖ ≤ C ‖ k,α,Ω,Ω . ,c | ‖ , ‖ F , ′ ,α k,α +2 k Ω × Λ L ( Ω ) ) Ω C ) Ω ( C ( These a priori estimates can then be extended to more general cost functions. A possible strategy is the following: c in which the cost D ⊂X×Y (1) Identify a totally -convex domain 2 d / 2 on is smooth and uniformly regular. (For instance in the case of 1 n − 1 − n n − 1 × S S \{ d ( y, − x ) < δ } , i.e. one would remove this could be S a strip around the cut locus.) (2) Prove the continuity of the optimal transport map. (3) Working on the mass transport condition, prove that ψ is ∂ c D . (Still in the case of the sphere, prove that the entirely included in transport map stays a positive distance away from the cut locus.) Ω ⊂ X , find a c -convex subset (4) For each small enough domain of such that the transport map sends Ω into Λ ; deduce from Λ Y ∂ . Theorem 12.49 that ( Ω ) ⊂ Λ and deduce an upper bound on ∇ ψ ψ c (5) Use coordinates to reduce to the case when and Λ are subsets Ω n R is intrinsi- . (Reduce Ω and Λ if necessary.) Since the tensor S of c cally defined, the uniform regularity property will be preserved by this operation. (6) Regularize ψ on ∂Ω , solve the associated Monge–Amp`ere-type equation with regularized boundary data (there is a set of useful tech- niques for that), and use Theorem 12.57 to obtain interior a priori estimates which are independent of the regularization. Then pass to the limit and get a smooth solution. This last step is well-understood by specialists but requires some skill. To conclude this chapter, I shall state a regularity result for optimal transport on the sphere, which was obtained by means of the preceding strategy.

342 336 12 Smoothness n − 1 S T ). heorem 12.58 (Smoothness of optimal transport on n 1 n − R S , equipped with its vol- Let be the unit Euclidean sphere in − 1 n be the geodesic distance on . Let f and ume measure, and let S d n − 1 1 1 , C ψ g . Let be be the unique positive probability densities on S (up to an additive constant) Kantorovich potential associated with the ( dx ) = f ( x ) vol( dx ) to ν ( dy ) = g ( y ) vol( dy ) with cost transport of μ 2 x,y ) = d ( x,y ) T , and let c be the optimal transport map. Then ψ ∈ ( n 3 − 1 2 ,β n − 1 n − 1 ,β 1) , and in particular T ∈ C C ( ( S S ) for all ,S β ∈ (0 ) . , k,α n − 1 , S If moreover f,g ( ) for some k ∈ N , α ∈ (0 lie in 1) , then C k ,α n − 1 k +1 +2 n − 1 n − 1 ,α C ∈ ( S ( S C . (In particular, if ψ ,S ) and T ) ∈ f ∞ ∞ are ψ and T and C then .) are positive and g C 1 n − S Exercise 12.59. Split into two (geodesically convex) hemispheres and S S , according to, say, the sign of the first coordinate. Let f ± − + stand for the uniform probability density on S . Find out the optimal ± 2 x,y vol and f ) vol (for the cost c ( transport between ) = d ( x,y ) f − + and explain why it was not a priori expected to be smooth. Appendix: Strong twist condition for squared distance, and rectifiability of cut locus In this Appendix I shall explain why the squared geodesic distance satisfies the strong twist condition. The argument relies on a few key results about the cut locus (introduced just after Problem 8.8) which I shall recall without proof. Details can be found in mildly advanced textbooks in Riemannian geometry such as the ones quoted in the bib- liographical notes of Chapter 14. M,g ) be a complete smooth Riemannian manifold equipped Let ( 2 . Recall that d ( x, · ) with its geodesic distance is not differentiable d y if and only if there are at least two distinct minimal geodesics at going from x to y . It can be shown that the closure of such points 2 is exactly the cut locus cut( x ) of x . In particular, if c = d y , then ⋃ ′ )) x cut( }× ( { x M \ cut( x ). Also cut( ( ) := ∇ Dom c ( x, · )) = M x M ∈ x ′ M ( ∇ ). c ) = ( is closed, so Dom × M ) \ cut( M x ∞ c is C outside cut( M ). It is easily checked that Let now x and y be such that y / ∈ cut( x ); in particular (as we know from Problem 8.8), y is not focal to x , which means that d exp is x

343 Bibliographical notes 337 − 1 ell-defined and nonsingular at = (exp ), which is the initial ) w v ( y x,y x to y . But ∇ velocity of the unique geodesic going from c ( x ) coin- x,y x − 2 1 − 1 ; so ∇ − ) v ( x,y ) = −∇ is ((exp exp ) c cides with ) = − ( ∇ v y x → y x x xy nonsingular. This concludes the proof of the strong twist condition. n − 1 c satisfies It is also true that ) ; in fact, for any compact (Cut subset K of M and any x ∈ M one has n 1 − K ∩ cut( x )] H + ∞ . (12.36) [ < This however is a much more delicate result, which will be found in recent research papers. Bibliographical notes The denomination Monge–Amp`ere equation is used for any equation 2 , or more generally ψ ) = f which resembles det( ∇ ( ) ( ) 2 − A ( x,ψ, ∇ ψ ) ∇ = F det x,ψ, ∇ ψ ψ . (12.37) Monge himself was probably not aware of the relation between the Monge problem and Monge–Amp`ere equations; this link was made much later, maybe in the work of Knott and Smith [524]. In any case it is Brenier [156] who made this connection popular among the com- munity of partial differential equations. Accordingly, weak solutions of Monge–Amp`ere-type equations constructed by means of optimal trans- port are often called Brenier solutions in the literature. McCann [614] proved that such a solution automatically satisfies the Monge–Amp`ere equation almost everywhere (see the bibliographical notes for Chap- ter 11). Caffarelli [185] showed that for a convex target, Brenier’s notion of solution is equivalent to the older concepts of Alexandrov solution and viscosity solution. These notions are reviewed in [814, Chapter 4] and a proof of the equivalence between Brenier and Alexandrov so- lutions is recast there. (The concept of Alexandrov solution is devel- ̈ oped in [53, Section 11.2].) Feyel and Ust ̈unel [359, 361, 362] studied the infinite-dimensional Monge–Amp`ere equation induced by optimal transport with quadratic cost on the Wiener space. The modern regularity theory of the Monge–Amp`ere equation was pioneered by Alexandrov [16, 17] and Pogorelov [684, 685]. Since then it has become one of the most prestigious subjects of fully nonlinear

344 338 12 Smoothness p artial differential equations, in relation to geometric problems such as the construction of isometric embeddings, or convex hypersurfaces with prescribed (multi-dimensional) Gauss curvature (for instance the Minkowski problem is about the construction of a convex hypersur- face whose Gauss curvature is a prescribed function of the normal; the Alexandrov problem is about prescribing the so-called integral curva- ture). These issues are described in Bakelman’s monograph [53]. Re- cently Oliker [659] has shown that the Alexandrov problem can be recast as an optimal transport problem on the sphere. A modern account of some parts of the theory of the Monge–Amp`ere equation can be found in the recent book by Guti ́errez [452]; there is also an unpolished set of notes by Guan [446]. General references for fully nonlinear elliptic partial differential equations are the book by Caffarelli and Cabr ́e [189], the last part of the one by Gilbarg and Trudinger [416], and the user-friendly notes by Krylov [531] or by Trudinger [786]. As a major development, the estimates derived by Evans, Krylov and Safonov [326, 530, 532] allow one to establish the regularity of fully nonlinear second-order equations under an assumption of uniform ellipticity. Eventually these techniques rely on Schauder-type estimates for certain linear equations. Monge– Amp`ere equations are uniformly elliptic only under a priori estimates on the second derivative of the unknown function; so these a priori estimates must first be established before applying the general theory. An elementary treatment of the basic linear Schauder estimates can be found, e.g., in [835] (together with many references), and a short treat- ment of the Monge–Amp`ere equation (based on [833] and on works by other authors) in [490]. Monge–Amp`ere equations arising in optimal transport have certain distinctive features; one of them is a particular type of boundary con- dition, sometimes referred to as the second boundary condition. The pioneering papers in this field are due to Delano ̈e in dimension 2 [284], then Caffarelli [185, 186, 187], Urbas [798, 799], and X.-J. Wang [833] in arbitrary dimension. The first three authors were interested in the case of a quadratic cost function in Euclidean space, while Wang con- sidered the logarithmic cost function on the sphere, which appears in the so-called reflector antenna problem, at least in the particular case of “far field”. (At the time of publication of [833] it was not yet under- stood that the problem treated there really was an optimal transport problem; it was only later that Wang made this remark.)

345 Bibliographical notes 339 W olfson [840] studied the optimal transport problem between two sets of equal area in the plane, with motivations in geometry, and iden- tified obstructions to the existence of smooth solutions in terms of the curvature of the boundary. Theorem 12.49 is one of the first steps in Caffarelli’s regularity the- ory [185] for the quadratic cost function in Euclidean space; for more general cost functions it is due to Ma, Trudinger and Wang [585]. Theorem 12.50 (together with further refinements) appears in [185, 186, 187]. An informal introduction to the first steps of Caffarelli’s techniques can be found in [814, Chapter 4]. The extension to un- bounded domains is sketched in [15, Appendix]. Most of the theory was developed with nonconstructive arguments, but Forzani and Mal- 1 ,β estimates quantitative. C donado [376] were able to make at least the 2 α 0 ,α 2 ,p theory, there is also a difficult C → → W C Apart from the C p f,g result [184], where should be positive is arbitrarily large and continuous. (The necessity of both the continuity and the positivity assumptions are demonstrated by counterexamples due to Wang [832]; see also [490].) Caffarelli and McCann [192] studied the regularity of the optimal map in the problem of partial optimal transport, in which only a fraction of the mass is transferred from the source to the target (both measures need not have the same mass); this problem transforms into a (double) obstacle problem for the Monge–Amp`ere equation. Fi- galli [365] obtained refined results on this problem by more elementary methods (not relying on the Monge–Amp`ere equation) and showed that there is in general no regularity if the supports of the source and target measures overlap. Another obstacle problem involving optimal trans- port, arising from a physical model, has been studied in [737]. Cordero-Erausquin [240] adapted Caffarelli’s theory to the case of the torus, and Delano ̈e [285] and studied the stability of this theory under small perturbations. Roughly speaking he showed the following: n Given two smooth probability densities g on (say) T f and a and smooth optimal transport T between μ = f vol and ν = g vol, it is possible to slightly perturb , g and the Riemannian metric, in such a f way that the resulting optimal transport is still smooth. (Note carefully: How much you are allowed to perturb the metric depends on f and g .) Urbas [798] considered directly the boundary regularity for uni- n formly convex domains in R , by first establishing a so-called oblique derivative boundary condition, which is a sort of nonlinear version of the Neumann condition. (Actually, the uniform convexity of the do-

346 340 12 Smoothness m ain makes the oblique condition more elliptic in some sense than the Neumann condition.) Fully nonlinear elliptic equations with oblique boundary condition had been studied before in [555, 560, 561], and the connection with the second boundary value problem for the Monge– Amp`ere equation had been suggested in [562]. Compared to Caffarelli’s method, this one only covers the global estimates, and requires higher initial regularity; but it is more elementary. The generalization of these regularity estimates to nonquadratic cost functions stood as an open problem for some time. Then Ma, Trudinger and Wang [585] discovered that the older interior estimates by Wang [833] could be adapted to general cost functions satisfying the S in their paper and > 0 (this condition was called condition (A3) c in subsequent works). Theorem 12.52(ii) is extracted from this refer- ence. A subtle caveat in [585] was corrected in [793] (see Theorem 1 c is a regular there). A key property discovered in this study is that if ψ cost function and c -convex, then any local c -support function for is ψ is also a global c -support function (which is nontrivial unless ψ is differentiable); an alternative proof can be derived from the method of Y.-H. Kim and McCann [519, 520]. Trudinger and Wang [794] later adapted the method of Urbas to treat the boundary regularity under the weaker condition ≥ 0 (there S c (A3w) called ). The proof of Theorem 12.51 can be found there. At this point Loeper [570] made three crucial contributions to the theory. First he derived the very strong estimates in Theorem 12.52(i) which showed that the Ma–Trudinger–Wang (A3) condition (called (As) in Loeper’s paper) leads to a theory which is stronger than the Euclidean one in some sense (this was already somehow implicit in [585]). Secondly, he found a geometric interpretation of this condi- tion, namely the regularity property (Definition 12.14), and related it to well-known geometric concepts such as sectional curvature (Particular Case 12.30). Thirdly, he proved that the weak condition (A3w) (called (Aw) in his work) is mandatory to derive regularity (Theorem 12.21). The psychological impact of this work was important: before that, the Ma–Trudinger–Wang condition could be seen as an obscure ad hoc as- sumption, while now it became the natural condition. The proof of Theorem 12.52(i) in [570] was based on approximation and used auxiliary results from [585] and [793] (which also used some of the arguments in [570]. . . but there is no loophole!)

347 Bibliographical notes 341 L oeper [571] further proved that the squared distance on the sphere is a uniformly regular cost, and combined all the above elements to derive Theorem 12.58; the proof is simplified in [572]. In [571], Loeper derived smoothness estimates similar to those in Theorem 12.52 for the far-field reflector antenna problem. The exponent β in Theorem 12.52(i) is explicit; for instance, in = dμ/dx is bounded above and g is bounded below, the case when f − 1 n − Loeper obtained β = (4 , n being the dimension. (See [572] for 1) a simplified proof.) However, this is not optimal: Liu [566] improved − 1 n − this into = (2 β , which is sharp. 1) In a different direction, Caffarelli, Guti ́errez and Huang [191] could get partial regularity for the far-field reflector antenna problem by very elaborate variants of Caffarelli’s older techniques. This “direct” ap- proach does not yet yield results as powerful as the a priori estimates 1 by Loeper, Ma, Trudinger and Wang, since only regularity is ob- C tained in [191], and only when the densities are bounded from above and below; but it gives new insights into the subject. In dimension 2, the whole theory of Monge–Amp`ere equations be- comes much simpler, and has been the object of numerous studies [744]. 1 Old results by Alexandrov [17] and Heinz [471] imply C regularity of 2 ∇ as soon as ψ ) = h the solution of det( h is bounded from above (and strict convexity if it is bounded from below). Loeper noticed that this implied strenghtened results for the solution of optimal transport with quadratic cost in dimension 2, and together with Figalli [368] extended this result to regular cost functions. Now I shall briefly discuss counterexamples stories. Counterexam- ples by Pogorelov and Caffarelli (see for instance [814, pp. 128–129]) show that solutions of the usual Monge–Amp`ere equation are not smooth in general: some strict convexity on the solution is needed, and it has to come from boundary conditions in one way or the other. The counterexample in Theorem 12.3 is taken from Caffarelli [185], where it is used to prove that the “Hessian measure” (a generalized formulation of the Hessian determinant) cannot be absolutely continu- ous if the bridge is thin enough; in the present notes I used a slightly different reasoning to directly prove the discontinuity of the optimal transport. The same can be said of Theorem 12.4, which is adapted from Loeper [570]. (In Loeper’s paper the contradiction was obtained indirectly as in Theorem 12.44.)

348 342 12 Smoothness M a, Trudinger and Wang [585, Section 7.3] generalized Caffarelli’s counterexample, showing that for any nonconvex target domain, there are smooth positive densities leading to nonsmooth optimal transport. Their result holds for more general cost functions, up to the replace- -convexity. The method of proof used in [585] is ment of convexity by c adapted from an older paper by Wang [833] on the reflector antenna problem. A similar strategy was rediscovered by Loeper [570] to con- struct counterexamples in the style of Theorem 12.44. Theorem 12.7 was proven for these notes, with the aim of getting an elementary topological (as opposed to differential) version of Loeper’s general nonsmoothness result. An old counterexample by Lewy [744, Section 9.5] shows that the general equation (12.37) needs certain convexity-type assumptions; see [750, Section 3.3] for a rewriting of this construction. In view of Lewy’s counterexample, specialists of regularity theory did expect that smoothness would need some assumptions on the cost function, al- though there was no hint of the geometric insights found by Loeper. For nonregular cost functions, a natural question consists in describ- ing the possible singularities of solutions of optimal transport problems. Do these singularities occur along reasonably nice curves, or with com- plicated, fractal-like geometries? No such result seems to be known. Proposition 12.15, Theorems 12.20 and 12.44, Remarks 12.31 and 12.33, and the direct implication in Theorem 12.36 are all taken from Loeper [570] (up to issues about the domain of definition of c ). The ∇ x S is defined independently of the coordinate system was also fact that c noticed independently by Kim and McCann [520]. Theorem 12.35 is due to Trudinger and Wang [793], as well as Examples 12.40 and 12.41, n − 1 except for the case of S which is due to Loeper [571]. The converse implication in Theorem 12.36 was proven independently by Trudinger and Wang [793] on the one hand, by Y.-H. Kim and McCann [519] on the other. (The result by Trudinger and Wang is slightly more general but the proof is more sophisticated since it involves delicate smooth- ness theorems, while the argument by Kim and McCann is much more direct.) The proof which I gave in these notes is a simplified version of the Kim–McCann argument, which evolved from discussions with Trudinger. To go further, a refinement of the Ma–Trudinger–Wang conditions seemed to be necessary. The condition 2 2 2 x,y ) · ( ξ,η ) ≥ K ( ) ξ | S | ̃ η | ξ,η + C · ( ∇ ( ) c | 0 0 c x,y

349 Bibliographical notes 343 2 s called MTW( ,C K ) in [572]; here ̃ η = − ( ∇ i η c ) · . In the particu- 0 0 x,y ∞ , one finds again the Ma–Trudinger–Wang condition C = + lar case 0 0, weak if > (strong if K K = 0). If c is the squared geodesic Rieman- 0 0 C nian distance, necessarily ≥ K . According to [371], the squared 0 0 ∈ ) for some K K ,K (0 , 1) distance on the sphere satisfies MTW( 0 0 0 (this improves on [520, 571, 824]); numerical evidence even suggests , 1). that this cost satisfies the doubly optimal condition MTW(1 The proof of Theorem 12.36 was generalized in [572] to the case when the cost is the squared distance and satisfies an MTW( ,C ) K 0 0 c -segments, condition. Moreover, one can afford to “slightly bend” the and the inequality expressing regularity is also complemented with a 2 2 (in the notation t (1 − t ) (inf | η | remainder term proportional to ) | ζ | K 0 of the proof of Theorem 12.36). Figalli, Kim and McCann [367] have been working on the adaptation , of Caffarelli’s techniques to cost functions satisfying MTW(0 0) (which is a reinforcement of the weak regularity property, satisfied by many examples which are not strictly regular). Remark 12.45 is due to Kim [518], who constructed a smooth per- turbation of a very thin cone, for which the squared distance is not a regular cost function, even though the sectional curvature is positive everywhere. (It is not known whether positive sectional curvature com- bined with some pinching assumption would be sufficient to guarantee the regularity.) (TA) in Lemma 12.39 de- The transversal approximation property rives from a recent work of Loeper and myself [572], where it is proven for the squared distance on Riemannian manifolds with “nonfocal cut locus”, i.e. no minimizing geodesic is focalizing. The denomination “transversal approximation” is because the approximating path ( ̂ y ) t is constructed in such a way that it goes transversally through the cut locus. Before that, Kim and McCann [519] had introduced a different ⋂ ′ · Dom ( ∇ . c ( M ,y )) be dense in density condition, namely that t y t n S , but is not true for general The latter assumption clearly holds on manifolds. On the contrary, Lemma 12.39 shows that (TA) holds with great generality; the principle of the proof of this lemma, based on the co-area formula, was suggested to me by Figalli. The particular co-area formula which I used can be found in [331, p. 109]; it is established in [352, Sections 2.10.25 and 2.10.26]. n − 1 Property (Cut ) for the squared geodesic distance was proven in independent papers by Itoh and Tanaka [489] and Li and Niren-

350 344 12 Smoothness b erg [551]; in particular, inequality (12.36) is a particular case of Corol- lary 1.3 in the latter reference. These results are also established there for quite more general classes of cost functions. In relation to the conjecture evoked in Remark 12.43, Loeper and I [572] proved that when the cost function is the squared distance on a Riemannian manifold with nonfocal cut locus, the strict form of the Ma–Trudinger–Wang condition automatically implies the uniform - c ′ ∇ convexity of all Dom c ( x, · )), and a strict version of the regularity ( x property. It is not so easy to prove that the solution of a Monge–Amp`ere equation such as (12.1) is the solution of the Monge–Kantorovich prob- lem. There is a standard method based on the strict maximum prin- ciple [416] to prove uniqueness of solutions of fully nonlinear elliptic equations, but it requires smoothness (up to the boundary), and can- not be used directly to assess the identity of the smooth solution with the Kantorovich potential, which solves (12.1) only in weak sense. To establish the desired property, the strategy is to prove the -convexity c of the smooth solution; then the result follows from the uniqueness of the Kantorovich potential. This was the initial motivation behind Theorem 12.46, which was first proven by Trudinger and Wang [794, Section 6] under slightly different assumptions (see also the remark in Section 7 of the same work). All in all, Trudinger and Wang suggested c three different methods to establish the -convexity; one of them is a global comparison argument between the strong and the weak solution, in the style of Alexandrov. Trudinger suggested to me that the Kim– McCann strategy from [519] would yield an alternative proof of the c -convexity, and this is the method which I implemented to establish Theorem 12.46. (The proof by Trudinger and Wang is more general in the sense that it does not need the ˇ c -convexity; however, this gain of generality might not be so important because the c -convexity of ψ au- c c ψ , ψ and the ˇ c -convexity of ∂ tomatically implies the -convexity of ∂ c c up to issues about the domain of differentiability of c .) In [585] a local uniqueness statement is needed, but this is tricky since c -convexity is a global notion. So the problem arises whether a locally c -convex function (meaning that for each x there is y such that · x local minimizer of ψ + c ( is a ,y )) is automatically c -convex. This local-to-global problem, which is closely related to Theorem 12.46 (and to the possibility of localizing the study of the Monge–Amp`ere equa- tion (12.35)), is solved affirmatively for uniformly regular cost functions

351 Bibliographical notes 345 i c -convexity (local c -convexity, n [793] where a number of variants of c -convexity, strict -convexity, global c -convexity) are carefully dis- full c cussed. See also [571] for the case of the sphere. In this chapter I chose to start with geometric (more general) con- siderations, such as the regularity property, and end up with analytic conditions in terms of the Ma–Trudinger–Wang tensor; but actually the tensor was discovered before the geometric conditions. The role of the Riemannian structure (and geodesic coordinates) in the presenta- tion of this chapter might also seem artificial since it was noticed in Remark 12.31 that the meaningful quantities are actually independent of these choices. As a matter of fact, Kim and McCann [520] develop a framework which avoids any reference to them and identifies the Ma– Trudinger–Wang tensor as the sectional curvature tensor of the mixed 2 second derivative c/∂x∂y , considered as a pseudo-Riemannian metric ∂ n,n )) on the product manifold. In the same reference (with signature ( they also point out interesting connections with pseudo-Riemannian, Lagrangian and symplectic geometry (related to [840]). Now some comments about terminology. The terminology of - c curvature was introduced by Loeper [570] after he made the connection between the Ma–Trudinger–Wang tensor and the sectional curvature. It was Trudinger who suggested the term “regular” to describe a matrix- valued function A in (12.37) that would satisfy adequate assumptions of the type ≥ 0. By extension, I used the same denomination for S c cost functions satisfying the property of Definition 12.14. Kim and Mc- Cann [519] call this property ( DASM) (Double mountain Above Slid- ing Mountain). To apply Theorems 12.51 or 12.52 to problems where the cost func- tion is not everywhere differentiable, one first needs to make sure that the c -subdifferential of the unknown function lies within the domain 2 d ( x,y ) of differentiability. Typically (for the cost function on a Rie- mannian manifold), this means controlling the distance of ( x ) to the T cut locus of x , where T is the optimal transport. (“Stay away from cut locus!”) Until recently, the only manifolds for which this was known to be true independently of the probability densities (say bounded pos- itive) were positively curved manifolds where all geodesics have the same length: the sphere treated in [286] and [571]; and the projec- tive space considered in [287]. In the latter work, it was shown that the “stay-away” property still holds true if the variation of the length of geodesics is small with respect to certain other geometric quanti-

352 346 12 Smoothness t ies, and the probability densities satisfy certain size restrictions. Then Loeper and I [572] established smoothness estimates for the optimal 4 C perturbations of the projective space, without any transport on size restriction. The cut locus is also a major issue in the study of the perturbation of these smoothness results. Because the dependence of the geodesic distance on the Riemannian metric is not smooth near the cut locus, it is not clear whether the Ma–Trudinger–Wang condition is stable k under perturbations of the metric, however large k may be. This C stability problem, first formulated in [572], is in my opinion extremely 2 interesting; it is solved by Figalli and Rifford [371] near S . Without knowing the stability of the Ma–Trudinger–Wang condi- tion, if pointwise a priori bounds on the probability densities are given, 4 C one can afford a perturbation of the metric and retain the H ̈older 2 continuity of optimal transport; or even afford a perturbation and C retain a mesoscopic version of the H ̈older continuity [822]. Some of the smoothness estimates discussed in these notes also hold for other more complicated fully nonlinear equations, such as the reflec- tor antenna problem [507] (which in its general formulation does not seem to be equivalent to an optimal transport problem) or the so-called Hessian equations [789, 790, 792, 800], where the dominant term is a symmetric function of the eigenvalues of the Hessian of the unknown. The short survey by Trudinger [788] presents some results of this type, with applications to conformal geometry, and puts this into perspec- tive together with optimal transport. In this reference Trudinger also notes that the problem of the prescribed Schouten tensor resembles an optimal transport problem with logarithmic cost function; this connec- tion had also been made by McCann (see the remarks in [520]) who had long ago noticed the properties of conformal invariance of this cost function. A topic which I did not address at all is the regularity of certain sets solving variational problems involving optimal transport; see [632].

353 13 Q ualitative picture This chapter is devoted to a recap of the whole picture of optimal transport on a smooth Riemannian manifold M . For simplicity I shall not try to impose the most general assumptions. A good understanding of this chapter is sufficient to attack Part II of this course. Recap x,v,t ( M ) be a smooth complete connected Riemannian manifold, Let L 2 Lagrangian function on TM × [0 , a C 1], satisfying the classical con- 2 ditions of Definition 7.6, together with ∇ 0. Let c : M × M → R L > v be the induced cost function: ∫ { } 1 c ) = inf ( x,y L γ . , ̇ γ y ,t ) dt ; γ = = x, γ ( 0 1 t t 0 More generally, define ∫ { } t s,t x,y c ( ) = inf L ( γ , . ̇ γ y ,τ ) dτ ; γ = = x, γ t s τ τ s s,t s x,y ) is the optimal cost to go from point So at time c , to point ( x at time t . y L ( x,v,t ) arbitrary on a compact I shall consider three cases: (i) 2 L x,v,t ) = | v | ( / 2 on a complete manifold (so the cost manifold; (ii) 2 2 n / 2, where d is the distance); (iii) L ( x,v,t ) = | v | d / 2 in R is (so the 2 2). Throughout the sequel, I denote by x − y | cost is / | μ the initial 0 probability measure, and by μ the final one. When I say “absolutely 1

354 348 13 Qualitative picture c ontinuous” or “singular” this is in reference with the volume measure n R on the manifold (Lebesgue measure in ). is a Recall that a c generalized optimal coupling -cyclically monotone coupling. By analogy, I shall say that a generalized displacement inter- is a path ( ) polation valued in the space of probability measures, μ ≤ t 1 0 ≤ t = law ( γ ) and γ is a random minimizing curve such that such that μ t t ( ,γ γ ) is a generalized optimal coupling. These notions are interesting 0 1 only when the total cost between and μ μ is infinite. 1 0 By gathering the results from the previous chapters, we know: There always exists: 1. x • ,x an optimal coupling (or generalized optimal coupling) ( ), with 0 1 law π ; • a displacement interpolation (or generalized displacement interpo- μ ; ) lation) ( ≤ 0 ≤ 1 t t a random minimizing curve γ Π ; • with law ) = ) = , and law ( γ γ ,γ μ π . Each curve γ is a solution such that law ( t t 1 0 of the Euler–Lagrange equation d ∇ γ L ( (13.1) . , ̇ γ ) , t ) = ∇ ,t L ( γ γ , ̇ t v x t t t dt In the case of a quadratic Lagrangian, this equation reduces to 2 d γ t 0 , = 2 dt n R so trajectories are just geodesics, or straight lines in . Two trajecto- may intersect at time t Π t = 1, but never ries in the support of = 0 or at intermediate times. If either μ , for all or μ 2. is absolutely continuous, then so is μ t 0 1 ∈ (0 , 1). t If ,x μ is absolutely continuous, then the optimal coupling ( x ) 3. 1 0 0 x )) and characterized by the = is unique (in law), deterministic ( ( x T 0 1 equation ∇ ψ ( x (13.2) ) = −∇ , c ( x 0) ,x , ) = ∇ γ L ( x ̇ , 0 0 1 0 x 0 v where ( γ (it ) x = γ to x is the minimizing curve joining γ = 1 0 ≤ t 1 ≤ 0 0 t 1 is part of the theorem that this curve is almost surely unique), and ψ is a c -convex function, that is, it can be written as [ ] y x ) x,y ( ψ φ ( ) = sup ) − c ( M ∈ y

355 Recap 349 f −∞ , and never + ∞ ) function φ . or some nontrivial (i.e. not identically In case (ii), if nothing is known about the behavior of the distance function at infinity, then the gradient ∇ in the left-hand side of (13.2) ̃ ∇ . should be replaced by an approximate gradient 4. Under the same assumptions, the (generalized) displacement μ ) is unique. This follows from the almost sure interpolation ( t 1 0 ≤ t ≤ γ to γ ) is , where ( uniqueness of the minimizing curve joining γ ,γ 0 1 0 1 the optimal coupling. (Corollary 7.23 applies when the total cost is finite; but even if the total cost is infinite, we can apply a reasoning similar to the one in Corollary 7.23. Note that the result does not follow ⊗ vol ( dx dx )-uniqueness of the minimizing curve joining from the vol 0 1 to x .) x 1 0 5. Without loss of generality, one might assume that [ ] y ) = inf φ ) ( ψ ( x ) + c ( x,y M x ∈ (these are true supremum and true infimum, not just up to a negligible set). One can also assume without loss of generality that x,y ∈ M, φ ( y ) − ψ ( x ) ≤ c ( x,y ) ∀ and ) = φ ) − ψ ( x x c ( x ( ,x ) almost surely . 1 0 1 0 6. It is still possible that two minimizing curves meet at time t = 0 or t = 1, but this event may occur only on a very small set, of dimension at most n − 1. 7. All of the above remains true if one replaces μ μ at time 0 by t 0 , 0 1 t , with obvious changes of notation (e.g. replace c by at time c = 1 t, φ ψ should be changed into c is unchanged, but now ); the function ψ defined by t ] [ 0 ,t (13.3) . ( ψ ψ ) ( x ) + c y ) = inf ( x,y 0 t ∈ x M This ψ is a (viscosity) solution of the forward Hamilton–Jacobi equa- t tion ( ) ∗ = 0 + L ∂ . x, ∇ ψ ψ ( x ) ,t t t t and T between μ 8. The equation for the optimal transport μ is as t t 0 T follows: ( x ) is the solution at time t of the Euler–Lagrange equation t starting from x with velocity ( ) − 1 ) = (13.4) ∇ )) L ( x, · , 0) . ( x ( ∇ ψ ( x v v 0

356 350 13 Qualitative picture I n particular, For the quadratic cost on a Riemannian manifold T ( x ) = M • , t ( t ∇ ψ ( x )): To obtain T exp , flow for time t along a geodesic start- t x ̃ ψ ψ ( x ) (or rather ing at ∇ x ( x ) if nothing is known with velocity ∇ at infinity); M about the behavior of n • For the quadratic cost in , T ), where ( x ) = (1 − t ) x + t ∇ Ψ ( x R t 2 x ) = | x | ( / 2+ ψ Ψ x ) defines a lower semicontinuous convex function ( in the usual sense. In particular, the optimal transport from μ to 0 μ is a gradient of convex function, and this property characterizes 1 it uniquely among all admissible transports. 9. Whenever 0 ≤ t 1, < t ≤ 1 0 ∫ ∫ ,t t 0 1 − ψ ψ dμ = C dμ ( ) μ ,μ t t t t t t 0 1 0 1 1 0 ∫ ∫ t 1 ( ) − 1 x, x, [( ∇ ) L )( ; · ,t )] dt = ( ∇ ψ x ( x )) ,t L dμ ( t v t t 0 recall indeed Theorems 7.21 and 7.36, Remarks 7.25 and 7.37, and (13.4). Simple as they may seem by now, these statements summarize years of research. If the reader has understood them well, then he or she is ready to go on with the rest of this course. The picture is not really complete and some questions remain open, such as the following: Open Problem 13.1. If the initial and final densities, ρ , are and ρ 1 0 positive everywhere, does this imply that the intermediate densities ρ t are also positive? Otherwise, can one identify simple sufficient con- ditions for the density of the displacement interpolant to be positive everywhere? For general Lagrangian actions, the answer to this question seems to be negative, but it is not clear that one can also construct counterexam- ples for, say, the basic quadratic Lagrangian. My personal guess would be that the answer is about the same as for the smoothness: Positiv- ity of the displacement interpolant is in general false except maybe for some particular manifolds satisfying an adequate structure condition.

357 Standard approximation procedure 351 S tandard approximation procedure In this last section I have gathered two useful approximation results which can be used in problems where the probability measures are either noncompactly supported, or singular. In Chapter 10 we have seen how to treat the Monge problem in noncompact situations, without any condition at infinity, thanks to the notion of approximate differentiability. However, in practice, to treat noncompact situations, the simplest solution is often to use again a truncation argument similar to the one used in the proof of approximate differentiability. The next proposition displays the main scheme that one can use to deal with such situations. Proposition 13.2 (Standard approximation scheme). M be Let a smooth complete Riemannian manifold, let x,y ) be a cost func- c = c ( x,v,t ) on TM × [0 , 1] , satisfying the L tion associated with a Lagrangian ( , μ be two probability μ classical conditions of Definition 7.6; and let 0 1 . Let π be an optimal transference plan between μ measures on and M 0 μ ) ( be a displacement interpolation and let Π be a dynam- , let μ 1 1 0 ≤ t t ≤ μ ical optimal transference plan such that ,e ( ) . Π = π , ( e ) Π = e t 0 t 1 # # Γ be the set of all action-minimizing curves, equipped with the topol- Let K ) ogy of uniform convergence; and let be a sequence of compact ( ℓ ∈ N ℓ Γ , such that Π [ UK ; then ] = 1 . For ℓ large enough, Π [ K sets in ] > 0 ℓ ℓ define Π 1 K ℓ Z Π ]; := Π := [ K ; ℓ ℓ ℓ Z ℓ μ ; Π := ( e ) ) ,e Π e ; π := ( 1 t 0 ℓ # # ,ℓ ℓ t ℓ ( c c to proj , ℓ . Then for each be the restriction of K ) and let ℓ ℓ M × M μ is an associated ) π is a displacement interpolation and ( t ≤ ≤ 0 1 t,ℓ ℓ optimal transference plan; is compactly supported, uniformly in μ t,ℓ ∈ t , 1] ; and the following monotone convergences hold true: [0 Π. ↑ Z Z ↑ π Π ↑ π ; Z Z μ ; ↑ μ 1; t t,ℓ ℓ ℓ ℓ ℓ ℓ ℓ is absolutely continuous, then there exists a c -convex If moreover μ 0 ψ : M → R ∪{ + ∞} such that π is concentrated on the graph of function − 1 ̃ x → ( ∇ is absolutely c ) the transport T ( x, − : ∇ ψ ( x )) . For any ℓ , μ x 0 ,ℓ continuous, and the optimal transference plan π is deterministic. Fur- ℓ thermore, there is a coincides with ψ -convex function ψ such that c ℓ ℓ ℓ

358 352 13 Qualitative picture ψ verywhere on C e := proj such that (Spt( π Z )) ; and there is a set ℓ ℓ ℓ M ̃ ) = and for any x ∈ C vol [ \ Z . , Z ∇ ψ ( ] = 0 ) ∇ ψ x ( x ℓ ℓ ℓ ℓ μ Still under the assumption that is absolutely continuous, the mea- 0 sures μ are also absolutely continuous, and the optimal transport t,ℓ 1) , [0 ∈ between μ t T is deterministic, for any given and μ 0 t,ℓ t t,ℓ → t ,ℓ 0 0 t [0 , 1] . In addition, ∈ and , μ -almost surely T T = , t → t t,ℓ → t t ,ℓ 0 0 0 where T . is the optimal transport from μ to μ t t t → t 0 0 Proof of Proposition 13.2 . The proof is quite similar to the argument used in the proof of uniqueness in Theorem 10.42 in a time-independent context. It is no problem to make this into a time-dependent version, since displacement interpolation behaves well under restriction, recall Theorem 7.30. The last part of the theorem follows from the fact that the map T ⊓⊔ . γ 7→ can be written as γ t t → t,ℓ t 0 0 Remark 13.3. Proposition 13.2 will be used several times throughout this course, for instance in Chapter 17. Its main drawback is that there is absolutely no control of the smoothness of the approximations: Even if the densities and ρ and are smooth, the approximate densities ρ ρ 1 0 0 ,ℓ ρ will in general be discontinuous. In the proof of Theorem 23.14 in 1 ,ℓ Chapter 23, I shall use another approximation scheme which respects the smoothness, but at the price of a loss of control on the approxima- tion of the transport. Let us now turn to the problem of approximating singular transport μ problems by smooth ones. If and μ are singular, there is a priori no 1 0 uniqueness of the optimal transference plans, and actually there might be a large number (possibly uncountable) of them. However, the next theorem shows that singular optimal transference plans can always be approximated by nice ones. Theorem 13.4 (Regularization of singular transport problems). M Let c : M × M → R be a smooth complete Riemannian manifold, and be a cost function induced by a Lagrangian L ( x,v,t ) satisfying the clas- sical conditions of Definition 7.6. Further, let μ μ and be two proba- 1 0 bility measures on , such that the optimal transport cost between μ M 0 μ and is finite, and let π be an optimal transference plan between μ 0 1 k k k . Then there are sequences ( μ and ) ) μ π ( , ( μ such and ) 1 k N k N ∈ ∈ N k ∈ 0 1 that

359 Equations of displacement interpolation 353 k k k μ i) each ( and μ is an optimal transference plan between π , and 1 0 k k has a smooth, compactly any one of the probability measures μ , μ 1 0 supported density; k k k μ . , μ μ π → μ k , (ii) → → π in the weak sense as →∞ 1 0 0 1 Proof of Theorem 13.4 . By Theorem 7.21, there exists a displacement and ) be such that ) γ ; let ( μ between μ interpolation ( μ ≤ 1 ≤ t t ≤ 0 1 t ≤ 1 0 t 0 γ = law ( μ imply that action-minimizing ). The assumptions on L t t curves solve a differential equation with Lipschitz coefficients, and therefore are uniquely determined by their initial position and velocity, ,t ]. So for any a fortiori by their restriction to some time-interval [0 0 , ) is the unique optimal / 2), by Theorem 7.30(ii), ( γ t ,γ ∈ (0 1 1 t t 0 − 0 0 coupling between and μ μ . Now it is easy to construct a sequence t t − 1 0 0 k k μ , and each ) → ∞ k as such that μ ( μ converges weakly to t N ∈ k 0 t t 0 0 k μ is compactly supported with a smooth density. (To construct such t 0 a sequence, first truncate to ensure the property of compact support, then localize to charts by a partition of unity, and apply a regulariza- k tion in each chart.) Similarly, construct a sequence ( μ such that ) k ∈ N t 1 − 0 k k μ is compactly supported converges weakly to μ μ , and each t − 1 0 1 1 − t − t 0 0 k π with a smooth density. Let be the unique optimal transference , 1 t t − 0 0 and μ μ plan between . By stability of optimal transport (Theo- t 1 t − 0 0 k = law ( ). π ,γ γ converges as k → ∞ to π rem 5.20), t − 1 t t , 1 − t 0 0 0 0 − , t t 1 0 0 γ , the random variable ( γ ) converges ,γ Then by continuity of 1 − t t 0 0 γ ,γ ) as pointwise to ( → 0, which implies that π converges t 0 − t 0 , 1 1 t 0 0 = π t ) = 1 /n , weakly to . The conclusion follows by choosing k ( n k 0 large enough. ⊓⊔ Equations of displacement interpolation obtained by dis- In Chapter 7, we understood that a curve ( μ ) 0 t ≤ 1 t ≤ placement interpolation solves an action minimization problem in the space of measures, and we wondered whether we could obtain some nice equations for these curves. Here now is a possible answer. For simplicity I shall assume that there is enough control at infinity, that the notion of approximate differentiability can be dispensed with (this is the case for instance if M is compact).

360 354 13 Qualitative picture C μ onsider a displacement interpolation ( ) . By Theorem 7.21, ≤ t 1 0 t ≤ can be seen as the law of , where the random path ( γ ) satis- γ μ ≤ t t 1 t t 0 ≤ has velocity t fies the Euler–Lagrange equation (13.1), and so at time ( ) − 1 ξ ξ γ x ) := ( ∇ )). By the formula of con- L ( x, · ,t ) x ), where ( ( ∇ ψ ( t t v t t μ satisfies servation of mass, t ∂μ t + ∇ · ( ξ ) = 0 μ t t ∂t in the sense of distributions (be careful: is not necessarily a gradient, ξ t unless L is quadratic). Then we can write down the equations of displacement interpolation :  ∂μ  t  ) = 0; · ∇ + μ ( ξ  t t   ∂t  ( )  ( ψ ∇ ); ∇ = L x x,ξ ( ,t x ) t v t (13.5)   ψ c -convex; is 0    ) (  ∗  ,t ) . = 0 x, ∇ ψ x ( + ∂ L ψ t t t If the cost function is just the square of the distance, then these equations become  ∂μ t   ∇ ) = 0; μ ξ ( · +  t t  ∂t     ) = x ); x ( ψ ∇ ξ ( t t (13.6) 2  / 2-convex; d ψ is 0     2  ψ | |∇ t   ψ ∂ . + 0 = t t 2 Finally, for the square of the Euclidean distance, this simplifies into  ∂μ t   ξ μ · ∇ + ) = 0; (  t t  ∂t      x ψ ( ∇ ); ) = ξ x ( t t  2 (13.7) | x |  → x ψ ( x ) is lower semicontinuous convex; +  0  2     2  | ψ |∇ t   = ψ ∂ + 0 . t t 2 Apart from the special choice of initial datum, the latter system is well-known in physics as the pressureless Euler equation , for a potential velocity field.

361 Quadratic cost function 355 Q uadratic cost function In a context of Riemannian geometry, it is natural to focus on the quadratic Lagrangian cost function, or equivalently on the cost function 2 ) = ( x,y ) P , and consider the Wasserstein space x,y d ( M ). This will c ( 2 be the core of all the transport proofs in Part II of this course, so a key 2 / 2-convex functions (that is, role will be played by -convex functions d c 2 = / 2). In Part III we shall consider metric structures that are for c d not Riemannian, but still the square of the distance will be the only cost function. So in the remainder of this chapter I shall focus on that particular cost. 2 / 2-convex functions might look a bit mysterious, and d The class of if they are so important it would be good to have simple characteriza- 2 2 d z,y / 2-convex, then z → ψ ( z ) + d ( tions of them. If ψ is / 2 should ) have a minimum at when y = exp x ( ∇ ψ ( x )). If in addition ψ is twice x differentiable at , then necessarily x [ ] 2 ( d ∇ exp )) x ( · , ψ x 2 2 x ψ ) ∇ ≥−∇ ( 13.8) x ) ( ( . 2 However, this is only a necessary condition, and it is not clear that it 2 would imply d / 2-convexity, except for a manifold satisfying the very strong curvature condition S 0 as in Theorem 12.46. ≥ 2 / 2 d On the other hand, there is a simple and useful general criterion 2 2-convex. This d according to which sufficiently small functions are / TM v statement will guarantee in particular that any tangent vector ∈ 2 can be represented as the gradient of a 2-convex function. d / 2 2 d Theorem 13.5 ( / 2 -small functions are Let M be C -convex). K be a compact subset of M . Then, a Riemannian manifold, and let 2 ε > ψ ∈ C there is such that any function 0 M ) satisfying ( c ψ ) ⊂ K, ‖ ψ ‖ ≤ ε Spt( 2 C b 2 d / 2 -convex. is n 2 2 = R . , then ψ is d Example 13.6. / 2-convex if ∇ Let ψ ≥ − I M n (In this particular case there is no need for compact support, and a one-sided bound on the second derivative is sufficient.) Proof of Theorem 13.5 . Let ( M,g ) be a Riemannian manifold, and let ′ be a compact subset of M . Let K . For K { x ∈ M ; d ( x,K ) ≤ 1 } =

362 356 13 Qualitative picture 2 ∈ M , the Hessian of x → d ( x,y ) ny / 2 is equal to I a (or, more y n T ) at x = y ; so by compactness one rigorously, to the identity on M x 2 x d ( x,y ) 0 such that the Hessian of / 2 remains larger δ > may find → ′ / 2 as long as y stays in K than and d I x,y ) < 2 δ . Without loss of ( n δ < / 2. 1 generality, be supported in , and such that Now let K ψ 2 δ 1 2 x ) | < ( ∈ ∀ x | ψ M , | ( x ) | < ∇ ; ψ 4 4 rite w 2 ) x,y ( d ( x ψ ) = x ( f ) + y 2 2 4 in ≥ I a / ∇ B nd note that f ( y ), so f is uniformly convex in that y y n δ 2 ball. ′ 2 K ) = and d ( x,y ) ≥ δ , then obviously f If ( x ) ≥ δ y / 4 > ψ ( y ∈ y y ); so the minimum of f f can be achieved only in B ( ( y ). If there y y δ are two distinct such minima, say x and x , then we can join them by 1 0 ) γ which stays within B a geodesic ( ) and then the function ( y 1 0 ≤ t ≤ t δ 2 → f ( γ ) is uniformly convex (because f t is uniformly convex in t y y B ( y )), and minimum at t = 0 and t = 1, which is impossible. δ 2 ′ 2 If y / , then ψ ( x ) 6 = 0 implies d ( x,y ) ≥ 1, so f ∈ ( x ) ≥ (1 / 2) − δ K / 4, y while ( y ) = 0. So the minimum of f f can only be achieved at x such y y ψ ( x ) = 0, and it has to be at x = y . that f has exactly one minimum, which lies in B ( y ). We In any case, y δ x = T ( y ), and it is characterized as the unique shall denote it by solution of the equation ( ) 2 ) d ( x,y ) + ∇ ( ψ ∇ x = 0 , (13.9) x 2 where x is the unknown. Let x be arbitrary in M , and y = exp )). Then (as a con- ( ∇ ψ ( x x 2 ), so [ d ( x,y ) sequence of the first variation formula), / 2] = −∇ ψ ∇ x ( x x = T equation (13.9) holds true, and y ). This means that, with the ( 2 c ( x,y ) = d ( x,y ) x,y / 2, one has ψ notation ( y ) = ψ ( x ) + c ( c ). Then c cc ( ) = sup[ ψ ψ ( y ) − c ( x,y )] ≥ ψ ( x ). Since x is arbitrary, actually we x cc ψ have shown that ≥ ψ ; but the converse inequality is always true, cc -convex. so = ψ , and then ψ is c ψ ⊓⊔ Remark 13.7. The end of the proof took advantage of a general prin- ciple, independent of the particular cost c : If there is a surjective map

363 The structure of P ( M ) 357 2 T such that : x → ψ ( x ) + c ( x,y ) is minimum at T ( y ), then ψ is f y -convex. c M ) The structure of ( P 2 A striking discovery made by Otto at the end of the nineties is that the induces a kind of differentiable structure on a Riemannian manifold M ( M ). This idea takes substance P differentiable structure in the space 2 μ is determined from the following remarks: All of the path ( ) 0 t ≤ 1 t ≤ ξ x ), which in turn is determined by ∇ ψ from the initial velocity field ( 0 ψ as a kind of as in (13.4). So it is natural to think of the function ∇ “initial velocity” for the path ( μ ). The conceptual shift here is about t μ could be seen either as the law the same as when we decided that t t of a random minimizing curve at time , or as a path in the space of ∇ ψ can be seen either as the field of the measures: Now we decide that initial velocities of our minimizing curves, or as the (abstract) velocity μ ) at time t = 0. of the path ( t There is an abstract notion of tangent space T ) to a X (at point x x metric space ( ,d ): in technical language, this is the pointed Gromov– X of the rescaled space. It is a rather natural notion: fix Hausdorff limit , and zoom onto it, by multiplying all distances by a large your point x − 1 ε x fixed. This gives a new metric space X , factor , while keeping x,ε x , then and if one is not too curious about what happens far away from X the space might converge in some nice sense to some limit space, x,ε that may not be a vector space, but in any case is a cone. If that limit X space exists, it is said to be the tangent space (or tangent cone) to x . (I shall come back to these issues in Part III.) at In terms of that construction, the intuition sketched above is in- P ( M ) be the metric space consisting of probability deed correct: let 2 measures on M , equipped with the Wasserstein distance W is . If μ 2 absolutely continuous, then the tangent cone P T ( ) exists and can M μ 2 be identified isometrically with the closed vector space generated by 2 d 2-convex functions ψ , equipped with the norm / ) ( ∫ / 1 2 2 ( dμ ) | x . ψ |∇ ‖∇ := ‖ ψ 2 ( ) μ L TM ; x M

364 358 13 Qualitative picture A ctually, in view of Theorem 13.5, this is the same as the vector space all generated by smooth, compactly supported gradients, completed with respect to that norm. With what we know about optimal transport, this theorem is not that hard to prove, but this would require a bit too much geometric machinery for now. Instead, I shall spend some time on an important related result by Ambrosio, Gigli and Savar ́e, according to which any (which for all ) admits a velocity ( t P M Lipschitz curve in the space 2 lives in the tangent space at μ ). Surprisingly, the proof will not require t absolute continuity. P ( M ) ). Theorem 13.8 (Representation of Lipschitz paths in 2 Let P M ( M ) be be a smooth complete Riemannian manifold, and let 2 , with a finite sec- the metric space of all probability measures on M ( μ W ) . Further, let be a ond moment, equipped with the metric ≤ t ≤ 1 0 2 t P ) ( M Lipschitz-continuous path in : 2 t W ,μ ( ) ≤ L | μ − s | . s t 2 ∈ [0 , 1] , let H For any be the Hilbert space generated by gradients of t t ψ continuously differentiable, compactly supported : 2 ( ) TM μ ( ) L ; t 1 := H ; ( ) } . C ∈ ψ M ψ ∇ { Vect t c ∞ 2 , ( x ) ∈ L ( ( dt ; L Then there exists a measurable vector field ξ dμ ))) ( x t t H ( dt -almost everywhere unique, such that ξ ∈ ) for all t (i.e. the μ dx t t t velocity field really is tangent along the path), and ∂ + ∇· ( ξ μ μ ) = 0 (13.10) t t t t in the weak sense. Conversely, if the path μ for some measur- ) (13.10) satisfies ( t 1 ≤ 0 t ≤ 2 ( , almost ( x able vector field whose L ξ ( μ L ) -norm is bounded by )) t t surely in t , then ( μ . ) is a Lipschitz-continuous curve with ‖ ̇ μ ‖≤ L t The proof of Theorem 13.8 requires some analytical tools, and the reader might skip it at first reading. 1 ψ : M → R be a Proof of Theorem 13.8 . Let C function, with Lipschitz constant at most 1. For all s < t in [0 , 1], ∣ ∣ ∫ ∫ ∣ ∣ ∣ ∣ ( (13.11) μ ( W . ) ,μ ≤ ≤ W ) ,μ μ ψ dμ ψ dμ − 1 s t 2 t s s t ∣ ∣ M M

365 The structure of P M ) 359 ( 2 ∫ ζ ψ dμ ) := is a Lipschitz function of t . By Theo- t ( In particular, t M t ∈ [0 , 1]. ζ rem 10.8(ii), the time-derivative of exists for almost all times μ Then let π μ and (for be an optimal transference plan between t s s,t the squared distance cost function). Let  ) y ( ψ | ψ ( x ) − |   if 6 x = y   d ( x ,y ) ) := ( Ψ x,y     ( y. ) | if x = |∇ ψ x is bounded by 1, and moreover it is upper semicontinuous. Obviously Ψ If t is a differentiability point of ζ , then ∣ ∣ ∣ ∣ ∫ ∫ ∫ ∣ ∣ ∣ ∣ d 1 ∣ ∣ ∣ ∣ dμ ≤ lim inf ψ dμ − ψ dμ ψ + t ε t t ∣ ∣ ∣ ∣ 0 ↓ ε dt ε ∫ 1 ( ( x,y ) | ψ ( y ) − ψ π x ) | d ≤ lim inf t,t + ε ↓ 0 ε ε √ ∫ √ ( 2 ) ∫ x,y ) ,y x ( ( dπ ) d t,t + ε 2 lim inf ≤ ) dπ Ψ ( x ( x,y ,y ) ε + t,t ↓ ε 0 ε √ ( ) ) ( ∫ ,μ ( W ) μ t 2 + ε t 2 x,y ,y ) ) dπ Ψ ( x ( im inf l = ε + t,t ε 0 ↓ ε √ ) ( ∫ 2 ≤ im inf L. Ψ ( x ,y ) l dπ x,y ( ) + ε t,t ↓ 0 ε Since Ψ is upper semicontinuous and π converges weakly to δ t,t ε x = y + (the trivial transport plan where nothing moves) as ε ↓ 0, it follows that √ ∣ ∣ ∫ ∫ ∣ ∣ d 2 ∣ ∣ x ( dμ ) ( | dμ ψ x ,x ) | Ψ L ≤ t t ∣ ∣ dt √ ∫ 2 = | ∇ ψ ( x L | ) dμ . ( x ) t ∫ ) + ( ψ d/dt C ) dμ Now the key remark is that the time-derivative ( t ∫ C . This shows that ( d/dt ) does not depend on the constant ψ dμ t ∇ ψ , obviously linear. The above estimate shows really is a functional of 2 continuous ). L that this functional is ( dμ with respect to the norm in t

366 360 13 Qualitative picture A ctually, this is not completely rigorous, since this functional is only , and “almost all” here might depend on t ψ defined for almost all . Here is a way to make things rigorous: Let L be the set of all Lipschitz functions on M with Lipschitz constant at most 1, such that, say, ψ M ) = 0, where x ψ ∈ ( is arbitrary but fixed once for all, and x is ψ 0 0 M . The set L is compact in the norm K supported in a fixed compact ⊂ ) ψ . By a of uniform convergence, and contains a dense sequence ( k k ∈ N regularization argument, one can assume that all those functions are ∫ 1 ψ actually of class , we know that . For each ψ C dμ is differentiable t k k ∈ [0 , 1]; and since there are only countably many ζ t ’s, for almost all k t , each ζ is differentiable at time t . The we know that for almost every k ∫ ) map ( αdμ d/dt is well-defined at each of these times t , for all α in t the vector space ψ H ’s, and it is continuous if generated by all the t k 2 L ) norm. It follows from dμ ( that vector space is equipped with the t the Riesz representation theorem that for each differentiability time t 2 H ⊂ L there exists a unique vector ( dμ ), with norm at most L , ξ ∈ t t t such that ∫ ∫ d ψ dμ dμ ψ ξ (13.12) ·∇ = . t t t dt ψ , and by density it should also This identity should hold true for any k 1 M C hold true for any ( ψ ), supported in K . ∈ 1 1 C . We ( M ) be the set of ψ ∈ Let C ( M ) that are supported in K K just showed that there is a negligible set of times, τ , such that (13.12) K 1 holds true for all ψ ( M ) and t / ∈ τ ∈ . Now choose an increasing C K K K , with ∪ K = M , so that any compact ) family of compact sets ( N m m ∈ m 1 set is included in some . Then (13.12) holds true for all ψ ∈ C K M ), ( m c t τ , which is still a as soon as does not belong to the union of K m negligible set of times. But equation (13.12) is really the weak formulation of (13.10). Since 2 dμ L is uniquely determined in ( ξ , actually the vector ), for almost all t t t field ξ -uniquely determined. ( x ) is dμ dt ( x ) t t To conclude the proof of the theorem, it only remains to prove the μ ) and ( ξ ) solve (13.10). By the equation converse implication. Let ( t t ), where = law ( γ is a (random) solution of conservation of mass, μ γ t t t of γ ̇ ξ ( γ ) . = t t t s < t , 1]. From the formula Let be any two times in [0 ∫ } { t 2 2 ̇ ( = ( t − s ) inf γ , ,γ = | γ ζ , ζ | ) dτ ; ζ γ = d s s τ t t t s s

367 The structure of P ( M ) 361 2 we deduce ∫ ∫ t t 2 2 2 ( t − s ) ≤ ,γ dτ. | ̇ γ d | ) dτ ( ( t − s ) γ ≤ | | ξ ) ( γ t t τ t s s s So ∫ t 2 2 2 dμ ( ≤ ( t − s ) γ ) | ξ ( x ) | ( ,γ d x ) dτ ≤ ( t − s ) E ‖ ξ ‖ . ∞ 2 s τ τ t dt L ( ( dμ )) L ; t s In particular 2 2 2 2 ,μ − ) μ ≤ E d ( γ ) ,γ s ) W ≤ L , ( t ( t 2 t s s 2 ∞ ξ in L L where L is an upper bound for the norm of ). This concludes ( the proof of Theorem 13.8. ⊓⊔ Remark 13.9. With hardly any more work, the preceding theorem can be extended to cover paths that are absolutely continuous of order 2, in the sense defined on p. 127. Then of course the velocity field will not 2 2 ∞ ( dt L L ( dμ ; )), but in L live in ( dμ ). dt t t Observe that in a displacement interpolation, the initial measure μ 0 ∇ uniquely determine the final measure and the initial velocity field ψ 0 : this implies that μ P geodesics in ( M ) are nonbranching , in the strong 1 2 sense that their initial position and velocity determine uniquely their final position. Finally, we can now derive an “explicit” formula for the action func- tional determining displacement interpolations as minimizing curves. μ = ( μ ) be any Lipschitz (or absolutely continuous) path in Let ≤ t ≤ 1 t 0 ψ ); let M P ξ ) be the associated time-dependent velocity ( x ) = ∇ ( x ( t t 2 field. By the formula of conservation of mass, μ can be interpreted as t the law of γ ). Define , where γ is a random solution of ̇ γ γ = ξ ( t t t t ∫ 1 2 μ ( A ) := inf (13.13) dt, | ξ | ( γ E ) t μ t t 0 where the infimum is taken over all possible realizations of the random . By Fubini’s theorem, curves γ ∫ ∫ 1 1 2 2 E ) = inf ( A μ ( | ξ ) | γ dt = inf E | | ̇ γ dt t t t 0 0 ∫ 1 2 ̇ ≥ dt | E γ inf | t 0 2 E d ( , = ,γ ) γ 1 0

368 362 13 Qualitative picture a γ nd the infimum is achieved if and only if the coupling ( ,γ ) is mini- 1 0 γ are (almost surely) action-minimizing. This shows mal, and the curves that displacement interpolations are characterized as the minimizing A curves for the action is the same as the action appear- A . Actually ing in Theorem 7.21 (iii), the only improvement is that now we have produced a more explicit form in terms of vector fields. The expression (13.13) can be made slightly more explicit by noting that the optimal choice of velocity field is the one provided by Theo- rem 13.8, which is a gradient, so we may restrict the action functional to gradient velocity fields: ∫ 1 ∂μ t 2 | |∇ ψ ( ) := E dt ; A μ μ . + ∇ · ( ∇ ψ ) = 0 (13.14) μ t t t t ∂t 0 Note the formal resemblance to a Riemannian structure: What the formula above says is ∫ 1 2 2 dt, (13.15) μ ) ( = inf μ W ‖ ‖ ̇ ,μ t 0 2 1 T P μ 2 t 0 where the norm on the tangent space T P is defined by 2 μ } { ∫ 2 2 ‖ ) = 0 vμ ( = inf ̇ μ | v | ‖ dμ ; ̇ μ + ∇· T P μ 2 ∫ 2 |∇ ψ | = dμ ; ̇ μ + ∇· ( ∇ ψ μ ) = 0 . Remark 13.10. There is an appealing physical interpretation, which really is an infinitesimal version of the optimal transport problem. Imagine that you observe the (infinitesimal) evolution of the density of particles moving in a continuum, but don’t know the actual veloc- ities of these particles. There might be many velocity fields that are compatible with the observed evolution of density (many solutions of the continuity equation). Among all possible solutions, select the one with minimum kinetic energy. This energy is (up to a factor 1 / 2) the square norm of your infinitesimal evolution.

369 Bibliographical notes 363 B ibliographical notes Formula (13.8) appears in [246]. It has an interesting consequence which can be described as follows: On a Riemannian manifold, the optimal transport starting from an absolutely continuous probability measure ; that is, the set of points x such that almost never hits the cut locus ( x ) belongs to the cut locus of x the image T is of zero probability. Al- x and T ( x ) are joined by a though we already know that almost surely, unique geodesic, this alone does not imply that the cut locus is almost y belongs to the cut locus of x and never hit, because it is possible that x y are joined by a unique minimizing geodesic. (Recall the and still discussion after Problem 8.8.) But Cordero-Erausquin, McCann and 2 ( x,z ) / Schmuckenschl ̈ager show that if such is the case, then 2 fails to d z = y . On the other hand, by Alexandrov’s second dif- be semiconvex at ferentiability theorem (Theorem 14.1), ψ is twice differentiable almost 2 d ( x, · ) everywhere; formula (13.8), suitably interpreted, says that / 2 . is semiconvex at ) whenever ψ is twice differentiable at x x ( T At least in the Euclidean case, and up to regularity issues, the ex- plicit formulas for geodesic curves and action in the space of measures were known to Brenier, around the mid-nineties. Otto [669] took a P conceptual step forward by considering formally ( M ) as an infinite- 2 dimensional Riemannian manifold, in view of formula (13.15). For some time it was used as a purely formal, yet quite useful, heuris- tic method (as in [671], or later in Chapter 15). It is only recently that rigorous constructions were performed in several research papers, e.g. [30, 203, 214, 577]. The approach developed in this chapter relies n R heavily on the work of Ambrosio, Gigli and Savar ́e [30] (in ). A more geometric treatment can be found in [577, Appendix A]; see also [30, Section 12.4], and [655, Section 3] (I shall give a few more details in the bibliographical notes of Chapter 26). As I am completing this course, an important contribution to the subject has just been made by Lott [575] who established “explicit” formulas for the Riemannian connection and ac P M ), or rather in the subset made of smooth positive ( curvature in 2 densities, when M is compact. The pressureless Euler equations describe the evolution of a gas of particles which interact only when they meet, and then stick together (sticky particles). It is a very degenerate system whose study turns out to be tricky in general [169, 320, 745]. But in applications to optimal transport, it comes in the very particular case of potential flow (the

370 364 13 Qualitative picture v elocity field is a gradient), so the evolution is governed by a simple Hamilton–Jacobi equation. μ There are two natural ways to extend a minimizing geodesic ( ) 0 ≤ t ≤ 1 t ( M in P ) into a geodesic defined (but not necessarily minimizing) at all 2 times. One is to solve the Hamilton–Jacobi equation for all times, in the viscosity sense; then the gradient of the solution will define a velocity field, and one can let ( μ ) evolve by transport, as in (13.6). Another way t [30, Example 11.2.9] is to construct a trajectory of the gradient flow of 2 − ( σ, · ) the energy / 2, where σ is, say, μ , and the trajectory starts W 0 2 from an intermediate point, say μ . The existence of this gradient 1 / 2 flow follows from [30, Theorems 10.4.12 and 11.2.1], while [30, Theo- rem 11.2.1] guarantees that it coincides (up to time-reparameterization) with the original minimizing geodesic for short times. (This is close to the construction of quasigeodesics in [678].) It is natural to expect that both approaches give the same result, but as far as I know, this has not been established. μ μ ) Khesin suggested to me the following nice problem. Let = ( 0 t t ≥ n ( R P ) (defined with the help be a geodesic in the Wasserstein space 2 of the pressureless Euler equation), and characterize the cut time of μ as the “time of the first shock”. If μ is absolutely continuous with 0 positive density, this essentially means the following: let t = inf { t ; 1 c μ ) μ is abso- is not a minimizing geodesic ( , let t = sup { t ; } ≤ s t 0 t 0 ≤ t t 1 lutely continuous for t t } , and show that t should be = ≤ t . Since t s c c 0 2 , the solution ; | x | equal to sup / 2 + tψ (0 ,x ) is convex for t ≤ t { } t 2 2 of this problem is related to a qualitative study of the way in which convexity degenerates at the first shock. In dimension 1, this problem can be studied very precisely [642, Chapter 1] but in higher dimensions the problem is more tricky. Khesin and Misio lek obtain some results in this direction in [515]. P ) and found that they ( Kloeckner [522] studied the isometries of R 2 are not all induced by isometries of R ; roughly speaking, there is one additional “exotic” isometry. In his PhD thesis, Agueh [4] studied the structure of P ) for ( M p 1 (not necessarily equal to 2). Ambrosio, Gigli and Savar ́e [30] p > pushed these investigations further. Displacement interpolation becomes somewhat tricky in presence of boundaries. In his study of the porous medium equations, Otto [669] n 2 boundary. with C R considered the case of a bounded open set of

371 Bibliographical notes 365 or many years, the great majority of applications of optimal trans- F port to problems of applied mathematics have taken place in Euclidean setting, but more recently some “genuinely Riemannian” applications have started to pop out. There was an original suggestion to use opti- mal transport in a three-dimensional Riemannian manifold (actually, a cube equipped with a varying metric) related to image perception and the matching of pictures with different contrasts [289]. In a meteoro- logical context, it is natural to consider the sphere (as a model of the Earth), and in the study of the semi-geostrophic system one is naturally led to optimal transport on the sphere [263, 264]; actually, it is even natural to consider a conformal change of metric which “pinches” the sphere along its equator [263]! For completely different reasons, optimal transport on the sphere was recently used by Otto and Tzavaras [670] in the study of a coupled fluid-polymer model.

372

373 Part II Optimal transport and Riemannian geometry

374

375 369 T his second part is devoted to the exploration of Riemannian geome- try through optimal transport. It will be shown that the geometry of a manifold influences the qualitative properties of optimal transport; Ricci curvature this can be quantified in particular by the effect of bounds on the convexity of certain well-chosen functionals along dis- placement interpolation. The first hints of this interplay between Ricci curvature and optimal transport appeared around 2000, in works by Otto and myself, and shortly after by Cordero-Erausquin, McCann and Schmuckenschl ̈ager. Throughout, the emphasis will be on the quadratic cost (the trans- port cost is the square of the geodesic distance), with just a few excep- tions. Also, most of the time I shall only handle measures which are absolutely continuous with respect to the Riemannian volume measure. Chapter 14 is a preliminary chapter devoted to a short and tenta- tively self-contained exposition of the main properties of Ricci curva- ture. After going through this chapter, the reader should be able to understand all the rest without having to consult any extra source on Riemannian geometry. The estimates in this chapter will be used in Chapters 15, 16 and 17. Chapter 15 presents a powerful formal differential calculus on the Wasserstein space, cooked up by Otto. Chapters 16 and 17 establish the main relations between displace- ment convexity and Ricci curvature. Not only do Ricci curvature bounds imply certain properties of displacement convexity, but con- versely these properties characterize Ricci curvature bounds. These re- sults will play a key role in the rest of the course. In Chapters 18 to 22 the main theme will be that many classical properties of Riemannian manifolds, that come from Ricci curvature estimates, can be conveniently derived from displacement convexity techniques. This includes in particular estimates about the growth of the volume of balls, Sobolev-type inequalities, concentration inequali- ties, and Poincar ́e inequalities. Then in Chapter 23 it is explained how one can define certain gradi- ent flows in the Wasserstein space, and recover in this way well-known equations such as the heat equation. In Chapter 24 some of the func- tional inequalities from the previous chapters are applied to the quali- tative study of some of these gradient flows. Conversely, gradient flows provide alternative proofs to some of these inequalities, as shown in Chapter 25.

376 370 he issues discussed in this part are concisely reviewed in the sur- T veys by Cordero-Erausquin [244] and myself [821] (both in French). Convention: Throughout Part II, unless otherwise stated, a “Rie- mannian manifold” is a smooth, complete connected finite-dimensional Riemannian manifold distinct from a point, equipped with a smooth metric tensor .

377 14 R icci curvature Curvature is a generic name to designate a local invariant of a metric space that quantifies the deviation of this space from being Euclidean. (Here “local invariant” means a quantity which is invariant under local isometries.) It is standard to define and study curvature mainly on Rie- mannian manifolds, for in that setting definitions are rather simple, and the Riemannian structure allows for “explicit” computations. Through- out this chapter, M will stand for a complete connected Riemannian manifold, equipped with its metric g . sectional curvature σ (for The most popular curvatures are: the x and each plane P ⊂ T Ricci M , σ P ( each point ) is a number), the x x Ric (for each point , Ric is a quadratic form on the tan- curvature x x T is M ), and the scalar curvature S (for each point gent space , S x x x a number). All of them can be obtained by reduction of the Riemann curvature tensor . The latter is easy to define: If ∇ stands for the X covariant derivation along the vector field X , then Riem( X,Y ∇ ; ∇ ) := −∇ ∇ ∇ + Y X Y X ] [ X,Y but it is notoriously difficult, even for specialists, to get some under- standing of its meaning. The Riemann curvature can be thought of as a tensor with four indices; it can be expressed in coordinates as a non- linear function of the Christoffel symbols and their partial derivatives. Of these three notions of curvature (sectional, Ricci, scalar), the sectional one is the most precise; in fact the knowledge of all sectional curvatures is equivalent to the knowledge of the Riemann curvature. The Ricci curvature is obtained by “tracing” the sectional curvature: If e is a given unit vector in T ) is an orthonormal M and ( e,e ,...,e 2 x n ∑ T ,...,n M , then Ric = 2 ( e,e ) = ) basis of j ( P ( ), where P σ j j x x x

378 372 14 Ricci curvature i e,e s the plane generated by } . Finally, the scalar curvature is the { j trace of the Ricci curvature. So a control on the sectional curvature is stronger than a control on the Ricci curvature, which in turn is stronger than a control on the scalar curvature. For a surface (manifold of dimension 2), these three notions reduce to just one, which is the Gauss curvature and whose definition is el- ementary. Let us first describe it from an extrinsic point of view. Let 3 M R . In the neighborhood of a be a two-dimensional submanifold of x , choose a unit normal vector n = n ( y ), then this defines lo- point 2 3 R n ⊂ with values in S (see Figure 14.1). The cally a smooth map 3 2 , which can M and T tangent spaces R are parallel planes in T S x ) n x ( n be identified unambiguously. So the determinant of the differential of can also be defined without ambiguity, and this determinant is called the curvature. The fact that this quantity is invariant under isometries tour de force at the time. (To is one of Gauss’s most famous results, a appreciate this theorem, the reader might try to prove it by elementary means.) x n 2 S M T he dashed line gives the recipe for the construction of the Gauss map; Fig. 14.1. its Jacobian determinant is the Gauss curvature. As an illustration of this theorem: If you hold a sheet of paper 3 R straight, then its equation (as an embedded surface in , and assuming that it is infinite) is just the equation of a plane, so obviously it is not curved. Fine, but now bend the sheet of paper so that it looks like valleys and mountains, write down the horrible resulting equations, give it to a friend and ask him whether it is curved or not. One thing he can do is compute the Gauss curvature from your horrible equations, find that it is identically 0, and deduce that your surface was not curved

379 14 Ricci curvature 373 3 R a , t all. Well, it looked curved as a surface which was embedded in intrinsic but from an point of view it was not: A tiny creature living on the surface of the sheet, unable to measure the lengths of curves going outside of the surface, would never have noticed that you bent the sheet. M,g ) to something else, pick up any To construct isometries from ( ′ ′ = → : , and equip M φ M φ ( M ) with the metric diffeomorphism M ′ − 1 − 1 ′ ∗ g ( ) , defined by ( v ) = g is an g g φ )). Then v ( d φ φ = ( − 1 x ) x φ ( x ′ ′ isometry between ( ,g ) and ( ). Gauss’s theorem says that the M M,g ′ ′ M curvature computed in ( ,g M,g ) ) and the curvature computed in ( x are the same, modulo obvious changes (the curvature at point along P should be compared with the curvature at φ ( x ) along a plane a plane )). This is why one often says that the curvature is “invariant ( P d φ x under the action of diffeomorphisms”. Curvature is intimately related to the . local behavior of geodesics The general rule is that, in the presence of positive curvature, geodesics converge have a tendency to (at least for short times), while in the presence of negative curvature they have a tendency to diverge . This tendency can usually be felt only at second or third order in time: at first order, the convergence or divergence of geodesics is dictated by the initial conditions. So if, on a space of (strictly) positive curvature, you start two geodesics from the same point with velocities pointing in different directions, the geodesics will start to diverge, but then the ten- dency to diverge will diminish. Here is a more precise statement, which will show at the same time that the Gauss curvature is an intrinsic M , start two constant-speed geodesics with x ∈ notion: From a point v,w . The two curves will spread unit speed, and respective velocities ( apart; let ) be the distance between their respective positions at time δ t √ . In a first approximation, δ ( t ) ≃ t − c os θ ) t , where θ is the angle 2(1 v w (this is the same formula as in Euclidean space). But and between a more precise study shows that ) ( 2 √ κ cos ( θ/ 2) x 2 4 t ( 1 − t ) = ) 2(1 − c 14.1) t os + O ( t θ ) δ , ( 6 where κ is the Gauss curvature at x . x Once the intrinsic nature of the Gauss curvature has been estab- lished, it is easy to define the notion of sectional curvature for Rie- mannian manifolds of any dimension, embedded or not: If x ∈ M and P ⊂ T ) as the Gauss curvature of the surface which is M , define σ P ( x x obtained as the image of P by the exponential map exp (that is, the x

380 374 14 Ricci curvature ollection of all geodesics starting from with a velocity in P ). Another c x equivalent definition is by reduction of the Riemann curvature tensor: 〈 } is an orthonormal basis of P , then σ If ( P ) = u,v Riem ( u,v ) · u, v 〉 . { x It is obvious from the first definition of Gauss curvature that the unit 2 S two-dimensional sphere has curvature +1, and that the Euclidean 2 n has curvature 0. More generally, the sphere S plane ( R ), of dimen- R 2 and radius R , has constant sectional curvature 1 /R sion ; while the n - n n R has curvature 0. The other classical ex- dimensional Euclidean space n n − 1 R ) = { ample is the hyperbolic space, say x,y ) ∈ R H ( ( × (0 , + ∞ ) } 2 2 2 2 equipped with the metric + dy ( ) /y dx , which has constant sec- R 2 − /R tional curvature . These three families (spheres, Euclidean, hy- 1 perbolic) constitute the only simply connected Riemannian manifolds with constant sectional curvature, and they play an important role as comparison spaces. The qualitative properties of optimal transport are also (of course) related to the behavior of geodesics, and so it is natural to believe that curvature has a strong influence on the solution of optimal transport. Conversely, some curvature properties can be read off on the solution of optimal transport. At the time of writing, these links have been best Ricci curvature ; so this is the point of view that understood in terms of will be developed in the sequel. This chapter is a tentative crash course on Ricci curvature. Hope- fully, a reader who has never heard about that topic before should, by the end of the chapter, know enough about it to understand all the rest of the notes. This is by no means a complete course, since most proofs will only be sketched and many basic results will be taken for granted. In practice, Ricci curvature usually appears from two points of view: (a) estimates of the Jacobian determinant of the exponential map ; (b) Bochner’s formula . These are two complementary points of view on the same phenomenon, and it is useful to know both. Before going on, I shall make some preliminary remarks about Riemannian calculus at second order, for functions which are not necessarily smooth. Preliminary: second-order differentiation All curvature calculations involve second-order differentiation of cer- tain expressions. The notion of covariant derivation lends itself well to

381 Preliminary: second-order differentiation 375 t hose computations. A first thing to know is that the exchange of deriva- tives is still possible. To express this properly, consider a parametrized s,t γ ( s,t ) in M , and write d/dt (resp. d/ds ) for the dif- ) surface ( → t with s frozen (resp. as ferentiation along γ , viewed as a function of t with D/Dt (resp. D/Ds ) for the corre- s a function of frozen); and 2 C ( sponding covariant differentiation. Then, if M F ∈ ), one has ( ) ) ( D D d F F d = . ( 14.2) Dt ds dt Ds . If Also a crucial concept is that of the is twice Hessian operator f n 2 , its Hessian matrix is just ( R f/∂x ∂x ) differentiable on , ∂ 1 ≤ i,j ≤ n j i f that is, the array of all second-order partial derivatives. Now if is , the Hessian operator at is the M x defined on a Riemannian manifold 2 f ( x ) : T linear operator M → T ∇ M defined by the identity x x 2 · v = ∇ ∇ ( ∇ f ) . f v (Recall that ∇ stands for the covariant derivation in the direction v .) v 2 ∇ f is the covariant gradient of the gradient of f . In short, A convenient way to compute the Hessian of a function is to differ- γ is a geodesic ) entiate it twice along a geodesic path. Indeed, if ( ≤ ≤ 1 0 t t ∇ γ = 0, so ̇ path, then γ ̇ 〉 〈 〉 〈 2 d d ̇ ) = γ ∇ 〈 ∇ f ( γ , ) , ̇ γ ) 〉 = f ∇ γ ( ∇ f ( γ f ) , ̇ γ ∇ ( + γ t t t t t γ ̇ γ ̇ t t 2 dt dt 〈 〉 2 f ( γ ̇ ) · ̇ γ . , ∇ γ = t t t γ ∈ = x and ̇ γ , then = v In other words, if T M x 0 0 2 〈 〉 t 2 2 ( x ) + t 〈∇ f ( x ) , v 〉 + x f . ( ∇ γ f ( (14.3) ) · v , v ) = + o ( t f ) t 2 This identity can actually be used to define the Hessian operator. A similar computation shows that for any two tangent vectors , v u at x , ( ) ( 〉 ) 〈 d D 2 (14.4) ( su + tv ) , f = f ∇ e xp ( x ) · u, v x Ds dt where exp v is the value at time 1 of the constant speed geodesic x x with velocity v . Identity (14.4) together with (14.2) starting from 2 2 ∈ C shows that if ( M ), then ∇ ) is a symmetric operator: f ( x f

382 376 14 Ricci curvature 2 2 ( x ) · u, v 〉 ∇ = 〈∇ 〈 f ( x ) · v, u 〉 f . In that case it will often be conve- x x 2 T x ) as a quadratic form on nient to think of f M . ( ∇ x The Hessian is related to another fundamental second-order dif- Laplacian ferential operator, the , or Laplace–Beltrami operator. The Laplacian can be defined as the trace of the Hessian: 2 ) = tr ( ∇ ∆f f ( x ( . x )) Another possible definition is ∆f = ∇· ( ∇ f ) , ∇· is the divergence operator, defined as the negative of the where 1 2 vector ): More explicitly, if ξ is a C L ( M adjoint of the gradient in , its divergence is defined by M field on ∫ ∫ ∞ − ∀ ( M ) , ζ . ( ∇· ξ ) ζ d vol = ∈ C vol ξ ·∇ ζ d c M M Both definitions are equivalent; in fact, the divergence of any vector ξ coincides with the trace of the covariant gradient of ξ field . When ∑ n 2 M ∆f is given by the usual expression = ∂ R , f . More generally, ii in coordinates, the Laplacian reads ∑ ( / 1 − 1 / 2 ij 2 = (det ∂ g ) ) g (det g ∆f ∂ f ) . j i ij In the context of optimal transport, we shall be led to consider 2 C Hessian operators for functions , and not that are not of class f 2 f ∇ ∇ f will still be even continuously differentiable. However, and well-defined almost everywhere, and this will be sufficient to conduct the proofs. Here I should explain what it means for a function defined ξ be a vector field defined almost everywhere to be differentiable. Let U ; when x on a domain of a neighborhood y is close enough to x , of w such that T is M there is a unique velocity y = γ ∈ , where γ x 1 the constant-speed geodesic starting from x with initial velocity w ; for simplicity I shall write w = y − x (to be understood as y = exp ). w x is said to be covariantly differentiable at x in the direction v , if Then ξ ( ) x ξ − ) y ( ξ ) θ ( y x → lim ∇ ξ ( x ) := 14.5) ( | v | v x y − v | | x y − y ; x → → x | | v | | y − exists, where varies on the domain of definition of ξ , and θ y is → x y the parallel transport along the geodesic joining y to x . If ξ is defined

383 Preliminary: second-order differentiation 377 verywhere in a neighborhood of , then this is just the usual notion e x of covariant derivation. Formulas for (14.5) in coordinates are just the same as in the smooth case. The following theorem is the main result of second differentiability for nonsmooth functions: Theorem 14.1 (Second differentiability of semiconvex func- M be a Riemannian manifold equipped with its volume Let tions). M , and let ψ measure, let U → R be locally U be an open subset of : semiconvex with a quadratic modulus of semiconvexity, in the sense of ∈ x , ψ is differentiable at x Definition 10.10. Then, for almost every U : A M → T T M , characterized and there exists a symmetric operator x x by one of the following two equivalent properties: v ∈ T , M (i) For any ∇ ; ( ∇ ψ )( x ) = Av v x A 〉 v,v · 〈 2 ( ) + 〈∇ ψ ( x ) ,v 〉 + ) = (ii) v ψ (exp x ψ + ( | v | . o a s v → 0 ) x 2 2 A is denoted by ∇ The operator ψ ( x ) and called the Hessian of ψ at x . When no confusion is possible, the quadratic form defined by A is also ψ at . called the Hessian of x is denoted by A ( x ) and called the Laplacian of The trace of ∆ψ at x . The function x → ∆ψ ( x ψ coincides with the density of the ) absolutely continuous part of the distributional Laplacian of ; while the ψ singular part of this distributional Laplacian is a nonnegative measure. n is known is convex R Remark 14.2. ψ R The particular case when → as Alexandrov’s second differentiability theorem . By extension, I shall use the terminology “Alexandrov’s theorem” for the general statement of Theorem 14.1. This theorem is more often stated in terms of Property (ii) than in terms of Property (i); but it is the latter that will be most useful for our purposes. Remark 14.3. As the proof will show, Property (i) can be replaced by the following more precise statement involving the subdifferential of ψ : − − is any vector field valued in ∇ y ψ (i.e. ξ ( y ) ∈ ∇ If ψ ( y ) for all ξ ), then ξ ( x ) = Av . ∇ v Remark 14.4. For the main part of this course, we shall not need the full strength of Theorem 14.1, but just the particular case when ψ is continuously differentiable and ∇ ψ is Lipschitz; then the proof becomes much simpler, and ∇ ψ is almost everywhere differentiable in

384 378 14 Ricci curvature t he usual sense. Still, on some occasions we shall need the full generality of Theorem 14.1. Beginning of proof of Theorem 14.1 . The notion of local semiconvexity 2 C diffeomorphism, so it suffices with quadratic modulus is invariant by n R to prove Theorem 14.1 when . But a semiconvex function = M n R in an open subset U of is just the sum of a quadratic form and a locally convex function (that is, a function which is convex in any ). So it is actually sufficient to consider the special convex subset of U n ψ R . Then if case when is a convex function in a convex subset of ∈ U and B is a closed ball around x , included in U , let x be ψ B ψ ; since ψ is Lipschitz and convex, it can be B to the restriction of n extended into a Lipschitz convex function on the whole of R (take ). In ψ for instance the supremum of all supporting hyperplanes for B short, to prove Theorem 14.1 it is sufficient to treat the special case of n a convex function R ψ → R . At this point the argument does not : involve any more Riemannian geometry, but only convex analysis; so ⊓⊔ I shall postpone it to the Appendix (Theorem 14.25). The Jacobian determinant of the exponential map Let M ξ be a vector field on M (so be a Riemannian manifold, and let for each x , ξ ( x ) lies in T ). Recall the definition of the exponential M x map = exp ξ : Start from point x a geodesic curve with initial velocity T ( ) ∈ T M , and follow it up to time 1 (it is not required that the ξ x x geodesic be minimizing all along); the position at time 1 is denoted by ) = ξ ( exp ). As a trivial example, in the Euclidean space, exp x ξ ( x x x x + ξ ( x ). The computation of the Jacobian determinant of such a map is a classical exercise in Riemannian geometry, whose solution involves the Ricci curvature. One can take this computation as a theorem about the Ricci curvature (previously defined in terms of sectional or Riemann curvature), or as the mere definition of the Ricci curvature. x ∈ M be given, and let ξ So let be a vector field defined in a neighborhood of x , or almost everywhere in a neighborhood of x . Let ( e , and consider small ,...,e M ) be an orthonormal basis of T x 1 n variations of x , denoted abusively by e ,...,e in these directions 1 n

385 The Jacobian determinant of the exponential map 379 x δ e + ,...,x + δe should be understood as, say, . (Here x + δe j 1 n e ( ); but it might also be any path x ( δ ) with ̇ x (0) = exp 0, .) As δ → δe i j x δe built on ( x + δe ) has ,...,x + the infinitesimal parallelepiped P 1 n δ n P ] ≃ δ . (It is easy to make sense of that by using local volume vol [ δ charts.) The quantity of interest is P vol [ )] T ( δ x ( J ) := lim . vol [ P ] δ or that purpose, T ( P ) can be approximated by the infinitesimal par- F δ T ( x + δe allelogram built on ) ,...,T ( x + δe ). Explicitly, n 1 ( x + δe T ) = exp . )) δe + ( ξ ( x i i + x δe i ξ is not defined at x + δe (If it is always possible to make an infinitesimal i perturbation and replace + δe x by a point which is extremely close i and at which is well-defined. Let me skip this nonessential subtlety.) ξ n , so Assume for a moment that we are in T ( R ) = x + ξ ( x ); x then, by a classical result of real analysis, J ( x ) = | det( ∇ T ( x )) | = | det( I + ∇ ξ ( x )) | . But in the genuinely Riemannian case, things are n much more intricate (unless ( x ) = 0) because the measurement of ξ infinitesimal volumes changes as we move along the geodesic path ( t,x ) = exp tξ ( γ ( x )). x To appreciate this continuous change, let us parallel-transport along γ to define a new family E ( t ) = ( e the geodesic ( t ) ,...,e )) in ( t n 1 T = 0, the 〉 ) t M . Since ( d/dt ) 〈 e ( ( t ) , e e ( t ) 〉 = 〈 ̇ e ̇ ( t ) ,e , ( t ) 〉 + 〈 e ) ( t i j j j i i t γ ) ( E ( t ) remains an orthonormal basis of T . (Here family t for any M ) t γ ( the dot symbol stands for the covariant derivation along γ .) Moreover, e . (See Figure 14.2.) ( t ) = ̇ γ ( t,x ) / | ̇ γ ( t,x ) | 1 ) t ( E 0) ( E Fig. 14.2. T he orthonormal basis E , here represented by a small cube, goes along the geodesic by parallel transport.

386 380 14 Ricci curvature T T = exp ξ , it will be convenient o express the Jacobian of the map T to consider the whole collection of maps tξ ). For brevity, let = exp( t us write ) ( + δ E ) = T T ( ( x + δe ; ) ,...,T ) ( x + x δe n t 1 t t then ( x + δ E ) T T ≃ ( x ) + δ J , t t where ∣ ∣ d ∣ ,...,J ); J J ( t,x ) := = ( J ) . T δe ( x + 1 n i i t ∣ dδ 0 δ = (See Figure 14.3.) J The vector fields have been obtained by differentiating a family i ); such vector fields are of geodesics depending on a parameter (here δ and they satisfy a characteristic linear second- called Jacobi fields order equation known as the Jacobi equation . To write this equa- J tion, it will be convenient to express ,...,J in terms of the basis n 1 J th component of in this ; so let J ,...,e = 〈 J e ,e j 〉 stand for the j 1 ij i n i J J basis. The matrix ) = ( satisfies the differential equation ij i,j ≤ n 1 ≤ ̈ J ( t ) + R ( t ) J ( t ) = 0 , (14.6) where R t ) is a matrix which depends on the Riemannian structure at ( ( ), and can be expressed in terms of the Riemann curvature tensor: γ t 〈 〉 t ) = t Riem . R ) t ( ̇ γ ( t ) ,e ( ( t )) ̇ γ ( (14.7) ) , e ( i j ij ( γ t ) γ t ( ) .) The x (All of these quantities depend implicitly on the starting point reader who prefers to stay away from the Riemann curvature tensor can take (14.6) as the equation defining the matrix ; the only things R that one needs to know about it are (a) ( t ) is symmetric; (b) the R R ( t ) vanishes (which is the same, modulo identification, as first row of R ( t ) ̇ γ ( t ) = 0); (c) tr R ( t ) = Ric ) (which one can also adopt as γ ( ̇ γ ̇ , t γ t t a definition of the Ricci tensor); (d) R is invariant under the transform → 1 − t , E ( t ) →− E (1 − t ), γ t → γ . − t 1 t Equation (14.6) is of second order in time, so it should come with ̇ J (0) and initial conditions for both J (0). On the one hand, since T ( y ) = y , 0 ∣ ∣ d ∣ J (0) = ( , ) = δe + x e i i i ∣ dδ = δ 0

387 The Jacobian determinant of the exponential map 381 J ( t ) t ) E ( ( 0) E ( 0) J = ( t time J ( t ) and E = 0, the matrices t ) coincide, but at later times t A Fig. 14.3. they (may) differ, due to geodesic distortion. (0) is just the identity matrix. On the other hand, so J ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ D d d D ̇ ∣ ∣ ∣ ∣ J (0) = + ) T δe ( x + δe x ) = ( T i i t i t ∣ ∣ ∣ ∣ Dt Dδ dδ dt 0 0 0 0 = t t δ = δ = = ∣ ∣ D ∣ = x ξ ) = ( ξ ( ) + δe ∇ , e i i ∣ Dδ 0 = δ where ∇ ξ is the covariant gradient of ξ . (The exchange of derivatives ∞ ξ at and the C is justified by the differentiability of regularity of x t,y,ξ exp ( tξ ).) So → ( ) y 〉 〈 d DJ d i e 〈 J ) , . ξ 〉 J 〉 ,e = ∇ , e ( e = 〈 = j j i i i j j Dt dt dt We conclude that the initial conditions are ̇ (0) = I (14.8) , ) J (0) = ∇ ξ ( x J , n ξ ( x ) is identified ∇ where in the second expression the linear operator E : ( ∇ ξ ) = 〈 . ∇ ξ ) e ,e 〉 = 〈 e ·∇ ξ,e with its matrix in the basis 〉 ( j i ij j i ,e (Be careful, this is the converse of the usual convention = 〈 Ae ; 〉 A i ij j anyway, later we shall work with symmetric operators, so it will not matter.) From this point on, the problem is about a path J ( t ) valued in ) of real the space matrices, and we can forget about the ( R M n × n n geometry: Parallel transport has provided a consistent identification of

388 382 14 Ricci curvature n ll the tangent spaces T via the x M with R a . This path depends on ) t γ ( initial conditions (14.8), so in the sequel we shall put that dependence , but it is very x explicitly. It might be very rough as a function of smooth as a function of . The Jacobian of the map T is defined by t t t,x J ( ( t,x ) , J ) = det and the formula for the differential of the determinant yields ) ( − 1 ̇ ̇ , (14.9) ( J J ( t,x ) J ( t,x ) t,x ) = J ( t,x ) tr J at least as long as ( t,x ) is invertible (let’s forget about that problem for the moment). So it is natural to set − 1 ̇ := U , (14.10) J J . By differentiating (14.10) and using and look for an equation on U (14.6), we discover that 2 1 − 1 − − 1 ̇ ̇ ̇ ̈ R JJ − JJ = − JJ − U = U ̇ J and (note that J do not necessarily commute). So the change of variables (14.10) has turned the second-order equation (14.6) into the first-order equation 2 ̇ U + R = 0 , U + (14.11) which is of Ricatti type, that is, with a quadratic nonlinearity. By taking the trace of (14.11), we arrive at d 2 tr U ) + tr ( U = 0 ) + tr R ( . dt Now the trace of ( t,x ) only depends on γ R and ̇ γ ; in fact, as noticed t t γ ( t ), evaluated before, it is precisely the value of the Ricci curvature at γ ( t ). So we have arrived at our first important equation in the direction ̇ involving Ricci curvature: d 2 ) + Ric( ̇ U ) + tr ( U ( tr γ ) = 0 , (14.12) dt where of course Ric( ̇ γ ) is an abbreviation for Ric )). t ( γ ( ̇ γ ( t ) , ̇ ) ( γ t Equation (14.12) holds true for any vector field ξ , as long as ξ is covariantly differentiable at x . But in the sequel, I shall only apply it in the particular case when ξ derives from a function: ξ = ∇ ψ ; and ψ is

389 The Jacobian determinant of the exponential map 383 l ocally semiconvex with a quadratic modulus of semiconvexity. There are three reasons for this restriction: (a) In the theory of optimal transport, one only needs to consider such maps; guarantees the almost everywhere dif- (b) The semiconvexity of ψ ∇ ψ , by Theorem 14.1; ferentiability of 2 ∇ ψ , then ∇ ξ ( x ) = ∇ ψ (c) If ( ξ ) is symmetric and this will = x U t,x ) at all times; this symmetry will allow to imply the symmetry of ( ( U ) = J ( t,x ). derive from (14.12) a closed inequality on tr t,x ξ = So from now on, ψ , where ψ is semiconvex. To prove the ∇ 2 ̇ t,x ), note that U (0 ,x ) = I ) and symmetry of U (0 ,x ) = ∇ ( ψ U x ( n (modulo identification) are symmetric, and the time-dependent matrix ∗ U ) is also symmetric, so U ( t,x R ( ( t,x ) t,x solve the ) and its transpose same differential equation, with the same initial conditions. Then, by the uniqueness statement in the Cauchy–Lipschitz theorem, they have to coincide at all times where they are defined. Inequality (14.12) cannot be recast as a differential equation involv- ing only the Jacobian determinant (or equivalently tr ( t,x )), since U 2 U U . the quantity tr ( ) in (14.12) cannot be expressed in terms of tr allows us to use the Cauchy–Schwarz However, the symmetry of U inequality, in the form 2 (tr U ) 2 ≥ U tr ( ) ; n t hen, by plugging this inequality into (14.12), we obtain an important differential inequality involving Ricci curvature: 2 d (tr U ) ) + 0 (14.13) ( tr . + Ric( ̇ γ ) ≤ U n dt There are several ways to rewrite this result in terms of the Jacobian determinant J ( t ). For instance, by differentiating the formula ̇ J , tr U = J o ne obtains easily ( ) 2 ) ( 2 ̇ ̈ J J d 1 U ) (tr ) + − 1 − = ( tr U . J n J n dt

390 384 14 Ricci curvature S o (14.13) becomes ) ( 2 ( ) ̈ ̇ 1 J J γ − − − 1 ) . (14.14) Ric( ̇ ≤ J J n 1 /n ( J ( t ) For later purposes, it will be convenient to define ) := D t (which one can think of as a coefficient of mean distortion); then the ̈ n D / D . So left-hand side of (14.14) is exactly ̈ ) D γ Ric( ̇ − ≤ 14.15) . ( n D ( ) := ℓ t Yet another useful formula is obtained by introducing J − t ), and then (14.13) becomes log ( 2 ̇ ℓ ( t ) ̈ + ℓ ( ≥ (14.16) ) Ric( ̇ γ ) . t n In all of these formulas, we have always taken t = 0 as the starting time, but it is clear that we could do just the same with any starting time t )), the ∈ [0 , 1], that is, consider, instead of T x ( x ) = exp( t ∇ ψ ( t 0 T )). Then all the differential inequali- map x ( ( x ) = exp(( t − t ψ ) ∇ 0 → t t 0 ties are unchanged; the only difference is that the Jacobian determinant t at time = 0 is not necessarily 1. Taking out the direction of motion The previous formulas are quite sufficient to derive many useful geo- metric consequences. However, one can refine them by taking advantage curvature is not felt in the direction of motion . In other of the fact that γ , one will never be able words, if one is traveling along some geodesic to detect some curvature by considering variations (in the initial posi- tion, or initial velocity) in the direction of γ itself: the path will always be the same, up to reparametrization. This corresponds to the property R ( t ) ̇ γ ( t ) = 0, where R ( t ) is the matrix appearing in (14.6). In short, curvature is felt only in n 1 directions out of n . This loose principle − n often leads to a refinement of estimates by a factor ( 1) /n . − Here is a recipe to “separate out” the direction of motion from the other directions. As before, assume that the first vector of the or- thonormal basis (0) = 0 (0) is e γ (0) = ̇ γ (0) / | ̇ γ (0) | . (The case when ̇ J 1

391 Taking out the direction of motion 385 an be treated separately.) Set c = u u (this is the coefficient in U 11 // as U which corresponds to just the direction of motion), and define ⊥ − 1) ( n − 1) matrix obtained by removing the first line and n the ( × . Of course, tr ( U first column in u U + tr ( U ). Next decompose ) = ⊥ // J into a parallel and an orthogonal contri- the Jacobian determinant butions: ) ( ∫ t . , J J ( t ) = exp = J u ds J ) ( s ⊥ // // // 0 Further define parallel and orthogonal distortions by 1 1 n − = = J D , J D ; ⊥ // // ⊥ a nd, of course, − = log J ℓ (14.17) = − log J . , ℓ ⊥ ⊥ // // R ( t ) vanishes, equation (14.11) implies Since the first row of ∑ 2 2 2 − u ̇ u u . − ≤− u = = // 11 1 j // j It follows easily that 2 ̇ ̈ ℓ (14.18) ≥ , ℓ // // or equivalently ̈ J ≤ 0 , (14.19) // D so = J (this property does not is always a concave function of t // // M ), and the same holds true of course for depend on the curvature of . which coincides with J D // // Now let us take care of the orthogonal part: Putting together (14.9), (14.10), (14.11), (14.18), we find ∑ d 2 2 ̈ ̈ ℓ ( tr U ) − . ℓ u = U − ) + Ric( ̇ γ ) − = tr ( ⊥ // j 1 dt ∑ 2 2 2 + 2 Since tr U = tr ( U u , this implies ) ⊥ 1 j 2 ̈ (14.20) tr ( U ℓ ≥ ) + Ric( ̇ γ ) . ⊥ ⊥ Then in the same manner as before, one can obtain 2 ̇ ( ℓ ) ⊥ ̈ , ) (14.21) γ Ric( ̇ + ℓ ≥ ⊥ 1 − n

392 386 14 Ricci curvature ̈ D Ric( ̇ γ ) ⊥ ≤ − 14.22) ( . 1 D n − ⊥ and ℓ are the same To summarize: The basic inequalities for ℓ ⊥ // , but with the exponent n replaced by as for − 1 in the case of ℓ ℓ , n ⊥ and 1 in the case of ; and the number Ric( ̇ γ ) replaced by 0 in the ℓ // D . The same for case of ℓ and . D ⊥ // // Positivity of the Jacobian Unlike the distance function, the exponential map is always smooth. J ( t ) from vanish- But this does not prevent the Jacobian determinant J ing, i.e. the matrix ) from becoming singular (not invertible). Then ( t computations such as (14.9) break down. So all the computations per- ∈ t formed before are only valid if t ( (0 , 1). J ) is positive for all ℓ ( t ) = − log J ( t ), the vanishing of the Jacobian deter- In terms of ℓ minant corresponds to a divergence ) → ∞ . Readers familiar with ( t ordinary differential equations will have no trouble believing that these ℓ solves the Ricatti-type equation (14.16), events are not rare: Indeed, and such equations often lead to blow-up in finite time. For instance, ℓ ( t ) that solves consider a function 2 ̇ ( ℓ ) ̈ ≥ ℓ K , + n − 1 ̇ ℓ where where K > has a minimum, so 0. Consider a time t ( t ) = 0. ℓ 0 0 ℓ cannot be defined on a time-interval larger than [ t − Then, T,t ], + T 0 0 √ := π ( n − 1 where ) /K . So the Jacobian has to vanish at some time, T and we even have a bound on this time. (With a bit more work, this estimate implies the Bonnet–Myers theorem , which asserts that the √ M cannot be larger than π diameter of ( n − 1 ) /K if Ric ≥ Kg .) The vanishing of the Jacobian may occur even along geodesics that n ( x ) = − ξ x in R are minimizing for all times: Consider for instance ; 2 then the image of exp( tξ ) is reduced to a single point when t = 1 / 2. However, in the case of optimal transport, the Jacobian cannot vanish at intermediate times, at least for almost all initial points: Recall indeed the last part of Theorem 11.3. This property can be seen as a result of the very special choice of the velocity field ξ , which is the gradient 2 d ; or as a consequence of the “no-crossing” / 2 -convex function of a

393 Bochner’s formula 387 p roperty explored in Chapter 8. (There is also an interpretation in terms of Jacobi fields, see Proposition 14.31 in the Third Appendix.) Bochner’s formula point of view, So far, we have discussed curvature from a Lagrangian ( ), keeping the memory of the γ t that is, by going along a geodesic path initial position. It is useful to be also familiar with the Eulerian point of view, in which the focus is not on the trajectory, but on the velocity field ξ ( t,x ). To switch from Lagrangian to Eulerian description, ξ = just write ( ) t ) = ξ (14.23) t,γ ( t ) ̇ . γ ( In general, this can be a subtle issue because two trajectories might cross, and then there might be no way to define a meaningful veloc- ξ ( · ) at the crossing point. However, if a smooth vector ity field t, ) is given, then around any point = ξ , · ξ x field the trajectories (0 0 γ t,x ) = exp( tξ ( x )) do not cross for | t | small enough, and one can de- ( fine ξ ( t,x ) without ambiguity. The covariant differentiation of (14.23) along ξ itself, and the geodesic equation ̈ γ = 0, yield ∂ξ ∇ ξ = 0 , (14.24) + ξ ∂t pressureless Euler equation which is the . From a physical point of view, this equation describes the velocity field of a bunch of particles which travel along geodesic curves without interacting. The derivation of (14.24) will fail when the geodesic paths start to cross, at which point the solution to (14.24) would typically lose smoothness and need reinterpretation. But for the sequel, we only need (14.24) to be satisfied | t | , and locally around x . for small values of All the previous discussion about Ricci curvature can be recast in Eulerian terms. Let ( t,x ) = exp )); by the definition of the co- γ tξ ( x ( x variant gradient, we have ̇ )) J ) = ∇ ξ ( t,γ ( t,x t,x J ( t,x ) ( t = 0). Under the identi- (the same formula that we had before at time n R fication of with T ), we can identify t ( E M provided by the basis ) t ( γ J with the matrix J , and then

394 388 14 Ricci curvature ( ) − 1 ̇ J ( t,x ) J ( t,x ) t ,x = ∇ ξ ) = t,γ ( t,x ) U , (14.25) ( ∇ where again the linear operator ξ is identified with its matrix in the . basis E t,x ) = tr ∇ ξ ( t,x ) coincides with the divergence of U Then tr ( · ), evaluated at x . By the chain-rule and (14.24), t, ( ξ d d ξ U t,x ) = tr )) ( ∇ · )( )( t,γ ( t,x ( dt dt ) ( ∂ξ ) ·∇ ,γ ( t,x )) ∇· + ̇ γ ( t,x t = ( ∇· ξ )( t,γ ( t,x )) ( ∂t ( ) ( ( ξ ) + ξ ·∇ ( ∇· ξ ) = ∇ t,γ ( t,x )) . −∇· ξ Thus the Lagrangian formula (14.12) can be translated into the Eulerian formula 2 ∇ −∇· ξ ) + ξ ·∇ ( ∇· ξ ) + tr ( ∇ ξ ) ( + Ric( ξ ) = 0 . (14.26) ξ All functions here are evaluated at ( t,γ ( t,x )), and of course we can choose t = 0, and x arbitrary. So (14.26) is an identity that holds true 2 for any smooth (say ) vector field ξ on our manifold M . Of course C 1 it can also be established directly by a coordinate computation. ξ ∇ is ξ , if While formula (14.26) holds true for all vector fields symmetric then two simplifications arise: 2 | ξ | ξ (a) ξ = ∇ ∇ · ξ = ∇ ; ξ 2 2 2 ξ ) ( = b) tr ( ξ ‖ ∇ , HS standing for Hilbert–Schmidt norm. ‖∇ HS So (14.26) becomes 2 | ξ | 2 ξ + ξ · ∇ ( − ∆ ) + ‖∇ ξ ‖ (14.27) . + Ric( ξ ) = 0 ∇· HS 2 We shall apply it only in the case when ξ is a gradient: ξ = ∇ ψ ; then 2 ∇ ∇ ξ ψ is indeed symmetric, and the resulting formula is = 2 ψ | |∇ 2 2 ) + ∇ ψ · − ( ∆ψ ∆ ‖∇ (14.28) ψ ‖ + . + Ric( ∇ ψ ) = 0 ∇ HS 2 1 2 ∇ ith the notation = ξ ·∇ (which is classical in fluid mechanics), and tr ( = ξ ) W ∇ ξ ∇ ξ ··∇ ξ ) = 0. −∇· ξ ·∇ ξ + ξ ·∇∇· ξ + ∇ ξ ··∇ ξ +Ric( ξ , (14.26) takes the amusing form

395 Bochner’s formula 389 T he identity (14.26), or its particular case (14.28), is called the , or just Bochner’s Bochner–Weitzenb ̈ock–Lichnerowicz formula 2 formula. ξ ∇ ψ , the pressureless Euler equa- With the ansatz Remark 14.5. = tion (14.24) reduces to the Hamilton–Jacobi equation 2 ∂ψ | ∇ ψ | + = 0 . (14.29) ∂t 2 One can use this equation to obtain (14.28) directly, instead of first deriving (14.26). Here equation (14.29) is to be understood in a viscosity sense (otherwise there are many spurious solutions); in fact the reader might just as well take the identity ] [ 2 d ( x,y ) ) = inf t,x ( ψ ψ ( y ) + ∈ M y t 2 s the definition of the solution of (14.29). Then the geodesic curves a starting with γ (0) = x , ̇ γ (0) = ∇ ψ γ x ) are called characteristic ( curves of equation (14.29). Remark 14.6. Here I have not tried to derive Bochner’s formula for nonsmooth functions. This could be done for semiconvex ψ , with an 2 | ψ |∇ ). In + ∇ ψ · ∇ ( ∆ψ ∆ appropriate “compensated” definition for − 2 fact, the semiconvexity of ∇ ψ prevents the formation of instantaneous shocks, and will allow the Lagrangian/Eulerian duality for a short time. 2 U ( t,x ) coincides with ∇ Remark 14.7. ψ ( t,γ ( t,x )), The operator t > 0. which is another way to see that it is symmetric for From this point on, we shall only work with (14.28). Of course, by using the Cauchy–Schwarz identity as before, we can bound below 2 2 2 ‖∇ ‖ ; therefore (14.25) implies by ( ∆ψ ) ψ /n HS 2 n (14.26) or (14.28) I have written Bochner’s formula in purely “metric” terms, I which will probably look quite ugly to many geometer readers. An equivalent but more “topological” way to write Bochner’s formula is ∗ ∇∇ ∆ + Ric = 0 , + ∗ ∗ = − ( dd where + d ∆ d ) is the Laplace operator on 1-forms, ∇ is the covariant differentiation (under the identification of a 1-form with a vector field) and the 2 adjoints are in L (vol). Also I should note that the name “Bochner formula” is attributed to a number of related identities.

396 390 14 Ricci curvature 2 2 ( ∆ψ ) ψ | | ∇ ( ∆ψ . ≥ − ∇ ψ ·∇ (14.30) + Ric( ∇ ψ ) ) ∆ n 2 Apart from regularity issues, this inequality is strictly equivalent to (14.13), and therefore to (14.14) or (14.15). Not so much has been lost when going from (14.28) to (14.30): there 2 x ψ ∇ where ) is a multiple of is still equality in (14.30) at all points x ( the identity. ̂ One can also take out the direction of motion, := ( ∇ ψ ) / |∇ ψ | , ψ ∇ from the Bochner identity. The Hamilton–Jacobi equation implies 2 ̂ ̂ ψ + ∇ ψ ∂ · ∇ ∇ ψ = 0, so t 〈 〉 2 ̂ ̂ · ∇ ∇ ψ, ∂ ∇ ψ ψ t 〉 〈 〈 〉 2 2 2 2 ̂ ̂ ̂ ̂ 2 ( ψ, |∇ ∇ ψ ψ − − | ∇ ∇ ψ · ( ∇ / ψ · 2) ∇ ψ ) , · ∇ ψ = , ∇ 2 2 ̂ and by symmetry the latter term can be rewritten − ψ ) · 2 ∇ ψ | ∇ . ( | From this one easily obtains the following refinement of Bochner’s for- mula: Define 〈 〉 2 ̂ ̂ ∇ f · f ∇ ψ, = ∆ ψ ∇ , ∆ , = ∆ − ∆ ⊥ // // then  ∣ ∣ 2 2 ψ |∇ | 2 2 ̂ ∣ ∣  ∆ ·∇ ∆ ≥ ψ + 2 ) ψ ( ∇ ) ψ ∆ · ( ∇ ψ − ∇ ψ // // //  2  ∣ ∣ 2  2 | ψ |∇ 2 2 2 ̂ ∣ ∣ ψ − 2 ∆ . ( ∇ · ψ ) ) − ∇ ψ ∇ ψ ·∇ ≥‖∇ ∆ ∇ ψ ‖ ψ + Ric( ⊥ ⊥ HS ⊥ 2 (14.31) This is the “Bochner formula with the direction of motion taken out”. I have to confess that I never saw these frightening formulas anywhere, and don’t know whether they have any use. But of course, they are equivalent to their Lagrangian counterpart, which will play a crucial role in the sequel. Analytic and geometric consequences of Ricci curvature bounds Inequalities (14.13), (14.14), (14.15) and (14.30) are the “working heart” of Ricci curvature analysis. Many gometric and analytic con- sequences follow from these estimates.

397 Analytic and geometric consequences of Ricci curvature boun ds 391 Here is a first example coming from analysis and partial differen- is globally bounded tial equations theory: If the Ricci curvature of M below (inf > −∞ Ric heat kernel , ), then there exists a unique x x t > ( x,y ) ( i.e. a measurable function 0, p ∈ M , y ∈ M ), in- x t tegrable in y , smooth outside of the diagonal x = y , such that ∫ t,x ) := ( p f ( x,y ) f ∆f ( y ) d vol( y ) solves the heat equation ∂ = f t 0 t . with initial datum f 0 Here is another example in which some topological information can be recovered from Ricci bounds: If M is a manifold with nonnegative x , Ric 0), and there exists a line in ≥ Ricci curvature (for each M , x ∈ which is minimizing for all values of time t γ R , that is, a geodesic ′ ′ is isometric to R × M , for some Riemannian manifold then M M . This is the splitting theorem , in a form proven by Cheeger and Gromoll. quantitative statements can be obtained from (i) a lower Many bound on the Ricci curvature and (ii) an upper bound on the dimension of the manifold. Below is a (grossly nonexhaustive) list of some famous such results. In the statements to come, is always assumed to be a M smooth, complete Riemannian manifold, vol stands for the Riemannian volume on ∆ for the Laplace operator and d for the Riemannian M , K is the lower bound on the Ricci curvature, and n distance; is the r . Also, if A is a measurable set, then A dimension of will denote M its -neighborhood, which is the set of points that lie at a distance r at most r from A . Finally, the “model space” is the simply connected Riemannian manifold with constant sectional curvature which has the same dimension as M , and Ricci curvature constantly equal to K (more rigorously, to Kg is the metric tensor on the model space). , where g Volume growth estimates: The inequality (also 1. Bishop–Gromov called Riemannian volume comparison theorem) states that the volume of balls does not increase faster than the volume of balls in the model x ∈ M , space. In formulas: for any B ( x )] vol [ r i s a nonincreasing function of r, ) V ( r where ∫ r ′ ′ , ( r V ( r ) = ) dr S 0

398 392 14 Ricci curvature ) (  √  K  1 − n  r i f 0 K > sin   − n 1        − 1 n ) = c ( S r = 0 K if r n,K      ) (  √    | | K − 1 n   . 0 r i f K < sinh  1 − n ( r Here of course B (0) in the model space, that S ) is the surface area of r n − 1)-dimensional volume of ∂B is the ( (0), and c is a nonessential n,K r normalizing constant. (See Theorem 18.8 later in this course.) 2. Diameter estimates: The Bonnet–Myers theorem states that, K > 0, then M is compact and more precisely if √ − n 1 M ≤ diam ( π ) , K ith equality for the model sphere. w Spectral gap inequalities: If K > 0, then the spectral gap 3. of λ 1 the nonnegative operator − ∆ is bounded below: nK ≥ λ , 1 1 n − ith equality again for the model sphere. (See Theorem 21.20 later in w this course.) K > 4. n ≥ 3, let μ = (Sharp) Sobolev inequalities: If 0 and / vol[ M ] be the normalized volume measure on M ; then for any vol M , smooth function on n 2 4 2 ⋆ 2 2 f = ‖ ≤‖ f ‖ , 2 , ‖ ‖ f + ⋆ ‖∇ 2 2 2 L ) μ μ ) ( L ( ) μ ( L n ( 2 ) − − 2 Kn n a nd those constants are sharp for the model sphere. Heat kernel bounds: There are many of them, in particular the 5. Li–Yau estimates : If K ≥ well-known p ) ( x,y 0, then the heat kernel t satisfies ) ( 2 C x,y ) ( d x,y ) ≤ p ( exp , − t √ ( ) x ] B C t vol [ 2 t for some constant C which only depends on n . For K < 0, a similar bound holds true, only now C depends on K and there is an additional

399 Change of reference measure and effective dimension 393 Ct e f . There are also pointwise estimates on the derivatives of actor p Harnack inequalities . log , in relation with t The list could go on. More recently, Ricci curvature has been at the heart of Perelman’s solution of the celebrated Poincar ́e conjecture, and more generally the topological classification of three-dimensional manifolds. Indeed, Perelman’s argument is based on Hamilton’s idea to use Ricci curvature in order to define a “heat flow” in the space of metrics, via the partial differential equation ∂g − 2 Ric( g ) , (14.32) = ∂t g ) is the Ricci tensor associated with the metric g , which where Ric( can be thought of as something like − . The flow defined by (14.32) ∆g is called the Ricci flow. Some time ago, Hamilton had already used its properties to show that a compact simply connected three-dimensional Riemannian manifold with positive Ricci curvature is automatically 3 S . diffeomorphic to the sphere Change of reference measure and effective dimension For various reasons, one is often led to consider a reference measure ν − V ( x ) ν ( that is not the volume measure vol, but, say, ) = e vol( dx ), dx for some function V : M → R , which in this chapter will always be 2 assumed to be of class . The metric–measure space ( M,d,ν ), where C stands for the geodesic distance, may be of interest in its own right, d or may appear as a limit of Riemannian manifolds, in a sense that will be studied in Part III of these notes. Of course, such a change of reference measure affects Jacobian deter- minants; so Ricci curvature estimates will lose their geometric meaning unless one changes the definition of Ricci tensor to take the new refer- ence measure into account. This might perturb the dependence of all estimates on the dimension, so it might also be a good idea to intro- N , which may be larger than the “true” duce an “effective dimension” dimension n of the manifold. The most well-known example is certainly the Gaussian measure ) n n ( γ R in (do not confuse it with a geodesic!): , which I shall denote by 2 | x −| e dx ) n ( n ( ) = γ dx . R ∈ , x 2 / n ) (2 π

400 394 14 Ricci curvature t is a matter of experience that most theorems which we encounter I about the Gaussian measure can be written just the same in dimen- n sion 1 or in dimension , or even in infinite dimension, when properly ( n n ) R ) is infinite, in interpreted. In fact, the effective dimension of ( ,γ a certain sense, whatever n . I admit that this perspective may look strange, and might be the result of lack of imagination; but in any case, it will fit very well into the picture (in terms of sharp constants for geometric inequalities, etc.). So, again let ( ) ) = γ ( t,x ) = exp ; T t ∇ ψ ( x ) x ( t x now the Jacobian determinant is [ ] )) x T ( ( − V t ( x )) ( T ν B e t r t,x J ) = lim ( = J ) ,x t ( , 0 ) − V ( x ↓ r 0 ] ) [ B ( x ν e r where is the Jacobian corresponding to V ≡ 0 (that is, to ν = vol). J 0 t ), Then (with dots still standing for derivation with respect to · · ( ( t,x ) = (log J (log ) J ( t,x ) − ̇ γ ) t,x ) ·∇ V ( γ ( t,x )) , 0 〈 〉 ·· ·· 2 t,x ) = (log J . ) (log ( t,x ) − J ∇ ) V ( γ ( t,x )) · ̇ γ ( t,x ) , ̇ γ ( t,x ) ( 0 For later purposes it will be useful to keep track of all error terms in the inequalities. So rewrite (14.12) as ∥ ∥ ) ( 2 2 ∥ ∥ (tr ) U tr U · ∥ ∥ Ric( ̇ γ ) = − (14.33) . I + (tr U + ) − U n ∥ ∥ n n S H Then the left-hand side in (14.33) becomes · 2 [(log J ) ] 0 ·· + γ ) J ) + Ric( ̇ (log 0 n · 2 ·∇ [(log ) )] γ γ J V ( + ̇ ·· 2 ̇ γ 〉 + ) ( + 〈∇ = (log V J γ ) · ̇ γ, ) . + γ Ric( ̇ n By using the identity ( ) 2 2 2 2 ) b a ( a + n n b − N , ( a − b + − = 14.34) − n n N N ( n N − n ) N we see that

401 Change of reference measure and effective dimension 395 [ ] 2 · J γ ·∇ V ( γ ) log ( ) + ̇ n [ ] 2 · 2 ) ( J log ( ̇ ·∇ V ( γ )) γ − = N N n − ] ) [ ( 2 n N n N − · ̇ log J ) ( + ) γ · ∇ V γ + ( ) n n − ( N n N [ ] 2 · 2 (log J ) ( )) ̇ γ ·∇ V ( γ − = − N N n ] [ 2 n N − n · γ ) ( ( log J V ) + + ̇ γ ·∇ 0 n − N N ( ) n ] [ ] [ 2 2 · 2 ) J (log n ( γ )) ( V ·∇ γ ̇ N − n γ ( V ·∇ γ + ̇ = ) + − U t r n n n − N ( N ) − N N To summarize these computations it will be useful to introduce some more notation: first, as usual, the negative logarithm of the Jacobian determinant: ℓ ( t,x ) := − log J ( t,x ); (14.35) and then, the generalized Ricci tensor : ⊗∇ ∇ V V 2 V − := Ric + Ric ∇ 14.36) , ( N,ν n N − V ⊗∇ V is a quadratic form on where the tensor product , defined TM ∇ by its action on tangent vectors as ) ( 2 V ( ; ( v ) = ( ∇ V ∇ V ) · v ) ⊗∇ x x so 2 · ) γ ̇ V ∇ ( 2 γ − )( ̇ V Ric . ∇ ) = (Ric + γ ( ̇ ) N,ν − n N I t is implicitly assumed in (14.36) that N ≥ n (otherwise the correct definition is Ric = 0, = −∞ ); if N = n the convention is 0 ×∞ N,ν 2 = 0. Note that Ric so (14.36) still makes sense if V = Ric + ∇ ∇ V , ,ν ∞ while Ric = Ric. n, vol The conclusion of the preceding computations is that ∥ ∥ ) ( 2 2 ̇ ∥ ∥ ℓ tr U ̈ ∥ ∥ + I ( ̇ γ ) + = ℓ Ric − U n N,ν ∥ ∥ n N S H ) ( [ ] 2 N − n n + t γ ( (14.37) . ) r U + ̇ γ ·∇ V ) n − N ( N n

402 396 14 Ricci curvature W N = ∞ this takes a simpler form: hen ∥ ∥ ( ) 2 ∥ ∥ U tr ̈ ∥ ∥ (14.38) ℓ ( ̇ γ = Ric I U − ) + n ,ν ∞ ∥ ∥ n S H When N < one can introduce ∞ 1 N J ( t ) D ( , t ) := nd then formula (14.37) becomes a ∥ ∥ ( ) 2 ̈ ∥ ∥ tr U D ∥ ∥ = − Ric − γ N ( ̇ U ) + I N,ν n ∥ ∥ D n S H ] ) [ ( 2 − N n n ·∇ ) r U + ̇ γ . V ( γ (14.39) + t − ) ( n N N n Of course, it is a trivial corollary of (14.37) and (14.39) that  2 ̇  ℓ  ̈  γ ℓ ≥ ( ̇ Ric ) + N,ν   N (14.40)   ̈  D   N − . R ic ≥ γ ) ( ̇ N,ν D Finally, if one wishes, one can also take out the direction of motion (skip at first reading and go directly to the next section). Define, with self-explicit notation, )) x ( T ( V − t e , J ) = t,x ) ( J ( t,x , ⊥ ⊥ 0 ( V − ) x e 1 N − log a , D nd = J ℓ J = . Now, in place of (14.33), use ⊥ ⊥ ⊥ ⊥ ∥ ∥ ) ( n 2 2 ∑ ∥ ∥ U (tr ) tr U ⊥ ⊥ 2 · ∥ ∥ u − ) = γ Ric( ̇ + − I ) + (tr U U − n − 1 ⊥ ⊥ 1 j ∥ ∥ − n n 1 − 1 S H =2 j (14.41) as a starting point. Computations quite similar to the ones above lead to ∥ ∥ ) ( 2 2 ̇ ∥ ∥ tr U ( ) ℓ ⊥ ⊥ ̈ ∥ ∥ U ℓ γ − Ric + = ( ̇ I ) + N,ν n − 1 ⊥ ⊥ ∥ ∥ N 1 1 − n − S H ] ) [( n 2 ∑ − N n n − 1 2 (14.42) . u + ·∇ + ) γ ( t r U + ̇ γ V j 1 − n N − 1 ( − N ) n 1 )( =2 j

403 Generalized Bochner formula and Γ f ormalism 397 2 N In the case = ∞ , this reduces to ∥ ∥ ( ) n 2 ∑ ∥ ∥ tr U ⊥ 2 ̈ ∥ ∥ ( ̇ U ℓ − = Ric ) + γ + u I (14.43) ; ∞ ,ν − n 1 ⊥ ⊥ 1 j ∥ ∥ n − 1 S H j =2 ∞ and in the case , to N < ∥ ∥ ( ) 2 ̈ ∥ ∥ tr U D ⊥ ⊥ ∥ ∥ = − Ric ( ̇ γ ) + U I − N N,ν − n 1 ⊥ ∥ ∥ − 1 D n ⊥ H S ] ) [( n 2 ∑ − n N − n 1 2 + u γ ) r U + ̇ γ ·∇ V ( t . + (14.44) 1 j 1 − n N ) ( n − 1 N )( − =2 j As corollaries,  2 ̇ ) ℓ (  ⊥ ̈  γ Ric ) ( ̇ + ≥ ℓ  N,ν ⊥   − 1 N (14.45)   ̈  D ⊥   − N γ ic ≥ ( ̇ R ) . N,ν D ⊥ Γ formalism Generalized Bochner formula and 2 Of course there is an Eulerian translation of all that. This Eulerian formula can be derived either from the Lagrangian calculation, or from the Bochner formula, by a calculation parallel to the above one; the latter approach is conceptually simpler, while the former is faster. In any case the result is best expressed in terms of the differential operator ∆ −∇ L ·∇ , (14.46) = V and can be written 2 |∇ | ψ Lψ − ∇ ψ ·∇ L 2 ) ( 2 2 2 ‖ + ‖∇ Ric + ∇ ψ V = ( ∇ ψ ) HS 2 ( Lψ ) ) + Ric ψ ( ∇ = N,ν N ( ) ∥ ∥ ( ) ( ) [ ] 2 2 ∥ ∥ ∆ψ − N n n 2 ∥ ∥ ∇ ψ − I + ∆ + ψ ·∇ . V ∇ ψ + n ∥ ∥ n ) n − ( N n N H S

404 398 14 Ricci curvature I Γ t is convenient to reformulate this formula in terms of the for- 2 malism. Given a general linear operator L , one defines the associated ) by the formula operator (or Γ carr ́e du champ [ ] 1 f,g L ) = f Γ ) − fLg − gLf ( . (14.47) ( g 2 is a bilinear operator, which in some sense encodes the Note that Γ from being a derivation operator. In our case, for (14.46), L deviation of ) = ∇ ( ·∇ g. Γ f,g f Next introduce the operator (or carr ́e du champ it ́er ́e ): Γ 2 [ ] 1 L Γ (14.48) ( f,g Γ ( fg ) − Γ ( f,Lg ) − Γ ( g,Lf ) ) = . 2 2 In the case of (14.46), the important formula for later purposes is 2 |∇ ψ | Γ (14.49) ( ψ,ψ ) = L Γ ( ψ ) := . − ∇ ψ ·∇ ( Lψ ) 2 2 2 Then our previous computations can be rewritten as 2 ( Lψ ) ) = Ric Γ ) ( ψ + ψ ∇ ( N,ν 2 N ) ( ∥ ∥ ) ] [ ) ( ( 2 2 ∥ ∥ n N − n ∆ψ 2 ∥ ∥ ψ ·∇ . + I + V ∆ ψ + ∇ ∇ − ψ n ∥ ∥ N ( N − n n ) n S H (14.50) Of course, a trivial corollary is 2 ( Lψ ) ) ≥ Γ (14.51) ( ψ . + Ric ) ( ∇ ψ N,ν 2 N And as the reader has certainly guessed, one can now take out the direction of motion (this computation is provided for completeness but will not be used): As before, define ∇ ψ ̂ , ψ ∇ = |∇ ψ | 2 2 ∇ f be f restricted to the space hen if ∇ t is a smooth function, let f ⊥ 2 ), i.e. f , and ∆ f = tr ( ∇ ψ orthogonal to ∇ ⊥ ⊥

405 Curvature-dimension bounds 399 〈 〉 2 ̂ ̂ − ∆ ∇ f f · = ∇ ψ, ∆ ∇ ψ f , ⊥ and next, = ∆ ·∇ f −∇ V f f, L ⊥ ⊥ 2 ∣ ∣ |∇ ψ | 2 2 2 2 ̂ ̂ ∣ ∣ ∇ ψ ·∇ ( L ψ ψ ) − 2 ) = L ( ∇ ψ ψ ) · ∇ ∇ ψ Γ | ( − 2 | ( ∇ − ψ ) · . ⊥ ⊥ ⊥ , 2 2 Then ∥ ∥ ) ( 2 2 ∥ ∥ ψ ( L ) ∆ ψ ⊥ ⊥ 2 ∥ ∥ ( Γ ψ ) = I − ∇ ∇ ( ψ ψ ) + Ric + N,ν 1 − n 2 , ⊥ ⊥ ∥ ∥ 1 n N 1 − − [( ] ) n 2 ∑ 1 N − n n − 2 + ) ψ ∂ ( ∆ ψ ∇ . + · ψ + ∇ V 1 j ⊥ n − N − n )( 1 ) 1 − N ( =2 j Curvature-dimension bounds M It is convenient to declare that a Riemannian manifold , equipped estimate with its volume measure, satisfies the curvature-dimension ) if its Ricci curvature is bounded below by K and its dimen- CD( K,N N : Ric ≥ K , n ≤ N . (As usual, Ric ≥ K is a sion is bounded above by ∀ x, shorthand for “ ≥ Kg Ric .”) The number K might be positive or x x V − e ν vol, negative. If the reference measure is not the volume, but = . ≥ K then the correct definition is Ric N,ν Most of the previous discussion is summarized by Theorem 14.8 below, which is all the reader needs to know about Ricci curvature to understand the rest of the proofs in this course. For convenience I shall briefly recall the notation: − V vol is the volume on M , ν = e • measures: vol is the reference measure; 2 ∆ is the Laplace(–Beltrami) operator on M , ∇ • is the operators: L = ∆ −∇ V ·∇ is the modified Laplace operator, Hessian operator, 2 Γ ); ( ψ ) = L ( |∇ ψ | Lψ / 2) −∇ ψ ·∇ ( and 2 tensors: Ric is the Ricci curvature bilinear form, and Ric is the • N,ν 2 ), = Ric + ∇ ) V − ( ∇ V ⊗∇ V modified Ricci tensor: Ric / ( N − n N,ν 2 ∇ , identified to a bilinear form; V ( where ) is the Hessian of V at x x

406 400 14 Ricci curvature • unctions: ψ is an arbitrary function; in formulas involving the Γ f 2 3 formalism it will be assumed to be of class C , while in formulas involving Jacobian determinants it will only be assumed to be semi- convex; T is a given function on M , γ ( t,x ) = If geodesic paths: ( x ) = • ψ t ( ) t − t with velocity ) ∇ exp ( x ) ( is the geodesic starting from x ψ 0 x x t γ ,x ) = ∇ ψ ̇ ( ), evaluated at time t ∈ [0 , 1]; it is assumed that ( 0 ( t,x ) does not vanish for t ∈ (0 , 1); the starting time t J may be 0 the origin t = 0, or any other time in [0 , 1]; 0 • ( t,x ) is the Jacobian determinant of T Jacobian determinants: ( x ) J t ν (with respect to the reference measure , not with respect to the 1 /N log J standard volume), D = J ℓ = is the mean distor- − , and T tion associated with ( ); t • the dot means differentiation with respect to time; finally, the subscript means that the direction in J ⊥ , D • , Γ 2 ⊥ ⊥ ⊥ , γ = ∇ ψ of motion ̇ has been taken out (see above for precise defini- tions). Theorem 14.8 ( CD( K,N ) curvature-dimension bound). Let M be a Riemannian manifold of dimension n , and let K ∈ R , N ∈ [ n, ∞ ] . Then, the conditions below are all equivalent if they are required to hold true for arbitrary data; when they are fulfilled, is said to satisfy the M K,N CD( ) curvature-dimension bound: K ; (i) ≥ Ric N,ν 2 Lψ ( ) 2 ( (ii) ) Γ ≥ ψ + K ∇ ψ | | ; 2 N 2 ̇ ) ( ℓ 2 ̈ + K | ̇ γ | . ℓ ≥ (iii) N I N < ∞ , this is also equivalent to f ( ) 2 | K | ̇ γ ̈ + D (iv) D ≤ 0 . N Moreover, these inequalities are also equivalent to 2 ( L ψ ) ⊥ 2 | (ii’) Γ ; ( ψ ) ≥ ψ ∇ + K | 2 ⊥ , 1 N − 2 ̇ ℓ ) ( ⊥ 2 ̈ K ≥ (iii’) + ℓ | ̇ γ | ; ⊥ 1 N − N < ∞ , nd, in the case a ) ( 2 γ ̇ | K | ̈ + D . 0 ≤ D (iv’) ⊥ ⊥ − N 1

407 Curvature-dimension bounds 401 R Note carefully that the inequalities (i)–(iv’) are re- emark 14.9. always quired to be true : For instance (ii) should be true for all ψ , t ∈ and all , 1). The equivalence is that [(i) true for all x ] is x all (0 , all x and all t ], etc. equivalent to [(ii) true for all ψ K,N model spaces). ) Examples 14.10 (One-dimensional CD( 0 and 1 < N < ∞ , consider (a) Let K > ( ) √ √ N − 1 − π N 1 π − = M , ⊂ R , 2 2 K K quipped with the usual distance on , and the reference measure R e ) ( √ K − 1 N d ; x x ( ) = cos ν dx − 1 N M satisfies CD( K,N then M ), although the Hausdorff dimension of is of course 1. Note that M is not complete, but this is not a serious problem since CD( K,N ) is a local property. (We can also replace M by its closure, but then it is a manifold with boundary.) (b) For 0, 1 ≤ N < ∞ , the same conclusion holds true if one K < M R and = considers ( ) √ | | K N − 1 dx ν ( ) = cosh d x. x − N 1 (c) For any ∈ [1 , ∞ ), an example of one-dimensional space satis- N fying CD(0 ,N ) is provided by M = (0 , + ∞ ) with the reference measure N − 1 ν ) = x dx ; dx ( K ∈ R , take M = R and equip it with the reference (d) For any measure 2 Kx − 2 e ν ( dx ) = d x ; then satisfies CD( K, ∞ ). M Sketch of proof of Theorem 14.8 . It is clear from our discussion in this chapter that (i) implies (ii) and (iii); and (iii) is equivalent to (iv) by el- ementary manipulations about derivatives. (Moreover, (ii) and (iii) are equivalent modulo smoothness issues, by Eulerian/Lagrangian duality.) It is less clear why, say, (ii) would imply (i). This comes from for- mulas (14.37) and (14.50). Indeed, assume (ii) and choose an arbitrary

408 402 14 Ricci curvature x ∈ M , and v . Construct ∈ T N > n . Assume, to fix ideas, that M x 0 0 0 3 ψ C such that function a   ) = ∇ ψ ( x v 0 0   2 ) = ∇ ψ ( x λ I 0 0 n ) (  n   )(= nλ − ) = ∆ψ ( x . ) V ( x v ∇ · 0 0 0 0 N n − ( This is fairly easy by using local coordinates, or distance and expo- nential functions.) Then all the remainder terms in (14.50) will vanish x , so that at 0 ( ) 2 ) ( Lψ 2 2 ψ ( x v = | | ≤ K Γ | ( ψ ) − K |∇ ) ) ( x 2 0 0 0 N ( ) = ∇ ψ ( x . ) Ric = Ric ) ( v 0 N,ν N,ν 0 ≥ K So indeed Ric . N,ν The proof goes in the same way for the equivalence between (i) and (ii’), (iii’), (iv’): again the problem is to understand why (ii’) implies (i), and the reasoning is almost the same as before; the key point being ∂ x ⊓⊔ ψ , j 6 = 2, all vanish at that the extra error terms in . 0 1 j K,N Many interesting inequalities can be derived from CD( ). It was successfully advocated by Bakry and other authors during the past two decades that CD( K,N ) should be considered as a property of the gen- L eralized Laplace operator . Instead, it will be advocated in this course that CD( K,N ) is a property of the solution of the optimal transport problem, when the cost function is the square of the geodesic distance. Of course, both points of view have their advantages and their draw- backs. From differential to integral curvature-dimension bounds There are two ways to characterize the concavity of a function f t ) ( ̈ , 1]: the differential inequality on a time-interval, say [0 f ≤ 0, or the ( ) − (1 − λ ) t ). If the latter + λt t integral bound ≥ (1 f λ ) f ( t ( ) + λf 0 1 0 1 is required to hold true for all t 1], then the two ,t , ∈ [0 , 1] and λ ∈ [0 1 0 formulations are equivalent.

409 From differential to integral curvature-dimension bounds 40 3 There are two classical generalizations. The first one states that the ̈ K ≤ 0 is equivalent to the integral inequality f differential inequality + ) ( t − (1 Kt ) 2 ( (1 − λ ) f ≥ t ) + ) + λf ( t λt f + − t ) λ (1 t ( t − . ) 0 1 1 0 0 1 2 nother one is as follows: The differential inequality A ̈ t ) + Λf ( t ) ≤ 0 (14.52) f ( is equivalent to the integral bound ( ) (1 − λ ) ( λ ) (14.53) f ≥ τ − λ ) t ( | t , − t ) | ) f ( t t )+ τ (1 + λt ( | t ( − t f | ) 0 1 0 1 1 0 1 0 where  √ Λ ) sin( λθ    √ Λ > i f 0   ) θ sin( Λ       ( λ ) τ ) = θ ( = 0 λ if Λ      √    λθ − sinh( Λ )   √ . 0 Λ < f i  ) Λ θ sinh( − A more precise statement, together with a proof, are provided in the Second Appendix of this chapter. K,N ): This leads to the following integral characterization of CD( Theorem 14.11 (Integral reformulation of curvature-dimension Let M be a Riemannian manifold, equipped with a reference bounds). − V 2 be the geodesic distance e vol , V measure C ν ( M ) , and let d = ∈ on . Let K ∈ R and M ∈ [1 , ∞ ] . Then, with the same notation as N in Theorem 14.8, M satisfies CD( K,N ) if and only if the following inequality is always true (for any semiconvex ψ x , as , and almost any 1) J t,x ) does not vanish for t ∈ soon as , ( ): (0 (1 − t ) ( t ) D ( t,x ) D (0 ,x ) + τ ) ≥ τ ∞ D (1 ,x ) ( N < K,N K,N (14.54) Kt (1 − t ) 2 d ) x ,y ) , ( N = ∞ ( − ℓ ( t,x ) ≤ (1 − t ) ℓ (0 ,x ) + tℓ (1 ,x ) 2 (14.55)

410 404 14 Ricci curvature w y = exp here ( ∇ ψ ( x )) and, in case N < ∞ , x  ) tα sin(   i f K > 0   sin α      t ) ( = τ = 0 K if t K,N        sinh( tα )   i f K < 0 sinh α where √ K | | = α d ( x ,y ) ( α ∈ [0 ,π ] if K > 0 ) . N Proof of Theorem 14.11 N < ∞ , inequality (14.54) is obtained . If by transforming the differential bound of (iii) in Theorem 14.8 into | ̇ γ | is a constant all along an integral bound, after noticing that γ d ( γ ,γ ). Conversely, to go from (14.54) the geodesic , and equals 0 1 γ , then reparametrize the to Theorem 14.8(iii), we select a geodesic γ ) geodesic ( , apply (14.54) to the M → 1] , into a geodesic [0 t t ≤ t ≤ t 0 1 reparametrized path and discover that ) λ ( ) λ − (1 ) τ D ( t ) ,x ) + τ t,x ( D ; D ( t λt ,x + t = (1 − λ ) t ≥ 0 1 1 0 K,N K,N √ α = where now K | / N d ( γ ( t | ) ,γ ( t ) satis- )). It follows that D ( t,x 1 0 t ,t ; and this is equivalent to (14.52). fies (14.53) for any choice of 0 1 N = ∞ , starting from in- The reasoning is the same for the case ⊓⊔ equality (ii) in Theorem 14.8. ( t ) τ The next result states that the the coefficients obtained in K,N Theorem 14.11 can be if N is finite and K 6 = 0, automatically improved by taking out the direction of motion: Theorem 14.12 (Curvature-dimension bounds with direction Let M be a Riemannian manifold, equipped of motion taken out). − V ν = with a reference measure vol , and let d be the geodesic distance e on M . Let K ∈ R and N ∈ [1 , ∞ ) . Then, with the same notation as in Theorem 14.8, M CD( K,N ) if and only if the following satisfies ψ , and almost any x , as inequality is always true (for any semiconvex ): J ( t,x ) does not vanish for t ∈ (0 , 1) soon as

411 From differential to integral curvature-dimension bounds 40 5 (1 − t ) ( t ) D (0 ,x ) + τ ≥ τ ) ( D (1 ,x ) (14.56) t,x D K,N K,N where now  1 ( ) − 1 N  1 ) tα in( s  N  f 0 K > i t   sin α       ) ( t τ = = 0 K if t K,N       1 ) (  − 1  N  1 tα ) s inh(   N t i f K < 0 sinh α and √ | K | ( . ) α ( x ,y ) = α ∈ [0 ,π ] if K > 0 d 1 − N When N < ∞ and K > 0 Theorem 14.12 contains the Remark 14.13. √ − ( N 1 ) /K . π d ) ≤ x,y ( Bonnet–Myers theorem according to which √ N/K With Theorem 14.11 the bound was only . π roof of Theorem 14.12 K,N ) is . The proof that (14.56) implies CD( P done in the same way as for (14.54). (In fact (14.56) is stronger than (14.54).) As for the other implication: Start from (14.22), and transform it into an integral bound: − t ) ) ( t (1 ,x D , ( ) D ,x (0 t,x ) + σ ) ≥ σ (1 D ⊥ ⊥ ⊥ K,N K,N ( t ) 0; = sin( tα ) / sin α if where σ t if K = 0; sinh( tα ) / sinh α if K > K,N K < 0. Next transform (14.19) into the integral bound ,x ( t,x ) ≥ (1 − t ) D . (0 ,x ) + t D ) (1 D // // // Both estimates can be combined thanks to H ̈older’s inequality: 1 1 − 1 N N ) ( ( D t,x D ) = D t,x t,x ) ( ⊥ / / 1 ) ( − 1 N ) ( 1 t ) − ( t σ ,x D (1 ,x ) + σ (0 D ≥ ) K,N K,N 1 ( ) N t D ) (0 ,x ) + t D 1 (1 ,x ) − ( // // 1 1 1 1 t − ( 1 ) ) t ( 1 − N N N N σ ( ≥ ) 0 ,x ) + ( σ 1 ( ) . − t ) t D ( D ,x (1 ) / / K,N K,N This implies inequality (14.56). ⊓⊔

412 406 14 Ricci curvature E stimate (14.56) is sharp in general. The following reformulation K,N yields an appealing interpretation of CD( ) in terms of comparison for the (unoriented) Jacobian spaces. In the sequel, I will write Jac x , computed with respect to a given x determinant evaluated at point reference measure. Corollary 14.14 (Curvature-dimension bounds by comparison). M be a Riemannian manifold equipped with a reference mea- Let 2 V − M e , V ∈ = ν ( M ) . Define the J -function of vol on sure C , × R [0 1] R by the formula × + + { } t,δ,J ) := inf ( Jac , (exp( tξ )); | ξ ( x ) | = δ ; Jac J (exp( ξ )) = J x M,ν x (14.57) ξ x , such that where the infimum is over all vector fields defined around ξ ( x ) is symmetric, and Jac . Then, for ∇ sξ ) 6 = 0 for 0 ≤ s < 1 (exp x any K ∈ R , N ∈ [1 , ∞ ] ( K ≤ 0 if N = 1 ), ( K,N ) , ( ⇐⇒ J ) ≥J M,ν CD( K,N ) satisfies M,ν K,N ) ( is the J -function of the model CD( where ) space consid- J K,N ered in Examples 14.10. ( K,N ) -dimensional If N is also the J -function of the N is an integer, J model space √  ( ) − 1 N  N  S K > f i 0    K       N ( K,N ) R K if = 0 = S     √   ( )   N − 1  N  H , i f K < 0  | K | equipped with its volume measure. Corollary 14.14 results from Theorem 14.12 by a direct computation ) K,N ( S -function of the model spaces. In the case of J of the , one can also make a direct computation, or note that all the inequalities which were used to obtain (14.56) turn into equalities for suitable choices of parameters.

413 Distortion coefficients 407 R There is a quite similar (and more well-known) for- emark 14.15. sectional mulation of lower curvature bounds which goes as follows. -function of a manifold by the formula L Define the M { ( ) L d exp exp ; ( tv ) , ( t,δ,L ( tw ) ) = inf ; | v | = | w | = δ M x x } L v, exp d w ) = (exp , x x v,w T M . Then M has ∈ where the infimum is over tangent vectors x ) κ ( ) κ ( sectional curvature larger than ≥L L L if and only if κ is , where M √ ( κ ) 2 -function of the reference space S S the (1 / L , which is κ ) if κ > 0, √ 2 2 κ = 0, and H R (1 / κ | if | ) if κ < 0. By changing the infimum into a supremum and reversing the inequalities, one can also obtain a charac- terization of upper sectional curvature bounds (under a topological as- sumption of simple connectivity). The comparison with (14.14) conveys the idea that sectional curvature bounds measure the rate of separation of geodesics in terms of distances, while Ricci curvature bounds do it . in terms of Jacobian determinants Distortion coefficients Apart from Definition 14.19, the material in this section is not necessary to the understanding of the rest of this course. Still, it is interesting because it will give a new interpretation of Ricci curvature bounds, and motivate the introduction of distortion coefficients, which will play a crucial role in the sequel. A and B are two measurable If Definition 14.16 (Barycenters). t ∈ [0 , 1] , a sets in a Riemannian manifold, and -barycenter of A and B t is a point which can be written γ , where γ is a (minimizing, constant- t speed) geodesic with -barycenters ∈ A and γ γ ∈ B . The set of all t 0 1 between A and B is denoted by [ A,B ] . t Definition 14.17 (Distortion coefficients). M be a Rieman- Let V − ∈ vol nian manifold, equipped with a reference measure V e C ( M ) , , and let x and y be any two points in M . Then the distortion coefficient and β ( x ,y ) between x 1) y at time t ∈ (0 , is defined as follows: t

414 408 14 Ricci curvature • f x and y are joined by a unique geodesic γ , then I ] ] [ [ x,B ( y )] ν )] y ( ν [ [ x,B r t r t ; ,y ) = lim β (14.58) x = lim ( t n 0 ↓ r 0 ↓ r ν ( y )] [ B ν [ B t ( y ) ] r r t If y are joined by several minimizing geodesics, then and • x x ,y ) = inf (14.59) ,γ , ) x β β ( ( lim sup s t t γ − s 1 → γ (0) = where the infimum is over all minimizing geodesics joining x to γ (1) = y . Finally, the values of β are defined by ( x ,y ) for t = 0 and t = 1 t β ( x ,y ) ≡ 1; β . ( x ,y ) := lim inf ,y x ( β ) t 0 1 + t 0 → The heuristic meaning of distortion coefficients is as follows (see Figure 14.4). Assume you are standing at point x and observing some device located at y . You are trying to estimate the volume of this device, but your appreciation is altered because light rays travel along x and y are joined by a unique geodesic, curved lines (geodesics). If β then the coefficient ( x ,y ) tells by how much you are overestimating; 0 so it is less than 1 in negative curvature, and greater than 1 in positive curvature. If x and y are joined by several geodesics, this is just the same, except that you choose to look in the direction where the device looks smallest. geodesics are distorted by curvature effects the light source                how the observer thinks  the light source looks like ocation of l the observer he meaning of distortion coefficients: Because of positive curvature Fig. 14.4. T effects, the observer overestimates the surface of the light source; in a negatively curved world this would be the contrary.

415 Distortion coefficients 409 M β ore generally, ( x ,y ) compares the volume occupied by the light t rays emanating from the light source, when they arrive close to t ), γ ( to the volume that they would occupy in a flat space (see Figure 14.5).  y      x       he distortion coefficient is approximately equal to the ratio of the Fig. 14.5. T volume filled with lines, to the volume whose contour is in dashed line. Here the space is negatively curved and the distortion coefficient is less than 1. Now let us express distortion coefficients in differential terms, and Jacobi fields . A key concept in doing so will be the notion more precisely focalization , which was already discussed in Chapter 8: A point is of y if there exists v x T M such that said to be focal to another point ∈ x = exp d x and the differential y is not invertible. exp M : T T M → x y v x v It is equivalent to say that there is a geodesic γ which visits both x and y , and a Jacobi field J along γ such that J ( x ) = 0, J ( y ) = 0. This concept is obviously symmetric in x y , and then x,y are said to be and γ ). conjugate points (along and y are joined by a unique geodesic If and are not conjugate, x γ then by the local inversion theorem, for r small enough, there is a unique ξ ( z velocity z ∈ B . Then the distortion ( y ) such that exp x ξ ( z ) = ) at r z coefficients can be interpreted as the Jacobian determinant of exp ξ at n , which would be the value in Euclidean − t ) time , renormalized by (1 t space. The difference with the computations at the beginning of this chapter is that now the Jacobi field is not defined by its initial value and initial derivative, but rather by its initial value and its final value : exp , so the Jacobi field vanishes after a ξ ( z ) = x independently of z z t = 0 corresponds time 1. It will be convenient to reverse time so that to x and t = 1 to y ; thus the conditions are J (0) = 0, J (1) = I . After n these preparations it is easy to derive the following:

416 410 14 Ricci curvature P roposition 14.18 (Computation of distortion coefficients). be a Riemannian manifold, let Let y be two points in M . and M x Then [ γ ] ) = inf β β ,y ( x ) ( x ,y , t t γ γ γ (0) = x where the infimum is over all minimizing geodesics joining γ [ ] x β is defined as follows: ) ,y ( y γ (1) = to , and t If x,y are not conjugate along γ , let E be an orthonormal basis of • and define T M y  , 0 1 ) t det J (    f 0 i 1 ≤ < t  n  t γ [ ] ( ,y x ) = β (14.60) t  1 0 ,   s ) J det (   lim i = 0 , t f n 0 s → s 0 1 , is the unique matrix of Jacobi fields along γ satisfying where J 1 , 0 1 0 , J (1) = E ; J (0) = 0; If x,y are conjugate along γ • , define { = 1 if 1 t ] [ γ x ,y β ) = ( t + if 0 ≤ t < 1 . ∞ Distortion coefficients can be explicitly computed for the model CD( K,N ) spaces and depend only on the distance between the two points and y . These particular coefficients (or rather their expression x as a function of the distance) will play a key role in the sequel of these notes. Definition 14.19 (Reference distortion coefficients). Given K ∈ R , N ∈ [1 , ∞ ] and t ∈ [0 , 1] , and two points x,y in some metric ( K,N ) ) ) , define β space X ,d as follows: ( x,y ( t then If 0 < t ≤ • and 1 < N < ∞ 1

417 Distortion coefficients 411   α > π, and 0 + ∞ if K >        ) (  − N 1   ) tα sin(   α and 0 K > if , ] ,π [0 ∈   s α t in  ) ,N ( K x,y ( β ) = t    K if 1 , = 0         ( )  N 1 −  tα ) sinh(   if , 0 K <  inh t s α (14.61) where √ | | K α (14.62) . = d ( x ,y ) 1 N − In the two limit cases N • 1 and N →∞ , modify the above expres- → sions as follows: { + ∞ if K > 0 , ( 1) K, ) = β ( x,y (14.63) t if ≤ 0 , 1 K K 2 2 ) ( K, ∞ d ) x,y ( 1 − t ( ) 6 e ) = ( x,y β (14.64) . t ( K,N ) = 0 define β ) = 1 • For . ( x,y t 0 If is the model space for CD( K,N ), as in Examples 14.10, then X ( ) K,N is just the distortion coefficient on X . β ( K,N ) , β K is positive, then for fixed If α is an increasing function of t t (going to + ∞ at α = π ), while for fixed α , it is a decreasing function ( K,N ) β K . On the whole, on [0 of , 1]. All this is reversed for negative t t is nondecreasing in K and nonincreasing in N . (See Figure 14.6.) The next two theorems relate distortion coefficients with Ricci cur- vature lower bounds; they show that (a) distortion coefficients can be interpreted as the “best possible” coefficients in concavity estimates for the Jacobian determinant; and (b) the curvature-dimension bound CD( K,N ) is a particular case of a family of more general estimates characterized by a lower bound on the distortion coefficients.

418 412 14 Ricci curvature K = 0 K < 0 > 0 K π ( K,N ) he shape of the curves β Fig. 14.6. ( x,y ), for fixed t ∈ (0 , 1), as a function T t p K | / ( N | 1 ) d ( x,y ). − α of = Theorem 14.20 (Distortion coefficients and concavity of Jaco- M be a Riemannian manifold of dimen- bian determinant). Let and , and let be any two points in M n ( β sion ( x,y )) x,y . Then if ≤ ≤ 1 0 t t )) y,x β ( are two families of nonnegative coefficients, the follow- ( t ≤ 1 0 ≤ t ing statements are equivalent: ∀ t ∈ [0 , (a) , β ≤ ( x,y ) 1] ,x ( x ,y ); β ; β y,x ) ≤ β ) ( y ( t t t t x ≥ , for any geodesic γ (b) For any n to y , for any N joining ) ∈ [0 , 1] , and for any initial vector field ξ around x x = γ t t ( ) , ∇ ξ ( 0 0 0 0 − J s ) symmetric, let exp(( s ( t ) ) ξ stand for the Jacobian determinant of 0 at x , ; if J ( s ) does not vanish for 0 < s < 1 , then for all t ∈ [0 , 1] 0  1 1 1 1 1  N N N N N N ≥ ( 1 − t ) β ( 1) ( ( y,x ) ∞ J J ( 0) ( t + t β J ( x,y ) ) < )  t t − 1     J log t (1) log J (0 ,x ) ≥ (1 − t ) log J (0) +   ] [   x,y − ) log β ; ) ∞ ( y,x ) + t log (1 = ( t ) + ( N β t − 1 t (14.65) N = n . (c) Property (b) holds true for Theorem 14.21 (Ricci curvature bounds in terms of distor- tion coefficients). Let be a Riemannian manifold of dimension n , M equipped with its volume measure. Then the following two statements are equivalent: (a) Ric ≥ K ; ) ,n K ( β β (b) . ≥ Sketch of proof of Theorem 14.20 . To prove the implication (a) ⇒ (b), it ∞ is obtained from = β . The case N = suffices to establish (14.65) for β

419 Distortion coefficients 413 1 /N ∞ by passing to the limit, since lim he case 1)] = − [ N ( a t N < 0 → N . So all we have to show is that if N < ∞ , then a ≤ n log 1 1 1 1 1 N N N N N ) β J ( ( y ,x ) t ) ,y x ( β 1 + t − ( 0) ( J t ≥ . J ( 1) ) t − t 1 x,y are conjugate can be treated by a limiting argument. he case when T J J (In fact the conclusion is that both x (1) have to vanish if (0) and y x and y are not conjugate, and are conjugate.) So we may assume that E ( t ), along γ , and define the introduce a moving orthonormal basis 0 , 0 , 1 1 ) and J J Jacobi matrices ( ( t ) by the requirement t , 1 1 , 0 1 1 , 0 0 0 , J , J (0) = 0 , J J I (1) = 0; (1) = I . (0) = n n , 1 , 0 0 0 0 , 1 , 1 1 are identified with their expressions J ,J , J (Here in the J E moving basis .) As noted after (14.7), the Jacobi equation is invariant under the 1 , 0 0 , 1 E → − change , so J t → 1 becomes J − t , when one exchanges E the roles of x and y , and replaces t by 1 − t . In particular, we have the formula , 0 1 ( t ) J det ( y β . ) ,x = (14.66) t − 1 n t ) − (1 As in the beginning of this chapter, the issue is to compute the of a Jacobi field J ( determinant at time ). Since the Jacobi fields t t ̈ are solutions of a linear differential equation of the form + = 0, RJ J , and they are invariant under n they form a vector space of dimension 2 right-multiplication by a constant matrix. This implies 1 , 0 0 , 1 t ( t ) J (0) + J ) = J J ( t ) J (1) . (14.67) ( n satisfies the following inequality: If The determinant in dimension and Y are two n × n nonnegative symmetric matrices, then X 1 1 1 n n n . ≥ ( det X ) 14.68) ( + (det Y ) ) Y X det( + By combining this with H ̈older’s inequality, in the form n n − N − N 1 n 1 1 1 n N N n N N N b t a ≥ ( + t 1 − + ) b , ) a ( w e obtain a generalization of (14.68): N − n n − N 1 1 1 N N N N N Y ) 14.69) ( det( X ) X ≥ + t ( 1 − t ( det Y ) ) + . ( det

420 414 14 Ricci curvature I f one combines (14.67) and (14.69) one obtains n − N 1 1 1 0 , 1 N N N N J ( )) t (det − ( t )) ≥ ) t det J (0)) ( det J 1 ( ( − n N 1 1 1 0 , N N N t det J ( det J (1)) ( t )) + ( 1 1 ] [ ] [ 0 , 1 0 , 1 1 1 ( t ) det J t ( ) d et J N N N N (1 − t ) = t 0) ( J J ( 1) + n n ) t t − (1 1 1 1 1 N N N N ( y ,x ) t ) β = ) (1 − ,y x β ( 0) ( J J ( 1) , t + t 1 t − here the final equality follows from (14.60) and (14.66). w In this way we have shown that (a) implies (b), but at the price of a gross cheating, since in general the matrices appearing in (14.67) are not symmetric! It turns out however that they can be positively K ( t ) such that det K cosymmetrized t ) > 0, and : there is a kernel ( , 0 1 0 , 1 ( ( t ) and K ( t ) t ) J J ( t ) J (1) are both positive symmetric, at least K t 1). This remarkable property is a consequence of the structure , ∈ for (0 of Jacobi fields; see Propositions 14.30 and 14.31 in the Third Appendix of this chapter. Once the cosymmetrization property is known, it is obvious how to fix the proof: just write 1 1 1 ) ( ) ) ( ( 0 1 1 , 0 , n n n t ( J ) t ( K det( )) )) K ≥ t + ) d et( K ( t ) J J ( d ( t ) J (1)) et( ( t , 1 /n a t )) nd then factor out the positive quantity (det ( to get K 1 1 1 ( ) ( ( ) ) , 0 1 , 0 1 n n n J ) det ( t + J et( J )) t ( ( t ) J (1)) et( d ≥ . d he end of the argument is as before. T Next, it is obvious that (b) implies (c). To conclude the proof, it ⇒ (a). By symmetry and definition of β it , suffices to show that (c) t [ γ ] ( ) ≤ is sufficient to show that x ( x,y . If β x ,y ) for any geodesic γ β t t and y are conjugate along γ then there is nothing to prove. Otherwise, we can introduce ξ ( z ) in the ball B ), ( y ) such that for any z ∈ B y ( r r to z ( z ) = x , and exp μ ( tξ ( z )) is the only geodesic joining exp ξ x . Let 0 z z B ( be the uniform probability distribution on ), and μ be the Dirac y r 1 mass at x ; then exp ξ is the unique map T such that T , so it μ μ = 0 1 # is the optimal transport map, and therefore can be written as exp( ∇ ψ ) 2 = / 2-convex ψ ; in particular, ξ for some ∇ ψ . (Here I have chosen d t = 1, say.) So we can apply (b) with N = n , D (1) = 0, D (0) = 1, 0 0 , 1 ) = det J ( D t,x ( t ), and obtain

421 Distortion coefficients 415 1 1 0 , 1 n n ) J d et ( t ( β t . ≥ x,y ) t [ γ ] 1 n 0 , J I t follows that β ( t )) /t = ( β x,y ) ≤ (det ( x ,y ), as desired. ⊓⊔ t t Sketch of proof of Theorem 14.21 ⇒ (b), we may apply . To prove (a) = N , to conclude that Property (c) in The- inequality (14.56) with n ( K,n ) ) ,n K ( ; thus β = β orem 14.20 is satisfied with β β . Conversely, ≥ ) ,n K ( if β , Theorem 14.20 implies inequality (14.56), which in turn ≥ β ), or equivalently Ric K . ⊓⊔ K,n ≥ implies CD( M,ν ) satisfies CD( K,N Remark 14.22. If ( ) then (14.65) still holds ( K,N ) β = β (provided of course that one takes the measure true with into account when computing Jacobian determinants): this is just a ν rewriting of Theorem 14.12. However, Theorem 14.20 does not apply in this case, since is in general larger than the “true dimension” n . N Theorems 14.20 and 14.21 suggest a generalization Remark 14.23. ) criterion: Given an effective dimension N , define the of the CD( K,N generalized distortion coefficients as the best coefficients in (14.65) β N ,ν N < ∞ , the second one if (the first inequality if = ∞ ). In this way N the condition CD( ) might be a particular case of a more general K,N β,N condition CD( β ≥ ), which would be defined by the inequality N ,ν β , where β ( x,y ) would be, say, a given function of the distance between x and . I shall not develop this idea, because (i) it is not clear at present y K,N that it would really add something interesting to the CD( ) theory; (ii) the condition CD( β,N ) would in general be nonlocal. It is not a priori clear what kind of functions β can Remark 14.24. occur as distortion coefficients. It is striking to note that, in view of Theorems 14.11 and 14.12, for any given manifold M of dimension n the following two conditions are equivalent, say for K > 0:   √ ) ( n K ,y x ( d sin t ) n   √ ≥ β ,y x ; ( ) , 1] ∀ , x,y (i) t ∈ ∀ ∈ M, [0 ( )   t K ,y x ( d in s t ) n   √ ( ) n − 1 K ) t d sin x ,y ( − 1 n   √ ∈ M, ∀ t ∈ [0 , 1] , β . ( x ,y ) ≥ (ii) ( ∀ x,y )   t K x ,y ) d in s t ( n 1 − This self-improvement property implies restrictions on the possible be- havior of β .

422 416 14 Ricci curvature F irst Appendix: Second differentiability of convex functions In this Appendix I shall provide a proof of Theorem 14.1. As explained right after the statement of that theorem, it suffices to consider the n . So here is the statement → particular case of a convex function R R to be proven: Theorem 14.25 (Alexandrov’s second differentiability theorem). n : φ → R be a convex function. Then, for Lebesgue-almost every Let R n R , is differentiable at x and there exists a symmetric opera- x ∈ φ n n R : → R , characterized by any one of the following equivalent A tor properties: φ ( x + v ) = ∇ φ ( x ) + Av + o ( | v | ) as v → 0 ; (i) ∇ | ( + v ) = ∇ φ ( x ) + Av + o ( x v | ) as v → 0 ; ∂φ (i’) 〈 Av,v 〉 2 (ii) ) = φ ( x ) + ∇ φ ( x ) · v + φ ( x ; + o ( | v | + ) a s v → 0 v 2 〈 Av,v 〉 n 2 2 x + tv ) = φ ( x ) + t ∈ φ ( x ) · v + t R ∀ , φ ( (ii’) v ∇ o ( t + ) 2 a t → 0 . s v + φ is differentiable at x (In (i) the vector v ; in (ii) the is such that o ( | v | ) means a set whose elements are all bounded in norm notation like o ( | v | ) .) 2 The operator φ ( x ) and called the Hessian of φ A is denoted by ∇ A is . When no confusion is possible, the quadratic form defined by x at 2 x . Moreover, the function x →∇ ) ψ ( x also called the Hessian of φ at 2 ) is the density of the absolutely con- ( x ) = tr ( ∇ → ψ ( x )) ∆ψ (resp. x 2 ′ ∇ tinuous part of the distribution ψ (resp. of the distribution ∆ ). ψ ′ D D Before starting the proof, let me recall an elementary lemma about convex functions. n be a convex function, and let Lemma 14.26. → R φ x : , R (i) Let 0 n , . . . , ∈ R x such that B ( x , 2 r ) is included in the convex hull x 1 0 n +1 x ,...,x of . Then, 1 +1 n inf 2 x ≤ ) − max φ ( x max ) ≤ ( ); φ x ( φ φ ≤ sup φ i i 0 +1 n ≤ n +1 ≤ i 1 1 ≤ i ≤ , ) r 2 x ( B 0 x , ( B r ) 2 0 ) ( ( ( φ − ) max ) x 2 φ x i 0 1 ≤ i ≤ n +1 . ≤ ‖ ‖ φ ( B ,r )) x Lip( 0 r

423 First Appendix: Second differentiability of convex function s 417 ( φ ) (ii) If is a sequence of convex functions which converges N k ∈ k pointwise to some function , then the convergence is locally uniform. Φ Proof of Lemma 14.26 B ( x . If , 2 r ) then of course, by convex- x lies in 0 ( x ) ≤ max( φ ( x ̃ ) ,...,φ ( x ity, z )). Next, if z ∈ B ( x φ , 2 r ), then := 0 n 1 +1 ) x z ∈ B ( x 2 , 2 r ) and φ ( z ) ≥ 2 φ ( x ). − − φ ( ̃ z ) ≥ 2 φ ( x x ) − max φ ( 0 i 0 0 0 x ∈ B ( x Next, let ,r ) and let y ∈ ∂φ ( x ); let z = x + ry/ | y |∈ B ( x ). , 2 r 0 0 ) r | = 〈 y,z − x 〉 ≤ φ ( z ) − φ ( x y ≤ From the subdifferential inequality, | φ x 2 (max ) − φ ( x )). This proves (i). ( 0 i n and ) be a sequence of convex functions, let R Now let ( φ x ∈ 0 N k k ∈ r > 2 x 0. Let ,...,x ) is included in the con- r be such that B ( x let , 0 n 1 +1 x , then by (i) there ,...,x j ) converges for all vex hull of φ x ( . If j +1 n 1 k is a uniform bound on ‖ φ converges pointwise ‖ φ on B ( x ). So if ,r 0 Lip k k x ,r ), the convergence has to be uniform. This proves (ii). ⊓⊔ ( on B 0 Now we start the proof of Theorem 14.25. To begin with, we should check that the formulations (i), (i’), (ii) and (ii’) are equivalent; this will use the convexity of . φ ⇒ (i) . It is obvious that (i’) Proof of the equivalence in Theorem 14.25 (ii’), so we just have to show that (i) ⇒ (ii) and (ii’) ⇒ (i’). and (ii) ⇒ (ii), the idea is to use the mean value theorem; since To prove (i) ⇒ φ is not smooth, we shall regularize it. Let ζ be a radially sym- a priori n → R , with compact support in R metric nonnegative smooth function ∫ − n ζ = 1. For any B ); then ζ x/ε ( x ) = ε (0), such that ε > ζ ( 0, let ε 1 . The resulting function φ φ ∗ ζ let := φ is smooth and converges ε ε ε φ as ε → pointwise to φ is locally Lipschitz we have 0; moreover, since (by dominated convergence) φ . = ( ∇ φ ∇ ∗ ζ ) ε ε Then we can write φ ( x + v ) − φ ( x ) = lim )] x ( [ φ φ ( x + v ) − ε ε → ε 0 ∫ 1 ) = lim (14.70) v dt. ∇ φ tv ( x + · ε 0 ε → 0 ε ≤| v | ; then, by (i), for all Let us assume that ∈ B ), x ( z ε 2 ∇ φ ( z ) = ∇ φ ( x ) + A ( z − x ) + o ( | v | ) . ( If B y ( x ), we can integrate this identity against ζ (since ∈ y − z ) dz ε ε ∫ = 0, ( y − z ) = 0 if | y − z | > ε ); taking into account ζ ( z − x ) ζ dz ( z − x ) ε ε we obtain

424 418 14 Ricci curvature φ ( y ) = ∇ φ ∇ ( x ) + A ( y − x ) + o ( | v | ) . ε ε x ( x + tv ) = ∇ φ In particular, ( φ ) + tAv + o ( | v | ). By plugging this ∇ ε ε into the right-hand side of (14.70), we obtain Property (ii). (i’). Without loss of generality we ⇒ Now let us prove that (ii’) = 0 and ∇ φ ( x ) = 0. So the assumption is φ ( tw ) = x may assume that 2 2 〉 / 2 + o ( t 〈 ), for any w . If (i’) is false, then there are sequences t Aw,w → 0, | x x |6 = 0, and y ) such that ∈ ∂φ ( x k k k k − y Ax k k 6− −−→ 0 . (14.71) k →∞ | | x k ,y x ) (still denoted ( x Extract an arbitrary sequence from ( ,y ) k k k k for simplicity) and define 1 ( . ) w | x | φ ( φ ) := w k k 2 | | x k ssumption (ii) implies that φ converges pointwise to A Φ defined by k 〈 Aw,w 〉 Φ ( ) = w . 2 he functions φ are convex, so the convergence is actually locally T k uniform by Lemma 14.26. y ), ∈ ∂φ ( x Since k k n ∈ R , , φ ( z ) ≥ ∀ ( x ) + 〈 y z ,z − x 〉 φ k k k or equivalently, with the notation w x = / | x , | k k k 〈 〉 y k n ( w ) ≥ φ ∈ ( w R ∀ ) + , φ w w (14.72) , w − . k k k k x | | k w = w The choice + y ), / | y w | shows that | y ( | / | x φ | ≤ φ − ( w ) k k k k k k k k | y | so / | x is bounded. Up to extraction of a subsequence, we may | k k n − 1 = x y / | x | → |→ σ ∈ S assume that w . Then we can and y x / | k k k k k pass to the limit in (14.72) and recover 〈 n Φ ∀ , Φ ( w ∈ ≥ w ( σ ) + R y,w − σ 〉 . ) It follows that y ∈ ∂Φ ( σ ) = { Aσ } . So y , or equivalently / | x Aσ | → k k / y − Ax 0. What has been shown is that each subsequence of ( |→ | x ) k k k the original sequence ( y has a subsequence which converges − Ax | ) / | x k k k to 0; thus the whole sequence converges to 0. This is in contradiction with (14.71), so (i’) has to be true. ⊓⊔

425 First Appendix: Second differentiability of convex function s 419 Now, before proving Theorem 14.25 in full generality, I shall consider two particular cases which are much simpler. : R → R be a convex . Let Proof of Theorem 14.25 in dimension 1 φ ′ is nondecreasing, and therefore differ- function. Then its derivative φ ⊓⊔ entiable almost everywhere. n . Let φ : R Proof of Theorem 14.25 when ∇ R φ is locally Lipschitz → φ ∇ be a convex function, continuously differentiable and such that locally Lipschitz. By Rademacher’s theorem, each function ∂ φ is dif- i stands for the partial derivative ferentiable almost everywhere, where ∂ i x . So the functions ∂ with respect to ( ∂ ) are defined almost every- φ i j i ) = ( ∂ ) φ ∂ ∂ φ ( where. To conclude the proof, it suffices to show that ∂ j i j i 2 be any ζ compactly supported almost everywhere. To prove this, let C function; then, by successive use of the dominated convergence theorem φ ∗ ζ , and the smoothness of ∗ ∂ ( φ ) ∗ ζ = ∂ ) ( ∂ ζ φ ∂ ζ ) = ∂ ∗ ∂ φ ( j i j i j i ∂ ∗ ∂ ζ. ( φ = ζ ) = ∂ ∗ ( ∂ ) φ ∗ ζ ) = ( ∂ φ ∂ i i j i j j ∂ ∂ φ − is arbitrary this ∂ It follows that ( φ ) ∗ ζ = 0, and since ζ ∂ i j j i implies that ∂ ∂ vanishes almost everywhere. This concludes φ − ∂ φ ∂ i j i j the argument. ⊓⊔ Proof of Theorem 14.25 in the general case . As in the proof of Theo- rem 10.8(ii), the strategy will be to reduce to the one-dimensional case. n such that R , v 0, and x ∈ φ is differentiable at x , define For any t > v · ) x φ ( x + tv ) − φ ( x ) − t ∇ φ ( ) = Q ( t,x ≥ 0 . v 2 t n he goal is to show that for Lebesgue–almost all x ∈ R , T q x ) := lim ( ) t,x Q ( v v 0 t → v , and is a quadratic function of v . exists for all n q x Let Dom v ∈ R ( such that q ) exists. It is clear ( x ) be the set of v from the definition that: on (a) ( x ) is nonnegative and homogeneous of degree 2 in v q v Dom q ( x ); (b) q ): this is just because ( x ) is a convex function of v on Dom q ( x v it is the limit of the family Q ( t,x ), which is convex in v ; v

426 420 14 Ricci curvature ( v is interior to Dom q ( x ) and q c) If ( x ) → ℓ as w → v , w ∈ w Dom x ), then also v ∈ Dom q ( x ) and q q ( x ) = ℓ . Indeed, let ε > 0 ( v δ | w − v |≤ δ = and let q ⇒| ( x ) − ℓ |≤ ε ; then, we can be so small that w ) ,...,v v lies in the convex hull in Dom q ( x v ∩ B ( v,δ ) so that find 1 +1 n v v ,...,v ) and ,δ , and then v of ∈ Dom q ( x ) ∩ B ( v,δ ), so v ∈ B ( 0 +1 n 0 1 ( v ,...,v ,r ) is included in the convex hull of v . By Lemma 14.26, B n +1 0 1 Q ≤ t,x ) − max Q 2 ( t,x ) ( Q . ( t,x ) ≤ max Q ) ( t,x v v v v 0 i i So − 3 ε ≤ 2 q q ) ( x ) − max ℓ t,x ( ( x ) ≤ lim inf Q v v v 0 i → t 0 x lim sup ε. + Q ℓ ( t,x ) ≤ max ≤ ≤ ) ( q v v i 0 → t Q It follows that lim ( t,x ) = ℓ , as desired. v Next, we can use the same reasoning as in the proof of Rademacher’s theorem (Theorem 10.8(ii)): Let v be given, v 6 = 0, let us show that q ) exists for almost all x . By Fubini’s theorem, it is sufficient to ( x v v ( ) exists λ q -almost everywhere on each line parallel to x . show that 1 v ⊥ ∈ v So let be given, and let L be the line passing x = x v + R 0 0 x 0 , parallel to ( ; the existence of q x x v + t v ) is equivalent to through v 0 0 0 ψ : t → φ ( x the second differentiability of the convex function + tv ) at 0 t = , and from our study of the one-dimensional case we know that t 0 λ this happens for t ∈ R . -almost all 1 0 n q , the set of x ∈ R v such that So for each A ( x ) does not exist is v v n of zero measure. Let ( ) be a dense subset of R v , and let A = ∪ A : v k k n is of zero measure, and for each x ∈ R A \ A , Dom q ( x ) contains all the vectors v . k n ( Again, let \ A . By Property (b), q x ∈ x ) is a convex function R v v , so it is locally Lipschitz and can be extended uniquely into a of n ( v ) on continuous convex function r . By Property (c), r ( v ) = q ), ( x R v n which means that Dom q ( x ) = R . At this point we know that for almost any x the limit q ) exists ( x v for all , and it is a convex function of v , homogeneous of degree 2. v function of q ( What we do not know is whether ) is a quadratic . v x v Let us try to solve this problem by a regularization argument. Let n be a smooth nonnegative compactly supported function on R , with ζ ∫ φ = 1. Then ∇ φ ∗ ζ = ∇ ( ζ ∗ ζ ). Moreover, thanks to the nonnegativity of Q ( x,t ) and Fatou’s lemma, v

427 First Appendix: Second differentiability of convex function s 421 ∫ ) x ) = ∗ lim ζ q ( Q dy ( y,t )( ζ ( x − y ) v v 0 ↓ t ∫ lim inf ) Q ≤ ( y,t ) ζ ( x − y dy v t 0 ↓ ] [ 1 ∗ φ ∗ ζ ) ( x + tv ) − ( = lim inf ( ζ )( x ) − t ∇ ( φ ∗ ζ )( x ) · v φ 2 t ↓ 0 t 〈 〉 1 2 = ( φ ∗ ζ ) ∇ x ) v, v ( . · 2 v , but this It is obvious that the right-hand side is a quadratic form in ζ ∗ ζ ( x ). In fact, in general q is only an upper bound on ∗ q does v v 2 ∗ 〈∇ not ( φ / ζ ) v,v 〉 . The difference is caused by the 2) coincide with (1 2 φ := (1 / 2) 〈∇ singular part of the measure μ · v,v 〉 , defined in the v distribution sense by ∫ ∫ 1 2 ) = ζ dx μ ) ( 〈 x ( ∇ ) ζ ( x dx. · v,v 〉 φ ( x ) v 2 This obstacle is the main new difficulty in the proof of Alexandrov’s theorem, as compared to the proof of Rademacher’s theorem. μ , we shall appeal to To avoid the singular part of the measure v Lebesgue’s density theory, in the following precise form: Let μ be a n R locally finite measure on , and let ρλ be its Lebesgue decom- + μ s n position into an absolutely continuous part and a singular part. Then, n for Lebesgue–almost all R x ∈ , ∥ ∥ 1 ∥ ∥ ( x ) λ −−−→ μ − 0 , ρ n T ( x )) V( B n δ → δ 0 δ ‖·‖ where stands for the total variation on the ball B ( x ). δ TV( ( x )) B δ Such an will be called a Lebesgue point of μ . x ρ is locally be the density of μ So let . It is easy to check that μ v v v q -almost is locally integrable. So, for λ finite, and we also showed that n v all x we have 0 ∫  1   −−−→ x d | x ) 0; ( q − ) x | q ( 0 v v  n  → 0 δ δ  x B ( ) 0 δ   ∥ ∥  1  ∥ ∥  0 ρ −−−→ ( x ) λ . μ − n 0 v v V( B ( x n )) T 0 δ δ → 0 δ The goal is to show that q ). Then the proof will be com- ( x x ) = ρ ( v 0 v 0 plete, since ρ x is a quadratic form in ) is obtained by (indeed, ρ ( v v v 0

428 422 14 Ricci curvature veraging a ( dx ), which itself is quadratic in v ). Without loss of gen- μ v x erality, we may assume that = 0. 0 (0) = ρ To prove that q (0), it suffices to establish v v ∫ 1 (14.73) , (0) = 0 dx | lim | q ρ ( x ) − v v n 0 δ → δ 0) ( B δ To estimate ), we shall express it as a limit involving points in q ( x v x ), and then use a Taylor formula; since φ is not a priori smooth, ( B δ ≤ δ . Let ζ be as before, and let ε we shall regularize it on a scale − n ζ ε ( x ζ ( x/ε ); further, let φ . := φ ∗ ζ ) = ε ε such that ∇ ( x ) We can restrict the integral in (14.73) to those x φ ∇ φ ; indeed, such points x exists and such that is a Lebesgue point of , φ ( x ) = lim φ ), and form a set of full measure. For such an x x ( ε 0 ε → ( ∇ ) = lim ∇ ). So, x φ φ x ( 0 → ε ε ∫ 1 | dx (0) ρ − | q ) ( x v v n δ B 0) ( δ ∫ ∣ ∣ ] [ 1 ) tδv x ( φ −∇ ) x · φ ( x + tδv ) − φ ( ∣ ∣ = ) v d ( ρ lim − x ∣ ∣ 0 n 2 2 0 t → t δ δ B ( 0) δ ∫ ∣ ∣ · ) x ( + −∇ ) x ( φ − 1 tδv ) φ tδv ( x φ ∣ ∣ ε ε ε x lim d = − lim v ( ρ ) ∣ ∣ 0 2 2 n → ε 0 0 t → δ δ t 0) ( B δ ∫ ∫ ∣ ∣ ] [ 1 1 ∣ ∣ 2 lim = lim φ s ) − 〈∇ (1 dx ds ( x + stδv ) · v,v 〉− 2 ρ (0) ∣ ∣ v ε n t → 0 → 0 ε δ 0) 0 ( B δ ∫ ∫ ∣ 1 ] [ 1 ∣ 2 stδv (0) + x ( φ ρ 2 〉− v,v · 〈∇ ) ≤ lim inf lim inf ∣ v ε n → t ε 0 0 → ∣ δ 0) ( 0 B δ ∣ s ) ds − dx (1 ∣ ∫ ∫ 1 ∣ ∣ 1 2 ∣ ∣ ≤ lim inf lim inf φ ( (0) v,v · ρ 〈∇ dy ds, ) y 〉− ε v n t → 0 0 → ε δ ) tδv 0 s ( B δ where Fatou’s lemma and Fubini’s theorem were used successively. B ( stδv,δ ) ⊂ B (0 , (1 + | v | ) δ ), independently of s and t , we can Since bound the above expression by ∫ ∣ ∣ 1 2 ∣ ∣ ρ dy φ v,v · ) y ( (0) 〉− 〈∇ lim inf v ε n ε → 0 δ 0 v | (1+ , | δ ) ) B ( ∫ ∫ ∣ ∣ 1 ∣ ∣ = lim inf ]( dy λ (0) ρ z μ )[ ) − y dz ζ ( − ∣ ∣ v v n ε n 0 → ε δ B , 0 ( ) ) | v | (1+ δ ∫ ∫ 1 dz ( dy. | λ ≤ (0) ρ − μ lim inf | ) z − y ( ) ζ ε n v v n ε → 0 δ B ( ) 0 , | v | ) δ (1+

429 Second Appendix: Very elementary comparison arguments 423 W y varies in B (0 , (1 + | v | ) δ ), z stays in B (0 , (1 + | v | ) δ + ε ), which hen B (0 ) with C = 2 + | v | . So, after using Fubini’s itself is included in ,Cδ dy ζ y − z ) ( , we conclude that theorem and integrating out ε ∫ 1 ≤‖ λ (0) . ρ | q − ( x ) − ρ μ (0) | dx ‖ v v v n v )) TV( B (0 ,Cδ n δ B ( 0) δ The conclusion is obtained by taking the limit → 0. δ 2 Once φ has been identified as the density of the distributional ∇ 2 ) is the density Hessian of := tr ( ∇ φ φ , it follows immediately that ∆φ of the distributional Laplacian of φ . (The trace of a matrix-valued nonnegative measure is singular if and only if the measure itself is singular.) ⊓⊔ Remark 14.27. The notion of a distributional Hessian on a Rieman- nian manifold is a bit subtle, which is why I did not state anything about it in Theorem 14.1. On the other hand, there is no difficulty in defining the distributional Laplacian. Second Appendix: Very elementary comparison arguments There are well-developed theories of comparison estimates for second- order linear differential equations; but the statement to be considered here can be proven by very elementary means. Theorem 14.28 (One-dimensional comparison for second-order 2 Λ ∈ R , and f ∈ C ([0 , 1]) ∩ C (0 inequalities). Let 1) , f ≥ 0 . Then , the following two statements are equivalent: ̈ 0 + Λf ≤ (i) f (0 , 1) ; in 2 Λ < π , then for all t , ,t 1] ∈ [0 (ii) If 1 0 ) ( − ) ) (1 ( λ λ λt ) f ≥ τ | (1 − λ ( | t t − t ( ) ) f ( t f ) + τ t , + ( | t ) − t | 1 0 0 1 0 1 1 0 where

430 424 14 Ricci curvature  √ ) Λ λθ sin(  2   √ f 0 < Λ < π i   ) Λ sin( θ       λ ) ( ) = τ ( θ λ Λ = 0 if      √    λθ − Λ ) sinh(   √ . 0 Λ < f i  ) − θ sinh( Λ 2 2 c then f ( t ) = c sin( Λ ) for some π ≥ 0 ; finally if Λ > π = then If πt = 0 . f 2 ⇒ (i). If Λ ≥ π Proof of Theorem 14.28 this is . The easy part is (ii) 2 Λ < π λ = 1 / 2, then a Taylor expansion shows that , take trivial. If ) ( 2 θΛ 1 / 2) 3 (1 + ( ) = 1 θ + o ) θ ( τ 2 8 nd a ( ) ( ) 2 f ( ) ) + ( f t t t t + ( t t + t t ) − 1 0 1 1 1 0 0 0 2 ̈ ) . + f = f + o ( | t | − t 1 0 2 2 2 4 S o, if we fix t ∈ (0 , 1) and let t 2, ,t / → t in such a way that t = ( t ) + t 1 1 0 0 we get / 2) 2) (1 / (1 t − τ | ) f ( t ( ) + τ ) t | ( | t t − t ( | ) f ( t f ) − 1 1 0 0 0 1 ) ( 2 − ( t ) t 1 0 ̈ t ) + Λf f t ) + o (1) ( . ( = 8 By assumption the left-hand side is nonnegative, so in the limit we ̈ f + Λf ≤ recover 0. ⇒ (i). By abuse of nota- Now consider the reverse implication (ii) f ( λ ) = f ((1 − λ ) t tion, let us write + λt ), and denote by a prime 1 0 ′′ 2 ; so f the derivation with respect to + Λθ λ f ≤ 0, θ = | t . Let − t | 1 0 λ ( λ → g ( ) be defined by the right-hand side of (ii); that is, ) is the g λ 2 ′′ + Λθ solution of g g g (0) = f (0), g (1) = f (1). The goal is to = 0 with show that f ≥ g on [0 , 1]. (a) Case Λ < 0. Let a > 0 be any constant; then f still := f + a a , and f > 0 (even if we did f solves the same differential inequality as a f ≥ 0, we could take a sufficiently large for this to be true). not assume ′′ 2 (0), be defined as the solution of g Let f + Λθ g g (0) = = 0 with g a a a a a

431 Second Appendix: Very elementary comparison arguments 425 g ( 1) = f , so it (1). As a → 0, f g converges to f and g converges to a a a a is sufficient to show . Therefore, without loss of generality we f ≥ g a a is continuous. g/f are positive, so f,g may assume that g/f attains its maximum at 0 or 1, then we are done. Otherwise, If ′′ ′ g/f ) there is ( λ ∈ ) ≤ 0, ( g/f ) (0 ( λ ) = 0, and then , 1) such that ( λ 0 0 0 the identity ( ) ) ( ′ ′ ′ ′′ ′ g ) g g g g ( Λg f + ′ ′ f ( , 2 + ) Λf − = Λ − 2 − 2 f f f f f f , which is impossible. λ , yields 0 > valuated at 2 Λg/f − e 0 (b) Case Λ = 0. This is the basic property of concave functions. √ 2 . Let θ = | (c) Case 0 we − t | ≤ 1. Since θ < Λ < π t , Λ < π 1 0 ′′ 2 w + Λθ such that w ≤ 0 and w > 0 on (0 , 1). can find a function w f + aw f (Just take a well-chosen sine or cosine function.) Then := a , and it is positive. still satisfies the same differential inequality as f ′′ 2 g g + Λθ Let g be defined by the equation = 0 with g (0) = f (0), a a a a a , so it is sufficient (1) = f g (1). As a → 0, f g converges to f and g to a a a a f g . Thus we may assume that f and g are positive, ≥ to show that a a is continuous. f/g and The rest of the reasoning is parallel to the case Λ < 0: If f/g attains λ (0 ∈ its minimum at 0 or 1, then we are done. Otherwise, there is , 1) 0 ′′ ′ such that ( ( λ ) = 0, and then the identity ) ≥ f/g f/g ) ) ( λ 0, ( 0 0 ( ) ( ) ′ ′ ′ ′′ ′ f ) f ( f f + Λf g f ′ ′ Λg Λ − 2 g ( = , ) + 2 − − 2 g g g g g g 2 λ valuated at < − e Λf/g , which is impossible. , yields 0 0 2 = π t . Take (d) Case ), and Λ t πλ = 1. Then let g ( λ ) = sin( = 0, 1 0 ′′ ′′ Λg f h + Λf ≤ 0 and g . The differential equations + f/g = 0 := let 2 ′ ′ ′′ 2 ′ ′ ′ 2 g g + 2 gh combine to yield ( g = ≤ 0. So h h g h is nonincreasing. ) 2 ′ ′ ( λ ) < 0 for some t , so ∈ (0 , 1), then h h g If ( λ λ ) < 0 for all λ ≥ 0 0 0 0 ′ 2 ′ 2 ′ C/g ( λ ) ≤ − ≤ − C ( / (1 λ λ ) ) as λ → 1, where C,C h are positive − constants. It follows that h ( λ ) becomes negative for λ close to 1, which ′ h ) ( λ 0, then a similar reasoning is impossible. If on the other hand > 0 shows that ( λ ) becomes negative for λ close to 0. The conclusion is h ′ that is identically 0, so f/g is a constant. h √ 2 = , then for all t (e) If ,t π/ ∈ [0 , 1] with | t Λ > π − t | Λ , 1 1 0 0 t he function f ( λ ) = f ( λt ), by + (1 − λt πλ )) is proportional to sin( 1 0 Case (d). By letting t is identi- , t f vary, it is easy to deduce that 1 0 cally 0. ⊓⊔

432 426 14 Ricci curvature T hird Appendix: Jacobi fields forever R Let 7−→ R ( t ) be a continuous map defined on [0 , 1], valued in the : t × space of U be the space of functions n n symmetric matrices, and let n R t t ) ∈ u 7−→ solving the second-order linear differential equation : u ( ( t ) + R ( ̈ ) u ( t ) = 0 . (14.74) u t U n )-dimensional By the theory of linear differential equations, is a (2 7−→ ( u (0) , ̇ u (0)) is a linear isomorphism vector space and the map u n 2 R . U → γ : [0 1] → M is As explained in the beginning of this chapter, if , , and is a Jacobi field along M ξ a geodesic in a Riemannian manifold γ (that is, an infinitesimal variation of geodesic around γ ), then the ( t coordinates of T ξ evolving M ), written in an orthonormal basis of ) t ( γ by parallel transport, satisfy an equation of the form (14.74). u It is often convenient to consider an array ( ( t ) ,...,u )) of so- ( t n 1 lutions of (14.74); this can be thought of as a time-dependent matrix t J ( t ) solving the differential (matrix) equation 7−→ ̈ ( t J R ( t ) J ( t ) = 0 . ) + Such a solution will be called a Jacobi matrix . If J is a Jacobi matrix and A JA is still a Jacobi matrix. is a constant matrix, then Jacobi matrices enjoy remarkable properties, some of which are sum- marized in the next three statements. In the sequel, the time-interval 1] and the symmetric matrix 7−→ R ( t ) are fixed once for all, and , t [0 the dot stands for time-derivation. Proposition 14.29 (Jacobi matrices have symmetric logarith- J be a Jacobi matrix such that J (0) is invert- Let mic derivatives). 1 − ̇ (0) (0) J t is symmetric. Let ible and J ∈ [0 , 1] be largest possible ∗ − 1 ̇ ( is invertible for t < t is symmetric . Then ( J J t ) J ( t ) t such that ) ∗ for all t ∈ [0 ,t . ) ∗ Proposition 14.30 (Cosymmetrization of Jacobi matrices). Let 1 0 J J be Jacobi matrices defined by the initial conditions and 0 1 0 1 0 1 ̇ ̇ I J (0) = ; (0) = 0 , J I (0) = (0) = 0 , , J J n n 1 1 0 0 so that any Jacobi matrix can be written as 1 0 ̇ ) = J (14.75) . ( t ) J (0) + J ( (0) ( t ) t J J 1 0

433 Third Appendix: Jacobi fields forever 427 0 A ssume that ( t ) is invertible for all t ∈ (0 , 1] . Then: J 1 0 − 1 1 t ( t ) ) := J J S ( ( t ) is symmetric positive for all t ∈ (0 , 1] , and (a) 0 1 it is a decreasing function of t . 1 1 , 0 , 0 ,J J ) such that (b) There is a unique pair of Jacobi matrices ( 0 , 1 , 0 1 1 , 0 , 0 1 , J , J (0) = 0 , J J I (0) = (1) = I (1) = 0 ; n n 1 , 0 , 1 0 ̇ ̇ J moreover (0) and (1) are symmetric. J t (c) For , 1] let ∈ [0 0 − 1 t tJ K ( ) = , ( t ) 1 . If by K (0) = I t extended by continuity at J is a Jacobi ma- = 0 n ∗ − 1 J (0) trix such that S (1) J (0) J (0) = S (1) and is invertible, and − 1 ̇ [0 J J are symmetric, then for any t ∈ (0) , 1] the matrices (0) ( ) ( ) 1 − 1 , 0 0 , 1 − 1 (0) J J ) and K ( t ) t J ( J ) ( t ) J (1) t J (0) ( K (0) ∈ det t ) > 0 for all t ( [0 , 1) . are symmetric. Moreover, K Proposition 14.31 (Jacobi matrices with positive determinant). S ( t ) and K ( t ) be the matrices defined in Proposition 14.30. Let J Let ̇ J I (0) = and (0) J be a Jacobi matrix such that is symmetric. Then n the following properties are equivalent: ̇ J (0) + S > 0 ; (i) (1) , 1 0 > ) t K (ii) ( t ) J (1) J 0 for all t ∈ (0 , 1) ; ( (iii) K ( t ) J ( t ) > 0 for all t ∈ [0 , 1] ; (iv) det ( t ) > 0 for all t ∈ [0 , 1] . J The equivalence remains true if one replaces the strict inequalities in (i)–(ii) by nonstrict inequalities, and the time-interval , 1] in (iii)–(iv) [0 by , 1) . [0 Before proving these propositions, it is interesting to discuss their geometric interpretation: • If γ ( t ) = exp ) = ( t ∇ ψ ( x )) is a minimizing geodesic, then ̇ γ ( t x ( t,γ ( t )), where ψ solves the Hamilton–Jacobi equation ψ ∇  2 | ) t,x ( ψ ∇ | t,x ) ∂ψ (   = + 0  2 ∂t    ; (0 , · ) = ψ ψ

434 428 14 Ricci curvature t least before the first shock of the Hamilton–Jacobi equation. This a 2 ̇ , (0) = J (0) = ∇ I ψ ( x ), J corresponds to Proposition 14.29 with n 2 2 2 1 − ̇ ( ( J = ) t ψ ( t,γ ( t )). (Here ∇ ( ψ ) x ) and ∇ J ψ ( t,γ ( t )) are iden- t ∇ tified with the matrix of their respective coordinates, as usual in a varying orthonormal basis). 1 0 F • ( t ) and J The Jacobi matrices respec- ( t ) represent ∂ F J ∂ and t t v x 0 1 ( d/dt x,v ) = (exp tively, where ( tv ) , ( F ) exp )) is the geodesic ( tv t x x 0 . So the hypothesis of invertibility of J flow at time ( t ) in Propo- t 1 nonfocalization along sition 14.30 corresponds to an assumption of γ the geodesic . • The formula ( ) 2 t ,γ ( )) d ( · ) = exp t ( −∇ γ x x 2 ields, after differentiation, y ) t ( H 1 0 , J 0 = J − ) t ( ) ( t 0 1 t here (modulo identification) w 2 )) · t ( d ( ,γ 2 H ) = t . ( ∇ x 2 ( t in the denominator comes from time-reparameterization.) The extra 1 0 − 1 S ( t ) ) = J J t ( ( t ) = H ( t ) /t should be symmetric. So 0 1 The assumption • ̇ (0) = I 0 , J J (0) + S (1) ≥ n is the Jacobi field version of the formulas ) ( 2 x ( ψ ∇ exp , · ) d x 2 2 . 0 ≥ t ∇ , ψ )) x ( ψ ∇ ( t ( x ) = exp ∇ ( γ ) + x x 2 2 2-convex, so Proposi- he latter inequality holds true if is d T / ψ tion 14.31 implies the last part of Theorem 11.3, according to which the Jacobian of the optimal transport map remains positive for 0 < t < 1. Now we can turn to the proofs of Propositions 14.29 to 14.31.

435 Third Appendix: Jacobi fields forever 429 P . The argument was already mentioned in roof of Proposition 14.29 − 1 ̇ U J ( t ) J ( t ) ) = ( satisfies the beginning of this chapter: the matrix t the Ricatti equation 2 ̇ + ( t ) U t R ( t ) = 0 ) + U ( ∗ ); and since R is symmetric the transpose U ( on [0 ) ,t of U ( t ) satis- t ∗ ∗ fies the same equation. By assumption (0) U ; so the Cauchy– U (0) = ∗ ( ) = U ( t ) U for all t . t Lipschitz uniqueness theorem implies ⊓⊔ . First of all, the identity in (14.75) follows Proof of Proposition 14.30 immediately from the observation that both sides solve the Jacobi equa- tion with the same initial conditions. n 0 − 1 1 w ∈ R ( and ̂ w = J Let now ; so that ( t ) t ∈ J (0 w , t 1], ) 0 1 0 1 1 0 ̂ w = J ) = J ( t ) w . The function s 7−→ u ( s ( J t be- ( s ) w − J ) w ( s ) ̂ 1 0 0 1 U and satisfies u ( t ) = 0. Moreover, w = longs to (0), ̂ w = − ̇ u (0). So u 0 − 1 1 J the matrix S ( t ) ( t J ) = ) can be interpreted as the linear map ( t 0 1 (0) 7−→− ̇ u (0), where s 7−→ u ( u ) varies in the vector space U of solu- s t tions of the Jacobi equation (14.74) satisfying u ( t ) = 0. To prove the symmetry of S ( t ) it suffices to check that for any two u , v in U , t 〈 (0) , ̇ v (0) 〉 = 〈 ̇ u (0) ,v (0) 〉 , u n . But stands for the usual scalar product on R 〈· ·〉 where , ( ) d u ( s ) , ̇ v ( s ) 〉 −〈 ̇ u ( s 〉 ,v ( s ) 〉 ) = −〈 〈 ( s ) ,R ( s ) v ( s ) 〉 + 〈 R ( s ) u ( s ) ,v ( s ) u ds = 0 (14.76) by symmetry of R . This implies the symmetry of S since 〈 u (0) , ̇ v (0) 〉−〈 ̇ u (0) ,v (0) 〉 = 〈 u ( t ) , ̇ v ( t ) 〉−〈 ̇ u ( t ) ,v ( t ) 〉 = 0 . Remark 14.32. A knowledgeable reader may have noticed that this is a “symplectic” argument (related to the Hamiltonian nature of the n n geodesic flow): if × R is equipped with its natural symplectic form R ( ) u, ̇ u ) ω , v, ̇ v ) ( = 〈 ̇ u,v 〉−〈 ̇ v,u 〉 , ( then the flow ( u ( s ) , ̇ u ( s )) 7−→ ( u ( t ) , ̇ u ( t )), where u ∈ U , preserves ω . ̇ U are = { u ∈U ; u (0) = 0 } and The subspaces U } = { u ∈U ; ̇ u (0) = 0 0 0 Lagrangian : this means that their dimension is the half of the dimension

436 430 14 Ricci curvature o U , and that ω vanishes identically on each of them. Moreover ω is f ̇ U nondegenerate on , providing an identification of these spaces. × U 0 0 ̇ × U U U is also Lagrangian, and if one writes it as a graph in , Then t 0 0 it is the graph of a symmetric operator. t ) is Back to the proof of Proposition 14.29, let us show that S ( . To reformulate the above observations we a decreasing function of t n write, for any , w ∈ R 〈 〉 ∂u w,w 〉 = − 〈 w, S ( t ) ,t ( , ) 0 ∂s u s,t ) is defined by where (  2 ) s,t ( u ∂    + ( u ) ,t s ) = 0 R ( s  2  ∂s (14.77) u ( t,t ) = 0      ,t w. (0 u ) = So 〉 〈 2 ∂ u ̇ 0 = − ( w, t S w,w 〉 ( ) ,t ) 〈 ∂s∂t 〈 〉 〉 〈 ∂ ∂v − w, = u ) (0 ,t ) , = − ( u (0) , ∂ 0) ( t ∂s ∂s s,t s v ( s ) = ( ∂ ) are solu- u )( s,t ) and s 7−→ u ( s ) = where ( 7−→ u t tions of the Jacobi equation. Moreover, by differentiating the conditions in (14.77) one obtains ∂u . (0) = 0 ( v (14.78) = 0; ) t v ) + t ( ∂s By (14.76) again, 〉 〈 〉 〈 〉 〈 〉 〈 ∂u ∂v ∂ u ∂v t ( 0) ,v (0) ) − t u ( t ) , − ( ( v ) , + ( = . ) ( t 0) − u , (0) ∂s ∂s ∂s ∂s The first two terms in the right-hand side vanish because v (0) = 0 and u ( t ) = u ( t,t ) = 0. Combining this with the first identity in (14.78) one finds in the end 2 ̇ ( t ) w,w 〉 = 〈 v ( t ) ‖ S . (14.79) −‖ We already know that v (0) = 0; if in addition v ( t ) = 0 then 0 = 0 0 (0), so (by invertibility of ) = J v v ( t ) ̇ v ( J t (0) = 0, and ( t )) ̇ v vanishes 1 1

437 Third Appendix: Jacobi fields forever 431 ) vanishes at = t , and since also dentically; then by (14.78) ( du/ds i s u w = 0. In ) = 0 we know that u ( t vanishes identically, which implies other words, the right-hand side of (14.79) is strictly negative unless ( ) is a strictly decreasing function of t . Thus S = 0; this means that w t the proof of (a) is finished. , 0 0 1 , 1 J To prove (b) it is sufficient to exhibit the matrices ( t ) J ( ) and t 0 1 J : J explicitly in terms of and 0 1 1 1 , 0 0 0 − 1 1 1 0 , 1 0 − 0 t ) J J . (1) J ( J ( t (1); J ) − J ( t ) = J t ) = ( t ) J ( (1) 1 0 1 1 0 1 (14.80) , 0 0 − 0 1 1 1 1 0 , 1 0 − ̇ ̇ ̇ Moreover, J (1) J (0) = ( t ) = − J J J (1) J are (1) and (1) 1 1 1 0 symmetric in view of (a) and Proposition 14.29. Also ) ( 0 1 1 1 0 − 0 − 1 , 0 − 1 1 ( J (1) J − ) t ) t J J (1) ( ) K ) tJ = (0) J (0) J ( t ( J ) t t ( 0 1 0 1 1 ( ) 0 1 1 − 0 1 1 − = J t t J ) ( J ) t − ( J (1) (1) 1 0 1 0 0 − 1 1 ) = J t ) is ( is positive symmetric since by (a) the matrix ) S ( J t t ( 0 1 t . In particular K ( t ) is invertible for a strictly decreasing function of (0) = ) t > , it follows by continuity that det K ( t K 0; but since I n 1). , remains positive on [0 − 1 J Finally, if J satisfies the assumptions of (c), then is S (1) (0) ∗ S = symmetric (because (1)). Then S (1) ) 0 , 1 − 1 ( t ) ) J (1) J (0) J ( t K ( )( ) ( − 0 1 0 − 1 1 0 1 0 − ̇ J ) tJ (1) = J ( J t (0) (1) + J ( J (1) t J (0) ) 1 1 1 1 0 ( ) − 1 1 0 − 1 − 1 ̇ J J t J + (1) J (0) J (0) = (1) (0) 0 1 is also symmetric. ⊓⊔ . Assume (i); then, by the formulas in the Proof of Proposition 14.31 J (0) = I end of the proof of Proposition 14.30, with , n , 0 0 , 1 1 ̇ ( ( t ) = t [ S ( t ) − S (1)]; K ( t ) J t ) J ( t ) J (1) = t [ S (1) + J K (0)] . As we already noticed, the first matrix is positive for t (0 , 1); and the ∈ second is also positive, by assumption. In particular (ii) holds true. 1 , 0 J t ) = ⇒ (iii) is obvious since J The implication (ii) ( t ) + ( 1 0 , 1). (At J t ) J (1) is the sum of two positive matrices for t ∈ (0 , ( t = 0 one sees directly K (0) J (0) = I .) n

438 432 14 Ricci curvature I K ( t )) (det J ( t )) > 0 for all t ∈ [0 , 1), and we f (iii) is true then (det t K ) > 0; so det J ( t ) > 0, which is (iv). already know that det ( ̇ (i). Recall that K ( t ) J ( t ) = t [ S ( t )+ It remains to prove (iv) J (0)]; ⇒ K ) t since det > 0, the assumption (iv) is equivalent to the statement ( ̇ t ) = tS ( t ) + t A J (0) has positive deter- that the symmetric matrix ( 1 ∈ (0 minant for all 1]. The identity tS ( t ) = K ( t ) J t ) shows that ( t , 0 A ( t ) approaches I 0; and since none of its eigenvalues vanishes, as t → n ̇ t ) has to remain positive for all t . So S ( t ) + A J (0) is positive for all ( ∈ , so this is equivalent to , 1]; but S is a decreasing function of t t (0 ̇ (1) + J (0) > S 0, which is condition (i). The last statement in Proposition 14.31 is obtained by similar ar- guments and its proof is omitted. ⊓⊔ Bibliographical notes Recommended textbooks about Riemannian geometry are the ones by do Carmo [306], Gallot, Hulin and Lafontaine [394] and Chavel [223]. All the necessary background about Hessians, Laplace–Beltrami oper- ators, Jacobi fields and Jacobi equations can be found there. Apart from these sources, a review of comparison methods based on Ricci curvature bounds can be found in [846]. Formula (14.1) does not seem to appear in standard textbooks of Riemannian geometry, but can be derived with the tools found therein, or by comparison with the sphere/hyperbolic space. On the sphere, the computation can be done directly, thanks to a classical formula of spherical trigonometry: If are the lengths of the sides of a a,b,c 2 S triangle drawn on the unit sphere γ is the angle opposite to c , , and then cos c = cos a cos b +sin a sin b cos γ . A more standard computation usually found in textbooks is the asymptotic expansion of the perimeter 0. Y. H. x with (geodesic) radius r , as r → of a circle centered at Kim and McCann [520, Lemma 4.5] recently generalized (14.1) to more general cost functions, and curves of possibly differing lengths. The differential inequalities relating the Jacobian of the exponen- tial map to the Ricci curvature can be found (with minor variants) in a number of sources, e.g. [223, Section 3.4]. They usually appear in conjunction with volume comparison principles such as the Heintze– K ̈archer, L ́evy–Gromov, Bishop–Gromov theorems, all of which express

439 Bibliographical notes 433 t , and the he idea that if the Ricci curvature is bounded below by K , then volumes along geodesic fields grow no N dimension is less than faster than volumes in model spaces of constant sectional curvature hav- and Ricci curvature identically equal to K . These com- ing dimension N putations are usually performed in a smooth setting; their adaptation to the nonsmooth context of semiconvex functions has been achieved only recently, first by Cordero-Erausquin, McCann and Schmucken- schl ̈ager [246] (in a form that is somewhat different from the one pre- sented here) and more recently by various sets of authors [247, 577, 761]. Bochner’s formula appears, e.g., as [394, Proposition 4.15] (for a ξ = ∇ ψ ) or as [680, Proposition 3.3 (3)] (for a vector field vector field ξ such that is symmetric, i.e. the 1-form p → ξ · p is closed). In ∇ ξ both cases, it is derived from properties of the Riemannian curvature tensor. Another derivation of Bochner’s formula for a gradient vector 2 field is via the properties of the square distance function d ( x ,x ) ; this 0 is quite simple, and not far from the presentation that I have followed, 2 ( x d ,x ) since / 2 is the solution of the Hamilton–Jacobi equation at 0 time 1, when the initial datum is 0 at x and + ∞ everywhere else. 0 But I thought that the explicit use of the Lagrangian/Eulerian duality would make Bochner’s formula more intuitive to the readers, especially those who have some experience of fluid mechanics. There are several other Bochner formulas in the literature; Chap- ter 7 of Petersen’s book [680] is entirely devoted to that subject. In fact “Bochner formula” is a generic name for many identities involving commutators of second-order differential operators and curvature. The examples (14.10) are by now standard; they have been discussed for instance by Bakry and Qian [61], in relation with spectral gap esti- mates. When the dimension N is an integer, these reference spaces are obtained by “projection” of the model spaces with constant sectional curvature. The practical importance of separating out the direction of motion is implicit in Cordero-Erausquin, McCann and Schmuckenschl ̈ager [246], but it was Sturm who attracted my attention on this. To implement this idea in the present chapter, I essentially followed the discussion in [763, Section 1]. Also the integral bound (14.56) can be found in this reference. Many analytic and geometric consequences of Ricci curvature bounds are discussed in Riemannian geometry textbooks such as the one by

440 434 14 Ricci curvature G allot, Hulin and Lafontaine [394], and also in hundreds of research papers. Cordero-Erausquin, McCann and Schmuckenschl ̈ager [246, Section 2] express differential inequalities about the Jacobian determinant in terms of volume distortion coefficients; all the discussion about dis- tortion coefficients is inspired from this reference, and most of the ma- terial in the Third Appendix is also adapted from that source. This Appendix was born from exchanges with Cordero-Erausquin, who also unwillingly provided its title. It is a pleasure to acknowledge the help of my geometer colleagues (Ghys, Sikorav, Welschinger) in getting the “symplectic argument” for the proof of Proposition 14.29. Concerning Bakry’s approach to curvature-dimension bounds, among many sources one can consult the survey papers [54] and [545]. The almost everywhere second differentiability of convex functions was proven by Alexandrov in 1942 [16]. The proof which I gave in the first Appendix has several points in common with the one that can be found in [331, pp. 241–245], but I have modified the argument to make it look as much as possible like the proof of Rademacher’s theo- rem (Theorem 10.8(ii)). The resulting proof is a bit redundant in some respects, but hopefully it will look rather natural to the reader; also I think it is interesting to have a parallel presentation of the theorems by Rademacher and Alexandrov. Alberti and Ambrosio [11, Theorem 7.10] prove Alexandrov’s theorem by a quite different technique, since they deduce it from Rademacher’s theorem (in the form of the almost ev- erywhere existence of the tangent plane to a Lipschitz graph) together with the area formula. Also they directly establish the differentiability of the gradient, and then deduce the existence of the Hessian; that is, they prove formulation (i) in Theorem 14.1 and then deduce (ii), while in the First Appendix I did it the other way round. Lebesgue’s density theorem can be found for instance in [331, p. 42]. The theorem according to which a nonincreasing function R → R is differentiable almost everywhere is a well-know result, which can be deduced as a corollary of [318, Theorems 7.2.4 and 7.2.7].

441 15 O tto calculus M ( M ) be the associated be a Riemannian manifold, and let P Let 2 ) is a ( M Wasserstein space of order 2. Recall from Chapter 7 that P 2 length space and that there is a nice representation formula for the Wasserstein distance W : 2 ∫ 1 2 2 ,μ dt, ) μ = inf ( W (15.1) ‖ μ ‖ ̇ 2 0 1 t μ t 0 where ‖ ̇ μ ‖ of the measure is the norm of the infinitesimal variation ̇ μ μ , defined by μ { } ∫ 2 . = inf ) = 0 ‖ | v | ̇ dμ ; ̇ μ + ∇· ( vμ μ ‖ μ One of the reasons for the popularity of Riemannian geometry (as opposed to the study of more general metric structures) is that it al- lows for rather explicit computations. At the end of the nineties, Otto realized that some precious inspiration could be gained by performing computations of a Riemannian nature in the Wasserstein space. His motivations will be described later on; to make a long story short, he needed a good formalism to study certain diffusive partial differential equations which he knew could be considered as gradient flows in the Wasserstein space. In this chapter, as in Otto’s original papers, this problem will be considered from a purely formal point of view, and there will be no attempt at rigorous justification. So the problem is to set up rules for formally differentiating functions (i.e. functionals) on P ( M ). To fix 2 the ideas, and because this is an important example arising in many different contexts, I shall discuss only a certain class of functionals,

442 436 15 Otto calculus t M → R , used to distort the reference hat involve (i) a function V : R → : R , twice differentiable volume measure; and (ii) a function U + + ∞ )), which will relate the values of the density of our (at least on (0 , probability measure and the value of the functional. So let  ) − V ( x e dx ) := dx vol( ) ν (    (15.2) ∫    ) := U μ U ( ρ ( x )) dν ( x ) , μ = ρν. ( ν M So far the functional U is only defined on the set of probability mea- ν ν , or equivalently sures that are absolutely continuous with respect to with respect to the volume measure, and I shall not go beyond that 0 setting before Part III of these notes. If stands for the density of μ ρ 0 − V e with respect to the plain volume, then obviously ρ ρ , so there = is the alternative expression ∫ ( ) 0 V 0 − V vol e e ρ ρ = . , μ U d vol ) = ( U μ ν M U as a constitutive law for the One can think of internal energy of a fluid: this is jargon to say that the energy “contained” in a fluid ∫ ). The function U of density ρ is given by the formula U should ρ ( be a property of the fluid itself, and might reflect some microscopic interaction between the particles which constitute it; it is natural to U (0) = 0. assume In the same thermodynamical analogy, one can also introduce the pressure law: ′ − ) = ρU ( ( ρ p ρ U ( ρ ) . (15.3) ) The physical interpretation is as follows: if the fluid is enclosed in a domain Ω , then the pressure felt by the boundary ∂Ω at a point x is normal and proportional to p ρ ) at that point. (Recall that the pres- ( sure is defined, up to a sign, as the partial derivative of the internal energy with respect to the volume of the fluid.) So if you consider a V , then its density is homogeneous fluid of total mass 1, in a volume = 1 ρ , the total energy is V U (1 /V ), and the pressure should be /V ( − d/dV )[ V U (1 /V )] = p (1 /V ); this justifies formula (15.3). ∫ ) is associated a total pressure To the pressure p ( ρ p dν , and one can again consider the influence of small variations of volume on this functional; this leads to the definition of the iterated pressure :

443 15 Otto calculus 437 ′ ( ρ ) = ρp p ( ρ ) − p ( ρ ) . (15.4) 2 Both the pressure and the iterated pressure will appear naturally when one differentiates the energy functional: the pressure for first-order derivatives, and the iterated pressure for second-order derivatives. m = 1, and Let 6 Example 15.1. m ρ − ρ m ( ) ρ ) = ) = U ρ U ( ( ; 1 − m t hen m m ) = ρ m , p p ( ρ ) = ( ( − 1) ρ ρ . 2 There is an important limit case as → 1: m (1) ) = ρ log ρ ; ρ ( U then ρ ) = ρ, p p ( ρ ) = 0 . ( 2 ( m ) m By the way, the linear part 1) in U − ρ/ ( does not contribute − ( m ) to the pressure, but has the merit of displaying the link between U (1) U and . ∆ Differential operators will also be useful. Let be the Laplace oper- ator on M , then the distortion of the volume element by the function V leads to a natural second-order operator: L V ·∇ . (15.5) −∇ = ∆ Recall from Chapter 14 the expression of the carr ́e du champ it ́er ́e : associated with L ) ( 2 |∇ ψ | ) Lψ ( ·∇ ψ (15.6) − ∇ ψ ( L ) = Γ 2 2 ) ( 2 2 2 ‖∇ + ∇ Ric + ψ ‖ V = ( ∇ ψ ); (15.7) HS the second equality is a consequence of Bochner’s formula (14.28), as we shall briefly check. With respect to (14.28), there is an additional term in the left-hand side: 2 |∇ ψ | V ·∇ −∇ ) + ∇ ψ · ∇ ( ∇ V ·∇ ψ 2 〉 〉 〈 〈 〈 〉 2 2 2 ψ ψ + ·∇ ∇ ∇ V ·∇ ψ, ∇ ψ = + ∇ ∇ V, ψ ·∇ V, ∇ ψ − 〈 〉 2 ·∇ V ∇ ψ, ∇ ψ = ,

444 438 15 Otto calculus hich is precisely the additional term in the right-hand side. w The next formula is the first important result in this chapter: it gives an “explicit” expression for the gradient of the functional U . For ν , the gradient of U in at μ is a “tangent vector” at μ a given measure μ ν μ . the Wasserstein space, so this should be an infinitesimal variation of Formula 15.2 (Gradient formula in Wasserstein space). Let μ . Then, with the above nota- be absolutely continuous with respect to ν tion, ( ) ′ = −∇· grad μ ∇ U ( U ρ ) (15.8) ν μ ( ) − V ∇ p −∇· ρ ) = vol (15.9) e ( ) ( Lp ρ ) − ν. (15.10) ( = The expression in the right-hand side of (15.8) is the Remark 15.3. ∇· m is defined in divergence of a vector-valued measure; recall that the weak sense by its action on compactly supported smooth functions: ∫ ∫ ( ∇· m ) = − φd ∇ φ · ( dm ) . On the other hand, the divergence in (15.9) is the divergence of a vector field. Note that ( ξ vol) = ( ∇· ξ ) vol, so in (15.9) one could put the ∇· volume “inside the divergence”. All three expressions in Formula (15.2) are interesting, the first one because it writes the “tangent vector” ′ U ρ in the normalized form −∇ · ( μ ∇ ψ grad ψ = U ); the ( ), with ν μ second one because it gives the result as the divergence of a vector field; the third one because it is stated in terms of the infinitesimal variation of density ρ = dμ/dν . Below are some important examples of application of Formula 15.2. Example 15.4. H -functional of Boltzmann (opposite of the Define the entropy) by ∫ ( μ ) = H . ρ log ρd vol M Then the second expression in equation (15.8) yields grad H = − ∆μ, μ which can be identified with the function − ∆ρ . Thus the gradient of Boltzmann’s entropy is the Laplace operator. This short statement is one of the first striking conclusions of Otto’s formalism.

445 15 Otto calculus 439 − V = e E ν vol, write μ = ρν = xample 15.5. Now consider a general 0 vol, and define ρ ∫ ∫ 0 ρ ρ log ρdν = ( μ (log dμ ) = + V ) H ν M M -functional relative to the reference measure ν ). Then (this is the H ν. H = grad ( ∆ρ −∇ V ·∇ ρ ) ν = − ( Lρ ) − ν μ the gradient of the relative entropy is the distorted Laplace In short, operator. To generalize Example 15.4 in another direction, con- Example 15.6. sider ∫ m ρ − ρ ) ( m μ ) = H ( ol; v d 1 m − then m ) ( m ρ ∆ grad H = ) . − ( μ − V = e More generally, if ρ vol, and is the density with respect to ν ∫ m ρ − ρ m ) ( ν, d ) = H ( μ ν 1 m − then ( ) V − V m e ∇· grad (15.11) e U ν ∇ ρ = − ν μ m ( ) ν. (15.12) − Lρ = The next formula is about second-order derivatives, or Hessians. at Since the Hessian of μ is a quadratic form on the tangent space U ν P , I shall write down its expression when evaluated on a tangent T μ 2 −∇· ( μ ∇ ψ ). vector of the form Let be μ Formula 15.7 (Hessian formula in Wasserstein space). ν , and let ̇ μ = −∇· ( μ ∇ ψ ) be a absolutely continuous with respect to μ . Then, with the above notation, tangent vector at ∫ ∫ 2 dν ( ̇ μ ) = ( ) Γ ρ ( ψ ) p Hess ρ ) dν + (15.13) ( ( Lψ ) U p 2 ν μ 2 M M ∫ ] [ ( ) 2 2 2 dν ) + = Ric + ∇ ‖∇ V ) ( ∇ ψ ψ ‖ p ( ρ HS M ∫ ( ) 2 − ∆ψ + ∇ V (15.14) ψ dν. + p ) ( ρ ·∇ 2 M

446 440 15 Otto calculus emark 15.8. ∇ ψ and R As expected, this is a quadratic expression in . μ its derivatives; and this expression does depend on the measure m ( ρ ) = ( ρ − − ρ ) / ( m Applying the formula with 1), U Example 15.9. = μ , one obtains recalling that ρν m ) ( μ Hess ( ̇ H ) = μ ν ∫ ( ) ) ( 2 2 2 2 m − 1 ∆ψ ψ )+( m − 1) ψ +(Ric+ −∇ V ·∇ ψ ∇ ‖ ‖∇ ρ V )( ∇ dμ. HS M m = 1, which is U ( ρ ) = ρ log ρ , this expression sim- In the limit case plifies into ∫ ( ) 2 2 2 H ‖∇ ) = ψ ‖ μ ( ̇ Hess dμ + (Ric + ∇ ; V )( ∇ ψ ) μ ν HS M or equivalently, with the notation of Chapter 14, ∫ ) ( 2 2 μ ) = Hess ) dμ. ‖∇ H ψ ‖ ( ̇ ψ + Ric ∇ ( ,ν ∞ μ ν HS M Formulas 15.2 and 15.7 will be justified only at a heuristic level. A rigorous proof would require many more definitions and much more ap- paratus, as well as regularity and decay assumptions on the measures and the functionals. So here I shall disregard all issues about integrabil- ity and regularity, which will be a huge simplification. Still, the proofs will not be completely trivial. “Proof” of Formula 15.2 . When the integration measure is not speci- fied, it will be the volume rather than ν . To understand the proof, it is and a differential . important to make the distinction between a gradient be such that the tangent vector grad ζ U Let can be represented ν μ as ( μ ∇ ζ ), and let ∂ ) be an arbitrary “tangent μ = −∇· −∇· μ ∇ ψ ( t vector”. The infinitesimal variation of the density ρ = dμ/dν is given by ) ( V − V . ∇· = ρe e ρ ∇ ψ ∂ − t By direct computation and integration by parts, the infinitesimal vari- U along that variation is equal to ation of ν ∫ ∫ ′ V − ′ ∂ ) ρdν = − U U ( ( ρ ) ∇· ( ρe ρ ) ∇ ψ t ∫ − V ′ U ρ ) ·∇ ψ ρe = ∇ ( ∫ ′ = ∇ U ( ρ ) ·∇ ψ dμ.

447 15 Otto calculus 441 B y definition of the gradient operator, this should coincide with ∫ 〈 〉 U μ grad = , ∂ ∇ ζ ψ dμ. ·∇ ν t μ If this should hold true for all ψ , the only possible choice is that ′ ′ ρ ) = ∇ ζ ( ρ ), at least μ ∇ ζ := U U ( ρ ) -almost everywhere. In any case ( U . This proves for- provides an admissible representation of grad ν μ mula (15.8). The other two formulas are obtained by noting that ′ ′′ ) = p ρU ( ρ ), and so ( ρ ′′ ′ ′ ρ ) ρ = ρU ∇ ( ρ U ∇ ) = p ( ( ρ ) ∇ ρ = ∇ p ( ρ ); ρ therefore ( ) ) ( ′ ′ V − V − ) ∇· e ) ρ ρ ∇ U ( ( ρ ) U = ∇· μ ∇ Lp ( ρ = . e ⊓⊔ For the second order (formula (15.7)), things are more intricate. The following identity will be helpful: If is a tangent vector at x on ξ M , and a Riemannian manifold is a function on M , then F ∣ 2 ∣ d ∣ ξ ( F ) = Hess , (15.15) )) t F ( γ ( x ∣ 2 dt 0 t = ( t ) is a geodesic starting from where (0) = x with velocity ̇ γ (0) = ξ . γ γ F ( γ ( t )) To prove (15.15), it suffices to note that the first derivative of ( γ t ) ·∇ F ( γ ( t )); so the second derivative is ( d/dt )( ̇ γ ( t )) ·∇ F ( γ ( t )) + is ̇ 〈 〉 2 F ( ∇ ( t )) · ̇ γ ( t ) , ̇ γ ( t ) γ , and the first term vanishes because a geodesic has zero acceleration. “Proof” of Formula 15.7 U ) ( . The problem consists in differentiating μ t ν twice along a geodesic path of the form  ) = 0 ψ ∇ μ ( ∇· ∂ + μ  t   2   |∇ ψ |  0 = . + ψ ∂ t 2 The following integration by parts formula will be useful: ∫ ∫ f (15.16) gdν = − ∇ ( Lf ) g dν. ·∇

448 442 15 Otto calculus F rom the proof of the gradient formula, one has, with the notation μ ν , ρ = t t ∫ μ ) dU ( ν t ′ ψ U · ∇ = ∇ ( ρ dν ) ρ t t t dt M ∫ ∇ ψ = ·∇ p ( ρ dν ) t t M ∫ ( Lψ ) − p ρ = ) dν. ( t t M It remains to differentiate again. To alleviate notation, I shall not write the time variable explicitly. So ∫ ∫ 2 ) ( U d ( μ ) ν ′ L ∂ (15.17) ψ ρdν p ( ρ ) dν − − ( Lψ ) p = ( ρ ) ∂ t t 2 dt ( ) ∫ ∫ 2 |∇ ψ | ′ ( ∂ ) ρ ( = − (15.18) L Lψ ) p μ. ν d ρ ( ) p t 2 The last term in (15.18) can be rewritten as ∫ ′ Lψ ) p ψ ( ρ ) ∇· ( μ ∇ ( ) ∫ ( ) ′ = ) p − ( ρ ) ( ·∇ ψ dμ Lψ ∇ ∫ ( ) ′ − ) p ( ( ρ ∇ Lψ ·∇ ψ ρdν = ) ∫ ∫ ′ ′′ Lψ ) ·∇ ψ p − ( ρ ) ρdν − = ( Lψ ) p ∇ ( ρ ) ρ ∇ ρ ·∇ ψ dν ( ∫ ∫ ′ ∇ ( Lψ ) ·∇ ψ ρp = − ρ ) dν − ( ( Lψ ) ∇ p (15.19) ( ρ ) ·∇ ψ dν. 2 The second term in (15.19) needs a bit of reworking: it can be recast as ∫ ∫ ( ) Lψ p ψ dν ( ρ ) ∇ ·∇ ψ dν − − ( ∇ Lψ ) p ·∇ ( ρ ) 2 2 ∫ ∫ 2 Lψ ) = p dν, ( ρ ) dν − ( ( ∇ Lψ ) ·∇ ψ p ) ( ρ 2 2 where (15.16) has been used once more. By collecting all these calculations,

449 15 Otto calculus 443 ( ) ∫ ∫ 2 2 | d ( μ | ) U ∇ ψ ν 2 = dν ) ρ ( p ν + L ( Lψ ) ) ρ ( p d 2 2 dt 2 ∫ ( ) ′ Lψ ) ( + ∇ ( ρ ) − ρp ψ ( ρ ) ·∇ dν. p 2 ′ ( ( ρ ) − ρp Since p ρ ) = − p ( ρ ), this transforms into 2 ( ( ) ) ∫ ∫ 2 | ψ |∇ 2 ψ ·∇ Lψ ) p ( ρ ) dν + (15.20) ( Lψ ) ∇ p . ( ρ L − 2 2 ⊓⊔ In view of (15.6)–(15.7), this establishes formula (15.13). “Prove” that the gradient of an arbitrary functional F Exercise 15.10. M ( on ) can be written P 2 δ F = grad ( μ ∇ φ ) , φ = , F −∇· μ δμ w here δ F /δμ is a function defined by ( ) ∫ d δ F ∂ μ F ) . = ( μ t t t δμ dt C heck that in the particular case ∫ ( ) ( F F ( x,ρ ( x ) , ∇ ρ μ x ) ) = dν ( x ) , (15.21) M x,p = F ( x,ρ,p ) is a smooth function of ρ ∈ R where , ( F ) ∈ TM , one + has ) ( ) ( ∂ F x ) = ( ∂ ) F ) ( x,ρ ( x ) , ∇ ρ ( x ρ ∂μ ) ( ∇ x −∇ V ( x )) · ( ∇ ) F ) − x,ρ ( x ) , ∇ ρ ( ( p x The following two open problems (loosely formulated) are natural and interesting, and I don’t know how difficult they are: Find a nice formula for the Hessian of the Open Problem 15.11. F functional (15.21) . appearing in Open Problem 15.12. Find a nice formalism playing the role of the Otto calculus in the space P . More generally, are there ( M ) , for p 6 = 2 p nice formal rules for taking derivatives along displacement interpola- tion, for general Lagrangian cost functions?

450 444 15 Otto calculus T o conclude this chapter, I shall come back to the subject of rig- orous justification of Otto’s formalism. At the time of writing, several theories have been developed, at least in the Euclidean setting (see the bibliographical notes); but they are rather heavy and not yet com- 1 pletely convincing. From the technical point of view, they are based on the natural strategy which consists in truncating and regularizing, then applying the arguments presented in this chapter, then passing to the limit. A quite different strategy, which I personally recommend, consists in translating all the Eulerian statements in the language of Lagrangian formalism. This is less appealing for intuition and calculations, but somehow easier to justify in the case of optimal transport. For in- stance, instead of the Hessian operator, one will only speak of the second derivative along geodesics in the Wasserstein space. This point of view will be developed in the next two chapters, and then a rigorous treatment will not be that painful. Still, in many situations the Eulerian point of view is better for intu- ition and for understanding, in particular in certain problems involving functional inequalities. The above discussion might be summarized by “Think Eulerian, prove Lagrangian” . This is a rather excep- the slogan tional situation from the point of view of fluid dynamics, where the standard would rather be “Think Lagrangian, prove Eulerian” (for in- stance, shocks are delicate to treat in a Lagrangian formalism). Once again, the point is that “there are no shocks” in optimal transport: as discussed in Chapter 8, trajectories do not meet until maybe at final time. Bibliographical notes Otto’s seminal paper [669] studied the formal Riemannian structure of the Wasserstein space, and gave applications to the study of the porous medium equation; I shall come back to this topic later. With all the preparations of Part I, the computations performed in this chapter may look rather natural, but they were a little conceptual tour de force at the time of Otto’s contribution, and had a strong impact on the research community. This work was partly inspired by the desire to 1 I can afford this negative comment since myself I participated in the story.

451 Bibliographical notes 445 u nderstand in depth a previous contribution by Jordan, Kinderlehrer and Otto [493]. m U ρ ) = ρ in ( Otto’s computations were concerned with the case n R ( U ) = ρ log ρ on a manifold [671, . Then Otto and I considered ρ Section 3]; we computed the Hessian by differentiating twice along geodesics in the Wasserstein space. (To my knowledge, this was the first published work where Ricci curvature appeared in relation to optimal ∫ ( μ ) = dx W ( x − y ) μ ( transport.) Functionals of the form ) μ ( dy ) E n R were later studied by Carrillo, McCann and myself [213]. More in recently, Lott and I [577, Appendix E] considered the functionals U ν presented in this chapter (on a manifold and with a reference measure V − vol). e In my previous book [814, Section 9.1], I already gave formulas for n R ( the gradient and Hessian of three basic types of functionals on P ) 2 , , potential energy and interaction energy internal energy that I called and which can be written respectively (with obvious notation) as ∫ ∫ ∫ 1 ( x )) dx ; V dμ U ; ( ρ x W − y ) d μ ( x ) dμ ( y ) . (15.22) ( 2 A short presentation of the differential calculus in the Wasserstein space can be found in [814, Chapter 8]; other sources dealing with this subject, with some variations in the presentation, are [30, 203, 214, 671, 673]. Apart from computations of gradients and Hessians, little is known about Riemannian calculus in ( M ). The following issues are natural P 2 (I am not sure how important they are, but at least they are natural): Is there a Jacobi equation in • ), describing small variations of ( M P 2 geodesic fields? • Can one define Christoffel symbols, at least formally? • Can one define a Laplace operator? • Can one define a volume element? A divergence operator? Recently, Lott [575] partly answered some of these questions by establishing formulas for the Riemannian connection and Riemannian ∞ curvature in the subset ( M ) of smooth positive densities, viewed P as a subset of ( M ), when M is compact. In a different direction, P 2 Gigli [415] gave a rigorous construction of a parallel transport along a ∫ 1 n ( R curve in ) for which P . (See [29] for improved ∞ ‖ v + ‖ dt < Lip t 2 0 results.)

452 446 15 Otto calculus T he problem whether there exists a natural probability measure P M ) is, I think, very ( (“volume”, or “Boltzmann–Gibbs measure”) on 2 relevant for applications in geometry or theoretical statistics. Von Re- nesse and Sturm [827] have managed to construct natural probability 1 ( S measures on ); these measures depend on a parameter β P (“inverse 2 temperature”) and may be written heuristically as ) − H μ ( ν e d vol( μ ) ) = dμ ( P , ( 15.23) β Z β 1 ν is the reference measure on S , that is, the Lebesgue measure. where Their construction strongly uses the one-dimensional assumption, and makes the link with the theory of “Poisson measures” used in nonpara- metric statistics. A particle approximation was studied in [38]. The point of view that was first advocated by Otto himself, and which I shall adopt in this course, is that the “Otto calculus” should primarily be considered a heuristic tool, and conclusions drawn by its use should then be checked by “direct” means. This might lack elegance, but it is much safer from the point of view of mathematical rigor. Some papers in which this strategy has been used with success are [577, 669, 671, 673, 761]. Recently, Calvez [197] used the Otto formalism to derive complex identities for chemotaxis models of Keller–Segel type, which would have been very difficult to guess otherwise. In most of these works, rigorous justifications are done in Lagrangian formalism, or by methods which do not use transport at all. The work by Otto and Westdickenberg [673] is an interesting exception: there everything is attacked from an Eulerian perspective (using such tools as regularization of currents on manifolds); see [271] for an elaboration of these ideas, which applies even without smoothness. All the references quoted above mainly deal with calculus in P ) ( M p p = 2. The case p 6 = 2 is much less well understood; as noticed for in [30, p. 10], P ) can be seen as a kind of Finsler structure, and ( M p there are also rules to compute derivatives in that space, at least to first order. The most general results to this date are in [30]. A better understood generalization treats the case when geodesics in P ) are replaced by action-minimizing curves, for some Lagrangian ( M 2 action like those considered in Chapter 7; the adaptation of Otto cal- culus to this situation was worked out by Lott [576], with applications to the study of the Ricci flow.

453 Bibliographical notes 447 L et me conclude with some remarks about the functionals consid- ered in this chapter. Functionals of the form (15.22) appear everywhere in mathematical physics to model all kinds of energies. It would be foolish to try to make a list. ′ ( ρ ) = ρU ρ ( The interpretation of p − U ( ρ ) as a pressure associated ) to the constitutive law U is well-known in thermodynamics, and was explained to me by McCann; the discussion in the present chapter is slightly expanded in [814, Remarks 5.18]. ∫ H ) is well-known is sta- ( μ ) = ( ρ log ρdν The functional μ = ρν ν tistical physics, where it was introduced by Boltzmann [141]. In Boltz- H is identified with the negative of the entropy. mann’s theory of gases, ν It would take a whole book to review the meaning of entropy in thermo- dynamics and statistical mechanics (see, e.g., [812] for its use in kinetic H theory). I should also mention that the functional coincides with the Kullback information in statistics, and it appears in Shannon’s theory of information as an optimal compression rate [747], and in Sanov’s the- orem as the rate function for large deviations of the empirical measure of independent samples [302, Chapter 3] [296, Theorem 6.2.10]. An interesting example of a functional of the form (15.21) that was considered in relation with optimal transport is the so-called Fisher information , ∫ 2 ρ |∇ | ) = μ ( I ; ρ s ee [30, Example 11.1.10] and references there provided. We shall en- counter this functional again later.

454

455 16 D isplacement convexity I Convexity plays a prominent role in analysis in general. It is most is F : V → R ∪{ + ∞} : A function generally used in a vector space V x,y ∈V , said to be convex if for all ( ) [0 , 1] F ) + (1 − t ) x + ty ∀ ≤ (1 − t ) F ( x t tF ∈ y ) . (16.1) ( But convexity is also a notion: In short, convexity in a metric metric space means convexity along geodesics. Consequently, geodesic spaces are a natural setting in which to define convexity: Let ( X ,d ) be a Definition 16.1 (Convexity in a geodesic space). F : X → R ∪{ + ∞} complete geodesic space. Then a function is said to be geodesically convex, or just convex, if for any constant-speed geodesic ( ) path γ valued in X , ≤ 1 0 t ≤ t ∀ t ∈ [0 , 1] F ( γ (16.2) ) ≤ (1 − t ) F ( γ . ) + tF ( γ ) 1 0 t x x in X there exists at , It is said to be weakly convex if for any 1 0 x ( least one constant-speed geodesic path with γ = ) , γ = x , γ t 0 ≤ t 1 1 0 1 0 ≤ (16.2) holds true. such that inequality It is a natural problem to identify functionals that are convex on the Wasserstein space. In his 1994 PhD thesis, McCann established n P ) to prove the ( R and used the convexity of certain functionals on 2 uniqueness of their minimizers. Since then, his results have been gen- eralized; yet almost all examples which have been treated so far belong to the general class ( ) ∫ ∫ dμ F ( I ( x μ ,...,x U ) ) = ( x dμ ) ... dμ ( x ) + d ν, 1 1 k k dν k X X

456 450 16 Displacement convexity I w I ( x here ,...,x is ) is a certain “ k -particle interaction potential”, U 1 k R , and ν is a reference measure. R → a nice function + In this and the next chapter I shall consider the convexity prob- , in the case I = 0, so the M lem on a general Riemannian manifold U defined by functionals under study will be the functionals ν ∫ ( μ ) = U ρν. U ( ρ ) dν, μ = (16.3) ν M As a start, I shall give some reminders about the notion of convexity and its refinements; then I shall make these notions more explicit in the P case of the Wasserstein space ( M ). In the last section of this chapter 2 I shall use Otto’s calculus to guess sufficient conditions under which U ν satisfies some interesting convexity properties (Guesses 16.6 and 16.7). Let the reader not be offended if I strongly insist that convexity in the metric space P ) has nothing to do with the convex structure ( M 2 of the space of probability measures. The former concept will be called displacement convexity “convexity along optimal transport” or . Reminders on convexity: differential and integral conditions The material in this section has nothing to do with optimal transport, and is, for the most part, rather standard. n : R R → It is well-known that a function F is convex, in the sense of (16.1), if and only if it satisfies 2 (16.4) ≥ 0 ∇ F n R . The latter inequality should generally (nonnegative Hessian) on be understood in distribution sense, but let me just forget about this subtlety which is not essential here. The inequality (16.4) is a differential condition, in contrast with the “integral” condition (16.1). There is a more general principle relating a lower bound on the Hessian (differential condition) to a convexity-type inequality (integral condition). It can be stated in terms of the one- dimensional Green function (of the Laplace operator with Dirichlet boundary conditions). That Green function is the nonnegative kernel 2 ( s,t ) such that for all functions φ ∈ ), ([0 , 1]; R ) ∩ C G ((0 , 1); R C

457 Reminders on convexity: differential and integral condition s 451 ∫ 1 ( t ) φ (0) + tφ (1) − ( − t ̈ φ ( s ) G ) = (1 s,t ) ds. (16.5) φ 0 (see Figure 16.1): It is easy to give an explicit expression for G { (1 − s ) if s ≤ t t G ( s,t ) = (16.6) − t. ) if s ≥ (1 t s Then formula (16.5) actually extends to arbitrary continuous functions φ 1], provided that ̈ φ (taken in a distribution sense) is bounded on [0 , below by a real number. s t 1 0 Fig. 16.1. T G ( s,t ) as a function of s . he Green function The next statement provides the equivalence between several differ- ential and integral convexity conditions in a rather general setting. Let Proposition 16.2 (Convexity and lower Hessian bounds). M,g be a continuous be a Riemannian manifold, and let Λ = Λ ( x,v ) ( ) ( quadratic form on x , Λ ; that is, for any x, · ) is a quadratic form TM in v , and it depends continuously on x . Assume that for any constant- speed geodesic γ , 1] → M , : [0 ( ) γ ̇ , γ Λ t t (16.7) > − ∞ . ] := inf [ λ γ 2 1 ≤ t ≤ 0 ̇ | | γ t 2 M C Then, for any function ( ∈ ) , the following statements are equiv- F alent: 2 ∇ F (i) ≥ Λ ; (ii) For any constant-speed, minimizing geodesic : [0 , 1] → M , γ ∫ 1 γ ds ) ≤ (1 − t ) F ( γ ) ) + ; ( γ s,t ) − F ( ( Λ ( γ G , ̇ γ ) tF s s 1 0 t 0

458 452 16 Displacement convexity I ( γ : [0 , 1] → M , iii) For any constant-speed, minimizing geodesic ∫ 1 〈 〉 ( ( γ F ) + γ ∇ F ( γ ; ) , ̇ γ dt ) + ≥ F ) Λ ( γ t , ̇ γ − ) (1 t 0 0 1 0 t 0 : [0 , 1] → M , (iv) For any constant-speed, minimizing geodesic γ ∫ 1 〈 〉 〈 〉 ̇ γ dt. γ − F ∇ F ( γ ) ) , ( γ γ ) ≥ , ̇ ̇ Λ ( γ , ∇ t 0 t 0 1 1 0 F is said to be Λ -convex. The When these properties are satisfied, equivalence is still preserved if conditions (ii), (iii) and (iv) are respec- tively replaced by the following a priori weaker conditions: γ : [0 , 1] → M , (ii’) For any constant-speed, minimizing geodesic t (1 − t ) 2 − t ) F ( γ ; ) + tF ( γ F ) − λ [ γ ] ( γ ) ≤ (1 ) d ( γ γ , 0 1 0 t 1 2 γ : [0 , 1] → M , (iii’) For any constant-speed, minimizing geodesic 2 〈 〉 ,γ ) d ( γ 1 0 γ ∇ F ( γ ≥ ) , ̇ γ F ( + λ [ γ ] γ ( ) + ) F ; 0 0 0 1 2 ( iv’) For any constant-speed, minimizing geodesic γ : [0 , 1] → M , 〈 〉 〈 〉 2 , ̇ γ . ∇ − γ ∇ F ( γ F ) , ( γ ) ) ≥ λ [ γ ] d ( γ ,γ ̇ 0 1 0 0 1 1 In the particular case when Λ is equal to λg for some Remark 16.3. λ λ , Property (ii) reduces to Property (ii’) with λ [ γ ] = R . constant ∈ γ has constant speed, Indeed, since ∫ 1 ) + ) (1 − t ) F ( γ γ tF ( γ ( ) − λ F ≤ s,t ) ds ( g ( γ G , ̇ γ ) 1 0 t s s 0 ∫ 1 2 γ ( γ = (1 ) + tF ( γ t ) − ) ( − F ,γ ) λd s,t ds. ) ( G 0 0 1 1 0 ∫ 1 2 t ) = t ( into (16.5) one sees that By plugging the function φ = G ( s,t ) ds 0 t − t ) / 2. So (ii) indeed reduces to (1 ) t − (1 λt 2 ( d . (16.8) ) γ , γ ) + tF ( γ F ) − γ ( F ) t ( − (1 ≤ γ ) 1 0 t 0 1 2

459 Reminders on convexity: differential and integral condition s 453 Let Λ Definition 16.4 ( M -convexity). be a Riemannian manifold, ( x,v Λ be a continuous quadratic form on M , satisfy- = Λ and let ) F ing M → R ∪{ + ∞} is said to be Λ -convex if (16.7) . A function : = λg , Λ Property (ii) in Proposition 16.2 is satisfied. In the case when R , F will be said to be λ -convex; this means that inequality (16.8) λ ∈ -convexity is just plain convexity. 0 is satisfied. In particular, Proof of Proposition 16.2 . The arguments in this proof will come again several times in the sequel, in various contexts. x Assume that (i) holds true. Consider in x , and introduce and M 0 1 γ γ = x to γ = x . a constant-speed minimizing geodesic joining 1 1 0 0 Then 2 〉 〈 d 2 ( γ . ( ) F γ ̇ , γ ) Λ ≥ · γ ̇ γ = ̇ , ) ∇ γ ( F t t t t t t 2 dt γ ( ) := φ ( t Then Property (ii) follows from identity (16.5) with ). F t As for Property (iii), it can be established either by dividing the t > 0, and then letting t → 0, or directly from (i) inequality in (ii) by φ t ) = F ( γ ) again. Indeed, by using the Taylor formula at order 2 with ( t φ (0) = 〈∇ F ( γ 〉 ) , ̇ γ ). ̇ , while ̈ φ ( t ) ≥ Λ ( γ γ , ̇ t t 0 0 To go from (iii) to (iv), replace the geodesic γ by the geodesic γ , 1 − t t to get ∫ 1 〈 〉 F ( γ ) − ( ∇ γ ( γ ) , ̇ γ F + ) ≥ F ) dt. t − Λ ( γ ) (1 γ ̇ , 1 0 1 1 1 t 1 − t − 0 After changing variables in the last integral, this is ∫ 1 〈 〉 F γ γ ( ) − ) ∇ ≥ ( γ ) , ̇ γ F + F ( ) γ tdt, ̇ , Λ ( γ 1 1 0 1 t t 0 and by adding up (iii), one gets Property (iv). ⇒ (ii) ⇒ (iii) ⇒ (iv). To complete the So far we have seen that (i) proof of equivalence it is sufficient to check that (iv’) implies (i). So assume (iv’). From the identity ∫ 1 〉 〈 〉 〈 2 γ F − ) ∇ F γ γ ( ) , ̇ γ , = ∇ ̇ ( dt ) )( ̇ γ ∇ γ F ( 0 1 1 0 t t 0 γ , and (iv’), one deduces that for all geodesic paths ∫ 1 2 2 d ( γ dt. ,γ (16.9) ) λ ≤ [ γ ) ∇ ] F ( γ γ )( ̇ t t 0 1 0

460 454 16 Displacement convexity I C ,v hoose ( ) in TM , with v x 6 = 0, and γ ), where t ) = exp εtv ( ( 0 0 0 0 x 0 γ depends implicitly on ε . Note that d ( γ ε > ,γ 0; of course ) = ε | v . | 0 1 0 Write ( ) ∫ 1 2 ( ,γ γ ) γ ̇ d 0 1 t 2 γ ≤ F ] γ (16.10) ) t. [ λ ∇ ( d t 2 ε ε 0 ε → 0, ( γ , As ̇ γ , so ) ≃ ( x TM ,εv ) in 0 t t 0 ,v Λ x ( Λ ) /ε Λ ) γ , ̇ γ ) γ ̇ , γ ( ( t t t 0 t 0 λ [ ] = inf . γ = inf −−−→ 2 2 2 → ε 0 0 ≤ 0 ≤ t 1 t 1 ≤ ≤ | | | γ ̇ | | ̇ / v γ | ε 0 t t hus the left-hand side of (16.10) converges to Λ ( x T ,v ). On the other 0 0 2 ∇ F is continuous, the right-hand side obviously converges hand, since 2 ∇ F x to )( v ( ). Property (i) follows. ⊓⊔ 0 0 Displacement convexity I shall now discuss convexity in the context of optimal transport, M of the previous section by the geodesic replacing the manifold space P ( M ). For the moment I shall only consider measures that are 2 absolutely continuous with respect to the volume on , and denote by M ac P ( M ) the space of such measures. It makes sense to study convex- 2 ac P ) because this is a ): ( M ity in geodesically convex subset of P M ( 2 2 By Theorem 8.7, a displacement interpolation between any two ab- solutely continuous measures is itself absolutely continuous. (Singular measures will be considered later, together with singular metric spaces, in Part III.) μ and be two probability measures on M , absolutely con- So let μ 0 1 μ ) tinuous with respect to the volume element, and let ( be the ≤ t ≤ 1 0 t μ displacement interpolation between and μ . Recall from Chapter 13 1 0 that this displacement interpolation is uniquely defined, and character- ized by the formulas μ , where = ( T μ ) t t 0 # ̃ ( x ) = exp (16.11) ( T t ∇ ψ ( x )) , t x 2 ψ is d and / 2-convex. (Forget about the ̃ symbol if you don’t like it.) Moreover, is injective for t < 1; so for any t < 1 it makes sense to T t define the velocity field ) by ( t,x ) on T M ( v t

461 Displacement convexity 455 ( ) d T ,T = v ( ) ( x ) , t x t t dt nd one also has a ) ( ̃ ψ v = t,T ∇ x ) ( T , ( x )) ( t t t where is a solution at time t of the quadratic Hamilton–Jacobi equa- ψ t ψ = ψ . tion with initial datum 0 λ -convexity The next definition adapts the notions of convexity, and Λ -convexity to the setting of optimal transport. Below λ is a real = Λ ( μ,v ) de- number that might nonnegative or nonpositive, while Λ a quadratic form on vector fields fines for each probability measure μ M → TM v : . Definition 16.5 (Displacement convexity). With the above no- ac : P ( tation, a functional F M ) → R ∪{ + ∞} is said to be: 2 is a (constant-speed, ( μ • ) if, whenever displacement convex ≤ ≤ 1 0 t t ac P ( M ) , minimizing) geodesic in 2 t ∈ [0 , 1] F ( μ ( ) ≤ (1 − t ) F ∀ μ ); ) + tF ( μ 1 t 0 λ -displacement convex , if, whenever ( μ is a (constant- ) • ≤ ≤ 1 0 t t ac M speed, miminizing) geodesic in ( P ) , 2 ) t λt (1 − 2 F ( μ ∀ ) ≤ (1 − t ) F ( μ t )+ tF ( μ ∈ ) − [0 , 1] W ; ( μ , μ ) 1 t 0 1 0 2 2 Λ -displacement convex , if, whenever • μ ) is a (constant- ( ≤ t ≤ t 1 0 ac , and ( M ) speed, minimizing) geodesic in ( ψ ) P is an associ- t ≤ t ≤ 1 0 2 ated solution of the Hamilton–Jacobi equation, ∫ 1 ̃ [0 , 1] F ( μ − ) ≤ (1 − t ) F ( μ ∀ )+ tF ( μ t ) ∈ ∇ ( ( μ ds, , Λ ψ ) ) G s,t 1 t 0 s s 0 G ( where ) is the one-dimensional Green function of (16.6) . (It s,t ̃ Λ ( μ is bounded below by an integrable , ) ∇ ψ ) is assumed that G ( s,t s s s ∈ [0 , 1] function of .) Λ -displacement Of course these definitions are more and more general: 2 ; -displacement convexity when Λ ( μ,v ) = λ ‖ v ‖ convexity reduces to λ 2 ( μ ) L and this in turn reduces to plain displacement convexity when λ = 0.

462 456 16 Displacement convexity I isplacement convexity from curvature-dimension D bounds The question is whether the previously defined concepts apply to func- tionals of the form , as in (16.3). Of course Proposition 16.2 does U ν ac ( M ) nor P not apply, because neither P ( M ) are smooth manifolds. 2 2 However, if one believes in Otto’s formalism, then one can hope that -displacement convexity, Λ -displacement con- displacement convexity, λ U would be respectively equivalent to vexity of ν ) U (16.12) ≥ 0 , Hess , U Hess ≥ λ, Hess μ U ̇ ( ̇ μ ) ≥ Λ ( μ, ν μ μ ν μ ν U U at μ (which was stands for the formal Hessian of where Hess ν ν μ 2 is μ , and ̇ is shorthand for λ computed in Chapter 15), λ ‖ · ‖ 2 ) μ ( L ∇ ψ via the usual continuity equation identified with μ ∇· ( ∇ ψ μ ) = 0 . + ̇ Let us try to identify simple sufficient conditions on the manifold , the reference measure ν and the energy function U , for (16.12) to M hold. This quest is, for the moment, just formal; it will be checked later, without any reference to Otto’s formalism, that our guess is correct. To identify conditions for displacement convexity I shall use again the formalism of Chapter 14. Equip the Riemannian manifold M with − V = e a reference measure ν vol, where V is a smooth function on M , and assume that the resulting space satisfies the curvature-dimension bound CD( N ∈ [1 , ∞ ] and K ∈ R . K,N ), as in Theorem 14.8, for some will stand for the density of μ with respect Everywhere in the sequel, ρ . ν to : R U Consider a continuous function R . I shall assume that U is → + and (0) = 0. The latter condition is quite natural from a phys- U convex ical point of view (no matter ⇒ no energy). The convexity assumption might seem more artificial, and to justify it I will argue that (i) the con- U is necessary for U vexity of to be lower semicontinuous with respect ν to the weak topology induced by the metric W ; (ii) if one imposes the 2 ′ p ( r ) = r U nonnegativity of the pressure ( r ) − U ( r ), which is natural from the physical point of view, then conditions for displacement con- U ; vexity will be in the end quite more stringent than just convexity of (iii) the convexity of U automatically implies the nonnegativity of the ′ ′ ( r ) = r U ( ( r ) − U pressure, since r ) = rU p ( r ) − U ( r ) + U (0) ≥ 0. For simplicity I shall also impose that U is twice continuously differentiable

463 Displacement convexity from curvature-dimension bounds 45 7 2 in (0 ψ in (16.11) is C ). Finally, I shall assume that , ∞ + everywhere , U and I shall avoid the discussion about the domain of definition of ν by just considering compactly supported probability measures. Then, from (15.13) and (14.51), ∫ ∫ 2 μ ) = ) (16.13) Γ dν ( ψ U p ( ρ ) dν + Hess ) ( Lψ ) ( ̇ p ρ ( 2 2 ν μ M M ∫ ∫ [ ] p 2 ψ ) p ≥ Ric ) dν + ( ( ( Lψ ) ∇ ρ p + ν ρ ) d ( N,ν 2 N M M (16.14) ∫ ∫ [ ] p 2 2 ) ( ρ |∇ dν + ψ | ( Lψ K ≥ p p + ) ) ν. d ( ρ 2 N M M (16.15) To get a bound on this expression, it is natural to assume that p 16.16) + 0 ( ≥ p . 2 N for which (16.16) is satisfied will be called The set of all functions U the N and denoted displacement convexity class of dimension by DC DC , for which (16.16) holds as . A typical representative of N N = U an equality, is defined by U N  ) ( 1 1 − N  ρ < N < N 1 ρ − ) ( ∞ −  (16.17) ) = ρ ( U N   ρ ( N = ∞ ) . ρ log These functions will come back again and again in the sequel, and the associated functionals will be denoted by H . N,ν If inequality (16.16) holds true, then Hess U , ≥ KΛ U μ ν where ∫ 2 μ, ̇ μ ) = Λ dν. (16.18) ψ | ( p ( ρ ) |∇ U M So the conclusion is as follows: Guess 16.6. Let M be a Riemannian manifold satisfying a curvature- ∞ dimension bound K,N ) for some K ∈ R , N ∈ (1 , CD( ] , and let U satisfy (16.16) ; then U is KΛ -displacement convex. ν U

464 458 16 Displacement convexity I A ctually, there should be an equivalence between the two statements is KΛ -displacement convex; in Guess 16.6. To see this, assume that U U ν T v x ∈ M ∈ pick up an arbitrary point M ; , and a tangent vector x 0 0 0 U which , a probability measure μ consider the particular function U = N x ψ such that , and a function is very much concentrated close to 0 2 Lψ v ( and Γ ψ ψ ) + ( ) = ) x /N = Ric ( v ) (as in the proof of ∇ ( 0 0 0 N,ν 2 Theorem 14.8). Then, on the one hand, ∫ ∫ 1 1 2 − 1 2 1 − N N ̇ μ ) = K KΛ |∇ ψ | μ, ( ρ (16.19) | d ν ρ ≃ K | v d ν ; U 0 , on the other hand, by the choice of U [ ] ∫ 2 1 ) ( Lψ 1 − N ν, μ d ) = U ( ̇ Γ ρ Hess ( ψ ) + ν μ 2 N μ is concentrated around x , this is well approximated but then since 0 by [ ] ∫ ∫ 2 1 1 Lψ ) ( − 1 − 1 N N ( ) + ν. d ψ d ν = Ric ( v ) Γ ρ ( x ) ρ N,ν 0 2 0 N ( ) ≥ Comparing the latter expression with (16.19) shows that Ric v N,ν 0 2 v | v . Since | K and . Note K were arbitrary, this implies Ric ≥ x N,ν 0 0 0 H that this reasoning only used the functional = ( U , and prob- ) ν N,ν N μ ability measures that are very concentrated around a given point. This heuristic discussion is summarized in the following: If, for each x -displacement con- ∈ M Guess 16.7. H KΛ is , N,ν U 0 vex when applied to probability measures that are supported in a small neighborhood of x curvature-dimension , then M satisfies the CD( K,N ) 0 bound. Condition CD(0 0, ∞ ) with ν = vol just means Ric ≥ Example 16.8. , U ∈ DC just means that the iterated pressure p and the statement ∞ 2 is nonnegative. The typical example is when U ( ρ ) = ρ log ρ , and then the corresponding functional is ∫ ( μ ) = ρd ρ log H vol , μ = ρ vol . Then the above considerations suggest that the following statements are equivalent:

465 A fluid mechanics feeling for Ricci curvature 459 ( ≥ 0; i) Ric U is such that the nonnegative iterated pres- (ii) If the nonlinearity is nonnegative, then the functional U sure is displacement convex; p 2 vol (iii) H is displacement convex; x , the functional ∈ M (iii’) For any H is displacement convex 0 when applied to probability measures that are supported in a small x . neighborhood of 0 Example 16.9. The above considerations also suggest that the in- equality Ric Kg is equivalent to the K -displacement convexity of ≥ the H functional, whatever the value of K ∈ R . These guesses will be proven and generalized in the next chapter. A fluid mechanics feeling for Ricci curvature Ricci curvature is familiar to physicists because it plays a crucial role in Einstein’s theory of general relativity. But what we have been dis- covering in this chapter is that Ricci curvature can also be given a physical interpretation in terms of classical fluid mechanics . To pro- vide the reader with a better feeling of this new point of view, let us imagine how two physicists, the first one used to relativity and light propagation, the second one used to fluid mechanics, would answer the following question: Describe in an informal way an experiment that can determine whether we live in a nonnegatively Ricci-curved space. The light source test: Take a small light source, and try to determine its volume by looking at it from a distant position. If you systemat- ically overestimate the volume of the light source, then you live in a nonnegatively curved space (recall Figure 14.4). Take a perfect gas in which particles do The lazy gas experiment: not interact, and ask him to move from a certain prescribed density field at time t = 0, to another prescribed density field at time t = 1. Since the gas is lazy, he will find a way to do so that needs a minimal amount of work (least action path). Measure the entropy of the gas at each time, and check that it always lies above the line joining the final and initial entropies. If such is the case, then we know that we live in a nonnegatively curved space (see Figure 16.2).

466 460 16 Displacement convexity I                                                    = 1 t      = 0 t   t = 1 / 2 ∫ ρ S = − ρ log = 0 t = 1 t Fig. 16.2. T he lazy gas experiment: To go from state 0 to state 1, the lazy gas uses a path of least action. In a nonnegatively curved world, the trajectories of the particles first diverge, then converge, so that at intermediate times the gas can afford to have a lower density (higher entropy). Bibliographical notes Convexity has been extensively studied in the Euclidean space [705] and in Banach spaces [172, 324]. I am not aware of textbooks where the study of convexity in more general geodesic spaces is developed, although this notion is now quite frequently used (in the context of optimal transport, see, e.g., [30, p. 50]). The concept and terminology of displacement convexity were intro- duced by McCann in the mid-nineties [614]. He identified (16.16) as the n basic criterion for convexity in ( R ), and also discussed other formu- P 2 lations of this condition, which will be studied in the next chapter. Inequality (16.16) was later rediscovered by several authors, in various contexts.

467 Bibliographical notes 461 T he application of Otto calculus to the study of displacement con- vexity goes back to [669] and [671]. In the latter reference it was con- jectured that nonnegative Ricci curvature would imply displacement convexity of H . Ricci curvature appears explicitly in Einstein’s equations, and will be encountered in any mildly advanced book on general relativity. Fluid mechanics analogies for curvature appear explicitly in the work by Cordero-Erausquin, McCann and Schmuckenschl ̈ager [246]. Lott [576] recently pointed out some interesting properties of dis- placement convexity for functionals explicitly depending on t ∈ [0 , 1]; ∫ + for instance the convexity of ρ along displace- log ρ t dν t N t log t t ment interpolation is a characterization of CD(0 ,N ). The Otto formal- ism is also useful here.

468

469 17 D isplacement convexity II In Chapter 16, a conjecture was formulated about the links between dis- placement convexity and curvature-dimension bounds; its plausibility was justified by some formal computations based on Otto’s calculus. In the present chapter I shall provide a rigorous justification of this conjecture. For this I shall use a Lagrangian point of view, in contrast with the Eulerian approach used in the previous chapter. Not only is the Lagrangian formalism easier to justify, but it will also lead to new curvature-dimension criteria based on so-called “distorted displacement convexity”. The main results in this chapter are Theorems 17.15 and 17.37. Displacement convexity classes is a family N What I shall call displacement convexity class of order of convex nonlinearities satisfying a certain characteristic differential inequality of second order (recall (16.16)). Let N Definition 17.1 (Displacement convexity classes). be a real parameter in , ∞ ] . The class DC [1 is defined as the set of contin- N U : R → R , twice continuously differentiable uous convex functions + on (0 , + ∞ ) , such that U (0) = 0 , and, with the notation ′ ′ r ) = r U p ( r ) − U ( r ) , p , ( r ) = r p ( ( r ) − p ( r ) 2 satisfies any one of the following equivalent differential conditions: U p p + (i) ≥ 0 ; 2 N

470 464 17 Displacement convexity II ( r ) p ii) ( ; r is a nondecreasing function of N − / 1 1 r { } N N − U ( δ δ δ > ) ( ∞ 0) if N < ) := u (iii) ( is a convex δ δ − δ ) U δ ∈ R ) ( N = ∞ e e ( if δ function of . U U (0) = 0, the function u ap- Remark 17.2. is convex and Since pearing in (iii) is automatically nonincreasing. ′ ⊂ It is clear (from condition (i) for instance) that Remark 17.3. DC N ′ N DC if N . So the smallest class of all is DC , while DC is the ≥ 1 N ∞ N largest (actually, conditions (i)–(iii) are void for = 1). U R DC If , then for any a ≥ 0, b > 0, c ∈ Remark 17.4. , belongs to N cr aU ( br ) + 7−→ also belongs to DC r . the function N The requirement for U to be twice differentiable on Remark 17.5. , ∞ ) could be removed from many subsequent results involving + (0 displacement convexity classes. Still, this regularity assumption will simplify the proofs, without significantly restricting the generality of applications. α α ≥ 1, the function U ( r ) = r Examples 17.6. belongs to (i) For any DC . all classes N α α < 1, then the function U ( r ) = − r (ii) If belongs to DC if and N − 1 1 − 1 /N (1 (that is, α ≥ 1 − 1 ≤ ). The function − r − α ) N is only if /N DC . in some sense the minimal representative of N U r ( r ) = r log (iii) The function belongs to DC . It can be seen ∞ ∞ 1 − 1 /N ) = − N ( r U r ( r − as the limit of the functions ), which are the N same (up to multiplication and addition of a linear function) as the functions appearing in (ii) above. Proof of the equivalence in Definition 17.1 N < ∞ , and . Assume first /N N ′ 1 − 1 − r . By computation, u ( ( δ ) = − Np ( r ) /r δ ) = δ write . So u is 1 1 − /N convex if and only if ) r /r is a nonincreasing function of δ , i.e. ( p r . Thus (ii) and (iii) are equivalent. a nondecreasing function of Next, by computation again, ( ) 2 p ( r ) 2 ′′ − 1 N N r ( δ ) = u + 17.1) ) r ( p . ( 2 N

471 Displacement convexity classes 465 S u is convex if and only if p o + p/N is nonnegative. This shows the 2 equivalence between (i) and (iii). = ∞ , the arguments are similar, with the formulas In the case N ) ) r ( p ( r p 2 ′′ − ′ δ u ( , e ) = ( δ ) = − , u δ ( δ r . ) = r r ⊓ ⊔ will play an important role in the The behavior of functions in DC N DC may present singularities at the sequel of this course. Functions in N origin; for example ) is not differentiable at r = 0. It is often possi- ( U r N ( r ) by a smooth approx- U ble to get around this problem by replacing N ) ( − 1 /N r ( r + ε ) N , for instance − − r imation which still belongs to , DC N ε → 0. The next proposition provides and later passing to the limit as DC ∞ near 0 or + more systematic ways to “regularize” functions in ; N at the same time it gives additional information about the behavior of functions in . The notation p ( r ) and p ) is the same as in DC r ( 2 N Definition 17.1. Proposition 17.7 (Behavior of functions in DC ). N N ∈ [1 , ∞ ) , and Ψ ∈ C ( R ∞ ; R + ) such that Ψ ( r ) /r (i) Let → + + ≤ r U ∈ DC as such that 0 ; then there exists U ≤ Ψ , and → ∞ N ( r ) /r −→ + U as r →∞ . ∞ (ii) If ∈ DC U , then either U is linear, or there exist constants ∞ 0 , b ∈ R such that a > ∀ r ≥ 0 , U ( r ) ≥ ar log r + br. (iii) Let N , ∞ ] and let U ∈ DC ∈ . If r [1 ∈ (0 , + ∞ ) is such that N 0 1 ′ − /N 0 , then there is a constant K > 0 r p p ( r ) ≥ Kr ) ( > such that 0 ≥ r . If on the contrary p ( r for all r , then U is linear on [0 ,r ] . ) = 0 0 0 0 ′′ is either empty, or an interval of U { ( r ) = 0 } ; r In particular, the set [0 ,r ] . the form 0 N ∈ [1 , ∞ ] and let U ∈ DC (iv) Let . Then U is the pointwise N nondecreasing limit of a sequence of functions ( U ) in DC , such N N ℓ ∈ ℓ r that (a) U on [0 ,r U ] , where coincides with is arbitrarily large; ℓ ℓ ℓ 1 1 − N 0 and b ∈ R such that U (b) for each ( r ) = − ar ℓ there are a ≥ + r b ℓ ′ ′ ( ∞ ) → U ) ( ∞ ) for r large enough; (c) U (or ar log r + br if N = ∞ ℓ →∞ . ℓ as N ∈ [1 , is the pointwise ] and let U ∈ DC (v) Let . Then U ∞ N nonincreasing limit of a sequence of functions , such U DC ) in ( N N ℓ ℓ ∈

472 466 17 Displacement convexity II hat (a) t coincides with U on [ r U , + ∞ ) , where r is an arbitrary real ℓ ℓ ℓ ′ r number such that ) > 0 ; (b) U ( ( r ) is a linear function of r close to p ℓ ℓ ′ ′ →∞ (0) → U the origin; (c) (0) as ℓ U . ) ( ℓ ′′ ′′ U ≤ C U , (vi) In statements (iv) and (v), one can also impose that ℓ independent of ℓ . In statement (v), one can also for some constant C ′′ impose that [0 ,r U ] increases nicely from 0, in the following sense: If 0 ℓ ′′ U , then there are r = 0 > r is the interval where , an increasing 0 1 ℓ : [ function h ,r ≤ ] → R h , and constants K K ,K such that r 2 1 1 + 1 0 ′′ ≤ K U h on [ r . ,r ] 1 0 2 [1 , ∞ ] and let N ∈ DC (vii) Let . Then there is a sequence U ∈ N ∞ ) ( con- U of functions in DC , such that U )) U C ∈ ((0 , + ∞ N ℓ ℓ ∈ ℓ N ℓ 2 monotonically and in U ((0 , C ∞ )) ; and, with the notation verges to + loc ′ ( r p r U ) = , ( r ) − U ) ( r ℓ ℓ ℓ p ) ( r ) ( p p ( r r ) r ( p ) ℓ ℓ inf inf − sup − −−→ . −−→ ; sup 1 1 1 1 − 1 1 1 − 1 − − r r ℓ →∞ →∞ ℓ r r N N N N r r r r ere are some comments about these results. Statements (i) and (ii) H show that functions in can grow as slowly as desired at infinity if DC N , but have to grow at least like r log r if N < = ∞ . Statements ∞ N U ∈ DC as a monotone (iv) to (vi) make it possible to write any N r , nondecreasing for large r ) of “very limit (nonincreasing for small nice” functions U ∈ DC , which behave linearly close to 0 and like N ℓ 1 − 1 /N ar r (or ar log − + br ) at infinity (see Figure 17.1). This ap- br proximation scheme makes it possible to extend many results which can be proven for very nice nonlinearities, to general nonlinearities in DC . N The proof of Proposition 17.7 is more tricky than one might expect, and it is certainly better to skip it at first reading. Proof of Proposition 17.7 . The case N = 1 is not difficult to treat sep- arately (recall that DC is the class of all convex continuous functions 1 2 U (0) = 0 and U ∈ C U ((0 , + ∞ with ))). So in the sequel I shall as- sume 1. The strategy will always be the same: First approximate N > , then reconstruct u from the approximation, thanks to the formula U − 1 /N ) = r u ( r N U ( ) ( r u (log 1 /r ) if r = ∞ ). Let us start with the proof of (i). Without loss of generality, we Ψ is identically 0 on [0 , 1] (otherwise, replace Ψ by may assume that χΨ , where 0 ≤ χ ≤ 1 and χ is identically 0 on [0 , 1], identically 1 on , [2 + ∞ )). Define a function u : (0 , + ∞ ) → R by

473 Displacement convexity classes 467 U ) U ( ( r ) r ℓ U dashed line) is an approximation of U (solid line); it is linear close Fig. 17.1. ( ℓ to the origin and almost affine at infinity. This regularization can be made without going out of the class DC , and without increasing too much the second derivative N U . of N − N u Ψ ( ( δ ) = ) . δ δ u 0 on [1 , + ∞ ), and lim Then ≡ u ( δ ) = + ∞ . + 0 δ → u The problem now is that u be the is not necessarily convex. So let ̃ u , ∞ ), i.e. the supremum of all linear functions lower convex hull of on (0 u . Then ̃ u ≡ 0 on [1 , ∞ ) and ̃ u is nonincreasing. bounded above by Necessarily, lim . ̃ u (17.2) δ ) = + ∞ ( + δ → 0 δ Indeed, suppose on the contrary that lim ̃ u ( ∞ ) = M < + , and + δ 0 → − ) M +1 u ( δ the latter function is ( let := sup ∈ a R be defined by a 0 ≥ δ δ nonpositive when δ is small enough, so the supremum is finite). Then u ( δ ) ≥ M + 1 − aδ , so lim + 1, which is a contradiction. M ≥ + ̃ u ( δ ) 0 δ → Thus (17.2) does hold true. Then let − 1 /N ) := . ̃ u ( r U ( r ) r

474 468 17 Displacement convexity II C U is continuous and nonnegative, with U ≡ 0 on [0 , 1]. By com- learly 1 ′′ 2 − 1 − 1 /N − − /N ′′ − 1 /N ′ − 1 /N u r ( r r ( r ) − ( N − 1) ̃ u ) = ( r N U ̃ )). putation, ( u ̃ U is convex. Hence As is convex and nonincreasing, it follows that 1 − /N ≤ u and Ψ ( r ̃ ru ( r U ∈DC u ), it is . On the other hand, since ) = N goes to + ≤ ; and still (17.2) implies that U ( r ) /r Ψ ∞ as U clear that →∞ . r N = ∞ , then the function U can be Now consider Property (ii). If u reconstructed from by the formula ( r r u (log(1 /r )) , (17.3) U ) = is convex and nonincreasing, either is constant (in which case U u u As 0, b ∈ R , such that is linear), or there are constants ( δ ) ≥− aδ + b , a > u U ( r ) ≥− ar log(1 /r ) + br = ar and then r + br . log Next let us turn to (iii). First assume ∞ . The formula N < 1 1 1 ′ 1 − − ′ N N ( r ) = − p r ) = r ( r U ) − U ( r u ( r ) N − 1 /N ′ r r ) > 0 if and only if u 0. Then for any ( r , hows that p ≤ ) < ( r s 0 0 0 1 − /N 1 /N ′ − ′ ( r ) u ≤ u ), so ( r 0 1 ) ( − N 1 1 1 r ′ ′ ′ − − − ′′ ′ N N N u ( r ( r ) = rU p ( − ( N − 1 ) u r ( r r ) = ) ) 2 N   1 − ′ N 1 ) ( N u 1) ( r − − 0   N ≥ r − . 2 N − 1 /N ′ ′ − 1 /N u ) = 0 for ) = 0, then necessarily u ( ( r r I f on the other hand 0 /N 1 − , , which means that is constant on [ r r r ≤ is u + ∞ ), so U all 0 0 ,r linear on [0 ]. 0 The reasoning is the same in the case = ∞ , with the help of the N formulas ( ) ) ( ) ( ) ( 1 1 1 1 ′ ′ ′ ′′ ′ ( p log r − ru ) = log r u ) = − u , ( l og U r r r r and ) ( 1 ′ ′′ ′ . r ( r ) = r U r ≥ r ) ≥− u = ⇒ log p ( 0 r 0

475 Displacement convexity classes 469 N u by an affine ow consider statement (iv). The idea is to replace 1 C function close to the origin, essentially by smoothing of the trivial ∈ , ), let U ∈ DC ∞ and [1 approximation by the tangent. First let N N N − N u δ δ ) = δ ). We know that ( is a nonincreasing, twice differ- U let ( u ∞ ). If u is linear close to the origin, entiable convex function on (0 , + there is nothing to prove. Otherwise there is a sequence of positive ′ ′ 0. < 2) such that a / a ≤ a ( / 4 and u numbers ( ( a a ) ) < u +1 ℓ +1 ℓ ℓ N ∈ ℓ ℓ ℓ 2 , construct a function u as follows: For each ℓ C ℓ a ), , + ∞ • u on [ coincides with u ; ℓ ℓ ′′ ′′ ,a u χ on [0 = • is a smooth cutoff function such that u ], , where χ ℓ ℓ ℓ ℓ ≤ χ 2. ≤ 1, χ / ( a a 0 χ ≤ ( δ ) = 0 for δ ) = 1, ℓ ℓ ℓ ℓ ℓ ′ ′ 0, also Since ( a ) < u u u is convex and < 0 on (0 ,a ]. By con- ℓ ℓ ℓ ℓ ℓ ′′ ′′ ′ ′ u 2]. Also u is linear on [0 ), ≤ u ,a , u struction, a ( a ( ) = u / ℓ ℓ ℓ ℓ ℓ ℓ as ( a /ℓ u u ( a ] (with 1 ); by writing the Taylor formula on [ s,a ) = ℓ ℓ ℓ ℓ ′ ′ ( s ) ≤ u ( s ), u u a ( s ) ≥ u base point), we deduce that ( ≤ ) for all s s ℓ ℓ ℓ (and therefore for all ). s ℓ , u For each lies above the tangent to u at 1 /ℓ ; that is ℓ ′ . ( s ) ≥ u ( a ) ) + ( s − a s ) u ) =: ( a ( u T ℓ ℓ ℓ ℓ ℓ ′ ′ ′ ( a u Since ) < u is nondecreasing and u a lies / 2), the curve T ( ℓ ℓ +1 ℓ T ,a strictly on [0 below the curve ]. By / 2], and therefore on [0 ,a ℓ +1 ℓ ℓ +1 ∫ a ℓ ′′ in such a way that χ choosing χ is very small, we can make u ℓ ℓ a / 2 ℓ u ]; and in particular is very close to the line T sure that ( s ) on [0 ,a ℓ ℓ ℓ ,a is bounded above by T u ]. This on [ a that the whole curve +1 ℓ +1 ℓ ℓ ℓ is a nondecreasing function of will ensure that ℓ . u ℓ ′′ ′′ ≤ u ; u ; To recapitulate: ≤ u u u u ≤ = u on [ a ; , + ∞ ); 0 ≤ u ℓ ℓ ℓ +1 ℓ ℓ ℓ ′ ′ / 0 ≥ u ≥ ; u u is affine on [0 ,a 2]. ℓ ℓ ℓ Now let − 1 /N ( U ( r . ) = r u ) r ℓ ℓ By direct computation, 1 ( ) − − 1 N 1 1 1 r ′′ − − ′ ′ ′ − N N N ( r U ) = ( u r ) − ( N − 1 ) u 17.4) ) ( r . ( r ℓ ℓ ℓ 2 N u is convex nonincreasing, the above expression is nonnegative; Since ℓ is convex, and by construction, it lies in so DC U . Moreover U sat- N ℓ ℓ ′′ ( r ) is bounded above by isfies the first requirement in (vi), since U ℓ 1 − 1 /N ′′ 2 − − /N ′′ − 1 /N ′ − 1 /N 1 r ( r ) − ( N − 1) u ). ( r u ) ( ( )) = U r ( r /N

476 470 17 Displacement convexity II ∞ , things are similar, except that now u is defined N I n the case = a 2 converges to −∞ (say a on the whole of R ≤ , the sequence a ), ℓ +1 ℓ ℓ and one should use the formulas ) ( ( ) ) ( 1 1 1 ′′ ′ ′ ′ r u ) = ( U ( U r ); ) = /r og (log 1 r − u . u l log ℓ ℓ ℓ ℓ ℓ r r r constant function for large by a For (v), the idea is to replace u 1 C way, so the smoothing . But this cannot be done in a values of δ turns out to be more tricky. (Please consider again possibly skipping the rest of this proof.) I shall distinguish four cases, according to the behavior of u at in- ′ ′ ∞ ) = lim finity, and the value of u ). To fix ideas I shall s u (+ ( ∞ + s → ∞ ; but the case N = ∞ can be treated similarly. assume that N < In each case I shall also check the first requirement of (vi), which is ′′ ′′ ≤ C U U . ℓ ′ u Case 1: (+ ∞ ) = 0. This means that u is affine at infinity and is some constant. Then δ for δ ≥ δ ) = large enough, where c c ( u 0 − N − 1 /N ( r ) = ) = cr for r ≤ δ r u ( U , and there is nothing to prove. r 0 ′ ′ u := (+ ∞ ) < 0. Let a Case 2: − u u (+ ∞ ), is affine at infinity and ′ 0 and a . By assumption there are δ u > ≤ − b ∈ R such that so 0 ( s ) = − as + b for s ≥ δ u . Let a ). I shall define recursively ≥ max(1 ,δ 0 0 1 2 ) such that: an increasing sequence ( C a functions u , and ℓ N ∈ ℓ ℓ • on [0 ,a ], u ; coincides with u ℓ ℓ ′′ • , + ∞ ), u a is a continuous function ( s ) = χ on [ ( s ) /s , where χ ℓ ℓ ℓ ℓ a , + ∞ ), 0 with compact support in ( χ ≤ ≤ 1. (So u is obtained ℓ ℓ ℓ 1 by integrating this twice, and ensuring the continuity at C = ℓ ; s ′′ 2 ( a u ) = 0, so the result will be C note that .) ℓ ,b to be supported in some interval ( a Let us choose ), such that χ ℓ ℓ ℓ ∫ ∞ + ( s ) χ ℓ = a. d s s a ℓ ∫ + ∞ Such a χ exists since + 1. b ≥ ds/s = + ∞ . Then we let a ℓ ℓ +1 ℓ a ℓ u The function is convex by construction, and affine at infinity; ℓ ∫ ∞ + ′ ′ ′ ds/s ) = 0, so u is s ( χ ) = ( a u ) + ( a (+ ∞ ) = u u moreover, ℓ ℓ ℓ +1 ℓ ℓ a ℓ ′ ′ ′′ ′′ ′ u ≥ u ≥ , so u 0. Obviously u ≤ u actually constant at infinity and ℓ ℓ ℓ and u u ≥ u . Also, on [ a ≤ ) , + ∞ ), u a , ≤ u ( a ( u ) ≤ +1 +1 ℓ ℓ ℓ ℓ +1 ℓ ℓ ℓ +1

477 Displacement convexity classes 471 w ,a hile on [0 ) is nonincreasing ], u u ; so the sequence ( = u ≤ u ℓ +1 ℓ +1 ℓ ℓ . ℓ in − 1 /N ′′ ( r ( r ) = ). Formula (17.4) shows again that U r u Let ≥ 0, U ℓ ℓ ℓ 2 , ∈ ( R (0) = 0, ) ∩ C U ((0 C + ∞ )); so U ∈DC U . and it is clear that N + ℓ ℓ ℓ − N U It is clear also that U is U U on [0 ,a , ≥ U ], coincides with ℓ ℓ ℓ ℓ − N ], U ℓ converges monotonically to U as ,b → ∞ , and linear on [0 ℓ ℓ ′ ′ u ) converges to (+ ∞ U u (+ ∞ ) = U (0) = (0) = −∞ . ℓ ℓ ′′ ′′ It only remains to check the bound U ≤ CU . This bound is obvi- ℓ − − N N , ); for ∞ + r ≤ a a ous on [ it results from the formulas ℓ ℓ 1 ( ) − 1 − N 1 1 1 r ′′ − ′ − N N N U r ( ) = r ( r r u χ − ( N − 1 ) ) ( r ) ℓ ℓ ℓ 2 N 1 1 − − ) ( N r N 1 + ( ; − 1) a ≤ 2 N ( ) 1 N − 1 ′′ − 1 − N r U ( ) = r . a 2 N o C = 1 + 1 / (( ) is admissible. − 1) a N S ′ u u (+ ∞ ) = 0. The proof is Case 3: is not affine at infinity and based again on the same principle, but modified as follows: • ,a ; ], u u coincides with on [0 ℓ ℓ ′′ ′′ ( a ), u on [ , ( s ) = ζ ∞ s ) u • ( s ), where ζ is a smooth function + ℓ ℓ ℓ ℓ a , identically equal to 0 at infinity, identically equal to 1 close to ℓ 2]. , with values in [0 < b < c , and a supported in [ Choose ,c ζ ], so that 1 ≤ ζ ≤ 2 a ℓ ℓ ℓ ℓ ℓ ℓ ℓ a ], 0 ≤ ζ ≤ ,b b ,c on [ ], and 2 on [ ℓ ℓ ℓ ℓ ℓ ∫ ∫ b c ℓ ℓ ′′ ′ ′ ′ ′′ ( ds > u ( s ) − u ( ) a ); u ) s ( ζ b . ζ ( u ) − ( s ) u a ( s ) ds = ℓ ℓ ℓ ℓ ℓ a a ℓ ℓ ∫ ∞ + ′ ′′ ′′ are continuous and u and u This is possible since ds (2 u = ( s )) a ℓ ′ ′ ′ ′ − u (+ ( a 2( )) > u ∞ u (+ ) − u ) ( a would be affine ) > 0 (otherwise u ∞ ℓ ℓ a on [ , + ∞ )). Then choose a + 1. > c +1 ℓ ℓ ℓ ′ (+ is convex and it satisfies u The resulting function ) = u ∞ ℓ ℓ ′ ′ ′ is constant at infinity. u 0 and ≤ u a ( ( a ) = 0, so u u ) − ℓ ℓ ℓ ℓ ′ ′′ ′ ′′ , and these inequalities u u ≥ u ≥ and ≥ u u , so ,b u ], On [ a ℓ ℓ ℓ ℓ ℓ ′ ′′ are continuous, we can always arrange . Since u are strict at and u b ℓ ′ ′ u ≥ hold u ≥ u b that u and that the inequalities c is so close to ℓ ℓ ℓ ℓ

478 472 17 Displacement convexity II rue on [ t ,c b ]. Then these inequalities will also hold true on [ c ) , + ∞ ℓ ℓ ℓ is constant there, and u is nonincreasing. since u ℓ 1 − /N r ( ) = r u ). The same reasoning as in the previous ( U Define r ℓ ℓ − N U , lies in U ≥ U , U ], is linear on [0 ,c case shows that DC U N ℓ ℓ ℓ ℓ ℓ ′ ℓ → ∞ , and U converges monotonically to U (0) = as ) con- (+ ∞ u ℓ ℓ ′ ) satisfies all the desired U u (0). The sequence ( U (+ ∞ verges to ) = ℓ ′′ ′ ′′ ′ u u and u ≥ ensure properties; in particular the inequalities ≤ 2 u ℓ ℓ ′′ ′′ ≤ 2 U . that U ℓ ′ u u (+ ∞ ) < 0. In this case the Case 4: is not affine at infinity and u is defined as follows: proof is based on the same principle, and ℓ • on [0 ], u ,a coincides with u ; ℓ ℓ ′′ ′′ s + ∞ ), u a are both ( s ) = η η ( s • u , ( on [ ) + χ and ( s ) /s , where χ ) ℓ ℓ ℓ ℓ ℓ ℓ valued in [0 1], χ is a smooth cutoff function with compact support , ℓ a ), and , + ∞ in ( η is a smooth function identically equal to 1 close ℓ ℓ , and identically equal to 0 close to infinity. to a ℓ > a b To construct these functions, first choose χ and supported ℓ ℓ ℓ in [ ,b a ] in such a way that ℓ ℓ ) ( ∫ b ′ ′ ℓ χ ) s ( (+ u ( b ) + u ∞ ) ℓ ℓ − d = s . 2 s a ℓ ∫ + ∞ ′ T his is always possible since is continuous and ds/s = + ∞ , u a ℓ ′ ′ ′ ( − )+ u u (+ ∞ )) / 2 approaches the finite limit − u ( (+ ∞ ) as b b → + ∞ . ℓ ℓ Then choose > b = 1 , and η η supported in [ a c ,c ] such that ℓ ℓ ℓ ℓ ℓ ℓ a ,b ] and on [ ℓ ℓ ∫ c ′ ′ ℓ b ( ) u ) (+ ∞ u − ℓ ′′ . η = u ℓ 2 b ℓ ∫ ∞ + ′ ′ ′′ T his is always possible since ( s ) ds = u u (+ ∞ > − u ) ( b ) ℓ b ℓ ′ ′ , (+ ∞ ) − u [ ( b )). )] / 2 > 0 (otherwise u would be affine on [ b ∞ u + ℓ ℓ ≥ c + 1. a Finally choose ℓ +1 ℓ The function so constructed is convex, affine at infinity, and u ℓ ∫ ∫ ∫ b c b ℓ ℓ ℓ s χ ) ( ℓ ′ ′ ′′ ′′ (+ + u ∞ s d ( a u ) = ) + u + u . = 0 η ℓ ℓ ℓ s a b a ℓ ℓ ℓ ′ 0. ≤ u is actually constant at infinity, and So u ℓ ℓ ′ ′ ′ ′′ ′ ′′ and u ≥ a ( u u ) = u a ( a ( ) = u ); so ( a ), ≥ u , u ], ,b On [ a u ℓ ℓ ℓ ℓ ℓ ℓ ℓ ℓ ℓ ℓ u ,b ≥ u ]. a on [ ℓ ℓ ℓ

479 Displacement convexity classes 473 ′ ′ ′ ′ ′ u )) b ≥ u , O ( b n [ ) = ( u + ( b ) ) − u ∞ (+ ∞ ∞ / 2 ≥ u ), one has (+ ℓ ℓ ℓ ℓ ℓ ′ ′ if ) ≥ u u 3 (+ ∞ ). We can always ensure that this inequality holds ( b ℓ ′ ′ u true by choosing ( a a ) ≥ 3 u large enough that (+ ∞ ). Then u ≥ ( s ) 1 1 ℓ ∫ s ′ ′ b (+ ∞ ) ( s − b ) + ) ≥ u ( ( u u ) + b s u = u ( ); so u also on ≥ u ℓ ℓ ℓ ℓ ℓ ℓ b ℓ ). , + ∞ b [ ℓ − 1 /N r ) = r u can ( Define U r ( ). All the desired properties of U ℓ ℓ ℓ ′′ be shown just as before, except for the bound on U , which we shall ℓ N − − N ′′ ′′ U U , = now check. On [ + . On [0 ,a ∞ ), a ), with the notation ℓ ℓ ℓ 1 − ′ /N ′ /N 1 ′ − r ( a u ), we have ) ≥ − a , u ∞ ( r u − = ) ≥ − 3 a (recall that (+ ℓ ′ ( we imposed a u ) ≥− 3 a ), so 1 1 ( ) − − 1 N 1 1 r ′′ − ′ − ′ N N U r ( ) ≤ r r ( u ) , N − 1) a + 1 + 3( ℓ 2 N 1 ( ) − − 1 N 1 1 r − ′ ′ − ′′ N N r , u ( r ( r ) U ≥ ) + ( N − 1) a 2 N ′′ ′′ U CU and once again with C = 3 + 1 / (( N − 1) a ). ≤ ℓ It remains to prove the second part of (vi). This will be done by a further approximation scheme. So let ∈ DC be linear close to U N U the origin. (We can always reduce to this case by (v).) If is linear on the whole of R , there is nothing to do. Otherwise, by (iii), the set + ′′ vanishes is an interval [0 ,r ]. The goal is to show that we may where U 0 U by U is nonincreasing in such a way that U approximate ∈DC U , N ℓ ℓ ℓ ′′ ℓ , is linear on some interval [0 ,r increases nicely ( U )], and U ℓ in 0 ℓ ℓ r ( ( ℓ ) ,r ℓ from 0 on [ )). 1 0 u In this case, is a nonincreasing function, identically equal to a − 1 /N ′ ∞ ), with s constant on [ s r is nonincreasing , + u ; and also = 0 0 0 to 0, so in fact u is strictly decreasing up to s ). We . Let a ,s ∈ ( s 2 / 0 1 0 0 2 and C a functions u as follows: can recursively define real numbers ℓ ℓ on (0 ,a • ], u ; coincides with u ℓ ℓ ′′ ′′ ′ are smooth , + ∞ ), ( u η ) • = χ and u on [ + η χ ( − u a ), where ℓ ℓ ℓ ℓ ℓ ℓ , close χ 2], ( r ) is identically equal to 1 for r functions valued in [0 ℓ b to , and identically equal to 0 for r ≥ a is compactly ; and η ℓ ℓ ℓ supported in [ b . ,c < s ] and decreasing to 0 close to c < c ; a < b 0 ℓ ℓ ℓ ℓ ℓ ℓ Let us choose χ c in such a way that η , , b , ℓ ℓ ℓ ℓ

480 474 17 Displacement convexity II ∫ ∫ ∫ ∫ b b b c ℓ ℓ ℓ ℓ ′ ′′ ′ ′′ ′ ′ > ; χ u u u + χ u − ) = u − η ( a ); ( ℓ ℓ ℓ ℓ ℓ a a a b ℓ ℓ ℓ ℓ ∫ c ℓ ′ . ( − u η ) > 0 ℓ b ℓ ∫ s 0 ′′ ′ ′′ ′ are continuous, u ,u This is possible since > ) = − 2 u (2 u ( a ) ℓ ℓ a ℓ ′ ′ ( a u ), and ( − u ) is strictly positive on [ − a u ,s ≥ ]. It is clear that u 0 ℓ ℓ ℓ ℓ ′ ′ u ≥ u and on [ a very ,b c ], with strict inequalities at b ; by choosing ℓ ℓ ℓ ℓ ℓ close to b , we can make sure that these inequalities are preserved on ℓ / ,c [ ]. Then we choose a 2. b = ( c ) + s 0 +1 ℓ ℓ ℓ ℓ − 1 /N ) := ru Let us check that U r ( ( r ) satisfies all the required ℓ ℓ − N ′′ N − ∈ [ U s properties. To bound ], , ( s , note that for / 2) r 0 0 ℓ ) ( ′′ 1 ′′ 1 /N ′ /N − − U u ( ) ( r ) ≤ C ) − u ( N,r ( r r ) 0 ℓ ℓ ℓ ) ( 1 − ′ /N 1 /N − ′′ ( r ) u ) − u N,r ( r ( C 2 ) ≤ 0 and ) ( 1 /N 1 − ′ /N ′′ − ′′ ( u r ) r U ) − u N,r ( r ( K ≥ ) ) , ( 0 C ( N,r ) are positive constants. Finally, on [ ), K ( N,r ], where b ,c 0 0 ℓ ℓ ′′ ′ η is decreasing close to ( − u u = c η (indeed, ) is decreasing close to ℓ ℓ ℓ ℓ ′ ′ − − u is positive nonincreasing); and of course c u , and is decreasing ℓ ℓ − ′′ 1 /N ′ − 1 /N ) and − u r r ( r as well. So in ( ) are increasing functions of u ℓ ℓ a small interval [ r ]. This concludes the argument. ,r 1 0 ∞ by a convex, non- To prove (vi), we may first approximate u C − , in such a way that ‖ u increasing function u 0 for ‖ → u 2 ℓ ℓ (( a,b )) C a,b > 0. This can be done in such a way that u any ( s ) is nonde- ℓ ′ ′ u ; and u creasing for small and nonincreasing for large (0) → s s (0), ℓ ′ ′ 1 − 1 /N → u u (+ ∞ ). The conclusion follows easily since p ( r ) /r is (+ ∞ ) ℓ ′ − 1 /N ′ nondecreasing and equal to ( r (1 /N ) − − u u (log 1 /r ) in the case ) ( N = ∞ ). ⊓⊔ Domain of the functionals U ν U ∈ DC corresponds a functional U . However, some condi- To each N ν tions might be needed to make sense of U U ( μ ). Why is that so? If ν ∫ is, say, nonnegative, then an integral such as U ( ρ ) dν always makes

481 Domain of the functionals 75 U 4 ν ac + U ∞ is well-defined on the whole of P sense in [0 , ( M ). But ], so ν 2 might be partially negative, and then one should not exclude the U ρ ) have U ( possibility that both the negative and the positive parts of infinite integrals. The problem comes from infinity and does not arise is a compact manifold, or more generally if ν has finite mass. if M Theorem 17.8 below solves this issue: It shows that under some , the quantity U integral growth condition on ( μ ) is well-defined if ν ν has finite moments of order large enough. This suggests that we μ p ac M study on the set ( P ) of absolutely continuous measures with U ν p ac P p finite moment of order ( M ). , rather than on the whole space 2 Since this theorem only uses the metric structure, I shall state it in the context of general Polish spaces rather than Riemannian manifolds. U ( Theorem 17.8 (Moment conditions make sense of μ ) ). Let ν ( ) be a Polish space and let ν be a reference Borel measure on X . ,d X ∈ [1 , ∞ ] . Assume that there exists x and ∈X Let p ∈ [2 , + ∞ ) such N 0 that ∫  ) x ( dν   N < ∞ ∞ + , < if  p ( N − 1)  x [1 + d ( x )] ,  0 X ∫    p ,x x ) ( − cd  0  c > 0; ∃ = ( x ) < + ∞ if N e ∞ . dν M (17.5) Then, for any U ∈DC , the formula N ∫ U μ ) = ( ( ρ ) dν, μ = ρν U ν X ac U : P unambiguously defines a functional , where ( X ) → R ∪{ ∞} + ν p ac P X ) is the set of absolutely continuous probability measures on X ( p with a finite moment of order p . ac Even if no such U exists, is still well-defined on P p X ) , the ( ν c set of absolutely continuous compactly supported probability measures, ν is finite on compact sets. provided that N If Example 17.9. R is the Lebesgue measure on , then U ν is well- ν ac N defined on P ( R , as long as ) for all U ∈ DC N = 2, N ≥ 3. For N 2 ac N on P U 2. In Theorem 17.8 allows us to define R ( ), for any p > ν p ac N = 1, U ). All this remains true is well-defined on P the case N ( R ν c N if R is replaced by an arbitrary N -dimensional Riemannian manifold

482 476 17 Displacement convexity II N ith nonnegative Ricci curvature. (Indeed, vol[ ( x B )] = O ( r w ) for any r 0 ∫ 1) − ( N p < x / [1 + d ( x x ,x )] ( ∈ dν , so M ) + ∞ if p ( N − 1) > N .) fixed 0 0 Convention 17.10. In the sequel of this course I shall sometimes write ∈ [2 , + “ ) ∪ { c } satisfying the assumptions of Theorem 17.8” or p ∞ p [2 , + ∞ ) ∪{ c } satisfying (17.5)”. This means that p is either a “ ∈ real number greater or equal than 2, satisfying (17.5) (the metric space X ) and the reference measure ( should be obvious from the context); ν ,d ”, so that ( or the symbol “ c X ) stands for the set P P ( X ) of compactly p c supported probability measures. For any positive constant C , the set of probability Remark 17.11. ∫ p is closed in ( X ) with μ d ( x ); but ,x ) measures dμ ( x ) ≤ C P P X ( in 2 p 0 K X ) is not. Similarly, if P is a given compact ( in general the whole set p , then the set of probability measures with support in K is X subset of P P ( X ); but compact in ) is not closed in general. ( X c 2 X is a length space (for instance a Riemannian man- Remark 17.12. If P ( M ) is a geodesically ifold equipped with its geodesic distance), then p subset of P (1 ( M ), for any q ∈ convex , + ∞ ). Indeed, let ( μ be ) 0 t ≤ 1 t ≤ q P a geodesic in ( M ); according to Corollary 7.22, there is a random q p γ such that μ ∞ = law ( γ + ); then the bounds E d ( x < ,γ geodesic ) 0 t 0 t p p x and ,γ , in view of ) d < + ∞ together imply E E ( ( ∞ ,γ + ) x < d t 0 1 0 the inequality ] [ p p 1 − p 2 p ≤ ,γ ) 0 ≤ 2 x ( d ⇒ 1 = d ( x . ,γ t ) ≤ + d ( x ) ,γ 1 0 t 0 0 0 ac P Combining this with Theorem 8.7, we deduce that ( M ) is geodesi- p P ( cally convex in M ), and more precisely 2 ) ( ∫ ∫ ∫ p p 2 p p − 1 ) ≤ 2 ( x . ,x ) ( d μ x ) ,x ) d μ dx ( dx ) + ( d ( x ( ,x ) dx μ 1 0 0 t 0 0 ac U is a priori only defined on P Thus even if the functional ( M ), it is ν p P ). ( not absurd to study its convexity properties along geodesics of M 2 . The problem is to show that under the assump- Proof of Theorem 17.8 tions of the theorem, U ( ) is bounded below by a ν -integrable function; ρ ∫ U . ( μ ) = R U ( ρ ) dν will be well-defined in then ∪{ + ∞} ν N < ∞ . By convexity of u , there is a constant Suppose first that N − N A U ( δ A > 0 so that ) ≥− Aδ − δ , which means ( ) 1 1 − N ρ ρ + ( ρ ) ≥− A U . ( 17.6)

483 Domain of the functionals U 4 77 ν 1 − 1 /N 1 lies in ( ν ); so it is sufficient to show that also ρ ρ L Of course, 1 ν lies in ). But this is a simple consequence of H ̈older’s inequality: L ( ∫ 1 1 − N x ( ) ρ x ) ( d ν X ∫ 1 ( ) 1 1 − p p − 1+ N N ,x ) x ( d ,x ( 1 + d ( x ) ρ ) ( ) x ) (1 + = ) ν ( x d 0 0 X 1 1 ) ( ) ( ∫ ∫ 1 − N N − N ( − 1) p p ( . ) ( d 1 + ) x dν ( x ) ,x dν ( x ) d ( x ≤ ,x ) ρ ) (1 + ( x ) 0 0 X X N ∞ . By Proposition 17.7(ii), there are posi- ow suppose that = N tive constants a,b such that ρ ) ≥ aρ log ρ U bρ. (17.7) ( − 1 ρ ) So it is sufficient to show that ( ∈ log ρ ( ν ). Write L − ∫ ( x ) log ρ ( ( ) dν ρ x ) x X ∫ ) ( p p p cd − cd ( x ,x ) ) ,x x ( ) ,x x ( cd 0 0 0 = e e e ) x ( ρ ρ ( log dν ( x ) x ) X ∫ p c − x ( ,x ) d ρ ( x ) dν ( x ) . (17.8) 0 X By Jensen’s inequality, applied with the convex function → r log r , r p c d ( x − , · ) 0 dν e R and the integrable function the probability measure p ) ( x , · d − c 0 dν e X p ( x , cd · ) 0 ρe , (17.8) can be bounded below by ∫ ∫ ( )( ) ( ) ∫ ρdν ρdν p cd ( x − ,x ) X X 0 ∫ ∫ ) e dν x ( log p p ,x ) x ( d ,x x ) ( d c − c − 0 0 dν ( x ) x ( dν ) e e X X X ∫ p dν . d ( x ) ,x ) − ρ ( x ) c ( x 0 X This concludes the argument. ⊓⊔ In the sequel of this chapter, I shall study properties of the func- ac U is a Riemannian manifold equipped on P tionals M ( M ), where ν p with its geodesic distance.

484 478 17 Displacement convexity II isplacement convexity from curvature bounds, revisited D introduced in (16.17) (or in Example 17.6 (iii)). U Recall the notation N U ) For any N > H 1, the functional ( : will instead be denoted by N ν N,ν ∫ ρν. ) = ( U H ( ρ ) dν, μ = μ N,ν N M I shall often write H instead of H ; and I may even write just H if ,ν ν ∞ the reference measure is the volume measure. This notation is justified ∫ : H ( ρ ) = functional ρ log ρd vol. H Boltzmann’s by analogy with ∈ DC For each , formula (16.18) defines a functional Λ U which U N will later play a role in displacement convexity estimates. It will be convenient to compare this quantity with Λ := Λ ; explicitly, N U N ∫ 1 2 1 − N v ( x ) | ) = ρ Λ μ,v ( | ) ( ) d ν ( x x , μ = ρν. (17.9) N M Λ ≥ It is clear that K , where Λ N U N,U  p ) ( r   0 if K > lim K  1 − 1 N /  → 0 r r      Kp ( r ) = (17.10) = inf K = 0 if K 0 N,U N / 1 − 1 0 r>  r       p ( r )   K lim 0. K < if 1 N − 1 / →∞ r r It will also be useful to introduce a version of displacement local convexity. In short, a functional U is said to be locally displacement ν convex if it is displacement convex in the neighborhood of each point. Let M be a Definition 17.13 (Local displacement convexity). Riemannian manifold, and let F be defined on a geodesically convex ac P ) is said to be ( M subset of , with values in R ∪ { + ∞} . Then F 2 locally displacement convex if, for any ∈ M there is x 0 such that r > 0 the convexity inequality t ∈ [0 , 1] F ) μ ) ≤ (1 − t ) F ( μ ∀ ) + tF ( μ ( t 1 0 , are supported in the holds true as soon as all measures , 0 ≤ t ≤ 1 μ t ball B ( x ) . r 0 The notions of local Λ -displacement convexity and local λ -displace- ment convexity are defined similarly, by localizing Definition 16.5.

485 Displacement convexity from curvature bounds, revisited 47 9 F Warning 17.14. is locally dis- When one says that a functional F mean that is displacement convex does not placement convex, this . The word “local” refers to in a small neighborhood of μ , for any μ M , not the topology of the Wasserstein the topology of the base space space. The next theorem is a rigorous implementation of Guesses 16.6 and 16.7; it relates curvature-dimension bounds, as appearing in Theo- rem 14.8, to displacement convexity properties. Recall Convention 17.10. bounds read off from displacement con- CD Theorem 17.15 ( be a Riemannian manifold, equipped with its geodesic vexity). Let M 2 − V ν e vol , V ∈ C = ( M ) . Let K ∈ R and a reference measure distance d ∈ (1 and ∞ ] . Let p ∈ [2 , + ∞ ) ∪{ c } satisfy the assumptions of The- N , orem 17.8. Then the following three conditions are equivalent: satisfies the curvature-dimension criterion CD( K,N ) ; (i) M ∈ DC -displacement , the functional U (ii) For each is Λ U N,U ν N ac P convex on M ) , where ( ; = K Λ Λ N,U N N,U p H is locally (iii) -displacement convex; KΛ N,ν N N and then necessarily V is constant. ≥ , with equality possible only if n Statement (ii) means, explicitly, that for any displace- Remark 17.16. ac ) ment interpolation ( μ P in 1], ( M ), and for any t ∈ [0 , t 1 t ≤ ≤ 0 p ∫ ∫ 1 1 − 1 2 ̃ N ) + K U μ ( x ( ρ ) ds s,t | ( ∇ ψ G ( x ) | ) d ν ( x ) ν t N,U s s M 0 ≤ (1 t ) U − ( μ (17.11) ) + tU , ( μ ) 1 0 ν ν ,s 0 is the density of μ where , ψ ρ = H ψ (Hamilton–Jacobi semigroup), s t t + 2 ̃ ψ / 2-convex, exp( is ∇ d ) is the Monge transport μ K → μ , and ψ 1 0 N,U is defined by (17.10). Remark 17.17. If the two quantities in the left-hand side of (17.11) are infinite with opposite signs, the convention is (+ ∞ ) − (+ ∞ ) = −∞ , i.e. the inequality is void. This eventuality is ruled out by any one of the ( ≥ 0; (b) N = ∞ ; (c) μ following conditions: (a) ,μ ), where ∈ P M K q 0 1 ∫ q ( N δ 1) − 2 N − − x ∞ ,x ) 2 N/ ( N − 1) is such that d (1 + q > ( ) ν ( dx ) < + 0 for some δ > 0. This is a consequence of Proposition 17.24 later in this chapter.

486 480 17 Displacement convexity II R The case N = 1 is degenerate since U emark 17.18. is not defined; 1 but the equivalence (i) K ⇔ to be (ii) remains true if one defines N,U ∞ 0, and 0 if K ≤ 0. I shall address this case from a slightly if + K > different point of view in Theorem 17.41 below. (As stated in that = 1 is possible only if is one-dimensional and ν = vol.) N M theorem, As a particular case of Theorem 17.15, we now have a rigorous justi- fication of the guess formulated in Example 16.8: The nonnegativity of the Ricci curvature is equivalent to the (local) displacement convexity functional. This is the intersection of two situations of Boltzmann’s H = where Theorem 17.15 is easier to formulate: (a) the case ; and N ∞ = 0. These cases are important enough to be stated K (b) the case explicitly as corollaries of Theorem 17.15: K, ∞ ) Corollary 17.19 ( CD(0 ,N ) bounds via optimal CD( and Let M be a Riemannian manifold, transport). ∈ R and N ∈ (1 , ∞ ] ; K then: M satisfies ≥ Kg if and only if Boltzmann’s H functional (a) Ric ac K P is -displacement convex on ( M ) ; c (b) has nonnegative Ricci curvature and dimension bounded M ac if and only if H N above by is displacement convex on P ( M ) . vol N, c Remark 17.20. All these results can be extended to singular mea- sures, so the restriction to absolutely continuous measures is nonessen- tial. I shall come back to these issues in Chapter 30. Core of the proof of Theorem 17.15 . Before giving a complete proof, for pedagogical reasons I shall give the main argument behind the impli- cation (i) K = 0. ⇒ (ii) in Theorem 17.15, in the simple case μ ) be a Wasserstein geodesic, with μ absolutely contin- Let ( 1 t t t ≤ 0 ≤ ρ . By change of be the density of μ with respect to uous, and let ν t t variables, ( ) ∫ ∫ ρ 0 ) ( ρ U U = dν J d ν, t t J t where J . The is the Jacobian of the optimal transport taking μ μ to 0 t t next step consists in rewriting this as a function of the mean distortion. N − N ( ) = δ u δ Let δ ( ), then U   1 ) ( ∫ ∫ N J J ρ t 0 t   = ρ u d ν ν. d U ρ 0 0 1 ρ J N t 0 ρ 0

487 Displacement convexity from curvature bounds, revisited 48 1 U The fact that belongs to means precisely that u is convex DC N nonincreasing. The nonnegativity of the (generalized) Ricci curvature t . Then the u means that the argument of is a concave function of convexity of the whole expression follows from the simple fact that the composition of a convex nonincreasing function with a concave function ⊓⊔ is itself convex. . Let us start with the implication Complete proof of Theorem 17.15 N < (i) , since the case N = ∞ ⇒ (ii). I shall only treat the case ∞ , I shall also assume that μ are and μ is very similar. In a first step 1 0 compactly supported; this assumption will be relaxed later on. μ and μ So let be two absolutely continuous, compactly supported 0 1 probability measures and let ( ) μ be the unique displacement t 1 ≤ t 0 ≤ and interpolation between μ , where μ T μ ) . It can be written ( t 0 1 0 # solve the Hamilton–Jacobi ( x ) = exp ) ( t ∇ ψ ( x )), then let ( ψ T t ≤ t ≤ 1 t 0 x ψ ψ . The goal is to show that equation with initial datum = 0 ( μ U ) ≤ (1 − t ) U ) ( μ μ ) + tU ( ν 0 1 ν t ν ∫ ∫ 1 1 1 − 2 N K − x ) ( ρ | ( ∇ ψ (17.12) ( x ) | ds. dν ( x ) G ) s,t N,U s s M 0 2 ) is almost surely Note that the last integral is finite since x ψ | |∇ ( s 2 D is the maximum distance between elements , where D bounded by 1 ∫ 1 − /N 1 N d μ ); and that of Spt( ρ ) and elements of Spt( ν ≤ ν [Spt μ μ ] s s 1 0 by Jensen’s inequality. U ∞ ( μ , then there is nothing to ) = + If either or U ∞ ( μ ) = + 1 ν ν 0 prove; so let us assume that these quantities are finite. ( be a fixed time in (0 , 1); on T Let t 1], M ), define, for all t ∈ [0 , 0 t 0 ( ) )) x ( T exp ψ ( t ∇ ∇ ψ ( x )) . = exp t ( → t t 0 x x 0 T Then μ be → μ . Let J is the unique optimal transport t t t → t t t → 0 0 0 the associated Jacobian determinant (well-defined μ -almost surely). t 0 μ T is concentrated on Recall from Chapter 11 that ) and that M ( t → t t 0 its density ρ is determined by the equation t ( ρ (17.13) ( x ) = ρ . ( T ) x J )) ( x t t t → t t t → 0 0 0 Since U (0) = 0, we may apply Theorem 11.3 to F ( x ) = U ( ρ )); ( x t or more precisely, to the positive part and the negative part of F sep- arately. So

488 482 17 Displacement convexity II ∫ ∫ ( ) ( ) ) dν ( x ) = U ( U ρ ρ . ( T ) x ( dν ( x )) x J ) x ( t t t → → t t t 0 0 M M Then formula (17.13) implies ( ) ∫ ∫ ρ ) x ( t 0 U ( dν ) (17.14) . ) ρ x ( ν = U ) J d x ( t t t → 0 ( x ) J t → t M M 0 Since the contribution of { ρ = 0 } does not matter, this can be rewrit- t 0 ten ( ) ∫ ) x ( ρ ( J x ) t t → t 0 0 U ) = μ ( U ( ) x ( ν d ) ρ x t ν t 0 ) ρ ( x J ) x ( t → t t M 0 0 ∫ ( ) N − N t,x ) = δ U δ ) x ( t,x ) ( dμ ( t t t 0 0 0 M ∫ , w ( t,x ) = dμ ) ( x t 0 M − N N U ( δ w , and ( t,x ) where t,x ) ( ) := ) ( t,x δ t t 0 0 1 ) ( N 1 ( ) ) x ( J − t t → 0 N δ t,x T ( ( x ) ρ ) = = . → t t t t 0 0 x ) ρ ( t 0 ) coincides with p to a factor which does not depend on t , δ ,x · ( U t 0 ( t ) in the notation of Chapter 14. So, by Theorem 14.8, for D -almost μ t 0 x all one has ∣ ∣ K 2 ̈ ∣ ∣ t δ . ( ( t,x ) ) ≤− δ ∇ ψ )) ( T x ( ,x → t t t t t 0 0 0 N N − N ) = δ ), so that U ( δ Set u ( w = u ◦ δ , where δ is shorthand for δ 1 ′′ ′ 1 − /N is fixed. Since u 0, ≥ 0 and u · ( ( ) = − Np ( r ) /r ,x ) and δ x ≤ δ t 0 − N one has, with = δ , r ) ) ( ( 2 2 u w ∂ ∂u ∂ 2 ̇ ̈ ) δ ( t ) t = + ( δ ) ( 2 2 ∂δ ∂δ ∂t ) ) ( ( ∣ ∣ p r ) K ( 2 ∣ ∣ − ∇ − δ ( t ) N ≥ . ψ ) ( T ) x ( t t t → 1 0 − 1 N N r K By combining this with the definition of , one obtains N,U ∣ ∣ 2 ∣ ∣ ≥ K δ ̈ w ( t,x ) )) ) ∇ ψ ( ( T x ( t,x N,U t t t → t 0 0 ∣ ∣ 1 2 − ∣ ∣ N . (17.15) T ∇ ψ ) ( ) x ( = T x ( K ρ )) ( t → t t t t t N,U → 0 0

489 Displacement convexity from curvature bounds, revisited 48 3 t w , this implies (recall Proposi- Since is a continuous function of tion 16.2) (1 − t ) w (0 ,x ) − tw (1 w ) ,x ( t,x ) − ∫ 1 ∣ ∣ 1 2 − ∣ ∣ N ( ds. ≤− K ∇ ψ ( T ( ( x ) ) x )) ρ G ( s,t ) T t N,U → s s s t s → 0 0 0 μ and use of Fubini’s theorem, this inequal- Upon integration against t 0 ity becomes ( μ U ) − (1 − t ) U ) ( μ μ ) − tU ( ν 1 ν t 0 ν ( ) ∫ ∫ 1 ∣ ∣ 1 2 − ∣ ∣ N ( ≤− ( x ∇ ψ ( T ) T ( x ) ) )) ρ G ( s,t ) ds K dμ x ( s s → t N,U s t → t s 0 0 0 M 0 ∫ ∫ 1 ∣ ∣ 1 2 − ∣ ∣ N )) ( x ( ρ T ( G ) K = ∇ ψ x ( T ds ( dμ ) ( x ) ) − s,t s s t → t s → s N,U t 0 0 0 0 M ∫ ∫ 1 1 − 2 N ∇ ( y ) ds s,t ) | ( ψ G ( y ) | ρ dμ ( y ) K = − s s s N,U M 0 ∫ ∫ 1 1 2 1 − N ψ ) y ) s,t ( ds. | ∇ ( dν ( y ) | G = K − ) y ( ρ s N,U s 0 M μ and μ This concludes the proof of Property (ii) when have com- 1 0 pact support. In a I shall relax this compactness as- second step ∈ [2 , + ∞ ) ∪{ c sumption by a restriction argument. Let satisfy the p } μ ,μ be two probability mea- assumptions of Theorem 17.8, and let 0 1 ac sures in P ( M ). Let ( Z be as in ) ) ψ ( , ( μ ) ≤ t,ℓ t ≤ 1 ,ℓ ∈ N N ∈ t,ℓ ℓ 0 ≤ t ≤ 1 ,ℓ ∈ N ℓ 0 p ρ stand for the density of μ . By Remark 17.4, Proposition 13.2. Let t,ℓ t,ℓ U the function : r → U ( Z ; and it is easy to check r ) belongs to DC N ℓ ℓ 1 1 − N . Since the measures are compactly sup- K μ Z that K = N ,U N,U t,ℓ ℓ ℓ μ ported, we can apply the previous inequality with μ replaced by t t,ℓ and U U : replaced by ℓ ∫ ∫ ∫ ) ) dν ) dν ≤ (1 − t Z ( U ( Z ) ρ ρ Z ρ dν + t U U ( ℓ ,ℓ 0 1 ,ℓ ℓ t,ℓ ℓ ∫ ∫ 1 1 1 − 1 − 1 2 N N K ∇ ρ ) ( y ) y s,t − | (17.16) ψ ( ( y ) | ds. dν ( Z ) G N ,U s,ℓ s,ℓ ℓ M 0 ℓ → ∞ . Recall from It remains to pass to the limit in (17.16) as Z Proposition 13.2 that ρ is a nondecreasing family of functions con- t,ℓ ℓ verging monotonically to ρ . Since U is nondecreasing, it follows that t +

490 484 17 Displacement convexity II U ( Z ρ ) U ( ρ ↑ ) . + t + t ℓ ,ℓ ( r ≤ On the other hand, the proof of Theorem 17.8 shows that U ) − 1 1 − N N,U ); so ) for some constant A = A ( r r ( A + 1 1 1 ) ) ( ( 1 − 1 − 1 − N N N 17.17) ρ Z A ρ ( Z ρ ≤ A ≤ ρ ( + ρ ) + Z U . t − ℓ t,ℓ ℓ t,ℓ t ,ℓ ℓ t By the proof of Theorem 17.8 and Remark 17.12, the function on the -integrable, and then we may pass to the right-hand side of (17.17) is ν limit by dominated convergence. To summarize: ∫ ∫ ( Z dν ρ ) ) dν U ρ ( by monotone convergence; U −−−→ + + t t,ℓ ℓ →∞ ℓ ∫ ∫ Z U ρ by dominated convergence. ) dν −−−→ dν ) ρ U ( ( − − t t,ℓ ℓ →∞ ℓ So we can pass to the limit in the first three terms appearing in 2 ( the inequality (17.16). As for the last term, note that y ) | |∇ = ψ s,ℓ 2 2 )-almost surely; but then ac- dy ( y ( d )) ( / (1 − s ) y,T , at least μ s,ℓ ,ℓ → s 1 2 2 ) = d ( ( y )) y,T / (1 cording to Proposition 13.2 this coincides with s − 1 s → 2 ̃ ψ | ( ∇ ) | y . So the last term in (17.16) can be rewritten as s ∫ ∫ 1 1 2 1 − ̃ N y ( ν y ( G ds, | ) ∇ ψ ( K ) | ) d s,t )) y ( ρ Z ( s N,U s,ℓ ℓ M 0 and by monotone convergence this goes to ∫ ∫ 1 1 2 − 1 ̃ N ) ) s,t ( G K | ds ∇ ψ ) ( y y | ( d ν ( ( )) y ρ N,U s s 0 M ℓ → ∞ . Thus we have passed to the limit in all terms of (17.16), as ⇒ and the proof of (i) (ii) is complete. Since the implication (ii) ⇒ (iii) is trivial, to conclude the proof of ⇒ (i). So let x ; the ∈ M Theorem 17.15 it only suffices to prove (iii) 0 Kg ) g ≥ , where goal is to show that (Ric is the Riemannian x N,ν x 0 0 metric. Let H is KΛ -displacement convex in r > 0 be such that N,ν N = 0 be a tangent vector at ( x . For a start, assume ). Let v x 6 B 0 0 0 r 2 ̃ ψ ∈ N > n . As in the proof of Theorem 14.8, we can construct ( M ), C 2 ̃ ̃ ( x x ), such that ∇ , ψ ( x ( ) = v ψ compactly supported in ∇ B ) = 0 r 0 0 0 ( I is the identity on T λ I M ) and x n n 0 0 ] [ 2 ̃ ) ψ L ( ̃ . ) v ( x ) = Ric ( − ) ψ Γ ( N,ν 0 0 2 N

491 Displacement convexity from curvature bounds, revisited 48 5 ̃ ψ ψ , where θ is a positive real number. If θ is small := θ Then let 2 d 2-convex by Theorem 13.5, and |∇ ψ |≤ r/ 2. Let ρ ψ be enough, / is 0 ( x ), with η < r/ 2; and a smooth probability density, supported in B 0 η ρ μ ν = μ . = exp( t ∇ ψ ) μ ; t 0 0 0 # B ) P μ M ), entirely supported in is a geodesic in ( x ), Then ( ( ≤ 1 r t 0 0 2 ≤ t so condition (iii) implies ( ) − (1 − t ) H μ μ ) − tH ( ( ) μ H 0 N,ν 1 t N,ν N,ν ( ) ∫ ∫ 1 1 − 1 2 N ) x ( K ≤− ρ ds. ψ (17.18) ( x ) | | dν ( x ) ∇ s s 0 As in the proof of (i) ( t,x ) be the Jacobian determi- (ii), let ⇒ J 1 /N ( , and let δ ∇ t,x ) = J ) at t,x ) ψ x . (This t nant of the map exp( ( = 0 in the computations above; now this amounts to choosing t 0 t ψ ) is for sure Lipschitz.) Further, let ∇ is not a problem since exp( t,x ) = exp ( ( t ∇ ψ γ x )). Formula (14.39) becomes ( x ∥ ∥ ) ( 2 ̈ ∥ ∥ ) δ t,x ( ) t,x ( U tr ∥ ∥ − N I )) + ( ̇ γ ( t,x = Ric ) t,x ( − U N,ν n ∥ ∥ δ t ( n ,x ) S H ( ] [ ) 2 N − n n , + ·∇ )) ( t,x t r U ( t,x ) + ̇ γ ( t,x ) (17.19) V ( γ n n N ( N ) − 2 ), ∇ (0 ψ ( x ) = U ( t,x ) solves the nonlinear differential equa- where ,x U 2 ̇ + U + R = 0, and R is defined by (14.7). By using all this tion U ̃ → ψ θ being information, we shall derive expansions of (17.19) as 0, = x + O ( θ ) (this is informal writing to mean that fixed. First of all, x 0 ( x,x t,x ) = O ( θ )); then, by smoothness of the exponential map, ̇ γ ( d ) = 0 2 2 3 ). θ θv ); it follows that Ric + ( ̇ γ ( t,x )) = θ O Ric ( θ ( O ( v ) + 0 x ; N,ν 0 N,ν 0 2 2 ̃ ∇ ) = Next, ψ ( x U (0) = λ ); by an elementary θI θ and R ( t ) = O ( θ n 0 0 2 ̇ ( t,x ) = O ( θ ), so comparison argument, U = O ( θ U ), and U ( t,x ) = 2 2 2 ( O ( θ λ ). Also U − (tr U ) I ) /n = O θI θ + ), tr U ( t ) = λ θ θn + O ( 0 n n 0 2 γ ( t,x ) ·∇ V ( γ ( t,x )) + (( N − n ) /n ) tr U ( t,x ) = O ( θ and ̇ ). Plugging all these expansions into (17.19), we get ) ( ̈ t,x ) δ 1 ( 3 2 = ) O ( ) + v θ ( ic . θ − (17.20) R 0 N,ν ,x ( δ ) N t By repeating the proof of (i) ⇒ (ii) with U = U and using (17.20), N one obtains

492 486 17 Displacement convexity II H μ − ) − (1 − t ) H ) ( μ μ ) ( tH ( N N,ν 1 N,ν t 0 ,ν ∫ ∫ 1 ) ( 1 − 1 2 N ( O ) + θ ≥− v (17.21) ρ ds. ( y ) ( ) Ric ) d ν ( y ) G ( s,t θ N,ν s 0 M 0 On the other hand, by (17.18), H μ ( ) − (1 − t ) H ) ( μ μ ) − tH ( N,ν 1 N,ν t N,ν 0 ∫ ∫ 1 1 2 1 − N ) y ( ds ) | s,t ( ̇ γ ( s ,y ) | G dν ) ≤− y ρ ( K s 0 M ∫ ∫ 1 ) ( 1 2 2 1 − N ( ) − | | ds. K θ = ρ ) v y ) + O ( θ d ν ( y ) G ( s,t s 0 0 M Combining this with (17.21) and canceling out multiplicative factors ∫ ∫ 1 1 1 − 2 N y ) θ ds ρ ≥ d ν ( y ) G ( s,t ) ( on both sides, we obtain Ric ) ( v 0 s N,ν 0 2 v + | K | O ( θ ). The limit θ → 0 yields 0 2 (17.22) ( v , ) ≥ K | v Ric | 0 0 N,ν v and since was arbitrary this concludes the proof. 0 N > n = n . If The previous argument was under the assumption N V and is constant, the proof is the same but the modified Ricci tensor Ric is replaced by the usual Ricci tensor. N,ν = It remains to rule out the cases when and V is nonconstant, N n N < n . In these situations (17.19) should be replaced by or ∥ ∥ ) ( 2 ( ) ̈ ∥ ∥ ) δ t,x ( ) t,x tr U ( 2 ∥ ∥ Ric + V = ( ̇ I γ ( ∇ )) + t,x N − t,x ) − U ( n ∥ ∥ ,x δ ) n t ( S H ) ( 2 t,x ( γ 2 ( V ) t r U ( t,x ) − ̇ γ ( t,x )) ·∇ )) ( U (tr t,x + 17.23) − ( . N n (To see this, go back to Chapter 14, start again from (14.33) but this time don’t apply (14.34).) Repeat the same argument as before, with 2 ∇ ψ ( x λ ) = v is arbitrary). and ∇ ψ ψ ( x satisfying ) = λ (now I 0 n 0 0 0 0 Then instead of (17.22) one retrieves ( ) 2 1 · )) x 1 ( ∇ ( ∇ v ( x λ ) · v 2 ) ( V V 0 0 0 0 0 2 2 (Ric + + − )+ ∇ λ − V )( v 0 0 N n N N 2 K 17.24) v ( | ≥ . | 0

493 Displacement convexity from curvature bounds, revisited 48 7 = n λ N , so If , the left-hand side of (17.24) is an affine expression of 0 V ( x ) = 0. v ·∇ the inequality is possible only if the slope if zero, i.e. 0 0 and v Since are arbitrary, actually ∇ V = 0, so V is constant (and x 0 0 2 ) ). If K | v Ric( | ≥ v n < N , what we have on the left of (17.24) is 0 0 a quadratic expression of λ with negative dominant coefficient, so it 0 ≥ n ⊓⊔ N cannot be bounded below. This contradiction establishes . Remark 17.21. A completely different (and much more general) proof ≥ n will be given later in Corollary 30.13. of the inequality N Exercise 17.22 (Necessary condition for displacement convex- This exercise shows that the elements of ity). are essentially DC N the only nonlinearities leading to displacement convex functionals. Let N , and let R M be a positive integer, ν be the Lebesgue mea- N = N R . Let U be a measurable function sure in R → R such that U ν + ac N ( R is lower semicontinuous and convex on the space ) (absolutely P c continuous, compactly supported probability measures), equipped with the distance . Show that (a) U itself is convex lower semicontinuous; W 2 N − N U ( δ δ → ) is convex. Hint: To prove (b), consider the geodesic δ (b) μ is the uniform probability measure on ) (0). B , where μ curve ( δ δ> δ δ 0 M,ν ) satisfies CD( K,N ) and U belongs Exercise 17.23. Show that if ( − 1 /N U -displacement convex, when restricted to is to DC KR , then ν N 1 − /N ‖ . In short, R the geodesically convex set defined by K U ρ ‖ ≤ ρ is - ∞ ν L m Hint: ( r ) = r Use and let m →∞ . (A longer U displacement convex. solution is via the proof of Corollary 19.5.) To conclude this section, I shall establish sufficient conditions for the time-integral appearing in (17.11) to be finite. Proposition 17.24 (Finiteness of time-integral in displacement convexity inequality). Let M be a Riemannian manifold equipped − V e ( with a reference measure vol , V ∈ C ν M = , and let ρ ρ , ) 0 1 2 M ψ be d / 2 -convex such that . Let be two probability densities on ̃ = = exp( ψ ) is the optimal Monge transport between μ and T ρ ν ∇ 0 0 2 be the = ρ ) ν for the cost function c ( x,y ) = d ( x,y ) μ . Let ( μ t 1 ≤ t ≤ 1 1 0 μ be the density and μ displacement interpolation between , and let ρ t 0 1 μ : . Then for any t ∈ of , 1) [0 t ∫ 2 2 ̃ (i) | W ∇ ψ ) | . dν = ρ ,μ ( μ 0 1 t t 2 M (ii) Let z be any point in M . If N ∈ (1 , ∞ ) and q > 2 N/ ( N − 1) is ac P ) ,μ M ∈ μ such that ( and 1 0 q

494 488 17 Displacement convexity II ∫ dν ( x ) δ ∃ > 0; , ∞ + < ) ( N − 1) − ( N − δ q 2 M d + z,x ) 1 ( then ( ) ∫ ∫ 1 1 1 − 2 N ̃ ρ (1 . | ν d ∇ ψ | − t ) dt < + ∞ t t M 0 = C ( N,q,δ ) > 0 and θ = More precisely, there are constants C /N N,q,δ , 1 − 1 (0 ) such that ∈ θ ) ( ∫ 1 − 1 2 N ̃ ) ∇ (1 | ρ − ψ (17.25) | t d ν t t M 1 ) ( ∫ N C ( ν ) dx θ 2 ( μ W , ≤ ) μ 0 2 1 θ 2 − 1 δ − N 2 − 1) q ( N − (1 ) t − ( d )) z (1 + ,x 1 ) ( ∫ ∫ (1 − θ ) − N q q ) + + d ( × ) d μ ( ( dx ) z,x ) z,x μ 1 ( dx . 1 0 ̃ . First, | d ∇ ψ P ( x ) | = roof of Proposition 17.24 ( x,T ), t − ( x )) / (1 1 → t t ∫ 2 ̃ | dν | ψ ∇ = ρ is the optimal transport T . So μ → where μ t t t t 1 → 1 2 2 2 = / (1 − t ) ) ,μ W μ μ ,μ ) ( . This proves (i). W ( 1 2 2 1 0 t To prove (ii), I shall start from ∫ 1 2 − 1 ̃ N x ν ( ) ρ ) | t ∇ ψ ) ( x ) | − (1 ( d x t t ∫ 1 1 2 − 1 N = ν ) ρ . ) ( d ( x ,T dx ( ( x x )) t → t 1 − t 1 Let us first see how to bound the integral in the right-hand side, with- − 1 out worrying about the factor (1 t ) in front. This can be done − with the help of Jensen’s inequality, in the same spirit as the proof of Theorem 17.8: If r ≥ 0 is to be chosen later, then ∫ 1 1 − 2 N ν x d ( x ,T ( ) dx ( x )) ρ ) ( 1 t t → − 1 N ( ) ∫ N ( ) N 2 r ) ( N − 1 x,T ≤ ρ ( ( x )) d ) 1 + ( d ( z,x ν ( d x ) ) x 1 → t t 1 ) ( ∫ N x ) ( d ν × ) ( N − 1 r 1 z,x ( ) + d 1 − N ( ) ∫ N N )( ) ( 2 ) ( r N − 1 ≤ ( z,x ) + d ( z,T + C ρ ) x ) d ( z,x ) x 1 ( ( ν ( d x ) d t 1 → t

495 Displacement convexity from curvature bounds, revisited 48 9 1 − N ( ) ∫ ( ) N N N +2 +2 r r ) ( ( ) 1 − N 1 N − ,T ρ ≤ ( x )) C ( x ) ) z,x ( d + 1 ) x d ( ν + d ( z → t 1 t N − 1 ) ( ∫ ∫ N N N r +2 r +2 ( ) ( ) − 1 N − 1 N , ( z,x ) 1 ) y ,y ) = C ρ + ( x ) d ρ ν d z ( d ) y ( + ( t 1 w here 1 ) ( ∫ N ( ν ) dx r,N C ) = ,N,ν r ) ( , = C ( C ) ( N − 1 r ) d + 1 z,x ( C ( r,N ) stands for some constant depending only on r and N . By a nd Remark 17.12, the previous expression is bounded by − N 1 ( ) ∫ ∫ N N N r +2 +2 r ( ) ( ) 1 1 − − N N 1 + d ( ( r,N,ν ) ) z,x C ) ) z,x ; x ) + d x ( ( d ( d μ μ 1 0 nd the choice r = q − 2 N/ ( N − 1) leads to a ∫ 1 2 1 − ̃ N x ) | ρ ∇ ψ ( ( x ) | (17.26) ν ( d x ) t t N − 1 ( ) ∫ ∫ N q q z,x ) C μ ( ( dx ) + N,q d ( z,x ) ) μ ) ( dx 1 + ≤ d ( 1 0 1 ( ) ∫ N d ν ( x ) × . ) ( ( N − 1) − 2 N q ( d + 1 ) z,x T his estimate is not enough to establish the convergence of the time- ∫ 1 integral, since (1 − t ) = + ∞ ; so we need to gain some cancellation dt/ 0 → will be δ t as 1. The idea is to interpolate with (i), and this is where 2 ( δ < q 1) − N N . useful. Without loss of generality, we may assume − ′ ∈ (0 , 1 − 1 /N ) to be chosen later, and N Let = (1 − θ ) N ∈ (1 , ∞ ). θ By H ̈older’s inequality, ∫ 1 1 2 − 1 N ,T ) ( ρ d ( x x dx ( x )) ) ν ( → 1 t t 1 t − ) ( ∫ θ 1 2 ( ρ dx ( x ) d ( x ,T ) ν ≤ ( x )) 1 → t t t − 1 ) ( ∫ 1 θ − 1 − 1 2 ′ N ( x ,T ) dx ( ( x )) d ν ) x ( ρ 1 t → t ( ) ∫ − θ 1 1 1 − 1 θ 2 2 ′ N = ( μ ρ ( ( x ) ) , μ dx ) d ( x W ν ,T )) ( x 2 1 t 1 t t → t − 1 ) ( ∫ θ 1 − 1 1 1 − 2 2 θ ′ N . = )) ν ( dx ) x d ( x ,T ( ) ( μ W , ρ μ ( x ) → t 1 2 t 1 0 θ 2 − 1 − ) t (1

496 490 17 Displacement convexity II ′ N S > 1 we can apply the preceding computation with N re- ince ′ N placed by : ∫ 1 1 − 2 ′ N ) x ρ ( d ( ( x )) x ν ( dx ) ,T t → 1 t 1 ) ( ∫ ∫ 1 − ′ N q q ′ 1 + ( μ ) ( dx ) + ,q d ( z,x ) d μ ( z,x dx ) N ( C ≤ ) 1 0 1 ( ) ∫ ′ N x ) ν ( d × . ′ ′ N − 1) ( q − N 2 ( z ,x )) (1 + d ′ ′ = q ( N T − 1) − 2 N so that θ q ( N − 1) − 2 N − δ ; hen we may choose that is, θ = δ/ (( q − 2) N ) ∈ (0 , 1 − 1 /N ). The conclusion follows. ⊓⊔ Ricci curvature bounds from distorted displacement convexity In Theorem 17.15, all the influence of the Ricci curvature bounds lies ∫ 1 in the additi