Cycle Consistent Probability Divergences
Across Different Spaces
Abstract
Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Monge optimal transport formulation for matching, up to isometries, distributions on different spaces. Our formulation arises as a principled relaxation of the GromovHaussdroff distance between metric spaces, and employs two cycleconsistent maps that push forward each distribution onto the other. We study structural properties of the proposed discrepancy and, in particular, show that it captures the popular cycleconsistent generative adversarial network (GAN) framework as a special case, thereby providing the theory to explain it. Motivated by computational efficiency, we then kernelize the discrepancy and restrict the mappings to parametric function classes. The resulting kernelized version is coined the generalized maximum mean discrepancy (GMMD). Convergence rates for empirical estimation of GMMD are studied and experiments to support our theory are provided.
1 Introduction
Discrepancy measures between probability distributions are ubiquitous in machine learning. In practice, distributions of interests are often supported on different spaces and the goal is not only to quantify discrepancy, but also to obtain a meaningful and consistent correspondence between data points. Such problems arise, e.g., in natural language processing for unsupervised matching across different languages or ontologies (AlvarezMelis and Jaakkola, 2018; Grave et al., 2019; AlvarezMelis et al., 2020), shape matching (Bronstein et al., 2006a; Mémoli, 2011; Xu et al., 2019), heterogenous domain adaptation (Yan et al., 2018), generative modeling (Bunne et al., 2019), and many more.
Among the most popular discrepancies between distributions on incompatible spaces is the GromovWasserstein distance (GW) (Mémoli, 2011) (see Séjourné et al. (2020) for an unbalanced variant). Computationally, the GW distance amounts to a quadratic assignment problem that is NP hard (Commander, 2005). To alleviate this impasse, Peyré et al. (2016) proposed an entropic regularization to the GW problem, and derived an algorithm with cubic complexity in the number of samples. More recently, Vayer et al. (2019) proposed slicing the GW distance, which further reduces the computational complexity to . Despite these algorithmic advances, a common issue with GWbased discrepancies is their lack of generalization to new data points. These approaches only quantify the distance without generating a map that captures the correspondence. This requires recomputing the distance whenever one wants to account for new data points, thereby incurring an additional cost. This shortcoming motivate us to explore computationally friendly discrepancies between distributions on different spaces that explicitly encode consistent, bidirectional measure preserving mappings that capture the correspondence.
This is similar in spirit to the recent interest in learning Monge optimal transport (OT) maps (Perrot et al., 2016; Makkuva et al., 2020; Paty et al., 2020; Flamary et al., 2019).
Specifically, we propose a novel unbalanced divergence between probability measures supported on different spaces that explicitly employs two cycleconsistent maps that (approximately) push each distribution onto another. Cycleconsistency here is in the context of cycle generative adversarial networks (Zhu et al., 2017; Kim et al., 2017) (GANs), which requires that the two pushforward maps are roughly inverses one of another. Note that Mémoli and Needham (2021) recently proposed a quasimetric called the GromovMonge distance that employs a single push forward map. A key advantage of our approach is its cycle consistency that provides a mathematical framework for the popular cycle GAN and enables a principled study thereof.
The main contributions of this paper are:

[leftmargin=*,align=parleft]

We introduce and study in Section 3 the unbalanced bidirectional GromovMonge (UBGM) divergence, drawing connections between UBGM and the popular cycle GAN framework.

Motivated by computational efficiency, we kernelize UBGM in Section 4 and restrict mappings to parametric function classes, such as neural networks (NNs). We call the resulting divergence the generalized maximum mean discrepancy (GMMD). We then derive convergence rates for twosample empirical estimation of GMMD.

We present numerical results in Section 5 that support our theory and demonstrate the computational efficiency, and generalization capability of the proposed framework for matching across different spaces.
2 Background and Preliminaries
2.1 Notations
Let be a compact metric space. The diameter of a set is . We use to denote the open ball of radius centered at . For , a set is called an cover of if for any , . The covering number of is . When is a subset of , we always use the metric induce by the Euclidean norm, denoted as . For a metric space , the diameter of is defined as .
For , the space over is denoted by with designating the norm. The Lipschitz constant of a function is , with denoting the Lipschitz ball of radius . A mapping between metric spaces is called an isometry if , for all , i.e., preserves the metric structure. For a class of mappings from to , define the supmetric on as .
The probability space on which all random variables are defined is denoted by (assumed to be sufficiently rich), with designating the corresponding expectation. The class of Borel probability measures over is denoted by . For , denotes the fold product measure of . Given a measurable and , the pushforward of through is , for any Borel set . Clearly . A sequence of probability measures converges to a probability measure , denoted as , if for any bounded continuous function on , We also write when , where is a constant depending only on , and write if the omitted constant is universal. Also denote .
2.2 GromovHaussdroff Distance
To motivate our proposed discrepancy measure, we start by recalling the GromovHausdorff distance between metric spaces, which is defined as
(1) 
where the infimum is on an ambient metric space and isometric embeddings and , with as the Hausdroff distance on .
Evidently, the formulation above is not computable. Nevertheless, has several equivalent forms (Mémoli, 2011) that, with appropriate relaxations, can be computed efficiently. Two such forms are as follows.
Correspondence set reformulation.
Following Mémoli and Needham (2021), a correspondence set between and is a set whose projections to and define surjections on . The set of all such correspondences is denoted by . The GromovHausdroff distance can be reformulated as
(2) 
where is the pointwise distortion between and . Note that (2) can be written in the following compact form as an norm:
(3) 
Two mappings reformulation.
Another important reformulation of the GromovHausdroff distance in terms of mappings between spaces is:
(4) 
where the distortions ,, and are given by^{1}^{1}1The superscript indicates that , , and can be written as norms of the appropriate pointwise distortions (e.g., for ).
Formulation (4) thus measures distance by searching for low distortion maps between the two metric spaces, such that the socalled cycle consistency property holds, i.e., the maps are approximate inverses of one another. More specifically following Mémoli and Sapiro (2005), if then there exists and such that: (1) the induced metric distortions are small, i.e., and ; and (2) these functions are almost inverses one of another, in the sense that and (which follows from by taking and , respectively). Thus, ensures cycle consistency of the maps.
3 Probability Divergences Across Metric Measure Spaces
We are now ready to present the proposed discrepancy between probability measures on different spaces. In conjunction with the preceding discussion, such a discrepancy can equivalently be viewed as a distance between metric measure (mm) spaces (Mémoli, 2011).
Definition 1 (Metric measure spaces).
A metric measure space is a triple where is a compact metric space and has full support.
3.1 The Unbalanced Bidirectional GromovMonge Divergence
Formulation (4) of is computationally appealing and has lead to many algorithmic relaxations that approximate the GromovHausdroff distance (Mémoli and Sapiro, 2004, 2005; Bronstein et al., 2006b). As a stepping stone towards our unbalanced formulation, we first adapt this formulation to an extended metric between two mm spaces and . To that end, we first relax the distortion terms and then restrict the mappings to be measure persevering, as described next.
Let be independent of . For , set
as the relaxation of the distortion terms from (4). Restricting the mappings in (4) to be ‘Monge’ measure preserving, i.e., and , gives rise to th order bidirectional GromovMonge (BGM) distance, as defined next.
Definition 2 (Bidirectional GromovMonge distance).
Fix . The th order BGM distance between and is
where . If the set of measure preserving maps is empty, then we set .
This definition follows a reasoning similar to Formulation (4) of above: We are looking for measure preserving maps , that are low metric distortion (minimizing ) and satisfy a cycle consistency propriety (minimizing , i.e., almost inverses one of another).
The BGM distance defines an extended metric between equivalence classes of mm spaces.
Definition 3 (Equivalence classes).
Two mm spaces and are called equivalent if and only if (iff) there is an invertible isometry such that (and hence ). The set of all such equivalence classes is denoted by .
Proposition 1 (BGM distance metrizes ).
For any , defines an extended metric on .
Remark 1 (Finiteness of BGM).
Let be a mm space, with uncountable and atomless. If is the subcategory of mm spaces isomorphic to (Mémoli and Needham, 2021), then BGM is a finite metric on . This setting is interesting for matching isomorphic mm spaces (of same dimension), in image, shape and text matching applications.
The proof of Proposition 1 is given in Appendix A.1. Positivity, symmetry, and the triangle inequality all follow from elementary calculations. The main challenge is in showing that if nullifies then the considered spaces are isometrically isomorphic. To that end, we use the following lemma, that may be of independent interest (see Appendix A.2 for the proof).
Lemma 1 (Existence of isometries).
Fix and let and be arbitrary function classes such that . Then there exist minimizing sequences and that converge almost surely (a.s.) to isometries and , respectively, such that .
Remark 2 (Convergent sequences).
By definition of isometry, the limits are metric preserving, thus continuous and Lipschitz 1. We don’t know if they still live in unless we impose compactness conditions later on. Notice that the convergence stated in the lemma is guaranteed solely by the first two terms of . If we drop the third term, convergent sequences still exists, but the limits are not guaranteed to be inverse of each other.
While is a valid extended metric on , its evaluation is computationally challenging as it is unclear how to optimize over bidirectional Monge maps. Following the unbalanced OT framework (Chizat et al., 2018; Frogner et al., 2015), we relax the measure preserving constraint using divergences^{2}^{2}2A divergence on is a functional such that iff . and on measures on and , respectively, and restrict the functions to prespecified classes. This gives rise to the unbalanced GromovMonge formulation.
Definition 4 (Unbalanced bidirectional GromovMonge divergence).
Fix , let and be divergences on and , respectively. Further take as a class of mappings from to , and a class of mappings from to . The order UBGM divergence between and is
where is given in Definition 2, and are fixed regularization coefficients.
Evidently, no longer requires optimizing over Monge maps, which alleviates the computational difficulty associated with from Definition 2. The function classes and can also be chosen for computational convenience (e.g., NNs). The following proposition (see Appendix A.3 for the proof) shows that is a continuous divergence.
Proposition 2 ( is a divergence).
Suppose that and are weakly continuous in their arguments, and are rich enough so that they are dense around the isometric bijections if are isometric. Then

[leftmargin=*,align=parleft]

is an upper semicontinuous divergence on , i.e., , for all and , with equality iff there exists isometries on the support of respectively, such that , , and .

If further are compact in the supmetrics and , then is continuous with respect to (w.r.t.) weak convergence.
3.2 Cycle GAN as Unbalanced GromovMonge Divergence
If and are integral probability metrics (IPM)^{3}^{3}3An IPM indexed by a function class is a pseudometric on defined as . (Zolotarev, 1984; Müller, 1997) indexed by the function classes and , respectively, then amounts to a minimax game between the maps and the witness functions and of the IPMs on and respectively, as follows:
(5) 
where and .
Written in this form, we see the similarity to the cycle GAN formulation (Zhu et al., 2017; Kim et al., 2017):
(6) 
The first two terms in (6) encourage and to be approximate inverses of one another, similar to the role of the distortion in from (5). While the original Cycle GAN formulation did not require and to be isometries, this constraint was introduced in followup works (Hoshen and Wolf, 2018). We thus see that Cycle GAN, with a relaxed isometry requirement of and , is a particular instantiation of the UBGM divergence.
3.3 Relation to Past Works
We show briefly here the construction of two wellknown discrepancies between mm spaces, namely the GW distance (Mémoli, 2011), and the GromovMonge (GM) distance (Mémoli and Needham, 2021). The starting point of defining these two distances is the ‘correspondence set formulation’ of from (3).
GromovWasserstein distance.
In a nutshell, the GW distance is an relaxation of the norm in formulation (3) of , along with a Kantorovich relaxation of the correspondence set using couplings.
Definition 5 (GromovWasserstein distance (Mémoli, 2011)).
The GW distance between and is
where , and is the set of all couplings of .
Another closely related formulation is the unbalanced GW distance from Séjourné et al. (2020). For any divergence^{4}^{4}4Séjourné et al. (2020) used divergences (Csiszár, 1967) for and , but we provide a general definition. on , define its twofold extension .
Definition 6 (Unbalanced GromovWasserstein distance (Séjourné et al., 2020)).
Let be the set of all nonnegative Borel measures on . The unbalanced GW distance between and is
where are the marginals of on and .
The unbalanced relaxation of the GW distance is similar to how our UBGM distance (Definition 4) relaxes the BGM distance. A crucial difference is that both the BGM distance and its unbalanced version explicitly encode bidirectional mappings, which are important in applications as they alleviate the need to recompute the coupling matrix given new datapoints.
GromovMonge distance.
More recently, Mémoli and Needham (2021) presented another extension of to a discrepancy between mm spaces. Termed the GM distance, it considers an Monge relaxation of (3), as opposed to the Kanotrovichbased approach of GW. Namely, instead of using couplings, the correspondence set now comprises Monge maps, i.e., all measurable maps s.t. . Also for arbitrary , denote . Clearly for Monge maps that pushes to , .
Definition 7 (GromovMonge distance (Mémoli and Needham, 2021)).
The GM distance between two mm spaces and is
where
Comparing to above, we see that while the latter uses a single low metric distortion maps (with a cost of the form ), our BGM distance uses two such mappings that are approximately inverses (as enforced by ). In a sense our definition is a symmetrized and cycle consistent version of .
4 Kernelization: Generalized Maximum Mean Discrepancy
Motivated by computational considerations, we now instantiate the divergences and in (see Definition 4) as maximum mean discrepancies (MMDs) (Gretton et al., 2012). We coin the resulting kernelized divergence as the generalized maximum mean discrepancy (GMMD). MMDs can be efficiently computed and offer flexibility in picking the proper kernel for each space. We start by reviewing preliminaries on MMDs (Section 4.1), after which we present the kernelized UBGM distance (Section 4.2), and explore its empirical convergence rates (Section 4.3).
4.1 Reproducing Kernel Hilbert Spaces
We define reproducing kernel Hilbert spaces (RKHS) and the associated MMD. For a separable space and a continuous, positive definite, realvalued kernel , let denote the corresponding RKHS, in which for any , we have , for any . See Berlinet and ThomasAgnan (2011) for existence and uniqueness of . There is a natural way to embed into , given by the kernel mean embedding
where . This enables defining a discrepancy measure between probability distribution as the RKHS distance between their kernel mean embeddings.
Definition 8 (Maximum mean discrepancy).
Let be an RKHS. The MMD between is
When the kernel is characteristic, as defined next, metrizes the space of distributions .
Definition 9 (Characteristic kernel).
The kernel of an RKHS is called characteristic if the mean embedding is injective.
Also recall that characteristic kernels enable defining a metric on the space (Sejdinovic et al., 2013). Namely, defining , for , we have that is a metric space. To simplify notation, we henceforth denote by .
4.2 Generalized MMD
The GMMD is defined as follows. Throughout we assume that and are compact with diameters bounded by , and specialize to the case of .
Definition 10 (Generalized MMD between Metric Measure Spaces).
Let and be characteristic kernels on and , respectively. The GMMD between and is
Remark 3 (GMMD is a divergence).
MMDs w.r.t. characteristic kernels are metrics (in particular, divergences) on the space of Borel probability measures. Proposition 2 thus implies that GMMD is a divergence. Further, under the mild condition that are bounded from above, MMDs are weakly continuous. Hence, weak continuity of GMMD also follows from the proposition, given that are compact.
Remark 4 (Kernels specify GMMD).
GMMD can be fully specified by the kernels if one defines the mm spaces using the kernel induced metrics and .
4.3 GMMD Empirical Estimation Rates
We now study the convergence rate of the twosample plugin estimator of GMMD. Let and be i.i.d. samples from and , respectively. Denote by and the empirical measures associated with these samples. Since GMMD is weakly continuous (for compact and ), we immediately have as a.s. The focus of this section is the rate at which this convergence happens.
Theorem 1.
Suppose are uniformly bounded by a constant , and the diameters of and are bounded by . Further suppose that and are compact in and , respectively. Then
where and
with and defined analogously, and
Theorem 1 bounds the estimation error for general function classes and in terms of the appropriate entropy integrals. The proof is given in Appendix A.4 and relies on standard chaining arguments and bounds on Rademacher chaos complexity (cf. e.g, Sriperumbudur (2016)). In general, these entropy integrals cannot be further simplified due to the dependence on the arbitrary classes and . Nevertheless, as discussed next, Theorem 1 can be instantiated to obtain explicit rates for particular function classes of interest.
Remark 5 (Special cases).
Corollary 1 in Appendix A.6 instantiates the result for two important cases: when the function classes are (i) Lipschitz and (ii) parametric. We obtain a convergence rate of for the Lipschitz case when , and for the parametric case. The latter is of particular practical interest, as NNs offer a convenient and trainable model for the bidirectional maps (see next section).
5 Numerical Experiments
We present applications of GMMD in shape matching. We parametrize the bidirectional maps as neural networks with parameters and respectively. Algorithm 1 (Appendix B) summarizes the optimization of the GMMD objective as function of and . All experiments are run on the same machine with 4 core CPUs and a Tesla T4 GPU. The examples below highlight the qualitative and quantitative behavior of GMMD, illustrating the fact that GMMD is applicable to the same tasks as classical methods such as GW and UGW for finding correspondences. Further, GMMD amortizes the computational cost, as it results in continuous mappings that generalize to unseen datapoints drawn from the same distributions.
5.1 GMMD For Shape Matching
We consider here matching of synthetic shapes, specifically a 2dimensional heart shape given in Figure 1(\subreffig:heart) and its transformations through rotation (\subreffig:rotation), scaling (\subreffig:scaling) and isometrically embedding into 3dimensional space (\subreffig:3D). The data is generated via sampling points for each shape. The distributions for each matching experiment are the empirical measures induced by these samples, with corresponding to 1(\subreffig:heart) and , , and corresponding to subfigures (\subreffig:rotation), (\subreffig:scaling), and (\subreffig:3D), respectively (the subscript is suppressed when we simultaneously refer to several experiments).
For matching experiments, we compute GMMD using Algorithm 1 for , where . We use a uniform mixture of Gaussian kernels to define and and use kernel induced metrics , and in the distortion . The bandwidths we used for the Gaussian kernels are median of the metric . The architecture of the bidirectional maps and is a 3layer ReLU NN with 200 neurons each, and an output dimension matching the target distribution dimension. We use Adam optimizer (Kingma and Ba, 2014) for 3000 epochs with a learning rate .
In Figure 2, the first row corresponds to GMMD matching for . For each case, we see that the learned bidirectional maps of GMMD successfully perform the matching, i.e., and . We also confirm that they satisfy the cycle consistency property, i.e., and . The second row shows entropic GW matchings (Peyré et al., 2016). We use the POT library (Flamary et al., 2021) to perform discrete entropic GW for an entropic regularization parameter . Note that entropic GW results in a coupling matrix . To obtain discrete mappings of the points we employ barycentric maps (Ferradans et al., 2014), i.e., and .
We see from Figure 2 that the GMMD continuous maps and the discrete barycentric maps induced by the GW coupling are on par qualitatively in these matching tasks. To confirm this quantitatively, we consider the matching of the heart shape and its rotation (Figure 1(\subreffig:rotation)) since for this case an isometry exists, i.e., there are with . Tables 1 and 2 state the values of , , and across different regularization parameters for the GMMD and GWbased mappings, respectively. We see that GMMD and GW indeed result in small MMD and distortions values. GMMD yields a smaller distortion than GW. Note that we also have evaluated UGW (Séjourné et al., 2020) with the code provided by the authors and found that it is sensitive to hyperparameters choice, and did not result in an accurate matching on the considered tasks. We think that more tuning is needed for UGW. Additional results and ablation on regularization parameters and shapes are given in Appendix B.
Figure 3 presents a more complex matching of 3D shapes that consist in two different biplanes models from the Princeton Shape benchmark (Shilane et al., 2004) (for ). We see that GMMD and GW are also on par and that the GMMD bidirectional maps result in less outliers than barycentric GWbased maps. This robustness of GMMD is due to the use of kernel induced metrics. Quantitative evaluation is presented in Appendix B.
0.256  0.0794  0.0294  0.0294  0.0801 
0.128  0.00825  4.94e4  4.05e4  0.0574 
0.064  0.00776  0.00227  0.00190  0.0560 
0.032  0.0924  2.90e4  0.00386  2.76 
0.0005  0.00134  0.00420  0.00299  0.696 
0.005  0.00660  0.127  0.116  1.73 
0.05  0.0424  0.615  0.613  6.69 
0.5  0.0686  3.99  4.12  22.9 
5.2 GMMD Amortization and Generalization
For the biplane matching experiment with sample size from each distribution, Table 3 reports the training time for GMMD and the runtime for GW and UGW computings. The computational complexity of training GMMD maps amounts to the complexity of gradient descent in NN training for epochs, which is . For entropic GW and UGW, however, the implementations are variants of the Sinkhorn algorithm, whose complexity scales as . The longer training time for GMMD is due to the large number of epochs used in gradient descent (namely, ), but at inference time this cost is amortized since we obtain continuous maps that generalize to unseen datapoints (see Appendix B for quantitative evaluation of the generalization). For instance, matching 8000 new datapoints sampled from and each using the learned mapping requires 63 ms, while with GW one would incur the cost of recomputing the coupling (26 minutes)—a three order of magnitude speedup.
0.0005  1566.86  0.002  5048.11 


0.256  5026.5  
0.1  28.7508  0.064  5052.89 
6 Conclusion
This paper introduced the UBGM divergence—a novel discrepancy measure between distributions across heterogeneous spaces, which employs bidirectional and cycleconsistent mappings. We established structural properties of the UBGM divergence and highlighted its intimate connection to the socalled cycle GAN. We also presented a kernelized variant of this divergence, termed GMMD, and analysed its statistical estimation from samples. Numerical experiments demonstrated the promise of this new divergence and compared it to other known metrics, such as the GW and UGW distances. Appealing future directions include extending the GMMD to allow optimization over kernels, sharper statistical bounds, as well as connections between the UBGM divergence and the UGW distance (in particular, under what conditions they coincide).
References
 AlvarezMelis and Jaakkola (2018) D. AlvarezMelis and T. S. Jaakkola. GromovWasserstein alignment of word embedding spaces. arXiv preprint arXiv:1809.00013, 2018.
 AlvarezMelis et al. (2020) D. AlvarezMelis, Y. Mroueh, and T. Jaakkola. Unsupervised hierarchy matching with optimal transport over hyperbolic spaces. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS2020), volume 108 of Proceedings of Machine Learning Research, pages 1606–1617, 2020.
 Berlinet and ThomasAgnan (2011) A. Berlinet and C. ThomasAgnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
 Bojanowski et al. (2016) P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
 Bronstein et al. (2006a) A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Generalized multidimensional scaling: A framework for isometryinvariant partial surface matching. Proceedings of the National Academy of Sciences, 103(5):1168–1172, 2006a.
 Bronstein et al. (2006b) A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Efficient computation of isometryinvariant distances between surfaces. SIAM Journal on Scientific Computing, 28(5):1812–1836, 2006b.
 Bunne et al. (2019) C. Bunne, D. AlvarezMelis, A. Krause, and S. Jegelka. Learning generative models across incomparable spaces. In Proceedings of the International Conference on Machine Learning (ICML2019), pages 851–861, June 2019.
 Chizat et al. (2018) L. Chizat, G. Peyré, B. Schmitzer, and F.X. Vialard. Unbalanced optimal transport: Dynamic and Kantorovich formulations. Journal of Functional Analysis, 274(11):3090–3123, 2018.
 Commander (2005) C. W. Commander. A survey of the quadratic assignment problem, with applications. Morehead Electronic Journal of Applicable Mathematics, 4:MATH–2005–01, 2005.
 Conneau et al. (2017) A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without parallel data. arXiv preprint arXiv:1710.04087, 2017.
 Csiszár (1967) I. Csiszár. Informationtype measures of difference of probability distributions and indirect observation. Studia Ccientiarum Mathematicarum Hungarica, 2:229–318, 1967.
 De la Pena and Giné (2012) V. De la Pena and E. Giné. Decoupling: from dependence to independence. Springer Science & Business Media, 2012.
 Ferradans et al. (2014) S. Ferradans, N. Papadakis, G. Peyré, and J.F. Aujol. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853–1882, 2014.
 Flamary et al. (2019) R. Flamary, K. Lounici, and A. Ferrari. Concentration bounds for linear monge mapping estimation and optimal transport domain adaptation. arXiv preprint arXiv:1905.10155, 2019.
 Flamary et al. (2021) R. Flamary, N. Courty, A. Gramfort, M. Z. Alaya, A. Boisbunon, S. Chambon, L. Chapel, A. Corenflos, K. Fatras, N. Fournier, L. Gautheron, N. T. Gayraud, H. Janati, A. Rakotomamonjy, I. Redko, A. Rolet, A. Schutz, V. Seguy, D. J. Sutherland, R. Tavenard, A. Tong, and T. Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1–8, 2021.
 Frogner et al. (2015) C. Frogner, C. Zhang, H. Mobahi, M. ArayaPolo, and T. Poggio. Learning with a Wasserstein loss. arXiv preprint arXiv:1506.05439, 2015.
 Grave et al. (2019) E. Grave, A. Joulin, and Q. Berthet. Unsupervised alignment of embeddings with Wasserstein procrustes. In Proceedings of the The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS2019), pages 1880–1890. PMLR, 2019.
 Gretton et al. (2012) A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
 Hoshen and Wolf (2018) Y. Hoshen and L. Wolf. Unsupervised correlation analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2018), June 2018.
 Kim et al. (2017) T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover crossdomain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML2017), pages 1857–1865. PMLR, 2017.
 Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Makkuva et al. (2020) A. Makkuva, A. Taghvaei, S. Oh, and J. Lee. Optimal transport mapping via input convex neural networks. In Proceedings of the International Conference on Machine Learning (ICML2020), pages 6672–6681. PMLR, 2020.
 Mémoli (2011) F. Mémoli. Gromov–Wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417–487, 2011.
 Mémoli and Needham (2021) F. Mémoli and T. Needham. Distance distributions and inverse problems for metric measure spaces. arXiv preprint arXiv:1810.09646, 2021.
 Mémoli and Sapiro (2004) F. Mémoli and G. Sapiro. Comparing point clouds. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pages 32–40, 2004.
 Mémoli and Sapiro (2005) F. Mémoli and G. Sapiro. A theoretical and computational framework for isometry invariant recognition of point cloud data. Foundations of Computational Mathematics, 5(3):313–347, 2005.
 Müller (1997) A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
 Paty et al. (2020) F.P. Paty, A. d’Aspremont, and M. Cuturi. Regularity as regularization: Smooth and strongly convex brenier potentials in optimal transport. In International Conference on Artificial Intelligence and Statistics (AISTATS2021), pages 1222–1232. PMLR, 2020.
 Perrot et al. (2016) M. Perrot, N. Courty, R. Flamary, and A. Habrard. Mapping estimation for discrete optimal transport. In Advances in Neural Information Processing Systems (NeurIPS2016), pages 4197–4205, 2016.
 Peyré et al. (2016) G. Peyré, M. Cuturi, and J. Solomon. GromovWasserstein averaging of kernel and distance matrices. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning (ICML2016), volume 48 of Proceedings of Machine Learning Research, pages 2664–2672, New York, New York, USA, 20–22 Jun 2016. PMLR.
 Sejdinovic et al. (2013) D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distancebased and RKHSbased statistics in hypothesis testing. The Annals of Statistics, pages 2263–2291, 2013.
 Séjourné et al. (2020) T. Séjourné, F.X. Vialard, and G. Peyré. The unbalanced Gromov Wasserstein distance: Conic formulation and relaxation. arXiv preprint arXiv:2009.04266, 2020.
 Shilane et al. (2004) P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The princeton shape benchmark. In Proceedings of the International Conference on Shape Modeling and Applications (SMI2004), pages 167–178. IEEE, 2004.
 Sriperumbudur (2016) B. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839–1893, 2016.
 van der Vaart and Wellner (1996) A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 1996.
 Vayer et al. (2019) T. Vayer, R. Flamary, R. Tavenard, L. Chapel, and N. Courty. Sliced GromovWasserstein. arXiv preprint arXiv:1905.10124, 2019.
 Wainwright (2019) M. J. Wainwright. Highdimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019.
 Weed and Bach (2019) J. Weed and F. Bach. Sharp asymptotic and finitesample rates of convergence of empirical measures in Wasserstein distance. Bernoulli, 25(4A):2620–2648, 2019.
 Xu et al. (2019) H. Xu, D. Luo, and L. Carin. Scalable GromovWasserstein learning for graph partitioning and matching. Advances in neural information processing systems (NeurIPS19), 32:3052–3062, 2019.
 Yan et al. (2018) Y. Yan, W. Li, H. Wu, H. Min, M. Tan, and Q. Wu. Semisupervised optimal transport for heterogeneous domain adaptation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI2018), volume 7, pages 2969–2975, 2018.
 Zhu et al. (2017) J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (CCV2017), pages 2223–2232, 2017.
 Zolotarev (1984) V. M. Zolotarev. Probability metrics. Theory of Probability & Its Applications, 28(2):278–302, 1984.
Appendix A Proofs
To simplify notation we denote , which is the functional that is optimized in definition of .
a.1 Proof of Proposition 1
The symmetry and positivity follows directly from definition and Lemma 1, which is proven below. For the triangle inequality, fix 3 mm spaces , , and functions (over the appropriate domains) with , , , . We only show the derivation for ; a similar argument applies to . For , we have
Hence is a metric on .
a.2 Proof of Lemma 1
Suppose and are sequence such that . We will show that up to extracting subsequences, these sequences converge a.s. to isometrics, and , respectively, such that . The argument first shows that there is a countable dense such that the distortion function (defined below) converges on to 0. Then we take a subsequence of that converges on , and show that this subsequence also converges a.s. on , and the limit is an isometry. After applying the same to , we conclude the desired convergence and demonstrate that the limits and satisfy .
We first consider the term . Since
we may assume that, up to extraction of subsequences, we have
Set as the set of pairs for which the convergence occurs, and let be the slice at in the first coordinate. Then , and by Fubini’s theorem
Denoting , we thus have , and hence is dense. We next construct as a countable dense subset of .
Step 1 – Separability of convergence points: We present an inductive construction of a countable dense subset such that converges to 0 on . First take any and define
then since . Suppose we have points , such that for , and define for . Define function
and set . Suppose , otherwise is already dense. Since is continuous, is a nonempty open set on . Notice that set still have probability 1, and any point satisfies that converges for all . Since has full support, is not empty, hence we pick . Inductively we have sequence such that converges for any . Denote .
Now we prove that is dense in . Suppose it is not, then there is an and an such that