{"title": "Modelling heterogeneous distributions with an Uncountable Mixture of Asymmetric Laplacians", "book": "Advances in Neural Information Processing Systems", "page_first": 8838, "page_last": 8848, "abstract": "In regression tasks, aleatoric uncertainty is commonly addressed by considering a parametric distribution of the output variable, which is based on strong assumptions such as symmetry, unimodality or by supposing a restricted shape. These assumptions are too limited in scenarios where complex shapes, strong skews or multiple modes are present. In this paper, we propose a generic deep learning framework that learns an Uncountable Mixture of Asymmetric Laplacians (UMAL), which will allow us to estimate heterogeneous distributions of the output variable and shows its connections to quantile regression. Despite having a fixed number of parameters, the model can be interpreted as an infinite mixture of components, which yields a flexible approximation for heterogeneous distributions. Apart from synthetic cases, we apply this model to room price forecasting and to predict financial operations in personal bank accounts. We demonstrate that UMAL produces proper distributions, which allows us to extract richer insights and to sharpen decision-making.", "full_text": "Modelling heterogeneous distributions with an\nUncountable Mixture of Asymmetric Laplacians\n\nAxel Brando \u2217\n\nBBVA Data & Analytics\nUniversitat de Barcelona\n\nJose A. Rodr\u00edguez-Serrano\u2020\nBBVA Data & Analytics\n\nJordi Vitri\u00e0\u2021\n\nUniversitat de Barcelona\n\nAlberto Rubio\n\nBBVA Data & Analytics\n\nAbstract\n\nIn regression tasks, aleatoric uncertainty is commonly addressed by considering a\nparametric distribution of the output variable, which is based on strong assumptions\nsuch as symmetry, unimodality or by supposing a restricted shape. These assump-\ntions are too limited in scenarios where complex shapes, strong skews or multiple\nmodes are present. In this paper, we propose a generic deep learning framework\nthat learns an Uncountable Mixture of Asymmetric Laplacians (UMAL), which will\nallow us to estimate heterogeneous distributions of the output variable and we show\nits connections to quantile regression. Despite having a \ufb01xed number of parameters,\nthe model can be interpreted as an in\ufb01nite mixture of components, which yields a\n\ufb02exible approximation for heterogeneous distributions. Apart from synthetic cases,\nwe apply this model to room price forecasting and to predict \ufb01nancial operations in\npersonal bank accounts. We demonstrate that UMAL produces proper distributions,\nwhich allows us to extract richer insights and to sharpen decision-making.\n\n1\n\nIntroduction\n\nFigure 1: Regression problem with heterogeneous output distributions modelled with UMAL.\n\nIn the last decade, deep learning has had signi\ufb01cant success in many real-world tasks, such as image\nclassi\ufb01cation [1] and natural language processing [2]. While most of the successful examples have\nbeen in classi\ufb01cation tasks, regression tasks can also be tackled with deep networks by considering\narchitectures where the last layer represents the continuous response variable(s) [3, 4, 5]. However,\n\n\u2217axel.brando@bbvadata.com | axelbrando@ub.edu.\n\u2020joseantonio.rodriguez.serrano@bbvadata.com\n\u2021jordi.vitria@ub.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthis point-wise approach does not provide us with information about the uncertainty underlying the\nprediction process. When an error in a regression task is associated with a high cost, we might prefer\nto include uncertainty estimates in our model, or actually estimate the distribution of the response\nvariable.\nThe modelling of uncertainty in regression tasks has\nbeen approached from two main standpoints [6]. On\nthe one hand, one of the sources of uncertainty is\n\u201cmodel ignorance\u201d, i.e. the mismatch between the\nmodel that approximates the task and the true (and\nunknown) underlying process. This has been referred\nto as Epistemic uncertainty. This type of uncertainty\ncan be modelled using Bayesian methods [7, 8, 9, 10]\nand can be partially reduced by increasing the size\nand quality of training data.\nOn the other hand, another source of uncertainty is\n\u201cinevitable variability in the response variable\u201d, i.e.\nwhen the variable to predict exhibits randomness,\nwhich is possible even in the extreme case where the\ntrue underlying model is known. This randomness\ncould be caused by several factors. For instance, by\nthe fact that the input data do not contain all vari-\nables that explain the output. This type of uncertainty\nhas been referred to as Aleatoric uncertainty. This\ncan be modelled by considering output distributions\n[11, 12, 13], instead of point-wise estimations, and is\nnot necessarily reduced by increasing the amount of\ntraining data.\nWe will concentrate on the latter case, our goal being\nto improve the state-of-the-art in deep learning meth-\nods to approximate aleatoric uncertainty. The need to\nimprove current solutions can be understood by con-\nsidering the regression problem in Figure 1. In this\nregression problem, the distribution of the response\nvariable exhibits several regimes. Consequently, there\nis no straightforward de\ufb01nition of aleatoric uncer-\ntainty that can represent all these regimes. A quanti-\ntative de\ufb01nition of uncertainty valid for one regime\n(e.g. standard deviation) might not be valid for others.\nAlso, the usefulness of such uncertainty could depend\non the end-task. For example, reporting the number\nof modes would be enough for some applications.\nFor other applications, it might be more interesting\nto analyse the differences among asymmetries of the\npredicted distributions.\nIn this paper, we propose a new model for estimating\na heterogeneous distribution of the response variable.\nBy heterogeneous, we mean that no strong assump-\ntions are made, such as unimodality or symmetry. As\nFigure 2 shows, this can be done by implementing\na deep learning network, \u03c6, which implicitly learns\nthe parameters for the components of an Uncountable\nMixture of Asymmetric Laplacians (UMAL). While the number of weights of such an internal\nnetwork is \ufb01nite, we show that it is implicitly \ufb01ts a mixture of an in\ufb01nite number of functions.\nUMAL is a generic model that is based on two main ideas. Firstly, in order to capture the distribution\nof possible outputs given the same input, a parametric probability distribution is imposed on the\noutput of the model and a neural network is trained to predict the parameters that maximise the\n\nFigure 2: On the bottom we see a represen-\ntation of the proposed regression model that\ncaptures all the components \u03c4i of the mix-\nture of Asymmetric Laplacian distributions\n(ALD) simultaneously. Moreover, this model\nis agnostic to the architecture of the neural net-\nwork \u03c6. On the middle, we observe a visuali-\nsation of certain ALD components predicting\nthe upper plot that is the distribution of values\nof y for a \ufb01xed point, x, from the Multimodal\npart of Figure 1 (for ease of visualization, the\nplot has been clipped to 0.2).\n\n2\n\n\flikelihood of such a probability distribution [11, 14]. Speci\ufb01cally, if that parametric distribution is a\nmixture model, the approach is known as Mixture Density Network (MDN). And secondly, UMAL\ncan be seen as a generalisation of a method developed in the \ufb01eld of statistics and particularly in the\n\ufb01eld of econometrics: Quantile Regression (QR) [15]. QR is agnostic with respect to the modelled\ndistribution, which allows it to deal with more heterogeneous distributions. Moreover, QR is still\na maximum likelihood estimation of an Asymmetric Laplacian Distribution (ALD) [16]. UMAL\nextends this model by considering a mixture model that implicitly combines an in\ufb01nite number of\nALDs to approximate the distribution of the response variable.\nIn order to quantitatively validate the capabilities of the proposed model, we have considered a\nreal problem where the behaviour of the variable to be predicted has a heterogeneous distribution.\nFurthermore, in the interest of reproducibility we have considered the use of open data 4. Price\nforecasting per night of houses / rooms offered on Airbnb, a global online marketplace and hospitality\nservice, full\ufb01ls these conditions. Speci\ufb01cally, we predict prices for the cities of Barcelona and\nVancouver using public information downloaded from [17]. Price prediction is based on informative\nfeatures such as neighbourhood, number of beds and other characteristics associated with the houses /\nrooms. As we can see in the results section, by predicting the full distribution of the price, as opposed\nto a single estimate of it, we are able to extract much richer conclusions.\nFurthermore, we have also applied the UMAL model to a private, large dataset of short sequences, in\norder to forecast monthly aggregated spending and incomes jointly for each category in personal bank\naccounts. As in the case of the price prediction per room, to draw conclusions we have used neural\nnetworks to perform comparisons using Mixture Density Networks models [11], single distribution\nestimators as well as other baselines.\n\n2 Background concepts and notation\n\nmodel p (y | x) that \ufb01ts an observed set of samples D = (X, Y ) =(cid:8)(xi, yi) | xi \u2208 RF , yi \u2208 R(cid:9)n\n\nIt should be noted that this paper does not attempt to model epistemic uncertainty [6], for which\nrecent work exists related to Bayesian neural networks or its variations [7, 8, 9, 10], by considering a\nBayesian interpretation of dropout technique [18] or even combining the forecast of a deep ensemble\n[19]. Importantly, the main objective of this article is to study models that capture the aleatoric\nuncertainty in regression problems by using deep learning methods. The reason to focus in this type\nof uncertainty is because we are interested in problems where there are large volumes of data there\nbut still exists a high variability of possible correct answers given the same input information.\nTo obtain the richest representation of aleatoric uncertainty, we want to determine a conditional density\ni=1,\nwhich we assumed to be sampled from an unknown distribution q (y | x). To achieve this goal, we\nconsider the solutions that restrict p to a parametric family distributions {f (y; \u03b8) | \u03b8 \u2208 \u0398}, where \u03b8\ndenotes the parameters of the distributions [20]. These parameters are the outputs of a deep learning\nfunction \u03c6 : RF \u2192 \u0398 with weights w optimised to maximise the likelihood function in a training\nsubset of D. The problem appears when the assumed parametric distribution imposed on p differs\ngreatly from the real distribution shape of q. This case will become critical the more heterogeneous\nq is with respect to p, i.e. in cases when its distribution shape is more complex, containing further\nbehaviours such as extra modes or stronger asymmetries.\n\n3 Modelling heterogeneous distributions\n\nIn this paper, we have selected as baseline approaches two different types of distribution that belong\nto the generalised normal distribution family: the normal and the Laplacian distribution. Thus, in\nboth cases the neural network function is de\ufb01ned as \u03c6 : RF \u2192 R \u00d7 (0, +\u221e) to predict location and\nscale parameters. However, the assumption of a simple normal or Laplace variability in the output of\nthe model forces the conditional distribution of the output given the input to be unimodal [11]. This\ncould be very limiting in some problems, such as when we want to estimate the price of housing and\nthere may be various types of price distributions given the same input characteristics.\n\n4The source code to reproduce the public results reported is published in https://github.com/BBVA/UMAL.\n\n3\n\n\fMixture Density Network One proposed solution in the literature to approximate a multimodal\nconditional distribution is the Mixture of Density Networks (MDN) [11]. Speci\ufb01cally, the mixture\nlikelihood for any normal or Laplacian distribution components is\n\np (y | x, w) =\n\n\u03b1i(x) \u00b7 pdf (y | \u00b5i(x), bi(x)) ,\n\n(1)\n\nm(cid:88)\n\ni=1\n\n(cid:80)m\n\nwhere m denotes the \ufb01xed number of components of the mixture, each one being de\ufb01ned by\nthe distribution function pdf. On the other hand, \u03b1i(x) would be the mixture weight (such that\ni \u03b1i(x) = 1). Therefore, for this type of model we will have an extra output to predict, \u03b1, in the\noriginal neural network, i.e. \u03c6 : RF \u2192 Rm \u00d7 (0, +\u221e)m \u00d7 [0, 1]m.\nAlthough this model can approximate any type of distribution, provided m is suf\ufb01ciently large [21],\nit does not capture asymmetries at the level of each component. Additionally, it entails determining\nthe optimal number of components m e.g. by cross-validation [22], which in practice multiplies the\ntraining cost by a signi\ufb01cant factor.\n\nQuantile Regression Alternatively, an extension to classic regression has been proposed in the\n\ufb01eld of statistics and econometrics: Quantile Regression (QR). QR is based on estimating the desired\nconditional quantiles of the response variable [15, 23, 24, 25]. Given \u03c4 \u2208 (0, 1), the \u03c4-th quantile\nregression loss function would be de\ufb01ned as\n\nL\u03c4 (x, y; w) = (y \u2212 \u00b5(x)) \u00b7 (\u03c4 \u2212 1[y < \u00b5(x)]) ,\n\n(2)\nwhere 1[p] is the indicator function that veri\ufb01es the condition p. This loss function is an asymmetric\nconvex loss function that penalises overestimation errors with weight \u03c4 and underestimation errors\nwith weight 1 \u2212 \u03c4 [26].\nRecently, some works have combined neural networks with QR [26, 12, 27]. For instance, in the\nreinforcement learning \ufb01eld, a neural network has been proposed to approximate a given set of\nquantiles [26]. This is achieved by jointly minimizing a sum of terms like those in Equation 2, one\nfor each given quantile. Following this, the Implicit Quantile Networks (IQN) model was proposed\n[12] in order to learn the full quantile range instead of a \ufb01nite set of quantiles. This was done by\nconsidering the \u03c4 parameter as an input of the deep reinforcement learning model and conditioning\nthe single output to the input desired quantile, \u03c4. In order to optimize for all possible \u03c4 values, the\nloss function considers an expectation over \u03c4, which in the stochastic gradient descent method is\napproximated by sampling \u03c4 \u223c U (0, 1) from a uniform distribution in each iteration. Recently, a\nneural network has also been applied to regression problems in order to simultaneously minimise\nthe Equation 2 for all quantile values sampled as IQN [13]. Thus, both solutions consider a joint but\n\u201cindependent\u201d quantile minimization with respect to the loss function. Consequently, for the sake of\nconsistency with the following nomenclature, we will refer to them as Independent QR models.\nGiven a neural network function, \u03c6 : RF +1 \u2192 R, where the input is (x, \u03c4 ) \u2208 RF +1, such that\nimplicitly approximates all the quantiles \u03c4 \u2208 (0, 1), we can obtain the distribution shape for a given\ninput x by integrating the conditioned function over \u03c4. However, due to the fact that this function is\nestimated empirically, there is no guarantee that it will be strictly increasing with respect to the value\n\u03c4 and this can lead to a crossing quantiles phenomenon [23, 13]. Below, we introduce a concept that\nallows this limitation to be bypassed by applying a method described in the following section.\n\nAsymmetric Laplacian distribution As it is widely known, when a function is \ufb01tted using the\nmean square, or the mean absolute error loss, it is equivalent to a maximum likelihood estimation of\nthe location parameter of a Normal distribution, or Laplacian distribution, respectively. Similar to\nthese unimodal cases, when we minimise Equation 2, we are optimising the maximum likelihood of\nthe location parameter of an Asymmetric Laplacian Distribution (ALD) [16, 23] expressed as\n\nALD (y | \u00b5, b, \u03c4 ) =\n\n\u03c4 (1 \u2212 \u03c4 )\n\nb(x)\n\nexp{\u2212 (y \u2212 \u00b5(x)) \u00b7 (\u03c4 \u2212 1[y < \u00b5(x)]) /b(x)} .\n\n(3)\n\nWhen \u00b5, b parameters are predicted by using deep networks conditioned to \u03c4, we are considering a\nnon-point-wise approach of QR. Next, we combine all ALDs to infer a response variable distribution.\n\n4\n\n\f4 The Uncountable Mixture of Asymmetric Laplacians model\n\nIn order to de\ufb01ne the proposed framework, the objective is to consider a model that corresponds\nto the mixture distribution of all possible ALD functions with respect to the asymmetry parameter,\n\u03c4 \u2208 (0, 1). This mixture model has an uncountable set of components that are combined to pro-\nduce the uncountable mixture5 distribution. Let w be the weights of the deep learning model to\nestimate, \u03c6 : RF +1 \u2192 R \u00d7 (0, +\u221e), which predicts the (\u00b5\u03c4 , b\u03c4 ) parameters of the different ALDs\nconditioned to a \u03c4 value. Then, we can consider the following compound model marginalising over\n\u03c4:\n\np (y | x, w) =\n\n\u03b1\u03c4 (x) \u00b7 ALD (y | \u00b5\u03c4 (x), b\u03c4 (x), \u03c4 ) d\u03c4.\n\n(4)\n\n(cid:90)\n\nNow we can make two considerations. On the one hand, we assume a uniform distribution for each\ncomponent \u03b1\u03c4 of the mixture model. Therefore, the weight \u03b1\u03c4 is the same for all ALDs, maintaining\nthe restriction to integrate to 1. On the other hand, in order to make the integral tractable at the time\nof training, following the strategy proposed in implicit cases [12, 13], we consider a random variable\n\u03c4 \u223c U(0, 1) and apply Monte Carlo (MC) integration [7], selecting N\u03c4 random values of \u03c4 in each\niteration, so that we discretise the integral. This results in the following expression:\n\nN\u03c4(cid:88)\nALD(y | \u00b5\u03c4t(x), b\u03c4t(x), \u03c4t).\n\np (y | x, w) \u2248 1\nN\u03c4\n\nt=1\n\n(5)\n\nTherefore, the Uncountable Mixture of Asymmetric Laplacians (UMAL) model is optimised by\nminimising the following negative log-likelihood function with respect to w,\n\n\u2212 log p (Y | X, w) \u2248 \u2212 n(cid:88)\n\nlog\n\n(cid:32) N\u03c4(cid:88)\n\n(cid:33)\nexp [log ALD(yi | \u00b5\u03c4t(xi), b\u03c4t(xi), \u03c4t)]\n\n\u2212 log(N\u03c4 ),\n\n(6)\n\ni=1\n\nt=1\n\nwhere, as is commonly considered in mixture models [29], we have a \u201clogarithm of the sum of\nexponentials\u201d. This form allows application of the LogSumExp Trick [30] during optimisation to\nprevent over\ufb02ow or under\ufb02ow when computing the logarithm of the sum of the exponentials.\n\n4.1 Connection with quantile models\n\nIt is important to note the link between UMAL and QR. If we consider an Independent QR model\nwhere the entire range of quantiles is implicitly and independently approximated (as in the case of\nIQN), then the mode of an ALD can be directly inferred. Thus, in inference time there is a perfect\nsolution that estimates the real distribution but in a point-estimate manner. However, an alternative\napproach would be to minimise all the negative logarithm of ALDs as a sum of distributions where\neach one \u201cindependently\u201d captures the variability for each quantile. This solution is, in fact, an upper\nbound of the UMAL model. Applying Jensen\u2019s Inequality to the negative logarithm function of\nEquation 6 gives us an expression that corresponds to consider all ALDs as independent elements,\n\n\u2212 log p (Y | X, w) \u2264 \u2212 n(cid:88)\n\n(cid:32) N\u03c4(cid:88)\n\n(cid:33)\nlog ALD(yi | \u00b5\u03c4t(xi), b\u03c4t(xi), \u03c4t)\n\n\u2212 log(N\u03c4 ).\n\n(7)\n\ni=1\n\nt=1\n\nWe will refer to this upper bound solution as Independent ALD and it will be used as a baseline in\nfurther comparisons.\n\n5The concept of \u201cuncountable mixture\u201d refers to the marginalisation formula that de\ufb01nes a compound\n\nprobability distribution [28].\n\n5\n\n\f5 UMAL as a deep learning framework\n\nUMAL can be viewed as a framework for upgrading any point-wise estimation regression model in\ndeep learning to an output distribution shape forecaster, as show in Algorithm 2. This implementation\ncan be performed using any automatic differentiation library such as TensorFlow [31] or PyTorch\n[32]. Additionally, it also performs the Monte Carlo step within the procedure, which results in more\nef\ufb01cient computation in training time.\nTherefore, in order to obtain the conditioned mixture distribution we should perform Algorithm 3.\nBy using this rich information we are able to conduct the following experiments.\n\nPrerequisites 1 De\ufb01nitions and functions used for following Algorithms\n\n(cid:46) x has batch size and number of features as shape, [bs, F ].\n(cid:46) RESHAPE( tensor, shape): returns tensor with shape shape.\n(cid:46) REPEAT(tensor, n): repeats last dimension of tensor n times.\n(cid:46) CONCAT(T1, T2): concat T1 and T2 by using their last dimension.\n(cid:46) LEN(T1): number of elements in T1.\n\nAlgorithm 2 How to build UMAL model by using any deep learning architecture for regression\n1: procedure BUILD_UMAL_GRAPH(input vectors x, deep architecture \u03c6, MC sampling N\u03c4 )\n2:\n3:\n4:\n5:\n6:\n7:\n\nx \u2190 RESHAPE( REPEAT(x, N\u03c4 ), [bs \u00b7 N\u03c4 , F ]) (cid:46) Adapting x to be able to associate a \u03c4.\n\u03c4 \u2190 U(0, 1)\n(cid:46) \u03c4 must have [bs \u00b7 N\u03c4 , 1] shape.\ni \u2190 CONCAT(x, \u03c4 )\n(cid:46) The i has [bs \u00b7 N\u03c4 , F + 1] shape.\n(\u00b5, b) \u2190 \u03c6\u03c4 (i)\n(cid:46) Applying any deep learning function \u03c6.\nL \u2190 Equation 6\n(cid:46) Applying the UMAL Loss function by using the (\u00b5, b, \u03c4 ) triplet.\nreturn L\n\nAlgorithm 3 How to generate the \ufb01nal conditioned distribution by using UMAL model\n1: procedure PREDICT(input vectors x, response vectors y, deep architecture \u03c6, selected \u03c4s sel\u03c4 )\n(cid:46) Adapting \u03c4 shape.\n2:\n(cid:46) Adjusting x shape.\n3:\n(cid:46) The i has [bs \u00b7 N\u03c4 , F + 1] shape.\n4:\n(cid:46) Apply the trained deep learning function \u03c6.\n5:\n\n\u03c4 \u2190 RESHAPE( REPEAT(sel\u03c4 , bs), [bs \u00b7 LEN(sel\u03c4 ), 1])\nx \u2190 RESHAPE( REPEAT(x, sel\u03c4 ), [bs \u00b7 LEN(sel\u03c4 ), F ])\ni \u2190 CONCAT(x, \u03c4 )\n(\u00b5, b) \u2190 \u03c6\u03c4 (i)\np (y | x) \u2190 1\nreturn p (y | x)\n\n(cid:46) Calculate mixture model of sel\u03c4 for each y.\n\nALD(y | \u00b5\u03c4t, b\u03c4t, \u03c4t)\n\nN\u03c4(cid:80)\n\nN\u03c4\n\nt=1\n\n6:\n\n7:\n\n6 Experimental Results\n\n6.1 Data sets and experiment settings\n\nIn this section, we show the performance of the proposed model. All experiments are implemented in\nTensorFlow [33] and Keras [34], running in a workstation with Titan X (Pascal) GPU and GeForce\nRTX 2080 GPU. Regarding parameters, we use a common learning rate of 10\u22123. In addition, to\nrestrict the value of the scale parameter, b, to strictly positive values, the respective output have a\nsoftplus function [35] as activation. We will refer to the number of parameters to be estimated as P .\nOn the other hand, the Monte Carlo sampling number, N\u03c4 , for Independent QR, ALD and UMAL\nmodels will always be \ufb01xed to 100 at the training time. Furthermore, all public experiments are\ntrained using an early stopping training policy with 200 epochs of patience for all compared methods.\n\nSynthetic regression Figure 1 corresponds to the following data set. Given (X, Y ) =\n{(xi, yi)}3800\ni=1 points where xi \u2208 [0, 1] and yi \u2208 R, they are de\ufb01ned by 4 different \ufb01xed synthetic\ndistributions depending on the X range of values. In particular, if xi < 0.21, then the corresponding\nyi came from a Beta(\u03b1 = 0.5, \u03b2 = 1) distribution. Next, if 0.21 < xi < 0.47, then their yi values\n\n6\n\n\fTable 1: Comparison of the Log-Likelihood of the test set over different alternatives to model the\ndistribution of the different proposed data sets. The scale for each data set is indicated in parenthesis.\nLog-Likelihood comparison\n\nModel\n\nNormal distribution\nLaplace distribution\nIndependent QR\n2 comp. Normal MDN\n3 comp. Normal MDN\n4 comp. Normal MDN\n10 comp. Normal MDN\n2 comp. Laplace MDN\n3 comp. Laplace MDN\n4 comp. Laplace MDN\n10 comp. Laplace MDN\nIndependent ALD\nUMAL model\n\nSynthetic (102) BCN RPF (103) YVC RPF (102) Financial (106)\n\u221239.88 \u00b1 13.4 \u221238.44 \u00b1 6.55 \u221270.79 \u00b1 3.26\n\u221241.30 \u00b1 0.78 \u221219.84 \u00b1 0.93 \u221282.87 \u00b1 8.01\n\u2212119.0 \u00b1 7.68 \u221232.98 \u00b1 1.63 \u2212113.54 \u00b1 10.4\n\u221243.14 \u00b1 6.12 \u221228.59 \u00b1 3.38 \u221274.11 \u00b1 3.26\n\u221251.79 \u00b1 21.0 \u221231.66 \u00b1 4.85 \u221274.22 \u00b1 2.37\n\u2212111.6 \u00b1 43.27 \u221228.60 \u00b1 7.22 \u221276.85 \u00b1 5.95\n\u2212184.3 \u00b1 35.5 \u221227.72 \u00b1 2.81 \u221277.26 \u00b1 6.12\n\u221242.83 \u00b1 1.54 \u221219.76 \u00b1 0.18 \u221265.52 \u00b1 0.40\n\u221264.13 \u00b1 36.70 \u221219.57 \u00b1 0.30 \u221278.80 \u00b1 3.79\n\u221252.53 \u00b1 8.79 \u221219.89 \u00b1 0.44 \u221266.58 \u00b1 1.10\n\u2212155.9 \u00b1 32.9 \u221221.45 \u00b1 0.83 \u221282.51 \u00b1 9.66\n\u221239.03 \u00b1 0.45 \u221219.03 \u00b1 0.81 \u221264.16 \u00b1 0.19\n\u221228.14 \u00b1 0.44 \u221218.04 \u00b1 0.72 \u221262.68 \u00b1 0.21\n\n\u22128.56\n\u22127.88\n\u22128.26\n\u22126.37\n\u22127.25\n\u22126.75\n\u221210.40\n\u221210.83\n\u22125.84\n\u22125.72\n\u22126.28\n\u22125.66\n\u22125.49\n\nare obtained from N (\u00b5 = 3 \u00b7 cos xi \u2212 2, \u03c3 =| 3 \u00b7 cos xi \u2212 2 |) distribution depending on xi value.\nThen, when 0.47 < xi < 0.61 their respective yi values is obtained from an increasing uniform\nrandom distribution and, \ufb01nally, all values above 0.61 are obtained from three different uniform\ndistributions: U(8, 0.5), U(1, 3) and U(\u22124.5, 1.5). A total of 50% of the random uniform generated\ndata were considered as test data, 40% for training and 10% for validation.\nFor all compared models, we will use the same neural network architecture for \u03c6. This consists of 4\ndense layers that have output dimensions 120, 60, 10 and P , respectively, and all but the last layer\nwith ReLu activation. Regarding training time, all models took less than 3 minutes to converge.\n\nRoom price forecasting (RPF) By using the publicly available information from the the Inside\nAirbnb platform [17] we selected Barcelona (BCN) and Vancouver (YVC) as the cities to carry out\nthe comparison of the models in a real situation. For both cities, we select the last time each house or\n\ufb02at appeared within the months available from April 2018 to March 2019.\nThe regression problem will be de\ufb01ned as predicting the real price per night of each \ufb02at in their\nrespective currency using the following information: the one hot encoding of the categorical attributes\n(present in the corresponding Inside Airbnb \u201clistings.csv\u201d \ufb01les) of district number, neighbourhood\nnumber, room type and property type, as well as the number of bathrooms, accommodates values\ntogether with the latitude and longitude normalised according to the minimums and maximums of the\ncorresponding city.\nGiven the 36, 367 and 11, 497 \ufb02ats in BCN and YVC respectively, we have considered 80% as a\ntraining set, 10% as a validation set and the remaining 10% as a test set. Regarding the trained\nmodels, all share the same neural network architecture for their \u03c6, composed of 6 dense layers with\nReLu activation in all but the last layer and their output dimensions of 120, 120, 60, 60, 10 and P ,\nrespectively. Concerning training time, all models took less than 30 minutes to converge.\n\nFinancial estimation The aim here is to anticipate personal expenses and income for each speci\ufb01c\n\ufb01nancial category in the upcoming month by only considering the last 24 months of aggregated\nhistoric values for that customer as a short-time series problem. This private data set contains monthly\naggregated expense and income operations for each costumer in a certain category as time series of\n24 months. 1.8 millions of that time series of a selected year will be the training set, 200 thousand\nwill be the validation set and 1 million time series of the following year will be the test set. Regarding\nthe \u03c6 architecture for all compared models, after an internal previous re\ufb01nement task to select the\nbest architecture, we used a recurrent model that contains 2 concatenated Long Short-Term Memory\n(LSTM) layers [36] of 128 output neurons each, and then two dense layers of 128 and P outputs,\nrespectively. It is important to note that because all compared solutions used for this article are\nagnostic with respect to the architecture, the only decision we need to take is how to insert the extra \u03c4\n\n7\n\n\fFigure 3: Plot with the performance of three different models in terms of calibration. The mean and\nstandard deviation for all folds of the mean absolute error between the predicted calibration and the\nperfect ideal calibration is represented in the table.\n\nLikelihood calibration comparison\n\nModel\n\nNormal distribution\nLaplace distribution\nIndependent QR\n2 comp. Normal MDN\n3 comp. Normal MDN\n4 comp. Normal MDN\n10 comp. Normal MDN\n2 comp. Laplace MDN\n3 comp. Laplace MDN\n4 comp. Laplace MDN\n10 comp. Laplace MDN\nIndependent ALD\nUMAL model\n\nBCN RPF YVC RPF\n.12 \u00b1 .04 .04 \u00b1 .01\n.03 \u00b1 .00 .06 \u00b1 .01\n.10 \u00b1 .02 .12 \u00b1 .02\n.05 \u00b1 .02 .12 \u00b1 .05\n.07 \u00b1 .02 .14 \u00b1 .04\n.10 \u00b1 .03 .17 \u00b1 .06\n.19 \u00b1 .04 .19 \u00b1 .06\n.05 \u00b1 .01 .09 \u00b1 .01\n.08 \u00b1 .02 .11 \u00b1 .02\n.13 \u00b1 .05 .12 \u00b1 .03\n.24 \u00b1 .03 .18 \u00b1 .05\n.06 \u00b1 .01 .02 \u00b1 .01\n.04 \u00b1 .01 .07 \u00b1 .01\n\ninformation into the \u03c6 function in the QR, ALD and UMAL models. In these cases, for simplicity,\nwe add the information \u03c4 repeatedly as one more attribute of each point of the input time series.\n\n6.2 Results\n\nLog-Likelihood comparison We compared the log-likelihood adaptation of all models presented in\nTable 1 for the three type of problems introduced. For all public data sets, we give their corresponding\nmean and standard deviation over the 10 runs of each model we did. Due to computational resources,\nthe private data set is the result after one execution per model. Furthermore, we take into account\ndifferent numbers of components for the different MDN models. We observe that the best solutions\nfor MDN are far from the UMAL cases. Thus, we conclude that the UMAL models achieve the best\nperformance in all of these heterogeneous problems.\n\nCalibrated estimated likelihoods To determine whether the learned likelihood is useful (i.e. if\nUMAL yields calibrated outputs), we performed an additional empirical study to assess this point.\nWe highlight that our system predicts an output distribution p(y|x, w) (not a con\ufb01dence value).\nSpeci\ufb01cally, we have computed the % of actual test data that falls into different thresholds of\npredicted probability. Ideally, given a certain threshold \u03b8 \u2208 [0, 1], the amount of data points with a\npredicted probability above or equal to 1\u2212\u03b8 should be similar to \u03b8. On the left side of Figure 3 we plot\nthese measures for different methods (in green, our model) when considering the BCN RPF dataset.\nFurthermore, on the right side of Figure 3 we report the mean absolute error between the empirical\nmeasures and the ideal ones for both rental-price data sets. As we can see, the conditional distribution\npredicted by UMAL has low error values. Therefore, we can state that UMAL produces proper and\ncalibrated conditional distributions that are especially suitable for heterogeneous problems6.\n\nPredicted distribution shape analysis From right to left in Figure 4 we show a 50 perplexity with\nWasserstein distance t-SNE [37] projection from 500 linearly spaced discretisation of the normalised\npredicted distribution to 2 dimensions for each room of the test set in Barcelona. Each colour of\nthe palette corresponds to a certain DBSCAN [38] cluster obtained with \u0001 = 5.8 and 40 minimum\nsamples as DBSCAN parameters. We show a Hex-bin plot over the map of Barcelona, where the\ncolours correspond to the mode cluster of all the rooms inside the hexagonal limits. A similar study\nwould be useful to extract patterns inside the city, and consequently adapt speci\ufb01c actions to them.\n\n6In the Appendix section, we have evaluated calibration quality and negative log-likelihood on the UCI data\n\nsets with the same architectures as [19, 8].\n\n8\n\n\fFigure 4: DBSCAN clustering of the t-SNE projection to 2 dimensions of normalised Barcelona\npredicted distributions. Hexbin plot of most common cluster for each hexagon on top of the map.\n\n7 Conclusion\n\nThis paper has introduced the Uncountable Mixture of Asymmetric Laplacians (UMAL) model, a\nframework that uses deep learning to estimate output distribution without strong restrictions (Figure\n1). As shown in Figure 2, UMAL is a model that implicitly learns in\ufb01nite ALD distributions,\nwhich are combined to form the \ufb01nal mixture distribution. Thus, in contrast with mixture density\nnetworks, UMAL does not need to increase its internal neural network output, which tends to produce\nunstable behaviours when it is required. Furthermore, the Monte Carlo sampling of UMAL could be\nconsidered as a batch size that can be updated even during training time.\nWe have presented a benchmark comparison in terms of log-likelihood adaptation in the test set\nof three different types of problems. The \ufb01rst was a synthetic experiment with distinct controlled\nheterogeneous distributions that contains multimodality and skewed behaviours. Next, we used public\ndata to create a complex problem for predicting the room price per night in two different cities as\ntwo independent problems. Finally, we compared all the presented models in a \ufb01nancial forecasting\nproblem anticipating the next monetary aggregated monthly expense or income of a certain customer\ngiven their historical data. We showed that the UMAL model outperforms the capacity to approximate\nthe output distribution with respect to the other baselines as well as yielding calibrated outputs.\nIn introducing UMAL we emphasise the importance of taking the concept of aleatoric uncertainty to\na whole richer level, where we are not restricted to only studying variability or evaluating con\ufb01dence\nintervals to make certain actions but can carry out shape analysis in order to develop task-tailored\nmethods.\n\nAcknowledgements We gratefully acknowledge the Government of Catalonia\u2019s Industrial Doctor-\nates Plan for funding part of this research. The UB acknowledges that part of the research described\nin this chapter was partially funded by RTI2018-095232-B-C21 and SGR 1219. We would also like\nto thank BBVA Data and Analytics for sponsoring the industrial PhD.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[2] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[3] Vasileios Belagiannis, Christian Rupprecht, Gustavo Carneiro, and Nassir Navab. Robust\n\noptimization for deep regression. In CVPR, pages 2830\u20132838, 2015.\n\n[4] Yagmur Gizem Cinar, Hamid Mirisaee, Parantapa Goswami, Eric Gaussier, and Ali A\u00eft-Bachir.\nPeriod-aware content attention rnns for time series forecasting with missing values. Neurocom-\nputing, 312:177\u2013186, 2018.\n\n9\n\n\f[5] Chuan Wang, Hua Zhang, Liang Yang, Si Liu, and Xiaochun Cao. Deep people counting\nin extremely dense crowds. In Proceedings of the 23rd ACM international conference on\nMultimedia, pages 1299\u20131302. ACM, 2015.\n\n[6] Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter? Structural\n\nSafety, 31(2):105\u2013112, 2009.\n\n[7] Carl Edward Rasmussen. A practical monte carlo implementation of bayesian learning. In\n\nAdvances in Neural Information Processing Systems, pages 598\u2013604, 1996.\n\n[8] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable\n\nlearning of bayesian neural networks. In ICML, pages 1861\u20131869, 2015.\n\n[9] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\n\nin neural networks. ICML, 2015.\n\n[10] Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch\nnormalized deep networks. In International Conference on Machine Learning, pages 4914\u20134923,\n2018.\n\n[11] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.\n\n[12] Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for\ndistributional reinforcement learning. In International Conference on Machine Learning, pages\n1104\u20131113, 2018.\n\n[13] Natasa Tagasovska and David Lopez-Paz. Frequentist uncertainty estimates for deep learning.\n\nBayesian Deep Learning workshop NeurIPS, 2018.\n\n[14] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for\n\ncomputer vision? In NIPS, pages 5580\u20135590, 2017.\n\n[15] Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspectives,\n\n15(4):143\u2013156, 2001.\n\n[16] Keming Yu and Rana A Moyeed. Bayesian quantile regression. Statistics & Probability Letters,\n\n54(4):437\u2013447, 2001.\n\n[17] Murray Cox. Inside airbnb: adding data to the debate. Inside Airbnb [Internet].[cited 16 May\n\n2019]. Available: http://insideairbnb.com, 2019.\n\n[18] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\n\nuncertainty in deep learning. In ICML, pages 1050\u20131059, 2016.\n\n[19] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, pages 6402\u20136413, 2017.\n\n[20] Christopher M Bishop. Mixture density networks. Technical Report NCRG/4288, 1994.\n\n[21] N Kostantinos. Gaussian mixtures and their applications to signal processing. Advanced signal\nprocessing handbook: theory and implementation for radar, sonar, and medical imaging real\ntime systems, pages 3\u20131, 2000.\n\n[22] Carl Edward Rasmussen. The in\ufb01nite gaussian mixture model. In Advances in neural informa-\n\ntion processing systems, pages 554\u2013560, 2000.\n\n[23] Roger Koenker, Victor Chernozhukov, Xuming He, and Limin Peng. Handbook of Quantile\n\nRegression. CRC press, 2017.\n\n[24] C Gutenbrunner and J Jureckov\u00e1. Regression quantile and regression rank score process in the\n\nlinear model and derived statistics. Annals of Statistics, 20:305\u2013330, 1992.\n\n[25] Gary Chamberlain. Quantile regression, censoring, and the structure of wages. In Advances in\n\neconometrics: sixth world congress, volume 2, pages 171\u2013209, 1994.\n\n10\n\n\f[26] Will Dabney, Mark Rowland, Marc G Bellemare, and R\u00e9mi Munos. Distributional reinforcement\nlearning with quantile regression. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n[27] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multi-\n\nhorizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.\n\n[28] Ana Fred, Maria De Marsico, and Gabriella Sanniti di Baja. Pattern Recognition Applications\nand Methods: 5th International Conference, ICPRAM 2016, Rome, Italy, February 24-26, 2016,\nRevised Selected Papers, volume 10163. Springer, 2017.\n\n[29] Mohammad E Khan, Guillaume Bouchard, Kevin P Murphy, and Benjamin M Marlin. Varia-\ntional bounds for mixed-data factor analysis. In Advances in Neural Information Processing\nSystems, pages 1108\u20131116, 2010.\n\n[30] Frank Nielsen and Ke Sun. Guaranteed bounds on the kullback\u2013leibler divergence of univariate\n\nmixtures. IEEE Signal Processing Letters, 23(11):1543\u20131546, 2016.\n\n[31] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and\nImplementation ({OSDI} 16), pages 265\u2013283, 2016.\n\n[32] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[33] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek\nMurray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal\nTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete\nWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-\nscale machine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[34] Fran\u00e7ois Chollet et al. Keras (2015), 2019.\n\n[35] Hao Zheng, Zhanlei Yang, Wenju Liu, Jizhong Liang, and Yanpeng Li. Improving deep neural\nnetworks using softplus units. In 2015 International Joint Conference on Neural Networks\n(IJCNN), pages 1\u20134. IEEE, 2015.\n\n[36] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[37] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[38] Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, Xiaowei Xu, et al. A density-based algorithm for\ndiscovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226\u2013231,\n1996.\n\n11\n\n\f", "award": [], "sourceid": 4757, "authors": [{"given_name": "Axel", "family_name": "Brando", "institution": "BBVA DATA & ANALYTICS SL UNIVERSITAT DE BARCELONA"}, {"given_name": "Jose", "family_name": "Rodriguez", "institution": "BBVA Data & Analytics"}, {"given_name": "Jordi", "family_name": "Vitria", "institution": "Universitat de Barcelona"}, {"given_name": "Alberto", "family_name": "Rubio Mu\u00f1oz", "institution": "BBVA Data & Analytics"}]}