{"title": "Processing of missing data by neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2719, "page_last": 2729, "abstract": "We propose a general, theoretically justified mechanism for processing missing data by neural networks. Our idea is to replace typical neuron's response in the first hidden layer by its expected value. This approach can be applied for various types of networks at minimal cost in their modification. Moreover, in contrast to recent approaches, it does not require complete data for training. Experimental results performed on different types of architectures show that our method gives better results than typical imputation strategies and other methods dedicated for incomplete data.", "full_text": "Processing of missing data by neural networks\n\nMarek \u00b4Smieja\n\nmarek.smieja@uj.edu.pl\n\n\u0141ukasz Struski\n\nlukasz.struski@uj.edu.pl\n\nJacek Tabor\n\njacek.tabor@uj.edu.pl\n\nBartosz Zieli\u00b4nski\n\nbartosz.zielinski@uj.edu.pl\n\nPrzemys\u0142aw Spurek\n\nprzemyslaw.spurek@uj.edu.pl\n\nFaculty of Mathematics and Computer Science\n\nJagiellonian University\n\n\u0141ojasiewicza 6, 30-348 Krak\u00f3w, Poland\n\nAbstract\n\nWe propose a general, theoretically justi\ufb01ed mechanism for processing missing\ndata by neural networks. Our idea is to replace typical neuron\u2019s response in the\n\ufb01rst hidden layer by its expected value. This approach can be applied for various\ntypes of networks at minimal cost in their modi\ufb01cation. Moreover, in contrast to\nrecent approaches, it does not require complete data for training. Experimental\nresults performed on different types of architectures show that our method gives\nbetter results than typical imputation strategies and other methods dedicated for\nincomplete data.\n\n1\n\nIntroduction\n\nLearning from incomplete data has been recognized as one of the fundamental challenges in machine\nlearning [1]. Due to the great interest in deep learning in the last decade, it is especially important to\nestablish uni\ufb01ed tools for practitioners to process missing data with arbitrary neural networks.\nIn this paper, we introduce a general, theoretically justi\ufb01ed methodology for feeding neural networks\nwith missing data. Our idea is to model the uncertainty on missing attributes by probability density\nfunctions, which eliminates the need of direct completion (imputation) by single values. In conse-\nquence, every missing data point is identi\ufb01ed with parametric density, e.g. GMM, which is trained\ntogether with remaining network parameters. To process this probabilistic representation by neural\nnetwork, we generalize the neuron\u2019s response at the \ufb01rst hidden layer by taking its expected value\n(Section 3). This strategy can be understand as calculating the average neuron\u2019s activation over the\nimputations drawn from missing data density (see Figure 1 for the illustration).\nThe main advantage of the proposed approach is the ability to train neural network on data sets\ncontaining only incomplete samples (without a single fully observable data). This distinguishes our\napproach from recent models like context encoder [2, 3], denoising autoencoder [4] or modi\ufb01ed\ngenerative adversarial network [5], which require complete data as an output of the network in training.\nMoreover, our approach can be applied to various types of neural networks what requires only minimal\nmodi\ufb01cation in their architectures. Our main theoretical result shows that this generalization does not\nlead to loss of information when processing the input (Section 4). Experimental results performed on\nseveral types of networks demonstrate practical usefulness of the method (see Section 5 and Figure 2\nfor sample results) .\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fINPUT\n\nOUTPUT\n\nGMM params: (pi, mi, \u03a3i)i\n\n(cid:82) \u03c6(wT x + b)FS(x)dx\n\nw1\n\nw2\n\nw3\n\nw4\n\nw5\n\nw6\n\nw7\n\nx1\n\n(cid:63)\n\nx3\n\n(cid:63)\n\nx5\n\nx6\n\nx7\n\nFigure 1: Missing data point (x, J), where x \u2208 RD and J \u2282 {1, . . . , D} denotes absent attributes, is\nrepresented as a conditional density FS (data density restricted to the af\ufb01ne subspace S = A\ufb00[x, J]\nidenti\ufb01ed with (x, J)). Instead of calculating the activation function \u03c6 on a single data point (as for\ncomplete data points), the \ufb01rst hidden layer computes the expected activation of neurons. Parameters\nof missing data density (pi, \u00b5i, \u03a3i)i are tuned jointly with remaining network parameters.\n\n2 Related work\n\nTypical strategy for using machine learning methods with incomplete inputs relies on \ufb01lling absent\nattributes based on observable ones [6], e.g. mean or k-NN imputation. One can also train separate\nmodels, e.g. neural networks [7], extreme learning machines (ELM) [8], k-nearest neighbors [9], etc.,\nfor predicting the unobserved features. Iterative \ufb01lling of missing attributes is one of the most popular\ntechnique in this class [10, 11]. Recently, a modi\ufb01ed generative adversarial net (GAN) was adapted to\n\ufb01ll in absent attributes with realistic values [12]. A supervised imputation, which learns a replacement\nvalue for each missing attribute jointly with remaining network parameters, was proposed in [13].\nInstead of generating candidates for \ufb01lling missing attributes, one can build a probabilistic model of\nincomplete data (under certain assumptions on missing mechanism) [14, 15], which is subsequently\nfed into particular learning model [16, 17, 18, 19, 20, 21, 22, 23]. Decision function can also be\nlearned based on the visible inputs alone [24, 25], see [26, 27] for SVM and random forest cases.\nPelckmans et. al. [28] modeled the expected risk under the uncertainty of the predicted outputs. The\nauthors of [29] designed an algorithm for kernel classi\ufb01cation under low-rank assumption, while\nGoldberg et. al. [30] used matrix completion strategy to solve missing data problem.\nThe paper [31] used recurrent neural networks with feedback into the input units, which \ufb01lls absent\nattributes for the sole purpose of minimizing a learning criterion. By applying the rough set theory,\nthe authors of [32] presented a feedforward neural network which gives an imprecise answer as\nthe result of input data imperfection. Goodfellow et. al. [33] introduced the multi-prediction deep\nBoltzmann machine, which is capable of solving different inference problems, including classi\ufb01cation\nwith missing inputs.\nAlternatively, missing data can be processed using the popular context encoder (CE) [2, 3] or modi\ufb01ed\nGAN [5], which were proposed for \ufb01lling missing regions in natural images. The other possibility\nwould be to use denoising autoencoder [4], which was used e.g. for removing complex patterns like\nsuperimposed text from an image. Both approaches, however, require complete data as an output of\nthe network in training phase, which is in contradiction with many real data sets (such us medical\nones).\n\n2\n\n\f3 Layer for processing missing data\n\nIn this section, we present our methodology for feeding neural networks with missing data. We show\nhow to represent incomplete data by probability density functions and how to generalize neuron\u2019s\nactivation function to process them.\nMissing data representation. A missing data point is denoted by (x, J), where x \u2208 RD and\nJ \u2282 {1, . . . , D} is a set of attributes with missing values. With each missing point (x, J) we\nassociate the af\ufb01ne subspace consisting of all points which coincide with x on known coordinates\nJ(cid:48) = {1, . . . , N} \\ J:\n\nS = A\ufb00[x, J] = x + span(eJ ),\n\nwhere eJ = [ej]j\u2208J and ej is j-th canonical vector in RD.\nLet us assume that the values at missing attributes come from the unknown D-dimensional probability\ndistribution F . Then we can model the unobserved values of (x, J) by restricting F to the af\ufb01ne\nsubspace S = A\ufb00[x, J].\nIn consequence, possible values of incomplete data point (x, J) are\ndescribed by a conditional density1 FS : S \u2192 R given by (see Figure 1):\n\n(cid:40)\n\n1\n\n(cid:82)\nS F (s)ds F (x), for x \u2208 S,\n0, otherwise.\n\nFS(x) =\n\n(1)\n\nNotice that FS is a degenerate density de\ufb01ned on the whole RD space2, which allows to interpret it\nas a probabilistic representation of missing data point (x, J).\nIn our approach, we use the mixture of Gaussians (GMM) with diagonal covariance matrices as a\nmissing data density F . The choice of diagonal covariance reduces the number of model parameters,\nwhich is crucial in high dimensional problems. Clearly, a conditional density for the mixture of\nGaussians is a (degenerate) mixture of Gaussians with a support in the subspace. Moreover, we apply\nan additional regularization in the calculation of conditional density (1) to avoid some artifacts when\nGaussian densities are used3. This regularization allows to move from typical conditional density\ngiven by (1) to marginal density in the limiting case. Precise formulas for a regularized density for\nGMM with detailed explanations are presented in Supplementary Materials (section 1).\nGeneralized neuron\u2019s response. To process probability density functions (representing missing data\npoints) by neural networks, we generalize the neuron\u2019s activation function. For a probability density\nfunction FS, we de\ufb01ne the generalized response (activation) of a neuron n : RD \u2192 R on FS as the\nmean output:\n\n(cid:90)\n\nn(FS) = E[n(x)|x \u223c FS] =\n\nn(x)FS(x)dx.\n\nObserve that it is suf\ufb01cient to generalize neuron\u2019s response at the \ufb01rst layer only, while the rest of\nnetwork architecture can remain unchanged. Basic requirement is the ability of computing expected\nvalue with respect to FS. We demonstrate that the generalized response of ReLU and RBF neurons\nwith respect to the mixture of diagonal Gaussians can be calculated ef\ufb01ciently.\nLet us recall that the ReLU neuron is given by\n\nwhere w \u2208 RD and b \u2208 R is the bias. Given 1-dimensional Gaussian density N (m, \u03c32), we \ufb01rst\nevaluate ReLU[N (m, \u03c32)], where ReLU = max(0, x). If we de\ufb01ne an auxiliary function:\n\nReLUw,b(x) = max(wT x + b, 0),\n\nthen the generalized response equals:\n\nNR(w) = ReLU[N (w, 1)],\n\nReLU[N (m, \u03c32)] = \u03c3NR(\n\nm\n\u03c3\n\n).\n\n1More precisely, FS equals a density F conditioned on the observed attributes.\n2An example of degenerate density is a degenerate Gaussian N (m, \u03a3), for which \u03a3 is not invertible. A\ndegenerate Gaussian is de\ufb01ned on af\ufb01ne subspace (given by image of \u03a3), see [34] for details. For simplicity we\nuse the same notation N (m, \u03a3) to denote both standard and degenerate Gaussians.\n\n3One can show that the conditional density of a missing point (x, J) suf\ufb01ciently distant from the data reduces\n\nto only one Gaussian, which center is nearest in the Mahalanobis distance to A\ufb00[x, J]\n\n3\n\n\fElementary calculation gives:\n\npi\n\nNR(w) =\n\nexp(\u2212 w2\n2\n\n) +\n\nw\n2\n\n(1 + erf(\n\nw\u221a\n2\n\n)),\n\n(2)\n\n1\u221a\n(cid:82) z\n2\u03c0\n0 exp(\u2212t2)dt.\nTheorem 3.1. Let F =(cid:80)\n\nwhere erf(z) = 2\u221a\nWe proceed with a general case, where an input data point x is generated from the mixture of\n(degenerate) Gaussians. The following observation shows how to calculate the generalized response\nof ReLUw,b(x), where w \u2208 RD, b \u2208 R are neuron weights.\nweights w = (w1, . . . , wD) \u2208 RD, b \u2208 R, we have:\n\ni piN (mi, \u03a3i) be the mixture of (possibly degenerate) Gaussians. Given\n\n(cid:88)\n\ni\n\npiNR(cid:0) wT mi + b\n\n(cid:112)\n\nwT \u03a3iw\n\n(cid:1).\n\nReLUw,b(F ) =\n\nProof. If x \u223c N (m, \u03a3) then wT x + b \u223c N (wT x + b, wT \u03a3w). Consequently,\n\nif x \u223c\n\ni piN (wT mi + b, wT \u03a3iw).\n\nMaking use of (2), we get:\n\n(cid:80)\ni piN (mi, \u03a3i), then wT x + b \u223c(cid:80)\n(cid:88)\n(cid:90) \u221e\n\nReLUw,b(F ) =\n\nReLU(x)\n\n(cid:90)\n\nR\n\ni\n\n(cid:88)\n\n=\n\npi\n\ni\n\n0\n\npiN (wT mi + b, wT \u03a3iw)(x)dx\n\nxN (wT mi + b, wT \u03a3iw)(x)dx =\n\n(cid:88)\n\ni\n\npiNR(cid:0) wT mi + b\n\n(cid:112)\n\nwT \u03a3iw\n\n(cid:1).\n\nk(cid:88)\n\ni=1\n\nk(cid:88)\n\ni=1\n\npi\n\n(cid:90)\n\nRD\n\nTheorem 3.2. Let F =(cid:80)\n\nWe show the formula for a generalized RBF neuron\u2019s activation. Let us recall that RBF function is\ngiven by RBFc,\u0393(x) = N (c, \u0393)(x).\n\ni piN (mi, \u03a3i) be the mixture of (possibly degenerate) Gaussians and let\n\nRBF unit be parametrized by N (c, \u0393). We have:\n\nRBFc,\u0393(F ) =\n\npiN (mi \u2212 c, \u0393 + \u03a3i)(0).\n\nProof. We have:\n\nRBFc,\u0393(F ) =\n\n(cid:90)\n\nRD\n\nRBFc,\u0393(x)F (x)dx =\n\nN (c, \u0393)(x)N (mi, \u03a3i)(x)dx\n\nk(cid:88)\n\n=\n\nk(cid:88)\n\npi(cid:104)N (c, \u0393), N (mi, \u03a3i)(cid:105) =\n\npiN (mi \u2212 c, \u0393 + \u03a3i)(0).\n\n(3)\n\ni=1\n\ni=1\n\nNetwork architecture. Adaptation of a given neural network to incomplete data relies on the\nfollowing steps:\n\n1. Estimation of missing data density with the use of mixture of diagonal Gaussians. If data\nsatisfy missing at random assumption (MAR), then we can adapt EM algorithm to estimate\nincomplete data density with the use of GMM. In more general case, we can let the network\nto learn optimal parameters of GMM with respect to its cost function4. The later case was\nexamined in the experiment.\n\n4If huge amount of complete data is available during training, one should use variants of EM algorithm to\nestimate data density. It could be either used directly as a missing data density or tuned by neural networks with\nsmall amount of missing data.\n\n4\n\n\f2. Generalization of neuron\u2019s response. A missing data point (x, J) is interpreted as the mixture\nof degenerate Gaussians FS on S = A\ufb00[x, J]. Thus we need to generalize the activation\nfunctions of all neurons in the \ufb01rst hidden layer of the network to process probability\nmeasures. In consequence, the response of n(\u00b7) on (x, J) is given by n(FS).\n\nThe rest of the architecture does not change, i.e. the modi\ufb01cation is only required on the \ufb01rst hidden\nlayer.\nObserve that our generalized network can also process classical points, which do not contain any\nmissing values. In this case, generalized neurons reduce to classical ones, because missing data\ndensity F is only used to estimate possible values at absent attributes. If all attributes are complete\nthen this density is simply not used. In consequence, if we want to use missing data in testing stage,\nwe need to feed the network with incomplete data in training to \ufb01t accurate density model.\n\n4 Theoretical analysis\n\nThere appears a natural question: how much information we lose using generalized neuron\u2019s activation\nat the \ufb01rst layer? Our main theoretical result shows that our approach does not lead to the lose of\ninformation, which justi\ufb01es our reasoning from a theoretical perspective. For a transparency, we will\nwork with general probability measures instead of density functions. The generalized response of\nneuron n : RD \u2192 R evaluated on a probability measure \u00b5 is given by:\n\n(cid:90)\n\nn(\u00b5) :=\n\nn(x)d\u00b5(x).\n\nThe following theorem shows that a neural network with generalized ReLU units is able to identify\nany two probability measures. The proof is a natural modi\ufb01cation of the respective standard proofs of\nUniversal Approximation Property (UAP), and therefore we present only its sketch. Observe that all\ngeneralized ReLU return \ufb01nite values iff a probability measure \u00b5 satis\ufb01es the condition\n\n(cid:90)\n\n(cid:107)x(cid:107)d\u00b5(x) < \u221e.\n\nThat is the reason why we reduce to such measures in the following theorem.\nTheorem 4.1. Let \u00b5, \u03bd be probabilistic measures satisfying condition (4). If\nReLUw,b(\u00b5) = ReLUw,b(\u03bd) for w \u2208 RD, b \u2208 R,\n\nthen \u03bd = \u00b5.\nProof. Let us \ufb01x an arbitrary w \u2208 RD and de\ufb01ne the set\n\nFw =(cid:8)p : R \u2192 R :\n\n(cid:90)\n\np(wT x)d\u00b5(x) =\n\n(cid:90)\n\np(wT x)d\u03bd(x)(cid:9).\n\n(4)\n\n(5)\n\nOur main step in the proof lies in showing that Fw contains all continuous bounded functions.\nLet ri \u2208 R such that \u2212\u221e = r0 < r1 < . . . < rl\u22121 < rl = \u221e and qi \u2208 R such that q0 = q1 = 0 =\nql\u22121 = ql, be given. Let Q : R \u2192 R be a piecewise linear continuous function which is af\ufb01ne linear\non intervals [ri, ri+1] and such that Q(ri) = qi. We show that Q \u2208 Fw. Since\n\nQ =\n\nqi \u00b7 Tri\u22121,ri,ri+1,\n\nwhere the tent-like piecewise linear function T is de\ufb01ned by\n0 for r \u2264 p0,\nr\u2212p0\np1\u2212p0\np2\u2212r\np2\u2212p1\n0 for r \u2265 p2,\n\nTp0,p1,p2(r) =\n\nfor r \u2208 [p0, p1],\nfor r \u2208 [p1, p2],\n\nl\u22121(cid:88)\n\ni=1\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n5\n\n\foriginal mask\n\nk-nn mean\n\ndropout\n\nour CE\n\nFigure 2: Reconstructions of partially incomplete images using the autoencoder. From left: (1) origi-\nnal image, (2) image with missing pixels passed to autoencooder; the output produced by autoencoder\nwhen unknown pixels were initially \ufb01lled by (3) k-nn imputation and (4) mean imputation; (5) the\nresults obtained by autoencoder with dropout, (6) our method and (7) context encoder. All columns\nexcept the last one were obtained with loss function computed based on pixels from outside the mask\n(no fully observable data available in training phase). It can be noticed that our method gives much\nsharper images than the competitive methods.\n\nit is suf\ufb01cient to prove that T \u2208 Fw. Let Mp(r) = max(0, r \u2212 p). Clearly\n\nTp0,p1,p2 =\n\n1\n\np1 \u2212 p0\n\n\u00b7 (Mp0 \u2212 Mp1) \u2212\n\n1\n\np2 \u2212 p1\n\n\u00b7 (Mp2 \u2212 Mp1).\n\nHowever, directly from (5) we see that Mp \u2208 Fw for every p, and consequently T and Q are also in\nFw.\nNow let us \ufb01x an arbitrary bounded continuous function G. We show that G \u2208 Fw. To observe\nthis, take an arbitrary uniformly bounded sequence of piecewise linear functions described before\nwhich is convergent pointwise to G. By the Lebesgue dominated convergence theorem we obtain that\nG \u2208 Fw.\nTherefore cos(\u00b7), sin(\u00b7) \u2208 Fw holds consequently also for the function eir = cos r + i sin r we have\nthe equality\n\n(cid:90)\n\n(cid:90)\n\nexp(iwT x)d\u00b5(x) =\n\nexp(iwT x)d\u03bd(x).\n\nSince w \u2208 RD was chosen arbitrarily, this means that the characteristic functions of two measures\ncoincide, and therefore \u00b5 = \u03bd.\n\nIt is possible to obtain an analogical result for RBF activation function. Moreover, we can also get\nmore general result under stronger assumptions on considered probability measures. More precisely,\nif a given family of neurons satis\ufb01es UAP, then their generalization is also capable of identifying\nany probability measure with compact support. Complete analysis of both cases is presented in\nSupplementary Material (section 2).\n\n5 Experiments\n\nWe evaluated our model on three types of architectures. First, as a proof of concept, we veri\ufb01ed the\nproposed approach in the context of autoencoder (AE). Next we applied multilayer perceptron (MLP)\nto multiclass classi\ufb01cation problem and \ufb01nally we used shallow radial basis function network (RBFN)\nin binary classi\ufb01cation. For a comparison we only considered methods with publicly available codes\nand thus many methods described in the related work section have not been taken into account.\nThe code implementing the proposed method is available at https://github.com/lstruski/\nProcessing-of-missing-data-by-neural-networks.\nAutoencoder. Autoencoder (AE) is usually used for generating compressed representation of data.\nHowever, in this experiment, we were interested in restoring corrupted images, where part of data\nwas hidden.\n\n6\n\n\fTable 1: Mean square error of reconstruction on MNIST incomplete images (we report the errors\ncalculated over the whole area, inside and outside the mask). Described errors are obtained for images\nwith intensities scaled to [0, 1].\n\nTotal error\nError inside the mask\nError outside the mask\n\nk-nn\n\n0.01189\n0.00722\n0.00468\n\nonly missing data\nmean\ndropout\n\nour\n\n0.01727\n0.00898\n0.00829\n\n0.01379\n0.00882\n0.00498\n\n0.01056\n0.00810\n0.00246\n\ncomplete data\n\nCE\n\n0.01326\n0.00710\n0.00617\n\nAs a data set, we used grayscale handwritten digits retrieved from MNIST database. For each image\nof the size 28 \u00d7 28 = 784 pixels, we removed a square patch of the size5 13 \u00d7 13. The location of\nthe patch was uniformly sampled for each image. AE used in the experiments consists of 5 hidden\nlayers with 256, 128, 64, 128, 256 neurons in subsequent layers. The \ufb01rst layer was parametrized by\nReLU activation functions, while the remaining units used sigmoids6.\nAs describe in Section 1, our model assumes that there is no complete data in training phase.\nTherefore, the loss function was computed based only on pixels from outside the mask.\nAs a baseline, we considered combination of analogical architecture with popular imputation tech-\nniques:\nk-nn: Missing features were replaced with mean values of those features computed from the K\nnearest training samples (we used K = 5). Neighborhood was measured using Euclidean distance in\nthe subspace of observed features.\nmean: Missing features were replaced with mean values of those features computed for all (incom-\nplete) training samples.\ndropout: Input neurons with missing values were dropped7.\nAdditionally, we used a type of context encoder (CE), where missing features were replaced with\nmean values, however in contrast to mean imputation, the complete data were used as an output of\nthe network in training phase. This model was expected to perform better, because it used complete\ndata in computing the network loss function.\nIncomplete inputs and their reconstructions obtained with various approaches are presented in Figure\n2 (more examples are included in Supplementary Material, section 3). It can be observed that our\nmethod gives sharper images then the competitive methods. In order to support the qualitative results,\nwe calculated mean square error of reconstruction (see Table 1). Quantitative results con\ufb01rm that\nour method has lower error than imputation methods, both inside and outside the mask. Moreover, it\novercomes CE in case of the whole area and the area outside the mask. In case of the area inside the\nmask, CE error is only slightly better than ours, however CE requires complete data in training.\nMultilayer perceptron. In this experiment, we considered a typical MLP architecture with 3 ReLU\nhidden layers. It was applied to multiclass classi\ufb01cation problem on Epileptic Seizure Recognition\ndata set (ESR) taken from [35]. Each 178-dimensional vector (out of 11500 samples) is EEG\nrecording of a given person for 1 second, categorized into one of 5 classes. To generate missing\nattributes, we randomly removed 25%, 50%, 75% and 90% of values.\nIn addition to the imputation methods described in the previous experiment, we also used iterative\n\ufb01lling of missing attributes using Multiple Imputation by Chained Equation (mice), where several\nimputations are drawing from the conditional distribution of data by Markov chain Monte Carlo\ntechniques [10, 11]. Moreover, we considered the mixture of Gaussians (gmm), where missing\n\n5In the case when the removed patch size was smaller, all considered methods performed very well and\n\ncannot be visibly distinguished.\n\nobtained were less plausible.\n\n6We also experimented with ReLU in remaining layers (except the last one), however the results we have\n7Values of the remaining neurons were divided by 1 \u2212 dropout rate\n\n7\n\n\fTable 2: Classi\ufb01cation results on ESR data obtained using MLP (the results of CE are not bolded,\nbecause it had access to complete examples).\n\nonly missing data\n\ncomplete data\n\n% of missing\n\nk-nn\n\nmice mean\n\ngmm dropout\n\nour\n\n25%\n50%\n75%\n90%\n\n0.773\n0.773\n0.628\n0.615\n\n0.823\n0.816\n0.786\n0.670\n\n0.799\n0.703\n0.624\n0.596\n\n0.823\n0.801\n0.748\n0.697\n\n0.796\n0.780\n0.755\n0.749\n\n0.815\n0.817\n0.787\n0.760\n\nCE\n\n0.812\n0.813\n0.792\n0.771\n\nTable 3: Summary of data sets with internally absent attributes.\n\nData set\nbands\nkidney disease\nhepatitis\nhorse\nmammographics\npima\nwinconsin\n\n#Instances\n539\n400\n155\n368\n961\n768\n699\n\n#Attributes\n19\n24\n19\n22\n5\n8\n9\n\n#Missing\n5.38%\n10.54%\n5.67%\n23.80%\n3.37%\n12.24%\n0.25%\n\nfeatures were replaced with values sampled from GMM estimated from incomplete data using EM\nalgorithm8.\nWe applied double 5-fold cross-validation procedure to report classi\ufb01cation results and we tuned\nrequired hyper-parameters. The number of the mixture components for our method was selected in\nthe inner cross-validation from the possible values {2, 3, 5}. Initial mixture of Gaussians was selected\nusing classical GMM with diagonal matrices. The results were assessed using classical accuracy\nmeasure.\nThe results presented in Table 2 show the advantage of our model over classical imputation methods,\nwhich give reasonable results only for low number of missing values. It is also slightly better than\ndropout, which is more robust to the number of absent attributes than typical imputations. It can be\nseen that our method gives comparable scores to CE, even though CE had access to complete training\ndata. We also ran MLP on complete ESR data (with no missing attributes), which gave 0.836 of\naccuracy.\nRadial basis function network. RBFN can be considered as a minimal architecture implementing\nour model, which contains only one hidden layer. We used cross-entropy function applied on a\nsoftmax in the output layer. This network suits well for small low-dimensional data.\nFor the evaluation, we considered two-class data sets retrieved from UCI repository [36] with\ninternally missing attributes, see Table 3 (more data sets are included in Supplementary Materials,\nsection 4). Since the classi\ufb01cation is binary, we extended baseline with two additional SVM kernel\nmodels which work directly with incomplete data without performing any imputations:\ngeom: Its objective function is based on the geometric interpretation of the margin and aims to\nmaximize the margin of each sample in its own relevant subspace [26].\nkarma: This algorithm iteratively tunes kernel classi\ufb01er under low-rank assumptions [29].\nThe above SVM methods were combined with RBF kernel function.\nWe applied analogical cross-validation procedure as before. The number of RBF units was selected in\nthe inner cross-validation from the range {25, 50, 75, 100}. Initial centers of RBFNs were randomly\nselected from training data while variances were samples from N (0, 1). For SVM methods, the\nmargin parameter C and kernel radius \u03b3 were selected from {2k : k = \u22125,\u22123, . . . , 9} for both\nparameters. For karma, additional parameter \u03b3karma was selected from the set {1, 2}.\n\n8Due to the high-dimensionality of MNIST data, mice was not able to construct imputations in previous\nexperiment. Analogically, EM algorithm was not able to \ufb01t GMM because of singularity of covariance matrices.\n\n8\n\n\fTable 4: Classi\ufb01cation results obtained using RBFN (the results of CE are not bolded, because it had\naccess to complete examples).\n\ndata\nbands\nkidney\nhepatitis\nhorse\nmammogr.\npima\nwinconsin\n\nkarma\n\ngeom k-nn\n\nonly missing data\nmice mean\n\ngmm dropout\n\nour\n\n0.580\n0.995\n0.665\n0.826\n0.773\n0.768\n0.958\n\n0.571\n0.986\n0.817\n0.822\n0.815\n0.766\n0.958\n\n0.520\n0.992\n0.825\n0.807\n0.822\n0.767\n0.967\n\n0.544\n0.992\n0.792\n0.820\n0.825\n0.769\n0.970\n\n0.545\n0.985\n0.825\n0.793\n0.819\n0.760\n0.965\n\n0.577\n0.980\n0.820\n0.818\n0.803\n0.742\n0.957\n\n0.616\n0.983\n0.780\n0.823\n0.814\n0.754\n0.964\n\n0.598\n0.993\n0.846\n0.864\n0.831\n0.747\n0.970\n\ncomplete data\n\nCE\n\n0.621\n0.996\n0.843\n0.858\n0.822\n0.743\n0.968\n\nThe results, presented in Table 4, indicate that our model outperformed imputation techniques in\nalmost all cases. It partially con\ufb01rms that the use of raw incomplete data in neural networks is usually\nbetter approach than \ufb01lling missing attributes before learning process. Moreover, it obtained more\naccurate results than modi\ufb01ed kernel methods, which directly work on incomplete data.\n\n6 Conclusion\n\nIn this paper, we proposed a general approach for adapting neural networks to process incomplete\ndata, which is able to train on data set containing only incomplete samples. Our strategy introduces\ninput layer for processing missing data, which can be used for a wide range of networks and does\nnot require their extensive modi\ufb01cations. Thanks to representing incomplete data with probability\ndensity function, it is possible to determine more generalized and accurate response (activation)\nof the neuron. We showed that this generalization is justi\ufb01ed from a theoretical perspective. The\nexperiments con\ufb01rm its practical usefulness in various tasks and for diverse network architectures. In\nparticular, it gives comparable results to the methods, which require complete data in training.\n\nAcknowledgement\n\nThis work was partially supported by National Science Centre, Poland (grants no.\n2016/21/D/ST6/00980, 2015/19/B/ST6/01819, 2015/19/D/ST6/01215, 2015/19/D/ST6/01472). We\nwould like to thank the anonymous reviewers for their valuable comments on our paper.\n\nReferences\n[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.\n\n[2] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context\nencoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2536\u20132544, 2016.\n\n[3] Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image\ninpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), volume 1, page 3, 2017.\n\n[4] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural\n\nnetworks. In Advances in neural information processing systems, pages 341\u2013349, 2012.\n\n[5] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson,\nand Minh N Do. Semantic image inpainting with deep generative models. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 5485\u20135493, 2017.\n\n[6] Patrick E McKnight, Katherine M McKnight, Souraya Sidani, and Aurelio Jose Figueredo.\n\nMissing data: A gentle introduction. Guilford Press, 2007.\n\n[7] Peter K Sharpe and RJ Solly. Dealing with missing values in neural network-based diagnostic\n\nsystems. Neural Computing & Applications, 3(2):73\u201377, 1995.\n\n9\n\n\f[8] Du\u0161an Sovilj, Emil Eirola, Yoan Miche, Kaj-Mikael Bj\u00f6rk, Rui Nian, Anton Akusok, and\nAmaury Lendasse. Extreme learning machine for missing data using multiple imputations.\nNeurocomputing, 174:220\u2013231, 2016.\n\n[9] Gustavo EAPA Batista, Maria Carolina Monard, et al. A study of k-nearest neighbour as an\n\nimputation method. HIS, 87(251-260):48, 2002.\n\n[10] Stef Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by chained\n\nequations in r. Journal of statistical software, 45(3), 2011.\n\n[11] Melissa J Azur, Elizabeth A Stuart, Constantine Frangakis, and Philip J Leaf. Multiple imputa-\ntion by chained equations: what is it and how does it work? International journal of methods in\npsychiatric research, 20(1):40\u201349, 2011.\n\n[12] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Gain: Missing data imputation using\n\ngenerative adversarial nets. pages 5689\u20135698, 2018.\n\n[13] Maya Gupta, Andrew Cotter, Jan Pfeifer, Konstantin Voevodski, Kevin Canini, Alexander\nMangylov, Wojciech Moczydlowski, and Alexander Van Esbroeck. Monotonic calibrated\ninterpolated look-up tables. The Journal of Machine Learning Research, 17(1):3790\u20133836,\n2016.\n\n[14] Zoubin Ghahramani and Michael I Jordan. Supervised learning from incomplete data via an\nEM approach. In Advances in Neural Information Processing Systems, pages 120\u2013127. Citeseer,\n1994.\n\n[15] Volker Tresp, Subutai Ahmad, and Ralph Neuneier. Training neural networks with de\ufb01cient\n\ndata. In Advances in neural information processing systems, pages 128\u2013135, 1994.\n\n[16] Marek \u00b4Smieja, \u0141ukasz Struski, and Jacek Tabor. Generalized rbf kernel for incomplete data.\n\narXiv preprint arXiv:1612.01480, 2016.\n\n[17] David Williams, Xuejun Liao, Ya Xue, and Lawrence Carin. Incomplete-data classi\ufb01cation\nusing logistic regression. In Proceedings of the International Conference on Machine Learning,\npages 972\u2013979. ACM, 2005.\n\n[18] Alexander J Smola, SVN Vishwanathan, and Thomas Hofmann. Kernel methods for missing\nvariables. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics.\nCiteseer, 2005.\n\n[19] David Williams and Lawrence Carin. Analytical kernel matrix completion with incomplete\nmulti-view data. In Proceedings of the ICML Workshop on Learning With Multiple Views, 2005.\n\n[20] Pannagadatta K Shivaswamy, Chiranjib Bhattacharyya, and Alexander J Smola. Second order\ncone programming approaches for handling missing and uncertain data. Journal of Machine\nLearning Research, 7:1283\u20131314, 2006.\n\n[21] Diego PP Mesquita, Jo\u00e3o PP Gomes, and Leonardo R Rodrigues. Extreme learning machines\nfor datasets with missing values using the unscented transform. In Intelligent Systems (BRACIS),\n2016 5th Brazilian Conference on, pages 85\u201390. IEEE, 2016.\n\n[22] Xuejun Liao, Hui Li, and Lawrence Carin. Quadratically gated mixture of experts for incomplete\ndata classi\ufb01cation. In Proceedings of the International Conference on Machine Learning, pages\n553\u2013560. ACM, 2007.\n\n[23] Uwe Dick, Peter Haider, and Tobias Scheffer. Learning from incomplete data with in\ufb01nite\nimputations. In Proceedings of the International Conference on Machine Learning, pages\n232\u2013239. ACM, 2008.\n\n[24] Ofer Dekel, Ohad Shamir, and Lin Xiao. Learning to classify with missing and corrupted\n\nfeatures. Machine Learning, 81(2):149\u2013178, 2010.\n\n[25] Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion.\nIn Proceedings of the International Conference on Machine Learning, pages 353\u2013360. ACM,\n2006.\n\n10\n\n\f[26] Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller. Max-margin\nclassi\ufb01cation of data with absent features. Journal of Machine Learning Research, 9:1\u201321,\n2008.\n\n[27] Jing Xia, Shengyu Zhang, Guolong Cai, Li Li, Qing Pan, Jing Yan, and Gangmin Ning. Adjusted\nweight voting algorithm for random forests in handling missing values. Pattern Recognition,\n69:52\u201360, 2017.\n\n[28] Kristiaan Pelckmans, Jos De Brabanter, Johan AK Suykens, and Bart De Moor. Handling\nmissing values in support vector machine classi\ufb01ers. Neural Networks, 18(5):684\u2013692, 2005.\n\n[29] Elad Hazan, Roi Livni, and Yishay Mansour. Classi\ufb01cation with low rank and missing data. In\nProceedings of The 32nd International Conference on Machine Learning, pages 257\u2013266, 2015.\n\n[30] Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, and Xiaojin Zhu. Transduction with\nmatrix completion: Three birds with one stone. In Advances in neural information processing\nsystems, pages 757\u2013765, 2010.\n\n[31] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous\n\ndata. In Advances in neural information processing systems, pages 395\u2013401, 1996.\n\n[32] Robert K Nowicki, Rafal Scherer, and Leszek Rutkowski. Novel rough neural network for\nclassi\ufb01cation with missing data. In Methods and Models in Automation and Robotics (MMAR),\n2016 21st International Conference on, pages 820\u2013825. IEEE, 2016.\n\n[33] Ian Goodfellow, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Multi-prediction deep\nboltzmann machines. In Advances in Neural Information Processing Systems, pages 548\u2013556,\n2013.\n\n[34] Calyampudi Radhakrishna Rao, Calyampudi Radhakrishna Rao, Mathematischer Statistiker,\nCalyampudi Radhakrishna Rao, and Calyampudi Radhakrishna Rao. Linear statistical inference\nand its applications, volume 2. Wiley New York, 1973.\n\n[35] Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and\nChristian E Elger. Indications of nonlinear deterministic and \ufb01nite-dimensional structures in\ntime series of brain electrical activity: Dependence on recording region and brain state. Physical\nReview E, 64(6):061907, 2001.\n\n[36] Arthur Asuncion and David J. Newman. UCI Machine Learning Repository, 2007.\n\n11\n\n\f", "award": [], "sourceid": 1430, "authors": [{"given_name": "Marek", "family_name": "\u015amieja", "institution": "Jagiellonian University"}, {"given_name": "\u0141ukasz", "family_name": "Struski", "institution": "Jagiellonian University"}, {"given_name": "Jacek", "family_name": "Tabor", "institution": "Jagiellonian University"}, {"given_name": "Bartosz", "family_name": "Zieli\u0144ski", "institution": "Jagiellonian University"}, {"given_name": "Przemys\u0142aw", "family_name": "Spurek", "institution": "Jagiellonian University"}]}