{"title": "Icebreaker: Element-wise Efficient Information Acquisition with a Bayesian Deep Latent Gaussian Model", "book": "Advances in Neural Information Processing Systems", "page_first": 14820, "page_last": 14831, "abstract": "In this paper, we address the ice-start problem, i.e., the challenge of deploying machine learning models when only a little or no training data is initially available, and acquiring each feature element of data is associated with costs. This setting is representative of the real-world machine learning applications. For instance, in the health care domain, obtaining every single measurement comes with a cost. We propose Icebreaker, a principled framework for elementwise training data acquisition. Icebreaker introduces a full Bayesian Deep Latent Gaussian Model (BELGAM) with a novel inference method, which combines recent advances in amortized inference and stochastic gradient MCMC to enable fast and accurate posterior inference. By utilizing BELGAM\u2019s ability to fully quantify model uncertainty, we also propose two information acquisition functions for imputation and active prediction problems. We demonstrate that BELGAM performs significantly better than previous variational autoencoder (VAE) based models, when the data set size is small, using both machine learning benchmarks and real world recommender systems and health-care applications. Moreover, Icebreaker not only demonstrates improved performance compared to baselines, but it is also capable of achieving better test performance with less training data available.", "full_text": "Icebreaker:\n\nElement-wise Ef\ufb01cient Information Acquisition with\n\na Bayesian Deep Latent Gaussian Model\n\nWenbo Gong1\u2217, Sebastian Tschiatschek2, Richard E. Turner12,\n\nSebastian Nowozin2\u2020, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato12, Cheng Zhang2\n\nAbstract\n\nIn this paper, we address the ice-start problem, i.e., the challenge of deploying\nmachine learning models when only a little or no training data is initially available,\nand acquiring each feature element of data is associated with costs. This setting\nis representative of the real-world machine learning applications. For instance, in\nthe health-care domain, obtaining every single measurement comes with a cost.\nWe propose Icebreaker, a principled framework for element-wise training data\nacquisition. Icebreaker introduces a full Bayesian Deep Latent Gaussian Model\n(BELGAM) with a novel inference method, which combines recent advances in\namortized inference and stochastic gradient MCMC to enable fast and accurate\nposterior inference. By utilizing BELGAM\u2019s ability to fully quantify model un-\ncertainty, we also propose two information acquisition functions for imputation\nand active prediction problems. We demonstrate that BELGAM performs signif-\nicantly better than previous variational autoencoder (VAE) based models, when\nthe data set size is small, using both machine learning benchmarks and real-world\nrecommender systems and health-care applications. Moreover, Icebreaker not only\ndemonstrates improved performance compared to baselines, but it is also capable\nof achieving better test performance with less training data available.\n\n1\n\nIntroduction\n\nAcquiring information is costly in many real-world applications. For example, a medical doctor often\nneeds to carry out a sequence of lab tests to make a correct diagnosis, where each of these tests is\nassociated with a cost in terms of money, time, and health risks. To this end, an AI system should be\nable to suggest the information to be acquired in the form of \"one measurement (feature) at a time\" for\naccurate predictions (diagnosis) of any new user. Recently, test-time active prediction methods, such\nas EDDI (Ef\ufb01cient Dynamic Discovery of high-value Inference) [28], provide a solution for such a\nproblem when there is a suf\ufb01cient amount of training data. Unfortunately, in many scenarios, training\ndata can also be challenging and costly to obtain. For example, new data needs to be collected by\ntaking measurements of currently hospitalized patients with their consent. Ideally, we would like to\ndeploy an AI system, such as EDDI, when no or only limited training data is available. We call this\nproblem the ice-start problem.\nThe key to address the ice-start problem is to have a scalable model that knows what it does not know,\nnamely to quantify the epistemic uncertainty. This knowledge can be used to guide the acquisition of\n\n1Department of Engineering, University of Cambridge, Cambridge, UK\n\u2217Contributed during internship in Microsoft Research\n2Microsoft Research, Cambridge, UK\n\u2020Now at Google AI, Berlin, Germany (contributed while being with Microsoft Research)\n\nCorrespondence to: Cheng Zhang <Cheng.Zhang@microsoft.com> and Wenbo Gong <wg242@cam.ac.uk>\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ftraining data. Intuitively, unfamiliar, but informative features are more useful for model training. We\nrefer to this as element-wise training-time active acquisition.\nTraining-time active acquisition is needed in a great range of applications. One example is the\nrecommender system with no historical user data.\nDespite the success of element-wise test-time active prediction [23, 28, 44, 56], few works have\nprovided a general and scalable solution for the ice-start problem. Additionally, these works [21, 22,\n32] are commonly limited to a speci\ufb01c application scenario. More importantly, we need to design\nnew acquisition functions that take the model parameters uncertainty into account.\nIn this work, we propose Icebreaker 1, a principled and ef\ufb01cient framework to solve the ice-start\nproblem. Icebreaker actively acquires informative feature elements during training and also performs\ntwo general test tasks. To enable Icebreaker, we contribute the following:\n\n1. We propose a Bayesian deep Latent Gaussian Model (BELGAM). Standard training of the\ndeep generative model produces the point estimates for the parameters, whereas our approach\napplies a fully Bayesian treatment to the weights. The resulting epistemic uncertainty can\nbe later used for training acquisition. (Section 2)\n\n2. We design a novel partial amortized inference method for BELGAM, named PA-BELGAM.\nWe combine the ef\ufb01cient amortized inference for the local latent variables with stochastic\ngradient MCMC for the model parameters to ensure high inference accuracy. (Section 2.2)\n3. To complete Icebreaker, we propose two training-time element-wise information acquisition\nfunctions based on PA-BELGAM for imputation (Section 3) and active prediction (Section\n4) tasks, respectively.\n\n4. We evaluate PA-BELGAM and the entire Icebreaker approach on common machine learning\nbenchmarks and a real-world health-care task. Our method demonstrates clear improvements\nwhen compared to multiple baselines, showing that it can be effectively used to solve the\nice-start problem. (Section 5)\n\n2 Bayesian Deep Latent Gaussian Model (BELGAM) with Partial\n\nAmortized Inference\n\nHere, we propose a Bayesian Deep Latent Gaussian Model (BELGAM) with explicit epistemic\nuncertainty quanti\ufb01cation, and a novel hybrid inference scheme for ef\ufb01cient and accurate inference.\n2.1 Bayesian Deep Latent Gaussian Model (BELGAM)\nA Bayesian latent variable model shown in Figure 1, is a common\nmodeling choice, but previous work has focused on models that are\ntypically linear and not \ufb02exible enough to model highly complex data.\nOn the other hand, Deep Latent Gaussian Model [20], which uses a\n\ufb02exible neural network, does not quantify the parameter uncertainty.\nWe unify the above two models and propose a Bayesian Deep Latent\nGaussian Model (BELGAM), which uses a Bayesian neural network\nto generate observations XO from local latent variables Z with\nglobal weights \u03b8 shown in Figure 1. The model is thus de\ufb01ned as:\n\nXX\n\nZ\n\n\u03b8\n\np(XO, \u03b8, Z) = p(\u03b8)\n\np(xi,d|zi, \u03b8)p(zi),\n\nFigure 1: BELGAM\n\n(1)\n\n|O|(cid:89)\n\n(cid:89)\n\ni=1\n\nd\u2208Oi\n\nwhere |O| is the amount of observed data, and Oi is the et of indices of observed feature entries\nfor the ith data point. The goal is to infer the posterior, p(\u03b8, Z|XO), for both local latent variables\nZ = [z1, . . . , z|o|] and global latent weights \u03b8. However, the posterior is generally intractable, and\napproximate inference is needed [25, 57]. Variational inference (VI) [3, 18, 25, 52, 57] and sampling-\nbased methods [1] are two types of approaches commonly used for this task. Sampling-based\napproaches are known for accurate inference performances and theoretical guarantees[6].\n\n1Code available: https://github.com/microsoft/Icebreaker\n\n2\n\n\fHowever, sampling the local latent variable Z is computationally expensive as the cost scales\nlinearly with the data set size. To best trade off computational cost against inference accuracy, we\npropose to amortize the inference for Z and keep an accurate sampling-based approach for the global\nlatent weights \u03b8. Speci\ufb01cally, we use preconditioned stochastic gradient Hamiltonian Monte Carlo\n(SGHMC) [6] (see appendix for details).\n2.2 Partial Amortized BELGAM\nRevisiting amortized inference in the presence of missing data. Amortized inference [20, 38]\nis an ef\ufb01cient extension for variational inference. It was originally proposed for inferring local latent\nvariables Z of deep latent Gaussian models. Amortized inference uses a deep neural network as a\nfunction estimator to compute the variational distribution q(zi|xi) for the posterior of zi using xi as\ninput, instead of using individually parameterized approximations q(zi). Thus, the estimation of the\nlocal latent variable does not scale with data set size during model training.\nHowever, in our problem setting, the feature values for\neach data instance are partially observed. Thus, the vanilla\namortized inference approach cannot be used as the input\ndimensionality of the observed data can vary for each\ndata instance. As with the Partial VAE proposed in [28],\nwe adopt a set encoding structure [37, 55] to build an\ninference network to infer Z based on partial observations\nin an amortized manner.\nThe structure of the inference net is shown in Figure 2.\nFor each data instance xi \u2208 XO with |Oi| observed features, the input is modi\ufb01ed as Si =\n[si,1, . . . , si,|Oi|] where si,d = [xi,d, ed] and ed is a feature embedding. This is fed into a standard\nneural network h : RM +1 \u2192 RK where M and K are the dimensions of the latent space and ed,\nrespectively. Finally, a permutation invariant set function g(\u00b7) is applied.\n\nFigure 2: The illustration of P-VAE in-\nference network structure.\n\nh(\u00b7)\nh(\u00b7)\n\nh(\u00b7)\n\ng(\u00b7)\n\nxi,|O|\n\ne|O|\n\nxi,1\n\nxi,2\n\ne1\n\ne2\n\n. . .\n\nlatent space\n\nAmortized inference + SGHMC As discussed previously, we want to be computationally ef\ufb01cient\nwhen inferring Z and be accurate when inferring the global latent weights \u03b8 for BELGAM. Here, we\ndiscuss how to combine an accurate sampling approach for the global parameters with the ef\ufb01cient\namortized inference for the local latent variables.\nAssume we have the factorized approximated posterior q(\u03b8, Z|XO) \u2248 q(\u03b8|XO)q\u03c6(Z|XO) [20, 28],\nthen the proposed inference scheme can be summarized into two stages: (i) Sample \u03b8 \u223c q(\u03b8|XO)\nusing SGHMC, (ii) Update the amortized inference network q\u03c6(zi|xi) to approximate p(zi|xi).\nFirst, we present how to sample \u03b8 \u223c q(\u03b8|XO) using SGHMC. The optimal form for q(\u03b8|XO) can\nbe de\ufb01ned as q(\u03b8|XO) = 1\nC elog p(XO,\u03b8), where C is the normalization constant p(XO). The key to\nsampling from such distribution is to compute the gradient \u2207\u03b8 log p(XO, \u03b8), which, unfortunately,\nis intractable due to marginalizing the latent variable Z. Instead, we propose to approximate this\nquantity by transforming the marginalization into an optimization:\n\n(cid:2)Eq\u03c6(zi|xi)[log p(xi|zi, \u03b8)] \u2212 KL[q\u03c6(zi|xi)||p(zi)](cid:3) + log p(\u03b8),\n(cid:88)\n\nwhere right hand side is the lower bound of the joint distribution. Assuming that F is a suf\ufb01ciently\nlarge function class, we can compute the gradient as:\n\u2207\u03b8 log p(XO, \u03b8) = \u2207\u03b8 max\nq\u03c6\u2208F\n\n(cid:2)Eq\u03c6(zi|xi)[log p(xi|zi, \u03b8)] \u2212 KL[q\u03c6(zi|xi)||p(zi)](cid:3) + log p(\u03b8).\n\nlog p(XO, \u03b8) \u2265 (cid:88)\n\ni\u2208XO\n\n(2)\n\ni\u2208XO\n\nAfter sampling \u03b8, we then update the inference network with these samples by optimizing:\n(cid:35)\nL(XO; \u03c6) = Eq(\u03b8,Z|XO)[log p(XO|Z, \u03b8)] \u2212 KL[q(Z, \u03b8|XO)||p(Z, \u03b8)]\nEq\u03c6(zi|xi)[log p(xi|zi, \u03b8)] \u2212 KL[q\u03c6(zi|xi)||p(zi)]\n= Eq(\u03b8|XO)\n\n(cid:34)(cid:88)\n\n\u2212 KL[q(\u03b8|XO)||p(\u03b8)].\n\n(3)\n\ni\u2208XO\n\n(4)\nwhere the outer expectation can be approximated by SGHMC samples, and the outer KL penalty is\nintractable but can be ignored for updating the inference network. The resulting inference algorithm\n\n3\n\n\fresembles an iterative update procedure, like Monte Carlo Expectation Maximization (MCEM) [53]\nwhere it samples latent Z and optimizes \u03b8 instead. We call the proposed model Partial Amortized\nBELGAM (PA-BELGAM). Partial VAE [27] is actually a special case of PA-BELGAM, where \u03b8 is\nestimated by a point instead of with a set of samples.\nNote that, in this way, the computational cost with the single-chain SGHMC is exactly the same as\ntraining a normal VAE thanks to the amortization for Z. Thus, PA-BELGAM scales to large data\nwhen needed. For additional memory cost, we adopt a similar idea based on the Moving Window\nMCEM algorithm [12], where samples are stored and updated in a \ufb01xed size pool with a \ufb01rst in \ufb01rst\nout procedure. In the next two sections, we present two objective functions for two general machine\nlearning tasks respectively: imputation tasks and prediction tasks.\n\nIcebreaker for Imputation Tasks\n\n3\nWe present Icebreaker for imputation tasks, which can be directly applied in the same way as [27].\n\nProblem De\ufb01nition Assume that at each training data acquisition step we have already obtained\ntraining data Dtrain, a pool data set Dpool that contains the data we could query next and Dtrain \u222a\nDpool = X \u2208 RN\u00d7D. In the ice-start scenario, Dtrain = \u2205. At each step of the training-time\nacquisition, we actively select data points xi,d \u2208 Dpool to acquire, thereby moving them into Dtrain\nand updating the model with the newly formed Dtrain. Figure 3 shows the \ufb02ow diagram of this\nprocedure at a given step. During the process, there is an observed data set XO (e.g. the training data\nset XO = Dtrain) and unobserved set XU with |O| and |U| number of rows respectively. For each\ndata instance xi \u2208 XO, we have the observed index set Oi containing the indices of the observed\nfeatures for row i. The training time acquisition procedure is summarised in algorithm 1.\n\nAlgorithm 1: Element-wise training time acquisition\ninput :XO,XU ,\u03a6,M, Acquisition number K, \u039e\nXO = \u2205;\nwhile XU (cid:54)= \u2205 do\n\n/* Information acquisition */\nCompute reward R(xi,d, XO) for xi,d \u2208 XU using\nEq. 5 or 10 ;\n// Reward computation\nSample Xnew ; // Sample K feature elements\naccording to the R value.\nXO = XO \u222a Xnew;\n/* Model Training */\nRe-initialize model M ;\nto avoid local optimum\nM =Train(M,\u039e);\n/* Test task */\nTest(M);\ncurrent model M\n\n// Test performance of the\n\n// Update training set\n\n// Re-initialization\n\nend\n\nFigure 3: Icebreaker Flowchart. The green\nand gray blocks represent observed and un-\nobserved items respectively.\n\nWe denote the training set Dtrain = XO and the pool set Dpool = XU . The model M and training\nhyper-parameters are grouped as \u039e. We evaluate its quality on the test task using metrics such as\npredictive negative log likelihood (NLL).\n3.1 Active Information Acquisition for Imputation\nDesigning the training time acquisition function is nontrivial. Existing information-theoretical\nobjectives such as the one used in EDDI is not applicable in this setting (see appendix C.1). The key\nfor such an objective function is to make the model certain about the data set as quickly as possible\nsimultaneously focus on improving test performance.\nImputing missing values is important in applications such as recommender systems and other down-\nstream tasks. In this setting, the goal is to learn about all the feature elements as quickly as possible.\nThis can be formalized as selecting the elements xi,d that maximize the expected reduction in the\nposterior uncertainty of \u03b8:\n\nRI (xi,d, XO) = H[p(\u03b8|XO)] \u2212 Ep(xi,d|XO)[H[p(\u03b8|XO, xi,d)]],\n\n(5)\n\n4\n\n\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026PA-BELGAMObjectiveQuery\fwhere H[\u00b7] denotes the entropy of a distribution. We use the symmetry of the mutual information to\nsidestep the posterior update p(\u03b8|XO, xi,d) and entropy estimation of \u03b8 for ef\ufb01ciency. Thus, Eq. 5 is\nwritten as\n\n(6)\n\nRI (xi,d, XO) = H[p(xi,d|XO)] \u2212 Ep(\u03b8|XO)[H[p(xi,d|\u03b8, XO)]].\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n1\n\n1\n\nWe can approximate Eq. 6 as\nRI (xi,d, XO) \u2248 \u2212 1\nK\n\n(cid:88)\nm=1 and {xk\ni }M\n\ni,d|zm\n\np(xk\n\nm,n\n\nk\n\nlog\n\nM N\nn=1, {zm\n\ni , \u03b8n)+\n\nlog\n\n1\nM\n\nm\n\nN K\n\nk,n\n\np(xk\n\ni,d|zm\n\ni , \u03b8n), (7)\n\nbased on the samples {\u03b8n}N\nk=1 from SGHMC, the amortized inference\nnetwork and the data distribution, respectively. The sample xi,d \u223c p(xi,d|XO) can be generated in\nthe following way: (i) zi \u223c q\u03c6(zi|xio), (ii) \u03b8 \u223c q(\u03b8|XO) and (iii) xi,d \u223c p(xi,d|\u03b8, zi), where xio\nrepresents the observed features in the ith row of XO\n\ni,d}K\n\n4\n\nIcebreaker for Prediction Tasks\n\nNext, we introduce a second type of test task called active prediction, where a sequence of active\nacquisition steps is carried out before predicting a speci\ufb01ed target variable at test time. Note that the\ntypical test prediction task is a special case where no acquisition of features is performed. Here, we\ndemonstrate the case where feature-wise active information acquisition is used in both training and\ntesting time, which is desired in data costly situations.\n\nProblem De\ufb01nition During the training acquisition, the procedure is the same as in the imputation\ntask, which is shown in Algorithm 1 and Figure 3. The only difference is that we have speci\ufb01ed target\nvariables. We denote the target as Y . In this case, each xi \u2208 XO has a corresponding target yi. In\naddition, instead of querying a single feature value xi,d during training, as in the imputation task, we\nquery a feature-target pair (xi,d, yi) if yi has not been queried before. Otherwise, we only query xi,d.\nAs an example, we adopt a similar procedure used in EDDI [28] for test time active prediction, and\nuse the Area under the information curve (AUIC) generated from EDDI to evaluate the performance\nof Icebreaker. This re\ufb02ects the overall model performance with test time active acquisition. The\nevaluation procedure is summarised in Algorithm 3 in the appendix.\n4.1 Model and Active Information Acquisition for Active Prediction\nConditional BELGAM The proposed model and inference algorithm in section 3 can be easily\nextended to incorporate the target variables. In general, PA-BELGAM can be directly adapted to any\nVAE based framework. One possible choice is to adopt the formulation of the conditional VAE [45]\nfor the prediction task here (see appendix B for details).\n\nIcebreaker for active target prediction. For the prediction task, solely reducing the model epis-\ntemic uncertainty is not optimal as the goal is to predict the target variable Y . Instead, we require\nthe model to (1) capture feature correlations for accurate imputations in both training and test time\n(similar to reducing the model epistemic uncertainty), and (2) \ufb01nd informative features to learn to\npredict the target variable. Thus, the desired acquisition function needs to balance the unsupervised\nlearning, which focuses on exploring relations between features, and supervised learning that exploits\ninformative features to predict speci\ufb01ed targets. We propose the following objective:\n\nRP (xi,d, XO) = Ep(xi,d|XO)[H[p(yi|xi,d, XO)]] \u2212 Ep(\u03b8,xi,d|XO)[H[p(yi|\u03b8, xi,d, XO)]].\n\n(8)\nThe above objective is the conditional mutual information I(yi, \u03b8|xi,d; XO). Thus, maximizing\n8 is the same as maximizing the information gain between the target yi and the model weights\n\u03b8, conditioned on the additional feature xi,d, and observed features XO. In our case, the xi,d is\nunobserved. As the weights \u03b8 do not change signi\ufb01cantly after collecting xi,d, for computational\nconvenience, we assume p(\u03b8|XO) \u2248 p(\u03b8|XO, xi,d) when estimating the objective.\nAs before, we approximate this objective using Monte Carlo integration:\nRP (xi,d, XO) \u2248\n\u2212 1\nJK\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n|z(m,k)\n\n|z(m,k)\n\np(y(j,k)\n\np(y(j,k)\n\n, \u03b8n) +\n\n, \u03b8n),\n\nKN J\n\n1\nM\n\ni\n\ni\n\ni\n\ni\n\n1\n\nM N\n\n(9)\n\nlog\n\nlog\n\n1\n\nj,k\n\nm,n\n\nj,n,k\n\nm\n\n5\n\n\f(a) Boston Housing Imputation\n\n(b) Long-tail selection pattern\n\n(c) Boston Housing Active prediction\n\ni\n\n}J\nj=1 and {xk\n\ni\n\ni,d) for each imputed sample xk\n\nFigure 4: Boston Housing experimental results. (a) The NLL over the number of observed feature\nvalues. (b) The distribution (log scale) of the number of observed features per data instance during\nthe training time. (c) Performance on the active prediction task vs. training set size. The test time\nactive prediction curves at the training data size indicated by the black dash line are shown in Figure 5\nwhere we draw {z(m,k)\n}M\nm=1 from q\u03c6(zi|XO, xk\nn=1,\n{y(j,k)\ni,d}K\nk=1) are sampled in a similar way as in the imputation task. This objective\nnaturally balances the exploration of new unseen features that may be informative as well as the\nexploitation of the familiar ones to facilitate learning a better predictor. For example, if feature xi,d\nhas not been observed before or uninformative about the target, the \ufb01rst entropy term in Eq. 8 will be\nhigh, which encourages the algorithm to pick this data point. However, using this term alone may\nresult in selecting uninformative/noisy features. Thus, we need an extra term that eliminates the\npossibility of selecting uninformative features, which is exactly the second term. Unless xi,d together\nwith \u03b8 can provide extra information about yi, the entropy in the second term for uninformative\nfeatures will still be high. Thus, the two terms combined together encourage the model to select the\nless explored but informative features. The resulting objective is mainly targeted at (2) mentioned at\nthe beginning of this subsection. Thus, a natural way to satisfy both (1) and (2) is a combination of\nthe two objectives:\n\ni,d. Others ({\u03b8n}N\n\nRC(xi,d, XO) = (1 \u2212 \u03b1)RI (xi,d, XO) + \u03b1RP (xi,d, XO),\n\n(10)\n\nwhere \u03b1 controls which task the model focuses on. This objective also has an information-theoretic\ninterpretation. In the appendix C.1, we show that when \u03b1 = 1\n2, this combined objective is equivalent\nto the mutual information between \u03b8 and the feature-target pair (xi,d, yi).\n5 Experiments\n\nWe evaluate Icebreaker \ufb01rst on benchmark data sets UCI [8] on both imputation and prediction tasks.\nWe then consider two real-world applications: (a) movie rating imputation task using the MovieLens\ndataset [10]; and (b) risk prediction in intensive care using the MIMIC dataset [17].\n\nExperiments Setup and evaluation. We compare Icebreaker with a random feature acquisition\nstrategy for training where both P-VAE [28] and PA-BELGAM are used. For the imputation task,\nP-VAE already achieves excellent results in various data sets compared to traditional methods [28, 34].\nAdditionally, for the active prediction task, we compare Icebreaker to an instance-wise active learning\nmethod, denoted as Row AT, in which the data are assumed to be fully observed apart from the target.\nWe evaluate the imputation performance by reporting negative log likelihood (NLL) over the test\ntarget. For the active prediction task, we use EDDI [28] to sequentially select features at test time. We\nreport the area under the information curve (AUIC) [28] for the test set (See Figure 5 for an example\nand the appendix for details). A smaller value of AUIC indicates better overall active prediction\nperformance. All experiments are averaged over 10 runs, and their setting details are in the appendix.\n5.1 UCI Data Set\nImputation Task. At each step of Icebreaker, we select 50 feature elements from the pool. Figure\n4a shows the averaged NLL on the test set as the training set increases. Icebreaker outperforms\n\n6\n\n05001000150020002500300035004000Training data set size1.001.251.501.752.002.252.502.75NLLBoston Housing missing value imputation NLLIcebreakerPA-BELGAMPVAE0500100015002000250030003500Training data set size1.52.02.53.03.5Concrete missing value imputation NLL05001000150020002500300035004000Training data set size1.82.02.22.42.62.83.03.2Wine quality missing value imputation NLL2468101214Number of features100101102Accumulated numberBoston Housing selection distribution with training size 250RandomIcebreaker246810Number of features100101102Accumulated numberConcrete selection distribution with training size 25024681012Number of features100101102Accumulated numberWine selection distribution with training size 2500250500750100012501500175020002250Training Set Size35.037.540.042.545.047.550.052.555.0AUIC Value500 training data1250 training data2250 training dataBoston Housing Active testing performanceIcebreaker+EDDIRow AT+EDDIPA-BELGAM+EDDIPVAE+EDDI0250500750100012501500175020002250Training Set Size19202122232425AUIC ValueEnergy Active testing performanceIcebreaker+EDDIRow AT+EDDIPA-BELGAM+EDDIPVAE+EDDI\fFigure 5: Evaluation of test time performance after exposure to different amounts of training data:\n(Left): 550 feature elements. (Middle):1250 feature elements (Right): 2250 feature elements. The x-\naxis indicates the number of actively-acquired feature elements used for prediction. Legend indicates\nthe methods used for training (Icebreaker, Row AT, etc.) and test time acquisition (EDDI, RAND)\n\nrandom acquisition with both PA-BELGAM and P-VAE by a large margin, especially at the early\nstages of training. We also see that PA-BELGAM alone can be bene\ufb01cial compared to P-VAE with\nsmall data sets. This is because P-VAE tends to over-\ufb01t, while PA-BELGAM leverages the model\nuncertainties.\nWe also analyze the selection pattern. We gather all the rows that have been queried with at least one\nfeature during training acquisition and count how many features are queried for each. We repeat this\nfor the \ufb01rst 5 acquisitions. Figure 4b shows the histogram of the number of features acquired for each\ndata point. The random selection concentrates around one feature per data instance. However, the\nlong-tailed distribution of the number of features selected by Icebreaker means it tends to concentrate\nmore features in certain rows to exploit feature relations for predicting target but simultaneously tries\nto spread its selection for more exploration. We include imputation results on other UCI data sets in\nthe Appendix. We \ufb01nd that Icebreaker consistently outperforms the baselines by a large margin.\n\nPrediction Task. Figure 4c shows the AUIC curve as the amount of training data increases. The\nIcebreaker clearly achieves better results compared to all baselines (Also con\ufb01rmed by Figure 5). This\nshows that it not only yields a more accurate prediction of the targets but also captures correlations\nbetween features and targets. Interestingly, the baseline Row AT performs a little worse than PA-\nBELGAM. We argue that before querying a single target variable, Row AT needs to query the whole\nrow, which induces the costs equivalent to the number of features. Thus, with \ufb01xed query budgets,\nRow AT will form a relatively small but complete data set. Again, the uncertainty of PA-BELGAM\nbrings bene\ufb01ts compared to P-VAE with point estimated parameters.\nAt the early training stage (500 data points, the left panel in Figure 5), the performance of Row AT is\nworse at test time than others when few features are selected. This is due to the fact that obtaining a\ncomplete observed datum is costly. With the budget of 500 feature elements, it can only select 50\nfully observed data instances. In contrast, Icebreaker has obtained, within that budget, 260 partially\nobserved instances with different levels of missingness. As more features are selected during the\ntest, these issues are mitigated, and the performance starts to improve. Further evidence suggests\nthat, as the training data grows, we can clearly observe a better prediction performance of Row AT at\nthe early test stage. We also include in the appendix the evaluation of other UCI data sets for active\nprediction.\n5.2 Recommender System using MovieLens\nOne common benchmark data set for recommender systems is MovieLens-1M [10]. P-VAE has\nobtained state-of-the-art imputation performance in this dataset after training with a suf\ufb01cient amount\nof data [27]. Figure 6a shows the performance on predicting unseen data points in terms of NLL.\nIcebreaker shows that with minimum training data, the model has already learned to predict the unseen\ndata with high accuracy. Given any small amount of data, Icebreaker obtains the best performance at\nthe given query budget, followed by PA-BELGAM which outperforms P-VAE. The selection pattern\nin Figure 6b is similar to the UCI imputation, shown in Figure 6b. We argue this long-tail selection is\nimportant, especially when each row contains many features. The random selection tends to scatter\nthe choices and is less likely to discover dependencies until the data set grows larger. However, if\n\n7\n\n024681012Feature number0.751.001.251.501.752.002.252.50NLLActive Test curve with training data set 500 pointsIcebreaker+EDDIIcebreaker+RANDRow AT+EDDIRow AT+RANDPA-BELGAM+EDDIPA-BELGAM+RANDPVAE+EDDIPVAE+RAND024681012Feature number0.500.751.001.251.501.752.002.252.50Active Test curve with training data set 1250 points024681012Feature number0.500.751.001.251.501.752.002.252.50Active Test curve with training data set 2250 points\f(a) Imputation NLL Curve\n\n(b) Long-tailed selection\n\nFigure 6: Performance on MovieLens. Panel (a) shows the imputation NLL vs. the number of\nobserved movie ratings. Panel (b) shows the distribution of the number of features selected per user.\n\nthere are many features per data instance, this accumulation will take a very long time. On the other\nhand, the long-tailed selection exploits the features inside certain rows to discover their dependencies\nand simultaneously tries to spread out the queries for exploration.\n5.3 Mortality Prediction using MIMIC\nWe apply Icebreaker in a health-care setting using the Medical Information Mart for Intensive\nCare (MIMIC III) data set [17]. This is the largest real-world health-care data set in terms of\npatient numbers. The goal is to predict mortality based on 17 medical measurements. The data is\npre-processed following [11] and balanced. Full details are available in appendix E.2.1.\nThe left panel in Figure 7 shows that the Icebreaker outperforms the other baselines signi\ufb01cantly\nin active prediction with higher robustness (smaller std. error). Robustness is crucial in health-care\nsettings as the cost of unstable model performance is high. As before, Row AT performs worse until\nit accumulates suf\ufb01cient data. Note that without active training feature selection, PA-BELGAM\nperforms better than P-VAE due to its ability to model uncertainty, which is very useful in this\nextremely noisy data set.\nTo evaluate whether the proposed method can discover valuable information, we plot the accumulated\nfeature number in the middle panel of Figure 7. The x-axis indicates the total number of observed\ndata in the training set, and each point on the curve indicates the number of features selected in the\ncorresponding training set. We see that not only different features have been collected at different\nfrequencies, but the curve of Glucose is clearly non-linear as well. This indicates that the importance\nof different features varies for different training set size. Icebreaker is establishing a sophisticated\nfeature element acquisition scheme that no heuristic method can currently achieve. The top 3\nfeatures are the Glasgow coma scale (GCS). These features have been identi\ufb01ed previously as being\nclinically important (e.g. by the IMPACT model [47]. Glucose is also in the IMPACT set. It was not\ncollected frequently in the early stage, but in the later training phase, more Glucose feature has been\nselected. Compared to GCS, Glucose has a highly non-linear relationship with the patient outcome\n[36] (or refer to the appendix E.2.1). Icebreaker chooses more informative features with simpler\nrelationships in the very early iterations. While the learning progresses, Icebreaker is able to identify\nthese informative features with complex relationships to the target. Additionally, the missing rate for\neach feature in the entire data set differs. Capillary re\ufb01ll rate (Cap.) has more than 90% data missing,\nmuch higher than Height. Icebreaker is still able to pick the useful and rarely observed information,\nwhile only choosing a small percent of the irrelevant information at test time. On the right hand side\nof Figure 7, we plot the histogram of the initial choices during test-time acquisition. GCS are mostly\nselected in the \ufb01rst step, as it is the most informative feature.\n\n6 Related Work\n\nData-wise Active Learning. The goal of active learning is to obtain optimal model performance\nwith as fewer queries as possible [29, 31, 43], where only querying labels are associated with a\ncost. One category is based on decision theory [39], where the acquisition step is to minimize the\n\n8\n\n050000100000150000200000250000Training data set size90010001100120013001400150016001700NLLMovielens Imputation NLLIcebreakerPA-BELGAMPVAE050100150200250Num of picked features102101100101102FrequencyFeature picking distributionIcebreakerRandom050000100000150000200000250000Training data set size90010001100120013001400150016001700NLLMovielens Imputation NLLIcebreakerPA-BELGAMPVAE050100150200250Num of picked features102101100101102FrequencyFeature picking distributionIcebreakerRandom\fFigure 7: Performance MIMIC experiments. (Left) This \ufb01gure shows the predictive AUIC curve as\ntraining data size increases. (Middle) The accumulated feature statistics as active selection progresses\n(Right) This indicates the histogram of initial choice during active prediction using EDDI.\n\nloss de\ufb01ned by test tasks after making the query based on observed data. Indeed this coincides\nperfectly with the goal of active learning. However, its evaluation can be expensive in practice\n[19, 59]. Another category is based on information theory, including many previous active learning\napproaches [7, 26, 50]. Another well-known acquisition function is BALD [14], which is based on\nmutual information. Although our acquisition for imputation is also based on mutual information, we\nemphasize that the original BALD objective is only applied to scenarios with complete data set. In\nanother word, those methods aim to only select next data instance to label while assuming that every\nfeature of each data point is observed. We call this approach instance-wise selection. Obviously,\nthese methods are not directly applicable to the ice-start problem as they assume that the only cost\ncomes from acquiring labels.\n\nFeature-wise Active Learning.\nInstead of only querying labels, the above active learning idea\ncan be extended to query features, named as active feature acquisition (AFA). It makes sequential\nfeature selections in order to improve model performance[5, 15, 32, 40, 41, 48, 49], which is similar\nto our framework. However, they are commonly designed for a speci\ufb01c application such as clustering\n[51] and classi\ufb01cation [33], assuming the data are fully observed in the test time. In addition, many\nmethods have other limitations. For example, only simple linear models can be used [5, 40, 48] with\nnon-information-theoretical objective functions [15, 32]. None of the above methods can be easily\ncombined with test time active prediction methods [16, 28, 44]. Our method enables both training\ntime and test-time ef\ufb01cient information acquisition in a principled way with a \ufb02exible model, which\nis of great need in real-life applications.\n\nCold-start problem Another relevant problem to ice-start is called cold-start problem [30, 42].\nThe key difference between these two scenarios is that cold-start problem targets at the test time\ndata scarcity after the model has been trained. Taking the recommender system as an example, the\ncold-start problem handles the scenario when there are new users incoming with no historical ratings\ngiven a trained recommender. One common strategy is to utilise the meta data (e.g. user pro\ufb01les,\nitem category) to initialise the latent factors of users/items [35, 46, 54].\n\n7 Conclusion\n\nIn this work, we introduce the ice-start problem where machine learning models are expected to be\ndeployed where little or no training data has been collected. The costs of collecting new training\ndatum apply at the level of feature elements. Icebreaker provides an information-theoretical way to\nacquire element-wise data for training actively and uses the minimum amount of data for downstream\ntest tasks like imputation and active prediction. Within the framework of Icebreaker, we propose\nPA-BELGAM, a Bayesian deep latent Gaussian model together with a novel inference scheme\nthat combines amortized inference and SGHMC. This enables fast and accurate posterior inference.\nFurthermore, we propose two training time acquisition functions targeted at the imputation and active\nprediction tasks. We evaluate Icebreaker on several benchmark data sets, including two real-world\napplications. Icebreaker consistently outperforms the baselines. Possible future directions include\ntaking the mixed-type variables into account and deploying it in a pure streaming environment.\n\n9\n\n050010001500200025003000Training data set size4050607080AUICMIMIC-III AUIC CurveIcebreaker+EDDIPA-BELGAM+EDDIPVAE+EDDIRow AT+EDDI050010001500200025003000Training data set size050100150200250300NumberAccumulated feature numberGCS:eyeGCS:verbalGCS:motorGCS:totalGlucoseHeightpHCap.Cap.Dia.BPIns.OxyGCS:EGCS:MGCS:TGCS:VGlu.HRHei.MBPOxy.SatRes.RSys.BPTempWei.pH050100150200250Initial choice for active prediction Cap.GCS:EGCS:MGCS:TGCS:VGlu.Hei.pH\fReferences\n\n[1] C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine\n\nlearning. Machine learning, 50(1-2):5\u201343, 2003.\n\n[2] Y. Baram, R. E. Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of\n\nMachine Learning Research, 5(Mar):255\u2013291, 2004.\n\n[3] M. J. Beal et al. Variational algorithms for approximate Bayesian inference. 2003.\n\n[4] J. M. Bernardo. Expected information as expected utility. The Annals of Statistics, pages\n\n686\u2013690, 1979.\n\n[5] S. Chakraborty, J. Zhou, V. Balasubramanian, S. Panchanathan, I. Davidson, and J. Ye. Active\nmatrix completion. In 2013 IEEE 13th International Conference on Data Mining, pages 81\u201390.\nIEEE, 2013.\n\n[6] C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin. Bridging the gap between stochastic gradient\nMCMC and stochastic optimization. In Arti\ufb01cial Intelligence and Statistics, pages 1051\u20131060,\n2016.\n\n[7] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\n[8] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.\n\n[9] Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. In\nProceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1183\u20131192. JMLR. org, 2017.\n\n[10] F. M. Harper and J. A. Konstan. The Movielens datasets: History and context. Acm transactions\n\non interactive intelligent systems (tiis), 5(4):19, 2016.\n\n[11] H. Harutyunyan, H. Khachatrian, D. C. Kale, and A. Galstyan. Multitask learning and bench-\n\nmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.\n\n[12] M. Havasi, J. M. Hern\u00e1ndez-Lobato, and J. J. Murillo-Fuentes. Inference in deep gaussian pro-\ncesses using stochastic gradient Hamiltonian Monte Carlo. In Advances in Neural Information\nProcessing Systems, pages 7506\u20137516, 2018.\n\n[13] N. Houlsby, J. M. Hern\u00e1ndez-Lobato, and Z. Ghahramani. Cold-start active learning with robust\nordinal matrix factorization. In International Conference on Machine Learning, pages 766\u2013774,\n2014.\n\n[14] N. Houlsby, F. Husz\u00e1r, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classi\ufb01ca-\n\ntion and preference learning. arXiv preprint arXiv:1112.5745, 2011.\n\n[15] S.-J. Huang, M. Xu, M.-K. Xie, M. Sugiyama, G. Niu, and S. Chen. Active feature acquisition\n\nwith supervised matrix completion. arXiv preprint arXiv:1802.05380, 2018.\n\n[16] J. Janisch, T. Pevn`y, and V. Lis`y. Classi\ufb01cation with costly features using deep reinforcement\n\nlearning. arXiv preprint arXiv:1711.07364, 2017.\n\n[17] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody,\nP. Szolovits, L. A. Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database.\nScienti\ufb01c Data, 3:160035, 2016.\n\n[18] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n\nmethods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[19] A. Kapoor, E. Horvitz, and S. Basu. Selective supervision: Guiding supervised learning with\n\ndecision-theoretic active learning.\n\n[20] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference\n\non Learning Representation, 2014.\n\n10\n\n\f[21] A. Krause and E. Horvitz. A utility-theoretic approach to privacy in online services. Journal of\n\nArti\ufb01cial Intelligence Research, 39:633\u2013662, 2010.\n\n[22] J. Krumm and E. Horvitz. Traf\ufb01c updates: Saying a lot while revealing a little. 2019.\n\n[23] Y. Lewenberg, Y. Bachrach, U. Paquet, and J. S. Rosenschein. Knowing what to ask: A Bayesian\n\nactive learning approach to the surveying problem. In AAAI, pages 1396\u20131402, 2017.\n\n[24] C. Li, C. Chen, D. Carlson, and L. Carin. Preconditioned stochastic gradient Langevin dynamics\n\nfor deep neural networks. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[25] Y. Li. Approximate Inference: New Visions. PhD thesis, University of Cambridge, 2018.\n\n[26] D. V. Lindley. On a measure of the information provided by an experiment. The Annals of\n\nMathematical Statistics, pages 986\u20131005, 1956.\n\n[27] C. Ma, W. Gong, J. M. Hern\u00e1ndez-Lobato, N. Koenigstein, S. Nowozin, and C. Zhang. Partial\nVAE for hybrid recommender system. In NIPS Workshop on Bayesian Deep Learning, 2018.\n\n[28] C. Ma, S. Tschiatschek, K. Palla, J. M. H. Lobato, S. Nowozin, and C. Zhang. EDDI: Ef-\n\ufb01cient dynamic discovery of high-value information with partial vae. In Proceedings of the\nInternational Conference on Machine Learning, 2019.\n\n[29] D. J. MacKay. Information-based objective functions for active data selection. Neural computa-\n\ntion, 4(4):590\u2013604, 1992.\n\n[30] D. Maltz and K. Ehrlich. Pointing the way: active collaborative \ufb01ltering.\n\n[31] A. K. McCallumzy and K. Nigamy. Employing EM and pool-based active learning for text\nclassi\ufb01cation. In International Conference on Machine Learning, pages 359\u2013367. Citeseer,\n1998.\n\n[32] P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. Active feature-value acquisition\nfor classi\ufb01er induction. In International Conference on Data Mining, pages 483\u2013486. IEEE,\n2004.\n\n[33] P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. An expected utility approach\nto active feature-value acquisition. In Fifth IEEE International Conference on Data Mining\n(ICDM\u201905), pages 4\u2013pp. IEEE, 2005.\n\n[34] A. Nazabal, P. M. Olmos, Z. Ghahramani, and I. Valera. Handling incomplete heterogeneous\n\ndata using VAEs. arXiv preprint arXiv:1807.03653, 2018.\n\n[35] A. K. Pandey and D. S. Rajpoot. Resolving cold start problem in recommendation system\nusing demographic approach. In 2016 International Conference on Signal Processing and\nCommunication (ICSC), pages 213\u2013218. IEEE, 2016.\n\n[36] A.-L. Popkes, H. Overweg, A. Ercole, Y. Li, J. M. Hern\u00e1ndez-Lobato, Y. Zaykov, and C. Zhang.\nInterpretable outcome prediction with sparse Bayesian neural networks in intensive care. arXiv\npreprint arXiv:1905.02599, 2019.\n\n[37] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3D\nclassi\ufb01cation and segmentation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 652\u2013660, 2017.\n\n[38] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In Interantional Conference on Machine Learning, 2014.\n\n[39] N. Roy and A. McCallum. Toward optimal active learning through monte carlo estimation of\n\nerror reduction.\n\n[40] N. Ruchansky, M. Crovella, and E. Terzi. Matrix completion with queries. In Proceedings of\nthe 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 1025\u20131034. ACM, 2015.\n\n11\n\n\f[41] M. Saar-Tsechansky, P. Melville, and F. Provost. Active feature-value acquisition. Management\n\nScience, 55(4):664\u2013684, 2009.\n\n[42] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start\nrecommendations. In Proceedings of the 25th annual international ACM SIGIR conference on\nResearch and development in information retrieval, pages 253\u2013260. ACM, 2002.\n\n[43] B. Settles. Active learning. Synthesis Lectures on Arti\ufb01cial Intelligence and Machine Learning,\n\n6(1):1\u2013114, 2012.\n\n[44] H. Shim, S. J. Hwang, and E. Yang. Joint active feature acquisition and classi\ufb01cation with\n\nvariable-size set encoding. In Advances in Neural Information Processing Systems, 2018.\n\n[45] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional\ngenerative models. In Advances in neural information processing systems, pages 3483\u20133491,\n2015.\n\n[46] D. Stern, R. Herbrich, and T. Graepel. Matchbox: Large scale bayesian recommendations. In\n\nInternational World Wide Web Conference, 2009.\n\n[47] E. W. Steyerberg, N. Mushkudiani, P. Perel, I. Butcher, J. Lu, G. S. McHugh, G. D. Murray,\nA. Marmarou, I. Roberts, J. D. F. Habbema, et al. Predicting outcome after traumatic brain\ninjury: development and international validation of prognostic scores based on admission\ncharacteristics. PLoS medicine, 5(8):e165, 2008.\n\n[48] D. J. Sutherland, B. P\u00f3czos, and J. Schneider. Active learning and search on low-rank matrices.\nIn Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery\nand data mining, pages 212\u2013220. ACM, 2013.\n\n[49] M. Thahir, T. Sharma, and M. K. Ganapathiraju. An ef\ufb01cient heuristic method for active feature\nacquisition and its application to protein-protein interaction prediction. In BMC proceedings,\nvolume 6, page S2. BioMed Central, 2012.\n\n[50] S. Tong and D. Koller. Support vector machine active learning with applications to text\n\nclassi\ufb01cation. Journal of machine learning research, 2(Nov):45\u201366, 2001.\n\n[51] D. Vu, P. Melville, M. Bilenko, and M. Saar-Tsechansky. Intelligent information acquisition for\n\nimproved clustering.\n\n[52] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[53] G. C. Wei and M. A. Tanner. A Monte Carlo implementation of the EM algorithm and the\npoor man\u2019s data augmentation algorithms. Journal of the American statistical Association,\n85(411):699\u2013704, 1990.\n\n[54] J. Xu, Y. Yao, H. Tong, X. Tao, and J. Lu. Ice-breaking: mitigating cold-start recommendation\nproblem by rating comparison. In Twenty-Fourth International Joint Conference on Arti\ufb01cial\nIntelligence, 2015.\n\n[55] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep\n\nsets. In Advances in Neural Information Processing Systems, pages 3391\u20133401, 2017.\n\n[56] S. Zannone, J. M. Hern\u00e1ndez-Lobato, C. Zhang, and K. Palla. Odin: Optimal discovery of\nhigh-value information using model-based deep reinforcement learning. In ICML Real-world\nSequential Decision Making Workshop, 2019.\n\n[57] C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt. Advances in variational inference. IEEE\n\ntransactions on pattern analysis and machine intelligence, 2018.\n\n[58] J.-J. Zhu and J. Bento. Generative adversarial active learning. arXiv preprint arXiv:1702.07956,\n\n2017.\n\n[59] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learning\n\nusing gaussian \ufb01elds and harmonic functions.\n\n12\n\n\f", "award": [], "sourceid": 8401, "authors": [{"given_name": "Wenbo", "family_name": "Gong", "institution": "University of Cambridge"}, {"given_name": "Sebastian", "family_name": "Tschiatschek", "institution": "Microsoft Research"}, {"given_name": "Sebastian", "family_name": "Nowozin", "institution": "Microsoft Research Cambridge"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}, {"given_name": "Jos\u00e9 Miguel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "University of Cambridge"}, {"given_name": "Cheng", "family_name": "Zhang", "institution": "Microsoft Research, Cambridge, UK"}]}