NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 8621 Multi-task Learning for Aggregated Data using Gaussian Processes

### Reviewer 1

ORIGINALITY: The authors present a framework for multi-task learning on aggregated data, i.e. data that has been averaged over time and/or space. The key contribution is to consider a *multi-task setting*, in which correlated aggregated data is jointly modeled and each task can have a different likelihood function. As such, the paper seems to be a relatively straightforward extension of the work of Moreno-Munoz er al. 2018 (Multi-output GPs when outputs have different likelihoods) by the work of Smith et al. 2018 (GPs for aggregated data in the single-output case). This combination is (to my knowledge) novel and relevant. QUALITY: Technically, the method seems sound and supported by the experiments, although the latter are partially presented in a manner that doesn't make them easy to follow. There is a clear setting when this method should be used (different kinds of correlated, aggregated data), but there is no assessment of weaknesses of the method. CLARITY: The paper is well organized, partially well written and easy to follow, in other parts with quite some potential for improvement, specifically in the experiments section. Suggestions for more clarity below. SIGNIFICANCE: I consider the work significant, because there might be many settings in which integrated data about the same quantity (or related quantities) may come at different cost. There is no earlier method that allows to take several sources of data into account, and even though it is a fairly straightforward extension of multi-task models and inference on aggregated data, it is relevant. MORE DETAILED COMMENTS: --INTRO & RELATED WORK: * Could you state somewhere early in the introduction that by "task" you mean "output"? * Regarding the 3rd paragraph of the introduction and the related work section: They read unnaturally separated. The paragraph in the introduction reads very technical and it would be great if the authors could put more emphasis there in how their work differs from previous work and introduce just the main concepts (e.g. in what way multi-task learning differs from multiple instance learning). Much of the more technical assessment could go into the related work section (or partially be condensed). --SECTION 2.3: Section 2 was straightforward to follow up to 2.3 (SVI). From there on, it would be helpful if a bit more explanation was available (at the expense of parts of the related work section, for example). More concretely: * l.145ff: $N_d$ is not defined. It would be good to state explicitely that there could be a different number of observations per task. * l.145ff: The notation has confused me when first reading, e.g. $\mathbb{y}$ has been used in l.132 for a data vector with one observation per task, and in l.145 for the collection of all observations. I am aware that the setting (multi-task, multiple supports, different number of observations per task) is inherently complex, but it would help to better guide the reader through this by adding some more explanation and changing notation. Also l.155: do you mean the process f as in l.126 or do you refer to the object introduced in l.147? * l.150ff: How are the inducing inputs Z chosen? Is there any effect of the integration on the choice of inducing inputs? l.170: What is z' here? Is that where the inducing inputs go? * l.166ff: It would be very helpful for the reader to be reminded of the dimensions of the matrices involved. * l.174 Could you explicitly state the computational complexity? * Could you comment on the performance of this approximate inference scheme based on inducing inputs and SVI? --EXPERIMENTS: * synthetic data: Could you give an example what kind of data could look like this? In Figure 1, what is meant by "support data" and what by "predicted training count data"? Could you write down the model used here explicitly, e.g. add it to the appendix? * Fertility rates: - It is unclear to me how the training data is aggregated and over which inputs, i.e. what you mean by 5x5. - Now that the likelihood is Gaussian, why not go for exact inference? * Sensor network: - l.283/4 You might want to emphasize here that CI give high accuracy but low time resolution results, e.g. "...a cheaper method for __accurately__ assessing the mass..." - Again, given a Gaussian likelihood, why do you use inducing inputs? What is the trade-off (computational and quality) between using the full model and SVI? - l.304ff: What do you mean by "additional training data"? - Figure 3: I don't understand the red line: Where does the test data come from? Do you have a ground truth? - Now the sensors are co-located. Ideally, you would want to have more low-cost sensors that high-cost (high accuracy) sensors in different locations. Do you have a thought on how you would account for spatial distribution of sensors? --REFERENCES: * please make the style of your references consistent, and start with the last name. Typos etc: ------------- * l.25 types of datasets * l.113 should be $f_{d'}(v')$, i.e. $d'$ instead of $d$ * l.282 "... but are badly bias" should be "is(?) badly biased" (does the verb refer to measurement or the sensor? Maybe rephrase.) * l.292 biased * Figure 3: biased, higher peaks, 500 with unit. * l.285 consisting of? Or just "...as observations of integrals" * l.293 these variables

### Reviewer 2

This paper proposes a general framework based on GPs for multi-task learning on the data aggregated over supports of different shapes and sizes. The change of support addressed herein is an important problem in various disciplines (e.g., geostatistics and epidemiology). The authors define the covariance function between any pair of supports as the double integration of the GPs, in which dependences between tasks are designed by the linear model of latent GPs. The inference procedure is based on variational EM that incorporates inducing points. The problem addressed in this submission is important, and the proposed approach is reasonable. However, there are several concerns; especially the experimental results seem not enough to support the authors' claims. Comments following the guidelines as requested. [Originality] Strengths. The main technical contribution of this submission is to extend the multi-task GP to handle the change of support; its idea sounds good and useful. Weaknesses. Some important related works are not discussed. Multi-task (i.e., multivariate) GPs have been widely studied in machine learning community. Although most of them assume that data values are associated with points, it would be better to mention several related multi-task GPs (e.g., [1],[2],[3]). Especially, [1] designed the dependent GP by a linear mixing of latent GPs, which is similar to this submission. Also, there is an important related work missing here: [4]. I think this paper essentially addressed a related task: Predicting the fine-grained data by using auxiliary data sets with various granularities. I would like the authors to clarify the differences and advantages of this submission. [Quality] Strengths. This paper is technically sound except for some concerns. The authors evaluate the proposed model in the simple experimental setting using synthetic and real data sets. Weaknesses. My concerns about the proposed model are as follows: 1) I have understood that the integral in Equation (1) corresponds to bag observation model in [Law et al., NeurIPS'18] or spatial aggregation process in [4]. The formulation introduced by the authors assume that the observations are obtained by averaging over the corresponding support $v$. However, the data might be aggregated by another procedure, e.g., simple summation or population weighted average; actually the disease incident data are often available in count, or rate per the number of residents. 2) In order to handle various data types (e.g., count and rate), shouldn't the corresponding aggregation processes be performed at the likelihood level? 3) I think it would be more efficient to estimate ${a_{d,q}}$ instead of $B_q$ since $b^q_{d,d'} = a_{d,q}a_{d',q}$. The major weakness of this submission is in the experiments. First, the proposed model should be compared with any typical baseline, such as regression-based model with aggregation process (e.g., Law et al., NeurIPS'18, [4]) and multi-task GP with point-referenced data (e.g., [1]). I believe the previous multi-task GP can be applied via the simplification; that is, each data value at the support $v$ is assumed to be associated with the representative point (e.g., centroid) of its support (as in the previous work [4]). Second, the extensive experiments are helpful to verify the effectiveness of the proposed model. In all the experiments, the authors consider two tasks. I would like to see the experimental results considering more tasks; then it is a good idea to discuss how to determine the number of latent GPs $Q$. Short question: I was wondering if you could give me the detail of *resolution 5 \times 5* in the experimental setting of fertility rates. [Clarity] This paper is easy to understand. Some typos: 1) In line 235, *low-cost* should be *low-accuracy*? 2) In line 239, *GP process* should be *GP*. [Significance] Aggregated data with different supports are commonplace in a wide variety of applications, so I think this is an important problem to tackle. However, the major weakness of the submission in my view is that the evaluation of the proposed model is not enough, so the effectiveness/usefulness of the model is unclear from the experimental results. I think it would be great to compare the proposed model with baseline methods. [1] Y. W. Teh et al., Semiparametric Latent Factor Models, AISTATS, 333-340, 2005. [2] P. Boyle et al., Dependent Gaussian Processes, NeurIPS, 217-224, 2005. [3] E. Bonilla et al., Multi-task Gaussian Process Prediction, NeurIPS, 153-160, 2008. [4] Y. Tanaka et al., Refining Coarse-grained Spatial Data Using Auxiliary Spatial Data Sets with Various Granularities, AAAI, 2019. https://arxiv.org/abs/1809.07952 ------------------------------ After author feedback: I appreciate the responses to my questions. The new experimental results in the rebuttal is a welcome addition. In light of this, I upgraded my score. The proposal is a combination of the coregionalization and the concept of aggregation process used in block-kriging; this is a simple but effective way. I also agree that a sensor experiment is one of the applications with the proposed model. But I'm still of the opinion that there is not enough experiments and/or discussions to support the authors' claims. The authors state that the model is a general framework and has many applications related to geostatistics (lines 14-23); the support $v$ corresponds to the 2-dimensional region, e.g., borough (line 92). As described in Related work (lines 222-229), the proposed model strongly relates to spatial downscaling and disaggregation in geostatistics. If anything, I think this application that contains spatial aggregation is a more critical one for the proposed model. In the spatial data setting, a wide variety of data sets is available at various spatial granularities (for instance, New York City publish open data in [https://opendata.cityofnewyork.us]). Naturally, one would like to handle these data sets simultaneously (as in Law et al., NeurIPS'18, [4]); namely the setting with a large number of tasks. In that case, I believe the authors should discuss several issues; for example, the sensitivity of the number of latent GPs $Q$, the approximate accuracy of integral over regions, etc. I think it would be better to clarify the scope of this study and discuss the above issues.