NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3745 Beyond Vector Spaces: Compact Data Representation as Differentiable Weighted Graphs

Reviewer 1

The paper proposes an embedding of data in a non-vectorial metric space, a weighted graph with shortest path distance. I think the paper is original. I find it well-written and the methodology, described in section 3, sound. The experimental results are not always impressive, though. The experiments in section 4.1 show that PRODIGE, the proposed technique, reaches good performance with low memory budget, but corresponding methods almost catch up with 8-dimensional embeddings only. Since nowadays we tend to work with much larger embeddings than 8-dimensional ones, I wonder in practice how important is this result. A similar reasoning applies to the experiments in section 4.3, in this case in addition I would suggest to compare PRODIGE with state-of-the-art embedding techniques such as BERT or ELMO. The experiments in section 4.2 looks more convincing. I appreciated the comparison with Metric Learning, the same technique using euclidean embeddings

Reviewer 2

Mainstream research in representation learning deals with representing structured data as vectorial data. In contrast, the authors propose to represent vectorial data as weighted graphs equipped with a shortest path distance. This makes the paper original. But, many graph construction methods have been proposed (e.g. \epsilon graphs, kNN graphs) given a similarity measure. Moreover, the similarity can be learned or graphs can be adapted to some task at hand. Here, the authors propose an end to end learning algorithm based on gradient descent which is also original. Experimental results are provided on tasks such as compression, classification and collaborative filtering. The development of the training objective is technically sound. But, in my opinion, many points are not carefully discussed. First, the probabilistic model assumes that edges are independent while a shortest path distance strongly depends on relations between edges. The training objective is an expectation over all random graphs and its estimation should be described in more details. It is not made clear in the paper that the results depend strongly on a good initialization, thus requiring a known problem-specific similarity. Last, convergence of the algorithm is not discussed. Therefore, the paper is based on an original idea and should lead to future publication. But, in my opinion, the paper is not ready for publication at NeurIPS because many questions on the learning algorithm remain unanswered. *** Detailed comments. * Introduction. It is not clear whether you discuss unsupervised or supervised representation methods ("for the task at hand", "in the specific task"). This remark also applies to the introduction of your paper. It is not clear whether you consider a target task. It is said in $3 of the related work section, a task specific loss is introduced in Section 3.2 and a problem-specific similarity is required in Section 3.3. * Related work. In my opinion, embeddings (whatever is the geometry) should be discussed in the introduction to clearly position your work. The related work section should be devoted to graph construction methods: base construction methods such as \epsilon graphs, kNN graphs, but also learning methods to adapt graphs to a target task. Also, metric learning should be discussed here. *$3.1. You assume independence between edges but you consider a shortest path distance in the learning objective. You should discuss this issue. * $3.1. You should state that there are 2n^2 parameters to be learned. *$3.2. You must estimate (3). This estimate seems to be based on only one sampling estimate. This is rather optimistic. Please elaborate on this and justify why such a rough estimate could be sufficient. Moreover, the interplay between gradient descent for weights and gradient descent for edge probabilities is not discussed. * $3.2. It is asserted that "once the training procedure converges, the output graph is nearly deterministic". Is this a theoretical statement or an experimental one ? Please, discuss the convergence of your algorithm. What can be said from a theoretical perspective ? Does-it converge in general ? What can be said on the output ? From an experimental perspective ? ... *$3.3. Here we learn that the algorithm is untractable. An heuristic is proposed. The heuristic is itself probabilistic therefore leading to questions on the convergence of your algorithm. Also here we learn that a problem-specific similarity is required for the method to work. * $4.1. The initialization heuristic of section 3.3 is for edge probabilities. Do you make several experiments with different draws ? Also what is the initialization procedure for edge weights ? Also we would like to know how you tune \lambda ? I.e. what is your crtiterion to choose \lambda ? What is the chosen value ? And how vary the output graphs depending on the initialization and on \lambda ? *$4.2. Metric learning is a research domain in itself. I do not understand what you called "metric learning" in this section. Which metric ? Which algorithm ? =============== after rebuttal =========== Thanks for your answers. I still have some concerns about independence of edge probabilities and the importance of the supposed given metric.