Paper ID: | 1379 |
---|---|

Title: | Kronecker Determinantal Point Processes |

This paper describes optimisation algorithms for determinantal point processes in which the kernel matrix is restricted to have Kronecker product structure. Naively apply previous maximum likelihood algorithms for this model would not preserve the Kronceker product structure, a block coordinate descent algorithm results in a tractable procedure.

The authors propose restricting the kernel matrix in a DPP to a Kronecker product structure. In itself this is a very natural idea (and as such could be considered a bit incremental) but it does seem to take quite a bit of work to find a reasonable optimization approach, e.g., to show that the updates retain positive definiteness of the submatrices, and that each of the required terms can be computed efficiently. I should think that this would be of interest to those at NIPS that are interested in this model. Very minor questions: Choosing N_1 = N_2 \approx sqrt(N) does seem the most natural choice. Is there any motivation to choose differently? The authors also mention that choosing more than two matrices in the decomposition did not seem to be worthwhile. Why was that? Typos / minor comments: line 85: equations following are missing parentheses line 174: "subdivised" : do you mean "subdivided"? line 180: I think \eqref{} was intended instead of \ref

1-Less confident (might not have understood significant parts)

The determinantal point process suffers from the scalability issue when sampling from a model or learning from data. This paper proposes an approach to solve the scalability problem in both sampling and learning by dividing the marginal kernel through the kronecker product of sub-kernels. The authors derive the positive definite iterative algorithm to optimise the sub-kernels and show the correctness of the optimisation algorithm. The properties of proposed approach is well characterised in terms of time and sparse complexity, and the experiments support these characteristics.

I'm not familiar with DDP field, however, throughout the paper, the authors clearly depict the problem of existing methods and approach to solve the problem. The paper is definitely well written and easy to follow. Although the proposed approach, kronecker product of sub-kernels, may seem to be one of natural approaches including low rank approximation, the authors well describe some constraints that should be satisfied and derive optimisation algorithm that satisfies those constraints. One thing that I'd like to understand more about the proposed method is its empirical performance to compare with other approaches proposed to solve the scalability issue and performance changes with respect to the number of sub-kernels. For example, how the proposed approach works well to compare with k-DDP even though k-DDP is not appropriate to model some problems. Also the paper generalises the algorithm to multiple sub-kernels not only 2, so it would be good if there is some analysis about how the number of sub-kernels influences empirical performances in terms of time complexity and predictive performance.

1-Less confident (might not have understood significant parts)

Efficient sampling and learning for DDPs is achieved by a model whose kernel matrix decomposes as tensor products.

The paper looks interesting, is on a high technical level and my cursory reading confirms its correctness. I wasn't able to check it more carefully. Minor remarks: * Big-O notation should be written either $\mathcal{O}(...)$ or just $O(...)$

1-Less confident (might not have understood significant parts)

The authors propose two efficient algorithms (learning and sampling) for determinantal point processes (DPP) in which the DPP-kernel matrix factors as the Kronecker product of smaller positive definite matrices. The first proposed algorithm learns a DPP-kernel which can be factored as the Kronecker product of two positive definite matrices in time quadratic in the number of items. This is an improvement over the existing algorithm [25] for learning a DPP-kernel matrix which runs in time cubic in the number of items -- however, the existing algorithm [25] works without assuming the Kronecker structure of the DPP-kernel. The stochastic version of the algorithm is even faster and requires time O(N^{1.5}), with N denoting the number of items. The second proposed algorithm is for sampling from a DPP with the DPP-kernel having the Kronecker structure. This algorithm runs in time O(N^{1.5}) and represents an improvement over more general DPP-sampling algorithms which do not assume the Kronecker structure of the DPP-kernel and require either cubic time in the number of items or a MCMC sampling with an impractical mixing bound.

The paper can be split into three major components: i) learning algorithm, ii) sampling algorithm, iii) experiments. i) Learning algorithm: The algorithm works by maximizing the log-likelihood of the model over the space of positive definite matrices which factor as the Kronecker product of smaller positive definite matrices. It represents an extension of the algorithm for learning determinantal point processes proposed in [25] and which does not assume that the DPP-kernel has the Kronecker structure. The authors overcome the issue with the additional constraint on the DPP-kernel by performing the block-coordinate ascent. In Theorem 3.2 they show that the proposed optimization method makes the objective non-decreasing (a similar result was also shown in [25] for the algorithm which does not assume the Kronecker structure of the DPP-kernel). The authors also give guidelines how to generalize the algorithm for DPP-kernels which can be represented with the Kronecker product of more than two matrices. However, for the case of more than two matrices in the Kronecker product the authors do not show the computational gains over [25] and learning without assuming the Kronecker structure (see Theorem 3.3). @LA1: The maximization of the log-likelihood is non-convex problem and while the proposed version of the block-coordinate ascent is guaranteed to result in non-decreasing objective there is no guarantee that the learned model is a globally optimal one. As this is a non-convex problem and, hence, highly dependent on the initial solution, it would be nice to have a heuristic or guidelines for initializing the solver in Algorithm 1. Has this aspect been tested thoroughly while empirically evaluating Kronecker DPPs? @LA2: It is not clear how constraining is the assumption that the DPP-kernel factors as the Kronecker product of two positive definite matrices (for this case, the authors give an efficient implementation). How does this assumption affect the capacity/properties of such DPP models? ii) Sampling algorithm: The algorithm exploits the fact that the eigen-decomposition of DPP-kernels which factor as Kronecker products of smaller positive definite matrices can be performed by decomposing the smaller matrices (see Corollary 2.2). @SA1: The proposed sampling algorithm is the standard DPP-sampling algorithm with an efficient eigen-decomposition which is possible due to the assumed Kronecker structure of the DPP-kernel. iii) Experiments: The main focus of the experiments is to compare the learning algorithm to [25]. This is achieved via synthetic experiments and experiments on two real-world data sets. In one of the experiments with a real-world data set, the authors confirm that the Kronecker DPPs are less expressive than the DPPs with kernels not assuming such structure. The second experiment with a real world data set, shows a clear reduction in the training time of Algorithm 1 for Kronecker DPPs over [25], designed for standard DPP-learning. @E1: It is very difficult to distinguish the plots when printed black-and-white. It would be nice if the authors could assign a different marker to each of the plots (e.g., circle, square etc.). @E2: Given the fact that the problem is non-convex it is rather difficult to empirically assess the loss in the expressiveness which comes as a result of the Kronecker structure assumption. However, in my opinion this is an important aspect of the Kronecker DPPs that was not addressed properly by the authors in the experiments. Minor comments: -- line 40: Y \in \mathcal{Y} is just an element of the set of items and (1) can be misinterpreted (perhaps, to use the power set instead) -- line 43: L_Y is not introduced in the text around (2) -- line 85: bracket missing on Tr_1 -- line 112: notation with the composition of f and the Kronecker product looks suboptimal -- line 133: typo on `Generalization' -- lines 137-141: it would be nice to add more details to the appendix -- line 166: `by from' -- line 200: `true' kernel is not clear enough -- Appendix: Algorithm 3, `end forreturn ...' ------------------------------------------------------------------------------------ The paper is very well organized and extremely easy to follow. All the proofs are clear enough and the compressed parts can be derived easily. It is a solid paper with a very elegant theoretical part. Also, a very interesting observation was given in Section 6, where the authors observe that one of the two parts of the log-likelihood function is geodesically convex over the Riemannian manifold of positive definite matrices and it will be interesting to see how this idea develops in the future.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

The paper proposes a Kronecker factorization of Determinantal Point Process kernels for purposes of efficient estimation and sampling. Under such a factorization, the authors observe the log likelihood can be split into concave and convex terms if all but one sub-kernel is fixed. The resulting coordinate ascent procedure is guaranteed to monotonically increase the log likelihood while decreasing runtime and space costs (from, roughly, cubic to squared). Further efficiency can be obtained by a stochastic variant (squared to 3/2). Experiments on simulated, small-scale, and large-scale datatsets are reported. The Kronecker factorization seems to be a reasonable constraint, as it performs comparably to unconstrained kernels while allowing DPPs to be scaled to 10,000 items in the ground set.

This is a nice paper showing how to exploit Kronecker factorization to improve DPP estimation and sampling. The contributions are incremental but clearly novel, the presentation is clear (I thought section 3 was especially well organized), and the experiments well thought-out (I like how the small-scale ones effectively isolate the effect of the constraint). Here are the justifications for my ratings. 1. Technical quality: Exploiting Kronecker factorizations is commonly done, so it's no surprise that the technique can be used to improve the efficiency of DPPs. However, the authors had to overcome non-trivial obstacles in (1) observing the concave-convex decomposition, (2) showing the coordinate ascent updates monotonically increase the LL, and (3) performing the complexity analysis (the Knapsack problem in particular). I gave the rating "Poster level" because of these novel, un-obvious contributions. I have no major issues with the proofs or experiments. The only reason I didn't give a higher rating was that I do not think any of these developments (or the aggregate) are sufficiently impressive or creative to say the paper is in the top 3% of submissions: the primary contribution is scaling an existing model. 2. Novelty / originality: Again, the paper's contributions seem genuinely new and interesting. Yet, I gave the rating "poster level" because exploiting Kronecker factorizations is fairly common (as the authors state themselves, line 35). The paper's main developments are a byproduct of the need to overcome some challenges introduced by the factorization (for instance, preserving the Kronecker structure and still having ascending updates). Thus, the contributions are useful but don't constitute any paradigm shift. 3. Potential impact: I rated the paper "oral level" in this category because the demonstrated reductions from cubic runtimes can bring DPPs into a new regime of scalability. The experiment on the genes dataset, with 10,000 items in the ground set, is impressive. Furthermore, the second, small-scale experiment demonstrates that the Kronecker assumption only slightly degrades the kernel estimate. Thus, there is little reason not to use a Kronecker factorization if runtime is of any concern. 4. Clarity and presentation: The paper is well-written. I liked the logical flow of section #3 in particular: the log-likelihood with the Kronecker factorization --> decomposition into concave-convex terms --> coordinate ascent by fixing sub-kernels. My one suggestion is that I would have found a table summarizing the current state of DPP estimation / sampling techniques, with their associated runtime and memory costs, very helpful (as someone who knows DPPs but not the various algorithms for scaling them). Including such a table would further highlight the Kronecker factorization's superiority.

2-Confident (read it all; understood it all reasonably well)

This paper proposes the Kronecker Determinantal Point Process (KronDPP) model, which is a DPP model whose kernel matrix is decomposed as the Kronecker product of several smaller kernel matrices. Efficient algorithms for batch and stochastic learning are presented, which lead to significant speedups compared to standard full-rank DPP learning approaches. An experimental evaluation on synthetic and real datasets shows that KronDPP learning is much faster than existing full-rank DPP learning algorithms.

This paper makes an important contribution to the field of DPP learning, and represents one of the first efforts to scale DPP learning to large-scale datasets by overcoming the O(N^3) cost of standard full-rank DPP learning. The proposed learning algorithm (Algorithm 1) appears relatively straightforward to implement, which is a plus. In additional to a time complexity analysis, the experimental results confirm that the learning algorithm is substantially faster than conventional DPP learning algorithms. In the Introduction, this paper claims that O(N^3) preprocessing is required for exact sampling for a rank-k DPP. However, this claim appears to be incorrect, as shown in the time complexity analysis for Algorithm 3 in [Gillenwater, "Approximate Inference for Determinantal Point Processes", PhD thesis, 2014]. For this faster dual DPP sampling algorithm, we have an overall cost of O(k^3 + Nk^3) for sampling from a rank-k DPP, which includes the k^3 cost of the initial eigendecomposition). This is much better than O(N^3) and is also better than the O(N^(3/2) + Nk^3) cost for KronDPP with m = 2 subkernels. It is not clear that the stochastic updates in Algorithm 1 in this paper are faster than those in Ref. [9] cited in this paper, since [9] does not provide a time complexity analysis. Either a complexity analysis of [9] or experimental comparison to [9] should be provided, or this claim should be removed. At various places in this paper, the cost of stochastic updates for Algorithm 1 is stated as either O(N \kappa^2 + N^(3/2)) or O(O(N \kappa^3 + N^(3/2)). From the supplement it appears that the later is correct, so the appropriate corrections should be made. There are a few grammar issues throughout this paper that should be corrected.

2-Confident (read it all; understood it all reasonably well)

Authors provide an improvement to learning Determinantal Point Processes, by introducing Kronecker product kernels. This improvement is base on earlier work on fixed point algorithms for learning DPP. By decomposing the kernel matrix into a Kronecker product of smaller matrices, more efficient learning and sampling algorithms are gained compared to the baseline. Authors evaluate new model on one synthetic and two real datasets.

Paper is technically well written, proofs seem plausible, speedups obtained look promising. However, experiments appear to be slightly biased towards the proposed model, and should be improved. If I understand correctly, synthetic datasets are sampled from the model author presented in this paper. If this is not the case, it should be made clear in the text. However, if this is the case, than this gives an "unfair" advantage to author's model in comparison to others. In order to correctly demonstrate that this model is better than others in general, dataset should be sampled in some other more general way. One final remark, author mentions k-DPP and rank-k DPP models [references 16 and 9] to be the current state-of-the-art, but doesn't provide any experimental comparison.

2-Confident (read it all; understood it all reasonably well)