Paper ID: 470
Title: Learning Multi-level Sparse Representations
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a method of decomposing a matrix into a product of more than
two sparse matrices in a hierarchical manner, where the rank decreases from
the lower to higher levels. This method is called multi-level sparse matrix
factorization. One benefit of such a decomposition is that different
levels of regularization may be performed at different levels.

Overall, the general idea of multi-level sparse factorization is original and
has been clearly described.


1) Tuning

Some methods for data-driven tuning may be derived as opposed to a user-specified
values.

2) Optimization

Can something be said about convergence properties of the proposed learning
aglorithm, such as convergenec speed or computational complexity?
Q2: Please summarize your review in 1-2 sentences
An interesting and novel method.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The article was motivated by the difficult problem of inferring neurons and their assemblies from Calcium imaging recordings. It proposes a "bilevel sparse heterarchical matrix factorization", whereby neurons (i.e. their pixel-correlates), and the assemblies they belong to (potentially >1 per neuron), are inferred jointly. More generally, it offers a decomposition of an observation matrix into a product of a number of sparse matrices, with the rank decreasing from lower to higher levels. It allows for imposing (structured) sparsity constraints at every level, and is not limited to hierarchical structures, as in the previous work (Jenatton and all, 2012; Kim and Xing, 2012). In addition, their method does not require an a priori assumption on the exact number of assemblies (although the authors mention that the number of inferred assemblies depends on the optimisation parameters, that need to be defined by user).

The method is shown to work well on synthetic data:
(i) it recovered assembly assignments of neurons with sensitivity surpassing that of MNMMF (Cichocki and Zdunek, 2006), or KSVDS (Rubinstein and all, 2010), all methods, including the proposed one, were initialised with Adina cell sorting (Diego and all, 2013).
(ii) it detected calcium transients of single neurons with sensitivity higher than MNMMF, Adina or cell sorting.
An example from analysing real experimental data (calcium imaging in rat hippocampus) is also given, although the reviewer was not sure what to conclude from these images (other than neurons seem to be less spatially sparse than in the initial state derived from Adina algorithm).

The parallel inference of structures at every level, including the most basic one (the dictionary), as well as the time courses of at each level, is emphasised throughout the article, it is said to be the reason for the more robust identification of neuronal time courses. However, the inferred dictionary (images of single neurons) is never discussed. Images of such reconstructions (such as supplementary Fig. 6), next to the ground truth and reconstructions from other methods, would be far more interesting to the reviewer than the images of calcium recordings, where the ground truth is unknown (Fig. 6).

Quality
This paper is technically sound. It provides an in-depth description of the proposed cost function, with details of the optimisation made available in the Supplemental Material. There might be some details missing, such as how to optimise the trade-off parameters (or, how they chose and what values their used in their simulations), how were the adjacency matrices binarized (to yield true and false positives for their evaluation, l. 308), or how sensitive was the method to levels of noise they tested (l.300), but overall, I find the article a complete piece of work.

Clarity
The paper is clearly written.

Originality
To the best of my knowledge, the proposed approach is a new one. Relevant literature is adequately referenced.

Significance
The results are potentially important for any lab performing calcium imaging on cell assemblies. As the authors have shown, the parallel optimisation of assembly assignments and time courses at every level of interest improves robustness of the inference. For any experiment that can assume a fairly coherent activation of neurons within an assembly - this method should be of great value.
One more advantage is the flexibility in imposing prior knowledge on any level of detail. (It is a pity that the method apparently didn't work well without initialising dictionary by one obtained from Adina. In principle, it should be possible to incorporate the same knowledge that Adina uses at the 0-th level of inference..)
Q2: Please summarize your review in 1-2 sentences
An interesting, technically sound article, of potential value to anyone working with structured data (i.e., calcium imaging of neural assemblies).

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

The authors present an algorithm for learning multilevel sparse representations. This is modeled as a (penalized) low rank matrix factorization problem where the number of factors is typically greater than 2 and each factor represents a different entity in the hierarchy. The factors are estimated using standard block co-ordinate descent methods where each factor is optimized while the others are kept first and sparsity promoting penalties are used at each step.

The method is discussed in the context of analyzing large scale calcium imaging data where the assumption is that the neurons can be collected into a set of (possibly overlapping) neuronal assemblies giving rise to the hierarchy pixel --> neuron --> assembly. The goal then is to estimate the location of each neuron (in terms of pixels), the membership of each neuronal assembly, and the spikes of each assembly.

The authors argue that their method performs better than other methods in artificial (but realistic) data and that it can correctly identify neural assemblies in real data.


Originality

The ideas presented in this paper are novel to the best of my knowledge. The three-way decomposition approach is interesting and I liked the fact that the authors also keep a standard two-way decomposition in their model and enforce consistency between the two. This extra step can make the algorithm more robust to phenomena such as a neuron in an assembly "misses" a spike etc.

Quality

My main concern about this paper is efficiency. Large scale calcium imaging experiments typically involve tens of thousands of pixels and thousand of timesteps. Although the multilevel decomposition reduces significantly the dimensionality, and several steps of the algorithm can bee parallelized, I can imagine that the algorithm might still be quite slow and require a large number of iterations. These points should be discussed.

I am also not convinced by the usage of neural assemblies. It is not clear that this is something that will often appear in practice. Also, if a subset of neurons have the same spikes then this will be reflected in the temporal component of the spatiotemporal matrix. The rank of the matrix will still be small and I don't see a direct benefit from incorporating the proposed multilevel hierarchical decomposition. The authors would probably need to expand their opinion on this issue.

Moreover, the various regularizers \Omega_U, \Omega_A should be clarified. For example, what are the convex cell shape priors that are used?

The authors leave to the user the choice of regularization weights. This seems an important task to me since the number of regularizers is large and their use of critical importance. The authors claim that an accurate prior estimation of the rank is not of great importance since the algorithm will shrink non relevant components to zero. Although this sounds intuitive I can imagine that this behavior depends heavily on the right choice of the various sparsity penalties that will shrink the non relevant components.


Clarity

I think the paper is not clearly written at all and right now is only accessible to an expert of calcium imaging data analysis. For example, the authors mention very briefly in the results section how the neural spikes are translated into calcium activity through a convolution operator and do not discuss this central aspect of spike deconvolution at all in their algorithm. In fact it seems to me that they sidestep this issue "by eliminating all transients that overlap more than 20%" (if I understand correctly). This is a topic of active research (e.g. Vogelstein et al. 2010, Grewe et. al. 2010) and I'd like to see how the authors relate to that literature.

I found the inclusion of the main equations as a figure a bit unusual, although that is a question of style as well.

Finally, I would prefer a more thorough presentation of the real data experiments. Although videos cannot be presented easily in paper, the results as shown in Fig 6 could be more convincing. In general, Fig 6 needs a better explanation.
Q2: Please summarize your review in 1-2 sentences
An interesting approach, but the paper needs significant clarification on a few important issues.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their careful comments. Reviews are mainly positive about the significance of the contribution, experiments, novelty and clarity of the paper. All the suggestions and comments made by the reviewers will be taken into account for a more complete manuscript.

Please see their main concerns below.

@ALL rev

*Parameters

The trade-off parameters are chosen depending on the structure of the problem, and also their confidence. The main parameters are the weights of the data fidelity and temporal consistency. Depending on their values, the algorithm relies only on the data fidelity as KSVDS by setting \nabla_U=0, or on the temporal consistency as MNNMF by setting \nabla_D=0. Both terms are needed because the temporal consistent indicates when neurons fire and the raw data indicates the intensity of the firing. In our experiments, we set up \nabla_D bigger than \nabla_U because neurons are not firing exactly at the same time.

Regarding the sparsity parameters, \lambda_U should increase monotonically while layers increase since the number of active signals will be higher at lower levels, and \lambda_A should be a lower value (0.0001) to prevent A from shrinking to zero at the first iterations.

*Fig. 6

Fig. 2 and 6 will be improved to show that our algorithm is able to reconstruct the input frame at different levels of representation with better SNR than the raw data, and also that several neurons can be active at the same time but only few of them will conform a neuronal assembly. The identification of these entities is an important element for neuroscience as well as monitoring neurons at single cell resolution.

*Computational complexity

The problem in Fig. 3d) is not jointly convex as the simplest case of dictionary learning or non-negative matrix factorization. Due to this non-convex formulation, the simplest cases are only able to prove the convergence to a stable point (or local minima) as indicated in [16] and the convergence rate depends highly on the initialization of the matrices as pointed in [1*]. Empirically, the algorithm converges in 80 iterations on average. Despite estimating all parameters, the computational time is similar to KSVDS and MNNMF that only rely on estimating the upper layer. All experiments were conducted on the same conditions and run it in MATLAB.

The complexity of only the first layer depends on the update of D and U. The former based on [10] is O(m∙q_o∙max(m^0.5,n)+m^2+max(n,m) ∙q_0^2) and the latter depends on the solver. For instance, orthogonal matching pursuit is O(n∙q_0∙(m+L^2)), where L is the number of non-zero elements. For multiple layers this will increase the complexity approx. by a multiplicative factor of l^2.

[1*]Albright et.al. "Algorithms, Initializations, and Convergence for the Nonnegative Matrix Factorization". NCSU Technical Report Math.

@REV_4

Data-driven tunning of these parameters is technically possible. This would make our results -even- better than what we now have by suboptimal manual parameter tunning.

@REV_5

*"how were the adjacency matrices binarized (to yield true and false positives for their evaluation, l. 308)"

The adjacency matrices are binarized as follows: i) KSVDS: entries bigger than 5% of the maximum value in A are set to 1. ii) MNNMF: entries bigger than 0.001 are set to 1, and iii) bilevel SHMF: entries bigger than 0 are set to 1 since \lambda_A is set to 0.001. These values are chosen to give the best performance.

*"how sensitive was the method to levels of noise they tested (l.300)"

We preferred to show an overall performance for a clear comparison among the algorithms instead of showing the performance for each single test sequence. We will include the performance for different noise levels showing that the performance of the algorithms slightly decrease when the noise level increases.

*"It is a pity that the method apparently didn't work well without initializing dictionary by one obtained from Adina"

Bilevel SHMF can also be initialized randomly achieving similar performance to the other methods, but slightly worse (approx. 3%) than when initialized with ADINA. For a fair comparison, the algorithms were compared using the same conditions.

@REV_6

*"I am also not convinced by the usage of neural assemblies. It is not clear that this is something that will often appear in practice..."

Neural assemblies is an important feature in neuroscience because patterns of co-active neurons are crucial determinants of behavioral and cognitive functions as memory consolidation, and also have a "level-bridging" role between cells and the full organism properties.

*"\Omega_U, \Omega_A should be clarified. For example, what are the convex cell shape priors that are used?"

In line 266, \Omega_U and \Omega_A are defined to l_1 norm instead of l_0 because (1) it is a convex relaxation of the l_0 norm and (2) also the number of firing neurons at each frame and of cells that belongs to an assembly is unknown a priori. The only prior of the cell shape is that it must be convex.

*"...not discuss this central aspect of spike deconvolution...This is a topic of active research (e.g. Vogelstein et al. 2010, Grewe et. al. 2010)"

Bilevel SHMF can be completed with both spike deconvolution methods to estimate the spike train of the neurons and assemblies once the Calcium transient is inferred. We prefer to formulate the complete approach with Calcium transient to enforce temporal consistency as a matrix factorization instead of using a hamming distance.

It is correct that we eliminate this transient but we sample another firing position for that assembly. Hence, we ensure that assemblies are independent and they will not belong to the same assembly.