NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1717 On Mixtures of Markov Chains

### Reviewer 1

#### Summary

The paper proposes a novel approach to estimate the parameters of a mixture of Markov chains. It is based on samples of trajectories with length 3 (the 3-trails) of the finite mixture and uses SVD decompositions and simple linear algebra techniques to recover the parameters from the empirical distribution of these 3-trails. Some synthetic experiments and a real data application are provided.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 2

#### Summary

The new model is interesting and captures some of the new challenges in machine learning. The main idea is to use a "semantic vector" to represent the labels, which allows infinitely many labels, and also labels that are semantically correlated with each other. The hypothesis is a function f that maps pairs of (x,\lambda) (data and semantic label vector) to +-1 (the true label). The new model is a natural generalization to the multi-label setting, and is more suitable for modern supervised learning applications. The paper gives a generalization bound for the new model based on the VC-dimension. The technique is quite straight-forward and similar to the previous generalization bounds. The new model is quite clean and it would not be surprising to see other generalization bounds also generalized to this new setting. One questionable point about the model is that it requires the semantic vectors for labels are known (and these are generalized independent of the training samples x). It would be interesting to see whether it is possible to learn both the semantic vectors and labeling function f using the same data set (information theoretically it seems possible given enough samples from each class). Or at least maybe given a few new samples in a new class, try to estimate the semantic vector for this new class. In experiments the paper uses word-embeddings which is reasonable, but being able to learn/adjust labels based on the data might also be interesting.

#### Qualitative Assessment

The new algorithm is based the observation that the backward matrix P and forward matrix Q are closely related (and both are closely related to the transition matrix of different chains). In particular, the two sets of matrices can be captured by the "shuffle set" property. This property allows the paper to show that a co-kernel matrix A(P,Q) is of low rank, which gives a way to relate the two rank L subspaces related to P and Q. The paper then uses standard approach of eigendecomposition to find a unique linear transformation to compute the matrices P and Q. The new observations of shuffle set property and co-kernel matrix are interesting techniques for spectral learning. The algorithm is clean and seem to perform well. The paper also shows the nondegeneracy condition is not restrictive. It would be interesting to see a more careful perturbation analysis, especially about the condition number of the co-kernel matrix. The analysis is mostly well-written, except in certain places the notations are a bit confusing. For example, Page 4, when describing step (ii), the result should be Y_j = R Y'_j and Z_j = R Z'_j (the ' appeared on incorrect matrices), and in (iii) it should be R = D\Pi \bar{R'}. Overall this is an interesting new result on spectral learning and contains some new techniques specially designed for mixture of Markov chains.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

This paper tackles the problem of recovering a mixture of Markov chains from trajectories of length three. The authors give a condition under which the mixture is identifiable, and a spectral algorithm that recovers the parameters exactly in the noiseless case. The algorithm can easily be adapted to the noisy case, and experiments on synthetic and real data are provided.

#### Qualitative Assessment

This paper studies an interesting problem, with important applications. I am not very familiar with related work, and as such cannot accurately situate the problem in the context of other works, but the framework proposed by the authors appears to be new, and the technical contributions seem notable. I enjoyed reading this paper. Altogether I am in favor of accepting the paper. Here are the two main points that could be improved in my opinion: 1) I wonder if the definition of "well-distributedness" could be simplified (also see my question below.) Also I feel that proving the necessity of this condition is within reach, and would strengthen the paper. 2) the experimental results are a not very convincing. E.g. the regime chosen for the synthetic experiments (nb samples = 10^9, Ln^2 = 100) does not seem applicable to most realistic settings. A discussion on why EM seemingly outperforms the spectral approach when L increases would also be interesting. Questions: - l. 97, "It is easy to see that using...". That statement does not seem "easy" to me. At first sight, O(log n) samples seem just too few to estimate O(Ln^2) parameters. Can you make your statement more precise? What do you mean by approximated within +- eps? Also, when you say "sample trails", do you mean 3-trails? - Eq. 2: What is the factor 2 doing at the end? (probably a typo?) - The definition of well-distributedness is not trivial to fully grasp. Suppose the starting probability is uniform (i.e. s^l(i) = 1/nL), is the definition equivalent to saying "the set \mathcal{M} is linearly independent"? - l. 157: I suppose you mean $Y_j = R Y'_j$ and $Z_j = R Z'_j$ ? (note that the prime is at a different place.) - above l.224: by E_{\ell}[...] you probably mean the average? It is not immediately clear. Are you taking into account a distribution of the L components, e.g. as given by the starting probabilites? - Figure 3b: can you give an intuitive explanation on why EM becomes better when L is decreasing? Comments: - l.64: "the latter of which has no quality guarantees, even in practice". I don't understand what you mean by quality guarantees 'in practice'. - l. 114: I believe you mean 'row' instead of 'column' (the matrix is row-stochastic) - I think a discussion (or some examples of classes) of well-distributed vs non-well-distributed mixtures would be helpful. - l. 136: "to prove our bounds". Which bounds? I suppose you mean "to perfectly recover the mixture"? ===== UPDATE AFTER REBUTTAL Thank you for your rebuttal, which managed to answer some of my questions & comments above. One last comment: "With regard to independence, the ideal case would be if we had truly independent samples, eg, the same identifiable user coming back to last.FM every day." As a matter of fact, the dataset that you use *does* contain this. As each "play" event is timestamped, you could easily check if there was a long break between two successive plays.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

This paper addresses the problem of reconstructing a mixture of Markov chain given observation trajectories. Authors it is sufficient to uniquely reconstruct the mixture of Markov chain given only the first three states of each trail and propose an efficient algorithm for that.

#### Qualitative Assessment

Since I am not very familiar with topic, I have some difficulty to understand the novel contribution and do not have valuable comments. From my point of view, it is not well motivated especially how authors come up with their reconstruction algorithm and why it works.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 5

#### Summary

This paper explores the problem of reconstructing mixtures of Markov chains by only considering triples of states. Specifically, they show that only the first three states of each trail need to be considered to recover mixture parameters. By recovering the Markov chain that produces a set of states, the authors can analyze user behavior. They show that their algorithm outperforms EM.

#### Qualitative Assessment

The paper gives a good background on the problem and a thorough explanation of the proof of their method. It would be nice to get an explanation of figure 3(a) and why the median recovery error of EM is all over the map compared to number of samples. They show good accuracy and performance compared to EM.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)