Paper ID: | 2998 |
---|---|

Title: | Optimal Best Markovian Arm Identification with Fixed Confidence |

In general the paper is well written and easy to follow. I have couple questions about the results of the paper. See the comments below. The main concern is on the significance of the results. In particular, are the results in Section 2 about the reparametrization, and the result in Section 3 about the concentration of Markov chain mean estimation significant? The results look very interesting to me, but I am not familiar with this field. Other reviewers may have better judgement. Another concern is that the proposed algorithm needs the knowledge of P, as it is used to compute the stopping rule. This dependence makes the algorithm much less practical. Questions about the results: 1. Is the R_a term natural in Lemma 2? It seems to be a bit technical 2. Is the lower bound result in Section 3 under the same conditions with other references? For example, compared to [DZ98] and [Lez98]? 3. Does the algorithm need to know \phi? Minor comments: 1. Line 81-83: \mathcal{X} and \mathcal{S} are not unified. 2. Line 91. What is y_m? 3. Line 205, P_{(q,\mu)} is not defined. 4. Line 248 and 250, how are \hat{\mu}_a(N_a(t)) and \hat{\mu}_a(t) defined? ================ After rebuttal: Thanks for the reply. The rebuttal confirms that the algorithm needs some sort of knowledge of P. I think it needs some further justification why this is not a serious concern, or it should be stated explicitly. As it is also pointed out by Reviewer #4, the claims in section 3 are misleading. The comparison of The results of [Lez98], [DZ98], and Theorem 1 should be more rigorous and careful. I will keep my original score.

This work extends the problem of optimal best arm identification to the Markovian setting. Similarly to the iid case, each arm is still parametrized by a single parameter which can here be taken to be the stationary mean (exponential family of Markov chains), and a fair amount of structure is therefore assumed. For example the law of all arms is generated by the same Markov kernel (generator of the exponential family), and must be known for the algorithm to yield a priori guarantees and a valid stopping rule. The paper is of theoretical nature, is extremely well written and uses proper mathematical formalism. Its structure is doctrinal, allowing the reader to understand the underlying family of distributions, the bandit problem, main developed tools, and parse the results efficiently. Although the reviewer has no reason to doubt the results, a certain number of clarifications would be welcome here, especially with regards to Section 3. P(S) is defined at L.92 as the set of chains (identified with their Markov matrix) that satisfy a collection of irreducibility and other structural properties that are tied to a previously defined function φ. However, in the statement of Theorem 1, P(S) is introduced prior to φ. This begs the question of whether Theorem 1 is as general as currently stated. If indeed there is a strong connection between the function evaluated on the chain and the transition matrix for the theorem to hold, it would be uncalled for the author to claim general improvements over the references at L.163-164. In the proof of lemma 1, at L.373-374 the reader is referred to [Lax07], which guarantees existence of an eigenvector for the PF root that ’depends differentiably’ on the parameter, but also cautions the reader against the fact that not all such vectors might work. Is it therefore clear that the chosen u and v (where u sums to 1 and with v satisfying a prescribed inner product with u) are indeed sufficiently differentiable ? === Minor comments === L.91 ’y_m’ should be y L.135 ’the the’ L.166-167 The transpose notation is perhaps a bit too liberal here, as it refers to the adjoint of P in L2(π). L.371 ’eigenvectors’ L.375 Missing a constant term in log P(x,y) (that will indeed vanish upon taking the derivative) L.379 Last term should apply on x and not y. L.380 The infinitesimal term in theta is sub-scripted. L.387 Could you explain how dependence of this ratio in theta implies the ’Therefore’ ? L.461 ’eigenvalue’ L.488 Can you add intermediary steps for the second equality ? ==== AFTER REBUTTAL ==== The reviewer thanks the authors for their detailed answers. As argued by the authors, concentration inequalities are handy tools, but as such it is important for the reader to also quickly understand when they apply best, and when they don't. This coupling between φ and P could lead to a very interesting discussion and perhaps new insight. The reviewer keeps his score: an accept 7/10.

- It is not clear to me why one should care about a problem introduced by this paper. The authors put no words on introducing the physical background of the markov reward setting. I think motivation is important especially when the setting is not something that naturally exists: the scalar-exponential-family-parameterized markov chain looks artificial. It would help if the authors can explain why such setting is interesting to study. - It seems to me that given the concentration result in line 148, everything (lower bound, algorithm, and upper bound) directly follows [GK16]. It is not clear to me if there is anything interesting brought by the markov reward setting, regarding the problem structure revealed by the bound, challenges or techniques in the analysis. - The writing structure needs improvement. The actual problem setting is introduced on page 5 and even after reading the setting it takes me quite while to figure out what are the arms and what are the rewards. That said, the writing of theoretical statements looks clear. -------------------- update after reading the rebuttal: The authors pointed out a contribution I missed in my initial review - the finite sample concentration result. Although I still think the motivation for the problem setting is not very convincing and the results are a bit unsurprising from a practical point of view, I appreciate the technical clarity of this paper and that the authors work through all the theoretical results in a new best arm identification setting. I have increased my score accordingly.