This paper proposes application of reinforcement learning, in particular Q-learning, to determine the check-node (CN) scheduling policy in BP decoding of short LDPC codes. It is in contrast to other works in the broad area of machine learning applications to coding which focus on finding coding schemes or "deep unfolding" of iterative decoders. Discretization of state space and clustering of CNs are introduced to avoid explosion of the state space size and learning complexity. The experiments in Section 5 show improvement by the proposal both in terms of error probability and the number of CN-to-variable-node message propagation. The reviewers rated this paper favorably, especially with emphasis on the novelty. They are also satisfied with the author response. I would thus like to recommend acceptance of this paper. I would also appreciate it if the authors take into account the following items, as well as the review comments. - Use of the term "bandit": As Reviewer #1 mentioned, the reader may think that there is some confusion in this paper between bandit and reinforcement learning. Although it seems that the authors use the term "bandit" in accordance with reference [6], where it is used to refer to a problem where the environment consists of N statistically independent processes of state transition and reward generation, each with its own state space, the use of the term "bandit", regarded standard nowadays, refers to a problem where the environment has only a single state so that the action set and the reward processes in the future are not affected by the decisions today. See, e.g., Section 1.1.2 of Lattimore and Szepesvari, Bandit Algorithms, Cambridge University Press, 2020 (accessible via https://tor-lattimore.com/downloads/book/book.pdf). A clarification regarding this point should be useful. - Line 132: The use of the term "value" here is somehow confusing, as it refers here to the numerical value representing the aggregated log likelihood ratios, whereas the term would generally refer in the standard reinforcement learning framework to (estimated) expected return of a state or a state-action pair. - Line 305: The quantity Pr[\hat{x}\not=x] corresponds not to the bit error rate claimed here by the authors, but to the block error rate, as the boldface x and hat(boldface x) are defined to be the transmitted codeword and the reconstructed codeword in line 128 and in lines 130-131 of the paper, respectively. In order to avoid possible confusion between bit error rate and block error rate, a convention in communication engineering is to abbreviate them as BER and BLER, respectively, which the authors would like to adopt with an explicit definition. I also think it a good idea to mention that the target block error rate of 10e-1 has been employed in several radio network specifications, so that the shown range of error rates in Figure 2 is relevant.