The paper proposes an idea for tuning hyper-parameters in deep (reinforcement) learning using Bayesian optimization. The key idea is to exploit the iterative structure of the problem and use a variable-augmentation trick to learn a score function that compresses the learning progress at any stage. The score accounts for training success and stability. The strengths of the paper are: - well written - good relation to prior work - good experimental study However, the paper also has weaknesses, which are mostly related to theoretical aspects and chosen heuristics (see some details below). 1. If we are only interested in the predictive mean for the cost-GP, why do we use a GP in the first place, and not parametric function, which scales much better? 2. All reviewers agree that everything related to the condition number is heuristic and unclear. That's the one part that caused us the most toothache. We don't think this part is overly critical, and that the other ideas are quite valuable. Here are our concerns: In the first place, condition numbers are unintuitive, and it depends on what you want to do with the matrix, the implementation, the floating point representation etc. in order to make (heuristic) statements about numerical stability. This is nowhere explained, and neither of us reviewers knows how to set these thresholds. The GP covariance's condition number shouldn't be too bad if the (measurement) noise is reasonably big (more precisely, the signal-to-noise ratio should not be horrible). Also, the smallest eigenvalue of K + \sigma^2 I should be attained already with 2 identical inputs. If observations are noise-free, things can go crazy w.r.t. condition numbers, obviously. In this case, a standard approach is to add a jitter term (nugget) to the kernel matrix, which would play the role of the measurement noise. Again, that's a good way to control condition numbers. A possible solution would be to just add a nugget/jitter term by default and remove the part on the condition number from the paper. 3. Optimization via gradient descent: How do you deal with multi-modality? 4. When the parameters of the score function are estimated, this is basically a (regularized) least-squares fit (using the data-fit term of the log-marginal likelihood). It would be interesting to see what effect this has on the GP model itself in terms of fitting, i.e., how do the GP hyper-parameters change compared to a vanilla GP fit. Does the GP model still make sense, i.e., is the fit good. Some insights here would be useful. Overall, we (the reviewers) think that this paper is borderline-borderline. To be honest, it's not a great paper, but there's nothing fundamentally wrong with it, and it has some interesting ideas. In the end, the question is whether a resubmission of the work will make the work significantly better or whether this just adds pain to the authors' lives without significant benefit to the paper. We don't think this paper will ever make it out of the "noise zone" (borderline zone) at conferences. In the end, we recommend to accept this paper, but we *strongly* urge the authors to consider the issues raised above (also look at the reviews) in the final version. In particular (but not the only issue): We require the authors to address the comments/feedback related to the conditioning issues. These issues were already pointed out by the reviewers for UAI 2020, but the authors didn't address these comments in their resubmission to NeurIPS. So, this is the last chance to address this issue.