NeurIPS 2020

Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems


Review 1

Summary and Contributions: Update: Thanks for the thoughtful response. To clarify, while PE is a common assumption in classical control literature, it is not common in more recent nonasymptotic work.. If one were to assume PE in the state feedback setting, then injecting noise would not be necessary and better regret could be achieved -- but lower bounds tell us that this is not the case. So justifying the applicability of the assumption in this output feedback setting is crucial, and I'm happy to hear that it ends up being a mild assumption satisfied by well known controllers. --- This paper presents a method, AdaptOn for adaptive output-feedback control for linear systems with time-varying convex loss functions. AdaptOn estimates the dynamics each epoch and then runs online gradient descent. The authors show that this method achieves a regret polylogarithmic in T for controllers satisfying a persistence of excitation condition.

Strengths: This work seems to be theoretically sound. The combination of online gradient descent with model estimation is novel, as is the analysis. The main result is significant, by showing that for partially observed systems, a polylog(T) regret can be achieved. This is somewhat surprising, since upper and lower bounds for sqrt(T) regret have been shown in the case that states are fully observed. The analysis includes an interesting argument about persistence of excitation in output feedback problems. This perspective highlights that the existence of measurement noise induces a degree of exploration.

Weaknesses: My main reservation with this paper is that it relies on restricting the controller synthesis to the class of persistently exciting controllers. This class is crucial to the proposed algorithm, since in Corollary 8.1 we see that only a T^2/3 regret bound would follow by these techniques otherwise. However, the condition is not sufficiently explained or motivated in the body of the paper. Is this a natural set of control policies to consider? If it is not, then the regret baseline is less meaningful, and the results are less impressive. It is unclear whether partially observed linear systems admit polylogarithmic regret due to an inherent difference with state observation, or due to artificial restrictions emposed by the persistence of excitation condition on control policies.

Correctness: As far as I can tell, the results seem to be correct.

Clarity: The paper is generally clear and has a well-organized appendix.

Relation to Prior Work: Overall, the paper is well situated with respect to related work. However, a longer and more explicit discussion of the relation to the upper and lower bounds presented in [46] is warranted.

Reproducibility: Yes

Additional Feedback:


Review 2

Summary and Contributions: This work establishes a logarithmic regret bound for the task of controlling an unknown latent-state linear dynamical system subject to stongly convex costs under some conditions. To obtain logarithmic regret, the typical scheme is ensure that the information obtained during exploitation is enough to drive down parameter estimate error. Mania et al established that "E" error in parameters translates to "E^2" increase instantaneous cost (or Simchowitz et al for strongly convex costs). Therefore, establishing "E" scales as T^-0.5 suffices. The latter requirement is met for least squares as long as the covariance of covariates are lower bounded. This latter property, true under appropriate assumptions, is the central observation here. It should be noted that the assumption of persistent excitation here rules out the applicability of the result to LQRs, where T^0.5 is optimal (Simchowitz, Foster).

Strengths: + This work offers the first log regret gurantee for learning partially observable systems. + The closed-loop sys-id procedure, while an adaptation of the known sys-id procedures, completes an improtant part of the picture.

Weaknesses: + While persistent excitation is necessary for logarithmic regret, ideally the algorithm should default to a T^0.5 regret bound in the absence of the former, while offering log regret otherwise. Can the authors say if there are any modifications of the present approach that permit this? + The previous works in this space often did not make an assumption on the stability of the system, but instead operated with stabilizable systems. Note that superimposed stabilizing controller on a stabilizable system is not equivalent to a stable system, because the cost thus must take into account the suggested actions along with the actions of the stabilizing controller. Sometimes considerable effort in the past works has put in this aspect. The absence here limits the applicability. (It is also not apparent if the results extend.)

Correctness: While details were not checked, the claims made look correct.

Clarity: The paper is generally well written.

Relation to Prior Work: To the reviewer's knowledge, the best known bounds (granted with fewer assumption) for this setup scale as T^0.5.

Reproducibility: No

Additional Feedback: Thanks for the response -- the reviewer opts to retain their score.


Review 3

Summary and Contributions: This paper studies the online learning problem for the linear quadratic Gaussian (LQG) model. The proposed algorithm is a combination of System Identification (sys-id) and online convex optimization (OCO) with memory. The proposed algorithm is shown to achieve a polylog(T) regret.

Strengths: This seems the first work that achieves polylog(T) regret for LQG model. LQG is a classical model in the control literature.

Weaknesses: 1. Although the authors claim the algorithm is an RL algorithm, it seems unclear to me how this algorithm can be applied to general reinforcement learning problem. It seems that the algorithm highly rely on the linear dynamical system. If the algorithm and theory is limited only to LQG, it would be less relevant to a machine learning conference. 2. It seems that the proof technic is much dependent on [Simchowitz et al 2020]. The improvement is that the authors propose a new system identification approach and a doubling trick.

Correctness: It seems that the theory is correct. It would be nice to have numerical experiments to validate the polylog T regret.

Clarity: This paper is overall well written. It would be better if the following two points can be made clearer: 1. How does studying LQG help for learning POMDP and can the proposed algorithm be applied to a POMDP? 2. How do the theory and proof differ from [Simchowitz et al 2020]?

Relation to Prior Work: It discussed prior works. However, it would be nice if the most relevant paper, [Simchowitz et al 2020], could be compared in more depth.

Reproducibility: Yes

Additional Feedback: Update after reading the author response: I agree that LQG is a fundamental problem in control theory and establishing a polylogT regret for this problem is significant. But I don't think the proposed algorithm can be modified for POMDP. Thus, it seems a bit oversold by saying that solving LQG is a first step for solving pomdp. As pointed our by another reviewer, it would be nice to highlight that the system is assumed stable. Also I would recommend the authors to highlight technical novelty given [Simchowitz et al 2020] in the revision. Numerical experiments are also helpful to validate the polylog T regret.