__ Summary and Contributions__: his paper investigates unsupervised learning from a continuous stream of visual input, with a visual selective attention mechanism that allows saccade-like dynamics. The method attempts to approximately maximize mutual information between the pixel-wise predictions of a deep network and the visual input stream. Experimental results show that focal attention reminiscent of human saccade choices improve mutual information between the visual stream and the network's output.

__ Strengths__: This paper addresses an important problem: how can a system learn useful representations truly online, from a long sequence of video input.
The experiments show that the proposed method achieves higher mutual information with the visual input stream compared to the more naive models considered.
------Update after rebuttal------
The theoretical framework could be useful to the community, but the empirical evaluation still seems anecdotal and hard to place in the broader context of work in the area.

__ Weaknesses__: The evaluation shows that the proposed attention mechanism improves the mutual information metric defined in the paper, but this is not linked back to a clear functional benefit. This makes the significance of the MI improvements hard to interpret. Are the resulting features better able to perform some task of interest? Demonstrating this would strengthen the paper.
The paper introduces fourth order dynamics, and spends considerable time simplifying this to 2nd order dynamics. It is unclear what this adds to the presentation, and it may be more straightforward to simply state what the implemented model is directly.
The experiments chose hyperparameters to maximize the mutual information extracted by each algorithm, and it is not clear from the text whether this optimization was performed on a validation set or not. It seems possible therefore that some of the results could be due to overfitting the hyperparameters to the test dataset. The paper could be improved with a more complete discussion of hyperparameter selection.
The evaluation videos could be better motivated. They are quite specific (exactly three videos), and the paper would be improved by discussing the significance of looping the videos. How many loops were necessary to generate 105k frames for the Carpark/Call streams, for instance? It is hard to tell from these specific examples how robust the results are.

__ Correctness__: The theoretical framework as initially introduced requires many approximations and simplifications in practice, but these are relatively well-motivated. The experiments appear to be correct, and admirably, full code is provided.

__ Clarity__: I found the paper somewhat difficult to follow. I think the presentation of the 4th order dynamics could be considerably shortened, with the exposition focusing on exactly what algorithm was implemented in the experiments.

__ Relation to Prior Work__: Prior work is clearly discussed, and this paper's contributions are clear.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: Presents a detailed mathematical description of a method for optimizing information transfer of an attentive visual system, based on the least action principle.

__ Strengths__: (see the Weaknesses section for further context) I think the main contribution of the paper is the very detailed and novel mathematical specification and solution of the variational learning problem (resulting in what the paper calls the "Cognitive Action Laws").

__ Weaknesses__: The approach is well-founded and the derivations are solid, but the main problem with the paper is that the general approach and motivating ideas are very similar to the free energy principle ideas of Karl Friston and his co-workers. The authors need to contrast their work with that of Friston and point out the novel contributions that are being made.
in the experiment section, comparison should be made to the input/output mutual information at fixation/attention locations generated by other state-of-the-art video attention algorithms e.g. such as the one in:
Cornia, M., Baraldi, L., Serra, G., & Cucchiara, R. (2018). Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing, 27(10), 5142-5154.

__ Correctness__: Yes

__ Clarity__: Yes, the paper is clear, although there are many grammatical errors. These errors do not affect the readability of the paper, however.

__ Relation to Prior Work__: I am surprised to see that a paper proposing a method based on the principle of least action does not cite the work of Karl Friston, who is the most well-known proponent of this idea as it relates to perception and cognition.
Friston, Karl. "The free-energy principle: a unified brain theory?." Nature reviews neuroscience 11.2 (2010): 127-138.
Karl, Friston. “A Free Energy Principle for Biological Systems.” Entropy (Basel, Switzerland) vol. 14,11 (2012): 2100-2121. doi:10.3390/e14112100
(from the abstract of the latter paper: "We motivate a solution using a principle of least action based on variational free energy (from statistical physics) and establish the conditions under which it is formally equivalent to the information bottleneck method.")
Friston and co-workers have applied his ideas to modeling attention.
For example, in the following paper they state:
"We have suggested recently that perception is the inference about causes of sensory inputs and attention is the inference about the uncertainty (precision) of those causes (Friston, 2009). This places attention in the larger context of perceptual inference under uncertainty"
Feldman, H., & Friston, K. (2010). Attention, uncertainty, and free-energy. Frontiers in human neuroscience, 4, 215.
See also:
Schwartenbeck, P., FitzGerald, T., Dolan, R., & Friston, K. (2013). Exploration, novelty, surprise, and free energy minimization. Frontiers in psychology, 4, 710.
The paper should compare the proposed focus of attention modeling approach with prior methods employing mutual information, such as that of Bruce and Tsotsos:
Bruce, N., & Tsotsos, J. (2006). Saliency based on information maximization. In Advances in neural information processing systems (pp. 155-162).

__ Reproducibility__: Yes

__ Additional Feedback__: I like the paper mainly because it provides a different approach to the problem. Although the underlying motivating idea is similar to others, such as the Friston free energy paradigm, they have a different way of attacking the problem. So I think it is useful to others building such models. To me, the rebuttal didn't really seem to answer any of the reviewers' main questions, so I don't know if any minds will be changed here. I still think the paper is worth accepting.

__ Summary and Contributions__: The paper propose an interesting combination between Lagrangian (classical mechanics) formulation of temporal (streaming) learning over time with Information theoretic view of human like attention mechanisms. The Lagrangian approach leads to a second order ODE models on top of which the authors propose a Mutual Information maximization to capature attention. The learning over time formulation is analogous to the Lagrangian formulation of classical mechanics, with a kinetic and potential energy terms which are functions of the generalized vector time flow (generalized canonical coordinates) wand their forests 2 derivatives. They added a disipative exponentially decaying term which can include loss of energy in the dynamics, to allow for non conservatives flows. The learning algorithm try to estimate the optimal potential function - which can be time dependent, and a few discrete parameters of the kinetic term. On top of that formalism, they add an Information or entropy measure which is a functional of the probability density function of these trajectories (something that in principle requires to solve a Focker-Planck like equation). Assuming that this can be done, they use this spatio-temporal density to calculate the mutual information between different times along the dynamics of this flow.
This rather involved formalism is applied to several videos by fitting a time dependent potential to the dynamics of the pixels in each video and then estimate the attention (salient) points in each of the scene. They also relate or compare it to the neuroscience of the retina.

__ Strengths__: The paper is inspiring by its physics-like approach to several hard problems, trajectory learning in a Lagrangian framework, flow density estimation, mutual information changes along the flow and its relationship with attention mechanisms. If convincingly controlling the algorithms involved - it can be a very interesting contribution. Not so much to the NIPS audience though, as the emphasis is on the physics like formulation.

__ Weaknesses__: This is a very ambitious paper that tries to combine several novel and computationally difficult tasks in one framework and apply them to real data with claims to be relevant to neuroscience. The main problem is that these difficult tasks,: learning potentials for the classical mechanics of trajectories, ensemble density estimation of these trajectories, mutual information estimation for this ensemble and then extraction the attention points from these estimates, are too much for one short paper. The paper didn’t convince me that any of this difficult problems is satisfactory solved algorithmically here, and their combination on the real data, which gives interesting comparative values in these datasets, but look anecdotal and inconclusive to me. The paper completely ignores other algorithms for learning temporal flows, like RL or DeepRL, and does not specify well enough the algorithmic issues. It is not clear how to compare this approach to any standard method and what is the real benefit of the proposed measure of attention estimation, to cognitive processing in general.

__ Correctness__: The main theorem (Thm 1) and its proof in the supplementary seem correct, but no learning algorithms are given in the paper. I couldn’t understand how exactly the trajectory density estimation, on which the mutual information estimation is based, is actually calculated. Solving FP like equation in high dimensions is nuitoriously difficult without some parametric assumptions. What is actiually assumed about the unknown potential U? How is it learned from the data? These are crucial details that are missing from this short paper, which make the result difficult to evaluate or reproduce.

__ Clarity__: The paper is clearly written and can be understood by people familiar with the concepts of Lagrangian dynamics, entropy and mutual information, agttantion mechansism , and this physics style math. Otherwise, the paper is inaccessible.

__ Relation to Prior Work__: The references are limited to this very specific line of research with its several threads. I missed a comparisons to other dynamic learning methods and algorithms, such as Deep RL, InfoRL, and their extensions. While the submission contains the video data and Peyton code and I suppose the results can be reproduced by running the code on this data, it isn’t clear to me what happens on other data and what is the meaning and how to interpret the main numerical results in Table 2.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper deals with a problem of unsupervised statistical modelling from time-series signals and proposes a method based on mutual information maximisation. The proposed method is originated from previous papers presented by a specific author group and is extended to spatio-temporal signals. The validity of the proposed method is evaluated with simple video signals.

__ Strengths__: 1. This paper has a strong mathematical background that have been developed by the same author group.

__ Weaknesses__: 1. Low clarity. Honestly speaking, I could not follow the discussion in the current manuscript at all, partly due to the lack of my knowledge for the previous methods presented by the same author group.
2. If my understanding is correct, the current manuscript does not contain any experimental comparisons with other previous methods related to statistical modelling of video signals.

__ Correctness__: Mathematical discussions seem to be no problem, as far as I understand. However, there exists a big logical gap between the motivation presented in the abstract and the methodology.

__ Clarity__: Clearly no. The current manuscript is not self-contained and requires extensive knowledge for the methods proposed by a specific author group. We can find a lot of "See ... for the detail" and motivations and meanings of mathematical expressions have been totally skipped in this paper.

__ Relation to Prior Work__: I think that it is no problem, in terms of follow-ups of a specific author group.

__ Reproducibility__: No

__ Additional Feedback__: