Summary and Contributions: Edit: I'm raising my score from a 4 to a 5. The author rebuttal addressed a misconception I had about CuLE not being compatible with the gym interface. It seems this is not the case, which significantly improves the usability of the artifact. The authors introduce CuLE, a port of Atari to the GPU for improved throughput without needing many CPUs. Beyond the simulator artifact, the claimed contributions are: 1. identification of computational bottlenecks 2. proposal of a batching strategy for speeding up CuLE training 3. analysis of limitations of CUDA acceleration for Atari
Strengths: The throughput numbers of CuLE are impressive, as is the engineering required to accelerate Atari with CUDA.
Weaknesses: My main concern with this work is the limited scope of impact. CuLE only accelerates Atari environments. It also seems like CuLE also requires special integration with the RL libraries, which would require more work on the authors of those libraries. (1) In terms of impact, CuLE falls short of comparable work such as SampleFactory, which appears to not only have higher performance but is a more general purpose system that supports CPU-hosted environments. Similarly, Accelerated Methods for Deep Reinforcement Learning does not require special-purpose CUDA implementations for the environment. (2) Performance wise, the numbers provided by CuLE are strong but there aren't clear improvements over existing systems. For example, solving Pong in 5.9 minutes has been achieved by several RL libraries. There is also a strong focus evaluation of inference-only speed, which is quite easy to achieve in any parallel system. A strong evaluation section would compare training speed against other distributed / GPU-based RL systems. (3) The contributions beyond the CuLE artifact are limited. I did not see much novel insights in the proposed batching strategy or analysis of algorithm bottlenecks.
Clarity: The paper is generally well organized and clearly written. However, I do think the evaluation section could be better structured to point out takeaways from the experiments.
Relation to Prior Work: Yes.
Summary and Contributions: Edit: I appreciate that the authors are planning to include additional experiments for more atari levels. Keeping my rating at for stated reasons. The paper implements an ATARI simulator in CUDA and demonstrates that environments can hence be run at much higher FPS using a GPU. They then run A2C and V-trace with large batches to obtain very fast training in wall clock time. As the setup is entirely on GPUs it is fairly on-policy and they can employ 1024 environments in parallel.
Strengths: Porting the Atari emulator to CUDA is a strong engineering feat and can have a large impact on the community by simplifying and accelerating atari experiments.
Weaknesses: To utilize this work large batches seem required, which limits the selection of applicable algorithms. Also I am wondering why only 4 atari games were tested. Otherwise it needs to be state that this is more an engineering contribution than a research contribution. The meta reviewer needs to decide if it fits the scope of the conference.
Correctness: The empirical approach is sound.
Relation to Prior Work: Yes
Additional Feedback: I am assuming that the code that is promised to be submitted will ensure reproducibility of the results.
Summary and Contributions: ===Edit after rebuttal=== After reading the author rebuttle and discussing with other reviewers, overall I have no objection in accepting, given that we all agree this would be a great tool for the RL community. However, I am keeping my score at 6 since I still see that this particular implementation would not take effect without having to execute a large number of environment (I had a misunderstanding that this would require more GPUs but the authors clarified that a single GPU is capable of executing more environment, this was helpful). The question here would be, is it really necessary that we *always* run thousands of environments? As I've pointed out in the initial review, sometimes less number of environment is sufficient enough to get a certain performance and in this situation, more environment would just be a waste of frames in my opinion. To re-iterate my review regarding "time efficiency" vs. "data efficiency", this paper addresses the former only, which is a weakness. That said, I'm okay with this as long as the authors have emphasized that this work is not aming at the latter (also in rebuttal). === The authors introduce a full-GPU supported Atari environment, called CuLE, and perform a thorough analysis of the improvement over CPU implementations. The analyses and comparisons to the commonly used OpenAI Gym environment are very detailed, with good insights into when is one tool better/worse than others. Open-sourcing this tool (after the review period) would make a good contribution to RL research platforms.
Strengths: The part that I like the most about this paper is that the authors recognized/disentangled different factors when considering "speed". The emulator speed, which is the part that is usually implemented in CPU; this is what the authors' main contribution here, to implement the emulator in GPU. The inference speed, which is the communication cost between CPU-GPU; this is what can be improved by leveraging CuLE to reduce this cost. And the training speed, which is the limitation of an RL algorithm; this part was not addressed in the paper (i.e., the authors used an existing algorithm to test CuLE). Having a clear definition of each component inside a training system is very helpful in identifying the bottlenecks. As shown in the experiments, the throughput of an emulator itself can be one of the reasons why some games run slower than others. I also appreciate the "generalization for different systems" experiments, which shows that CuLE will be a good tool in most of the commonly used RL algorithms. The comparison with existing acceleration methods in Table 1 also provides a clear view of the advantage of CuLE.
Weaknesses: My main concern is that results seem to be contradictory to what the authors claimed as the benefit of leveraging GPU accelerations. Specifically, in the "impact statement" the authors described CuLE can "provide access to an accelerated training environment to researchers with limited computational capabilities," but the results show the acceleration won't take into effect unless you use more computation---Figure 2, CuLE runs slower than OpenAI when using a fewer number of environments. If someone can only afford to run 100 environments, would this mean CuLE is not useful here? The limitation of the memory has been noted in the paper which is good. I was confused when looking at Table 3. First, why is there no 120 envs experiment for CuLE? Second, by looking at the columns of OpenAI and the columns of CuLE, the message I get was "CuLE has a higher FPS, but somehow performs less number of UPS, and it takes more data and longer time to achieve the same performance as OpenAI that uses a fewer number of envs." Take Assault as an example, if I can achieve 800 scores using 120 OpenAI CPU envs within 5M frames, why should I bother using CuLE with 4 GPUs and consume 18M frames? Of course, this depends on what you are trying to optimize: CuLE optimizes wall-clock time (when you have the computation resources), but often when I consider speed up RL, I care more about being able to learn with less number of environmental interactions (i.e., frames), because collecting data is often more expensive than computation (e.g., in robotics it is impossible to let a robot repeat a movement for millions of times). I hope the authors can elaborate further on this. This might be a point worth addressing in the impact statement: CuLE might create a resource imbalance problem among researchers; big companies can easily leverage CuLE to produce results faster and better, while small labs will be unable to do so, result in slower progress.
Correctness: All methods are sound.
Clarity: The paper is well written and easy to read.
Relation to Prior Work: A clear comparison with existing acceleration methods is made in Table 1 of the paper
Additional Feedback: Some minor formatting issues: -no need to insert a line break between section titles & text -reference  and  are duplicated -line 127 "...TIA kernel may be scheduled *one* the GPU with more than one thread per game..." -> *on* the GPU? -there are several undefined references to Figures and Tables inside the supplementary materials -the impact section should be titled "broader impact" instead of "impact statement" Overall, I think this paper presents a good open-source tool for the RL community. And as the authors discussed, CuLE can open up new directions in designing RL algorithms that can fully leverage the hardware advantage (i.e., improve the "training speed"). I wonder if the authors have a plan to extend their implementation to more environments (e.g., can you make a full wrapper around all OpenAI gym environments), which will certainly have a great impact.
Summary and Contributions: The authors proposed CuLE, a CUDA port of the Atari Learning Environment (ALE) CuLE utilizes GPU parallelization to render frames directly on the GPU which removes the current inference bottleneck for Atari in RL loop: CPU-GPU communication. This also enables running thousands of games simultaneously on a single GPU. Given the fact that Atari is still one of the most popular environments for RL research, this paper can potentially be extremely high impact, assuming the authors also make it easy to use! CuLE: 1) Provides access to an accelerated training environment to researchers with limited computational capabilities. 2) Facilitate research in novel directions that explore thousands of agents without requiring access to a distributed system with hundreds of CPU cores
Strengths: The paper analysis the computational bottleneck of the current methods in a systematic way and addresses them by porting the Atari emulation into the GPU. There are great discussions on how this port could be done and the authors clearly justify the reasoning on their decisions. Finally, there are well designed experiments that illustrate the achieved improvements with clarity.
Weaknesses: This is not a weakness of the paper per se but a way that the authors could improve the paper substantially to make it even stronger for machine learning conference: The paper provides a method for emulating Atari faster, but the authors do not demonstrate how this can be used to achieve high scores, using the same method, in a shorter amount of time. There is Figure 5, but what I have in mind is a table with scores across many games by training the same model for a fixed amount of time. This can show the strong impact that this paper can have in shortening the scientific experiment cycle.
Clarity: The paper is very well written and easy to follow. This paper is not in my direct field of expertise and I was not familiar with some of the concepts but I did manage to follow and understand. However, I encourage the writers to rewrite some parts of the paper to make it more "NeurIPS friendly", given that NeurIPS is not a system conference. e.g. the reads can use a one liner description of what TIA is! I also encourage the authors to include a table of abbreviations given that there are too many of them e.g. SPU, UPS, etc. The clarity of some of the graphs can be improved .e.g. the labels in Fig.5 are really tiny.
Relation to Prior Work: The authors explored the previous deep RL methods in depth and recognized their computational bottleneck. This paper is a direct answer to this bottleneck which, in theory, should improve the majority of the previous methods.