NeurIPS 2020

Multi-agent Trajectory Prediction with Fuzzy Query Attention

Review 1

Summary and Contributions: The paper proposes a novel attention block termed as ‘Fuzzy Query Attention’ (FQA) to better model social interactions between interacting agents. The crux of the FQA block lies in learning decision variables which dictate how the human reacts to his/her neighbor. In particular, the model learns how to react in a yes/no (binary) scenario (eg. possibility of collision) respectively and the decision variable indicates the probability of that particular scenario occurring. The proposed FQA has been shown mathematically to have the potential to model important features like proximity, approach. The experimentation has been extensive, and the overall model is shown to outperform related works on multiple datasets.

Strengths: The paper proposes a novel attention module to better capture social interactions. The architecture of the module, with regards to two-agent interactions, matches the human intuition of decision making. The authors further support it by demonstrating, mathematically, two common decisions of Proximity and Approach. The experiments have been extensive with regards to different datasets chosen, and they strongly indicate the effectiveness of the proposed architecture. Both quantitatively and quantitively, the performance of the proposed model is strong. I personally like the diversity of the datasets, from human walking dynamics and vehicles and physics data. The paper is very well-written and has been enjoyable to read.

Weaknesses: The experiments have been extensive, however I have following three crucial questions to better understand the performance boost arising from the overall architecture: 1. Improvement arising from interaction module or motion module? Taking Social LSTM [1] to be an interaction-based baseline, the proposed architecture has two different components: the interaction and motion modules. Is the boost coming from the interaction module which is FQA in comparison to Social Pooling [1]? Or is it the new motion module? An ablation study showing the performance while keeping the motion module the same as the baseline will help answer this question. 2. Are the decisions actually Fuzzy? The authors use the term Fuzzy to describe continuous-valued decisions over their discrete-valued boolean counterparts. It is therefore important to ask the question what happens within the network if the Di’s were Boolean. What is the performance drop? 3. Comparison to self-attention [2]: The authors provide an ablation study ‘Removing decision making of FQA’. It is slightly confusing as to what exactly is the architecture without decision-making? Do the authors perform a procedure similar to self-attention (use K, Q and V) to determine Vsr. If no, it would be interesting to see the comparison with self-attention block. [1] Social LSTM: Human Trajectory Prediction in Crowded Spaces, 2016 [2] Attention is all you need, 2016

Correctness: The methodology is correct.

Clarity: Yes the paper is well written

Relation to Prior Work: yes, it is clearly discussed

Reproducibility: Yes

Additional Feedback: - In the rebuttal, it will be great if the authors could answer the questions in the "Weakness section". - It would be further good to see comparison to non-graph based trajectory models in ETH-UCY. - As a suggestion, maybe use the term ‘gating mechanism’ as opposed to ‘fuzzy’. POST REBUTTAL: Thank you for the clarifications. I have updated my score.

Review 2

Summary and Contributions: The authors propose Fuzzy Query Attention (FQA), a pair-wise attention mechanism for relational models. Similar to Graph Nets or Transformers, FQA captures the effects of the interaction of a sender node on a receiver node with a function of the pair. In particular, the function of choice is a "fuzzy" dot product that allows the network to interpolate between the presence and absence of an effect.

Strengths: The paper is well written and the ideas presented are appropriately placed in the context of existing work. The work is certainly relevant to the NeurIPS community: relational reasoning has received considerable attention in recent years. The ideas are novel and results are impressive and simple to understand.

Weaknesses: This is more a suggestion than a weakness, it would be nice to see how this methods scale up to very large systems (e.g. the simulations in Learning to simulate complex physics with graph networks). Possibly FQA might drop edges from large graphs in a context-dependent way, and might be able to do so dynamically. It would be interesting to see if it helps. Similarly, it would be interesting to have a look at a thresholded version of these fuzzy decisions on over-connected graphs. Do they correctly recover the tru underlying structure?

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Thank you for sharing these cool ideas! I have two random thoughts for future directions I thought I might share. 1. If I understand the math correctly, there is nothing limiting your method to being fuzzy between only 2 values. In particular, you could produce a set of queries and keys with a multi-headed network, and then construct a few decisions D_sr,1; D_sr,2;...;D_sr,m, pass them through a soft-max (as if they were logits), and construct a linear combination of V_1, V_2, ..., V_m, which can be produced similarly to V_y and V_n. This would be especially useful in the presence of a mixed system (e.g. a mass-spring chain falling on a rigid surface, you'd have the mass-spring interaction, as well as the mass-surface). 2. This is minor, I wonder if it would be wise to normalize V_y and V_n to have norm 1 or something. I am worried unbounded values might interfere with your fuzzy decisions (e.g. if \|V_y\| >> \|V_n\| your fuzzy decision is not working as intended). I like this research direction, I hope my feedback helps. All the best. ============ After Author Feedback =========== Thank you for taking the time to write a rebuttal, and for sharing these cool ideas. I stand by my initial assessment that this paper should be accepted. Best.

Review 3

Summary and Contributions: This work presents a novel attention mechanism for modeling edge interactions in a graph. It is demonstrated by modeling physical simulations as well as real-world pedestrian and vehicle interactions.

Strengths: The experimental evaluation is quite thorough, with a good ablation study demonstrating where performance comes from.

Weaknesses: A lack of clarity in Section 3 really hinders understanding how the method works. Notably, how can the FC layers behind K_sr and Q_sr be trained if their output ends up being detached? Why are the keys and queries necessary? What do they do? If the FC layers cannot be trained, then aren't K and Q essentially just random vectors? To remedy this, an overview paragraph stating something like "At a high level, FQA performs the following ... FQA generates keys K and queries Q that represent ... The decision variables D then select the most relevant parts of K and Q to ..." would really aid a reader's understanding. This set of questions: "How can the FC layers behind K_sr and Q_sr be trained if their output ends up being detached? Why are the keys and queries necessary? What do they do? If the FC layers cannot be trained, then aren't K and Q essentially just random vectors?" are my main identified weaknesses with the work, and are the reason behind the initial review score. If these questions are answered satisfactorily, I am happy to increase my score to be above the acceptance threshold. POST-REBUTTAL: Thank you for the clarification! My questions were answered and thus I will raise my score accordingly. Good luck!

Correctness: Yes, the experimental methodology follows existing work and claims in the paper are substantiated by the experiments.

Clarity: The paper's general grammar is fine, but Section 3 was hard to follow at times and I found it difficult to parse out what exactly is going on in the model.

Relation to Prior Work: This could be done a little bit better. In particular, what concretely separates FQA from existing attention architectures? Yes, it is evident that this fuzzy combination of keys and queries is different, but it is hard to find any deeper insight as to why this is intuitively a better approach or how different this really is compared to common dot-product attention methods (e.g., Bahdanau attention)? Such information can be put into lines 128-138, which is where I was hoping to find such a discussion.

Reproducibility: Yes

Additional Feedback: Line 114: "parallely parsed" -> "parsed in parallel"

Review 4

Summary and Contributions: The authors propose a method for multi-agent trajectory prediction. They present a relational model to describe interaction between agents with a fuzzy query attention mechanism. The method is evaluated in some datasets such as human crowd trajectories, US freeway traffic, NBA sports data, and show its validity in different environments.

Strengths: Modeling interaction is important in trajectory prediction. FAQ module proposed in this paper looks good. Experiments are sufficient and show that the proposed FAQ module has high validity. The paper is well written and the mathematical framework seems correct.

Weaknesses: There are a few points I'd like to discuss. 1. Making continuous-valued decision, especially continuous attention between agents, is not new in trajectory prediction. However, FAQ module proposed in this paper looks new. 2. The decision demonstrated in section 4.3 looks too simple, which could be easily evolved by other methods. Overall, I think the innovation of this paper is limited.

Correctness: The method seems corrent.

Clarity: Well written.

Relation to Prior Work: Clearly discussed.

Reproducibility: Yes

Additional Feedback: