The paper provides a new algorithm for Federated Learning with resource constrained edge devices. The algorithm adapts distillation based techniques (which are usually used for model compression from larger model to smaller model) to a kind of a two way knowledge transfer model that aids learning of local small neural networks on the edge devices and a larger global network on the server cloud. Methodologically the paper is novel, useful, and well written. But a few points raised by the reviewers are very pertinent and needs to be discussed in the final version 1. One key advantage & motivation for the model is stated as reduced communication - unfortunately this has not been empirically justified against FedAvg - the method has the potential for less frequent communication compared to FedAvg but this has not been validated empirically -- it would be good to have this information on the experiments reported - exchanging features over parameters is stated as an advantage, but I agree with R1&R3’s concern that this may not be the case in now-standard networks on high resolution images where the per iteration communication scales as #samples x #hidden units (or features), which could be large. I encourage the authors to be upfront and honest about the potential shortcoming and argue the benefit of the small memory footprint in spite of potential higher communication cost. 2. Although the paper is not about privacy, sharing of NN layer activations of data points has potential privacy concerns and I strongly encourage the authors to have a detailed discussion about the (potentially negative) privacy implications (including in the broader impacts section)