{"title": "Domes to Drones: Self-Supervised Active Triangulation for 3D Human Pose Reconstruction", "book": "Advances in Neural Information Processing Systems", "page_first": 3912, "page_last": 3915, "abstract": "Existing state-of-the-art estimation systems can detect 2d poses of multiple people in images quite reliably. In contrast, 3d pose estimation from a single image is ill-posed due to occlusion and depth ambiguities. Assuming access to multiple cameras, or given an active system able to position itself to observe the scene from multiple viewpoints, reconstructing 3d pose from 2d measurements becomes well-posed within the framework of standard multi-view geometry. Less clear is what is an informative set of viewpoints for accurate 3d reconstruction, particularly in complex scenes, where people are occluded by others or by scene objects. In order to address the view selection problem in a principled way, we here introduce ACTOR, an active triangulation agent for 3d human pose reconstruction. Our fully trainable agent consists of a 2d pose estimation network (any of which would work) and a deep reinforcement learning-based policy for camera viewpoint selection. The policy predicts observation viewpoints, the number of which varies adaptively depending on scene content, and the associated images are fed to an underlying pose estimator. Importantly, training the policy requires no annotations - given a 2d pose estimator, ACTOR is trained in a self-supervised manner. In extensive evaluations on complex multi-people scenes filmed in a Panoptic dome, under multiple viewpoints, we compare our active triangulation agent to strong multi-view baselines, and show that ACTOR produces significantly more accurate 3d pose reconstructions. We also provide a proof-of-concept experiment indicating the potential of connecting our view selection policy to a physical drone observer.", "full_text": "Domes to Drones: Self-Supervised Active\n\nTriangulation for 3D Human Pose Reconstruction\n\nSupplementary Material\n\nAleksis Pirinen1\u2217, Erik G\u00e4rtner1\u2217 and Cristian Sminchisescu1,2\n1Department of Mathematics, Faculty of Engineering, Lund University\n\n2Google Research\n\n{aleksis.pirinen, erik.gartner, cristian.sminchisescu}@math.lth.se\n\nThis supplementary provides more insight into our ACTOR model and experimental setup. Section \u00a71\ndescribes the details of the network architecture, implementation, and hyperparameters. \u00a72 elaborates\non how we match 2d pose estimates in space and time using instance features. In \u00a73 we provide 2d\nreprojection errors onto 2d OpenPose [2] estimates on the Panoptic test splits. Finally, \u00a74 describes\nfurther dataset details.\n\n1 Model Architecture\n\nSee Fig. 1 for a full description of the ACTOR network architecture. ACTOR was implemented in\nCaffe [5] and MATLAB. We used an open-source TensorFlow implementation of OpenPose2. All\ncode and pre-trained weights have been made publicly available.3\n\nFigure 1: ACTOR policy architecture. A multi-people 2d pose estimation system (here OpenPose,\nbut any similar deep system would would work) processes an input image. The deep feature maps Bt\nproduced by OpenPose (conv4_4_CPM) is fed into the ACTOR policy network and is processed by\ntwo convolutional layers with ReLU-activations. The \ufb01rst and second convolutional layers both have\n3 \u00d7 3 kernels with stride 1. Their output dimensions are 8 \u00d7 39 \u00d7 21 and 4 \u00d7 18 \u00d7 9, respectively.\nThe max pooling layer has a 2 \u00d7 2 kernel with stride 2. The output from the second convolutional\nlayer is then concatenated with agent-centric camera rig information about previously visited cameras\nrelative to current position (Rig), and auxiliary information such as the number of joints triangulated\nand number of people detected in the view (Aux). The \ufb02attened and concatenated data is then fed into\nthree fully-connected layers with tanh-activations with 1024, 512 and 2 neurons respectively. The\n\ufb01nal output is scaled by two constants to produce radial angles on the viewing sphere.\n\n1.1 Hyperparamters\n\nHyperparameter search was performed using two powerful workstations equipped with several\nNVIDIA Titan V100 GPU:s. Training a single model for 40k episodes took about 32 hours using one\nGPU and to speed up results while searching for optimal hyperparameters we trained several model\ncon\ufb01gurations in parallell using Hyperdock [3]. The most important parameters for training ACTOR\n\n\u2217Denotes equal contribution, order determined by coin \ufb02ip.\n2https://gist.github.com/alesolano/b073d8ec9603246f766f9f15d002f4f4\n3https://github.com/ErikGartner/actor\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n!\"CNNRig + AuxFCFCFCOpenPose('(,'*)MaxPoolCNNPolicy Head\fwere learning rate, precision of the the von Mises (ma, me) and the annealing rate of the precision.\nSee Table 1 for a summary of the values tested for these hyperparameters. In total we trained around\n10 different versions of the \ufb01nal model with varying hyperparameters and evaluated each of them on\nthe validation set. Finally, the best model was evaluated on the test dataset and retrained with four\nadditional random seeds to measure the model\u2019s sensitivity to the random seed (the model is not very\nsensitive as indicated in Fig. 2, main paper).\n\nHyperparameter\nLearning Rate\n\nvon Mises precision\n\nAttempted values\n\n{(1, 10) \u2192 (25, 50), (10, 50) \u2192 (20, 100), (10,50) \u2192 (100, 500)}\n\n{1e-7, 5e-7, 1e-6, 5e-6}\n\nTable 1: The values tested for the most important hyperparameters when training ACTOR. The \ufb01nal\nand best values are highlighted in bold. For the von Mises precisions, the arrow indicates linear\nannealing performed during training (e.g. from (ma, me) = (1, 10) to (ma, me) = (25, 50) for the\nbest con\ufb01guration).\n\n2 Matching Multiple People\n\nACTOR reconstructs multiple people in both space and time from 2d pose estimates. In order to\ntrack and match these estimates we compute instance sensitive features. These deep features can then\nbe stably matched to each other using the Hungarian algorithm and the L2 distance to compute the\nmatching cost.\nWe trained an instance classi\ufb01er structured as a siamese network [1] using a contrastive loss [4] that\naims to produce 50-dimensional features for each person that can be used to distinguish individuals.\nAs input the instance classi\ufb01er takes VGG-19 [7] features from the bounding box of the 2d pose\nestimate. The instance classi\ufb01er is trained for 40k iterations on the training set with a mini-batch\nsize of 16 where half contains positive pairs and the other half contains negative pairs. The training\nexamples are sampled randomly in both space and time yielding a robust classi\ufb01er. Lastly, the\ninstance classi\ufb01er is \ufb01ne-tuned for 2k iterations on each scene, creating scene-speci\ufb01c versions of\nthe classi\ufb01er that are slightly adapted to the environment of those scenes. This tuning is performed\noutside the range of the active-sequence in which the agent operates.\nAt the start of an active-sequence the agents is given an appearance model for each target it should\nreconstruct. These appearance models are averages of K different instance features computed for\neach target in the scene but from time-freezes that are not part of the current active-sequence. We\ndenote the i:th instance feature for the l:th person by ul\ni, with i = 1, . . . , K. In practice we use\nK = 10. Then we set as appearance model ml to:\n\nml = median(ul\n\n1, . . . , ul\n\nK)\n\n(1)\n\nFor each camera location we compute the distance between the instance features of each detected\nperson to all appearance models in that scene. This gives us a cost matrix whose elements are\ncj,l = (cid:107)uj \u2212 ml(cid:107)2\n2, i.e., the cost to match detection j to person l. Given this matrix we assign\ndetections according to the Hungarian algorithm. Since there might be false detections by the 2d pose\nestimator and not all people are visible from every camera location we \ufb01lter out matches with a cost\nlarger than a threshold C, such that all matches cj,l \u2264 C = 0.5.\nIf a person is never detected in an active-view, and if it does not have a previous temporal backup to\nuse as 3d pose reconstruction (cf. \u00a73 and the implementation details in \u00a74 of the main paper), we set\neach joint estimate to the ground-truth center hip location. Obviously, this estimate is implausible and\nhighly inaccurate \u2013 it is used only to compute average errors (not including such an estimate when\ncomputing average errors would be another option, but this would not penalize missed persons).\n\n3 Reprojection Errors onto OpenPose 2d Estimates\n\nThe 3d ground-truth in Panoptic is generated from exhaustive triangulation of 2d pose estimates [6],\nbut those 2d pose estimates are not from OpenPose. Thus it is relevant to also look at reprojection\n\n2\n\n\fFigure 2: Mean 2d reprojection errors per joint relative to OpenPose 2d estimates vs number of\ncameras on the test sets. Left: Multi-people data. Right: Single-people data. ACTOR reduces the 2d\nreprojection error faster than the heuristic baselines, particularly for multi-people data. Single-person\nscenes are easier to reconstruct, especially when using many cameras \u2013 also note that all models\nconverge close to the error of the oracle in this case.\n\nerrors onto the OpenPose 2d estimates, since these errors are not affected by any potential incorrect\nbias in the 3d ground-truth. Such reprojection errors are shown in Fig. 2. We note that ACTOR is\nmore accurate relative to the oracle in this metric. For single-people data the agent even converges\nclose to the oracle, while the oracle is still slightly better for multi-people data due to its more dif\ufb01cult\nnature with occlusions. ACTOR yields lower reprojection errors than the heuristic baselines, with an\nexception at 2 cameras for multi-people data where Max-Azim is more accurate. Note however that\nACTOR was not trained to produce accurate estimates at any \ufb01xed number of cameras, but rather to\nquickly triangulate all joints. Despite this, we outperform the baselines in the vast majority of cases.\n\n4 Additional Dataset Insights\n\nTable 2 shows the size and split of the Panoptic dataset [6] into train, validation and test sets. The data\nwas created using scripts that downsampled from 30 FPS to 2 FPS to increase movement between\nframes.\n\nTrain\n\nVal\n\nTest\n\nAll\n\nMa\ufb01a\n\n53,100\n\n27,900\n\n33,728\n\n114,728\n\nUltimatum 27,960\n\n4,340\n\n55,825\n\n88,125\n\nPose\n\nAll\n\n51,079\n\n29,672\n\n59,288\n\n140,039\n\n132,139\n\n61,912\n\n148,841\n\n342,892\n\nTable 2: The number of images in our dataset categorized by scene type and subset type (training,\nvalidation, and testing). Note that Ma\ufb01a and Ultimatum are complex multi-people scenes and that\nthey account for more than half of the dataset.\n\n3\n\n2345678910# cameras80110140170200230260290mean pixel errorACTORRandomMax-AzimOracle2345678910# cameras3036424854606672mean pixel errorACTORRandomMax-AzimOracle\fReferences\n[1] J. Bromley, I. Guyon, Y. LeCun, E. S\u00e4ckinger, and R. Shah. Signature veri\ufb01cation using a\" siamese\" time\n\ndelay neural network. In NIPS, pages 737\u2013744, 1994.\n\n[2] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. OpenPose: realtime multi-person 2D pose\n\nestimation using Part Af\ufb01nity Fields. In CVPR, 2017.\n\n[3] E. G\u00e4rtner. Hyperdock. https://github.com/ErikGartner/Hyperdock, 2019.\n[4] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR,\n\npages 1735\u20131742. IEEE, 2006.\n\n[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\nConvolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international\nconference on Multimedia, pages 675\u2013678. ACM, 2014.\n\n[6] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh. Panoptic\n\nstudio: A massively multiview system for social motion capture. In ICCV, 2015.\n\n[7] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n4\n\n\f", "award": [], "sourceid": 2164, "authors": [{"given_name": "Aleksis", "family_name": "Pirinen", "institution": "Lund University"}, {"given_name": "Erik", "family_name": "G\u00e4rtner", "institution": "Lund University"}, {"given_name": "Cristian", "family_name": "Sminchisescu", "institution": "Google Research"}]}