{"title": "Perceiving the arrow of time in autoregressive motion", "book": "Advances in Neural Information Processing Systems", "page_first": 2306, "page_last": 2317, "abstract": "Understanding the principles of causal inference in the visual system has a long history at least since the seminal studies by Albert Michotte. Many cognitive and machine learning scientists believe that intelligent behavior requires agents to possess causal models of the world. Recent ML algorithms exploit the dependence structure of additive noise terms for inferring causal structures from observational data, e.g. to detect the direction of time series; the arrow of time. This raises the question whether the subtle asymmetries between the time directions can also be perceived by humans. Here we show that human observers can indeed discriminate forward and backward autoregressive motion with non-Gaussian additive independent noise, i.e. they appear sensitive to subtle asymmetries between the time directions. We employ a so-called frozen noise paradigm enabling us to compare human performance with four different algorithms on a trial-by-trial basis: A causal inference algorithm exploiting the dependence structure of additive noise terms, a neurally inspired network, a Bayesian ideal observer model as well as a simple heuristic. Our results suggest that all human observers use similar cues or strategies to solve the arrow of time motion discrimination task, but the human algorithm is significantly different from the three machine algorithms we compared it to. In fact, our simple heuristic appears most similar to our human observers.", "full_text": "Perceiving the arrow of time in autoregressive motion\n\nKristof Meding\n\nUniversity of T\u00fcbingen\n\nNeural Information Processing Group\n\nT\u00fcbingen, Germany\n\nkristof.meding@uni-tuebingen.de\n\nBernhard Sch\u00f6lkopf*\n\nMax-Planck-Institute for Intelligent Systems\n\nEmpirical Inference Department\n\nT\u00fcbingen, Germany\n\nbs@tuebingen.mpg.de\n\nDominik Janzing\n\nAmazon Research T\u00fcbingen\n\nT\u00fcbingen, Germany\n\njanzind@amazon.com\n\nFelix A. Wichmann*\nUniversity of T\u00fcbingen\n\nNeural Information Processing Group\n\nT\u00fcbingen, Germany\n\nfelix.wichmann@uni-tuebingen.de\n\n*joint senior authors\n\nAbstract\n\nUnderstanding the principles of causal inference in the visual system has a long\nhistory at least since the seminal studies by Albert Michotte. Many cognitive\nand machine learning scientists believe that intelligent behavior requires agents to\npossess causal models of the world. Recent ML algorithms exploit the dependence\nstructure of additive noise terms for inferring causal structures from observational\ndata, e.g. to detect the direction of time series; the arrow of time. This raises\nthe question whether the subtle asymmetries between the time directions can\nalso be perceived by humans. Here we show that human observers can indeed\ndiscriminate forward and backward autoregressive motion with non-Gaussian\nadditive independent noise, i.e. they appear sensitive to subtle asymmetries between\nthe time directions. We employ a so-called frozen noise paradigm enabling us to\ncompare human performance with four different algorithms on a trial-by-trial basis:\nA causal inference algorithm exploiting the dependence structure of additive noise\nterms, a neurally inspired network, a Bayesian ideal observer model as well as a\nsimple heuristic. Our results suggest that all human observers use similar cues or\nstrategies to solve the arrow of time motion discrimination task, but the human\nalgorithm is signi\ufb01cantly different from the three machine algorithms we compared\nit to. In fact, our simple heuristic appears most similar to our human observers.\n\n1\n\nIntroduction\n\nDiscriminative convolutional neural networks (CNNs) have produced impressive results in machine\nlearning, but certain striking failures of generalisation have been pointed out as well in terms\nof adversarial examples [1\u20133] or the recent \ufb01ndings of Geirhos and colleagues that CNNs show\nsurprisingly large generalisation errors under image degradations [4, 5]. Many cognitive and machine\nlearning scientists maintain that \ufb02exible and robust intelligent behaviour in the real world requires\nagents to possess generative or causal models of the world [6]. The importance of causality for\ncognitive science and psychology has long been recognized [7\u201316]. In visual perception, for example,\nit is fundamental to identify the causal structure in a visual scene: are objects moving or standing\nstill, are some objects causing the movement of other objects [17\u201319], are the movements caused by\n(intentional) actors or rather by forces of nature? [20, 21] On a cognitive rather than perceptual level\nprogress has been made to understand how we intuitively understand physics [22], how humans learn\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcausal structures from data [10, 7, 8, 13] and on human causal inference via counterfactual reasoning\n[16, 23].\nMuch less research has explored whether the earlier, perceptual and unconscious\u2014cognitively\nimpenetrable [24]\u2014processing stages in humans possess already causal inference algorithms, see\nDanks [25] for a recent overview on the relationship between causal perception, causal inference and\ncausal reasoning. Rolfs et al. [26] found evidence for perceptual adaptation to causality, thus arguing\nthat the perceptual system already possesses mechanisms tuned to \u201ccausal features\u201d in the visual\ninput (but c.f. for a critique of the paper on methodological grounds [27]). More recently it was\nshown, using the continuous \ufb02ash suppression paradigm, that simple Michotte style launching-events\nenter awareness faster when they are perceived as continuous causal events, again suggesting that\nrather early, perceptual and pre-conscious processes may already be tuned to \u201ccausal features\u201d [28].\nRecently there has been considerable progress in understanding causal inference by approaching it as\na machine learning problem [29\u201332]. In the last two decades algorithms for causal inference with\ndifferent approaches have been suggested. Based on the language of graphical models and structural\nequation models, the \u201cclassical approaches\u201d infer the directed acyclic graph (DAG) formalizing\nthe causal relations from the observed conditional statistical (in)dependences subject to causal\nMarkov condition and causal faithfulness [29, 30]. After about 2004, several other approaches were\nsuggested that infer causal DAGs using properties of distributions other than conditional independence.\nThese approaches also consider DAGs that consist of two variables only (in which case conditional\nindependence testing is futile), i.e., to decide what is cause and what is effect, see chapter 4 of [32] for\nan overview. It was shown that one can still infer the structure if one is willing to place restrictions on\nthe action of the noise disturbances, speci\ufb01cally, that it is additive and independent, and that either the\nnoise is non-Gaussian or the functions are nonlinear [33\u201336]. These methods have also been applied\nto determine the causal direction of time series by \ufb01tting autoregressive models, i.e., by predicting\nfuture from past, and examining the noise terms [37\u201339].\nInvestigation of the arrow of time in causal learning was motivated by its role in physics [40, 41, 37],\nsince it can be shown that the time asymmetry based on the independence of noise can be explained\nby the usual thermodynamic arrow of time [42] and that recent approaches to causal inference are\nthus linked to statistical physics [43]. Pickup et al. showed that the independence of noise can be\nemployed to detect the arrow of time in real world YouTube videos, without semantic or cognitive\nknowledge about the visual world [39]. Recently it was shown that also neural networks can infer\nthe arrow of time from movies alone [44], suggesting that even low-level motion information in the\nvideo contains information about the arrow of time.\nClearly, humans can perceive the arrow of time in settings where semantic information or world\nknowledge is available. In a famous movie by the Lumi\u00e8re brothers, a wall falls over, subsequently\nshown backwards to illustrate the perceptual contrast.1 Similarly, humans can perceive the arrow of\ntime if there is a clear non-stationarity in the data, or a directionality due to a perceivable increase in\nentropy, e.g. if we observe an explosion. However, ML causality methods can also infer the arrow\nof time in cases that at \ufb01rst sight appear hard, i.e. where the marginals are the same in both motion\ndirections and the setting is stationary. For humans, in contrast, the perception of the arrow of time\nin such settings is unclear. Although it is well established that humans are sensitive to higher-order\nregularities in the spatial statistics of static natural images [45], for motion sequences or motion\ndiscrimination analogous results have not yet been established. It was even recently shown\u2014at least\nwhen assessing the motion direction of random dot kinematograms (RDKs)\u2014that humans appear\nonly sensitive to the mean and variance of the displacement angles but were insensitive to skewness\nand kurtosis [46]. Thus, for RDKs, and unlike in the case of static spatial structure, the human visual\nsystem appears insensitive to higher-order statistics. Causal dependency algorithms, however, in the\nlinear case crucially rely on non-Gaussianity of additive noise, for which kurtosis and additional\nnon-zero higher-order moments are a measure.\nThus we investigated whether the human visual system is sensitive to dependencies in the motion of a\nsingle disk. Furthermore, we investigate in depth the relationship between the abilities of different\nmachine learning algorithms: a Residual Dependence based algorithm, a Neural Network, a Bayesian\nideal observer and a very simple ecological valid heuristic. We show, \ufb01rst, that human observers can\nindeed discriminate the arrow of time in autoregressive (AR) motion with non-Gaussian additive\nindependent noise, i.e. they appear sensitive to subtle time re\ufb02ection asymmetries. Second, we show\n\n1https://www.youtube.com/watch?v=W_bB0TVTwg8\n\n2\n\n\fthat humans are remarkably ef\ufb01cient in this task, requiring only a short motion sequence to identify\nthe direction of the time series. Third, humans might use a strategy similar to the heuristic. Fourth,\nwe show that the ideal observer algorithm and the neural network both achieve \u201csuper-human\u201d\u2014and\nquantitatively very similar\u2014performance, but the frozen noise paradigm we employed shows that\nboth algorithms use different cues or strategies.\n\n2 Methods\n\nHere we provide the minimum information necessary to understand our experiments and results. We\nrefer to the supplementary material for detailed explanations and all information needed to allow all\nexperiments to be reproduced.\n\n2.1 The arrow of time: Causal and anti-causal time series\n\nWe constructed time series from a generative additive noise model:\n\nxt = 0.05 \u00b7 xt\u22124 + 0.1 \u00b7 xt\u22123 + 0.2 \u00b7 xt\u22122 + 0.4 \u00b7 xt\u22121 + \u0001t\n\nThe noise \u0001t is independent from all previous states xt\u22121, xt\u22122, .... Clearly, future states\nxt, xt+1, xt+2, ... are dependent on \u0001t since \u0001t in\ufb02uences them (the arrow of time in this setting). This\nis true for all types of noise distributions for \u0001t, however, the direction is not detectable for Gaussian\nnoise in a linear time series since a linear Gaussian time series can be modeled in the forward and\nbackward direction with independent noise terms. For non-Gaussian noise, however, this is not true:\nit is not possible to \ufb01t a time series in the backward direction with independent noise terms [37].\nMultiple algorithmic ways exist to detect the direction of such a time series based on this dependence\nstructure. We describe them in section 2.3. Note that we can use the case with the Gaussian\ndistribution for \u0001t as \u201csanity check\u201d to test our psychophysical experiment as well as our algorithms:\nneither humans nor algorithms should be able to identify the direction with Gaussian noise.\nThroughout we use time series for which the additive noise component \u0001t is distributed according to\n\u0001t \u223c sgn(Y ) \u00b7 |Y |r, with Y Gaussian distributed. We choose the exponent r in the range of 0.1 \u2212 6.\nThis yields noise which is either Bimodal (r < 1) or peaked Super-Gaussian (r > 1). The closer the\nvalue of r is to 1, the more Gaussian \u0001t becomes. An Exponent r = 1 yields Gaussian distributed\nnoise. The noise parameterization with exponent r has the advantage that the non-Gaussianity of\nthe time series can be precisely controlled with one parameter. Additionally, we choose a single\nsmoothed Uniform distribution with tails extending to \u00b1\u221e. In total 16 noise distributions were used\nin our experiment, seven with Super-Gaussian additive noise, seven with Bimodal additive noise, one\nwith smoothed Uniform and one with Gaussian additive noise. All noise distributions had mean 0 and\nstandard deviation of 44.72 pixels on screen (1,13 cm), see appendix sec. A.1. These values ensure,\nin practice, that the time series is bounded to the range of possible coordinates of the monitor used in\nour experiment. Time series in the true time direction are in the following denoted as causal time\nseries, and time series which are \ufb02ipped along the temporal axis are denoted as anti-causal. Movies\nof the stimuli are presented in the supplementary material.\n\n2.2 Psychophysical Experiment\n\nWe tested if humans have the ability to discriminate causal from anti-causal time series in a psy-\nchophysical experiment. Observers saw a white random dot moving across the horizontal axis on a\ncomputer screen. The dot position followed a linear non-Gaussian time series with additive noise\ndescribed as above. Observers had to press a button whether they saw the moving dot belonging\nto the green (causal) or to red (anti-causal) category\u2014observers were unaware that the difference\nbetween the categories was a time-reversal; they were given a cover story to identify harmless from\ndangerous bacteria based on their motion. We hypothesized that humans are better at classifying\nvery strong non-Gaussian time series as algorithms do [37]. Thus we began by training subjects with\neasily classi\ufb01able noise and made the time series progressively more dif\ufb01cult (making r approach\n1.0). Human observers should be able to use the same cue for different intensities of the Bimodal or\nSuper-Gaussian noise. The discrimination task is rather dif\ufb01cult and we screened participants based\non their performance in what we considered \u201ceasy conditions\u201d with r = 6, 4, 2 (Super-Gaussian) and\nr = 0.1, 0.3, 0.5 (Bimodal). Participants had to achieve at least 67.5% in these blocks (40 trials) to\n\n3\n\n\fbe signi\ufb01cantly different from chance level and to participate further in our experiment. Seven of our\n17 naive observers failed to reach the criterion. We provide detailed information in the supplementary\nmaterial A.2.2 why we think this does not in\ufb02uence our overall results about human performance\nTen naive observers participated successfully in the \ufb01rst experiment (6 female, 4 male mean, mean\nage = 24 yrs, std = 2.5 yrs). All subjects received monetary compensation. The observers were\ntested on time series with all 16 noise distributions. For Bimodal and Super-Gaussian noise observers\nprogressed from easy to dif\ufb01cult noise. Each observer classi\ufb01ed every of the 16 noise distributions\n40 times, 640 trials in total per observer and it took each observer four hours to complete the \ufb01rst\nexperiment.\nThe \ufb01rst experiment assessed how well observers were able to discriminate forward and backward\nAR motion sequences as a function of the degree of non-Gaussianity of the additive noise, i.e. to see\nthe arrow of time. Our second experiment aimed to investigate both human and algorithmic strategies\nfor the detection of the arrow of time. In this experiment the noise was randomly sampled for all\nsubjects. To this end all subjects were tested on exactly the same time series: the so-called frozen\nnoise paradigm often successfully employed in auditory psychophysics [47\u201349] This experimental\ntechnique allows to examine inter-subject or subject-algorithmic correlation and consistency. In the\nsecond experiment we only used a single noise distribution, Bimodal noise with exponent r = 0.5.\nThe length of the motion sequence\u2014and thus the viewing time\u2014was reduced progressively from\nthe initial 100 time-points to \ufb01nally only 2 time points (100, 50, 25, 20, 16, 12, 8, 4, 2). Participants\nclassi\ufb01ed 40 trials for each sequence length yielding in total 360 trials per observer. Similar to\nexperiment one the task got more dif\ufb01cult as the experiment progressed. Four of the best observers in\nthe previous experiment participated in this experiment (2 male, 2 female, age =22.5 yrs, std = 2.3\nyrs). The experiment lasted 1.5 hours per observer.\n\n2.3 Algorithms for causal inference\n\nOne central aim of ours is to compare the abilities of humans and algorithms to detect the arrow of\ntime. We compared the performance of our human observers to three different algorithms: First, an\nalgorithm which directly exploits the residual dependence structure (ResDep). Second, a neurally\ninspired network and, third, a Bayesian ideal observer algorithm. Furthermore, a simple heuristic is\ntested.\nThe ResDep algorithm proposed by Jonas Peters et al. [37] uses directly the residual dependence\nstructure of \u0001t to the value xt\u22121. The algorithm \ufb01ts an autoregressive model to the time series and a\nseries \ufb02ipped along the time dimension. Subsequently an independence test is performed between\n\ufb01tted residuals and data points. The direction is decided using the Hilbert-Schmidt Independence\nCriterion test. The true time direction maximizes the independence-score between residuals and data\npoints.\nThe second algorithm was a (simple) neurally inspired network [50, 51]. The network consisted of\none convolution layer, followed by a batch normalization layer, a ReLU-layer and a fully connected\nlayer (see A.3.1 in the appendix for further details). For each noise distribution the network was\ntrained with 30000 time series. We used the Adam optimizer with an initial learning rate of 0.01.\nThe network was trained for a maximum of 30 epochs. Both the ResDep algorithm and the neural\nnetwork has full temporal memory since we input the full time series at the \ufb01rst step.\nWhile the neural network gets as input the full time series and thus has perfect temporal memory, we\ncan contrast this algorithm with one based on Bayes statistics. In the vision literature this is often\ndone in an ideal observer framework [52]. An ideal observer analysis is a statistical framework which\nprovides the upper limit of performance given a set of constraints since the ideal observer has perfect\nknowledge about the underlying model and its constraints.\nWe calculated the probability of the direction d given the data X = (x1, x2, ..., xN ) using Bayes rule:\n\np(d|X) =\n\np(X)\n\n=\n\np(X|d) \u00b7 p(d)\n\nt=1p(xt|xt\u22121, xt\u22122, ..., x1, d) \u00b7 p(d)\n\u03a0N\n\n.\n\np(X)\n\nIf we consider only stationary and stable time series of order 4\u2014as in our experiments\u2014the terms\nin the numerator become p(xt|xt\u22121, xt\u22122, xt\u22123, xt\u22124) for the forward time series. This term cor-\nresponds exactly to the chosen noise distribution. We compare this expression in the forward and\nbackward direction and choose the direction for which the corresponding probability is larger. This\nmethod is very similar to calculating the Bayes Factor. See section A.3.1 in the appendix for a\ndetailed explanation.\n\n4\n\n\fAs a \ufb01nal algorithm we \ufb01tted a heuristic to the data in spirit similar to heuristics proposed e.g. by\nSteng\u00e5rd and Berg [53]. The heuristics were developed after we had evaluated the feedback from our\nobservers and the analysis of the noise structure. We found two different principles for Bimodal and\nSuper-Gaussian noise. For Super-Gaussian noise noise values are often sampled around 0. Therefore\nin the forward direction, the dot often jumps around the center and rarely makes a big jump outwards.\nAfter such a big jump, the point slowly sprints back to the center. This means that in the forward\ndirection there are big jumps to the outside, in the backward direction there are big jumps to the\ncentre. The Bimodal condition behaves the other way around. Often large values are sampled and\nonly rarely smaller ones. We have used this observation to develop a heuristic in a few lines of code.\nAt the maximum displacement, it is checked whether a large jump occurred before or after it. This\n5-line code heuristic also works to identify the arrow-of-time of \u201creal\u201d data (EEG recordings; 60%\naccuracy). For details see section A.3.1 in the appendix.\n\nFigure 1: Psychometric Function for Bimodal noise (A) and Super-Gaussian noise (B). The black dots\nrepresent the human accuracy for different exponents, pooled over all 10 subjects. The psychometric\nfunctions are \ufb01tted with cumulative Gaussian distributions. Performance gets worse towards an\nexponent of 1 which corresponds to the non-identi\ufb01able Gaussian noise case. The horizontal line\nmarks the width, the scaled 75% threshold for the different \ufb01ts. Whiskers show 95% Credible\nIntervals (CI) for the threshold.\n3 Results\n\nThe following psychometric functions and Bayesian Credible Intervals (CI) were calculated with the\nBeta Binomial Model in Psigni\ufb01t 4 [54]. Figure 1 shows the main result of Experiment one. The\nblack psychometric functions show the human data (pooled across all ten observers) and the coloured\ncurves results for the algorithms on the same time series seen by our human observers: ResDep in\nred, the neural network in yellow, the ideal observer in blue, our heuristic in light blue. Data for\nsingle observers are shown in Figure A.8. All individual psychometric functions can be found in\n\ufb01gure A.9 and A.10, and the thresholds with credible intervals in table 4. On the one hand, humans\ncan indeed detect the direction of a time series for Super-Gaussian and Bimodal noise with thresholds\nr = 0.67, 95% CI [0.62, 0.72] (Bimodal) and r = 1.62, 95% CI [1.45, 1.81] (Super-Gaussian). The\nResDep algorithm, on the other hand, performs similar to humans with Bimodal noise (threshold\nr = 0.64, 95% CI [0.61, 0.68]) and, perhaps, marginally better with Super-Gaussian noise (threshold\nr = 1.48, 95% CI [1.4, 1.56])2. Algorithmic performance of the neural network and the ideal\nobserver is superior to human and ResDep performance and both algorithms show remarkably similar\nresults. A detailed analysis of the neural network can be found in A.5. Thresholds for the exponents\nof the Neural Network are r = 0.85, 95% CI [0.82, 0.96] and r = 1.19, 95% CI [0.98, 1.24], for the\nideal observer r = 0.87, 95% CI [0.83, 0.96] and r = 1.18, 95% CI [0.99, 1.25] and for the heuristic\n\n2The best three human observers for Super-Gaussian noise had thresholds of r = 1.32, 1.36, 1.38\u2014at least\n\nas sensitive as ResDep.\n\n5\n\n\fr = 0.67, 95% CI [0.61, 0.74] and r = 1.52, 95% CI [1.11, 1.88].\n\nThe parameterization with exponent r is somewhat arbitrary and we tested other distant scales\n(Kullback-Leibler divergence, Jensen-Shannon divergence, Kolmogorov-Smirnov distance and nor-\nmalized exponents), see \ufb01gure A.14. Normalized exponents yield most similar scales.\nThe results for the smoothed Uniform noise were much more diverse\u2014remember that there is only\na single smoothed Uniform distribution with zero mean and the same variance as all other noise\ndistributions we used: The average human accuracy was 50% (chance performance), for ResDep\n70%, for the neural network 96%, for the ideal observer 97% and for the heuristic 75%; we discuss\nthese results in the next section.\nFrom Figures 1 and the block by block comparison in A.15 it appears as if human observers may\nuse an internal algorithm similar to ResDep (top left panel in A.15) or the heuristic, and the neural\nnetwork may have learned a strategy mimicking that of the ideal observer,\nThe frozen noise paradigm described above in section 2.2 and used in our experiment 2 allows us\nto investigate this question in a much more stringent way: All human observers and the algorithms\nclassi\ufb01ed exactly the same time series\u2014they were not only drawn from the same distribution but the\nvery same time series. In addition, in experiment 2 we explored how human observers and algorithms\ncope with shorter time series (Bimodal noise, r = 0.5 \ufb01xed throughout the experiment). This, too,\nmay offer a way to distinguish human observers and algorithms from each other.\nFigure 2 shows the results for experiment 2. Plotting conventions as in Figure 1: The black psychome-\ntric function shows human data (pooled across all four observers) and the coloured curves results for\nthe algorithms on exactly the same time series seen by our human observers: ResDep in red, the neural\nnetwork in yellow, the ideal observer in blue, the heuristic in light blue. Individual psychometric\nfunctions of the four human observers are shown in Figure A.16. The neural network was exactly\ntrained as in experiment one with the exception that we shrink the size of the convolutional layer\nfor very short time series, see A.3.1 for details. Human observers are able to detect the direction of\ntime series even for rather short time series, with a threshold of about 17.76, 95% CI [14.40,22.44]\ntime points. The results are even more impressive if we exclude observer 2\u2014who told us after the\nexperiment that he had been not fully attentive during the experiment: the threshold drops to 15.17\ntime points, 95% CI [11.51,19.18], see \ufb01gure A.17 in the appendix. In this respect, humans clearly\noutperform the ResDep algorithm which requires 42.67, 95% CI [28.88,58.86] time points for 75%\ncorrect discrimination. The neural network with a threshold of 8.13, 95% CI [-1.85,12.52] time points\nand the ideal observer algorithms with a threshold of 7.73, 95% CI [-0.71,11.16] again show similar\nperformance and are again superior to that of human observers and ResDep. However the heuristic\nshows a threshold very close to human observers, 18.07 time points, 95% CI [14.20, 46.50]. Please\nnote, however, that the somewhat poor performance of ResDep may not (only) re\ufb02ect its intrinsic\ninferiority but may in part be due to the dif\ufb01culty of \ufb01tting short time series. ResDep relies on the\nARMA method in MATLAB; ResDep is effectively guessing for time series shorter or equal than 8\ntime points. Also, the ideal observer has intrinsic problems with short time series since our underlying\nassumptions for the approximation does not hold anymore, see A.3.1 in the supplementary material\nfor further details.\nThe frozen noise method allows us to compare observer consistency within observers and consistency\nbetween humans and algorithms. If subject 1 has for a given block an accuracy p1 and subject 2\nhas for the same block an accuracy p2, then we would expect for independent binomial observers\na fraction of p1 \u00b7 p2 + (1 \u2212 p1) \u00b7 (1 \u2212 p2) equally answered (\u201cconsistent\u201d) trials. This fraction of\nexpected consistency is compared to the number of actually equally answered trials per block. If\nthe observed proportion is signi\ufb01cantly higher than expected, this provides evidence that subjects 1\nand 2\u2014be them two humans, two algorithms or a human and an algorithm\u2014are not independent,\nwhich in turns indicates that they rely on similar processing strategies or at least use similar stimulus\ninformation.\nFigure 3 shows this comparison for humans and algorithms, with the expected consistency shown on\nthe x-axis, plotted on the observed consistency on the y-axis. Comparing human observers to each\nother (top left panel in Figure 3), we see that humans tend to have more similarities than expected\nfrom independent observers. (The shaded ellipsoidal regions indicate the con\ufb01dence regions around\nthe null hypothesis that they are independent given the amount of data.) The \ufb01rst column in Figure 3\nstrongly suggests that humans observers use a strategy or internal algorithm independent from all\n\n6\n\n\fFigure 2: Performance of humans and algorithms for time series (bimodal 0.5) of different lengths,\nplot conventions are as in \ufb01gure 1.\n\nfour of our ML algorithms. Furthermore, the graph shows that all algorithms show only a consistency\nconsistent with them being independent. (Because we can generate more data for the algorithmic\ncomparisons, we con\ufb01rmed this using many more trials, reaching the same conclusion, see Figure\nA.18.). Finally we note that human observers and our heuristic show a high observed consistency.\n\n4 Conclusion\n\nOur frozen noise paradigm shows that ideal observers and the neural networks have unique strategies.\nEven if we use more data points in \ufb01gure A.18 we see only a small effect of similarity. One could\nargue that we do not \ufb01nd an agreement of the ideal observer and neural network due to the intrinsic\nproblem of the ideal observer algorithm for short time series. But even if we redo the frozen noise\nparadigm using long sequences but varying the exponent\u2014thus rendering the sequences dif\ufb01cult\nnot by shortening them but by making the noise more Gaussian\u2014we again see only a minor effect,\nsee \ufb01gure A.19. The ideal observer and the neural network use different, albeit equally successful,\nstrategies.\nDespite the fact that, on the one hand human observers, ResDep and the heuristic and on the other\nhand the neural network and the ideal observer, show roughly the same performance in experiment 1,\nthe frozen noise paradigm in experiment 2 allows us to conclude that they actually all use independent\nstrategies. In particular, human observers do not use a ResDep dependency algorithm, and neither do\nthey use an ideal (or suboptimal, see A.3.2) Bayesian probability calculation\u2014especially the latter is\na popular notion in visual perception and the cognitive sciences. Instead, our human observers appear\nto use an approach similar to our (simple) heuristic.\nAnother main outcome of our study is how remarkably ef\ufb01cient the unique strategy of the vi-\nsual system is: Our observers only needed 17.76 95% CI [14.40,22.44] time points (15.17,\n95% CI [11.51,19.18] if we exclude one somewhat poorer performing observer) for 75% correct\ndiscrimination of the forward or backward played AR motion sequences. They require fewer data\nsamples than a successful ML algorithm for causal inference (ResDep; with the caveat regarding\nimplementation mentioned above. A different implementation of the ResDep ideas may perform\nbetter). Performance approached that of the ideal observer that knows the underlying statistics\nperfectly, i.e., the order of the AR process, the AR coef\ufb01cients, the variance and exponent of the\nnoise of the time series. We deem it unlikely that the human observers could extract these parameters\nfrom visual input alone, let alone for the very short sequences. Our heuristic, on the other hand, is\n\n7\n\n\fFigure 3: Human observer consistency and observer-algorithmic consistency for the frozen noise\nparadigm. The x-axis shows the expected proportion of equally answered trials under the assumption\nof independent observers or algorithms. The y-axis shows the actual observed number of equally\nanswered trials in the experiment. The shaded area shows a 95% con\ufb01dence interval calculated\nbased on the Wilson score interval [55]. Colour codes the number of time points. We used in the\nalgorithm-algorithm comparison not only time series with lengths from the experiment but also a\n\ufb01ner grid: 10-30 time points with spacing 1 and 35-100 with spacing 5. The upper number on the\nright shows the proportion of point lying above the diagonal, the lower number the p-value for the\nnull hypothesis that the same number of points are lying above as below the diagonal. Red numbers\nindicate signi\ufb01cant deviations from the null hypothesis.\nimplemented in a few lines of code, is incredibly stable and fast. The surprisingly good performance\nand the high degree of observed consistency with humans indicates that human observers may be\nusing a rather similar strategy.\nIn addition we also tried to use suboptimal algorithms [53]3. We used two approaches to make\nthe Bayesian observer and the Neural Network suboptimal. First, we \ufb01tted an additive noise term\nto the decision variable (Model 2 in Steng\u00e5rd and Berg). This corresponds to late noise in the\nvisual pathway. Second, we \ufb01tted an additive noise term to the individual time series points before\ncalculating the decision of the ideal observer. This corresponds to noise in the early visual pathway,\n\n3We would like to thank one of our reviewers for suggesting the suboptimal analyses\n\n8\n\n\fthat is uncertainty about the exact location of the disk (e.g. micro-saccadic eye movements). The\naddition of both early and late noise yields more similar \ufb01ts between the algorithms and human\ndata. However, on a trial-by-trial basis, only minor similarities are visible. See section A.3.2 in the\nappendix for details.\nWhen performing demanding psychophysical experiments with human observers there is always\nthe question of learning\u2014are we really reporting and interpreting stationary performance? In\ncausal inference in cognitive science structure learning from data is an important topic (e.g.\ndynamical causal learning [10]). However, in our experiments observers were able to do the motion\ndiscrimination after a few training trials. More importantly, the accuracy in the \ufb01rst and second\nhalf of every block was very similar. Average performance in experiment 1 pooled for all subjects\nand across all noise distributions was 82% for trials 1-20 and 80% for trials 20-40; 64%/63% in\nexperiment 2. This strongly suggests that our data are not contaminated by learning effects during\nour experiments.\n\nOne puzzle we are unable to resolve is why our human observers typically failed to reach above\nchance performance with the smoothed Uniform distribution: performance for smoothed Uniform\nwas at 50% across all observers. From a psychophysical point of view the smoothed Uniform\ncondition was more dif\ufb01cult by experimental design: Observers could not start with easy smoothed\nUniform noise since there was no free parameter. On the other hand, Bimodal and smoothed Uniform\ndistributions have a similar dependence structure, see Figure A.1. We expected that at least those\nobservers that were already trained on Bimodal noise should be able to detect the direction of the\nsmoothed Uniform time series\u2014however, that was not the case. Only observer 10 achieved an\naccuracy above 65%. (The JS-Divergence of the smoothed Uniform distribution corresponds to a\nBimodal exponent of 0.73. As we can see from Figure 1 we expect around 65% performance, in\nline with LL\u2019s performance.) The surprising dif\ufb01culty of the smoothed Uniform distribution should\nhelp constrain which strategy or algorithm was used by our human observers during our experiments.\nRecent advances in causal inference have been strongly driven by human intuition about how the\nshape of joint distribution indicates causal directions [56, 33]. This line of argument, together with\nour experimental results, suggests that many of the human abilities regarding the recognition of causal\nand time asymmetries are not known yet.\nIn any case, we argue that we can learn a lot about the inner workings of a cognitive system by probing\nit with appropriate arti\ufb01cial\u2014not ecologically valid\u2013stimuli [57, 58] \u2014this is not to say one should\nonly and always use simple, arti\ufb01cial stimuli, but there is a place for their use, particularly when\nstudying less well known areas\u2014such as the human visual system\u2019s sensitivity to subtle temporal\ndependencies. In a predictive coding framework, e.g., it would be useful to know the exact temporal\nstatistical structure of e.g. the motion of leaves and grass in the wind. An unusual motion pattern\u2014e.g.\nhaving the \u201cwrong\u201d dependencies\u2014may signal a hidden predator behind the foliage.\nEver since Albert Michotte performed his studies there is the question whether causal inference may\nunder certain circumstances already be a perceptual rather than a cognitive ability [26, 28]. In our\nexperiment observers were able to discriminate very subtle temporal asymmetries, similar to the\nremarkable sensitivity to higher-order spatial dependencies in patches of natural images [45]. To us\nthis hints at an early, perceptual locus in our experiments.\n\nAuthor contributions\n\nB.S. had the initial project idea connecting causal inference with (early) visual perception. B.S. and\nF.W. developed the idea of using AR motion and additive noise as a visual stimulus with help from\nD.J. The concrete psychophysical experiments with all parameters were designed by K.M.; F.W.\nsuggested the use of the \u201cfrozen noise\u201d inspired analyses; K.M. conducted all experiments and wrote\nall the code. K.M. did the statistical analyses and implemented the algorithmic observers with help\nfrom F.W. and D.J. The paper was jointly written by K.M. and F.W. with input from D.J. and B.S.\n\nAcknowledgments\n\nThis work was supported by the German Research Foundation (DFG): SFB1233, Robust Vision\n(project number 276693517): Inference Principles and Neural Mechanisms, TP4 Causal inference\nstrategies in human vision (F.W. and B.S.).\n\n9\n\n\fWe would like to thank Frank J\u00e4kel for invaluable intuitions about the structure of the arrow of time\nproblem. Additionally, we thank Heiko Sch\u00fctt and Bernhard Lang for discussion about Bayesian\nobservers and Robert Geirhos for discussion about neural networks. Moreover we are grateful to\nKarin Bierig and Vincent Plikat for help with data collection and Silke Gramer for administrative\nand Uli Wannek for technical support. Finally we thank the internal reviewers at the Max-Planck\nInstitute for Intelligent Systems and our three anonymous reviewers for constructive feedback. We\nare particularly indebted to reviewer #3 and the suggestion to explore suboptimal observers.\n\nReferences\n[1] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and\n\nRob Fergus. Intriguing properties of neural networks. arXiv, 1312.6199v4:1\u201310, 2014.\n\n[2] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv\n\npreprint arXiv:1607.02533, 2016.\n\n[3] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami.\nPractical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference\non computer and communications security, pages 506\u2013519. ACM, 2017.\n\n[4] Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Sch\u00fctt, Matthias Bethge, and Felix A.\nWichmann. Generalisation in humans and deep neural networks. Advances in Neural Information\nProcessing Systems, 31:7549\u20137561, 2018.\n\n[5] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland\nBrendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and\nrobustness. International Conference on Learning Representations (ICLR), 2019.\n\n[6] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines\n\nthat learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.\n\n[7] Patricia W. Cheng. From covariation to causation: a causal power theory. Psychological review, 104(2):\n\n367\u2013405, 1997.\n\n[8] Joshua B. Tenenbaum and Thomas L. Grif\ufb01ths. Structure learning in human causal induction. In Advances\n\nin neural information processing systems, volume 13, pages 59\u201365, 2001.\n\n[9] Tamar Kushnir, Alison Gopnik, Laura Schulz, and David Danks. Inferring hidden causes. In Proceedings\n\nof the Annual Meeting of the Cognitive Science Society, volume 25, 2003.\n\n[10] David Danks, Thomas L. Grif\ufb01ths, and Joshua B. Tenenbaum. Dynamical causal learning. In Advances in\n\nneural information processing systems, volume 15, pages 83\u201390, 2003.\n\n[11] Mark Steyvers, Joshua B. Tenenbaum, Eric-Jan Wagenmakers, and Ben Blum. Inferring causal networks\n\nfrom observations and interventions. Cognitive science, 27(3):453\u2013489, 2003.\n\n[12] Alison Gopnik, Clark Glymour, David M. Sobel, Laura E. Schulz, Tamar Kushnir, and David Danks. A\ntheory of causal learning in children: causal maps and bayes nets. Psychological review, 111(1):3\u201332,\n2004.\n\n[13] Thomas L. Grif\ufb01ths and Joshua B. Tenenbaum. Structure and strength in causal induction. Cognitive\n\npsychology, 51(4):334\u2013384, 2005.\n\n[14] Alison Gopnik and Laura Schulz. Causal learning: Psychology, philosophy, and computation. Oxford\n\nUniversity Press, 2007.\n\n[15] Noah D. Goodman, Tomer D. Ullman, and Joshua B. Tenenbaum. Learning a theory of causality.\n\nPsychological review, 118(1):110, 2011.\n\n[16] David A. Lagnado, Tobias Gerstenberg, and Ro\u2019i Zultan. Causal responsibility and counterfactuals.\n\nCognitive science, 37(6):1036\u20131073, 2013.\n\n[17] Albert Michotte. The perception of causality. Oxford, England: Basic Books, 1963.\n\n[18] Alan M. Leslie and Stephanie Keeble. Do six-month-old infants perceive causality? Cognition, 25(3):\n\n265\u2013288, 1987.\n\n10\n\n\f[19] Lance J. Rips. Causation from perception. Perspectives on Psychological Science, 6(1):77\u201397, 2011.\n\n[20] Fritz Heider and Marianne Simmel. An experimental study of apparent behavior. The American journal of\n\npsychology, 57(2):243\u2013259, 1944.\n\n[21] Brian J. Scholl and Tao Gao. Perceiving animacy and intentionality: Visual processing or higher-level\njudgment. Social perception: Detection and interpretation of animacy, agency, and intention, pages\n197\u2014-230, 2013.\n\n[22] James R. Kubricht, Keith J. Holyoak, and Hongjing Lu. Intuitive physics: Current research and controver-\n\nsies. Trends in cognitive sciences, 21(10):749\u2013759, 2017.\n\n[23] Tobias Gerstenberg, Noah D. Goodman, David A. Lagnado, and Joshua B. Tenenbaum. How, whether,\nwhy: Causal judgments as counterfactual contrasts. In Proceedings of the 37th Annual Conference of the\nCognitive Science Society, pages 782\u2013787, Austin, TX, 2015. Cognitive Science Society.\n\n[24] Jerry A. Fodor. The Modularity of Mind. Cambridge, MA: MIT Press, 1983.\n\n[25] David Danks. The psychology of causal perception and reasoning. In Oxford handbook of causation, pages\n\n447\u2013470. Oxford University Press, 2010.\n\n[26] Martin Rolfs, Michael Dambacher, and Patrick Cavanagh. Visual adaptation of the perception of causality.\n\nCurrent Biology, 23(3):250\u2013254, 2013.\n\n[27] Regan M. Gallagher and Derek H. Arnold. Comparing the aftereffects of motion and causality across\n\nvisual co-ordinates. bioRxiv, 2018.\n\n[28] Pieter Moors, Johan Wagemans, and Lee de-Wit. Causal events enter awareness faster than non-causal\n\nevents. PeerJ, 5:e2932, 2017.\n\n[29] Peter Spirtes, Clark N. Glymour, and Richard Scheines. Causation, prediction, and search. MIT press,\n\n2000.\n\n[30] Judea Pearl. Causality: models, reasoning, and inference. Cambridge University Press, 2000.\n\n[31] Daniel Malinsky and David Danks. Causal discovery algorithms: A practical guide. Philosophy Compass,\n\n13(1):e12470, 2018.\n\n[32] Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Elements of causal inference: foundations and\n\nlearning algorithms. MIT press, 2017.\n\n[33] Patrik O. Hoyer, Dominik Janzing, Joris M. Mooij, Jonas Peters, and Bernhard Sch\u00f6lkopf. Nonlinear causal\ndiscovery with additive noise models. In Advances in neural information processing systems, volume 21,\npages 689\u2013696, 2009.\n\n[34] Shohei Shimizu, Patrik O. Hoyer, Aapo Hyv\u00e4rinen, and Antti Kerminen. A linear non-gaussian acyclic\n\nmodel for causal discovery. Journal of Machine Learning Research, 7(Oct):2003\u20132030, 2006.\n\n[35] Kun Zhang and Aapo Hyv\u00e4rinen. On the identi\ufb01ability of the post-nonlinear causal model. In Proceedings\nof the twenty-\ufb01fth conference on uncertainty in arti\ufb01cial intelligence, pages 647\u2013655. AUAI Press, 2009.\n\n[36] Jonas Peters, Joris M. Mooij, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Causal discovery with continuous\n\nadditive noise models. The Journal of Machine Learning Research, 15(Jun):2009\u20132053, 2014.\n\n[37] Jonas Peters, Dominik Janzing, Arthur Gretton, and Bernhard Sch\u00f6lkopf. Detecting the direction of causal\ntime series. In Proceedings of the 26th Annual International Conference on Machine Learning, pages\n801\u2013808, 2009.\n\n[38] Stefan Bauer, Bernhard Sch\u00f6lkopf, and Jonas Peters. The arrow of time in multivariate time series. In\nProceedings of the 33rd International Conference on International Conference on Machine Learning -\nVolume 48, pages 2043\u20132051, 2016.\n\n[39] Lyndsey C. Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman,\nBernhard Sch\u00f6lkopf, and William T. Freeman. Seeing the arrow of time. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 2035\u20132042, 2014.\n\n[40] Hans Reichenbach. The direction of time. University of California Press, Berkeley, USA, 1956.\n\n[41] Huw Price. Time\u2019s arrow & Archimedes\u2019 point: new directions for the physics of time. Oxford University\n\nPress, 1997.\n\n11\n\n\f[42] Dominik Janzing. On the entropy production of time series with unidirectional linearity. Journ. Stat. Phys.,\n\n138:767\u2013779, 2010.\n\n[43] Dominik Janzing, Rafael Chaves, and Bernhard Sch\u00f6lkopf. Algorithmic independence of initial condition\nand dynamical law in thermodynamics and causal inference. New Journal of Physics, 18:093052, 2016.\n\n[44] Donglai Wei, Joseph Lim, Andrew Zisserman, and William T. Freeman. Learning and using the arrow of\n\ntime. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[45] Holly E. Gerhard, Felix A. Wichmann, and Matthias Bethge. How sensitive is the human visual system to\n\nthe local statistics of natural images? PLOS Computational Biology, 9(1):1\u201315, 01 2013.\n\n[46] Michael L. Waskom, Janeen Asfour, and Roozbeh Kiani. Perceptual insensitivity to higher-order statistical\n\nmoments of coherent random dot motion. Journal of vision, 18(6):9\u20139, 2018.\n\n[47] Newman Guttman and Bela Julesz. Lower limits of auditory periodicity analysis. The Journal of the\n\nAcoustical Society of America, 35(4):610, 1963.\n\n[48] Trevor R. Agus, Simon J. Thorpe, and Daniel Pressnitzer. Rapid formation of robust auditory memories:\n\ninsights from noise. Neuron, 66(4):610\u2013618, 2010.\n\n[49] Vinzenz H. Sch\u00f6nfelder and Felix A. Wichmann. Identi\ufb01cation of stimulus cues in narrow-band tone-in-\nnoise detection using sparse observer models. The Journal of the Acoustical Society of America, 134(1):\n447\u2013463, 2013.\n\n[50] Robbe LT. Goris, Tom Putzeys, Johan Wagemans, and Felix A. Wichmann. A neural population model for\n\nvisual pattern detection. Psychological review, 120(3):472\u2013496, 2013.\n\n[51] Heiko H. Sch\u00fctt and Felix A. Wichmann. An image-computable spatial vision model. Journal of Vision,\n\n17(12):12, 1\u201335, 2017.\n\n[52] Wilson S. Geisler. Ideal observer analysis. The visual neurosciences, 10(7):825\u2013837, 2003.\n\n[53] Elina Steng\u00e5rd and Ronald van den Berg. Imperfect bayesian inference in visual perception. PLOS\n\nComputational Biology, 15(4):1\u201327, 04 2019.\n\n[54] Heiko H. Sch\u00fctt, Stefan Harmeling, Jakob H. Macke, and Felix A. Wichmann. Painfree and accurate\nbayesian estimation of psychometric functions for (potentially) overdispersed data. Vision Research, 122:\n105 \u2013 123, 2016.\n\n[55] Edwin B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the\n\nAmerican Statistical Association, 22(158):209\u2013212, 1927.\n\n[56] Sun Xiaohai, Janzing Dominik, and Sch\u00f6lkopf Bernhard. Causal inference by choosing graphs with most\nplausible Markov kernels. In Proceedings of the 9th International Symposium on Arti\ufb01cial Intelligence\nand Mathematics, pages 1\u201311, 2006.\n\n[57] Nicole C. Rust and J. Anthony Movshon. In praise of arti\ufb01ce. Nature Neuroscience, 8(12):1647\u20131650,\n\n2005.\n\n[58] Marina Martinez-Garcia, Marcelo Bertalm\u00edo, and Jes\u00fas Malo. In praise of arti\ufb01ce reloaded: Caution with\n\nnatural image databases in modeling vision. Frontiers in Neuroscience, 13:8, 2019.\n\n[59] David H. Brainard. The psychophysics toolbox. Spatial vision, 10:433\u2013436, 1997.\n\n[60] Brainard David H. Kleiner, Mario, Denis Pelli, Allen Ingling, Richard Murry, and Christopher Broussard.\n\nWhat\u2019s new in psychtoolbox-3. Perception, 36(14):1\u201316, 2007.\n\n[61] Hirotsugu Akaike. Information theory and an extension of the maximum likelihood principle. Second\n\nInternational Symposium on Information Theory, Tsahkadsor, Armenia, 267-281., 1973.\n\n[62] Felix A. Wichmann and N. Jeremy Hill. The psychometric function: I. \ufb01tting, sampling, and goodness of\n\n\ufb01t. Perception & Psychophysics, 63(8):1293\u20131313, Nov 2001.\n\n12\n\n\f", "award": [], "sourceid": 1359, "authors": [{"given_name": "Kristof", "family_name": "Meding", "institution": "University T\u00fcbingen"}, {"given_name": "Dominik", "family_name": "Janzing", "institution": "Amazon"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems"}, {"given_name": "Felix A.", "family_name": "Wichmann", "institution": "University of T\u00fcbingen"}]}