{"title": "Occam's razor is insufficient to infer the preferences of irrational agents", "book": "Advances in Neural Information Processing Systems", "page_first": 5598, "page_last": 5609, "abstract": "Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. \nHowever, the general problem of inferring the reward function of an agent of unknown rationality has received little attention.\nUnlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent's policy in enough environments.\nThis paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam's razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret.\nTo address this, we need simple `normative' assumptions, which cannot be deduced exclusively from observations.", "full_text": "Occam\u2019s razor is insuf\ufb01cient to infer the preferences\n\nof irrational agents\n\nS\u00a8oren Mindermann \u2217* \u2020\n\nVector Institute\n\nUniversity of Toronto\n\nStuart Armstrong* \u2021\n\nFuture of Humanity Institute\n\nUniversity of Oxford\n\nsoeren.mindermann@gmail.com\n\nstuart.armstrong@philosophy.ox.ac.uk\n\nAbstract\n\nInverse reinforcement learning (IRL) attempts to infer human rewards or pref-\nerences from observed behavior. Since human planning systematically deviates\nfrom rationality, several approaches have been tried to account for speci\ufb01c human\nshortcomings. However, the general problem of inferring the reward function of an\nagent of unknown rationality has received little attention. Unlike the well-known\nambiguity problems in IRL, this one is practically relevant but cannot be resolved\nby observing the agent\u2019s policy in enough environments. This paper shows (1) that\na No Free Lunch result implies it is impossible to uniquely decompose a policy\ninto a planning algorithm and reward function, and (2) that even with a reasonable\nsimplicity prior/Occam\u2019s razor on the set of decompositions, we cannot distinguish\nbetween the true decomposition and others that lead to high regret. To address this,\nwe need simple \u2018normative\u2019 assumptions, which cannot be deduced exclusively\nfrom observations.\n\n1\n\nIntroduction\n\nIn today\u2019s reinforcement learning systems, a simple reward function is often hand-crafted, and still\nsometimes leads to undesired behaviors on the part of RL agent, as the reward function is not well\naligned with the operator\u2019s true goals4. As AI systems become more powerful and autonomous, these\nfailures will become more frequent and grave as RL agents exceed human performance, operate at\ntime-scales that forbid constant oversight, and are given increasingly complex tasks \u2014 from driving\ncars to planning cities to eventually evaluating policies or helping run companies. Ensuring that\nthe agents behave in alignment with human values is known, appropriately, as the value alignment\nproblem [Amodei et al., 2016, Had\ufb01eld-Menell et al., 2016, Russell et al., 2015, Bostrom, 2014,\nLeike et al., 2017].\nOne way of resolving this problem is to infer the correct reward function by observing human\nbehaviour. This is known as Inverse reinforcement learning (IRL) [Ng and Russell, 2000, Abbeel and\nNg, 2004, Ziebart et al., 2008]. Often, learning a reward function is preferred over imitating a policy:\nwhen the agent must outperform humans, transfer to new environments, or be interpretable. The\nreward function is also usually a (much) more succinct and robust task representation than the policy,\nespecially in planning tasks [Abbeel and Ng, 2004]. Moreover, supervised learning of long-range and\ngoal-directed behavior is often dif\ufb01cult without the reward function [Ratliff et al., 2006].\n\n\u2217Equal contribution.\n\u2020Work performed at Future of Humanity Institute.\n\u2021Further af\ufb01liation: Machine Intelligence Research Institute, Berkeley, USA.\n4See for example the game CoastRunners, where an RL agent didn\u2019t \ufb01nish the course, but instead\nfound a bug allowing it to get a high score by crashing round in circles https://blog.openai.com/\nfaulty-reward-functions/.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fUsually, the reward function is inferred based on the assumption that human behavior is optimal\nor noisily optimal. However, it is well-known that humans deviate from rationality in systematic,\nnon-random ways [Tversky and Kahneman, 1975]. This can be due to speci\ufb01c biases such as time-\ninconsistency, loss aversion and hundreds of others, but also limited cognitive capacity, which leads\nto forgetfulness, limited planning and false beliefs. This limits the use of IRL methods for tasks that\nhumans don\u2019t \ufb01nd trivial.\nSome IRL approaches address speci\ufb01c biases [Evans et al., 2015b,a], and others assume noisy\nrationality [Ziebart et al., 2008, Boularias et al., 2011]. But a general framework for inferring the\nreward function from suboptimal behavior does not exist to our knowledge. Such a framework needs\nto infer two unobserved variables simultaneously: the human reward function and their planning\nalgorithm5 which connects the reward function with behaviour, henceforth called a planner.\nThe task of observing human behaviour (or the human policy) and inferring from it the human reward\nfunction and planner will be termed decomposing the human policy. This paper will show there is a\nNo Free Lunch theorem in this area: it is impossible to get a unique decomposition of human policy\nand hence get a unique human reward function. Indeed, any reward function is possible. And hence,\nif an IRL agent acts on what it believes is the human policy, the potential regret is near-maximal.\nThis is another form of unidenti\ufb01ability of the reward function, beyond the well-known ones [Ng and\nRussell, 2000, Amin and Singh, 2016].\nThe main result of this paper is that, unlike other No Free Lunch theorems, this unidenti\ufb01ability\ndoes not disappear when regularising with a general simplicity prior that formalizes Occam\u2019s razor\n[Vitanyi and Li, 1997]. This result will be shown in two steps: \ufb01rst, that the simplest decompositions\ninclude degenerate ones, and secondly, that the most \u2018reasonable\u2019 decompositions according to human\njudgement are of high complexity.\nSo, although current IRL methods can perform well on many well-speci\ufb01ed problems, they are\nfundamentally and philosophically incapable of establishing a \u2018reasonable\u2019 reward function for the\nhuman, no matter how powerful they become. In order to do this, they will need to build in \u2018normative\nassumptions\u2019: key assumptions about the reward function and/or planner, that cannot be deduced\nfrom observations, and allow the algorithm to focus on good ways of decomposing the human policy.\nFuture work will sketch out some potential normative assumptions that can be used in this area,\nmaking use of the fact that humans assess each other to be irrational, and often these assessments\nagree. In view of the No Free Lunch result, this shows that humans must share normative assumptions.\nOne of these \u2018normative assumption\u2019 approaches is brie\ufb02y illustrated in an appendix, while another\nappendix demonstrates how to use the planner-reward formalism to de\ufb01ne when an agent might be\nmanipulating or overriding human preferences. This happens when the agent pushes the human\ntowards situations where their policy is very suboptimal according to their reward function.\n\n2 Related Work\n\nIn the \ufb01rst IRL papers from Ng and Russell [2000] and Abbeel and Ng [2004] a max-margin algorithm\nwas used to \ufb01nd the reward function under which the observed policy most outperforms other policies.\nSuboptimal behavior was \ufb01rst addressed explicitly by Ratliff et al. [2006] who added slack variables to\nallow for suboptimal behavior. This \ufb01nds reward functions such that the observed policy outperforms\nmost other policies and the biggest margin by which another policy outperforms it is minimal, i.e. the\nobserved policy has low regret. Shiarlis et al. [2017] introduce a modern max-margin technique with\nan approximate planner in the optimisation.\nHowever, the max-margin approach has mostly been replaced by the max entropy IRL [Ziebart et al.,\n2008]. Here, the assumption is that observed actions or trajectories are chosen with probability\nproportional to the exponent of their value. This assumes a speci\ufb01c suboptimal planning algorithm\nwhich is noisily rational (also known as Boltzmann-rational). Noisy rationality explains human\nbehavior on various data sets better [Hula et al., 2015]. However, Evans et al. [2015b] and Evans et al.\n[2015a] showed that this can fail since humans deviate from rationality in systematic, non-random\nways. If noisy rationality is assumed, repeated suboptimal actions throw off the inference.\n\n5 Technically we only need to infer the human reward function, but inferring that from behaviour requires\n\nsome knowledge of the planning algorithm.\n\n2\n\n\fLiterature on inferring the reasoning capabilities of an agent is scarce. Evans et al. [2015b] and Evans\net al. [2015a] use Bayesian inference to identify speci\ufb01c planning biases such as myopic planning and\nhyperbolic time-discounting. They simultaneously infer the agent\u2019s preferences. Cundy and Filan\n[2018] adds bias resulting from hierarchical planning. Hula et al. [2015] similarly let agents infer\nfeatures of their opponent\u2019s reasoning such as planning depth and impulsivity in simple economic\ngames. Recent work learns the planning algorithm with two assumptions: being close to noisily\nrational in a high-dimensional planner space and supervised planner-learning [Anonymous, 2019].\nThe related ideas of meta-reasoning [Russell, 2016], computational rationality [Lewis et al., 2014] and\nresource rationality [Grif\ufb01ths et al., 2015] may create the possibility to rede\ufb01ne irrational behavior\nas rational in an \u2018ancestral\u2019 distribution of environments where the agent optimises its rewards by\nchoosing among the limited computations it is able to perform or jointly minimising the cost of\ncomputation and maximising reward. This could in theory rede\ufb01ne many biases as computationally\noptimal in some distribution of environments and provide priors on human planning algorithms.\nUnfortunately the problem of doing this in practice seems to be extremely dif\ufb01cult \u2014 and it assumes\nthat human goals are roughly the same as evolution\u2019s goals, which is certainly not the case.\n\n3 Problem setup and background\n\nA human will be performing a series of actions, and from these, an agent will attempt to estimate\nboth the human\u2019s reward function and their planning algorithm.\nThe environment M in which the human operates is an MDP/R, a Markov Decision Process without\nreward function (a world-model [Had\ufb01eld-Menell et al., 2017]). An MDP/R is de\ufb01ned as a tuple,\n(cid:104)S,A, T, \u02c6s(cid:105) consisting of a discrete state space S, a \ufb01nite action space A, a \ufb01xed starting state \u02c6s, and\na probabilistic transition function T : S \u00d7 A \u00d7 S \u2192 [0, 1] to the next state (also called the dynamics).\nAt each step, the human is in a certain state s, takes a certain action a, and ends up in a new state s(cid:48)\nas given by T (s(cid:48) | s, a).\nLet R = {R : S \u00d7 A \u2192 [\u22121, 1]} = [\u22121, 1]S\u00d7A be the space of candidate reward functions; a given\nR will map any state-reward pair to a reward value in the interval [\u22121, 1].\nLet \u03a0 be the space of deterministic, Markovian policies. So \u03a0 is the space of functions S \u2192 A. The\nhuman will be following the policy \u02d9\u03c0 \u2208 \u03a0.\nThe results of this paper apply to both discounted rewards and episodic environments settings6.\n\n3.1 Planners and reward functions: decomposing the policy\n\nThe human has their reward function, and then follows a policy that presumably attempts to maximise\nit. Therefore there is something that bridges between the reward function and the policy: a piece of\ngreater or lesser rationality that transforms knowledge of the reward function into a plan of action.\nThis bridge will be modeled as a planner p : R \u2192 \u03a0, a function that takes a reward and outputs a\npolicy. This planner encodes all the rationality, irrationality, and biases of the human. Let P be the\nset of planners. The human is therefore de\ufb01ned by a planner-reward pair (p, R) \u2208 P \u00d7 R. Similarly,\n(p, R) with p(R) = \u03c0 is a decomposition of the policy \u03c0. The task of the agent is to \ufb01nd a \u2018good\u2019\ndecomposition of the human policy \u02d9\u03c0.\n\n3.2 Compatible pairs and evidence\n\nThe agent can observe the human\u2019s behaviour and infer their policy from that. In order to simplify\nthe problem and separate out the effect of the agent\u2019s learning, we will assume the agent has perfect\nknowledge of the human policy \u02d9\u03c0 and of the environment M. At this point, the agent cannot learn\nanything by observing the human\u2019s actions, as it can already perfectly predict these.\nThen a pair (p, R) is de\ufb01ned to be compatible with \u02d9\u03c0, if p(R) = \u02d9\u03c0 \u2014 thus that pair is a possible\ncandidate for decomposing the human policy into the human\u2019s planner and reward function.\n\n6The setting is only chosen for notational convenience:\n\nit also emulates discrete POMDPs, non-\n\nMarkovianness (eg by encoding the whole history in the state) and pseudo-random policies.\n\n3\n\n\f4\n\nIrrationality-based unidenti\ufb01ability\n\nUnidenti\ufb01ability of the reward is a well-known problem in IRL [Ng and Russell, 2000]. Amin and\nSingh [2016] categorise the problem into representational and experimental unidenti\ufb01ability. The\nformer means that adding a constant to a reward function or multiplying it with a positive scalar\ndoes not change what is optimal behavior. This is unproblematic as rescaling the reward function\ndoesn\u2019t change the preference ordering. The latter can be resolved by observing optimal policies in a\nwhole class of MDPs which contains all possible transition dynamics. We complete this framework\nwith a third kind of identi\ufb01ability, which arises when we observe suboptimal agents. This kind of\nunidenti\ufb01ability is worse as it cannot necessarily be resolved by observing the agent in many tasks.\nIn fact, it can lead to almost arbitrary regret.\n\n4.1 Weak No Free Lunch: unidenti\ufb01able reward function and half-maximal regret\n\nThe results in this section show that without assumptions about the rationality of the human, all\nattempts to optimise their reward function are essentially futile. Everitt et al. [2017] work in a similar\nsetting as we do: in their case, a corrupted version of the reward function is observed. The problem\nour case is that a \u2018corrupted\u2019 version \u02d9\u03c0 of an optimal policy \u03c0\u2217\nis observed and used as information\n\u02d9R\nto optimise for the ideal reward \u02d9R. A No Free Lunch result analogous to theirs applies in our case;\nboth resemble the No Free Lunch theorems for optimisation [Wolpert and Macready, 1997].\nMore philosophically, this result is as an instance of the well-known is-ought problem from meta-\nethics. Hume [1888] argued that what ought to be (here, the human\u2019s reward function) can never be\nconcluded from what is (here, behavior) without extra assumptions. Equivalently, the human reward\nfunction cannot be inferred from behavior without assumptions about the planning algorithm p. In\np\u2208P P (\u03c0 | R, p)P (p) is unde\ufb01ned without P (p). As\n\nprobabilistic terms, the likelihood P (\u03c0|R) =(cid:80)\n\nshown in Section 5 and Section 5.2, even a simplicity prior on p and R will not help.\n\n4.1.1 Unidenti\ufb01able reward functions\n\nFirstly, we note that compatibility (p(R) = \u02d9\u03c0), puts no restriction on R, and few restrictions on p:\nTheorem 1. For all \u03c0 \u2208 \u03a0 and R \u2208 R, there exists a p \u2208 P such that p(R) = \u03c0.\nFor all p \u2208 P and \u03c0 \u2208 \u03a0 in the image of p, there exists an R such that p(R) = \u03c0.\nProof. Trivial proof: de\ufb01ne the planner7 p as mapping all of R to \u03c0; then p(R) = \u03c0. The second\nstatement is even more trivial, as \u03c0 is in the image of p, so there must exist R with p(R) = \u03c0.\n\n4.1.2 Half-maximal regret\n\nThe above shows that the reward function cannot be constrained by observation of the human, but\nwhat about the expected long-term value? Suppose that an agent is unsure what the actual human\nreward function is; if the agent itself is acting in an MDP/R, can it follow a policy that minimises the\npossible downside of its ignorance?\nThis is prevented by a recent No Free Lunch theorem. Being ignorant of the reward function one\nshould maximise is equivalent of having a corrupted reward channel with arbitrary corruption. In that\ncase, Everitt et al. [2017] demonstrated that whatever policy \u03c0 the agent follows, there is a R \u2208 R\nfor which \u03c0 is half as bad as the worst policy the agent could have followed. Speci\ufb01cally, let V \u03c0\nR (s)\nbe the expected return of reward function R from state s, given that the agent follows policy \u03c0. If \u03c0\nwas the optimal policy for R, then this can be written as V \u2217\nR(s). The regret of \u03c0 for R at s is given by\nthe difference:\n\nReg(\u03c0, R)(s) = V \u2217\nThen Everitt et al. [2017] demonstrates that for any \u03c0,\n\n(cid:18)\n\nR (s).\n\nR(s) \u2212 V \u03c0\n\n(cid:19)\n\u03c0(cid:48)\u2208\u03a0,R\u2208R Reg(\u03c0(cid:48), R)(s)\n\nR\u2208R Reg(\u03c0, R)(s) \u2265 1\n\nmax\n\n2\n\nmax\n\n.\n\nSo for any compatible (p, R) = \u02d9\u03c0, we cannot rule out that maximizing R leads to at least half of the\nworst-case regret.\n\n7This is the \u2018indifferent\u2019 planner p\u03c0 of subsubsection 5.1.1.\n\n4\n\n\f5 Simplicity of degenerate decompositions\n\nLike many No Free Lunch theorems, the result of the previous section is not surprising given there are\nno assumptions about the planning algorithm. No Free Lunch results are generally avoided by placing\na simplicity prior on the algorithm, dataset, function class or other object [Everitt et al., 2014]. This\namounts to saying algorithms can bene\ufb01t from regularisation. This section is dedicated to showing\nthat, surprisingly, simplicity does not solve the No Free Lunch result.\nOur simplicity measure is minimum description length of an object, de\ufb01ned as Kolmogorov com-\nplexity [Kolmogorov, 1965], the length of the shortest program that outputs a string describing the\nobject. This is the most general formalization of Occam\u2019s razor we know of [Vitanyi and Li, 1997].\nAppendix A explores how the results extend to other measures of complexity, such as those that\ninclude computation time. We start with informal versions of our main results.\nTheorem 2 (Informal simplicity theorem). Let ( \u02d9p, \u02d9R) be a \u2018reasonable\u2019 planner-reward pair that\ncaptures our judgements about the biases and rationality of a human with policy \u02d9\u03c0 = \u02d9p( \u02d9R). Then\nthere are degenerate planner-reward pairs, compatible with \u02d9\u03c0, of lower complexity than ( \u02d9p, \u02d9R), and\na pair ( \u02d9p(cid:48),\u2212 \u02d9R) of similar complexity to ( \u02d9p, \u02d9R), but with opposite reward function.\nThere are a few issues with this theorem as it stands. Firstly, simplicity in algorithmic information\ntheory is relative to the computer language (or equivalently Universal Turing Machine) L used [Ming\nand Vit\u00b4anyi, 2014, Calude, 2002], and there exists languages in which the theorem is clearly false:\none could choose a degenerate language in which ( \u02d9p, \u02d9R) is encoded by the string \u20180\u2019, for example,\nand all other planner-reward pairs are of extremely long length. What constitutes a \u2018reasonable\u2019\nlanguage is a long-standing open problem, see Leike et al. [2017] and M\u00a8uller [2010]. For any pair of\nlanguages, complexities differ only by a constant, the amount required for one language to describe\nthe other, but this constant can be arbitrarily large.\nNevertheless, this section will provide grounds for the following two semi-formal results:\nProposition 3. If \u02d9\u03c0 is a human policy, and L is a \u2018reasonable\u2019 computer language, then there exists\ndegenerate planner-reward pairs amongst the pairs of lowest complexity compatible with \u02d9\u03c0.\nProposition 4. If \u02d9\u03c0 is a human policy, and L is a \u2018reasonable\u2019 computer language with ( \u02d9p, \u02d9R) a\ncompatible planner-reward pair, then there exist a pair ( \u02d9p(cid:48),\u2212 \u02d9R) of comparable complexity to ( \u02d9p, \u02d9R),\nbut opposite reward function.\n\nThe last part of Theorem 2, the fact that any \u2018reasonable\u2019 ( \u02d9p, \u02d9R) is expected to be of higher complexity,\nwill be addressed in Section 6.\n\n5.1 Simple degenerate pairs\n\nThe argument in this subsection will be that 1) the complexity of \u02d9\u03c0 is close to a lower bound on\nany pair compatible with it and 2) degenerate decompositions are themselves close to this bound.\nThe \ufb01rst statement follows because for any decomposition (p, R) compatible with \u02d9\u03c0, the map\n(p, R) (cid:55)\u2192 p(R) = \u02d9\u03c0 will be a simple one, adding little complexity. And if a compatible pair (p(cid:48), R(cid:48))\ncan be from \u02d9\u03c0 with little extra complexity, then it too will have a complexity close to the minimal\ncomplexity of any other pair compatible with it. Therefore we will \ufb01rst produce three degenerate\npairs that can be simply constructed from \u02d9\u03c0.\n\n5.1.1 The degenerate pairs\n\nWe can de\ufb01ne the trivial constant reward function 0, and the greedy planner pg. The greedy planner\npg acts by taking the action that maximises the immediate reward in the current state and the next\naction. Thus8 pg(R)(s) = argmaxa R(s, a). We can also de\ufb01ne the anti-greedy planner \u2212pg, with\n\u2212pg(R)(s) = argmina R(s, a). In general, it will be useful to de\ufb01ne the negative of a planner:\nDe\ufb01nition 5. If p : R \u2192 \u03a0 is a planner, the planner \u2212p is de\ufb01ned by \u2212p(R) = p(\u2212R).\nFor any given policy \u03c0, we can de\ufb01ne the indifferent planner p\u03c0, which maps any reward function to\n\u03c0. We can also de\ufb01ne the reward function R\u03c0, so that R\u03c0(s, a) = 1 if \u03c0(s) = a, and R\u03c0(s, a) = 0\notherwise. The reward function \u2212R\u03c0 is de\ufb01ned to be the negative of R\u03c0. Then:\n\n8Recall that pg is a planner, pg(R) is a policy, so pg(R) can be applied to states, and pg(R)(s) is an action.\n\n5\n\n\fLemma 6. The pairs (p\u03c0, 0), (pg, R\u03c0), and (\u2212pg,\u2212R\u03c0) are all compatible with \u03c0.\n\nProof. Since the image p\u03c0 is \u03c0, p\u03c0(0) = \u03c0. Now, R\u03c0(s, a) > 0 iff \u03c0(s) = a, hence for all s:\n\nso pg(R\u03c0) = \u03c0. Then \u2212pg(\u2212R\u03c0) = pg(\u2212(\u2212R\u03c0)) = pg(R\u03c0) = \u03c0, by De\ufb01nition 5.\n\npg(R\u03c0)(s) = argmax\n\nR\u03c0(s, a) = \u03c0(s),\n\na\n\n5.1.2 Complexity of basic operations\n\nWe will look the operations that build the degenerate planner-reward pairs from any compatible pair:\n\n1. For any planner p, f1(p) = (p, 0) as a planner-reward pair.\n2. For any reward function R, f2(R) = (pg, R).\n3. For any planner-reward pair (p, R), f3(p, R) = p(R).\n4. For any planner-reward pair (p, R), f4(p, R) = (\u2212p,\u2212R).\n5. For any policy \u03c0, f5(\u03c0) = p\u03c0.\n6. For any policy \u03c0, f6(\u03c0) = R\u03c0.\n\nThese will be called the basic operations, and there are strong arguments that reasonable computer\nlanguages should be able to express them with short programs. The operation f1, for instance, is\nsimply appending the \ufb02at trivial 0, f2 appends a planner de\ufb01ned by the simple9 search operator\nargmax, f3 applies a planner to the object \u2014 a reward function \u2014 that the planner naturally acts on,\nf4 is a double negation, while f5 and f6 are simply described in subsubsection 5.1.1.\nFrom these basic operations, we can de\ufb01ne three composite operations that map any compatible\nplanner-reward pair to one of the degenerate pairs (the element F4 = f4 is useful for later de\ufb01nitions).\nThus de\ufb01ne\n\nF = {F1 = f1 \u25e6 f5 \u25e6 f3, F2 = f2 \u25e6 f6 \u25e6 f3, F3 = f4 \u25e6 f2 \u25e6 f6 \u25e6 f3, F4 = f4}.\n\nFor any \u02d9\u03c0-compatible pair (p, R) we have F1(p, R) = (p \u02d9\u03c0, 0), F2(p, R) = (pg, R \u02d9\u03c0), and F3(p, R) =\n(\u2212pg,\u2212R \u02d9\u03c0) (see the proof of Proposition 7).\nLet KL denote Kolmogorov complexity in the language L: the shortest algorithm in L that generates\na particular object. We de\ufb01ne the F -complexity of L as\n\nmax\n\n(p,R),Fi\u2208F\n\nKL(Fi(p, R)) \u2212 KL(p, R).\n\nThus the F -complexity of L is how much the Fi potentially increase10 the complexity of pairs.\nFor a constant c \u2265 0, this allows us to formalise what we mean by L being a c-reasonable language\nfor F : that the F -complexity of L is at most c. A reasonable language is a c-reasonable language for\na c that we feel is intuitively low enough.\n\n5.1.3 Low complexity of degenerate planner-reward pairs\n\nTo formalise the concepts \u2018of lowest complexity\u2019, and \u2018of comparable complexity\u2019, choose a constant\nc \u2265 0, then (p, R) and (p(cid:48), R(cid:48)) are of \u2018comparable complexity\u2019 if\n||KL(p, R) \u2212 KL(p(cid:48), R(cid:48))|| \u2264 c.\n\nFor a set S \u2282 P \u00d7 R, the pair (p, R) \u2208 S is amongst the lowest complexity in S if\n\n||KL(p, R) \u2212 min\n(p(cid:48),R(cid:48))\u2208S\n\nKL(p(cid:48), R(cid:48))|| \u2264 c,\n\nthus KL is within distance c of the minimum complexity element of S. Now formalize Proposition 3:\n9 In most standard computer languages, argmax just requires a for-loop, a reference to R, a comparison\nwith a previously stored value, and possibly the storage of a new value and the current action.\n10F -complexity is non-negative: F4 \u25e6 F4 is the identity, so that KL(F4(p, R)) \u2212 KL(p, R) =\n\u2212(KL(F4(F4(p, R)) \u2212 KL(F4(p, R)), meaning that max(p,R),F4 KL(F4(p, R)) \u2212 KF (p, R) must be non-\nnegative; this is a reason to include F4 in the de\ufb01nition of F .\n\n6\n\n\fProposition 7. If \u02d9\u03c0 is the human policy, c de\ufb01nes a reasonable measure of comparable complexity,\nand L is a c-reasonable language for F , then the degenerate planner-reward pairs (p \u02d9\u03c0, 0), (pg, R \u02d9\u03c0),\nand (\u2212pg,\u2212R \u02d9\u03c0) are amongst the pairs of lowest complexity among the pairs compatible with \u02d9\u03c0.\nProof. By Lemma 6, (p \u02d9\u03c0, 0), (pg, R \u02d9\u03c0), and (\u2212pg,\u2212R \u02d9\u03c0) are compatible with \u02d9\u03c0. By the de\ufb01nitions\nof the fi and Fi, for(p, R) compatible with \u02d9\u03c0, f3((p, R)) = p(R) = \u02d9\u03c0 and hence\n\nF1(p, R) = f1 \u25e6 f5( \u02d9\u03c0) = f1(p \u02d9\u03c0) = (p \u02d9\u03c0, 0),\nF2(p, R) = f2 \u25e6 f6( \u02d9\u03c0) = f2(R \u02d9\u03c0) = (pg, R \u02d9\u03c0),\nF3(p, R) = f4 \u25e6 F2(p, R) = (\u2212pg,\u2212R \u02d9\u03c0).\n\nNow pick (p, R) to be the simplest pair compatible with \u02d9\u03c0. Since L is c-reasonable for F ,\nKL(p \u02d9\u03c0, 0) \u2264 c + KL(p, R). Hence (p \u02d9\u03c0, 0) is of lowest complexity among the pairs compatible with\n\u02d9\u03c0; the same argument applies for the other two degenerate pairs.\n\n5.2 Negative reward\nIf ( \u02d9p, \u02d9R) is compatible with \u02d9\u03c0, then so is (\u2212 \u02d9p,\u2212 \u02d9R) = f4( \u02d9p, \u02d9R) = F4( \u02d9p, \u02d9R). This immediately\nimplies the formalisation of Proposition 4:\nProposition 8. If \u02d9\u03c0 is a human policy, c de\ufb01nes a reasonable measure of comparable complexity, L\nis a c-reasonable language for F , and ( \u02d9p, \u02d9R) is compatible with \u02d9\u03c0, then (\u2212 \u02d9p,\u2212 \u02d9R) is of comparable\ncomplexity to ( \u02d9p, \u02d9R).\n\nSo complexity fails to distinguish between a reasonable human reward function and its negative.\n\n6 The high complexity of the genuine human reward function\n\nSection 5 demonstrated that there are degenerate planner-reward pairs close to the minimum com-\nplexity among all pairs compatible with \u02d9\u03c0. This section will argue that any reasonable pair ( \u02d9p, \u02d9R)\nis unlikely to be close to this minimum, and is therefore of higher complexity than the degenerate\npairs. Unlike simplicity, reasonable decomposition cannot easily be formalised. Indeed, a formaliza-\ntion would likely already solve the problem, yielding an algorithm to maximize it. Therefore, the\narguments in this section are mostly qualitative.\nWe use reasonable to mean \u2018compatible with human judgements about rationality\u2019. Since we do\nnot have direct access to such a decomposition, the complexity argument will be about showing the\ncomplexity of these human judgements. This argument will proceed in three stages:\n\n1. Any reasonable ( \u02d9p, \u02d9R) is of high complexity, higher than it may intuitively seem to us.\n2. Even given \u02d9\u03c0, any reasonable ( \u02d9p, \u02d9R) involves a high number of contingent choices. Hence\n\nany given ( \u02d9p, \u02d9R) has high information (and thus high complexity), even given \u02d9\u03c0.\n3. Past failures to \ufb01nd a simple ( \u02d9p, \u02d9R) derived from \u02d9\u03c0 are evidence that this is tricky.\n\n6.1 The complexity of human (ir)rationality\n\nHumans make noisy and biased decisions all the time. Though noise is important [Kahneman et al.,\n2016], many biases, such as anchoring bias, overcon\ufb01dence, planning fallacies, and so on, affect\nhumans in a highly systematic way; see Kahneman and Egan [2011] for many examples.\nMany people may feel that they have a good understanding of rationality, and therefore assume that\nassessing the (ir)rationality of any particular decision is not a complicated process. But an intuition\nfor bias does not translate into a process for establishing a ( \u02d9p, \u02d9R).\nConsider the anchoring bias de\ufb01ned in Ariely et al. [2004], where irrelevant information \u2014 the last\ndigits of social security numbers \u2014 changed how much people were willing to pay for goods. When\nde\ufb01ning a reasonable ( \u02d9p, \u02d9R), it does not suf\ufb01ce to be aware of the existence of anchoring bias11, but\n\n11 The fact that many cognitive biases have only been discovered recently argue against people having a good\n\nintuitive grasp of bias and rationality, as do people\u2019s persistent bias blind spots [Scopelliti et al., 2015].\n\n7\n\n\fone has to precisely quantify the extent of the bias \u2014 why does anchoring bias seem to be stronger\nfor chocolate than for wine, for instance? And why these precise percentages and correlations, and\nnot others? And can people\u2019s judgment tell which people are more or less susceptible to anchoring\nbias? And can one quantify the bias for a single individual, rather than over a sample?\nAny given ( \u02d9p, \u02d9R) can quantify the form and extent of these biases by computing objects like the regret\nfunction Reg( \u02d9p, \u02d9R)(s) := Reg( \u02d9p( \u02d9R), \u02d9R)(s) = V \u2217\n(s), which measures the divergence\n\u02d9R\nbetween the expected value of the actual and optimal human policies12. Thus any given ( \u02d9p, \u02d9R)\n\u2014 which contains the information to compute quantities like Reg( \u02d9p, \u02d9R)(s) or similar measures of\nbias13, in every state \u2014 carries a high amount of numerical information about bias, and hence a high\ncomplexity.\nSince humans do not easily have access to this information, this implies that human judgement of\nirrationality is subject to Moravec\u2019s paradox [Moravec, 1988]. It is similar to, for example, social\nskills: though it seems intuitively simple to us, it is highly complex to de\ufb01ne in algorithmic terms.\nOther authors have argued directly for the complexity of human values, from \ufb01elds as diverse as\ncomputer science, philosophy, neuroscience, and economics [Minsky, 1984, Bostrom, 2014, Glimcher\net al., 2009, Muehlhauser and Helm, 2012, Yudkowsky, 2011].\n\n(s) \u2212 V \u02d9p( \u02d9R)\n\n\u02d9R\n\n6.2 The contingency of human judgement\n\nThe previous section showed that reasonable ( \u02d9p, \u02d9R) carry large amounts of information/complexity,\nbut the key question is whether it requires information additional to that in \u02d9\u03c0. This section will show\nthat even when \u02d9\u03c0 is known, there are many contingent choices that need to be made to de\ufb01ne any\nspeci\ufb01c reasonable ( \u02d9p, \u02d9R). Hence any given ( \u02d9p, \u02d9R) contains a large amount of information beyond\nthat in \u02d9\u03c0, and hence is of higher complexity.\nReasons to believe that human judgement about reasonable ( \u02d9p, \u02d9R) contains many contingent choices:\n\u2022 There is a variability of human judgement between cultures. When Miller [1984] compared\nAmerican and Indian assessments of the same behaviours, they found systematically different\nexplanations for them14 Basic intuitions about rationality also vary between cultures [Nisbett\net al., 2001, Br\u00a8uck, 1999].\n\u2022 There is a variability of human judgement within a single culture. When Slovic and Tversky\n[1974] analysed the \u201cAllais Paradox\u201d, they found that different people gave different answers\nas to what the rational behaviour was in their experiments.\n\u2022 There is evidence of variability of human judgement within the same person. Slovic and\nTversky [1974] further attempted to argue for the rationality of one of the answers. This\nsometimes resulted in the participant sometimes changing their minds, and contradicting\ntheir previous assessment of rationality.\n\u2022 There is a variability of human judgement for the same person assessing their own values,\ncaused by differences as trivial as question ordering [Schuman and Ludwig, 1983]. So\nhuman meta-judgement, of own values and rationality, is also contingent and variable.\n\u2022 People have partial bias blind spots around their own biases [Scopelliti et al., 2015].\n\nThus if a human is following policy \u02d9\u03c0, a decomposition ( \u02d9p, \u02d9R) would provide additional information\nabout the cultural background of the decomposer, their personality within their culture, and even\nabout the past history of the decomposer and how the issue is being presented to them. Those last\npieces prevents us from \u2018simply\u2019 using the human\u2019s own assessment of their own rationality, as that\nassessment is subject to change and re-interpretation depending on their possible histories.\n\n12To exactly quantify the anchoring bias above, we could use a regret function that contrasts \u02d9\u03c0 with the same\n\npolicy, but where the decision is optimal for one turn only (rather than for all turns, as in standard regret).\n\n13In constrast, regret for the degenerate planner-reward pairs is trivial. Reg(p \u02d9\u03c0, 0) and Reg(pg, R \u02d9\u03c0) are\nidentically zero \u2014 in the second case, since pg(R \u02d9\u03c0) is actually optimal for R \u02d9\u03c0, getting the maximal possible\nreward \u2014 while (\u2212pg,\u2212R \u02d9\u03c0) has a regret that is identically \u22121 at each step.\n\n14\u201cResults show that there were cross-cultural and developmental differences related to contrasting cultural\n\nconceptions of the person [...] rather than from cognitive, experiential, and informational differences [...].\u201d\n\n8\n\n\f6.3 The search for human rationality models\n\nOne \ufb01nal argument that there is no simple algorithm for going from \u02d9\u03c0 to ( \u02d9p, \u02d9R): many have tried\nand failed to \ufb01nd such an algorithm. Since the subject of human rationality has been a major one for\nseveral thousands of years, the ongoing failure is indicative \u2014 though not a proof \u2014 of the dif\ufb01culties\ninvolved. There have been many suggested philosophical avenues for \ufb01nding such a reward (such as\nre\ufb02ective equilibrium [Rawls, 1971]), but all have been underde\ufb01ned and disputed.\nThe economic concept of revealed preferences [Samuelson, 1948] is the most explicit, using the\nassumption of rational behaviour to derive human preferences. This is an often acceptable approx-\nimation, but can be taken too far: failure to take achieve an achievable goal does not imply that\nfailure was desired. Even within the con\ufb01nes of economics, it has been criticised by behavioural\neconomics approaches, such as prospect theory [Kahneman and Tversky, 2013] \u2014 and there are\ncounter-criticisms to these.\nUsing machine learning to deduce the intentions and preferences of humans is in its infancy, but we\ncan see non-trivial real-world examples, even in settings as simple as car-driving [Lazar et al., 2018].\nThus to date, neither humans nor machine learning have been able to \ufb01nd simple ways of going from\n\u02d9\u03c0 to ( \u02d9p, \u02d9R), nor any simple and explicit theory for how such a decomposition could be achieved. This\nsuggests that ( \u02d9p, \u02d9R) is a complicated object, even if \u02d9\u03c0 is known. In conclusion:\nConjecture 9 (Informal complexity proposition). If \u02d9\u03c0 is a human policy, and L is a \u2018reasonable\u2019\ncomputer language with ( \u02d9p, \u02d9R) a \u2018reasonable\u2019 compatible planner-reward pair, then the complexity of\n( \u02d9p, \u02d9R) is not close to minimal amongst the pairs compatible with \u02d9\u03c0.\n\n7 Conclusion\n\nWe have shown that some degenerate planner-reward decompositions of a human policy have near-\nminimal description length and argued that decompositions we would endorse do not. Hence, under\nthe Kolmogorov-complexity simplicity prior, a formalization of Occam\u2019s Razor, the posterior would\nendorse degenerate solutions. Previous work has shown that noisy rationality is too strong an\nassumption as it does not account for bias; we tried the weaker assumption of simplicity, strong\nenough to avoid typical No Free Lunch results, but it is insuf\ufb01cient here.\nThis is no reason for despair: there is a large space to explore between these two extremes. Our hope\nis that with some minimal assumptions about planner and reward we can infer the rest with enough\ndata. Staying close to agnostic is desirable in some settings: for example, a misspeci\ufb01ed model of\nthe human reward function can lead to disastrous decisions with high con\ufb01dence [Milli et al., 2017].\nAnonymous [2019] makes a promising \ufb01rst try \u2014 a high-dimensional parametric planner is initialized\nto noisy rationality and then adapts to \ufb01t the behavior of a systematically irrational agent.\nHow can we reconcile our results with the fact that humans routinely make judgments about the\npreferences and irrationality of others? And, that these judgments are often correlated from human\nto human? After all, No Free Lunch applies to human as well as arti\ufb01cial agents. Our result shows\nthat they must be using shared priors, beyond simplicity, that are not learned from observations.\nWe call these normative assumptions because they encode beliefs about which reward functions are\nmore likely and what constitutes approximately rational behavior. Uncovering minimal normative\nassumptions would be an ideal way to build on this paper; Appendix C shows one possible approach.\n\nAcknowledgments.\n\nWe wish to thank Laurent Orseau, Xavier O\u2019Rourke, Jan Leike, Shane Legg, Nick Bostrom, Owain\nEvans, Jelena Luketina, Tom Everrit, Jessica Taylor, Paul Christiano, Eliezer Yudkowsky, Stuart\nRussell, Dylan Had\ufb01eld-Menell, and Anders Sandberg, Adam Gleave, Rohin Shah, among many\nothers. This work was supported by the Alexander Tamas programme on AI safety research, the\nLeverhulme Trust, and the Machine Intelligence Research Institute.\n\n9\n\n\fReferences\nPieter Abbeel and Andrew Y Ng. Apprenticeship Learning via Inverse Reinforcement Learning.\n\n2004.\n\nEric Allender. When worlds collide: Derandomization, lower bounds, and kolmogorov complexity.\nIn International Conference on Foundations of Software Technology and Theoretical Computer\nScience, pages 1\u201315. Springer, 2001.\n\nKareem Amin and Satinder Singh. Towards Resolving Unidenti\ufb01ability in Inverse Reinforcement\n\nLearning. 2016.\n\nDario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00b4e.\n\nConcrete Problems in AI Safety. 2016.\n\nAnonymous. Inferring reward functions from demonstrators with unknown biases. In Submitted to\nInternational Conference on Learning Representations, 2019. URL https://openreview.net/\nforum?id=rkgqCiRqKQ. under review.\n\nDan Ariely, George Loewenstein, and Drazen Prelec. Arbitrarily coherent preferences. The psychology\n\nof economic decisions, 2:131\u2013161, 2004.\n\nNick Bostrom. Superintelligence: Paths, dangers, strategies. Oxford University Press, 2014.\n\nAbdeslam Boularias, Jens Kober, and Jan Peters. Relative Entropy Inverse Reinforcement Learning,\n\n2011.\n\nJoanna Br\u00a8uck. Ritual and rationality: some problems of interpretation in european archaeology.\n\nEuropean journal of archaeology, 2(3):313\u2013344, 1999.\n\nCristian Calude. Information and randomness : an algorithmic perspective. Springer, 2002.\n\nChris Cundy and Daniel Filan. Exploring hierarchy-aware inverse reinforcement learning. arXiv\n\npreprint arXiv:1807.05037, 2018.\n\nOwain Evans, Andreas Stuhlmueller, and Noah D. Goodman. Learning the Preferences of Ignorant,\n\nInconsistent Agents. Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2015a.\n\nOwain Evans, Andreas Stuhlm\u00a8uller, and Noah D Goodman. Learning the preferences of bounded\n\nagents. NIPS Workshop on Bounded Optimality, pages 16\u201322, 2015b.\n\nTom Everitt and Marcus Hutter. Avoiding wireheading with value reinforcement learning.\n\nInternational Conference on Arti\ufb01cial General Intelligence, pages 12\u201322. Springer, 2016.\n\nIn\n\nTom Everitt, Tor Lattimore, and Marcus Hutter. Free Lunch for optimisation under the universal\ndistribution. In Proceedings of the 2014 IEEE Congress on Evolutionary Computation, CEC 2014,\npages 167\u2013174, 2014.\n\nTom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. Reinforcement\n\nLearning with a Corrupted Reward Channel. 2017.\n\nPaul W Glimcher, Colin F Camerer, Ernst Fehr, and Russell A Poldrack. Neuroeconomics: Decision\n\nmaking and the brain, 2009.\n\nThomas L. Grif\ufb01ths, Falk Lieder, and Noah D. Goodman. Rational Use of Cognitive Resources:\nLevels of Analysis Between the Computational and the Algorithmic. Topics in Cognitive Science,\n7(2):217\u2013229, 2015.\n\nDylan Had\ufb01eld-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative Inverse\n\nReinforcement Learning. arXiv:1606.03137 [cs], 2016.\n\nDylan Had\ufb01eld-Menell, Smitha Milli, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Inverse\nreward design. In Advances in Neural Information Processing Systems, pages 6749\u20136758, 2017.\n\n10\n\n\fAndreas Hula, P. Read Montague, and Peter Dayan. Monte Carlo Planning Method Estimates\nPlanning Horizons during Interactive Social Exchange. PLOS Computational Biology, 11(6):\ne1004254, 2015.\n\nDavid Hume. Treatise on Human Nature Ed Selby-bigge, L a. 1888.\n\nDaniel Kahneman and Patrick Egan. Thinking, fast and slow, volume 1. Farrar, Straus and Giroux\n\nNew York, 2011.\n\nDaniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision under risk. In\nHandbook of the fundamentals of \ufb01nancial decision making: Part I, pages 99\u2013127. World Scienti\ufb01c,\n2013.\n\nDaniel Kahneman, Andrew M Rosen\ufb01eld, Linnea Gandhi, and Tom Blaser. Noise: How to overcome\nthe high, hidden cost of inconsistent decision making. Harvard business review, 94(10):38\u201346,\n2016.\n\nAndrei N Kolmogorov. Three approaches to the quantitative de\ufb01nition o\ufb01nformation\u2019. Problems of\n\ninformation transmission, 1(1):1\u20137, 1965.\n\nDaniel A Lazar, Kabir Chandrasekher, Ramtin Pedarsani, and Dorsa Sadigh. Maximizing road\n\ncapacity using cars that in\ufb02uence people. arXiv preprint arXiv:1807.04414, 2018.\n\nJan Leike, Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq, Laurent\n\nOrseau, and Shane Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.\n\nLeonid A Levin. Randomness conservation inequalities; information and independence in mathemat-\n\nical theories. Information and Control, 61(1):15\u201337, 1984.\n\nRichard L Lewis, Andrew Howes, and Satinder Singh. Computational Rationality: Linking Mech-\nanism and Behavior Through Bounded Utility Maximization. Topics in Cognitive Science, 6:\n279\u2013311, 2014.\n\nJoan G Miller. Culture and the development of everyday social explanation. Journal of personality\n\nand social psychology, 46(5):961, 1984.\n\nSmitha Milli, Dylan Had\ufb01eld-Menell, Anca Dragan, and Stuart Russell. Should robots be obedient?\n\narXiv preprint arXiv:1705.09990, 2017.\n\nLI Ming and Paul MB Vit\u00b4anyi. Kolmogorov complexity and its applications. Algorithms and\n\nComplexity, 1:187, 2014.\n\nMarvin Minsky. Afterword to Vernor Vinge\u2019s novel, \u201cTrue names.\u201d Unpublished manuscript. 1984.\n\nURL http://web.media.mit.edu/~minsky/papers/TrueNames.Afterword.html.\n\nHans Moravec. Mind children: The future of robot and human intelligence. Harvard University Press,\n\n1988.\n\nLuke Muehlhauser and Louie Helm. The singularity and machine ethics. In Singularity Hypotheses,\n\npages 101\u2013126. Springer, 2012.\n\nMarkus M\u00a8uller. Stationary algorithmic probability. Theoretical Computer Science, 411(1):113\u2013130,\n\n2010.\n\nAndrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. Proceedings of the\n\nSeventeenth International Conference on Machine Learning, pages 663\u2013670, 2000.\n\nRichard E Nisbett, Kaiping Peng, Incheol Choi, and Ara Norenzayan. Culture and systems of thought:\n\nholistic versus analytic cognition. Psychological review, 108(2):291, 2001.\n\nNathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In\nProceedings of the 23rd international conference on Machine learning - ICML \u201906, pages 729\u2013736,\n2006.\n\n11\n\n\fJohn Rawls. A Theory of Justice. Cambridge, Massachusetts: Belknap Press, 1971. ISBN 0-674-\n\n00078-1.\n\nStuart Russell. Rationality and Intelligence: A Brief Update. In Fundamental Issues of Arti\ufb01cial\n\nIntelligence, pages 7\u201328. 2016.\n\nStuart Russell, Daniel Dewey, and Max Tegmark. Research Priorities for Robust and Bene\ufb01cial\n\nArti\ufb01cial Intelligence. AI Magazine, 36(4):105, 2015.\n\nPaul A Samuelson. Consumption theory in terms of revealed preference. Economica, 15(60):243\u2013253,\n\n1948.\n\nJ\u00a8urgen Schmidhuber. The speed prior: a new simplicity measure yielding near-optimal computable\npredictions. In International Conference on Computational Learning Theory, pages 216\u2013228.\nSpringer, 2002.\n\nHoward Schuman and Jacob Ludwig. The norm of even-handedness in surveys as in life. American\n\nSociological Review, pages 112\u2013120, 1983.\n\nIrene Scopelliti, Carey K Morewedge, Erin McCormick, H Lauren Min, Sophie Lebrecht, and\nKarim S Kassam. Bias blind spot: Structure, measurement, and consequences. Management\nScience, 61(10):2468\u20132486, 2015.\n\nKyriacos Shiarlis, Joao Messias, and Shimon Whiteson. Rapidly exploring learning trees.\n\nIn\nProceedings - IEEE International Conference on Robotics and Automation, pages 1541\u20131548,\n2017.\n\nPaul Slovic and Amos Tversky. Who accepts savage\u2019s axiom? Behavioral science, 19(6):368\u2013373,\n\n1974.\n\nAmos Tversky and Daniel Kahneman. Judgment under Uncertainty: Heuristics and Biases. In Utility,\nProbability, and Human Decision Making, pages 141\u2013162. Springer Netherlands, Dordrecht, 1975.\n\nPaul MB Vitanyi and Ming Li. An introduction to Kolmogorov complexity and its applications,\n\nvolume 34. Springer Heidelberg, 1997.\n\nDavid H. Wolpert and William G. Macready. No free lunch theorems for optimization.\n\nTransactions on Evolutionary Computation, 1(1):67\u201382, 1997.\n\nIEEE\n\nEliezer Yudkowsky. Complex value systems in friendly ai. In International Conference on Arti\ufb01cial\n\nGeneral Intelligence, pages 388\u2013393. Springer, 2011.\n\nBrian D Ziebart, Andrew Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse\nReinforcement Learning. In AAAI Conference on Arti\ufb01cial Intelligence, pages 1433\u20131438, 2008.\n\n12\n\n\f", "award": [], "sourceid": 2676, "authors": [{"given_name": "Stuart", "family_name": "Armstrong", "institution": "Oxford University"}, {"given_name": "S\u00f6ren", "family_name": "Mindermann", "institution": "Vector Institute"}]}