{"title": "Ask not what AI can do, but what AI should do: Towards a framework of task delegability", "book": "Advances in Neural Information Processing Systems", "page_first": 57, "page_last": 67, "abstract": "While artificial intelligence (AI) holds promise for addressing societal challenges, issues of exactly which tasks to automate and to what extent to do so remain understudied. We approach this problem of task delegability from a human-centered perspective by developing a framework on human perception of task delegation to AI. We consider four high-level factors that can contribute to a delegation decision: motivation, difficulty, risk, and trust. To obtain an empirical understanding of human preferences in different tasks, we build a dataset of 100 tasks from academic papers, popular media portrayal of AI, and everyday life, and administer a survey based on our proposed framework. We find little preference for full AI control and a strong preference for machine-in-the-loop designs, in which humans play the leading role. Among the four factors, trust is the most correlated with human preferences of optimal human-machine delegation. This framework represents a first step towards characterizing human preferences of AI automation across tasks. We hope this work encourages future efforts towards understanding such individual attitudes; our goal is to inform the public and the AI research community rather than dictating any direction in technology development.", "full_text": "Ask not what AI can do, but what AI should do:\n\nTowards a framework of task delegability\n\nBrian Lubars\n\nUniversity of Colorado Boulder\nbrian.lubars@colorado.edu\n\nChenhao Tan\n\nUniversity of Colorado Boulder\nchenhao.tan@colorado.edu\n\nAbstract\n\nWhile arti\ufb01cial intelligence (AI) holds promise for addressing societal challenges,\nissues of exactly which tasks to automate and to what extent to do so remain un-\nderstudied. We approach this problem of task delegability from a human-centered\nperspective by developing a framework on human perception of task delegation to\nAI. We consider four high-level factors that can contribute to a delegation deci-\nsion: motivation, dif\ufb01culty, risk, and trust. To obtain an empirical understanding\nof human preferences in different tasks, we build a dataset of 100 tasks from aca-\ndemic papers, popular media portrayal of AI, and everyday life, and administer\na survey based on our proposed framework. We \ufb01nd little preference for full AI\ncontrol and a strong preference for machine-in-the-loop designs, in which hu-\nmans play the leading role. Among the four factors, trust is the most correlated\nwith human preferences of optimal human-machine delegation. This framework\nrepresents a \ufb01rst step towards characterizing human preferences of AI automation\nacross tasks. We hope this work encourages future efforts towards understand-\ning such individual attitudes; our goal is to inform the public and the AI research\ncommunity rather than dictating any direction in technology development.\n\n1\n\nIntroduction\n\nRecent developments in machine learning have led to signi\ufb01cant excitement about the promise of\narti\ufb01cial intelligence. Ng [35] claims that \u201carti\ufb01cial intelligence is the new electricity.\u201d Arti\ufb01cial\nintelligence indeed approaches or even outperforms human-level intelligence in critical domains\nsuch as hiring, medical diagnosis, and judicial systems [7, 10, 12, 23, 29]. However, we also observe\ngrowing concerns about which problems are appropriate applications of AI. For instance, a recent\nstudy used deep learning to predict sexual orientation from images [45]. This study has caused\ncontroversy [32, 34]: Glaad and the Human Rights Campaign denounced the study as \u201cjunk science\u201d\nthat \u201cthreatens the safety and privacy of LGBTQ and non-LGBTQ people alike\u201d [2]. In general,\nresearchers also worry about the impact on jobs and the future of employment [14, 42, 44].\nSuch excitement and concern begs a fundamental question at the interface of arti\ufb01cial intelligence\nand human society: which tasks should be delegated to AI, and in what way? To answer this\nquestion, we need to at least consider two dimensions. The \ufb01rst dimension is capability. Machines\nmay excel at some tasks, but struggle at others; this area has been widely explored since Fitts \ufb01rst\ntackled function allocation in 1951 [13, 36, 38]. The goal of AI research has also typically focused\non pushing the boundaries of machine ability and exploring what AI can do.\nThe second dimension is human preferences, i.e., what role humans would like AI to play. The\nautomation of some tasks is celebrated, while others should arguably not be automated for reasons\nbeyond capability alone. For instance, automated civil surveillance may be disastrous from ethical,\nprivacy, and legal standpoints. Motivation is another reason: no matter the quality of machine\ntext generation, it is unlikely that an aspiring writer will derive the same satisfaction or value from\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdelegating their writing to a fully automated system. Despite the clear importance of understanding\nhuman preferences, the question of what AI should do remains understudied in AI research.\nIn this work, we present the \ufb01rst empirical study to understand how different factors relate to human\npreferences of delegability, i.e., to what extent AI should be involved. Our contributions are three-\nfold. First, building on prior literature on function allocation, mixed-initiative systems, and trust and\nreliance on machines [20, 27, 36, inter alia], we develop a framework of four factors \u2014 motivation,\ndif\ufb01culty, risk, and trust \u2014 to explain task delegability. Second, we construct a dataset of diverse\ntasks ranging from those found in academic research to ones that people routinely perform in daily\nlife. Third, we conduct a survey to solicit human evaluations of the four factors and delegation pref-\nerences, and validate the effectiveness of our framework. We \ufb01nd that our survey participants seldom\nprefer full automation, but value AI assistance. Among the four factors, trust is the most correlated\nwith human preferences of delegability. However, the need for interpretability, a component in trust,\ndoes not show a strong correlation with delegability. Our study contributes towards a framework of\ntask delegability and an evolving database of tasks and their associated human preferences.\n\n2 Related Work\n\nAs machine learning grows further embedded in society, human preferences of AI automation gain\nrelevance. We believe surveying, tracking, and understanding such preferences is important. How-\never, apart from human-machine integration studies on speci\ufb01c tasks, the primary mode of thinking\nin AI research is towards automation. We summarize related work in three main areas.\nTask allocation and delegation. Several studies have proposed theories of delegation in the context\nof general automation [6, 33, 36, 38]. Function allocation examines how to best divide tasks based\non human and machine abilities [13, 36, 38]. Castelfranchi and Falcone [6] emphasize the role of\nrisk, uncertainty, and trust in delegation, which we build on in developing our framework. Milewski\nand Lewis [33] suggest that people may not want to delegate to machines in tasks characterized\nby low trust or low con\ufb01dence, where automation is unnecessary, or where automation does not\nadd to utility. In the context of jobs and their susceptibility to automation, Frey and Osborne [14]\n\ufb01nd social, creative, and perception-manipulation requirements to be good prediction criteria for\nmachine ability. Parasuraman et al.\u2019s Levels of Automation is the closest to our work [36]. However,\ntheir work is primarily concerned with performance-based criteria (e.g., capability, reliability, cost),\nwhile our interest involves human preferences.\nShared-control design paradigms. Many tasks are amenable to a mix of human and machine\ncontrol. Mixed-initiative systems and collaborative control have gained traction over function allo-\ncation, driven by the need for \ufb02exibility and adaptability, and the importance of a user\u2019s goals in\noptimizing value-added automation [4, 20, 21].\nWe broadly split work on shared-control systems into two categories. We \ufb01nd this split more \ufb02exible\nand practical for our application than the Levels of Automation. On one side, we have human-in-\nthe-loop machine learning designs, wherein humans assist machines. The human role is to oversee,\nhandle edge cases, augment training sets, and re\ufb01ne system outputs. Such designs enjoy prevalence\nin applications from vision and recognition to translation [5, 11, 18]. Alternatively, a machine-in-\nthe-loop paradigm places the human in the primary position of action and control while the machine\nassists. Examples of this include a creative-writing assistance system that generates contextual sug-\ngestions [8, 41], and predicting situations in which people are likely to make judgmental errors in\ndecision-making [1]. Even tasks which should not be automated may still bene\ufb01t from machine\nassistance [16, 17, 24], especially if human performance is not the upper bound as Kleinberg et al.\nfound in judge bail decisions [23].\nTrust and reliance on machines. Finally, we consider the community\u2019s interest in trust. As au-\ntomation grows in complexity, complete understanding becomes impossible; trust serves as a proxy\nfor rational decisions in the face of uncertainty, and appropriate use of technology becomes criti-\ncal [27]. As such, calibration of trust continues to be a popular avenue of research [15, 28]. Lee\nand See [27] identify three bases of trust in automation: performance, process, and purpose. Perfor-\nmance describes the automation\u2019s ability to reliably achieve the operator\u2019s goals. Process describes\nthe inner workings of the automation; examples include dependability, integrity, and interpretability\n(in particular, interpretable ML has received signi\ufb01cant interest [9, 22, 39, 40]). Finally, purpose\nrefers to the intent behind the automation and its alignment with the user\u2019s goals.\n\n2\n\n\fFactors\nMotivation\nDif\ufb01culty\n\nRisk\nTrust\n\nComponents\nIntrinsic motivation, goals, utility\nSocial skills, creativity, effort required, expertise\nrequired, human ability\nAccountability, uncertainty, impact\nMachine ability, interpretability, value alignment\n\nTable 1: An overview of the four factors in our AI task dele-\ngability framework.\n3 A Framework for Task Delegability\n\nFigure 1: Factors behind task dele-\ngability.\n\nTo explain human preferences of task delegation to AI, we develop a framework with four factors: a\nperson\u2019s motivation in undertaking the task, their perception of the task\u2019s dif\ufb01culty, their perception\nof the risk associated with accomplishing the task, and \ufb01nally their trust in the AI agent. We choose\nthese factors because motivation, dif\ufb01culty, and risk respectively cover why a person chooses to\nperform a task, the process of performing a task, and the outcome, while trust captures the interaction\nbetween the person and the AI. We now explain the four factors, situate them in literature, and\npresent the statements used in the surveys to capture each component. Table 1 presents an overview.\nMotivation. Motivation is an energizing factor that helps initiate, sustain, and regulate task-related\nactions by directing our attention towards goals or values [25, 30]. Affective (emotional) and cogni-\ntive processes are thought to be collectively responsible for driving action, so we consider intrinsic\nmotivation and goals as two components in motivation [30]. We also distinguish between learning\ngoals and performance goals, as indicated by Goal Setting Theory [31]. Finally, the expected utility\nof a task captures its value from a rational cost-bene\ufb01t analysis perspective [20]. Note that a task\nmay be of high intrinsic motivation yet low utility, e.g., reading a novel. Speci\ufb01cally, we use the\nfollowing statements to measure these motivation components in our surveys:\n1. Intrinsic Motivation: \u201cI would feel motivated to perform this task, even without needing to; for\n\nexample, it is fun, interesting, or meaningful to me.\u201d\n\n2. Goals: \u201cI am interested in learning how to master this task, not just in completing the task.\u201d\n3. Utility: \u201cI consider this task especially valuable or important; I would feel committed to com-\n\npleting this task because of the value it adds to my life or the lives of others.\u201d\n\nDif\ufb01culty. Dif\ufb01culty is a subjective measure re\ufb02ecting the cost of performing a task. For delegation,\nwe frame dif\ufb01culty as the interplay between task requirements and the ability of a person to meet\nthose requirements. Some tasks are dif\ufb01cult because they are time-consuming or laborious; others,\nbecause of the required training or expertise. To differentiate the two, we include effort required\nand expertise required as components in dif\ufb01culty. The third component, belief about abilities\npossessed, can also be thought of as task-speci\ufb01c self-con\ufb01dence (also called self-ef\ufb01cacy [3]) and\nhas been empirically shown to predict allocation strategies between people and automation [26].\nAdditionally, we contextualize our dif\ufb01culty measures with two speci\ufb01c skill requirements:\nthe\namount of creativity and social skills required. We choose these because they are considered more\ndif\ufb01cult for machines than for humans [14].\n1. Social skills: \u201cThis task requires social skills to complete.\u201d\n2. Creativity: \u201cThis task requires creativity to complete.\u201d\n3. Effort: \u201cThis task requires a great deal of time or effort to complete.\u201d\n4. Expertise: \u201cIt takes signi\ufb01cant training or expertise to be quali\ufb01ed for this task.\u201d\n5. (Perceived) human ability: \u201cI am con\ufb01dent in [my own/a quali\ufb01ed person\u2019s] ability to complete\n\nthis task.\u201d 1\n\nRisk. Real-world tasks involve uncertainty and risk in accomplishing the task, so a rational decision\non delegation involves more than just cost and bene\ufb01t. Delegation amounts to a bet: a choice\nconsidering the probabilities of accomplishing the goal against the risks and costs of each agent [6].\nPerkins et al. [37] de\ufb01ne risk practically as a \u201cprobability of harm or loss,\u201d \ufb01nding that people rely\non automation less as the probability of mortality increases. Responsibility or accountability may\nplay a role if delegation is seen as a way to share blame [28, 43]. We thus decompose risk into three\n\n1We \ufb02ip this component in coherence analysis (Fig. 2c) so that higher lack of con\ufb01dence in human ability\n\nindicates higher dif\ufb01culty.\n\n3\n\nMotivationDelegationDecisionDifficultyTrustRisk\fcomponents: personal accountability for the task outcome; the uncertainty, or the probability of\nerrors; and the scope of impact, or cost or magnitude of those errors.\n1. Accountability: \u201cIn the case of mistakes or failure on this task, someone needs to be held ac-\n\n2. Uncertainty: \u201cA complex or unpredictable environment/situation is likely to cause this task to\n\ncountable.\u201d\n\nfail.\u201d\n\n3. Impact: \u201cFailure would result in a substantial negative impact on my life or the lives of others.\u201d\nTrust. Trust captures how people deal with risk or uncertainty. We use the de\ufb01nition of trust as\n\u201cthe attitude that an agent will help achieve an individual\u2019s goals in a situation characterized by\nuncertainty and vulnerability\u201d [27]. Trust is generally regarded as the most salient factor in reliance\non automation [27, 28]. Here, we operationalize trust as a combination of perceived ability of the\nAI agent, agent interpretability (ability to explain itself), and perceived value alignment. Each of\nthese corresponds to a component of trust in Lee and See [27]: performance, process, and purpose.\n1. (Perceived) machine ability: \u201cI trust the AI agent\u2019s ability to reliably complete the task.\u201d\n2. Interpretability: \u201cUnderstanding the reasons behind the AI agent\u2019s actions is important for me\n\nto trust the AI agent on this task (e.g., explanations are necessary).\u201d 2\n\n3. Value alignment: \u201cI trust the AI agent\u2019s actions to protect my interests and align with my values\n\nfor this task.\u201d\n\nDegree of delegation. We develop this framework of motivation, dif\ufb01culty, risk, and trust to explain\nhuman preferences of delegation. To measure human preferences, we split the degree of delegation\ninto the following four categories (refer to Supplementary Material for the wordings in the survey):\n1. No AI assistance: the person does the task completely on their own (henceforth, \u201chuman only\u201d).\n2. The human leads and the AI assists: the person does the task mostly on their own, but the AI\noffers recommendations or help when appropriate (e.g., human gets stuck or AI sees possible\nmistakes) (henceforth, \u201cmachine in the loop\u201d).\n\n3. The AI leads and the human assists: the AI performs the task, but asks the person for sugges-\n\ntions/con\ufb01rmation when appropriate (henceforth, \u201chuman in the loop\u201d).\n\n4. Full AI automation: decisions and actions are made automatically by the AI once the task is\n\nassigned; no human involvement (henceforth, \u201cAI only\u201d).\n\nFig. 1 presents our expectation of how motivation, dif\ufb01culty, risk, and trust relate to delegability.\nMotivation describes how invested someone is in the task, including how much effort they are willing\nto expend, while dif\ufb01culty determines the amount of effort the task requires. Therefore, we expect\ndif\ufb01culty and motivation to relate to each other: we hypothesize that people are more likely to\ndelegate tasks which they \ufb01nd dif\ufb01cult (or have low con\ufb01dence in their abilities), and less likely\nto delegate tasks which they are highly invested in. Risk re\ufb02ects uncertainty and vulnerability in\nperforming a task, the situational conditions necessary for trust to be salient [27]. We thus expect\nrisk to moderate trust. Finally, we hypothesize that the correlation between components within\neach factor should greater than that across different factors, i.e., factors should show coherence in\ncomponent measurements.\n\n4 A Task Dataset and Survey Results\n\nTo evaluate our framework empirically, we build a database of diverse tasks covering settings rang-\ning from academic research to daily life, and develop and administer a survey to gather perceptions\nof those tasks under our framework. We examine survey responses through both quantitative analy-\nses and qualitative case studies.\n\n4.1 A Dataset of Tasks\n\nWe choose 100 tasks drawn from academic conferences, popular discussions in the media, well-\nknown occupations, and mundane tasks people encounter in their everyday lives. These tasks are\ngenerally relevant to current AI research and discussion, and present challenging delegability deci-\nsions with which to evaluate our framework. Example tasks can be found in \u00a74.3. To additionally\nbalance the variety of tasks chosen, we categorize them as art, creative, business, civic, entertain-\nment, health, living, and social, and keep a minimum of 7 tasks of each (a task can belong to multiple\n\n2We \ufb02ip this component in coherence analysis (Fig. 2c and 2d) so that higher lack of need for interpretability\n\nindicates higher trust.\n\n4\n\n\fcategories; refer to Supplementary Material for details). Ideally, the tasks would cover the entire au-\ntomation \u201ctask space\u201d; our task set is intended as a reasonable starting point.\nSince some tasks, e.g., medical diagnosis, require expertise, and since motivation does not apply if\nthe subject does not personally perform the task, we develop two survey versions.\n\u2022 Personal survey. We include all the four factors in Table 1 and ask participants \u201cIf you were to\ndo the given (above) task, what level of AI/machine assistance would you prefer?\u201d\n\u2022 Expert survey. We include only dif\ufb01culty, risk, and trust, and ask participants \u201cIf you were to\nask someone to complete the given (above) task, what level of AI/machine assistance would you\nprefer?\u201d\n\nFollowing a general explanation, our survey begins by asking subjects for basic demographic in-\nformation. Subjects are then presented one randomly-selected task from our set. They evaluate the\ntask under each component in our framework (see Table 1) according to a \ufb01ve-point Likert scale.\nFinally, subjects select the degree of delegation they would prefer for the task from the following\nfour choices: Full Automation, AI leads and human assists, Human leads and AI assists, or No AI\nassistance. Note that subjects are not told which factor each question measures beyond the question\ntext itself, and can choose the degree of delegation independently of our framework.\nWe administer this 5-minute survey on Mechanical Turk. To improve the quality of surveys, we\nrequire that participants have completed 200 HITs with at least a 99% acceptance rate and are from\nthe United States. We additionally add two attention check questions to ensure participants read\nthe survey carefully. Subjects are paid $0.80 upon completing the survey and passing the checks;\notherwise the data is discarded. We record 1000 survey responses: 500 each for the personal and\nthe expert versions, composed of 5 responses for each of the 100 tasks. Finally, we further discard\nresponses that are identical in each component (e.g., marking \u201cAgree\u201d for all components), resulting\nin 495 and 497 responses for the personal and expert versions, respectively. This leaves 8 tasks with\n4 responses rather than 5. We obtain a gender-balanced sample with 525 males, 463 females, and 4\nidentifying otherwise. The dataset is available at http://delegability.github.io.\n\n4.2 Survey Results\n\nIn this section, we begin by examining the overall distribution of the delegability preferences, then\ninvestigate the relation between components in our framework and the delegability labels.\nParticipants seldom choose \u201cAI only\u201d and prefer designs where humans play the leading role.\nParticipants labeled the delegability of tasks ranging from 1 (\u201cHuman only\u201d) to 4 (\u201cAI only\u201d). Fig. 2a\npresents the distribution. In both the personal and expert surveys, 4 (\u201cAI only\u201d) is seldom chosen.\nInstead, both distributions peak at 2, indicating strong preferences towards machine-in-the-loop de-\nsigns. This result becomes even more striking when averaging the \ufb01ve responses received per task,\nconcentrating almost half the mass between 2 and 2.5 \u2014 again indicating a preference for machine-\nin-the-loop designs. In fact, after averaging responses, we \ufb01nd that none of the 100 tasks yield an\noverall preference for full automation (\u2265 3.5). Taken together, these results imply that people prefer\nhumans to keep control over the vast majority of tasks, yet are also open to AI assistance.\nIf we view our surveys as a labeling exercise, the agreement between individuals is low but is\nrelatively higher in the expert survey than the personal survey: the Krippendorff\u2019s \u03b1 is 0.063 in\nthe personal survey and is 0.174 in the expert survey [19]. The lower agreement in the personal\nsurvey is consistent with heterogeneity between individuals. Two of the most contentious personal\nsurvey tasks were: \u201cCleaning your house\u201d and \u201cBreaking up with your romantic partner\u201d.\nTrust is most correlated with human preferences of automation. Consistent with our hypothesis\nin Fig. 1, trust is generally positively correlated with delegability. Table 2 shows the component\ncorrelations with the delegability labels. After Bonferroni correction, 5 out of the 11 components\nare signi\ufb01cantly correlated with the delegability label in the expert survey, while only 4 of the 14\ncomponents are in the personal survey. Trust takes the top two spots in both. Dif\ufb01culty, the only\nother signi\ufb01cantly correlated factor after trust, is generally negatively correlated with delegability.\nIn particular, the creative and social skill requirements are the next highest correlations in both sur-\nveys, suggesting speci\ufb01c skills may form a stronger basis for delegation to AI than more subjective\ndif\ufb01culty measures (e.g., self-con\ufb01dence).\nNext, we highlight three exceptions: 1) Contrary to our hypothesis, we did not \ufb01nd statistically\nsigni\ufb01cant correlations between the risk or motivation factors and delegability after Bonferroni cor-\n\n5\n\n\f(a) Distribution of indi-\nvidual preferences.\n\n(b) Dist. of average pref-\nerences for each task.\n\n(c) Average pairwise cor-\nrelation between compo-\nnents in the four factors.\n\n(d) Correlation of com-\nponents in the trust fac-\ntor.\n\nFigure 2: Fig. 2a and 2b show that full automation is rarely preferred in our surveys. Fig. 2c and\n2d examine correlations between components. We observe low coherence in trust and dif\ufb01culty. In\nparticular, interpretability seems distinct from the other two trust components.\n\nFactor\n\nComponent\n\nMotivation\nMotivation\nMotivation\n\nDif\ufb01culty\nDif\ufb01culty\nDif\ufb01culty\nDif\ufb01culty\nDif\ufb01culty\n\nRisk\nRisk\nRisk\n\nTrust\nTrust\nTrust\n\nUtility\nIntrinsic motivation\nGoals\n\nSocial skills\nCreativity\nHuman ability\nEffort required\nExpertise required\n\nUncertainty\nAccountability\nImpact\n\nMachine ability\nValue alignment\nProcess\n\nPersonal\n-0.126 (\u2020)\n-0.104 (\u2020)\nNS\n\n-0.303 (***)\n-0.223 (***)\nNS\nNS\nNS\n\nNS\nNS\nNS\n\n0.520 (***)\n0.481 (***)\nNS\n\nExpert\n\nN/A\nN/A\nN/A\n\n-0.294 (***)\n-0.290 (***)\n-0.160 (**)\n-0.124 (\u2020)\n-0.120 (\u2020)\n-0.135 (\u2020)\n-0.123 (\u2020)\n-0.106 (\u2020)\n0.593 (***)\n0.522 (***)\nNS\n\nTable 2: Pearson correlation of framework components with the delegabilty label for individual re-\nsponses to the personal and expert surveys. p-values were computed by aggregating over individual\nLikert ratings separately for the personal and expert surveys, resulting in 25 tests in total. Signi\ufb01-\ncance after Bonferroni correction is indicated by *** for p < 0.001, ** for p < 0.01, * for p < 0.05,\nand NS for p >= 0.05. Results that were p < 0.05 prior to correction are indicated by \u2020.\nrection. 2) The interpretability (process) component of trust is not correlated with delegability. 3)\nCon\ufb01dence in human ability (within the dif\ufb01culty factor; lower con\ufb01dence indicates higher dif\ufb01-\nculty) is not correlated with delegability in the personal survey, but is actually negatively correlated\nin the expert survey (the lower the con\ufb01dence, the more delegable). This differs from the general\ntrend of rating more dif\ufb01cult tasks as less delegable, suggesting that people prefer experts to accept\nAI assistance on low-con\ufb01dence tasks, but perhaps are less willing to do so themselves.\nFactors are generally \u201ccoherent\u201d, but risk components have only weak correlation with trust\ncomponents. We next study the correlation between components in our framework. We focus\non the personal survey here because it has all four factors, but results are consistent in the expert\nsurvey. Fig. 2c presents the average pairwise component correlations between the four factors: the\ncorrelation along the diagonal is generally higher than the off-diagonal ones. This \ufb01nding con\ufb01rms\nthat factors are generally \u201ccoherent\u201d, af\ufb01rming our categorization of the components within them.\nComparing the factor relations to our expectation in Fig. 1 (see the correlation graph in the Supple-\nmentary Material for detailed relations between components), we \ufb01nd that motivation and dif\ufb01culty\nare indeed correlated, most notably between self-con\ufb01dence and intrinsic motivation in the personal\nsurvey (\u03c1 = 0.41, people enjoy doing what they are good at). However, contrary to our expectation,\ncomponents in risk have only weak connections with trust, while dif\ufb01culty is correlated with trust\n(through the social and creative skill requirements) and to risk in personal and expert surveys.\nCoherence is lower in dif\ufb01culty and trust than in risk and motivation. To investigate this, we zoom\ninto the correlation matrix in Fig. 2d to show individual components of trust. We observe inter-\npretability has little correlation with machine ability and negative correlation with value alignment,\nsuggesting that the need for explanation is independent of whether the machine is perceived as ca-\npable, and perhaps higher when machine is perceived as benign. In comparison, interpretability is\nmost strongly correlated with risk. In the personal survey, it is also connected to motivation through\nutility and learning goals. Thus risky and important tasks, which people want to learn, tend to require\nmore interpretability; but this may not be tied to their willingness to delegate.\n\n6\n\n1234050100150200250CountPersonalExpert1.01.52.02.53.03.54.001020304050CountPersonalExpertMotivationDi\ufb03cultyRiskTrustMotivationDi\ufb03cultyRiskTrust0.520.070.13-0.020.070.240.22-0.150.130.220.49-0.15-0.02-0.15-0.150.17\u22120.6\u22120.4\u22120.20.00.20.40.6MachineAbilityInterpret-abilityValueAlign.MachineAbilityInterpret-abilityValueAlign.1.00-0.030.70-0.031.00-0.170.70-0.171.00\u22120.8\u22120.40.00.40.8\fTask Description\n\nSocial\nSkills (D) Req (D) Ability (D) ability (R) (R)\n\nExpertise Human\n\nAccount-\n\nImpact Machine\n\nInterpret- Delegability\n\nAbility (T) ability (T)\n\nMedical diagnosis: \ufb02u\nMedical treatment planning: \ufb02u\nExplaining diagnosis & treatment options\nto a patient: \ufb02u\n\nMedical diagnosis: depression\nMedical treatment planning: depression\nExplaining diagnosis & treatment options\nto a patient: depression\n\nMedical diagnosis: cancer\nMedical treatment planning: cancer\nExplaining diagnosis & treatment options\nto a patient: cancer\n\nNew employee hiring decisions\nJudging a defendant\u2019s recidivism risk\n\n3.4\n3.6\n3.4\n\n4.4\n3.8\n4.4\n\n2.6\n3.6\n4.4\n\n4.4\n3.8\n\n4.2\n3.6\n3.6\n\n4.6\n3.4\n4.4\n\n5\n4.6\n4.4\n\n3.6\n4\n\n4.6\n4.4\n4.2\n\n4.6\n4.4\n4.2\n\n3.6\n4.8\n4.2\n\n3.8\n4.4\n\n4\n4\n3.8\n\n4.2\n4.2\n4.4\n\n3.8\n4.4\n4.2\n\n3.8\n4.4\n\n4.2\n3.4\n3.2\n\n3.8\n4.4\n4.4\n\n4.8\n4.8\n4.6\n\n3.8\n4.6\n\n3\n3.6\n3.8\n\n2.8\n3\n2.2\n\n2.4\n2.4\n2.4\n\n2.4\n2.4\n\n4.2\n3.8\n3.2\n\n3.4\n4\n4.4\n\n3.4\n3.8\n2.6\n\n4.4\n3.6\n\n2.4\n2.4\n2.2\n\n2.2\n2\n1.6\n\n2\n1.6\n1.4\n\n2.2\n1.8\n\nTable 3: A case study of tasks from the expert survey. See full names of tasks in the supplementary\nmaterial. Note that we do not \ufb02ip any component in these case study tables.\n\nTask Description\n\nSocial\nSkills (D) Req (D) Ability (D) ability (R) (R)\n\nExpertise Human\n\nAccount-\n\nImpact Intrinsic Machine Delegability\n\n(M)\n\nAbility (T)\n\nServing on jury duty\nNew employee hiring decisions\nReading bedtime stories to your child\nScheduling an important business meeting 3.6\n\n4.6\n4\n4\n\n2.4\n3.6\n2\n2\n\n4.6\n3.6\n4.4\n4.4\n\n4.6\n4.2\n2.6\n3.2\n\n4.2\n3\n1.8\n3\n\n3.8\n2.6\n4.8\n2.6\n\n1.2\n1.8\n3.2\n3.6\n\n1.4\n1.8\n1.8\n3\n\nTable 4: A case study of tasks from the personal survey. Refer to https://delegability.\ngithub.io/table.html for live demo.\n4.3 Case Studies\n\nTo further illustrate the operation of our delegability framework, we present averaged responses to\nselected tasks from the expert survey in Table 3 and the personal survey in Table 4. For the expert\ncase studies, we examine responses to medical diagnosis, recidivism risk, and hiring. Next, we ob-\nserve the effects of motivation on some personal tasks. Finally, some tasks such as hiring are suitable\nfor both experts and laypeople, allowing us to compare the personal and expert contexts. Though\nclearly important, we observe trust alone cannot explain differences in delegability preferences.\nExpert Survey Case Studies. The medical domain is often considered a promising area for AI, but\nhow open are patients to delegating different aspects of their healthcare to AI? We compare three\nmedical tasks (diagnosis, treatment, and explanation), each with three different illness contexts (the\n\ufb02u, depression, and cancer).\nIntuitively, \ufb02u-related tasks are seen as the most delegable \u2013 even\napproaching a delegability of 3 (human-in-the-loop) \u2013 while cancer is the least. Correspondingly,\nthe \ufb02u-related tasks are perceived as less dif\ufb01cult (lower social skill and expertise requirements,\nhigher con\ufb01dence in the human expert), less risky (lower impact and accountability), and with higher\ntrust in machine ability. However, all except cancer explanation are nearest a delegability level of 2\n(machine-in-the-loop): though human control is preferred, machine assistance is valued.\nFairness decision problems such as recidivism risk and employee hiring decisions are typically char-\nacterized as requiring high transparency and accountability. From the responses, we see that these\ntasks are both rated as dif\ufb01cult, risky, and with low degrees of trust in AI; in fact, they look similar\nto the medical tasks under our framework. Accordingly, we again observe both tasks result in an\naverage preference for machine-in-loop designs.\nPersonal Survey Case Studies. To observe the role of motivation, we consider \u201cReading bedtime\nstories to your child\u201d, a machine-in-the-loop task, and \u201cScheduling an important business meeting\nwith several co-workers\u201d, a human-in-the-loop task. The former is higher motivation, yet the latter\nis higher risk. The tasks are otherwise similar. Here, motivation appears to be the deciding factor:\nthe former\u2019s higher motivation makes it less delegable despite the lower risk.\nFinally, we compare some personal-context tasks which are similar to the above expert-context case\nstudies. \u201cServing on jury duty: deciding if a defendant is innocent or guilty\u201d is comparable to the\nhigh-risk low-trust medical and recidivism tasks, and is also similarly rated one of the least dele-\ngable tasks to AI. For the employee hiring task, respondents rated their self-con\ufb01dence as near their\ncon\ufb01dence in an expert. Upon directly comparing the other components more closely, we observe\nhigher accountability, lower trust, and slightly lower delegability levels in the personal evaluations.\n\n7\n\n\f5 Concluding Discussion\nIn this work, we present \ufb01rst steps towards understanding human preferences in human-AI task\ndelegation decisions. We develop an intuitive framework of motivation, dif\ufb01culty, risk, and trust,\nwhich we hope will serve as a starting point for reasoning about delegation preferences across tasks.\nWe develop a survey to quantify human preferences, and validate the promise of such an empirical\napproach through correlation analysis and case studies. Our \ufb01ndings show a preference for machine-\nin-the-loop designs and a disinclination towards \u201cAI Only\u201d.\nIn developing this framework, our intent is not to suppress technology development, but rather to\nprovide an avenue for more effective human-centered automation. Human preferences regarding the\nextent of autonomy, and the reasons and motivations behind these preferences, are an understudied\nyet valuable source of information. There is a clear gap in methodologies for both understanding\nand contextualizing preferences across tasks; it is this gap that we wish to address.\nImplications. First, our \ufb01nding of trust as the most salient factor behind delegation preferences\nsupports the community\u2019s widespread interest in trust and reliance. We \ufb01nd negative correlations\nbetween trust in machine abilities and the social and creative skill requirements. These are skills\ncommonly considered dif\ufb01cult for machines, hinting that directly measuring task characteristics\ninstead of human assessments of trust may also be an effective approach. We also note the low\ncorrelation between desired interpretability and delegability to AI, demonstrating the complex and\nintricate relation between interpretable machine learning and trust.\nMoreover, our \ufb01ndings show that people do not prefer \u201cAI only\u201d, instead opting for machine-in-the-\nloop designs. Interestingly, even for low-trust tasks such as cancer diagnosis or babysitting, people\nare still receptive to the idea of a machine-in-the-loop assistant. We should explore paradigms that\nlet people maintain high-level control over tasks while leveraging machines as support, as in recent\nwork on clinical decision systems [46].\nLimitations. Our framework does not fully explain delegability preferences: the highest measured\ncorrelation is 0.59, and due to limited data, we do not explore higher-order feature interactions.\nAdditionally, human preferences are dynamic and survey results likely evolve over time. Neverthe-\nless, mapping current perceptions enables tracking any future changes, providing a mechanism to\nunderstand how basic changes in factors like machine ability manifest through trust and reliance.\nOur exploratory survey methodology also has several limitations. We abstracted delegability deci-\nsions and measured a limited number of factors, thus potentially overstating the importance of trust\nand overlooking others that occur in real situations. For example, we did not consider situational\ndetails like the trust in speci\ufb01c companies, or actual human or machine performance. Second, we\ndesigned the survey to avoid biasing or requiring respondents to understand our framework concep-\ntually. Training may improve subject calibration and agreement. We also recommend consideration\nof individual baseline attitudes towards automation. Since each participant only \ufb01lled out one sur-\nvey, we were unable to determine if individuals were consistently biased towards AI, and what effect\nthis might have had on our measurements. In addition, we chose only four delegability categories,\nranging from \u201chuman only\u201d to \u201cAI only\u201d. This was a deliberate abstraction choice to handle the wide\nvariety of tasks presented. However, since most responses fell into one of the two shared-control\ncategories, future studies may bene\ufb01t from more \ufb01ne-grained choices on shared control.\nFinally, our empirical survey is based on participants on Mechanical Turk. We use a strict \ufb01lter\nand attention checks to guarantee quality, but this sample may not be representative of the general\npopulation in the US, or of other populations with different cultural expectations.\nTowards a Framework. In addition to resolving the above limitations, we speci\ufb01cally suggest two\ncharacteristics for any framework addressing the task delegability question: 1) a characterization of\nhuman preferences towards automation and machine control; and 2) a characterization of the task\nspace, enabling the generalization of task-speci\ufb01c \ufb01ndings to other domains.\nIn particular, generalization represents a signi\ufb01cant challenge, especially when incorporating human\nfactors. For instance, it is unclear how to generalize from physicians interacting with AI for cancer\ndiagnosis to judges interacting with AI for recidivism prediction. While our approach maps tasks\nto delegability preferences through a common set of task-automation perception factors and enables\ndirectly comparing tasks with a quanti\ufb01able perception distance, its capability of generalization\nrequires further veri\ufb01cation. Ultimately, we believe an effective quanti\ufb01cation of human preferences\nand task relations will prove invaluable for the community and the public as a whole.\n\n8\n\n\fReferences\n[1] Ashton Anderson, Jon M. Kleinberg, and Sendhil Mullainathan. Assessing human error against\n\na benchmark of perfection. In Proceedings of KDD, 2016.\n\n[2] Drew Anderson. Glaad and hrc call on stanford university & responsible media to de-\nbunk dangerous & \ufb02awed report claiming to identify lgbtq people through facial recognition\ntechnology.\nhttps://www.glaad.org/blog/glaad-and-hrc-call-stanford-university-responsible-\nmedia-debunk-dangerous-\ufb02awed-report, 2017. [Online; accessed 3-Sep-2018].\n\n[3] Albert Bandura. Human agency in social cognitive theory. 44:1175\u201384, 10 1989.\n\n[4] J. M. Bradshaw, V. Dignum, C. Jonker, and M. Sierhuis. Human-agent-robot teamwork. IEEE\n\nIntelligent Systems, 27(2):8\u201313, March 2012. ISSN 1541-1672. doi: 10.1109/MIS.2012.37.\n\n[5] Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona,\nand Serge Belongie. Visual recognition with humans in the loop. In Proceedings of ECCV,\n2010.\n\n[6] Cristiano Castelfranchi and Rino Falcone. Towards a theory of delegation for agent-based\nsystems. Robotics and Autonomous Systems, 24(3):141 \u2013 157, 1998. ISSN 0921-8890. doi:\nhttps://doi.org/10.1016/S0921-8890(98)00028-1. URL http://www.sciencedirect.com/\nscience/article/pii/S0921889098000281. Multi-Agent Rationality.\n\n[7] Jie-Zhi Cheng, Dong Ni, Yi-Hong Chou, Jing Qin, Chui-Mei Tiu, Yeun-Chung Chang, Chiun-\nSheng Huang, Dinggang Shen, and Chung-Ming Chen. Computer-aided diagnosis with deep\nlearning architecture: applications to breast lesions in us images and pulmonary nodules in ct\nscans. Scienti\ufb01c reports, 6:24454, 2016.\n\n[8] Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A Smith. Creative\nwriting with a machine in the loop: Case studies on slogans and stories. In Proceedings of IUI,\n2018.\n\n[9] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learn-\n\ning. arXiv preprint arXiv:1702.08608, 2017.\n\n[10] Isil Erel, L\u00b4ea H Stern, Chenhao Tan, and Michael S Weisbach. Selecting directors using\n\nmachine learning. Technical report, National Bureau of Economic Research, 2018.\n\n[11] Jerry Alan Fails and Dan R Olsen Jr. Interactive machine learning. In Proceedings of IUI,\n\n2003.\n\n[12] Rasool Fakoor, Faisal Ladhak, Azade Nazi, and Manfred Huber. Using deep learning to en-\n\nhance cancer diagnosis and classi\ufb01cation. In Proceedings of ICML, volume 28, 2013.\n\n[13] Paul M Fitts, MS Viteles, NL Barr, DR Brimhall, Glen Finch, Eric Gardner, WF Grether,\nWE Kellum, and SS Stevens. Human engineering for an effective air-navigation and traf\ufb01c-\ncontrol system, and appendixes 1 thru 3. Technical report, Ohio State University Research\nFoundation Columbus, 1951.\n\n[14] Carl Benedikt Frey and Michael A. Osborne. The future of employment: How susceptible\nare jobs to computerisation? Technological Forecasting and Social Change, 114:254 \u2013 280,\n2013.\nISSN 0040-1625. doi: https://doi.org/10.1016/j.techfore.2016.08.019. URL http:\n//www.sciencedirect.com/science/article/pii/S0040162516302244.\n\n[15] Matthew Gombolay, Xi Jessie Yang, Bradley Hayes, Nicole Seo, Zixi Liu, Samir Wadhwania,\nTania Yu, Neel Shah, Toni Golen, and Julie Shah. Robotic assistance in the coordination of\npatient care. The International Journal of Robotics Research, page 0278364918778344, 2016.\n\n[16] Ben Green and Yiling Chen. Disparate interactions: An algorithm-in-the-loop analysis of\n\nfairness in risk assessments. In Proceedings of FAT*, 2019.\n\n[17] Ben Green and Yiling Chen. The principles and limits of algorithm-in-the-loop decision mak-\n\ning. In Proceedings of CSCW. ACM, 2019.\n\n9\n\n\f[18] Spence Green, Jeffrey Heer, and Christopher D. Manning. Natural language translation at\nthe intersection of ai and hci. Queue, 13(6):30:30\u201330:42, June 2015. ISSN 1542-7730. doi:\n10.1145/2791301.2798086. URL http://doi.acm.org/10.1145/2791301.2798086.\n\n[19] Andrew F Hayes and Klaus Krippendorff. Answering the call for a standard reliability measure\n\nfor coding data. Communication methods and measures, 1(1):77\u201389, 2007.\n\n[20] Eric Horvitz. Principles of mixed-initiative user interfaces.\n\n159\u2013166. ACM, 1999.\n\nIn Proceedings of CHI, pages\n\n[21] Matthew Johnson, Jeffrey M Bradshaw, Paul J Feltovich, Catholijn M Jonker, M Birna\nVan Riemsdijk, and Maarten Sierhuis. Coactive design: Designing support for interdepen-\ndence in joint activity. Journal of Human-Robot Interaction, 3(1):43\u201369, 2014.\n\n[22] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to\n\ncriticize! criticism for interpretability. In Proceedings of NIPS, 2016.\n\n[23] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan.\nHuman decisions and machine predictions*. The Quarterly Journal of Economics, 133(1):237\u2013\n293, 2018. doi: 10.1093/qje/qjx032. URL http://dx.doi.org/10.1093/qje/qjx032.\n\n[24] Vivian Lai and Chenhao Tan. On human predictions with explanations and predictions of\nmachine learning models: A case study on deception detection. In Proceedings of FAT*, 2019.\n\n[25] Gary P Latham and Craig C Pinder. Work motivation theory and research at the dawn of the\n\ntwenty-\ufb01rst century. Annu. Rev. Psychol., 56:485\u2013516, 2005.\n\n[26] John D. Lee and Neville Moray. Trust, self-con\ufb01dence, and operators\u2019 adaptation to automa-\ntion. International Journal of Human-Computer Studies, 40(1):153 \u2013 184, 1994. ISSN 1071-\n5819. doi: https://doi.org/10.1006/ijhc.1994.1007. URL http://www.sciencedirect.\ncom/science/article/pii/S107158198471007X.\n\n[27] John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance.\n\nHuman Factors: The Journal of Human Factors and Ergonomics Society, 46:50\u201380, 2004.\n\n[28] Stephan Lewandowsky, Michael Mundy, and Gerard P. A. Tan. The dynamics of trust: Com-\nparing humans to automation. Journal of Experimental Psychology: Applied, 6:104\u2013123, 2000.\n\n[29] Geert Litjens, Clara I S\u00b4anchez, Nadya Timofeeva, Meyke Hermsen, Iris Nagtegaal, Iringo\nKovacs, Christina Hulsbergen-Van De Kaa, Peter Bult, Bram Van Ginneken, and Jeroen Van\nDer Laak. Deep learning as a tool for increased accuracy and ef\ufb01ciency of histopathological\ndiagnosis. Scienti\ufb01c reports, 6:26286, 2016.\n\n[30] Edwin Locke. Motivation, cognition, and action: An analysis of studies of task goals and\nknowledge. Applied Psychology, 49(3):408\u2013429, 2000. doi: 10.1111/1464-0597.00023. URL\nhttps://onlinelibrary.wiley.com/doi/abs/10.1111/1464-0597.00023.\n\n[31] Edwin Locke and Gary Latham. Building a practically useful theory of goal setting and task\n\nmotivation - a 35-year odyssey. 57:705\u201317, 10 2002.\n\n[32] Gianluca Mezzo\ufb01ore.\n\nstraight\n\nis \ufb02awed\n\nor\nartificial-intelligence-ai-lgbtq-gay-straight/, 2017.\n3-Sep-2018].\n\nThat ai study which claims to guess whether you\u2019re gay\nand\nhttps://mashable.com/2017/09/11/\naccessed\n\ndangerous.\n\n[Online;\n\n[33] Allen Milewski and Steven Lewis. When people delegate. 11 2014.\n\n[34] Heather Murphy. Why stanford researchers tried to create a \u2018gaydar\u2019 machine. https://www.\n\nnytimes.com/2017/10/09/science/stanford-sexual-orientation-study.html,\n2017. [Online; accessed 3-Sep-2018].\n\n[35] Andrew Ng. Arti\ufb01cial intelligence is the new electricity. In presentation at the Stanford MSx\n\nFuture Forum, 2017.\n\n10\n\n\f[36] Raja Parasuraman, Thomas B Sheridan, and Christopher D Wickens. A model for types\nand levels of human interaction with automation. IEEE Transactions on systems, man, and\ncybernetics-Part A: Systems and Humans, 30(3):286\u2013297, 2000.\n\n[37] LeeAnn Perkins, Janet E. Miller, Ali Hashemi, and Gary Burns. Designing for human-centered\n\nsystems: Situational risk as a factor of trust in automation, 2010.\n\n[38] Harold E. Price. The allocation of functions in systems. Human Factors, 27(1):33\u201345, 1985.\n\ndoi: 10.1177/001872088502700104.\n\n[39] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining\n\nthe predictions of any classi\ufb01er. In Proceedings of KDD, 2016.\n\n[40] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-\n\nagnostic explanations. In Proceedings of AAAI, 2018.\n\n[41] Melissa Roemmele and Andrew S. Gordon. Creative help: A story writing assistant. In Pro-\n\nceedings of ICDIS, 2015.\n\n[42] Klaus Schwab. The fourth industrial revolution. Crown Business, 2017.\n\n[43] Nathan Stout, Alan Dennis, and Taylor Wells. The buck stops there: The impact of perceived\naccountability and control on the intention to delegate to software agents. AIS Transactions on\nHuman-Computer Interaction, 6, 03 2014.\n\n[44] Richard E Susskind and Daniel Susskind. The future of the professions: How technology will\n\ntransform the work of human experts. Oxford University Press, USA, 2015.\n\n[45] Yilun Wang and Michal Kosinski. Deep neural networks are more accurate than humans at\ndetecting sexual orientation from facial images. Journal of Personality and Social Psychology,\n114(2):246\u2013257, 2018.\n\n[46] Qian Yang, Aaron Steinfeld, and John Zimmerman. Unremarkable ai: Fiting intelligent deci-\n\nsion support into critical, clinical decision-making processes. In Proceedings of CHI, 2019.\n\n11\n\n\f", "award": [], "sourceid": 39, "authors": [{"given_name": "Brian", "family_name": "Lubars", "institution": "University of Colorado Boulder"}, {"given_name": "Chenhao", "family_name": "Tan", "institution": "University of Colorado Boulder"}]}