{"title": "Making the Cut: A Bandit-based Approach to Tiered Interviewing", "book": "Advances in Neural Information Processing Systems", "page_first": 4639, "page_last": 4649, "abstract": "Given a huge set of applicants, how should a firm allocate sequential resume screenings, phone interviews, and in-person site visits? In a tiered interview process, later stages (e.g., in-person visits) are more informative, but also more expensive than earlier stages (e.g., resume screenings). Using accepted hiring models and the concept of structured interviews, a best practice in human resources, we cast tiered hiring as a combinatorial pure exploration (CPE) problem in the stochastic multi-armed bandit setting. The goal is to select a subset of arms (in our case, applicants) with some combinatorial structure. We present new algorithms in both the probably approximately correct (PAC) and fixed-budget settings that select a near-optimal cohort with provable guarantees. We show via simulations on real data from one of the largest US-based computer science graduate programs that our algorithms make better hiring decisions or use less budget than the status quo.", "full_text": "A Bandit-based Approach to Tiered Interviewing\n\nMaking the Cut:\n\nCandice Schumann?\n\nZhi Lang?\n\nJeffrey S. Foster\u2020\n\nJohn P. Dickerson?\n\n{schumann,zlang}@cs.umd.edu, jfoster@cs.tufts.edu, john@cs.umd.edu\n\n?University of Maryland\n\n\u2020Tufts University\n\nAbstract\n\nGiven a huge set of applicants, how should a \ufb01rm allocate sequential resume\nscreenings, phone interviews, and in-person site visits? In a tiered interview process,\nlater stages (e.g., in-person visits) are more informative, but also more expensive\nthan earlier stages (e.g., resume screenings). Using accepted hiring models and\nthe concept of structured interviews, a best practice in human resources, we cast\ntiered hiring as a combinatorial pure exploration (CPE) problem in the stochastic\nmulti-armed bandit setting. The goal is to select a subset of arms (in our case,\napplicants) with some combinatorial structure. We present new algorithms in both\nthe probably approximately correct (PAC) and \ufb01xed-budget settings that select a\nnear-optimal cohort with provable guarantees. We show via simulations on real\ndata from one of the largest US-based computer science graduate programs that\nour algorithms make better hiring decisions or use less budget than the status quo.\n\n\u2018... nothing we do is more important than hiring and developing people. At the end of the\nday, you bet on people, not on strategies.\u201d \u2013 Lawrence Bossidy, The CEO as Coach (1995)\n\n1\n\nIntroduction\n\nHiring workers is expensive and lengthy. The average cost-per-hire in the United States is $4,129 [So-\nciety for Human Resource Management, 2016], and with over \ufb01ve million hires per month on average,\ntotal annual hiring cost in the United States tops hundreds of billions of dollars [United States Bureau\nof Labor Statistics, 2018]. In the past decade, the average length of the hiring process has doubled to\nnearly one month [Chamberlain, 2017]. At every stage, \ufb01rms expend resources to learn more about\neach applicant\u2019s true quality, and choose to either cut that applicant or continue interviewing with the\nintention of offering employment.\nIn this paper, we address the problem of a \ufb01rm hiring a cohort of multiple workers, each with\nunknown true utility, over multiple stages of structured interviews. We operate under the assumption\nthat a \ufb01rm is willing to spend an increasing amount of resources\u2014e.g., money or time\u2014on applicants\nas they advance to later stages of interviews. Thus, the \ufb01rm is motivated to aggressively \u201cpare down\u201d\nthe applicant pool at every stage, culling low-quality workers so that resources are better spent in\nmore costly later stages. This concept of tiered hiring can be extended to crowdsourcing or \ufb01nding a\ncohort of trusted workers. At each successive stage, crowdsourced workers are given harder tasks.\nUsing techniques from the multi-armed bandit (MAB) and submodular optimization literature, we\npresent two new algorithms\u2014in the probably approximately correct (PAC) (\u00a73) and \ufb01xed-budget\nsettings (\u00a74)\u2014and prove upper bounds that select a near-optimal cohort in this restricted setting. We\nexplore those bounds in simulation and show that the restricted setting is not necessary in practice\n(\u00a75). Then, using real data from admissions to a large US-based computer science Ph.D. program,\nwe show that our algorithms yield better hiring decisions at equivalent cost to the status quo\u2014or\ncomparable hiring decisions at lower cost (\u00a75).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 A Formal Model of Tiered Interviewing\n\nIn this section, we provide a brief overview of related work, give necessary background for our\nmodel, and then formally de\ufb01ne our general multi-stage combinatorial MAB problem. Each of our n\napplicants is an arm a in the full set of arms A. Our goal is to select K < n arms that maximize some\nobjective w using a maximization oracle. We split up the review/interview process into m stages,\nsuch that each stage i 2 [m] has per-interview information gain si, cost ji, and number of required\narms Ki (representing the size of the \u201cshort list\u201d of applicants who proceed to the next round). We\nwant to solve this problem using either a con\ufb01dence constraint (, \u270f), or a budget constraint over each\nstage (Ti). We rigorously de\ufb01ne each of these inputs below.\n\nMulti-armed bandits. The multi-armed bandit problem allows for modeling resource allocation\nduring sequential decision making. Bubeck et al. [2012] provide a general overview of historic\nresearch in this \ufb01eld. In a MAB setting there is a set of n arms A. Each arm a 2 A has a true\nutility u(a) 2 [0, 1], which is unknown. When an arm a 2 A is pulled, a reward is pulled from\na distribution with mean u(a) and a -sub-Gaussian tail. These pulls give an empirical estimate\n\u02c6u(a) of the underlying utility, and an uncertainty bound rad(a) around the empirical estimate, i.e.,\n\u02c6u(a) rad(a) < u(a) < \u02c6u(a) + rad(a) with some probability . Once arm a is pulled (e.g, an\napplication is reviewed or an interview is performed), \u02c6u(a) and rad(a) are updated.\n\nTop-K and subsets. Traditionally, MAB problems focus on selecting a single best (i.e., highest\nutility) arm. Recently, MAB formulations have been proposed that select an optimal subset of K arms.\nBubeck et al. [2013] propose a budgeted algorithm (SAR) that successively accepts and rejects arms.\nWe build on work by Chen et al. [2014], which generalizes SAR to a setting with a combinatorial\nobjective. They also outline a \ufb01xed-con\ufb01dence version of the combinatorial MAB problem. In the\nChen et al. [2014] formulation, the overall goal is to choose an optimal cohort M\u21e4 from a decision\n\nclass M. In this work, we use decision class MK(A) = {M \u2713 A |M| = K}. A cohort is optimal\n\nif it maximizes a linear objective function w : Rn \u21e5M K(A) ! R. Chen et al. [2014] rely on a\nmaximization oracle, as do we, de\ufb01ned as\n(1)\n\nOracle K(\u02c6u, A) = argmax\nM2MK (A)\n\nw(\u02c6u, M ).\n\nChen et al. [2014] de\ufb01ne a gap score for each arm a in the optimal cohort M\u21e4, which is the difference\nin combinatorial utility between M\u21e4 and the best cohort without arm a. For each arm a not in the\noptimal set M\u21e4, the gap score is the difference in combinatorial utility between M\u21e4 and the best set\nwith arm a. Formally, for any arm a 2 A, the gap score a is de\ufb01ned as\n\na =(w(M\u21e4) max{M | M2MK^a2M} w(M ),\nw(M\u21e4) max{M | M2MK^a /2M} w(M ),\n\nif a /2 M\u21e4\nif a 2 M\u21e4.\n\nUsing this gap score we estimate the hardness of a problem as the sum of inverse squared gaps:\n\n(2)\n\n(3)\n\nThis helps determine how easy it is to differentiate between arms at the border of accept/reject.\n\nObjectives. Cao et al. [2015] tighten the bounds of Chen et al. [2014] where the objective function\nis Top-K, de\ufb01ned as\n\nu(a).\n\n(4)\n\nIn this setting the objective is to pick the K arms with the highest utility. Jun et al. [2016] look at the\nTop-K MAB problem with batch arm pulls, and Singla et al. [2015a] look at the Top-K problem from\na crowdsourcing point of view.\nIn this paper, we explore a different type of objective that balances both individual utility and the\ndiversity of the set of arms returned. Research has shown that a more diverse workforce produces\nbetter products and increases productivity [Desrochers, 2001; Hunt et al., 2015]. Thus, such an\nobjective is of interest to our application of hiring workers. In the document summarization setting,\nLin and Bilmes [2011] introduced a submodular diversity function where the arms are partitioned\ninto q disjoint groups P1,\u00b7\u00b7\u00b7 , Pq:\n\nwDIV(u, M ) =\n\nu(a).\n\n(5)\n\nH =Xa2A\n\n2\na\n\nwTOP(u, M ) = Xa2M\n\nqXi=0s Xa2Pi\\M\n\n2\n\n\fNemhauser et al. [1978] prove theoretical bounds for the simple greedy algorithm that selects a set\nthat maximizes a submodular, monotone function. Krause and Golovin [2014] overview submodular\noptimization in general. Singla et al. [2015b] propose an algorithm for maximizing an unknown\nfunction, and Ashkan et al. [2015] introduce a greedy algorithm that optimally solves the problem of\ndiversi\ufb01cation if that diversity function is submodular and monotone. Radlinski et al. [2008] learn a\ndiverse ranking from behavior patterns of different users by using multiple MAB instances. Yue and\nGuestrin [2011] introduce the linear submodular bandits problem to select diverse sets of content\nwhile optimizing for a class of feature-rich submodular utility models. Each of these papers uses\nsubmodularity to promote some notion of diversity. Using this as motivation, we empirically show\nthat we can hire a diverse cohort of workers (Section 5).\n\nVariable costs.\nIn many real-world settings, there are different ways to gather information, each\nof which vary in cost and effectiveness. Ding et al. [2013] looked at a regret minimization MAB\nproblem in which, when an arm is pulled, a random reward is received and a random cost is taken\nfrom the budget. Xia et al. [2016] extend this work to a batch arm pull setting. Jain et al. [2014] use\nMABs with variable rewards and costs to solve a crowdsourcing problem. While we also assume\nnon-unit costs and rewards, our setting is different than each of these, in that we actively choose how\nmuch to spend on each arm pull.\nInterviews allow \ufb01rms to compare applicants. Structured interviews treat each applicant the same\nby following the same questions and scoring strategy, allowing for meaningful cross-applicant\ncomparison. A substantial body of research shows that structured interviews serve as better predictors\nof job success and reduce bias across applicants when compared to traditional methods [Harris, 1989;\nPosthuma et al., 2002]. As decision-making becomes more data-driven, \ufb01rms look to demonstrate\na link between hiring criteria and applicant success\u2014and increasingly adopt structured interview\nprocesses [Kent and McCarthy, 2016; Levashina et al., 2014].\nMotivated by the structured interview paradigm, Schumann et al. [2019] introduced a concept of\n\u201cweak\u201d and \u201cstrong\u201d pulls in the Strong Weak Arm Pull (SWAP) algorithm. SWAP probabilistically\nchooses to strongly or weakly pull an arm. Inspired by that work we associate pulls with a cost\nj 1 and information gain s j, where a pull receives a reward pulled from a distribution with a\nps-sub-Gaussian tail, but incurs cost j. Information gain s relates to the con\ufb01dence of accept/reject\nfrom an interview vs review. As stages get more expensive, the estimates of utility become more\nprecise - the estimate comes with a distribution with a lower variance. In practice, a resume review\nmay make a candidate seem much stronger than they are, or a badly written resume could severely\nunderestimate their abilities. However, in-person interviews give better estimates. A strong arm\npull with information gain s is equivalent to s weak pulls because it is equivalent to pulling from a\ndistribution with a /ps sub-Gaussian tail - it is equivalent to getting a (probably) closer estimate.\nSchumann et al. [2019] only allow for two types of arm pulls and they do not account for the structure\nof current tiered hiring frameworks; nevertheless, in Section 5, we extend (as best we can) their model\nto our setting and compare as part of our experimental testbed.\n\nGeneralizing to multiple stages. This paper, to our knowledge, gives the \ufb01rst computational\nformalization of tiered structured interviewing. We build on hiring models from the behavioral\nscience literature [Vardarlier et al., 2014; Breaugh and Starke, 2000] in which the hiring process\nstarts at recruitment and follows several stages, concluding with successful hiring. We model these m\nsuccessive stages as having an increased cost\u2014in-person interviews cost more than phone interviews,\nwhich in turn cost more than simple r\u00e9sum\u00e9 screenings\u2014but return additional information via the\nscore given to an applicant. For each stage i 2 [m] the user de\ufb01nes a cost ji and an information gain si\nfor the type of pull (type of interview) being used in that stage. During each stage, Ki arms move on\nto the next stage (we cut off Ki1 Ki arms), where n = K0 > K1 > \u00b7\u00b7\u00b7 > Km1 > Km = K).\nThe user must therefore de\ufb01ne Ki for each i 2 [m 1]. The arms chosen to move on to the next\nstage are denoted as Am \u21e2 Am1 \u21e2\u00b7\u00b7\u00b7\u21e2 A1 \u21e2 A0 = A.\nTiered MAB and interviewing stages. Our formulation was initially motivated by the graduate\nadmissions system run at our university. Here, at every stage, it is possible for multiple independent\nreviewers to look at an applicant. Indeed, our admissions committee strives to hit at least two written\nreviews per application package, before potentially considering one or more Skype/Hangouts calls\nwith a potential applicant. (In our data, for instance, some applicants received up to 6 independent\nreviews per stage.)\n\n3\n\n\fWhile motivated by academic admissions, we believe our model is of broad interest to industry as\nwell. For example, in the tech industry, it is common to allocate more (or fewer) 30-minute one-on-\none interviews on a visit day, and/or multiple pre-visit programming screening teleconference calls.\nSimilarly, in management consulting Hunt et al. [2015], it is common to repeatedly give independent\n\u201ccase study\u201d interviews to borderline candidates.\n\n3 Probably Approximately Correct Hiring\n\nIn this section, we present Cutting Arms using a Combinatorial Oracle (CACO), the \ufb01rst of two\nmulti-stage algorithms for selecting a cohort of arms with provable guarantees. CACO is a probably\napproximately correct (PAC) [Haussler and Warmuth, 1993] algorithm that performs interviews over\nm stages, for a user-supplied parameter m, before returning a \ufb01nal subset of K arms.\nAlgorithm 1 provides pseudocode for CACO. The algorithm requires several user-supplied parameters\nin addition to the standard PAC-style con\ufb01dence parameters ( - con\ufb01dence probability, \u270f - error),\nincluding the total number of stages m; pairs (si, ji) for each stage i 2 [m] representing the\ninformation gain si and cost ji associated with each arm pull; the number Ki of arms to remain at\nthe end of each stage i 2 [m]; and a maximization oracle. After each stage i is complete, CACO\nremoves all but Ki arms. The algorithm tracks these \u201cactive\u201d arms, denoted by Ai1 for each stage i,\nthe total cost Cost that accumulates over time when pulling arms, and per-arm a information such as\nempirical utility \u02c6u(a) and total information gain T (a). For example, if arm a has been pulled once in\nstage 1 and twice in stage 2, then T (a) = s1 + 2s2.\n\nT (a)\n\nnumber of stages m; (si, ji, Ki) for each stage i\n\nCACO begins with all arms active\n(line 1). Each stage i starts by pulling\nAlgorithm 1 Cutting Arms using a Combinatorial Oracle (CACO)\neach active arm once using the given\nRequire: Con\ufb01dence 2 (0, 1); Error \u270f 2 (0, 1); Oracle;\n(si, ji) pair to initialize or update em-\npirical utilities (line 3). It then pulls\n1: A0 A\narms until a con\ufb01dence level is trig-\n2: for stage i = 1, . . . , m do\ngered, removes all but Ki arms, and\nPull each a 2 Ai1 once using the given si, ji pair\n3:\ncontinues to the next stage (line 13).\nUpdate empirical means \u02c6u\n4:\nIn a stage i, CACO proceeds in\n5:\nCost Cost +Ki1 \u00b7 ji\nrounds indexed by t. In each round,\nfor t = 1, 2, . . . do\n6:\nthe algorithm \ufb01rst \ufb01nds a set Ai of\n7:\nAi Oracle Ki(\u02c6u)\nsize Ki using the maximization ora-\nfor a 2 Ai1 do\n8:\nrad(a) q 2 log(4|A| Cost 3)/\ncle and the current empirical means \u02c6u\n9:\n(line 7). Then, given a con\ufb01dence ra-\nif a 2 Ai then \u02dcu(a) \u02c6u(a) rad t(a)\n10:\ndius (line 9), it computes pessimistic\nelse \u02dcu(a) \u02c6u(a) + rad(a)\n11:\nestimates \u02dcu(a) of the true utilities of\neach arm a and uses the oracle to\n\u02dcAi Oracle Ki(\u02dcu)\n12:\n\ufb01nd a set of arms \u02dcAi under these\nif |w( \u02dcAi) w(Ai)| <\u270f then break\n13:\npessimistic assumptions (lines 10-12).\n14:\np arg maxa2( \u02dcAi\\Ai)[(Ai\\ \u02dcAi) rad(a)\nIf those two sets are \u201cclose enough\u201d\nPull arm p using the given si, ji pair\n15:\n(\u270f away), CACO proceeds to the\nUpdate \u02c6u(p) with the observed reward\n16:\nnext stage (line 13). Otherwise, across\n17:\nT (p) T (p) + si\nall arms a in the symmetric differ-\n18:\nCost Cost + ji\nence between Ai and \u02dcAi, the arm\n19: Out Am; return Out\np with the most uncertainty over its\ntrue utility\u2014determined via rad(a)\u2014\nis pulled (line 14). At the end of the last stage m, CACO returns a \ufb01nal set of K active arms that\napproximately maximizes an objective function (line 19).\nWe prove a bound on CACO in Theorem 1. As a special case of this theorem, when only a single\nstage of interviewing is desired, and as \u270f ! 0, then Algorithm 1 reduces to Chen et al. [2014]\u2019s\nCLUCB, and our bound then reduces to their upper bound for CLUCB. This bound provides insights\ninto the trade-offs of Cost, information gain s, problem hardness H (Equation 3), and shortlist size\nKi. Given the Cost and information gain s parameters Theorem 1 provides a tighter bound than\nthose for CLUCB.\n\n4\n\n\fi\n\nK 2\ni\n\nK 2\ni\n\ni,t\n\n\n\n,\n\n2\na\n\nTheorem 1. Given any 2 (0, 1), any \u270f 2 (0, 1), any decision classes Mi \u2713 2[n] for each stage\ni 2 [m], any linear function w, and any expected rewards u 2 Rn, assume that the reward distribution\n'a for each arm a 2 [n] has mean u(a) with a -sub-Gaussian tail. Let M\u21e4i = argmaxM2Mi denote\nthe optimal set in stage i 2 [m]. Set radt(a) = q2 log(\n)/Ti,t(a) for all t > 0 and\na 2 [n]. Then, with probability at least 1 , the CACO algorithm (Algorithm 1) returns the set Out\nwhere w(Out) w(M\u21e4m) <\u270f and\n\u270f2 1A log0@ 2j4\nT \uf8ffO0@2 Xi2[m]\n\n0@ ji\nsi0@ Xa2Ai1\n\n\u270f2 1A1A1A .\n\nsi Xa2Ai1\n\nmin\u21e2 1\n\nmin\u21e2 1\n\n4Ki1 Cost 3\n\nT\n\nTheorem 1 gives a bound relative to problem-speci\ufb01c pa-\nrameters such as the gap scores a (Equation 2), inter-stage\ncohort sizes Ki, and so on. Figure 11 lends intuition as to\nhow CACO changes with respect to these inputs, in terms of\nproblem hardness (de\ufb01ned in Eq. 3). When a problem is easy\n(gap scores a are large and hardness H becomes small),\nthe min parts of the bound are dominated by gap scores\na, and there is a smooth increase in total cost. When the\nproblem gets harder (gap scores a are small and hardness\nH becomes large), the mins are dominated by K2\ni /\u270f2 and\nthe cost is noisy but bounded below. When \u270f or increases,\nthe lower bounds of the noisy section decrease\u2014with the\nimpact of \u270f dominating that of . A policymaker can use these high-level trade-offs to determine\nhiring mechanism parameters. For example, assume there are two interview stages. As the number\nK1 of applicants who pass the \ufb01rst interview stage increases, so too does total cost T . However, if\nK1 is too small (here, very close to the \ufb01nal cohort size K), then the cost also increases.\n\nFigure 1: Hardness (H) vs theoretical cost\n(T ) as user-speci\ufb01ed parameters to the\nCACO algorithm change.\n\nBounded \n\nH\n\n105\n\n103\n\n,\n\n2\na\n\n4 Hiring on a Fixed Budget with BRUTAS\n\ni 2 [m], wherePi2[m]\n\nIn many hiring situations, a \ufb01rm or committee has a \ufb01xed budget for hiring (number of phone\ninterviews, total dollars to spend on hosting, and so on). With that in mind, in this section, we present\nBudgeted Rounds Updated Targets Successively (BRUTAS), a tiered-interviewing algorithm in the\n\ufb01xed-budget setting.\nAlgorithm 2 provides pseudocode for BRUTAS, which takes as input \ufb01xed budgets \u00afTi for each stage\n\u00afTi = \u00afT , the total budget. In this version of the tiered-interview problem, we\nalso know how many decisions\u2014whether to accept or reject an arm\u2014we need to make in each stage.\nThis is slightly different than in the CACO setting (\u00a73), where we need to remove all but Ki arms at\nthe conclusion of each stage i. We make this change to align with the CSAR setting of Chen et al.\n[2014], which BRUTAS generalizes. In this setting, let \u02dcKi represent how many decisions we need\nto make at stage i 2 [m]; thus,Pi2[m]\n\u02dcKi = n. The \u02dcKis are independent of K, the \ufb01nal number of\narms we want to accept, except that the total number of accept decisions across all \u02dcK must sum to K.\nThe budgeted setting uses a constrained oracle COracle : Rn \u21e5 2[n] \u21e5 2[n] ! M [ {?} de\ufb01ned as\n\nCOracle(\u02c6u, A, B) =\n\nargmax\n\nw(\u02c6u, M ),\n\n{M2MK | A\u2713M ^ B\\M =;}\n\nwhere A is the set of arms that have been accepted and B is the set of arms that have been rejected.\nIn each stage i 2 [m], BRUTAS starts by collecting the accept and reject sets from the previous stage.\nIt then proceeds through \u02dcKi rounds, indexed by t, and selects a single arm to place in the accept\nset A or the reject set B. In a round t, it \ufb01rst pulls each active arm\u2014arms not in A or B\u2014a total of\n\u02dcTi,t \u02dcTi,t1 times using the appropriate si and ji values. \u02dcTi,t is set according to Line 6; note that\n\u02dcTi,0 = 0. Once all the empirical means for each active arm have been updated, the constrained oracle\nis run to \ufb01nd the empirical best set Mi,t (Line 9). For each active arm a, a new pessimistic set \u02dcMi,t,a\nis found (Lines 11-15). a is placed in the accept set A if a is not in Mi,t, or in the reject set B if a\n\n1For detailed \ufb01gures see Appendix E\n\n5\n\n\fis in Mi,t. This is done to calculate the gap that arm a creates (Equation 2). The arm pi,t with the\nlargest gap is selected and placed in the accept set A if pi,t was included in Mi,t, or placed in the\nreject set B otherwise (Lines 16-20). Once all rounds are complete, the \ufb01nal accept set A is returned.\nTheorem 2, provides an lower bound on the con\ufb01dence that BRUTAS returns the optimal set. Note\nthat if there is only a single stage, then Algorithm 2 reduces to Chen et al. [2014]\u2019s CSAR algorithm,\nand our Theorem 2 reduces to their upper bound for CSAR. Again Theorem 2 provides tighter bounds\nthan those for CSAR given the parameters for information gain sb and arm pull cost jb.\n\nTheorem 2. Given any \u00afTis such thatPi2[m]\n\u00afTi = \u00afT > n, any decision class MK \u2713 2[n], any\nlinear function w, and any true expected rewards u 2 Rn, assume that reward distribution 'a for\neach arm a 2 [n] has mean u(a) with a -sub-Gaussian tail. Let (1), . . . , (n) be a permutation\nof 1, . . . , n (de\ufb01ned in Eq. 2) such that (1) \uf8ff . . . \uf8ff (n). De\ufb01ne \u02dcH , maxi2[n] i2\n(i) . Then,\nAlgorithm 2 uses at most \u00afTi samples per stage i 2 [m] and outputs a solution Out 2M K [ {?}\nsuch that\n\nPr[Out 6= M\u21e4] \uf8ff n2 exp Pm\n\nb=1 sb( \u00afTb \u02dcKb)/(jbflog( \u02dcKb))\n\n722 \u02dcH\n\n!\n\n(6)\n\na=0 eKi)ji(eKit+1)\u21e1\n\u00afTi(nPi1\na=0 eKi)\n\nflog(nPi1\n\nwhere flog(n) ,Pn\n\ni=1 i1, and M\u21e4 = argmaxM2MK w(M ).\nAlgorithm 2 Budgeted Rounds Updated Targets Successively\n(BRUTAS)\nRequire: Budgets \u00afTi 8i 2 [m]; (si, ji, \u02dcKi) for each stage i; con-\n\nstrained oracle COracle\n1\ni=1\ni\n2: A0,1 ?; B0,1 ?\n3: for stage i = 1, . . . , m do\n4:\n5:\n\n1: De\ufb01ne flog(n) ,Pn\nAi,1 Ai1, \u02dcKi1+1; Bi,1 Bi1, \u02dcKi1+1; \u02dcTi,0 0\nfor t = 1, . . . , \u02dcKi do\n\u02dcTi,t \u21e0\nforeach a 2 [n] \\ (Ai,t [ Bi,t) do\nPull a ( \u02dcTi,t \u02dcTi,t1) times; update \u02c6ui,t(a)\nMi,t COracle(\u02c6ui,t, Ai,t, Bi,t)\nif Mi,t =? then return ?\nforeach a 2 [n] \\ (Ai,t [ Bi,t) do\n\nWhen setting the budget for each\nstage, a policymaker should en-\nsure there is suf\ufb01cient budget\nfor the number of arms in each\nstage i, and for the given ex-\nogenous cost values ji associ-\nated with interviewing at that\nstage. There is also a balance\nbetween the number of deci-\nsions that must be made in a\ngiven stage i and the ratio si\nof\nji\ninterview information gain and\ncost. Intuitively, giving higher\nbudget to stages with a higher\nratio makes sense\u2014but one\nsi\nji\nalso would not want to make all\naccept/reject decisions in those\nstages, since more decisions cor-\nresponds to lower con\ufb01dence.\nGenerally, arms with high gap\nscores a are accepted/rejected\nin the earlier stages, while arms\nwith low gap scores a are ac-\ncepted/rejected in the later stages.\n17:\nThe policy maker should look\n18:\nat past decisions to estimate gap\n19:\nscores a (Equation 2) and hard-\n20:\nness H (Equation 3). There is a\n21: Out Am, \u02dcKm+1; return Out\nclear trade-off between informa-\ntion gain and cost. If the policy\nmaker assumes (based on past data) that the gap scores will be high (it is easy to differentiate between\napplicants) then the lower stages should have a high Ki, and a budget to match the relevant cost ji. If\nthe gap scores are all low (it is hard to differentiate between applicants) then more decisions should\nbe made in the higher, more expensive stages. By looking at the ratio of small gap scores to high gap\nscores, or by bucketing gap scores, a policy maker will be able to set each Ki.\n\n\u02dcMi,t,a COracle( \u02c6wi,t, Ai,t, Bi,t[{a})\n\u02dcMi,t,a COracle( \u02c6wi,t, Ai,t[{a}, Bi,t)\na2[n]\\(Ai,t[Bi,t)\n\nAi,t+1 Ai,t [{ pi,t}; Bi,t+1 Bi,t\nAi,t+1 Ai,t; Bi,t+1 Bi,t [{ pi,t}\n\n6:\n\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n\nif a 2 Mi,t then\nelse\n\npi,t argmax\nif pi,t 2 Mt then\nelse\n\nw(Mi,t) w( \u02dcMi,t,a)\n\n6\n\n\f=0.05\n\n=0.075\n\n=0.1\n\n=0.2\n\n=0.1\n\nt\ns\no\nC\n\nt\ns\no\nC\n\nt\ns\no\nC\n\nt\ns\no\nC\n\nt\ns\no\nC\n\nK1 =10\nK1 =13\nK1 =18\nK1 =29\n\n10\n\n15\ns\n\n20\n\n10\n\n15\ns\n\n20\n\n10\n\n15\ns\n\n20\n\n10\n\n10\n\n20\n\n20\n\n15\ns\n\n15\ns\n\nFigure 2: Comparison of Cost vs information gain (s) as \u270f increases for CACO. Here, = 0.05 and = 0.2.\nAs \u270f increases, the cost of the algorithm also decreases. If the overall cost of the algorithm is low, then increasing\ns (while keeping j constant) provides diminishing returns.\n\n5 Experiments\n\nIn this section, we experimentally evaluate BRUTAS and CACO in two different settings. The \ufb01rst\nsetting uses data from a toy problem of Gaussian distributed arms. The second setting uses real\nadmissions data from one of the largest US-based graduate computer science programs.\n\n5.1 Gaussian Arm Experiments\n\nWe begin by using simulated data to test the tightness of our theoret-\nical bounds. To do so, we instantiate a cohort of n = 50 arms whose\ntrue utilities, ua, are sampled from a normal distribution. We aim to\nselect a \ufb01nal cohort of size K = 7. When an arm is pulled during\na stage with cost j and information gain s, the algorithm is charged\na cost of j and a reward is pulled from a distribution with mean ua\nand standard deviation of /ps. For simplicity, we present results\nin the setting of m = 2 stages.\n\nFigure 3: Hardness (H) vs Cost,\ncomparing against Theorem 1.\n\nCACO. To evaluate CACO, we vary , \u270f, , K1, and s2. We \ufb01nd that as increases, both cost\nand utility decrease, as expected. Similarly, Figure 2 shows that as \u270f increases, both cost and utility\ndecrease. Higher values of increase the total cost, but do not affect utility. We also \ufb01nd diminishing\nreturns from high information gain s values (x-axis of Figure 2). This makes sense\u2014as s tends to\nin\ufb01nity, the true utility is returned from a single arm pull. We also notice that if many \u201ceasy\u201d arms\n(arms with very large gap scores) are allowed in higher stages, total cost rises substantially.\nAlthough the bound de\ufb01ned in Theorem 1 assumes a linear function w, we empirically tested CACO\nusing a submodular function wDIV. We \ufb01nd that the cost of running CACO using this submodular\nfunction is signi\ufb01cantly lower than the theoretical bound. This suggests that (i) the bound for CACO\ncan be tightened and (ii) CACO could be run with submodular functions w.\n\nBRUTAS. To evaluate BRUTAS, we varied and ( \u02dcKi, Ti) pairs\nfor two stages. Utility varies as expected from Theorem 2: when \nincreases, utility decreases. There is also a trade-off between \u02dcKi and\nTi values. If the problem is easy, a low budget and a high \u02dcK1 value\nis suf\ufb01cient to get high utility. If the problem is hard (high H value),\na higher overall budget is needed, with more budget spent in the\nsecond stage. Figure 4 shows this escalating relationship between\nbudget and utility based on problem hardness. Again we found that\nBRUTAS performed well when using a submodular function wDIV.\nFinally, we compare CACO and BRUTAS to two baseline algo-\nrithms: UNIFORM and RANDOM, which uniformly and randomly\nrespectively, pulls arms in each stage. In both algorithms, the maximization oracle is run after each\nstage to determine which arms should move on to the next stage. When given a budget of 2,750,\nBRUTAS achieves a utility of 244.0, which outperforms both the UNIFORM and RANDOM baseline\nutilities of 178.4 and 138.9, respectively. When CACO is run on the same problem, it \ufb01nds a solution\n(utility of 231.0) that beats both UNIFORM and RANDOM at a roughly equivalent cost of 2,609. This\nqualitative behavior exists for other budgets.\n\nFigure 4: Effect of an increas-\ning budget on the overall utility\nof a cohort. As hardness (H) in-\ncreases, more budget is needed to\nproduce a high quality cohort.\n\n7\n\n103105H100102104106CostTheorySimulations800010000BudgetwH=2.7\u21e51010H=8.0\u21e5103\fFigure 5: Utility vs Cost over four different algorithms (RANDOM, UNIFORM, SWAP, CACO, BRUTAS) and\nthe actual admissions decisions made at the university. Both CACO and BRUTAS produce equivalent cohorts\nto the actual admissions process with lower cost, or produce high quality cohorts than the actual admissions\nprocess with equivalent cost.\n\n5.2 Graduate Admissions Experiment\n\nWe evaluate how CACO and BRUTAS might perform in the real world by applying them to a\ngraduate admissions dataset from one of the largest US-based graduate computer science programs.\nThese experiments were approved by the university\u2019s Institutional Review Board and did not affect\nany admissions decisions for the university. Our dataset consists of three years (2014\u201316) worth\nof graduate applications. For each application we also have graduate committee review scores\n(normalized to between 0 and 1) and admission decisions.\n\nExperimental setup. Using information from 2014 and 2015, we used a random forest classi-\n\ufb01er [Pedregosa et al., 2011], trained in the standard way on features extracted from the applications,\nto predict probability of acceptance. Features included numerical information such as GPA and\nGRE scores, topics from running Latent Dirichlet Allocation (LDA) on faculty recommendation\nletters [Schmader et al., 2007], and categorical information such as region of origin and undergraduate\nschool. In the testing phase, the classi\ufb01er was run on the set of applicants A from 2016 to produce a\nprobability of acceptance P (a) for every applicant a 2 A.\nWe mimic the university\u2019s application process of two stages: a \ufb01rst review stage where admissions\ncommittee members review the application packet, and a second interview stage where committee\nmembers perform a Skype interview for a select subset of applicants. The committee members follow\na structured interview approach. We determined that the time taken for a Skype interview is roughly 6\ntimes as long as a packet review, and therefore we set the cost multiplier for the second stage j2 = 6.\nWe ran over a variety of s2 values, and we determined by looking at the distribution of review\nscores from past years. When an arm a 2 A is pulled with information gain s and cost j, a reward is\nrandomly pulled from the arm\u2019s review scores (when s1 = 1 and j1 = 1, as in the \ufb01rst stage), or a\nreward is pulled from a Gaussian distribution with mean P (a) and a standard deviation of ps.\nWe ran simulations for BRUTAS, CACO, UNIFORM, and RANDOM. In addition we compare\nto an adjusted version of Schumann et al. [2019]\u2019s SWAP. SWAP uses a strong pull policy to\nprobabilistically weak or strong pull arms. In this adjusted version we use a strong pull policy of\nalways weak pulling arms until some threshold time t and strong pulling for the remainder of the\nalgorithm. Note that this adjustment moves SWAP away from \ufb01xed con\ufb01dence but not all the way to\na budgeted algorithm like BRUTAS but \ufb01ts into the tiered structure. For the budgeted algorithms\nBRUTAS, UNIFORM, and RANDOM, (as well as the pseudo-budgeted SWAP) if there are Ki arms\nin round i, the budget is Ki \u00b7 xi where xi 2 N. We vary and \u270f to control CACO\u2019s cost.\nWe compare the utility of the cohort selected by each of the algorithms to the utility from the cohort\nthat was actually selected by the university. We maximize either objective wTOP or wDIV for each\nof the algorithms. We instantiate wDIV, de\ufb01ned in Equation 5, in two ways: \ufb01rst, with self-reported\ngender, and second, with region of origin. Note that since the graduate admissions process is run\nentirely by humans, the committee does not explicitly maximize a particular function. Instead, the\ncommittee tries to \ufb01nd a good overall cohort while balancing areas of interest and general diversity.\n\nResults. Figure 5 compares each algorithm to the actual admissions decision process performed by\nthe real-world committee. In terms of utility, for both wTOP and wDIV, BRUTAS and CACO achieve\nsimilar gains to the actual admissions process (higher for wDIV over region of origin) when using less\ncost/budget. When roughly the same amount of budget is used, BRUTAS and CACO are able to\n\n8\n\n500010000Cost406080wTOPTopK500010000Cost101112wDIVDiversityoverGender500010000Cost1820wDIVDiversityoverRegion\fprovide higher predicted utility than the true accepted cohort, for both wTOP and wDIV. As expected,\nBRUTAS and CACO outperform the baseline algorithms RANDOM, UNIFORM. The adjusted SWAP\nalgorithm performs poorly in this restricted setting of tiered hiring. By limiting the strong pull policy\nof SWAP, only small incremental improvements can be made as Cost is increased.\n\n6 Conclusions & Discussion of Future Research\n\nWe provided a formalization of tiered structured interviewing and presented two algorithms, CACO\nin the PAC setting and BRUTAS in the \ufb01xed-budget setting, which select a near-optimal cohort\nof applicants with provable bounds. We used simulations to quantitatively explore the impact of\nvarious parameters on CACO and BRUTAS and found that behavior aligns with theory. We showed\nempirically that both CACO and BRUTAS work well with a submodular function that promotes\ndiversity. Finally, on a real-world dataset from a large US-based Ph.D. program, we showed that\nCACO and BRUTAS identify higher quality cohorts using equivalent budgets, or comparable cohorts\nusing lower budgets, than the status quo admissions process. Moving forward, we plan to incorporate\nmulti-dimensional feedback (e.g., with respect to an applicant\u2019s technical, presentation, and analytical\nqualities) into our model; recent work due to Katz-Samuels and Scott [2018, 2019] introduces that\nfeedback (in a single-tiered setting) as a marriage of MAB and constrained optimization, and we see\nthis as a fruitful model to explore combining with our novel tiered system.\nDiscussion. The results support the use of BRUTAS and CACO in a practical hiring scenario. Once\npolicymakers have determined an objective, BRUTAS and CACO could help reduce costs and\nproduce better cohorts of employees. Yet, we note that although this experiment uses real data, it\nis still a simulation. The classi\ufb01er is not a true predictor of utility of an applicant. Indeed, \ufb01nding\nan estimate of utility for an applicant is a nontrivial task. Additionally, the data that we are using\nincorporates human bias in admission decisions, and reviewer scores [Schmader et al., 2007; Angwin\net al., 2016]. Finally, de\ufb01ning an objective function on which to run CACO and BRUTAS is a dif\ufb01cult\ntask. Recent advances in human value judgment aggregation [Freedman et al., 2018; Noothigattu et\nal., 2018] could \ufb01nd use in this decision-making framework.\n\n7 Acknowledgements\n\nSchumann and Dickerson were supported by NSF IIS RI CAREER Award #1846237. We thank\nGoogle for gift support, University of Maryland professors David Jacobs and Ramani Duraiswami\nfor helpful input, and the anonymous reviewers for helpful comments.\n\nReferences\nJulia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias. www.propublica.org/\n\narticle/machine-bias-risk-assessments-in-criminal-sentencing, 2016.\n\nAzin Ashkan, Branislav Kveton, Shlomo Berkovsky, and Zheng Wen. Optimal greedy diversity for\n\nrecommendation. In IJCAI, 2015.\n\nJames A Breaugh and Mary Starke. Research on employee recruitment. Journal of Management,\n\n2000.\n\nS\u00e9bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic\n\nmulti-armed bandit problems. Foundations and Trends in Machine Learning, 2012.\n\nS\u00e9ebastian Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identi\ufb01cations in multi-armed\n\nbandits. In ICML, 2013.\n\nWei Cao, Jian Li, Yufei Tao, and Zhize Li. On top-k selection in multi-armed bandits and hidden\n\nbipartite graphs. In NeurIPS, 2015.\n\nAndrew Chamberlain. How long does it take to hire? https://www.glassdoor.com/research/\n\ntime-to-hire-in-25-countries/, 2017.\n\nShouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration\n\nof multi-armed bandits. In NeurIPS, 2014.\n\n9\n\n\fPierre Desrochers. Local diversity, human creativity, and technological innovation. Growth and\n\nChange, 2001.\n\nWenkui Ding, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. Multi-armed bandit with budget constraint\n\nand variable costs. In AAAI, 2013.\n\nRachel Freedman, J Schaich Borg, Walter Sinnott-Armstrong, J Dickerson, and Vincent Conitzer.\n\nAdapting a kidney exchange algorithm to align with human values. In AAAI, 2018.\n\nMichael M Harris. Reconsidering the employment interview. Personal Psychology, 1989.\n\nDavid Haussler and Manfred Warmuth. The probably approximately correct (PAC) and other learning\n\nmodels. In Foundations of Knowledge Acquisition. 1993.\n\nVivian Hunt, Dennis Layton, and Sara Prince. Diversity matters. McKinsey & Company, 2015.\n\nShweta Jain, Sujit Gujar, Satyanath Bhat, Onno Zoeter, and Y Narahari. An incentive compatible\n\nmab crowdsourcing mechanism with quality assurance. arXiv, 2014.\n\nKwang-Sung Jun, Kevin Jamieson, Robert Nowak, and Xiaojin Zhu. Top arm identi\ufb01cation in\n\nmulti-armed bandits with batch arm pulls. In AISTATS, 2016.\n\nJulian Katz-Samuels and Clayton Scott. Feasible arm identi\ufb01cation. In ICML, 2018.\n\nJulian Katz-Samuels and Clayton Scott. Top feasible arm identi\ufb01cation. In AISTATS, 2019.\n\nJulia D. Kent and Maureen Terese McCarthy. Holistic Review in Graduate Admissions. Council of\n\nGraduate Schools, 2016.\n\nAndreas Krause and Daniel Golovin. Submodular function maximization. In Tractability. 2014.\n\nJulia Levashina, Christopher J Hartwell, Frederick P Morgeson, and Michael A Campion. The\n\nstructured employment interview. Personnel Psychology, 2014.\n\nHui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In ACL,\n\n2011.\n\nGeorge L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for\n\nmaximizing submodular set functions. Mathematical Programming, 1978.\n\nRitesh Noothigattu, Snehalkumar Neil S. Gaikwad, Edmond Awad, Sohan D\u2019Souza, Iyad Rahwan,\nPradeep Ravikumar, and Ariel D. Procaccia. A voting-based system for ethical decision making.\nIn AAAI, 2018.\n\nF. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and\nE. Duchesnay. Scikit-learn. JMLR, 2011.\n\nRichard A Posthuma, Frederick P Morgeson, and Michael A Campion. Beyond employment interview\n\nvalidity. Personal Psychology, 2002.\n\nFilip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-\n\narmed bandits. In ICML, 2008.\n\nToni Schmader, Jessica Whitehead, and Vicki H. Wysocki. A linguistic comparison of letters of\nrecommendation for male and female chemistry and biochemistry job applicants. Sex Roles, 2007.\n\nCandice Schumann, Samsara N. Counts, Jeffrey S. Foster, and John P. Dickerson. The diverse cohort\n\nselection problem. AAMAS, 2019.\n\nAdish Singla, Eric Horvitz, Pushmeet Kohli, and Andreas Krause. Learning to hire teams. In HCOMP,\n\n2015.\n\nAdish Singla, Sebastian Tschiatschek, and Andreas Krause. Noisy submodular maximization via\nadaptive sampling with applications to crowdsourced image collection summarization. AAAI, 2015.\n\n10\n\n\fSociety for Human Resource Management. Human capital benchmarking report, 2016.\nUnited States Bureau of Labor Statistics. Job openings and labor turnover. https://www.bls.gov/news.\n\nrelease/pdf/jolts.pdf, 2018.\n\nPelin Vardarlier, Yalcin Vural, and Semra Birgun. Modelling of the strategic recruitment process by\n\naxiomatic design principles. In Social and Behavioral Sciences, 2014.\n\nYingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and Tie-Yan Liu. Budgeted multi-armed bandits\n\nwith multiple plays. In IJCAI, 2016.\n\nYisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversi\ufb01ed\n\nretrieval. In NeurIPS, 2011.\n\n11\n\n\f", "award": [], "sourceid": 2605, "authors": [{"given_name": "Candice", "family_name": "Schumann", "institution": "University of Maryland"}, {"given_name": "Zhi", "family_name": "Lang", "institution": "University of Maryland, College Park"}, {"given_name": "Jeffrey", "family_name": "Foster", "institution": "Tufts University"}, {"given_name": "John", "family_name": "Dickerson", "institution": "University of Maryland"}]}