{"title": "Minimizing Statistical Bias with Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 417, "page_last": 423, "abstract": null, "full_text": "Minimizing  Statistical Bias with Queries \n\nDavid A.  Cohn \n\nAdaptive Systems Group \n\nHarlequin,  Inc. \n\nOne  Cambridge Center \nCambridge,  MA  02142 \ncOhnCharlequin.com \n\nAbstract \n\nI describe  a  querying criterion that attempts to minimize the error \nof a  learner  by  minimizing its estimated squared  bias.  I  describe \nexperiments  with  locally-weighted  regression  on  two  simple prob(cid:173)\nlems,  and observe  that this  \"bias-only\"  approach  outperforms the \nmore  common  \"variance-only\"  exploration  approach,  even  in  the \npresence  of noise. \n\n1 \n\nINTRODUCTION \n\nIn recent  years, there has been an explosion of interest in \"active\"  machine learning \nsystems.  These  are  learning  systems  that  make  queries,  or  perform  experiments \nto  gather data that  are expected  to maximize performance.  When  compared with \n\"passive\"  learning  systems,  which  accept  given,  or  randomly  drawn  data,  active \nlearners have demonstrated significant decreases  in the amount of data required  to \nachieve equivalent performance.  In industrial applications,  where  each  experiment \nmay take  days  to  perform  and  cost  thousands  of dollars,  a  method  for  optimally \nselecting  these  points would offer enormous savings in time and  money. \nAn  active  learning system  will  typically attempt to select  data that  will  minimize \nits  predictive  error.  This  error  can  be  decomposed  into  bias  and  variance  terms. \nMost  research  in selecting  optimal actions or  queries  has  assumed  that the learner \nis  approximately unbiased,  and that to minimize learner error,  variance is  the only \nthing  to  minimize  (e.g.  Fedorov  [1972]'  MacKay  [1992]'  Cohn  [1996],  Cohn  et  al., \n[1996],  Paass  [1995]).  In  practice,  however,  there  are very  few  problems for  which \nwe have unbiased learners.  Frequently, bias constitutes a large portion of a learner's \nerror;  if the learner is deterministic and the data are noise-free, then bias is the  only \nsource  of error.  Note  that the bias  term here  is  a  statistical bias,  distinct from  the \ninductive  bias  discussed  in some  machine  learning  research  [Dietterich  and  Kong, \n1995]. \n\n\f418 \n\nD.A. Cohn \n\nIn this paper I describe an algorithm which selects actions/ queries designed to mini(cid:173)\nmize the bias of a locally weighted regression-based  learner.  Empirically, \"variance(cid:173)\nminimizing\" strategies which ignore bias seem to perform well, even in cases  where, \nstrictly speaking,  there is  no  variance  to  minimize.  In  the  tasks  considered  in  this \npaper,  the  bias-minimizing strategy  consistently  outperforms  variance  minimiza(cid:173)\ntion, even in  the presence  of noise. \n\n1.1  BIAS  AND  VARIANCE \n\nLet us  begin  by defining P(x, y)  to be the unknown joint distribution over  x and y, \nand  P( x)  to  be  the  known  marginal distribution  of x  (commonly called  the  input \ndistribution).  We  denote  the  learner's  output  on input  x,  given  training set  D  as \ny(x; D).  We  can  then  write the expected  error of the learner as \n\n1 E  [(y(x;D) - y(x))2Ix] P(x)dx, \n\n(1) \n\nwhere E[\u00b7] denotes the expectation over P and over training sets D.  The expectation \ninside the integral may be  decomposed as follows  (Geman et al. , 1992): \n\nE  [(y(x;D) - y(x))2Ix] \n\nE  [(y(x) - E[ylx]?] \n\n(2) \n\n+ (Ev [y(x; D)] - E[ylx])2 \n\n+Ev [(y(x;D) - Ev[y(x;D)])2] \n\nwhere Ev [.] denotes the expectation over training sets.  The first  term in Equation 2 \nis the variance of y given x - it is the noise in the distribution, and does  not depend \non our learner or how the training data are chosen.  The second term is the learner's \nsquared bias, and the third is its variance; these last two terms comprise the expected \nsquared error of the learner  with  respect  to the regression  function  E[Ylx]. \nMost  research  in  active  learning  assumes  that  the  second  term  of Equation  2  is \napproximately zero,  that  is,  that  the  learner  is  unbiased.  If this  is  the  case,  then \none may concentrate on selecting data so  as to minimize the variance of the learner. \nAlthough this  \"all-variance\" approach is optimal when the learner is  unbiased, truly \nunbiased  learners  are  rare.  Even  when  the  learner's  representation  class  is  able \nto  match  the  target  function  exactly,  bias is  generally  introduced  by  the  learning \nalgorithm  and  learning  parameters.  From  the  Bayesian  perspective,  a  learner  is \nonly unbiased  if its  priors are  exactly  correct. \nThe optimal choice  of query would,  of course, minimize  both  bias and variance,  but \nI leave that for future work.  For the purposes of this paper, I will only be concerned \nwith  selecting  queries  that  are  expected  to  minimize  learner  bias.  This  approach \nis  justified  in  cases  where  noise  is  believed  to  be  only  a  small  component  of the \nlearner's  error.  If the  learner  is  deterministic  and  there  is  no  noise,  then  strictly \nspeaking, there  is  no error  due  to  variance -\nall  the error  must be  due  to learner \nbias.  In cases with non-determinism or noise, all-bias minimization, like all-variance \nminimization, becomes  an approximation of the optimal approach. \n\nThe learning model discussed  in this paper is  a  form of locally  weighted  regression \n(LWR)  [Cleveland et  al.,  1988],  which  has  been  used  in  difficult  machine  learning \ntasks,  notably  the  \"robot  juggler\"  of Schaal  and  Atkeson  [1994].  Previous  work \n[Cohn et al.,  1996]  discussed  all-variance query selection for  LWR; in the remainder \nof this paper, I describe  a method for  performing all-bias query selection.  Section 2 \ndescribes the criterion that must be optimized for all-bias query selection.  Section 3 \ndescribes  the  locally  weighted  regression  learner  used  in  this  paper  and  describes \n\n\fMinimizing Statistical Bias with Queries \n\n419 \n\nhow  the  all-bias criterion  may be  computed for  it .  Section  4  describes  the  results \nof experiments using this criterion on several simple domains.  Directions for future \nwork  are discussed  in  Section 5. \n\n2  ALL-BIAS  QUERY  SELECTION \n\nLet  us  assume for  the moment that we  have  a source of noise-free examples (Xi, Yi) \nand  a  deterministic learner  which,  given  input  X,  outputs estimate Y(X).l  Let  us \nalso  assume  that  we  have  an  accurate estimate of the  bias  of y which  can  be  used \nto estimate  the  true  function  y(x)  =  y(x)  - bias(x).  We  will  break  these  rather \nstrong assumptions of noise-free examples and accurate bias estimates in Section 4, \nbut they  are useful for  deriving  the theoretical  approach described  below. \n\nGiven  an  accurate  bias estimate,  we  must force  the  biased  estimator into the  best \napproximation of y(x)  with  the fewest  number of examples.  This,  in effect,  trans(cid:173)\nforms  the  query  selection  problem  into  an  example filter  problem  similar  to  that \nstudied  by  Plutowski  and  White  [1993]  for  neural  networks.  Below,  I  derive  this \ncriterion for  estimating the change  in error at  X  given  a  new  queried  example at x. \nSince  we  have  (temporarily)  assumed  a  deterministic  learner  and  noise-free  data, \nthe expected  error in  Equation 2 simplifies to: \n\nE  [(Y( X; 'D)  - y( x))2Ix, 'D] \n\n(Y(x; 'D)  - y(x))2 \n\n(3) \n\nWe  want to select  a new  x such that when we  add (x, f)),  the resulting squared bias \nis  minimized: \n\n(Y'  - y? ==  (y(x; 'D U (x, f)))  - y(x))2 . \n\n(4) \nI will, for the remainder of the paper, use the  \"'\"  to indicate estimates based on the \ninitial training set  plus  the  additional example  (x, y).  To  minimize  Expression  4, \nwe  need  to  compute  how  a  query  at  x will  change  the  learner's  bias  at  x.  If we \nassume  that  we  know  the  input  distribution,2  then  we  can  integrate  this  change \nover  the  entire  domain  (using  Monte  Carlo  procedures)  to  estimate  the  resulting \naverage  change,  and select  a  x such  that  the  expected  squared  bias  is  minimized. \nDefining bias ==  y - y and f:,.y  ==  y'  - y, we  can  write the new  squared  bias as: \n\nbias,2 \n\n(y'  - y)2  =  (Y + f:,.y  _ y)2 \nf:,.y2  + 2f:,.y . bias + bias2 \n\n(5) \nNote  that  since  bias  as  defined  here  is  independent  of x,  minimizing  the  bias  is \nequivalent to minimizing f:,.y2  + 2f:,.y . bias. \nThe estimate of bias'  tells  us  how much our bias will change for  a given x. We may \noptimize this value over x in one of a number of ways.  In low dimensional spaces,  it \nis often sufficient to consider a set of \"candidate\" x and select the one promising the \nsmallest resulting  error.  In  higher  dimensional spaces,  it  is  often  more efficient  to \nsearch  for  an optimal x with  a  response  surface  technique  [Box  and  Draper, 1987], \nor hill climb on  abias,2 / ax. \nEstimates  of bias  and  f:,.y  depend  on  the  specific  learning  model  being  used.  In \nSection 3, I describe a locally weighted regression model, and show how differentiable \nestimates of bias  and f:,.y  may be  computed for  it. \n\n1 For  clarity,  I  will  drop  the  argument  :z;  except  where  required  for  disambiguation.  I \n\nwill  also  denote only  the univariate  case;  the results  apply  in  higher  dimensions  as  well. \n2This assumption is  contrary to the assumption norma.lly  made in some forms of learn(cid:173)\n\ning,  e.g.  PAC-learning,  but it is  appropriate in  many  domains. \n\n\f420 \n\nD.  A.  Cohn \n\n2.1  AN  ASIDE:  WHY  NOT JUST USE Y - Mas? \n\nIf we  have  an accurate  bias estimate, it is  reasonable to ask  why  we  do  not simply \nuse  the  corrected  y - C;;;S  as  our  predictor.  The  answer  has  two  parts,  the  first \nof  which  is  that  for  most  learners,  there  are  no  perfect  bias  estimators  -\nthey \nintroduce their own  bias and  variance,  which  must  be  addressed  in data selection. \nSecond,  we  can  define  a  composite learner Ye  ==  Y - C:;;;S.  Given  a random training \nsample  then,  we  would  expect  Ye  to  outperform  y.  However,  there  is  no  obvious \nway  to select  data for  this composite learner other than selecting  to maximize the \nperformance  of its  two  components.  In  our  case,  the  second  component  (the  bias \nestimate)  is  non-analytic,  which  leaves  us  selecting  data  so  as  to  maximize  the \nperformance of the  first  component  (the uncorrected  estimator).  We  are  now  back \nto  our  original  problem:  we  can  select  data so  as  to  minimize either  the  bias  or \nvariance of the uncorrected  LWR-based learner.  Since the purpose of the correction \nis to give an unbiased estimator, intuition suggests that variance minimization would \nbe the more sensible route in this  case.  Empirically, this approach does  not appear \nto yield  any  benefit  over  uncorrected  variance minimization (see  Figure  1). \n\n3  LOCALLY WEIGHTED REGRESSION \n\nThe type  of learner  I  consider here  is  a form  of locally weighted  regression  (LWR) \nthat is  a slight variation on  the  LOESS  model of Cleveland et  al.  [1988]  (see  Cohn \net al., [1996]  for details).  The LOESS  model performs a linear regression  on points \nin  the  data set,  weighted  by  a  kernel  centered  at  x.  The kernel  shape  is  a  design \nparameter:  the original LOESS  model uses  a  \"tricubic\"  kernel;  in my experiments \nI  use  the more common Gaussian \n\nwhere Ie  is a smoothing parameter.  For brevity, I will drop the argument x for  hi(x), \nand define  n  = 2:i  hi.  We  can then  write  the weighted  means and  covariances  as: \n\n\"\"  Xi \n, \nn \n. \n\nJ.l:r;  = L..J hi - ,  \nJ.ly  = L h,-, \n\nYi \nn \n\n, \n. \n\n\"\" \n\nU:r;y  =  L..J hi \n\n, \n. \n\n(Xi  - X)(Yi  - J.ly) \n\nn \n\n. \n\nWe use these means and covariances to produce an estimate Y at the x  around which \nthe kernel  is  centered,  with  a  confidence  term in  the form of a  variance estimate: \n\nIn  all the experiments  discussed  in  this paper,  the smoothing parameter Ie  was  set \nso  as  to minimize u2. \nThe  low  cost  of incorporating  new  training  examples  makes  this  form  of locally \nweighted regression  appealing for  learning systems which must operate in real time, \nor with time-varying target functions  (e.g.  [Schaal and Atkeson  1994]). \n\n\fMinimizing Statistical Bias with Queries \n\n421 \n\nI \n\nI \nY \n\n) \n\nA \n\nA \n\nA  I \n\nA \n\n3.1  COMPUTING D..y  FOR LWR \nIf we  know  what  new  point  (x, y)  we're  going  to  add,  computing D..y  for  LWR  is \nstraightforward.  Defining h as  the  weight  given to x,  and n as  n + h we  can write \n~y  =  y  - y  =  J.L  + -\n\nx  - J.Lx \n\nh (Y  ___ J.Ly)  _  uxy (x _  J.Lx)  + (x _  n~x _  ~x) . nuXY_ + h . ~x -:xKii - J.Ly) \n\nU xy ( \nU/2 \nx \n\nU xy ( \n-\nu2 \nx \n\nn \n\nnu;+h\u00b7(x-J.Lx)2 \nNote  that computing D..y  requires  us  to know both the x and y of the new  point .  In \npractice,  we only know x.  If we  assume, however,  that we  can estimate the learner's \nbias  at  any  x,  then  we  can  also estimate  the  unknown  value  y ~ y(x)  - bias(x) . \nBelow,  I  consider  how  to compute the bias  estimate. \n\nn \n\nn \n\nI  ) \nx - J.L \nx \n\n- J.L \ny \n\n-\n\nu; \n\n3.2  ESTIMATING  BIAS  FOR LWR \n\nThe most common technique for estimating bias is  cross-validation .  Standard cross(cid:173)\nvalidation however, only gives estimates of the  bias  at our specific  training points , \nwhich  are  usually combined  to form  an  average  bias estimate.  This is  sufficient  if \none  assumes  that the  training distribution is  representative  of the  test  distribution \n(which  it  isn't  in  query  learning)  and  if one  is  content  to just  estimate  the  bias \nwhere  one  already has  training data (which  we  can't be). \nIn the query selection  problem, we  must be  able to estimate the bias  at all possible \nx.  Box  and  Draper  [1987]  suggest  fitting  a  higher  order  model and measuring the \ndifference.  For  the  experiments  described  in  this  paper,  this  method yielded  poor \nresults; two other bias-estimation techniques,  however, performed very  well. \n\nOne  method  of estimating  bias  is  by  bootstrapping  the  residuals  of  the  training \npoints.  One produces a  \"bootstrap sample\" of the learner's residuals on the training \ndata, and adds them to the original predictions to create a synthetic training set .  By \naveraging  predictions  over  a  number of bootstrapped  training sets  and  comparing \nthe average prediction with that of the original predictor, one arrives at a first-order \nbootstrap estimate of the predictor's bias [Connor 1993; Efron and Tibshirani, 1993] . \nIt is  known  that this  estimate is  itself biased  towards  zero;  a  standard heuristic  is \nto divide the estimate by  0.632  [Efron,  1983]. \nAnother method of estimating bias of a learner is  by fitting  its own cross-validated \nresiduals.  We  first  compute the cross-validated residuals  on  the training examples. \nThese  produce  estimates  of the  learner's  bias  at  each  of the  training  points.  We \ncan then  use  these  residuals  as  training examples for  another  learner  (again  LWR) \nto produce estimates of what the cross-validated error would be  in places  where  we \ndon't have  training data. \n\n4  EMPIRICAL  RESULTS \n\nIn  the  previous  two  sections,  I  have  explained  how  having  an  estimate of D..y  and \nbias  for  a  learner  allows  one  to  compute  the  learner's  change  in  bias  given  a  new \nquery,  and  have  shown  how  these  estimates  may be  computed  for  a  learner  that \nuses  locally weighted regression.  Here,  I apply these  results to two simple problems \nand  demonstrate  that  they  may  actually  be  used  to  select  queries  that  minimize \nthe statistical bias (and the error)  of the learner.  The problems involve learning the \nkinematics  of a  planar  two-jointed  robot  arm:  given  the  shoulder  and elbow joint \nangles, the learner must predict  the  tip position. \n\n\f422 \n\n4.1  BIAS  ESTIMATES \n\nD.A. Cohn \n\nI tested the accuracy of the two bias estimators by observing their correlations on 64 \nreference  inputs, given 100 random training examples from the planar arm problem. \nThe  bias estimates had  a  correlation  with  actual biases  of 0.852  for  the bootstrap \nmethod, and 0.871  for  the  cross-validation method. \n\n4.2  BIAS  MINIMIZATION \n\nI ran two sets of experiments using the bias-minimizing criterion in conjunction with \nthe  bias  estimation technique  of the  previous  section  on  the  planar arm  problem. \nThe bias minimization criterion was  used  as follows:  At each time step, the learner \nwas  given  a  set  of 64  randomly chosen  candidate  queries  and  64  uniformly chosen \nreference  points.  It evaluated  E' (x)  for  each  reference  point  given  each  candidate \npoint and selected  for  its next  query  the candidate point with the smallest average \nE' (x) over the reference  points.  I compared the bias-minimizing strategy (using the \ncross-validation and bootstrap estimation techniques)  against random sampling and \nthe  variance-minimizing strategy  discussed  in  Cohn  et  al.  [1996].  On  a  Sparc  10, \nwith m training examples, the  average evaluation times per candidate per reference \npoint were  58 + 0.16m J.lseconds  for  the variance criterion, 65 + 0.53m J.lseconds  for \nthe cross-validation-based bias criterion, and 83 + 3. 7m J.lseconds  for  the bootstrap(cid:173)\nbased  bias criterion  (with  20x resampling) . \nTo test  whether the  bias-only assumption was robust  against the presence  of noise, \n1 % Gaussian  noise  was  added  to  the  input  values  of the  training  data  in  all  ex(cid:173)\nperiments.  This simulates noisy  position effectors  on  the arm , and  results  in  non(cid:173)\nGaussian noise  in  the output coordinate  system. \n\nIn  the  first  series  of experiments,  the  candidate  shoulder  and  elbow  joint  angles \nwere  drawn  uniformly over  (U[O, 271\"],  U[O,  71\")) .  In  unconstrained  domains like  this, \nrandom sampling is  a fairly good default  strategy.  The bias minimization strategies \nstill  significantly  outperform  both  random  sampling  and  the  variance  minimizing \nstrategy in these  experiments (see  Figure  1). \n\n-1 \n\n10 \n\ng \n'\" 'il ,0-2 \n:a \n~ \nc: \n~10  .  random \n\n-3 \n\n' '\\ \n\n\".'~ \n\nvariance-min \n-\no  cross-val-min \n~ x  bootstrap-min \n200 \n\n100 \n\n10  0 \n\n300 \n\ntrainlno set size \n\nI, \n\\\\ \n--\n\n1 \n10 \n\ng  0 \n\n\"'10 \n\n}1O-1 \n~  -2 \n\n,- -\n\n.:  ~~&e-.!)jni(llIZI~  , \no \n\n10 \n10-3  ~  ~~~~rmiWngar- iOlmizing \n400 \n\n300 \n\n200 \n\n1 00 \n\ntrainino set size \n\nIheta 1 \n\n(left)  MSE  as  a  function  of number of noisy  training examples for  the \nFigure  1: \nunconstrained  arm  problem.  Errors  are  averaged  over  10  runs  for  the  bootstrap \nmethod and 15  runs for  all others.  One run with the cross-validation-based method \nwas  excluded  when  k  failed  to  converge  to  a  reasonable  value.  (center)  MSE  as \na  function  of number of noisy  training examples for  the  constrained  arm problem . \nThe bias correction strategy discussed in Section 2.1  does no better than the uncor(cid:173)\nrected  variance-minimizing strategy,  and  much  worse  than  the  bias-minimization \nstrategy.  (right)  Sample  exploration  trajectory  in  joint-space  for  the  constrained \narm problem , explored  according  to the bias minimizing criterion. \n\nIn the second series of experiments, candidates were  drawn uniformly from a region \n\n\fMinimizing Statistical Bias with Queries \n\n423 \n\nlocal to  the previously  selected  query:  (01  \u00b1 0.217\", O2  \u00b1 0.117\").  This corresponds  to \nrestricting  the  arm  to local  motions.  In  a  constrained  problem such  as  this,  ran(cid:173)\ndom  sampling  is  a  poor  strategy;  both  the  bias  and  variance-reducing  strategies \noutperform it at least  an order of magnitude.  Further, the  bias-minimization strat(cid:173)\negy  outperforms variance minimization by a large  margin (Figure 1).  Figure  1 also \nshows  an  exploration  trajectory  produced  by  pursuing  the  bias-minimizing crite(cid:173)\nrion.  It is  noteworthy that, although the implementation in this case  was a  greedy \n(one-step)  minimization, the trajectory results  in globally good exploration. \n\n5  DISCUSSION \n\nI  have argued  in  this paper  that, in many situations, selecting  queries  to minimize \nlearner bias is an appropriate and effective strategy for  active learning.  I have given \nempirical  evidence  that,  with  a  LWR-based  learner  and  the  examples  considered \nhere,  the strategy is effective  even  in the presence  of noise. \n\nBeyond  minimizing either  bias  or  variance,  an  important next  step  is  to explicitly \nminimize  them  together .  The  bootstrap-based  estimate should  facilitate  this,  as \nit produces  a  complementary variance estimate with little additional computation. \nBy optimizing over both criteria simultaneously, we expect to derive a criterion that \nthat, in  terms of statistics,  is  truly optimal for  selecting  queries. \n\nREFERENCES \nBox,  G.,  &  Draper, N.  (1987).  Empirical model-building  and  response  surfaces, \nWiley,  New  York. \nCleveland,  W.,  Devlin,  S.,  &  Grosse,  E.  (1988) .  Regression  by  local fitting. \nJournal  of Econometrics,  37, 87-114. \nCohn,  D.  (1996)  Neural  network  exploration  using  optimal  experiment  design. \nNeural Networks,  9(6):1071-1083. \nCohn,  D.,  Ghahramani,  Z.,  &  Jordan,  M.  (1996) .  Active  learning  with sta(cid:173)\ntistical models.  Journal  of Artificial Inteligence  Research 4:129-145 . \nConnor, J. (1993).  Bootstrap Methods in Neural Network Time Series  Prediction. \nIn  J .  Alspector  et  al.,  eds.,  Proc.  of the  Int.  Workshop  on  Applications  of Neural \nNetworks  to  Telecommunications,  Lawrence  Erlbaum, Hillsdale,  N.J. \nDietterich,  T.,  &  Kong,  E.  (1995) .  Error-correcting  output  coding  corrects \nbias  and  variance.  In  S.  Prieditis  and  S.  Russell,  eds.,  Proceedings  of the  12th \nInternational  Conference  on  Machine  Learning. \nEfron, B. (1983) Estimating the error rate of a prediction rule:  some improvements \non cross-validation.  J.  Amer.  Statist.  Assoc.  78:316-331. \nEfron, B.  &  Tibshirani, R.  (1993).  An introduction  to  the  bootstrap.  Chapman \n&  Hall,  New  York . \nFedorov, V.  (1972).  Theory  of Optimal Experiments.  Academic Press,  New  York. \nGeman, S.,  Bienenstock, E.,  &  Doursat, R.  (1992).  Neural  networks and the \nbias/variance dilemma.  Neural  Computation,  4,  1-58. \nMacKay,  D.  (1992).  Information-based objective functions  for  active  data selec(cid:173)\ntion,  Neural  Computation,  4,  590-604. \nPaass,  G.,  and Kindermann, J. (1994).  Bayesian  Query Construction for  Neu(cid:173)\nral  Network  Models.  In  G.  Tesauro  et  al.,  eds.,  Advances  in  Neural  Information \nProcessing Systems  7,  MIT Press. \nPlutowski, M.,  &  White,  H.  (1993).  Selecting concise  training sets from  clean \ndata.  IEEE  Transactions  on  Neural  Networks,  4, 305-318. \nSchaal,  S.  &  Atkeson,  C.  (1994).  Robot  Juggling:  An  Implementation  of \nMemory-based Learning.  Control Systems 14, 57-71. \n\n\f", "award": [], "sourceid": 1288, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}]}