{"title": "Prediction and Semantic Association", "book": "Advances in Neural Information Processing Systems", "page_first": 11, "page_last": 18, "abstract": null, "full_text": "Prediction and  Semantic Association \n\nThomas L.  Griffiths  &  Mark Steyvers \n\nDepartment of Psychology \n\nStanford University,  Stanford,  CA  94305-2130 \n{gruffydd,msteyver}@psych.stanford.edu \n\nAbstract \n\nWe  explore  the  consequences  of  viewing  semantic  association  as \nthe result of attempting to predict the concepts likely to arise in a \nparticular context.  We  argue that the success of existing accounts \nof semantic representation comes as a result of indirectly addressing \nthis problem, and show that a closer correspondence to human data \ncan be obtained by taking a  probabilistic approach that explicitly \nmodels the generative structure of language. \n\n1 \n\nIntroduction \n\nMany cognitive capacities, such as  memory and categorization, can be analyzed as \nsystems for  efficiently  predicting  aspects  of an organism's environment  [1].  Previ(cid:173)\nously,  such  analyses  have been  concerned  with  memory for  facts  or the properties \nof  objects,  where  the  prediction  task  involves  identifying  when  those  facts  might \nbe needed  again,  or what  properties novel  objects might  possess.  However,  one  of \nthe  most  challenging  tasks  people  face  is  linguistic  communication.  Engaging  in \nconversation or reading a  passage of text requires retrieval of a  variety of concepts \nfrom  memory  in  response  to  a  stream of information.  This  retrieval  task  can  be \nfacilitated by predicting which concepts are likely to be needed from  their context, \nhaving efficiently abstracted and stored the cues that support these predictions. \n\nIn  this  paper,  we  examine  how  understanding  the  problem  of  predicting  words \nfrom  their  context  can provide  insight  into  human semantic association,  exploring \nthe  hypothesis  that  the  association  between  words  is  at  least  partially  affected \nby  their  statistical  relationships.  Several  researchers  have  argued  that  semantic \nassociation  can  be  captured  using  high-dimensional  spatial  representations,  with \nthe most  prominent  such  approach being Latent  Semantic Analysis  (LSA)  [5].  We \nwill  describe this procedure, which indirectly addresses the prediction problem.  We \nwill  then suggest an alternative approach which explicitly models the way language \nis  generated and show that this approach provides a  better account of human word \nassociation  data than  LSA,  although the  two  approaches  are  closely  related.  The \ngreat promise of this approach is that it illustrates how we might begin to relax some \nof the  strong  assumptions  about  language  made  by  many  corpus-based  methods. \nWe  will  provide  an  example of this,  showing  results  from  a  generative  model  that \nincorporates both sequential and contextual information. \n\n\f2  Latent  Semantic Analysis \n\nLatent Semantic Analysis addresses the prediction problem by capturing similarity \nin  word  usage:  seeing  a  word  suggests  that  we  should  expect  to  see  other  words \nwith similar usage patterns.  Given a corpus containing W  words and D  documents, \nthe input to LSA  is  a  W  x  D  word-document co-occurrence matrix F  in which  fwd \ncorresponds  to  the  frequency  with  which  word  w  occurred  in  document  d.  This \nmatrix is transformed to a matrix G  via some function involving the term frequency \nfwd  and its frequency across documents  fw ..  Many applications of LSA in cognitive \nscience use  the transformation \n\ngwd  =  IOg{fwd + 1}(1 - Hw) \n\nH \n\n- _  d-l  f w \n\nw  -\n\n2:D_  Wlog{W} \n' \n\nlogD \n\nf w . \n\n(1) \n\nwhere  Hw  is  the  normalized  entropy  of the  distribution  over  documents  for  each \nword.  Singular  value  decomposition  (SVD)  is  applied  to  G  to  extract  a  lower \ndimensional  linear  subspace  that  captures  much  of the  variation  in  usage  across \nwords.  The  output  of  LSA  is  a  vector  for  each  word,  locating  it  in  the  derived \nsubspace.  The association between two words is typically assessed using the cosine of \nthe angle between their vectors, a  measure that appears to produce psychologically \naccurate  results  on  a  variety  of  tasks  [5] .  For  the  tests  presented  in  this  paper, \nwe  ran  LSA  on  a  subset  of the  TASA  corpus,  which  contains  excerpts  from  texts \nencountered by children between first grade and the first year of college.  Our subset \nused  all  D  =  37651  documents,  and the  W  =  26414  words  that  occurred  at least \nten times  in  the whole  corpus, with stop words removed.  From this we  extracted a \n500  dimensional representation, which  we  will  use throughout the paper. 1 \n\n3  The topic model \n\nLatent Semantic Analysis gives results that seem consistent with human judgments \nand extracts information relevant to predicting words from their contexts, although \nit  was  not  explicitly  designed  with  prediction  in  mind.  This  relationship  suggests \nthat a closer correspondence to human data might be obtained by directly attempt(cid:173)\ning to solve the prediction task.  In this section, we  outline an alternative approach \nthat involves  learning a  probabilistic model of the way language is  generated.  One \ngenerative  model  that  has  been  used  to  outperform  LSA  on  information  retrieval \ntasks views  documents as  being composed of sets of topics  [2,4].  If we  assume that \nthe  words  that  occur in different  documents  are  drawn from  T  topics,  where  each \ntopic  is  a  probability  distribution  over  words,  then  we  can  model  the  distribution \nover words  in  anyone document as  a  mixture of those topics \n\nT \n\nP(Wi) = LP(Wil zi  =j)P(Zi  =j) \n\nj=l \n\n(2) \n\nwhere Zi  is a latent variable indicating the topic from which the ith word was drawn \nand P(wilzi  =  j)  is  the probability of the ith word under the jth topic.  The words \nlikely to be used in a  new context can be determined by estimating the distribution \nover topics for  that context, corresponding to P(Zi). \nIntuitively,  P(wlz  =  j)  indicates  which words are important to a  topic, while  P(z) \nis  the prevalence of those topics  within a  document.  For example,  imagine a  world \nwhere the only topics of conversation are love and research.  We  could then express \n\nIThe dimensionality of the representation is an important parameter for both models in \nthis paper.  LSA performed best on the word association task with around 500  dimensions, \nso  we  used the same dimensionality for  the topic model. \n\n\fthe probability distribution over words with two topics, one relating to love and the \nother to research.  The content of the topics would  be reflected in  P(wlz = j):  the \nlove topic would give high probability to words like JOY,  PLEASURE, or HEART,  while \nthe research topic would give high probability to words like SCIENCE,  MATHEMATICS, \nor  EXPERIMENT.  Whether  a  particular  conversation  concerns  love,  research,  or \nthe  love  of research  would  depend  upon  its  distribution  over  topics,  P(z),  which \ndetermines how these topics are mixed together in forming  documents. \n\nHaving  defined  a  generative  model,  learning topics  becomes  a  statistical  problem. \nThe data consist  of words  w  =  {Wl' ... , w n },  where each  Wi  belongs to some  doc(cid:173)\nument  di ,  as  in  a  word-document  co-occurrence  matrix.  For  each  document  we \nhave  a  multinomial  distribution  over  the  T  topics,  with  parameters  ()(d),  so  for  a \nword  in  document  d,  P(Zi  =  j)  =  ();d;).  The  jth topic  is  represented  by  a  multi(cid:173)\nnomial  distribution over the W  words  in  the vocabulary,  with  parameters 1/i),  so \nP(wilzi  =  j)  =  1>W.  To make predictions about new documents, we need to assume \na  prior distribution on the parameters ().  Existing parameter estimation algorithms \nmake different  assumptions  about (),  with  varying results  [2,4].  Here,  we  present a \nnovel approach to inference in this model,  using Markov chain Monte  Carlo with a \nsymmetric Dirichlet(a) prior on ()(di)  for all documents and a symmetric Dirichlet(,B) \nprior on 1>(j)  for  all  topics.  In  this approach we  do  not need to explicitly represent \nthe model parameters:  we  can integrate out ()  and 1>,  defining  the model simply in \nterms of the assignments of words to topics  indicated by the Zi' \nMarkov chain Monte  Carlo is  a  procedure for  obtaining samples from  complicated \nprobability  distributions,  allowing  a  Markov  chain  to  converge  to  the  taq~et dis(cid:173)\ntribution  and  then  drawing  samples  from  the  states  of  that  chain  (see  [3]).  We \nuse  Gibbs  sampling,  where  each  state is  an  assignment  of values  to  the  variables \nbeing sampled, and the next state is  reached by sequentially sampling all  variables \nfrom their distribution when conditioned on the current values of all other variables \nand  the  data.  We  will  sample  only  the  assignments  of words  to  topics,  Zi.  The \nconditional posterior distribution for  Zi  is  given  by \n\nP( \n\n'1 \n\nZi=)Z- i ,wex \n\n) \n\nn(di) + a \nn eW;)  + (3 \n-',} \n-',} \n(.) \n(d ' ) \nn_i,j + W (3  n_i,. + Ta \n\n(3) \n\nwhere  Z - i  is  the  assignment  of all  Zk  such  that  k  f:.  i,  and  n~~:j  is  the  number \nof words  assigned  to  topic  j  that  are the  same  as  w,  n~L is  the  total  number  of \nwords  assigned to topic  j, n~J,j is  the number of words from  document  d  assigned \nto  topic  j, and n~J. is  the total number  of words  in document  d,  all  not  counting \nthe assignment of the current word Wi.  a,,B  are free  parameters that determine how \nheavily these distributions are smoothed. \n\nWe  applied  this  algorithm to our subset  of the  TASA  corpus,  which  contains n  = \n5628867  word  tokens.  Setting  a  =  0.1,,B  =  0.01  we  obtained  100  samples  of 500 \ntopics,  with  10  samples from  each of 10  runs with a  burn-in of 1000 iterations and \na  lag of 100 iterations between samples. 2  Each sample consists of an assignment of \nevery  word  token  to  a  topic,  giving  a  value  to  each  Zi.  A  subset  of the  500  topics \nfound  in  a  single  sample  are  shown  in  Table  1.  For  each  sample  we  can  compute \n\n2Random numbers were generated with the Mersenne Twister, which has an extremely \ndeep  period  [6].  For  each  run,  the  initial  state  of the  Markov  chain  was  found  using  an \non-line version  of Equation 3. \n\n\fFEEL \n\nFEELINGS \nFEELING \nANGRY \n\nWAY \n\nTHINK \nSHOW \nFEELS \nPEOPLE \nFRIENDS \nTHINGS \nMIGHT \nHELP \nHAPPY \n\nFELT \nLOVE \nANGER \nBEING \nWAYS \nFEAR \n\nMUSIC \nPLAY \nDANCE \nPLAYS \nSTAGE \nPLAYED \n\nBAND \n\nAUDIENCE \nMUSICAL \nDANCING \nRHYTHM \nPLAYING \nTHEATER \n\nDRUM \n\nACTORS \nSHOW \nBALLET \nACTOR \nDRAMA \nSONG \n\nBALL \nGAME \nTEAM \nPLAY \n\nBASEBALL \nFOOTBALL \nPLAYERS \n\nGAMES \nPLAYING \n\nFIELD \nPLAYED \nPLAYER \nCOACH \n\nBASKETBALL \n\nSPORTS \n\nHIT \nBAT \n\nTENNIS \nTEAMS \nSOCCER \n\nSCIENCE \nSTUDY \n\nSCIENTISTS \nSCIENTIFIC \nKNOWLEDGE \n\nWORK \n\nCHEMISTRY \nRESEARCH \nBIOLOGY \n\nMATHEMATICS \nLABORATORY \n\nSTUDYING \nSCIENTIST \n\nPHYSICS \nFIELD \nSTUDIES \n\nUNDERSTAND \n\nSTUDIED \nSCIENCES \n\nMANY \n\nWORKERS \n\nWORK \nLABOR \n\nJOBS \n\nWORKING \nWORKER \n\nWAGES \n\nFACTORY \n\nJOB \n\nWAGE \n\nSKILLED \n\nPAID \n\nCONDITIONS \n\nPAY \n\nFORCE \nMANY \nHOURS \n\nEMPLOYMENT \n\nFORCE \nFORCES \nMOT IO N \n\nBODY \n\nGRAVITY \n\nMASS \nPULL \n\nNEWTON \nOBJECT \n\nLAW \n\nDIRECTION \n\nMOVING \n\nREST \nFALL \n\nACTING \n\nMOMENTUM \n\nDISTANCE \n\nGRAVITATIONAL \n\nEMPLOYED \nEMPLOYERS \n\nPUSH \n\nVELOCITY \n\nTable  1:  Each column  shows  the  20  most  probable  words  in  one  of the  500  topics \nobtained from a single sample.  The organization of the columns and use of boldface \ndisplays  the way in  which  polysemy is  captured by the model. \n\nthe posterior predictive distribution  (and posterior mean for  q/j)) : \n\nJ \n\n( .) \n\n( 0) \n\nP(wl z = j, z, w)  =  P(wl z = j, \u00a2 J  )P(\u00a2 J  Iz, w)  d\u00a2  J  = _(;=,.J) __  \n\n(4) \n\n( 0) \n\nn (W)  + (3 \nnj  + W (3 \n\n4  Predicting word association \n\nWe  used  both  LSA  and  the  topic  model  to  predict  the  association  between  pairs \nof words, comparing these results with human word association norms collected by \nNelson,  McEvoy and Schreiber  [7].  These word association norms were  established \nby  presenting a  large number  of participants with  a  cue  word  and  asking them  to \nname an associated word in response.  A total of 4544 of the words  in these  norms \nappear in the set of 26414 taken from  the TASA  corpus. \n\n4.1  Latent  Semantic Analysis \n\nIn  LSA,  the  association  between  two  words  is  usually  measured  using  the  cosine \nof the  angle  between  their  vectors.  We  ordered the  associates  of each  word in  the \nnorms  by  their  frequencies ,  making  the  first  associate  the  word  most  commonly \ngiven as a response to the cue.  For example, the first  associate of NEURON is  BRAIN. \nWe evaluated the cosine between each word and the other 4543 words in the norms, \nand  then  computed  the  rank  of  the  cosine  of each  of the  first  ten  associates,  or \nall of the  associates for  words  with less  than ten.  The results  are shown in  Figure \n1.  Small  ranks  indicate  better  performance,  with  a  rank of one  meaning  that  the \ntarget word had the highest  cosine.  The median  rank of the first  associate was  32, \nand LSA  correctly predicted the first  associate for  507 of the 4544 words. \n\n4.2  The  topic  model \n\nThe probabilistic nature of the topic model makes it easy to predict the words likely \nto occur in  a  particular context.  If we  have seen  word  WI  in  a  document,  then we \ncan determine the probability that word W2  occurs in that document by computing \nP( w2IwI).  The generative model allows documents to contain multiple topics, which \n\n\f1 \n\nl;r \n\nII \n\n7 \n\n8 \n\n9 \n\n10 \n\n450  1_ LSA - cosine \n\nLSA - inner product \nTopic model \n\n400 \n\nD \n\n350 \n\n300 \n\n250 \n\n200 \n\n150 \n\n100 \n\n50 \n\no \n\nlin \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\nAssociate number \n\nFigure  1:  Performance of different  methods  of prediction  on  the  word  association \ntask.  Error bars show one standard error, estimated with  1000 bootstrap samples. \n\nis  extremely  important  to  capturing  the  complexity  of  large  collections  of words \nand computing  the probability of complete documents.  However,  when  comparing \nindividual  words  it  is  more effective  to assume  that  they  both  come  from  a  single \ntopic.  This assumption gives  us \n\n(5) \n\nz \n\nwhere we  use Equation 4 for  P(wlz)  and P(z)  is  uniform,  consistent with the sym(cid:173)\nmetric prior on e,  and the subscript in Pi (w2lwd  indicates the restriction to a single \ntopic.  This  estimate  can  be  computed  for  each  sample  separately,  and  an  overall \nestimate obtained by averaging over samples.  We computed Pi (w2Iwi)  for the 4544 \nwords  in  the  norms,  and  then  assessed  the  rank  of the  associates  in  the  resulting \ndistribution using the same procedure as  for  LSA.  The results  are shown in Figure \n1.  The median rank for  the first  associate  was  32, with  585  of the 4544 first  asso(cid:173)\nciates  exactly  correct.  The  probabilistic  model  performed  better  than  LSA,  with \nthe improved performance becoming more apparent for  the later associates . \n\n4.3  Discussion \n\nThe central problem in  modeling  semantic association is  capturing the interaction \nbetween word frequency and similarity of word usage.  Word frequency is  an impor(cid:173)\ntant factor  in a  variety  of cognitive tasks,  and one  reason for  its  importance is  its \npredictive  utility.  A  higher  observed frequency  means  that  a  word  should  be  pre(cid:173)\ndicted to occur more often.  However, this effect of frequency should be tempered by \nthe relationship between a  word and its semantic context .  The success of the topic \nmodel is a consequence of naturally combining frequency information with semantic \nsimilarity:  when  a  word  is  very  diagnostic  of a  small  number  of topics,  semantic \ncontext is  used in  prediction.  Otherwise, word frequency  plays a  larger role. \n\n\fThe  effect  of  word  frequency  in  the  topic  model  can  be  seen  in  the  rank-order \ncorrelation  of the  predicted  ranks  of the  first  associates  with  the  ranks  predicted \nby word frequency  alone, which is  p = 0.49.  In contrast, the cosine is  used in  LSA \nbecause it explicitly  removes  the  effect  of word  frequency,  with the  corresponding \ncorrelation being p =  -0.01.  The cosine is  purely a  measure of semantic similarity, \nwhich is  useful  in situations where word frequency is  misleading, such as in tests of \nEnglish fluency  or other linguistic tasks, but not necessarily consistent with human \nperformance.  This measure was based in the origins of LSA in information retrieval, \nbut other measures that do incorporate word frequency have been used for modeling \npsychological data.  We  consider one such measure in the next  section. \n\n5  Relating  LSA and  the topic model \n\nThe decomposition of a word-document co-occurrence matrix provided by the topic \nmodel  can  be  written  in  a  matrix  form  similar  to  that  of  LSA.  Given  a  word(cid:173)\ndocument  co-occurrence  matrix  F,  we  can  convert  the  columns  into  empirical  es(cid:173)\ntimates  of the  distribution over  words  in  each document  by  dividing  each  column \nby its  sum.  Calling this  matrix P, the topic  model  approximates it with the non(cid:173)\nnegative matrix factorization P  ~ \u00a2O, where column j  of \u00a2  gives 4/j) , and column d \nof 0 gives ()(d).  The inner product matrix ppT is  proportional to the empirical esti(cid:173)\nmate of the joint distribution over words P(WI' W2)'  We can write ppT ~ \u00a2OOT \u00a2T, \ncorresponding to P(WI ,W2)  = L z\"Z 2  P(wIl zdP(W2Iz2)P(ZI,Z2) , with OOT  an em(cid:173)\npirical  estimate  of  P(ZI , Z2)'  The  theoretical  distribution  for  P(ZI, Z2)  is  propor(cid:173)\ntional to 1+ 0::,  where  I  is  the identity matrix,  so  OOT  should be close  to diagonal. \nThe single topic assumption removes the off-diagonal elements, replacing OOT  with \nI  to give  PI (Wl ' W2)  ex:  \u00a2\u00a2T. \nBy  comparison,  LSA  transforms  F  to  a  matrix  G  via  Equation  1,  then  the  SVD \ngives  G  ~ UDV T  for  some low-rank diagonal D.  The locations of the words along \nthe extracted dimensions are X  =  UD.  If the column sums do not vary extensively, \nthe empirical estimate of the joint distribution over words specified by the entries in \nG  will  be approximately P(WI,W2)  ex:  GGT .  The properties of the SVD  guarantee \nthat  XX T ,  the  matrix of inner products  among  the  word vectors, is  the best  low(cid:173)\nrank  approximation  to  GGT  in  terms  of  squared  error.  The  transformations  in \nEquation  1  are  intended  to  reduce  the  effects  of word  frequency  in  the  resulting \nrepresentation,  making  XXT  more similar to \u00a2\u00a2T. \nWe  used  the  inner  product  between  word  vectors  to  predict  the  word  association \nnorms,  exactly  as  for  the  cosine.  The  results  are  shown  in  Figure  1.  The  inner \nproduct  initially  shows  worse  performance  than  the  cosine,  with  a  median  rank \nof 34  for  the  first  associate  and  500  exactly  correct,  but  performs  better for  later \nassociates.  The  rank-order correlation with  the  predictions  of word  frequency  for \nthe  first  associate  was  p  =  0.46,  similar  to  that  for  the  topic  model.  The  rank(cid:173)\norder correlation between the ranks given by the inner product and the topic model \nwas  p  =  0.81,  while  the  cosine  and  the  topic  model  correlate  at  p  =  0.69.  The \ninner product and PI (w2lwd  in the topic model seem to give  quite similar results, \ndespite being obtained by very different  procedures.  This similarity is  emphasized \nby choosing to assess the models  with separate ranks for  each cue word, since this \nmeasure  does  not  discriminate  between  joint  and  conditional  probabilities.  While \nthe  inner product  is  related to the joint  probability of WI  and W2,  PI (w2lwd  is  a \nconditional  probability  and  thus  allows  reasonable  comparisons  of the  probability \nof W2  across  choices  of WI ,  as  well  as  having  properties  like  asymmetry  that  are \nexhibited by word association. \n\n\f\"syntax\" \n\n\"semantics\" \n\nHE \nYOU \nTHEY \n\nI \n\nSHE \nWE \nIT \n\nPEOPLE \n\nEVERYONE \n\nOTHERS \n\nSCIENTISTS \nSOMEONE \n\nWHO \n\nNOBODY \n\nONE \n\nSOMETHING \n\nANYONE \n\nEVERYBODY \n\nSOME \nTHEN \n\nON \nAT \n\nINTO \nFROM \nWITH \n\nTHROUGH \n\nOVER \n\nAROUND \nAGAINST \nACROSS \n\nUPON \n\nTOWARD \nUNDER \nALONG \nNEAR \n\nBEHIND \n\nOFF \n\nABOVE \nDOWN \nBEFORE \n\nBE \n\nMAKE \nGET \nHAVE \n\nGO \n\nTAKE \n\nDO \n\nFIND \nUSE \nSEE \nHELP \nKEEP \nGIVE \nLOOK \nCOME \nWORK \nMOVE \nLIVE \nEAT \n\nSAID \n\nASKED \n\nTHOUGHT \n\nTOLD \nSAYS \n\nMEANS \nCALLED \nCRIED \nSHOWS \n\nANSWERED \n\nTELLS \n\nREPLIED \nSHOUTED \n\nEXPLAINED \n\nLAUGHED \n\nMEANT \nWROTE \nSHOWED \nBELIEVED \n\nMAP \n\nNORTH \nEARTH \nSOUTH \nPOLE \nMAPS \n\nWEST \nLINES \nEAST \n\nEQUATOR \n\nAUSTRALIA \n\nGLOBE \nPOLES \n\nHEMISPHERE \n\nLATITUDE \n\nPLACES \n\nLAND \n\nWORLD \nCOMPASS \n\nDOCTOR \nPATIENT \nHEALTH \nHOSPITAL \nMEDICAL \n\nCARE \n\nPATIENTS \n\nNURSE \n\nDOCTORS \nMEDICINE \nNURSING \n\nTREATMENT \n\nNURSES \n\nPHYSICIAN \nHOSPITALS \n\nDR \nS ICK \n\nASSISTANT \nEMERGENCY \n\nPRACTICE \n\nBECOME \n\nWHISPERED \n\nCONTINENTS \n\nTable 2:  Each column shows the 20 most probable words in one of the 48  \"syntactic\" \nstates  of the  hidden  Markov  model  (four  columns  on  the  left)  or  one  of  the  150 \n\"semantic\"  topics  (two  columns on the right)  obtained from  a  single sample. \n\n6  Exploring more  complex generative  models \n\nThe topic  model,  which  explicitly  addresses the problem of predicting words from \ntheir  contexts,  seems  to  show  a  closer  correspondence to  human  word  association \nthan LSA.  A  major consequence  of this  analysis  is  the possibility  that we  may  be \nable to gain insight into some of the associative aspects of human semantic memory \nby  exploring statistical solutions  to this  prediction problem.  In  particular,  it  may \nbe  possible  to  develop  more  sophisticated  generative  models  of language  that  can \ncapture some of the important linguistic  distinctions  that influence  our processing \nof words.  The close relationship between LSA and the topic model makes the latter \na  good  starting point  for  an  exploration of semantic  association,  but  perhaps  the \ngreatest potential of the statistical approach is  that it illustrates how  we  might  go \nabout relaxing some of the strong assumptions made by both of these models. \n\nOne such assumption is the treatment of a  document as  a  \"bag of words\" , in which \nsequential information is  irrelevant.  Semantic information is  likely to influence only \na  small subset  of the  words  used  in  a  particular context,  with the majority of the \nwords playing functional syntactic roles that are consistet across contexts.  Syntax is \njust as important as semantics for  predicting words, and may be an effective means \nof  deciding  if  a  word  is  context-dependent.  In  a  preliminary  exploration  of  the \nconsequences of combining syntax and semantics in a generative model for language, \nwe  applied  a  simple  model  combining  the syntactic structure of a  hidden  Markov \nmodel  (HMM)  with the semantic structure of the topic model.  Specifically, we  used \na  third-order  HMM  with  50  states  in  which  one  state marked  the start  or  end  of \na  sentence,  48  states each emitted words from  a  different  multinomial distribution, \nand one state emitted words from  a  document-dependent multinomial distribution \ncorresponding to the topic model  with T  = 150.  We  estimated parameters for  this \nmodel using Gibbs sampling, integrating out the parameters for both the HMM and \nthe  topic  model  and  sampling  a  state  and  a  topic  for  each  of the  11821091  word \ntokens in the corpus. 3  Some of the state and topic distributions from a single sample \nafter 1000 iterations are shown in Table 2.  The states of the HMM accurately picked \nout many of the functional  classes of English syntax, while the state corresponding \nto the topic model was used to capture the context-specific distributions over nouns. \n\n3This larger  number is  a  result  of including low  frequency  and stop words. \n\n\fCombining the topic model  with the HMM  seems to have advantages for  both:  no \nfunction  words  are absorbed into  the topics,  and  the  HMM  does  not  need  to deal \nwith the context-specific variation in nouns.  The model also seems to do a good job \nof generating topic-specific text - we  can clamp the distribution over topics to pick \nout those of interest, and then use the model to generate phrases.  For example, we \ncan generate phrases on the topics of research (\"the chief wicked selection of research \nin  the  big  months\" , \"astronomy  peered  upon  your  scientist's  door\",  or  \"anatomy \nestablished  with  principles  expected  in  biology\") ,  language  (\"he expressly  wanted \nthat  better  vowel\"),  and  the  law  (\"but  the  crime  had  been  severely  polite  and \nconfused\" ,  or  \"custody  on  enforcement  rights  is  plentiful\").  While  these  phrases \nare somewhat nonsensical, they are certainly topical. \n\n7  Conclusion \n\nViewing  memory and  categorization as  systems involved in  the efficient  prediction \nof  an  organism's  environment  can  provide  insight  into  these  cognitive  capacities. \nLikewise,  it  is  possible  to learn  about  human  semantic  association  by  considering \nthe  problem  of  predicting  words  from  their  contexts.  Latent  Semantic  Analysis \naddresses this problem, and provides a good account of human semantic association. \nHere,  we  have shown that a  closer  correspondence to human data can be obtained \nby  taking  a  probabilistic  approach that explicitly  models  the  generative structure \nof  language,  consistent  with  the  hypothesis  that  the  association  between  words \nreflects  their  probabilistic relationships.  The great promise of this approach is  the \npotential to explore how  more sophisticated statistical models of language, such as \nthose incorporating both syntax and semantics, might help us understand cognition. \n\nAcknowledgments \n\nThis work was generously supported by the NTT Communications Sciences Laboratories. \nWe used Mersenne Twister code written by Shawn Cokus,  and are grateful  to Touchstone \nApplied Science Associates for making available the TASA corpus, and to Josh Tenenbaum \nfor  extensive discussions  on this topic. \n\nReferences \n\n[1]  J.  R.  Anderson.  The  Adaptive  Character  of Thought.  Erlbaum, Hillsdale, NJ,  1990. \n\n[2]  D . M.  Blei, A.  Y. Ng, and M.  1.  Jordan.  Latent Dirichlet allocation.  In T. G.  Dietterich, \nS.  Becker,  and  Z.  Ghahramani,  eds,  Advances  in  Neural  Information  Processing  Systems \n14,  2002. \n[3]  W . R.  Gilks,  S. Richardson, and D.  J . Spiegelhalter,  eds.  Markov  Chain  Monte  Carlo \nin Practice.  Chapman and Hall,  Suffolk,  1996. \n\n[4]  T .  Hofmann.  Probabilistic  Latent  Semantic  Indexing.  In  Proceedings  of the  Twenty(cid:173)\nSecond  Annual International  SIGIR  Conference,  1999. \n\n[5]  T. K. Landauer and S.  T. Dumais.  A solution to Plato's problem:  The Latent Semantic \nAnalysis theory  of acquisition,  induction,  and representation of knowledge.  Psychological \nReview,  104:211- 240,  1997. \n\n[6]  M.  Matsumoto  and  T .  Nishimura.  Mersenne  twister:  A  623-dimensionally  equidis(cid:173)\ntributed uniform  pseudorandom number generator.  ACM Transactions  on  Modeling  and \nComputer  Simulation,  8:3- 30,  1998. \n\n[7]  D.  L.  Nelson,  C.  L.  McEvoy,  and  T.  A.  Schreiber.  The  University  of South  Florida \nword association  norms.  http://www. usf. edu/FreeAssociation,  1999. \n\n\f", "award": [], "sourceid": 2153, "authors": [{"given_name": "Thomas", "family_name": "Griffiths", "institution": null}, {"given_name": "Mark", "family_name": "Steyvers", "institution": null}]}