{"title": "Streaming Pointwise Mutual Information", "book": "Advances in Neural Information Processing Systems", "page_first": 1892, "page_last": 1900, "abstract": "Recent work has led to the ability to perform space ef\ufb01cient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, other- wise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI compu- tation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data.", "full_text": "Streaming Pointwise Mutual Information\n\nBenjamin Van Durme\nUniversity of Rochester\n\nRochester, NY 14627, USA\n\nAshwin Lall\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332, USA\n\nAbstract\n\nRecent work has led to the ability to perform space ef\ufb01cient, approximate counting\nover large vocabularies in a streaming context. Motivated by the existence of data\nstructures of this type, we explore the computation of associativity scores, other-\nwise known as pointwise mutual information (PMI), in a streaming context. We\ngive theoretical bounds showing the impracticality of perfect online PMI compu-\ntation, and detail an algorithm with high expected accuracy. Experiments on news\narticles show our approach gives high accuracy on real world data.\n\n1\n\nIntroduction\n\nRecent work has led to the ability to perform space ef\ufb01cient counting over large vocabularies [Talbot,\n2009; Van Durme and Lall, 2009]. As online extensions to previous work in randomized storage\n[Talbot and Osborne, 2007], signi\ufb01cant space savings are enabled if your application can tolerate a\nsmall chance of false positive in lookup, and you do not require the ability to enumerate the contents\nof your collection.1 Recent interest in this area is motivated by the scale of available data outpacing\nthe computational resources typically at hand.\nWe explore what a data structure of this type means for the computation of associativity scores,\nor pointwise mutual information, in a streaming context. We show that approximate k-best PMI\nrank lists may be maintained online, with high accuracy, both in theory and in practice. This result\nis useful both when storage constraints prohibit explicitly storing all observed co-occurrences in a\nstream, as well as in cases where accessing such PMI values would be useful online.\n\n2 Problem De\ufb01nition and Notation\n\nThroughout this paper we will assume our data is in the form of pairs (cid:104)x, y(cid:105), where x \u2208 X and\ny \u2208 Y . Further, we assume that the sets X and Y are so large that it is infeasible to explicitly\nmaintain precise counts for every such pair on a single machine (e.g., X and Y are all the words in\nthe English language).\nWe de\ufb01ne the pointwise mutual information (PMI) of a pair x and y to be\n\nPMI(x, y) \u2261 lg P (x, y)\n\nP (x)P (y)\n\nwhere these (empirical) probabilities are computed over a particular data set of interest.2 Now, it is\noften the case that we are not interested in all such pairs, but instead are satis\ufb01ed with estimating the\nsubset of Y with the k largest PMIs with each x \u2208 X. We denote this set by PMIk(x).\nOur goal in this paper is to estimate these top-k sets in a streaming fashion, i.e., where there is only\na single pass allowed over the data and it is infeasible to store all the data for random access. This\n\n1This situation holds in language modeling, such as in the context of machine translation.\n2As is standard, lg refers to log2.\n\n1\n\n\fmodel is natural for a variety of reasons, e.g., the data is being accessed by crawling the web and it\nis infeasible to buffer all the crawled results.\nAs mentioned earlier, there has been considerable work in keeping track of the counts of a large\nnumber of items succinctly. We explore the possibility of using these succinct data structures to\nsolve this problem. Suppose there is a multi-set M = {m1, m2, m3, . . .} of word pairs from X \u00d7 Y .\nUsing an approximate counter data structure, it is possible to maintain in an online fashion the counts\n\nc(x, y) = |{i | mi = (cid:104)x, y(cid:105)}|,\nc(x) = |{i | mi = (cid:104)x, y(cid:48)(cid:105), for some y(cid:48) \u2208 Y }|, and\nc(y) = |{i | mi = (cid:104)x(cid:48), y(cid:105), for some x(cid:48) \u2208 X}|,\n\nwhich allows us to estimate PMI(x, y) as lg P (x,y)\nc(x)c(y), where n\nis the length of the stream. The challenge for this problem is determining how to keep track of the\nset PMIk(x) for all x \u2208 X in an online fashion.\n3 Motivation\n\n(c(x)/n)(c(y)/n) = lg nc(x,y)\n\nP (x)P (y) = lg\n\nc(x,y)/n\n\nP ( \u00afA)P (B).\n\nP (A)P (B) + P (A \u00afB) lg P (A \u00afB)\n\nP (A)P ( \u00afB) + P ( \u00afA \u00afB) lg P ( \u00afA \u00afB)\n\nPointwise mutual information underlies many experiments in computational (psycho-)linguistics,\ngoing back at least to Church and Hanks [1990], who at the time referred to PMI as a mathematical\nformalization of the psycholinguistic association score. We do not attempt to summarize this work\nin its entirety, but give representative highlights below.\nTrigger Models Rosenfeld [1994] was interested in collecting trigger pairs, (cid:104)A, B(cid:105), such that the\npresence of A in a document is likely to \u201ctrigger\u201d an occurrence of B. There the concern was in\n\ufb01nding the most useful triggers overall, and thus pairs were favored based on high average mu-\ntual information; I(A, B) = P (AB) lg P (AB)\nP ( \u00afA)P ( \u00afB) +\nP ( \u00afAB) lg P ( \u00afAB)\nAs commented by Rosenfeld, the \ufb01rst term of his equation relates to the PMI formula given by\nChurch and Hanks [1990]. We might describe our work here as collecting terms y, triggered by\neach x, once we know x to be present. As the number of possible terms is large,3 we limit ourselves\nto the top-k items.\nAssociated Verbs Chambers and Jurafsky [2008], following work such as Lin [1998] and Chklovski\nand Pantel [2004], introduced a probabilistic model for learning Shankian script-like structures\nwhich they termed narrative event chains; for example, if in a given document someone pleaded,\nadmits and was convicted, then it is likely they were also sentenced, or paroled, or \ufb01red. Prior to\nenforcing a temporal ordering (which does not concern us here), Chambers and Jurafsky acquired\nclusters of related verb-argument pairs by \ufb01nding those that shared high PMI.\nAssociativity in Human Memory Central to their rational analysis of human memory, Schooler\nand Anderson [1997] approximated the needs odds, n, of a memory structure S as the product of\nrecency and context factors, where the context factor is the product of associative ratios between S\nand local cues; n \u223c= P (S|HS )\nP ( \u00afS|HS )\nIf we take x to range over cues, and y to be a memory structure, then in our work here we are\nstoring the identities of the top-k memory structures for a given cue x, as according to strength of\nassociativity.4\n\nP (S)P (q).\n\n(cid:81)\n\nP (Sq)\n\nq\u2208 QS\n\n4 Lower Bound\n\nWe \ufb01rst discuss the dif\ufb01culty in solving the online PMI problem exactly. An obvious \ufb01rst attempt\nat an algorithm for this problem is to use approximate counters to estimate the PMI for each pair in\n\n3Rosenfeld: ... unlike in a bigram model, where the number of different consecutive word pairs is much\nless than [the vocabulary] V 2, the number of word pairs where both words occurred in the same document is a\nsigni\ufb01cant fraction of V 2.\n\n4Note that Frank et al. [2007] gave evidence suggesting PMI may be suboptimal for cue modeling, but to\n\nour understanding this result is limited to the case of novel language acquisition.\n\n2\n\n\fthe stream and maintain the top-k for each x using a priority queue. This method does not work, as\nillustrated by the examples below.\nExample 1 (probability of y changes): Consider the stream\n\nxy xy xy xz wz | wy wy wy wy wy\n\n3/5\n\n1/5\n\n3/10\n\n(4/10)(8/10) \u2248 lg (0.94) and PMI(x, z) = lg\n\nwhich we have divided in half. After the \ufb01rst half, y is best for x since PMI(x, y) = lg\n(4/5)(3/5) =\nlg (5/4) and PMI(x, z) = lg\n(4/5)(2/5) = lg (5/8). At the end of the second half of the stream,\nz is best for x since PMI(x, y) = lg\n(4/10)(2/10) =\nlg (1.25). However, during the second half of the stream we never encounter x and hence never\nupdate its value. So, the naive algorithm behaves erroneously.\nWhat this example shows is that not only does the naive algorithm fail, but also that the top-k PMI\nof some x may change (because of the change in probability of y) without any opportunity to update\nPMIk(x).\nNext, we show another example which illustrates the failure of the naive algorithm due to the fact\nthat it does not re-compute every PMI each time.\nExample 2 (probability of x changes): Consider the stream\n\n1/10\n\npd py py xy xd\n\n1/5\n\n1/4\n\n1/5\n\nin which we are interested in only the top PMI tuples for x. When we see xy in the stream,\n(1/4)(3/4) \u2248 lg (1.33), and when we see xd in the stream, PMI(x, d) =\nPMI(x, y) = lg\n(2/5)(2/5) = lg (1.25). As a result, we retain xy but not xd. However, xy\u2019s PMI is now\nlg\n(2/5)(3/5) = lg (0.833) which means that we should replace xy with xd. How-\nPMI(x, y) = lg\never, since we didn\u2019t re-compute PMI(x, y), we erroneously output xy.\nWe next formalize these intuitions into a lower bound showing why it might be hard to compute\nevery PMIk(x) precisely. For this lower bound, we make the simplifying assumption that the size\nof the set X is much smaller than N (i.e., |X| \u2208 o(N)), which is the usual case in practice.\nTheorem 1: Any algorithm that explicitly maintains the top-k PMIs for all x \u2208 X in a stream of\nlength at most n (where |X| \u2208 o(n)) in a single pass requires \u2126(n|X|) time.\nWe will prove this theorem using the following lemma:\nLemma 1: Any algorithm that explicitly maintains the top-k PMIs of |X| = p + 1 items over a\nstream of length at most n = 2r + 2p + 1 in a single pass requires \u2126(pr) time.\nProof of Lemma 1: Let us take the length of the stream to be n, where we assume without loss of\ngenerality that n is odd. Let X = {x1, . . . , xp+1}, Y = {y1, y2} and let us consider the following\nstream:\n\nx1y1, x2y1, x3y1,\nx1y2, x2y2, x3y2,\n\n. . . , xpy1,\n. . . , xpy2,\nxp+1y1\n\nxp+1y2, xp+1y2,\nxp+1y1, xp+1y1,\nxp+1y2, xp+1y2,\n\n. . .\n\nxp+1y1+r(mod)2, xp+1y1+r(mod)2.\n\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8fe r times\n\nSuppose that we are interested in maintaining only the top-PMI item for each xi \u2208 X (the proof\neasily generalizes to larger k). Let us consider the update cost for only the set Xp = {x1, . . . , xp} \u2286\nX. After xp+1y1 appears in the stream for the \ufb01rst time, it should be evident that all the elements of\nXp have a higher PMI with y2 than y1. However, after we see two copies of xp+1y2, the PMI of y1\nis higher than that of y2 for each x \u2208 Xp. Similarly, the top-PMI of each element of Xp alternates\nbetween y1 and y2 for the remainder of the stream. Now, the current PMI for each element of Xp\nmust be correct at any point in the stream since the stream may terminate at any time. Hence, by\nconstruction, the top PMI of x1, . . . , xp will change at least r times in the course of this stream, for\n\n3\n\n\fa total of at least pr operations. The length of the stream is n = 2p + 2r + 1. This completes the\nproof of Lemma 1. (cid:3)\nProof of Theorem 1: Taking |X| = p + 1, we have in the construction of Lemma 1 that r =\n(n \u2212 2p \u2212 1)/2 = (n \u2212 2|X| + 1)/2. Hence, there are at least pr = (|X| \u2212 1)(n \u2212 2|X| + 1)/2 =\n\u2126(n|X| \u2212 |X|2) update operations required. Since we assumed that |X| \u2208 o(n), this is \u2126(n|X|)\noperations. (cid:3)\nHence, there must be a high update cost for any such algorithm. That is, on average, any algorithm\nmust perform \u2126(|X|) operations per item in the stream.\n\n5 Algorithm\n\nThe lower bound from the previous section shows that, when solving the PMI problem, the best one\ncan do is effectively cross-check the PMI for every possible x \u2208 X for each item in the stream. In\npractice, this is far too expensive and will lead to online algorithms that cannot keep up with the rate\nat which the input data is produced. To solve this problem, we propose a heuristic algorithm that\nsacri\ufb01ces some accuracy for speed in computation.\nBesides keeping processing times in check, we have to be careful about the memory requirements of\nany proposed algorithm. Recall that we are interested in retaining information for all pairs of x and\ny, where each is drawn from a set of cardinality in the millions. Our algorithm uses approximate\ncounting to retain the counts of all pairs of items (cid:104)x, y(cid:105) in a data structure Cxy. We keep exact counts\nof all x and y since this takes considerably less space. Given these values, we can (approximately)\nestimate PMI(x, y) for any (cid:104)x, y(cid:105) in the stream.\nWe assume Cxy to be based on recent work in space ef\ufb01cient counting methods for streamed text\ndata [Talbot, 2009; Van Durme and Lall, 2009]. For our implementation we used TOMB coun-\nters [Van Durme and Lall, 2009] which approximate counts by storing values in log-scale. These\nlog-scale counts are maintained in unary within layers of Bloom \ufb01lters [Bloom, 1970] (Figure 1)\nthat can be probabilistically updated using a small base (Figure 2); each occurrence of an item in the\nstream prompts a probabilistic update to its value, dependent on the base. By tuning this base, one\ncan trade off between the accuracy of the counts and the space savings of approximate counting.\n\nFigure 1: Unary counting with Bloom \ufb01lters.\n\nFigure 2: Transition by base b.\n\nNow, to get around the problem of having stale PMI values because the count of x changing (i.e.,\nthe issue in Example 2 in the previous section), we divide the stream up into \ufb01xed-size buffers B\nand re-compute the PMIs for all pairs seen within each buffer (see Algorithm 1).\nUpdating counts for x, y and (cid:104)x, y(cid:105) is constant time per element in the stream. Insertion into a k-best\npriority queue requires O(lg k) operations. Per interval, we perform in the worst case one insertion\nper new element observed, along with one insertion for each element stored in the previous rank lists.\nAs long as |B| \u2265 |X|k, updating rank lists costs O(|B|lg k) per interval.5 The algorithm therefore\nrequires O(n + n lg k) = O(n lg k) time, where n is the length of the stream. Note that when\n|B| = n we have the standard of\ufb02ine method for computing PMI across X and Y (not withstanding\napproximate counters). When |B| < |X|k, we run afoul of the lower bound given by Theorem 2.\nRegarding space, |I| \u2264 |B|. A bene\ufb01t of our algorithm is that this can be kept signi\ufb01cantly smaller\nthan |X| \u00d7 |Y |,6 since in practice, |Y | (cid:29) lg k.\n\n5I.e., the extra cost for reinserting elements from the previous rank lists is amortized over the buffer length.\n6E.g., the V 2 of Rosenfeld.\n\n4\n\n...1bb21\u2212b\u221211\u2212b\u221221\u2212b\u22123b\u22123b\u22122b\u22121\fAlgorithm 1 FIND-ONLINE-PMI\n1: initialize hashtable counters Hx and Hy for exact counts\n2: initialize an approximate counter Cxy\n3: initialize rank lists, L, mapping x to k-best priority queue storing (cid:104)y, PMI(x, y)(cid:105)\n4: for each buffer B in the stream do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end for\n\ninitialize I, mapping (cid:104)x, y(cid:105) to {0, 1}, denoting whether (cid:104)x, y(cid:105) was observed in B\nfor (cid:104)x, y(cid:105) in B do\nset I((cid:104)x, y(cid:105)) = 1\nincrement Hx(x) \u0003 initial value of 0\nincrement Hy(y) \u0003 initial value of 0\ninsert (cid:104)x, y(cid:105) into Cxy\n\nre-compute L(x) using current y \u2208 L(x) and {y|I((cid:104)x, y(cid:105)) = 1}\n\nend for\nfor each x \u2208 X do\n\nend for\n\n5.1 Misclassi\ufb01cation Probability Bound\n\nOur algorithm removes problems due to the count of x changing, but does not solve the problem that\nthe probability of y changes (i.e., the issue in Example 1 in the previous section). The PMI of a pair\n(cid:104)x, y(cid:105) may decrease considerably if there are many occurrences of y (and relatively few occurrences\nof (cid:104)x, y(cid:105)) in the stream, leading to the removal of y from the true top-k list for x. We show in the\nfollowing that this is not likely to happen very often for the text data that our algorithm is designed\nto work on.\nIn giving a bound on this error, we will make two assumptions: (i) the PMI for a given x follows\na Zip\ufb01an distribution (something that we observed in our data), and (ii) the items in the stream are\ndrawn independently from some underlying distribution (i.e., they are i.i.d.). Both these assumptions\ntogether help us to sidestep the lower bound proved earlier and demonstrate that our single-pass\nalgorithm will perform well on real language data sets.\nWe \ufb01rst make the observation that, for any y in the set of top-k PMIs for x, if (cid:104)x, y(cid:105) appears in\nthe \ufb01nal buffer then we are guaranteed that y is correctly placed in the top-k at the end. This is\nbecause we recompute PMIs for all the pairs in the last buffer at the end of the algorithm (line 13 of\nAlgorithm 1). The probability that (cid:104)x, y(cid:105) does not appear in the last buffer can be bounded using the\ni.i.d. assumption to be at most(cid:18)\n1 \u2212\n\n(cid:19)|B|\n\nc(x, y)\n\n|B|c(x,y)\n\nn\n\nn\n\n\u2248 e\u2212\n\u2264 e\u2212k|X|c(x,y)/n,\n\nwhere for the last inequality we use the bound |B| \u2265 |X|k that we assumed in the previous section.\nHence, in those cases that c(x, y) = \u2126(n/(|X|k)), our algorithm correctly identi\ufb01es y as being in\nthe top-k PMI for x with high probability. The proof for general c(x, y) is given next.\nWe study the probability with which some y(cid:48) which is not in the top-k PMI for a \ufb01xed x can displace\nsome y in the top-k PMI for x. We do so by studying the last buffer in which (cid:104)x, y(cid:105) appears. The\nonly way that y(cid:48) can displace y in the top-k for x in our algorithm is if at the end of this buffer the\nfollowing holds true:\n\nct(x, y(cid:48))\nct(y(cid:48)) >\n\nct(x, y)\nct(y) ,\n\nwhere the t subscripts denotes the respective counts at the end of the buffer. We will show that this\nevent occurs with very small probability. We do so by bounding the probability of the following\nthree unlikely events.\nIf we assume all c(x, y) are above some threshold m, then with only small probability (i.e., 1/2m)\nwill the last buffer containing (cid:104)x, y(cid:105) appear before the midpoint of the stream. So, let us assume\nthat the buffer appears after the midpoint of the stream. Then, the probability that (cid:104)x, y(cid:48)(cid:105) appears\nmore than (1 + \u03b4)c(x, y(cid:48))/2 times by this point can be bounded by the Chernoff bound to be at most\n\n5\n\n\fexp(\u2212c(x, y(cid:48))\u03b42/8). Similarly, the probability that y(cid:48) appears less than (1\u2212 \u03b4)c(y(cid:48))/2 times by this\npoint can be bounded by exp(\u2212c(y(cid:48))\u03b42/4). Putting all these together, we get that\n\n(cid:18) ct(x, y(cid:48))\n\nPr\n\nct(y(cid:48)) >\n\n(cid:19)\n\n(1 + \u03b4)c(x, y(cid:48))\n(1 \u2212 \u03b4)c(y(cid:48))\n\n< 1/2m + exp(\u2212c(x, y(cid:48))\u03b42/8) + exp(\u2212c(y(cid:48))\u03b42/4).\n\n(cid:18) ct(x, y(cid:48))\n\nPr\n\nct(y(cid:48)) >\n\nct(x, y)\nct(y)\n\n(cid:19)\n\nWe now make use of the assumption that the PMIs are distributed in a Zip\ufb01an manner. Let us take\nthe rank of the PMI of y(cid:48) to be i (and recall that the rank of the PMI of y is at most k). Then, by the\nZip\ufb01an assumption, we have that PMI(x, y) \u2265 (i/k)sPMI(x, y(cid:48)), where s is the Zip\ufb01an parameter.\nThis can be re-written as c(x,y)\n). We can now put all these results\ntogether to bound the probability of the event\n\nc(y(cid:48)) 2((i/k)s\u22121)PMI(x,y(cid:48)\n\nc(y) \u2265 c(x,y(cid:48)\n\n)\n\n\u2264 1/2m + exp(\u2212c(x, y(cid:48))\u03b42/8) + exp(\u2212c(y(cid:48))\u03b42/4),\n\n) \u2212 1)/(((i/k)s \u2212 1)2PMI(x,y(cid:48)\n\nwhere we take \u03b4 = (((i/k)s \u2212 1)2PMI(x,y(cid:48)\nHence, the probability that some low-ranked y(cid:48) will displace a y in the top-k PMI of x is low. Taking\na union bound across all possible y(cid:48) \u2208 Y gives a bound of 1/2m + |Y |(exp(\u2212c(x, y(cid:48))\u03b42/8) +\nexp(\u2212c(y(cid:48))\u03b42/4)).7\n6 Experiments\n\n) + 1).\n\nWe evaluated our algorithm for online, k-best PMI with a set of experiments on collecting verbal\ntriggers in a document collection. For each document, we considered all verb::verb pairs, non-\nstemmed; e.g., wrote::ruled, \ufb01ghting::endure, argued::bore. For each unique verb x observed in the\nstream, our goal was to recover the top-k verbs y with the highest PMI given x.8 Readers may peek\nahead to Table 2 for example results.\nExperiments were based on 100,000 NYTimes articles taken from the Gigaword Corpus [Graff,\n2003]. Tokens were tagged for part of speech (POS) using SVMTool [Gim\u00b4enez and M`arquez, 2004],\na POS tagger based on SVMlight [Joachims, 1999].\nOur stream was constructed by considering all pairwise combinations of the roughly 82 (on average)\n(cid:80)\nverb tokens occurring in each document. Where D \u2208 D is a document in the collection, let Dv\nrefer to the list of verbal tokens, not necessarily unique. The length of our stream, n, is therefore:\n\n(cid:0)|Dv|2\n(cid:1).9\n\nD \u2208 D\n\nWhile research into methods for space ef\ufb01cient, approximate counting has been motivated by a\ndesire to handle exceptionally large datasets (using limited resources), we restricted ourselves here\nto a dataset that would allow for comparison to explicit, non-approximate counting (implemented\nthrough use of standard hashtables).10 We will refer to such non-approximate counting as perfect\ncounting. Finally, to guard against spurious results arising from rare terms, we employed the same\nc(xy) > 5 threshold as used by Church and Hanks [1990].\nWe did not heavily tune our counting mechanism to this task, other than to experiment with a few\ndifferent bases (settling on a base of 1.25). As such, empirical results for approximate counting\n\n7For streams composed such as described in our experiments, this bound becomes powerful as m approaches\n100 or beyond (recalling that both c(x, y(cid:48)), c(y(cid:48)) > m). Experimentally we observed this to be conservative\nin that such errors appear unlikely even when using a smaller threshold (e.g., m = 5).\n\n8Unlike in the case of Rosenfeld [1994], we allowed for triggers to occur anywhere in a document, rather\nthan exclusively in the preceding context. This can be viewed as a restricted version of the experiments of\nChambers and Jurafsky [2008], where we consider all verb pairs, regardless of whether they are assumed to\npossess a co-referent argument.\n9For the experiments here, n = 869, 641, 588, or roughly 900 million, (cid:104)x, y(cid:105) pairs. If fully enumerated\nas text, this stream would have required 12GB of uncompressed storage. Vocabulary size, |X| = |Y |, was\nroughly 30 thousand (28,972) unique tokens.\n\n10That is, since our algorithm is susceptible to adversarial manipulation of the stream, it is important to\nestablish the experimental upper bound that is possible assuming zero error due to the use of probabilistic\ncounts.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 3: 3(a) : Normalized, mean PMI for top-50 y for each x. 3(b) : Accuracy of top-5 ranklist using the\nstandard measurement, and when using an instrumented counter that had oracle access to which (cid:104)x, y(cid:105) were\nabove threshold.\n\nTable 1: When using a perfect counter and a buffer of 50, 500 and 5,000 documents, for k = 1, 5, 10: the\naccuracy of the resultant k-best lists when compared to the \ufb01rst k, k + 1 and k + 2 true values.\n\nBuffer\n\n50\n500\n5000\n\n1\n\n94.10\n94.14\n94.69\n\n2\n\n98.75\n98.81\n98.93\nk = 1\n\n3\n\n99.45\n99.53\n99.60\n\n5\n\n97.25\n97.31\n97.76\n\n6\n\n99.13\n99.16\n99.30\nk = 5\n\n7\n\n99.60\n99.62\n99.71\n\n10\n\n98.05\n98.12\n98.55\n\n11\n\n99.26\n99.29\n99.46\nk = 10\n\n12\n\n99.63\n99.65\n99.74\n\nshould be taken as a lower bound, while the perfect counting results are the upper bound on what an\napproximate counter might achieve.\nWe measured the accuracy of resultant k-best lists by \ufb01rst collecting the true top-50 elements for\neach x, of\ufb02ine, to be used as a key. Then, for a proposed k-best list, accuracy was calculated\nat different ranks of the gold standard. For example, the elements of a proposed 10-best list will\noptimally fully intersect with the \ufb01rst 10 elements of the gold standard. In the case the list is not\nperfect, we would hope that an element incorrectly positioned at, e.g., rank 9, should really be of\nrank 12, rather than rank 50.\nUsing this gold standard, Figure 3(a) shows the normalized, mean PMI scores as according to rank.\nThis curve supports our earlier theoretical assumption that PMI over Y is a Zip\ufb01an distribution for\na given x.\n\n6.1 Results\n\nIn Table 1 we see that when using a perfect counter, our algorithm succeeds in recovering almost\nall top-k elements. For example, when k = 5, reading 500 documents at a time, our rank lists are\n97.31% accurate. Further, of those collected triggers that are not truly in the top-5, most were either\nin the top 6 or 7. As there appears to be minimal impact based on buffer size, we \ufb01xed |B| = 500\ndocuments for the remainder of our experiments.11 This result supports the intuition behind our\nmisclassi\ufb01cation probability bound: while it is possible for an adversary to construct a stream that\nwould mislead our online algorithm, this seems to rarely occur in practice.\nShown in Figure 3(b) are the accuracy results when using an approximate counter and a buffer size of\n500 documents, to collect top-5 rank lists. Two results are presented. The standard result is based on\ncomparing the rank lists to the key just as with the results when using a perfect counter. A problem\nwith this evaluation is that the hard threshold used for both generating the key, and the results for\nperfect counting, cannot be guaranteed to hold when using approximate counts. It is possible that\n11Strictly speaking, |B| is no larger than the maximum length interval in the stream resulting from enumer-\n\nating the contents of, e.g., 500 consecutive documents.\n\n7\n\nllllllllllllllllllllllllllllllllllllllllllllllllll010203040500.050.1510203040500.60.81.0standardinstrumented\fTable 2: Top 5 verbs, y, for x = bomb, laughed and vetoed. Left columns are based on using a perfect\ncounter, while right columns are based on an approximate counter. Numeral pre\ufb01xes denote rank of element in\ntrue top-k lists. All results are with respect to a buffer of 500 documents.\n\nx = bomb\n\n1:detonate\n2:assassinate\n3:bomb\n4:plotting\n5:plotted\n\n1:detonate\n7:bombed\n2:assassinate\n4:plotting\n8:expel\n\nx = laughed\n\n1:tickle\n2:tickling\n3:tickled\n4:snickered\n5:captivating\n\n-:panang\n1:tickle\n3:tickled\n2:tickling\n4:snickered\n\nx = vetoed\n\n1:vetoing\n2:overridden\n3:overrode\n4:override\n5:latches\n\n1:vetoing\n2:overridden\n4:override\n5:latches\n7:vetoed\n\nsome (cid:104)x, y(cid:105) pair that occurs perhaps 4 or 5 times may be misreported as occurring 6 times or more.\nIn this case, the (cid:104)x, y(cid:105) pair will not appear in the key in any position, thus creating an arti\ufb01cial\nupper bound on the possible accuracy as according to this metric. For purposes of comparison, we\ninstrumented the approximate solution to use a perfect counter in parallel. All PMI values were\ncomputed as before, using approximate counts, but the perfect counter was used just in verifying\nwhether a given pair exceeded the threshold. In this way the approximate counting solution saw just\nthose elements of the stream as observed in the perfect counting case, allowing us to evaluate the\nranking error introduced by the counter, irrespective of issues in \u201cdipping below\u201d the threshold. As\nseen in the instrumented curve, top-5 rank lists generated when using the approximate counter are\ncomposed primarily of elements truly ranked 10 or below.\n\n6.2 Examples\n\nFigure 2 contains the top-5 most associated verbs as according to our algorithm, both when using\na perfect and an approximate counter. As can be seen for the perfect counter, and as suggested by\nTable 1, in practice it is possible to track PMI scores over buffered intervals with a very high degree\nof accuracy. For the examples shown (and more generally throughout the results), the resultant\nk-best lists are near perfect matches to those computed of\ufb02ine.\nWhen using an approximate counter we continue to see reasonable results, with some error intro-\nduced due to the use of probabilistic counting. The rank 1 entry reported for x = laughed exempli-\n\ufb01es the earlier referenced issue of the approximate counter being able to incorrectly dip below the\nthreshold for terms that the gold standard would never see.12\n\n7 Conclusions\n\nIn this paper we provided the \ufb01rst study of estimating top-k PMI online. We showed that while a\nprecise solution comes at a high cost in the streaming model, there exists a simple algorithm that\nperforms well on real data. An avenue of future work is to drop the assumption that each of the\ntop-k PMI values is maintained explicitly and see whether there is an algorithm that is feasible for\nthe streaming version of the problem or if a similar lower bound still applies. Another promising\napproach would be to apply the tools of two-way associations to this problem [Li and Church, 2007].\nAn experiment of Schooler and Anderson [1997] assumed words in NYTimes headlines operated as\ncues for the retrieval of memory structures associated with co-occurring terms. Missing from that\nreport was how such cues might be accumulated over time. The work presented here can be taken as\na step towards modeling resource constrained, online cue learning, where an appealing description\nof our model involves agents tracking co-occurring events over a local temporal window (such as\na day), and regularly consolidating this information into long term memory (when they \u201csleep\u201d).\nFuture work may continue this direction by considering data from human trials.\n\nAcknowledgements Special thanks to Dan Gildea, as well as Rochester HLP/Jaeger-lab members\nfor ideas and feedback. The \ufb01rst author was funded by a 2008 Provost\u2019s Multidisciplinary Award\nfrom the University of Rochester, and NSF grant IIS-0328849. The second author was supported in\npart by the NSF grants CNS-0905169 and CNS-0910592, funded under the American Recovery and\nReinvestment Act of 2009 (Public Law 111-5), and by NSF grant CNS-0716423.\n\n12I.e., the token panang, incorrectly tagged as a verb, is sparsely occurring.\n\n8\n\n\fReferences\n[Bloom, 1970] Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications\n\nof the ACM, 13:422\u2013426, 1970.\n\n[Chambers and Jurafsky, 2008] Nathanael Chambers and Dan Jurafsky. Unsupervised Learning of Narrative\n\nEvent Chains. In Proceedings of ACL, 2008.\n\n[Chklovski and Pantel, 2004] Timothy Chklovski and Patrick Pantel. VerbOcean: Mining the Web for Fine-\nGrained Semantic Verb Relations. In Proceedings of Conference on Empirical Methods in Natural Language\nProcessing (EMNLP-04), pages 33\u201340, Barcelona, Spain, 2004.\n\n[Church and Hanks, 1990] Kenneth Church and Patrick Hanks. Word Association Norms, Mutual Information\n\nand Lexicography. Computational Linguistics, 16(1):22\u201329, March 1990.\n\n[Frank et al., 2007] Michael C. Frank, Noah D. Goodman, and Joshua B. Tenenbaum. A Bayesian framework\n\nfor cross-situational word learning. In Advances in Neural Information Processing Systems, 20, 2007.\n\n[Gim\u00b4enez and M`arquez, 2004] Jes\u00b4us Gim\u00b4enez and Llu\u00b4\u0131s M`arquez. SVMTool: A general POS tagger generator\n\nbased on Support Vector Machines. In Proceedings of LREC, 2004.\n\n[Graff, 2003] David Graff. English Gigaword. Linguistic Data Consortium, Philadelphia, 2003.\n[Joachims, 1999] Thorsten Joachims. Making large-scale SVM learning practical. In B. Sch\u00a8olkopf, C. Burges,\nand A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11, pages 169\u2013184.\nMIT Press, Cambridge, MA, 1999.\n\n[Li and Church, 2007] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-\n\nway associations. Computational Linguistics, 33(3):305\u2013354, 2007.\n\n[Lin, 1998] Dekang Lin. Automatic Retrieval and Clustering of Similar Words. In Proceedings of COLING-\n\nACL, 1998.\n\n[Rosenfeld, 1994] Ronald Rosenfeld. Adaptive Statistical Language Modeling: A Maximum Entropy Ap-\n\nproach. PhD thesis, Computer Science Department, Carnegie Mellon University, April 1994.\n\n[Schooler and Anderson, 1997] Lael J. Schooler and John R. Anderson. The role of process in the rational\n\nanalysis of memory. Cognitive Psychology, 32(3):219\u2013250, 1997.\n\n[Talbot and Osborne, 2007] David Talbot and Miles Osborne. Randomised Language Modelling for Statistical\n\nMachine Translation. In Proceedings of ACL, 2007.\n\n[Talbot, 2009] David Talbot. Succinct approximate counting of skewed data. In Proceedings of IJCAI, 2009.\n[Van Durme and Lall, 2009] Benjamin Van Durme and Ashwin Lall. Probabilistic Counting with Randomized\n\nStorage. In Proceedings of IJCAI, 2009.\n\n9\n\n\f", "award": [], "sourceid": 627, "authors": [{"given_name": "Benjamin", "family_name": "Durme", "institution": null}, {"given_name": "Ashwin", "family_name": "Lall", "institution": null}]}