{"title": "Collective Inference on Markov Models for Modeling Bird Migration", "book": "Advances in Neural Information Processing Systems", "page_first": 1321, "page_last": 1328, "abstract": null, "full_text": "Collective Inference on Markov Models\n\nfor Modeling Bird Migration\n\nDaniel Sheldon\n\nM. A. Saleh Elmohamed\n\nDexter Kozen\n\nCornell University\nIthaca, NY 14853\n\n{dsheldon,kozen}@cs.cornell.edu\n\nsaleh@cam.cornell.edu\n\nAbstract\n\nWe investigate a family of inference problems on Markov models, where many\nsample paths are drawn from a Markov chain and partial information is revealed\nto an observer who attempts to reconstruct the sample paths. We present algo-\nrithms and hardness results for several variants of this problem which arise by re-\nvealing different information to the observer and imposing different requirements\nfor the reconstruction of sample paths. Our algorithms are analogous to the clas-\nsical Viterbi algorithm for Hidden Markov Models, which \ufb01nds the single most\nprobable sample path given a sequence of observations. Our work is motivated by\nan important application in ecology: inferring bird migration paths from a large\ndatabase of observations.\n\n1 Introduction\n\nHidden Markov Models (HMMs) assume a generative model for sequential data whereby a sequence\nof states (or sample path) is drawn from a Markov chain in a hidden experiment. Each state generates\nan output symbol from alphabet \u03a3, and these output symbols constitute the data or observations. A\nclassical problem, solved by the Viterbi algorithm, is to \ufb01nd the most probable sample path given\ncertain observations for a given Markov model. We call this the single path problem; it is well suited\nto labeling or tagging a single sequence of data. For example, HMMs have been successfully applied\nin speech recognition [1], natural language processing [2], and biological sequencing [3].\nWe introduce two generalizations of the single path problem for performing collective inference on\nMarkov models, motivated by an effort to model bird migration patterns using a large database of\nstatic observations. The eBird database hosted by the Cornell Lab of Ornithology contains millions\nof bird observations from throughout North America, reported by the general public using the eBird\nweb application.1 Observations report location, date, species and number of birds observed. The\neBird data set is very rich; the human eye can easily discern migration patterns from animations\nshowing the observations as they unfold over time on a map of North America.2 However, the\neBird data are static, and they do not explicitly record movement, only the distributions at different\npoints in time. Conclusions about migration patterns are made by the human observer. Our goal is\nto build a mathematical framework to infer dynamic migration models from the static eBird data.\nQuantitative migration models are of great scienti\ufb01c and practical import: for example, this problem\narose out of an interdisciplinary project at Cornell University to model the possible spread of avian\nin\ufb02uenza in North America through wild bird migration.\nThe migratory behavior for a species of birds can be modeled using a single generative process\nthat independently governs how individual birds \ufb02y between locations, giving rise to the following\n\n1http://ebird.org\n2http://www.avianknowledge.net/visualization\n\n1\n\n\finference problem: a hidden experiment simultaneously draws many independent sample paths from\na Markov chain, and the observations reveal aggregate information about the collection of sample\npaths at each time step, from which the observer attempts to reconstruct the paths. For example, the\neBird data estimate the geographical distribution of a species on successive days, but do not track\nindividual birds.\nWe discuss two problems within this framework. In the multiple path problem, we assume that\nexactly M independent sample paths are drawn from the Markov model, and the observations reveal\nthe number of paths that output symbol \u03b1 at time t, for each \u03b1 and t. The observer seeks the\nmost likely collection of paths given the observations. The fractional path problem is a further\ngeneralization in which paths are divisible entities. The observations reveal the fraction of paths that\noutput symbol \u03b1 at time t, and the observer\u2019s job is to \ufb01nd the most likely (in a sense to be de\ufb01ned\nlater) weighted collection of paths given the observations. Conceptually, the fractional path problem\ncan be derived from the multiple path problem by letting M go to in\ufb01nity; or it has a probabilistic\ninterpretation in terms of distributions over paths.\nAfter discussing some preliminaries in section 2, sections 3 and 4 present algorithms for the multiple\nand fractional path problems, respectively, using network \ufb02ow techniques on the trellis graph of the\nMarkov model. The multiple path problem in its most general form is NP-hard, but can be solved\nas an integer program. The special case when output symbols uniquely identify their associated\nstates can be solved ef\ufb01ciently as a \ufb02ow problem; although the single path problem is trivial in this\ncase, the multiple and fractional path problems remain interesting. The fractional path problem can\nbe solved by linear programming. We also introduce a practical extension to the fractional path\nproblem, including slack variables allowing the solution to deviate slightly from potentially noisy\nobservations. In section 5, we demonstrate our techniques with visualizations for the migration of\nArchilochus colubris, the Ruby-throated Hummingbird, devoting some attention to a challenging\nproblem we have neglected so far: estimating species distributions from eBird observations.\nWe brie\ufb02y mention some related work. Caruana et al. [4] and Phillips et al. [5] used machine\nlearning techniques to model bird distributions from observations and environmental features. For\nproblems on sequential data, many variants of HMMs have been proposed [3], and recently, con-\nditional random \ufb01elds (CRFs) have become a popular alternative [6]. Roth and Yih [7] present an\ninteger programming inference framework for CRFs that is similar to our problem formulations.\n\n2 Preliminaries\n\n2.1 Data Model and Notation\n\nA Markov model (V, p, \u03a3, \u03c3) is a Markov chain with state set V and transition probabilities p(u, v)\nfor all u, v \u2208 V . Each state generates a unique output symbol from alphabet \u03a3, given by the mapping\n\u03c3 : V \u2192 \u03a3. Although some presentations allow each state to output multiple symbols with different\nemission probabilities, we lose no generality assuming that each state emits a unique symbol \u2014 to\nencode a model where state v output multiple symbols, we simply duplicate v for each symbol and\nencode the emission probabilities into the transitions. Of course, \u03c3 need not be one-to-one. It is\nuseful to think of \u03c3 as a partition of the states, letting V\u03b1 = \u03c3\u22121(\u03b1) be the set of all states that\noutput \u03b1. We assume each model has a distinguished start state s and output symbol start.\nLet Y = V T be the set of all possible sample paths of length T . We represent a path y \u2208 Y as a row\nvector y = (y1, . . . , yT ), and a collection of M paths as the M \u00d7 T matrix Y = (yit), with each\nrow yi\u00b7 representing an independent sample path. The transition probabilities induce a distribution\nt=1 p(yt, yt+1). We will also consider arbitrary distributions \u03c0 over Y,\nletting Y = (Y1, . . . , YT ) denote a random path from \u03c0. Then, for example, we write Pr\u03c0 [Yt = u]\nto be the probability under \u03c0 that the tth state is u, and E\u03c0 [f(Y )] to be the expected value of f(Y )\nfor any function f of a random path Y drawn from \u03c0. Note that Y (boldface) denotes a matrix of\nM paths, while Y denotes a random path.\n\n\u03bb on Y, where \u03bb(y) = QT\u22121\n\n2.2 The Trellis Graph and Viterbi as Shortest Path\n\nTo develop our \ufb02ow-based algorithms, it is instructive to build upon a shortest-path interpretation of\nthe Viterbi algorithm [7]. In an instance of the single path problem we are given a model (V, p, \u03a3, \u03c3)\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: Trellis graph for Markov model with states {s, u, v, w} and alphabet {start, 0, 1}. States u\nand v output the symbol 0, and state w outputs the symbol 1. (a) The bold path is feasible for the speci\ufb01ed\nobservations, with probability p(s, u)p(u, u)p(u, w). (b) Infeasible edges have been removed (indicated by\nlight dashed lines), and probabilities changed to costs. The bold path has cost c(s, u) + c(u, u) + c(u, w).\n\nand observations \u03b11, . . . , \u03b1T , and we seek the most probable path y given these observations. We\ncall path y feasible if \u03c3(yt) = \u03b1t for all t; then we wish to maximize \u03bb(y) over feasible y. The\nproblem is conveniently illustrated using the trellis graph of the Markov model (Figure 1). Here, the\nstates are replicated for each time step, and edges connect a state at time t to its possible successors\nat time t + 1, labeled with the transition probability. A feasible path must pass through partition\nV\u03b1t at step t, so we can prune all edges incident on other partitions, leaving only feasible paths. By\nde\ufb01ning the cost of an edge as c(u, v) = \u2212 log p(u, v), and letting the path cost c(y) be the sum\nof its edge costs, straightforward algebra shows that arg maxy \u03bb(y) = arg miny c(y), i.e., the path\nof maximum probability becomes the path of minimum cost under this transformation. Thus the\nViterbi algorithm \ufb01nds the shortest feasible path in the trellis using edge lengths c(u, v).\n\n3 Multiple Path Problem\n\nIn the multiple path problem, M sample paths are drawn from the model and the observations reveal\nthe number of paths Nt(\u03b1) that output \u03b1 at time t, for all \u03b1 and t; or, equivalently, the multiset At\nof output symbols at time t. The objective is to \ufb01nd the most probable collection Y that is feasible,\nmeaning it produces multisets A1, . . . , AT . The probability \u03bb(Y) is just the product of the path-wise\nprobabilities:\n\nMY\n\ni=1\n\nMY\n\nT\u22121Y\n\ni=1\n\nt=1\n\n\u03bb(Y) =\n\n\u03bb(yi) =\n\np(yi,t, yi,t+1).\n\n(1)\n\n(2)\n\nThen the formal speci\ufb01cation of this problem is\n\n\u03bb(Y) subject to |{i : yi,t \u2208 V\u03b1}| = Nt(\u03b1) for all \u03b1, t.\n\nmax\nY\n\n3.1 Reduction to the Single Path Problem\n\nA naive approach to the multiple path problem reduces it to the single path problem by creating a new\nMarkov model on state set V M where state hv1, . . . , vMi encodes an entire tuple of original states,\nand the transition probabilities are given by the product of the element-wise transition probabilities:\n\np(hu1, . . . , uMi,hv1, . . . , vMi) =\n\np(ui, vi).\n\nMY\n\ni=1\n\nA state from the product space V M corresponds to an entire column of the matrix Y, and by chang-\ning the order of multiplication in (1), we see that the probability of a path in the new model is equal\nto the probability of the entire collection of paths in the old model. To complete the reduction, we\nform a new alphabet \u02c6\u03a3 whose symbols represent multisets of size M on \u03a3. Then the solution to (2)\ncan be found by running the Viterbi algorithm to \ufb01nd the most likely sequence of states from V M\nthat produce output symbols (multisets) A1, . . . , AT . The running time is polynomial in |V M| and\n|\u02c6\u03a3|, but exponential in M.\n\n3\n\np(u,u)p(u,w)p(s,u)uvw01V0V1V0V0V1V10startObservationssc(u,u)c(u,w)c(s,u)uvw01V0V1V0V0V1V10startObservationss\f3.2 Graph Flow Formulation\n\nCan we do better than the naive approach? Viewing the cost of a path as the cost of routing one\nunit of \ufb02ow along that path in the trellis, a minimum cost collection of M paths is equivalent to a\nminimum cost \ufb02ow of M units through the trellis \u2014 given M paths, we can route one unit along each\nto get a \ufb02ow, and we can decompose any \ufb02ow of M units into paths each carrying a single unit of\n\ufb02ow. Thus we can write the optimization problem in (2) as the following \ufb02ow integer program, with\nadditional constraints that the \ufb02ow paths generate the correct observations. The decision variable\nuv indicates the \ufb02ow traveling from u to v at time t; or, the number of sample paths that transition\nxt\nfrom u to v at time t.\n\n(IP)\n\nc(u, v)xt\nuv\n\nminX\ns.t. X\nX\n\nu,v,t\n\nu\n\nu\u2208V\u03b1,v\u2208V\n\nxt+1\nvw\n\nuv =X\n\nxt\n\nw\n\nuv = Nt(\u03b1)\nxt\nuv \u2208 N\nxt\nX\n\nfor all v, t,\n\nfor all \u03b1, t,\n\nfor all u, v, t.\n\n(3)\n\n(4)\n\nThe \ufb02ow conservation constraints (3) are standard: the \ufb02ow into v at time t is equal to the \ufb02ow\nleaving v at time t + 1. The observation constraints (4) specify that Nt(\u03b1) units of \ufb02ow leave\npartition V\u03b1 at time t. These also imply that exactly M units of \ufb02ow pass through each level of the\n\ntrellis, by summing over all \u03b1,X\n\nuv =X\n\nxt\n\nu,v\n\n\u03b1\n\nu\u2208V\u03b1,v\u2208V\n\nuv =X\n\nxt\n\n\u03b1\n\nNt(\u03b1) = M.\n\nWithout the observation constraints, IP would be an instance of the minimum-cost \ufb02ow problem [8],\nwhich is solvable in polynomial time by a variety of algorithms [9]. However, we cannot hope to\nencode the observation constraints into the \ufb02ow framework, due to the following result.\nLemma 1. The multiple path problem is NP-hard.\n\nThe proof of Lemma 1 is by reduction from SET COVER, and is omitted. One may use a general\npurpose integer program solver to solve IP directly; this may be ef\ufb01cient in some cases despite the\nlack of polynomial time performance guarantees. In the following sections we discuss alternatives\nthat are ef\ufb01ciently solvable.\n\n3.3 An Ef\ufb01cient Special Case\n\nIn the special case when \u03c3 is one-to-one, the output symbols uniquely identify their generating\nstates, so we may assume that \u03a3 = V , and the output symbol is always the name of the current state.\nTo see how the problem IP simpli\ufb01es, we now have Vu = {u} for all u, so each partition consists of\na single state, and the observations completely specify the \ufb02ow through each node in the trellis:\n\n(40)\nSubstituting the new observation constraints (40) for time t+1 into the RHS of the \ufb02ow conservation\nconstraints (3) for time t yield the following replacements:\n\nuv = Nt(u)\nxt\n\nfor all u, t.\n\nX\nX\n\nv\n\nuv = Nt+1(v)\nxt\n\nfor all v, t.\n\n(30)\n\nu\n\nuv for a single\nThis gives an equivalent set of constraints, each of which refers only to variables xt\nt. Hence the problem can be decomposed into T \u2212 1 disjoint subproblems for t = 1, . . . , T \u2212 1.\nThe tth subproblem IPt is given in Figure 2(a), and illustrated on the trellis in Figure 2(b). State\nu at time t has a supply of Nt(u) units of \ufb02ow coming from the previous step, and we must route\nNt+1(v) units of \ufb02ow to state v at time t + 1, so we place a demand of Nt+1(v) at the corresponding\nnode. Then the problem reduces to \ufb01nding a minimum cost routing of the supply from time t to meet\nthe demand at time t + 1, solved separately for all t = 1, . . . , T \u2212 1. The problem IPt an instance\nof the transportation problem [10], a special case of the minimum-cost \ufb02ow problem. There are a\nvariety of ef\ufb01cient algorithms to solve both problems [8,9], or one may use a general purpose linear\nprogram (LP) solver; any basic solution to the LP relaxation of IPt is guaranteed to be integral [8].\n\n4\n\n\f(IPt)\n\nminX\n\nc(u, v)xt\nuv\n\nu,v\n\ns.t. X\nX\n\nu\n\nv\n\nuv = Nt+1(v)\nxt\n\nuv = Nt(u)\nxt\nuv \u2208 N\nxt\n(a)\n\n(30)\n\n(40)\n\n\u2200v,\n\n\u2200u,\n\u2200u, v.\n\n(b)\n\nFigure 2: (a) The de\ufb01nition of subproblem IPt. (b) Illustration on the trellis.\n\n4 Fractional Path Problem\n\nQ\n\nIn the fractional path problem, a path is a divisible entity. The observations specify qt(\u03b1), the\nfraction of paths that output \u03b1 at time t, and the observer chooses \u03c0(y) fractional units of each\npath y, totaling one unit, such that qt(\u03b1) units output \u03b1 at time t. The objective is to maximize\ny\u2208Y \u03bb(y)\u03c0(y). Put another way, \u03c0 is a distribution over paths such that Pr\u03c0 [Yt \u2208 V\u03b1] = qt(\u03b1),\ni.e., qt speci\ufb01es the marginal distribution over symbols at time t. By taking the logarithm, an equiv-\nalent objective is to maximize E\u03c0 [log \u03bb(Y )], so we seek the distribution \u03c0 that maximizes the\nexpected log-probability of a path Y drawn from \u03c0. Conceptually, the fractional path problem arises\nby letting M \u2192 \u221e in the multiple path problem and normalizing to let qt(\u03b1) = Nt(\u03b1)/M specify\nthe fraction of paths that output \u03b1 at time t. Operationally, the fractional path problem is modeled\nby the LP relaxation of IP, which routes one splittable unit of \ufb02ow through the trellis.\n\n(RELAX)\n\nminX\ns.t. X\nX\nX\n\nu,v,t\n\nu\n\nu\u2208V\u03b1\n\nv\u2208V\n\nc(u, v)xt\nuv\n\nuv =X\n\nxt\n\nxt+1\nvw\n\nfor all v, t,\n\nw\n\nuv = qt(\u03b1)\nxt\nuv \u2265 0\nxt\n\nfor all \u03b1, t,\n\n(5)\n\nfor all u, v, t.\n\nIt is easy to see that a unit \ufb02ow x corresponds to a probability distribution \u03c0. Given any distribution\nuv = Pr\u03c0 [Yt = u, Yt+1 = v]; then x is a \ufb02ow because the probability a path enters v at\n\u03c0, let xt\ntime t is equal to the probability it leaves v at time t + 1. Conversely, given a unit \ufb02ow x, any path\ndecomposition assigning \ufb02ow \u03c0(y) to each y \u2208 Y is a probability distribution because the total \ufb02ow\nis one. In general, the decomposition is not unique, but any choice yields a distribution \u03c0 with the\nsame objective value. Furthermore, under this correspondence, x satis\ufb01es the marginal constraints\n(5) if and only if \u03c0 has the correct marginals:\n\nPr [Yt = u] = Pr [Yt \u2208 V\u03b1] .\n\nX\n\nX\n\nv\u2208V\n\nu\u2208V\u03b1\n\nu\u2208V\u03b1\nFinally, we can rewrite the objective function in terms of paths:\n\nu\u2208V\u03b1\n\nxt\n\nuv = X\nX\n\nu,v,t\n\nc(u, v)xt\n\nX\nPr [Yt = u, Yt+1 = v] = X\nuv =X\n\nv\u2208V\n\ny\u2208Y\n\n\u03c0(y)c(y) = E\u03c0 [c(Y )] = E\u03c0 [\u2212 log \u03bb(Y )] .\n\nBy switching signs and changing from minimization to maximization, we see that RELAX solves\nthe fractional path problem. This problem is very similar to maximum entropy or minimum cross\nentropy modeling, but the details are slightly different: such a model would typically \ufb01nd the dis-\ntribution \u03c0 with the correct marginals that minimizes the cross entropy or Kullback-Leibler di-\nvergence [11] between \u03bb and \u03c0, which, after removing a constant term, reduces to minimizing\nE\u03bb [\u2212 log \u03c0(Y )]. Like IP, the RELAX problem also decomposes into subproblems in the case when\n\u03c3 is one-to-one, but this simpli\ufb01cation is incompatible with the slack variables introduced in the\nfollowing section.\n\n5\n\n041320SupplyDemandtt+1v1v2v3v1v2v3c(v1,v1)Nt(\u00b7)Nt+1(\u00b7)\fIncorporating Slack\n\n4.1\nIn our application, the marginal distributions qt(\u00b7) are themselves estimates, and it is useful to allow\nthe LP to deviate slightly from these marginals to \ufb01nd a better overall solution. To accomplish this,\nwe add slack variables \u03b4t\nu into the marginal constraints (5), and charge for the slack in the objective\nfunction. The new marginal constraints are\n\nX\nX\n(50)\nu\u2208V\u03b1\nv\u2208V\n\u03b1| into the objective function to charge for the slack, using a standard\n\u03b1|\u03b4t\nLP trick [8] to model the absolute value term. The slack costs \u03b3t\n\u03b1 can be tailored to individual input\nvalues; for example, one may want to charge more to deviate from a con\ufb01dent estimate. This will\ndepend on the speci\ufb01c application. We also add the necessary constraints to ensure that the new\nmarginals q0\n\nand we add the termP\n\nt(\u03b1) = qt(\u03b1) + \u03b4t\n\n\u03b1 form a valid probability distribution for all t.\n\nuv = qt(\u03b1) + \u03b4t\nxt\n\n\u03b1\n\nfor all \u03b1, t,\n\n\u03b1,t \u03b3t\n\n5 Demonstration\n\nIn this section, we demonstrate our techniques by using the fractional path problem to create visual-\nizations showing likely migration routes of Archilochus colubris, the Ruby-throated Hummingbird,\na common bird whose range is relatively well covered by eBird observations. We work in dis-\ncretized space and time, dividing the map into grid cells and the year into weeks. We must specify\nthe Markov model governing transitions between locations (grid cells) in successive weeks; also,\nwe require estimates qt(\u00b7) for the weekly distributions of hummingbirds across locations. Since the\nactual eBird observations are highly non-uniform in space and time, estimating weekly distribu-\ntions requires signi\ufb01cant inference for locations with few or no observations. In the appendix, we\noutline one approach based on harmonic energy minimization [12], but we may use any technique\nthat produces weekly distributions qt(u) and slack costs \u03b3t\nu. Improving these estimates, say, by\nincorporating important side information such as climate and habitat features, could signi\ufb01cantly\nimprove the overall model. Finally, although our \ufb01nal observations qt(\u00b7) are distributions over states\n(locations) and not output symbols \u2014 i.e., \u03c3 is one-to-one \u2014 we cannot use the simpli\ufb01cation from\nsection 3.3 because we incorporate slack into the model.\n\n5.1\n\neBird Data\n\nLaunched in 2002, eBird is a citizen science project run by the Cornell Lab of Ornithology, lever-\naging the data gathering power of the public. On the eBird website, birdwatchers submit checklists\nof birds they observe, indicating a count for each species, along with the location, date, time and\nadditional information. Our data set consists of the 428,648 complete checklists from 19953 through\n2007, meaning the reporter listed all species observed. This means we can infer a count of zero, or\na negative observation, for any species not listed. Using a land cover map from the United States\nGeological Survey (USGS), we divide North America into grid cells that are roughly 225 km on a\nside. All years of data are aggregated into one, and the year is divided into weeks so t = 1, . . . , 52\nrepresents the week of the year.\n\n5.2 Migration Inference\n\nGiven weekly distributions qt(u) and slack costs \u03b3t\nu (see the appendix), it remains to specify\nthe Markov model. We use a simple Gaussian model favoring short \ufb02ights, letting p(u, v) \u221d\nexp(\u2212d(u, v)2/\u03c32), where d(u, v) measures the distance between grid cell centers. This corre-\nsponds to a squared distance cost function. To reduce problem size, we omitted variables xt\nuv from\nthe LP when d(u, v) > 1350 km, effectively setting p(u, v) = 0. We also found it useful to impose\nu \u2264 qt(u) on the slack variables so no single value could increase by more than a\nupper bounds \u03b4t\nfactor of two. Our \ufb01nal LP, which was solved using the MOSEK optimization toolbox, had 78,521\nconstraints and 3,031,116 variables.\nFigure 3 displays the migration paths our model inferred for the four weeks starting on the dates\nindicated. The top row shows the distribution and paths inferred by the model; grid cells colored\n\n3Users may enter historical observations.\n\n6\n\n\fWeek 10\nMarch 5\n\nWeek 20\nMay 14\n\nWeek 30\nJuly 28\n\nWeek 40\nOctober 1\n\nFigure 3: Ruby-throated Hummingbird migration. See text for description.\n\nin lighter shades have more birds (higher values for q0\nuv)\nIn\nbetween the week shown and the following week, with line width proportional to \ufb02ow xt\nthe bottom row, the raw data is given for comparison. White dots indicate negative observations;\nblack squares indicate positive observations, with size proportional to count. Locations with both\npositive and negative observations appear a charcoal color. The inferred distributions and paths are\nconsistent with both seasonal ranges and written accounts of migration routes. For example, in the\nsummary paragraph on migration from the Archilochus colubris species account in Birds of North\nAmerica [13], Robinson et al. write \u201cMany \ufb02y across Gulf of Mexico, but many also follow coastal\nroute. Routes may differ for north- and southbound birds.\u201d\n\nt(u)). Arrows indicate \ufb02ight paths (xt\nuv.\n\nAcknowledgments\n\nWe are grateful to Daniel Fink, Wesley Hochachka and Steve Kelling from the Cornell Lab of\nOrnithology for useful discussions. This work was supported in part by ONR Grant N00014-01-1-\n0968 and by NSF grant CCF-0635028. The views and conclusions herein are those of the authors\nand do not necessarily represent the of\ufb01cial policies or endorsements of these organizations or the\nUS Government.\n\nReferences\n\n[1] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.\n\nProceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[2] E. Charniak. Statistical techniques for natural language parsing. AI Magazine, 18(4):33\u201344, 1997.\n[3] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: Probabilistic models of\n\nproteins and nucleic acids. Cambridge University Press, 1998.\n\n[4] R. Caruana, M. Elhawary, A. Munson, M. Riedewald, D. Sorokina, D. Fink, W. M. Hochachka, and\n\nS. Kelling. Mining citizen science data to predict prevalence of wild bird species. In SIGKDD, 2006.\n\n[5] S. J. Phillips, M. Dud\u00b4\u0131k, and R. E. Schapire. A maximum entropy approach to species distribution mod-\n\neling. In ICML, 2004.\n\n[6] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. ICML, 2001.\n\n[7] D. Roth and W. Yih. Integer linear programming inference for conditional random \ufb01elds. ICML, 2005.\n\n7\n\n\f[8] V. Chv\u00b4atal. Linear Programming. W.H. Freeman, New York, NY, 1983.\n[9] A. V. Goldberg, S. A. Plotkin, and E. Tardos. Combinatorial algorithms for the generalized circulation\n\nproblem. Math. Oper. Res., 16(2):351\u2013381, 1991.\n\n[10] G. B. Dantzig. Application of the simplex method to a transportation problem.\n\nIn T. C. Koopmans,\neditor, Activity Analysis of Production and Allocation, volume 13 of Cowles Commission for Research in\nEconomics, pages 359\u2013373. Wiley, 1951.\n\n[11] J. Shore and R. Johnson. Properties of cross-entropy minimization. IEEE Trans. on Information Theory,\n\n27:472\u2013482, 1981.\n\n[12] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian \ufb01elds and harmonic\n\nfunctions. In ICML, 2003.\n\n[13] T. R. Robinson, R. R. Sargent, and M. B. Sargent. Ruby-throated Hummingbird (Archilochus colubris). In\nA. Poole and F. Gill, editors, The Birds of North America, number 204. The Academy of Natural Sciences,\nPhiladelphia, and The American Ornithologists\u2019 Union, Washington, D.C., 1996.\n\n[14] D. Aldous and J. Fill. Reversible Markov Chains and Random Walks on Graphs. Monograph in Prepara-\n\ntion, http://www.stat.berkeley.edu/users/aldous/RWG/book.html.\n\nA Estimating Weekly Distributions from eBird\n\nOur goal is to estimate qt(u), the fraction of birds in grid cell u during week t. Given enough\nobservations, we can estimate qt(u) using the average number of birds counted per checklist, a\nquantity we call the rate rt(u). However, even for a bird with good eBird coverage, there are cells\nwith few or no observations during some weeks. To \ufb01ll these gaps, we use the harmonic energy\nminimization technique [12] to determine values for empty cells based on neighbors in space and\ntime. This technique uses a graph-based similarity structure, in our case the 3-dimensional lattice\nbuilt on points ut, where ut represents cell u during week t. Edges are weighted, with weights\nrepresenting similarity between points. Point ut is connected to its four grid neighbors in time slice\nt by edges of unit weight, excluding edges between cells separated by water (speci\ufb01cally, when the\nline connecting the centers is more than half water). Point ut is also connected to points ut\u22121 and\nut+1 with weight 1/4, to achieve some temporal smoothing.\nHarmonic energy minimization learns a function f on the graph; the idea is to match rt(u) on points\nwith suf\ufb01cient data and \ufb01nd values for other points according to the similarity structure. To this\nend, we designate some boundary points for which the value of f is \ufb01xed by the data, while other\npoints are interior points. The value of f at interior point ut is determined by the expected value\nof the following random experiment: perform a random walk starting from ut, following outgoing\nedges with probability proportional to their weight. When the walk \ufb01rst hits a boundary point\nvt0, terminate and accept the boundary value f(vt0). In this way, the values at interior points are\na weighted average of nearby boundary values, where \u201cnearness\u201d is interpreted as the absorption\nprobability in an absorbing random walk. We derive a measure of con\ufb01dence in the value f(ut)\nfrom the same experiment: let h(ut) be the expected number of steps for the random walk from ut\nto hit the boundary (the hitting time of the boundary set [14]). When h(ut) is small, ut is close to\nthe boundary and we are more con\ufb01dent in f(ut).\nRather than choosing a threshold on the number of observations required to be a boundary point,\nwe create a soft boundary by designating all points ut as interior points, and adding one boundary\nnode to the graph structure for each observation, connected by an edge of unit weight to the cell\nin which it occurred, with value equal to the number of birds observed. As point ut gains more\nobservations, its behavior approaches that of a hard boundary: with probability approaching one, the\nwalk from ut will reach an observation in the \ufb01rst step, so f(ut) will approach rt(u), the average of\nthe observations. As a conservative measure, each node is also connected to a sink with boundary\nvalue 0, to prevent values from propagating over very long distances.\nWe compute h and f iteratively using standard techniques. Since f(ut) approximates the rate rt(u),\nwe multiply by the land mass of cell u to get an estimate \u02c6qt(u) for the (relative) number of birds\nu \u02c6qt(u).\nu = \u03b30/h(ut) to be inversely proportional to boundary hitting time, with\nFor slack costs, we set \u03b3t\n\u03b30 \u2248 261 chosen in conjunction with the transition costs in section 5.2 so the average cost for a unit\nof slack is the same as moving 600 km.\n\nin cell u at time t. Finally, we normalize \u02c6q for each time slice t, taking qt(u) = \u02c6qt(u)/P\n\n8\n\n\f", "award": [], "sourceid": 525, "authors": [{"given_name": "M.a.", "family_name": "Elmohamed", "institution": null}, {"given_name": "Dexter", "family_name": "Kozen", "institution": null}, {"given_name": "Daniel", "family_name": "Sheldon", "institution": null}]}