{"title": "Consistent Estimation of Functions of Data Missing Non-Monotonically and Not at Random", "book": "Advances in Neural Information Processing Systems", "page_first": 3144, "page_last": 3152, "abstract": "Missing records are a perennial problem in analysis of complex data of all types, when the target of inference is some function of the full data law. In simple cases, where data is missing at random or completely at random (Rubin, 1976), well-known adjustments exist that result in consistent estimators of target quantities. Assumptions underlying these estimators are generally not realistic in practical missing data problems. Unfortunately, consistent estimators in more complex cases where data is missing not at random, and where no ordering on variables induces monotonicity of missingness status are not known in general, with some notable exceptions (Robins, 1997), (Tchetgen Tchetgen et al, 2016), (Sadinle and Reiter, 2016). In this paper, we propose a general class of consistent estimators for cases where data is missing not at random, and missingness status is non-monotonic. Our estimators, which are generalized inverse probability weighting estimators, make no assumptions on the underlying full data law, but instead place independence restrictions, and certain other fairly mild assumptions, on the distribution of missingness status conditional on the data. The assumptions we place on the distribution of missingness status conditional on the data can be viewed as a version of a conditional Markov random field (MRF) corresponding to a chain graph. Assumptions embedded in our model permit identification from the observed data law, and admit a natural fitting procedure based on the pseudo likelihood approach of (Besag, 1975). We illustrate our approach with a simple simulation study, and an analysis of risk of premature birth in women in Botswana exposed to highly active anti-retroviral therapy.", "full_text": "Consistent Estimation of Functions of Data Missing\n\nNon-Monotonically and Not at Random\n\nIlya Shpitser\n\nDepartment of Computer Science\n\nJohns Hopkins University\nilyas@cs.jhu.edu\n\nAbstract\n\nMissing records are a perennial problem in analysis of complex data of all types,\nwhen the target of inference is some function of the full data law. In simple cases,\nwhere data is missing at random or completely at random [15], well-known ad-\njustments exist that result in consistent estimators of target quantities.\nAssumptions underlying these estimators are generally not realistic in practical\nmissing data problems. Unfortunately, consistent estimators in more complex\ncases where data is missing not at random, and where no ordering on variables\ninduces monotonicity of missingness status are not known in general, with some\nnotable exceptions [13, 18, 16].\nIn this paper, we propose a general class of consistent estimators for cases where\ndata is missing not at random, and missingness status is non-monotonic. Our es-\ntimators, which are generalized inverse probability weighting estimators, make\nno assumptions on the underlying full data law, but instead place independence\nrestrictions, and certain other fairly mild assumptions, on the distribution of miss-\ningness status conditional on the data.\nThe assumptions we place on the distribution of missingness status conditional on\nthe data can be viewed as a version of a conditional Markov random \ufb01eld (MRF)\ncorresponding to a chain graph. Assumptions embedded in our model permit\nidenti\ufb01cation from the observed data law, and admit a natural \ufb01tting procedure\nbased on the pseudo likelihood approach of [2]. We illustrate our approach with\na simple simulation study, and an analysis of risk of premature birth in women in\nBotswana exposed to highly active anti-retroviral therapy.\n\n1\n\nIntroduction\n\nPractical data sets generally have missing or corrupted entries. A classical missing data problem is\nto \ufb01nd a way to make valid inferences about the full data law. In other words, the goal is to exploit\nassumptions on the mechanism which is responsible for missingness or corruption of data records\nto transform the problem into another where missingness or corruption were not present at all.\nIn simple cases, where missingness status is assumed to be missing completely at random (deter-\nmined by an independent coin \ufb02ip), or at random (determined by a coin \ufb02ip independent conditional\non observed data records), adjustments exist which result in consistent estimators of many functions\nof the full data law. Unfortunately, these cases are dif\ufb01cult to justify in practice. Often, data records\nare missing intermittently and in complex patterns that do not conform to above assumptions. For\ninstance, in longitudinal observational studies in medicine, patients may elect to not show up at a\nparticular time point, for reasons having to do with their (by de\ufb01nition missing) health status at that\ntime point, and then later return for followup.\n\n1\n\n\fIn this situation, missingness is not determined by a coin \ufb02ip independent of missing data conditional\non the observed data (data is missing not at random), and missingness status of a patient is not\nmonotonic under any natural ordering. In this setting, deriving consistent estimators of even simple\nfunctions of the full data law is a challenging problem [13, 18, 16].\nIn this paper we propose a new class of consistent generalized inverse probability weighting (IPW)\nestimators for settings where data is missing non-monotonically and not at random. Like other IPW\nestimators, ours makes no modeling assumptions on the full data law, and only models the joint\nmissingness status of all variables, conditional on those variables. This model can be viewed as\na conditional Markov random \ufb01eld (MRF) with independence assumptions akin to those made in\nfactors of a chain graph model [6]. The assumptions encoded in our model permit identi\ufb01cation of\nthe full data law, and allow estimation based on the pseudo likelihood approach of [2].\nOur paper is organized as follows. We discuss relevant preliminaries on graphical models in Section\n2. We \ufb01x additional notation and discuss some prior work on missing data in Section 3. We introduce\nour missingness model, and identi\ufb01cation results based on it in Section 4, and discuss estimation in\nSection 5. We illustrate the use of our model with a simple simulation study in Section 6, and give\na data analysis application in Section 7. Finally, we illustrate the difference between our model and\na seemingly similar non-identi\ufb01ed model via a parameter counting argument in Section 8, and give\nour conclusions in Section 9.\n\n2 Chain Graph Models\n\nWe brie\ufb02y review statistical chain graph models. A simple mixed graph is a graph where every\nvertex pair is connected by at most one edge, and there are two types of possible edges: directed and\nundirected. Chain graphs are mixed graphs with the property that for every edge cycle in the graph,\nthere is no way to assign an orientation to undirected edges in any cycle to form a directed cycle [6].\nFor a graph G with a vertex set V, and any subset A \u2286 V, de\ufb01ne the induced subgraph GA to be\na graph with the vertex set A and all edges in G between elements in A. Given a graph G, de\ufb01ne\nthe augmented or moral graph Ga to be an undirected graph obtained from adding a new undirected\nedge between any unconnected vertices W1, W2 if a path of the form W1 \u2192 \u25e6 \u2212 \u25e6 . . . \u25e6 \u2212\u25e6 \u2190 W2\nexists in G (note the path may only contain a single intermediate vertex), and then replacing all\ndirected edges in G by undirected edges.\nA clique in an undirected graph is a set of vertices where any pair of vertices are neighbors. A\nmaximal clique is a clique such that no superset of it forms a clique. Given an undirected graph G,\ndenote the set of maximal cliques in G by C(G). A block in a simple mixed graph G is any connected\ncomponent formed by undirected edges in a graph obtained from G dropping all directed edges.\nGiven a simple mixed graph G, denote the set of blocks in G by B(G).\nA chain graph model is de\ufb01ned by the following factorization\n\n(cid:89)\n\np(V) =\n\nwhere for each B,\n\np(B | paG(B)) =\n\np(B | paG(B)),\n(cid:89)\n\nB\u2208B(G)\n\n1\n\nZ(paG(B))\n\nC\u2208C((GB\u222apaG (B))a)\n\n(1)\n\n(2)\n\n\u03c6C(C),\n\nand \u03c6C(C) are called potential functions and map value con\ufb01gurations of variables in C to real\nnumbers, which are meant to denote an \u201caf\ufb01nity\u201d of the model towards that particular value con\ufb01g-\nuration. The chain graph factorization implies Markov properties, described in detail in [6].\n\n3 Preliminaries, and Prior Work on Missing Data\nWe will consider data sets on random variables L \u2261 L1, . . . Lk, drawn from a full data law p(L).\nAssociated with each random variable Li is a binary missingness indicator Ri, where Li is observed\nif and only if Ri = 1. De\ufb01ne a vector (lj, rj) \u2261 (lj\nk) to be the jth realization of\n\n1, . . . rj\n\n1, . . . lj\n\nk, rj\n\n2\n\n\fi | rj = 1} \u2286 lj. In missing data settings, for every j, we only get to\np(L, R). De\ufb01ne (l\u2217)j \u2261 {lj\nobserve the vector of values ((l\u2217)j, rj), and we wish to make inferences using the true realizations\n(lj, rj) from the underlying law. Doing this entails building a bridge between the observed and the\nunderlying realizations, and this bridge is provided by assumptions made on p(L, R).\nIf we can assume that for any i, p(Ri | L) = p(Ri), in other words, every missing value is de-\ntermined by an independent biased coin \ufb02ip, then data is said to be missing completely at random\n(MCAR) [15]. In this simple setting, it is known that any estimator for complete data remains con-\nsistent if applied to just the complete cases. A more complex assumption, known as missing at\nrandom (MAR) [15], states that for every i, p(Ri | L) = p(Ri | L\u2217). In other words, every missing\nvalue is determined by a biased coin \ufb02ip that is independent of missing data values conditional on\nthe observed data values. In this setting, a variety of adjustments lead to consistent estimators.\nThe most interesting patterns of missingness, and the most relevant in practice, are those that do\nnot obey either of the above two assumptions, in which case data is said to be missing not at ran-\ndom (MNAR). Conventional wisdom in MNAR settings is that without strong parametric modeling\nassumptions, many functions of the full data law are not identi\ufb01ed from the observed data law.\nNevertheless, a series of recent papers [8, 7, 17], which represented missing data mechanisms as\ngraphical models, and exploited techniques developed in causal inference, have shown that the full\ndata law may be non-parametrically identi\ufb01ed under MNAR.\nIn this approach, the full data law is assumed to factorize with respect to a directed acyclic graph\n(DAG) [11]. Assumptions implied by this factorization are then used to derive functions of p(L)\nin terms of p(R, L\u2217). We illustrate the approach using Fig. 1 (a),(b) and (c). Here nodes in green\nare assumed to be completely observed. In Fig. 1 (a), the Markov factorization is p(R2, L1, L2) =\np(R2 | L1)p(L2 | L1)p(L1). It is easy to verify using d-separation [11] in this DAG that p(R2 |\nL1, L2) = p(R2 | L1). Since L1 is always observed, this setting is MAR, and we get the following\np(L1, L2) = p(L2|L1)p(L1) = p(L2|L1, R2 = 1)p(L1) = p(R2 = 1, L\u2217)/p(R2 = 1|L1), where\nthe last expression is a functional of p(R, L\u2217), and so the full data law p(L) is non-parametrically\nidenti\ufb01ed from the observed data law p(R, L\u2217).\nThe ratio form of the identifying functional suggests the following simple IPW estimator for E[L2],\nknown as the Horvitz-Thompson estimator [4]. We estimate p(R2 | L1) either directly if L1 is\ndiscrete and low dimensional, or using maximum likelihood \ufb01tting of a model for p(R2 | L1; \u03b2),\nfor instance a logistic regression model. We then average observed values of L2, but compensate for\nthe fact that observed and missing values of L2 systematically differ using the inverse of the \ufb01tted\n2 /p(R2 =\n1 | ln\n1 ; \u02c6\u03b2). Under our missingness model, this estimator is clearly unbiased. Under a number of\nadditional fairly mild conditions, this estimator is also consistent.\nA more complicated graph, shown in Fig. 1 (b), implies the following factorization\n\nprobability of the case being observed, conditional on L1, or \u02c6E[L2] = N\u22121(cid:80)\n\np(L1, L2, R1, R2) = p(R1 | L2)p(R2 | L1)p(L1 | L2)p(L2).\n\n(3)\nUsing d-separation in this DAG, we see that in cases where any values are missing, neither MCAR\nnor MAR assumptions hold under this model. Thus, in this example, data is MNAR. However, the\nconditional independence constraints implied by the factorization (3) imply the following\n\nn:rn=1 Ln\n\np(L1, L2) =\n\np(R1 = 1 | L\u2217\n\np(R1 = 1, R2 = 1, L\u2217)\n2, R2 = 1) \u00b7 p(R2 = 1 | L\u2217\n\n.\n\n1, R1 = 1)\n\nAs before, all terms on the right hand side are functions of p(R, L\u2217), and so p(L) is non-\nparametrically identi\ufb01ed from p(R, L\u2217). This example was discussed in [8].\nThe form of the identifying functional suggests a simple generalization of the IPW estimator from\nthe previous example for E[L2]. As before, we \ufb01t models p(R1 | L\u2217\n1; \u03b22) by\nMLE. We take the empirical average of the observed values of L2, but reweigh them by the inverses\nof both of the estimated probabilities, using complete cases only:\n\n2; \u03b21) and p(R2 | L\u2217\n\n(cid:88)\n\n1\nN\n\nn:rn\n\n1 =rn\n\n2 =1\n\n2 \u00b7\nln\n\n1\n\np(r1 = 1 | ln\n\n2 ; \u02c6\u03b21) \u00b7 p(r1 = 1 | ln\n\n1 ; \u02c6\u03b22)\n\n.\n\nThis estimator is also consistent, with the proof a simple generalization of that for Horvitz-\nThompson. More generally, it has been shown in [8] that in DAGs where no R variable has a\n\n3\n\n\fL1\n\nL2\n\nL2\n\nL1\n\nL2\n\nL3\n\nL1\n\nL1\n\nL2\n\nL3\n\nL1\n\nL2\n\nL3\n\nL1\n\nL2\n\nL3\n\nR2\n\nR1 R2\n\nR1\n\n(a)\n\n(b)\n\nL4\n\n(c)\n\nR2\n\nR3 R2 R1\n\nR3 R2 R1\n\nR3 R2 R1\n\n(d)\n\n(e)\n\n(f )\n\n(a) A graphical model for MAR data. (b),(c) Graphical model for MNAR data where\nFigure 1:\n(e)\nidenti\ufb01cation of the full data law is possible.\nA missingness model seemingly similar to (d), where the full data law is not identi\ufb01ed.\n(f) An\nundirected graph representing an independence model Markov equivalent to the independence model\nrepresented by a chain graph in (d).\n\n(d) The no self-censoring model for k = 3.\n\nchild, and the edge Li \u2192 Ri does not exist for any i, we get:\np(L\u2217, R = 1)\n\n(cid:81)\n\np(L) =\n\np(Ri | paG(Ri), R{i|Li\u2208paG (Ri)} = 1)\n\n.\n\nRi\n\nThis identifying functional implies consistent IPW estimators can be derived that are similar to\nestimators in the above examples.\nThe dif\ufb01culty with this result is that it assumes missingness indicators are disconnected. This as-\nsumption means we cannot model persistent dropout or loss to followup (where Ri = 0 at one time\npoint implies Ri = 0 at all following time points), or complex patterns of non-monotone missing\ndata (where data is missing intermittently, but missingness also exhibits complex dependence struc-\nture). This kind of dependence is represented by connecting R variables in the graph. Unfortunately,\nthis often leads to non-identi\ufb01cation \u2013 the functional of the full data law not being a function of the\nobserved data law. For instance, if we add an edge R1 \u2192 R2 to Fig. 1 (b), it is known that p(L1, L2)\nis not identi\ufb01ed from p(R, L\u2217). Intuition for this is presented in Section 8.\nA classical approach to missingness with connected R variables assumes sequential ignorability,\nand monotone missingness (where there exists an ordering on variables such that every unit that\u2019s\nmissing earlier in the ordering remains missing later in the ordering) [12]. However, this approach\ndoes not easily generalize to data missing in non-monotone patterns and not at random.\nNevertheless, if a suf\ufb01cient number of edges are missing in the graph, identi\ufb01cation sometimes\nis possible even if R variables are dependent, and monotonicity and MAR do not hold. In par-\nticular, techniques from causal inference have been using to derive complex identi\ufb01cation results\nin this setting [7, 17]. For instance, it has been shown that in Fig. 1 (c), p(L1, L2, L3, L4) =\np(L\u2217,R=1)\nqL4 (L1|R2,R1=1)qL4 (R2) and\nqL4 (R1, R2, L1, L2, L3) = p(L1, L2, R1, R2 | L3, L4)p(L3). See [17] for details. Unfortunately,\nit is often dif\ufb01cult to give a practical missing data setting which exhibits the particular pattern of\nmissing edges that permits identi\ufb01cation. In addition, a full characterization of identi\ufb01ability of\nfunctionals of the full data law under MNAR is an open problem. In the next sections, we generalize\nthe graphical model approach to missing data from DAGs to a particular type of chain graph. Our\nmodel is able to encode fairly general settings where data is missing non-monotonically and not at\nrandom, while also permitting identi\ufb01cation of the full data law under fairly mild assumptions.\n\n, where \u02dcp1 = qL4 (R1 = 1 | L2, R2 = 1), \u02dcp2 = qL4 (L1|R2=1,R1=1)qL4 (R2=1)\n\n\u02dcp1\u00b7 \u02dcp2\n\n(cid:80)\n\nR2\n\n4 The No Self-Censoring Missingness Model\n\nHaving given the necessary preliminaries, we are ready to de\ufb01ne our missingness model for data\nmissing non-monotonically and not at random. Our desiderata for such a model are as follows.\nFirst, in order for our model to be useful in as wide a variety of missing data settings as possible,\nwe want to avoid imposing any assumptions on the underlying full data law. Second, since we wish\nto consider arbitrary non-monotonic missingness patterns, we want to allow arbitrary relationships\nbetween missingness indicators. Finally, since we wish to allow data to be missing not at random,\nwe want to allow as much dependence of missingness indicators on the underlying data, even if\nmissing, as possible.\n\n4\n\n\fHowever, a completely unrestricted relationship between underlying variables and missingness in-\ndicators can easily lead to non-identi\ufb01cation. For instance in any graph where the edge Li \u2192 Ri\nexists, the marginal distribution p(Li) is not in general a function of the observed data law. Thus, we\ndo not allow variables to drive their own missingness status, and thus edges of the form Li \u2192 Ri.\nHowever, we allow a variable to in\ufb02uence its own missingness status indirectly.\nSurprisingly, the restrictions given so far essentially characterize independences de\ufb01ning our pro-\nposed model. Consider the following chain graph on vertices L1, . . . Lk, R1, . . . Rk. The vertices\nL1, . . . , Lk form a complete DAG, meaning that the full data law p(L1, . . . , Lk) has no restrictions.\nThe vertices R1, . . . Rk form a k-clique, meaning arbitrary dependence structure between R vari-\nables is allowed. In addition, for every i, paG(Ri) \u2261 L \\ {Li}, which restricts a variable Li from\ndirectly causing its own missingness status Ri. The resulting graph is always a chain graph. An\nexample (for k = 3) is shown in Fig. 1 (c). The factorizations (1) and (2) for chain graphs of this\nform imply a particular set of independence constraints.\nLemma 1 Let G be a chain graph with vertex set R \u222a L, where B(G) = {R,{L1}, . . .{Lk}}, and\nfor every i, paG(Li) = {L1, . . . Li\u22121}, paG(Ri) = L \\ {Li}. Then for every i, and every p(L, R)\nthat factorizes according to G, the only conditional independences implied by this factorization on\np(L, R) are (\u2200i) (Ri \u22a5\u22a5 Li | R \\ {Ri}, L \\ {Li}). 1\n\n(cid:3)\n\nProof: This follows by the global Markov property results for chain graphs, found in [6].\nThis set of independences in p(R, L) can be represented not only by a chain graph, but also by\nan undirected graph where every pair vertices except Ri and Li (for every i) are connected. Such a\ngraph, interpreted as a Markov random \ufb01eld, would imply the same set of conditional independences\nas those in Lemma 1. An example of such a graph for k = 3 is shown in Fig. 1 (f). The reason we\nemphasize viewing the model using chain graphs is because the only independence restrictions we\nplace are on the conditional distribution p(R | L); these restrictions resemble those found in factors\nof (1), and not in classical conditional Markov random \ufb01elds, where every variable in R would\ndepend on every variable in L. We call the missingness model with this independence structure the\nno self-censoring model, due to the fact that no variable Li is allowed to directly censor itself via\nsetting Ri to 0. We now show that under relatively benign assumptions, we can identify the full data\nlaw p(L) in this model.\nLemma 2 If p(R = 1 | L) is identi\ufb01ed from the observed data distribution p(L\u2217, R = 1), then\np(L) is identi\ufb01ed from p(L\u2217, R = 1) via p(L\u2217, R = 1)/p(R = 1 | L).\nProof: Trivially follows by the chain rule of probability, and the fact that L = L\u2217 if R = 1.\nTo obtain identi\ufb01cation, we use a form of the log conditional pseudo-likelihood (LCPL) function,\n\ufb01rst considered (in joint form) in [2]. De\ufb01ne, for any parameterization p(R | L; \u03b1), where |R| = k,\n\n(cid:3)\n\nlog PL(\u03b1) =\n\nlog p(Ri = rj\n\ni | Rj \\ {Rj\n\ni} = rj, Lj; \u03b1).\n\nk(cid:88)\n\n(cid:88)\n\ni=1\n\nj:Lj\\{Lj\n\ni}\u2286(L\u2217)j\n\nIn subsequent discussion we will assume that if p1(R | L; \u03b10) (cid:54)= p2(R | L; \u03b1) then \u03b10 (cid:54)= \u03b1.\nLemma 3 Under the no self-censoring model, in the limit of in\ufb01nite data sampled from p(R, L),\nwhere only L\u2217, R is observed, log PL(\u03b1) is maximized at the true parameter values \u03b10.\n\nProof: The proof follows that for the standard pseudo-likelihood in [9]. The difference between the\nLCPL functions evaluated at \u03b10 and \u03b1 can be expressed as a sum of conditional relative entropies,\nwhich is always non-negative. The fact that every term in the LCPL function is a function of the\n(cid:3)\nobserved data follows by Lemma 1.\nWe will restrict attention to function classes which satisfy standard assumptions needed to derive\nconsistent estimators [10], namely compactness of the parameter space, dominance, and (twice)\ndifferentiability with respect to \u03b1, which implies continuity.\n\n1A \u22a5\u22a5 B | C is notation found in [3], meaning A is independent of B given C.\n\n5\n\n\f\uf8f1\uf8f2\uf8f3 (cid:88)\n\nR\u2020\u2286P(R)\\{\u2205}\n\n\uf8fc\uf8fd\uf8fe\n\nCorollary 1 Under the no self-censoring model of missingness, and assumptions above, the estima-\ntor of \u03b1 maximizing the LCPL function is weakly consistent.\n\nProof: Follows by Lemma 3, and the argument in [9] via equation (9), Lemma 1 and Theorem 1. (cid:3)\n\n5 Estimation\nSince all variables in R are binary, and our model for p(R | L) is a type of conditional MRF, a\nlog-linear parameterization is natural. We thus adapt the following class of parameterizations:\n\np(R = r | L = l) =\n\n1\n\nZ(l)\n\nexp\n\nrR\u2020 \u00b7 fR\u2020(lL\\L\u2020; \u03b1R\u2020)\n\n(4)\n\nwhere L\u2020 \u2261 {Li | Ri \u2208 R\u2020}, P(R) is the powerset of R, and for every R\u2020, fR\u2020 is a function\nparameterized by \u03b1R\u2020, mapping values of L\\ L\u2020 to an |R\u2020|-way interaction. Let \u03b1 \u2261 {\u03b1R\u2020 | R\u2020 \u2286\nP(R) \\ {\u2205}}. We now show our class of parameterizations gives the right independence structure.\nLemma 4 For an arbitrary p(L), and a conditional distribution p(R | L) parameterized as in (4),\nthe set of independences in Lemma 1 hold in the joint distribution p(L, R) = p(R | L)p(L).\nProof: For any Ri \u2208 R, and values r, l, such that rRi = 1,\n\n(cid:110)(cid:80)\n(cid:110)(cid:80)\n\n(cid:111)\nRi\u2208R\u2020\u2286P(R)\\{\u2205} rR\u2020 \u00b7 fR\u2020(lL\\L\u2020; \u03b1R\u2020)\nRi\u2208R\u2020\u2286P(R)\\{\u2205} rR\u2020 \u00b7 fR\u2020(lL\\L\u2020; \u03b1R\u2020)\n\n(cid:111) .\n\nexp\n\n1 + exp\n\np(rRi | rR\\{Ri}, lL) =\n\n(cid:3)\nBy de\ufb01nition of fR\u2020, this functional is not a function of Li, which gives our result.\nAs expected with a log-linear conditional MRF, the distribution p(Ri | R \\ {Ri}, L) resembles the\nlogistic regression model. Under twice differentiability of fR\u2020, \ufb01rst and second derivatives of the\nLCPL function have a straightforward derivation, which we omit in the interests of space. Just as\nwith the logistic model, the estimating equations cannot be solved in closed form, but iterative algo-\nrithms are straightforward to construct. For suf\ufb01ciently simple fR\u2020, the Newton-Raphson algorithm\nmay be employed. Note that every conditional model for Ri is \ufb01t only using rows where L \\ {Li}\nare observed. Thus, the \ufb01tting procedure fails in datasets with few enough samples that for some\nRi, no such rows exist. We leave extensions of our model that deal with this issue to future work.\nFinally, we use our \ufb01tted model p(R | L; \u02c6\u03b1), as a joint IPW for estimating functions of p(L). For\ninstance, if L1, . . . Lk\u22121 represents intermediate outcomes, and Lk the \ufb01nal outcome of a longitu-\ndinal study with intermittent MNAR dropout represented by our model, and we are interested in the\nexpected \ufb01nal outcome, E[Lk], we would extend IPW estimators discussed in Section 3 as follows:\nk /p(R = 1 | ln; \u02c6\u03b1). Estimation of more complex functionals of p(L)\nproceeds similarly, though it may employ marginal structural models if L is high-dimensional. Con-\nsistency of these estimators follows, under the usual assumptions, by standard arguments for IPW\nestimators, and Corollary 1.\n\n\u02c6E[Lk] = N\u22121(cid:80)\n\nn:rn=1 ln\n\n6 A Simple Simulation Study\n\n(cid:80)\n\nTo verify our results, we implemented our estimator for a simple model in the class of param-\neterizations (4) that satisfy the assumptions needed for deriving the true parameter by maximiz-\ning the LCPL function. Fig. 2 shows our results. For the purposes of illustration, we chose\nthe model in Fig. 1 (d) with functions fR\u2020 de\ufb01ned as follows. For every edge (Li, Rj) in the\ngraph, de\ufb01ne a parameter wij, and a parameter w\u2205. De\ufb01ne every function fR\u2020 to be of the form\ni:Li\u2208L\\L\u2020,j:Rj\u2208R\u2020 wijLi(1). The values of L1, L2, L3 were drawn from a multivariate normal\ndistribution with parameters \u00b5 = (1, 1, 1), \u03a3 = I + 1. We generated a series of datasets with sample\nsize 100 to 1000, and compared differences between the true means E[Li(1)] and the unadjusted\n(complete case) MLE estimate of E[Li(1)] (blue), and IPW adjusted estimate of E[Li(1)] (red), for\ni = 1, 2, 3. The true difference is, of course, 0. Con\ufb01dence intervals at the 95% level were comput-\ning using case resampling bootstrap (50 iterations). The con\ufb01dence intervals generally overlapped\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2:\n(a),(b),(c) Results of estimating E[L1(1)], E[L2(1)] and E[L3(1)], respectively, from a\nmodel in Fig. 1 (d). Y axis is parameter value, and X axis is sample size. Con\ufb01dence intervals are\nreported using case resampling bootstrap at 95% level. Con\ufb01dence interval size does not necessarily\nshrink with sample size \u2013 a known issue with IPW estimators.\n\n0, while complete case analysis did not. We noted that con\ufb01dence intervals did not always shrink\nwith increased sample size \u2013 a known dif\ufb01culty with IPW estimators.\nAside from the usual dif\ufb01culties with IPW estimators, which are known to suffer from high variance,\nour estimator only reweighs observed cases, which may in general be a small fraction of the overall\ndataset as k grows (in our simulations only 50-60% of cases were complete). Furthermore, esti-\nmating weights by maximizing pseudo-likelihood is known to be less ef\ufb01cient than by maximizing\nlikelihood, since all variability of variables in the conditioning sets is ignored.\n\n7 Analysis Application\n\nTo illustrate the performance of our model in a practical setting where data is missing not at random,\nwe report an analysis of a survey dataset for HIV-infected women in Botswana, also analyzed in [18].\nThe goal is to estimate an association between maternal exposure to highly active anti-retroviral\ntherapy (HAART) during pregnancy and a premature birth outcome among HIV-infected women\nin Botswana. The overall data consisted of 33148 obstetrical records from 6 cites in Botswana.\nHere we restricted to a subset of HIV positive women (n = 9711). We considered four features: the\noutcome (preterm delivery), with 6.7% values missing, and two risk factors \u2013 whether the CD4 count\n(a measure of immune system health) was lower than 200 cells per \u00b5L (53.1% missing), and whether\nHAART was continued from before pregnancy (69.0% missing). We also included hypertension \u2013 a\ncommon comorbidity of HIV (6.5% missing). In this dataset missing at random is not a reasonable\nassumption, and what\u2019s more missingness patterns are not monotonic.\nWe used a no-self censoring model with fR\u2020(.) of the same form as in section 6. The results\nare shown in Fig. 3, which contain the complete case analysis (CC), the no self-censoring model\n(NSCM), and a version of the discrete choice model in [18] (DCM). We reported the odds ratios\n(ORs) with a 95% con\ufb01dence interval, obtained by bootstrap. Note that CC and DCM con\ufb01dence\nintervals for the OR overlap 1, indicating a weak or non-existent effect. The con\ufb01dence interval for\nthe NSCM indicates a somewhat non-intuitive inverse relationship for low CD4 count and premature\nbirth, which we believe may be due to assumptions of the NSCM not being met with a limited set\nof four variables we considered. In fact, the dataset was suf\ufb01ciently noisy that an expected positive\nrelationship was not found by any method.\n\n8 Parameter Counting\n\nParameter counting may be used to give an intuition for why p(L) is identi\ufb01ed under the no\nself-censoring model, but not under a very similar missingness model where undirected edges\nbetween R variables are replaced by directed edges under some ordering (see Fig. 1 (d) and\n\n7\n\n\u22120.20.00.20.42505007501000EstimatedObservedTrueL1 Mean with Sample Size\u22120.20.00.20.42505007501000EstimatedObservedTrueL2 Mean with Sample Size\u22120.250.000.250.502505007501000EstimatedObservedTrueL3 Mean with Sample Size\fLow CD4 Count\n\nCC\n\n0.782 (0.531, 1.135)\nNSCM 0.651 (0.414, 0.889)\nDCM 1.020 (0.742, 1.397)\n\nCont HAART\n\n1.142 (0.810, 1.620)\n1.032 (0.670, 1.394)\n1.158 (0.869, 1.560)\n\nFigure 3: Analyses of the HIV pregnancy Botswana dataset. CC: complete case analysis, NSCM:\nthe no self-censoring model with a linear parameterization, DCM: a member of the discrete choice\nmodel family described in [18].\n\nR\u2020\u2286P(R)\\{\u2205}\n\n(cid:0) k|R\u2020|\n\nR\u2020\u2208P(R)\\{\u2205}\n\n(e) for an example for k = 3.) Assume |L| = k, where L variables are discrete with d lev-\nels. Then the observed data law may be parameterized by 2k \u2212 1 parameters for p(R), and by\ndk\u2212|R\u2020|\u22121 parameters for each p(L\u2217 | R\u2020 = 1, R \\ R\u2020 = 0), where R\u2020 (cid:54)= \u2205, for a total of\n\n(cid:1)(d|R\u2020| \u2212 1) = (d + 1)k \u2212 1. The no-censoring model needs\n(cid:0) k|R\u2020|\n2k \u2212 1 +(cid:80)\n(cid:1)dk\u2212|R\u2020| for p(R | L), yielding a total of\ndk \u2212 1 parameters for p(L), and(cid:80)\ndk\u22121 parameters for p(L), and(cid:80)k\n\ndk \u2212 1 + (d + 1)k \u2212 dk = (d + 1)k \u2212 1, which means the model is just-identi\ufb01ed, and imposes no\nrestrictions on the observed data law under our assumptions on L. However, the DAG model needs\ni=1(dk\u22121\u00b72i\u22121) for p(R | L), for a total of dk\u22121+dk\u22121\u00b7(2k\u22121).\n\nThe following Lemma implies the DAG version of the no self-censoring model is not identi\ufb01ed.\nLemma 5 dk\u22121 \u00b7 (2k \u2212 1) > (d + 1)k \u2212 dk for k \u2265 2, d \u2265 2.\n\nProof: For k = 2, we have 3d > 2d + 1, which holds for any d > 1. If our result holds for k, then\n2k > (d + 1)k/dk\u22121 \u2212 d + 1. Then the inequality holds for k + 1, since 2 > (d + 1)/d for d > 1. (cid:3)\nJust identi\ufb01cation under the independence structure given in Lemma 1 was used in [16] (indepen-\ndently of this paper) to derive a parameterization of the model that uses the observed data law. This\npaper, by contrast, only models the missingness process represented by p(R | L), and does not\nmodel the observed data law p(L\u2217) at all.\n\n9 Conclusions\n\nIn this paper, we have presented a graphical missingness model based on chain graphs for data\nmissing non-monotonically and not at random. Speci\ufb01cally, our model places no restrictions on the\nunderlying full data law, and on the dependence structure of missingness indicators, and allows a\nhigh degree of interdependence between the underlying unobserved variables and missingness indi-\ncators. Nevertheless, under our model, and fairly mild assumptions, the full data law is identi\ufb01ed.\nOur estimator is an inverse probability weighting estimator with the weights being joint probabilities\nof the data being observed, conditional on all variables. The weights are \ufb01tted by maximizing the\nlog conditional pseudo likelihood function, \ufb01rst derived in joint form in [2].\nWe view our work as an alternative to existing and newly developed methods for MNAR data\n[13, 18, 16], and an attempt to bridge the gap between the existing rich missing data literature\non identi\ufb01cation and estimation strategies for MAR data (see [14] for further references), and newer\nwork which gave an increasingly sophisticated set of identi\ufb01cation conditions for MNAR data using\nmissingness graphs [8, 7, 17]. The drawbacks of existing MAR methods is that most missingness\npatterns of practical interest are not MAR, the drawbacks of the missingness graph literature is that\nit has not yet considered estimation, and used assumptions on missingness that, while MNAR, are\ndif\ufb01cult to justify in practice (for example Fig. 1 (c) implies a complicated identifying functional\nunder MNAR, but places a marginal independence restriction (L1 \u22a5\u22a5 L2) on the full data law).\nOur work remedies both of these shortcomings. On the one hand, we assume a very general, and\nthus easier to justify in practice, missingness model for MNAR data. On the other, we don\u2019t just\nconsider an identi\ufb01cation problem for our model, but give a class of IPW estimators for functions\nof the observed data law. Addressing statistical and computational challenges posed by our class of\nestimators, and making them practical for analysis of high dimensional MNAR data is our next step.\n\n8\n\n\fReferences\n[1] Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference\n\nmodels. Biometrics, 61:962\u2013972, 2005.\n\n[2] Julian Besag. Statistical analysis of lattice data. The Statistician, 24(3):179\u2013195, 1975.\n[3] A. Philip Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society,\n\n41:1\u201331, 1979.\n\n[4] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a \ufb01nite\n\nuniverse. Journal of the American Statistical Association, 47:663\u2013685, 1952.\n\n[5] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\nIn Proceedings of the Eighteenth International Conference on Machine\n\nand labeling sequence data.\nLearning (ICML-01), pages 282 \u2013 289. Morgan Kaufmann, 2001.\n\n[6] Steffan L. Lauritzen. Graphical Models. Oxford, U.K.: Clarendon, 1996.\n[7] Karthika Mohan and Judea Pearl. Graphical models for recovering probabilistic and causal queries from\nmissing data. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 27, pages 1520\u20131528. Curran Associates, Inc., 2014.\n[8] Karthika Mohan, Judea Pearl, and Jin Tian. Graphical models for inference with missing data. In C.J.C.\nBurges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Infor-\nmation Processing Systems 26, pages 1277\u20131285. Curran Associates, Inc., 2013.\n\n[9] A. Mozeika, O. Dikmen, and J. Piili. Consistent inference of a general model using the pseudolikelihood\n\nmethod. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics., 90, 2014.\n\n[10] Whitney Newey and Daniel McFadden. Chapter 35: Large sample estimation and hypothesis testing. In\n\nHandbook of Econometrics, Vol.4, pages 2111\u20132245. Elsevier Science, 1994.\n\n[11] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan and Kaufmann, San Mateo, 1988.\n[12] James M. Robins. A new approach to causal inference in mortality studies with sustained exposure\nperiods \u2013 application to control of the healthy worker survivor effect. Mathematical Modeling, 7:1393\u2013\n1512, 1986.\n\n[13] James M. Robins. Non-response models for the analysis of non-monotone non-ignorable missing data.\n\nStatistics in Medicine, 16:21\u201337, 1997.\n\n[14] James M. Robins and Mark van der Laan. Uni\ufb01ed Methods for Censored Longitudinal Data and Causal-\n\nity. Springer-Verlag New York, Inc., 2003.\n\n[15] D. B. Rubin. Causal inference and missing data (with discussion). Biometrika, 63:581\u2013592, 1976.\n[16] Mauricio Sadinle and Jerome P. Reiter. Itemwise conditionally independent nonresponse modeling for\n\nincomplete multivariate data. https://arxiv.org/abs/1609.00656, 2016. Working paper.\n\n[17] Ilya Shpitser, Karthika Mohan, and Judea Pearl. Missing data as a causal and probabilistic problem.\nIn Proceedings of the Thirty First Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-15), pages\n802\u2013811. AUAI Press, 2015.\n\n[18] Eric J. Tchetgen Tchetgen, Linbo Wang, and BaoLuo Sun. Discrete choice models for nonmonotone\nnonignorable missing data: Identi\ufb01cation and inference. https://arxiv.org/abs/1607.02631,\n2016. Working paper.\n\n9\n\n\f", "award": [], "sourceid": 1571, "authors": [{"given_name": "Ilya", "family_name": "Shpitser", "institution": "Johns Hopkins University"}]}