{"title": "MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare", "book": "Advances in Neural Information Processing Systems", "page_first": 4547, "page_last": 4557, "abstract": "Deep learning models exhibit state-of-the-art performance for many predictive healthcare tasks using electronic health records (EHR) data, but these models typically require training data volume that exceeds the capacity of most healthcare systems.\nExternal resources such as medical ontologies are used to bridge the data volume constraint, but this approach is often not directly applicable or useful because of inconsistencies with terminology.\nTo solve the data insufficiency challenge, we leverage the inherent multilevel structure of EHR data and, in particular, the encoded relationships among medical codes.\nWe propose Multilevel Medical Embedding (MiME) which learns the multilevel embedding of EHR data while jointly performing auxiliary prediction tasks that rely on this inherent EHR structure without the need for external labels. \nWe conducted two prediction tasks, heart failure prediction and sequential disease prediction, where MiME outperformed baseline methods in diverse evaluation settings.\nIn particular, MiME consistently outperformed all baselines when predicting heart failure on datasets of different volumes, especially demonstrating the greatest performance improvement (15% relative gain in PR-AUC over the best baseline) on the smallest dataset, demonstrating its ability to effectively model the multilevel structure of EHR data.", "full_text": "MiME: Multilevel Medical Embedding of Electronic\n\nHealth Records for Predictive Healthcare\n\nEdward Choi\u21e4\nGoogle Brain\n\nedwardchoi@google.com\n\nCao Xiao\n\nIBM Research\n\ncxiao@us.ibm.com\n\nWalter F. Stewart\u2020\nHINT Consultants\n\nwfs502000@yahoo.com\n\nJimeng Sun\n\nGeorgia Institute of Technology\n\njsun@cc.gatech.edu\n\nAbstract\n\nDeep learning models exhibit state-of-the-art performance for many predictive\nhealthcare tasks using electronic health records (EHR) data, but these models\ntypically require training data volume that exceeds the capacity of most healthcare\nsystems. External resources such as medical ontologies are used to bridge the\ndata volume constraint, but this approach is often not directly applicable or useful\nbecause of inconsistencies with terminology. To solve the data insuf\ufb01ciency chal-\nlenge, we leverage the inherent multilevel structure of EHR data and, in particular,\nthe encoded relationships among medical codes. We propose Multilevel Medical\nEmbedding (MiME) which learns the multilevel embedding of EHR data while\njointly performing auxiliary prediction tasks that rely on this inherent EHR struc-\nture without the need for external labels. We conducted two prediction tasks, heart\nfailure prediction and sequential disease prediction, where MiME outperformed\nbaseline methods in diverse evaluation settings. In particular, MiME consistently\noutperformed all baselines when predicting heart failure on datasets of different\nvolumes, especially demonstrating the greatest performance improvement (15% rel-\native gain in PR-AUC over the best baseline) on the smallest dataset, demonstrating\nits ability to effectively model the multilevel structure of EHR data.\n\n1\n\nIntroduction\n\nThe rapid growth of electronic health record (EHR) data has motivated use of deep learning models\nand demonstrated state-of-the-art performance in diagnostics [26, 13, 12, 27], disease detection [14,\n10, 17], risk prediction [20, 32], and patient subtyping [3, 6]. However, training optimal deep learning\nmodels typically requires a large volume (i.e. number of patient records and features per record)\nMost health systems do not have the data volume required to optimize performance of these models,\nespecially for less common services (e.g. intensive care units (ICU)) or rare conditions.\nExternal resources, particularly medical ontologies have been used to address data volume insuf\ufb01-\nciencies [12, 31, 7]. For example [12], latent embedding of a clinical code (e.g. diagnosis code) can\nbe learned as a convex combination of the embeddings of the code itself and its ancestors on the\nontology graph. However, medical ontologies are often not available or not directly applicable due to\nthe nonstandard, or idiosyncratic use of terminology and complex terminology mapping from one\nhealth system\u2019s EHR to another. For example, many clinics still use their own in-house terminologies\n\n\u21e4Work done at Georgia Institute of Technology.\n\u2020Work done at Sutter Health.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(t-1)-th Visit\n\nt-th Visit\n\n(t+1)-th Visit\n\nFatigue\n\nCough\n\nFever\n\nDyspnea\n\nNausea\n\nDiagnosis level\n\nBenzonatate\n\nAcetaminophen\n\nIV Fluid\n\nCardiac EKG\n\nTreatment level\n\nFigure 1: Symbolic representation of a single visit of a patient. Red denotes diagnosis codes, and blue\ndenotes medication/procedure codes. A visit encompasses a set of codes, as well as a hierarchical\nstructure and heterogeneous relations among these codes. For example, while both Acetaminophen\nand IV \ufb02uid form an explicit relationship with Fever, they also are correlated with each other as\ndescendants of Fever.\n\nfor medications and lab tests, which do not conform with the standard medical ontologies such as\nAnatomical Therapeutic Chemical (ATC) Classi\ufb01cation system and Logical Observation Identi\ufb01ers\nNames and Codes (LOINC).\nAs an alternative, we explored how the inherent multilevel structure of EHR data could be leveraged\nto improve learning ef\ufb01ciency. The hierarchical structure of EHR data begins with the patient,\nfollowed by visits, then diagnosis codes within visits, which are then linked to treatment orders\n(e.g. medications, procedures). This hierarchical structure reveals in\ufb02uential multilevel relationships,\nespecially between diagnosis codes and treatment codes. For example, a diagnosis fever can lead to\nassociated treatments such as acetaminophen (medication) and IV \ufb02uid (procedure). We examine\nwhether this multilevel structure could be leveraged to obtain a robust model under small data volume.\nTo the best of our knowledge, none of the existing works leverage this multilevel structure in EHR.\nRather, they \ufb02atten EHR data as a set of independent codes [18, 38, 11, 12, 14, 10, 13, 27, 2], which\nignores hierarchical relationships among medical codes within visits.\nWe propose Multilevel Medical Embedding) (MiME) to simultaneously transform the inherent multi-\nlevel structure of EHR data into multilevel embeddings, while jointly performing auxiliary prediction\ntasks that re\ufb02ect this inherent structure without the need for external labels. Modeling the inher-\nent structure among medical codes enables us to accurately capture the distinguishing patterns of\ndifferent patient states. The auxiliary tasks inject the hierarchical knowledge of EHR data into the\nembedding process such that the main task can borrow prediction power from related auxiliary tasks.\nWe conducted two prediction tasks, heart failure prediction and sequential disease prediction, where\nMiME outperformed baseline methods in diverse evaluation settings. In particular, for heart failure\nprediction on datasets of different volumes, MiME consistently outperformed all baseline models.\nEspecially, MiME showed the greatest performance improvement (15% relative gain in PR-AUC over\nthe best baseline) for the smallest dataset, demonstrating its ability to effectively model the multilevel\nstructure of EHR data.\n\n2 Method\n\nEHR data can be represented by a common hierarchy that begins with individual patient records,\nwhere each patient record consists of a sequence of visits. In a typical visit,a physician gives a\ndiagnosis to a patient and then order medications or procedures based on the diagnosis. This process\ngenerates a set of treatment (medication and procedure) codes and a relationship among diagnosis\nand treatment codes (see Figure 1). MiME is designed to explicitly capture the relationship between\nthe diagnosis codes and the treatment codes within visits.\n\n2.1 Notations of MiME\nAssume a patient has a sequence of visits V (1), . . . ,V (t) over time, where each visit V (t) contains a\nvarying number of diagnosis (Dx) objects O(t)\nconsists of a single Dx code\nd(t)\n. Similarly, each M(t)\ni 2A and a set of associated treatments (medications or procedures) M(t)\nconsists of varying number of treatment codes m(t)\n| 2B . To reduce clutter, we omit\n\n1 , . . . ,O(t)\n|V (t)|\n\ni,1, . . . , m(t)\n\n. Each O(t)\n\ni\n\ni,|M(t)\n\ni\n\ni\n\ni\n\n2\n\n\fTable 1: Notations for MiME. Note that the dimension size z is used in many places due to the use of\nskip-connections, which will be described in section 2.2.\n\nand treatment codes M(t)\ni ) Auxiliary predictions, respectively for a Dx code and a treatment code based on o(t)\n\ni\n\ni\n\ni\n\np(d(t)\n\ni\n\ni,j|o(t)\n\nNotation\n\ni\n\nA\nB\nh\nV (t)\nv(t) 2 Rz\nO(t)\no(t)\ni 2 Rz\ni ), p(m(t)\n|o(t)\nd(t)\ni 2A\nM(t)\nm(t)\ni,j 2B\n, m(t)\ng(d(t)\ni,j )\ni\nf (d(t)\n,M(t)\ni )\nr(\u00b7) 2 Rz\n\ni\n\ni\n\n1 , . . . ,O(t)\n\n|V(t)|\n\nDe\ufb01nition\nSet of unique diagnosis codes\nSet of unique treatment codes (medications and procedures)\nA vector representation of a patient\nA patient\u2019s t-th visit, which contains diagnosis objects O(t)\nA vector representation of V (t)\ni-th diagnosis object of t-th visit consisting of Dx code d(t)\ni\nA vector representation of O(t)\nDx code of diagnosis object O(t)\na set of treatment codes associated with i-th Dx code d(t)\ni\nj-th treatment code of M(t)\nA function that captures the interaction between d(t)\nand m(t)\ni,j\ni\nA function that computes embedding of diagnosis object o(t)\ni\nA helper notation for extracting d(t)\n\ni or m(t)\n\ni\n\ni\n\ni,j\u2019s embedding vector\n\nin visit t\n\n1 , . . . ,O(t)\n\n2 = Cough and two associated treatment codes m(t)\n\nthe superscript (t) indicating t-th visit, when we are discussing a single visit. Table 1 summarizes\nnotations we will use throughout the paper.\nIn Figure 1, there are \ufb01ve Dx codes, hence \ufb01ve Dx objects O(t)\n5 . More speci\ufb01cally, the\n\ufb01rst Dx object O1 has d(t)\n1 = Fatigue as the Dx code, but no treatment codes. O2, on the other\nhand, has Dx code d(t)\n2,1 = Benzonatate and\nm(t)\n2,2 = Acetaminophen. In this case, we can use g(d(t)\n2,1) to capture the interaction between Dx\ncode Cough and treatment code Benzonatate, which will be fed to f (d(t)\n2 ) to obtain the vector\nrepresentation of Dx object o(t)\n2 . Using the \ufb01ve Dx object embeddings o(t)\n5 , we can obtain a\nvisit embedding v(t). In addition, some treatment codes (e.g. Acetaminophen) can be shared by two\nor more Dx codes (e.g. Cough, Fever), if the doctor ordered a single medication for more than one\ndiagnosis. Then each Dx object will have its own copy of the treatment code attached to it, in this\ncase denoted, m(t)\n\n2 ,M(t)\n1 , . . . , o(t)\n\n2,2 and m(t)\n\n3,1, respectively.\n\n2 , m(t)\n\n2.2 Description of MiME\nMultilevel Embedding As discussed earlier, previous approaches often \ufb02atten a single visit such\nthat Dx codes and treatment codes are packed together so that a single visit V (t) can be expressed as a\nbinary vector x(t) 2{ 0, 1}|A|+|B| where each dimension corresponds to a speci\ufb01c Dx and treatment\ncode. Then a patient\u2019s visit sequence is encoded as:\n\nv(t) = (Wxx(t) + bx)\n\nh = h(v(1), v(2), . . . , v(t))\n\nwhere Wx is the embedding matrix that converts the binary vector x to a lower-dimensional visit\nrepresentation3,  a non-linear activation function such as sigmoid or recti\ufb01ed linear unit (ReLU),\nh(\u00b7) a function that maps a sequence of visit representations v(0), . . . , v(t) to a patient representation\nh. In contrast, MiME effectively derives a visit representation v(t), than can be plugged into any h(\u00b7)\nfor the downstream prediction task. h(\u00b7) can simply be an RNN or a combination of RNNs and CNN\nand attention mechanisms [1].\nMiME explicitly captures the hierarchy between Dx codes and treatment codes depicted in Figure 1.\nFigure 2 illustrates how MiME builds the representation of V (omitting the superscript (t)) in a bottom-\nup fashion via multilevel embedding. In a single Dx object Oi, a Dx code di and its associated\ntreatment codes Mi are used to obtain a vector representation of Oi, oi. Then multiple Dx object\nembeddings o0, . . . , o|V| in a single visit are used to obtain a visit embedding v, which in turn forms\n\n3We omit bias variables throughout the paper to reduce clutter.\n\n3\n\n\fEmbedding flow\nInteraction between \ndiagnosis and treatment\nAuxiliary prediction\n\nPatient level\n\nVisit level\n\n36|5=\u210e-(9),-(/),\u2026\t\n5\n-(%)\n3 *#,+%|)#(%) 3 *#,,%|)#(%)\n3 \"#%|)#(%)\n)#(%)\n! \"#(%),\u2133#(%)\n\"#(%)\n2 \"#%,*#,,(%)\n\nTreatment level\n\n-(%0/)\n)#0/(%)\n\n*#,,(%)\n\n-(%./)\n)#./(%)\n2 \"#%,*#,+(%)\n*#,+(%)\n\n1#(%)\n\u2133#(%)\n\nDiagnosis level\n\nFigure 2: Prediction model using MiME. Codes are embedded into multiple levels: diagnosis-level,\nvisit-level, and patient-level. Final prediction p(y|h) is based on the patient representation h, which is\nderived from visit representations v(0), v(1), . . ., where each v(t) is generated using MiME framework.\nAs shown in the Treatment level, MiME explicitly captures the interactions between a diagnosis code\nand the associated treatment codes. MiME also uses those codes as auxiliary prediction targets to\nimprove generalizability when large training data are not available.\n\na patient embedding h with other visit embeddings. The formulation of MiME is as follows:\n\n{z\n\nF: used for skip-connection\n\nf (di,Mi)\n\n|V|Xi\n|\n\nv = \u2713Wv\u21e3\n\u2318\u25c6 + F\nf (di,Mi) = oi = \u2713Wo\u21e3 r(di) +\n|Mi|Xj\n{z\ng(di, mi,j) = Wmr(di)  r(mi,j)\n\n}\n\n|\n\ng(di, mi,j)\n\nG: used for skip-connection\n\n\u2318\u25c6 + G\n\n(1)\n\n(2)\n\n(3)\n\n}\n\nwhere Eq. (1), Eq. (2) and Eq. (3) describe MiME in a top-down fashion, respectively corresponding\nto Visit level, Diagnosis level and Treatment level in Figure 2.\nIn Eq. (1), a visit embedding v is obtained by summing Dx object embeddings o1, . . . , o|V|, which are\nthen transformed with Wv 2 Rz\u21e5z.  is a non-linear activation function such as sigmoid or recti\ufb01ed\nlinear unit (ReLU). In Eq. (2), oi is obtained by summing r(di) 2 Rz, the vector representation of\nthe Dx code di, and the effect of the interactions between di and its associated treatments Mi, which\nare then transformed with Wo 2 Rz\u21e5z. The interactions captured by g(di, mi,j) are added to the\nr(di), which can be interpreted as adjusting the diagnosis representation according to its associated\ntreatments (medications and procedures). Note that in both Eq. (1) and Eq. (2), F and G are used to\ndenote skip-connections [23].\nIn Eq. (3), the interaction between a Dx code embedding r(di) and a treatment code embedding\nr(mi,j) is captured by element-wise multiplication . Weight matrix Wm 2 Rz\u21e5z sends the Dx code\nembedding r(di) into another latent space, where the interaction between di and the corresponding\nmi,j can be effectively captured. The formulation of Eq. (3) was inspired by recent developments in\nbilinear pooling technique [37, 21, 19, 24], which we discuss in more detail in Appendix A. With\nEq. (3) in mind, G in Eq. (2) can also be interpreted as r(di) being skip-connected to the sum of\ninteractions g(di, mi,j).\n\n4\n\n\fLaux = aux\n\nTXt \u2713 |V (t)|Xi \u21e3CE(d(t)\n\ni\n\ni\n\n|M(t)\n\n|Xj\n\n, \u02c6d(t)\n\ni ) +\n\nCE(m(t)\n\ni,j , \u02c6m(t)\n\n(6)\n\ni,j )\u2318\u25c6\n\nJoint Training with Auxiliary Tasks Patient embedding h is often used for speci\ufb01c prediction tasks,\nsuch as heart failure prediction or mortality. The representation power of h comes from properly\ncapturing each visit V (t), and modeling the longitudinal aspect with the function h(v0, . . . , vt). Since\nthe focus of this work is on modeling a single visit V (t), we perform auxiliary predictions as follows:\n(4)\n(5)\n\n|o(t)\ni ) = softmax(Udo(t)\ni )\ni ) = (Umo(t)\ni,j|o(t)\ni )\n\n\u02c6d(t)\ni = p(d(t)\ni,j = p(m(t)\n\u02c6m(t)\n\ni\n\n|V (t)|\n\nand the prediction of the treatment code \u02c6m(t)\n\n1 , . . . , o(t)\n, and the associated treatment code m(t)\n\nGiven Dx object embeddings o(t)\n, while aggregating them to obtain v(t) as in Eq. (1),\nMiME predicts the Dx code d(t)\ni,j as depicted by Figure 2. In\ni\nEq. (4) and Eq. (5), Ud 2 R|A|\u21e5z and Um 2 R|B|\u21e5z are weight matrices used to compute the the\nprediction of Dx code \u02c6d(t)\ni,j, respectively. In Eq. (6), T\ni\ndenotes the total number of visits the patient made, CE(\u00b7,\u00b7) the cross-entropy function and aux the\ncoef\ufb01cient for the auxiliary loss term. We used the softmax function for predicting d(t)\nsince in a\ni\nsingle Dx object O(t)\n, there is only one Dx code involved. However, there could be no (or many)\ntreatment codes associated with O(t)\n, and therefore we used |B| number of sigmoid functions for\npredicting each treatment code.\nAuxiliary tasks are based on the inherent structure of the EHR data, and require no additional\nlabeling effort. These auxiliary tasks guide the model to learn Dx object embeddings o(t)\nthat are\ni\nrepresentative of the speci\ufb01c codes involved with it. Correctly capturing the events within a visit is\nthe basis of all downstream prediction tasks, and these general-purpose auxiliary tasks, combined\nwith the speci\ufb01c target task, encourage the model to learn visit embeddings v(t) that are not only\ntuned for the target prediction task, but also grounded in general-purpose foundational knowledge.\n\ni\n\ni\n\n3 Experiments\n\nIn this section, we \ufb01rst describe the dataset and the baseline models, and present evaluation results.\nThe source code of MiME is publicly available at https://github.com/mp2893/mime.\n\n3.1 Source of Data\nWe conducted all our experiments using EHR data provided by Sutter Health. The dataset was\nconstructed for a study designed to predict a future diagnosis of heart failure, and included EHR data\nfrom 30,764 senior patients 50 to 85 years of age. We extracted the diagnosis codes, medication\ncodes and the procedure codes from encounter records, and related orders. We used Clinical\nClassi\ufb01cation Software for ICD9-CM4 to group the ICD9 diagnosis codes into 388 categories.\nGeneric Product Identi\ufb01er Drug Group5 was used to group the medication codes into 99 categories.\nClinical Classi\ufb01cations Software for Services and Procedures6 was used to group the CPT procedure\ncodes into 1,824 categories. Any code that did not \ufb01t into the grouper formed its own category.\nTable 2 summarizes data statistics.\n\n3.2 Baseline Models\nFirst, we use Gated Recurrent Units (GRU) [9] with different embedding strategies to map visit\nembedding sequence v(1), . . . , v(T ) to a patient representation h:\n\u2022 raw: A single visit V (t) is represented by a binary vector x(t) 2{ 0, 1}|A|+|B|. Only the\ndimensions corresponding to the codes occurring in that visit is set to 1, and the rest are 0.\n4https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp\n5http://www.wolterskluwercdi.com/drug-data/medi-span-electronic-drug-\ufb01le/\n6https://www.hcup-us.ahrq.gov/toolssoftware /ccs_svcsproc/ccssvcproc.jsp\n\n5\n\n\fTable 2: Statistics of the dataset\n\n30,764\n616,073\n\n20.0\n\n# of patients\n# of visits\nAvg. # of visits per patient\n# of unique codes\nAvg. # of Dx per visit\nAvg. # of Rx per diagnosis\nAvg. # of Proc. per diagnosis\n\n2,311 (Dx:388, Rx:99, Proc:1,824)\n\n1.93 (Max: 29)\n0.31 (Max: 17)\n0.36 (Max: 10)\n\n\u2022 linear: The binary vector x(t) is linearly transformed to a lower-dimensional vector v(t) =\nWxx(t) where Wx 2 Rb\u21e5(|A|+|B|) is the embedding matrix. This is equivalent to taking the\nvector representations of the codes (i.e. columns of the embedding matrix Wx) in the visit V (t),\nand summing them up to derive a single vector v(t) 2 Rb.\n\u2022 sigmoid, tanh, relu: The binary vector x(t) is transformed to a lower-dimensional vector v(t) =\n(Wxx(t)) where we use either sigmoid, tanh, or ReLU for (\u00b7) to add non-linearity to linear.\n\u2022 sigmoidmlp, tanhmlp, relumlp: We add one more layer to sigmoid, tanh and relu to increase\ntheir expressivity. The visit embedding is now v(t) = (Wx2(Wx1x(t))) where  is either\nsigmoid, tanh or ReLU. We do not test linearmlp since two consecutive linear layers can be\ncollapsed to a single linear layer.\n\nSecond, we also compare with two advanced embedding methods that are speci\ufb01c designed for\nmodeling EHR data.\n\u2022 Med2Vec: We use Med2Vec [11] to learn visit representations, and use those \ufb01xed vectors\nas input to the prediction model. We test this model as a representative case of unsupervised\nembedding approach using EHR data.\n\n\u2022 GRAM: We use GRAM [12], which is equivalent to injecting domain knowledge (ICD9 Dx code\ntree) to tanh via attention mechanism. We test this model as a representative case of incorporating\nexternal domain knowledge.\n\n3.3 Prediction Tasks\nHeart failure prediction The objective is to predict the \ufb01rst diagnosis of heart failure (HF), given an\n18-months observation records discussed in section 3.1. Among 30,764 patients, 3,414 were case\npatients who were diagnosed with HF within a 1-year window after the 18-months observation. The\nremaining 27,350 patients were controls. The case-control selection criteria are detailed in [39] and\nsummarized in Appendix B. While an accurate prediction of HF can save a large amount of costs and\nlives [33], this task is also suitable for assessing how well a model can learn the relationship between\nthe external label (i.e. the label information is not inherent in the EHR data) and the features (i.e.\ncodes).\nWe applied logistic regression to the patient representation h to obtain a value between 0 (no HF\nonset) and 1 (HF onset). All models were trained end-to-end except Med2Vec. We report Area under\nthe Precision-Recall Curve (PR-AUC) in the experiment and Area under the Receiver Operating\nCharacteristic (ROC-AUC) in the appendix, as PR-AUC is considered a better measure for imbalanced\ndata like ours [34, 16]. Implementation and training con\ufb01gurations are described in Appendix C. We\nalso performed sequential disease prediction (SDP) (predicting all diagnoses of the next visit at every\ntimestep) where MiME demonstrated superior performance over all baseline models. The detailed\ndescription and results of SDP are provided in Appendix H and Appendix I respectively.\n\n3.4 Experiment 1: Varying the Data Size\nTo evaluate MiME\u2019s performance in another perspective, we created four datasets E1, E2, E3, E4\nfrom the original data such that each dataset consisted of patients with varying maximum sequence\nlength Tmax (i.e. maximum number of visits). In order to simulate a new hospital collecting patient\nrecords over time, we increased Tmax for each dataset such that 10, 20, 30, 150 for E1, E2, E3, E4\nrespectively. Each dataset had 6299 (414 cases), 15794 (1177 cases), 21128 (1848 cases), 27428\n\n6\n\n\fFigure 3: Test PR-AUC of HF prediction for increasing data size. A table with the results of all\nbaseline models is provided in Appendix F\n\n(3173 cases) patients respectively. For MiME aux, we used the same 0.015 for the auxiliary loss\ncoef\ufb01cient aux.\nFigure 3 shows the test PR-AUC for HF prediction across all datasets (loss and ROC-AUC are\ndescribed in Appendix G). Again we show the strongest activation functions tanh and tanhmlp here\nand provide the full table in Appendix F. We can readily see that MiME outperforms all baseline\nmodels across all datasets. However, the performance gap between MiME and the baselines are larger\nin datasets E1, E2 than in datasets E3, E4, con\ufb01rming our assumption that exploiting the inherent\nstructure of EHR can alleviate the data insuf\ufb01ciency problem. Especially for the smallest dataset E1,\nMiME aux (0.2831 PR-AUC) demonstrated signi\ufb01cantly better performance than the best baseline\ntanhmlp (0.2462 PR-AUC), showing 15% relative improvement.\nIt is notable that MiME consistently outperformed GRAM in both Table 3 and Figure 3 in terms of test\nloss and test PR-AUC. To be fair, GRAM was only using Dx code hierarchy (thus ungrouped 5814\nDx codes were used), and no additional domain knowledge regarding treatment codes. However,\nthe experiment results tell us that even without resorting to external domain knowledge, we can still\ngain improved predictive performance by carefully studying the EHR data and leveraging its inherent\nstructure.\n\n3.5 Experiment 2: Varying Visit Complexity\n\nTable 3: HF prediction performance on small datasets. Values in the parentheses denote standard\ndeviations from 5-fold random data splits. All models used GRU for mapping the visit embeddings\nv(1), . . . , v(T ) to a patient representation h. Two best values in each column are marked in bold. A\nfull table with all baselines is provided in Appendix D.\nD2\n\nD1\n\nD3\n\n(Visit complexity 0-15%)\n(5608 patients, 464 cases)\ntest loss\n\n(Visit complexity 15-30%)\n(5180 patients, 341 cases)\ntest loss\n\n(Visit complexity 30-100%)\n(5231 patients, 383 cases)\ntest loss\n\nraw\nlinear\ntanh\ntanhmlp\nMed2Vec\nGRAM\nMiME\nMiME aux\n\n0.2553 (0.0084)\n0.2562 (0.0108)\n0.2648 (0.0124)\n0.2587 (0.0121)\n0.2601 (0.0186)\n0.2554 (0.0254)\n0.2535 (0.0042)\n0.2512 (0.0073)\n\ntest PR-AUC\n0.2669 (0.0314)\n0.2722 (0.0354)\n0.2707 (0.0138)\n0.2671 (0.0257)\n0.2771 (0.0288)\n0.2633 (0.0521)\n0.2637 (0.0326)\n0.2750 (0.0326)\n\n0.2203 (0.0186)\n0.2200 (0.0187)\n0.2186 (0.0182)\n0.2289 (0.0213)\n0.2171 (0.0170)\n0.2249 (0.0448)\n0.2121 (0.0238)\n0.2117 (0.0238)\n\ntest PR-AUC\n0.2388 (0.0460)\n0.2403 (0.0229)\n0.2479 (0.0512)\n0.2296 (0.0185)\n0.2356 (0.0309)\n0.2505 (0.0609)\n0.2579 (0.0241)\n0.2589 (0.0287)\n\n0.2144 (0.0127)\n0.2021 (0.0176)\n0.2025 (0.0151)\n0.2024 (0.0181)\n0.2044 (0.0129)\n0.2333 (0.0362)\n0.1931 (0.0140)\n0.1910 (0.0163)\n\ntest PR-AUC\n0.3776 (0.0589)\n0.4339 (0.0411)\n0.4415 (0.0532)\n0.4290 (0.0510)\n0.3813 (0.0240)\n0.3998 (0.0628)\n0.4685 (0.0432)\n0.4787 (0.0434)\n\nNext, we conducted a series of experiments to con\ufb01rm that MiME can indeed capture the relation-\nship between Dx codes and treatment codes, thus producing robust performance in small datasets.\nSpeci\ufb01cally, we created three small datasets D1, D2, D3 from the original data such that each dataset\nconsisted of patients with varying degree of Dx-treatment interactions (i.e. visit complexity). We\nde\ufb01ned visit complexity as below to calculate for a patient the percentage of visits that have at least\ntwo diagnosis codes associated with different sets of treatment codes,\n1 , . . . ,M(t)\n#V (t) where |set(M(t)\nT\n\nvisit complexity =\n\n)| 2\n\n|V(t)|\n\n7\n\n\fwhere T denotes the total number of visits. For example, in Figure 1, the t-th visit V (t) has Fever\nassociated with no treatments, and Cough associated with two treatments. Therefore V (t) quali\ufb01es\nas a complex visit. From the original dataset, we selected patients with a short sequence (less\nthan 20 visits) to simulate a hospital newly equipped with a EHR system, and there aren\u2019t much\ndata collected yet. Among the patients with less than 20 visits, we used visit complexity ranges\n0  15%, 15  30%, 30  100% to create D1, D2, D3 consisting of 5608 (464 HF cases), 5180 (341\nHF cases), 5231 (383 HF cases) patients respectively. For training MiME with auxiliary tasks, we\nexplored various aux values between 0.01  0.1, and found 0.015 to provide the best performance,\nalthough other values also improved the performance in varying degrees.\nTable 3 shows the HF prediction performance for the dataset D1, D2 and D3. To enhance readability,\nwe show here the results of the strongest activation function tanh and tanhmlp, and we report test\nloss and test PR-AUC. The results of other activation functions and the test ROC-AUC are provided\nin Appendix D and Appendix E.\nTable 3 provides two important messages. First of all, both MiME and MiME aux show close to the best\nperformance in all datasets D1, D2 and D3, especially high complexity dataset D3.This con\ufb01rms\nthat MiME indeed draws its power from the interactions between Dx codes and treatment codes, with\nor without the auxiliary tasks. In D1, patients\u2019 visits do not have much structure, that it makes little\ndifference whether we use MiME or not, and its performance is more or less similar to many baselines.\nSecond, auxiliary tasks indeed help MiME generalize better to patients unseen during training. In\nall datasets D1, D2 and D3, MiME aux outperforms MiME in all measures, especially in D3 where it\nshows PR-AUC 0.4787 (8.4% relative improvement over the best baseline tanh).\n\n4 Related Work\n\nOver the years, medical concept embedding has been an active research area. Some works tried to\nsummarize sparse and high-dimensional medical concepts into compressed vectors [15, 18]. In those\nworks, medical concepts were organized as temporal sequences, from which embeddings were derived.\nOther works used latent layers of deep models for representing more abstract medical concepts\n[14, 10, 13, 12, 27, 2]. For example, restricted Boltzmann Machines, stacked auto-encoders or multi-\nlayer neural networks were used to learn the representation of codes, visits, or patients [38, 28, 11].\nSome works used medical ontologies to learn medical concept representations [12, 8]. Although all\nworks successfully learned concept embeddings for some task in varying degrees, they did not fully\nutilize the multilevel structure or diagnosis-treatment relationship of EHR.\nRecently, multiple code types in EHR gained more attentions. In [35], authors viewed different code\ntypes separately, and tried to capture complex relationships across these disparate data types using\nRNNs, but they did not explicitly address the hierarchy of EHR data. More recently in [30], the\nauthors tried to explicitly capture the interaction between a set of all diagnosis codes and a set of all\nmedication codes occurring in a visit. However, in their experiment, simply concatenating both sets\nto obtain a visit vector outperformed other methods in many tasks. This suggests that disregarding\nthe diagnosis-speci\ufb01c Dx-Rx interaction and \ufb02attening all codes as sets is a suboptimal approach to\nmodeling EHR data.\nAs described in section 2.2, we employ auxiliary task strategy to train a robust model. Training a\nmodel to predict multiple related targets has shown to improve model robustness in medical prediction\ntasks in previous studies. For example, [5] used lab values as auxiliary targets to improve mortality\nprediction performance. More recent studies [29, 22, 4] demonstrated improved prediction accuracy\nwhen training a model with multiple related tasks such as mortality prediction and phenotyping.\n\n5 Conclusion\n\nIn this work, we presented MiME, an integrated approach that simultaneously models hierarchical\ninter-code relations into medical concept embedding while jointly performing auxiliary prediction\ntasks. Through extensive empirical evaluation, MiME demonstrated impressive performance across all\nbenchmark tasks and its generalization ability to smaller datasets, especially outperforming baselines\nin terms of PR-AUC in heart failure prediction. As we have established in this work that MiME can be\na good choice for modeling visits, in the future, we plan to extend MiME to include more \ufb01ne-grained\nmedical events such as procedure outcomes, demographic information, and medication instructions.\n\n8\n\n\fAcknowledgments\nThis work was supported by the National Science Foundation, award IIS-#1418511 and CCF-\n#1533768, the National Institute of Health award 1R01MD011682-01 and R56HL138415, and\nSamsung Scholarship. We would also like to thank Sherry Yan for her helpful comments on the\noriginal manuscript.\n\nReferences\n[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. In ICLR, 2015.\n\n[2] Jacek M Bajor and Thomas A Lasko. Predicting medications from diagnostic codes with\n\nrecurrent neural networks. In ICLR, 2017.\n\n[3] Inci M Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K Jain, and Jiayu Zhou. Patient subtyping\n\nvia time-aware lstm networks. In SIGKDD, 2017.\n\n[4] Adrian Benton, Margaret Mitchell, and Dirk Hovy. Multi-task learning for mental health using\n\nsocial media text. arXiv preprint arXiv:1712.03538, 2017.\n\n[5] Rich Caruana, Shumeet Baluja, and Tom Mitchell. Using the future to\" sort out\" the present:\nRankprop and multitask learning for medical risk evaluation. In NIPS, pages 959\u2013965, 1996.\n[6] Chao Che, Cao Xiao, Jian Liang, Bo Jin, Jiayu Zho, and Fei Wang. An rnn architecture with\ndynamic temporal matching for personalized predictions of parkinson\u2019s disease. In SIAM on\nData Mining, 2017.\n\n[7] Zhengping Che, David Kale, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu. Deep\ncomputational phenotyping. In Proceedings of the 21th ACM SIGKDD International Conference\non Knowledge Discovery and Data Mining, KDD \u201915, pages 507\u2013516, New York, NY, USA,\n2015. ACM.\n\n[8] Zhengping Che, David Kale, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu. Deep\n\ncomputational phenotyping. In SIGKDD, 2015.\n\n[9] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. In EMNLP, 2014.\n\n[10] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart, and Jimeng Sun.\n\nDoctor ai: Predicting clinical events via recurrent neural networks. In MLHC, 2016.\n\n[11] Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thomp-\nson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. Multi-layer representation learning for\nmedical concepts. In SIGKDD, 2016.\n\n[12] Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun. Gram:\n\nGraph-based attention model for healthcare representation learning. In SIGKDD, 2017.\n\n[13] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter\nStewart. Retain: An interpretable predictive model for healthcare using reverse time attention\nmechanism. In NIPS, 2016.\n\n[14] Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Using recurrent neural network\nmodels for early detection of heart failure onset. Journal of the American Medical Informatics\nAssociation, 2016.\n\n[15] Youngduck Choi, Chill Yi-I Chiu, and David Sontag. Learning low-dimensional representations\n\nof medical concepts. AMIA Summits on Translational Science Proceedings, 2016.\n\n[16] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In\nProceedings of the 23rd international conference on Machine learning, pages 233\u2013240. ACM,\n2006.\n\n[17] Crist\u00f3bal Esteban, Oliver Staeck, Stephan Baier, Yinchong Yang, and Volker Tresp. Predicting\nclinical events by combining static and dynamic information using recurrent neural networks.\nIn ICHI, 2016.\n\n9\n\n\f[18] Wael Farhan, Zhimu Wang, Yingxiang Huang, Shuang Wang, Fei Wang, and Xiaoqian Jiang.\nA predictive model for medical events based on contextual embedding of temporal sequences.\nJMIR medical informatics, 2016.\n\n[19] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus\nRohrbach. Multimodal compact bilinear pooling for visual question answering and visual\ngrounding. In EMNLP, 2016.\n\n[20] Joseph Futoma, Jonathan Morris, and Joseph Lucas. A comparison of models for predicting\n\nearly hospital readmissions. JBI, 2015.\n\n[21] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In\n\nCVPR, 2016.\n\n[22] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning\n\nand benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.\n\n[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[24] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak\n\nZhang. Hadamard product for low-rank bilinear pooling. In ICLR, 2017.\n\n[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\n[26] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzell. Learning to diagnose\n\nwith lstm recurrent neural networks. In ICLR, 2016.\n\n[27] Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. Dipole:\nDiagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks.\nIn SIGKDD, 2017.\n\n[28] Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. Deep patient: An unsupervised\nrepresentation to predict the future of patients from the electronic health records. Scienti\ufb01c\nreports, 2016.\n\n[29] Che Ngufor, Sudhindra Upadhyaya, Dennis Murphree, Daryl Kor, and Jyotishman Pathak.\nMulti-task learning with selective cross-task transfer for predicting bleeding and other important\npatient outcomes. In Data Science and Advanced Analytics (IEEE DSAA), pages 1\u20138, 2015.\n\n[30] Phuoc Nguyen, Truyen Tran, and Svetha Venkatesh. Resset: A recurrent model for sequence of\n\nsets with applications to electronic medical records. arXiv:1802.00948, 2018.\n\n[31] Nozomi Nori, Hisashi Kashima, Kazuto Yamashita, Hiroshi Ikai, and Yuichi Imanaka. Si-\nmultaneous modeling of multiple diseases for mortality prediction in acute hospital care. In\nProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, KDD \u201915, pages 855\u2013864, New York, NY, USA, 2015. ACM.\n\n[32] T. Pham, T. Tran, D. Phung, and S. Venkatesh. Predicting healthcare trajectories from medical\n\nrecords: A deep learning approach. Journal of Biomedical Informatics, 2017.\n\n[33] Veronique L Roger, Susan A Weston, Margaret M Red\ufb01eld, Jens P Hellermann-Homan, Jill\nKillian, Barbara P Yawn, and Steven J Jacobsen. Trends in heart failure incidence and survival\nin a community-based population. JAMA, 2004.\n\n[34] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the\nroc plot when evaluating binary classi\ufb01ers on imbalanced datasets. PloS one, 10(3):e0118432,\n2015.\n\n[35] Harini Suresh, Nathan Hunt, Alistair Johnson, Leo Anthony Celi, Peter Szolovits, and Marzyeh\nGhassemi. Clinical intervention prediction and understanding using deep networks. In MLHC,\n2017.\n\n[36] Tensor\ufb02ow Team. Tensor\ufb02ow: A system for large-scale machine learning. In OSDI, 2016.\n[37] JB Tenenbaum and WT Freeman. Separating style and content with bilinear models. Neural\n\nComputation, 2000.\n\n[38] Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh. Learning vector represen-\ntation of medical objects via emr-driven nonnegative restricted boltzmann machines (enrbm).\nJournal of Biomedical Informatics, 2015.\n\n10\n\n\f[39] Rajakrishnan Vijayakrishnan, Steven R Steinhubl, Kenney Ng, Jimeng Sun, Roy J Byrd, Zahra\nDaar, Brent A Williams, Shahram Ebadollahi, Walter F Stewart, et al. Prevalence of heart failure\nsigns and symptoms in a large primary care population identi\ufb01ed through the use of text and\ndata mining of the electronic health record. Journal of Cardiac Failure, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2222, "authors": [{"given_name": "Edward", "family_name": "Choi", "institution": "Google"}, {"given_name": "Cao", "family_name": "Xiao", "institution": "IBM Research"}, {"given_name": "Walter", "family_name": "Stewart", "institution": "No Affiliation"}, {"given_name": "Jimeng", "family_name": "Sun", "institution": "Georgia Tech"}]}