{"title": "REFUEL: Exploring Sparse Features in Deep Reinforcement Learning for Fast Disease Diagnosis", "book": "Advances in Neural Information Processing Systems", "page_first": 7322, "page_last": 7331, "abstract": "This paper proposes REFUEL, a reinforcement learning method with two techniques: {\\em reward shaping} and {\\em feature rebuilding}, to improve the performance of online symptom checking for disease diagnosis. Reward shaping can guide the search of policy towards better directions. Feature rebuilding can guide the agent to learn correlations between features. Together, they can find symptom queries that can yield positive responses from a patient with high probability. Experimental results justify that the two techniques in REFUEL allows the symptom checker to identify the disease more rapidly and accurately.", "full_text": "REFUEL: Exploring Sparse Features in Deep\n\nReinforcement Learning for Fast Disease Diagnosis\n\nYu-Shao Peng\n\nHTC Research & Healthcare\n\nys_peng@htc.com\n\nKai-Fu Tang\n\nHTC Research & Healthcare\n\nkevin_tang@htc.com\n\nHsuan-Tien Lin\n\nDepartment of CSIE,\n\nNational Taiwan University\nhtlin@csie.ntu.edu.tw\n\nEdward Y. Chang\n\nHTC Research & Healthcare\nedward_chang@htc.com\n\nAbstract\n\nThis paper proposes REFUEL, a reinforcement learning method with two tech-\nniques: reward shaping and feature rebuilding, to improve the performance of\nonline symptom checking for disease diagnosis. Reward shaping can guide the\nsearch of policy towards better directions. Feature rebuilding can guide the agent\nto learn correlations between features. Together, they can \ufb01nd symptom queries\nthat can yield positive responses from a patient with high probability. Experimental\nresults justify that the two techniques in REFUEL allow the symptom checker to\nidentify the disease more rapidly and accurately.\n\n1\n\nIntroduction\n\nOne of the rising needs in healthcare is self-diagnosis. About 35% of adults in the U.S. had the\nexperience of self-diagnosing their discomfort or illness via online services, according to a survey\nperformed in 2012 [12]. Although online searches are convenient media, search results can often be\ninaccurate or irrelevant [12]. The process of disease diagnosis can be considered as a sequence of\nqueries and answers: a doctor inquires a patient of his/her symptoms to make a disease prediction.\nDuring the Q&A process, the doctor carefully chooses relevant questions to ask the patient with\ntwin performance aims. First, each symptom probing should obtain maximal information about the\npatient\u2019s state. Second, after a series of Q&A, the diagnosis should be accurate.\nFrom the machine learning perspective, the doctor\u2019s iterative decision making process can be viewed\nas feature acquisition and classi\ufb01cation. The costs for feature aquisition are non-trivial in the decision\nmaking process, because acquiring a large amount of symptoms from a patient is annoying and\nimpractical. One big family of existing methods that take the feature aquisition costs into account are\ntree-based. For instance, cost-sensitive decision trees were studied by Xu et al. [19] and Kusner et\nal. [5]; and cost-sensitive random forests by Nan et al. [7, 8] and Nan and Saligrama [6]. Recently,\nreinforcement learning (RL) was shown to be a promising approach to address the sequential decision\nproblem with acquisition costs. Janisch et al. [2] use a double deep Q-network [15] to learn a query\nstrategy. Their experimental results show that their RL approach outperforms tree-based methods.\nEven thought the RL method can reach higher prediction accuracy, applying this method to the disease\ndiagnosis task encounters two challenges. First, the numbers of possible diseases and symptoms\nare several hundreds, which lead to a large search space. Second, the number of symptoms that a\npatient actually suffers from is much less than that of possible symptoms, which results in a sparse\nfeature space. This specialty makes the search more dif\ufb01cult because exploring these few symptoms\nin the sparse feature space is time-consuming. To alleviate the \ufb01rst problem, Tang et al. [14] and Kao\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fet al. [3] proposed ensemble models. Although the search space in each model is reduced, which\nresulted in improved performance, the problem of the sparse feature space remains open.\nThis paper proposes the REFUEL1 model, which is a reinforcement learning model with reward\nshaping and feature rebuilding techniques. Reward shaping gives the agent an additional reward to\nguide the search towards better directions in the sparse feature space. Feature rebuilding enables\nthe agent to discover the key symptoms by learning a better representation. Combining these two\ntechniques, the agent can effectively learn a strategy that achieves higher prediction accuracy. Our\nempirical study demonstrates that our proposed techniques can productively explore in the sparse\nfeature space and obtain high disease-prediction accuracy (e.g., achieving 91.71% top-5 accuracy\namong 73 diseases).\n\n2 Diseases Diagnosis Using Reinforcement Learning\n\nset of labels, where n is the number of diseases, and D =(cid:8)(cid:0)x(j), y(j)(cid:1) : x(j) \u2208 X , y(j) \u2208 Y\n\nWe consider the disease diagnosis problem as a sequential decision problem. The doctor who interacts\nwith a patient can be considered as an agent. The agent\u2019s actions in the decision process include\nfeature acquisition and classi\ufb01cation. Initially, the patient provides an initial symptom to the agent.\nThen, the agent starts its feature acquisition stage. At each time step, the agent asks the patient whether\nhe/she suffers from a certain symptom. The patient answers the agent with true/false indicating\nwhether he/she suffers from that symptom. After the agent has acquired suf\ufb01cient symptoms, the\nagent switches to the classi\ufb01cation stage and makes a disease prediction.\nSuppose we have a medical dataset consisting of patients\u2019 symptoms and diseases. Let X = {0, 1}m\ndenote the set of possible symptom pro\ufb01les, where m is the number of symptoms. A patient\u2019s\nsymptom pro\ufb01le x \u2208 X is said to have the jth symptom if xj = 1; otherwise, it does not have jth\n(cid:9)k\nsymptom. We call x+ = {j : xj = 1} and x\u2212 = {j : xj = 0} the sets of positive and negative\nsymptoms of x, respectively.2 Also, we say that x is sparse if |x+| (cid:28) m. Let Y = {1, . . . , n} be the\nbe the dataset of k labeled examples. We say that D is sparse if for all x \u2208 {x \u2208 X : (x, y) \u2208 D}\nare sparse. In the problem of disease diagnosis, a patient typically has only few symptoms, and the\nnumber of possible symptoms can be several hundreds (i.e., |x+| (cid:28) m). That is, the symptom space\nof the disease diagnosis problem is sparse.\nTo formulate our problem as a decision process, we introduce the Markov decision process (MDP)\nterminologies used in this paper. We shall follow the standard notations in [13]. An MDP M\nis a \ufb01ve tuple M = (S,A,R, p, \u03b3), where: S is a set of states; A is a set of actions; R \u2286 R\nis a set of possible rewards; p : S \u00d7 R \u00d7 S \u00d7 A \u2192 [0, 1] is the dynamics of the MDP, with\np(s(cid:48), r | s, a) := Pr{St = s(cid:48), Rt = r | St\u22121 = s, At\u22121 = a}, for all s, s(cid:48)\n\u2208 S, r \u2208 R, a \u2208 A; and\n\u03b3 \u2208 [0, 1] is the discount-rate parameter. An MDP is said to be \ufb01nite if S, A, and R are \ufb01nite.\nGiven our dataset D, we can construct a sample model M = (S,A,R, p, \u03b3) as follows. The state\nspace S is {\u22121, 0, 1}m. A state can be viewed as the partial information of a patient\u2019s symptom\npro\ufb01le x \u2208 X . In the state encoding, we use 1 for positive symptoms, \u22121 for negative symptoms,\nand 0 for unknown symptoms. The action space A is {1, . . . , m}\u222a{m + 1, . . . , m + n}. We say that\nan action is a feature acquisition action if it is less than or equal to m; otherwise, it is a classi\ufb01cation\naction. The reward space R is {\u22121, 0, 1}. The dynamics p is a generative model which can generate\nsample episodes. At the beginning, the sample model generates an initial state S1 with one positive\nsymptom by\n\nj=1\n\n(x, y) \u223c uniform(D),\n\ni \u223c uniform (x+) ,\n\nS1 = ei,\n\nwhere ei is the vector in Rm with ith component 1 and all other components 0. The initial state can\nbe viewed as the information of an initial symptom provided by a patient.\nIn the following time step t, if M receives an action At \u2264 m (i.e., a feature acquisition action) from\nthe agent and the time step t is less than the maximum length T , the sample model generates the next\nstate St+1 and the reward Rt+1 by\n\n(cid:26)2xj \u2212 1\n\nSt,j\n\nSt+1,j =\n\nif j = At\notherwise ,\n\nRt+1 = 0.\n\n1REFUEL stands for REward shaping and FeatUrE rebuiLding.\n2We use positive and negative symptoms to denote present and absent symptoms.\n\n2\n\n\fThe next state St+1 contains the positive and negative symptoms in St, and also one additional feature\nxAt (i.e., the Ath\nt position of x) acquired by the action At. If M receives a feature acquisition action\nat the last step of an episode (t = T ), the next state St+1 and the reward Rt+1 are\n\nSt+1 = S\u22a5,\n\nRt+1 = \u22121,\n\nwhere S\u22a5 is the terminal state. Therefore, the agent is punished if it asks too many questions. On the\nother hand, if M receives an action At > m (i.e., a classi\ufb01cation action), the sample model generates\nSt+1 and Rt+1 by\n\nSt+1 = S\u22a5,\n\nRt+1 = 2\u03b4y+m,At \u2212 1,\n\nby the agent is Gt = (cid:80)T\n\nwhere \u03b4 is the Kronecker delta. The reward Rt+1 is 1 if the disease prediction made by the agent\nis correct (i.e., y + m = At). Otherwise, Rt+1 is \u22121 for a wrong disease prediction (y + m (cid:54)= At).\nThe next state St+1 is the terminal state S\u22a5, and the episode terminates.\nIn reinforcement learning, a policy \u03c0 : S \u2192 A is a strategy function of an agent that is used\nto interact with an MDP. At time step t, the agent chooses an action based on \u03c0 and obtains an\nimmediate reward Rt+1 generated by the MDP. The cumulative discounted reward Gt obtained\nt(cid:48)=t \u03b3t(cid:48)\u2212tRt+1. Given a policy \u03c0, the state value of state s under policy\n\u03c0 is de\ufb01ned as v\u03c0(s) := E\u03c0[Gt | St = s], and the action value of taking action a in state s\nunder policy \u03c0 is de\ufb01ned as q\u03c0(s, a) := E\u03c0[Gt | St = s, At = a]. Therefore, the optimal action\nvalue is de\ufb01ned as q\u2217(s, a) := sup\u03c0 q\u03c0(s, a). For a \ufb01nite MDP, the optimal policy is de\ufb01ned as\n\u03c0\u2217(s) = arg maxa\u2208A q\u2217(s, a).\nWe use the policy-based method REINFORCE [17] to learn the optimal policy \u03c0\u2217, which goal is to\nmaximize the cumulative discounted reward Gt. To achieve \u201cfast\u201d disease diagnosis, we have two\ncandidate approaches. The \ufb01rst approach is to penalize the agent with a negative reward whenever\nthe agent queries a symptom. Since the agent can easily learn unexpected behavior if the reward\ncriteria is slightly changed, the severity of penalty is dif\ufb01cult to determine. The other approach is\nto set the discount-rate parameter \u03b3 < 1. The discount-rate parameter \u03b3 can affect the cumulative\ndiscounted reward Gt, and simultaneously affect the strategy of the agent. If \u03b3 < 1 and the agent\nmakes a correct prediction at the end, the agent asking fewer questions receives more cumulative\ndiscounted reward Gt than that asking more questions. Therefore, the agent shall learn to shorten the\ndiagnosis process. This approach is preferable because its reward (penalty) setting is simpler.\n\n3 Our Proposed Schemes\n\nAs illustrated in Section 2, the feature vector of symptoms of each training instance is sparse in\nour disease diagnosis problem. If the agent relies on sampling randomly in the symptom space to\nquery a patient, it is with low probability that a symptom probing would be responded positively\nby the patient. Without any symptoms being positive, the agent cannot perform disease prediction.\nTherefore, it is evident that in the early phrase of symptom probing, the agent must employ a strategy\nthat can quickly \ufb01nd some positive symptoms during reinforcement learning. However, the na\u00efve\nreinforcement learning algorithm presented in Section 2 does not explicitly guide the agent to discover\npositive symptoms quickly, which makes the learning progress slow under the challenging scenario of\nsparse features. In this section, we propose two novel techniques to guide the reinforcement learning\nagent to quickly discover key positive symptoms. The \ufb01rst technique is built on the foundation of\nreward shaping to encourage discovering positive symptoms; the second technique leverages the\nconcept of representation learning to ensure focusing on key positive symptoms. As we will show in\nSection 4, these two techniques signi\ufb01cantly improve the REINFORCE framework in both speed and\naccuracy.\n\n3.1 Exploring Positive Symptoms with Reward Shaping\n\nIn order to encourage the reinforcement learning agent to discover positive symptoms more quickly,\na simple heuristic is to provide the agent with an auxiliary piece of reward when a positive symptom\nis queried, and a relatively smaller (or even negative) reward when a negative symptom is queried.\nHowever, this heuristic suffers from the risk of changing the optimal policy of the original MDP. We\n\ufb01rst introduce a principled framework that allows changing the reward function while keeping the\noptimal policy untouched. We then design a novel reward function that satis\ufb01es both the intuition\nbehind the heuristic and the requirement of the framework.\n\n3\n\n\fThe principled framework is called reward shaping, which changes the original reward function\nof an MDP into a new one in order to make the reinforcement learning problem easier to solve\n[11]. We consider the special case of potential-based reward shaping, which comes with theoretical\nguarantees on the invariance of the optimal policy [9]. Given an MDP M = (S,A,R, p, \u03b3), potential-\nbased reward shaping assigns a potential on each state by a bounded potential function \u03d5 : S \u2192 R.\nPhysically, the potential function represents the prior \u201cgoodness\u201d of each state. We de\ufb01ne the\nauxiliary reward function from state s to state s(cid:48) as f (s, s(cid:48)) := \u03b3\u03d5(s(cid:48)) \u2212 \u03d5(s), which carries the\ndiscounted potential difference between two states. Potential-based reward shaping considers a new\nMDP M\u03d5 = (S,A,R\u03d5, p\u03d5, \u03b3) with the changed reward distribution p\u03d5 to be\n\n(cid:48)\n(cid:48)\n, r | s, a) := p(s\np\u03d5(s\n(cid:48)\nwhere R\u03d5 := {r + f (s, s\n\n(cid:48)\n, r \u2212 f (s, s\n\n(cid:48)\n) | s, a),\u2200s, s\n\n(cid:48)\n\n) : r \u2208 R, s, s\n\n\u2208 S}\n\n\u2208 S, r \u2208 R\u03d5, a \u2208 A,\n\nThat is, the new MDP provides an additional reward f (s, s(cid:48)) when the agent moves from state s to\nstate s(cid:48). The following theorem establishes the guarantee that the optimal policy of M and M\u03d5 are\nequivalent.\nTheorem 1 (Policy Invariance Theorem [9]). For any given MDP M = (S,A,R, p, \u03b3) and any\npotential function \u03d5, de\ufb01ne the changed MDP M\u03d5 = (S,A,R\u03d5, p\u03d5, \u03b3) above. Then, every optimal\npolicy in M\u03d5 is also optimal in M.\n\nWith Theorem 1 guaranteeing the invariance of the optimal policies, the only issue left is to design a\nproper potential function \u03d5(s) for each state s. Recall that the potential function needs to match the\nheuristic of providing a more positive reward after identifying a positive symptom. Next, we propose\na simple potential function, and then prove a necessary condition for the potential function to match\nthe heuristic.\nThe simple potential function is de\ufb01ned as\n\n(cid:26)\u03bb \u00d7 |{j : sj = 1}|\n\n\u03d5(s) :=\n\n0\n\nif s \u2208 S \\ {S\u22a5}\notherwise\n\n,\n\n(1)\n\nwhere \u03bb > 0 is a hyperparameter that controls the magnitude of reward shaping. That is, when the\nstate s is not the terminal state S\u22a5, \u03d5(s) carries a scaled version of the number of positive symptoms\nthat have been identi\ufb01ed before entering state s. By the de\ufb01nition f (s, s(cid:48)) := \u03b3\u03d5(s(cid:48)) \u2212 \u03d5(s), scaling\nup \u03bb would be equivalent to scaling up f to make the auxiliary rewards larger in magnitude. Once\nthe agent reaches the terminal state S\u22a5, the potential value of it is changed to 0 to ensure the policy\ninvariance [9].\nAfter de\ufb01ning the potential function, we would like to ensure that the auxiliary reward f (s, s(cid:48))\nsatis\ufb01es our heuristic. The heuristic is to provide the agent with a positive auxiliary reward when it\ndiscovers a positive symptom, and a negative auxiliary reward when it queries a negative symptom.\nRecall that we set the discount-rate parameter \u03b3 < 1 to shorten the diagnosis process. Since the\ndiscount-rate parameter \u03b3 can affect the auxiliary reward f (s, s(cid:48)), we need to investigate the range of\n\u03b3 to make the auxiliary reward satisfy our heuristic. A similar analysis is performed by Grzes and\nKudenko [1]. Here, we provide a more rigorous analysis. For any state s, consider a state s+ where\ns+ and s differ by one positive symptom. Similarly, consider a state s\u2212 where s\u2212 and s differ by\none negative symptom. The following theorem establishes the condition on \u03b3 that makes f (s, s+)\npositive and f (s, s\u2212) non-positive.\nTheorem 2. Let \u03d5 : S \u2192 R\u22650 be a bounded function over S, u = sup{\u03d5(s) : s \u2208 S}. If\ns, s+, s\u2212 \u2208 S satis\ufb01es \u03d5(s+) = \u03d5(s) + c and \u03d5(s\u2212) = \u03d5(s) for some c > 0, then, for any\nu+c < \u03b3 < 1,\n\nu\n\nf (s, s\u2212) \u2264 0 and f (s, s+) > 0.\n(cid:16) \u03d5(s)\n(cid:17)\n\n(cid:16) u\n\n(cid:17)\n\nProof. f (s, s\u2212) = (\u03b3 \u2212 1)\u03d5(s) which is trivially non-positive by the conditions \u03b3 < 1 and \u03d5(s) \u2265 0;\nf (s, s+) = \u03b3\u03d5(s+) \u2212 \u03d5(s) >\nThe proposed potential function in (1) easily satis\ufb01es the condition of Theorem 2 by setting c = \u03bb and\nu = \u03bbd, where d = max{ |x+| : x \u2208 DX} is the sparsity level of dataset D. Then, by choosing \u03b3 to\nbe within the range of ( \u03bbd\nd+1 , 1), we can ensure that the auxiliary reward of discovering\na positive symptom to be positive, and the auxiliary reward of discovering a negative symptom to\n\n(\u03d5(s) + c) \u2212 \u03d5(s) \u2265 0.\n\n(\u03d5(s) + c) \u2212 \u03d5(s) \u2265\n\n\u03bbd+\u03bb , 1) = ( d\n\n\u03d5(s)+c\n\nu+c\n\n4\n\n\fbe non-positive, and therefore, satisfying the simple heuristic under the principled framework of\npotential-based reward shaping.\nGiven that certain important negative symptoms are also helpful to distinguish diseases, one may\nthink it is counterintuitive to punish an agent with non-positive auxiliary rewards when it queries\nnegative symptoms. However, Theorem 1 guarantees the invariance of the optimal policy with\nauxiliary rewards. Therefore, even though the agent may receive non-positive auxiliary rewards, it\ncan still learn to query those critical negative symptoms.\n\n3.2 Feature Rebuilding of Sparse Features\n\nm\n\n\u03c0\n\nz\n\n1024\n\ns\n\n\u03ba\n\n1024\n\n1024\n\n2048\n\nm + l\n\nm + n\n\nFigure 1: Dual neural network architecture.\nThe upper branch is the policy \u03c0 of an agent.\nThe lower branch is the feature rebuilding part\nof sparse features.\n\nOne specialty of the disease diagnosis problem is\nthat many symptoms can be highly correlated. Symp-\ntom correlation allows doctors to use some positive\nsymptoms to infer the existence of some other symp-\ntoms. For example, a patient suffers from runny nose\nmay also suffer from sore throat or headache. Our\nagent can take advantage of such correlation without\n\u201cwasting\u201d queries on probing those highly correlated\nsymptoms.\nThe MDP designed in Section 2 allows the agent\nto learn to not waste queries by encouraging the\nagent to obtain the correct classi\ufb01cation reward as\nquickly as possible. Nevertheless, the reward signal,\nwhich comes at the moment of the \ufb01nal action of\nclassi\ufb01cation, is rather weak to the agent. Thus, the\nconcept of not wasting queries can be slow to learn.\nFurthermore, the reward signal is weakened by the\ntechnique of reward shaping introduced previously.\nIn particular, because any positive symptom would receive the same amount of auxiliary reward from\nthe changed MDP, the agent is generally more eager to identify any positive symptom (to receive\nsome auxiliary reward immediately) rather than a key positive symptom that can help infer other\npositive symptoms towards a quicker classi\ufb01cation reward.\nTo identify key positive symptoms, we combine a symptom rebuilding task with policy learning.\nFigure 1 illustrates our proposed architecture: the \ufb01rst three layers are shared layers, the upper branch\nis for policy learning, and the lower branch is for feature rebuilding. In the training stage, the model\nshould generate a policy \u03c0, as well as rebuild full symptoms from partial ones. If the agent queries\nmore key positive symptoms, the model can make a better feature rebuilding at the next time step. A\nbetter feature rebuilding provides the agent a preferable representation in the shared layers. That is,\nthe agent can possess a global view of the symptom information. Referring this global information,\nthe agent inclines to make a correct diagnosis and to receive a positive reward. Afterwards, the\ndesired behavior of querying key positive symptoms will be reinforced by the positive reward.\nAs shown in Figure 1, the input to the neural network is the concatenation of the state s and contextual\ninformation \u03ba \u2208 {0, 1}l. We use the same contextual information \u03ba as in [3]. Let x \u2208 X be a vector\nwith full symptom information and z \u2208 Rm the rebuilt one calculated by the lower branch (with\na sigmoid output function). We propose to consider the binary cross-entropy function [16] as the\nfeature rebuilding loss:\n\n(cid:88)\n\n(cid:88)\n\nj\n\ni\n\nJreb(x, z; \u03b8) = \u2212\n\n[x(j)\n\ni\n\nlog z(j)\n\ni + (1 \u2212 x(j)\n\ni ) log(1 \u2212 z(j)\n\ni\n\n)].\n\n(2)\n\nWe propose to minimize the feature rebuilding loss together with maximizing the cumulative dis-\ncounted reward in REINFORCE. In particular, consider an objective function\n\nJ = Jpg(\u03b8) \u2212 \u03b2Jreb(x, z; \u03b8).\n\nThe \ufb01rst term, Jpg(\u03b8) = v\u03c0\u03b8 (s0), is the cumulative discounted reward from the initial state to the\nterminal state. The second term is feature rebuilding loss (i.e., Equation 2) and \u03b2 is a hyperparameter,\nwhich controls the importance of the feature rebuilding task.\nOptimizing the new objective function requires some simple changes of the original stochastic\ngradient ascent algorithm within REINFORCE. In particular, the policy gradient theorem already\n\n5\n\n\fAlgorithm 1: REFUEL: Exploring sparse features using reward shaping and feature rebuilding\n, where X = {0, 1}m and Y = {1, . . . , n}. An action set\nInput\n\n: x(j) \u2208 X , y(j) \u2208 Y(cid:111)k\n\n(cid:110)(cid:16)\nx(j), y(j)(cid:17)\n\n:A dataset D =\nA = {1, . . . , m} \u222a {m + 1, . . . , m + n}.\nN is the number of episodes. T is the maximum number of steps of an episode.\n\nj=1\n\nOutput :Agent\u2019s parameters \u03b8\n\n1 begin\n2\n3\n4\n5\n\nInitialize parameters \u03b8\nfor l \u2190 0 to N do\n\nt \u2190\u2212 1\n(x, y) \u223c uniform(D), x+ := {j : xj = 1}, i \u223c uniform(x+)\nSt,j \u2190\u2212\n\n(cid:40)\n\n1 if j = i\n0 otherwise\n// start one sample episode\nrepeat\nAt \u223c \u03c0(At | St);\nif At \u2264 m and t < T then\n\n(cid:40)\n\nSt+1,j \u2190\u2212\n\n2xj \u2212 1 if j = At\notherwise\nSt,j\n\n; Rt+1 \u2190\u2212 0\n\nelse if At \u2264 m and t \u2265 T then\n\nelse\n\nSt+1 \u2190\u2212 S\u22a5 ; Rt+1 \u2190\u2212 \u22121\nSt+1 \u2190\u2212 S\u22a5 ; Rt+1 \u2190\u2212 2\u03b4y+m,At \u2212 1\n\nend\nt \u2190\u2212 t + 1\n\nuntil St = S\u22a5\n// REFUEL: reward shaping and feature rebuilding\n\u03d5(s) := \u03bb \u00d7 |{j : sj = 1}|\nGt \u2190\u2212 0; \u2206\u03b8 \u2190\u2212 0; t \u2190\u2212 t \u2212 1\nrepeat\n\nGt \u2190\u2212 Rt+1 + (\u03b3\u03d5(St+1) \u2212 \u03d5(St)) + \u03b3Gt+1\n\u2206\u03b8 \u2190\u2212 \u2206\u03b8 + Gt\nt \u2190\u2212 t \u2212 1\nuntil t = 0\n\u03b8 \u2190\u2212 \u03b8 + \u03b1\u2206\u03b8\n\n6\n\n7\n8\n9\n10\n\n11\n\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30 end\n\nend\nreturn \u03b8\n\n\u2207\u03b8 \u03c0(At|St,\u03b8)\n\u03c0(At|St,\u03b8) + \u03b2\u2207Jreb(x, zt; \u03b8) + \u03b7\u2207H(\u03c0(At|St, \u03b8))\n\n(cid:104)\n\nGt\n\n(cid:105)\n\n\u2207\u03b8\u03c0(At|St,\u03b8)\n\u03c0(At|St,\u03b8)\n\nestablishes that \u2207Jpg(\u03b8) \u221d E\u03c0\nfunction is simply\n\n(cid:20)\n\n(cid:48)\n\n\u03b8\n\n= \u03b8 + \u03b1\n\nGt\u2207\u03b8\u03c0(At|St, \u03b8)\n\u03c0(At|St, \u03b8)\n\n. Thus, the update rule for the new objective\n\n(cid:21)\n\n+ \u03b2\u2207Jreb(x, zt; \u03b8) + \u03b7\u2207H(\u03c0(At|St, \u03b8))\n\n.\n\nNote that we follow a popular practice of adding an additional entropy regularization term H, which\nwas \ufb01rst proposed by [18] to help the agent escape from local optimal at the beginning of the training\nprocess.\nBy additionally minimizing the feature rebuilding loss, we direct the agent to internally carry the\nknowledge towards good feature rebuilding (low rebuilding loss), and the knowledge can naturally be\nused to guide a con\ufb01dent classi\ufb01cation action more quickly.\n\n3.3 The REFUEL Algorithm\n\nCombining the techniques of reward shaping and enforcing feature rebuilding, we propose the\nREFUEL algorithm to solve the disease diagnosis problem, as shown in Algorithm 1. The inputs\nto the algorithm include a training dataset D, an action set A, the number of episodes N, and the\nmaximum number of steps of an episode T . The output of the algorithm is the trained model.\nAt the beginning of the training process, the algorithm initializes the model\u2019s parameters and enters\na training loop. The sample model generates an initial state with a positive symptom to simulate a\npatient with an initial symptom (lines 5 - 6). Afterwards, the algorithm starts an episode. Within\nthe episode, the agent selects an action based on the current policy \u03c0. Once the agent completes an\naction, the immediate reward Rt and the state St are stored in the buffer for further use by the reward\nshaping technique.\n\n6\n\n\fWhen the agent makes a decision, two types of actions can be performed. If the agent selects an\nacquisition action (At \u2264 m), the sample model generates the next state, which consists of the\nsymptoms from the previous state and one additional symptom acquired by the action, and a reward\n0 for t < T (line 11), or the terminal state S\u22a5 and a reward \u22121 for t \u2265 T (line 13). If the agent\nselects a classi\ufb01cation action (At > m), it receives a reward 1 for a correct prediction; otherwise, the\nagent receives a reward \u22121. Subsequently, the agent reaches the terminal state S\u22a5 and the episode\nterminates (line 15).\nOnce the episode terminates, the algorithm takes the immediate rewards Rt and the states St stored in\nthe buffer to calculate shaped rewards that encourage the identi\ufb01cation of positive symptoms (line 23).\nThen, the algorithm takes the rebuilt symptoms zt to compute the feature rebuilding loss, and uses\nthe update rule of the new objective function to update model parameters (line 24).\n\n4 Experiments\n\nOwing to the privacy laws (e.g., the Health Insurance Portability and Accountability Act; HIPAA),\nit is dif\ufb01cult to obtain real-world medical data at the current points. We follow our earlier work on\nthe simulation procedure to generate arti\ufb01cial medical data instead [3]. The procedure maintains\na probability distribution of a patient\u2019s symptoms given the patient has a certain disease. The\nprobability distribution is over 650 diseases and 376 symptoms. The simulation procedure \ufb01rst\nuniformly samples a disease from a disease set. Under the sampled disease, the procedure extracts the\nprobabilities of symptoms given the sampled disease. Then, the procedure generates the symptoms\nof the sampled disease by performing a Bernoulli trial on each symptom. For example, if a disease\ncalled common cold is sampled by the procedure, the probabilities of cough and sore throat under\ncommon cold (which are 85% and 82%) can be extracted. Then doing Bernoulli trials according to\nthese probabilities can generate one instance.\nIn all of our experiments, we used the simulation procedure [3] to sample 106, 104, 105 data for\ntraining, validation, and testing. We used the same neural network architecture (Figure 1) and the\nsame hyperparameters in all experiments. We used Adam [4] as our optimizer. The initial learning\nrate is 0.0001, and the batch size is 512. The coef\ufb01cient of the entropy regularizer \u03b7 is initially set to\n0.01 and linearly decayed to 0 over time. The coef\ufb01cient of the feature rebuilding loss \u03b2 is 10. The\ndiscount-rate parameter \u03b3 is 0.99, and the hyperparameter \u03bb in our potential function \u03d5 is 0.25. We\nreport the sensitivity analyses of hyperparameters \u03b2 and \u03bb in the supplementary material.\nIn our implementation, we changed the auxiliary reward f in the terminal state. Speci\ufb01cally, when\nthe agent encounters a terminal state S\u22a5, it receives an auxiliary reward f (S, S\u22a5) = \u03b3\u03d5(S\u22a5) instead\nof f (S, S\u22a5) = \u03b3\u03d5(S\u22a5) \u2212 \u03d5(S). We found that this deviation from the standard reward shaping\ntechnique resulted in better performance. Our hypothesis is that the modi\ufb01ed auxiliary reward f can\nencourage the agent to query more positive symptoms because our de\ufb01nition of \u03d5 actually counts the\nnumber of positive symptoms.\nTo evaluate the learning ef\ufb01ciency and accuracy of our REFUEL algorithm, we formed three different\ndisease diagnosis tasks, which consist of top-200, top-300, and top-400 diseases in terms of the\nnumber of occurrences in the Centers for Disease Control and Prevention (CDC) records. Figure 2\nshows the experimental results. The baseline (blue line) is REINFORCE [17] without any enhance-\nment; RESHAPE (yellow line) is the baseline plus the technique of reward shaping; REFUEL (red\nline) is the baseline plus both reward shaping and feature rebuilding. Each line in Figure 2 is the\naverage training accuracy over 5 different random seeds. The shaded area represents the region of\ntwo standard deviations. Note that as shown in Figure 2, RESHAPE and REFUEL outperform the\nbaseline, especially in the tasks of 300 and 400 diseases. We can also see that REFUEL can explore\nfaster and predict better than RESHAPE.\nFigure 3 shows the comparison of cumulative discounted rewards received in the training process.\nSince the reward and accuracy are highly correlated, the trends of these two \ufb01gures are similar. Also,\nwe can see that in the case of 400 diseases, the baseline struggles for obtaining positive rewards.\nBoth RESHAPE and REFUEL can obtain rewards far more than the baseline. This suggests that our\nalgorithm is effective in discovering key positive symptoms.\nSince in our reward setting, there is no penalty for the agent if it selects a repeated action. We are\ninterested in seeing whether our algorithm can more robustly avoid repeated actions. Figure 4 shows\nthe probability of choosing an repeated action in each step. The curves in Figure 4 are much more\ndifferent from the curves in Figure 2 and 3. While the repeat rate of the baseline sharply increases at\n\n7\n\n\fthe end of the training process in the cases of 300 and 400 diseases, RESHAPE and REFUEL can\nquickly converge towards 0.\n\n(a) 200 diseases\n\n(b) 300 diseases\n\n(c) 400 diseases\n\nFigure 2: Experiments on 3 datasets of different disease numbers. The curves show the training\naccuracy of three methods. REFUEL (red line) uses reward shaping and feature rebuilding; RESHAPE\n(yellow line) only uses reward shaping; Baseline (blue line) adopts none of them. The solid line is the\naveraged result of 5 different random seeds. The shaded area represents two standard deviations.\n\n(a) 200 diseases\n\n(b) 300 diseases\n\n(c) 400 diseases\n\nFigure 3: Experiments on 3 datasets of different disease numbers. Each plot shows the average of\ncumulative discounted rewards during training.\n\n(a) 200 diseases\n\n(b) 300 diseases\n\n(c) 400 diseases\n\nFigure 4: The probability of choosing a repeated action during training for each dataset.\n\nAfter training, the models were selected according to the performance on validation set. The testing\nresults are reported in Table 1. The table clearly shows that all the results from REFUEL outperform\nthe baseline. Note that the performance of the baseline in the case of 400 diseases is only slightly\nbetter than a random guess, whereas REFUEL can obtain a top-5 accuracy of 67.09% in 400 diseases.\nTable 2 shows the statistics of symptom acquisition. In each case, the number of possible symptoms is\nmore than 300, and the average number of positive symptoms per patient is only about 3. Furthermore,\nthere is 1 positive symptom in the initial state. That is, the average maximum number of queriable\npositive symptoms is only about 2. Therefore, it is extremely hard to \ufb01nd the positive symptoms. As\nshowed in Table 2, the baseline can only acquire 0.01 positive symptoms on average in the case of\n400 diseases. On the contrary, our proposed REFUEL algorithm can acquire 1.31 positive symptoms.\nThus, this experiment indicates that our algorithm is bene\ufb01cial for sparse feature exploration.\n\n8\n\n02004006008001,0000204060EpochAccuracy(%)02004006008001,00001020304050EpochAccuracy(%)02004006008001,00001020304050BaselineRESHAPEREFUELEpochAccuracy(%)02004006008001,000\u22121\u22120.500.5EpochReward02004006008001,000\u22121\u22120.500.5EpochReward02004006008001,000\u22121\u22120.500.5BaselineRESHAPEREFUELEpochReward02004006008001,000020406080EpochRepeat(%)02004006008001,000020406080EpochRepeat(%)02004006008001,0000204060BaselineRESHAPEREFUELEpochRepeat(%)\fTable 1: The test accuracy of baseline and REFUEL.\n\n#Diseases\n\n200\n300\n400\n\nTop 1\n\n48.75\n21.78\n0.74\n\nBaseline\n\nTop 3\n\nTop 5\n\n#Steps\n\nTop 1\n\nREFUEL\n\nTop 3\n\nTop 5\n\n61.70\n28.42\n1.46\n\n66.55\n31.18\n2.09\n\n6.66\n6.78\n8.98\n\n54.84\n47.49\n43.83\n\n73.66\n65.08\n60.76\n\n79.68\n71.17\n67.09\n\n#Steps\n\n8.01\n8.09\n8.35\n\nTable 2: The average number of positive symptoms that the agent successfully inquires.\n\n#Diseases\n\n#Possible Symptoms\n\n#Symptoms / Patient\n\nBaseline\n\nREFUEL\n\n200\n300\n400\n\n300\n344\n354\n\n3.07\n3.12\n3.19\n\n0.56\n0.33\n0.01\n\n1.40\n1.37\n1.31\n\nTable 3: Comparison of REFUEL with the best prior work published in AAAI 2018 [3].\n\n#Diseases\n\n73\n136\n196\n255\n\nTop 1\n\n63.55\n44.50\n32.87\n26.26\n\nHierarchical Model [3]\n\nTop 3\n\nTop 5\n\n#Steps\n\nTop 1\n\nREFUEL\n\nTop 3\n\nTop 5\n\n73.35\n51.90\n38.02\n29.81\n\n77.94\n55.03\n40.20\n31.42\n\n7.15\n5.73\n5.14\n5.01\n\n69.15\n60.60\n55.01\n50.66\n\n86.85\n79.57\n74.20\n68.96\n\n91.71\n84.98\n80.22\n75.04\n\n#Steps\n\n7.52\n7.70\n7.96\n8.14\n\nNext, we compare with prior works. Since the previous study [2] reported that the RL method readily\noutperformed tree-based methods, we put the experimental results of a tree-based method in the\nsupplementary material. Here, we focus on the hierarchical model proposed in [3], which is the best\nprior work. This prior work creates four different tasks of disease sizes 73, 136, 196, 255. We train 5\nmodels with 5 different random seeds for each task. The average test accuracy are reported in Table 3,\nwhich reveals that REFUEL signi\ufb01cantly outperforms the hierarchical model. The top-1 accuracy of\nREFUEL is nearly as twice as the hierarchical model in 255 diseases. Besides, the top-3 and top-5\naccuracies are signi\ufb01cantly boosted from top-1 accuracy in REFUEL. This is particularly helpful in\nthe disease diagnosis problem if some diseases are very dif\ufb01cult to detect solely based on symptoms.\n\n5 Concluding Remarks\n\nIn this paper, we presented REFUEL for disease diagnosis. The contributions of REFUEL are two\nfolds. First, we design an informative potential function to guide an agent to productively search in\nthe sparse feature space and adopt a reward shaping technique to ensure the policy invariance. Second,\nwe design a component that rebuilds the feature space to provide the agent with a better representation.\nThe two techniques help identify symptom queries that yield positive patient responses at a much\nhigher probability, which in term leads to faster exploration speed and higher diagnosis accuracy.\nThe experimental results have con\ufb01rmed the validity of our proposed method and shown that it is a\npromising approach to fast and accurate disease diagnosis.\nAlso, we discovered that slightly modifying the auxiliary reward in the terminal state resulted in\nbetter performance. To the best of our knowledge, its theoretical justi\ufb01cation is still open. We will\nfurther investigate this interesting \ufb01nding in our future work.\nWe have deployed REFUEL with our DeepQuest service in more than ten hospitals. Our future work\nplans to enhance REFUEL to deal with continuous features so that the algorithm can be applied to\nsome other domains with features of continuous values.\n\nAcknowledgments\n\nThe work was carried out during the \ufb01rst author\u2019s internship at HTC Research and became part of the\nMaster\u2019s thesis of the \ufb01rst author [10]. We thank Profs. Shou-De Lin, Yun-Nung Chen, Min Sun and\nthe anonymous reviewers for valuable suggestions. We also thank the members at HTC Research:\nJin-Fu Lin for the support of operating GPU instances and Yang-En Chen for his efforts in organizing\nthe experiment results. This work was partially supported by the Ministry of Science and Technology\nof Taiwan under MOST 107-2628-E-002-008-MY3.\n\n9\n\n\fReferences\n[1] M. Grzes and D. Kudenko. Theoretical and empirical analysis of reward shaping in reinforcement learning.\nIn International Conference on Machine Learning and Applications, ICMLA 2009, Miami Beach, Florida,\nUSA, December 13-15, 2009, pages 337\u2013344, 2009.\n\n[2] J. Janisch, T. Pevn\u00fd, and V. Lis\u00fd. Classi\ufb01cation with costly features using deep reinforcement learning.\n\nCoRR, abs/1711.07364, 2017.\n\n[3] H.-C. Kao, K.-F. Tang, and E. Y. Chang. Context-aware symptom checking for disease diagnosis using\nhierarchical reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Arti\ufb01cial\nIntelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018.\n\n[4] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.\n\n[5] M. J. Kusner, W. Chen, Q. Zhou, Z. E. Xu, K. Q. Weinberger, and Y. Chen. Feature-cost sensitive learning\nwith submodular trees of classi\ufb01ers. In Proceedings of the Twenty-Eighth AAAI Conference on Arti\ufb01cial\nIntelligence, July 27 -31, 2014, Qu\u00e9bec City, Qu\u00e9bec, Canada., pages 1939\u20131945, 2014.\n\n[6] F. Nan and V. Saligrama. Adaptive classi\ufb01cation for prediction under a budget. In Advances in Neural\nInformation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017,\n4-9 December 2017, Long Beach, CA, USA, pages 4730\u20134740, 2017.\n\n[7] F. Nan, J. Wang, and V. Saligrama. Feature-budgeted random forest.\n\nIn Proceedings of the 32nd\nInternational Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1983\u2013\n1991, 2015.\n\n[8] F. Nan, J. Wang, and V. Saligrama. Pruning random forests for prediction on a budget. In Advances in\nNeural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems\n2016, December 5-10, 2016, Barcelona, Spain, pages 2334\u20132342, 2016.\n\n[9] A. Y. Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory and\napplication to reward shaping. In Proceedings of the 16th International Conference on Machine Learning\nICML 1999, Bled, Slovenia, 27-30 June, 1999, pages 278\u2013287, 1999.\n\n[10] Y.-S. Peng. Exploring sparse features in deep reinforcement learning towards fast disease diagnosis.\n\nMaster\u2019s thesis, National Taiwan University, 2018.\n\n[11] J. Randl\u00f8v and P. Alstr\u00f8m. Learning to drive a bicycle using reinforcement learning and shaping. In\nProceedings of the 15th International Conference on Machine Learning ICML 1998, Madison, Wisconsin,\nUSA, 24-27 July, 1998, pages 463\u2013471, 1998.\n\n[12] H. L. Semigran, J. A. Linder, C. Gidengil, and A. Mehrotra. Evaluation of symptom checkers for self\n\ndiagnosis and triage: audit study. BMJ, 351, 2015.\n\n[13] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge Univ Press, second edition,\n\n2018.\n\n[14] K.-F. Tang, H.-C. Kao, C.-N. Chou, and E. Y. Chang. Inquire and diagnose: Neural symptom checking\nensemble using deep reinforcement learning. In NIPS Workshop on Deep Reinforcement Learning, 2016.\n\n[15] H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In Proceedings\nof the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.,\npages 2094\u20132100, 2016.\n\n[16] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol. Stacked denoising autoencoders: Learning\nuseful representations in a deep network with a local denoising criterion. Journal of Machine Learning\nResearch, 11:3371\u20133408, 2010.\n\n[17] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8(3):229\u2013256, May 1992.\n\n[18] R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning algorithms.\n\nConnection Science, 3(3):241\u2013268, 1991.\n\n[19] Z. E. Xu, M. J. Kusner, K. Q. Weinberger, and M. Chen. Cost-sensitive tree of classi\ufb01ers. In Proceedings\nof the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June\n2013, pages 133\u2013141, 2013.\n\n10\n\n\f", "award": [], "sourceid": 3646, "authors": [{"given_name": "Yu-Shao", "family_name": "Peng", "institution": "HTC Research"}, {"given_name": "Kai-Fu", "family_name": "Tang", "institution": "HTC Research"}, {"given_name": "Hsuan-Tien", "family_name": "Lin", "institution": "Appier/National Taiwan University"}, {"given_name": "Edward", "family_name": "Chang", "institution": null}]}