{"title": "Deep Knowledge Tracing", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 513, "abstract": "Knowledge tracing, where a machine models the knowledge of a student as they interact with coursework, is an established and significantly unsolved problem in computer supported education.In this paper we explore the benefit of using recurrent neural networks to model student learning.This family of models have important advantages over current state of the art methods in that they do not require the explicit encoding of human domain knowledge,and have a far more flexible functional form which can capture substantially more complex student interactions.We show that these neural networks outperform the current state of the art in prediction on real student data,while allowing straightforward interpretation and discovery of structure in the curriculum.These results suggest a promising new line of research for knowledge tracing.", "full_text": "Deep Knowledge Tracing\n\nChris Piech\u2217, Jonathan Bassen\u2217, Jonathan Huang\u2217\u2021, Surya Ganguli\u2217,\n\nMehran Sahami\u2217, Leonidas Guibas\u2217, Jascha Sohl-Dickstein\u2217\u2020\n\n\u2217Stanford University, \u2020Khan Academy, \u2021Google\n\n{piech,jbassen}@cs.stanford.edu, jascha@stanford.edu,\n\nAbstract\n\nKnowledge tracing\u2014where a machine models the knowledge of a student as they\ninteract with coursework\u2014is a well established problem in computer supported\neducation. Though effectively modeling student knowledge would have high ed-\nucational impact, the task has many inherent challenges. In this paper we explore\nthe utility of using Recurrent Neural Networks (RNNs) to model student learning.\nThe RNN family of models have important advantages over previous methods\nin that they do not require the explicit encoding of human domain knowledge,\nand can capture more complex representations of student knowledge. Using neu-\nral networks results in substantial improvements in prediction performance on a\nrange of knowledge tracing datasets. Moreover the learned model can be used for\nintelligent curriculum design and allows straightforward interpretation and dis-\ncovery of structure in student tasks. These results suggest a promising new line of\nresearch for knowledge tracing and an exemplary application task for RNNs.\n\n1\n\nIntroduction\n\nComputer-assisted education promises open access to world class instruction and a reduction in the\ngrowing cost of learning. We can develop on this promise by building models of large scale student\ntrace data on popular educational platforms such as Khan Academy, Coursera, and EdX.\nKnowledge tracing is the task of modelling student knowledge over time so that we can accurately\npredict how students will perform on future interactions. Improvement on this task means that re-\nsources can be suggested to students based on their individual needs, and content which is predicted\nto be too easy or too hard can be skipped or delayed. Already, hand-tuned intelligent tutoring sys-\ntems that attempt to tailor content show promising results [28]. One-on-one human tutoring can\nproduce learning gains for the average student on the order of two standard deviations [5] and ma-\nchine learning solutions could provide these bene\ufb01ts of high quality personalized teaching to anyone\nin the world for free. The knowledge tracing problem is inherently dif\ufb01cult as human learning is\ngrounded in the complexity of both the human brain and human knowledge. Thus, the use of rich\nmodels seems appropriate. However most previous work in education relies on \ufb01rst order Markov\nmodels with restricted functional forms.\nIn this paper we present a formulation that we call Deep Knowledge Tracing (DKT) in which we\napply \ufb02exible recurrent neural networks that are \u2018deep\u2019 in time to the task of knowledge tracing. This\nfamily of models represents latent knowledge state, along with its temporal dynamics, using large\nvectors of arti\ufb01cial \u2018neurons\u2019, and allows the latent variable representation of student knowledge to\nbe learned from data rather than hard-coded. The main contributions of this work are:\n\n1. A novel way to encode student interactions as input to a recurrent neural network.\n2. A 25% gain in AUC over the best previous result on a knowledge tracing benchmark.\n3. Demonstration that our knowledge tracing model does not need expert annotations.\n4. Discovery of exercise in\ufb02uence and generation of improved exercise curricula.\n\n1\n\n\fFigure 1: A single student and her predicted responses as she solves 50 Khan Academy exercises. She seems to\nmaster \ufb01nding x and y intercepts and then has trouble transferring knowledge to graphing linear equations.\n\nThe task of knowledge tracing can be formalized as: given observations of interactions x0 . . . xt\ntaken by a student on a particular learning task, predict aspects of their next interaction xt+1 [6].\nIn the most ubiquitous instantiation of knowledge tracing, interactions take the form of a tuple of\nxt = {qt, at} that combines a tag for the exercise being answered qt with whether or not the exercise\nwas answered correctly at. When making a prediction, the model is provided the tag of the exercise\nbeing answered, qt and must predict whether the student will get the exercise correct, at. Figure 1\nshows a visualization of tracing knowledge for a single student learning 8th grade math. The student\n\ufb01rst answers two square root problems correctly and then gets a single x-intercept exercise incorrect.\nIn the subsequent 47 interactions the student solves a series of x-intercept, y-intercept and graphing\nexercises. Each time the student answers an exercise we can make a prediction as to whether or not\nshe would answer an exercise of each type correctly on her next interaction. In the visualization\nwe only show predictions over time for a relevant subset of exercise types. In most previous work,\nexercise tags denote the single \u201cconcept\u201d that human experts assign to an exercise. Our model\ncan leverage, but does not require, such expert annotation. We demonstrate that in the absence of\nannotations the model can autonomously learn content substructure.\n\n2 Related Work\n\nThe task of modelling and predicting how human beings learn is informed by \ufb01elds as diverse\nas education, psychology, neuroscience and cognitive science. From a social science perspective\nlearning has been understood to be in\ufb02uenced by complex macro level interactions including affect\n[21], motivation [10] and even identity [4]. The challenges present are further exposed on the micro\nlevel. Learning is fundamentally a re\ufb02ection of human cognition which is a highly complex process.\nTwo themes in the \ufb01eld of cognitive science that are particularly relevant are theories that the human\nmind, and its learning process, are recursive [12] and driven by analogy [13].\nThe problem of knowledge tracing was \ufb01rst posed, and has been heavily studied within the intelligent\ntutoring community. In the face of aforementioned challenges it has been a primary goal to build\nmodels which may not capture all cognitive processes, but are nevertheless useful.\n\n2.1 Bayesian Knowledge Tracing\n\nBayesian Knowledge Tracing (BKT) is the most popular approach for building temporal models\nof student learning. BKT models a learner\u2019s latent knowledge state as a set of binary variables,\neach of which represents understanding or non-understanding of a single concept [6]. A Hidden\nMarkov Model (HMM) is used to update the probabilities across each of these binary variables, as a\nlearner answers exercises of a given concept correctly or incorrectly. The original model formulation\nassumed that once a skill is learned it is never forgotten. Recent extensions to this model include\ncontextualization of guessing and slipping estimates [7], estimating prior knowledge for individual\nlearners [33], and estimating problem dif\ufb01culty [23].\nWith or without such extensions, Knowledge Tracing suffers from several dif\ufb01culties. First, the\nbinary representation of student understanding may be unrealistic. Second, the meaning of the\nhidden variables and their mappings onto exercises can be ambiguous, rarely meeting the model\u2019s\nexpectation of a single concept per exercise. Several techniques have been developed to create and\nre\ufb01ne concept categories and concept-exercise mappings. The current gold standard, Cognitive Task\nAnalysis [31] is an arduous and iterative process where domain experts ask learners to talk through\n\n2\n\nExercise index1.00.00.51020304050Line graph intuitionSlope of a lineSolving for y-interceptSolving for x-interceptGraphing linear equationsSquare rootsExercise attempted: correct, incorrectPredicted ProbabilityE[p]\ftheir thought processes while solving problems. Finally, the binary response data used to model\ntransitions imposes a limit on the kinds of exercises that can be modeled.\n\n2.2 Other Dynamic Probabilistic Models\n\nPartially Observable Markov Decision Processes (POMDPs) have been used to model learner be-\nhavior over time, in cases where the learner follows an open-ended path to arrive at a solution [29].\nAlthough POMDPs present an extremely \ufb02exible framework, they require exploration of an expo-\nnentially large state space. Current implementations are also restricted to a discrete state space,\nwith hard-coded meanings for latent variables. This makes them intractable or in\ufb02exible in practice,\nthough they have the potential to overcome both of those limitations.\nSimpler models from the Performance Factors Analysis (PFA) framework [24] and Learning Factors\nAnalysis (LFA) framework [3] have shown predictive power comparable to BKT [14]. To obtain\nbetter predictive results than with any one model alone, various ensemble methods have been used\nto combine BKT and PFA [8]. Model combinations supported by AdaBoost, Random Forest, linear\nregression, logistic regression and a feed-forward neural network were all shown to deliver superior\nresults to BKT and PFA on their own. But because of the learner models they rely on, these ensemble\ntechniques grapple with the same limitations, including a requirement for accurate concept labeling.\nRecent work has explored combining Item Response Theory (IRT) models with switched nonlinear\nKalman \ufb01lters [20], as well as with Knowledge Tracing [19, 18]. Though these approaches are\npromising, at present they are both more restricted in functional form and more expensive (due to\ninference of latent variables) than the method we present here.\n\n2.3 Recurrent Neural Networks\n\nRecurrent neural networks are a family of \ufb02exible dynamic models which connect arti\ufb01cial neurons\nover time. The propagation of information is recursive in that hidden neurons evolve based on both\nthe input to the system and on their previous activation [32]. In contrast to hidden Markov models\nas they appear in education, which are also dynamic, RNNs have a high dimensional, continuous,\nrepresentation of latent state. A notable advantage of the richer representation of RNNs is their abil-\nity to use information from an input in a prediction at a much later point in time. This is especially\ntrue for Long Short Term Memory (LSTM) networks\u2014a popular type of RNN [16].\nRecurrent neural networks are competitive or state-of-the-art for several time series tasks\u2013for in-\nstance, speech to text [15], translation [22], and image captioning [17]\u2013where large amounts of\ntraining data are available. These results suggest that we could be much more successful at tracing\nstudent knowledge if we formulated the task as a new application of temporal neural networks.\n\n3 Deep Knowledge Tracing\n\nWe believe that human learning is governed by many diverse properties \u2013 of the material, the context,\nthe timecourse of presentation, and the individual involved \u2013 many of which are dif\ufb01cult to quantify\nrelying only on \ufb01rst principles to assign attributes to exercises or structure a graphical model. Here\nwe will apply two different types of RNNs \u2013 a vanilla RNN model with sigmoid units and a Long\nShort Term Memory (LSTM) model \u2013 to the problem of predicting student responses to exercises\nbased upon their past activity.\n\n3.1 Model\n\nTraditional Recurrent Neural Networks (RNNs) map an input sequence of vectors x1, . . . , xT , to\nan output sequence of vectors y1, . . . , yT . This is achieved by computing a sequence of \u2018hidden\u2019\nstates h1, . . . , hT which can be viewed as successive encodings of relevant information from past\nobservations that will be useful for future predictions. See Figure 2 for a cartoon illustration. The\nvariables are related using a simple network de\ufb01ned by the equations:\n\nht = tanh (Whxxt + Whhht\u22121 + bh) ,\nyt = \u03c3 (Wyhht + by) ,\n\n(1)\n(2)\n\n3\n\n\fFigure 2: The connection between variables in a simple recurrent neural network. The inputs (xt) to the\ndynamic network are either one-hot encodings or compressed representations of a student action, and the\nprediction (yt) is a vector representing the probability of getting each of the dataset exercises correct.\n\nwhere both tanh and the sigmoid function, \u03c3 (\u00b7), are applied elementwise. The model is parame-\nterized by an input weight matrix Whx, recurrent weight matrix Whh, initial state h0, and readout\nweight matrix Wyh. Biases for latent and readout units are given by bh and by.\nLong Short Term Memory (LSTM) networks [16] are a more complex variant of RNNs that often\nprove more powerful. In LSTMs latent units retain their values until explicitly cleared by the action\nof a \u2018forget gate\u2019. They thus more naturally retain information for many time steps, which is believed\nto make them easier to train. Additionally, hidden units are updated using multiplicative interactions,\nand they can thus perform more complicated transformations for the same number of latent units.\nThe update equations for an LSTM are signi\ufb01cantly more complicated than for an RNN, and can be\nfound in Appendix A.\n\n3.2\n\nInput and Output Time Series\n\nIn order to train an RNN or LSTM on student interactions, it is necessary to convert those interac-\ntions into a sequence of \ufb01xed length input vectors xt. We do this using two methods depending on\nthe nature of those interactions:\nFor datasets with a small number M of unique exercises, we set xt to be a one-hot encoding of\nthe student interaction tuple ht = {qt, at} that represents the combination of which exercise was\nanswered and if the exercise was answered correctly, so xt \u2208 {0, 1}2M . We found that having\nseparate representations for qt and at degraded performance.\nFor large feature spaces, a one-hot encoding can quickly become impractically large. For datasets\nwith a large number of unique exercises, we therefore instead assign a random vector nq,a \u223c\nN (0, I) to each input tuple, where nq,a \u2208 RN , and N (cid:28) M. We then set each input vector\nxt to the corresponding random vector, xt = nqt,at. This random low-dimensional representation\nof a one-hot high-dimensional vector is motivated by compressed sensing. Compressed sensing\nstates that a k-sparse signal in d dimensions can be recovered exactly from k log d random linear\nprojections (up to scaling and additive constants) [2]. Since a one-hot encoding is a 1-sparse signal,\nthe student interaction tuple can be exactly encoded by assigning it to a \ufb01xed random Gaussian input\nvector of length \u223c log 2M. Although the current paper deals only with 1-hot vectors, this technique\ncan be extended easily to capture aspects of more complex student interactions in a \ufb01xed length\nvector.\nThe output yt is a vector of length equal to the number of problems, where each entry represents\nthe predicted probability that the student would answer that particular problem correctly. Thus the\nprediction of at+1 can then be read from the entry in yt corresponding to qt+1.\n\n3.3 Optimization\n\nThe training objective is the negative log likelihood of the observed sequence of student responses\nunder the model. Let \u03b4(qt+1) be the one-hot encoding of which exercise is answered at time t + 1,\nand let (cid:96) be binary cross entropy. The loss for a given prediction is (cid:96)(yT \u03b4 (qt+1) , at+1), and the\n\n4\n\nh0h1h2h3hTx1x2x3xTy1y2y3yT\u2026 \floss for a single student is:\n\n(cid:88)\n\nt\n\nL =\n\n(cid:96)(yT \u03b4 (qt+1) , at+1)\n\n(3)\n\nThis objective was minimized using stochastic gradient descent on minibatches. To prevent over-\n\ufb01tting during training, dropout was applied to ht when computing the readout yt, but not when\ncomputing the next hidden state ht+1. We prevent gradients from \u2018exploding\u2019 as we backpropagate\nthrough time by truncating the length of gradients whose norm is above a threshold. For all models\nin this paper we consistently used hidden dimensionality of 200 and a mini-batch size of 100. To\nfacilitate research in DKTs we have published our code and relevant preprocessed data1.\n\n4 Educational Applications\n\nThe training objective for knowledge tracing is to predict a student\u2019s future performance based on\ntheir past activity. This is directly useful \u2013 for instance formal testing is no longer necessary if a\nstudent\u2019s ability undergoes continuous assessment. As explored experimentally in Section 6, the\nDKT model can also power a number of other advancements.\n\n4.1\n\nImproving Curricula\n\nOne of the biggest potential impacts of our model is in choosing the best sequence of learning items\nto present to a student. Given a student with an estimated hidden knowledge state, we can query\nour RNN to calculate what their expected knowledge state would be if we were to assign them a\nparticular exercise. For instance, in Figure 1 after the student has answered 50 exercises we can test\nevery possible next exercise we could show her and compute her expected knowledge state given that\nchoice. The predicted optimal next problem for this student is to revisit solving for the y-intercept.\nWe use a trained DKT to test two classic curricula rules from education literature: mixing where\nexercises from different topics are intermixed, and blocking where students answer series of exer-\ncises of the same type [30]. Since choosing the entire sequence of next exercises so as to maximize\npredicted accuracy can be phrased as a Markov decision problem we can also evaluate the bene\ufb01ts\nof using the expectimax algorithm (see Appendix) to chose an optimal sequence of problems.\n\n4.2 Discovering Exercise Relationships\n\nThe DKT model can further be applied to the task of discovering latent structure or concepts in the\ndata, a task that is typically performed by human experts. We approached this problem by assigning\nan in\ufb02uence Jij to every directed pair of exercises i and j,\n\ny (j|i)(cid:80)\nk y (j|k)\n\nJij =\n\n,\n\n(4)\n\nwhere y (j|i) is the correctness probability assigned by the RNN to exercise j on the second timestep,\ngiven that a student answered exercise i correctly on the \ufb01rst. We show that this characterization of\nthe dependencies captured by the RNN recovers the pre-requisites associated with exercises.\n\n5 Datasets\n\nWe test the ability to predict student performance on three datasets: simulated data, Khan Academy\nData, and the Assistments benchmark dataset. On each dataset we measure area under the curve\n(AUC). For the non-simulated data we evaluate our results using 5-fold cross validation and in all\ncases hyper-parameters are learned on training data. We compare the results of Deep Knowledge\nTracing to standard BKT and, when possible to optimal variations of BKT. Additionally we compare\nour results to predictions made by simply calculating the marginal probability of a student getting a\nparticular exercise correct.\n\n1https://github.com/chrispiech/DeepKnowledgeTracing\n\n5\n\n\fOverview\n\nAU C\n\nDataset\n\nStudents Exercise Tags Answers\n\nMarginal BKT BKT* DKT\n\nSimulated-5\nKhan Math\nAssistments\n\n4,000\n47,495\n15,931\n\n50\n69\n124\n\n200 K\n1,435 K\n526 K\n\n0.64\n0.63\n0.62\n\n0.54\n0.68\n0.67\n\n-\n-\n0.69\n\n0.75\n0.85\n0.86\n\nTable 1: AUC results for all datasets tested. BKT is the standard BKT. BKT* is the best reported result from\nthe literature for Assistments. DKT is the result of using LSTM Deep Knowledge Tracing.\n\nSimulated Data: We simulate virtual students learning virtual concepts and test how well we can\npredict responses in this controlled setting. For each run of this experiment we generate two thou-\nsand students who answer 50 exercises drawn from k \u2208 1 . . . 5 concepts. For this dataset only, all\nstudents answer the same sequence of 50 exercises. Each student has a latent knowledge state \u201cskill\u201d\nfor each concept, and each exercise has both a single concept and a dif\ufb01culty. The probability of\na student getting a exercise with dif\ufb01culty \u03b2 correct if the student had concept skill \u03b1 is modelled\nusing classic Item Response Theory [9] as: p(correct|\u03b1, \u03b2) = c + 1\u2212c\n1+e\u03b2\u2212\u03b1 where c is the probabil-\nity of a random guess (set to be 0.25). Students \u201clearn\u201d over time via an increase to the concept\nskill which corresponded to the exercise they answered. To understand how the different models\ncan incorporate unlabelled data, we do not provide models with the hidden concept labels (instead\nthe input is simply the exercise index and whether or not the exercise was answered correctly). We\nevaluate prediction performance on an additional two thousand simulated test students. For each\nnumber of concepts we repeat the experiment 20 times with different randomly generated data to\nevaluate accuracy mean and standard error.\nKhan Academy Data: We used a sample of anonymized student usage interactions from the eighth\ngrade Common Core curriculum on Khan Academy. The dataset included 1.4 million exercises\ncompleted by 47,495 students across 69 different exercise types. It did not contain any personal\ninformation. Only the researchers working on this paper had access to this anonymized dataset, and\nits use was governed by an agreement designed to protect student privacy in accordance with Khan\nAcademy\u2019s privacy notice [1]. Khan Academy provides a particularly relevant source of learning\ndata, since students often interact with the site for an extended period of time and for a variety of\ncontent, and because students are often self-directed in the topics they work on and in the trajectory\nthey take through material.\nBenchmark Dataset: In order to understand how our model compared to other models we evaluated\nmodels on the Assistments 2009-2010 \u201cskill builder\u201d public benchmark dataset2 . Assistments is an\nonline tutor that simultaneously teaches and assesses students in grade school mathematics. It is, to\nthe best of our knowledge, the largest publicly available knowledge tracing dataset [11].\n\n6 Results\n\nOn all three datasets Deep Knowledge Tracing substantially outperformed previous methods. On\nthe Khan dataset using an LSTM neural network model led to an AUC of 0.85 which was a notable\nimprovement over the performance of a standard BKT (AUC = 0.68), especially when compared to\nthe small improvement BKT provided over the marginal baseline (AUC = 0.63). See Table 1 and\nFigure 3(b). On the Assistments dataset DKT produced a 25% gain over the previous best reported\nresult (AUC = 0.86 and 0.69 respectively) [23]. The gain we report in AUC compared to the marginal\nbaseline (0.24) is more than triple the largest gain achieved on the dataset to date (0.07).\nThe prediction results from the synthetic dataset provide an interesting demonstration of the capac-\nities of deep knowledge tracing. Both the LSTM and RNN models did as well at predicting student\nresponses as an oracle which had perfect knowledge of all model parameters (and only had to \ufb01t the\nlatent student knowledge variables). See Figure 3(a). In order to get accuracy on par with an oracle\nthe models would have to mimic a function that incorporates: latent concepts, the dif\ufb01culty of each\nexercise, the prior distributions of student knowledge and the increase in concept skill that happened\n\n2https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Left: Prediction results for (a) simulated data and (b) Khan Academy data. Right: (c) Predicted\nknowledge on Assistments data for different exercise curricula. Error bars are standard error of the mean.\n\nafter each exercise. In contrast, the BKT prediction degraded substantially as the number of hidden\nconcepts increased as it doesn\u2019t have a mechanism to learn unlabelled concepts.\nWe tested our ability to intelligently chose exercises on a subset of \ufb01ve concepts from the Assistment\ndataset. For each curricula method, we used our DKT model to simulate how a student would\nanswer questions and evaluate how much a student knew after 30 exercises. We repeated student\nsimulations 500 times and measured the average predicted probability of a student getting future\nquestions correct. In the Assistment context the blocking strategy had a notable advantage over\nmixing. See Figure 3(c). While blocking performs on par with solving expectimax one exercise\ndeep (MDP-1), if we look further into the future when choosing the next problem we come up with\ncurricula where students have higher predicted knowledge after solving fewer problems (MDP-8).\nThe prediction accuracy on the synthetic dataset suggest that it may be possible to use DKT models\nto extract the latent structure between the assessments in the dataset. The graph of our model\u2019s\nconditional in\ufb02uences for the synthetic dataset reveals a perfect clustering of the \ufb01ve latent concepts\n(see Figure 4), with directed edges set using the in\ufb02uence function in Equation 4. An interesting\nobservation is that some of the exercises from the same concept occurred far apart in time. For\nexample, in the synthetic dataset, where node numbers depict sequence, the 5th exercise in the\nsynthetic dataset was from hidden concept 1 and even though it wasn\u2019t until the 22nd problem\nthat another problem from the same concept was asked, we were able to learn a strong conditional\ndependency between the two. We analyzed the Khan dataset using the same technique. The resulting\ngraph is a compelling articulation of how the concepts in the 8th grade Common Core are related\nto each other (see Figure 4. Node numbers depict exercise tags). We restricted the analysis to\nordered pairs of exercises {A, B} such that after A appeared, B appeared more than 1% of the\ntime in the remainder of the sequence). To determine if the resulting conditional relationships are a\nproduct of obvious underlying trends in the data we compared our results to two baseline measures\n(1) the transition probabilities of students answering B given they had just answered A and (2) the\nprobability in the dataset (without using a DKT model) of answering B correctly given a student\nhad earlier answered A correctly. Both baseline methods generated discordant graphs, which are\nshown in the Appendix. While many of the relationships we uncovered may be unsurprising to an\neducation expert their discovery is af\ufb01rmation that the DKT network learned a coherent model.\n\n7 Discussion\n\nIn this paper we apply RNNs to the problem of knowledge tracing in education, showing improve-\nment over prior state-of-the-art performance on the Assistments benchmark and Khan dataset. Two\nparticularly interesting novel properties of our new model are that (1) it does not need expert anno-\ntations (it can learn concept patterns on its own) and (2) it can operate on any student input that can\nbe vectorized. One disadvantage of RNNs over simple hidden Markov methods is that they require\nlarge amounts of training data, and so are well suited to an online education environment, but not a\nsmall classroom environment.\n\n7\n\n0.550.650.750.8512345Test AccuracyNumber of Hidden ConceptsOracleRNNLSTMBKT0.550.650.750.8512345Test AccuracyNumber of ConceptsOracleRNNLSTMBKT0.550.650.750.8512345Test AccuracyNumber of ConceptsOracleRNNLSTMBKT0.00.20.40.60.81.00.00.20.40.60.81.0True Positive RateFalse Positive RateLSTMRNNBKTMarginal0.50.60.70102030Average Predicted ProbabilityExercise IndexMDP-8MDP-1BlockingMixing\fFigure 4: Graphs of conditional in\ufb02uence between exercises in DKT models. Above: We observe a perfect\nclustering of latent concepts in the synthetic data. Below: A convincing depiction of how 8th grade math\nCommon Core exercises in\ufb02uence one another. Arrow size indicates connection strength. Note that nodes may\nbe connected in both directions. Edges with a magnitude smaller than 0.1 have been thresholded. Cluster\nlabels are added by hand, but are fully consistent with the exercises in each cluster.\n\nThe application of RNNs to knowledge tracing provides many directions for future research. Fur-\nther investigations could incorporate other features as inputs (such as time taken), explore other\neducational impacts (such as hint generation, dropout prediction), and validate hypotheses posed in\neducation literature (such as spaced repetition, modeling how students forget). Because DKTs take\nvector input it should be possible to track knowledge over more complex learning activities. An es-\npecially interesting extension is to trace student knowledge as they solve open-ended programming\ntasks [26, 27]. Using a recently developed method for vectorization of programs [25] we hope to be\nable to intelligently model student knowledge over time as they learn to program.\nIn an ongoing collaboration with Khan Academy, we plan to test the ef\ufb01cacy of DKT for curriculum\nplanning in a controlled experiment, by using it to propose exercises on the site.\n\nAcknowledgments\nMany thanks to John Mitchell for his guidance and Khan Academy for its support. Chris Piech is\nsupported by NSF-GRFP grant number DGE-114747. Surya Ganguli thanks the Burroughs Well-\ncome, James S. McDonnell, Sloan and McKnight foundations for support.\n\nReferences\n[1] Khan academy privacy notice https://www.khanacademy.org/about/privacy-policy, 2015.\n[2] BARANIUK, R. Compressive sensing. IEEE signal processing magazine 24, 4 (2007).\n[3] CEN, H., KOEDINGER, K., AND JUNKER, B. Learning factors analysis\u2013a general method for cognitive\n\nmodel evaluation and improvement. In Intelligent tutoring systems (2006), Springer, pp. 164\u2013175.\n\n[4] COHEN, G. L., AND GARCIA, J. Identity, belonging, and achievement a model, interventions, implica-\n\ntions. Current Directions in Psychological Science 17, 6 (2008), 365\u2013369.\n\n[5] CORBETT, A. Cognitive computer tutors: Solving the two-sigma problem.\n\nSpringer, 2001, pp. 137\u2013147.\n\nIn User Modeling 2001.\n\n[6] CORBETT, A. T., AND ANDERSON, J. R. Knowledge tracing: Modeling the acquisition of procedural\n\nknowledge. User modeling and user-adapted interaction 4, 4 (1994), 253\u2013278.\n\n[7] D BAKER, R. S. J., CORBETT, A. T., AND ALEVEN, V. More accurate student modeling through\ncontextual estimation of slip and guess probabilities in bayesian knowledge tracing. In Intelligent Tutoring\nSystems (2008), Springer, pp. 406\u2013415.\n\n8\n\n1838586021435142330495451331465768696726432250911441025351512643341162017195642242759282936323439374855655263536266671Linear function intercepts24Interpreting function graphs47Constructing inconsistent system2Recognizing irrational numbers25Systems of equations w. Elim. 048Pythagorean theorem proofs3Linear equations 326Solutions to systems of equations49Scienti\ufb01c notation intuition4Multiplication in scienti\ufb01c notation27Views of a function50Line graph intuition5Parallel lines 228Recog func 251Multistep equations w. distribution6Systems of equations29Graphing proportional relationships52Fractions as repeating decimals7Equations word problems30Exponent rules53Cube roots8Slope of a line31Angles 254Scienti\ufb01c notation9Linear models of bivariate data32Understand equations word problems55Pythagorean theorem 210Systems of equations with elimination33Exponents 256Functions 111Plotting the line of best \ufb01t34Segment addition57Vertical angles 212Integer sums35Systems of equations w. substitution58Solving for the x intercept13Congruent angles36Comparing proportional relationships59Recognizing functions14Exponents 137Solutions to linear equations60Square roots15Interpreting scatter plots38Finding intercepts of linear functions61Slope and triangle similarity16Repeating decimals to fractions 239Midpoint of a segment62Distance formula17Graphical solutions to systems40Volume word problems63Converting decimals to fractions 218Linear non linear functions41Constructing scatter plots64Age word problems19Interpreting features of linear functions42Solving for the y intercept65Pythagorean theorem 120Repeating decimals to fractions 143Graphing systems of equations66Comparing features of functions 021Constructing linear functions44Frequencies of bivariate data67Orders of magnitude22Graphing linear equations45Comparing features of functions 168Angle addition postulate23Computing in scienti\ufb01c notation46Angles 169Parallel lines 1Scatter plotsPythagoreantheoremLine graphsSystems of EquationsLinesFunctionsExponentsFractionsAngles1245678910121416222324262830313234353637383941424448493181911131720212527293340434546471512456789101214162223242628303132343536373839414244484931819111317202125272933404345464715124567891012141622232426283031323435363738394142444849318191113172021252729334043454647151245678910121416222324262830313234353637383941424448493181911131720212527293340434546471512456789101214162223242628303132343536373839414244484931819111317202125272933404345464715Simulated DataKhan DataHidden concept 1Hidden concept 2Hidden concept 3Hidden concept 4Hidden concept 5\f[8] D BAKER, R. S. J., PARDOS, Z. A., GOWDA, S. M., NOORAEI, B. B., AND HEFFERNAN, N. T.\nIn User Modeling,\n\nEnsembling predictions of student knowledge within intelligent tutoring systems.\nAdaption and Personalization. Springer, 2011, pp. 13\u201324.\n\n[9] DRASGOW, F., AND HULIN, C. L. Item response theory. Handbook of industrial and organizational\n\npsychology 1 (1990), 577\u2013636.\n\n[10] ELLIOT, A. J., AND DWECK, C. S. Handbook of competence and motivation. Guilford Publications,\n\n2013.\n\n[11] FENG, M., HEFFERNAN, N., AND KOEDINGER, K. Addressing the assessment challenge with an online\n\nsystem that tutors as it assesses. User Modeling and User-Adapted Interaction 19, 3 (2009), 243\u2013266.\n\n[12] FITCH, W. T., HAUSER, M. D., AND CHOMSKY, N. The evolution of the language faculty: clari\ufb01cations\n\nand implications. Cognition 97, 2 (2005), 179\u2013210.\n\n[13] GENTNER, D. Structure-mapping: A theoretical framework for analogy. Cognitive science 7, 2.\n[14] GONG, Y., BECK, J. E., AND HEFFERNAN, N. T. In Intelligent Tutoring Systems, Springer.\n[15] GRAVES, A., MOHAMED, A.-R., AND HINTON, G. Speech recognition with deep recurrent neural\nnetworks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference\non (2013), IEEE, pp. 6645\u20136649.\n\n[16] HOCHREITER, S., AND SCHMIDHUBER, J. Long short-term memory. Neural computation 9, 8.\n[17] KARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image descriptions.\n\narXiv preprint arXiv:1412.2306 (2014).\n\n[18] KHAJAH, M., WING, R. M., LINDSEY, R. V., AND MOZER, M. C. Incorporating latent factors into\nknowledge tracing to predict individual differences in learning. Proceedings of the 7th International\nConference on Educational Data Mining (2014).\n\n[19] KHAJAH, M. M., HUANG, Y., GONZ \u00b4ALEZ-BRENES, J. P., MOZER, M. C., AND BRUSILOVSKY, P.\nIntegrating knowledge tracing and item response theory: A tale of two frameworks. Proceedings of the\n4th International Workshop on Personalization Approaches in Learning Environments (2014).\n\n[20] LAN, A. S., STUDER, C., AND BARANIUK, R. G. Time-varying learning and content analytics via\nsparse factor analysis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining (2014), ACM, pp. 452\u2013461.\n\n[21] LINNENBRINK, E. A., AND PINTRICH, P. R. Role of affect in cognitive processing in academic contexts.\nMotivation, emotion, and cognition: Integrative perspectives on intellectual functioning and development\n(2004), 57\u201387.\n\n[22] MIKOLOV, T., KARAFI \u00b4AT, M., BURGET, L., CERNOCK `Y, J., AND KHUDANPUR, S. Recurrent neural\nnetwork based language model. In INTERSPEECH 2010, 11th Annual Conference of the International\nSpeech Communication Association, Makuhari, Chiba, Japan, 2010 (2010), pp. 1045\u20131048.\n\n[23] PARDOS, Z. A., AND HEFFERNAN, N. T. Kt-idem: Introducing item dif\ufb01culty to the knowledge tracing\n\nmodel. In User Modeling, Adaption and Personalization. Springer, 2011, pp. 243\u2013254.\n\n[24] PAVLIK JR, P. I., CEN, H., AND KOEDINGER, K. R. Performance Factors Analysis\u2013A New Alternative\n\nto Knowledge Tracing. Online Submission (2009).\n\n[25] PIECH, C., HUANG, J., NGUYEN, A., PHULSUKSOMBATI, M., SAHAMI, M., AND GUIBAS, L. J.\n\nLearning program embeddings to propagate feedback on student code. CoRR abs/1505.05969 (2015).\n\n[26] PIECH, C., SAHAMI, M., HUANG, J., AND GUIBAS, L. Autonomously generating hints by inferring\nproblem solving policies. In Proceedings of the Second (2015) ACM Conference on Learning @ Scale\n(New York, NY, USA, 2015), L@S \u201915, ACM, pp. 195\u2013204.\n\n[27] PIECH, C., SAHAMI, M., KOLLER, D., COOPER, S., AND BLIKSTEIN, P. Modeling how students learn\n\nto program. In Proceedings of the 43rd ACM symposium on Computer Science Education.\n\n[28] POLSON, M. C., AND RICHARDSON, J. J. Foundations of intelligent tutoring systems. Psychology\n\nPress, 2013.\n\n[29] RAFFERTY, A. N., BRUNSKILL, E., GRIFFITHS, T. L., AND SHAFTO, P. Faster teaching by POMDP\n\nplanning. In Arti\ufb01cial intelligence in education (2011), Springer, pp. 280\u2013287.\n\n[30] ROHRER, D. The effects of spacing and mixing practice problems. Journal for Research in Mathematics\n\nEducation (2009), 4\u201317.\n\n[31] SCHRAAGEN, J. M., CHIPMAN, S. F., AND SHALIN, V. L. Cognitive task analysis. Psychology Press,\n\n2000.\n\n[32] WILLIAMS, R. J., AND ZIPSER, D. A learning algorithm for continually running fully recurrent neural\n\nnetworks. Neural computation 1, 2 (1989), 270\u2013280.\n\n[33] YUDELSON, M. V., KOEDINGER, K. R., AND GORDON, G. J.\n\nIndividualized bayesian knowledge\n\ntracing models. In Arti\ufb01cial Intelligence in Education (2013), Springer, pp. 171\u2013180.\n\n9\n\n\f", "award": [], "sourceid": 364, "authors": [{"given_name": "Chris", "family_name": "Piech", "institution": "Stanford"}, {"given_name": "Jonathan", "family_name": "Bassen", "institution": "stanford.edu"}, {"given_name": "Jonathan", "family_name": "Huang", "institution": "google.com"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "stanford.edu"}, {"given_name": "Mehran", "family_name": "Sahami", "institution": "stanford.edu"}, {"given_name": "Leonidas", "family_name": "Guibas", "institution": "stanford.edu"}, {"given_name": "Jascha", "family_name": "Sohl-Dickstein", "institution": "stanford.edu"}]}