{"title": "Learning to Teach with Dynamic Loss Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 6466, "page_last": 6477, "abstract": "Teaching is critical to human society: it is with teaching that prospective students are educated and human civilization can be inherited and advanced. A good teacher not only provides his/her students with qualified teaching materials (e.g., textbooks), but also sets up appropriate learning objectives (e.g., course projects and exams) considering different situations of a student. When it comes to artificial intelligence, treating machine learning models as students, the loss functions that are optimized act as perfect counterparts of the learning objective set by the teacher. In this work, we explore the possibility of imitating human teaching behaviors by dynamically and automatically outputting appropriate loss functions to train machine learning models. Different from typical learning settings in which the loss function of a machine learning model is predefined and fixed, in our framework, the loss function of a machine learning model (we call it student) is defined by another machine learning model (we call it teacher). The ultimate goal of teacher model is cultivating the student to have better performance measured on development dataset. Towards that end, similar to human teaching, the teacher, a parametric model, dynamically outputs different loss functions that will be used and optimized by its student model at different training stages. We develop an efficient learning method for the teacher model that makes gradient based optimization possible, exempt of the ineffective solutions such as policy optimization. We name our method as ``learning to teach with dynamic loss functions'' (L2T-DLF for short). Extensive experiments on real world tasks including image classification and neural machine translation demonstrate that our method significantly improves the quality of various student models.", "full_text": "Learning to Teach with Dynamic Loss Functions\n\n1Lijun Wu\u2020\u2217, 2Fei Tian\u2020, 2Yingce Xia, 3Yang Fan(cid:63),\n\n2Tao Qin, 1Jianhuang Lai, 2Tie-Yan Liu\n\n1Sun Yat-sen University, Guangzhou, China\n\n2Microsoft Research, Beijing, China\n\n3University of Science and Technology of China, Hefei, China\n\n1wulijun3@mail2.sysu.edu.cn, stsljh@mail.sysu.edu.cn\n\n2{fetia, yingce.xia, taoqin, tie-yan.liu}@microsoft.com, 3fyabc@mail.ustc.edu.cn\n\nAbstract\n\nTeaching is critical to human society: it is with teaching that prospective students\nare educated and human civilization can be inherited and advanced. A good\nteacher not only provides his/her students with quali\ufb01ed teaching materials (e.g.,\ntextbooks), but also sets up appropriate learning objectives (e.g., course projects\nand exams) considering different situations of a student. When it comes to arti\ufb01cial\nintelligence, treating machine learning models as students, the loss functions that\nare optimized act as perfect counterparts of the learning objective set by the teacher.\nIn this work, we explore the possibility of imitating human teaching behaviors\nby dynamically and automatically outputting appropriate loss functions to train\nmachine learning models. Different from typical learning settings in which the loss\nfunction of a machine learning model is prede\ufb01ned and \ufb01xed, in our framework, the\nloss function of a machine learning model (we call it student) is de\ufb01ned by another\nmachine learning model (we call it teacher). The ultimate goal of teacher model\nis cultivating the student to have better performance measured on development\ndataset. Towards that end, similar to human teaching, the teacher, a parametric\nmodel, dynamically outputs different loss functions that will be used and optimized\nby its student model at different training stages. We develop an ef\ufb01cient learning\nmethod for the teacher model that makes gradient based optimization possible,\nexempt of the ineffective solutions such as policy optimization. We name our\nmethod as \u201clearning to teach with dynamic loss functions\u201d (L2T-DLF for short).\nExtensive experiments on real world tasks including image classi\ufb01cation and neural\nmachine translation demonstrate that our method signi\ufb01cantly improves the quality\nof various student models.\n\n1\n\nIntroduction\n\nTeaching, which aims to help students learn new knowledge or skills effectively and ef\ufb01ciently, is\nimportant to advance modern human civilization. In human society, the rapid growth of quali\ufb01ed\nstudents not only relies on their intrinsic learning capability, but also, even more importantly, relies on\nthe substantial guidance from their teachers. The duties of teachers cover a wide spectrum: de\ufb01ning\nthe scope of learning (e.g., the knowledge and skills that we expect students to demonstrate by the end\nof a course), choosing appropriate instructional materials (e.g., textbooks), and assessing the progress\nof students (e.g., through course projects or exams). Effective teaching involves progressively and\ndynamically re\ufb01ning the teaching strategy based on re\ufb02ection and feedback from students.\nRecently, the concept of teaching has been introduced into arti\ufb01cial intelligence (AI), so as to improve\nthe learning process of a machine learning model. Currently, teaching in AI mainly focuses on\n\n\u2217The work was done when the \ufb01rst and fourth authors were interns at Microsoft Research Asia.\n\u2020The \ufb01rst two authors contribute equally to this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftraining data selection. For example, machine teaching [56, 34, 35] aims at identifying the smallest\ntraining data that is capable of producing the optimal learner models. The very recent work, learning\nto teach (L2T for short) [13], demonstrates how to automatically design teacher models for better\nmachine learning process. While conceptually L2T can cover different aspects of teaching in AI, [13]\nonly studies the problem of training data teaching.\nIn this work, inspired from learning to teach, we study loss function teaching in a formal and concrete\nmanner for the \ufb01rst time. The main motivation of our work is a natural observation on the analogy\nbetween loss functions in machine learning and exams in educating human students: appropriate\nexams re\ufb02ect the progress of students and urge them to make improvements accordingly, while loss\nvalues outputted by the loss function evaluate the performance of current machine learning model\nand set the optimization direction for the model parameters.\nIn our loss function teaching framework, a teacher model plays the role of outputting loss functions\nfor the student model (i.e., the daily machine learning model to solve a task) to minimize. Inspired\nfrom human teaching, we design the teacher model according to the following principles. First,\nsimilar to the different dif\ufb01culty levels of exams with respect to the progress of student in human\neducation, the loss function set by the teacher model should be dynamic, i.e., the loss functions\nshould be adaptive to different phases of the training process of the student model. To achieve\nthis, we require our teacher model to take the status of student model into consideration in setting\nthe loss functions, and to dynamically change the loss functions with respect to the growth of the\nstudent model. Such process is shown in Fig. 1. Second, the teacher model should be able to make\nself-improvement, just as a human teacher can accumulate more knowledge and improve his/her\nteaching skills through more teaching practices. To achieve that, we assume the loss function takes\nthe form of neural network whose coef\ufb01cients are determined via a parametric teacher model, which\nis also a neural network. The parameters of the teacher model can be automatically optimized in\nthe teaching process. Through optimization, the teacher keeps improving its teaching model and\nconsequently the quality of loss functions it outputs. We name our method as learning to teach with\ndynamic loss functions (L2T-DLF).\nThe eventual goal of the teacher model is that its output can\nserve as the loss function of the student model to maximize\nthe long-term performance of the student, measured via a task-\nspeci\ufb01c objective such as 0-1 accuracy in classi\ufb01cation and\nBLEU score in sequence prediction [41], on a stand-alone de-\nvelopment dataset. Learning a good teaching model is not triv-\nial, since on the one hand the task-speci\ufb01c objective is usually\nnon-smooth w.r.t. student model outputs, and on the other hand\nthe \ufb01nal evaluation of the student model is incurred on the dev\nset, disjoint with the training dataset where the teaching process\nactually happens. We design an ef\ufb01cient gradient based opti-\nmization algorithm to optimize teacher models. Speci\ufb01cally, to\ntackle the \ufb01rst challenge, we smooth the task-speci\ufb01c measure\nto its expected version where the expectation is taken on the\ndirect output of student model. To address the second challenge,\ninspired by Reverse-Mode Differentiation (RMD) [6, 7, 38],\nthrough reversing the stochastic gradient descent training pro-\ncess of the student model, we obtain derivatives of the param-\neters of the teacher model via chaining backwards the error\nsignals incurred on the development dataset .\nWe demonstrate the effectiveness of L2T-DLF on various real-\nworld tasks including image classi\ufb01cation and neural machine\ntranslation with different student models such as multi-layer\nperception networks, convolutional neural networks and sequence-to-sequence models with attention.\nThe improvements clearly demonstrate the effectiveness of the new loss function learnt by L2T-DLF.\n\nFigure 1: The student model is\ntrained via minimizing the dynamic\nloss functions taught by the teacher\nmodel (yellow curve). The bottom\nblack plane represents the parame-\nter space of student model and the\nfour colored mesh surfaces denote\ndifferent loss functions outputted vi-\na teacher model at different phases\nof student model training.\n\n2\n\n\f2 Related Work\n\nThe study of teaching for AI, inspired by human teaching process, has a long history [1, 17]. The\nmost recent efforts of teaching mainly focus on the level of training data selection. For example,\nthe machine teaching [56, 34, 35] literature targets at building the smallest training set to obtain a\npre-given optimal student model. A teaching strategy is designed in [18, 19] to iteratively select\nunlabeled data to label within the context of multi label propagation, in a similar manner with\ncurriculum learning [8, 27]. Furthermore there are research on pedagogical teaching inspired from\ncognitive science [44, 23, 39] in which a teacher module is responsible for providing informative\nexamples to the learner for the sake of understanding a concept rapidly.\nThe recent work learning to teach (L2T) [13] offers a more comprehensive view of teaching for AI,\nincluding training data teaching, loss function teaching and hypothesis space teaching. Furthermore,\nL2T breaks the strong assumption towards the existence of an optimal off-the-shelf student model\nadopted by previous machine teaching literature [56, 35]. Our work belongs to the general framework\nof L2T, with a particular focus on a thorough landscape of loss function teaching, including the\ndetailed problem setup and ef\ufb01cient solution for dynamically setting loss functions for training\nmachine learning models.\nOur work, and the more general L2T, leverages automatic techniques to bypass human prior knowl-\nedge as much as possible, which is in line with the principles of learning to learn and meta learn-\ning [43, 50, 2, 57, 37, 29, 10, 14]. What makes our work different with others, from the technical\npoint of view, is that: 1) we leverage gradient based optimization method rather than reinforce-\nment learning [57, 13]; 2) we need to handle the dif\ufb01culty when the error information cannot be\ndirectly back propagated from the loss function, since we aim at discovering the best loss function\nfor the machine learning models. We design an algorithm based on Reverse-Mode Differentiation\n(RMD) [7, 38, 15] to tackle such a dif\ufb01culty.\nSpecially designed loss functions play important roles in boosting the performances of real-world\ntasks, either by approximating the non-smooth task-speci\ufb01c objective such as 0-1 accuracy in\nclassi\ufb01cation [40], NDCG in ranking [49], BLEU in machine translation [45, 3] and MAP in object\ndetection [22, 46], or easing the optimization process of the student model such as overcoming the\ndif\ufb01culty brought by data imbalance [30, 32] and numerous local optima [20]. L2T-DLF differs from\nprior works in that: 1) the loss functions are automatically learned, covering a large space and without\nthe demand of heuristic understanding for task speci\ufb01c objective and optimization process; 2) the\nloss function dynamically evolves during the training process, leading to a more coherent interaction\nbetween loss and student model.\n\n3 Model\n\nIn this section, we introduce the details of L2T-DLF, including the student model and the teacher\nmodel, as well as the training strategy for optimizing the teacher model.\n\n3.1 Student Model\nFor a task of interest, we denote its input space and output space respectively as X and Y. The student\nmodel for this task is then denoted as f\u03c9 : X \u2192 Y, with \u03c9 as its weight parameters. The training of\nstudent model f\u03c9 is an optimization process that discovers a good weight parameter \u03c9\u2217 within a hy-\npothesis space \u2126, by minimizing a loss function l on the training data Dtrain containing M data points\nDtrain = {(xi, yi)}M\nl(f\u03c9(x), y).\n(x,y)\u2208D l(f\u03c9(x), y)\nwhere D is a dataset and will simultaneously name L as loss function when the context is clear. The\nlearnt student model f\u03c9\u2217 is then evaluated on a test data set Dtest = {(xi, yi)}N\ni=1 to obtain a score\nm(f\u03c9\u2217 (x), y), as its performance. Here the task speci\ufb01c objective\n\n(cid:80)\nFor the convenience of description, we de\ufb01ne a new notation L(f\u03c9, D) = (cid:80)\nM(f\u03c9\u2217 , Dtest) =(cid:80)\n\n(x,y)\u2208Dtest\n\ni=1. Speci\ufb01cally \u03c9\u2217 is obtained via solving min\u03c9\u2208\u2126\n\n(x,y)\u2208Dtrain\n\nm(y1, y2) measures the similarity between two output candidates y1 and y2.\nThe loss function l(\u02c6y, y), taking the model prediction \u02c6y = f\u03c9(x) and ground-truth y as inputs, acts as\nthe surrogate of m to evaluate the student model f\u03c9 during its training process, just as the exams in\nreal-world human teaching. We assume l(\u02c6y, y) is a neural network with some coef\ufb01cients \u03a6, denoted\nas l\u03a6(\u02c6y, y). It can be a simple linear model, or a deep neural network (some concrete examples\n\n3\n\n\fare provided in section 4.1 and section 4.2). With such a loss function l\u03a6(\u02c6y, y) (and the induced\nnotation L\u03a6), the student model gets sequentially updated via minimizing the output value of l\u03a6 by,\nfor example, stochastic gradient descent (SGD): \u03c9t+1 = \u03c9t \u2212 \u03b7t\n, t = {1, 2,\u00b7\u00b7\u00b7 , T},\ntrain \u2286 Dtrain, \u03c9t and \u03b7t is respectively the mini-batch training data, student model weight\nwhere Dt\nparameter and learning rate at t-th timestep. For ease of statement we simply set \u03c9\u2217 = \u03c9T .\n\n\u2202L\u03a6(f\u03c9t ,Dt\n\ntrain)\n\n\u2202\u03c9t\n\n3.2 Teacher Model\n\nA teacher model is responsible for setting the proper loss function l to the student model by outputting\nappropriate loss function coef\ufb01cients \u03a6. To cater for different status of student model training, we\nask the teacher model to output different loss functions lt at each training step t. To achieve that, the\nstatus of a student model is represented by a state vector st at timestep t, which contains for example\nthe current training/dev accuracy and iteration number. The teacher model, denoted as \u00b5, then takes\nst as inputs to compute the coef\ufb01cients of loss function \u03a6t at t-th timestep as \u03a6t = \u00b5\u03b8(st), where\n\u03b8 is the parameters of the teacher model. We further provide some examples of \u00b5\u03b8 in section 4.1\nand section 4.2. The actual loss function for student model is then lt = l\u03a6t. The learning process of\nstudent model then switches to:\n\u03c9t+1 = \u03c9t \u2212 \u03b7t\n\n\u2202L\u00b5\u03b8(st)(f\u03c9t, Dt\n\n= \u03c9t \u2212 \u03b7t\n\n\u2202L\u03a6t(f\u03c9t, Dt\n\ntrain)\n\ntrain)\n\n(1)\n\n\u2202\u03c9t\n\n\u2202\u03c9t\n\n.\n\nSuch a sequential procedure of obtaining f\u03c9\u2217 (i.e., f\u03c9T ) is the learning process of the student model\nwith training data Dtrain and loss function provided via the teacher model \u00b5\u03b8, and we use an abstract\noperator F to denote it: f\u03c9\u2217 = F(Dtrain, \u00b5\u03b8).\nJust as the training and testing setup in typical machine learning scenarios, the teacher model here\nsimilarly follows the two phases setup. Speci\ufb01cally, in the training process of teacher model, similar\nto quali\ufb01ed human teachers are good at improving the quality of exams, the teacher model in L2T-DLF\nre\ufb01nes the loss function it sets up via optimizing its own \u03b8. The ultimate goal of teacher model is to\nmaximize the performance of induced student model on a stand-alone development dataset Ddev:\n\nM(f\u03c9\u2217 , Ddev) = max\n\n\u03b8\n\nmax\n\n\u03b8\n\nM(F(Dtrain, \u00b5\u03b8), Ddev).\n\n(2)\n\nWe introduce the detailed training process (i.e., how to ef\ufb01ciently optimize Eqn. (2)) in section 3.3.\nIn the testing process of the teacher model, \u03b8 is \ufb01xed and the student model f\u03c9 gets updated with the\nguidance of teacher model \u00b5\u03b8, as speci\ufb01ed in Eqn. (1).\n\n3.3 Training Process of Teacher Model\n\nThere are two challenges to optimize teacher model: 1) the evaluation measure m is typically non-\nsmooth and non-differentiable w.r.t. the parameters of student model; 2) the error is incurred on dev\nset while the teacher model plays effect in training phase.\nWe use continuous relaxation of m to tackle the \ufb01rst challenge. The main idea is to inject random-\nness into m to form an approximated version \u02dcm, where the randomness comes from the student\nmodel [49]. Thanks to the fact that quite a few student models output probabilistic distributions on\nY, the randomness naturally comes from the direct outputs of f\u03c9. Speci\ufb01cally, to approximate the\ny\u2217\u2208Y m(y\u2217, y)p\u03c9(y\u2217|x),\nwhere p\u03c9(y\u2217|x) is the probability of predicting y\u2217 given x using f\u03c9. The gradient of \u03c9 is then\neasy to obtain via \u2202 \u02dcm(f\u03c9(x),y)\n. We further introduce a new notation\n\u02dcm(f\u03c9(x), y) which approximates the objective of the teacher model\n(x,y)\u2208Ddev\n\nperformance of f\u03c9 on a test data sample (x, y), we have \u02dcm(f\u03c9(x), y) =(cid:80)\n\u02dcM(f\u03c9, Ddev) =(cid:80)\n\nM(f\u03c9T , Ddev).\nWe use Reverse-Mode Differentiation (RMD) [6, 7, 38] to \ufb01ll in the gap between training data and\ndevelopment data. To better show the RMD process, we can view the sequential process in Eqn. (1)\nas a special feed-forward process of a deep neural network where each t corresponds to one layer,\nand RMD corresponds to the backpropagation process looping the SGD process backwards from T\nto 1. Speci\ufb01cally denote d\u03b8 as the gradient of \u02dcM (f\u03c9T , Ddev) w.r.t. the teacher model parameters\n\u03b8, which has initial value d\u03b8 = 0. On the dev dataset Ddev, the gradient of \u02dcM(f\u03c9, Ddev) w.r.t. the\n\ny\u2217\u2208Y m(y\u2217, y) \u2202p\u03c9(y\u2217|x)\n\n=(cid:80)\n\n\u2202\u03c9\n\n\u2202\u03c9\n\n4\n\n\fparameter of student model \u03c9T is calculated as\n\u2202 \u02dcM(f\u03c9T , Ddev)\n\nd\u03c9T =\n\n\u2202\u03c9T\n\n(cid:88)\n\n\u2202 \u02dcm(f\u03c9T (x), y)\n\n(x,y)\u2208Ddev\n\n\u2202\u03c9T\n\n=\n\n.\n\n(3)\n\nThen looping backwards from T and corresponding to Eqn. (1), at each step t = {T \u2212 1,\u00b7\u00b7\u00b7 , 1} we\nhave\n\nd\u03c9t =\n\n\u2202 \u02dcM(f\u03c9t, Ddev)\n\n\u2202\u03c9t\n\n= d\u03c9t+1 \u2212 \u03b7t\n\n\u22022L\u00b5\u03b8(st)(f\u03c9t, Dt\n\ntrain)\n\nd\u03c9t+1.\n\n\u2202\u03c92\nt\n\nAt the same time, the gradient of \u02dcM w.r.t. \u03b8 is accumulated at this time step as:\n\nd\u03b8 = d\u03b8 \u2212 \u03b7t\n\n\u22022L\u00b5\u03b8(st)(f\u03c9t, Dt\n\ntrain)\n\n\u2202\u03b8\u2202\u03c9t\n\nd\u03c9t+1.\n\n(4)\n\n(5)\n\nWe leave the detailed derivations for Eqn. (4) and (5) to Appendix. Furthermore it is worth-noting that\nthe computing of d\u03c9t and d\u03b8 involves hessian vector product, which can be effectively computed via\n\u22022g\n\u2202x\u2202y v = \u2202( \u2202g\n\u2202y v)/\u2202x, without explicitly calculating the Hessian matrix. Reverting backwards from\nt = T to t = 1, we obtain d\u03b8 and then \u03b8 is updated using any gradient based optimization algorithm\nsuch as momentum SGD, forming one step optimization for \u03b8 which we call teacher optimization\nstep. By iterating teacher optimization steps we obtain the \ufb01nal teacher model. The details are listed\nin Algorithm 1.\n\nAlgorithm 1 Training Teacher Model \u00b5\u03b8\n\nInput: Continuous relaxation \u02dcm. Initial value of \u03b8.\nwhile Teacher model parameter \u03b8 not converged do\nRandomly initialize student model parameter \u03c90.\nfor each time step t = 0,\u00b7\u00b7\u00b7 , T \u2212 1 do\n\nConduct student model training step via Eqn. (1).\n\nend for\nd\u03b8 = 0. Compute d\u03c9T via Eqn. (3).\nfor each time step t = T \u2212 1,\u00b7\u00b7\u00b7 , 0 do\n\nUpdate d\u03b8 as Eqn. (5).\nCompute d\u03c9t as Eqn. (4).\n\nend for\nUpdate \u03b8 using d\u03b8 via gradient based optimization algorithm.\n\nend while\nOutput: the \ufb01nal teacher model \u00b5\u03b8.\n\n(cid:46) One teacher optimization step\n\n(cid:46) Teach student model\n\n(cid:46) Reversely calculating the gradient d\u03b8\n\n3.4 Discussion\n\nAnother possible way to conduct teacher model optimization is through deep reinforcement learning.\nBy treating the teacher model as a policy outputting continuous action (i.e., the loss function), one\ncan leverage continuous control algorithm such as DDPG [31] to optimize teacher model. However,\nreinforcement learning algorithms, including Q-learning based ones such as DDPG are sample\ninef\ufb01cient, probably requiring huge amount of sampled trajectories to approximate the reward using\na critic network. Considering the training of student model is typically costly, we resort to gradient\nbased optimization algorithms instead.\nFurthermore, there are similarity between L2T-DLF and actor-critic (AC) method [5, 48] in rein-\nforcement learning (RL), in which a critic (corresponding to the parametric loss function) guides the\noptimization of an actor (corresponding to the student model). Apart from the difference within appli-\ncation domain (supervised learning versus RL), there are differences between the design principle of\nL2T-DLF and AC. For AC, by treating student model as actor, the student model output (e.g., f\u03c9t(xt))\nis essentially the action at timestep t, fed into the critic to output an approximation to the future\nreward (e.g., dev set accuracy). This is typically dif\ufb01cult since: 1) the student model output (i.e., the\naction) at a particular step t is weakly related with the \ufb01nal dev performance. Therefore optimizing its\naction with the guidance from critic network is largely meaningless; 2) the approximation to the future\n\n5\n\n\f(a) loss function\n\n(b) teacher model\n\nFigure 2: Left: the bilinear neural network specifying the loss function l\u03a6t(p\u03c9, y) = \u2212\u03c3((cid:126)y(cid:48)\u03a6t log pw).\nRight: the teacher model outputting \u03a6t via attention mechanism:\u03a6t = \u00b5\u03b8(st) = W sof tmax(V st).\n\nreward is hard given the performance measure is highly non-smooth. As a comparison, L2T-DLF is\nmore general in that at each timestep: 1) the teacher model considers the overall status of the student\nmodel for the sake of optimizing its parameters, rather than the instant action (i.e., the direct output);\n2) the teacher model outputs a loss function with the goal of maximizing, but not approximating the\nfuture reward. In that sense, L2T-DLF is more appropriate to real world applications.\n\n4 Experiments\n\nWe conduct comprehensive empirical veri\ufb01cations of the proposed L2T-DLF, in automatically dis-\ncovering the most appropriate loss functions for student model training. The tasks in our experiments\ncome from two domains: image classi\ufb01cation, and neural machine translation.\n\n4.1\n\nImage Classi\ufb01cation\n\nyx + by)/(cid:80)\n\ny\u2217\u2208Y exp (w(cid:48)\n\nThe evaluation measure m here is the 0-1 accuracy: m(y1, y2) = 1y1=y2 where 1 is the 0-1\nindicator function. The student model f\u03c9 can be a logistic classi\ufb01er specifying a softmax distribution\np\u03c9(y|x) = exp (w(cid:48)\ny\u2217 x + by\u2217 ) with \u03c9 = {wy\u2217 , by\u2217}y\u2217\u2208Y. The class label\nis predicted as \u02c6y = arg maxy\u2217\u2208Y p\u03c9(y\u2217|x) given input data x.\nInstead of imposing loss on \u02c6y\nand ground-truth y, for the sake of ef\ufb01cient optimization l typically takes the direct model output\np\u03c9 and y as inputs. For example, the most widely adopted loss function l is cross-entropy loss\nl(p\u03c9, y) = \u2212 log p\u03c9(y|x), which could be re-written in vector form l(p\u03c9, y) = \u2212(cid:126)y(cid:48) log p\u03c9, where\n(cid:126)y \u2208 {0, 1}|Y| is a one-hot representation of the true label y, i.e., (cid:126)yj = 1j=y,\u2200j \u2208 Y, (cid:126)y(cid:48) is the\ntranspose of (cid:126)y and pw \u2208 R|Y| is the probabilities for each class outputted via f\u03c9.\nGeneralizing the cross entropy loss, we set the loss function coef\ufb01cients \u03a6 as a matrix inter-\nacting between log pw and (cid:126)y, which switches loss function at t-th timestep into l\u03a6t(p\u03c9, y) =\n\u2212\u03c3((cid:126)y(cid:48)\u03a6t log pw), \u03a6t \u2208 R|Y|\u00d7|Y|, as is shown in Fig. 2(a). \u03c3 is the sigmoid function. The teacher\nmodel \u00b5\u03b8 here is then responsible for setting \u03a6t according to the state feature vector of student model\nst: \u03a6t = \u00b5\u03b8(st). One possible form of the teacher model is a neural network with attention mechanis-\nm (shown in Fig. 2(b)): \u03a6t = \u00b5\u03b8(st) = W sof tmax(V st), where W \u2208 R|Y|\u00d7|Y|\u00d7N , V \u2208 RN\u00d7|st|\nconstitute the teacher model parameter set \u03b8, N = 10 is the number of keys in attention mechanism.\nThe state vector st is a 13 dimensional vector composing of 1) the current iteration number t; 2)\ncurrent training accuracy of f\u03c9; 3) current dev accuracy of f\u03c9; 4) current precision of f\u03c9 for the 10\nclasses on the dev set, all normalized into [0, 1].\nWe choose three widely adopted datasets: the MNIST, CIFAR-10 and CIFAR-100 datasets. For\nthe sake of showing the robustness of L2T-DLF, the student models we choose cover a wide range,\nincluding multi-layer perceptron (MLP), plain convolutional neural network (CNN) following LeNet\narchitecture [28], and advanced CNN architecture including ResNet [21], Wide-ResNet [55] and\nDenseNet [24]. For all the student models, we use momentum stochastic gradient descent to perform\ntraining. In Appendix we describe the network structures of student models.\nThe different loss functions we compare include: 1) Cross entropy loss Lce(p\u03c9(x), y) =\n\u2212 log p\u03c9(y|x), which is the most widely adopted loss function to train neural network model;\n\n6\n\n\fTable 1: The recognition results (error rate %) on MNIST dataset.\n\nStudent Model/\n\nLoss\nMLP\nLeNet\n\nCross\n\nEntropy [11]\n\nSmooth [40] Large-Margin\nSoftmax [36]\n\nL2T-DLF\n\n1.94\n0.98\n\n1.89\n0.94\n\n1.83\n0.88\n\n1.69\n0.77\n\nTable 2: The recognition results (error rate %) on CIFAR-10 (C10) and CIFAR-100 (C100) dataset\n\nStudent Model/\n\nLoss\n\nResNet-8\nResNet-20\nResNet-32\nWRN\nDenseNet-BC\n\nCross\n\nEntropy [11]\nC10/C100\n12.45/39.79\n8.75/32.33\n7.51/30.38\n\n3.80/-\n3.54/-\n\nSmooth [40] Large-Margin\nSoftmax [36]\nC10/C100\n11.34/38.93\n8.02/31.65\n7.01/29.56\n\nC10/C100\n12.08/39.52\n8.53/32.01\n7.42/30.12\n\n3.81/-\n3.48/-\n\n3.69/-\n3.37/-\n\nL2T-DLF\n\nC10/C100\n10.82/38.27\n7.63/30.97\n6.95/29.25\n\n3.42/-\n3.08/-\n\n2) The smooth 0-1 loss proposed in [40]. It optimizes a smooth version of 0-1 accuracy in bi-\nnary classi\ufb01cation. We extend it to handle multi-class case by modifying the loss function as\nLsmooth(p\u03c9(x), y) = \u2212 log \u03c3(K(log p\u03c9(y|x) \u2212 maxy\u2217(cid:54)=y log p\u03c9(y\u2217|x))). It is not dif\ufb01cult to ob-\nserve when K \u2192 +\u221e, \u2212Lsmooth exactly matches the 0-1 accuracy. We choose the value of K to be\n50 according to the performance on dev set; 3) The large-margin softmax loss in [36] denoted as Llm,\nwhich aims to enhance discrimination between different classes via maximizing the margin induced\nby the angle between x and a target class representation wy. We use the open-sourced code released\nby the authors in our experiment; 4) The loss function discovered via the teacher in L2T-DLF. The\nteacher models are optimized with Adam [26] and the detailed setting is in Appendix.\nThe classi\ufb01cation results on MNIST, CIFAR-10 and CIFAR-100 are respectively shown in Table 1\nand 2. As can be observed, on all the three tasks, the dynamic loss functions outputted via teacher\nmodel help to cultivate better student model. For example, the teacher model helps WRN to achieve\n3.42% classi\ufb01cation error rate on CIFAR-10, which is on par with the result discovered via automatic\narchitecture search (e.g., 3.41% of NASNet [57]). Furthermore, our dynamic loss functions for\nDenseNet on CIFAR-10 reduces the error rate of DenseNet-BC (k=40) from 3.54% to 3.08%, where\nthe gain is a non-trival margin.\n\n4.1.1 Teacher Optimization\n\nIn Fig. 3, we provide the dev measure performance along with the teacher model optimization in\nMNIST experiment, the student model is LeNet. It can be observed that the dev measure is increasing\nalong with the teacher model optimizing, and \ufb01nally converges to a high score.\n\n4.1.2 Analysis Towards the Loss Functions\n\nTo better understand the loss functions outputted via teacher model, we visualize the coef\ufb01cients of\nsome loss functions outputted by teacher model for training ResNet-8 in CIFAR-100 classi\ufb01cation\ntask. Speci\ufb01cally, note that the loss function l\u03a6t(p\u03c9, y) = \u2212\u03c3((cid:126)y(cid:48)\u03a6t log pw) essentially characterizes\nthe correlations among different classes via the coef\ufb01cients \u03a6t. Positive \u03a6t(i, j) value means positive\ncorrelation between class i and j that their probabilities should be jointly maximized whereas negative\nvalue imposes negative correlation and higher discrimination between the two classes i and j. We\nchoose two classes in CIFAR-100: the Otter and Baby as class i and for each of them pick several\nrepresentative classes as class j. The corresponding \u03a6t(i, j) values are visualized in Fig. 4, with\nt = 20, 40, 60 denoting the coef\ufb01cients outputted via teacher model at t-th epoch of student model\ntraining. As can be observed, at the initial phase of training student model (t = 20), the teacher\nmodel chooses to enhance the correlation between two similar classes, e.g, Otter and Dolphin, Baby\nand Boy, for the sake of speeding up training. Comparatively, when the student model is powerful\nenough (t = 60), the teacher model will force it to perform better in discriminating two similar\nclasses, as indicated via the more negative coef\ufb01cient values \u03a6t(i, j). The variation of \u03a6t(i, j)\n\n7\n\n\fFigure 3: Measure score on the MNIST dev set along the teacher model optimization. The student\nmodel is LeNet.\n\n(a) Class Otter\n\n(b) Class Baby\n\nFigure 4: Coef\ufb01cient matrix \u03a6t outputted via teacher model. The y-axis (20, 40, 60) corresponds to\nthe different epochs of the student model training. Darker color means the coef\ufb01cients value are more\nnegative while shallower color means more positive. In each \ufb01gure, the leftmost two columns denote\nsimilar classes and the rightmost three columns represent dissimilar classes.\n\nvalues w.r.t. t well demonstrates the teacher model captures the status of student model in outputting\ncorrespondingly appropriate loss functions.\n\n4.2 Neural Machine Translation\n\nautoregressive, in that f\u03c9 factorizes the translation probability as p\u03c9(y|x) =(cid:81)|y|\ngeneralizing cross entropy loss is l\u03a6 = \u2212(cid:80)|y|\n\nIn the task of neural machine translation (NMT), the evaluation measure m(\u02c6y, y) is typically the\nBLEU score [41] between the translated sentence \u02c6y and ground-truth reference y. The student model\nf\u03c9 is a neural network performing sequence-to-sequence generation based on models including\nRNN [47], CNN [16] and self-attention network [51]. The decoding process of f\u03c9 is typically\nr=1 p\u03c9(yr|x, y<r).\nHere p\u03c9(\u00b7|x, y<r) is the distribution on target vocabulary V at the r-th position, taking the source side\nsentence x and the previous words y<r as inputs. Similar to the classi\ufb01cation task, the loss function\nrdiag(\u03a6) log p\u03c9(\u00b7|x, y<r)), where \u03a6 \u2208 R|V| is\nthe coef\ufb01cients of the loss function and diag(\u03a6) denotes the diagnoal matrix with \u03a6 as its diagonal\nelements. Here we set the interaction matrix as diagonal mainly for the sake of computational\nef\ufb01ciency, since the target vocabulary size |V| is usally very large (e.g., 30k). The teacher model then\noutputs \u03a6t at timestep t taking st as input: \u03a6t = \u00b5\u03b8(st) = W sof tmax(V st), where teacher model\nparameter \u03b8 = {W \u2208 R|V|\u00d7N , V \u2208 RN\u00d7|st|}. We set N = 5 and for the state vector st, it is the\n\nr=1 \u03c3((cid:126)y(cid:48)\n\n8\n\n0510152025303540Teacher optimization step0.98860.98880.98900.98920.98940.98960.9898Dev measureLeNet on MNIST\fTable 3: The translation results (BLEU score) on IWSLT-14 German-English task.\n\nCross\n\nStudent Model/\n\nLoss\n\nLSTM-1\nLSTM-2\n\nTransformer\n\nEntropy [52] RL [42] AC [3]\n27.75\n31.21\n34.34\n\n27.28\n30.86\n34.01\n\n27.53\n31.03\n34.32\n\nSoftmax-Margin [12] L2T-DLF\n\n28.12\n31.22\n34.46\n\n29.52\n31.75\n34.80\n\nsame with that in classi\ufb01cation except: 1) the training/dev set accuracy is now replaced with BLEU\nscores; 2) the last ten features in st for classi\ufb01cation are ignored, leading to |st| = 3.\nWe choose a widely used benchmark dataset in NMT literature [42, 54, 53], released in IWSLT-14\nGerman-English evaluation campaign [9], as the test-bed for different loss functions. The student\nmodel f\u03c9 for this task is based on LSTM with attention [4]. For the sake of fair comparison\nwith previous works [3, 42], we use single layer LSTM model as f\u03c9 and name it as LSTM-1. To\nfurther verify the effectiveness of L2T-DLF, we use a deeper translation model stacking two LSTM\nlayers as f\u03c9. We denote such stronger student model as LSTM-2. Furthermore, we also evaluate\nour L2T-DLF on the Transformer [51] network. The Transformer architecture is based on the\nself-attention mechanism [33], and it achieves superior performance on several NMT tasks. Both\nLSTM/Transformer student models are trained with simple SGD. In Appendix we provide the details\nof the LSTM/Transformer student models and the training settings of student/teacher models.\nThe loss functions we leverage to train student models include: 1) Cross entropy loss Lce to perform\nmaximum likelihood estimation (MLE) for training LSTM/Transformer model with teacher forc-\ning [52]; 2) The reinforcement learning (RL) loss Lrl, a.k.a, sequence level training [42] or minimum\nrisk training [45], targets at directly optimizing the BLEU scores for NMT models. A typical RL loss\ny\u2217\u2208Y log p\u03c9(y\u2217|x)(BLEU (y\u2217, y) \u2212 b), where b is the reward baseline and\nY is the candidate subset; 3) The loss speci\ufb01ed via actor-critic (AC) algorithm Lac [3], which approx-\nimates the BLEU score via a critic network; 4) The softmax-margin loss, which is empirically shown\nto be the most effective structural prediction loss for NMT [12]; 5) The loss function discovered via\nour L2T-DLF.\nWe report the experimental results in Table 3. From the table, we can clearly observe the dynamic loss\nfunctions outputted via our teacher model can guide the student model to have superior performance\ncompared with other specially designed loss functions. Speci\ufb01cally, with a shallow student model\nLSTM-1, we improve the BLEU score by more than 2.0 points compared with prede\ufb01ned cross-\nentropy loss. In addition, our LSTM-2 student model achieves 31.75 BLEU score and it surpasses\npreviously reported best result 30.08 by [25] on IWSLT-14 German-English achieved via RNN/LSTM\nmodels. With a much stronger Transformer student model, we also improve the model performance\nfrom BLEU score 34.01 to 34.80. The above results clearly demonstrate the effectiveness of our\nL2T-DLF approach.\n\nis Lrl(p\u03c9(x), y) = \u2212(cid:80)\n\n5 Conclusion\n\nIn contrast to expert designed and \ufb01xed loss functions in conventional machine learning systems,\nwe in this paper study how to learn dynamic loss functions so as to better teach a student machine\nlearning model. Since loss functions provided by the teacher model dynamically change with respect\nto the growth of the student model and the teacher model is trained through end-to-end optimization,\nthe quality of the student model gets improved signi\ufb01cantly, as shown in our experiments. We hope\nour work will stimulate and inspire the research community to automatically discover loss functions\nbetter than expert designed ones. As to future work, we would like to conduct empirical veri\ufb01cation\non tasks with more powerful student models and larger datasets. We are also interested in trying more\ncomplicated teacher models such as deeper neural networks.\n\nAcknowledgments\n\nThis work was partially supported by the NSFC 61573387. We thank all the anonymous reviewers\nfor their constructive feedbacks.\n\n9\n\n\fReferences\n[1] John R Anderson, C Franklin Boyle, and Brian J Reiser. Intelligent tutoring systems. Science,\n\n228(4698):456\u2013462, 1985.\n\n[2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In\nAdvances in Neural Information Processing Systems, pages 3981\u20133989, 2016.\n\n[3] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau,\nAaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv\npreprint arXiv:1607.07086, 2016.\n\n[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[5] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements\nthat can solve dif\ufb01cult learning control problems. IEEE transactions on systems, man, and\ncybernetics, (5):834\u2013846, 1983.\n\n[6] Atilim Gunes Baydin and Barak A Pearlmutter. Automatic differentiation of algorithms for\n\nmachine learning. arXiv preprint arXiv:1404.7456, 2014.\n\n[7] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation,\n\n12(8):1889\u20131900, 2000.\n\n[8] Yoshua Bengio, J\u00e9r\u00f4me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.\n\nIn Proceedings of the 26th ICML, pages 41\u201348. ACM, 2009.\n\n[9] M Cettolo, J Niehues, S St\u00fcker, L Bentivogli, and M Federico. Report on the 11th iwslt\nIn IWSLT-International Workshop on Spoken Language\n\nevaluation campaign, iwslt 2014.\nProcessing, pages 2\u201317. Marcello Federico, Sebastian St\u00fcker, Fran\u00e7ois Yvon, 2014.\n\n[10] Yutian Chen, Matthew W Hoffman, Sergio G\u00f3mez Colmenarejo, Misha Denil, Timothy P\nLillicrap, Matt Botvinick, and Nando de Freitas. Learning to learn without gradient descent by\ngradient descent. arXiv preprint arXiv:1611.03824, 2016.\n\n[11] Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the\n\ncross-entropy method. Annals of operations research, 134(1):19\u201367, 2005.\n\n[12] Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc\u2019Aurelio Ranzato. Classical\nstructured prediction losses for sequence to sequence learning. arXiv preprint arXiv:1711.04956,\n2017.\n\n[13] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In Interna-\n\ntional Conference on Learning Representations, 2018.\n\n[14] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\ntion of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learning\nResearch, pages 1126\u20131135, International Convention Centre, Sydney, Australia, 06\u201311 Aug\n2017. PMLR.\n\n[15] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and\nreverse gradient-based hyperparameter optimization. In Doina Precup and Yee Whye Teh,\neditors, Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 1165\u20131173, International Convention\nCentre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[16] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional\nsequence to sequence learning. In International Conference on Machine Learning, pages 1243\u2013\n1252, 2017.\n\n[17] S.A. Goldman and M.J. Kearns. On the complexity of teaching. J. Comput. Syst. Sci., 50(1):20\u2013\n\n31, February 1995.\n\n10\n\n\f[18] Chen Gong, Dacheng Tao, Wei Liu, Liu Liu, and Jie Yang. Label propagation via teaching-\nto-learn and learning-to-teach. IEEE transactions on neural networks and learning systems,\n28(6):1452\u20131465, 2017.\n\n[19] Chen Gong, Dacheng Tao, Jie Yang, and Wei Liu. Teaching-to-learn and learning-to-teach for\n\nmulti-label propagation. In AAAI 2016, pages 1610\u20131616, 2016.\n\n[20] Elad Hazan, K\ufb01r Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for\nstochastic non-convex problems. In International Conference on Machine Learning, pages\n1833\u20131841, 2016.\n\n[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[22] Paul Henderson and Vittorio Ferrari. End-to-end training of object class detectors for mean\naverage precision. In Asian Conference on Computer Vision, pages 198\u2013213. Springer, 2016.\n\n[23] Mark K Ho, Michael Littman, James MacGlashan, Fiery Cushman, and Joseph L Auster-\nweil. Showing versus doing: Teaching by demonstration. In Advances in Neural Information\nProcessing Systems 29, pages 3027\u20133035. 2016.\n\n[24] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\n\nconvolutional networks. In CVPR, 2017.\n\n[25] Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong Zhou, and Li Deng. Towards neural\nphrase-based machine translation. In International Conference on Learning Representations,\n2018.\n\n[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[27] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable\n\nmodels. In Advances in Neural Information Processing Systems, pages 1189\u20131197, 2010.\n\n[28] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[29] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.\n\n[30] Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, and Jaz S. Kandola. The\nperceptron algorithm with uneven margins. In Proceedings of the Nineteenth International\nConference on Machine Learning, ICML \u201902, pages 379\u2013386, San Francisco, CA, USA, 2002.\nMorgan Kaufmann Publishers Inc.\n\n[31] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n[32] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll\u00e1r. Focal loss for dense\n\nobject detection. arXiv preprint arXiv:1708.02002, 2017.\n\n[33] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou,\n\nand Yoshua Bengio. A structured self-attentive sentence embedding. In ICLR, 2017.\n\n[34] Ji Liu and Xiaojin Zhu. The teaching dimension of linear learners. Journal of Machine Learning\n\nResearch, 17(162):1\u201325, 2016.\n\n[35] Weiyang Liu, Bo Dai, James Rehg, and Le Song. Iterative machine teaching. In Proceedings of\nthe 34st International Conference on Machine Learning (ICML-17), pages 1188\u20131196, 2017.\n\n[36] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for\nIn International Conference on Machine Learning, pages\n\nconvolutional neural networks.\n507\u2013516, 2016.\n\n11\n\n\f[37] Renqian Luo, Fei Tian, Tao Qin, and Tie-Yan Liu. Neural architecture optimization. arXiv\n\npreprint arXiv:1808.07233, 2018.\n\n[38] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter opti-\nmization through reversible learning. In International Conference on Machine Learning, pages\n2113\u20132122, 2015.\n\n[39] Smitha Milli, Pieter Abbeel, and Igor Mordatch. Interpretable and pedagogical examples. arXiv\n\npreprint arXiv:1711.00694, 2017.\n\n[40] Tan Nguyen and Scott Sanner. Algorithms for direct 0\u20131 loss optimization in binary classi\ufb01ca-\n\ntion. In International Conference on Machine Learning, pages 1085\u20131093, 2013.\n\n[41] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n[42] Marc\u2019Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level\n\ntraining with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.\n\n[43] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. Diploma thesis,\n\nInstitut f. Informatik, Tech. Univ. Munich, 1987.\n\n[44] Patrick Shafto, Noah D Goodman, and Thomas L Grif\ufb01ths. A rational account of pedagogical\nreasoning: Teaching by, and learning from, examples. Cognitive psychology, 71:55\u201389, 2014.\n[45] Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu.\nMinimum risk training for neural machine translation. In Proceedings of the 54th Annual\nMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages\n1683\u20131692. Association for Computational Linguistics, 2016.\n\n[46] Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training deep neural networks via direct\nloss minimization. In International Conference on Machine Learning, pages 2169\u20132177, 2016.\n[47] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[48] Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. 1984.\n[49] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. Softrank: optimizing non-\nsmooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and\nData Mining, pages 77\u201386. ACM, 2008.\n\n[50] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media,\n\n2012.\n\n[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 6000\u20136010, 2017.\n\n[52] Ronald J Williams and David Zipser. A learning algorithm for continually running fully\n\nrecurrent neural networks. Neural computation, 1(2):270\u2013280, 1989.\n\n[53] Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. A study of reinforcement learning\n\nfor neural machine translation. In EMNLP, 2018.\n\n[54] Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. Adversarial\n\nneural machine translation. In ACML, 2018.\n\n[55] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arX-\n\niv:1605.07146, 2016.\n\n[56] Xiaojin Zhu. Machine teaching: An inverse problem to machine learning and an approach\ntoward optimal education. In Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial\nIntelligence, AAAI\u201915, pages 4083\u20134087. AAAI Press, 2015.\n\n[57] Barret Zoph and Quoc Le. Neural architecture search with reinforcement learning. In Interna-\n\ntional Conference on Learning Representations, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3182, "authors": [{"given_name": "Lijun", "family_name": "Wu", "institution": "Sun Yat-sen University"}, {"given_name": "Fei", "family_name": "Tian", "institution": "Miicrosoft Research"}, {"given_name": "Yingce", "family_name": "Xia", "institution": "Microsoft Research Asia"}, {"given_name": "Yang", "family_name": "Fan", "institution": "University of Science and Technology of China"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Lai", "family_name": "Jian-Huang", "institution": "Sun Yat-sen University"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research Asia"}]}