{"title": "Sparse Multi-Task Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 819, "page_last": 827, "abstract": "In multi-task reinforcement learning (MTRL), the objective is to simultaneously learn multiple tasks and exploit their similarity to improve the performance w.r.t.\\ single-task learning. In this paper we investigate the case when all the tasks can be accurately represented in a linear approximation space using the same small subset of the original (large) set of features. This is equivalent to assuming that the weight vectors of the task value functions are \\textit{jointly sparse}, i.e., the set of their non-zero components is small and it is shared across tasks. Building on existing results in multi-task regression, we develop two multi-task extensions of the fitted $Q$-iteration algorithm. While the first algorithm assumes that the tasks are jointly sparse in the given representation, the second one learns a transformation of the features in the attempt of finding a more sparse representation. For both algorithms we provide a sample complexity analysis and numerical simulations.", "full_text": "Sparse Multi-Task Reinforcement Learning\n\nDaniele Calandriello \u2217\n\nAlessandro Lazaric\u2217\n\nTeam SequeL\n\nINRIA Lille \u2013 Nord Europe, France\n\nMarcello Restelli\u2020\n\nDEIB\n\nPolitecnico di Milano, Italy\n\nAbstract\n\nIn multi-task reinforcement learning (MTRL), the objective is to simultaneously\nlearn multiple tasks and exploit their similarity to improve the performance w.r.t.\nsingle-task learning. In this paper we investigate the case when all the tasks can\nbe accurately represented in a linear approximation space using the same small\nsubset of the original (large) set of features. This is equivalent to assuming that\nthe weight vectors of the task value functions are jointly sparse, i.e., the set of\ntheir non-zero components is small and it is shared across tasks. Building on ex-\nisting results in multi-task regression, we develop two multi-task extensions of the\n\ufb01tted Q-iteration algorithm. While the \ufb01rst algorithm assumes that the tasks are\njointly sparse in the given representation, the second one learns a transformation\nof the features in the attempt of \ufb01nding a more sparse representation. For both\nalgorithms we provide a sample complexity analysis and numerical simulations.\n\nIntroduction\n\n1\nReinforcement learning (RL) and approximate dynamic programming (ADP) [24, 2] are effective\napproaches to solve the problem of decision-making under uncertainty. Nonetheless, they may fail\nin domains where a relatively small amount of samples can be collected (e.g., in robotics where\nsamples are expensive or in applications where human interaction is required, such as in automated\nrehabilitation). Fortunately, the lack of samples can be compensated by leveraging on the presence\nof multiple related tasks (e.g., different users). In this scenario, usually referred to as multi-task rein-\nforcement learning (MTRL), the objective is to simultaneously solve multiple tasks and exploit their\nsimilarity to improve the performance w.r.t. single-task learning (we refer to [26] and [15] for a com-\nprehensive review of the more general setting of transfer RL). In this setting, many approaches have\nbeen proposed, which mostly differ for the notion of similarity leveraged in the multi-task learning\nprocess. In [28] the transition and reward kernels of all the tasks are assumed to be generated from\na common distribution and samples from different tasks are used to estimate the generative distri-\nbution and, thus, improving the inference on each task. A similar model, but for value functions, is\nproposed in [16], where the parameters of all the different value functions are assumed to be drawn\nfrom a common distribution. In [23] different shaping function approaches for Q-table initialization\nare considered and empirically evaluated, while a model-based approach that estimates statistical in-\nformation on the distribution of the Q-values is proposed in [25]. Similarity at the level of the MDPs\nis also exploited in [17], where samples are transferred from source to target tasks. Multi-task rein-\nforcement learning approaches have been also applied in partially observable environments [18].\nIn this paper we investigate the case when all the tasks can be accurately represented in a linear\napproximation space using the same small subset of the original (large) set of features. This is\nequivalent to assuming that the weight vectors of the task value functions are jointly sparse, i.e., the\nset of their non-zero components is small and it is shared across tasks. Let us illustrate the concept\nof shared sparsity using the blackjack card game. The player can rely on a very large number of\nfeatures such as: value and color of the cards in the player\u2019s hand, value and color of the cards on\n\n\u2217{daniele.calandriello,alessandro.lazaric}@inria.fr\n\u2020{marcello.restelli}@polimi.it\n\n1\n\n\fthe table and/or already discarded, different scoring functions for the player\u2019s hand (e.g., sum of the\nvalues of the cards) and so on. The more the features, the more likely it is that the corresponding\nfeature space could accurately represent the optimal value function. Nonetheless, depending on the\nrules of the game (i.e., the reward and dynamics), a very limited subset of features actually contribute\nto the value of a state and we expect the optimal value function to display a high level of sparsity.\nFurthermore, if we consider multiple tasks differing for the behavior of the dealer (e.g., the value at\nwhich she stays) or slightly different rule sets, we may expect such sparsity to be shared across tasks.\nFor instance, if the game uses an in\ufb01nite number of decks, features based on the history of the cards\nplayed in previous hands have no impact on the optimal policy for any task and the corresponding\nvalue functions are all jointly sparse in this representation. Building on this intuition, in this paper\nwe \ufb01rst introduce the notion of sparse MDPs in Section 3. Then we rely on existing results in\nmulti-task regression [19, 1] to develop two multi-task extensions of the \ufb01tted Q-iteration algorithm\n(Sections 4 and Section 5) and we study their theoretical and empirical performance (Section 6). An\nextended description of the results, as well as the full proofs of the statements, are reported in [5].\n\n(cid:9)T\n\ninput: Input sets(cid:8)St = {xi}nx\n\nde\ufb01ned as T Q(x, a) = R(x, a) + \u03b3(cid:80)\n\n2 Preliminaries\nMulti-Task Reinforcement Learning (MTRL). A Markov decision process (MDP) is a tuple M =\n(X ,A, R, P, \u03b3), where the state space X is a bounded subset of the Euclidean space, the action\nspace A is \ufb01nite (i.e., |A| < \u221e), R : X \u00d7 A \u2192 [0, 1] is the reward of a state-action pair, P :\nX \u00d7 A \u2192 P(X ) is the transition distribution over the states achieved by taking an action in a given\nstate, and \u03b3 \u2208 (0, 1) is a discount factor. A deterministic policy \u03c0 : X \u2192 A is a mapping from\nstates to actions. We denote by B(X \u00d7 A; b) the set of measurable bounded state-action functions\nf : X \u00d7A \u2192 [\u2212b; b]. Solving an MDP corresponds to computing the optimal action\u2013value function\nQ\u2217 \u2208 B(X \u00d7A; Qmax = 1/(1\u2212\u03b3)), de\ufb01ned as the \ufb01xed point of the optimal Bellman operator T\ny P (y|x, a) maxa(cid:48) Q(y, a(cid:48)). The optimal policy is obtained\nthe optimal value function as \u03c0\u2217(x) = arg maxa\u2208A Q\u2217(x, a). In this\nas the greedy policy w.r.t.\npaper we study the multi-task reinforcement learning (MTRL) setting where the objective is to solve\nT tasks, de\ufb01ned as Mt = (X ,A, Pt, Rt, \u03b3) with t \u2208 [T ] = {1, . . . , T}, with the same state-action\nspace, but different dynamics and rewards. The objective of MTRL is to exploit similarities between\ntasks to improve the performance w.r.t. single-task learning. In particular, we choose linear \ufb01tted\nQ-iteration as the single-task baseline and we propose multi-task extensions tailored to exploit the\nsparsity in the structure of the tasks.\nLinear Fitted Q-iteration. Whenever\nX and A are large or continuous, we\nneed to resort to approximation schemes\nto learn a near-optimal policy. One of\nthe most popular ADP methods is the\n\ufb01tted-Q iteration (FQI) algorithm [7],\nwhich extends value iteration to approx-\nimate action-value functions. While ex-\nact value iteration proceeds by iterative\napplications of the Bellman operator (i.e.,\nQk = T Qk\u22121), at each iteration FQI ap-\nproximates T Qk\u22121 by solving a regres-\nsion problem. Among possible instances,\nhere we focus on a speci\ufb01c implementa-\ntion of FQI in the \ufb01xed design setting with\nlinear approximation and we assume access to a generative model of the MDP. Since the action\nspace A is \ufb01nite, we represent action-value functions as a collection of |A| independent state-value\nfunctions. We introduce a dx-dimensional state-feature vector \u03c6(\u00b7) = [\u03d51(\u00b7), . . . , \u03d5dx (\u00b7)]T with\n\u03c6i : X \u2192 R such that supx ||\u03c6(x)||2 \u2264 L and a state-action feature vector \u03c8(x, a) \u2208 R|A|dx\nobtained by surrounding features \u03c6(x) with (a \u2212 1)dx and (|A| \u2212 a)dx zeros before and af-\nter respectively. From \u03c6 we obtain a linear approximation space for action-value functions as\nF = {fw(x, a) = \u03c6(x)Twa, x \u2208 X , a \u2208 A, wa \u2208 Rdx}. FQI receives as input a \ufb01xed set of\ni=1 (\ufb01xed design setting) and the space F. Starting from w0 = 0, at each iteration\nstates S = {xi}nx\nk, FQI \ufb01rst draws a (fresh) set of samples (rk\ni=1 from the generative model of the MDP for\neach action a on each of the states {xi}nx\ni,a\u223cP (\u00b7|xi, a)) and builds |A|\n\nInitialize W 0 \u2190 0 , k = 0\ndo\nk \u2190 k + 1\nfor a \u2190 1, . . . ,|A| do\nfor t \u2190 1, . . . , T , i \u2190 1, . . . , nx do\nSample rk\ni,a,t = Rt(xi,t, a) and yk\nCompute zk\nend for\nBuild datasets Dk\n\nFigure 1: Linear FQI with \ufb01xed design and fresh samples at\neach iteration in a multi-task setting.\n\nCompute(cid:99)W k\n(cid:13)(cid:13)W k\nwhile(cid:0) max\n\nend for\n\ni,a, yk\ni=1 (i.e., rk\n\ni,a)nx\ni,a = R(xi, a) and yk\n\ni,a,t + \u03b3 maxa(cid:48) (cid:101)Qk\n\ni,a,t \u223c Pt(\u00b7|xi,t, a)\n\na,t = {(xi,t, a), zk\n\na on {Dk\n\na,t}T\na \u2212 W k\u22121\n\nt=1 (see Eqs. 2,5, or 8)\n\n\u2265 tol(cid:1) and k < K\n\ni=1\n\n(cid:13)(cid:13)2\n\na\n\ni=1\n\nt=1, tol, K\n\ni,a,t = rk\n\ni,a,t, a(cid:48))\n\nt (yk\ni,a,t}nx\n\na\n\n2\n\n\fi,a + \u03b3 maxa(cid:48) (cid:98)Qk\u22121(yk\n\ni,a = rk\n\ni=1, where zk\n\nindependent training sets Dk\n\nprevious iteration as \u03c8(yk\nthe training set Dk\n\nt=1 and we denote by(cid:99)W k\n\nis repeated up to K iterations or until no signi\ufb01cant change in the weight vector is observed. Since\n\nunbiased sample of T (cid:98)Qk\u22121 and (cid:98)Qk\u22121(yk\ni,a}nx\na = {(xi, a), zk\ni,a, a(cid:48)) is an\ni,a, a(cid:48)) is computed using the weight vector learned at the\na and it returns vectors (cid:98)wk\ni,a, a(cid:48))Twk\u22121. Then FQI solves |A| linear regression problems, each \ufb01tting\n1 , . . . ,(cid:98)wk|A|]. At each iteration the total number of samples is n = |A| \u00d7 nx. The process\n(cid:98)wk = [(cid:98)wk\na, which lead to the new action value function f(cid:98)wk with\nin principle (cid:98)Qk\u22121 could be unbounded (due to numerical issues in the regression step), in comput-\ni,a we use a function (cid:101)Qk\u22121 obtained by truncating (cid:98)Qk\u22121 in [\u2212Qmax; Qmax]. The\ning the samples zk\nconvergence and the performance of FQI are studied in detail in [20] in the case of bounded ap-\nproximation space, while linear FQI is studied in [17, Thm. 5] and [22, Lemma 5]. When moving\nto the multi-task setting, we consider different state sets {St}T\na \u2208 Rdx\u00d7T\na,t \u2208 Rdx as the t\u2013th column. The general structure of FQI in a multi-task\nsetting is reported in Fig. 1. Finally, we introduce the following matrix notation. For any matrix\nW \u2208 Rd\u00d7T , [W ]t \u2208 Rd is the t\u2013th column and [W ]i \u2208 RT the i\u2013th row of the matrix, Vec(W )\nis the RdT vector obtained by stacking the columns of the matrix, Col(W ) is its column-space and\nRow(W ) is its row-space. Beside the (cid:96)2, (cid:96)1-norm for vectors, we use the trace (or nuclear) norm\ni,j)1/2 and the (cid:96)2,1-norm\ni=1 (cid:107)[W ]i(cid:107)2. We denote by Od the set of orthonormal matrices and for any pair of\nmatrices V and W , V \u22a5 Row(W ) denotes the orthogonality between the spaces spanned by the two\nmatrices.\n3 Fitted Q\u2013Iteration in Sparse MDPs\nDepending on the regression algorithm employed at each iteration, FQI can be designed to take ad-\nvantage of different characteristics of the functions at hand, such as smoothness ((cid:96)2\u2013regularization)\nand sparsity ((cid:96)1\u2013regularization).\nIn this section we consider the high\u2013dimensional regression\nscenario and we study the performance of FQI under sparsity assumptions. Let \u03c0w(x) =\narg maxa fw(x, a) be the greedy policy w.r.t. fw. We start with the following assumption.1\nAssumption 1. For any function fw \u2208 F, the Bellman operator T can be expressed as\n\nthe matrix with vector (cid:98)wk\n(cid:107)W(cid:107)\u2217 = trace((W W T)1/2), the Frobenius norm (cid:107)W(cid:107)F = ((cid:80)\n(cid:107)W(cid:107)2,1 = (cid:80)d\n\ni,j[W ]2\n\nE\n\nx(cid:48)\u223cP (\u00b7|x,a)\n\nT fw(x, a) = R(x, a) + \u03b3\n\n[fw(x(cid:48), \u03c0w(x(cid:48)))] = \u03c8(x, a)TwR + \u03b3\u03c8(x, a)TP \u03c0w\n\n\u03c8 w (1)\nThis assumption implies that F is closed w.r.t. the Bellman operator, since for any fw, its image T fw\ncan be computed as the product between features \u03c8(\u00b7,\u00b7) and a vector of weights wR and P \u03c0w\n\u03c8 w. As a\nresult, the optimal value function Q\u2217 itself belongs to F and it can be computed as \u03c8(x, a)Tw\u2217. This\nassumption encodes the intuition that in the high\u2013dimensional feature space F induced by \u03c8, the\ntransition kernel P , and therefore the system dynamics, can be expressed as a linear combination of\nthe features using the matrix P \u03c0w\n\u03c8 , which depends on both function fw and features \u03c8. This condition\nis usually satis\ufb01ed whenever the space F is spanned by a very large set of features that allows it to\napproximate a wide range of different functions, including the reward and transition kernel. Under\n\nthis assumption, at each iteration k of FQI, there exists a weight vector wk such that T (cid:98)Qk\u22121 = fwk\n\nand an approximation of the target function fwk can be obtained by solving an ordinary least-squares\nproblem on the samples in Dk\na. Unfortunately, it is well known that OLS fails whenever the number\nof samples is not suf\ufb01cient w.r.t.\nthe number of features (i.e., d > n). For this reason, Asm. 1\nis often joined together with a sparsity assumption. Let J(w) = {i = 1, . . . , d : wi (cid:54)= 0} be\nthe set of s non-zero components of vector w (i.e., s = |J(w)|) and J c(w) be the complementary\nset. In supervised learning, the LASSO [11, 4] is effective in exploiting the sparsity assumption\nthat s (cid:28) d and dramatically reduces the sample complexity. In RL the idea of sparsity has been\nsuccessfully integrated into policy evaluation [14, 21, 8, 12] but rarely in the full policy iteration. In\na as\nvalue iteration, it can be easily integrated in FQI by approximating the target weight vector wk\n\n(cid:98)wk\n\na = arg min\nw\u2208Rdx\n\nnx(cid:88)\n\n(cid:16)\n\ni=1\n\n1\nnx\n\n(cid:17)2\n\n\u03c6(xi)Tw \u2212 zk\n\ni,a\n\n+ \u03bb||w||1.\n\n(2)\n\nWhile this integration is technically simple, the conditions on the MDP structure that imply sparsity\nin the value functions are not fully understood. In fact, one may simply assume that Q\u2217 is sparse in\n\n1A similar assumption has been previously used in [9] where the transition P is embedded in a RKHS.\n\n3\n\n\fF, with s non-zero weights, thus implying that d \u2212 s features captures aspects of states and actions\nthat do not have any impact on the actual optimal value function. Nonetheless, this would provide\nno guarantee about the actual level of sparsity encountered by FQI through iterations, where the\ntarget functions fwk may not be sparse at all. For this reason we need stronger conditions on the\nstructure of the MDP. We state the following assumption (see [10, 6] for similar conditions).\nAssumption 2 (Sparse MDPs). There exists a set J (the set of useful features) for MDP M, with\n|J| = s (cid:28) d, such that for any i /\u2208 J, and any policy \u03c0 the rows [P \u03c0\n\u03c8 ]i are equal to 0, and there\nexists a function fwR = R such that J(wR) \u2286 J.\nThis assumption implies that not only the reward function is sparse, but also that the features that\nare useless for the reward have no impact on the dynamics of the system. Since P \u03c0\n\u03c8 can be seen as\na linear representation of the transition kernel embedded in the high-dimensional space F, this as-\n\u03c8 has all its rows corresponding to features out-\nsumption corresponds to imposing that the matrix P \u03c0\nside of J set to 0. This in turn means that the future state-action vector E[\u03c8(x(cid:48), a(cid:48))T] = \u03c8(x, a)TP \u03c0\n\u03c8\ndepends only on the features in J. In the blackjack scenario illustrated in the introduction, this\nassumption is veri\ufb01ed by features related to the history of the cards played so far. In fact, if we\nconsider an in\ufb01nite number of decks, the feature indicating whether an ace has already been played\nis not used in the de\ufb01nition of the reward function and it is completely unrelated to the other fea-\ntures and, thus it does not contribute to the optimal value function. An important consideration on\nthis assumption can be derived by a closer look to the sparsity pattern of the matrix P \u03c0\n\u03c8 . Since the\nsparsity is required at the level of the rows, this does not mean that the features that do not belong to\nJ have to be equal to 0 after each transition. Instead, their value will be governed simply by the in-\nteraction with the features in J. This means that the features outside of J can vary from completely\nunnecessary features with no dynamics, to features that are redundant to those in J in describing the\nevolution of the system. Additional discussion on this assumption is available in [5]. Assumption 2,\ntogether with Asm. 1, leads to the following lemma.\nLemma 1. Under Assumptions 1 and 2, the application of the Bellman operator T to any function\nfw \u2208 F, produces a function fw(cid:48) = T fw \u2208 F such that J(w(cid:48)) \u2286 J.\n\nThis lemma guarantees that at any iteration k of FQI, the target function fwk = T (cid:98)Qk\u22121 has a level\n\nof sparsity J(wk) \u2264 s. We are now ready to study the performance of LASSO-FQI over iterations.\nIn order to simplify the comparison to the multi-task results in sections 4 and 5, we analyze the\naverage performance over multiple tasks. We consider that the previous assumptions extend to all\nthe MDPs {Mt}T\nt=1, each with a set of useful features Jt and sparsity st. The action\u2013value function\nlearned after K iterations is evaluated by comparing the performance of the corresponding greedy\nt (x, a) to the optimal policy. The performance loss is measured w.r.t.\npolicy \u03c0K\na target distribution \u00b5 \u2208 P(X \u00d7A). We introduce the following standard assumption for LASSO [3].\nAssumption 3 (Restricted Eigenvalues (RE)). De\ufb01ne n as the number of samples, and J c as the\ncomplement of the set of indices J. For any s \u2208 [d], there exists \u03ba(s) \u2208 R+ such that:\n\u2265 \u03ba(s),\n\n(3)\nt=1 and the function space F satisfy assump-\nTheorem 1 (LASSO-FQI). Let the tasks {Mt}T\nt st/T , \u03bamin(s) = mint \u03ba(st) and features bounded\nsupx ||\u03c6(x)||2 \u2264 L. If LASSO-FQI (Alg. 1 with Eq. 2) is run independently on all T tasks for K\niterations with a regularizer \u03bb = \u03b4Qmax\nprobability at least (1 \u2212 2d1\u2212\u03b4/8)KT , the performance loss is bounded as\n\ntions 1, 2 and 3 with average sparsity \u00afs =(cid:80)\n(cid:112)log(d)/n, for any numerical constant \u03b4 > 8, then with\n(cid:18)\n\n: |J| \u2264 s, \u2206 \u2208 Rd\\{0},(cid:107)\u2206J c(cid:107)1 \u2264 3(cid:107)\u2206J(cid:107)1\n\nt (x) = arg maxa QK\n\n(cid:26) (cid:107)\u03a6\u2206(cid:107)2\n\n\u221a\n\nn(cid:107)\u2206J(cid:107)2\n\n(cid:21)(cid:19)\n\n(cid:27)\n\nmin\n\n(cid:20) Q2\n\nt \u2212 Q\u03c0K\n\nt\n\nt\n\n\u2264 O\n\n1\n\n(1 \u2212 \u03b3)4\n\nmaxL2\n\u03ba4\nmin(s)\n\ns log d\n\nn\n\n(cid:13)(cid:13)(cid:13)Q\u2217\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:13)(cid:13)(cid:13)2\n\n2,\u00b5\n\n+ \u03b3KQ2\n\nmax\n\n.\n\n(4)\n\nRemark 1 (assumptions). Asm. 3 is a relatively weak constraint on the representation capability of\nthe data. The RE assumption is common in regression, and it is extensively analyzed in [27]. Asm. 1\nand 2 are speci\ufb01c to our setting and may pose signi\ufb01cant constraints on the set of MDPs of interest.\nAsm. 1 is introduced to give a more explicit interpretation for the notion of sparse MDPs. Without\nAsm. 1, the bound in Eq. 4 would have an additional approximation error term similar to standard\n\n4\n\n\ft\n\nt\n\nt\n\nt\n\nt = |J(wk+1\n\nt \u2264 s with sk\n\nthat are sparse, i.e., maxt,k sk\n\napproximate value iteration results (see e.g., [20]). Asm. 2 is a potentially very loose suf\ufb01cient\ncondition to guarantee that the target functions encountered over the iterations of LASSO\u2013FQI have\na minimum level of sparsity. Thm. 1 requires that for any k \u2264 K, the target function fwk+1\n= T fwk\n)|. In other words, all target\nhas weights wk+1\nfunctions encountered must be sparse, or LASSO\u2013FQI could suffer a huge loss at an intermediate\nstep. Such condition could be obtained under much less restrictive assumptions than Asm. 2, that\nleaves up to the MDPs dynamics to resparsify the target function at each step, at the expenses of\ninterpretability. It could be suf\ufb01cient to prove that the MDP dynamics do not enforce sparsity, but\nsimply do not reduce it across iterations, and use guarantees for LASSO reconstruction to maintain\nsparsity across iterations. Finally, we point out that even if \u201cuseless\u201d features do not satisfy Asm. 2\nand are weakly correlated with the dynamics and the reward function, their weights are discounted\nby \u03b3 at each step. As a result, the target functions could become \u201capproximately\u201d as sparse as Q\u2217\nover iterations, and provide enough guarantees to be used for a variation of Thm. 1. We leave for\nfuture work a more thorough investigation of these possible relaxations.\n4 Group-LASSO Fitted Q\u2013Iteration\nAfter introducing the concept of sparse MDP in Sect. 3, we move to the multi-task scenario and we\nstudy the setting where there exists a suitable representation (i.e., set of features) under which all the\ntasks can be solved using roughly the same set of features, the so-called shared sparsity assumption.\nGiven the set of useful features Jt for task t, we denote by J = \u222aT\nt=1Jt the union of all the non-zero\ncoef\ufb01cients across all the tasks. Similar to Asm. 2 and Lemma 1, we \ufb01rst assume that the set of\nfeatures \u201cuseful\u201d for at least one of the tasks is relatively small compared to d and then show how\nthis propagates through iterations.\nAssumption 4. We assume that the joint useful features over all the tasks are such that |J| = \u02dcs (cid:28) d.\nLemma 2. Under Asm. 2 and 4, at any iteration k, the target weight matrix W k has J(W k) \u2264 \u02dcs.\nThe Algorithm. In order to exploit the similarity across tasks stated in Asm. 4, we resort to the\nGroup LASSO (GL) algorithm [11, 19], which de\ufb01nes a joint optimization problem over all the\ntasks. GL is based on the intuition that given the weight matrix W \u2208 Rd\u00d7T , the norm (cid:107)W(cid:107)2,1\nmeasures the level of shared-sparsity across tasks. In fact, in (cid:107)W(cid:107)2,1 the (cid:96)2-norm measures the \u201crel-\nevance\u201d of feature i across tasks, while the (cid:96)1-norm \u201ccounts\u201d the total number of relevant features,\nwhich we expect to be small in agreement with Asm. 4. Building on this intuition, we de\ufb01ne the\nGL\u2013FQI algorithm in which at each iteration for each action a \u2208 A we compute (details about the\nimplementation of GL\u2013FQI are reported in [5, Appendix A])\n\n(cid:99)W k\n\na = arg min\nWa\n\n(cid:13)(cid:13)Z k\n\nT(cid:88)\n\nt=1\n\n(cid:13)(cid:13)2\n\n2\n\na,t \u2212 \u03a6twa,t\n\n+ \u03bb(cid:107)Wa(cid:107)2,1 .\n\n(5)\n\nTheoretical Analysis. The regularization of GL\u2013FQI is designed to take advantage of the shared-\nsparsity assumption at each iteration and in this section we show that this may lead to reduce the\nsample complexity w.r.t. using LASSO in FQI for each task separately. Before reporting the anal-\nysis of GL\u2013FQI, we need to introduce a technical assumption de\ufb01ned in [19] for GL.\nAssumption 5 (Multi-Task Restricted Eigenvalues). De\ufb01ne \u03a6 as the block diagonal matrix com-\nposed by the T sample matrices \u03a6t. For any s \u2208 [d], there exists \u03ba(s) \u2208 R+ s.t.\n: |J| \u2264 s, \u2206 \u2208 Rd\u00d7T\\{0},(cid:107)\u2206J c(cid:107)2,1 \u2264 3(cid:107)\u2206J(cid:107)2,1\n\n(cid:40) (cid:107)\u03a6 Vec(\u2206)(cid:107)2\n\n\u2265 \u03ba(s),\n\n(cid:41)\n\nmin\n\n\u221a\n\n(6)\n\nnT (cid:107)Vec(\u2206J )(cid:107)2\n\nSimilar to Theorem 1 we evaluate the performance of GL\u2013FQI as the performance loss of the\nreturned policy w.r.t. the optimal policy and we obtain the following performance guarantee.\nt=1 and the function space F satisfy assump-\nTheorem 2 (GL\u2013FQI). Let the tasks {Mt}T\ntions 1, 2, 4, and 5 with joint sparsity \u02dcs and features bounded supx ||\u03c6(x)||2 \u2264 L.\nIf GL\u2013\nFQI (Alg. 1 with Eq. 5) is run jointly on all T tasks for K iterations with a regularizer \u03bb =\n2 , for any numerical constant \u03b4 > 0, then with probability\nLQmax/\nat least (1 \u2212 log(d)\u2212\u03b4)K, the performance loss is bounded as\n\n\u221a\n\nnT(cid:0)1 + T \u22121/2 log d) 3\n2 +\u03b4(cid:1) 1\n(cid:18)\n(cid:13)(cid:13)(cid:13)Q\u2217\n\nt \u2212 Q\u03c0K\n\n(cid:13)(cid:13)(cid:13)2\n\n\u2264 O\n\n1\n\nt\n\nt\n\n2,\u00b5\n\n(1 \u2212 \u03b3)4\n\nmax\n\u03ba4(2\u02dcs)\n\n\u02dcs\nn\n\n1 +\n\n(log d)3/2+\u03b4\n\n\u221a\n\nT\n\n(cid:20) L2Q2\n\n(cid:18)\n\n+ \u03b3KQ2\n\nmax\n\n.\n\n(7)\n\n(cid:19)\n\n(cid:21)(cid:19)\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n5\n\n\fRemark 2 (comparison with LASSO-FQI). Ignoring all the terms in common with the two meth-\nods, constants, and logarithmic factors, we can summarize their bounds of LASSO-FQI and GL\u2013\n\nT )(cid:1). The \ufb01rst interesting aspect of the bound of\n\nFQI as (cid:101)O(\u00afs log(d)/n) and (cid:101)O(cid:0)\u02dcs/n(1 + log(d)/\n\n\u221a\n\n\u221a\nGL\u2013FQI is the role played by the number of tasks T . In LASSO\u2013FQI the \u201ccost\u201d of discovering\nthe st useful features is a factor log d, while GL\u2013FQI has a factor 1 + log(d)/\nT , which decreases\nwith the number of tasks. This illustrates the advantage of the multi\u2013task learning dimension of\nGL\u2013FQI, where all the samples of all tasks actually contribute to discovering useful features, so\nthat the more the number of features, the smaller the cost. In the limit, we notice that when T \u2192 \u221e,\nthe bound for GL\u2013FQI does not depend on the dimensionality of the problem anymore. The other\ncritical aspect of the bounds is the difference between \u00afs and \u02dcs. In fact, maxt st \u2264 \u02dcs \u2264 d and if\nthe shared-sparsity assumption does not hold, we can construct cases where the number of non-zero\nfeatures st is very small for each task, but the union J = \u222atJt is still a full set, so that \u02dcs \u2248 d. In this\ncase, GL\u2013FQI cannot leverage on the shared sparsity across tasks and it may perform signi\ufb01cantly\nworse than LASSO\u2013FQI. This is the well\u2013known negative transfer effect that happens whenever\nthe wrong assumption over tasks is enforced thus worsening the single-task learning performance.\n5 Feature Learning Fitted Q\u2013Iteration\nUnlike other properties such as smoothness, the sparsity of a function is intrinsically related to the\nspeci\ufb01c representation used to approximate it (i.e., the function space F). While Asm. 2 guarantees\nthat F induces sparsity for each task separately, Asm. 4 requires that all the tasks share the same\nuseful features in the given representation. As discussed in Rem. 2, whenever this is not the case,\nGL\u2013FQI may perform worse than LASSO\u2013FQI. In this section we investigate an alternative notion\nof sparsity in MDPs and we introduce the Feature Learning \ufb01tted Q-iteration (FL\u2013FQI) algorithm.\nLow Rank approximation. Since the poor performance of GL\u2013FQI is due to the chosen represen-\ntation (i.e., features), it is natural to ask the question whether there exists an alternative representation\n(i.e., different features) inducing a higher level of shared sparsity. Let us assume that there exists a\nspace F\u2217 de\ufb01ned by features \u03c6\u2217 such that the weight matrix of the optimal Q-functions A\u2217 \u2208 Rd\u00d7T\nis such that J(A\u2217) = s\u2217 (cid:28) d. As shown in Lemma 2, together with Asm. 2 and 4, this guarantees\nthat at any iteration J(Ak) \u2264 s\u2217. Given the set of states {St}T\nt=1, let \u03a6 and \u03a6\u2217 the feature matrices\nobtained by evaluating \u03c6 and \u03c6\u2217 on the states. We assume that there exists a linear transformation\nof the features of F\u2217 to the features of F such that \u03a6 = \u03a6\u2217U with U \u2208 Rdx\u00d7dx. In this setting\nthe samples used to de\ufb01ne the regression problem can be formulated as noisy observations of \u03a6\u2217Ak\na\nfor any action a. Together with the transformation U, this implies that there exists a weight matrix\na = U\u22121Ak\na is indeed sparse,\nW k\nany attempt to learn W k\na may have a very low level of sparsity. On\nthe other hand, an algorithm able to learn a suitable transformation U, it may be able to recover the\nrepresentation \u03a6\u2217 (and the corresponding space F\u2217) and exploit the high level of sparsity of Ak\na.\nWhile this additional step of representation or feature learning introduces additional complexity, it\nallows to relax the strict assumption on the joint sparsity \u02dcs and may improve the performance of\nGL\u2013FQI. Our assumption is formulated as follows.\nAssumption 6. There exists an orthogonal matrix U \u2208 Od (block diagonal matrix having matrices\n{Ua \u2208 Odx} on the diagonal) such that the weight matrix A\u2217 obtained as A\u2217 = U\u22121W \u2217 is jointly\nsparse, i.e., has a set of \u201cuseful\u201d features J(A\u2217) = \u222aT\n\na with W k\na using GL would fail, since W k\n\nt=1J([A\u2217]t) with |J(A\u2217)| = s\u2217 (cid:28) d.\n\na such that \u03a6\u2217Ak\n\na = \u03a6\u2217U U\u22121Ak\n\na. Although Ak\n\na = \u03a6W k\n\nCoherently with this assumption, we adapt the multi-task feature learning (MTFL) algorithm de\ufb01ned\nin [1] and at each iteration k for any action a we solve the optimization problem\n\na) = arg min\nUa\u2208Od\n\nmin\n\nAa\u2208Rd\u00d7T\n\n||Z k\n\na,t \u2212 \u03a6tUa[Aa]t||2 + \u03bb(cid:107)A(cid:107)2,1 .\n\n(8)\n\nIn order to better characterize the solution to this optimization problem, we study more in detail\nthe relationship between A\u2217 and W \u2217 and analyze the two directions of the equality A\u2217 = U\u22121W \u2217.\nWhen A\u2217 has s\u2217 non-zero rows, then any orthonormal transformation W \u2217 will have at most rank\nr\u2217 = s\u2217. This suggests that instead of solving the joint optimization problem in Eq. 8 and explicitly\nrecover the transformation U, we may directly try to solve for low-rank weight matrices W . Then\nwe need to show that a low-rank W \u2217 does indeed imply the existence of a transformation to a jointly-\nsparse matrix A\u2217. Assume W \u2217 has low rank r\u2217. It is then possible to perform a standard singular\n\n6\n\n((cid:98)U k\na , (cid:98)Ak\n\nT(cid:88)\n\nt=1\n\n\fvalue decomposition W \u2217 = U \u03a3V = U A\u2217. Because \u03a3 is diagonal with r\u2217 non-zero entries, A\u2217 will\nhave r\u2217 non-zero rows, thus being jointly sparse. It is possible to derive the following equivalence.\nProposition 1 ([5, Appendix A]). Given A, W \u2208 Rd\u00d7T , U \u2208 Od, the following equality holds,\nwith the relationship between the optimal solutions being W \u2217 = U A\u2217,\n\nmin\nA,U\n\n||Z k\n\na,t \u2212 \u03a6tUa[Aa]t||2 + \u03bb(cid:107)A(cid:107)2,1 = min\n\nW\n\n||Z k\n\na,t \u2212 \u03a6t[Wa]t||2 + \u03bb(cid:107)W(cid:107)1.\n\n(9)\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nThe previous proposition states the equivalence between solving a feature learning version of GL\nand solving a nuclear norm (or trace norm) regularized problem. This penalty is equivalent to an\n(cid:96)1-norm penalty on the singular values of the W matrix, thus forcing W to have low rank. Notice\nthat assuming that W \u2217 has low rank can be also interpreted as the fact that either the task weights\n[W \u2217]t or the features weights [W \u2217]i are linearly correlated. In the \ufb01rst case, it means that there is\na dictionary of core tasks that is able to reproduce all the other tasks as a linear combination. As a\nresult, Assumption 6 can be reformulated as Rank(W \u2217) = s\u2217. Building on this intuition we de\ufb01ne\nthe FL\u2013FQI algorithm where the regression is carried out according to Eq. 9.\nTheoretical Analysis. Our aim is to obtain a bound similar to Theorem 2 for the new FL-FQI Algo-\nrithm. We begin by introducing a slightly different assumption on the data available for regression.\nAssumption 7 (Restricted Strong Convexity). Under Assumption 6, let W \u2217 = U DV T be a singular\nvalue decomposition of the optimal matrix W \u2217 of rank r, and U r, V r the submatrices associated\nwith the top r singular values. De\ufb01ne B = {\u2206 \u2208 Rd\u00d7T : Row(\u2206)\u22a5U r and Col(\u2206)\u22a5V r}, and the\nprojection operator onto this set \u03a0B. There exists a positive constant \u03ba such that\n: \u2206 \u2208 Rd\u00d7T ,(cid:107)\u03a0B(\u2206)(cid:107)1 \u2264 3(cid:107)\u2206 \u2212 \u03a0B(\u2206)(cid:107)1\n\n(10)\nTheorem 3 (FL\u2013FQI). Let the tasks {Mt}T\nt=1 and the function space F satisfy assumptions 1, 2, 6,\nand 7 with rank s\u2217, features bounded supx ||\u03c6(x)||2 \u2264 L and T > \u2126(log n). If FL\u2013FQI (Alg. 1 with\nEq. 8) is run jointly on all T tasks for K iterations with a regularizer \u03bb \u2265 2LQmax\nthen with probability at least \u2126((1 \u2212 exp{\u2212(d + T )})K), the performance loss is bounded as\n\n(cid:26) (cid:107)\u03a6 Vec(\u2206)(cid:107)2\n\n2nT(cid:107) Vec(\u2206)(cid:107)2\n\n\u2265 \u03ba\n\n(cid:27)\n\nmin\n\n2\n\n2\n\n(cid:112)(d + T )/n,\n(cid:21)(cid:19)\n\n(cid:13)(cid:13)(cid:13)Q\u2217\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:18)\n\n(cid:13)(cid:13)(cid:13)2\n\n2,\u03c1\n\nt \u2212 Q\u03c0K\n\nt\n\nt\n\n\u2264 O\n\n1\n\n(1 \u2212 \u03b3)4\n\n(cid:20) Q2\n\n(cid:18)\n\n(cid:19)\n\nmaxL4\n\u03ba2\n\ns\u2217\nn\n\n1 +\n\nd\nT\n\n+ \u03b3KQ2\n\nmax\n\n.\n\nRemark 3 (comparison with GL-FQI). Unlike GL\u2013FQI, the performance FL\u2013FQI does not de-\npend on the shared sparsity \u02dcs of W \u2217 but on its rank, that is the value s\u2217 of the most jointly-sparse\nrepresentation that can be obtained through an orthogonal transformation U of the features. When-\never tasks are somehow linearly dependent, even if the weight matrix W \u2217 is dense and \u02dcs \u2248 d,\nthe rank s\u2217 can be small, thus guaranteeing a dramatic improvement over GL\u2013FQI. On the other\n\u221a\nhand, learning a new representation comes at the cost of a worse dependency on d. In fact, the term\nT in GL\u2013FQI, becomes d/T , implying that many more tasks are needed for FL\u2013FQI to\nlog(d)/\nconstruct a suitable representation. This is not surprising since we introduced a d \u00d7 d matrix U\nin the optimization problem and a larger number of parameters needs to be learned. As a result,\nalthough signi\ufb01cantly reduced by the use of trace-norm instead of (cid:96)2,1-regularization, the negative\ntransfer is not completely removed. In particular, the introduction of new tasks, that are not linear\ncombinations of the previous tasks, may again increase the rank s\u2217, corresponding to the fact that\nno jointly-sparse representation can be constructed.\n6 Experiments\nWe investigate the empirical performance of GL\u2013FQI, and FL\u2013FQI and compare their results to\nsingle-task LASSO\u2013FQI in two variants of the blackjack game. In the \ufb01rst variant (reduced variant)\nthe player can choose to hit to obtain a new card or stay to end the episode, while in the second one\n(reduced variant) she can also choose to double the bet on the \ufb01rst turn. Different tasks can be\nde\ufb01ned depending on several parameters of the game, such as the number of decks, the threshold at\nwhich the dealer stays and whether she hits when the threshold is research exactly with a soft hand.\nFull variant experiment. The tasks are generated by selecting 2, 4, 6, 8 decks, by setting the stay\nthreshold at {16, 17} and whether the dealer hits on soft, for a total of 16 tasks. We de\ufb01ne a very\n\n7\n\n\fFigure 2: Comparison of FL\u2013FQI, GL\u2013FQI and LASSO\u2013FQI on full (left) and reduced (right) variants. The\ny axis is the average house edge (HE) computed across tasks.\nrich description of the state space with the objective of satisfying Asm. 1. At the same time this is\nlikely to come with a large number of useless features, which makes it suitable for sparsi\ufb01cation.\nIn particular, we include the player hand value, indicator functions for each possible player hand\nvalue and dealer hand value, and a large description of the cards not dealt yet (corresponding to\nthe history of the game), under the form of indicator functions for various ranges.\nIn total, the\nrepresentation contains d = 212 features. We notice that although none of the features is completely\nuseless (according to the de\ufb01nition in Asm. 2), the features related with the history of the game\nare unlikely to be very useful for most of the tasks de\ufb01ned in this experiment. We collect samples\nfrom up to 5000 episodes, although they may not be representative enough given the large state\nspace of all possible histories that the player can encounter and the high stochasticity of the game.\nThe evaluation is performed by simulating the learned policy for 2,000,000 episodes and computing\nthe average House Edge (HE) across tasks. For each algorithm we report the performance for the\nbest regularization parameter \u03bb in the range {2, 5, 10, 20, 50}. Results are reported in Fig. 2-(left).\nAlthough the set of features is quite large, we notice that all the algorithms succeed in learning a good\npolicy even with relatively few samples, showing that all of them can take advantage of the sparsity\nof the representation. In particular, GL\u2013FQI exploits the fact that all 16 tasks share the same useless\nfeatures (although the set of useful feature may not overlap entirely) and its performance is the best.\nFL\u2013FQI suffers from the increased complexity of representation learning, which in this case does\nnot lead to any bene\ufb01t since the initial representation is sparse, but it performs as LASSO\u2013FQI.\nReduced variant experiment. We consider a representation for which we expect the weight matrix\nto be dense. In particular, we only consider the value of the player\u2019s hand and of the dealer\u2019s hand and\nwe generate features as the Cartesian product of these two discrete variables plus a feature indicating\nwhether the hand is soft, for a total of 280 features. Similar to the previous setting, the tasks are\ngenerated with 2, 4, 6, 8 decks, whether the dealer hits on soft, and a larger number of stay thresholds\nin {15, 16, 17, 18}, for a total of 32 tasks. We used regularizers in the range {0.1, 1, 2, 5, 10}. Since\nthe history is not included, the different number of decks in\ufb02uences only the probability distribution\nof the totals. Moreover, limiting the actions to either hit or stay further increases the similarity\namong tasks. Therefore, we expect to be able to \ufb01nd a dense, low-rank solution. Results in Fig. 2-\n(right) con\ufb01rms this guess, with FL\u2013FQI performing signi\ufb01cantly better than the other methods. In\naddition, GL\u2013FQI and LASSO\u2013FQI perform similarly, since the dense representation penalizes\nboth single-task and shared sparsity; in fact, both methods favor low values of \u03bb, meaning that the\nsparse-inducing penalties are not effective.\n7 Conclusions\nWe studied the multi-task reinforcement learning problem under shared sparsity assumptions across\nthe tasks. GL\u2013FQI extends the FQI algorithm by introducing a Group-LASSO step at each it-\neration and it leverages over the fact that all the tasks are expected to share the same small set of\nuseful features to improve the performance of single-task learning. Whenever the assumption is not\nvalid, GL\u2013FQI may perform worse than LASSO\u2013FQI. With FL\u2013FQI we take a step further and\nwe learn a transformation of the given representation that could guarantee a higher level of shared\nsparsity. Future work will be focused on considering a relaxation of the theoretical assumptions and\non studying alternative multi-task regularization formulations such as in [29] and [13].\nAcknowledgments This work was supported by the French Ministry of Higher Education and Research, the\nEuropean Community\u2019s Seventh Framework Programme under grant agreement 270327 (project CompLACS),\nand the French National Research Agency (ANR) under project ExTra-Learn n.ANR-14-CE24-0010-01.\n\n8\n\n1000.02000.03000.04000.05000.0-0.1-0.08-0.06-0.04nHEGL-FQIFL-FQILasso-FQI100.0300.0500.0700.0900.01100.0-0.16-0.14-0.12-0.1-0.08nHEGL-FQIFL-FQILasso-FQI\fReferences\n[1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning.\n\nMachine Learning, 73(3):243\u2013272, 2008.\n\n[2] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[3] Peter J Bickel, Ya\u2019acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig\n\nselector. The Annals of Statistics, pages 1705\u20131732, 2009.\n\n[4] Peter B\u00a8uhlmann and Sara van de Geer. Statistics for High-Dimensional Data: Methods, Theory and\n\nApplications. Springer, 1st edition, 2011.\n\n[5] Daniele Calandriello, Alessandro Lazaric, and Marcello Restelli. Sparse Multi-task Reinforcement Learn-\n\ning. In https://hal.inria.fr/hal-01073513, 2014.\n\n[6] A Castelletti, S Galelli, M Restelli, and R Soncini-Sessa. Tree-based feature selection for dimensionality\n\nreduction of large-scale control systems. In IEEE ADPRL, 2011.\n\n[7] Damien Ernst, Pierre Geurts, Louis Wehenkel, and Michael L Littman. Tree-based batch mode reinforce-\n\nment learning. Journal of Machine Learning Research, 6(4), 2005.\n\n[8] Mohammad Ghavamzadeh, Alessandro Lazaric, R\u00b4emi Munos, Matt Hoffman, et al. Finite-sample analy-\n\nsis of lasso-td. In ICML, 2011.\n\n[9] Steffen Grunewalder, Guy Lever, Luca Baldassarre, Massimiliano Pontil, and Arthur Gretton. Modelling\n\ntransition dynamics in mdps with rkhs embeddings. In ICML, 2012.\n\n[10] H. Hachiya and M. Sugiyama. Feature selection for reinforcement learning: Evaluating implicit state-\n\nreward dependency via conditional mutual information. In ECML PKDD. 2010.\n\n[11] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2009.\n[12] M. Hoffman, A. Lazaric, M. Ghavamzadeh, and R. Munos. Regularized least squares temporal difference\n\nlearning with nested (cid:96)2 and (cid:96)1 penalization. In EWRL, pages 102\u2013114. 2012.\n\n[13] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso.\n\nIn ICML, pages 433\u2013440. ACM, 2009.\n\n[14] J Zico Kolter and Andrew Y Ng. Regularization and feature selection in least-squares temporal difference\n\nlearning. In ICML, 2009.\n\n[15] A. Lazaric. Transfer in reinforcement learning: a framework and a survey. In M. Wiering and M. van\n\nOtterlo, editors, Reinforcement Learning: State of the Art. Springer, 2011.\n\n[16] Alessandro Lazaric and Mohmammad Ghavamzadeh. Bayesian multi-task reinforcement learning.\n\nICML, 2010.\n\nIn\n\n[17] Alessandro Lazaric and Marcello Restelli. Transfer from multiple MDPs. In NIPS, 2011.\n[18] Hui Li, Xuejun Liao, and Lawrence Carin. Multi-task reinforcement learning in partially observable\n\nstochastic environments. Journal of Machine Learning Research, 10:1131\u20131186, 2009.\n\n[19] Karim Lounici, Massimiliano Pontil, Sara Van De Geer, Alexandre B Tsybakov, et al. Oracle inequalities\n\nand optimal inference under group sparsity. The Annals of Statistics, 39(4):2164\u20132204, 2011.\n\n[20] R\u00b4emi Munos and Csaba Szepesv\u00b4ari. Finite-time bounds for \ufb01tted value iteration. The Journal of Machine\n\nLearning Research, 9:815\u2013857, 2008.\n\n[21] C. Painter-Wake\ufb01eld and R. Parr. Greedy algorithms for sparse reinforcement learning. In ICML, 2012.\n[22] Bruno Scherrer, Victor Gabillon, Mohammad Ghavamzadeh, and Matthieu Geist. Approximate modi\ufb01ed\n\npolicy iteration. In ICML, 2012.\n\n[23] Matthijs Snel and Shimon Whiteson. Multi-task reinforcement learning: Shaping and feature selection.\n\nIn EWRL, September 2011.\n\n[24] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning. MIT Press, 1998.\n[25] F. Tanaka and M. Yamamura. Multitask reinforcement learning on the distribution of mdps. In CIRA\n\n2003, pages 1108\u20131113, 2003.\n\n[26] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.\n\nJournal of Machine Learning Research, 10(1):1633\u20131685, 2009.\n\n[27] Sara A Van De Geer, Peter B\u00a8uhlmann, et al. On the conditions used to prove oracle results for the lasso.\n\nElectronic Journal of Statistics, 3:1360\u20131392, 2009.\n\n[28] A. Wilson, A. Fern, S. Ray, and P. Tadepalli. Multi-task reinforcement learning: A hierarchical Bayesian\n\napproach. In ICML, pages 1015\u20131022, 2007.\n\n[29] Yi Zhang and Jeff G Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In NIPS,\n\npages 2550\u20132558, 2010.\n\n9\n\n\f", "award": [], "sourceid": 552, "authors": [{"given_name": "Daniele", "family_name": "Calandriello", "institution": "Inria Lille - Nord Europe"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "INRIA Lille-Nord Europe"}, {"given_name": "Marcello", "family_name": "Restelli", "institution": "Politecnico di Milano"}]}