{"title": "Fighting Boredom in Recommender Systems with Linear Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1757, "page_last": 1768, "abstract": "A common assumption in recommender systems (RS) is the existence of a best fixed recommendation strategy. Such strategy may be simple and work at the item level (e.g., in multi-armed bandit it is assumed one best fixed arm/item exists) or implement more sophisticated RS (e.g., the objective of A/B testing is to find the\nbest fixed RS and execute it thereafter). We argue that this assumption is rarely verified in practice, as the recommendation process itself may impact the user\u2019s\npreferences. For instance, a user may get bored by a strategy, while she may gain interest again, if enough time passed since the last time that strategy was used. In\nthis case, a better approach consists in alternating different solutions at the right frequency to fully exploit their potential. In this paper, we first cast the problem as\na Markov decision process, where the rewards are a linear function of the recent history of actions, and we show that a policy considering the long-term influence\nof the recommendations may outperform both fixed-action and contextual greedy policies. We then introduce an extension of the UCRL algorithm ( L IN UCRL ) to\neffectively balance exploration and exploitation in an unknown environment, and we derive a regret bound that is independent of the number of states. Finally,\nwe empirically validate the model assumptions and the algorithm in a number of realistic scenarios.", "full_text": "Fighting Boredom in Recommender Systems with\n\nLinear Reinforcement Learning\n\nRomain Warlop\n\n\ufb01fty-\ufb01ve, Paris, France\n\nSequeL Team, Inria Lille, France\n\nromain@fifty-five.com\n\nAlessandro Lazaric\nFacebook AI Research\n\nParis, France\n\nlazaric@fb.com\n\nJ\u00e9r\u00e9mie Mary\nCriteo AI Lab\nParis, France\n\nj.mary@criteo.com\n\nAbstract\n\nA common assumption in recommender systems (RS) is the existence of a best\n\ufb01xed recommendation strategy. Such strategy may be simple and work at the item\nlevel (e.g., in multi-armed bandit it is assumed one best \ufb01xed arm/item exists) or\nimplement more sophisticated RS (e.g., the objective of A/B testing is to \ufb01nd the\nbest \ufb01xed RS and execute it thereafter). We argue that this assumption is rarely\nveri\ufb01ed in practice, as the recommendation process itself may impact the user\u2019s\npreferences. For instance, a user may get bored by a strategy, while she may gain\ninterest again, if enough time passed since the last time that strategy was used. In\nthis case, a better approach consists in alternating different solutions at the right\nfrequency to fully exploit their potential. In this paper, we \ufb01rst cast the problem as\na Markov decision process, where the rewards are a linear function of the recent\nhistory of actions, and we show that a policy considering the long-term in\ufb02uence\nof the recommendations may outperform both \ufb01xed-action and contextual greedy\npolicies. We then introduce an extension of the UCRL algorithm (LINUCRL) to\neffectively balance exploration and exploitation in an unknown environment, and\nwe derive a regret bound that is independent of the number of states. Finally,\nwe empirically validate the model assumptions and the algorithm in a number of\nrealistic scenarios.\n\n1\n\nIntroduction\n\nConsider a movie recommendation problem, where the recommender system (RS) selects the genre\nto suggest to a user. A basic strategy is to estimate user\u2019s preferences and then recommend movies of\nthe preferred genres. While this strategy is sensible in the short term, it overlooks the dynamics of the\nuser\u2019s preferences caused by the recommendation process. For instance, the user may get bored of\nthe proposed genres and then reduce her ratings. This effect is due to the recommendation strategy\nitself and not by an actual evolution of user\u2019s preferences, as she would still like the same genres, if\nonly they were not proposed so often.1\nThe existence of an optimal \ufb01xed strategy is often assumed in RS using, e.g., matrix factorization to\nestimate users\u2019 ratings and the best (\ufb01xed) item/genre [16]. Similarly, multi-armed bandit (MAB)\nalgorithms [4] effectively trade off exploration and exploitation in unknown environments, but still\nassume that rewards are independent from the sequence of arms selected over time and they try\nto select the (\ufb01xed) optimal arm as often as possible. Even when comparing more sophisticated\nrecommendation strategies, as in A/B testing, we implicitly assume that once the better option\n(either A or B) is found, it should be constantly executed, thus ignoring how its performance may\ndeteriorate if used too often. An alternative approach is to estimate the state of the user (e.g., her\n\n1In this paper, we do not study non-stationarity preferences, as it is a somehow orthogonal problem to the\n\nissue that we consider.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flevel of boredom) as a function of the movies recently watched and estimate how her preferences are\naffected by that. We could then learn a contextual strategy that recommends the best genre depending\non the actual state of the user (e.g., using LINUCB [17]). While this could partially address the\nprevious issue, we argue that in practice it may not be satisfactory. As the preferences depend on\nthe sequence of recommendations, a successful strategy should \u201cdrive\u201d the user\u2019s state in the most\nfavorable condition to gain as much reward as possible in the long term, instead selecting the best\n\u201cinstantaneous\u201d action at each step. Consider a user with preferences 1) action, 2) drama, 3) comedy.\nAfter showing a few action and drama movies, the user may get bored. A greedy contextual strategy\nwould then move to recommending comedy, but as soon as it estimates that action or drama are\nbetter again (i.e., their potential value reverts to its initial value as they are not watched), it would\nimmediately switch back to them. On the other hand, a more farsighted strategy may prefer to stick\nto comedy for a little longer to increase the preference of the user for action to its higher level and\nfully exploit its potential.\nIn this paper, we propose to use a reinforcement learning (RL) [23] model to capture this dynamical\nstructure, where the reward (e.g., the average rating of a genre) depends on a state that summarizes the\neffect of the recent recommendations on user\u2019s preferences. We introduce a novel learning algorithm\nthat effectively trades off exploration and exploitation and we derive theoretical guarantees for it.\nFinally, we validate our model and algorithm in synthetic and real-data based environments.\nRelated Work. While in the MAB model, regret minimization [2] and best-arm identi\ufb01cation\nalgorithms [11, 22] have been often proposed to learn effective RS, they all rely on the assumption\nthat one best \ufb01xed arm exists. [8] study settings with time-varying rewards, where each time an arm\nis pulled, its reward decreases due to loss of interest, but, unlike our scenario, it never increases again,\neven if not selected for a long time. [14] also consider rewards that continuously decrease over time\nwhether the arm is selected or not (e.g., modeling novelty effects, where new products naturally loose\ninterest over time). This model \ufb01ts into the more general case of restless bandit [e.g., 6, 25, 20],\nwhere each arm has a partially observable internal state that evolves as a Markov chain independently\nfrom the arms selected over time. Time-varying preferences has also been widely studied in RS.\n[25, 15] consider a time-dependent bias to capture seasonality and trends effect, but do not consider\nthe effects on users\u2019 state. More related to our model is the setting proposed by [21], who consider an\nMDP-based RS at the item level, where the next item reward depends on the previously k selected\nitems. Working at the item level without any underlying model assumption prevents their algorithm\nfrom learning in large state spaces, as every single combination of k items should be considered\n(in their approach this is partially mitigated by state aggregation). Finally, they do not consider\nthe exploration-exploitation trade-off and they directly solve an estimated MDP. This may lead to\nan overall linear regret, i.e., failing to learn the optimal policy. Somewhat similar, [12] propose a\nsemi-markov model to decide what item to recommend to a user based on her latent psychological\nstate toward this item. They assumed two possible states: sensitization, state during which she is\nhighly engaged with the item, and boredom, state during which she is not interested in the item.\nThanks to the use of a semi-markov model, the next state of the user depends on how long she has\nbeen in the current state. Our work is also related to the linear bandit model [17, 1], where rewards\nare a linear function of a context and an unknown target vector. Despite producing context-dependent\npolicies, this model does not consider the in\ufb02uence that the actions may have on the state and thus\noverlook the potential of long-term reward maximization.\n\n2 Problem Formulation\nWe consider a \ufb01nite set of actions a \u2208 {1, . . . , K} = [K]. Depending on the application, actions\nmay correspond to simple items or complex RS. We de\ufb01ne the state st at time t as the history of the\nlast w actions, i.e., st = (at\u22121,\u00b7\u00b7\u00b7 , at\u2212w), where for w = 0 the state reduces to the empty history.\nAs described in the introduction, we expect the reward of an action a to depend on how often a has\nbeen recently selected (e.g., a user may get bored the more a RS is used). We introduce the recency\n1{at\u2212\u03c4 = a}/\u03c4, where the effect of an action fades as 1/\u03c4, so that the\nrecency is large if an action is often selected and it decreases as it is not selected for a while. We\nde\ufb01ne the (expected) reward function associated to an action a in state s as\n\nfunction \u03c1(st, a) =(cid:80)w\n\n\u03c4 =1\n\nr(st, a) =\n\n\u03b8\u2217\na,j\u03c1(st, a)j = xT\n\ns,a\u03b8\u2217\na,\n\n(1)\n\nd(cid:88)\n\nj=0\n\n2\n\n\fFigure 1: Average rating as a function of the recency for different genre of movies (w = 10) and predictions of\nour model for d = 5 in red. From left to right, drama, comedy, action and thriller. The con\ufb01dence intervals are\nconstructed based on the amount of samples available at each state s and the red curves are obtained by \ufb01tting\nthe data with the model in Eq. 1.\n\n\u00b7\u00b7\u00b7 , \u03c1(s, a)d] \u2208 Rd+1 is the context vector associated to action a in\nwhere xs,a = [1, \u03c1(s, a),\na \u2208 Rd+1 is an unknown vector. In practice, the reward observed when selecting a at\nstate s and \u03b8\u2217\nst is rt = r(st, a) + \u03b5t, with \u03b5t a zero-mean noise. For d = 0 or w = 0, this model reduces to the\nstandard MAB setting, where \u03b8\u2217\na,0 is the expected reward of action a. Eq. 1 extends the MAB model\nby summing the \u201cstationary\u201d component \u03b8\u2217\na,0 to a polynomial function of the recency \u03c1(st, a). While\nalternative and more complicated functions of st may be used to model the reward, in the next section\nwe show that a small degree polynomial of the recency is rich enough to model real data.\nThe formulation in Eq. 1 may suggest that this is an instance of a linear bandit problem, where\nxst,a is the context for action a at time t and \u03b8\u2217\na is the unknown vector. Nonetheless, in linear\nbandit the sequence of contexts {xst,a}t is independent from the actions selected over time and the\nst,a\u03b8\u2217\noptimal action at time t is a\u2217\na,2 while in our model, xst,a actually depends\non the state st, that summarizes the last w actions. As a result, an optimal policy should take into\naccount its effect on the state to maximize the long-term average reward. We thus introduce the\ndeterministic Markov decision process (MDP) M = (cid:104)S, [K], f, r(cid:105) with state space S enumerating the\npossible sequences of w actions, action space [K], noisy reward function in Eq. 1, and a deterministic\ntransition function f : S \u00d7 [K] \u2192 S that simply drops the action selected w steps ago and appends\nthe last action to the state. A policy \u03c0 : S \u2192 [K] is evaluated according to its long-term average\n\n(cid:3), where rt is the (random) reward of state st and action\n\nreward as \u03b7\u03c0 = limn\u2192\u221e E(cid:2)1/n(cid:80)n\n\nat = \u03c0(st). The optimal policy is thus \u03c0\u2217 = arg max\u03c0 \u03b7\u03c0, with optimal average reward \u03b7\u2217 = \u03b7\u03c0\u2217\n.\nWhile an explicit form for \u03c0\u2217 cannot be obtained in general, an optimal policy may select an action\nwith suboptimal instantaneous reward (i.e., action at = \u03c0(st) is s.t. r(st, at) < maxa r(st, a)) so as\nto let other (potentially more rewarding) actions \u201crecharge\u201d and select them later on. This results into\na policy that alternates actions with a \ufb01xed schedule (see Sec. 5 for more insights).3 If the parameters\n\u03b8\u2217\na were known, we could compute the optimal policy by using value iteration where a value function\nu0 \u2208 RS is iteratively updated as\n\nt = arg maxa\u2208[K] xT\n\nt=1 rt\n\n(cid:104)\n\n(cid:0)f (s, a)(cid:1)(cid:105)\n\n\u2206(T ) = T \u03b7\u2217 \u2212 T(cid:88)\n\nui+1(s) = max\na\u2208[K]\n\nr(s, a) + ui\n\n,\n\n(2)\n\nand a nearly-optimal policy is obtained after n iterations as \u03c0(s) = maxa\u2208[K][r(s, a) + un(f (s, a))].\nAlternatively, algorithms to compute the maximum reward cycle for deterministic MDPs could be\nused [see e.g., 13, 5]. The objective of a learning algorithm is to approach the performance of the\noptimal policy as quickly as possible. This is measured by the regret, which compares the reward\ncumulated over T steps by a learning algorithm and by the optimal policy, i.e.,\n\nr(st, at),\n\n(3)\n\nwhere (st, at) is the sequence of states and actions observed and selected by the algorithm.\n\nt=1\n\n3 Model Validation on Real Data\n\nIn order to provide a preliminary validation of our model, we use the movielens-100k dataset [9, 7].\nWe consider a simple scenario where a RS directly recommends a genre to a user. In practice, one\n\n2We will refer to this strategy as \u201cgreedy\u201d policy thereafter.\n3In deterministic MDPs the optimal policy is a recurrent sequence of actions inducing a maximum-reward\n\ncycle over states.\n\n3\n\n0.00.51.01.52.02.53.03.13.23.33.43.53.63.73.83.9predictionhistorical ratings0.00.51.01.52.02.53.02.82.93.03.13.23.33.43.53.63.7predictionhistorical ratings0.00.51.01.52.02.53.03.03.13.23.33.43.53.63.7predictionhistorical ratings0.00.51.01.52.02.52.93.03.13.23.33.43.53.63.7predictionhistorical ratings\fGenre\naction\ncomedy\ndrama\nthriller\n\nd = 1\n0.55\n0.77\n0.0\n0.74\n\nd = 2\n0.74\n0.85\n0.77\n0.81\n\nd = 3\n0.79\n0.88\n0.80\n0.83\n\nd = 4\n0.81\n0.90\n0.83\n0.91\n\nd = 5\n0.81\n0.90\n0.86\n0.91\n\nd = 6\n0.82\n0.91\n0.87\n0.91\n\nTable 1: R2 for the different genres and values of d on movielens-100k and a window w = 10.\n\nmay prefer to use collaborative \ufb01ltering algorithms (e.g., matrix factorisation) and apply our proposed\nalgorithm on top of them to \ufb01nd the optimal cadence to maximize long term performances. However,\nwhen dealing with very sparse information like in retargeting, it may happen that a RS just focuses\non performing recommendations from a very limited set of items.4 Once applied to this scenario, our\nmodel predicts that user\u2019s preferences change with the number of movies of the same genre a user\nhave recently watched (e.g., she may get bored after seeing too many movies of a genre and then\ngetting interested again as time goes by without watching that genre). In order to verify this intuition,\nwe sort ratings for each user using their timestamps to produce an ordered sequence of ratings.5 For\ndifferent genres observed more than 10, 000 times, we compute the average rating for each value of\nthe recency function \u03c1(st, a) at the states st encountered in the dataset. The charts of Fig. 1 provide\na \ufb01rst qualitative support for our model. The rating for comedy, action, and thriller genres is a\nmonotonically decreasing function of the recency, hinting to the existence of a boredom-effect, so\nthat the rating of a genre decreases with how many movies of that kind have been recently watched.\nOn the other hand, drama shows a more sophisticated behavior where users \u201cdiscover\u201d the genre and\nincrease the ratings as they watch more movies, but get bored if they recently watched \u201ctoo many\u201d\ndrama movies. This suggests that in this case there is a critical frequency at which users enjoy movies\nof this genre. In order to capture the dependency between rating and recency for different genres, in\nEq. 1 we de\ufb01ned the reward as a polynomial of \u03c1(st, a) with coef\ufb01cients that are speci\ufb01c for each\naction a. In Table 1 we report the coef\ufb01cient of determination R2 of \ufb01tting the model of Eq. 1 to\nthe dataset for different genres and values of d. The results show how our model becomes more and\nmore accurate as we increase its complexity. We also notice that even polynomials of small degree\n(from d = 4 the R2 tends to plateau) actually produce accurate reward predictions, suggesting that\nthe recency does really capture the key elements of the state s and that a relatively simple function of\n\u03c1 is enough to accurately predict the rating. This result also suggests that standard approaches in RS,\nsuch as matrix factorization, where the rating is contextual (as it depends on features of both users\nand movies/genres) but static, potentially ignore a critical dimension of the problem that is related to\nthe dynamics of the recommendation process itself.\n\n4 Linear Upper-Con\ufb01dence bound for Reinforcement Learning\nThe Learning Algorithm. LINUCRL directly builds on the UCRL algorithm [10] and exploits the\nlinear structure of the reward function and the deterministic and known transition function f. The\ncore idea of LINUCRL is to construct con\ufb01dence intervals on the reward function and apply the\noptimism-in-face-of-uncertainty principle to compute an optimistic policy. The structure of LINUCRL\nis illustrated in Alg. 1. Let us consider an episode k starting at time t, LINUCRL \ufb01rst uses the\n\ncurrent samples collected for each action a separately to compute an estimate(cid:98)\u03b8t,a by regularized\n\nleast squares, i.e.,\n\n(cid:98)\u03b8t,a = min\n\n\u03b8\n\n(cid:88)\n\n(cid:0)xT\n\n(cid:1)2\n\ns\u03c4 ,a\u03b8 \u2212 r\u03c4\n\n+ \u03bb(cid:107)\u03b8(cid:107)2,\n\n(4)\n\n\u03c4 \u03b7(cid:101)\u03c0k ((cid:102)Mk)(cid:1) \u2264 t\u2212\u03b1.\n\n(7)\n\nWe are now ready to prove the main result.\n\nProof of Thm. 1. We follow similar steps as in [10]. We split the regret over episodes as\n\n\u2206(A, T ) =\n\nm(cid:88)\n\ntk+1\u22121(cid:88)\n\n(cid:0)\u03b7\u2217 \u2212 r(st, at)(cid:1) =\n\nm(cid:88)\n\nk=1\n\nt=tk\n\nk=1\n\n\u2206k.\n\nt=tk\n\na\u2208[K]\n\na\u2208[K]\n\n\u2206k =\n\nt\u2208Tk,a\n\nt\u2208Tk,a\n\n(cid:0)\u03b7\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nLet Tk,a = {tk \u2264 t < tk+1 : at = a} be the steps when action a is selected during episode k. We\nupper bound the per-episode regret as\n\n(cid:0)(cid:101)rk(st, a) \u2212 r(st, a)(cid:1),\nwhere the inequality directly follows from the event that(cid:101)\u03b7k \u2265 \u03b7\u2217 (Eq. 7) with probability 1 \u2212 T \u2212\u03b1.\n\n(cid:0)(cid:101)\u03b7k \u2212(cid:101)rk(st, a)(cid:1) +\n\n\u2217 \u2212 r(st, a)(cid:1) \u2264\n\ntk+1\u22121(cid:88)\n\n(cid:88)\n\nNotice that the low-probability event of failing con\ufb01dence intervals can be treated as in [10].\nWe proceed by bounding the \ufb01rst term of Eq. 8. Unlike in the general online learning scenario, in\nour setting the transition function f is known and thus the regret incurred from bad estimates of the\ndynamics is reduced to zero. Furthermore, since we are dealing with deterministic MDPs, the optimal\npolicy converges to a loop over states. When starting a new policy, we may start from a state outside\nits loop. Nonetheless, it is easy to verify that starting from any state s, it is always possible to reach\nany desired state s(cid:48) in at most w steps (i.e., the size of the history window). As a result, within each\n\nepisode k the difference between the cumulative reward ((cid:80)\nt(cid:101)rk(st, a)) and the (optimistic) average\nreward ((tk+1 \u2212 tk)(cid:101)\u03b7k) in the loop never exceeds w. Furthermore, since episodes terminate when\none action doubles its number of samples, using a similar proof as [10], we have that the number of\nepisodes is bounded as m \u2264 K log2( 8T\nK ). As a result, the contribution of the \ufb01rst term of Eq. 8 to the\ntk+1\u22121(cid:88)\noverall regret is bounded as\n\n(cid:0)(cid:101)\u03b7k \u2212(cid:101)rk(st, a)(cid:1) \u2264 Kw log2\n\n(cid:16) 8T\n\nm(cid:88)\n\n(cid:17)\n\n(8)\n\n.\n\nK\n\nk=1\n\nt=tk\n\n11\n\n\fThe second term in Eq. 8 refers to the (cumulative) reward estimation error and it can be decomposed\nas\n\nWe can bound the cumulative sum of the second term as (similar for the \ufb01rst, since(cid:101)rk belongs to the\ncon\ufb01dence interval of(cid:98)rk by construction)\n\n|(cid:101)rk(st, a) \u2212 r(st, a)| \u2264 |(cid:101)rk(st, a) \u2212(cid:98)rk(st, a)| + |(cid:98)rk(st, a) \u2212 r(st, a)|.\n(cid:88)\nm(cid:88)\n\n|(cid:98)rk(st, a) \u2212 r(st, a)| \u2264 m(cid:88)\n\n(cid:88)\n\n\u22121\na,t\n\nk=1\n\na\u2208[K]\n\nt\u2208Tk,a\n\n(cid:88)\n(cid:88)\n\na\u2208[K]\n\n(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116) m(cid:88)\n\nt\u2208Tk,a\n\nk=1\n\n\u2264 cmax\n\nct,a(cid:107)xst,a(cid:107)V\n(cid:88)\n\na\u2208[K]\n\nk=1\n\nt\u2208Tk,a\n\n(cid:107)xst,a(cid:107)2\n\nV\n\n\u221a\n\nTa,\n\n\u22121\na,t\n\nwhere the \ufb01rst inequality follows from Prop. 2 with probability 1 \u2212 T \u2212\u03b1, and Ta is the total number\nof times a has been selected at step T . Let Ta = \u222akTk,a, then using Lemma 11 of [1], we have\n\n(cid:88)\n\n(cid:107)xst,a||2\n\nV\n\n\u22121\nt,a\n\n\u2264 2 log\n\ndet(VT,a)\ndet(\u03bbI)\n\n,\n\nand by Lem. 10 of [1], we have\n\nt\u2208Ta\n\nwhich leads to\n\nm(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nk=1\n\na\u2208[K]\n\nt\u2208Tk,a\n\ndet(Vt,a) \u2264 (\u03bb + tL2\n|(cid:98)rk(st, a) \u2212 r(st, a)| \u2264 cmax\n\nw/(d + 1))d+1,\n\n(cid:115)\n\n\u221a\nTa\n\n(cid:88)\n(cid:115)\n\na\u2208[K]\n\n2(d + 1) log\n\n(cid:17)\n\n(cid:16) \u03bb + tL2\n(cid:17)\n\nw\n\u03bb(d + 1)\n\n.\n\n(cid:16) \u03bb + tL2\n\nw\n\u03bb(d + 1)\n\n\u2264 cmax\n\n2KT (d + 1) log\n\nBringing all the terms together gives the regret bound.\n\nB Experiments Details\n\nGenre\nAction\nComedy\nAdventure\nThriller\nDrama\nChildren\nCrime\nHorror\nSciFi\n\nAnimation\n\n\u03b8\u2217\na,0\n3.1\n3.34\n3.51\n3.4\n2.75\n3.52\n3.37\n3.54\n3.3\n3.4\n\n\u03b8\u2217\na,1\n0.54\n0.54\n0.86\n1.26\n1.0\n0.1\n0.32\n-0.68\n0.64\n1.38\n\n\u03b8\u2217\na,2\n-1.08\n-1.08\n-2.7\n-2.9\n0.94\n0.0\n1.12\n1.84\n-1.32\n-3.44\n\n\u03b8\u2217\na,3\n0.78\n0.78\n3.06\n2.76\n-1.86\n-0.3\n-3.0\n-2.04\n1.1\n3.62\n\n\u03b8\u2217\na,4\n-0.22\n-0.22\n-1.46\n-1.14\n0.94\n0.2\n2.26\n0.82\n-0.38\n-1.62\n\n\u03b8\u2217\na,5\n0.02\n0.02\n0.24\n0.16\n-0.16\n-0.04\n-0.54\n-0.12\n0.02\n0.24\n\nTable 3: Reward parameters of each genre for the movielens experiment.\n\nThe parameters used in the MovieLens experiment are reported in Table 3.\n\n12\n\n\f", "award": [], "sourceid": 883, "authors": [{"given_name": "Romain", "family_name": "WARLOP", "institution": "Inria"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "INRIA"}, {"given_name": "J\u00e9r\u00e9mie", "family_name": "Mary", "institution": null}]}