{"title": "A Geometric Perspective on Optimal Representations for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4358, "page_last": 4369, "abstract": "We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functions. From there, we provide formal evidence regarding the usefulness of value functions as auxiliary tasks in reinforcement learning. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We demonstrate that using value functions as auxiliary tasks corresponds to an expected-error relaxation of our formulation, with AVFs a natural candidate, and identify a close relationship with proto-value functions (Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness as auxiliary tasks in a series of experiments on the four-room domain.", "full_text": "AGeometricPerspectiveonOptimalRepresentationsforReinforcementLearningMarcG.Bellemare1,WillDabney2,RobertDadashi1,AdrienAliTaiga1,3,PabloSamuelCastro1,NicolasLeRoux1,DaleSchuurmans1,4,TorLattimore2,ClareLyle5AbstractWeproposeanewperspectiveonrepresentationlearninginreinforcementlearningbasedongeometricpropertiesofthespaceofvaluefunctions.Weleveragethisperspectivetoprovideformalevidenceregardingtheusefulnessofvaluefunctionsasauxiliarytasks.Ourformulationconsidersadaptingtherepresentationtomini-mizethe(linear)approximationofthevaluefunctionofallstationarypoliciesforagivenenvironment.Weshowthatthisoptimizationreducestomakingaccuratepredictionsregardingaspecialclassofvaluefunctionswhichwecalladversarialvaluefunctions(AVFs).Wedemonstratethatusingvaluefunctionsasauxiliarytaskscorrespondstoanexpected-errorrelaxationofourformulation,withAVFsanaturalcandidate,andidentifyacloserelationshipwithproto-valuefunctions(Mahadevan,2005).WehighlightcharacteristicsofAVFsandtheirusefulnessasauxiliarytasksinaseriesofexperimentsonthefour-roomdomain.1IntroductionAgoodrepresentationofstateiskeytopracticalsuccessinreinforcementlearning.Whileearlyapplicationsusedhand-engineeredfeatures(e.g.Samuel,1959),thesehaveprovenoneroustogenerateanddif\ufb01culttoscale.Asaresult,methodsinrepresentationlearninghave\ufb02ourished,rangingfrombasisadaptation(Menacheetal.,2005;Kelleretal.,2006),gradient-basedlearning(YuandBertsekas,2009),proto-valuefunctions(MahadevanandMaggioni,2007),featuregenerationschemessuchastilecoding(Sutton,1996)andthedomain-independentfeaturesusedinsomeAtari2600game-playingagents(Bellemareetal.,2013;Liangetal.,2016),andnonparametricmethods(Ernstetal.,2005;Farahmandetal.,2016;Tosattoetal.,2017).Today,themethodofchoiceisdeeplearning.Deeplearninghasmadeitsmarkbyshowingitcanlearncomplexrepresentationsofrelativelyunprocessedinputsusinggradient-basedoptimization(Tesauro,1995;Mnihetal.,2015;Silveretal.,2016).Mostcurrentdeepreinforcementlearningmethodsaugmenttheirmainobjectivewithadditionallossescalledauxiliarytasks,typicallywiththeaimoffacilitatingandregularizingtherepresentationlearningprocess.TheUNREALalgorithm,forexample,makespredictionsaboutfuturepixelvalues(Jaderbergetal.,2017);recentworkapproximatesaone-steptransitionmodeltoachieveasimilareffect(Franc\u00b8ois-Lavetetal.,2018;Geladaetal.,2019).Thegoodempiricalperformanceofdistributionalreinforcementlearning(Bellemareetal.,2017)hasalsobeenattributedtorepresentationlearningeffects,withrecentvisualizationssupportingthisclaim(Suchetal.,2019).However,whilethereisnowconclusiveempiricalevidenceoftheusefulnessofauxiliarytasks,theirdesignandjusti\ufb01cationremainonthewholead-hoc.Oneofourmaincontributionsistoprovidesaformalframeworkinwhichtoreasonaboutauxiliarytasksinreinforcementlearning.Webeginbyformulatinganoptimizationproblemwhosesolutionisaformofoptimalrepresentation.Speci\ufb01cally,weseekastaterepresentationfromwhichwecanbestapproximatethevaluefunctionofanystationarypolicyforagivenMarkovDecisionProcess.Simultaneously,thelargestapproximation1GoogleResearch2DeepMind3Mila,Universit\u00b4edeMontr\u00b4eal4UniversityofAlberta5UniversityofOxford33rdConferenceonNeuralInformationProcessingSystems(NeurIPS2019),Vancouver,Canada.\ferrorinthatclassservesasameasureofthequalityoftherepresentation.Whileourapproachmayappearnaive\u2013inrealsettings,mostpoliciesareuninterestingandhencemaydistracttherepresentationlearningprocess\u2013weshowthatourrepresentationlearningproblemcaninfactberestrictedtoaspecialsubsetofvaluefunctionswhichwecalladversarialvaluefunctions(AVFs).Wethencharacterizetheseadversarialvaluefunctionsandshowtheycorrespondtodeterministicpoliciesthateitherminimizeormaximizetheexpectedreturnateachstate,basedonthesolutionofanetwork-\ufb02owoptimizationderivedfromaninterestfunction\u03b4.Aconsequenceofourworkistoformalizewhypredictingvaluefunction-likeobjectsishelpfulinlearningrepresentations,ashasbeenarguedinthepast(Suttonetal.,2011,2016).Weshowhowusingthesepredictionsasauxiliarytaskscanbeinterpretedasarelaxationofouroptimizationproblem.Fromouranalysis,wehypothesizethatauxiliarytasksthatresembleadversarialvaluefunctionsshouldgiverisetogoodrepresentationsinpractice.Wecomplementourtheoreticalresultswithanempiricalstudyinasimplegridworldenvironment,focusingontheuseofdeeplearningtechniquestolearnrepresentations.We\ufb01ndthatpredictingadversarialvaluefunctionsasauxiliarytasksleadstorichrepresentations.2SettingWeconsideranenvironmentdescribedbyaMarkovDecisionProcesshX,A,r,P,\u03b3i(Puterman,1994);XandAare\ufb01nitestateandactionspaces,P:X\u00d7A\u2192P(X)isthetransitionfunction,\u03b3thediscountfactor,andr:X\u2192Rtherewardfunction.Fora\ufb01nitesetS,writeP(S)fortheprobabilitysimplexoverS.A(stationary)policy\u03c0isamappingX\u2192P(A),alsodenoted\u03c0(a|x).WedenotethesetofpoliciesbyP=P(A)X.Wecombineapolicy\u03c0withthetransitionfunctionPtoobtainthestate-to-statetransitionfunctionP\u03c0(x0|x):=Pa\u2208A\u03c0(a|x)P(x0|x,a).ThevaluefunctionV\u03c0describestheexpecteddiscountedsumofrewardsobtainedbyfollowing\u03c0:V\u03c0(x)=Eh\u221eXt=0\u03b3tr(xt)(cid:12)(cid:12)x0=x,xt+1\u223cP\u03c0(\u00b7|xt)i.Thevaluefunctionsatis\ufb01esBellman\u2019sequation(Bellman,1957):V\u03c0(x)=r(x)+\u03b3EP\u03c0V\u03c0(x0).Assumingtherearen=|X|states,weviewrandV\u03c0asvectorsinRnandP\u03c0\u2208Rn\u00d7n,suchthatV\u03c0=r+\u03b3P\u03c0V\u03c0=(I\u2212\u03b3P\u03c0)\u22121r.Ad-dimensionalrepresentationisamapping\u03c6:X\u2192Rd;\u03c6(x)isthefeaturevectorforstatex.Wewrite\u03a6\u2208Rn\u00d7dtodenotethematrixwhoserowsare\u03c6(X),andwithsomeabuseofnotationdenotethesetofd-dimensionalrepresentationsbyR\u2261Rn\u00d7d.Foragivenrepresentationandweightvector\u03b8\u2208Rd,thelinearapproximationforavaluefunctionis\u02c6V\u03c6,\u03b8(x):=\u03c6(x)>\u03b8.(1)Weconsidertheapproximationminimizingtheuniformlyweightedsquarederror(cid:13)(cid:13)\u02c6V\u03c6,\u03b8\u2212V\u03c0(cid:13)(cid:13)22=Xx\u2208X(\u03c6(x)>\u03b8\u2212V\u03c0(x))2.Wedenoteby\u02c6V\u03c0\u03c6theprojectionofV\u03c0ontothelinearsubspaceH=(cid:8)\u03a6\u03b8:\u03b8\u2208Rd(cid:9).2.1Two-PartNetworksMostdeepnetworksusedinvalue-basedreinforcementlearningcanbemodelledastwointeractingparts\u03c6and\u03b8whichgiverisetoalinearapproximation(Figure1,left).Here,therepresentation\u03c6canalsobeadjustedandisalmostalwaysnonlinearinx.Two-partnetworksareasimpleframeworkinwhichtostudythebehaviourofrepresentationlearningindeepreinforcementlearning.Wewillespeciallyconsidertheuseof\u03c6(x)tomakeadditionalpredictions,calledauxiliarytasksfollowingcommonusage,andwhosepurposeistoimproveorstabilizetherepresentation.Westudytwo-partnetworksinanidealizedsettingwherethelengthdof\u03c6(x)is\ufb01xedandsmallerthann,butthemappingisotherwiseunconstrained.Eventhisidealizeddesignoffersinteresting2\fRepresentation(x)<latexit sha1_base64=\"bO+I8PXml61+kiq1+jHcoWIIiWY=\">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64=\"bO+I8PXml61+kiq1+jHcoWIIiWY=\">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64=\"bO+I8PXml61+kiq1+jHcoWIIiWY=\">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit><latexit sha1_base64=\"bO+I8PXml61+kiq1+jHcoWIIiWY=\">AAAHlXicfZXbjts2EIaVQ+N0kzZJc9GL3gg1AqTFIrCCAC1SBMjBbRp0d+Pd7Cm7MgyKGlmESVEh6dReRg+R2/bJ8jYdyvKuDmkE2BrN9484HA7FKOdMm8Hg06XLV65+da13/euNGze/+fbW7TvfHWo5VxQOqORSHUdEA2cZHBhmOBznCoiIOBxFsxeOH70HpZnM9s0yh7Eg04wljBKDrqMwT9n9xU+T2/3Bg0F5+V0jqIy+V12jyZ1rizCWdC4gM5QTrU+DQW7GlijDKIdiI5xryAmdkSmcopkRAXpsy3wL/x56Yj+RCn+Z8UtvPcISofVSRKgUxKS6zZzz/5hJRWN061RKJ7qVk0l+HVuW5XMDGV2llMy5b6TvyuTHTAE1fIkGoYrhrHyaEkWowWI2x+RTiYL2sOduRptAAdfsrFWi+D3LdVWkxapKjaAI374RZvA3lUKQLLYhfVPY0M2NEm7fFEWLHtfocYe+rdG3HfqsRp916F6N7nXosEaHHXpSoycdul2j2x26X6P7HbpVo1tFu1h6t4Z328H6sEYPO8HRToWjyO60Y3NQosKaKjvqRNOIKBRsfgg32znHFQo/wzQI5thvDjXZIivsYmKDx5kb7F4YQ+KHsMixX/3zRH93iQ4B96aCbXS9xkSJkepnW0kLu9KiYApqHZj4GNjKZaXXxdoKM8mZYOjBxMrBVSrP8A14m9jQwMJoas/K96yxbmPdwKbCZu07E2Th25MLPT5fRMQz7tvhBfxrq1jX3aQgFQjLAZdlC3AKRcOvnH8PBFGzJqC45ewLqSTHiiybzG1uu796aK9GJrEBToOxDTkkxg8/+P0A58SmqXtolZJlyZf1Eycxy1ZYxOewinETLj8PtvTZftBeLZbF+MUpxVUrvCom1WCntbHGZcU+3yHYEoJhl+E93HTWl4S4MJUQrXVHjF76q2SVsKOX5wuX7qA7JcbfWXuG7t+G0p1RYPCTYR8/OZcP4V2TOYRnVdA+mbrG4cMHAdq7j/pPn1en1nXvB+9H774XeL94T70/vZF34FFv5n30/vH+7X3fe9Ib9v5YSS9fqmLueo2r9/o/pEHA3Q==</latexit>x<latexit sha1_base64=\"l03uCu57NEwOx4KA5SzuAFGPBH4=\">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64=\"l03uCu57NEwOx4KA5SzuAFGPBH4=\">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64=\"l03uCu57NEwOx4KA5SzuAFGPBH4=\">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit><latexit sha1_base64=\"l03uCu57NEwOx4KA5SzuAFGPBH4=\">AAAIQXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6C/a6/UX7K/Yn7G3Y6152lGRHP9oKsHW6z/fE4/FEhpng2gyHf125+sGHH13rXf9465NPP7vx+c1bXxzrdKEYHLFUpOokpBoET+DIcCPgJFNAZSjgRXj+1PEXr0FpniaHZpXBRNJZwmPOqEHX/nJ6sz/cHhYX6Rp+ZfS96hpPb/XuBlHKFhISwwTV+swfZmZiqTKcCci3goWGjLJzOoMzNBMqQU9skWlO7qAnInGq8JcYUnjrEZZKrVcyRKWkZq7bzDnfxcxcNka3TqV0rFs5mfj+xPIkWxhIWJlSvBDEpMQViERcATNihQZliuOsCJtTRZnBMjbHFLMUBe1hN27OmkCB0PyiVaLoNc90VaRlWaVGUIhv3woS+JWlUtIksgF7ntvAzY1RYZ/neYue1OhJh76s0Zcd+qRGn3ToQY0edOioRkcdelqjpx26W6O7HXpYo4cdulOjO3m7WHq/hvfbwfq4Ro87weFehcPQ7rVjM1CywpopO+5Es5AqFAzeBIN2zlGFgrcwDZI79tChJlsmuV1Orf8gcYPdCSKISQDLDPuVbBL9wSU6Avw2Feyi6xdMlJpU3bWVNLelFgUzUOvAmGBgK5dSr/O1FSSp4JKjBxMrBlfz9ALfgLepDQwsjWb2onjPGus21g1sKmzWvgtJl8SeXurx+TIiOhfEji7hzzv5uu5mDqkCaQXgsuwATiFv+JXzH4Ck6rwJGH5y9mmqUoEVWTWZ+7jtYfnQXo0kxQY48yc2EBAbErwhfR/nxGdz99AqJU/i9+unTmJWnbAId5EiqFren/Jp9YKzWvwkr5YEj4FiE7ERzlNBlFs1C3M73L4/wF19MMy7qpkCSDY61Gx/67StTEKxgDIRV/oyuvDZvt/uGzduS7pJp6sux3+LvkqsiNh6V09jE0uOKrwHA2e9T4itVAnRWvfw+BkpR1bSjp9tWm2+h+45NWRv7Rm5fxuk7jwFg5ucffBoIx/BqyZzCE9Xv32Wdo3je9s+2vvf9B9/X52z170vva+8rz3f+8577P3ojb0jj3ng/eb97v3R+7P3d++f3r+l9OqVKua217h6//0PuMIE2A==</latexit>StateValue\u2713<latexit sha1_base64=\"ty2ZhM7TIrIs31mJvTNAn2S9FrI=\">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64=\"ty2ZhM7TIrIs31mJvTNAn2S9FrI=\">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64=\"ty2ZhM7TIrIs31mJvTNAn2S9FrI=\">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit><latexit sha1_base64=\"ty2ZhM7TIrIs31mJvTNAn2S9FrI=\">AAAIRnicfZXdbts2FMfV7sNZ9tVul7sRZhQYCiOwig0rOgxoVw/dsCRz0sRNExkGRR3ZXEhJI+nODqt32O32RHuFvcTuht3uUJIdiWorwNbR+f2PeHh4REY5Z0oPh3/fuPnW2++829t5b/f9Dz786ONbtz+ZqGwpKZzSjGfyLCIKOEvhVDPN4SyXQETE4Vl0+djyZy9AKpalJ3qdw1SQecoSRolG1yTUC9Bkdqs/3BuWl981gtroe/U1nt3u3Q3jjC4FpJpyotRFMMz11BCpGeVQ7IZLBTmhl2QOF2imRICamjLdwr+DnthPMom/VPultxlhiFBqLSJUCqIXymXW+TqmF6I1urEqqRLl5KST+1PD0nypIaVVSsmS+zrzbZX8mEmgmq/RIFQynJVPF0QSqrGW7TH5PEOBO+zWzWgbSOCKXTklil+wXNVFWlVVagVF+PbdMIXfaCYESWMT0qeFCe3cKOHmaVE49KxBzzr0eYM+79BHDfqoQ48b9LhDRw066tDzBj3v0IMGPejQkwY96dD9Bt0v3GKpowY+coPVpEEnneDosMZRZA7d2BykqLGi0ow70TQiEgWDl+HAzTmuUfgKpkAwy76xqM1WaWFWMxM8SO1gd8IYEj+EVY796m8T/d4mOgL8NiUcoOtnTJToTN41tbQwlRYFc5CbwMTHQCeXSq+KjRWmGWeCoQcTKweXi+wK34C3mQk1rLSi5qp8zwYrF6sW1jXWG9+VICvfnF/r8fk6Ir7kvhldw5/2i03dcTvLJAjDAZdlH3AKRcsvrf8YBJGXbUDxkzOPM5lxrMi6zezHbU6qB3c10gwb4CKYmpBDov3wpd8PcE5svrAPTilZmrxZP7MSve6ExbiLlEH18v5YzOoXXDTip0W9JHgWlJuIiXGeEuLCyHlUmOHe/QHu6oNh0VXNJUC61aFm7yurdTKJ+BKqRGzpq+jSZ/qB2zd2XEe6TaerrsZ/hb5OrIzYfV1PYxMLhiq8hwNrvUmIrVQL0dr08PiJX40shRk/2bba4hDdC6L9w41nZP9NmNlDFTRucubBt1v5CH5tM4vwdA3cs7RrTO7tBWgffdl/+F19zu54n3mfe194gfe199D7wRt7px71fvF+9/7w/uz91fun92/vv0p680Yd86nXuna8/wEh7AYF</latexit>Goalstate\u02c6V(x)<latexit sha1_base64=\"hQTqaSVnMtWOt6x1pLM8o8Z6p9E=\">AAAI5XicfZVbj9tEFMfdcmkItxYeeRkRrVRQFMVcRFWE1KWBgthdstnddLvrNBqPT+zR+sbMpCQ79UfgDSHeEHwTPgbfhjO2k/jS1lLi4/P7H8+ZM8czbhpyqYbD/27cfO31N9681Xmr+/Y77773/u07H0xlshQMzlgSJuLcpRJCHsOZ4iqE81QAjdwQHrtXDw1//AyE5El8qtYpzCLqx3zBGVXoeuoEVJHp3EkDfnf1yfx2bzgY5hdpG3Zp9KzyGs/v3PrX8RK2jCBWLKRSXtrDVM00FYqzELKus5SQUnZFfbhEM6YRyJnO087IHno8skgE/mJFcm81QtNIynXkojKiKpBNZpwvYyqIaqNroxJyIRs5qcW9meZxulQQsyKlxTIkKiGmWsTjApgK12hQJjjOirCACsoU1rQ+ZugnKGgOu3VzVgcCQsmvGyXynvFUlkVaFVWqBbn49u4eXmR/+j1JaQpYMA8WxAmmKcdFJLpY0adOyvNVzUp+Amqc47xaTOhJVkWGsHGWv7vbdWL4lSVRRGMP3SdZEcVoqE9MVI2eV+h5iz6p0Cctul+h+y06qdBJi44qdNSi4wodt+hFhV606GGFHrboaYWetuhBhR4grWN5XMHHzWA5rdBpK9g9KrHr6qNmrDvZwVatsEuibLfy49armUsFCvrPnX5zQl6JnBcwCRE37GuD6mwVZ3o11/b92Ay2VzQarFL8mMg20e9MoiPAjUPAIbp+xkSpSsSnupRmutCiwAexCVwQDGzkUuhltrGcOAl5xNHTLbtcBMk1vgFvc+0oWCnJ9HW2/QjQL5tY1rAqsdr4riO6Ivpip8fnXYR3FRI92sGfDrJN3VUAiYBIh4DLcgA4hazmF8Y/gYiKqzpguB/oh4lIQqzIus7MzqNPi4fmasQJNsClPcM+4T5xnpOeTUqzUUceL14hnhuu1q0YDze3PKJc2B8znHcIC0Uu82jB/UCRWVYuBh5V+d6mPZyhAC/TwnczPRzc6+Nh0x9mbZUvAOKtDjWDL422+RWESygSMUUvonOf7tnNjjHjNqTbdNrqYvwX6MvE8ojuy7oZ2zfiqMK70zfWq4TYRKUQrU33jh+RYmQR6fGjbZMFR+WGf7TxjMy/dhJz5oPCnVHf/2YrH8EvdWYQHvp284hvG9PPBvbnA/v4i96Db8vjv2N9ZH1s3bVs6yvrgfWDNbbOLGYJ60/rb+ufjt/5rfN7549CevNGGfOhVbs6f/0PKx89cg==</latexit>AuxiliarytasksV\u21e2Rn<latexit sha1_base64=\"lKq0xkq+B5SgE8UPh4b8PgtnaGk=\">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64=\"lKq0xkq+B5SgE8UPh4b8PgtnaGk=\">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64=\"lKq0xkq+B5SgE8UPh4b8PgtnaGk=\">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit><latexit sha1_base64=\"lKq0xkq+B5SgE8UPh4b8PgtnaGk=\">AAAIYnicfZVbb9s2FMfV7hLPuzXr4/ZAzCgwFEZgFRtatBjQrh66YUnmpImbJvIMijqyiZCSRtKdHVav+zR73b7L3vdBdmjJji5tBdg6Or//IQ8Pb2EmuDaDwb83br73/gcf7nQ+6n78yaeffX5r94uxTheKwSlLRarOQqpB8ARODTcCzjIFVIYCXoSXTx1/8QqU5mlyYlYZTCSdJTzmjBp0TW+RQFIzZ1TYcU4CvQg1mMIXhvY4/w0lvcHeYP2QtuGXRs8rn9F0d+duEKVsISExTFCtL/xBZiaWKsOZgLwbLDRklF3SGVygmVAJemLXQ8nJHfREJE4V/hJD1t5qhKVS65UMUemS1E3mnG9jZi5rvVunUjrWjZxM/GBieZItDCSsSCleCGJS4ipIIq6AGbFCgzLFcVSEzamizGCd632KWYqCZrdbN2d1oEBoftUoUfSKZ7os0rKoUi0oxNa7QQJ/sFRKmkQ2YM9zu53W53neoGcVetaiLyv0ZYs+qdAnLXpcocctOqzQYYueV+h5ix5U6EGLnlToSYvuV+h+3iyWPqrgo2awHlfouBUcHpYYN8thMzYDJUusmbKjVjQLqUJB/3XQb+YclSh4A9MguWOPHKqzZZLb5dT6DxPX2Z0ggpgEsMxwvZJtoj+6RIeAe1PBAbp+xUSpSdVdW0pzW2hRMAO1CYwJBjZyKfQ631hBkgouOXowsXXnap5eYQv4mtrAwNJoZq/W7WywbmJdw6bEZuO7knRJ7Pm1Hr+vI6JLQezwGv6yn2/qbuaQKpBWAE7LPuAQ8ppfOf8xSKou64DhlrNPU5UKrMiqztzmtifFR3M2khQXwIU/sYGAGE/V16Tn45j4bO4+GqXkSfxu/dRJzKoVFuEpsg4qp/fnfFo2cFGJn+TllOA9sT5EbITjVBDlVs3C3A72HvTxVO8P8rZqpgCSrQ41e985bSOTUCygSMSVvohe+2zPb64b129Duk2nrS76f4O+TGwd0X3bmsZFLDmq8B30nfUuIS6lUojWZg2PnpGiZyXt6Nl2qc0P0T2nhhxuPEP3b4PUXbhg8JCzD7/fyofwe505hLer37xL28b43p6P9tG3vcc/lPdsx/vS+9r7xvO9+95j7ydv5J16zPvT+8v72/tn579Ot7PbuV1Ib94oY257tafz1f/eBQ5h</latexit>\u02c6V\u21e1<latexit sha1_base64=\"Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=\">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64=\"Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=\">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64=\"Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=\">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit><latexit sha1_base64=\"Y9XVFiR0gRzt0o8ZcHM/1+3D4iM=\">AAAIUXicfZXfb9s2EMfV7EfcdFvT7XEvwowCQ2EEVrFhRYcB7eqiHZZkTpq4aULPoKiTRUSUNJJu7bD6S/a6/UV72p+ytx0l2dGPtgJsne7zPfF4PJF+FnOlh8N/b2x99PEnn273bu7c+uzzL27v3vlyotKFZHDK0jiVZz5VEPMETjXXMZxlEqjwY3jpXz6x/OVrkIqnyYleZTAVdJ7wkDOq0TXbvU0iqt3J7yTjM5JFfLbbH+4Ni8vtGl5l9J3qGs/ubN8jQcoWAhLNYqrUhTfM9NRQqTmLId8hCwUZZZd0DhdoJlSAmpoi89y9i57ADVOJv0S7hbceYahQaiV8VAqqI9Vm1vk+piPRGN1YlVShauWkwwdTw5NsoSFhZUrhInZ16tqCuQGXwHS8QoMyyXFWLouopExjWZtjxvMUBe1hN27OmkBCrPhVq0TBa56pqkjLskqNIB/fvkMSeMNSIWgSGMJe5IbYuTEamxd53qJnNXrWoa9q9FWHPq7Rxx16XKPHHTqq0VGHntfoeYce1OhBh57U6EmH7tfoft4uljqq4aN2sJrU6KQT7B9W2PfNYTs2AykqrJg0404086lEweAtGbRzDipE3sEUCG7ZjxY12TLJzXJmvIeJHewuCSB0CSwz7Fd3k+hTm+gI8NuUcICu3zBRqlN5z1TS3JRaFMxBrgNDFwNbuZR6la8tkqQxFxw9mFgxuIzSK3wD3maGaFhqxcxV8Z41Vm2sGlhXWK99V4IuXXN+rcfn64jgMnbN6Br+up+v664jSCUIEwMuyz7gFPKGX1r/MQgqL5uA4SdnnqQyjbEiqyazH7c5KR/aq5Gk2AAX3tSQGELtkrdu38M58XlkH1ql5En4Yf3MSvSqExbgLlIEVcv7Sz6rXnBRi5/m1ZLgsVBsIibAeUoIciPnfm6Gew8GuKsPhnlXNZcAyUaHmr3vrbaViR8voEzElr6MLnym77X7xo7bkm7S6arL8d+hrxIrInbe19PYxIKjCu9kYK0PCbGVKiFa6x4eP3PLkaUw42ebVosO0W0Py8O1Z2T/DUnt+QoaNznz8KeNfAR/NJlFeLp67bO0a0zu73loH33Xf/Rzdc72nK+db5xvHc/5wXnkPHfGzqnDnIXzp/OX8/f2P9v/9ZzeVindulHFfOU0rt6t/wGzCwe+</latexit>Value, State 2Value, State 1Figure1:Left.Adeepnetworkviewedasacompositionoftwoparts,onelinearandonenot.Right.Theoptimalrepresentation\u03c6\u2217isalinearsubspacethatcutsthroughthevaluepolytope.problemstostudy.Wemightbeinterestedinsharingarepresentationacrossproblems,asisoftendoneintransferorcontinuallearning.Inthiscontext,auxiliarytasksmayinformhowthevaluefunctionshouldgeneralizetothesenewproblems.Inmanyproblemsofinterest,theweights\u03b8canalsobeoptimizedmoreef\ufb01cientlythantherepresentationitself,warrantingtheviewthattherepresentationshouldbeadaptedusingadifferentprocess(Levineetal.,2017;Chungetal.,2019).Notethatatrivial\u201cvalue-as-feature\u201drepresentationexistsforthesingle-policyoptimizationproblemminimize(cid:13)(cid:13)\u02c6V\u03c0\u03c6\u2212V\u03c0(cid:13)(cid:13)22w.r.t.\u03c6\u2208R;thisapproximationsets\u03c6(x)=V\u03c0(x)and\u03b8=1.Inthispaperwetakethestancethatthisisnotasatisfyingrepresentation,andthatagoodrepresentationshouldbeintheserviceofabroadergoal(e.g.control,transfer,orfairness).3RepresentationLearningbyApproximatingValueFunctionsWemeasurethequalityofarepresentation\u03c6intermsofhowwellitcanapproximateallpossiblevaluefunctions,formalizedastherepresentationerrorL(\u03c6):=max\u03c0\u2208PL(\u03c6;\u03c0),L(\u03c6;\u03c0):=(cid:13)(cid:13)\u02c6V\u03c0\u03c6\u2212V\u03c0(cid:13)(cid:13)22.Weconsidertheproblemof\ufb01ndingtherepresentation\u03c6\u2208RminimizingL(\u03c6):minimizemax\u03c0\u2208P(cid:13)(cid:13)\u02c6V\u03c0\u03c6\u2212V\u03c0(cid:13)(cid:13)22w.r.t.\u03c6\u2208R.(2)Inthecontextofourwork,wecallthistherepresentationlearningproblem(RLP)andsaythatarepresentation\u03c6\u2217isoptimalwhenitminimizestheerrorin(2).NotethatL(\u03c6)(andhence\u03c6\u2217)dependsoncharacteristicsoftheenvironment,inparticularonbothrewardandtransitionfunctions.WeconsidertheRLPfromageometricperspective(Figure1,right).Dadashietal.(2019)showedthatthesetofvaluefunctionsachievedbythesetofpoliciesP,denotedV:={V\u03c0\u2208Rn:\u03c0\u2208P},formsa(possiblynonconvex)polytope.Aspreviouslynoted,arepresentation\u03c6de\ufb01nesasubspaceHofpossiblevalueapproximations.ThemaximalerrorisachievedbythevaluefunctioninVwhichisfurthestalongthesubspacenormaltoH,since\u02c6V\u03c0\u03c6istheorthogonalprojectionofV\u03c0.WesaythatV\u2208VisanextremalvertexifitisavertexoftheconvexhullofV.Wewillmakeuseoftherelationshipbetweendirections\u03b4\u2208Rd,thesetofextremalvertices,andthesetofdeterministicpolicies.Thefollowinglemma,basedonawell-knownnotionofdualityfromconvexanalysis(BoydandVandenberghe,2004),statesthisrelationshipformally.Lemma1.Let\u03b4\u2208Rnandde\ufb01nethefunctionalf\u03b4(V):=\u03b4>V,withdomainV.Thenf\u03b4ismaximizedbyanextremalvertexU\u2208V,andthereisadeterministicpolicy\u03c0forwhichV\u03c0=U.Furthermore,thesetofdirections\u03b4\u2208Rnforwhichthemaximumoff\u03b4isachievedbymultipleextremalverticeshasLebesguemeasurezeroinRn.DenotebyPvthesetofpoliciescorrespondingtoextremalverticesofV.WenextderiveanequivalencebetweentheRLPandanoptimizationproblemwhichonlyconsiderspoliciesinPv.3\fTheorem1.Foranyrepresentation\u03c6\u2208R,themaximalapproximationerrormeasuredoverallvaluefunctionsisthesameastheerrormeasuredoverthesetofextremalvertices:max\u03c0\u2208P(cid:13)(cid:13)\u02c6V\u03c0\u03c6\u2212V\u03c0(cid:13)(cid:13)22=max\u03c0\u2208Pv(cid:13)(cid:13)\u02c6V\u03c0\u03c6\u2212V\u03c0(cid:13)(cid:13)22.Theorem1indicatesthatwecan\ufb01ndanoptimalrepresentationbyconsideringa\ufb01nite(albeitexponential)numberofvaluefunctions:Eachextremalvertexcorrespondstothevaluefunctionofsomedeterministicpolicy,ofwhichthereareatmostanexponentialnumber.Wewillcalltheseadversarialvaluefunctions(AVFs),becauseoftheminimax\ufb02avouroftheRLP.SolvingtheRLPallowsustoprovidequanti\ufb01ableguaranteesontheperformanceofcertainvalue-basedlearningalgorithms.Forexample,inthecontextofleast-squarespolicyiteration(LSPI;LagoudakisandParr,2003),minimizingtherepresentationerrorLdirectlyimprovestheperformancebound.Bycontrast,wecannothavethesameguaranteeif\u03c6islearnedbyminimizingtheapproximationerrorforasinglevaluefunction.Corollary1.Let\u03c6\u2217beanoptimalrepresentationintheRLP.Considerthesequenceofpolicies\u03c00,\u03c01,...derivedfromLSPIusing\u03c6\u2217toapproximateV\u03c00,V\u03c01,...underauniformsamplingofthestate-space.ThenthereexistsanMDP-dependentconstantC\u2208Rsuchthatlimsupk\u2192\u221e(cid:13)(cid:13)V\u2217\u2212V\u03c0k(cid:13)(cid:13)22\u2264CL(\u03c6\u2217).ThisresultisadirectapplicationofthequadraticnormboundsgivenbyMunos(2003),inwhoseworktheconstantismadeexplicit.Weemphasizethattheresultisillustrative;ourapproachshouldenablesimilarguaranteesinothercontexts(e.g.Munos,2007;PetrikandZilberstein,2011).3.1TheStructureofAdversarialValueFunctionsTheRLPsuggeststhatanagenttrainedtopredictvariousvaluefunctionsshoulddevelopagoodstaterepresentation.Intuitively,onemayworrythattherearesimplytoomany\u201cuninteresting\u201dpolicies,andthatarepresentationlearnedfromtheirvaluefunctionsemphasizesthewrongquantities.However,thesearchforanoptimalrepresentation\u03c6\u2217iscloselytiedtothemuchsmallersetofadversarialvaluefunctions(AVFs).TheaimofthissectionistocharacterizethestructureofAVFsandshowthattheyformaninterestingsubsetofallvaluefunctions.Fromthis,wearguethattheiruseasauxiliarytasksshouldalsoproducestructuredrepresentations.FromLemma1,recallthatanAVFisgeometricallyde\ufb01nedusingavector\u03b4\u2208Rnandthefunctionalf\u03b4(V):=\u03b4>V,whichtheAVFmaximizes.Sincef\u03b4isrestrictedtothevaluepolytope,wecanconsidertheequivalentpolicy-spacefunctionalg\u03b4:\u03c07\u2192\u03b4>V\u03c0.Observethatmax\u03c0\u2208Pg\u03b4(\u03c0)=max\u03c0\u2208P\u03b4>V\u03c0=max\u03c0\u2208PXx\u2208X\u03b4(x)V\u03c0(x).(3)Inthisoptimizationproblem,thevector\u03b4de\ufb01nesaweightingoverthestatespaceX;forthisreason,wecall\u03b4aninterestfunctioninthecontextofAVFs.Whenever\u03b4\u22650componentwise,werecovertheoptimalvaluefunction,irrespectiveoftheexactmagnitudeof\u03b4(Bertsekas,2012).If\u03b4(x)<0forsomex,however,themaximizationbecomesaminimization.Asthenextresultshows,thepolicymaximizingf\u03b4(\u03c0)dependsonanetwork\ufb02owd\u03c0derivedfrom\u03b4andthetransitionfunctionP.Theorem2.Maximizingthefunctionalg\u03b4isequivalentto\ufb01ndinganetwork\ufb02owd\u03c0thatsatis\ufb01esareverseBellmanequation:max\u03c0\u2208P\u03b4>V\u03c0=max\u03c0\u2208Pd>\u03c0r,d\u03c0=\u03b4+\u03b3P\u03c0>d\u03c0.Forapolicy\u02dc\u03c0maximizingtheabovewehaveV\u02dc\u03c0(x)=r(x)+\u03b3(cid:26)maxa\u2208AEx0\u223cPV\u02dc\u03c0(x0)d\u02dc\u03c0(x)>0,mina\u2208AEx0\u223cPV\u02dc\u03c0(x0)d\u02dc\u03c0(x)<0.Corollary2.Thereareatmost2ndistinctadversarialvaluefunctions.Thevectord\u03c0correspondstothesumofdiscountedinterestweights\ufb02owingthroughastatex,similartothedualvariablesinthetheoryoflinearprogrammingforMDPs(Puterman,1994).Theorem2,bywayofthecorollary,impliesthattherearefewerAVFs(\u22642n)thandeterministicpolicies(=|A|n).ItalsoimpliesthatAVFsrelatetoareward-drivenpurpose,similartohowtheoptimalvaluefunctiondescribesthegoalofmaximizingreturn.WewillillustratethispointempiricallyinSection4.1.4\f3.2RelationshiptoAuxiliaryTasksSofarwehavearguedthatsolvingtheRLPleadstoarepresentationwhichisoptimalinameaningfulsense.However,solvingtheRLPseemscomputationallyintractable:thereareanexponentialnumberofdeterministicpoliciestoconsider(Prop.1intheappendixgivesaquadraticformulationwithquadraticconstraints).Usinginterestfunctionsdoesnotmitigatethisdif\ufb01culty:thecomputationalproblemof\ufb01ndingtheAVFforasingleinterestfunctionisNP-hard,evenwhenrestrictedtodeterministicMDPs(Prop.2intheappendix).Instead,inthissectionweconsiderarelaxationoftheRLPandshowthatthisrelaxationdescribesexistingrepresentationlearningmethods,inparticularthosethatuseauxiliarytasks.Let\u03bebesomedistributionoverRn.Webeginbyreplacingthemaximumin(2)byanexpectation:minimizeEV\u223c\u03be(cid:13)(cid:13)\u02c6V\u03c6\u2212V(cid:13)(cid:13)22w.r.t.\u03c6\u2208R.(4)Theuseoftheexpectationoffersthreepracticaladvantagesovertheuseofthemaximum.First,thisleadstoadifferentiableobjectivewhichcanbeminimizedusingdeeplearningtechniques.Second,thechoiceof\u03begivesusanadditionaldegreeoffreedom;inparticular,\u03beneedsnotberestrictedtothevaluepolytope.Third,theminimizerin(4)iseasilycharacterized,asthefollowingtheoremshows.Theorem3.Letu\u22171,...,u\u2217d\u2208Rnbetheprincipalcomponentsofthedistribution\u03be,inthesensethatu\u2217i:=argmaxu\u2208BiEV\u223c\u03be(u>V)2,whereBi:={u\u2208Rn:kuk22=1,u>u\u2217j=0\u2200j<i}.Equivalently,u\u22171,...,u\u2217daretheeigenvectorsofE\u03beVV>\u2208Rn\u00d7nwiththedlargesteigenvalues.Thenthematrix[u\u22171,...,u\u2217d]\u2208Rn\u00d7d,viewedasamapX\u2192Rd,isasolutionto(4).Whentheprincipalcomponentsareuniquelyde\ufb01ned,anyminimizerof(4)spansthesamesubspaceasu\u22171,...,u\u2217d.Onemayexpectthequalityofthelearnedrepresentationtodependonhowcloselythedistribution\u03berelatestotheRLP.Fromanauxiliarytasksperspective,thiscorrespondstochoosingtasksthatareinsomesenseuseful.Forexample,generatingvaluefunctionsfromtheuniformdistributionoverthesetofpoliciesP,whileanaturalchoice,mayputtoomuchweighton\u201cuninteresting\u201dvaluefunctions.Inpractice,wemayfurtherrestrict\u03betoa\ufb01nitesetV.Underauniformweighting,thisleadstoarepresentationlossL(\u03c6;V):=XV\u2208V(cid:13)(cid:13)\u02c6V\u03c6\u2212V(cid:13)(cid:13)22(5)whichcorrespondstothetypicalformulationofanauxiliary-taskloss(e.g.Jaderbergetal.,2017).Inadeepreinforcementlearningsetting,onetypicallyminimizes(5)usingstochasticgradientdescentmethods,whichscalebetterthanbatchmethodssuchassingularvaluedecomposition(butseeWuetal.(2019)forfurtherdiscussion).Ouranalysisleadsustoconcludethat,inmanycasesofinterest,theuseofauxiliarytasksproducesrepresentationsthatareclosetotheprincipalcomponentsofthesetoftasksunderconsideration.IfViswell-alignedwiththeRLP,minimizingL(\u03c6;V)shouldgiverisetoareasonablerepresentation.Todemonstratethepowerofthisapproach,inSection4wewillstudythecasewhenthesetVisconstructedbysamplingAVFs\u2013emphasizingthepoliciesthatsupportthesolutiontotheRLP.3.3RelationshiptoProto-ValueFunctionsProto-valuefunctions(MahadevanandMaggioni,2007,PVF)areafamilyofrepresentationswhichvarysmoothlyacrossthestatespace.Althoughtheoriginalformulationde\ufb01nesthisrepresentationasthelargest-eigenvalueeigenvectorsoftheLaplacianofthetransitionfunction\u2019sgraphicalstructure,recentformulationsusethetopsingularvectorsof(I\u2212\u03b3P\u03c0)\u22121,where\u03c0istheuniformlyrandompolicy(Stachenfeldetal.,2014;Machadoetal.,2017;BehzadianandPetrik,2018).Inlinewiththeanalysisoftheprevioussection,proto-valuefunctionscanalsobeinterpretedasde\ufb01ningasetofvalue-basedauxiliarytasks.Speci\ufb01cally,ifwede\ufb01neanindicatorrewardfunctionry(x):=I[x=y]andasetofvaluefunctionsV={(I\u2212\u03b3P\u03c0)\u22121ry}y\u2208Xwith\u03c0theuniformlyrandompolicy,thenanyd-dimensionalrepresentationthatminimizes(5)spansthesamebasisasthed-dimensionalPVF(uptothebiasterm).Thissuggestsaconnectionwithhindsightexperiencereplay(Andrychowiczetal.,2017),whoseauxiliarytasksconsistsinreachingpreviouslyexperiencedstates.5\f4EmpiricalStudiesInthissectionwecomplementourtheoreticalanalysiswithanexperimentalstudy.Inturn,wetakeacloserlookat1)thestructureofadversarialvaluefunctions,2)theshapeofrepresentationslearnedusingAVFs,and3)theperformancepro\ufb01leoftheserepresentationsinacontrolsetting.OureventualgoalistodemonstratethattheRLP,whichisbasedonapproximatingvaluefunctions,givesrisetorepresentationsthatarebothinterestingandcomparabletopreviouslyproposedschemes.Ourconcreteinstantiation(Algorithm1)usestherepresentationloss(5).As-is,thisalgorithmisoflimitedpracticalrelevance(ourAVFsarelearnedusingatabularrepresentation)butwebelieveprovidesaninspirationalbasisforfurtherdevelopments.Algorithm1RepresentationlearningusingAVFsinputk\u2013desirednumberofAVFs,d\u2013desirednumberoffeatures.Sample\u03b41,...,\u03b4k\u223c[\u22121,1]nCompute\u00b5i=argmax\u03c0\u03b4>iV\u03c0usingapolicygradientmethodFind\u03c6\u2217=argmin\u03c6L(\u03c6;{V\u00b51,...,V\u00b5k})(Equation5)Weperformallofourexperimentswithinthefour-roomdomain(Suttonetal.,1999;Solwayetal.,2014;Machadoetal.,2017,Figure2,seealsoAppendixH.1).GoalstateInterest function \u1e9fPolicyAdversarial value functiond\u21e1<latexit sha1_base64=\"IyPQidmJDIC2i+En0ByKuEzIATg=\">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64=\"IyPQidmJDIC2i+En0ByKuEzIATg=\">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64=\"IyPQidmJDIC2i+En0ByKuEzIATg=\">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit><latexit sha1_base64=\"IyPQidmJDIC2i+En0ByKuEzIATg=\">AAAIRXicfZXfb9s2EMfVbqu7bOva9XEvxIwCQ2EEVrFhRYsC7eqhG5ZkTpqkaSLDoKiTTYSUVJLu7LD6G/a6/UX7G/ZH7G3Y63aUZEc/2gqwdbrP98Tj8USGmeDaDId/Xbn6wYcfXetd/3jrk08/u/H5zVtfHOt0oRgcsVSk6iSkGgRP4MhwI+AkU0BlKOBFeP7U8RevQWmeJodmlcFE0lnCY86oQddRNA0yPr3ZH24Pi4t0Db8y+l51jae3eneDKGULCYlhgmp95g8zM7FUGc4E5FvBQkNG2TmdwRmaCZWgJ7bINid30BOROFX4SwwpvPUIS6XWKxmiUlIz123mnO9iZi4bo1unUjrWrZxMfH9ieZItDCSsTCleCGJS4opEIq6AGbFCgzLFcVaEzamizGApm2OKWYqC9rAbN2dNoEBoftEqUfSaZ7oq0rKsUiMoxLdvBQn8ylIpaRLZgD3PbeDmxqiwz/O8RU9q9KRDX9boyw59UqNPOvSgRg86dFSjow49rdHTDt2t0d0OPazRww7dqdGdvF0svV/D++1gfVyjx53gcK/CYWj32rEZKFlhzZQdd6JZSBUKBm+CQTvnqELBW5gGyR176FCTLZPcLqfWf5C4we4EEcQkgGWG/Uo2if7gEh0BfpsKdtH1CyZKTaru2kqa21KLghmodWBMMLCVS6nX+doKklRwydGDiRWDq3l6gW/A29QGBpZGM3tRvGeNdRvrBjYVNmvfhaRLYk8v9fh8GRGdC2JHl/DnnXxddzOHVIG0AnBZdgCnkDf8yvkPQFJ13gQMPzn7NFWpwIqsmsx93PawfGivRpJiA5z5ExsIiA0J3pC+j3Pis7l7aJWSJ/H79VMnMatOWIS7SBFULe9P+bR6wVktfpJXS4JHQbGJ2AjnqSDKrZqFuR1u3x/grj4Y5l3VTAEkGx1qtr912lYmoVhAmYgrfRld+Gzfb/eNG7cl3aTTVZfjv0VfJVZEbL2rp7GJJUcV3oOBs94nxFaqhGite3j8jJQjK2nHzzatNt9D95wasrf2jNy/DVJ3poLBTc4+eLSRj+BVkzmEp6vfPku7xvG9bR/t/W/6j7+vztnr3pfeV97Xnu995z32fvTG3pHHPO795v3u/dH7s/d375/ev6X06pUq5rbXuHr//Q812gaA</latexit>Network \ufb02ow\u21e1<latexit sha1_base64=\"rI36vmhMtkOtbuMStKnGM1jgiTg=\">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64=\"rI36vmhMtkOtbuMStKnGM1jgiTg=\">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64=\"rI36vmhMtkOtbuMStKnGM1jgiTg=\">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit><latexit sha1_base64=\"rI36vmhMtkOtbuMStKnGM1jgiTg=\">AAAIQ3icfZXbjts2EIaVpK3T7SmHy9wINQIUgbGwihYNUgRIEBdp0d2t95jNrgyDokY2saSokHRqL6NH6G37RH2IPkPvit4W6FCSvTokEWBrNN8/4nA4IqOMM22Gw7+uXb/xwYcf9W5+vPXJp599/sWt23dOtFwoCsdUcqlOI6KBsxSODTMcTjMFREQcXkQXzxx/8RqUZjI9MqsMJoLMUpYwSgy6DsOMTW/1h9vD4vK7RlAZfa+6xtPbvQdhLOlCQGooJ1qfB8PMTCxRhlEO+Va40JARekFmcI5mSgToiS1yzf376In9RCr8pcYvvPUIS4TWKxGhUhAz123mnO9iZi4ao1unUjrRrZxM8nBiWZotDKS0TClZcN9I35XIj5kCavgKDUIVw1n5dE4UoQYL2RyTzyQK2sNu3Iw2gQKu2WWrRPFrlumqSMuySo2gCN++FabwK5VCkDS2IT3MbejmRgm3h3neoqc1etqhL2v0ZYc+rdGnHXpQowcdOqrRUYee1ehZh+7W6G6HHtXoUYfu1OhO3i6W3q/h/XawPqnRk05wtFfhKLJ77dgMlKiwpsqOO9E0IgoFgzfhoJ1zXKHwLUyDYI5971CTLdPcLqc2eJS6we6HMSR+CMsM+9XfJPqDS3QE+G0q2EXXL5goMVI9sJU0t6UWBTNQ68DEx8BWLqVe52srTCVngqEHEysGV3N5iW/A29SGBpZGU3tZvGeNdRvrBjYVNmvfpSBL355d6fH5KiK+4L4dXcGfd/J13c0cpAJhOeCy7ABOIW/4lfMfgCDqogkofnL2mVSSY0VWTeY+bntUPrRXI5XYAOfBxIYcEuOHb/x+gHNis7l7aJWSpcn79VMnMatOWIy7SBFULe9P+bR6wXktfpJXS4IHQbGJ2BjnqSDOrZpFuR1uPxzgrj4Y5l3VTAGkGx1qtr912lYmEV9AmYgrfRld+Gw/aPeNG7cl3aTTVZfjv0VfJVZEbL2rp7GJBUMV3sOBs94nxFaqhGite3j83C9HVsKOn29abb6H7jkx/t7aM3L/NpTuRAWDm5x99HgjH8GrJnMIT9egfZZ2jZOvtwO097/pPxlX5+xN7573pfeVF3jfeU+8H72xd+xRb+b95v3u/dH7s/d375/ev6X0+rUq5q7XuHr//Q/jHwW3</latexit>Figure2:Leftmost.Thefour-roomdomain.Otherpanels.Aninterestfunction\u03b4,thenetwork\ufb02owd\u03c0,thecorrespondingadversarialvaluefunction(blue/red=low/highvalue)anditspolicy.Weconsideratwo-partnetworkwherewepretrain\u03c6end-to-endtopredictasetofvaluefunctions.Ouraimhereistocomparetheeffectsofusingdifferentsetsofvaluefunctions,includingAVFs,onthelearnedrepresentation.Asourfocusisontheef\ufb01cientuseofad-dimensionalrepresentation(withd<n,thenumberofstates),weencodeindividualstatesasone-hotvectorsandmaptheminto\u03c6(x)withoutcapacityconstraints.AdditionaldetailsmaybefoundinAppendixH.4.1AdversarialValueFunctionsOur\ufb01rstsetofresultsstudiesthestructureofadversarialvaluefunctionsinthefour-roomdomain.Wegeneratedinterestfunctionsbyassigningavalue\u03b4(x)\u2208{\u22121,0,1}uniformlyatrandomtoeachstatex(Figure2,left).Werestricted\u03b4tothesediscretechoicesforillustrativepurposes.Wethenusedmodel-basedpolicygradient(Suttonetal.,2000)to\ufb01ndthepolicymaximizingPx\u2208X\u03b4(x)V\u03c0(x).Weobservedsomelocalminimaoraccumulationpointsbutasawholereasonablesolutionswerefound.Theresultingnetwork\ufb02owandAVFforaparticularsampleareshowninFigure2.Formoststates,thesignsof\u03b4andd\u03c0agree;however,thisisnottrueofallstates(largerversionandmoreexamplesinappendix,Figures6,7).Asexpected,statesforwhichd\u03c0>0(respectively,d\u03c0<0)correspondtostatesmaximizing(resp.minimizing)thevaluefunction.Finally,weremarkonthe\u201c\ufb02ow\u201dnatureofd\u03c0:trajectoriesoverminimizingstatesaccumulateincornersorloops,whilethoseovermaximizingstates\ufb02owtothegoal.WeconcludethatAVFsexhibitinterestingstructure,andaregeneratedbypoliciesthatarenotrandom(Figure2,right).Aswewillseenext,thisisakeydifferentiatorinmakingAVFsgoodauxiliarytasks.4.2RepresentationLearningwithAVFsWenextconsidertherepresentationsthatarisefromtrainingadeepnetworktopredictAVFs(denotedAVFfromhereon).Wesamplek=1000interestfunctionsanduseAlgorithm1togeneratekAVFs.6\fWecombinetheseAVFsintotherepresentationloss(5)andadapttheparametersofthedeepnetworkusingRmsprop(TielemanandHinton,2012).WecontrasttheAVF-drivenrepresentationwithonelearnedbypredictingthevaluefunctionofrandomdeterministicpolicies(RP).Speci\ufb01cally,thesepoliciesaregeneratedbyassigninganactionuniformlyatrandomtoeachstate.Wealsoconsiderthevaluefunctionoftheuniformlyrandompolicy(VALUE).Whilewemakethesechoiceshereforconcreteness,otherexperimentsyieldedsimilarresults(e.g.predictingthevalueoftheoptimalpolicy;appendix,Figure8).Inallcases,welearnad=16dimensionalrepresentation,notincludingthebiasunit.Value functionRandom policies (k=1000)AVFs (k=1000)Figure3:16-dimensionalrepresentationslearnedbypredictingasinglevaluefunction,thevaluefunctionsof1000randompolicies,or1000AVFssampledusingAlgorithm1.Eachpanelelementdepictstheactivationofagivenfeatureacrossstates,withblue/redindicatinglow/highactivation.Figure3showstherepresentationslearnedbythethreemethods.ThefeatureslearnedbyVALUEresemblethevaluefunctionitself(topleftfeature)oritsnegatedimage(bottomleftfeature).Coarselyspeaking,thesefeaturescapturethegeneraldistancetothegoalbutlittleelse.ThefeatureslearnedbyRPareofevenworsequality.Thisisbecausealmostallrandomdeterministicpoliciescausetheagenttoavoidthegoal(appendix,Figure12).TherepresentationlearnedbyAVF,ontheotherhand,capturesthestructureofthedomain,includingpathsbetweendistalstatesandfocalpointscorrespondingtoroomsorpartsofrooms.AlthoughourfocusisontheuseofAVFsasauxiliarytaskstoadeepnetwork,weobservethesameresultswhendiscoveringarepresentationusingsingularvaluedecomposition(Section3.2),asdescribedinAppendixI.Allinall,ourresultsillustratethat,amongallvaluefunctions,AVFsareparticularlyusefulauxiliarytasksforrepresentationlearning.4.3LearningtheOptimalPolicyFigure4:AveragediscountedreturnachievedbypolicieslearnedusingarepresentationproducedbyVALUE,AVF,orPVF.Averageisover20randomseedsandshadinggivesstandarddeviation.Ina\ufb01nalsetofexperiments,weconsiderlearn-ingareward-maximizingpolicyusingapre-trainedrepresentationandamodel-basedver-sionoftheSARSAalgorithm(RummeryandNiranjan,1994;SuttonandBarto,1998).Wecomparethevalue-basedandAVF-basedrep-resentationsfromtheprevioussection(VALUEandAVF),andalsoproto-valuefunctions(PVF;detailsinAppendixH.3).Wereportthequalityofthelearnedpoliciesaftertraining,asafunctionofd,thesizeoftherep-resentation.Ourqualitymeasureistheaveragereturnfromthedesignatedstartstate(bottomleft).ResultsareprovidedinFigure4andFig-ure13(appendix).WeobserveafailureoftheVALUErepresentationtoprovideausefulbasisforlearningagoodpolicy,evenasdincreases;whiletherepresentationisnotrank-de\ufb01cient,thefeaturesdonothelpreducetheapproximationerror.7\fIncomparison,ourAVFrepresentationsperformsimilarlytoPVFs.Increasingthenumberofauxiliarytasksalsoleadstobetterrepresentations;recallthatPVFimplicitlyusesn=104auxiliarytasks.5RelatedWorkOurworktakesinspirationfromresearchinbasisorfeatureconstructionforreinforcementlearning.RatitchandPrecup(2004),FosterandDayan(2002),Menacheetal.(2005),YuandBertsekas(2009),Bhatnagaretal.(2013),andSongetal.(2016)considermethodsforadaptingparametrizedbasisfunctionsusingiterativeschemes.IncludingMahadevanandMaggioni(2007)\u2019sproto-valuefunctions,anumberofworks(wenoteDayan,1993;Petrik,2007;MahadevanandLiu,2010;Ruanetal.,2015;Barretoetal.,2017)haveusedcharacteristicsofthetransitionstructureoftheMDPtogeneraterepresentations;thesearetheclosestinspirittoourapproach,althoughnoneusetherewardorconsiderthegeometryofthespaceofvaluefunctions.Parretal.(2007)proposedconstructingarepresentationfromsuccessiveBellmanerrors,Kelleretal.(2006)useddimensionalityreductionmethods;\ufb01nallyHutter(2009)proposesauniversalschemeforselectingrepresentations.Deepreinforcementlearningalgorithmshavemadeextensiveuseofauxiliarytaskstoimproveagentperformance,beginningperhapswithuniversalvaluefunctionapproximators(Schauletal.,2015)andtheUNREALarchitecture(Jaderbergetal.,2017);seealsoDosovitskiyandKoltun(2017),Franc\u00b8ois-Lavetetal.(2018)and,moretangentially,vandenOordetal.(2018).Levineetal.(2017)andChungetal.(2019)makeexplicituseoftwo-partnetworktoderivemoresampleef\ufb01cientdeepreinforcementlearningalgorithms.Veeriahetal.(2019)useameta-gradientapproachtogenerateauxiliarytasks.Thenotionofaugmentinganagentwithsidepredictionsisnotnew,withrootsinTDmodels(Sutton,1995),predictivestaterepresentations(Littmanetal.,2002),andtheHordearchitecture(Suttonetal.,2011),itselfinspiredbytheworkofSelfridge(1959).Anumberofworksquantifyorexplaintheusefulnessofarepresentation.Parretal.(2008)demon-stratedthatagoodrepresentationshouldsupportagoodapproximationofbothrewardandexpectednextstate.Weconjecturethattherelaxedproblem(4)tradesthesetwoquantitiesoffinaprincipledfashion.Lietal.(2006);Abeletal.(2016)considertheapproximationerrorthatarisesfromstateabstraction.Morerecently,Nachumetal.(2019)providesomeinterestingguaranteesinthecontextofhierarchicalreinforcementlearning,whileSuchetal.(2019)visualizestherepresentationslearnedbyAtari-playingagents.Finally,Bertsekas(2018)remarksonthetwo-partnetworkwestudyhere.6ConclusionInthispaperwestudiedthenotionofanadversarialvaluefunction,derivedfromageometricperspectiveonrepresentationlearninginRL.Ourworkshowsthatadversarialvaluefunctionsexhibitinterestingstructure,andaregoodauxiliarytaskswhenlearningarepresentationofanenvironment.Webelieveourworktobethe\ufb01rsttoprovideformalevidenceastotheusefulnessofpredictingvaluefunctionsforshapinganagent\u2019srepresentation.Ourworkopensupthepossibilityofautomaticallygeneratingauxiliarytasksindeepreinforcementlearning,analogoustohowdeeplearningitselfenabledamoveawayfromhand-craftedfeatures.Todoso,weexpectthatanumberofpracticalchallengeswillneedtobeovercome:Off-policylearning.ApracticalimplementationwillrequirelearningAVFsconcurrentlywiththemaintask.Doingsoresultsinoff-policylearning,whosenegativeeffectsarewell-documentedeveninrecentapplications(e.g.vanHasseltetal.,2018).Policyparametrization.AVFsarethevaluefunctionofdeterministicpolicies.Whileanaturalchoiceistolookforpoliciesthatmaximizerepresentationerror,thisposestheproblemofhowtoparametrizethepoliciesthemselves.Inparticular,apolicyparametrizedusingtherepresentation\u03c6maynotprovideasuf\ufb01cientdegreeof\u201cadversariality\u201d.Smoothnessintheinterestfunction.Incontinuousorlargestatespaces,itisdesirableforinterestfunctionstoincorporatesomedegreeofsmoothness,ratherthanvaryrapidlyfromstatetostate.Itisnotclearhowtocontrolthissmoothnessinaprincipledmanner.Fromamathematicalperspective,ourformulationoftheRLPwasmadewithbothconvenienceandgeometryinmind.Conceptually,itmaybeinterestingtoconsiderourapproachinothernorms,8\fincludingtheweightednormsusedinapproximationresults.Practically,thiswouldtranslateintoanemphasison\u201cinteresting\u201dvaluefunctions,forexamplebygivingadditionalweighttotheoptimalvaluefunctionanditsneighbouringAVFs.7AcknowledgementsTheauthorsthankthemanypeoplewhohelpedshapethisprojectthroughdiscussionsandfeedbackonearlyandlatedrafts:LihongLi,GeorgeTucker,DoinaPrecup,O\ufb01rNachum,CsabaSzepesv\u00b4ari,GeorgOstrovski,MarekPetrik,MarlosMachado,TimLillicrap,DannyTarlow,HugoLarochelle,SaurabhKumar,CarlesGelada,R\u00b4emiMunos,DavidSilver,andAndr\u00b4eBarreto.SpecialthanksalsotoPhilipThomasandScottNiekum,whogavethisprojectitsinitialimpetus.8AuthorContributionsM.G.B.,W.D.,D.S.,andN.L.R.conceptualizedtherepresentationlearningproblem.M.G.B.,W.D.,T.L.,A.A.T.,R.D.,D.S.,andN.L.R.contributedtothetheoreticalresults.M.G.B.,W.D.,P.S.C.,R.D.,andC.L.performedexperimentsandcollatedresults.Allauthorscontributedtothewriting.ReferencesAbadi,M.,Barham,P.,Chen,J.,Chen,Z.,Davis,A.,Dean,J.,Devin,M.,Ghemawat,S.,Irving,G.,Isard,M.,etal.(2016).Tensor\ufb02ow:Asystemforlarge-scalemachinelearning.InSymposiumonOperatingSystemsDesignandImplementation.Abel,D.,Hershkowitz,D.E.,andLittman,M.L.(2016).Nearoptimalbehaviorviaapproximatestateabstraction.InProceedingsoftheInternationalConferenceonMachineLearning.Andrychowicz,M.,Wolski,F.,Ray,A.,Schneider,J.,Fong,R.,Welinder,P.,McGrew,B.,Tobin,J.,Abbeel,O.P.,andZaremba,W.(2017).Hindsightexperiencereplay.InAdvancesinNeuralInformationProcessingSystems.Barreto,A.,Dabney,W.,Munos,R.,Hunt,J.J.,Schaul,T.,vanHasselt,H.P.,andSilver,D.(2017).Successorfeaturesfortransferinreinforcementlearning.InAdvancesinNeuralInformationProcessingSystems.Behzadian,B.andPetrik,M.(2018).Featureselectionbysingularvaluedecompositionforreinforce-mentlearning.InProceedingsoftheICMLPredictionandGenerativeModelingWorkshop.Bellemare,M.G.,Dabney,W.,andMunos,R.(2017).Adistributionalperspectiveonreinforcementlearning.InProceedingsoftheInternationalConferenceonMachineLearning.Bellemare,M.G.,Naddaf,Y.,Veness,J.,andBowling,M.(2013).TheArcadeLearningEnvironment:Anevaluationplatformforgeneralagents.JournalofArti\ufb01cialIntelligenceResearch,47:253\u2013279.Bellman,R.E.(1957).Dynamicprogramming.PrincetonUniversityPress,Princeton,NJ.Bernhard,K.andVygen,J.(2008).Combinatorialoptimization:Theoryandalgorithms.Springer,ThirdEdition,2005.Bertsekas,D.P.(2012).DynamicProgrammingandOptimalControl,Vol.II:ApproximateDynamicProgramming.AthenaScienti\ufb01c.Bertsekas,D.P.(2018).Feature-basedaggregationanddeepreinforcementlearning:Asurveyandsomenewimplementations.Technicalreport,MIT/LIDS.Bhatnagar,S.,Borkar,V.S.,andPrabuchandran,K.(2013).FeaturesearchintheGrassmanianinonlinereinforcementlearning.IEEEJournalofSelectedTopicsinSignalProcessing.Boyd,S.andVandenberghe,L.(2004).Convexoptimization.Cambridgeuniversitypress.Castro,P.S.,Moitra,S.,Gelada,C.,Kumar,S.,andBellemare,M.G.(2018).Dopamine:Aresearchframeworkfordeepreinforcementlearning.arXiv.9\fChung,W.,Nath,S.,Joseph,A.G.,andWhite,M.(2019).Two-timescalenetworksfornonlinearvaluefunctionapproximation.InInternationalConferenceonLearningRepresentations.Dadashi,R.,Ta\u00a8\u0131ga,A.A.,Roux,N.L.,Schuurmans,D.,andBellemare,M.G.(2019).Thevaluefunctionpolytopeinreinforcementlearning.Dayan,P.(1993).Improvinggeneralisationfortemporaldifferencelearning:Thesuccessorrepresen-tation.NeuralComputation.Dosovitskiy,A.andKoltun,V.(2017).Learningtoactbypredictingthefuture.InProceedingsoftheInternationalConferenceonLearningRepresentations.Ernst,D.,Geurts,P.,andWehenkel,L.(2005).Tree-basedbatchmodereinforcementlearning.JournalofMachineLearningResearch,6:503\u2013556.Farahmand,A.,Ghavamzadeh,M.,Szepesv\u00b4ari,C.,andMannor,S.(2016).Regularizedpolicyiterationwithnonparametricfunctionspaces.JournalofMachineLearningResearch.Foster,D.andDayan,P.(2002).Structureinthespaceofvaluefunctions.MachineLearning.Franc\u00b8ois-Lavet,V.,Bengio,Y.,Precup,D.,andPineau,J.(2018).Combinedreinforcementlearningviaabstractrepresentations.arXiv.Gelada,C.,Kumar,S.,Buckman,J.,Nachum,O.,andBellemare,M.G.(2019).DeepMDP:Learningcontinuouslatentspacemodelsforrepresentationlearning.InProceedingsoftheInternationalConferenceonMachineLearning.Hutter,M.(2009).Featurereinforcementlearning:PartI.UnstructuredMDPs.JournalofArti\ufb01cialGeneralIntelligence.Jaderberg,M.,Mnih,V.,Czarnecki,W.M.,Schaul,T.,Leibo,J.Z.,Silver,D.,andKavukcuoglu,K.(2017).Reinforcementlearningwithunsupervisedauxiliarytasks.InProceedingsoftheInternationalConferenceonLearningRepresentations.Keller,P.W.,Mannor,S.,andPrecup,D.(2006).Automaticbasisfunctionconstructionforapprox-imatedynamicprogrammingandreinforcementlearning.InProceedingsoftheInternationalConferenceonMachineLearning.Lagoudakis,M.andParr,R.(2003).Least-squarespolicyiteration.TheJournalofMachineLearningResearch.Levine,N.,Zahavy,T.,Mankowitz,D.,Tamar,A.,andMannor,S.(2017).Shallowupdatesfordeepreinforcementlearning.InAdvancesinNeuralInformationProcessingSystems.Li,L.,Walsh,T.,andLittman,M.(2006).Towardsauni\ufb01edtheoryofstateabstractionforMDPs.InProceedingsoftheNinthInternationalSymposiumonArti\ufb01cialIntelligenceandMathematics.Liang,Y.,Machado,M.C.,Talvitie,E.,andBowling,M.H.(2016).Stateoftheartcontrolofatarigamesusingshallowreinforcementlearning.InProceedingsoftheInternationalConferenceonAutonomousAgentsandMultiagentSystems.Littman,M.L.,Sutton,R.S.,andSingh,S.(2002).Predictiverepresentationsofstate.InAdvancesinNeuralInformationProcessingSystems.Machado,M.C.,Bellemare,M.G.,andBowling,M.(2017).ALaplacianframeworkforoptiondiscoveryinreinforcementlearning.InProceedingsoftheInternationalConferenceonMachineLearning.Machado,M.C.,Rosenbaum,C.,Guo,X.,Liu,M.,Tesauro,G.,andCampbell,M.(2018).Eigenop-tiondiscoverythroughthedeepsuccessorrepresentation.InProceedingsoftheInternationalConferenceonLearningRepresentations.Mahadevan,S.andLiu,B.(2010).Basisconstructionfrompowerseriesexpansionsofvaluefunctions.InAdvancesinNeuralInformationProcessingSystems.10\fMahadevan,S.andMaggioni,M.(2007).Proto-valuefunctions:ALaplacianframeworkforlearningrepresentationandcontrolinMarkovdecisionprocesses.JournalofMachineLearningResearch.Menache,I.,Mannor,S.,andShimkin,N.(2005).Basisfunctionadaptationintemporaldifferencereinforcementlearning.AnnalsofOperationsResearch.Mnih,V.,Kavukcuoglu,K.,Silver,D.,Rusu,A.A.,Veness,J.,Bellemare,M.G.,Graves,A.,Riedmiller,M.,Fidjeland,A.K.,Ostrovski,G.,etal.(2015).Human-levelcontrolthroughdeepreinforcementlearning.Nature,518(7540):529\u2013533.Munos,R.(2003).Errorboundsforapproximatepolicyiteration.InProceedingsoftheInternationalConferenceonMachineLearning.Munos,R.(2007).Performanceboundsinlp-normforapproximatevalueiteration.SIAMJournalonControlandOptimization.Nachum,O.,Gu,S.,Lee,H.,andLevine,S.(2019).Near-optimalrepresentationlearningforhierarchicalreinforcementlearning.InProceedingsoftheInternationalConferenceonLearningRepresentations.Parr,R.,Li,L.,Taylor,G.,Painter-Wake\ufb01eld,C.,andLittman,M.L.(2008).Ananalysisoflinearmodels,linearvalue-functionapproximation,andfeatureselectionforreinforcementlearning.InProceedingsoftheInternationalConferenceonMachineLearning.Parr,R.,Painter-Wake\ufb01eld,C.,Li,L.,andLittman,M.(2007).Analyzingfeaturegenerationforvalue-functionapproximation.InProceedingsoftheInternationalConferenceonMachineLearning.Petrik,M.(2007).AnanalysisofLaplacianmethodsforvaluefunctionapproximationinMDPs.InProceedingsoftheInternationalJointConferenceonArti\ufb01cialIntelligence.Petrik,M.andZilberstein,S.(2011).Robustapproximatebilinearprogrammingforvaluefunctionapproximation.JournalofMachineLearningResearch.Puterman,M.L.(1994).MarkovDecisionProcesses:Discretestochasticdynamicprogramming.JohnWiley&Sons,Inc.Ratitch,B.andPrecup,D.(2004).Sparsedistributedmemoriesforon-linevalue-basedreinforcementlearning.InProceedingsoftheEuropeanConferenceonMachineLearning.Rockafellar,R.T.andWets,R.J.-B.(2009).Variationalanalysis.SpringerScience&BusinessMedia.Ruan,S.S.,Comanici,G.,Panangaden,P.,andPrecup,D.(2015).Representationdiscoveryformdpsusingbisimulationmetrics.InProceedingsoftheAAAIConferenceonArti\ufb01cialIntelligence.Rummery,G.A.andNiranjan,M.(1994).On-lineQ-learningusingconnectionistsystems.Technicalreport,CambridgeUniversityEngineeringDepartment.Samuel,A.L.(1959).Somestudiesinmachinelearningusingthegameofcheckers.IBMJournalofResearchandDevelopment.Schaul,T.,Horgan,D.,Gregor,K.,andSilver,D.(2015).Universalvaluefunctionapproximators.InProceedingsoftheInternationalConferenceonMachineLearning.Selfridge,O.(1959).Pandemonium:Aparadigmforlearning.InSymposiumonthemechanizationofthoughtprocesses.Silver,D.,Huang,A.,Maddison,C.J.,Guez,A.,Sifre,L.,vandenDriessche,G.,Schrittwieser,J.,Antonoglou,I.,Panneershelvam,V.,Lanctot,M.,Dieleman,S.,Grewe,D.,Nham,J.,Kalchbrenner,N.,Sutskever,I.,Lillicrap,T.,Leach,M.,Kavukcuoglu,K.,Graepel,T.,andHassabis,D.(2016).MasteringthegameofGowithdeepneuralnetworksandtreesearch.Nature,529(7587):484\u2013489.Solway,A.,Diuk,C.,C\u00b4ordova,N.,Yee,D.,Barto,A.G.,Niv,Y.,andBotvinick,M.M.(2014).Optimalbehavioralhierarchy.PLOSComputationalBiology.11\fSong,Z.,Parr,R.,Liao,X.,andCarin,L.(2016).Linearfeatureencodingforreinforcementlearning.InAdvancesinNeuralInformationProcessingSystems.Stachenfeld,K.L.,Botvinick,M.,andGershman,S.J.(2014).Designprinciplesofthehippocampalcognitivemap.InAdvancesinNeuralInformationProcessingSystems.Such,F.P.,Madhavan,V.,Liu,R.,Wang,R.,Castro,P.S.,Li,Y.,Schubert,L.,Bellemare,M.G.,Clune,J.,andLehman,J.(2019).AnAtarimodelzooforanalyzing,visualizing,andcomparingdeepreinforcementlearningagents.InProceedingsoftheInternationalJointConferenceonArti\ufb01cialIntelligence.Sutton,R.,Modayil,J.,Delp,M.,Degris,T.,Pilarski,P.,White,A.,andPrecup,D.(2011).Horde:Ascalablereal-timearchitectureforlearningknowledgefromunsupervisedsensorimotorinteraction.InProceedingsoftheInternationalConferenceonAutonomousAgentsandMultiagentsSystems.Sutton,R.S.(1995).TDmodels:Modelingtheworldatamixtureoftimescales.InProceedingsoftheInternationalConferenceonMachineLearning.Sutton,R.S.(1996).Generalizationinreinforcementlearning:Successfulexamplesusingsparsecoarsecoding.InAdvancesinNeuralInformationProcessingSystems.Sutton,R.S.andBarto,A.G.(1998).Reinforcementlearning:Anintroduction.MITPress.Sutton,R.S.,Mahmood,A.R.,andWhite,M.(2016).Anemphaticapproachtotheproblemofoff-policytemporal-differencelearning.JournalofMachineLearningResearch.Sutton,R.S.,McAllester,D.A.,Singh,S.P.,andMansour,Y.(2000).Policygradientmethodsforreinforcementlearningwithfunctionapproximation.InAdvancesinNeuralInformationProcessingSystems.Sutton,R.S.,Precup,D.,andSingh,S.P.(1999).BetweenMDPsandsemi-MDPs:Aframeworkfortemporalabstractioninreinforcementlearning.Arti\ufb01cialIntelligence.Tesauro,G.(1995).TemporaldifferencelearningandTD-Gammon.CommunicationsoftheACM,38(3).Tieleman,T.andHinton,G.(2012).RmsProp:Dividethegradientbyarunningaverageofitsrecentmagnitude.COURSERA:NeuralNetworksforMachineLearning.Tosatto,S.,Pirotta,M.,D\u2019Eramo,C.,andRestelli,M.(2017).Boosted\ufb01ttedq-iteration.InProceedingsoftheInternationalConferenceonMachineLearning.vandenOord,A.,Li,Y.,andVinyals,O.(2018).Representationlearningwithcontrastivepredictivecoding.InAdvancesinNeuralInformationProcessingSystems.vanHasselt,H.,Doron,Y.,Strub,F.,Hessel,M.,Sonnerat,N.,andModayil,J.(2018).Deepreinforcementlearningandthedeadlytriad.arXiv.Veeriah,V.,Hessel,M.,Xu,Z.,Lewis,R.,Rajendran,J.,Oh,J.,vanHasselt,H.,Silver,D.,andSingh,S.(2019).Discoveryofusefulquestionsasauxiliarytasks.InAdvancesinNeuralInformationProcessingSystems.Wu,Y.,Tucker,G.,andNachum,O.(2019).TheLaplacianinRL:Learningrepresentationswithef\ufb01cientapproximations.InProceedingsoftheInternationalConferenceonLearningRepresentations.Yu,H.andBertsekas,D.P.(2009).Basisfunctionadaptationmethodsforcostapproximationinmdp.InProceedingsoftheIEEESymposiumonAdaptiveDynamicProgrammingandReinforcementLearning.12\f", "award": [], "sourceid": 2438, "authors": [{"given_name": "Marc", "family_name": "Bellemare", "institution": "Google Brain"}, {"given_name": "Will", "family_name": "Dabney", "institution": "DeepMind"}, {"given_name": "Robert", "family_name": "Dadashi", "institution": "Google Brain"}, {"given_name": "Adrien", "family_name": "Ali Taiga", "institution": "MILA"}, {"given_name": "Pablo Samuel", "family_name": "Castro", "institution": "Google"}, {"given_name": "Nicolas", "family_name": "Le Roux", "institution": "Google Brain"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Google Inc."}, {"given_name": "Tor", "family_name": "Lattimore", "institution": "DeepMind"}, {"given_name": "Clare", "family_name": "Lyle", "institution": "University of Oxford"}]}