{"title": "Online Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 2827, "page_last": 2837, "abstract": "We consider the problem of repeatedly solving a variant of the same dynamic programming problem in successive trials. An instance of the type of problems we consider is to find a good binary search tree in a changing environment. At the beginning of each trial, the learner probabilistically chooses a tree with the n keys at the internal nodes and the n + 1 gaps between keys at the leaves. The learner is then told the frequencies of the keys and gaps and is charged by the average search cost for the chosen tree. The problem is online because the frequencies can change between trials. The goal is to develop algorithms with the property that their total average search cost (loss) in all trials is close to the total loss of the best tree chosen in hindsight for all trials. The challenge, of course, is that the algorithm has to deal with exponential number of trees. We develop a general methodology for tackling such problems for a wide class of dynamic programming algorithms. Our framework allows us to extend online learning algorithms like Hedge and Component Hedge to a significantly wider class of combinatorial objects than was possible before.", "full_text": "Online Dynamic Programming\n\nHolakou Rahmanian\n\nDepartment of Computer Science\nUniversity of California Santa Cruz\n\nSanta Cruz, CA 95060\nholakou@ucsc.edu\n\nManfred K. Warmuth\n\nDepartment of Computer Science\nUniversity of California Santa Cruz\n\nSanta Cruz, CA 95060\nmanfred@ucsc.edu\n\nAbstract\n\nWe consider the problem of repeatedly solving a variant of the same dynamic\nprogramming problem in successive trials. An instance of the type of problems\nwe consider is to \ufb01nd a good binary search tree in a changing environment. At the\nbeginning of each trial, the learner probabilistically chooses a tree with the n keys\nat the internal nodes and the n + 1 gaps between keys at the leaves. The learner\nis then told the frequencies of the keys and gaps and is charged by the average\nsearch cost for the chosen tree. The problem is online because the frequencies can\nchange between trials. The goal is to develop algorithms with the property that\ntheir total average search cost (loss) in all trials is close to the total loss of the best\ntree chosen in hindsight for all trials. The challenge, of course, is that the algorithm\nhas to deal with exponential number of trees. We develop a general methodology\nfor tackling such problems for a wide class of dynamic programming algorithms.\nOur framework allows us to extend online learning algorithms like Hedge [16] and\nComponent Hedge [25] to a signi\ufb01cantly wider class of combinatorial objects than\nwas possible before.\n\n1\n\nIntroduction\n\nConsider the following online learning problem. In each trial, the algorithm plays with a Binary\nSearch Tree (BST) for a given set of n keys. Then the adversary reveals a set of probabilities for the\nn keys and their n + 1 gaps, and the algorithm incurs a linear loss of average search cost. The goal is\nto predict with a sequence of BSTs minimizing regret which is the difference between the total loss\nof the algorithm and the total loss of the single best BST chosen in hindsight.\nA natural approach to solve this problem is to keep track of a distribution on all possible BSTs during\nthe trials (e.g. by running the Hedge algorithm [16] with one weight per BST). However, this seems\nimpractical since it requires maintaining a weight vector of exponential size. Here we focus on\ncombinatorial objects that are comprised of n components where the number of objects is typically\nexponential in n. For a BST the components are the depth values of the keys and the gaps in the tree.\nThis line of work requires that the loss of an object is linear in the components (see e.g. [35]). In our\nBST examples the loss is simply the dot product between the components and the frequencies.\nThere has been much work on developing ef\ufb01cient algorithms for learning objects that are composed\nof components when the loss is linear in the components. These algorithms get away with keeping\none weight per component instead of one weight per object. Previous work includes learning k-sets\n[36], permutations [19, 37, 2] and paths in a DAG [35, 26, 18, 11, 5]. There are also general tools for\nlearning such combinatorial objects with linear losses. The Follow the Perturbed Leader (FPL) [22]\nis a simple algorithm that adds random perturbations to the cumulative loss of each component, and\nthen predicts with the combinatorial object that has the minimum perturbed loss. The Component\nHedge (CH) algorithm [25] (and its extensions [34, 33, 17]) constitutes another generic approach.\nEach object is typically represented as a bit vector over the set of components where the 1-bits\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\findicate the components appearing in the object. The algorithm maintains a mixture of the weight\nvectors representing all objects. The weight space of CH is thus the convex hull of the weight\nvectors representing the objects. This convex hull is a polytope of dimension n with the objects as\ncorners. For the ef\ufb01ciency of CH it is typically required that this polytope has a small number of\nfacets (polynomial in n). The CH algorithm predicts with a random corner of the polytope whose\nexpectation equals the maintained mixture vector in the polytope.\nUnfortunately the results of CH and its current extensions cannot be directly applied to problems\nlike BST. This is because the BST polytope discussed above does not have a characterization with\npolynomially many facets. There is an alternate polytope for BSTs with a polynomial number of\nfacets (called the associahedron [29]) but the average search cost is not linear in the components used\nfor this polytope. We close this gap by exploiting the dynamic programming algorithm which solves\nthe BST optimization problem. This gives us a polytope with a polynomial number of facets while\nthe loss is linear in the natural components of the BST problem.\n\nContributions We propose a general method for learning combinatorial objects whose optimization\nproblem can be solved ef\ufb01ciently via an algorithm belonging to a wide class of dynamic programming\nalgorithms. Examples include BST (see Section 4.1), Matrix-Chain Multiplication, Knapsack,\nRod Cutting, and Weighted Interval Scheduling (see Appendix A). Using the underlying graph\nof subproblems induced by the dynamic programming algorithm for these problems, we de\ufb01ne a\nrepresentation of the combinatorial objects by encoding them as a speci\ufb01c type of subgraphs called\nk-multipaths. These subgraphs encode each object as a series of successive decisions (i.e.\nthe\ncomponents) over which the loss is linear. Also the associated polytope has a polynomial number\nof facets. These properties allow us to apply the standard Hedge [16, 28] and Component Hedge\nalgorithms [25].\n\nPaper Outline\nIn Section 2 we start with online learning of paths which are the simplest type\nof subgraphs we consider. This section brie\ufb02y describes the main two existing algorithms for the\npath problem: (1) An ef\ufb01cient implementation of Hedge using path kernels and (2) Component\nHedge. Section 3 introduces a much richer class of subgraphs, called k-multipaths, and generalizes\nthe algorithms. In Section 4, we de\ufb01ne a class of combinatorial objects recognized by dynamic\nprogramming algorithms. Then we prove that minimizing a speci\ufb01c dynamic programming problem\nfrom this class over trials reduces to online learning of k-multipaths. The online learning for BSTs\nuses k-multipaths for k = 2 (Section 4.1). A large number of additional examples are discussed in\nAppendix A. Finally, Section 5 concludes with comparison to other algorithms and future work and\ndiscusses how our method is generalized for arbitrary \u201cmin-sum\u201d dynamic programming problems.\n\n2 Background\n\nPerhaps the simplest algorithms in online learning are the \u201cexperts algorithms\u201d like the Randomized\nWeighted Majority [28] or the Hedge algorithm [16]. They keep track of a probability vector over\nall experts. The weight/probability wi of expert i is proportional to exp(\u2318L (i)), where L(i) is the\ncumulative loss of expert i until the current trial and \u2318 is a non-negative learning rate. In this paper we\nuse exponentially many combinatorial objects (composed of components) as the set of experts. When\nHedge is applied to such combinatorial objects, we call it Expanded Hedge (EH) because it is applied\nto a combinatorially \u201cexpanded domain\u201d. As we shall see, if the loss is linear over components (and\nthus the exponential weight of an object becomes a product over components), then this often can be\nexploited for obtaining an ef\ufb01cient implementations of EH.\n\nLearning Paths The online shortest path has been explored both in full information setting [35, 25]\nand various bandit settings [18, 4, 5, 12]. Concretely the problem in the full information setting is as\nfollows. We are given a directed acyclic graph (DAG) G = (V, E) with a designated source node\ns 2 V and sink node t 2 V . In each trial, the algorithm predicts with a path from s to t. Then for\neach edge e 2 E, the adversary reveals a loss `e 2 [0, 1]. The loss of the algorithm is given by the\nsum of the losses of the edges along the predicted path. The goal is to minimize the regret which\nis the difference between the total loss of the algorithm and that of the single best path chosen in\nhindsight.\n\n2\n\n\f\u21e1 = 1\n\nExpanded Hedge on Paths Takimoto and Warmuth [35] found an ef\ufb01cient implementation of\nEH by exploiting the additivity of the loss over the edges of a path. In this case the weight w\u21e1\n\nalgorithm maintains one weight we per edge such that the total weight of all edges leaving any\n\nof a path \u21e1 is proportional toQe2\u21e1 exp(\u2318Le), where Le is the cumulative loss of edge e. The\nnon-sink node sums to 1. This implies that w\u21e1 =Qe2\u21e1 we and sampling a path is easy. At the end\nof the current trial, each edge e receives additional loss `e, and the updated path weights have the\nZQe2\u21e1 we exp(\u2318`e), where Z is a normalization. Now a certain ef\ufb01cient procedure\nform wnew\ncalled weight pushing [31] is applied. It \ufb01nds new edge weights wnew\nnode is one and the updated weights are again in \u201cproduct form\u201d, i.e. wnew\nsampling.\nTheorem 1 (Takimoto-Warmuth [35]). Given a DAG G = (V, E) with designated source node s 2 V\nand sink node t 2 V , assume N is the number of paths in G from s to t, L\u21e4 is the total loss of best\npath, and B is an upper bound on the loss of any path in each trial. Then with proper tuning of the\nlearning rate \u2318 over the T trials, EH guarantees:\n\ne\n\ns.t. the total out\ufb02ow out of each\n, facilitating\n\n\u21e1 =Qe2\u21e1 wnew\n\ne\n\nE[LEH] L\u21e4 \uf8ff Bp2 T log N + B log N .\n\nComponent Hedge on Paths Koolen, Warmuth and Kivinen [25] applied CH to the path problem.\nThe edges are the components of the paths. A path is encoded as a bit vector \u21e1 of |E| components\nwhere the 1-bits are the edges in the path. The convex hull of all paths is called the unit-\ufb02ow polytope.\nCH maintains a mixture vector in this polytope. The constraints of the polytope enforce an out\ufb02ow of\n1 from the source node s, and \ufb02ow conservation at every other node but the sink node t. In each trial,\nthe weight of each edge we is updated multiplicatively by the factor exp(\u2318`e). Then the weight\nvector is projected back to the unit-\ufb02ow polytope via a relative entropy projection. This projection is\nachieved by iteratively projecting onto the \ufb02ow constraint of a particular vertex and then repeatedly\ncycling through the vertices [8]. Finally, to sample with the same expectation as the mixture vector in\nthe polytope, this vector is decomposed into paths using a greedy approach which removes one path\nat a time and zeros out at least one edge in the remaining mixture vector in each iteration.\nTheorem 2 (Koolen-Warmuth-Kivinen [25]). Given a DAG G = (V, E) with designated source node\ns 2 V and sink node t 2 V , let D be a length bound of the paths in G from s to t against which the\nCH algorithm is compared. Also denote the total loss of the best path of length at most D by L\u21e4.\nThen with proper tuning of the learning rate \u2318 over the T trials, CH guarantees:\n\nE[LCH] L\u21e4 \uf8ff Dp4 T log |V | + 2 D log |V |.\n\nMuch of this paper is concerned with generalizing the tools sketched in this section from paths to\nk-mulitpaths, from the unit-\ufb02ow polytope to the k-\ufb02ow polytope and developing a generalized version\nof weight pushing for k-multipaths.\n\n3 Learning k-Multipaths\n\nAs we shall see, k-multipaths will be subgraphs of k-DAGs built from k-multiedges. Examples of all\nthe de\ufb01nitions are given in Figure 1 for the case k = 2.\nDe\ufb01nition 1 (k-DAG). A DAG G = (V, E) is called k-DAG if it has following properties:\n(i) There exists one designated \u201csource\u201d node s 2 V with no incoming edges.\n(ii) There exists a set of \u201csink\u201d nodes T\u21e2 V which is the set of nodes with no outgoing edges.\n(iii) For all non-sink vertices v, the set of edges leaving v is partitioned into disjoint sets of size k\n\nwhich are called k-multiedges.\n\nWe denote the set of multiedges \u201cleaving\u201d vertex v as Mv and all multiedges of the DAG as M.\n\nEach k-multipath can be generated by starting with a single multiedge at the source and choosing\nin\ufb02ow many (i.e. number of incoming edges many) successor multiedges at the internal nodes\n(until we reach the sink nodes in T ). An example of a 2-multipath is given in Figure 1. Recall that\npaths were described as bit vectors \u21e1 of size |E| where the 1-bits were the edges in the path. In\nk-multipaths each edge bit \u21e1e becomes a non-negative count.\n\n3\n\n\fFigure 1: On the left we give an example of a 2-DAG. The source s and the nodes in the \ufb01rst layer\neach have two 2-multiedges depicted in red and blue. The nodes in the next layer each have one\n2-multiedge depicted in green. An example of 2-multipath in the 2-DAG is given on the right. The\n2-multipath is represented as an |E|-dimensional count vector \u21e1. The grayed edges are the edges\nwith count \u21e1e = 0. All non-zero counts \u21e1e are shown next to their associated edges e. Note that for\nnodes in the middle layers, the out\ufb02ow is always 2 times the in\ufb02ow.\n\nDe\ufb01nition 2 (k-multipath). Given a k-DAG G = (V, E), let \u21e1 2 N|E| in which \u21e1e is associated with\ne 2 E. De\ufb01ne the in\ufb02ow \u21e1in(v) :=P(u,v)2E \u21e1(u,v) and the out\ufb02ow \u21e1out(v) :=P(v,u)2E \u21e1(v,u). We\ncall \u21e1 a k-multipath if it has the below properties:\n(i) The out\ufb02ow \u21e1out(s) of the source s is k.\n(ii) For any two edges e, e0 in a multiedge m of G, \u21e1e = \u21e1e0. (When clear from the context, we\n(iii) For each vertex v 2 V T { s}, the out\ufb02ow is k times the in\ufb02ow, i.e. \u21e1out(v) = k \u21e5 \u21e1in(v).\n\ndenote this common value as \u21e1m.)\n\nk-Multipath Learning Problem We de\ufb01ne the problem of online learning of k-multipaths on a\ngiven k-DAG as follows. In each trial, the algorithm randomly predicts with a k-multipath \u21e1. Then\nfor each edge e 2 E, the adversary reveals a loss `e 2 [0, 1] incurred during that trial. The linear loss\nof the algorithm during this trial is given by \u21e1 \u00b7 `. Observe that the online shortest path problem is a\nspecial case when k = |T | = 1. In the remainder of this section, we generalize the algorithms in\nSection 2 to the online learning problem of k-multipaths.\n\n3.1 Expanded Hedge on k-Multipaths\n\nWe implement EH ef\ufb01ciently for learning k-multipath by considering each k-multipath as an expert.\nRecall that each k-multipath can be generated by starting with a single multiedge at the source\nand choosing in\ufb02ow many successor multiedges at the internal nodes. Multipaths are composed of\nmultiedges as components and with each multiedge m 2 M, we associate a weight wm. We maintain\na distribution W over multipaths de\ufb01ned in terms of the weights w 2 R|M|\n0 on the multiedges. The\ndistribution W will have the following canonical properties:\nDe\ufb01nition 3 (EH distribution properties).\n\ncommon value in \u21e1 among edges in m.\n\n1. The weights are in product form, i.e. W (\u21e1) = Qm2M (wm)\u21e1m. Recall that \u21e1m is the\n2. The weights are locally normalized, i.e.Pm2Mv\n3. The total path weight is one, i.e.P\u21e1 W (\u21e1) = 1.\n\nwm = 1 for all v 2 V T .\n\nUsing these properties, sampling a k-multipath from W can be easily done as follows. We start with\nsampling a single k-multiedge at the source and continue sampling in\ufb02ow many successor multiedges\nat the internal nodes until the k-multipath reaches the sink nodes in T . Observe that \u21e1m indicates the\nnumber of times the k-multiedge m is sampled through this process. EH updates the weights of the\n\n4\n\n\fmultipaths as follows:\n\nW new(\u21e1) =\n\nW (\u21e1) exp(\u2318 \u21e1 \u00b7 `)\n\n1\nZ\n1\n\n1\n\n.\n\n=\n\n=\n\n`e!#\n\nZ Ym2M\nZ Ym2M\u21e3 wm exph \u2318 Xe2m\n\n(wm)\u21e1m! exp\"\u2318 Xm2M\n\u21e1m Xe2m\n\u2318\u21e1m\n`ei\n}\n|\nThus the weights wm of each k-multiedge m 2 M are updated multiplicatively to bwm by multiplying\nthe wm with the exponentiated loss factors exp\u21e5\u2318Pe2m `e\u21e4 and then renormalizing with Z. Note\nthatPe2m `e is the loss of multiedge m.\nZQm2M (bwm)\u21e1m sum to 1 (i.e. Property (iii) holds) since Z normalizes the weights. Our\nW new(\u21e1) = Qm2M (wnew\n\nGeneralized Weight Pushing We generalize the weight pushing algorithm [31] to k-multipaths\nto reestablish the three canonical properties of De\ufb01nition 3. The new weights W new(\u21e1) =\n1\ngoal is to \ufb01nd new multiedge weights wnew\n\nm so that the other two properties hold as well, i.e.\nm = 1 for all nonsinks v. For this purpose, we\nintroduce a normalization Zv for each vertex v. Note that Zs = Z where s is the source node. Now\nthe generalized weight pushing \ufb01nds new weights wnew\nm for the multiedges to be used in the next trial:\n\nm )\u21e1m andPm2Mv\n\n{z\n:=bwm\n\nwnew\n\n1. For sinks v 2T , Zv := 1.\n\n2. Recursing backwards in the DAG, let Zv :=Pm2Mv bwmQu:(v,u)2m Zu for all non-sinks v.\n\n3. For each multiedge m from v to u1, . . . , uk, wnew\n\nAppendix B proves the correctness and time complexity of this generalized weight pushing algorithm.\n\nm := bwmQk\n\ni=1 Zui/Zv.\n\nRegret Bound In order to apply the regret bound of EH [16], we have to initialize the distribution\nW on k-multipaths to the uniform distribution. This is achieved by setting all wm to 1 followed by\nan application of generalized weight pushing. Note that Theorem 1 is a special case of the below\ntheorem for k = 1.\nTheorem 3. Given a k-DAG G with designated source node s and sink nodes T , assume N is the\nnumber of k-multipaths in G from s to T , L\u21e4 is the total loss of best k-multipath, and B is an upper\nbound on the loss of any k-multipath in each trial. Then with proper tuning of the learning rate \u2318\nover the T trials, EH guarantees:\n\nE[LEH] L\u21e4 \uf8ff Bp2 T log N + B log N .\n\n3.2 Component Hedge on k-Multipaths\n\nWe implement the CH ef\ufb01ciently for learning of k-multipath. Here the k-multipaths are the objects\nwhich are represented as |E|-dimensional1 count vectors \u21e1 (De\ufb01nition 2). The algorithm maintains\nan |E|-dimensional mixture vector w in the convex hull of count vectors. This hull is the following\npolytope over weight vectors obtained by relaxing the integer constraints on the count vectors:\nDe\ufb01nition 4 (k-\ufb02ow polytope). Given a k-DAG G = (V, E), let w 2 R|E|\n0 in which we is as-\nsociated with e 2 E. De\ufb01ne the in\ufb02ow win(v) := P(u,v)2E w(u,v) and the out\ufb02ow wout(v) :=\nP(v,u)2E w(v,u). w belongs to the k-\ufb02ow polytope of G if it has the below properties:\n(i) The out\ufb02ow wout(s) of the source s is k.\n(ii) For any two edges e, e0 in a multiedge m of G, we = we0.\n(iii) For each vertex v 2 V T { s}, the out\ufb02ow is k times the in\ufb02ow, i.e. wout(v) = k \u21e5 win(v).\n\n1For convenience we use the edges as components for CH instead of the multiedges as for EH.\n\n5\n\n\fIn each trial, the weight of each edge we is updated multiplicatively to bwe = we exp(\u2318`e) and then\nthe weight vector bw is projected back to the k-\ufb02ow polytope via a relative entropy projection:\n\nwnew := arg min\n\n+ bi ai.\n\n(w||bw), where (a||b) =Pi ai log ai\n\nbi\n\nThis projection is achieved by repeatedly cycling over the vertices and enforcing the local \ufb02ow\nconstraints at the current vertex. Based on the properties of the k-\ufb02ow polytope in De\ufb01nition 4, the\ncorresponding projection steps can be rewritten as follows:\n\nw2k-\ufb02ow polytope\n\n(i) Normalize the wout(s) to k.\n(ii) Given a multiedge m, set the k weights in m to their geometric average.\n(iii) Given a vertex v 2 V T { s}, scale the adjacent edges of v s.t.\n\nwout(v) := k+1qk (wout(v))k win(v)\n\nand win(v) :=\n\n1\nk\n\nk+1qk (wout(v))k win(v).\n\nSee Appendix C for details.\n\nDecomposition The \ufb02ow polytope has exponentially many objects as its corners. We now rewrite\nany vector w in the polytope as a mixture of |M| objects. CH then predicts with a random object\ndrawn from this sparse mixture. The mixture vector is decomposed by greedily removing a multipath\nfrom the current weight vector as follows: Ignore all edges with zero weights. Pick a multiedge at\ns and iteratively in\ufb02ow many multiedges at the internal nodes until you reach the sink nodes. Now\nsubtract that constructed multipath from the mixture vector w scaled by its minimum edge weight.\nThis zeros out at least k edges and maintain the \ufb02ow constraints at the internal nodes.\n\nRegret Bound The regret bound for CH depends on a good choice of the initial weight vector winit\nin the k-\ufb02ow polytope. We use an initialization technique recently introduced in [32]. Instead of\nexplicitly selecting winit in the k-\ufb02ow polytope, the initial weight is obtained by projecting a point\n\nbwinit outside of the polytope to the inside. This yields the following regret bounds (Appendix D):\n\nTheorem 4. Given a k-DAG G = (V, E), let D be the upper bound for the 1-norm of the k-multipaths\nin G. Also denote the total loss of the best k-multipath by L\u21e4. Then with proper tuning of the learning\nrate \u2318 over the T trials, CH guarantees:\n\nE[LCH] L\u21e4 \uf8ff Dp2 T (2 log |V | + log D) + 2 D log |V | + D log D.\n\nMoreover, when the k-multipaths are bit vectors, then:\n\nE[LCH] L\u21e4 \uf8ff Dp4 T log |V | + 2 D log |V |.\n\nNotice that by setting |T | = k = 1, the algorithm for path learning in [25] is recovered. Also observe\nthat Theorem 2 is a corollary of Theorem 4 since every path is represented as a bit vector.\n\n4 Online Dynamic Programming with Multipaths\n\nWe consider the problem of repeatedly solving a variant of the same dynamic programming problem\nin successive trials. We will use our de\ufb01nition of k-DAGs to describe a certain type of dynamic\nprogramming problem. The vertex set V is a set of subproblems to be solved. The source node s 2 V\nis the \ufb01nal subproblem. The sink nodes T\u21e2 V are the base subproblems. An edge from a node v to\nanother node v0 means that subproblem v may recurse on v0. We assume a non-base subproblem v\nalways breaks into exactly k smaller subproblems. A step of the dynamic programming recursion is\nthus represented by a k-multiedge. We assume the sets of k subproblems between possible recursive\ncalls at a node are disjoint. This corresponds to the fact that the choice of multiedges at a node\npartitions the edge set leaving that node.\nThere is a loss associated with any sink node in T . Also with the recursions at the internal node v a\nlocal loss will be added to the loss of the subproblems that depends on v and the chosen k-multiedge\n\n6\n\n\fleaving v. Recall that Mv is the set of multiedges leaving v. We can handle the following type of\n\u201cmin-sum\u201d recurrences:\n\nOPT(v) =(LT (v)\n\nminm2MvhPu:(v,u)2m OPT(u) + LM (m)i\n\nv 2T\nv 2 V T .\n\nThe problem of repeatedly solving such a dynamic programming problem over trials now becomes\nthe problem of online learning of k-multipaths in this k-DAG. Note that due to the correctness of the\ndynamic programming, every possible solution to the dynamic programming can be encoded as a\nk-multipath in the k-DAG and vice versa.\nThe loss of a given multipath is the sum of LM (m) over all multiedges m in the multipath plus the\nsum of LT (v) for all sink nodes v at the bottom of the multipath. To capture the same loss, we can\nalternatively de\ufb01ne losses over the edges of the k-DAG. Concretely, for each edge (v, u) in a given\nmultiedge m de\ufb01ne `(v,u) := 1\nIn summary we are addressing the above min-sum type dynamic programming problem speci\ufb01ed by\na k-DAG and local losses where for the sake of simplicity we made two assumptions: each non-base\nsubproblem breaks into exactly k smaller subproblems and the choice of k subproblems at a node\nare disjoint. We brie\ufb02y discuss in the conclusion section how to generalize our methods to arbitrary\nmin-sum dynamic programming problems, where the sets of subproblems can overlap and may have\ndifferent sizes.\n\nk LM (m) + {u2T }LT (u) where {\u00b7} is the indicator function.\n\n4.1 The Example of Learning Binary Search Trees\nRecall again the online version of optimal binary search tree (BST) problem [10]: We are given a set\nof n distinct keys K1 < K2 < . . . < Kn and n + 1 gaps or \u201cdummy keys\u201d D0, . . . , Dn indicating\nsearch failures such that for all i 2{ 1..n}, Di1 < Ki < Di. In each trial, the algorithm predicts\nwith a BST. Then the adversary reveals a frequency vector ` = (p, q) with p 2 [0, 1]n, q 2 [0, 1]n+1\nandPn\nj=0 qj = 1. For each i, j, the frequencies pi and qj are the search probabilities for\nKi and Dj, respectively. The loss is de\ufb01ned as the average search cost in the predicted BST which is\nthe average depth2 of all the nodes in the BST:\n\ni=1 pi +Pn\n\nloss =\n\nnXi=1\n\ndepth(Ki) \u00b7 pi +\n\nnXj=0\n\ndepth(Dj) \u00b7 qj.\n\nConvex Hull of BSTs\nImplementing CH requires a representation where not only the BST polytope\nhas a polynomial number of facets, but also the loss must be linear over the components. Since the\naverage search cost is linear in the depth(Ki) and depth(Dj) variables, it would be natural to choose\nthese 2n + 1 variables as the components for representing a BST. Unfortunately the convex hull of all\nBSTs when represented this way is not known to be a polytope with a polynomial number of facets.\nThere is an alternate characterization of the convex hull of BSTs with n internal nodes called the\nassociahedron [29]. This polytope has polynomial in n many facets but the average search cost is not\nlinear in the n components associated with this polytope3.\n\nThe Dynamic Programming Representation The optimal BST problem can be solved via dy-\nnamic programming [10]. Each subproblem is denoted by a pair (i, j), for 1 \uf8ff i \uf8ff n + 1 and\ni 1 \uf8ff j \uf8ff n, indicating the optimal BST problem with the keys Ki, . . . , Kj and dummy keys\nDi1, . . . , Dj. The base subproblems are (i, i 1), for 1 \uf8ff i \uf8ff n + 1 and the \ufb01nal subproblem is\n(1, n). The BST dynamic programming problem uses the following recurrence:\nOPT(i, j) =(qi1\nThis recurrence always recurses on 2 subproblems. Therefore we have k = 2 and the associated\n2-DAG has the subproblems/vertices V = {(i, j)|1 \uf8ff i \uf8ff n + 1, i 1 \uf8ff j \uf8ff n}, source s = (1, n)\n\nmini\uf8ffr\uf8ffj{OPT(i, r1)+OPT(r+1, j)+Pj\n\nk=i pk +Pj\n\n2Here the root starts at depth 1.\n3Concretely, the ith component is ai bi where ai and bi are the number of nodes in the left and right subtrees\n\nj = i1\n\nk=i1 qk} i\uf8ff j.\n\nof the ith internal node Ki, respectively.\n\n7\n\n\f4\n\n1\n\n5\n\n3\n\n2\n\n3\n\n2\n\n4\n\n1\n\n5\n\nFigure 2: (left) Two different 2-multipaths in the DAG, in red and blue, and (right) their associated\nBSTs of n = 5 keys and 6 \u201cdummy\u201d keys. Note that each node, and consequently edge, is visited at\nmost once in these 2-multipaths.\n\nProblem\n\nO(n 3\n\nFPL\n2 pT )\nOptimal Binary Search Trees O(n 3\nMatrix-Chain Multiplications4\n\u2014\n2 pT )\nO(n 3\n2 pT )\nO(n 3\n2 pT )\nWeighted Interval Scheduling O(n 3\nTable 1: Performance of various algorithms over different problems. C is the capacity in the Knapsack\nproblem, and dmax is the upper-bound on the dimension in matrix-chain multiplication problem.\n\nEH\n2 pT )\nO(n 3\n2 (dmax)3 pT ) O(n (log n) 1\n2 pT )\nO(n 3\n2 pT )\nO(n 3\n2 pT )\nO(n 3\n\nCH\nO(n (log n) 1\n2 (dmax)3 pT )\nO(n (log nC) 1\nO(n (log n) 1\nO(n (log n) 1\n\n2 pT )\n2 pT )\n2 pT )\n2 pT )\n\nKnapsack\nRod Cutting\n\nand sinks T = {(i, i 1)|1 \uf8ff i \uf8ff n + 1}. Also at node (i, j), the set M(i,j) consists of (j i + 1)\nmany 2-multiedges. The rth 2-multiedge leaving (i, j) comprised of 2 edges going from the node\n(i, j) to the nodes (i, r 1) and (r + 1, j). Figure 2 illustrates the 2-DAG and 2-multipaths associated\nwith BSTs.\nSince the above recurrence relation correctly solves the of\ufb02ine optimization problem, every 2-\nmultipath in the DAG represents a BST, and every possible BST can be represented by a 2-multipath of\nthe 2-DAG. We have O(n3) edges and multiedges which are the components of our new representation.\nk=i1 qk and is upper bounded by 1. Most\ncrucially, the original average search cost is linear in the losses of the multiedges and the 2-\ufb02ow\npolytope has O(n3) facets.\n\nThe loss of each 2-multiedge leaving (i, j) isPj\n\nk=i pk+Pj\n\nRegret Bound As mentioned earlier, the number of binary trees with n nodes is the nth Catalan\nnumber. Therefore N = (2n)!\nn!(n+1)! 2 (2n, 4n). Also note that the expected search cost is bounded by\nB = n in each trial. Thus using Theorem 3, EH achieves a regret bound of O(n 3\nAdditionally, notice that the number of subproblems in the dynamic programming problem for BSTs\nis (n+1)(n+2)\n. This is also the number of vertices in the associated 2-DAG and each 2-multipath\nrepresenting a BST consists of exactly D = 2n edges. Therefore using Theorem 4, CH achieves a\nregret bound of O(n (log n) 1\n5 Conclusions and Future Work\n\n2 pT ).\n\n2 pT ).\n\n2\n\nWe developed a general framework for online learning of combinatorial objects whose of\ufb02ine\noptimization problems can be ef\ufb01ciently solved via an algorithm belonging to a large class of\ndynamic programming algorithms. In addition to BSTs, several example problems are discussed in\nAppendix A. Table 1 gives the performance of EH and CH in our dynamic programming framework\n\n4The loss of a fully parenthesized matrix-chain multiplication is the number of scalar multiplications in the\nexecution of all matrix products. This number cannot be expressed as a linear loss over the dimensions of the\nmatrices. We are thus unaware of a way to apply FPL to this problem using the dimensions of the matrices as the\ncomponents. See Appendix A.1 for more details.\n\n8\n\n\fand compares it with the Follow the Perturbed Leader (FPL) algorithm. FPL additively perturbs the\nlosses and then uses dynamic programming to \ufb01nd the solution of minimum loss. FPL essentially\nalways matches EH, and CH is better than both in all cases.\nWe conclude with a few remarks:\n\u2022 For EH, projections are simply a renormalization of the weight vector. In contrast, iterative Breg-\nman projections are often needed for projecting back into the polytope used by CH [25, 19]. These\nmethods are known to converge to the exact projection [8, 6] and are reported to be very ef\ufb01cient\nempirically [25]. For the special cases of Euclidean projections [13] and Sinkhorn Balancing [24],\nlinear convergence has been proven. However we are unaware of a linear convergence proof for\ngeneral Bregman divergences. Regardless of the convergence rate, the remaining gaps to the exact\nprojections have to be accounted for as additional loss in the regret bounds. We do this in Appendix\nE for CH.\n\u2022 For the sake of concreteness, we focused in this paper on dynamic programming problems\nwith \u201cmin-sum\u201d recurrence relations, a \ufb01xed branching factor k and mutually exclusive sets of\nchoices at a given subproblem. However, our results can be generalized to arbitrary \u201cmin-sum\u201d\ndynamic programming problems with the methods introduced in [30]: We let the multiedges in G\nform hyperarcs, each of which is associated with a loss. Furthermore, each combinatorial object\nis encoded as a hyperpath, which is a sequence of hyperarcs from the source to the sinks. The\npolytope associated with such a dynamic programming problem is de\ufb01ned by \ufb02ow-type constraints\nover the underlying hypergraph G of subproblems. Thus online learning a dynamic programming\nsolution becomes a problem of learning hyperpaths in a hypergraph, and the techniques introduced\nin this paper let us implement EH and CH for this more general class of dynamic programming\nproblems.\n\u2022 In this work we use dynamic programming algorithms for building polytopes for combinatorial\nobjects that have a polynomial number of facets. The technique of going from the original polytope\nto a higher dimensional polytope in order to reduce the number of facets is known as extended\nformulation (see e.g. [21]). In the learning application we also need the additional requirement\nthat the loss is linear in the components of the objects. A general framework of using extended\nformulations to develop learning algorithms has recently been explored in [32].\n\u2022 We hope that many of the techniques from the expert setting literature can be adapted to learning\ncombinatorial objects that are composed of components. This includes lower bounding weights\nfor shifting comparators [20] and sleeping experts [7, 1]. Also in this paper, we focus on full\ninformation setting where the adversary reveals the entire loss vector in each trial. In contrast in full-\nand semi-bandit settings, the adversary only reveals partial information about the loss. Signi\ufb01cant\nwork has already been done in learning combinatorial objects in full- and semi-bandit settings\n[3, 18, 4, 27, 9]. It seems that the techniques introduced in the paper will also carry over.\n\u2022 Online Markov Decision Processes (MDPs) [15, 14] is an online learning model that focuses\non the sequential revelation of an object using a sequential state based model. This is very much\nrelated to learning paths and the sequential decisions made in our dynamic programming framework.\nConnecting our work with the large body of research on MDPs is a promising direction of future\nresearch.\n\u2022 There are several important dynamic programming instances that are not included in the class\nconsidered in this paper: The Viterbi algorithm for \ufb01nding the most probable path in a graph,\nand variants of Cocke-Younger-Kasami (CYK) algorithm for parsing probabilistic context-free\ngrammars. The solutions for these problems are min-sum type optimization problem after taking a\nlog of the probabilities. However taking logs creates unbounded losses. Extending our methods to\nthese dynamic programming problems would be very worthwhile.\n\nAcknowledgments We thank S.V.N. Vishwanathan for initiating and guiding much of this research.\nWe also thank Michael Collins for helpful discussions and pointers to the literature on hypergraphs and\nPCFGs. This research was supported by the National Science Foundation (NSF grant IIS-1619271).\n\n9\n\n\fReferences\n[1] Dmitry Adamskiy, Manfred K Warmuth, and Wouter M Koolen. Putting Bayes to sleep. In\n\nAdvances in Neural Information Processing Systems, pages 135\u2013143, 2012.\n\n[2] Nir Ailon. Improved bounds for online learning over the Permutahedron and other ranking\n\npolytopes. In AISTATS, pages 29\u201337, 2014.\n\n[3] Jean-Yves Audibert, S\u00e9bastien Bubeck, and G\u00e1bor Lugosi. Minimax policies for combinatorial\n\nprediction games. In COLT, volume 19, pages 107\u2013132, 2011.\n\n[4] Jean-Yves Audibert, S\u00e9bastien Bubeck, and G\u00e1bor Lugosi. Regret in online combinatorial\n\noptimization. Mathematics of Operations Research, 39(1):31\u201345, 2013.\n\n[5] Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing.\n\nJournal of Computer and System Sciences, 74(1):97\u2013114, 2008.\n\n[6] Heinz H Bauschke and Jonathan M Borwein. Legendre functions and the method of random\n\nBregman projections. Journal of Convex Analysis, 4(1):27\u201367, 1997.\n\n[7] Olivier Bousquet and Manfred K Warmuth. Tracking a small set of experts by mixing past\n\nposteriors. Journal of Machine Learning Research, 3(Nov):363\u2013396, 2002.\n\n[8] Lev M Bregman. The relaxation method of \ufb01nding the common point of convex sets and\nits application to the solution of problems in convex programming. USSR computational\nmathematics and mathematical physics, 7(3):200\u2013217, 1967.\n\n[9] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and\n\nSystem Sciences, 78(5):1404\u20131422, 2012.\n\n[10] Thomas H.. Cormen, Charles Eric Leiserson, Ronald L Rivest, and Clifford Stein. Introduction\n\nto algorithms. MIT press Cambridge, 2009.\n\n[11] Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Manfred Warmuth. On-line learning\nalgorithms for path experts with non-additive losses. In Conference on Learning Theory, pages\n424\u2013447, 2015.\n\n[12] Varsha Dani, Sham M Kakade, and Thomas P Hayes. The price of bandit information for online\noptimization. In Advances in Neural Information Processing Systems, pages 345\u2013352, 2008.\n\n[13] Frank Deutsch. Dykstra\u2019s cyclic projections algorithm: the rate of convergence. In Approxima-\n\ntion Theory, Wavelets and Applications, pages 87\u201394. Springer, 1995.\n\n[14] Travis Dick, Andras Gyorgy, and Csaba Szepesvari. Online learning in Markov decision\nprocesses with changing cost sequences. In Proceedings of the 31st International Conference\non Machine Learning (ICML-14), pages 512\u2013520, 2014.\n\n[15] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online Markov decision processes.\n\nMathematics of Operations Research, 34(3):726\u2013736, 2009.\n\n[16] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n[17] Swati Gupta, Michel Goemans, and Patrick Jaillet. Solving combinatorial games using products,\n\nprojections and lexicographically optimal bases. Preprint arXiv:1603.00522, 2016.\n\n[18] Andr\u00e1s Gy\u00f6rgy, Tam\u00e1s Linder, G\u00e1bor Lugosi, and Gy\u00f6rgy Ottucs\u00e1k. The on-line shortest path\nproblem under partial monitoring. Journal of Machine Learning Research, 8(Oct):2369\u20132403,\n2007.\n\n[19] David P Helmbold and Manfred K Warmuth. Learning permutations with exponential weights.\n\nThe Journal of Machine Learning Research, 10:1705\u20131736, 2009.\n\n[20] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine Learning, 32(2):151\u2013\n\n178, 1998.\n\n10\n\n\f[21] Volker Kaibel. Extended formulations in combinatorial optimization. Preprint arXiv:1104.1023,\n\n2011.\n\n[22] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal\n\nof Computer and System Sciences, 71(3):291\u2013307, 2005.\n\n[23] Jon Kleinberg and Eva Tardos. Algorithm design. Addison Wesley, 2006.\n[24] Philip A Knight. The Sinkhorn\u2013Knopp algorithm: convergence and applications. SIAM Journal\n\non Matrix Analysis and Applications, 30(1):261\u2013275, 2008.\n\n[25] Wouter M Koolen, Manfred K Warmuth, and Jyrki Kivinen. Hedging structured concepts. In\n\nConference on Learning Theory, pages 239\u2013254. Omnipress, 2010.\n\n[26] Dima Kuzmin and Manfred K Warmuth. Optimum follow the leader algorithm. In Learning\n\nTheory, pages 684\u2013686. Springer, 2005.\n\n[27] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for\nstochastic combinatorial semi-bandits. In Arti\ufb01cial Intelligence and Statistics, pages 535\u2013543,\n2015.\n\n[28] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and\n\ncomputation, 108(2):212\u2013261, 1994.\n\n[29] Jean-Louis Loday. The multiple facets of the associahedron. Proc. 2005 Academy Coll. Series,\n\n2005.\n\n[30] R Kipp Martin, Ronald L Rardin, and Brian A Campbell. Polyhedral characterization of discrete\n\ndynamic programming. Operations Research, 38(1):127\u2013138, 1990.\n\n[31] Mehryar Mohri. Weighted automata algorithms. In Handbook of weighted automata, pages\n\n213\u2013254. Springer, 2009.\n\n[32] Holakou Rahmanian, David Helmbold, and S.V.N. Vishwanathan. Online learning of combina-\n\ntorial objects via extended formulation. Preprint arXiv:1609.05374, 2017.\n\n[33] Arun Rajkumar and Shivani Agarwal. Online decision-making in general combinatorial spaces.\n\nIn Advances in Neural Information Processing Systems, pages 3482\u20133490, 2014.\n\n[34] Daiki Suehiro, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Kiyohito Nagano. Online\nprediction under submodular constraints. In International Conference on Algorithmic Learning\nTheory, pages 260\u2013274. Springer, 2012.\n\n[35] Eiji Takimoto and Manfred K Warmuth. Path kernels and multiplicative updates. The Journal\n\nof Machine Learning Research, 4:773\u2013818, 2003.\n\n[36] Manfred K Warmuth and Dima Kuzmin. Randomized online PCA algorithms with regret bounds\nthat are logarithmic in the dimension. Journal of Machine Learning Research, 9(10):2287\u20132320,\n2008.\n\n[37] Shota Yasutake, Kohei Hatano, Shuji Kijima, Eiji Takimoto, and Masayuki Takeda. Online linear\noptimization over permutations. In Algorithms and Computation, pages 534\u2013543. Springer,\n2011.\n\n11\n\n\f", "award": [], "sourceid": 1611, "authors": [{"given_name": "Holakou", "family_name": "Rahmanian", "institution": "University of California at Santa Cruz"}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": "Univ. of Calif. at Santa Cruz"}]}