Search results for scalable

Scalable Inference for LogisticNormal Topic Models
Logisticnormal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the nonconjugacy between the logisticnormal prior and multinomial topic mixing proportions. Existing algorithms either make restricting meanfield assumptions or are not scalable to largescale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with largescale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise. Jianfei Chen, June Zhu, Zi Wang, Xun Zheng, Bo Zhang

Scalable Inference for Gaussian Process Models with BlackBox Likelihoods
We propose a sparse method for scalable automated variational inference (AVI) in a large class of models with Gaussian process (GP) priors, multiple latent functions, multiple outputs and nonlinear likelihoods. Our approach maintains the statistical efficiency property of the original AVI method, requiring only expectations over univariate Gaussian distributions to approximate the posterior with a mixture of Gaussians. Experiments on small datasets for various problems including regression, classification, Log Gaussian Cox processes, and warped GPs show that our method can perform as well as the full method under high levels of sparsity. On larger experiments using the MNIST and the SARCOS datasets we show that our method can provide superior performance to previously published scalable approaches that have been handcrafted to specific likelihood models. Amir Dezfouli, Edwin V. Bonilla

A Scalable Approach to Probabilistic Latent Space Inference of LargeScale Networks
We propose a scalable approach for making inference about latent spaces of large networks. With a succinct representation of networks as a bag of triangular motifs, a parsimonious statistical model, and an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices and hundreds of latent roles on a single machine in a matter of hours, a setting that is out of reach for many existing methods. When compared to the stateoftheart probabilistic approaches, our method is several orders of magnitude faster, with competitive or improved accuracy for latent space recovery and link prediction. Junming Yin, Qirong Ho, Eric Xing

Scalable Inference of Overlapping Communities
We develop a scalable algorithm for posterior inference of overlapping communities in large networks. Our algorithm is based on stochastic variational inference in the mixedmembership stochastic blockmodel. It naturally interleaves subsampling the network with estimating its community structure. We apply our algorithm on ten large, realworld networks with up to 60,000 nodes. It converges several orders of magnitude faster than the stateoftheart algorithm for MMSB, finds hundreds of communities in large realworld networks, and detects the true communities in 280 benchmark networks with equal or better accuracy compared to other scalable algorithms. Prem Gopalan, Sean Gerrish, Michael Freedman, David M. Blei, David M. Mimno

Scalable Influence Estimation in ContinuousTime Diffusion Networks
If a piece of information is released from a media site, can it spread, in 1 month, to a million web pages? This influence estimation problem is very challenging since both the timesensitive nature of the problem and the issue of scalability need to be addressed simultaneously. In this paper, we propose a randomized algorithm for influence estimation in continuoustime diffusion networks. Our algorithm can estimate the influence of every node in a network with $\Vcal$ nodes and $\Ecal$ edges to an accuracy of $\epsilon$ using $n=O(1/\epsilon^2)$ randomizations and up to logarithmic factors $O(n\Ecal+n\Vcal)$ computations. When used as a subroutine in a greedy influence maximization algorithm, our proposed method is guaranteed to find a set of nodes with an influence of at least $(1  1/e)\operatorname{OPT}  2\epsilon$, where $\operatorname{OPT}$ is the optimal value. Experiments on both synthetic and realworld data show that the proposed method can easily scale up to networks of millions of nodes while significantly improves over previous stateofthearts in terms of the accuracy of the estimated influence and the quality of the selected nodes in maximizing the influence. Nan Du, Le Song, Manuel GomezRodriguez, Hongyuan Zha

Scalable Methods for Nonnegative Matrix Factorizations of Nearseparable Tallandskinny Matrices
Numerous algorithms are used for nonnegative matrix factorization under the assumption that the matrix is nearly separable. In this paper, we show how to make these algorithms scalable for data matrices that have many more rows than columns, socalled tallandskinny matrices." One key component to these improved methods is an orthogonal matrix transformation that preserves the separability of the NMF problem. Our final methods need to read the data matrix only once and are suitable for streaming, multicore, and MapReduce architectures. We demonstrate the efficacy of these algorithms on terabytesized matrices from scientific computing and bioinformatics." Austin R. Benson, Jason D. Lee, Bartek Rajwa, David F. Gleich

Scalable Kernel Methods via Doubly Stochastic Gradients
The general perception is that kernel methods are not scalable, so neural nets become the choice for largescale nonlinear learning problems. Have we tried hard enough for kernel methods? In this paper, we propose an approach that scales up kernel methods using a novel concept called ``doubly stochastic functional gradients''. Based on the fact that many kernel methods can be expressed as convex optimization problems, our approach solves the optimization problems by making two unbiased stochastic approximations to the functional gradientone using random training points and another using random features associated with the kerneland performing descent steps with this noisy functional gradient. Our algorithm is simple, need no commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting. We demonstrate that a function learned by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization bound of O(1/\sqrt{t}). Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show competitive performances of our approach as compared to neural nets in datasets such as 2.3 million energy materials from MolecularSpace, 8 million handwritten digits from MNIST, and 1 million photos from ImageNet using convolution features. Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, MariaFlorina F. Balcan, Le Song

Scalingup Importance Sampling for Markov Logic Networks
Markov Logic Networks (MLNs) are weighted firstorder logic templates for generating large (ground) Markov networks. Lifted inference algorithms for them bring the power of logical inference to probabilistic inference. These algorithms operate as much as possible at the compact firstorder level, grounding or propositionalizing the MLN only as necessary. As a result, lifted inference algorithms can be much more scalable than propositional algorithms that operate directly on the much larger ground network. Unfortunately, existing lifted inference algorithms suffer from two interrelated problems, which severely affects their scalability in practice. First, for most realworld MLNs having complex structure, they are unable to exploit symmetries and end up grounding most atoms (the grounding problem). Second, they suffer from the evidence problem, which arises because evidence breaks symmetries, severely diminishing the power of lifted inference. In this paper, we address both problems by presenting a scalable, lifted importance samplingbased approach that never grounds the full MLN. Specifically, we show how to scale up the two main steps in importance sampling: sampling from the proposal distribution and weight computation. Scalable sampling is achieved by using an informed, easytosample proposal distribution derived from a compressed MLNrepresentation. Fast weight computation is achieved by only visiting a small subset of the sampled groundings of each formula instead of all of its possible groundings. We show that our new algorithm yields an asymptotically unbiased estimate. Our experiments on several MLNs clearly demonstrate the promise of our approach. Deepak Venugopal, Vibhav G. Gogate

Scalable imputation of genetic data with a discrete fragmentationcoagulation process
We present a Bayesian nonparametric model for genetic sequence data in which a set of genetic sequences is modelled using a Markov model of partitions. The partitions at consecutive locations in the genome are related by their clusters first splitting and then merging. Our model can be thought of as a discrete time analogue of continuous time fragmentationcoagulation processes [Teh et al 2011], preserving the important properties of projectivity, exchangeability and reversibility, while being more scalable. We apply this model to the problem of genotype imputation, showing improved computational efficiency while maintaining the same accuracies as in [Teh et al 2011]. Lloyd Elliott, Yee W. Teh

LargeScale Bayesian MultiLabel Learning via TopicBased Label Embeddings
We present a scalable Bayesian multilabel learning model based on learning lowdimensional label embeddings. Our model assumes that each label vector is generated as a weighted combination of a set of topics (each topic being a distribution over labels), where the combination weights (i.e., the embeddings) for each label vector are conditioned on the observed feature vector. This construction, coupled with a BernoulliPoisson link function for each label of the binary label vector, leads to a model with a computational cost that scales in the number of positive labels in the label matrix. This makes the model particularly appealing for realworld multilabel learning problems where the label matrix is usually very massive but highly sparse. Using a dataaugmentation strategy leads to full local conjugacy in our model, facilitating simple and very efficient Gibbs sampling, as well as an Expectation Maximization algorithm for inference. Also, predicting the label vector at test time does not require doing an inference for the label embeddings and can be done in closed form. We report results on several benchmark data sets, comparing our model with various stateofthe art methods. Piyush Rai, Changwei Hu, Ricardo Henao, Lawrence Carin

Scalable SemiSupervised Aggregation of Classifiers
We present and empirically evaluate an efficient algorithm that learns to aggregate the predictions of an ensemble of binary classifiers. The algorithm uses the structure of the ensemble predictions on unlabeled data to yield significant performance improvements. It does this without making assumptions on the structure or origin of the ensemble, without parameters, and as scalably as linear learning. We empirically demonstrate these performance gains with random forests. Akshay Balsubramani, Yoav Freund

Scalable nonconvex inexact proximal splitting
We study largescale, nonsmooth, nonconconvex optimization problems. In particular, we focus on nonconvex problems with \emph{composite} objectives. This class of problems includes the extensively studied convex, composite objective problems as a special case. To tackle composite nonconvex problems, we introduce a powerful new framework based on asymptotically \emph{nonvanishing} errors, avoiding the common convenient assumption of eventually vanishing errors. Within our framework we derive both batch and incremental nonconvex proximal splitting algorithms. To our knowledge, our framework is first to develop and analyze incremental \emph{nonconvex} proximalsplitting algorithms, even if we disregard the ability to handle nonvanishing errors. We illustrate our theoretical framework by showing how it applies to difficult largescale, nonsmooth, and nonconvex problems. Suvrit Sra

Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning
The mutual information is a core statistical quantity that has applications in all areas of machine learning, whether this is in training of density models over multiple data modalities, in maximising the efficiency of noisy transmission channels, or when learning behaviour policies for exploration by artificial agents. Most learning algorithms that involve optimisation of the mutual information rely on the BlahutArimoto algorithm  an enumerative algorithm with exponential complexity that is not suitable for modern machine learning applications. This paper provides a new approach for scalable optimisation of the mutual information by merging techniques from variational inference and deep learning. We develop our approach by focusing on the problem of intrinsicallymotivated learning, where the mutual information forms the definition of a wellknown internal drive known as empowerment. Using a variational lower bound on the mutual information, combined with convolutional networks for handling visual input streams, we develop a stochastic optimisation algorithm that allows for scalable information maximisation and empowermentbased reasoning directly from pixels to actions. Shakir Mohamed, Danilo Jimenez Rezende

VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
Pascal Poupart, Craig Boutilier

A Scalable Machine Learning Approach to Go
Lin Wu, Pierre F. Baldi

Scalable Inference for Neuronal Connectivity from Calcium Imaging
Fluorescent calcium imaging provides a potentially powerful tool for inferring connectivity in neural circuits with up to thousands of neurons. However, a key challenge in using calcium imaging for connectivity detection is that current systems often have a temporal response and frame rate that can be orders of magnitude slower than the underlying neural spiking process. Bayesian inference based on expectationmaximization (EM) have been proposed to overcome these limitations, but they are often computationally demanding since the Estep in the EM procedure typically involves state estimation in a highdimensional nonlinear dynamical system. In this work, we propose a computationally fast method for the state estimation based on a hybrid of loopy belief propagation and approximate message passing (AMP). The key insight is that a neural system as viewed through calcium imaging can be factorized into simple scalar dynamical systems for each neuron with linear interconnections between the neurons. Using the structure, the updates in the proposed hybrid AMP methodology can be computed by a set of onedimensional state estimation procedures and linear transforms with the connectivity matrix. This yields a computationally scalable method for inferring connectivity of large neural circuits. Simulations of the method on realistic neural networks demonstrate good accuracy with computation times that are potentially significantly faster than current approaches based on Markov Chain Monte Carlo methods. Alyson K. Fletcher, Sundeep Rangan

Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications
We investigate two novel mixed robust/averagecase submodular data partitioning problems that we collectively call Submodular Partitioning. These problems generalize purely robust instances of the problem, namely maxmin submodular fair allocation (SFA) and \emph{minmax submodular load balancing} (SLB), and also averagecase instances, that is the submodular welfare problem (SWP) and submodular multiway partition (SMP). While the robust versions have been studied in the theory community, existing work has focused on tight approximation guarantees, and the resultant algorithms are not generally scalable to large realworld applications. This contrasts the average case instances, where most of the algorithms are scalable. In the present paper, we bridge this gap, by proposing several new algorithms (including greedy, majorizationminimization, minorizationmaximization, and relaxation algorithms) that not only scale to large datasets but that also achieve theoretical approximation guarantees comparable to the stateoftheart. We moreover provide new scalable algorithms that apply to additive combinations of the robust and averagecase objectives. We show that these problems have many applications in machine learning (ML), including data partitioning and load balancing for distributed ML, data clustering, and image segmentation. We empirically demonstrate the efficacy of our algorithms on realworld problems involving data partitioning for distributed optimization (of convex and deep neural network objectives), and also purely unsupervised image segmentation. Kai Wei, Rishabh K. Iyer, Shengjie Wang, Wenruo Bai, Jeff A. Bilmes

Collaborative Filtering with Graph Information: Consistency and Scalable Methods
Low rank matrix completion plays a fundamental role in collaborative filtering applications, the key idea being that the variables lie in a smaller subspace than the ambient space. Often, additional information about the variables is known, and it is reasonable to assume that incorporating this information will lead to better predictions. We tackle the problem of matrix completion when pairwise relationships among variables are known, via a graph. We formulate and derive a highly efficient, conjugate gradient based alternating minimization scheme that solves optimizations with over 55 million observations up to 2 orders of magnitude faster than stateoftheart (stochastic) gradientdescent based methods. On the theoretical front, we show that such methods generalize weighted nuclear norm formulations, and derive statistical consistency guarantees. We validate our results on both real and synthetic datasets. Nikhil Rao, HsiangFu Yu, Pradeep K. Ravikumar, Inderjit S. Dhillon

Scalable Adaptation of State Complexity for Nonparametric Hidden Markov Models
Bayesian nonparametric hidden Markov models are typically learned via fixed truncations of the infinite state space or local Monte Carlo proposals that make small changes to the state space. We develop an inference algorithm for the sticky hierarchical Dirichlet process hidden Markov model that scales to big datasets by processing a few sequences at a time yet allows rapid adaptation of the state space cardinality. Unlike previous pointestimate methods, our novel variational bound penalizes redundant or irrelevant states and thus enables optimization of the state space. Our birth proposals use observed data statistics to create useful new states that escape local optima. Merge and delete proposals remove ineffective states to yield simpler models with more affordable future computations. Experiments on speaker diarization, motion capture, and epigenetic chromatin datasets discover models that are more compact, more interpretable, and better aligned to ground truth segmentations than competitors. We have released an opensource Python implementation which can parallelize local inference steps across sequences. Michael C. Hughes, William T. Stephenson, Erik Sudderth

Large Scale Distributed Sparse Precision Estimation
We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both sharedmemory and distributedmemory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than stateoftheart methods and scales almost linearly with the number of cores. Huahua Wang, Arindam Banerjee, ChoJui Hsieh, Pradeep Ravikumar, Inderjit Dhillon

Symbolic Opportunistic Policy Iteration for FactoredAction MDPs
We address the scalability of symbolic planning under uncertainty with factored states and actions. Prior work has focused almost exclusively on factored states but not factored actions, and on value iteration (VI) compared to policy iteration (PI). Our ﬁrst contribution is a novel method for symbolic policy backups via the application of constraints, which is used to yield a new efﬁcient symbolic imple mentation of modiﬁed PI (MPI) for factored action spaces. While this approach improves scalability in some cases, naive handling of policy constraints comes with its own scalability issues. This leads to our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent al gorithm lying between VI and MPI. The core idea is a symbolic procedure that applies policy constraints only when they reduce the space and time complexity of the update, and otherwise performs full Bellman backups, thus automatically adjusting the backup per state. We also give a memory bounded version of this algorithm allowing a spacetime tradeoff. Empirical results show signiﬁcantly improved scalability over the stateoftheart. Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Bayesian models for Largescale Hierarchical Classification
A challenging problem in hierarchical classification is to leverage the hierarchical relations among classes for improving classification performance. An even greater challenge is to do so in a manner that is computationally feasible for the large scale problems usually encountered in practice. This paper proposes a set of Bayesian methods to model hierarchical dependencies among class labels using multivari ate logistic regression. Specifically, the parentchild relationships are modeled by placing a hierarchical prior over the children nodes centered around the parame ters of their parents; thereby encouraging classes nearby in the hierarchy to share similar model parameters. We present new, efficient variational algorithms for tractable posterior inference in these models, and provide a parallel implementa tion that can comfortably handle largescale problems with hundreds of thousands of dimensions and tens of thousands of classes. We run a comparative evaluation on multiple largescale benchmark datasets that highlights the scalability of our approach, and shows a significant performance advantage over the other stateof theart hierarchical methods. Siddharth Gopal, Yiming Yang, Bing Bai, Alexandru Niculescumizil

Fast Kernel Learning for Multidimensional Pattern Extrapolation
The ability to automatically discover patterns and perform extrapolation is an essential quality of intelligent systems. Kernel methods, such as Gaussian processes, have great potential for pattern extrapolation, since the kernel flexibly and interpretably controls the generalisation properties of these methods. However, automatically extrapolating large scale multidimensional patterns is in general difficult, and developing Gaussian process models for this purpose involves several challenges. A vast majority of kernels, and kernel learning methods, currently only succeed in smoothing and interpolation. This difficulty is compounded by the fact that Gaussian processes are typically only tractable for small datasets, and scaling an expressive kernel learning approach poses different challenges than scaling a standard Gaussian process model. One faces additional computational constraints, and the need to retain significant model structure for expressing the rich information available in a large dataset. In this paper, we propose a Gaussian process approach for large scale multidimensional pattern extrapolation. We recover sophisticated out of class kernels, perform texture extrapolation, inpainting, and video extrapolation, and long range forecasting of land surface temperatures, all on large multidimensional datasets, including a problem with 383,400 training points. The proposed method significantly outperforms alternative scalable and flexible Gaussian process methods, in speed and accuracy. Moreover, we show that a distinct combination of expressive kernels, a fully nonparametric representation, and scalable inference which exploits existing model structure, are critical for large scale multidimensional pattern extrapolation. Andrew Wilson, Elad Gilboa, John P. Cunningham, Arye Nehorai

Scalable Algorithms for String Kernels with Inexact Matching
We present a new family of linear time algorithms based on sufficient statistics for string comparison with mismatches under the string kernels framework. Our algorithms improve theoretical complexity bounds of existing approaches while scaling well with respect to the sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets with loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a stateoftheart performance with significantly reduced running times. Pavel P. Kuksa, Paihsi Huang, Vladimir Pavlovic

Scalable Nonlinear Learning with Adaptive Polynomial Expansions
Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higherorder interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines. Alekh Agarwal, Alina Beygelzimer, Daniel J. Hsu, John Langford, Matus J. Telgarsky

Learning Label Trees for Probabilistic Modelling of Implicit Feedback
User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative filtering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difficult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative filtering with implicit feedback based on modelling the user's item selection process. In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efficient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. Andriy Mnih, Yee W. Teh

Largescale LBFGS using MapReduce
LBFGS has been applied as an effective parameter estimation method for various machine learning algorithms since 1980s. With an increasing demand to deal with massive instances and variables, it is important to scale up and parallelize LBFGS effectively in a distributed system. In this paper, we study the problem of parallelizing the LBFGS algorithm in large clusters of tens of thousands of sharednothing commodity machines. First, we show that a naive implementation of LBFGS using MapReduce requires either a significant amount of memory or a large number of mapreduce steps with negative performance impact. Second, we propose a new LBFGS algorithm, called Vectorfree LBFGS, which avoids the expensive dot product operations in the two loop recursion and greatly improves computation efficiency with a great degree of parallelism. The algorithm scales very well and enables a variety of machine learning algorithms to handle a massive number of variables over large datasets. We prove the mathematical equivalence of the new Vectorfree LBFGS and demonstrate its excellent performance and scalability using realworld machine learning problems with billions of variables in production clusters. Weizhu Chen, Zhenghao Wang, Jingren Zhou

Learning word embeddings efficiently with noisecontrastive estimation
Continuousvalued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning highdimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training logbilinear models with noisecontrastive estimation. Our approach is simpler, faster, and produces better results than the current stateofthe art method of Mikolov et al. (2013a). We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. Andriy Mnih, Koray Kavukcuoglu

SmallVariance Asymptotics for Exponential Family Dirichlet Process Mixture Models
Links between probabilistic and nonprobabilistic learning algorithms can arise by performing smallvariance asymptotics, i.e., letting the variance of particular distributions in a graphical model go to zero. For instance, in the context of clustering, such an approach yields precise connections between the kmeans and EM algorithms. In this paper, we explore smallvariance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that feature the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discretedata problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. Ke Jiang, Brian Kulis, Michael I. Jordan

Unsupervised Detection of Regions of Interest Using Iterative Link Analysis
This paper proposes a fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels. The proposed approach discovers highly probable regions of object instances by iteratively repeating the following two functions: (1) choose the exemplar set (i.e. small number of high ranked reference ROIs) across the dataset and (2) refine the ROIs of each image with respect to the exemplar set. These two subproblems are formulated as ranking in two different similarity networks of ROI hypotheses by link analysis. The experiments with the PASCAL 06 dataset show that our unsupervised localization performance is better than one of stateoftheart techniques and comparable to supervised methods. Also, we test the scalability of our approach with five objects in Flickr dataset consisting of more than 200,000 images. Gunhee Kim, Antonio Torralba

A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
The CUR matrix decomposition is an important extension of Nyström approximation to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relativeerror bound. The proposed algorithm has the advantages over the existing relativeerror CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several realworld datasets demonstrate significant improvement over the existing relativeerror algorithms. Shusen Wang, Zhihua Zhang

Individual Planning in InfiniteHorizon Multiagent Settings: Inference, Structure and Scalability
This paper provides the first formalization of selfinterested planning in multiagent settings using expectationmaximization (EM). Our formalization in the context of infinitehorizon and finitelynested interactive POMDPs (IPOMDP) is distinct from EM formulations for POMDPs and cooperative multiagent planning frameworks. We exploit the graphical model structure specific to IPOMDPs, and present a new approach based on blockcoordinate descent for further speed up. Forward filteringbackward sampling  a combination of exact filtering with sampling  is explored to exploit problem structure. Xia Qu, Prashant Doshi

On Triangular versus Edge Representations  Towards Scalable Modeling of Networks
In this paper, we argue for representing networks as a bag of {\it triangular motifs}, particularly for important network problems that current modelbased approaches handle poorly due to computational bottlenecks incurred by using edge representations. Such approaches require both 1edges and 0edges (missing edges) to be provided as input, and as a consequence, approximate inference algorithms for these models usually require $\Omega(N^2)$ time per iteration, precluding their application to larger realworld networks. In contrast, triangular modeling requires less computation, while providing equivalent or better inference quality. A triangular motif is a vertex triple containing 2 or 3 edges, and the number of such motifs is $\Theta(\sum_{i}D_{i}^{2})$ (where $D_i$ is the degree of vertex $i$), which is much smaller than $N^2$ for lowmaximumdegree networks. Using this representation, we develop a novel mixedmembership network model and approximate inference algorithm suitable for large networks with low maxdegree. For networks with high maximum degree, the triangular motifs can be naturally subsampled in a {\it nodecentric} fashion, allowing for much faster inference at a small cost in accuracy. Empirically, we demonstrate that our approach, when compared to that of an edgebased model, has faster runtime and improved accuracy for mixedmembership community detection. We conclude with a largescale demonstration on an $N\approx 280,000$node network, which is infeasible for network models with $\Omega(N^2)$ inference cost. Qirong Ho, Junming Yin, Eric P. Xing

Scalable kernels for graphs with continuous attributes
While graphs with continuous node attributes arise in many applications, stateoftheart graph kernels for comparing continuousattributed graphs suffer from a high runtime complexity; for instance, the popular shortest path kernel scales as $\mathcal{O}(n^4)$, where $n$ is the number of nodes. In this paper, we present a class of path kernels with computational complexity $\mathcal{O}(n^2 (m + \delta^2))$, where $\delta$ is the graph diameter and $m$ the number of edges. Due to the sparsity and small diameter of realworld graphs, these kernels scale comfortably to large graphs. In our experiments, the presented kernels outperform stateoftheart kernels in terms of speed and accuracy on classification benchmark datasets. Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, Karsten Borgwardt

Copeland Dueling Bandits
A version of the dueling bandit problem is addressed in which a Condorcet winner may not exist. Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet winner, is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed for small numbers of arms, while the second, Scalable Copeland Bandits (SCB), works better for largescale problems. We provide theoretical results bounding the regret accumulated by CCB and SCB, both substantially improving existing results. Such existing results either offer bounds of the form O(K log T) but require restrictive assumptions, or offer bounds of the form O(K^2 log T) without requiring such assumptions. Our results offer the best of both worlds: O(K log T) bounds without restrictive assumptions. Masrour Zoghi, Zohar S. Karnin, Shimon Whiteson, Maarten de Rijke

SmallVariance Asymptotics for Hidden Markov Models
Smallvariance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinitestate Bayesian nonparametric extension. Starting with the standard HMM, we first derive a “hard” inference algorithm analogous to kmeans that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discretestate sequence data with a nonfixed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a kmeanslike term along with penalties based on state transitions and the number of states. A key property of such algorithms is that — particularly in the nonparametric setting — standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. Anirban Roychowdhury, Ke Jiang, Brian Kulis

Scalable Discriminative Learning for Natural Language Parsing and Translation
Joseph Turian, Benjamin Wellington, I. D. Melamed

Efficient Recovery of Jointly Sparse Vectors
We consider the reconstruction of sparse signals in the multiple measurement vector (MMV) model,in which the signal, represented as a matrix, consists of a set of jointly sparse vectors. MMV is an extension of the single measurement vector (SMV) model employed in standard compressive sensing (CS). Recent theoretical studies focus on the convex relaxation of the MMV problem based on the $(2,1)$norm minimization, which is an extension of the wellknown $1$norm minimization employed in SMV. However, the resulting convex optimization problem in MMV is significantly much more difficult to solve than the one in SMV. Existing algorithms reformulate it as a secondorder cone programming (SOCP) or semidefinite programming (SDP), which is computationally expensive to solve for problems of moderate size. In this paper, we propose a new (dual) reformulation of the convex optimization problem in MMV and develop an efficient algorithm based on the proxmethod. Interestingly, our theoretical analysis reveals the close connection between the proposed reformulation and multiple kernel learning. Our simulation studies demonstrate the scalability of the proposed algorithm. Liang Sun, Jun Liu, Jianhui Chen, Jieping Ye

Fast and Scalable Training of SemiSupervised CRFs with Application to Activity Recognition
We present a new and efficient semisupervised training method for parameter estimation and feature selection in conditional random fields (CRFs). In realworld applications such as activity recognition, unlabeled sensor traces are relatively easy to obtain whereas labeled examples are expensive and tedious to collect. Furthermore, the ability to automatically select a small subset of discriminatory features from a large pool can be advantageous in terms of computational speed as well as accuracy. In this paper, we introduce the semisupervised virtual evidence boosting (sVEB) algorithm for training CRFs  a semisupervised extension to the recently developed virtual evidence boosting (VEB) method for feature selection and parameter learning. Semisupervised VEB takes advantage of the unlabeled data via minimum entropy regularization  the objective function combines the unlabeled conditional entropy with labeled conditional pseudolikelihood. The sVEB algorithm reduces the overall system cost as well as the human labeling cost required during training, which are both important considerations in building real world inference systems. In a set of experiments on synthetic data and real activity traces collected from wearable sensors, we illustrate that our algorithm benefits from both the use of unlabeled data and automatic feature selection, and outperforms other semisupervised training approaches. Maryam Mahdaviani, Tanzeem Choudhury

Learning to Search Efficiently in High Dimensions
High dimensional similarity search in large scale databases becomes an important challenge due to the advent of Internet. For such applications, specialized data structures are required to achieve computational efficiency. Traditional approaches relied on algorithmic constructions that are often data independent (such as Locality Sensitive Hashing) or weakly dependent (such as kdtrees, kmeans trees). While supervised learning algorithms have been applied to related problems, those proposed in the literature mainly focused on learning hash codes optimized for compact embedding of the data rather than search efficiency. Consequently such an embedding has to be used with linear scan or another search algorithm. Hence learning to hash does not directly address the search efficiency issue. This paper considers a new framework that applies supervised learning to directly optimize a data structure that supports efficient large scale search. Our approach takes both search quality and computational cost into consideration. Specifically, we learn a boosted search forest that is optimized using pairwise similarity labeled examples. The output of this search forest can be efficiently converted into an inverted indexing data structure, which can leverage modern text search infrastructure to achieve both scalability and efficiency. Experimental results show that our approach significantly outperforms the startoftheart learning to hash methods (such as spectral hashing), as well as stateoftheart high dimensional search algorithms (such as LSH and kmeans trees). Zhen Li, Huazhong Ning, Liangliang Cao, Tong Zhang, Yihong Gong, Thomas S. Huang

Reducing the Rank in Relational Factorization Models by Including Observable Patterns
Tensor factorizations have become popular methods for learning from multirelational data. In this context, the rank of a factorization is an important parameter that determines runtime as well as generalization ability. To determine conditions under which factorization is an efficient approach for learning from relational data, we derive upper and lower bounds on the rank required to recover adjacency tensors. Based on our findings, we propose a novel additive tensor factorization model for learning from latent and observable patterns in multirelational data and present a scalable algorithm for computing the factorization. Experimentally, we show that the proposed approach does not only improve the predictive performance over pure latent variable methods but that it also reduces the required rank  and therefore runtime and memory complexity  significantly. Maximilian Nickel, Xueyan Jiang, Volker Tresp

Scalable Training of Mixture Models via Coresets
How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of $O(dk^3/\eps^2)$ data points suffices for computing a $(1+\eps)$approximation for the optimal model on the original $n$ data points. Moreover, such coresets can be efficiently constructed in a mapreduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. Dan Feldman, Matthew Faulkner, Andreas Krause

A Scalable Hierarchical Distributed Language Model
Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widelyused ngram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words that was two orders of magnitude faster than the nonhierarchical language model it was based on. However, it performed considerably worse than its nonhierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple featurebased algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform nonhierarchical models and achieve stateoftheart performance. Andriy Mnih, Geoffrey E. Hinton

Parallelizing Support Vector Machines on Distributed Computers
Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a rowbased, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let $n$ denote the number of training instances, $p$ the reduced matrix dimension after factorization ($p$ is significantly smaller than $n$), and $m$ the number of machines. PSVM reduces the memory requirement from $\MO$($n^2$) to $\MO$($np/m$), and improves computation time to $\MO$($np^2/m$). Empirical studies on up to $500$ computers shows PSVM to be effective. Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization
Probabilistic graphical models are powerful tools for analyzing constrained, continuous domains. However, finding mostprobable explanations (MPEs) in these models can be computationally expensive. In this paper, we improve the scalability of MPE inference in a class of graphical models with piecewiselinear and piecewisequadratic dependencies and linear constraints over continuous domains. We derive algorithms based on a consensusoptimization framework and demonstrate their superior performance over state of the art. We show empirically that in a largescale voterpreference modeling problem our algorithms scale linearly in the number of dependencies and constraints. Stephen Bach, Matthias Broecheler, Lise Getoor, Dianne O'leary

Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
We show how improved sequences for magnetic resonance imaging can be found through automated optimization of Bayesian design scores. Combining recent advances in approximate Bayesian inference and natural image statistics with highperformance numerical computation, we propose the first scalable Bayesian experimental design framework for this problem of high relevance to clinical and brain research. Our solution requires approximate inference for dense, nonGaussian models on a scale seldom addressed before. We propose a novel scalable variational inference algorithm, and show how powerful methods of numerical mathematics can be modified to compute primitives in our framework. Our approach is evaluated on a realistic setup with raw data from a 3T MR scanner. Hannes Nickisch, Rolf Pohmann, Bernhard Schölkopf, Matthias Seeger

Semisupervised Learning with Deep Generative Models
The everincreasing size of modern data sets combined with the difficulty of obtaining label information has made semisupervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semisupervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or nonscalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semisupervised learning. Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, Max Welling

Fairness in MultiAgent Sequential DecisionMaking
We define a fairness solution criterion for multiagent decisionmaking problems, where agents have local interests. This new criterion aims to maximize the worst performance of agents with consideration on the overall performance. We develop a simple linear programming approach and a more scalable gametheoretic approach for computing an optimal fairness policy. This gametheoretic approach formulates this fairness optimization as a twoplayer, zerosum game and employs an iterative algorithm for finding a Nash equilibrium, corresponding to an optimal fairness policy. We scale up this approach by exploiting problem structure and value function approximation. Our experiments on resource allocation problems show that this fairness criterion provides a more favorable solution than the utilitarian criterion, and that our gametheoretic approach is significantly faster than linear programming. Chongjie Zhang, Julie A. Shah

Deep Temporal Sigmoid Belief Networks for Sequence Modeling
Deep dynamic generative models are developed to learn sequential dependencies in timeseries data. The multilayered model is designed by constructing a hierarchy of temporal sigmoid belief networks (TSBNs), defined as a sequential stack of sigmoid belief networks (SBNs). Each SBN has a contextual hidden state, inherited from the previous SBNs in the sequence, and is used to regulate its hidden bias. Scalable learning and inference algorithms are derived by introducing a recognition model that yields fast sampling from the variational posterior. This recognition model is trained jointly with the generative model, by maximizing its variational lower bound on the loglikelihood. Experimental results on bouncing balls, polyphonic music, motion capture, and text streams show that the proposed approach achieves stateoftheart predictive performance, and has the capacity to synthesize various sequences. Zhe Gan, Chunyuan Li, Ricardo Henao, David E. Carlson, Lawrence Carin

Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
Nonnegative Matrix Factorization (NMF) is a promising relaxation technique for clustering analysis. However, conventional NMF methods that directly approximate the pairwise similarities using the least square error often yield mediocre performance for data in curved manifolds because they can capture only the immediate similarities between data samples. Here we propose a new NMF clustering method which replaces the approximated matrix with its smoothed version using random walk. Our method can thus accommodate farther relationships between data samples. Furthermore, we introduce a novel regularization in the proposed objective function in order to improve over spectral clustering. The new learning objective is optimized by a multiplicative MajorizationMinimization algorithm with a scalable implementation for learning the factorizing matrix. Extensive experimental results on realworld datasets show that our method has strong performance in terms of cluster purity. Zhirong Yang, Tele Hao, Onur Dikmen, Xi Chen, Erkki Oja

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
The recent emergence of Graphics Processing Units (GPUs) as generalpurpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. Furthermore, the partitioning scheme balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We also use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors; actually the speedup is almost linearly scalable with the number of multiprocessors available. The proposed partitioning scheme and data streaming can be easily ported to many other models in machine learning. Feng Yan, Ningyi Xu, Yuan Qi

Emergence of ObjectSelective Features in Unsupervised Feature Learning
Recent work in unsupervised feature learning has focused on the goal of discovering highlevel features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to object classes or other complex patterns in the data. In this paper, we aim to test the hypothesis that unsupervised feature learning methods, provided with only unlabeled data, can learn highlevel, invariant features that are sensitive to commonlyoccurring objects. Though a handful of prior results suggest that this is possible when each object class accounts for a large fraction of the data (as in many labeled datasets), it is unclear whether something similar can be accomplished when dealing with completely unlabeled data. A major obstacle to this test, however, is scale: we cannot expect to succeed with small datasets or with small numbers of learned features. Here, we propose a largescale feature learning system that enables us to carry out this experiment, learning 150,000 features from tens of millions of unlabeled images. Based on two scalable clustering algorithms (Kmeans and agglomerative clustering), we find that our simple system can discover features sensitive to a commonly occurring object class (human faces) and can also combine these into detectors invariant to significant global distortions like large translations and scale. Adam Coates, Andrej Karpathy, Andrew Y. Ng

Stochastic Relational Models for Largescale Dyadic Data using MCMC
Stochastic relational models provide a rich family of choices for learning and predicting dyadic data between two sets of entities. It generalizes matrix factorization to a supervised learning problem that utilizes attributes of objects in a hierarchical Bayesian framework. Previously empirical Bayesian inference was applied, which is however not scalable when the size of either object sets becomes tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm to scale the model to very largescale dyadic data. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and a half million items. Shenghuo Zhu, Kai Yu, Yihong Gong

Optimization Methods for Sparse PseudoLikelihood Graphical Model Selection
Sparse high dimensional graphical model selection is a popular topic in contemporary machine learning. To this end, various useful approaches have been proposed in the context of $\ell_1$ penalized estimation in the Gaussian framework. Though many of these approaches are demonstrably scalable and have leveraged recent advances in convex optimization, they still depend on the Gaussian functional form. To address this gap, a convex pseudolikelihood based partial correlation graph estimation method (CONCORD) has been recently proposed. This method uses cyclic coordinatewise minimization of a regression based pseudolikelihood, and has been shown to have robust model selection properties in comparison with the Gaussian approach. In direct contrast to the parallel work in the Gaussian setting however, this new convex pseudolikelihood framework has not leveraged the extensive array of methods that have been proposed in the machine learning literature for convex optimization. In this paper, we address this crucial gap by proposing two proximal gradient methods (CONCORDISTA and CONCORDFISTA) for performing $\ell_1$regularized inverse covariance matrix estimation in the pseudolikelihood framework. We present timing comparisons with coordinatewise minimization and demonstrate that our approach yields tremendous pay offs for $\ell_1$penalized partial correlation graph estimation outside the Gaussian setting, thus yielding the fastest and most scalable approach for such problems. We undertake a theoretical analysis of our approach and rigorously demonstrate convergence, and also derive rates thereof. Sang Oh, Onkar Dalal, Kshitij Khare, Bala Rajaratnam

Regret based Robust Solutions for Uncertain Markov Decision Processes
In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of {\em maximin} policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed {\em minimax} regret as a suitable alternative to the {\em maximin} objective for robust optimization. However, existing algorithms for handling {\em minimax} regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds. Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet

Factoring Variations in Natural Images with Deep Gaussian Mixture Models
Generative models can be seen as the swiss army knives of machine learning, as many problems can be written probabilistically in terms of the distribution of the data, including prediction, reconstruction, imputation and simulation. One of the most promising directions for unsupervised learning may lie in Deep Learning methods, given their success in supervised learning. However, one of the current problems with deep unsupervised learning methods, is that they often are harder to scale. As a result there are some easier, more scalable shallow methods, such as the Gaussian Mixture Model and the Studentt Mixture Model, that remain surprisingly competitive. In this paper we propose a new scalable deep generative model for images, called the Deep Gaussian Mixture Model, that is a straightforward but powerful generalization of GMMs to multiple layers. The parametrization of a Deep GMM allows it to efficiently capture products of variations in natural images. We propose a new EMbased algorithm that scales well to large datasets, and we show that both the Expectation and the Maximization steps can easily be distributed over multiple machines. In our density estimation experiments we show that deeper GMM architectures generalize better than more shallow ones, with results in the same ballpark as the state of the art. Aaron van den Oord, Benjamin Schrauwen

A Complete Recipe for Stochastic Gradient MCMC
Many recent Markov chain Monte Carlo (MCMC) samplers leverage continuous dynamics to define a transition kernel that efficiently explores a target distribution. In tandem, a focus has been on devising scalable variants that subsample the data and use stochastic gradients in place of fulldata gradients in the dynamic simulations. However, such stochastic gradient MCMC samplers have lagged behind their fulldata counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is nontrivial. Even with simple dynamics, significant physical intuition is often required to modify the dynamical system to account for the stochastic gradient noise. In this paper, we provide a general recipe for constructing MCMC samplersincluding stochastic gradient versionsbased on continuous Markov processes specified via two matrices. We constructively prove that the framework is complete. That is, any continuous Markov process that provides samples from the target distribution can be written in our framework. We show how previous continuousdynamic samplers can be trivially reinvented in our framework, avoiding the complicated samplerspecific proofs. We likewise use our recipe to straightforwardly propose a new stateadaptive sampler: stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC). Our experiments on simulated data and a streaming Wikipedia analysis demonstrate that the proposed SGRHMC sampler inherits the benefits of Riemann HMC, with the scalability of stochastic gradient methods. YiAn Ma, Tianqi Chen, Emily Fox

Communication Efficient Distributed Machine Learning with the Parameter Server
This paper describes a thirdgeneration parameter server framework for distributed machine learning. This framework offers two relaxations to balance system performance and algorithm efficiency. We propose a new algorithm that takes advantage of this framework to solve nonconvex nonsmooth problems with convergence guarantees. We present an indepth analysis of two large scale machine learning problems ranging from $\ell_1$regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions. We demonstrate using these examples that the parameter server framework is an effective and straightforward way to scale machine learning to larger problems and systems than have been previously achieved. Mu Li, David G. Andersen, Alex J. Smola, Kai Yu

Parallel Feature Selection Inspired by Group Testing
This paper presents a parallel feature selection method for classification that scales up to very high dimensions and large data sizes. Our original method is inspired by group testing theory, under which the feature selection procedure consists of a collection of randomized tests to be performed in parallel. Each test corresponds to a subset of features, for which a scoring function may be applied to measure the relevance of the features in a classification task. We develop a general theory providing sufficient conditions under which true features are guaranteed to be correctly identified. Superior performance of our method is demonstrated on a challenging relation extraction task from a very large data set that have both redundant features and sample size in the order of millions. We present comprehensive comparisons with stateoftheart feature selection methods on a range of data sets, for which our method exhibits competitive performance in terms of running time and accuracy. Moreover, it also yields substantial speedup when used as a preprocessing step for most other existing methods. Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q. Ngo, Long Nguyen, Christopher Ré, Venu Govindaraju

Efficient Online Inference for Bayesian Nonparametric Relational Models
Stochastic block models characterize observed network relationships via latent community memberships. In large social networks, we expect entities to participate in multiple communities, and the number of communities to grow with the network size. We introduce a new model for these phenomena, the hierarchical Dirichlet process relational model, which allows nodes to have mixed membership in an unbounded set of communities. To allow scalable learning, we derive an online stochastic variational inference algorithm. Focusing on assortative models of undirected networks, we also propose an efficient structured mean field variational bound, and online methods for automatically pruning unused communities. Compared to stateoftheart online learning methods for parametric relational models, we show significantly improved perplexity and link prediction accuracy for sparse networks with tens of thousands of nodes. We also showcase an analysis of LittleSis, a large network of whoknowswho at the heights of business and government. Dae Il Kim, Prem Gopalan, David Blei, Erik Sudderth

Theoretical Analysis of Heuristic Search Methods for Online POMDPs
Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and epsilonoptimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) nearoptimal solutions in reasonable time. Stephane Ross, Joelle Pineau, Brahim Chaibdraa

An Online Algorithm for Large Scale Image Similarity Learning
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity may not scale to large datasets with high dimensionality, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of nonzero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than stateoftheart methods, and faster by orders of magnitude. On 2 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by an order of magnitude than was handled before. Gal Chechik, Uri Shalit, Varun Sharma, Samy Bengio

Factor Modeling for Advertisement Targeting
We adapt a probabilistic latent variable model, namely GaP (GammaPoisson), to ad targeting in the contexts of sponsored search (SS) and behaviorally targeted (BT) display advertising. We also approach the important problem of ad positional bias by formulating a onelatentdimension GaP factorization. Learning from clickthrough data is intrinsically large scale, even more so for ads. We scale up the algorithm to terabytes of realworld SS and BT data that contains hundreds of millions of users and hundreds of thousands of features, by leveraging the scalability characteristics of the algorithm and the inherent structure of the problem including data sparsity and locality. Specifically, we demonstrate two somewhat orthogonal philosophies of scaling algorithms to largescale problems, through the SS and BT implementations, respectively. Finally, we report the experimental results using Yahoos vast datasets, and show that our approach substantially outperform the stateoftheart methods in prediction accuracy. For BT in particular, the ROC area achieved by GaP is exceeding 0.95, while one prior approach using Poisson regression yielded 0.83. For computational performance, we compare a singlenode sparse implementation with a parallel implementation using Hadoop MapReduce, the results are counterintuitive yet quite interesting. We therefore provide insights into the underlying principles of largescale learning. Ye Chen, Michael Kapralov, John Canny, Dmitry Y. Pavlov

On Lifting the Gibbs Sampling Algorithm
Statistical relational learning models combine the power of firstorder logic, the de facto tool for handling relational structure, with that of probabilistic graphical models, the de facto tool for handling uncertainty. Lifted probabilistic inference algorithms for them have been the subject of much recent research. The main idea in these algorithms is to improve the speed, accuracy and scalability of existing graphical models' inference algorithms by exploiting symmetry in the firstorder representation. In this paper, we consider blocked Gibbs sampling, an advanced variation of the classic Gibbs sampling algorithm and lift it to the firstorder level. We propose to achieve this by partitioning the firstorder atoms in the relational model into a set of disjoint clusters such that exact lifted inference is polynomial in each cluster given an assignment to all other atoms not in the cluster. We propose an approach for constructing such clusters and determining their complexity and show how it can be used to trade accuracy with computational complexity in a principled manner. Our experimental evaluation shows that lifted Gibbs sampling is superior to the propositional algorithm in terms of accuracy and convergence. Deepak Venugopal, Vibhav Gogate

Generalized HigherOrder Orthogonal Iteration for Tensor Decomposition and Completion
Lowrank tensor estimation has been frequently applied in many realworld problems. Despite successful applications, existing Schatten 1norm minimization (SNM) methods may become very slow or even not applicable for largescale problems. To address this difficulty, we therefore propose an efficient and scalable core tensor Schatten 1norm minimization method for simultaneous tensor decomposition and completion, with a much lower computational complexity. We first induce the equivalence relation of Schatten 1norm of a lowrank tensor and its core tensor. Then the Schatten 1norm of the core tensor is used to replace that of the whole tensor, which leads to a much smallerscale matrix SNM problem. Finally, an efficient algorithm with a rankincreasing scheme is developed to solve the proposed problem with a convergence guarantee. Extensive experimental results show that our method is usually more accurate than the stateoftheart methods, and is orders of magnitude faster. Yuanyuan Liu, Fanhua Shang, Wei Fan, James Cheng, Hong Cheng

On Integrated Clustering and Outlier Detection
We model the joint clustering and outlier detection problem using an extension of the facility location formulation. The advantages of combining clustering and outlier selection include: (i) the resulting clusters tend to be compact and semantically coherent (ii) the clusters are more robust against data perturbations and (iii) the outliers are contextualised by the clusters and more interpretable. We provide a practical subgradientbased algorithm for the problem and also study the theoretical properties of algorithm in terms of approximation and convergence. Extensive evaluation on synthetic and real data sets attest to both the quality and scalability of our proposed method. Lionel Ott, Linsey Pang, Fabio T. Ramos, Sanjay Chawla

Fast Second Order Stochastic Backpropagation for Variational Inference
We propose a secondorder (Hessian or Hessianfree) based optimization method for variational inference inspired by Gaussian backpropagation, and argue that quasiNewton optimization can be developed as well. This is accomplished by generalizing the gradient computation in stochastic backpropagation via a reparametrization trick with lower complexity. As an illustrative example, we apply this approach to the problems of Bayesian logistic regression and variational autoencoder (VAE). Additionally, we compute bounds on the estimator variance of intractable expectations for the family of Lipschitz continuous function. Our method is practical, scalable and model free. We demonstrate our method on several realworld datasets and provide comparisons with other stochastic gradient methods to show substantial enhancement in convergence rates. Kai Fan, Ziteng Wang, Jeff Beck, James Kwok, Katherine A. Heller

Coresets for kSegmentation of Streaming Data
Lifelogging video streams, financial time series, and Twitter tweets are a few examples of highdimensional signals over practically unbounded time. We consider the problem of computing optimal segmentation of such signals by kpiecewise linear function, using only one pass over the data by maintaining a coreset for the signal. The coreset enables fast further analysis such as automatic summarization and analysis of such signals. A coreset (coreset) is a compact representation of the data seen so far, which approximates the data well for a specific task  in our case, segmentation of the stream. We show that, perhaps surprisingly, the segmentation problem admits coresets of cardinality only linear in the number of segments k, independently of both the dimension d of the signal, and its number n of points. More precisely, we construct a representation of size O(klog n /eps^2) that provides a (1+eps)approximation for the sum of squared distances to any given kpiecewise linear function. Moreover, such coresets can be constructed in a parallel streaming approach. Our results rely on a novel eduction of statistical estimations to problems in computational geometry. We empirically evaluate our algorithms on very large synthetic and real data sets from GPS, video and financial domains, using 255 machines in Amazon cloud. Guy Rosman, Mikhail Volkov, Dan Feldman, John W. Fisher III, Daniela Rus

Deeply Learning the Messages in Message Passing Inference
Deep structured output learning shows great promise in tasks like semantic image segmentation. We proffer a new, efficient deep structured model learning scheme, in which we show how deep Convolutional Neural Networks (CNNs) can be used to directly estimate the messages in message passing inference for structured prediction with Conditional Random Fields CRFs). With such CNN message estimators, we obviate the need to learn or evaluate potential functions for message calculation. This confers significant efficiency for learning, since otherwise when performing structured learning for a CRF with CNN potentials it is necessary to undertake expensive inference for every stochastic gradient iteration. The network output dimension of message estimators is the same as the number of classes, rather than exponentially growing in the order of the potentials. Hence it is more scalable for cases that a large number of classes are involved. We apply our method to semantic image segmentation and achieve impressive performance, which demonstrates the effectiveness and usefulness of our CNN message learning method. Guosheng Lin, Chunhua Shen, Ian Reid, Anton van den Hengel

DFacTo: Distributed Factorization of Tensors
We present a technique for significantly speeding up Alternating Least Squares (ALS) and Gradient Descent (GD), two widely used algorithms for tensor factorization. By exploiting properties of the KhatriRao product, we show how to efficiently address a computationally challenging substep of both algorithms. Our algorithm, DFacTo, only requires two matrixvector products and is easy to parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster than competing algorithms on a variety of datasets. For instance, DFacTo only takes 480 seconds on 4 machines to perform one iteration of the ALS algorithm and 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million x 2.5 million x 1.5 million dimensional tensor with 1.2 billion nonzero entries. Joon Hee Choi, S. Vishwanathan

Fast Lifted MAP Inference via Partitioning
Recently, there has been growing interest in lifting MAP inference algorithms for Markov logic networks (MLNs). A key advantage of these lifted algorithms is that they have much smaller computational complexity than propositional algorithms when symmetries are present in the MLN and these symmetries can be detected using lifted inference rules. Unfortunately, lifted inference rules are sound but not complete and can often miss many symmetries. This is problematic because when symmetries cannot be exploited, lifted inference algorithms ground the MLN, and search for solutions in the much larger propositional space. In this paper, we present a novel approach, which cleverly introduces new symmetries at the time of grounding. Our main idea is to partition the ground atoms and force the inference algorithm to treat all atoms in each part as indistinguishable. We show that by systematically and carefully refining (and growing) the partitions, we can build advanced anytime and anyspace MAP inference algorithms. Our experiments on several realworld datasets clearly show that our new algorithm is superior to previous approaches and often finds useful symmetries in the search space that existing lifted inference rules are unable to detect. Somdeb Sarkhel, Parag Singla, Vibhav G. Gogate

Online and Stochastic Gradient Methods for Nondecomposable Loss Functions
Modern applications in sensitive domains such as biometrics and medicine frequently require the use of nondecomposable loss functions such as precision@k, Fmeasure etc. Compared to point loss functions such as hingeloss, these offer much more fine grained control over prediction, but at the same time present novel challenges in terms of algorithm design and analysis. In this work we initiate a study of online learning techniques for such nondecomposable loss functions with an aim to enable incremental learning as well as design scalable solvers for batch problems. To this end, we propose an online learning framework for such loss functions. Our model enjoys several nice properties, chief amongst them being the existence of efficient online learning algorithms with sublinear regret and online to batch conversion bounds. Our model is a provable extension of existing online learning models for point loss functions. We instantiate two popular losses, Prec @k and pAUC, in our model and prove sublinear regret bounds for both of them. Our proofs require a novel structural lemma over ranked lists which may be of independent interest. We then develop scalable stochastic gradient descent solvers for nondecomposable loss functions. We show that for a large family of loss functions satisfying a certain uniform convergence property (that includes Prec @k, pAUC, and Fmeasure), our methods provably converge to the empirical risk minimizer. Such uniform convergence results were not known for these losses and we establish these using novel proof techniques. We then use extensive experimentation on real life and benchmark datasets to establish that our method can be orders of magnitude faster than a recently proposed cutting plane method. Purushottam Kar, Harikrishna Narasimhan, Prateek Jain

Automatic Feature Induction for Stagewise Collaborative Filtering
Recent approaches to collaborative filtering have concentrated on estimating an algebraic or statistical model, and using the model for predicting missing ratings. In this paper we observe that different models have relative advantages in different regions of the input space. This motivates our approach of using stagewise linear combinations of collaborative filtering algorithms, with nonconstant combination coefficients based on kernel smoothing. The resulting stagewise model is computationally scalable and outperforms a wide selection of stateoftheart collaborative filtering algorithms. Joonseok Lee, Mingxuan Sun, Guy Lebanon, Seungjean Kim

FastEx: Hash Clustering with Exponential Families
Clustering is a key component in data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as $k$means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

Random Conic Pursuit for Semidefinite Programming
We present a novel algorithm, Random Conic Pursuit, that solves semidefinite programs (SDPs) via repeated optimization over randomly selected twodimensional subcones of the PSD cone. This scheme is simple, easily implemented, applicable to very general SDPs, scalable, and theoretically interesting. Its advantages are realized at the expense of an ability to readily compute highly exact solutions, though useful approximate solutions are easily obtained. This property renders Random Conic Pursuit of particular interest for machine learning applications, in which the relevant SDPs are generally based upon random data and so exact minima are often not a priority. Indeed, we present empirical results to this effect for various SDPs encountered in machine learning; these experiments demonstrate the potential practical usefulness of Random Conic Pursuit. We also provide a preliminary analysis that yields insight into the theoretical properties and convergence of the algorithm. Ariel Kleiner, Ali Rahimi, Michael I. Jordan

A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements
We propose a simple, scalable, and fast gradient descent algorithm to optimize a nonconvex objective for the rank minimization problem and a closely related family of semidefinite programs. With $O(r^3 \kappa^2 n \log n)$ random measurements of a positive semidefinite $n\times n$ matrix of rank $r$ and condition number $\kappa$, our method is guaranteed to converge linearly to the global optimum. Qinqing Zheng, John Lafferty

Burnin, bias, and the rationality of anchoring
Bayesian inference provides a unifying framework for addressing problems in machine learning, artificial intelligence, and robotics, as well as the problems facing the human mind. Unfortunately, exact Bayesian inference is intractable in all but the simplest models. Therefore minds and machines have to approximate Bayesian inference. Approximate inference algorithms can achieve a wide range of timeaccuracy tradeoffs, but what is the optimal tradeoff? We investigate timeaccuracy tradeoffs using the MetropolisHastings algorithm as a metaphor for the mind's inference algorithm(s). We find that reasonably accurate decisions are possible long before the Markov chain has converged to the posterior distribution, i.e. during the period known as burnin. Therefore the strategy that is optimal subject to the mind's bounded processing speed and opportunity costs may perform so few iterations that the resulting samples are biased towards the initial value. The resulting cognitive process model provides a rational basis for the anchoringandadjustment heuristic. The model's quantitative predictions are tested against published data on anchoring in numerical estimation tasks. Our theoretical and empirical results suggest that the anchoring bias is consistent with approximate Bayesian inference. Falk Lieder, Thomas Griffiths, Noah Goodman

Polynomial Semantic Indexing
We present a class of nonlinear (polynomial) models that are discriminatively trained to directly map from the word content in a querydocument or documentdocument pair to a ranking score. Dealing with polynomial models on word features is computationally challenging. We propose a low rank (but diagonal preserving) representation of our polynomial models to induce feasible memory and computation requirements. We provide an empirical study on retrieval tasks based on Wikipedia documents, where we obtain stateoftheart performance while providing realistically scalable methods. Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, Mehryar Mohri

Time Dependent Adaptive Neural Networks
Fernando J. Pineda

A latent factor model for highly multirelational data
Many data such as social networks, movie preferences or knowledge bases are multirelational, in that they describe multiple relationships between entities. While there is a large body of work focused on modeling these data, few considered modeling these multiple types of relationships jointly. Further, existing approaches tend to breakdown when the number of these types grows. In this paper, we propose a method for modeling large multirelational datasets, with possibly thousands of relations. Our model is based on a bilinear structure, which captures the various orders of interaction of the data, but also shares sparse latent factors across different relations. We illustrate the performance of our approach on standard tensorfactorization datasets where we attain, or outperform, stateoftheart results. Finally, a NLP application demonstrates our scalability and the ability of our model to learn efficient, and semantically meaningful verb representations. Rodolphe Jenatton, Nicolas L. Roux, Antoine Bordes, Guillaume R. Obozinski

Selecting Receptive Fields in Deep Networks
Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters usually grows quadratically in the width of the network, thus necessitating handcoded "local receptive fields" that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Specifically, we choose local receptive fields that group together those lowlevel features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive fields (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive fields by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multilayered etworks that achieve stateoftheart results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. Adam Coates, Andrew Y. Ng

Accelerated Gradient Methods for Stochastic Optimization and Online Learning
Regularized risk minimization often involves nonsmooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., $\ell_1$regularizer). Gradient descent methods, though highly scalable and easy to implement, are known to converge slowly on these problems. In this paper, we develop novel accelerated gradient methods for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic optimization with both convex and strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple but powerful algorithm. Chonghai Hu, Weike Pan, James T. Kwok

On Model Parallelization and Scheduling Strategies for Distributed Machine Learning
Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A sibling problem that has received relatively less attention is how to ensure efficient and correct model parallel execution of ML algorithms, where parameters of an ML program are partitioned to different workers and undergone concurrent iterative updates. We argue that model and data parallelisms impose rather different challenges for system design, algorithmic adjustment, and theoretical analysis. In this paper, we develop a system for modelparallelism, STRADS, that provides a programming abstraction for scheduling parameter updates by discovering and leveraging changing structural properties of ML programs. STRADS enables a flexible tradeoff between scheduling efficiency and fidelity to intrinsic dependencies within the models, and improves memory efficiency of distributed ML. We demonstrate the efficacy of modelparallel algorithms implemented on STRADS versus popular implementations for topic modeling, matrix factorization, and Lasso. Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A. Gibson, Eric P. Xing

Copula variational inference
We develop a general variational inference method that preserves dependency among the latent variables. Our method uses copulas to augment the families of distributions used in meanfield and structured approximations. Copulas model the dependency that is not captured by the original variational distribution, and thus the augmented variational family guarantees better approximations to the posterior. With stochastic optimization, inference on the augmented distribution is scalable. Furthermore, our strategy is generic: it can be applied to any inference procedure that currently uses the meanfield or structured approach. Copula variational inference has many advantages: it reduces bias; it is less sensitive to local optima; it is less sensitive to hyperparameters; and it helps characterize and interpret the dependency among the latent variables. Dustin Tran, David Blei, Edo M. Airoldi

The Lovász ϑ function, SVMs and finding large dense subgraphs
The Lovasz $\theta$ function of a graph, is a fundamental tool in combinatorial optimization and approximation algorithms. Computing $\theta$ involves solving a SDP and is extremely expensive even for moderately sized graphs. In this paper we establish that the Lovasz $\theta$ function is equivalent to a kernel learning problem related to one class SVM. This interesting connection opens up many opportunities bridging graph theoretic algorithms and machine learning. We show that there exist graphs, which we call $SVM\theta$ graphs, on which the Lovasz $\theta$ function can be approximated well by a oneclass SVM. This leads to a novel use of SVM techniques to solve algorithmic problems in large graphs e.g. identifying a planted clique of size $\Theta({\sqrt{n}})$ in a random graph $G(n,\frac{1}{2})$. A classic approach for this problem involves computing the $\theta$ function, however it is not scalable due to SDP computation. We show that the random graph with a planted clique is an example of $SVM\theta$ graph, and as a consequence a SVM based approach easily identifies the clique in large graphs and is competitive with the stateoftheart. Further, we introduce the notion of a ''common orthogonal labeling'' which extends the notion of a ''orthogonal labelling of a single graph (used in defining the $\theta$ function) to multiple graphs. The problem of finding the optimal common orthogonal labelling is cast as a Multiple Kernel Learning problem and is used to identify a large common dense region in multiple graphs. The proposed algorithm achieves an order of magnitude scalability compared to the state of the art. Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi

(Almost) No Label No Cry
In Learning with Label Proportions (LLP), the objective is to learn a supervised classifier when, instead of labels, only label proportions for bags of observations are known. This setting has broad practical relevance, in particular for privacy preserving data processing. We first show that the mean operator, a statistic which aggregates all labels, is minimally sufficient for the minimization of many proper scoring losses with linear (or kernelized) classifiers without using labels. We provide a fast learning algorithm that estimates the mean operator via a manifold regularizer with guaranteed approximation bounds. Then, we present an iterative learning algorithm that uses this as initialization. We ground this algorithm in Rademacherstyle generalization bounds that fit the LLP setting, introducing a generalization of Rademacher complexity and a Label Proportion Complexity measure. This latter algorithm optimizes tractable bounds for the corresponding bagempirical risk. Experiments are provided on fourteen domains, whose size ranges up to 300K observations. They display that our algorithms are scalable and tend to consistently outperform the state of the art in LLP. Moreover, in many cases, our algorithms compete with or are just percents of AUC away from the Oracle that learns knowing all labels. On the largest domains, half a dozen proportions can suffice, i.e. roughly 40K times less than the total number of labels. Giorgio Patrini, Richard Nock, Tiberio Caetano, Paul Rivera

Multiplicative Forests for ContinuousTime Processes
Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuoustime Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partitionbased representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability. Jeremy Weiss, Sriraam Natarajan, David Page

Efficient Structured Matrix Rank Minimization
We study the problem of finding structured lowrank matrices using nuclear norm regularization where the structure is encoded by a linear map. In contrast to most known approaches for linearly structured rank minimization, we do not (a) use the full SVD; nor (b) resort to augmented Lagrangian techniques; nor (c) solve linear systems per iteration. Instead, we formulate the problem differently so that it is amenable to a generalized conditional gradient method, which results in a practical improvement with low per iteration computational cost. Numerical results show that our approach significantly outperforms stateoftheart competitors in terms of running time, while effectively recovering low rank solutions in stochastic system realization and spectral compressed sensing problems. Adams Wei Yu, Wanli Ma, Yaoliang Yu, Jaime Carbonell, Suvrit Sra

Balanced Graph Matching
Timothee Cour, Praveen Srinivasan, Jianbo Shi

Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients
Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they can not scale up to big datasets. Recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity. However, to obtain high quality solutions, the number of random features should be the same order of magnitude as the number of data points, making such approach not directly applicable to the regime with millions of data points.We propose a simple, computationally efficient, and memory friendly algorithm based on the ``doubly stochastic gradients'' to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the \emph{nonconvex} nature of these problems, our method enjoys theoretical guarantees that it converges at the rate $\Otil(1/t)$ to the global optimum, even for the top $k$ eigen subspace. Unlike many alternatives, our algorithm does not require explicit orthogonalization, which is infeasible on big datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets. Bo Xie, Yingyu Liang, Le Song

Fast Computation of Posterior Mode in MultiLevel Hierarchical Models
Multilevel hierarchical models provide an attractive framework for incorporating correlations induced in a response variable organized in a hierarchy. Model fitting is challenging, especially for hierarchies with large number of nodes. We provide a novel algorithm based on a multiscale Kalman filter that is both scalable and easy to implement. For nonGaussian responses, quadratic approximation to the loglikelihood results in biased estimates. We suggest a bootstrap strategy to correct such biases. Our method is illustrated through simulation studies and analyses of real world data sets in health care and online advertising. Liang Zhang, Deepak Agarwal

Natural Neural Networks
We introduce Natural Neural Networks, a novel family of algorithms that speed up convergence by adapting their internal representation during training to improve conditioning of the Fisher matrix. In particular, we show a specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feedforward computation of the network. Such networks can be trained efficiently via the proposed Projected Natural Gradient Descent algorithm (PRONG), which amortizes the cost of these reparametrizations over many parameter updates and is closely related to the Mirror Descent online learning algorithm. We highlight the benefits of our method on both unsupervised and supervised learning tasks, and showcase its scalability by training on the largescale ImageNet Challenge dataset. Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, koray kavukcuoglu

Discriminant Saliency for Visual Recognition from Cluttered Scenes
Dashan Gao, Nuno Vasconcelos

Relaxed Clipping: A Global Training Method for Robust Regression and Classification
Robust regression and classification are often thought to require nonconvex loss functions that prevent scalable, global training. However, such a view neglects the possibility of reformulated training methods that can yield practically solvable alternatives. A natural way to make a loss function more robust to outliers is to truncate loss values that exceed a maximum threshold. We demonstrate that a relaxation of this form of ``loss clipping'' can be made globally solvable and applicable to any standard loss while guaranteeing robustness against outliers. We present a generic procedure that can be applied to standard loss functions and demonstrate improved robustness in regression and classification problems. Min Yang, Linli Xu, Martha White, Dale Schuurmans, Yaoliang Yu

RankApproximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions
The longstanding problem of efficient nearestneighbor (NN) search has ubiquitous applications ranging from astrophysics to MP3 fingerprinting to bioinformatics to movie recommendations. As the dimensionality of the dataset increases, exact NN search becomes computationally prohibitive; (1+eps)distanceapproximate NN search can provide large speedups but risks losing the meaning of NN search present in the ranks (ordering) of the distances. This paper presents a simple, practical algorithm allowing the user to, for the first time, directly control the true accuracy of NN search (in terms of ranks) while still achieving the large speedups over exact NN. Experiments with highdimensional datasets show that it often achieves faster and more accurate results than the bestknown distanceapproximate method, with much more stable behavior. Parikshit Ram, Dongryeol Lee, Hua Ouyang, Alexander G. Gray

Predictive MatrixVariate t Models
It is becoming increasingly important to learn from a partiallyobserved random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrixvariate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the nonconjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upperbound of the loglikelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. Shenghuo Zhu, Kai Yu, Yihong Gong

Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints
We investigate two new optimization problems — minimizing a submodular function subject to a submodular lower bound constraint (submodular cover) and maximizing a submodular function subject to a submodular upper bound constraint (submodular knapsack). We are motivated by a number of realworld applications in machine learning including sensor placement and data subset selection, which require maximizing a certain submodular function (like coverage or diversity) while simultaneously minimizing another (like cooperative cost). These problems are often posed as minimizing the difference between submodular functions [9, 23] which is in the worst case inapproximable. We show, however, that by phrasing these problems as constrained optimization, which is more natural for many applications, we achieve a number of bounded approximation guarantees. We also show that both these problems are closely related and, an approximation algorithm solving one can be used to obtain an approximation guarantee for the other. We provide hardness results for both problems thus showing that our approximation factors are tight up to logfactors. Finally, we empirically demonstrate the performance and good scalability properties of our algorithms. Rishabh K. Iyer, Jeff A. Bilmes

On Iterative Hard Thresholding Methods for Highdimensional MEstimation
The use of Mestimators in generalized linear regression models in high dimensional settings requires risk minimization with hard L_0 constraints. Of the known methods, the class of projected gradient descent (also known as iterative hard thresholding (IHT)) methods is known to offer the fastest and most scalable solutions. However, the current stateoftheart is only able to analyze these methods in extremely restrictive settings which do not hold in high dimensional statistical models. In this work we bridge this gap by providing the first analysis for IHTstyle methods in the high dimensional statistical setting. Our bounds are tight and match known minimax lower bounds. Our results rely on a general analysis framework that enables us to analyze several popular hard thresholding style algorithms (such as HTP, CoSaMP, SP) in the high dimensional regression setting. Finally, we extend our analysis to the problem of lowrank matrix recovery. Prateek Jain, Ambuj Tewari, Purushottam Kar

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a realworld benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semisupervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used. Yi Zhang, Artur Dubrawski, Jeff G. Schneider

New Rules for Domain Independent Lifted MAP Inference
Lifted inference algorithms for probabilistic firstorder logic frameworks such as Markov logic networks (MLNs) have received significant attention in recent years. These algorithms use so called lifting rules to identify symmetries in the firstorder representation and reduce the inference problem over a large probabilistic model to an inference problem over a much smaller model. In this paper, we present two new lifting rules, which enable fast MAP inference in a large class of MLNs. Our first rule uses the concept of single occurrence equivalence class of logical variables, which we define in the paper. The rule states that the MAP assignment over an MLN can be recovered from a much smaller MLN, in which each logical variable in each single occurrence equivalence class is replaced by a constant (i.e., an object in the domain of the variable). Our second rule states that we can safely remove a subset of formulas from the MLN if all equivalence classes of variables in the remaining MLN are single occurrence and all formulas in the subset are tautology (i.e., evaluate to true) at extremes (i.e., assignments with identical truth value for groundings of a predicate). We prove that our two new rules are sound and demonstrate via a detailed experimental evaluation that our approach is superior in terms of scalability and MAP solution quality to the state of the art approaches. Happy Mittal, Prasoon Goyal, Vibhav G. Gogate, Parag Singla