{"title": "Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1163, "page_last": 1171, "abstract": "Effective convolutional neural networks are trained on large sets of labeled data. However, creating large labeled datasets is a very costly and time-consuming task. Semi-supervised learning uses unlabeled data to train a model with higher accuracy when there is a limited set of labeled data available. In this paper, we consider the problem of semi-supervised learning with convolutional neural networks. Techniques such as randomized data augmentation, dropout and random max-pooling provide better generalization and stability for classifiers that are trained using gradient descent. Multiple passes of an individual sample through the network might lead to different predictions due to the non-deterministic behavior of these techniques. We propose an unsupervised loss function that takes advantage of the stochastic nature of these methods and minimizes the difference between the predictions of multiple passes of a training sample through the network. We evaluate the proposed method on several benchmark datasets.", "full_text": "Regularization With Stochastic Transformations and\nPerturbations for Deep Semi-Supervised Learning\n\nMehdi Sajjadi\n\nMehran Javanmardi\n\nTolga Tasdizen\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Utah\n\n{mehdi, mehran, tolga}@sci.utah.edu\n\nAbstract\n\nEffective convolutional neural networks are trained on large sets of labeled data.\nHowever, creating large labeled datasets is a very costly and time-consuming\ntask. Semi-supervised learning uses unlabeled data to train a model with higher\naccuracy when there is a limited set of labeled data available.\nIn this paper,\nwe consider the problem of semi-supervised learning with convolutional neural\nnetworks. Techniques such as randomized data augmentation, dropout and random\nmax-pooling provide better generalization and stability for classi\ufb01ers that are\ntrained using gradient descent. Multiple passes of an individual sample through the\nnetwork might lead to different predictions due to the non-deterministic behavior\nof these techniques. We propose an unsupervised loss function that takes advantage\nof the stochastic nature of these methods and minimizes the difference between\nthe predictions of multiple passes of a training sample through the network. We\nevaluate the proposed method on several benchmark datasets.\n\n1\n\nIntroduction\n\nConvolutional neural networks (ConvNets) [1, 2] achieve state-of-the-art accuracy on a variety of\ncomputer vision tasks, including classi\ufb01cation, object localization, detection, recognition and scene\nlabeling [3, 4]. The advantage of ConvNets partially originates from their complexity (large number\nof parameters), but this can result in over\ufb01tting without a large amount of training data. However,\ncreating a large labeled dataset is very costly. A notable example is the \u2019ImageNet\u2019 [5] dataset with\n1000 category and more than 1 million training images. The state-of-the-art accuracy of this dataset\nis improved every year using ConvNet-based methods (e.g., [6, 7]). This dataset is the result of\nsigni\ufb01cant manual effort. However, with around 1000 images per category, it barely contains enough\ntraining samples to prevent the ConvNet from over\ufb01tting [7]. On the other hand, unlabeled data is\ncheap to collect. For example, there are numerous online resources for images and video sequences\nof different types. Therefore, there has been an increased interest in exploiting the readily available\nunlabeled data to improve the performance of ConvNets.\nRandomization plays an important role in the majority of learning systems. Stochastic gradient\ndescent, dropout [8], randomized data transformation and augmentation [9] and many other training\ntechniques that are essential for fast convergence and effective generalization of the learning functions\nintroduce some non-deterministic behavior to the learning system. Due to these uncertainties, passing\na single data sample through a learning system multiple times might lead to different predictions.\nBased on this observation, we introduce an unsupervised loss function optimized by gradient descent\nthat takes advantage of this randomization effect and minimizes the difference in predictions of\nmultiple passes of a data sample through the network during the training phase, which leads to better\ngeneralization in testing time. The proposed unsupervised loss function speci\ufb01cally regularizes the\nnetwork based on the variations caused by randomized data augmentation, dropout and randomized\nmax-pooling schemes. This loss function can be combined with any supervised loss function. In this\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fpaper, we apply the proposed unsupervised loss function to ConvNets as a state-of-the-art supervised\nclassi\ufb01er. We show through numerous experiments that this combination leads to a competitive\nsemi-supervised learning method.\n\n2 Related Work\n\nThere are many approaches to semi-supervised learning in general. Self-training and co-training\n[10, 11] are two well-known classic examples. Another set of approaches is based on generative\nmodels, for example, methods based on Gaussian Mixture Models (GMM) and Hidden Markov\nModels (HMM) [12]. These generative models generally try to use unlabeled data in modeling the\njoint probability distribution of the training data and labels. Transductive SVM (TSVM) [13] and\nS3VM [14] are another semi-supervised learning approach that tries to \ufb01nd a decision boundary with\na maximum margin on both labeled and unlabeled data. A large group of semi-supervised methods is\nbased on graphs and the similarities between the samples [15, 16]. For example, if a labeled sample\nis similar to an unlabeled sample, its label is assigned to that unlabeled sample. In these methods,\nthe similarities are encoded in the edges of a graph. Label propagation [17] is an example of these\nmethods in which the goal is to minimize the difference between model predictions of two samples\nwith large weighted edge. In other words, similar samples tend to get similar predictions.\nIn this paper, our focus is on semi-supervised deep learning. There has always been interest in\nexploiting unlabeled data to improve the performance of ConvNets. One approach is to use unlabeled\ndata to pre-train the \ufb01lters of ConvNet [18, 19]. The goal is to reduce the number of training epochs\nrequired to converge and improve the accuracy compared to a model trained by random initialization.\nPredictive sparse decomposition (PSD) [20] is one example of these methods used for learning the\nweights in the \ufb01lter bank layer. The works presented in [21] and [22] are two recent examples of\nlearning features by pre-training ConvNets using unlabeled data. In these approaches, an auxiliary\ntarget is de\ufb01ned for a pair of unlabeled images [21] or a pair of patches from a single unlabeled\nimage [22]. Then a pair of ConvNets is trained to learn descriptive features from unlabeled images.\nThese features can be \ufb01ne-tuned for a speci\ufb01c task with a limited set of labeled data. However, many\nrecent ConvNet models with state-of-the-art accuracy start from randomly initialized weights using\ntechniques such as Xavier\u2019s method [23, 6]. Therefore, approaches that make better use of unlabeled\ndata during training instead of just pre-training are more desired.\nAnother example of semi-supervised learning with ConvNets is region embedding [24], which is used\nfor text categorization. The work in [25] is also a deep semi-supervised learning method based on\nembedding techniques. Unlabeled video frames are also being used to train ConvNets [26, 27]. The\ntarget of the ConvNet is calculated based on the correlations between video frames. Another notable\nexample is semi-supervised learning with ladder networks [28] in which the sums of supervised and\nunsupervised loss functions are simultaneously minimized by backpropagation. In this method, a\nfeedforward model, is assumed to be an encoder. The proposed network consists of a noisy encoder\npath and a clean one. A decoder is added to each layer of the noisy path. This decoder is supposed to\nreconstruct a clean activation of each layer. The unsupervised loss function is the difference between\nthe output of each layer in clean path and its corresponding reconstruction from the noisy path.\nAnother approach by [29] is to take a random unlabeled sample and generate multiple instances by\nrandomly transforming that sample multiple times. The resulting set of images forms a surrogate\nclass. Multiple surrogate classes are produced and a ConvNet is trained on them. One disadvantage\nof this method is that it does not scale well with the number of unlabeled examples because a separate\nclass is needed for every training sample during unsupervised training. In [30], the authors propose\na mutual-exclusivity loss function that forces the set of predictions for a multiclass dataset to be\nmutually-exclusive. In other words, it forces the classi\ufb01er\u2019s prediction to be close to one only for\none class and zero for the others. It is shown that this loss function makes use of unlabeled data and\npushes the decision boundary to a less dense area of decision space.\nAnother set of works related to our approach try to restrict the variations of the prediction function.\nTangent distance and tangent propagation proposed by [31] enforce local classi\ufb01cation invariance with\nrespect to the transformations of input images. Here, we propose a simpler method that additionally\nminimizes the internal variations of the network caused by dropout and randomized pooling and leads\nto state-of-the-art results on MNIST (with 100 labeled samples), CIFAR10 and CIFAR100. Another\nexample is Slow Feature Analysis (SFA) (e.g., [32] and [33]) that encourages the representations of\ntemporally close data to exhibit small differences.\n\n2\n\n\f3 Method\n\nGiven any training sample, a model\u2019s prediction should be the same under any random transformation\nof the data and perturbations to the model. The transformations can be any linear and non-linear data\naugmentation being used to extend the training data. The disturbances include dropout techniques\nand randomized pooling schemes. In each pass, each sample can be randomly transformed or the\nhidden nodes can be randomly activated. As a result, the network\u2019s prediction can be different for\nmultiple passes of the same training sample. However, we know that each sample is assigned to only\none class. Therefore, the network\u2019s prediction is expected to be the same despite transformations\nand disturbances. We introduce an unsupervised loss function that minimizes the mean squared\ndifferences between different passes of an individual training sample through the network. Note that\nwe do not need to know the label of a training sample in order to enforce this loss. Therefore, the\nproposed loss function is completely unsupervised and can be used along with supervised training as\na semi-supervised learning method. Even if we don\u2019t have a separate unlabeled set, we can apply the\nproposed loss function on samples of labeled set to enforce stability.\nHere, we formally de\ufb01ne the proposed unsupervised loss function. We start with a dataset with N\ntraining samples and C classes. Let us assume that f j(xi) is the classi\ufb01er\u2019s prediction vector on the\ni\u2019th training sample during the j\u2019th pass through the network. We assume that each training sample\nis passed n times through the network. We de\ufb01ne the T j(xi) to be a random linear or non-linear\ntransformation on the training sample xi before the j\u2019th pass through the network. The proposed loss\nfunction for each data sample is:\n\nN(cid:88)\n\nn\u22121(cid:88)\n\nn(cid:88)\n\ni=1\n\nj=1\n\nk=j+1\n\nlTSU =\n\n(cid:107)f j(T j(xi)) \u2212 f k(T k(xi))(cid:107)2\n\n2\n\n(1)\n\nWhere \u2019TS\u2019 stands for transformation/stability. We pass a training sample through the network n times.\nIn each pass, the transformation T j(xi) produces a different input to the network from the original\ntraining sample. In addition, each time the randomness inside the network, which can be caused\nby dropout or randomized pooling schemes, leads to a different prediction output. We minimize\nthe sum of squared differences between each possible pair of predictions. We can minimize this\nobjective function using gradient descent. Although Eq. 1 is quadratically dependent on the number\nof augmented versions of the data (n), calculation of loss and gradient is only based on the prediction\nvectors. So, the computing cost is negligible even for large n. Note that recent neural-network-based\nmethods are optimized on batches of training samples instead of a single sample (batch vs. online\ntraining). We can design batches to contain replications of training samples so we can easily optimize\nthis transformation/stability loss function. If we use data augmentation, we put different transformed\nversions of an unlabeled data in the mini-batch instead of replication. This unsupervised loss function\ncan be used with any backpropagation-based algorithm. Even though, every mini-batch contains\nreplications of a training sample, these are used to calculate a single backpropagation signal avoiding\ngradient bias and not adversely affecting convergence. It is also possible to combine this loss with any\nsupervised loss function. We reserve part of the mini-batch for labeled data which are not replicated.\nAs mentioned in section 2, mutual-exclusivity loss function of [30] forces the classi\ufb01er\u2019s prediction\nvector to have only one non-zero element. This loss function naturally complements the transforma-\ntion/stability loss function. In supervised learning, each element of the prediction vector is pushed\ntowards zero or one depending on the corresponding element in label vector. The proposed loss\nminimizes the l2-norm of the difference between predictions of multiple transformed versions of\na sample, but it does not impose any restrictions on the individual elements of a single prediction\nvector. As a result, each prediction vector might be a trivial solution instead of a valid prediction\ndue to lack of labels. Mutual-exclusivity loss function forces each prediction vector to be valid and\nprevents trivial solutions. This loss function for the training sample xi is de\ufb01ned as follows:\n\n\uf8f6\uf8f8\n\n(2)\n\nN(cid:88)\n\nn(cid:88)\n\n\uf8eb\uf8ed\u2212 C(cid:88)\n\ni=1\n\nj=1\n\nk=1\n\nlMEU =\n\nC(cid:89)\n\nf j\nk (xi)\n\nl=1,l(cid:54)=k\n\n(1 \u2212 f j\n\nl (xi))\n\nWhere \u2019ME\u2019 stands for mutual-exclusivity. f j\nk (xi) is the k-th element of prediction vector f j(xi). In\nthe experiments, we show that the combination of both loss functions leads to further improvements\nin the accuracy of the models. We de\ufb01ne the combination of both loss functions as transforma-\n\n3\n\n\ftion/stability plus mutual-exclusivity loss function:\n\nlU = \u03bb1lMEU + \u03bb2lTSU\n\n(3)\n\n4 Experiments\n\nWe show the effect of the proposed unsupervised loss functions using ConvNets on MNIST [2],\nCIFAR10 and CIFAR100 [34], SVHN [35], NORB [36] and ILSVRC 2012 challenge [5]. We use\ntwo frameworks to implement and evaluate the proposed loss function. The \ufb01rst one is cuda-convnet\n[37], which is the original implementation of the well-known AlexNet model. The second framework\nis the sparse convolutional networks [38] with fractional max-pooling [39], which is a more recent\nimplementation of ConvNets achieving state-of-the-art accuracy on CIFAR10 and CIFAR100 datasets.\nWe show through different experiments that by using the proposed loss function, we can improve\nthe accuracy of the models trained on a few labeled samples on both implementations. In Eq. 1, we\nset n to be 4 for experiments conducted using cuda-convnet and 5 for experiments performed using\nsparse convolutional networks. Sparse convolutional network allows for any arbitrary batch sizes. As\na result, we tried different options for n and n = 5 is the optimal choice. However, cuda-convnet\nallows for mini-batches of size 128. Therefore, it is not possible to use n = 5. Instead, we decided\nto use n = 4. In practice the difference is insigni\ufb01cant. We used MNIST to \ufb01nd the optimal n. We\ntried different n up to 10 and did not observe improvements for n larger than 5. It must be noted\nthat replicating a training sample four or \ufb01ve times does not necessarily increase the computational\ncomplexity with the same factor. Based on the experiments, with higher n fewer training epochs\nare required for the models to converge. We perform multiple experiments for each dataset. We\nuse the available training data of each dataset to create two sets: labeled and unlabeled. We do not\nuse the labels of the unlabeled set during training. It must be noted that for the experiments with\ndata augmentation, we apply data augmentation to both labeled and unlabeled set. We compare\nmodels that are trained only on the labeled set with models that are trained on both the labeled set\nand the unlabeled set using the unsupervised loss function. We show that by using the unsupervised\nloss function, we can improve the accuracy of classi\ufb01ers on benchmark datasets. For experiments\nperformed using sparse convolutional network, we describe the network parameters using the format\nadopted from the original paper [39]:\n\n(10kC2 \u2212 F M P\n\n2)5 \u2212 C2 \u2212 C1\n\n\u221a\n\n\u221a\n\n\u221a\n\nIn the above example network, 10k is the number of maps in the k\u2019th convolutional layer. In this\nexample, k = 1, 2, ..., 5. C2 speci\ufb01es that convolutions use a kernel size of 2. F M P\n2 indicates\nthat convolutional layers are followed by a fractional max-pooling (FMP) layer [39] that reduces the\nsize of feature maps by a factor of\n2. As mentioned earlier, the mutual-exclusivity loss function of\n[30] complements the transformation/stability loss function. We implement that loss function in both\ncuda-convnet and sparse convolutional networks as well. We experimentally choose \u03bb1 and \u03bb2 in Eq.\n3. However, the performance of the models is not overly sensitive to these parameters, and in most of\nthe experiments it is \ufb01xed to \u03bb1 = 0.1 and \u03bb2 = 1.\n4.1 MNIST\nMNIST is the most frequently used dataset in the area of digit classi\ufb01cation. It contains 60000\ntraining and 10000 test samples of size 28 \u00d7 28 pixels. We perform experiments on MNIST using a\nsparse convolutional network with the following architecture: (32kC2 \u2212 F M P\n2)6 \u2212 C2 \u2212 C1.\nWe use dropout to regularize the network. The ratio of dropout gradually increases from the \ufb01rst\nlayer to the last layer. We do not use any data augmentation for this task. In other words, T j(xi)\nof Eq. 1 is identity function for this dataset. In this case, we take advantage of the random effects\nof dropout and fractional max-pooling using the unsupervised loss function. We randomly select\n10 samples from each class (total of 100 labeled samples). We use all available training data as\nthe unlabeled set. First, we train a model based on this labeled set only. Then, we train models\nby adding unsupervised loss functions. In separate experiments, we add transformation/stability\nloss function, mutual-exclusivity loss function and the combination of both. Each experiment is\nrepeated \ufb01ve times with a different random subset of training samples. We repeat the same set of\nexperiments using 100% of MNIST training samples. The results are given in Table 1. We can\nsee that the proposed loss signi\ufb01cantly improves the accuracy on test data. We also compare the\nresults with ladder networks [28]. Combination of both loss functions reduces the error rate to 0.55%\n\n\u221a\n\n4\n\n\f\u00b1 0.16 which is the state-of-the-art for the task of MNIST with 100 labeled samples to the best\nof our knowledge. The state-of-the-art error rate on MNIST using all training data without data\naugmentation is 0.24% [40]. It can be seen that we can achieve a close accuracy by using only 100\nlabeled samples.\n\nTable 1: Error rates (%) on test set for MNIST (mean % \u00b1 std).\nladder\nnet. [28]\n0.89 \u00b1 0.50\n\nboth\nlosses\n\ntransform\n\nmut-excl\nloss [30]\n3.92 \u00b1 1.12\n0.30 \u00b1 0.03\n\n0.55 \u00b1 0.16\n0.27 \u00b1 0.02\n\n-\n\nlabeled\ndata only\n5.44 \u00b1 1.48\n0.32 \u00b1 0.02\n\n/stability loss\n0.76 \u00b1 0.61\n0.29 \u00b1 0.02\n\n100 :\nall:\n\nladder net\n\nbaseline [28]\n6.43 \u00b1 0.84\n\n0.36\n\n4.2 SVHN and NORB\nSVHN is another digit classi\ufb01cation task similar to MNIST. This dataset contains about 70000 images\nfor training and more than 500000 easier images [35] for validation. We do not use the validation\nset. The test set contains 26032 images, which are RGB images of size 32 \u00d7 32. Generally, SVHN\nis a more dif\ufb01cult task compared to MNIST because of the large variations in the images. We do\nnot perform any pre-processing for this dataset. We simply convert the color images to grayscale\nby removing hue and saturation information. NORB is a collection of stereo images in six classes.\nThe training set contains 10 folds of 29160 images. It is common practice to use only the \ufb01rst\ntwo folds for training. The test set contains two folds, totaling 58320. The original images are\n108 \u00d7 108. However, we scale them down to 48 \u00d7 48 similar to [9]. We perform experiments on\nthese two datasets using both cuda-convnet and sparse convolutional network implementations of the\nunsupervised loss function.\nIn the \ufb01rst set of experiments, we use cuda-convnet to train models with different ratios of labeled and\nunlabeled data. We randomly choose 1%, 5%, 10%, 20% and 100% of training samples as labeled\ndata. All of the training samples are used as the unlabeled set. For each labeled set, we train four\nmodels using cuda-convnet. The \ufb01rst model uses labeled set only. The second model is trained on\nunlabeled set using mutual-exclusivity loss function in addition to the labeled set. The third model is\ntrained on the unlabeled set using the transformation/stability loss function in addition to the labeled\nset. The last model is also trained on both sets but combines two unsupervised loss functions. Each\nexperiment is repeated \ufb01ve times. For each repetition, we use a different subset of training samples as\nlabeled data. The cuda-convnet model consists of two convolutional layers with 64 maps and kernel\nsize of 5, two locally connected layers with 32 maps and kernel size 3. Each convolutional layer is\nfollowed by a max-pooling layer. A fully connected layer with 256 nodes is added before the last\nlayer. We use data augmentation for these experiments. T j(xi) of Eq. 1 crops every training sample\nto 28 \u00d7 28 for SVHN and 44 \u00d7 44 for NORB at random locations. T j(xi) also randomly rotates\ntraining samples up to \u00b120\u25e6. These transformations are applied to both labeled and unlabeled sets.\nThe results are shown in Figure 1 for SVHN and Figure 2 for NORB. Each point in the graph is the\nmean error rate of \ufb01ve repetitions. The error bars show the standard deviation of these \ufb01ve repetitions.\nAs expected, we can see that in all experiments the classi\ufb01cation accuracy is improved as we add\nmore labeled data. However, we observe that for each set of labeled data we can improve the results\nby using the proposed unsupervised loss functions. We can also see that when the number of labeled\nsamples is small, the improvement is more signi\ufb01cant. For example, when we use only 1% of labeled\ndata, we gain an improvement in accuracy of about 2.5 times by using unsupervised loss functions.\nAs we add more labeled samples, the difference in accuracy between semi-supervised and supervised\napproaches becomes smaller. Note that the combination of transformation/stability loss function and\nmutual-exclusivity loss function improves the accuracy even further. As mentioned earlier, these two\nunsupervised loss functions complement each other. Therefore, in most of the experiments we use\nthe combination of two unsupervised loss functions.\nWe perform another set of experiments on these two datasets using sparse convolutional networks as a\nstate-of-the-art classi\ufb01er. We create \ufb01ve sets of labeled data. For each set, we randomly pick a different\n1% subset of training samples as labeled set and all training data as unlabeled set. We train two\nmodels: the \ufb01rst trained only on labeled data, and the second using the labeled set and a combination\nof both unsupervised losses. Similarly, we train models using all available training data as both the\nlabeled set and unlabeled set. We do not use data augmentation for any of these experiments. In other\n\n5\n\n\fFigure 1: SVHN dataset: semi-supervised learn-\ning vs. training with labeled data only.\n\nFigure 2: NORB dataset: semi-supervised learn-\ning vs. training with labeled data only.\n\n\u221a\nwords, T j(xi) of Eq. 1 is identity function. As a result, dropout and random max-pooling are the only\nsources of variation in this case. We use the following model: (32kC2 \u2212 F M P 3\n2)12 \u2212 C2 \u2212 C1.\nSimilar to MNIST, we use dropout to regularize the network. Again, the ratio of dropout gradually\nincreases from the \ufb01rst layer to the last layer. The results (average of \ufb01ve error rates) are shown in\nTable 2. Here, we can see that by using unsupervised loss functions we can signi\ufb01cantly improve the\naccuracy of the classi\ufb01er by trying to minimize the variation in prediction of the network. In addition,\nfor NORB dataset we can observe that by using only 1% of labeled data and applying unsupervised\nloss functions, we can achieve accuracy that is close to the case when we use 100% of labeled data.\n\nTable 2: Error on test data for SVHN and NORB with 1% and 100% of data (mean % \u00b1 std).\n\nSVHN\n\nNORB\n\nlabeled data only:\nsemi-supervised:\n\n1% of data\n12.25 \u00b1 0.80\n6.03 \u00b1 0.62\n\n100% of data\n2.28 \u00b1 0.05\n2.22 \u00b1 0.04\n\n1% of data\n10.01 \u00b1 0.81\n2.15 \u00b1 0.37\n\n100% of data\n1.63 \u00b1 0.12\n1.63 \u00b1 0.07\n\n4.3 CIFAR10\nCIFAR10 is a collection of 60000 tiny 32\u00d7 32 images of 10 categories (50000 for training and 10000\nfor test). We use sparse convolutional networks to perform experiments on this dataset. For this\ndataset, we create 10 labeled sets. Each set contains 4000 samples that are randomly picked from\nthe training set. All 50000 training samples are used as unlabeled set. We train two sets of models\non these data. The \ufb01rst set of models is trained on labeled data only, and the other set of models\nis trained on the unlabeled set using a combination of both unsupervised loss functions in addition\nto the labeled set. For this dataset, we do not perform separate experiments for two unsupervised\nloss functions because of time constraints. However, based on the results from MNIST, SVHN and\nNORB, we deduce that the combination of both unsupervised losses provides improved accuracy.\nWe use data augmentation for these experiments. Similar to [39], we perform af\ufb01ne transformations,\nincluding randomized mix of translations, rotations, \ufb02ipping, stretching and shearing operations\n\u221a\nby T j(xi) of Eq. 1. Similar to [39], we train the network without transformations for the last 10\nepochs. We use the following parameters for the models: (32kC2 \u2212 F M P 3\n2)12 \u2212 C2 \u2212 C1.\nWe use dropout, and its ratio gradually increases from the \ufb01rst layer to the last layer. The results\nare given in Table 3. We also compare the results to ladder networks [28]. The model in [28]\ndoes not use data augmentation. We can see that the combination of unsupervised loss functions\non unlabeled data improves the accuracy of the models. In another set of experiments, we use all\n\u221a\navailable training data as both labeled and unlabeled sets. We train a network with the following\nparameters: (96kC2 \u2212 F M P 3\n2)12 \u2212 C2 \u2212 C1. We use af\ufb01ne transformations for this task too.\nHere again, we use transformation/stability plus the mutual-exclusivity loss function. We repeat\nthis experiments \ufb01ve times and achieve 3.18% \u00b1 0.1 mean and standard deviation error rate. The\n\n6\n\n151020100510152025Percent of labeled dataError rate (%)SVHN both unsupervised losseslabeled data onlyunsupervised transformation/stability lossunsupervised mutual\u2212exclusivity loss15102010046810121416182022Percent of labeled dataError rate (%)NORB both unsupervised losseslabeled data onlyunsupervised transformation/stability lossunsupervised mutual\u2212exclusivity loss\fTable 3: Error rates on test data for CIFAR10 with 4000 labeled samples (mean % \u00b1 std).\nladder networks [28]\n\ntransformation/stability+mutual-exclusivity\n\nlabeled data only:\nsemi-supervised:\n\n13.60 \u00b1 0.24\n11.29 \u00b1 0.24\n\n23.33 \u00b1 0.61\n20.40 \u00b1 0.47\n\nstate-of-the-art error rate for this dataset is 3.47%, achieved by the fractional max-pooling method\n[39] but obtained with a larger model (160n vs. 96n). We perform a single run experiment with 160n\nmodel and achieve the error rate of 3.00%. Similar to [39], we perform 100 passes during test time.\nHere, we surpass state-of-the-art accuracy by adding unsupervised loss functions.\n\n4.4 CIFAR100\nCIFAR100 is also a collection of 60000 tiny images of size 32 \u00d7 32. This dataset is similar to\nCIFAR10. However, it contains images of 100 categories compared to 10. Therefore, we have a\nsmaller number of training samples per category. Similar to CIFAR10, we perform experiments on\nthis dataset using sparse convolutional networks. We use all available training data as both labeled\nand unlabeled sets. The state-of-the-art error rate for this dataset is 23.82%, obtained by fractional\n\u221a\nmax-pooling [39] on sparse convolutional networks. The following model was used to achieve this\nerror rate: (96kC2 \u2212 F M P 3\n2)12 \u2212 C2 \u2212 C1. Dropout was also used with a ratio increasing from\nthe \ufb01rst layer to the last layer. We use the same model parameters and add transformation/stability\nplus the mutual-exclusivity loss function. Similar to [39], we do not use data augmentation for this\ntask (T j(xi) of Eq. 1 is identity function). Therefore, the proposed loss function minimizes the\nrandomness effect due to dropout and max-pooling. We achieve 21.43% \u00b1 0.16 mean and standard\ndeviation error rate, which is the state-of-the-art for this task. We perform 12 passes during the test\ntime similar to [39].\n\n4.5\n\nImageNet\n\nWe perform experiments on the ILSVRC 2012 challenge. The training data consists of 1281167\nnatural images of different sizes from 1000 categories. We create \ufb01ve labeled datasets from available\ntraining samples. Each dataset consists of 10% of training data. We form each dataset by randomly\npicking a subset of training samples. All available training data is used as the unlabeled set. We use\ncuda-convnet to train AlexNet model [7] for this dataset. Similar to [7], all images are re-sized to\n256 \u00d7 256. We also use data augmentation for this task following steps of [7], i.e., T j(xi) of Eq. 1\nperforms random translations, \ufb02ipping and color noise. We train two models on each labeled dataset.\nOne model is trained using labeled data only. The other model is trained on both labeled and unlabeled\nset using the transformation/stability plus mutual-exclusivity loss function. At each iteration, we\ngenerate four different transformed versions of each unlabeled sample. So, each unlabeled sample\nis forward passed through the network four times. Since we use all training data as unlabeled set,\nthe computational cost of each iteration is roughly quadrupled. But, in practice we found that when\nwe use 10% of training data as labeled set, the network converges in 20 epochs instead of standard\n90 epochs of AlexNet model. So, overall cost of our method for ImageNet is less than or equal to\nAlexNet. The results on validation set are shown in Table 4. We also compare the results to the\nmodel trained on the mutual-exclusivity loss function only and reported in [30]. We can see that\neven for a large dataset with many categories, the proposed unsupervised loss function improves the\nclassi\ufb01cation accuracy. The error rate of a single AlexNet model on validation set of ILSVRC 2012\nusing all training data is 18.2% [7].\n\nTable 4: Error rates (%) on validation set for ILSVR 2012 (Top-5).\nmutual\nxcl [30]\n45.63\n42.90\n\n45.91 \u00b1 0.25\n39.84 \u00b1 0.23\n\nrep 4\n45.57\n39.70\n\nrep 5\n46.08\n40.08\n\nmean\n\u00b1 std\n\nlabeled only:\nsemi-sup:\n\nrep 1\n45.73\n39.50\n\nrep 2\n46.15\n39.99\n\nrep 3\n46.06\n39.94\n\n[21] \u223c1.5%\n\nof data\n85.9\n84.2\n\n7\n\n\f5 Discussion\n\nWe can see that the proposed loss function can improve the accuracy of a ConvNet regardless of the\narchitecture and implementation. We improve the accuracy of two relatively different implementations\nof ConvNets, i.e., cuda-convnet and sparse convolutional networks. For SVHN and NORB, we do not\nuse dropout or randomized pooling for the experiments performed using cuda-convnet. Therefore, the\nonly source of variation in different passes of a sample through the network is random transformations\n(translation and rotation). For the experiments performed using sparse convolutional networks on\nthese two datasets, we do not use data transformation. Instead, we use dropout and randomized\npooling. Based on the results, we can see that in both cases we can signi\ufb01cantly improve the accuracy\nwhen we have a small number of labeled samples. For CIFAR100, we achieve state-of-the-art error\nrate of 21.43% by taking advantage of the variations caused by dropout and randomized pooling. In\nImageNet and CIFAR10 experiments, we use both data transformation and dropout. For CIFAR10,\nwe also have randomized pooling and achieve the state-of-the-art error rate of 3.00%. In MNIST\nexperiments with 100 labeled samples and NORB experiments with 1% of labeled data, we achieve\naccuracy reasonably close to the case when we use all available training data by applying mutual-\nexclusivity loss and minimizing the difference in predictions of multiple passes caused by dropout\nand randomized pooling.\n\n6 Conclusion\n\nIn this paper, we proposed an unsupervised loss function that minimizes the variations in different\npasses of a sample through the network caused by non-deterministic transformations and random-\nized dropout and max-pooling schemes. We evaluated the proposed method using two ConvNet\nimplementations on multiple benchmark datasets. We showed that it is possible to achieve sig-\nni\ufb01cant improvements in accuracy by using the transformation/stability loss function along with\nmutual-exclusivity of [30] when we have a small number of labeled data available.\nAcknowledgments\nThis work was supported by NSF IIS-1149299.\n\nReferences\n[1] B. B. Le Cun, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, \u201cHandwritten\ndigit recognition with a back-propagation network,\u201d in Advances in neural information processing systems,\nCiteseer, 1990.\n\n[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document recognition,\u201d\n\nProceedings of the IEEE, vol. 86, no. 11, pp. 2278\u20132324, 1998.\n\n[3] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, \u201cOverfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks,\u201d arXiv preprint arXiv:1312.6229, 2013.\n\n[4] J. Long, E. Shelhamer, and T. Darrell, \u201cFully convolutional networks for semantic segmentation,\u201d in\n\nComputer Vision and Pattern Recognition, pp. 3431\u20133440, 2015.\n\n[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al., \u201cImagenet large scale visual recognition challenge,\u201d International Journal of Computer\nVision, vol. 115, no. 3, pp. 211\u2013252, 2015.\n\n[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,\n\n\u201cGoing deeper with convolutions,\u201d in Computer Vision and Pattern Recognition, pp. 1\u20139, 2015.\n\n[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks,\u201d in Advances in neural information processing systems, pp. 1097\u20131105, 2012.\n\n[8] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, \u201cImproving neural\n\nnetworks by preventing co-adaptation of feature detectors,\u201d arXiv preprint arXiv:1207.0580, 2012.\n\n[9] D. Ciresan, U. Meier, and J. Schmidhuber, \u201cMulti-column deep neural networks for image classi\ufb01cation,\u201d\n\nin Computer Vision and Pattern Recognition, pp. 3642\u20133649, IEEE, 2012.\n\n[10] A. Blum and T. Mitchell, \u201cCombining labeled and unlabeled data with co-training,\u201d in Proceedings of the\n\neleventh annual conference on Computational learning theory, pp. 92\u2013100, ACM, 1998.\n\n[11] V. R. de Sa, \u201cLearning classi\ufb01cation with unlabeled data,\u201d in Advances in neural information processing\n\nsystems, pp. 112\u2013119, 1994.\n\n8\n\n\f[12] D. J. Miller and H. S. Uyar, \u201cA mixture of experts classi\ufb01er with learning based on both labelled and\n\nunlabelled data,\u201d in Advances in neural information processing systems, pp. 571\u2013577, 1997.\n\n[13] T. Joachims, \u201cTransductive inference for text classi\ufb01cation using support vector machines,\u201d in ICML,\n\nvol. 99, pp. 200\u2013209, 1999.\n\n[14] K. Bennett, A. Demiriz, et al., \u201cSemi-supervised support vector machines,\u201d Advances in Neural Information\n\nprocessing systems, pp. 368\u2013374, 1999.\n\n[15] A. Blum and S. Chawla, \u201cLearning from labeled and unlabeled data using graph mincuts,\u201d 2001.\n[16] X. Zhu, Z. Ghahramani, J. Lafferty, et al., \u201cSemi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions,\u201d in International Conference on Machine Learning, vol. 3, pp. 912\u2013919, 2003.\n\n[17] X. Zhu and Z. Ghahramani, \u201cLearning from labeled and unlabeled data with label propagation,\u201d tech. rep.,\n\nCiteseer, 2002.\n\n[18] Y. LeCun, K. Kavukcuoglu, C. Farabet, et al., \u201cConvolutional networks and applications in vision.,\u201d in\n\nISCAS, pp. 253\u2013256, 2010.\n\n[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, \u201cWhat is the best multi-stage architecture for\n\nobject recognition?,\u201d in International Conference on Computer Vision, pp. 2146\u20132153, IEEE, 2009.\n\n[20] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, \u201cFast inference in sparse coding algorithms with applications\n\nto object recognition,\u201d arXiv preprint arXiv:1010.3467, 2010.\n\n[21] P. Agrawal, J. Carreira, and J. Malik, \u201cLearning to see by moving,\u201d in International Conference on\n\nComputer Vision, pp. 37\u201345, 2015.\n\n[22] C. Doersch, A. Gupta, and A. A. Efros, \u201cUnsupervised visual representation learning by context prediction,\u201d\n\nin International Conference on Computer Vision, pp. 1422\u20131430, 2015.\n\n[23] X. Glorot and Y. Bengio, \u201cUnderstanding the dif\ufb01culty of training deep feedforward neural networks,\u201d in\n\nInternational conference on arti\ufb01cial intelligence and statistics, pp. 249\u2013256, 2010.\n\n[24] R. Johnson and T. Zhang, \u201cSemi-supervised convolutional neural networks for text categorization via\n\nregion embedding,\u201d in Advances in Neural Information Processing Systems, pp. 919\u2013927, 2015.\n\n[25] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, \u201cDeep learning via semi-supervised embedding,\u201d in\n\nNeural Networks: Tricks of the Trade, pp. 639\u2013655, Springer, 2012.\n\n[26] X. Wang and A. Gupta, \u201cUnsupervised learning of visual representations using videos,\u201d in International\n\nConference on Computer Vision, pp. 2794\u20132802, 2015.\n\n[27] D. Jayaraman and K. Grauman, \u201cLearning image representations tied to ego-motion,\u201d in International\n\nConference on Computer Vision, pp. 1413\u20131421, 2015.\n\n[28] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, \u201cSemi-supervised learning with ladder\n\nnetworks,\u201d in Advances in Neural Information Processing Systems, pp. 3532\u20133540, 2015.\n\n[29] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, \u201cDiscriminative unsupervised feature\nlearning with convolutional neural networks,\u201d in Advances in Neural Information Processing Systems,\npp. 766\u2013774, 2014.\n\n[30] M. Sajjadi, M. Javanmardi, and T. Tasdizen, \u201cMutual exclusivity loss for semi-supervised deep learning,\u201d\n\nin International Conference on Image Processing, IEEE, 2016.\n\n[31] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri, \u201cTransformation invariance in pattern recogni-\ntion\u2014tangent distance and tangent propagation,\u201d in Neural networks: tricks of the trade, pp. 239\u2013274,\nSpringer, 1998.\n\n[32] D. Jayaraman and K. Grauman, \u201cSlow and steady feature analysis: higher order temporal coherence in\n\nvideo,\u201d Computer Vision and Pattern Recognition, 2016.\n\n[33] L. Sun, K. Jia, T.-H. Chan, Y. Fang, G. Wang, and S. Yan, \u201cDl-sfa: deeply-learned slow feature analysis for\n\naction recognition,\u201d in Computer Vision and Pattern Recognition, pp. 2625\u20132632, 2014.\n\n[34] A. Krizhevsky and G. Hinton, \u201cLearning multiple layers of features from tiny images,\u201d 2009.\n[35] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, \u201cReading digits in natural images with\nunsupervised feature learning,\u201d in NIPS workshop on deep learning and unsupervised feature learning,\nvol. 2011, p. 4, Granada, Spain, 2011.\n\n[36] Y. LeCun, F. J. Huang, and L. Bottou, \u201cLearning methods for generic object recognition with invariance to\n\npose and lighting,\u201d in Computer Vision and Pattern Recognition, vol. 2, pp. II\u201397, IEEE, 2004.\n\n[37] A. Krizhevskey, \u201cCuda-convnet.\u201d code.google.com/p/cuda-convnet, 2014.\n[38] B. Graham, \u201cSpatially-sparse convolutional neural networks,\u201d arXiv preprint arXiv:1409.6070, 2014.\n[39] B. Graham, \u201cFractional max-pooling,\u201d arXiv preprint arXiv:1412.6071, 2014.\n[40] J.-R. Chang and Y.-S. Chen, \u201cBatch-normalized maxout network in network,\u201d arXiv preprint\n\narXiv:1511.02583, 2015.\n\n9\n\n\f", "award": [], "sourceid": 652, "authors": [{"given_name": "Mehdi", "family_name": "Sajjadi", "institution": "University of Utah"}, {"given_name": "Mehran", "family_name": "Javanmardi", "institution": "University of Utah"}, {"given_name": "Tolga", "family_name": "Tasdizen", "institution": "University of Utah"}]}