{"title": "Humans Learn Using Manifolds, Reluctantly", "book": "Advances in Neural Information Processing Systems", "page_first": 730, "page_last": 738, "abstract": "When the distribution of unlabeled data in feature space lies along a manifold, the information it provides may be used by a learner to assist classification in a semi-supervised setting. While manifold learning is well-known in machine learning, the use of manifolds in human learning is largely unstudied. We perform a set of experiments which test a human's ability to use a manifold in a semi-supervised learning task, under varying conditions. We show that humans may be encouraged into using the manifold, overcoming the strong preference for a simple, axis-parallel linear boundary.", "full_text": "Humans Learn Using Manifolds, Reluctantly\n\nBryan R. Gibson, Xiaojin Zhu, Timothy T. Rogers\u2217, Charles W. Kalish\u2020, Joseph Harrison\u2217\n\nDepartment of Computer Sciences, \u2217Psychology, and \u2020Educational Psychology\n\nUniversity of Wisconsin-Madison, Madison, WI 53706 USA\n\n{bgibson, jerryzhu}@cs.wisc.edu\n\n{ttrogers, cwkalish, jcharrison}@wisc.edu\n\nAbstract\n\nWhen the distribution of unlabeled data in feature space lies along a manifold,\nthe information it provides may be used by a learner to assist classi\ufb01cation in\na semi-supervised setting. While manifold learning is well-known in machine\nlearning, the use of manifolds in human learning is largely unstudied. We perform\na set of experiments which test a human\u2019s ability to use a manifold in a semi-\nsupervised learning task, under varying conditions. We show that humans may\nbe encouraged into using the manifold, overcoming the strong preference for a\nsimple, axis-parallel linear boundary.\n\n1\n\nIntroduction\n\nConsider a classi\ufb01cation task where a learner is given training items x1, . . . , xl \u2208 Rd, represented by\nd-dimensional feature vectors. The learner is also given the corresponding class labels y1, . . . , yl \u2208\nY. In this paper, we focus on binary labels Y \u2208 {\u22121, 1}. In addition, the learner is given some\nunlabeled items xl+1, . . . , xl+u \u2208 Rd without the corresponding labels. Importantly, the labeled\nand unlabeled items x1 . . . xl+u are distributed in a peculiar way in the feature space: they lie on\nsmooth, lower dimension manifolds, such as those schematically shown in Figure 1(a). The question\nis: given this knowledge of labeled and unlabeled data, how will the learner classify xl+1, . . . , xl+u?\nWill the learner ignore the distribution information of the unlabeled data, and simply use the labeled\ndata to form a decision boundary as in Figure 1(b)? Or will the learner propagate labels along the\nnonlinear manifolds as in Figure 1(c)?\n\n(a) the data\n\n(b) supervised learning\n\n(c) manifold learning\n\nFigure 1: On a dataset with manifold structure, supervised learning and manifold learning make\ndramatically different predictions. Large symbols represent labeled items, dots unlabeled items.\n\nWhen the learner is a machine learning algorithm, this question has been addressed by semi-\nsupervised learning [2, 11]. The designer of the algorithm can choose to make the manifold as-\nsumption, also known as graph-based semi-supervised learning, which states that the labels vary\nslowly along the manifolds or the discrete graph formed by connecting nearby items. Consequently,\nthe learning algorithm will predict Figure 1(c). The mathematics of manifold learning is well-\nunderstood [1, 6, 9, 10]. Alternatively, the designer can choose to ignore the unlabeled data and\nperform supervised learning, which results in Figure 1(b).\n\n1\n\n\fWhen the learner is a human being, however, the answer is not so clear. Consider that the human\nlearner does not directly see how the items are distributed in the feature space (such as Figure 1(a)),\nbut only a set of items (such as those in Figure 2(a)). The underlying manifold structure of the data\nmay not be immediately obvious. Thus there are many possibilities for how the human learner will\nbehave: 1) They may completely ignore the manifold structure and perform supervised learning; 2)\nThey may discover the manifold under some learning conditions and not others; or 3) They may\nalways learn using the manifold.\n\nFor readers not familiar with manifold learning, the setting might seem arti\ufb01cial. But in fact, many\nnatural stimuli we encounter in everyday life are distributed on manifolds. An important example\nis face recognition, where different poses (viewing angles) of the same face produce different 2D\nimages. These images can be quite different, as in the frontal and pro\ufb01le views of a person. However,\nif we continuously change the viewing angle, these 2D images will form a one-dimensional manifold\nin a very high dimensional image space. This example illustrates the importance of a manifold to\nfacilitate learning: if we can form and maintain such a face manifold, then with a single label (e.g.,\nthe name) on one of the face images, we can recognize all other poses of that person by propagating\nthe label along the manifold. The same is true for visual object recognition in general. Other more\nabstract stimuli form manifolds, or the discrete analogue, graphs. For example, text documents in a\ncorpus occupy a potentially nonlinear manifold in the otherwise very high dimensional space used\nto represent them, such as the \u201cbag of words\u201d representation.\n\nThere exists little empirical evidence addressing the question of whether human beings can learn\nusing manifolds when classifying objects, and the few studies we are aware of come to opposing\nconclusions. For instance, Wallis and B\u00a8ulthoff created arti\ufb01cial image sequences where a frontal face\nis morphed into the pro\ufb01le face of a different person. When participants were shown such sequences\nduring training, their ability to match frontal and pro\ufb01le faces during testing was impaired [8]. This\nmight be evidence that people depend on manifold structure stemming from temporal and spatial\nproximity to perform face recognition. On the other hand, Vandist et al. conducted a categorization\nexperiment where the true decision boundary is at 45 degrees in a 2D stimulus space (i.e., an in-\nformation integration task). They showed that when the two classes are elongated Gaussian, which\nare parallel to, and on opposite sides of, the decision boundary, unlabeled data does not help learn-\ning [7]. If we view these two elongated Gaussian as linear manifolds, this result suggests that people\ndo not generally learn using manifolds.\n\nThis study seeks to understand under what conditions, if any, people are capable of manifold learning\nin a semi-supervised setting. The study has important implications for cognitive psychology: \ufb01rst,\nif people are capable of learning manifolds, this suggests that manifold-learning models that have\nbeen developed in machine learning can provide hypotheses about how people categorize objects in\nnatural domains like face recognition, where manifolds appear to capture the true structure of the\ndomain. Second, if there are reliable methods for encouraging manifold learning in people, these\nmethods can be employed to aid learning in other domains that are structured along manifolds. For\nmachine learning, our study will help in the design of algorithms which can decide when to invoke\nthe manifold learning assumption.\n\n2 Human Manifold Learning Experiments\n\nWe designed and conducted a set of experiments to study manifold learning in humans, with the\nfollowing design considerations. First, the task was a \u201cbatch learning\u201d paradigm in which partici-\npants viewed all labeled and unlabeled items at once (in contrast to \u201conline\u201d or sequential learning\nparadigm where items appear one at a time). Batch learning allows us to compare human behavior\nagainst well-established machine learning models that typically operate in batch mode. Second, we\navoided using faces or familiar 3D objects as stimuli, despite their natural manifold structures as\ndiscussed above, because we wished to avoid any bias resulting from strong prior real-world knowl-\nedge. Instead, we used unfamiliar stimuli, from which we could add or remove a manifold structure\neasily. This design should allow our experiments to shed light on people\u2019s intrinsic ability to learn\nusing a manifold.\nParticipants and Materials. In the \ufb01rst set of experiments, 139 university undergraduates partici-\npated for partial course credit. A computer interface was created to represent a table with three bins,\nas shown in Figure 2(a). Unlabeled cards were initially placed in a central white bin, with bins to\n\n2\n\n\feither side colored red and blue to indicate the two classes y \u2208 {\u22121, 1}. Each stimulus is a card.\nParticipants sorted cards by clicking and dragging with a mouse. When a card was clicked, other\nsimilar cards could be \u201chighlighted\u201d in gray (depending on condition). Labeled cards were pinned\ndown in their respective red or blue bins and could not be moved, indicated by a \u201cpin\u201d in the corner\nof the card. The layout of the cards was such that all cards remained visible at all times. Unlabeled\ncards could be re-categorized at any time by dragging from any bin to any other bin. Upon sorting\nall cards, participants would click a button to indicating completion.\n\nTwo sets of stimuli were created. The \ufb01rst, used solely to acquaint the participants with the interface,\nconsisted of a set of 20 cards with animal line drawings on a white background. The images were\nchosen to approximate a linear continuum between \ufb01sh and mammal, with shark, dolphin, and\nwhale at the center. The second set of stimuli used for the actual experiment was composed of 82\n\u201ccrosshair\u201d cards, each with a pair of perpendicular, axis-parallel lines, all of equal length, crossing\non a white background. Four examples are shown in Figure 2(b). Each card therefore can be\nencoded as x \u2208 [0, 1]2, whose two features representing the positions of the vertical and horizontal\nlines, respectively.\n\n(a) Card sorting interface\n\n(b) x1 = (0, 0.1), x2 = (1, 0.9), x3 = (0.39, 0.41), x4 = (0.61, 0.59)\n\nFigure 2: Experimental interface (with highlighting shown), and example crosshair stimuli.\n\nProcedure. Each participant was given two tasks to complete.\n\nTask 1 was a practice task to familiarize the participant with the interface. The participant was\nasked to sort the set of 20 animal cards into two categories, with the two ends of the continuum\n(a clown \ufb01sh and a dachshund) labeled. Participants were told that when they clicked on a card,\nhighlighting of similar cards might occur. In reality, highlighting was always shown for the two\nnearest-neighboring cards (on the de\ufb01ned continuum) of a clicked card. Importantly, we designed\nthe dataset so that, near the middle of the continuum, cards from opposite biological classes would\nbe highlighted together. For example, when a dolphin was clicked, both a shark and a whale would\nbe highlighted. The intention was to indicate to the participant that highlighting is not always a clear\ngive-away for class labels. At the end of task 1 their \ufb01sh vs. mammal classi\ufb01cation accuracy was\npresented. No time limit was enforced.\n\nTask 2 asked the participant to sort a set of 82 crosshair cards into two categories. The set of cards,\nthe number of labeled cards, and the highlighting of cards depended on condition. The participant\nwas again told that some cards might be highlighted, whether the condition actually provided for\nhighlighting or not. The participant was also told that cards that shared highlighting may not all\nhave the same classi\ufb01cation. Again, no time limit was enforced. After they completed this task, a\nfollow up questionnaire was administered.\nConditions. Each of the 139 participants was randomly assigned to one of 6 conditions, shown in\nFigure 3, which varied according to three manipulations:\nThe number of labeled items l can be 2 or 4 (2l vs. 4l). For conditions with two labeled items,\nthe labeled items are always (x1, y1 = \u22121), (x2, y2 = 1); with four labeled items, they are always\n(x1, y1 = \u22121), (x2, y2 = 1), (x3, y3 = 1), (x4, y4 = \u22121). The features of x1 . . . x4 are those given\nin Figure 2(b). We chose these four labeled points by maximizing the prediction differences made\nby seven machine learning models, as discussed in the next section.\n\n3\n\n\fUnlabeled items are distributed on a uniform grid or manifolds (gridU vs. moonsU). The items\nx5 . . . x82 were either on a uniform grid in the 2D feature space, or along two \u201chalf-moons\u201d, which is\na well-studied dataset in the semi-supervised learning community. No linear boundary can separate\nthe two moons in feature space. x3 and x4, if unlabeled, are the same as in Figure 2(b).\nHighlighting similar items or not (the suf\ufb01x h). For the moonsUconditions, the neighboring cards\nof any clicked card may be highlighted. The neighborhood is de\ufb01ned as within a radius of \u01eb = 0.07\nin the Euclidean feature space. This value was chosen as it includes at least two neighbors for each\npoint in the moonsUdataset. To form the unweighted graph shown in Figure 3, an edge is placed\nbetween all neighboring points.\n\nThe rationale for comparing these different conditions will become apparent as we consider how\ndifferent machine-learning models perform on these datasets.\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n2lgridU\n\n8 participants\n\n2lmoonsU\n\n8 participants\n\n2lmoonsUh\n\n8 participants\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n4lmoonsU\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n4lmoonsUh\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n4lgridU\n\n22 participants\n\n24 participants\n\n23 participants\n\nFigure 3: The six experimental conditions. Large symbols indicate labeled items, dots unlabeled\nitems. Highlighting is represented as graph edges.\n\n3 Model Predictions\n\nWe hypothesize that human participants consider a set of models ranging from simple to sophis-\nticated, and that they will perform model selection based on the training data given to them. We\nstart by considering seven typical machine learning models to motivate our choice, and present the\n(graph) Graph-based semi-supervised\nmodels we actually use later on. The seven models are:\nlearning [1, 10], which propagates labels along the graph. It reverts to supervised learning when\nthere is no graph (i.e., no highlighting). (1NN,\u21132) 1-nearest-neighbor classi\ufb01er with \u21132 (Euclidean)\n(1NN,\u21131) 1-nearest-neighbor classi\ufb01er with \u21131 (Manhattan) distance. These two mod-\ndistance.\n(multi-v) multiple vertical linear bound-\nels are similar to exemplar models in psychology [3].\naries. (multi-h) multiple horizontal linear boundaries. (single-v) a single vertical linear boundary.\n(single-h) a single horizontal linear boundary. We plot the label predictions by these 7 models on\nfour of the six conditions in Figure 4. Their predictions on 2lmoonsUare identical to 2lmoonsUh, and on\n4lmoonsUare identical to 4lmoonsUh, except that \u201c(graph)\u201d is not available.\nFor conceptual simplicity and elegance, instead of using these disparate models we adopt a single\nmodel capable of making all these predictions. In particular, we use a Gaussian Process (GP) with\ndifferent kernels (i.e., covariance functions) k to simulate the seven models. For details on GPs\nsee standard textbooks such as [4]. In particular, we \ufb01nd seven different kernels k to match GP\nclassi\ufb01cation to each of the seven model predictions on all 6 conditions. This is somewhat unusual\nin that our GPs are not learned from data, but by matching other model predictions. Nonetheless, it\nis a valid procedure to create seven different GPs which will later be compared against human data.\nFor models (1NN,\u21132), (multi-v), (multi-h), (single-v), and (single-h), we use diagonal RBF kernels\n2) and tune \u03c31, \u03c32 on a coarse parameter grid to minimize classi\ufb01cation disagreement\ndiag(\u03c32\nw.r.t. the corresponding model prediction on all 6 conditions. For model (1NN,\u21131) we use a Laplace\nkernel and tune its bandwidth. For model (graph), we produce a graph kernel \u02dck following the\nReproducing Kernel Hilbert Space trick in [6]. That is, we extend a base RBF kernel k with a graph\ncomponent:\n\n1, \u03c32\n\n\u02dck(x, z) = k(x, z) \u2212 k\u22a4\n\n(1)\nwhere x, z are two arbitrary items (not necessarily on the graph), kx = (k(x, x1), . . . , k(x, xl+u))\u22a4\nis the kernel vector between x and all l + u points x1 . . . xl+u in the graph, K is the (l + u) \u00d7 (l + u)\nGram matrix with Kij = k(xi, xj), L is the unnormalized graph Laplacian matrix derived from\nunweighted edges on the \u01ebNN graph de\ufb01ned earlier for highlighting, and c is the parameter that we\ntune. We take the base RBF kernel k to be the tuned kernel for model (1NN,\u21132). It can be shown that\n\nx (I + cLK)\u22121cLkz\n\n4\n\n\f\u02dck is a valid kernel formed by warping the base kernel k along the graph, see [6] for technical details.\nWe used the GP classi\ufb01cation implementation with Expectation Propagation approximation [5].\n\nIn the end, our seven GPs were able to exactly match the predictions made by the seven models in\nFigure 4. We will use these GPs in the rest of the paper.\n\n(graph)\n\n(1NN,\u21132)\n\n(1NN,\u21131)\n\n(multi-v)\n\n(multi-h)\n\n(single-v)\n\n(single-h)\n\n2lgridU\n\n-\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n2lmoonsUh\n\n4lgridU\n\n4lmoonsUh\n\n0.5\n\n1\n\n-\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\nFigure 4: Predictions made by the seven models on 4 of the 6 conditions.\n\n4 Behavioral Experiment Results\n\nWe now compare human categorization behaviors to model predictions. We \ufb01rst consider the ag-\ngregate behavior for all participants within each condition. One way to characterize this aggregate\nbehavior is the \u201cmajority vote\u201d of the participants on each item. That is, if more than half of the\nparticipants classi\ufb01ed an item as y = 1, the majority vote classi\ufb01cation for that item is y = 1, and\nso on. The \ufb01rst row in Figure 5 shows the majority vote for each condition. In these and all further\nplots, blue circles indicate y = \u22121, red pluses y = 1, and green stars ambiguous, meaning the\nclassi\ufb01cation into positive or negative is half-half. We also compute how well the seven GPs predict\nhuman majority votes. The accuracies of these GP models are shown in Table 11.\n\n2lgridU\n\n2lmoonsU\n\n2lmoonsUh\n\n4lgridU\n\n4lmoonsU\n\n4lmoonsUh\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\n0.5\n\n1\n\nFigure 5: Human categorization results. (First row) the majority vote of participants within each\ncondition. (Bottom three rows) a sample of responses from 18 different participants.\n\nOf course, a majority vote only reveals average behavior. We have observed that there are wide\nparticipant variabilities. Participants appeared to \ufb01nd the tasks dif\ufb01cult, as their self-reported con-\n\ufb01dence scores were fairly low in all conditions. It was also noted that strategies for completing the\n\n1The condition 4lmoonsUhR will be explained later in Section 5.\n\n5\n\n\f2lgridU\n2lmoonsU\n2lmoonsUh\n4lgridU\n4lmoonsU\n4lmoonsUh\n4lmoonsUhR\n\n(graph)\n0.81\n0.47\n0.50\n0.54\n0.64\n0.97\n0.68\n\n(1NN,\u21132)\n0.94\n0.84\n0.78\n0.61\n0.62\n0.76\n0.63\n\n(1NN,\u21131)\n0.84\n0.62\n0.56\n0.64\n0.60\n0.54\n0.44\n\n(multi-v)\n0.86\n0.74\n0.76\n0.64\n0.69\n0.64\n0.56\n\n(multi-h)\n0.58\n0.42\n0.36\n0.50\n0.47\n0.31\n0.40\n\n(single-v)\n0.85\n0.79\n0.76\n0.60\n0.38\n0.65\n0.59\n\n(single-h)\n0.61\n0.45\n0.39\n0.51\n0.45\n0.26\n0.42\n\nTable 1: GP model accuracy in predicting human majority vote for each condition.\n\ntask varied widely, with some participant simply categorizing cards in the order they appeared on the\nscreen, while others took a much longer, studied approach. Most interestingly, different participants\nseem to use different models, as the individual participant plots in the bottom three rows of Figure 5\nsuggest. We would like to be able to make a claim about what model, from our set of models, each\nparticipant used for classi\ufb01cation. In order to do this, we compute per participant accuracies of\nthe seven models on that participant\u2019s classi\ufb01cation. We then \ufb01nd the model M with the highest\naccuracy for the participant, out of the seven models. If this highest accuracy is above 0.75, we\ndeclare that the participant is potentially using model M; otherwise no model is deemed a good \ufb01t\nand we say the participant is using some \u201cother\u201d model. We show the proportion of participants in\neach condition attributed to each of our seven models, plus \u201cother\u201d, in Table 2.\n\n2lgridU\n2lmoonsU\n2lmoonsUh\n4lgridU\n4lmoonsU\n4lmoonsUh\n4lmoonsUhR\n\n(graph)\n0.12\n0.00\n0.12\n0.00\n0.25\n0.39\n0.13\n\n(1NN,\u21132)\n0.00\n0.12\n0.00\n0.05\n0.25\n0.09\n0.03\n\n(1NN,\u21131)\n0.12\n0.00\n0.00\n0.09\n0.12\n0.09\n0.07\n\n(multi-v)\n0.25\n0.25\n0.38\n0.00\n0.12\n0.04\n0\n\n(multi-h)\n0.25\n0.25\n0.25\n0.00\n0.00\n0.04\n0\n\n(single-v)\n0.12\n0.25\n0.00\n0.18\n0.04\n0.00\n0.07\n\n(single-h)\n0.00\n0.00\n0.00\n0.09\n0.08\n0.13\n0.03\n\nother\n0.12\n0.12\n0.25\n0.59\n0.38\n0.22\n0.67\n\nTable 2: Percentage of participants potentially using each model\n\nBased on Figure 5, Table 1, and Table 2, we make some observations:\n1. When there are only two labeled points, the unlabeled distribution does not encourage humans to\nperform manifold learning (comparing 2lgridU vs. 2lmoonsU). That is, they do not follow the possible\nimplicit graph structure (2lmoonsU). Instead, in both conditions they prefer a simple single vertical or\nhorizontal decision boundary, as Table 2 shows2.\n2. With two labeled points, even if they are explicitly given the graph structure in the form of\nhighlighting, participants still do not perform manifold learning (comparing 2lmoonsU vs. 2lmoonsUh).\nIt seems they are \u201cblocked\u201d by the simpler vertical or horizontal hypothesis, which perfectly explains\nthe labeled data.\n3. When there are four labeled points but no highlighting, the distribution of unlabeled data still does\nnot encourage people to perform manifold learning (comparing 4lgridU vs. 4lmoonsU). This further\nsuggests that people can not easily extract manifold structure from unlabeled data in order to learn,\nwhen there is no hint to do so. However, most participants have given up the simple single vertical\nor horizontal decision boundary, because it contradicts with the four labeled points.\n4. Finally, when we provide the graph structure, there is a marked switch to manifold learning\n(comparing 4lmoonsU vs. 4lmoonsUh). This suggests that a combination of the elimination of preferred,\nsimpler hypotheses, together with a stronger graph hint, \ufb01nally gives the originally less preferred\nmanifold learning model a chance of being used. It is under this condition that we observed human\nmanifold learning behavior.\n\n2The two rows in Table 1 for these two conditions are therefore misleading, as it averages classi\ufb01cation made\nwith vertical and horizontal decision boundaries. Also note that in the 2lconditions (multi-v) and (multi-h) are\neffectively single linear boundary models (see Figure 4) and differ from (single-v) and (single-h) only slightly\ndue to the training method used.\n\n6\n\n\f5 Humans do not Blindly Follow the Highlighting\n\nDo humans really learn using manifolds? Could they have adopted a \u201cfollow-the-highlighting\u201d\nprocedure to label the manifolds 100% correctly: in the beginning, click on a labeled card x to\nhighlight its neighboring unlabeled cards; pick one such neighbor x\u2032 and classify it with the label of\nx; now click on (the now labeled) x\u2032 to \ufb01nd one of its unlabeled neighbors x\u2032\u2032, and repeat? Because\nour graph has disconnected components with consistently labeled seeds, this procedure will succeed.\nThe procedure is known as propagating-1NN in semi-supervised learning (Algorithm 2.7, [11]). In\nthis section we present three arguments that humans are not blindly following the highlighting.\n\nFirst, participants in 2lmoonsUh did not learn the manifold while those in 4lmoonsUh did, even though\nthe two conditions have the same \u01ebNN highlighting.\n\nSecond, a necessary condition for follow-the-highlighting is to always classify an unlabeled x\u2032\naccording to a labeled highlighted neighbor x. Conversely, if a participant classi\ufb01es x\u2032 as class\ny\u2032, while all neighbors of x\u2032 are either still unlabeled or have labels other than y\u2032, she could not\nhave been using follow-the-highlighting on x\u2032. We say she has taken a leap-of-faith on x\u2032. The\n4lmoonsUh participants had an average of 17 leaps-of-faith among about 78 classi\ufb01cations3, while\nstrict follow-the-highlighting procedure would yield zero leaps-of-faith.\n\nThird, the basic challenge of follow-the-highlighting is that the underlying manifold structure of the\nstimuli may have been irrelevant. Would participants have shown the same behavior, following the\nhighlighting, regardless of the actual stimuli? We therefore designed the following experiment. Take\nthe 4lmoonsUh graph which has 4 labeled nodes, 78 unlabeled nodes, and an adjacency matrix (i.e.,\nedges) de\ufb01ned by \u01ebNN, as shown in Figure 3. Take a random permutation \u03c0 = (\u03c01, . . . , \u03c078). Map\nthe feature vector of the ith unlabeled point to x\u03c0i, while keeping the adjacency matrix the same.\nThis creates the random-looking graph in Figure 6(a) which we call 4lmoonsUhR condition (the suf\ufb01x\nR stands for random), which is equivalent to the 4lmoonsUh graph in structure. In particular, there are\ntwo connected components with consistent labeled seeds. However, now the highlighted neighbors\nmay look very different than the clicked card.\n\nIf we assume humans blindly follow the highlighting (perhaps noisily), then we predict that they\nare more likely to classify those unlabeled points nearer (in shortest path length on the graph, not\nEuclidean distance) a labeled point with the latter\u2019s label; and that this correlation should be the same\nunder 4lmoonsUhR and 4lmoonsUh. This prediction turns out to be false. 30 additional undergraduates\nparticipated in the new 4lmoonsUhR condition. Figure 6(b) shows the above behavioral evaluation,\nwhich does not exhibit the predicted correlation, and is clearly different from the same evaluation for\n4lmoonsUh in Figure 6(c). Again, this is evidence that humans are not just following the highlighting.\nIn fact, human behavior in 4lmoonsUhR is similar to 4lmoonsU. That is, having random highlighting is\nsimilar to having no highlighting in how it affects human categorization. This can be seen from the\nlast rows of Tables 1 and 2, and Figure 6(d)4.\n\n6 Discussion\n\nWe have presented a set of experiments exploring human manifold learning behaviors. Our results\nsuggest that people can perform manifold learning, but only when there is no alternative, simpler\nexplanation of the data, and people need strong hints about the graph structure.\n\nWe propose that Bayesian model selection is one possible way to explain these human behaviors.\nRecall we de\ufb01ned seven Gaussian Processes, each with a different kernel. For a given GP with\nkernel k, the evidence p(y1:l | x1:l, k) is the marginal likelihood on labeled data, integrating out the\nhidden discriminant function sampled from the GP. With multiple candidate GP models, one may\nperform model selection by selecting the one with the largest marginal likelihood. From the absence\nof manifold learning in conditions without highlighting or with random highlighting, we speculate\nthat the GP with the graph-based kernel \u02dck (1) is special: it is accessible in a participant\u2019s repertoire\n\n3The individual number of leaps-of-faith are 0, 1, 2, 4, 10, 13, 13, 14, 14, 15, 15, 16, 18, 19, 20, 21, 22, 24,\n\n25, 27, 33, 36, and 36 respectively, for the 23 participants.\n\n4In addition, if we create a GP from the Laplacian of the random highlighting graph, the GP accuracy in\npredicting 4lmoonsUhR human majority vote is 0.46, and the percentage of participants in 4lmoonsUhR who can\nbe attributed to this model is 0.\n\n7\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ny\nc\na\nr\nu\nc\nc\na\n \nl\na\nc\ni\nr\ni\np\nm\ne\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n(a)\n\n0\n0\n\n2\n\n4\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ny\nc\na\nr\nu\nc\nc\na\n \nl\na\nc\ni\nr\ni\np\nm\ne\n\n12\n\n14\n\n0\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\nshortest path length\n\n6\n\n8\n\n10\n\nshortest path length\n\n(b)\n\n(c)\n\n1\n\n0.5\n\n12\n\n14\n\n0\n\n0\n\n1\n\n0.5\n(d)\n\nFigure 6: The 4lmoonsUhR experiment with 30 participants. (a) The 4lmoonsUhR condition. (b) The\nbehavioral evaluation for 4lmoonsUhR, where the x-axis is the shortest path length of an unlabeled\npoint to a labeled point, and the y-axis is the fraction of participants who classi\ufb01ed that unlabeled\npoint consistent with the nearest labeled point. (c) The same behavioral evaluation for 4lmoonsUh. (d)\nThe majority vote in 4lmoonsUhR.\n\nonly when strong hints (highlighting) exists and agrees with the underlying unlabeled data manifold\nstructure. Under this assumption, we can then explain the contrast between the lack of manifold\nlearning in 2lmoonsUh, and the presence of manifold learning in 4lmoonsUh. On one hand, for the\n2lmoonsUh condition, the evidence for the seven GP models on the two labeled points are: (graph)\n0.249, (1NN,\u21132) 0.250, (1NN,\u21131) 0.250, (multi-v) 0.250, (multi-h) 0.250, (single-v) 0.249, (single-\nh) 0.249. The graph-based GP has slightly lower evidence than several other GPs, which may be\ndue to our speci\ufb01c choice of kernel parameters in (1). In any case, there is no reason to prefer the\nGP with a graph kernel, and we do not expect humans to learn on manifold in 2lmoonsUh. On the\nother hand, for 4lmoonsUh, the evidence for the seven GP models on those four labeled points are:\n(graph) 0.0626, (1NN,\u21132) 0.0591, (1NN,\u21131) 0.0625, (multi-v) 0.0625, (multi-h) 0.0625, (single-v)\n0.0341, (single-h) 0.0342. The graph-based GP has a small lead over other GPs. In particular, it is\nbetter than the evidence 1/16 for kernels that treat the four labeled points essentially independently.\nThe graph-based GP obtains this lead by warping the space along the two manifolds so that the two\npositive (resp. negative) labeled points tend to co-vary. Thus, there is a reason to prefer the GP with\na graph kernel, and we do expect humans to learn on manifold in 4lmoonsUh.\nWe also explore the convex combination of the seven GPs as a richer model for human behavior:\nk(\u03bb) = P7\ni=1 \u03bbiki, where \u03bbi \u2265 0, Pi \u03bbi = 1. This allows a weighted combination of kernels to be\nused, and is more powerful than selecting a single kernel. Again, we optimize the mixing weights \u03bb\nby maximizing the evidence p(y1:l | x1:l, k(\u03bb)). This is a constrained optimization problem, and can\nbe easily solved up to local optimum (because evidence is in general non-convex) with a projected\ngradient method, given the gradient of the log evidence. For the 2lmoonsUh condition, in 100 trials\nwith random starting \u03bb values, the maximum evidence always converges to 1/4, while the optimum\n\u03bb is not unique and occupies a subspace (0, \u03bb2, \u03bb3, \u03bb4, \u03bb5, 0, 0) with \u03bb2 +\u03bb3 +\u03bb4 +\u03bb5 = 1 and mean\n(0, 0.27, 0.25, 0.22, 0.26, 0, 0). Note the weight for the graph-based kernel \u03bb1 is zero. In contrast, for\nthe 4lmoonsUh condition, in 100 trials \u03bb overwhelmingly converges to (1, 0, 0, 0, 0, 0, 0) with evidence\n0.0626. i.e., it again suggests that people would perform manifold learning in 4lmoonsUh.\nOf course, this Bayesian model selection analysis is over-simpli\ufb01ed. For instance, we did not con-\nsider people\u2019s prior p(\u03bb) on GP models, i.e., which model they would prefer before seeing the data.\nIt is possible that humans favor models which produce axis-parallel decision boundaries. De\ufb01ning\nand incorporating non-uniform p(\u03bb) priors is a topic for future research.\nAcknowledgments We thank Rob Nowak and the anonymous reviewers for their valuable comments that mo-\ntivated us to conduct the new experiments discussed in Section 5 after initial review. This work is supported in\npart by NSF IIS-0916038, NSF IIS-0953219, NSF DRM/DLS-0745423, and AFOSR FA9550-09-1-0313.\n\nReferences\n\n[1] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework\nfor learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399\u20132434,\nNovember 2006.\n\n[2] Olivier Chapelle, Bernhard Sch\u00a8olkopf, and Alexander Zien, editors. Semi-supervised learning. MIT\n\nPress, 2006.\n\n8\n\n\f[3] R. M. Nosofsky. Attention, similarity, and the identi\ufb01cation-categorization relationship. Journal of Ex-\n\nperimental Psychology: General, 115(1):39\u201357, 1986.\n\n[4] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT\n\nPress, 2006.\n\n[5] Carl E. Rasmussen\n\nand Christopher K.\n\nI. Williams.\n\nGPML matlab\n\ncode,\n\n2007.\n\nhttp://www.gaussianprocess.org/gpml/code/matlab/doc/, accessed May, 2010.\n\n[6] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. Beyond the point cloud: from transductive to\n\nsemi-supervised learning. In ICML05, 22nd International Conference on Machine Learning, 2005.\n\n[7] Katleen Vandist, Maarten De Schryver, and Yves Rosseel. Semisupervised category learning: The im-\npact of feedback in learning the information-integration task. Attention, Perception, & Psychophysics,\n71(2):328\u2013341, 2009.\n\n[8] Guy Wallis and Heinrich H. B\u00a8ulthoff. Effects of temporal association on recognition memory. Proceed-\n\nings of the National Academy of Sciences, 98(8):4800\u20134804, 2001.\n\n[9] Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Sch\u00a8lkopf. Learning with\n\nlocal and global consistency. In Advances in Neural Information Processing System 16, 2004.\n\n[10] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using Gaussian \ufb01elds and\n\nharmonic functions. In The 20th International Conference on Machine Learning (ICML), 2003.\n\n[11] Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-Supervised Learning. Synthesis Lectures on\n\nArti\ufb01cial Intelligence and Machine Learning. Morgan & Claypool Publishers, San Rafael, CA, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1302, "authors": [{"given_name": "Tim", "family_name": "Rogers", "institution": null}, {"given_name": "Chuck", "family_name": "Kalish", "institution": null}, {"given_name": "Joseph", "family_name": "Harrison", "institution": null}, {"given_name": "Jerry", "family_name": "Zhu", "institution": null}, {"given_name": "Bryan", "family_name": "Gibson", "institution": null}]}