{"title": "Learning Classification with Unlabeled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 112, "page_last": 119, "abstract": null, "full_text": "Learning Classification with Unlabeled Data \n\nVirginia R.  de Sa \n\ndesa@cs.rochester.edu \n\nDepartment of Computer Science \n\nUniversity of Rochester \nRochester,  NY 14627 \n\nAbstract \n\nOne of the advantages of supervised learning is that the final  error met(cid:173)\nric is available during training.  For classifiers, the algorithm can directly \nreduce  the  number  of misclassifications  on  the  training  set.  Unfortu(cid:173)\nnately, when modeling human learning or constructing classifiers for au(cid:173)\ntonomous  robots,  supervisory  labels  are  often  not available  or  too  ex(cid:173)\npensive.  In this paper we show that we can  substitute for  the labels by \nmaking use of structure between the pattern distributions to different sen(cid:173)\nsory modalities.  We show that minimizing the disagreement between the \noutputs of networks processing patterns from these different modalities is \na sensible approximation to minimizing the number of misclassifications \nin each modality, and leads to similar results.  Using the Peterson-Barney \nvowel  dataset we show  that the  algorithm performs well  in finding  ap(cid:173)\npropriate placement for the codebook vectors particularly when the con(cid:173)\nfuseable classes are different for the two modalities. \n\n1 \n\nINTRODUCTION \n\nThis paper addresses the question of how a human or autonomous robot can learn to classify \nnew  objects  without experience  with  previous  labeled  examples.  We  represent  objects \nwith n-dimensional pattern vectors and  consider piecewise-linear classifiers consisting of \na collection of (labeled) codebook vectors in the space of the input patterns (See Figure 1). \nThe classification boundaries are gi ven by the voronoi tessellation of the codebook vectors. \nPatterns are said to belong to the class (given by the label) of the codebook vector to which \nthey are closest. \n\n112 \n\n\fLearning Classification with Unlabeled Data \n\n113 \n\n\u2022 \n\n0 \n\nX B  0 \no \n\n\u2022 \n\no \n\u2022  XB \no \n\no \n\nFigure 1:  A  piecewise-linear classifier in  a 2-Dimensional input space.  The circles  represent data \nsamples from  two classes (filled  (A) and not filled  (B)).  The X's represent codebook vectors  (They \nare labeled according to their class A and B). Future patterns are classified according to  the label of \nthe closest codebook vector. \n\nIn [de Sa and Ballard,  1993] we showed that the supervised algorithm LVQ2.1[Kohonen, \n1990] moves the codebook vectors to minimize the number of misclassified patterns.  The \npower of this algorithm lies in the fact that it directly minimizes its final error measure (on \nthe training set).  The positions of the codebook vectors are placed not to approximate the \nprobability distributions but to decrease the number of misclassifications. \n\nUnfortunately  in  many  situations  labeled  training  patterns  are  either  unavailable  or  ex(cid:173)\npensive.  The classifier can not measure its classification performance while learning (and \nhence  not directly  maximize  it).  One  such  unsupervised  algorithm,  Competitive Learn(cid:173)\ning[Grossberg,  1976; Kohonen,  1982; Rumelhart  and  Zipser,  1986], has  unlabeled code(cid:173)\nbook vectors that move to minimize a measure of the reconstruction cost.  Even with sub(cid:173)\nsequent labeling of the codebook vectors, they are not well suited for classification because \nthey have not been positioned to induce optimal borders. \n\nSupervised  Unsupervised \n\n- implausible label \n\n-limited power \n\n\"COW\" \n\nSelf-Supervised \n- derives label from a \nco-occuring input to \nanother modality \n\n000 \n\nTarget \n\\I \n\u2022 \n\u2022 \n\u2022 \n\no  O{}OO \n\n000 \n\n\u2022 \n\u2022 \n\u2022 \n\no  O{}OO \n\n~ \nO~ 0 \n\u2022 \n\u2022 \n\u2022 \n\n0 600 \n\u2022 \n\u2022 \n\u2022 \n\no  O{}OO \n\n~  ~  ~ Input  2 \n\no  O{}OO \nmoo \n\nFigure 2:  The idea behind the algorithm \n\nThis paper presents a new measure for piecewise-linear classifiers receiving unlabeled pat(cid:173)\nterns  from  two or more sensory  modalities.  Minimizing the  new  measure is  an  approxi(cid:173)\nmation to minimizing the number of misclassifications directly.  It takes  advantage of the \nstructure available in natural environments which results in sensations to different sensory \nmodalities  (and  sub-modalities) that are  correlated.  For example,  hearing  \"mooing\" and \n\n\f114 \n\nde Sa \n\np \n0.5 \n\n0.4 \n\n0 . 3 \n\np \n0.5 \n\n0 . 4 \n\n0 . 3 \n\n0.2 \n\n1\\ \nI  \\ \nP(CB)P(,,*~) \nI  I \n\\ \n\\ \nI \nI \n\\ \n\n\\ \n\nFigure 3:  This figure shows an example world as sensed by two different modalities.  If modality A \nreceives a pattern from  its  Class A  distribution,  modality  2 receives a pattern from  its  own  class A \ndistribution (and the same for Class B). Without receiving information about which class the patterns \ncame from,  they  must try  to determine appropriate placement of the boundaries b l  and b2 \u2022  P(C;) is \nthe prior probability of Class i and p(xjIC;) is the conditional density of Class i for modality j \n\nseeing cows tend to occur together.  So, although the sight of a cow does not come with an \ninternal homuncular \"cow\" label  it does  co-occur with  an  instance  of a \"moo\".  The key \nis to process the \"moo\" sound to obtain a self-supervised label for the network processing \nthe visual image of the cow and vice-versa.  See Figure 2. \n\n2  USING MULTI-MODALITY INFORMATION \n\nOne way to make use of the cross-modality structure is  to derive labels for the codebook \nvectors (after they have been positioned either by random initialization or an unsupervised \nalgorithm).  The labels can be learnt with a competitive learning algorithm using a network \nsuch as that shown in Figure 4.  In this network the hidden layer competitive neurons repre(cid:173)\nsent the codebook vectors.  Their weights from  the input neurons represent their positions \nin  the respective input spaces.  Presentation of the paired patterns results in activation of \nthe  closest  codebook vectors  in  each  modality  (and  D's  elsewhere).  Co-occurring code(cid:173)\nbook vectors will then increase their weights to the same competitive output neuron.  After \nseveral iterations the codebook vectors are given the (arbitrary) label of the output neuron \nto which they have the strongest weight.  We will refer to this as  the \"labeling algorithm\". \n\n2.1  MINIMIZING DISAGREEMENT \n\nA  more  powerful  use  of the  extra information  is  for  better  placement  of the  codebook \nvectors themselves. \n\nIn  [de  Sa,  1994]  we derive  an  algorithm that  minimizes l  the disagreement  between  the \noutputs of two  modalities.  The algorithm  is  originally derived  not as  a  piecewise-linear \nclassifier but as  a method  of moving boundaries for  the case of two classes  and  an  agent \nwith two  I-Dimensional sensing modalities as  shown in Figure 3. \n\nEach class has a particular pro babili ty distri buti on for the sensation received by each modal(cid:173)\nity.  If modality 1 experiences a sensation from its pattern A distribution, modality 2 expe(cid:173)\nriences a sensation from its own pattern A distribution. That is, the world presents patterns \n\nIthe goal is actually to  find a non-trivial local minimum (for details see [de Sa,  1994]) \n\n\fLearning Classification with Unlabeled Data \n\n115 \n\nOutput (Class) \n\n00 0  \n\nHidden Layer \n\nCode book \nVectors \n\n(W) \n\nInput (X) \n\nModaiitylNetwork  1 \n\nModalitylNetwork 2 \n\nFigure 4:  This figure  shows a network for learning the labels of the codebook vectors. The weight \nvectors of the hidden layer neurons represent the codebook vectors while the weight vectors of the \nconnections from the hidden layer neuron!; to the output neurons represent the output class that each \ncodebook vector currently represents.  In this example there are 3 output classes and two  modalities \neach of which has 2-D input patterns and 5 codebook vectors. \n\nfrom the 2-D joint distribution shown in Figure 5a) but each modality can only sample its \n1-D marginal distribution (shown in Figure 3 and Figure 5a). \n\nWe show [de Sa,  1994] that minimizing the disagreement error -\nof patterns for which the two modalities output different labels -\n\nthe proportion of pairs \n\nE(b), b2) = Pr{x)  < b)  &  X2  > bJ} + Pr{x)  > b)  &  X2  < b2} \n\n(1) \n\n(2) \n\n(where f(x). X2)  = P(CA)p(xtICA)P(X2ICA) + P(CB)p(x1ICB)p(x2ICB) is the joint probability \ndensity for the two modalities) in the above problem results in an algorithm that corresponds \nto the optimal supervised algorithm except that the \"label\" for each modality's pattern is \nthe hypothesized output of the other modality. \n\nConsider the example illustrated in Figure 5.  In the supervised case (Figure 5a\u00bb) the labels \nare given allowing sampling of the actual  marginal distributions.  For each  modality,  the \nnumber of misclassifications can be minimized by setting the boundaries for each modality \nat the crossing points of their marginal distributions. \n\nHowever in the self-supervised system, the labels are  not available.  Instead we are given \nthe output of the other modality.  Consider the system from the point of view of modality \n2.  Its patterns are labeled according to the outputs of modality  1.  This labels the patterns \nin  Class  A  as  shown  in Figure 5b).  Thus  from  the  actual  Class  A  patterns,  the  second \nmodality sees the \"labeled\" distributions shown.  Letting a be the fraction of misclassified \npatterns from Class  A,  the resulting distributions are  given  by  (1  - a)P(CA)P(X2ICA)  and \n(a)P(CA)P(X2ICA). \nSimilarly Figure 5c)  shows  the effect  on  the  patterns  in  class  B.  Letting  b  be  the  frac(cid:173)\ntion of Class B  patterns misclassified, the distributions are given by  (1  - b)P( CB)P(X2ICB) \n\n\f116 \n\nde Sa \n\nand  (b)P( CB)p(X2ICB).  Combining  the  effects  on  both  classes  results  in  the  \"labeled\" \ndistributions  shown  in  Figure  5d).  The  \"apparent  Class  ~' distribution  is  given  by \n(1  - a)P(CA)P(X2ICA)  + (b)P(CB)p(X2ICB  and  the  \"apparent  Class  B\"  distribution  by \n(a)P(CA)P(X2ICA) + (1-b)P(CB)p(x2ICB).  Notice that even  though the approximated dis(cid:173)\ntributions may be discrepant, if a:::: b, the crossing point will be close. \n\nSimultaneously the second modality is  labeling the patterns to the first modality.  At each \niteration of the algorithm both borders move according to the samples from the \"apparent\" \nmarginal distributions. \n\n- P(CA)p(x1ICA) \n- P(CB)p(x1ICB) \n\n- (a)P(CA}p(x2ICA) \n- (1-a)P(CA)p(x2ICA) \n\na) \n\nFigure 5:  This  figure shows an  example of the joint and  marginal distributions  (For better visual(cid:173)\nization the scale of the joint distribution is  twice that of the marginal distributions) for  the example \nproblem introduced in  Figure 3.  The darker gray  represents patterns labeled \"N', while  the lighter \ngray  are labeled \"B\". The dark and  light  curves are the  corresponding marginal  distributions  with \nbold and regular labels respectively.  a) shows the labeling for the supervised case. b),c) and d) reflect \nthe labels given by modality  1 and the corresponding marginal distributions seen by modality 2.  See \ntext for  more details \n\n2.2  Self-Supervised Piecewise-Linear Classifier \n\nThe above ideas have been extended[de Sa,  1994] to rules for moving the codebook vectors \nin a piecewise-linear classifier.  Codebook vectors are initially chosen randomly  from the \ndata patterns.  In  order to complete the algorithm idea,  the codebook vectors  need  to  be \ngiven initial labels (The derivation assumes that the current labels are correct).  In LVQ2.1 \n\n\fLearning Classification with Unlabeled Data \n\n117 \n\nthe initial codebook vectors  are  chosen from  among  the data patterns that are  consistent \nwith their neighbours (according to a k-nearest neighbour algorithm); their labels are then \ntaken  as  the labels of the data patterns.  In  order to  keep  our algorithm unsupervised the \n\"labeling  algorithm\"  mentioned  earlier  is  used  to  derive  labels  for  the  initial  codebook \nvectors. \n\nAlso due to the fact that the codebook vectors may cross borders or may  not be accurately \nlabeled  in the  initialization stage,  they  are  updated  throughout the algorithm by  increas(cid:173)\ning  the  weight  to  the  output class  hypothesized  by  the  other modality,  from  the  neuron \nrepresenting the closest codebook vector.  The final  algorithm is given in Figure 6 \n\n1.  Randomly  choose  initial  codebook  vectors  from  data  vectors \n2.  Initialize  labels  of  codebook  vectors  using  the  labeling  algorithm \n\ndescribed  in  text \n\n3 .  Repeat  for  each  presentation  of  input  patterns  XI(n)  and  X2(n)  to  their \n\nrespective  modalities \n\n\u2022  find  the  two  nearest  codebook  vectors  in  modality  1  -- wl.i; , Wl.i;,  and \n\nmodality  2  -- W2,k;,  W2,k;  to  the  respective  input  patterns \n\n\u2022  Find  the  hypothesized  output  class  (CA ,  CB )  in  each  modality  (as \n\ngiven  by  the  label  of  the  closest  codebook  vector) \n\n\u2022  For  each modality  update  the  weights  according  to  the  following \n\nrules  (Only  the  rules  for  modality  1  are  given) \nIf neither  or  both  Wli', WI;'  have  the  same  label  as  w2,k'  or  XI(n)  does \nnot  lie within  c(n)  of  the  border  between  them  no  updates  are  done, \notherwise \n\n,  1 \n\n'  2 \n\n1 \n\n*( \nwi,i'  n  =WI,i n -\n\n() \n\n1) \n\n)(XI(n)-wv(n-l)) \n+a(n  IIXI (n)-wV(n-I)1I \n\n(XI (n)-wIJ,(n-I)) \nWIJ* (n) = wi/n - 1) - a(n) IIXI (n) _ w~J(n -1)11 \n\n* \n\nwhere  WI ,i'  is  the  codebook  vector  wi th  the  same  label,  and  WIJ'  is \nthe  codebook  vector  with  another  label. \n\n\u2022  update  the  labeling weights \n\nFigure 6:  The Self-Supervised piecewise-linear classifier algorithm \n\n3  EXPERIMENTS \n\nThe following experiments were all performed using the Peterson and Barney  vowel for(cid:173)\nmant data 2.  The dataset consists of the first and second formants for ten vowels in a /h V d/ \ncontext from  75  speakers  (32  males,  28  females,  15  children) who  repeated  each  vowel \ntwice 3. \n\nTo  enable  performance  comparisons,  each  modality  received  patterns  from  the  same \ndataset.  This  is  because  the  final  classification  performance  within  a  modality  depends \n\n20 btained from  Steven Nowlan \n33 speakers were missing one vowel and the raw data was linearly transformed to have zero mean \n\nand fall  within  the range [-3, 3] in both components \n\n\f118 \n\nde Sa \n\nTable  1:  Tabulation of performance figures  (mean  percent correct and  sample standard deviation \nover 60 trials  and 2 modalities). The heading i - j  refers  to  performance measured after the lh  step \nduring the  ilh  iteration. (Note  Step 1 is  not repeated during the multi-iteration  runs). \n\nsame-paired vowels \nrandom pairing \n\nnot only on the difficulty of the measured modality but also on that of the other \"labeling\" \nmodality.  Accuracy  was  measured  individually (on  the training  set)  for  both modalities \nand averaged.  These results were then averaged over 60 runs.  The results described below \nare also tabulated in Table  1 \n\nIn  the  first  experiment,  the classes  were  paired  so  that  the  modalities  received  patterns \nfrom  the  same  vowel  class.  If modality  1 received  an  [a]  vowel,  so  did modality  2  and \nlikewise for  all  the  vowel  classes  (i.e.  p(xt!Cj )  =  p(x2ICj)  for  all j).  After the  labeling \nalgorithm stage, the accuracy was 60\u00b15% as the initial random placement of the codebook \nvectors does not induce a good classifier.  After application of the third step in Figure 6 (the \nminimizing-disagreement part of the algorithm) the accuracy was 75 \u00b14%.  At this point the \ncodebook vectors are much better suited to defining appropriate classification boundaries. \n\nIt was  discovered  that all  stages  of the algorithm tended  to produce better results on  the \nruns that started with better random initial configurations.  Thus,  for each run,  steps 2 and \n3 were repeated with the final codebook vectors.  Average performance improved (73\u00b14% \nafter step 2 and 76\u00b14% after step 3).  Steps 2 and 3 were repeated several  more times with \nno  further significant increase in performance. \n\nThe power of using the cross-modality information to move the codebook vectors can be \nseen  by  comparing  these  results  to  those obtained with  unsupervised competitive learn(cid:173)\ning within modalities followed by  an  optimal supervised labeling algorithm which gave a \nperformance of 72 %. \n\nOne of the features of multi-modality information is that classes that are easily confuseable \nin one modality may be well separated in another.  This should improve the performance of \nthe algorithm as  the \"labeling\" signal for separating the overlapping classes will be  more \nreliable.  In  order to demonstrate this, more tests  were conducted with random pairing of \nthe vowels for each run.  For example presentation of [a] vowels to one modality would be \npaired with presentation of [i]  vowels to the other.  That is p(xIICj )  = p(x2ICaj) for a random \npermutation aI, a2 .. alO.  For the labeling stage the performance was  as  before (60 \u00b1 4%) \nas  the difficulty  within each  modality  has  not  changed.  However  after  the minimizing(cid:173)\ndisagreement algorithm the results were better as expected.  After 1 and 2 iterations of the \nalgorithm, 77 \u00b1 3% and 79 \u00b1 2%  were classified correctly.  These results are close to those \nobtained with the related supervised algorithm LVQ2.1  of 80%. \n\n4  DISCUSSION \n\nIn  summary,  appropriate classification  borders can  be learnt  without an  explicit external \nlabeling or supervisory signal.  For the particular vowel  recognition problem, the perfor(cid:173)\nmance of this \"self-supervised\" algorithm is  almost as  good as  that achieved with super-\n\n\fLearning Classification with Unlabeled Data \n\n119 \n\nvised algorithms. This algorithm would be ideal for tasks in which signals for two or more \nmodalities are available, but labels are either not available or expensive to obtain. \n\nOne  specific  task  is  learning  to  classify  speech  sounds  from  images  of the lips  and  the \nacoustic  signal.  Stork  et.  al.  [1992]  performed  this  task  with  a  supervised  algorithm \nbut one of the main limitations for data collection was the manual labeling of the patterns \n[David Stork, personal communication, 1993]. This task also has the feature that the speech \nsounds that are confuseable are not confuseable visually and vice-versa [Stork et ai., 1992]. \nThis complementarity helps the performance of this classifier as the other modality provides \nmore reliable labeling where it is needed most. \n\nThe algorithm could also be used for learning to classify signals to a single modality where \nthe  signal  to  the  other  \"modality\"  is  a  temporally  close  sample.  As  the  world  changes \nslowly  over  time,  signals  close  in  time  are  likely  from  the  same  class.  This  approach \nshould be more powerful than that of [FOldiak,  1991] as  signals close in time need  not be \nmapped to the same codebook vector but the closest codebook vector of the same class. \n\nAcknowledgements \n\nI would like to thank Steve Nowlan for  making the vowel  formant  data available to  me. \nMany thanks also to Dana Ballard, Geoff Hinton and Jeff Schneider for their helpful con(cid:173)\nversations and suggestions.  A preliminary version of parts of this work appears in greater \ndepth in [de Sa,  1994]. \n\nReferences \n\n[de Sa, 1994]  Virginia  R.  de Sa,  \"Minimizing disagreement for  self-supervised classification,\"  In \nM.C.  Mozer, P.  Smolensky, D.S. Touretzky, J.L.  Elman, and A.S. Weigend, editors, Proceedings \nof the  1993 Connectionist Models Summer School, pages 300-307. Erlbaum Associates, 1994. \n[de Sa and Ballard,  1993]  Virginia R.  de Sa and Dana H.  Ballard,  \"a note on learning vector quan(cid:173)\ntization,\"  In  c.L. Giles,  SJ.Hanson, and J.D.  Cowan, editors, Advances in  Neural Information \nProcessing Systems 5, pages 220-227. Morgan Kaufmann,  1993. \n\n[Foldiak, 1991]  Peter FOldiak, \"Learning Invariance from Transformation Sequences,\" Neural Com(cid:173)\n\nputation, 3(2):194-200, 1991. \n\n[Grossberg, 1976]  Stephen Grossberg,  \"Adaptive Pattern Classification and Universal Recoding:  I. \nParallel Development and Coding of Neural Feature Detectors,\" Biological Cybernetics, 23: 121-\n134, 1976. \n\n[Kohonen, 1982]  Teuvo  Kohonen,  \"Self-Organized  Formation  of Topologically  Correct  Feature \n\nMaps,\"  Biological Cybernetics, 43:59-69, 1982. \n\n[Kohonen, 1990]  Teuvo Kohonen, \"Improved Versions of Learning Vector Quantization,\" In IJCNN \n\nInternational Joint Conference on Neural Networks, volume  1, pages 1-545-1-550, 1990. \n\n[Rumelhart and Zipser, 1986]  D.  E.  Rumelhart and D.  Zipser,  \"Feature Discovery by Competitive \nLearning,\"  In David E.  Rumelhart, James L. McClelland, and the PDP Research Group, editors, \nParallel  Distributed  Processing:  Explorations  in  the  Microstructure  of Cognition,  volume  2, \npages 151-193. MIT Press, 1986. \n\n[Stork et at.,  1992]  David G. Stork, Greg Wolff, and Earl Levine, \"Neural network lipreading system \nfor improved speech recognition,\" In IJCNN International Joint Conference on Neural Networks, \nvolume 2, pages 11-286-11-295, 1992. \n\n\f", "award": [], "sourceid": 831, "authors": [{"given_name": "Virginia", "family_name": "de", "institution": null}]}