{"title": "A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling", "book": "Advances in Neural Information Processing Systems", "page_first": 710, "page_last": 718, "abstract": "The usability of Brain Computer Interfaces (BCI) based on the P300 speller is severely hindered by the need for long training times and many repetitions of the same stimulus. In this contribution we introduce a set of unsupervised hierarchical probabilistic models that tackle both problems simultaneously by incorporating prior knowledge from two sources: information from other training subjects (through transfer learning) and information about the words being spelled (through language models). We show, that due to this prior knowledge, the performance of the unsupervised models parallels and in some cases even surpasses that of supervised models, while eliminating the tedious training session.", "full_text": "A P300 BCI for the Masses: Prior Information\n\nEnables Instant Unsupervised Spelling\n\nPieter-Jan Kindermans, Hannes Verschore, David Verstraeten and Benjamin Schrauwen\n\nPieterJan.Kindermans@UGent.be\n\nGhent University, Electronics and Information Systems\n\nSint-Pietersnieuwstraat 41, 9000 Ghent, Belgium\n\nAbstract\n\nThe usability of Brain Computer Interfaces (BCI) based on the P300 speller is\nseverely hindered by the need for long training times and many repetitions of the\nsame stimulus.\nIn this contribution we introduce a set of unsupervised hierar-\nchical probabilistic models that tackle both problems simultaneously by incorpo-\nrating prior knowledge from two sources: information from other training sub-\njects (through transfer learning) and information about the words being spelled\n(through language models). We show, that due to this prior knowledge, the per-\nformance of the unsupervised models parallels and in some cases even surpasses\nthat of supervised models, while eliminating the tedious training session.\n\n1\n\nIntroduction\n\nBrain Computer Interfaces interpret brain signals to allow direct man-machine communication [17].\nIn this contribution, we study the so-called P300 paradigm [6]. The user is presented with a grid of\n36 characters of which alternately rows and columns light up, and focuses on the character he wishes\nto spell. The intensi\ufb01cation of the focused letter can typically be detected through an event-related\npotential around the parietal lobe occurring 300 ms after the stimulus. By correctly detecting this\nso-called P300 Event Related Potential (ERP), the character intended by the user can be determined.\nTo increase the spelling accuracy, multiple epochs are used before a character gets predicted, where\na single epoch is de\ufb01ned as a sequence of stimuli where each row and each column is intensi\ufb01ed\nonce. The main dif\ufb01culty in the construction of a P300 speller thus lies in the construction of a\nclassi\ufb01er for the P300 wave.\nPrevious work related to P300 has mainly focused on supervised training. These techniques were\nevaluated during several BCI Competitions [2, 3]. A popular classi\ufb01cation method, which we will\ncompare our proposed methods to, is Bayesian Linear Discriminant Analysis [7]. This is essentially\nBayesian Linear Regression where the hyperparameters are optimized using EM [1]. It has been\nshown that these classi\ufb01ers are among the best performing for P300 spelling [12]. A recent interest-\ning improvement of P300 spelling is post-processing of the classi\ufb01er outputs by a language model to\nimprove spelling results [15]. Other researchers have focused on adaptive classi\ufb01ers which are \ufb01rst\ntrained supervisedly and then adapt to the test session while spelling [11, 13, 10]. The most \ufb02exible\nof these methods can be found in [11], where they are able to adapt unsupervisedly from one subject\nto another, however there is still need for some initial supervised training sessions.\nRecent work has introduced unsupervised linear classi\ufb01ers [9] that achieve accuracies comparable\nto state of the art supervised methods. However, these still suffer from some limitations. When the\nspeller is used online without any prior training, it needs a warm-up period. During this warm-up\nperiod the speller output will be more or less random as the classi\ufb01er is still trying to determine the\nunderlying structure of the P300 ERP. Once the classi\ufb01er has successfully learned the task, it rarely\nmakes new mistakes. The length of this warm-up period depends on both the individual subject and\n\n1\n\n\fthe number of epochs to spell each character. A higher number of epochs will result in fewer letters\nin the warm-up, but the total spelling time might increase. A second disadvantage is the fact that the\nclassi\ufb01er is randomly initialized. The remedy for this - evaluating many random initializations and\nselecting the best - is suboptimal and ideally one would like to choose a more intelligent initialization\nbased on prior knowledge.\nThe aim of this paper is thus to reduce the warm-up period and to limit the number of initializations\nrequired to achieve acceptable performance without any subject-speci\ufb01c information. This will\nyield instant subject speci\ufb01c spelling, with high accuracy and a low number of epochs. To achieve\nthis goal, we extend the graphical model of the unsupervised classi\ufb01er by incorporating two types\nof prior knowledge: inter-subject information and language information. The key idea is that the\nincorporation of constraints and prior information can drastically improve a BCI\u2019s performance.\nThe power of incorporating prior knowledge has previously been demonstrated in a BCI where\n\ufb01nger \ufb02exion is decoded from electrocorticographic signals [18].\nWhat we propose is a fully integrated probabilistic model, unlike previous methods which are a\ncombination of different techniques. Furthermore, the prior work related to P300 classi\ufb01cation\npossesses only a subset of the capabilities of our model.\n\n2 Methods\n\n(a) standard\n\n(b) subject transfer\n\n(c) subject transfer and language model\n\nFigure 1: Graphical representation of the different classi\ufb01ers. On the left we show the basic unsu-\npervised classi\ufb01er [9]. In the middle we present our \ufb01rst contributions: the incorporation of inter\nsubject information through a shared hyperprior. On the right the most complex model: inter subject\ninformation and a trigram language model.\n\n2.1 Unsupervised P300 Speller\n\nThe basic unsupervised speller which we extend in this paper, is the unsupervised P300 classi\ufb01er\nproposed in [9]. We will present a slightly generalized version of this model such that it does not\ndepend on the column/row intensi\ufb01cation structure of the default P300 application. The model is\nbuilt around the following assumption: the EEG can be projected into one dimension where the\nprojection will have a Gaussian distribution with a class dependent mean (containing P300 response\nversus not containing the response). From now on the distribution on the projected EEG will be\nused as an approximation of the distribution on the EEG itself. This makes inference and reasoning\nabout the model simpler, but it remains an approximation. The full model, shown in Figure 1(a), is\nas follows:\n\np (ws) = N (ws|0, \u03b1sI) ,\n\np (xs,t,i|cs,t, ws, \u03b2s) = N(cid:0)xs,t,iws|ys,t,i(cs,t), \u03b2s\n\np (cs,t) =\n\n1\nC\n\n,\n\n(cid:1) ,\n\n2\n\nSubjectsTwscs,tIxs,t,iSubjectsTwscs,t\u03bcwIxs,t,iSubjectswsIxs,t+1,ics,t+2cs,t+1cs,t\u03bcwIxs,t+2,iIxs,t,i\fwhere ws is the classi\ufb01er\u2019s weight vector, C is the number of symbols in the character grid, s\nindicates the subject, cs,t is the t-th character for subject s. The row vector xs,t,i contains the EEG\nrecorded after intensi\ufb01cation i during spelling of ct by subject s, and a bias term. The EEG for a\ncharacter will be denoted as Xs,t which is a matrix whose rows are the different xs,t,i. Likewise,\nXs consists of all the features for a single subject. Both \u03b1s and \u03b2s are values for the precision of\nthe associated Gaussian distribution. The mean ys,t,i(cs,t) equals 1 when during intensi\ufb01cation i,\ncharacter cs,t was highlighted, otherwise ys,t,i(cs,t) = \u22121. This class dependent mean encompasses\nthe constraint on the labeling of the individual EEG segments posed by the application: during all of\nthe intensi\ufb01cations for a single character, the subject has focus on the same character. Thus during\nall epochs, each intensi\ufb01cation of this character should yield the P300. Each intensi\ufb01cation which\ndoes not include this character should not elicit a P300 response.\nThe probability of a character given the EEG can be computed by application of Bayes\u2019s rule:\np (cs,t|Xs,t, ws, \u03b2s) =\np(cs,t)p(Xs,t|cs,t,ws,\u03b2s). In this model, the EEG Xs contains the ob-\nserved variables, the characters are the latent variables which need to be inferred and ws, \u03b2s, \u03b1s are\nthe parameters which we want to optimize. The well known Expectation Maximization algorithm\n[5] can be used to optimize for ws, \u03b2s and yields following update equations:\n\np(cs,t)p(Xs,t|cs,t,ws,\u03b2s)\n\n(cid:80)\n\ncs,t\n\n(cid:19)\u22121\n\n(cid:88)\np(cid:0)cs|Xs, wold\n(cid:42)(cid:88)\np(cid:0)cs,t|Xs, wold\n\ncs\n\ns , \u03b2old\n\ns\n\n(cid:1)(cid:18)\n(cid:1)(cid:0)xs,t,iwold\n\ns Xs +\n\nX T\n\ns , \u03b2old\n\ns\n\nX T\n\ns ys (cs)\n\ns\n\n\u03b1old\n\u03b2old\n\nI\n\ns\n\ns \u2212 ys,t,i(cs,t)(cid:1)2\n\n(cid:43)\n\nws =\n\n\u03b2\u22121\n\ns\n\n=\n\ncs,t\n\nt,i\n\nThe update for ws is a weighted sum ridge regression classi\ufb01er trained with all possible labellings\nfor the EEG. The weights are the probabilities that the used labels are correct given the previous\nweight vector wold\ns . Let y(cs) be the labels which are assigned to the EEG given the character\nprediction cs when the application constraints described above are taken into account. The value for\n\u03b2\u22121\nis the expected mean squared error between the projection and the target mean given the old\ns\n, where D is\nweight vector. The hyper-parameter \u03b1s can be optimized directly: \u03b1s =\nthe dimensionality of the weight vector. The combined optimization of \u03b1s, \u03b2s allows for automatic\ntuning of the amount of regularization but \u03b1s will be bounded by 103 to prevent the weight vector\nfrom collapsing onto the prior.\nFrom the graphical representation it is clear that, without making additional assumptions about the\ndata, there are only two ways to add additional constraints or information. First, we can incorporate\nprior information about the characters (the bottom of the graphical model) through language models.\nThe second option is to incorporate prior information about the weight vector (the top of the model).\nWe will start with the latter. Both these access points for prior knowledge are given a brighter color\nin Figure: 1(a).\n\nD\n)T wold\n\n(wold\n\ns\n\ns\n\n2.2\n\nInter-subject Transfer\n\nFor the transfer learning, we drew inspiration from the work by Kemp et al.\n[8]. We will use\nhierarchical Bayesian models to share knowledge about the P300 response detection across different\nsubjects. Our proposed model is shown in Figure 1(b) and is de\ufb01ned as follows:\n\np (\u00b5w) = N (\u00b5w|0, \u03b1pI) , \u03b1p = 0, p (ws|\u00b5w) = N (ws|\u00b5w, \u03b1sI) ,\n\np (xs,t,i|cs,t, ws, \u03b2s) = N(cid:0)xs,t,iws|ys,t,i(cs,t), \u03b2s\n\n(cid:1) ,\n\np (cs,t) =\n\n1\nC\n\n,\n\nwhere we have placed a zero mean and precision Gaussian prior on the mean for the weight vector.\nWhen doing inference, we will always assume that \u00b5w is given and set to its most likely value. The\nadvantage of working with the most likely value is that there is no time penalty for transfer learning\n\n3\n\n\fwhen used in an online setting. In the case that \u00b5w = 0, the model reduces to the original model.\nOn the other hand, if \u00b5w takes on a nonzero value, the update equations for ws, \u03b1s become:\n\ns , \u03b2old\n\ns\n\nX T\n\ns Xs +\n\nX T\n\ns ys (cs) +\n\nI\u00b5w\n\n,\n\n(cid:19)\u22121(cid:18)\n\ns\n\n\u03b1old\n\u03b2old\n\ns\n\nI\n\n(cid:19)\n\ns\n\n\u03b1old\n\u03b2old\n\ns\n\n(cid:88)\n\np(cid:0)cs|Xs, wold\n\n(cid:1)(cid:18)\n\ncs\n\nD\n\ns \u2212 \u00b5w)T (wold\n\ns \u2212 \u00b5w)\n\n(wold\n\n.\n\nws =\n\n\u03b1s =\n\nThe update for \u03b2s remains unaltered. When we train without transfer for an initial set of subjects:\ns = 1, . . . , S, we initialize all \u03b1s = \u03b1p = 0 and \u00b5w = 0. For this speci\ufb01c assignment of \u00b5w, \u03b1p,\ntraining is actually the same as integrating out \u00b5w. After the training has converged for all the\nsubjects, we have a subject speci\ufb01c Maximum A Posteriori estimate: wnew\nand an optimized value\n\u03b1new\n\n. Using these, we can compute the posterior distribution on \u00b5w:\n\ns\n\ns\n\n(cid:88)\np (\u00b5w|wnew\n1\n\n1\n\n\u03b1new\n\np\n\ns=1...S\n\n, . . . , wnew\n\ns\n\n\u03b1new\ns wnew\n\ns\n\n,\n\n\u00b5new\n\np =\n\n) = N(cid:0)\u00b5w|\u00b5new\n(cid:88)\n\np\n\n\u03b1new\n\np =\n\nI(cid:1) ,\n\n, \u03b1new\n\np\n\n\u03b1new\n\ns\n\n.\n\ns=1...S\n\nand keep it \ufb01xed. The new\nTo apply transfer learning for a new subject S + 1, we assign \u00b5w = \u00b5new\n\u03b1S+1 is initialized with \u03b1new\n. The role of the optimization of \u03b1S+1 is to let the model determine\nwhether we can build a proper model by staying close to the prior (\u03b1S+1 takes on large values) or\nwhether we have to build a very speci\ufb01c model (\u03b1S+1 becomes very small).\n\np\n\np\n\n2.3\n\nIncorporation of language models\n\nA second possibility is to incorporate language models. The only difference between working with\nand without a language model lies in the computation of the probability of a character given the\nEEG. Hence the E-step will change but the M-step will not. Please note that we have dropped the\nsubject speci\ufb01c index, and we will continue to do so in this section to keep the notation uncluttered.\nAn n-gram language model takes the history into account: the probability of a character is de\ufb01ned\ngiven the n\u2212 1 previous characters: p (ct|ct\u22121, . . . , ct\u2212n+1). In this work, we limit ourselves to uni,\nbi and trigram language models. The graphical model of the P300 speller with subject transfer and\na trigram language model is shown in Figure 1(c). For the unigram language model, which counts\ncharacter frequencies, we only have to change the prior on the characters p (ct) to the probability of\neach character occurring.\nTo compute the marginal probability of a character given the EEG, which is exactly what we need\nin the E-step, we use the well known forward backward algorithm for HMM\u2019s [1]. For general\nn-grams, this algorithm computes:\n\np (X1, . . . , XT , ct, . . . , ct\u2212n+2) = f (ct, . . . , ct\u2212n+2) b (ct, . . . , ct\u2212n+2) ,\n\nf (ct, . . . , ct\u2212n+2) = p (X1, . . . , Xt, ct, . . . , ct\u2212n+2) ,\nb (ct, . . . , ct\u2212n+2) = p (Xt+1, . . . , XT|ct, . . . , ct\u2212n+2) .\n\nThe forward and backward recursion are as follows:\n\nf (ct, . . . , ct\u2212n+2) = p (Xt|ct)\n\np (ct|ct\u22121, . . . , ct\u2212n+1) f (ct\u22121, . . . , ct\u2212n+1) ,\n\nb (ct, . . . , ct\u2212n+2) =\n\nct\u2212n+1\n\np (Xt+1|ct+1) p (ct+1|ct, . . . , ct\u2212n+2) b (ct+1, . . . , ct\u2212n+3) .\n\n(cid:88)\n\n(cid:88)\n\nct+1\n\n(cid:88)\n(cid:88)\n\nThe initialization of the forward and backward recursion is analogous to the initialization for the\ndefault HMM [1]. The probability of a character can be computed as follows:\n\np(ct|X) =\n\n=\n\nct\u22121,...,ct\u2212n+2\n\n(cid:80)\n\np (X1, . . . , XT , ct, . . . , ct\u2212n+2)\n\np (X)\n\nf (ct, . . . , ct\u2212n+2) b (ct, . . . , ct\u2212n+2)\n\nct\u22121,...,ct\u2212n+2\n\nct,...,ct\u2212n+2\n\nf (ct, . . . , ct\u2212n+2) b (ct, . . . , ct\u2212n+2) .\n\n4\n\n\fThis can then be plugged directly into the EM-update equations from Section 2.1. Note that when\nwe cache the forward pass from previous character predictions, only a single step of both the forward\nand backward pass has to be executed to spell a new character.\n\n3 Experiments and Discussion\n\n3.1 The Akimpech Dataset\n\nWe performed our experiments on the public Akimpech P300 database [19]. This dataset covers 22\nsubjects1 who spelled Spanish words. The data was recorded with a 16 channel g.tec gUSBamp EEG\nampli\ufb01er at 256 Hz but only 10 channels are available in the dataset. The recording was performed\nwith the BCI2000 P300 speller software [14] with the following settings: a 2 second pause before\nand after each character, 62.5ms between the intensi\ufb01cations, these intensi\ufb01cations lasted 125ms\neach and the spelling matrix contained the characters [a \u2212 z1 \u2212 9 ]. The dataset comprises both a\ntrain and a test set. The train set contains 16 characters with 15 epochs per character. This train\nset will not be used by the unsupervised classi\ufb01ers but only by the supervised classi\ufb01er which we\nwill later use for comparison. The number of characters in the test is subject dependent and ranges\nfrom 17 to 29 with an average of 22.18. This limited number of characters per sequence is very\nchallenging for our unsupervised classi\ufb01er, since the spelling has to be as correct as possible right\nfrom the start, in order to obtain high average accuracies.\nAs the pre-processing in [9] has been shown to lead to good spelling performance, we adhere to\ntheir approach. The EEG is preprocessed one character at a time; as a consequence this approach is\nvalid in real online experiments2. Pre-processing begins by applying a Common Average Reference\n\ufb01lter, followed by a bandpass \ufb01lter (0.5 Hz - 15 Hz). Afterwards, each EEG channel is normalized\nsuch that it has zero mean and unit variance. The \ufb01nal step is sub-sampling the EEG by a factor 6\nand retaining 10 samples which are centered at 300ms after stimulus presentation.\n\n3.2 Training the Language models and Spelling Real Text\n\nThe Akimpech dataset was recorded using a limited number of Spanish words with an unrepresenta-\ntive subset of characters. It is therefore not an accurate representation of how a realistic speller would\nbe used. To alleviate this, we constructed a dataset which contains words that would be spelled in a\nrealistic context. This is done by re-synthesizing a dataset using the EEG from the Akimpech dataset\nand sentences from the English Wikipedia dataset from Sutskever et al. [16].\nIn a P300 speller, a look-up table assigns a speci\ufb01c character to each position in the on-screen matrix.\nThe actual task is to determine the position that, when intensi\ufb01ed, evokes the P300 response. To\nspell a symbol, we predict the desired position, then we look up the symbol assigned to it. Thus, in\na standard P300 setup the desired text can be modi\ufb01ed by altering the look-up table. Furthermore,\nthis will not in\ufb02uence the performance as long as the desired symbol is assigned to a single position.\nThis approach remains valid when language models are integrated into the classi\ufb01er, because neither\nthe EEG nor the intensi\ufb01cation structure is modi\ufb01ed.\nThe Wikipedia dataset was transformed to lowercase and we used the \ufb01rst 5 \u00b7 108 characters in the\ndataset to select the 36 most frequently occurring characters excluding numeric symbols. We argue\nthat using a subset of numbers is of no use and since we add the space as a symbol, we have to drop\nat least one numeric symbol. As such, it makes sense to replace all the numeric characters with other\nsymbols. The selected characters are the following: [a \u2212 z : %()(cid:48) \u2212 \u201d., ], where the underscore\nsigni\ufb01es whitespace. This set of characters is then used to train unigram, bigram and trigram letter\nmodels. These language models were trained on the \ufb01rst 5 \u00b7 108 characters and we applied Witten-\nBell smoothing [4], which assigns small but non-zero probabilities to n-grams not encountered in\nthe train set.\nThe remaining part of the Wikipedia dataset was used to generate target texts for classi\ufb01er evaluation.\nThis part was not used to train the language models. First we dropped the non-selected characters.\nThen for each subject, we sampled new texts, with the same length as the originally spelled text,\n\n1There are more subjects listed on the website but some \ufb01les are corrupt.\n2This claim was empirically veri\ufb01ed, we omit the discussion of these experiments due to page constraints.\n\n5\n\n\ffrom the dataset. Additionally, we modi\ufb01ed the contents of the character grid such that it contains\nthe 36 selected symbols. The look-up table for the individual spelling actions was changed such that\nthe correct solution is the newly sampled text. This is implemented by taking the base look-up table,\nwith [a \u2212 z : %()(cid:48) \u2212 \u201d., ], with a in the top left and in the bottom right corner of the screen, and\ncyclically shifting it.\n\n3.3 Experimental setup\n\nWe tested 12 different classi\ufb01ers, where we use the following code to name the classi\ufb01ers. The \ufb01rst\nletter indicates how the classi\ufb01er is initialized, either randomly (R) or using subject Transfer (T).\nThe second letter indicates whether the classi\ufb01ers adapts unsupervisedly during the spelling session\n(A) or is static (S). We compared the standard unsupervised (and adaptive) algorithm (RA) which\nis randomly initialized, our proposed transfer learning approach without online adaptation (TS) and\nthe transfer learning approach with adaptation (TA). These three different setups were tested without\na language model, and with a uni-, bi- and trigram language model. We will indicate the language\nmodel by appending the subscript \u2018\u2212\u2019 for the classi\ufb01er without language model, \u2018uni\u2019 for unigram,\netcetera. For example: TAtri is the unsupervised classi\ufb01er which used transfer learning, learns on\nthe \ufb02y and includes a trigram language model. The classi\ufb01er RA\u2212 is the baseline which we want to\nimprove on.\nThe in\ufb02uence of performance \ufb02uctuations caused by the initialization or desired text is minimized\nas follows. We executed 20 experiments per subject, where in each experiment all the classi\ufb01ers\nare evaluated. The desired text and classi\ufb01er initializations are experiment speci\ufb01c. This means\nthat for each subject we have 20 desired texts, 20 random initializations and 20 subject transfer\ninitializations. Each classi\ufb01er was evaluated on all of these texts, where for each text we always\nused the same initialization. Additionally, we repeated the experiments with 3, 4, 5, 10 and 15\nepochs per character.\nThe randomly initialized adaptive procedures work as in [9]. In short, the classi\ufb01er \ufb01rst receives the\nEEG for the next character. The EEG is added to the unsupervised trainset and 3 EM iterations are\nexecuted3. Next, the desired symbol is predicted with the updated classi\ufb01er.\nIn the case of transfer learning the initializations are computed as discussed in section 2.2. The initial\nclassi\ufb01ers used in the transfer learning process itself were trained unsupervisedly and of\ufb02ine without\na language model. For each subject, we drew 5 samples for w and trained 2 classi\ufb01ers per draw: one\nwith w and one with \u2212w such that at least one is above chance level for the binary P300 detection.\nFrom the resulting 10 classi\ufb01ers we selected the one which has the highest log likelihood, to be used\nin transfer learning. Finally, the current test subject is omitted when computing the transfer learning\nparameters. In short: the transfer learning parameters are computed without seeing labeled data and\nmore importantly without seeing any data from the current subject.\nWe conclude this section by discussing the time complexity of the methods. The use of transfer\nlearning does not increase the time needed to predict a character. However the time needed per EM\niteration scales linearly with the number of characters in the trainset. The addition of n-gram lan-\nguage models scales the time per E-step with (number of characters in grid)n\u22121. Therefore charac-\nter prediction can become very time consuming. As this is a major issue in this real-time application,\nwe will also discuss the setting where the classi\ufb01er is \ufb01rst used to spell the next character and the EM\nupdates are executed during the intensi\ufb01cations for the following symbol. As mentioned in Section\n2.3, only a single step in the forward and backward pass is needed to spell the next character. Thus\nwe can state that this approach yields instantaneous spelling. This classi\ufb01er will be named TA\u2217.\n\n3.4 Results\n\nWe will start the discussion of the results with the baseline method RA\u2212 followed by the evaluation\nof our contributions. An overview of the averaged results of all online experiments is available in\nFigure 2. In Figure 3 we show the performance on the test set after the classi\ufb01ers have processed\nthe test set and adapted to it, if possible. When we retest the adapted classi\ufb01er we will denote this\nby appending \u2018-R\u2019 to its name.\n\n3This is a trade-off between classi\ufb01er update time and performance.\n\n6\n\n\f(a) 3 epochs\n\n(b) 4 epochs\n\n(c) 5 epochs\n\n(d) 10 epochs\n\n(e) 15 epochs\n\nFigure 2: Overview of all spelling results from online experiments. Increasing the number of epochs\nor adding complex language models improves accuracy. Furthermore, transfer learning without\nadaptation (TS) outperforms learning from scratch (RA). Adding adaptation to the transfer learning\nimproves the results even further (TA).\n\n(a) 3 epochs\n\n(b) 4 epochs\n\n(c) 5 epochs\n\n(d) 10 epochs\n\n(e) 15 epochs\n\nFigure 3: Spelling accuracy when the test set is processed online and the classi\ufb01ers are re-evaluated\nafterwards. In Figure 2 we saw that the TS approach outperformed the RA range of classi\ufb01ers. Here\nwe see that TA-R and RA-R outperform TS even with few epochs. It is also clear that the adaptive\nclassi\ufb01ers are able to correct mistakes they made initially.\n\nApplication of the baseline method RA\u2212 and averaging the results over the different subjects results\nin an online spelling accuracy starting at 24.6% for 3 epochs and up to 82.1% for 15 epochs. The\nresult with 15 epochs is usable in practice and predicts only 4 characters incorrectly. However,\nthe spelling time is about half a minute per character. Retesting the classi\ufb01ers obtained after the\nonline experiment gives the following results: when 3 epochs are used the \ufb01nal classi\ufb01er is able to\nspell 60.5% correctly, for 15 epochs this becomes 94.6%. This corroborates the \ufb01ndings from the\noriginal paper [9] that the classi\ufb01ers need the warm-up period before they start to produce correct\npredictions.\nBy evaluating the addition of a language model, RAuni,bi,tri, we see an improvement of the online\nresults. The longer the time dependency in the language model, the bigger the improvement. As\nmore repetitions are used per character, the performance gain of the language models diminishes.\nFor 3 repetitions, a tri-gram model produces an online spelling accuracy of 43.5% compared to\n24.6% without a language model. The results for 15 repetitions show that on average 3 characters\nare predicted incorrectly when a trigram is used. Analysis of the re-evaluation of the classi\ufb01ers after\nonline processing shows a smaller improvement to the results, indicating that the language model\nmainly helps to reduce the warm up period.\nNext we consider the in\ufb02uence of transfer learning. We begin by evaluating the TS classi\ufb01ers, which\ndo not use unsupervised adaptation. Overall, TS classi\ufb01ers outperform the RA range, even when the\nlatter uses a trigram model. However, the post-test reevaluation shows that the RA methods are\nable to outperform TS. In essence: given enough data, the adaptive method has the ability to learn\na better model than the transfer learning approach. Addition of the language models to the TS\nclassi\ufb01er shows a secondary improvement, as is to be expected.\nThis brings us to the full model TA: adaptive unsupervised training which is initialized with transfer\nlearning and optionally makes use of language models. Figures 2 and 3 indeed con\ufb01rm that these\n\n7\n\n\u2212unibitri050100\u2212unibitri050100\u2212unibitri050100\u2212unibitri050100\u2212unibitri050100 RATSTA\u2212unibitri050100\u2212unibitri050100\u2212unibitri050100\u2212unibitri050100\u2212unibitri050100 RA\u2212RTSTA\u2212R\fTable 1: Comparison between different classi\ufb01ers. The BLDA classi\ufb01ers are subject-speci\ufb01c and\nsupervisedly trained. BLDA\u2212\ntri was trained using 3 epochs. The basic RA\u2212 and the full model TAtri\nare included. Furthermore we give results for an adapted version TA\u2217\ntri, which spells the character\nbefore the EM updates and for TA-Rtri, which is the re-evaluation of TAtri after processing the test\nset.\n\nEpochs RA\u2212 TA\u2217\n\ntri TAtri TA-Rtri BLDA BLDAtri BLDA\u2212\n\ntri\n\n3\n4\n5\n10\n15\n\n24.6\n42.2\n58.6\n78.4\n82.1\n\n73.8\n82.1\n87.0\n95.0\n97.9\n\n74.8\n83.0\n87.8\n95.5\n98.4\n\n83.5\n91.0\n94.4\n98.5\n99.5\n\n74.5\n82.2\n84.9\n93.0\n96.7\n\n89.4\n93.0\n94.6\n97.4\n98.1\n\n78.9\n83.8\n86.5\n92.5\n94.3\n\nmodels produce the best results both in the online test and in the re-evaluation afterwards, when we\nconsider unsupervised spelling. Also, the trigram classi\ufb01er produces the best results, which is not\nsurprising given the incorporation of important prior language knowledge into the model.\nNext, we give an overview of spelling accuracies in Table 1, where we compare the basic unsuper-\nvised method RA\u2212 to the full model TAtri. With nearly three times as accurate spelling for 3 epochs\n(74.8% compared to 24.6%) and near perfect spelling for 15 epochs, we can conclude that the full\nmodel is capable of instant spelling for a novel subject. The application of TA\u2217\ntri results in a minute\nperformance drop, but as this classi\ufb01er spells the character before performing the EM iterations, it\nallows for real-time spelling when the EEG is received and is therefore of more use in an online\nsetting.\nTo conclude, we compare the unsupervised methods with BLDA, which is the supervised counter-\npart of the RA\u2212 classi\ufb01er. The BLDA classi\ufb01ers in this table are supervisedly trained using 15\nepochs per character on 16 characters. This is slightly over 10 minutes of training before one can\nstart spelling. The BLDA\u2212\ntri classi\ufb01er used a limited training set with only 3 epochs per character\nor almost three minutes of training. When the limited training set is used, we see that our pro-\nposed method produces results which are competitive for 3-5 epochs and better for 10 and 15. The\nBLDAtri model outperforms our method when we consider a low number of repetitions per charac-\nter but not for 10 or 15 epochs. From 4 epochs onwards we can see that the re-evaluated classi\ufb01er\nafter online learning (TA-Rtri) is able to learn models which are as good as supervisedly trained\nmodels. Finally we would like to point out that even for just 3 epochs per character, our proposed\nmethod spelled less characters wrongly (about 6 on average) than the number of characters used\nduring the supervised training (16 for each subject).\n\n4 Conclusion\n\nIn this work we set out to build a P300 based BCI which is able to produce accurate spelling for\na novel subject without any form of training session. This is made possible by incorporating both\ninter-subject information and language models directly into an unsupervised classi\ufb01er. This yields\na coherent probabilistic model which quickly adapts to unseen subjects, by exploiting several forms\nof prior information. This contrasts with all supervised methods which need time consuming train-\ning session. There are only a few other unsupervised approaches for P300 spelling, but they need a\nwarm-up period during which the speller is unreliable or they need labeled data to initialize the adap-\ntive spellers. We compared our method to the original unsupervised speller proposed in [9] and have\nshown that unlike theirs, our approach works instantly. Furthermore, our \ufb01nal experiments demon-\nstrated that the proposed method can compete with state of the art subject-speci\ufb01c and supervisedly\ntrained classi\ufb01ers [7], even when incorporating a language model.\n\nAcknowledgments\n\nThis work was partially funded by the Ghent University Special Research Fund under the BOF-GOA\nproject Home-MATE.\n\n8\n\n\fReferences\n[1] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer,\n\n1 edition, 2007.\n\n[2] B. Blankertz, K.-R. Muller, G. Curio, T.M. Vaughan, G. Schalk, J.R. Wolpaw, A. Schlogl, C. Neuper,\nG. Pfurtscheller, T. Hinterberger, M. Schroder, and N. Birbaumer. The BCI competition 2003: progress\nand perspectives in detection and discrimination of EEG single trials. IEEE Trans. on Biomedical Engi-\nneering, 51(6):1044 \u20131051, June 2004.\n\n[3] B. Blankertz, K.-R. Muller, D.J. Krusienski, G. Schalk, J.R. Wolpaw, A. Schlogl, G. Pfurtscheller, Jd.R.\nMillan, M. Schroder, and N. Birbaumer. The BCI competition III: validating alternative approaches to\nactual BCI problems. IEEE Trans. on Neural Systems and Rehabilitation Engineering, 14(2):153 \u2013159,\nJune 2006.\n\n[4] S.F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Com-\n\nputer Speech & Language, 13(4):359\u2013393, 1999.\n\n[5] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):pp. 1\u201338, 1977.\n\n[6] L.A. Farwell and E. Donchin. Talking off the top of your head: toward a mental prosthesis utilizing\nevent-related brain potentials. Electroencephalography and Clinical Neurophysiology, 70(6):510 \u2013 523,\n1988.\n\n[7] U. Hoffmann, J.-M. Vesin, T. Ebrahimi, and K. Diserens. An ef\ufb01cient P300-based brain\u2013computer inter-\n\nface for disabled subjects. Journal of Neuroscience Methods, 167(1):115 \u2013 125, 2008.\n\n[8] C. Kemp, A. Perfors, and J.B. Tenenbaum. Learning overhypotheses with hierarchical bayesian models.\n\nDevelopmental science, 10(3):307\u2013321, 2007.\n\n[9] P.-J. Kindermans, D. Verstraeten, and B. Schrauwen. A bayesian model for exploiting application con-\n\nstraints to enable unsupervised training of a P300-based BCI. PLoS ONE, 7(4):e33758, 04 2012.\n\n[10] Y. Li, C. Guan, H. Li, and Z. Chin. A self-training semi-supervised SVM algorithm and its application in\nan EEG-based brain computer interface speller system. Pattern Recognition Letters, 29(9):1285 \u2013 1294,\n2008.\n\n[11] S. Lu, C. Guan, and H. Zhang. Unsupervised brain computer interface based on intersubject information\nand online adaptation. IEEE Trans. on Neural Systems and Rehabilitation Engineering, 17(2):135 \u2013145,\n2009.\n\n[12] N.V. Manyakov, N. Chumerin, A. Combaz, and M.M. Van Hulle. Comparison of linear classi\ufb01cation\nmethods for P300 brain-computer interface on disabled subjects. BIOSIGNALS, Rome, Italy, pages 328\u2013\n334, 2011.\n\n[13] R.C. Panicker, S. Puthusserypady, and Ying S. Adaptation in P300 brain-computer interfaces: A two-\nclassi\ufb01er cotraining approach. IEEE Trans. on Biomedical Engineering, 57(12):2927 \u20132935, December\n2010.\n\n[14] G. Schalk, D. J. Mcfarland, T. Hinterberger, N. Birbaumer, and J. R. Wolpaw. BCI2000: A general-\npurpose brain-computer interface (BCI) system. IEEE Trans. on Biomedical Engineering, 51:2004, 2004.\n[15] W. Speier, C. Arnold, J. Lu, R. K. Taira, and N. Pouratian. Natural language processing with dynamic\nclassi\ufb01cation improves P300 speller accuracy and bit rate. Journal of Neural Engineering, 9(1):016004,\n2012.\n\n[16] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In International\n\nConference on Machine Learning (ICML), 2011.\n\n[17] J. J. Vidal. Toward direct brain-computer communication. Annual Review of Biophysics and Bioengineer-\n\ning, 2(1):157\u2013180, 1973.\n\n[18] Z. Wang, G. Schalk, and Q. Ji. Anatomically constrained decoding of \ufb01nger \ufb02exion from electrocortico-\ngraphic signals. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 2070\u20132078. 2011.\n\n[19] O. Yanez-Suarez, L. Bougrain, C. Saavedra, E. Bojorges, and G. Gentiletti. P300-speller public-domain\n\ndatabase, 5 2012.\n\n9\n\n\f", "award": [], "sourceid": 332, "authors": [{"given_name": "Pieter-jan", "family_name": "Kindermans", "institution": null}, {"given_name": "Hannes", "family_name": "Verschore", "institution": null}, {"given_name": "David", "family_name": "Verstraeten", "institution": null}, {"given_name": "Benjamin", "family_name": "Schrauwen", "institution": null}]}