{"title": "Unsupervised Transcription of Piano Music", "book": "Advances in Neural Information Processing Systems", "page_first": 1538, "page_last": 1546, "abstract": "We present a new probabilistic model for transcribing piano music from audio to a symbolic form. Our model reflects the process by which discrete musical events give rise to acoustic signals that are then superimposed to produce the observed data. As a result, the inference procedure for our model naturally resolves the source separation problem introduced by the the piano's polyphony. In order to adapt to the properties of a new instrument or acoustic environment being transcribed, we learn recording specific spectral profiles and temporal envelopes in an unsupervised fashion. Our system outperforms the best published approaches on a standard piano transcription task, achieving a 10.6% relative gain in note onset F1 on real piano audio.", "full_text": "Unsupervised Transcription of Piano Music\n\nTaylor Berg-Kirkpatrick\n\nJacob Andreas Dan Klein\n\nComputer Science Division\n\nUniversity of California, Berkeley\n\n{tberg,jda,klein}@cs.berkeley.edu\n\nAbstract\n\nWe present a new probabilistic model for transcribing piano music from audio to\na symbolic form. Our model re\ufb02ects the process by which discrete musical events\ngive rise to acoustic signals that are then superimposed to produce the observed\ndata. As a result, the inference procedure for our model naturally resolves the\nsource separation problem introduced by the the piano\u2019s polyphony. In order to\nadapt to the properties of a new instrument or acoustic environment being tran-\nscribed, we learn recording-speci\ufb01c spectral pro\ufb01les and temporal envelopes in an\nunsupervised fashion. Our system outperforms the best published approaches on\na standard piano transcription task, achieving a 10.6% relative gain in note onset\nF1 on real piano audio.\n\n1\n\nIntroduction\n\nAutomatic music transcription is the task of transcribing a musical audio signal into a symbolic rep-\nresentation (for example MIDI or sheet music). We focus on the task of transcribing piano music,\nwhich is potentially useful for a variety of applications ranging from information retrieval to musi-\ncology. This task is extremely dif\ufb01cult for multiple reasons. First, even individual piano notes are\nquite rich. A single note is not simply a \ufb01xed-duration sine wave at an appropriate frequency, but\nrather a full spectrum of harmonics that rises and falls in intensity. These pro\ufb01les vary from piano\nto piano, and therefore must be learned in a recording-speci\ufb01c way. Second, piano music is gen-\nerally polyphonic, i.e. multiple notes are played simultaneously. As a result, the harmonics of the\nindividual notes can and do collide. In fact, combinations of notes that exhibit ambiguous harmonic\ncollisions are particularly common in music, because consonances sound pleasing to listeners. This\npolyphony creates a source-separation problem at the heart of the transcription task.\nIn our approach, we learn the timbral properties of the piano being transcribed (i.e. the spectral and\ntemporal shapes of each note) in an unsupervised fashion, directly from the input acoustic signal. We\npresent a new probabilistic model that describes the process by which discrete musical events give\nrise to (separate) acoustic signals for each keyboard note, and the process by which these signals are\nsuperimposed to produce the observed data. Inference over the latent variables in the model yields\ntranscriptions that satisfy an informative prior distribution on the discrete musical structure and at\nthe same time resolve the source-separation problem.\nFor the problem of unsupervised piano transcription where the test instrument is not seen during\ntraining, the classic starting point is a non-negative factorization of the acoustic signal\u2019s spectro-\ngram. Most previous work improves on this baseline in one of two ways: either by better modeling\nthe discrete musical structure of the piece being transcribed [1, 2] or by better adapting to the tim-\nbral properties of the source instrument [3, 4]. Combining these two kinds of approaches has proven\nchallenging. The standard approach to modeling discrete musical structures\u2014using hidden Markov\nor semi-Markov models\u2014relies on the availability of fast dynamic programs for inference. Here,\ncoupling these discrete models with timbral adaptation and source separation breaks the conditional\nindependence assumptions that the dynamic programs rely on.\nIn order to avoid this inference\nproblem, past approaches typically defer detailed modeling of discrete structure or timbre to a post-\nprocessing step [5, 6, 7].\n\n1\n\n\fFigure 1: We transcribe a dataset consisting of R songs produced by a single piano with N notes. For each\nkeyboard note, n, and each song, r, we generate a sequence of musical events, M (nr), parameterized by \u00b5(n).\nThen, conditioned on M (nr), we generate an activation time series, A(nr), parameterized by \u03b1(n). Next,\nconditioned on A(nr), we generate a component spectrogram for note n in song r, S(nr), parameterized by\n\u03c3(n). The observed total spectrogram for song r is produced by superimposing component spectrograms:\n\nX (r) =(cid:80)\n\nn S(nr).\n\nWe present the \ufb01rst approach that tackles these discrete and timbral modeling problems jointly. We\nhave two primary contributions: \ufb01rst, a new generative model that re\ufb02ects the causal process under-\nlying piano sound generation in an articulated way, starting with discrete musical structure; second,\na tractable approximation to the inference problem over transcriptions and timbral parameters. Our\napproach achieves state-of-the-art results on the task of polyphonic piano music transcription. On\na standard dataset consisting of real piano audio data, annotated with ground-truth onsets, our ap-\nproach outperforms the best published models for this task on multiple metrics, achieving a 10.6%\nrelative gain in our primary measure of note onset F1.\n\n2 Model\n\nOur model is laid out in Figure 1. It has parallel random variables for each note on the piano key-\nboard. For now, we illustrate these variables for a single concrete note\u2014say C(cid:93) in the 4th octave\u2014\nand in Section 2.4 describe how the parallel components combine to produce the observed audio\nsignal. Consider a single song, divided into T time steps. The transcription will be I musical events\nlong. The component model for C(cid:93) consists of three primary random variables:\n\nM, a sequence of I symbolic musical events, analogous to the locations\nand values of symbols along the C(cid:93) staff line in sheet music,\nA, a time series of T activations, analogous to the loudness of sound emit-\nted by the C(cid:93) piano string over time as it peaks and attenuates during each\nevent in M,\nS, a spectrogram of T frames, specifying the spectrum of frequencies over\ntime in the acoustic signal produced by the C(cid:93) string.\n\n2\n\ntimefreqvelocitydurationEnvelopeParamsSpectralParamsSpectrogramComponentSpectrogramNoteEventsEventParamsNoteActivationtimetimetimetimefreqfreqNR\u21b5(n)\u00b5(n)(n)M(nr)A(nr)S(nr)X(r)PLAYRESTvelocity\fFigure 2: Joint distribution on musical events, M (nr), and activations, A(nr), for note n in song r, conditioned\non event parameters, \u00b5(n), and envelope parameters, \u03b1(n). The dependence of Ei, Di, and Vi on n and r is\nsuppressed for simplicity.\n\nThe parameters that generate each of these random variables are depicted in Figure 1. First, musical\nevents, M, are generated from a distribution parameterized by \u00b5(C(cid:93)), which speci\ufb01es the probability\nthat the C(cid:93) key is played, how long it is likely to be held for (duration), and how hard it is likely to\nbe pressed (velocity). Next, the activation of the C(cid:93) string over time, A, is generated conditioned on\nM from a distribution parameterized by a vector, \u03b1(C(cid:93)) (see Figure 1), which speci\ufb01es the shape of\nthe rise and fall of the string\u2019s activation each time the note is played. Finally, the spectrogram, S,\nis generated conditioned on A from a distribution parameterized by a vector, \u03c3(C(cid:93)) (see Figure 1),\nwhich speci\ufb01es the frequency distribution of sounds produced by the C(cid:93) string. As depicted in\nFigure 3, S is produced from the outer product of \u03c3(C(cid:93)) and A. The joint distribution for the note1 is:\n\nP (S, A, M|\u03c3(C(cid:93)), \u03b1(C(cid:93)), \u00b5(C(cid:93))) = P (M|\u00b5(C(cid:93)))\n\n\u00b7 P (A|M, \u03b1(C(cid:93)))\n\u00b7 P (S|A, \u03c3(C(cid:93)))\n\n[Event Model, Section 2.1]\n[Activation Model, Section 2.2]\n[Spectrogram Model, Section 2.3]\n\nIn the next three sections we give detailed descriptions for each of the component distributions.\n\n2.1 Event Model\n\nOur symbolic representation of musical structure (see Figure 2) is similar to the MIDI format used by\nmusical synthesizers. M consists of a sequence of I random variables representing musical events\nfor the C(cid:93) piano key: M = (M1, M2, . . . , MI ). Each event Mi, is a tuple consisting of a state,\nEi, which is either PLAY or REST, a duration Di, which is a length in time steps, and a velocity Vi,\nwhich speci\ufb01es how hard the key was pressed (assuming Ei is PLAY).\nThe graphical model for the process that generates M is depicted in Figure 2. The sequence of\nstates, (E1, E2, . . . , EI ), is generated from a Markov model. The transition probabilities, \u00b5TRANS,\ncontrol how frequently the note is played (some notes are more frequent than others). An event\u2019s\nduration, Di, is generated conditioned on Ei from a distribution parameterized by \u00b5DUR. The dura-\ntions of PLAY events have a multinomial parameterization, while the durations of REST events are\ndistributed geometrically. An event\u2019s velocity, Vi, is a real number on the unit interval and is gener-\nated conditioned on Ei from a distribution parameterized by \u00b5VEL. If Ei = REST, deterministically\nVi = 0. The complete event parameters for keyboard note C(cid:93) are \u00b5(C(cid:93)) = (\u00b5TRANS, \u00b5DUR, \u00b5VEL).\n\n1For notational convenience, we suppress the C(cid:93) superscripts on M, A, and S until Section 2.4.\n\n3\n\nEnvelopeParamsDurationVelocityEventStateM(nr)E1E2E3D1D2D3V1V2V3A(nr)ScaletovelocityViTruncatetodurationDiAddnoise\u21b5(n)\f2.2 Activation Model\n\nIn an actual piano, when a key is pressed, a hammer strikes a string and a sound with sharply\nrising volume is emitted. The string continues to emit sound as long as the key remains de-\npressed, but the volume decays since no new energy is being transferred. When the key is\nreleased, a damper falls back onto the string, truncating the decay. Examples of this trajec-\ntory are depicted in Figure 1 in the graph of activation values. The graph depicts the note be-\ning played softly and held, and then being played more loudly, but held for a shorter time.\nIn our model, PLAY events represent hammer strikes on a piano string with raised damper,\nwhile REST events represent the lowered damper.\nIn our parameterization, the shape of the\nrise and decay is shared by all PLAY events for a given note, regardless of their\nduration and velocity. We call this shape an envelope and describe it using a posi-\ntive vector of parameters. For our running example of C(cid:93), this parameter vector is\n\u03b1(C(cid:93)) (depicted to the right).\nThe time series of activations for the C(cid:93) string, A, is a positive vector of length T , where T denotes\nthe total length of the song in time steps. Let [A]t be the activation at time step t. As mentioned in\nSection 2, A may be thought of as roughly representing the loudness of sound emitted by the piano\nstring as a function of time. The process that generates A is depicted in Figure 2. We generate A\nconditioned on the musical events, M. Each musical event, Mi = (Ei, Di, Vi), produces a segment\nof activations, Ai, of length Di. For PLAY events, Ai will exhibit an increase in activation. For\nREST events, the activation will remain low. The segments are appended together to make A. The\nactivation values in each segment are generated in a way loosely inspired by piano mechanics. Given\n\u03b1(C(cid:93)), we generate the values in segment Ai as follows: \u03b1(C(cid:93)) is \ufb01rst truncated to duration Di, then\nis scaled by the velocity of the strike, Vi, and, \ufb01nally, is used to parameterize an activation noise\ndistribution which generates the segment Ai. Speci\ufb01cally, we add independent Gaussian noise to\neach dimension after \u03b1(C(cid:93)) is truncated and scaled. In principle, this choice of noise distribution\ngives a formally de\ufb01cient model, since activations are positive, but in practice performs well and has\nthe bene\ufb01t of making inference mathematically simple (see Section 3.1).\n\n2.3 Component Spectrogram Model\n\nPiano sounds have a harmonic structure; when played, each piano string primarily emits\nenergy at a fundamental\nfrequency determined by the string\u2019s length, but also at all\ninteger multiples of that frequency (called partials) with diminishing strength (see the de-\npiction to the right). For example, the audio signal produced by the C(cid:93) string will vary in\nloudness, but its frequency distribution will remain mostly \ufb01xed. We call this frequency\ndistribution a spectral pro\ufb01le. In our parameterization, the spectral pro\ufb01le of C(cid:93) is speci-\n\ufb01ed by a positive spectral parameter vector, \u03c3(C(cid:93)) (depicted to the right). \u03c3(C(cid:93)) is a vector\nof length F , where [\u03c3(C(cid:93))]f represents the weight of frequency bin f.\nIn our model, the audio signal produced by the C(cid:93) string over the course of the song is represented\nas a spectrogram, S, which is a positive matrix with F rows, one for each frequency bin, f, and T\ncolumns, one for each time step, t (see Figures 1 and 3 for examples). We denote the magnitude\nof frequency f at time step t as [S]f t.\nIn order to generate the spectrogram (see Figure 3), we\n\ufb01rst produce a matrix of intermediate values by taking the outer product of the spectral pro\ufb01le,\n\u03c3(C(cid:93)), and the activations, A. These intermediate values are used to parameterize a spectrogram\nnoise distribution that generates S. Speci\ufb01cally, for each frequency bin f and each time step t, the\ncorresponding value of the spectrogram, [S]f t, is generated from a noise distribution parameterized\nby [\u03c3(C(cid:93))]f \u00b7 [A]t. In practice, the choice of noise distribution is very important. After examining\nresiduals resulting from \ufb01tting mean parameters to actual musical spectrograms, we experimented\nwith various noising assumptions, including multiplicative gamma noise, additive Gaussian noise,\nlog-normal noise, and Poisson noise. Poisson noise performed best. This is consistent with previous\n\ufb01ndings in the literature, where non-negative matrix factorization using KL divergence (which has\na generative interpretation as maximum likelihood inference in a Poisson model [8]) is commonly\nchosen [7, 2]. Under the Poisson noise assumption, the spectrogram is interpreted as a matrix of\n(large) integer counts.\n\n4\n\n\u21b5\fnote,\n\nspectrogram, X (r),\n\nFigure 3:\nConditional distribu-\ntion for song r on the observed\ntotal\nand\nthe component spectrograms for\n(S(1r), . . . , S(N r)),\neach\ngiven the\neach\nnote,\nand\nspectral parameters for each note,\n(\u03c3(1), . . . , \u03c3(N )).\nis the\nthe component\nsuperposition of\nn S(nr).\n\nspectrograms: X (r) =(cid:80)\n\nactivations\n\nfor\n(A(1r), . . . , A(N r)),\n\nX (r)\n\n2.4 Full Model\n\nSo far we have only looked at the component of the model corresponding to a single note\u2019s contri-\nbution to a single song. Our full model describes the generation of a collection of many songs, from\na complete instrument with many notes. This full model is diagrammed in Figures 1 and 3. Let a\npiano keyboard consist of N notes (on a standard piano, N is 88), where n indexes the particular\nnote. Each note, n, has its own set of musical event parameters, \u00b5(n), envelope parameters, \u03b1(n),\nand spectral parameters, \u03c3(n). Our corpus consists of R songs (\u201crecordings\u201d), where r indexes a\nparticular song. Each pair of note n and song r has it\u2019s own musical events variable, M (nr), activa-\ntions variable, A(nr), and component spectrogram S(nr). The full spectrogram for song r, which is\nthe observed input, is denoted as X (r). Our model generates X (r) by superimposing the component\nn S(nr). Going forward, we will need notation to group together variables\nacross all N notes: let \u00b5 = (\u00b5(1), . . . , \u00b5(N )), \u03b1 = (\u03b1(1), . . . , \u03b1(N )), and \u03c3 = (\u03c3(1), . . . , \u03c3(N )).\nAlso let M (r) = (M (1r), . . . , M (N r)), A(r) = (A(1r), . . . , A(N r)), and S(r) = (S(1r), . . . , S(N r)).\n\nspectrograms: X (r) =(cid:80)\n\n3 Learning and Inference\n\nOur goal is to estimate the unobserved musical events for each song, M (r), as well as the unknown\nenvelope and spectral parameters of the piano that generated the data, \u03b1 and \u03c3. Our inference will\nestimate both, though our output is only the musical events, which specify the \ufb01nal transcription.\nBecause MIDI sample banks (piano synthesizers) are readily available, we are able to provide the\nsystem with samples from generic pianos (but not from the piano being transcribed). We also give\nthe model information about the distribution of notes in real musical scores by providing it with an\nexternal corpus of symbolic music data. Speci\ufb01cally, the following information is available to the\nmodel during inference: 1) the total spectrogram for each song, X (r), which is the input, 2) the event\nparameters, \u00b5, which we estimate by collecting statistics on note occurrences in the external corpus\nof symbolic music, and 3) truncated normal priors on the envelopes and spectral pro\ufb01les, \u03b1 and \u03c3,\nwhich we extract from the MIDI samples.\nLet \u00afM = (M (1), . . . , M (R)), \u00afA = (A(1), . . . , A(R)), and \u00afS = (S(1), . . . , S(R)), the tuples of event,\nactivation, and spectrogram variables across all notes n and songs r. We would like to compute\nthe posterior distribution on \u00afM, \u03b1, and \u03c3. However, marginalizing over the activations \u00afA couples\nthe discrete musical structure with the superposition process of the component spectrograms in an\nintractable way. We instead approximate the joint MAP estimates of \u00afM, \u00afA, \u03b1, and \u03c3 via iterated\nconditional modes [9], only marginalizing over the component spectrograms, \u00afS. Speci\ufb01cally, we\nperform the following optimization via block-coordinate ascent:\n\n(cid:35)\n\n(cid:34)(cid:88)\n\n(cid:89)\n\nr\n\nS(r)\n\nmax\n\n\u00afM , \u00afA,\u03b1,\u03c3\n\nP (X (r), S(r), A(r), M (r)|\u00b5, \u03b1, \u03c3)\n\n\u00b7 P (\u03b1, \u03c3)\n\nThe updates for each group of variables are described in the following sections: \u00afM in Section 3.1,\n\u03b1 in Section 3.2, \u00afA in Section 3.3, and \u03c3 in Section 3.4.\n\n5\n\nfreqfreqtimetimetimefreq...X(r)S(Nr)S(1r)A(1r)A(Nr)(1)(N)\f3.1 Updating Events\n\nWe update \u00afM to maximize the objective while the other variables are held \ufb01xed. The joint dis-\ntribution on \u00afM and \u00afA is a hidden semi-Markov model [10]. Given the optimal velocity for each\nsegment of activations, the computation of the emission potentials for the semi-Markov dynamic\nprogram is straightforward and the update over \u00afM can be performed exactly and ef\ufb01ciently. We let\nthe distribution of velocities for PLAY events be uniform. This choice, together with the choice of\nGaussian activation noise, yields a simple closed-form solution for the optimal setting of the velocity\nvariable V (nr)\n]j be the jth\nentry of the segment of A(nr) generated by event M (nr)\n. The velocity that maximizes the activation\nsegment\u2019s probability is given by:\n\n. Let [\u03b1(n)]j denote the jth value of the envelope vector \u03b1(n). Let [A(nr)\n\ni\n\ni\n\n(cid:80)D(nr)\n\ni\n\nj=1\n\ni\n\n(cid:16)\n(cid:80)D(nr)\n\n[\u03b1(n)]j \u00b7 [A(nr)\n\n]j\n\ni\n\ni\n\nj=1 [\u03b1(n)]2\nj\n\n(cid:17)\n\nV (nr)\ni\n\n=\n\n3.2 Updating Envelope Parameters\n\nGiven settings of the other variables, we update the envelope parameters, \u03b1, to optimize the objec-\n\u2264 j},\ntive. The truncated normal prior on \u03b1 admits a closed-form update. Let I(j, n, r) = {i : D(nr)\nthe set of event indices for note n in song r with durations no longer than j time steps. Let [\u03b1(n)\n0 ]j\nbe the location parameter for the prior on [\u03b1(n)]j, and let \u03b2 be the scale parameter (which is shared\nacross all n and j). The update for [\u03b1(n)]j is given by:\n\ni\n\n(cid:80)\n\n(n,r)\n\n(cid:80)\n(cid:80)\n\n(cid:80)\n\ni\u2208I(j,n,r) V (nr)\n\n[A(nr)\ni\u2208I(j,n,r)[A(nr)\n\ni\n\ni\n\ni\n\n(n,r)\n\n]j + 1\nj + 1\n]2\n2\u03b22\n\n2\u03b22 [\u03b1(n)\n0 ]j\n\n[\u03b1(n)]j =\n\n3.3 Updating Activations\n\nIn order to update the activations, \u00afA, we optimize the objective with respect to \u00afA, with the other\nvariables held \ufb01xed. The choice of Poisson noise for generating each of the component spec-\ntrograms, S(nr), means that the conditional distribution of the total spectrogram for each song,\nn S(nr), with S(r) marginalized out, is also Poisson. Speci\ufb01cally, the distribution of\n\n(cid:1). Optimizing the probability of X (r) under\n\nX (r) = (cid:80)\n[X (r)]f t is Poisson with mean(cid:80)\n\n(cid:0)[\u03c3(n)]f \u00b7 [A(nr)]t\n\nthis conditional distribution with respect to A(r) corresponds to computing the supervised NMF\nusing KL divergence [7]. However, to perform the correct update in our model, we must also incor-\nporate the distribution of A(r), and so, instead of using the standard multiplicative NMF updates,\nwe use exponentiated gradient ascent [11] on the log objective. Let L denote the log objective, let\n\u02dc\u03b1(n, r, t) denote the velocity-scaled envelope value used to generate the activation value [A(nr)]t,\nand let \u03b32 denote the variance parameter for the Gaussian activation noise. The gradient of the log\nobjective with respect to [A(nr)]t is:\n\nn\n\n(cid:34)\n\n(cid:88)\n\nf\n\n\u2202L\n\n\u2202[A(nr)]t\n\n=\n\n(cid:80)\n\n[X (r)]f t \u00b7 [\u03c3(n)]f\n\nn(cid:48)(cid:0)[\u03c3(n(cid:48))]f \u00b7 [A(n(cid:48)r)]t\n\n(cid:1) \u2212 [\u03c3(n)]f\n\n(cid:35)\n\n(cid:16)\n\n\u2212 1\n\u03b32\n\n[A(nr)]t \u2212 \u02dc\u03b1(n, r, t)\n\n3.4 Updating Spectral Parameters\n\nThe update for the spectral parameters, \u03c3, is similar to that of the activations. Like the activations, \u03c3\nis part of the parameterization of the Poisson distribution on each X (r). We again use exponentiated\ngradient ascent. Let [\u03c3(n)\n]f be the location parameter of the prior on [\u03c3(n)]f , and let \u03be be the scale\nparameter (which is shared across all n and f). The gradient of the the log objective with respect to\n[\u03c3(n)]f is given by:\n\n0\n\n[X (r)]f t \u00b7 [A(nr)]t\n\nn(cid:48)(cid:0)[\u03c3(n(cid:48))]f \u00b7 [A(n(cid:48)r)]t\n(cid:80)\n\n(cid:1) \u2212 [A(nr)]t\n\n(cid:35)\n\n(cid:16)\n\n\u2212 1\n\u03be2\n\n[\u03c3(n)]f \u2212 [\u03c3(n)\n\n0\n\n]f\n\n(cid:34)\n\n(cid:88)\n\n(r,t)\n\n\u2202L\n\n\u2202[\u03c3(n)]f\n\n=\n\n(cid:17)\n\n(cid:17)\n\n6\n\n\f4 Experiments\n\nBecause polyphonic transcription is so challenging, much of the existing literature has either worked\nwith synthetic data [12] or assumed access to the test instrument during training [5, 6, 13, 7]. As\nour ultimate goal is the transcription of arbitrary recordings from real, previously-unseen pianos, we\nevaluate in an unsupervised setting, on recordings from an acoustic piano not observed in training.\n\nData We evaluate on the MIDI-Aligned Piano Sounds (MAPS) corpus [14]. This corpus includes\na collection of piano recordings from a variety of time periods and styles, performed by a human\nplayer on an acoustic \u201cDisklavier\u201d piano equipped with electromechanical sensors under the keys.\nThe sensors make it possible to transcribe directly into MIDI while the instrument is in use, providing\na ground-truth transcript to accompany the audio for the purpose of evaluation. In keeping with much\nof the existing music transcription literature, we use the \ufb01rst 30 seconds of each of the 30 ENSTDkAm\nrecordings as a development set, and the \ufb01rst 30 seconds of each of the 30 ENSTDkCl recordings\nas a test set. We also assume access to a collection of synthesized piano sounds for parameter\ninitialization, which we take from the MIDI portion of the MAPS corpus, and a large collection of\nsymbolic music data from the IMSLP library [15, 16], used to estimate the event parameters in our\nmodel.\n\nPreprocessing We represent the input audio as a magnitude spectrum short-time Fourier transform\nwith a 4096-frame window and a hop size of 512 frames, similar to the approach used by Weninger\net al. [7]. We temporally downsample the resulting spectrogram by a factor of 2, taking the max-\nimum magnitude over collapsed bins. The input audio is recorded at 44.1 kHz and the resulting\nspectrogram has 23ms frames.\n\nInitialization and Learning We estimate initializers and priors for the spectral parameters, \u03c3,\nand envelope parameters, \u03b1, by \ufb01tting isolated, synthesized, piano sounds. We collect these isolated\nsounds from the MIDI portion of MAPS, and average the parameter values across several synthesized\npianos. We estimate the event parameters \u00b5 by counting note occurrences in the IMSLP data. At\ndecode time, to \ufb01t the spectral and envelope parameters and predict transcriptions, we run 5 iterations\nof the block-coordinate ascent procedure described in Section 3.\n\nEvaluation We report two standard measures of performance: an onset evaluation, in which a\npredicted note is considered correct if it falls within 50ms of a note in the true transcription, and\na frame-level evaluation, in which each transcription is converted to a boolean matrix specifying\nwhich notes are active at each time step, discretized to 10ms frames. Each entry is compared to\nthe corresponding entry in the true matrix. Frame-level evaluation is sensitive to offsets as well\nas onsets, but does not capture the fact that note onsets have greater musical signi\ufb01cance than do\noffsets. As is standard, we report precision (P), recall (R), and F1-measure (F1) for each of these\nmetrics.\n\n4.1 Comparisons\n\nWe compare our system to three state-of-the-art unsupervised systems: the hidden semi-Markov\nmodel described by Benetos and Weyde [2] and the spectrally-constrained factorization models de-\nscribed by Vincent et al. [3] and O\u2019Hanlon and Plumbley [4]. To our knowledge, Benetos and Weyde\n[2] report the best published onset results for this dataset, and O\u2019Hanlon and Plumbley [4] report the\nbest frame-level results.\nThe literature also includes a number of supervised approaches to this task. In these approaches,\na model is trained on annotated recordings from a known instrument. While best performance\nis achieved when testing on the same instrument used for training, these models can also achieve\nreasonable performance when applied to new instruments. Thus, we also compare to a discriminative\nbaseline, a simpli\ufb01ed reimplementation of a state-of-the-art supervised approach [7] which achieves\nslightly better performance than the original on this task. This system only produces note onsets,\nand therefore is not evaluated at a frame-level. We train the discriminative baseline on synthesized\naudio with ground-truth MIDI annotations, and apply it directly to our test instrument, which the\nsystem has never seen before.\n\n7\n\n\fSystem\n\nDiscriminative [7]\nBenetos [2]\nVincent [3]\nO\u2019Hanlon [4]\nThis work\n\nP\n\n76.8\n\n-\n\n62.7\n48.6\n78.1\n\nOnsets\n\nR\n65.1\n\n-\n\n76.8\n73.0\n74.7\n\nFrames\n\nR\n-\n-\n\nP\n-\n-\n\n79.6\n73.4\n69.1\n\n63.6\n72.8\n80.7\n\nF1\n-\n\n68.0\n70.7\n73.2\n74.4\n\nF1\n70.4\n68.6\n69.0\n58.3\n76.4\n\nTable 1: Unsupervised transcription results on the MAPS corpus. \u201cOnsets\u201d columns show scores for identi\ufb01ca-\ntion (within \u00b150ms) of note start times. \u201cFrames\u201d columns show scores for 10ms frame-level evaluation. Our\nsystem achieves state-of-the-art results on both metrics.2\n\n4.2 Results\n\nOur model achieves the best published numbers on this task: as shown in Table 1, it achieves an\nonset F1 of 76.4, which corresponds to a 10.6% relative gain over the onset F1 achieved by the\nsystem of Vincent et al. [3], the top-performing unsupervised baseline on this metric. Surprisingly,\nthe discriminative baseline [7], which was not developed for the unsupervised task, outperforms\nall the unsupervised baselines in terms of onset evaluation, achieving an F1 of 70.4. Evaluated on\nframes, our system achieves an F1 of 74.4, corresponding to a more modest 1.6% relative gain over\nthe system of O\u2019Hanlon and Plumbley [4], which is the best performing baseline on this metric.\nThe surprisingly competitive discriminative baseline shows that it is possible to achieve high onset\naccuracy on this task without adapting to the test instrument. Thus, it is reasonable to ask how much\nof the gain our model achieves is due to its ability to learn instrument timbre. If we skip the block-\ncoordinate ascent updates (Section 3) for the envelope and spectral parameters, and thus prevent our\nsystem from adapting to the test instrument, onset F1 drops from 76.4 to 72.6. This result indicates\nthat learning instrument timbre does indeed help performance.\nAs a short example of our system\u2019s behavior, Figure 4 shows our system\u2019s output passed through a\ncommercially-available MIDI-to-sheet-music converter. This example was chosen because its onset\nF1 of 75.5 and error types are broadly representative of the system\u2019s performance on our data. The\nresulting score has musically plausible errors.\n\nFigure 4: Result of passing our system\u2019s prediction and the reference transcription MIDI through the Garage-\nBand MIDI-to-sheet-music converter. This is a transcription of the \ufb01rst three bars of Schumann\u2019s Hobgoblin.\n\nA careful inspection of the system\u2019s output suggests that a large fraction of errors are either off by\nan octave (i.e. the frequency of the predicted note is half or double the correct frequency) or are\nsegmentation errors (in which a single key press is transcribed as several consecutive key presses).\nWhile these are tricky errors to correct, they may also be relatively harmless for some applications\nbecause they are not detrimental to musical perception: converting the transcriptions back to audio\nusing a synthesizer yields music that is qualitatively quite similar to the original recordings.\n\n5 Conclusion\n\nWe have shown that combining unsupervised timbral adaptation with a detailed model of the gen-\nerative relationship between piano sounds and their transcriptions can yield state-of-the-art perfor-\nmance. We hope that these results will motivate further joint approaches to unsupervised music\ntranscription. Paths forward include exploring more nuanced timbral parameterizations and devel-\noping more sophisticated models of discrete musical structure.\n\n2 For consistency we re-ran all systems in this table with our own evaluation code (except for the system\nof Benetos and Weyde [2], for which numbers are taken from the paper). For O\u2019Hanlon and Plumbley [4]\nscores are higher than the authors themselves report; this is due to an extra post-processing step suggested by\nO\u2019Hanlon in personal correspondence.\n\n8\n\nPredictedscoreReferencescore\fReferences\n\n[1] Masahiro Nakano, Yasunori Ohishi, Hirokazu Kameoka, Ryo Mukai, and Kunio Kashino. Bayesian non-\nparametric music parser. In IEEE International Conference on Acoustics, Speech and Signal Processing,\n2012.\n\n[2] Emmanouil Benetos and Tillman Weyde. Explicit duration hidden markov models for multiple-instrument\n\npolyphonic music transcription. In International Society for Information Music Retrieval, 2013.\n\n[3] Emmanuel Vincent, Nancy Bertin, and Roland Badeau. Adaptive harmonic spectral decomposition for\n\nmultiple pitch estimation. IEEE Transactions on Audio, Speech, and Language Processing, 2010.\n\n[4] Ken O\u2019Hanlon and Mark D. Plumbley. Polyphonic piano transcription using non-negative matrix factori-\nsation with group sparsity. In IEEE International Conference on Acoustics, Speech, and Signal Process-\ning, 2014.\n\n[5] Graham E. Poliner and Daniel P.W. Ellis. A discriminative model for polyphonic piano transcription.\n\nEURASIP Journal on Advances in Signal Processing, 2007.\n\n[6] R. Lienhart C. G. van de Boogaart. Note onset detection for the transcription of polyphonic piano music.\n\nIn Multimedia and Expo ICME. IEEE, 2009.\n\n[7] Felix Weninger, Christian Kirst, Bjorn Schuller, and Hans-Joachim Bungartz. A discriminative approach\nto polyphonic piano note transcription using supervised non-negative matrix factorization. In IEEE Inter-\nnational Conference on Acoustics, Speech and Signal Processing, 2013.\n\n[8] Paul H. Peeling, Ali Taylan Cemgil, and Simon J. Godsill. Generative spectrogram factorization models\nIEEE Transactions on Audio, Speech, and Language Processing,\n\nfor polyphonic piano transcription.\n2010.\n\n[9] Julian Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, 1986.\n[10] Stephen Levinson. Continuously variable duration hidden Markov models for automatic speech recogni-\n\ntion. Computer Speech & Language, 1986.\n\n[11] Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear pre-\n\ndictors. Information and Computation, 1997.\n\n[12] Matti P. Ryynanen and Anssi Klapuri. Polyphonic music transcription using note event modeling. In\n\nIEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005.\n\n[13] Sebastian B\u00a8ock and Markus Schedl. Polyphonic piano note transcription with recurrent neural networks.\n\nIn IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.\n\n[14] Valentin Emiya, Roland Badeau, and Bertrand David. Multipitch estimation of piano sounds using a\nIEEE Transactions on Audio, Speech, and Language\n\nnew probabilistic spectral smoothness principle.\nProcessing, 2010.\n\n[15] The international music score library project, June 2014. URL http://imslp.org.\n[16] Vladimir Viro. Peachnote: Music score search and analysis platform. In The International Society for\n\nMusic Information Retrieval, 2011.\n\n9\n\n\f", "award": [], "sourceid": 819, "authors": [{"given_name": "Taylor", "family_name": "Berg-Kirkpatrick", "institution": "UC Berkeley"}, {"given_name": "Jacob", "family_name": "Andreas", "institution": "UC Berkeley"}, {"given_name": "Dan", "family_name": "Klein", "institution": "UC Berkeley"}]}