{"title": "Deep Recurrent Neural Network-Based Identification of Precursor microRNAs", "book": "Advances in Neural Information Processing Systems", "page_first": 2891, "page_last": 2900, "abstract": "MicroRNAs (miRNAs) are small non-coding ribonucleic acids (RNAs) which play key roles in post-transcriptional gene regulation. Direct identification of mature miRNAs is infeasible due to their short lengths, and researchers instead aim at identifying precursor miRNAs (pre-miRNAs). Many of the known pre-miRNAs have distinctive stem-loop secondary structure, and structure-based filtering is usually the first step to predict the possibility of a given sequence being a pre-miRNA. To identify new pre-miRNAs that often have non-canonical structure, however, we need to consider additional features other than structure. To obtain such additional characteristics, existing computational methods rely on manual feature extraction, which inevitably limits the efficiency, robustness, and generalization of computational identification. To address the limitations of existing approaches, we propose a pre-miRNA identification method that incorporates (1) a deep recurrent neural network (RNN) for automated feature learning and classification, (2) multimodal architecture for seamless integration of prior knowledge (secondary structure), (3) an attention mechanism for improving long-term dependence modeling, and (4) an RNN-based class activation mapping for highlighting the learned representations that can contrast pre-miRNAs and non-pre-miRNAs. In our experiments with recent benchmarks, the proposed approach outperformed the compared state-of-the-art alternatives in terms of various performance metrics.", "full_text": "Deep Recurrent Neural Network-Based Identi\ufb01cation\n\nof Precursor microRNAs\n\nSeunghyun Park\n\nSeonwoo Min\n\nElectrical and Computer Engineering\n\nElectrical and Computer Engineering\n\nSeoul National University\n\nSeoul 08826, Korea\n\nSchool of Electrical Engineering\n\nSeoul National University\n\nSeoul 08826, Korea\n\nKorea University\nSeoul 02841, Korea\n\nHyun-Soo Choi\n\nSeoul National University\n\nSeoul 08826, Korea\n\nElectrical and Computer Engineering\n\nElectrical and Computer Engineering\n\nSungroh Yoon\u2217\n\nSeoul National University\n\nSeoul 08826, Korea\nsryoon@snu.ac.kr\n\nAbstract\n\nMicroRNAs (miRNAs) are small non-coding ribonucleic acids (RNAs) which play\nkey roles in post-transcriptional gene regulation. Direct identi\ufb01cation of mature\nmiRNAs is infeasible due to their short lengths, and researchers instead aim at iden-\ntifying precursor miRNAs (pre-miRNAs). Many of the known pre-miRNAs have\ndistinctive stem-loop secondary structure, and structure-based \ufb01ltering is usually\nthe \ufb01rst step to predict the possibility of a given sequence being a pre-miRNA. To\nidentify new pre-miRNAs that often have non-canonical structure, however, we\nneed to consider additional features other than structure. To obtain such additional\ncharacteristics, existing computational methods rely on manual feature extraction,\nwhich inevitably limits the ef\ufb01ciency, robustness, and generalization of computa-\ntional identi\ufb01cation. To address the limitations of existing approaches, we propose\na pre-miRNA identi\ufb01cation method that incorporates (1) a deep recurrent neural\nnetwork (RNN) for automated feature learning and classi\ufb01cation, (2) multimodal\narchitecture for seamless integration of prior knowledge (secondary structure), (3)\nan attention mechanism for improving long-term dependence modeling, and (4) an\nRNN-based class activation mapping for highlighting the learned representations\nthat can contrast pre-miRNAs and non-pre-miRNAs. In our experiments with recent\nbenchmarks, the proposed approach outperformed the compared state-of-the-art\nalternatives in terms of various performance metrics.\n\nIntroduction\n\n1\nMicroRNAs (miRNAs) play crucial roles in post-transcriptional gene regulation by binding to the 3(cid:48)\nuntranslated region of target messenger RNAs [16]. Among the research problems related to miRNA,\ncomputational identi\ufb01cation of miRNAs has been one of the most signi\ufb01cant. The biogenesis of\na miRNA consists of the primary miRNA stage, the precursor miRNA (pre-miRNA) stage, and\nthe mature miRNA stage [17]. Mature miRNAs are usually short, having only 20\u201323 base pairs\n(bp), and it is dif\ufb01cult to identify them directly. Most computational approaches focus on detecting\n\n\u2217To whom correspondence should be addressed.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fpre-miRNAs since they are usually more identi\ufb01able because they are longer (approximately 80bp)\nand have a distinctive stem-loop secondary structure.\nIn terms of the machine learning (ML), pre-miRNA identi\ufb01cation can be viewed as a binary classi\ufb01ca-\ntion problem in which a given sequence must be classi\ufb01ed as either a pre-miRNA or a non-pre-miRNA.\nA variety of computational approaches for miRNA identi\ufb01cation have been proposed, and we can\nbroadly classify them [18] into rule-based such as MIReNA [1], and ML-based approaches, which\ncan be categorized into three groups in terms of the classi\ufb01cation algorithm used: (1) MiPred [12],\nmicroPred [2], triplet-SVM [3], iMiRNA-SSF [38], miRNApre [39], and miRBoost [4] use support\nvector machines; (2) MiRANN [1] and DP-miRNA [37] use neural networks; and (3) (CSHMM) [5],\nwhich use a context-sensitive hidden Markov model.\nKnown pre-miRNAs have distinctive structural characteristics, and therefore most computational\nmethods make \ufb01rst-order decisions based on the secondary structure of the input RNA sequence.\nHowever, the identi\ufb01cation of new pre-miRNAs with non-canonical structure and subtle properties,\nand maybe both, it requires the consideration of features other than secondary structure. Some authors\n[19] have even argued that the performance of ML-based tools are more dependent on the set of input\nfeatures than the ML algorithms that are used.\nThe discovery of new features which are effective in pre-miRNA identi\ufb01cation currently involves\neither searching for hand-crafted features (such as the frequency of nucleotide triplets in the loop,\nglobal and intrinsic folding attributes, stem length, and minimum free energy) or combining existing\nfeatures. One recent study utilized 187 such features [4], another 48 features [2], most of which were\nmanually prepared. Manual feature extraction requires ingenuity and inevitably limits the ef\ufb01ciency,\nrobustness, and generalization of the resulting identi\ufb01cation scheme developed. Furthermore, neural\nnetwork-based methods above only use neural networks for classi\ufb01cation of hand-designed features,\nand not for feature learning.\nSimilar challenges exist in other disciplines. Recently, end-to-end deep learning approaches have\nbeen successfully applied to tasks such as speech and image recognition, largely eliminating the\nmanual construction of feature engineerings. Motivated by these successes, we propose a deep\nneural network-based pre-miRNA identi\ufb01cation method which we call deepMiRGene to address the\nlimitations of existing approaches. It incorporates the following key components:\n\n1. A deep recurrent neural network (RNN) with long short-term memory (LSTM) units for\nRNA sequence modeling, automated feature learning, and robust classi\ufb01cation based on the\nlearned representations.\n\n2. A multimodal architecture for seamless integration of prior knowledge (such as the impor-\ntance of RNA secondary structure in pre-miRNA identi\ufb01cation) with automatically learned\nfeatures.\n\n3. An attention mechanism for effective modeling of the long-term dependence of the primary\n\nstructure (i.e., sequence) and the secondary structure of RNA molecules.\n\n4. An RNN-based class activation mapping (CAM) to highlight the learned representations in\n\nthe way that contrasts pre-miRNAs and non-pre-miRNAs to obtain biological insight.\n\nWe found that simply combining existing deep learning modules did not deliver satisfactory per-\nformance in our task. Thus our contribution can be seen as inventing a novel pipeline and with\ncomponents optimized for handling RNA sequences and structures to predict (possibly subtle) pre-\nmiRNA signals, rather than just assembling pre-packaged components. Our research for an optimal\nset of RNN architectures and hyperparameters for pre-miRNA identi\ufb01cation involved an exploration\nof the design space spanned by the components of our methodology. The result of this research is\na technique with demonstrable advantages over other state-of-the-art alternatives in terms of both\ncross-validation results but also the generalization ability (i.e., performance on test data). The source\ncode for the proposed method is available at https://github.com/eleventh83/deepMiRGene.\n\n2 Related Work\n\n2.1 The Secondary Structure of a Pre-miRNA\n\nThe secondary structure of an RNA transcript represents the base-pairing interactions within that\ntranscript. The usual secondary structure of a pre-miRNA is shown in Fig. 1, which shows that\na pre-miRNA is a base-paired double helix rather than a single strand, and this pairing is one\n\n2\n\n\fFigure 1: (A) sequence of a pre-miRNA, and (B) the secondary structure of the given sequence. The\ndot-bracket notation in (A) describes RNA secondary structures. Unpaired nucleotides are represented\nas \u201c.\u201ds and base-paired nucleotides are represented as \u201c(\u201ds and \u201c)\u201ds.\n\nFigure 2: Overview of our method: #sample is the number of input sequences and lseq is the\nmaximum length of the input sequence. The dimension of intermediate data is labeled (#sample,\nlseq, 16).\n\nof the most prominent features for pre-miRNA identi\ufb01cation [12, 2]. The secondary structure\nof a given sequence can be predicted by tools such as RNAfold [5], which is widely used.\nIt\nconstructs a thermodynamically stable secondary structure from a given RNA sequence by calculating\nthe minimum free energy and the probable base-pairings [20]. However, reliable pre-miRNA\nidenti\ufb01cation requires features other than the secondary structure to be considered, since false\npositives may be generated due to the limitations of structure prediction algorithms and the inherent\nunpredictability of these structures [21].\n\n2.2 Deep Recurrent Neural Networks\n\nRNNs are frequently used for sequential modeling and learning. RNNs process one element of input\ndata at a time and implicitly store past information using cyclic connections of hidden units [8].\nHowever, early RNNs often had dif\ufb01culty in learning long-term dependencies because of the vanishing\nor exploding gradient problem [9]. Recent deep RNNs incorporate mechanisms to address this\nproblem. Explicit memory units, such as LSTM units [10] or GRUs [3], are one such mechanism. An\nLSTM unit, for works as a sophisticated hidden unit that uses multiplicative gates to learn when to\ninput, output, and forget in addition to cyclic connections to store the state vector. A more recent\ninnovation [2] is the attention mechanism. This can take various forms, but in our system, a weighted\ncombination of the output vectors at each point in time replaces the single \ufb01nal output vector of a\nstandard RNN. An attention mechanism of this sort helps learn long-term dependencies and also\nfacilitates the interpretation of results, e.g., by showing how closely the output at a speci\ufb01c time point\nis related to the \ufb01nal output [30, 29, 2].\n\n3 Methodology\n\nFig. 2 shows the proposed methodology of our system. The input consists of either a set of pre-\nmiRNA sequences (in the training phase) or a test sequence (in the testing phase). The output for each\ninput sequence is a two-dimensional (softmax) vector which indicates whether the input sequence\nencodes pre-miRNA or not. In a preprocessing phase, we derive the secondary structure of the input\nsequence and then encode the sequence and its structure together into a 16-dimensional binary vector.\nEncoded vectors are then processed by the RNN architecture, consisting of LSTM layers and fully\nconnected (FC) layers, and the attention mechanism. The pseudocode of our approach is available as\nAppendix A, in the supplementary material.\n\n3\n\nACGUGCCACGAUUCAACGUGGCACAGsequence..((((((((......))))))))..secondary structure..........folding5\u2019 3\u2019 5\u20193\u20195\u20193\u2019ACCCUGGGCAGGACCCGUAUUCAAAG( ( ( ( ( ( ( () ) ) ) ) ) ) )ABLSTM Layer InputSequences OutputOnehotEncoding (#sample, lseq)FullyConnectedLayer Preprocessing Neural Network Layers True/False(#sample, lseq, 16) FoldingDropout(0.1) Dropout (0.1)(#sample, lseqx10)SecondaryStructures (#sample, 2)FullyConnectedLayer FullyConnectedLayer (lseqx10)x400400x100100x2Dropout (0.1)Dropout (0.1)Dropout (0.1)LLSTM2LFC1LFC2LFC3StatesMergingA\ufffden\ufffdon (So\ufffdmax)LSTM LayerLLSTM1(#sample, lseq,20)Dropout (0.1)So\ufffdmax\f3.1 Preprocessing\n\nPreprocessing a set of input pre-miRNA sequences involves two tasks. First, RNAfold is used to\nobtain the secondary structure of each sequence; we already noted the importance of this data. We\nclarify that each position in an RNA sequence as one of {A, C, G, U}, and the corresponding location\nin the secondary structure as one of {(, ), ., :}. This dot-bracket notation is shown in Fig. 1. The\nsymbol \u201c:\u201d represents a position inside a loop (unpaired nucleotides surrounded by a stem). Let xs\nand xt denote an input sequence and its secondary structure, respectively. Then, xs \u2208 {A, C, G, U}|xs|\nand xt \u2208 {(, ), ., :}|xt|. Note that |xs| = |xt|.\nNext, each input sequence xs is combined with its secondary structure xt into a numerical represen-\ntation. This is a simple one-hot encoding [4], which gave better results in our experiments than a\nsoft encoding (see Section 4). Our encoding scheme uses a 16-dimensional one-hot vector, in which\nposition i (i = 0, 1, . . . , 15) is interpreted as follows:\n\nif (cid:98)i/4(cid:99) =\n\n0\n1\n2\n3\n\nthen A\nthen G\nthen C\nthen U\n\nand if i%4 =\n\n0\n1\n2\n3\n\nthen (\nthen )\nthen .\nthen :\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nThe \u201c%\u201d symbol denotes the modulus operator. After preprocessing, the sequence xs and the structure\nxt are together represented by the matrix Xs \u2208 {0, 1}|xs|\u00d716, each row of which is the 16-dimensional\none-hot vector described above. For instance, xs = AUG and xt = (:) are represented by the\nfollowing 3 \u00d7 16 binary matrix:\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\nA(cid:122) (cid:125)(cid:124) (cid:123)\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n1 0 0 0\n0 0 0 0\n0 0 0 0\n\n().:\n\nXs =\n\nC(cid:122) (cid:125)(cid:124) (cid:123)\n\n0 0 0 0\n0 0 0 0\n0 0 0 0\n\nG(cid:122) (cid:125)(cid:124) (cid:123)\n\n0 0 0 0\n0 0 0 0\n0 1 0 0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb .\n\nU(cid:122) (cid:125)(cid:124) (cid:123)\n\n0 0 0 0\n0 0 0 1\n0 0 0 0\n\n3.2 Neural Network Architecture\n\nThe main features of our neural network is the attention mechanism provided by the LTSM and FC\nlayers.\n\n1) LSTM layers: The purpose of these layers is sequential modeling of the primary and secondary\nstructure of the input pre-miRNA transcripts. We use two stacked LSTM layers denoted by LLSTM\nand LLSTM\nrespectively. LLSTM\ntakes the matrix Xs produced in the preprocessing stage and returns a\nweight matrix H1, as follows:\n\n1\n\n2\n\n1\n\nH1 = LLSTM\n\n1\n\n(Xs) \u2208 R|xs|\u00d7d1 ,\n\n(1)\n\nwhere d1 is the number of LSTM units in the \ufb01rst layer. Similarly, the second layer \ufb01rst returns a\nsecond weight matrix H2:\n\nH2 = LLSTM\n\n2\n\n(H1) \u2208 R|xs|\u00d7d2 ,\n\n(2)\n\nwhere d2 is the number of LSTM units in the second layer.\nWe apply an attention mechanism to the output of LLSTM\nwith the aim of learning the importance\nof each position of xs. The neural networks \ufb01rst learn an attention weight for each output of the\nsecond LSTM layer for each sequence position in a training process. These weights are collectively\nrepresented by a matrix \u2126 \u2208 Rd2\u00d7|xs|. An attention weight matrix \u2126att \u2208 R|xs|\u00d7|xs| is then constructed\nas follows:\n\n2\n\nThis yields the attention weight vector \u03c9att\n\n\u2126att = H2\u2126.\n\n\u03c9att = softmax(diag(\u2126att)) \u2208 R|xs|,\n\n(3)\n\n(4)\n\n4\n\n\fwhere the ith element of \u03c9att corresponds to the attention weight for the ith position of xs. Then,\nHatt \u2208 R|xs|\u00d7d2, the attention-weighted representation of H2, can be expressed as follows:\n\nHatt = H2 (cid:12) (\u03c9att \u2297 ud2 ),\n\n(5)\nwhere ud2 is the d2-dimensional unit vector, and (cid:12) and \u2297 respectively denote the element-wise\nmultiplication and outer product operators.\nFinally, we reshape the matrix Hatt by \ufb02attening it into a (d2 \u00b7 |xs|)-dimensional vector \u02dchatt for the\nsake of compatibility with third-party software. We use the standard nonlinearities (i.e., hyperbolic\ntangent and logistic sigmoid) inside each LSTM cell.\n\n(cid:17)\n\n(cid:16)\u02dchatt\n\n1 , LFC\n\n2) Fully connected layers: The neural network collects the outputs from the last LSTM layer and\nmakes a \ufb01nal decision using three FC layers. We denote the operations performed by these three\nlayers by LFC\n3 , which allows us to represent the outputs of the three FC layers as\n3 (f2), where f1 \u2208 Rd3 and f2 \u2208 Rd4 are intermediate\nf1 = LFC\n1\nvectors, and \u02c6y \u2208 R2 denotes the \ufb01nal softmax output; d3 and d4 are the numbers of hidden nodes in\nthe last two FC layers. The \ufb01rst two FC layers use logistic sigmoids as theirs activation functions,\nwhile the last FC layer uses the softmax function.\n\n2 , and LFC\n, f2 = LFC\n\n2 (f1), and \u02c6y = LFC\n\n3.3 Training\n\nWe based our training objective on binary cross-entropy (also known as logloss). As will be explained\nin Section 4 (see Table 1), we encountered a class-imbalance problem in this study, since there exist\nsigni\ufb01cantly more negative training examples (non-pre-miRNA sequences) than positives (known\npre-miRNA sequences). We addressed this issue by augmenting the logloss training objective with\nbalanced class weights [31], so that the training error E is expressed as follows:\n\n(cid:8)c\u2212yi log(\u02c6yi) + c+(1 \u2212 yi) log(1 \u2212 \u02c6yi)(cid:9)\n\n(cid:88)\n\ni\n\nE = \u2212 1\nb\n\nwhere b is the mini-batch size (we used b = 128), and yi \u2208 {0, 1} is the class label provided in\ntraining data (yi = 0 for pre-miRNA; yi = 1 for non-pre-miRNA); c\u2212 and c+ represent the balanced\nclass weights given by\n\nck =\n\nN\n2 \u00b7 nk\n\n,\n\nk \u2208 {\u2212, +}\n\n(6)\n\nwhere N is the total number of training examples and nk is the number of examples in either the\npositive or the negative class.\nWe minimized E using the Adam [6] gradient descent method, which uses learning rates which adapt\nto the \ufb01rst and second moments of the gradients of each parameter. We tried other optimization\nmethods (e.g., the stochastic gradient descent [27] and RMSprop [28]), but they did not give better\nresults.\nWe used dropout regularization with an empirical setup. In the LSTM layers, a dropout parameter for\ninput gates and another for recurrent connection were both set to 0.1. In the FC layers, we set the\ndropout parameter to 0.1. We tried batch normalization [22], but did not \ufb01nd it effective.\nAll the weights were randomly initialized in the range of [\u22120.05, 0.05]. The number of hidden nodes\nin the LSTM (d1, d2) and the FC (d3, d4) layers were determined by cross validation as d1 = 20,\nd2 = 10, d3 = 400, and d4 = 100. The mini-batch size and training epochs were set to 128 and 300\nrespectively.\n\n4 Experimental Results\n\nWe used three public benchmark datasets [4] named human, cross-species, and new. The positive\npre-miRNA sequences in all three datasets were obtained from miRBase [25] (release 18). For the\nnegative training sets, we obtained noncoding RNAs other than pre-miRNAs and exonic regions of\nprotein-coding genes from NCBI (http://www.ncbi.nlm.nih.gov), fRNAdb [23], NONCODE [24], and\n\n5\n\n\fTable 1: Numbers of sequences in the three bench-\nmark datasets [4] used in this study. The median\nlength of each dataset is given in brackets.\n\nType \\ Dataset name\n\nHuman\n\nCross-species\n\nNew\n\nPositive examples\nNegative examples\n\n863 (85)\n7422 (92)\n\n1677 (93)\n8266 (96)\n\n690 (71)\n8246 (96)\n\nTable 2: Performance evaluation of different pre-miRNA identi\ufb01cation methods with cross-validation\n(CV) and test data using sensitivity (SE), speci\ufb01city (SP), positive predictive value (PPV), F-score,\ngeometric mean (g-mean), area under the receiver operating characteristic curve (AUROC), and area\nunder the precision-recall curve (AUPR).\n\nMethods\n\nSE1\n\nSP2 PPV3 F-score4 g-mean5 AUROC AUPR\n\nSE\n\nSP\n\nPPV F-score g-mean AUROC AUPR\n\nHuman\n\nCross-species\n\n0.861 0.977 0.884 0.872\n0.826 0.576 0.533 0.564\n0.735 0.967 0.819 0.775\n0.825 0.975 0.875 0.848\n0.766 0.952 0.765 0.765\n0.886 0.982 0.911 0.898\n\n0.917\n0.524\n0.843\n0.897\n0.854\n0.933\n\n0.856 0.844 0.526 0.651\n0.749 0.960 0.791 0.769\n0.760 0.977 0.870 0.812\n0.814 0.985 0.919 0.863\n0.796 0.950 0.764 0.780\n0.900 0.983 0.913 0.906\n\n-\n-\n\n-\n\n-\n-\n\n0.943\n0.970\n\n0.985\n\n0.952\n0.963\n\n-\n-\n0.869\n0.873\n-\n0.927\n\n-\n-\n0.908\n0.906\n-\n0.955\n\n0.850\n0.848\n0.862\n0.896\n0.870\n0.940\n\u221a\n\nmiRBoost (CV)\nCSHMM (CV)\ntriplet-SVM (CV)\nmicroPred (CV)\nMIReNA (CV)\nProposed (CV)\n\nmiRBoost (test)\nCSHMM (test)\ntriplet-SVM (test)\nmicroPred (test)\nMIReNA (test)\nProposed (test)\n\n0.803 0.988 0.887\n0.713 0.777 0.559\n0.669 0.986 0.851\n0.763 0.989 0.888\n0.818 0.943 0.624\n0.799 0.988 0.885\n0.884 0.969 0.768\n0.616 0.978 0.768\n0.744 0.992 0.914\n0.779 0.988 0.882\n0.826 0.941 0.617\n0.822 0.992 0.919\n\n0.843\n0.570\n0.749\n0.820\n0.708\n0.839\n\n0.822\n0.684\n0.821\n0.827\n0.706\n0.868\n\n0.891\n0.673\n0.812\n0.869\n0.878\n0.888\n0.925\n0.777\n0.859\n0.877\n0.881\n0.903\n\n-\n-\n\n-\n\n-\n-\n\n-\n-\n\n-\n\n-\n-\n\n0.957\n0.974\n\n0.854\n0.890\n\n0.984\n\n0.915\n\n0.947\n0.980\n\n0.830\n0.892\n\nTP:(cid:80) true positive, TN:(cid:80) true negative, FP:(cid:80) false positive, FN:(cid:80) false negative. 1 SE = TP/(TP + FN) 2 SP =\n\n0.918\n\n0.981\n\n0.984\n\n-\n\n-\n\n-\n\nTN/(TN + FP) 3 PPV (precision) = TP/(TP + FP) 4 F-score = 2TP/(2TP + FP + FN) 5 g-mean =\nsnoRNA-LBME-db [26]. Note that we only acquired those datasets that had undergone redundancy\nremoval and had annotation corrected by the data owners.\nAs shown in Table 1, the human dataset contains 863 human pre-miRNA sequences (positive\nexamples) and 7422 non-pre-miRNA sequences (negative examples). The cross-species dataset\ncontains 1677 pre-miRNA sequences collected from various species (e.g., human, mouse, and \ufb02y),\nand 8266 non-miRNA sequences. The new dataset has 690 newly discovered pre-miRNA sequences,\nwhich are in miRBase releases 19 and 20, with 8246 non-pre-miRNA sequences. For the human and\ncross-species datasets, 10% of the data was randomly chosen as a clean test dataset (also known as a\npublication dataset) and was never used in training. Using the remaining 90% of each dataset, we\ncarried out \ufb01ve-fold cross-validation for training and model selection. Note that the new dataset was\nused for testing purposes only, as described in Tran et al. [4]. Additional details of the experimental\nsettings used can be found in Appendix B.\n\nSE \u00b7 SP\n\n4.1 Validation and Test Performance Evaluation\n\nWe used seven evaluation metrics: sensitivity (SE), speci\ufb01city (SP), positive predictive value (PPV), F-\nscore, the geometric mean of SE and SP (g-mean), the area under the receiver operating characteristic\ncurve (AUROC), and the area under the precision-recall curve (AUPR). Higher sensitivity indicates a\nmore accurate pre-miRNAs predictor which is likely to assist the discovery of novel pre-miRNAs.\nHigher speci\ufb01city indicates more effective \ufb01ltering of pseudo pre-miRNAs, which increases the\nef\ufb01ciency of biological experiments. Because they take account of results with different decision\nthresholds, AUROC and AUPR typically deliver more information than the more basic metrics such\nas sensitivity, speci\ufb01city, and PPV, which are computed with a single decision threshold. Note that\nmiRBoost, MIReNA, and CSHMM do not provide decision values, and so the AUROC and AUPR\nmetrics cannot be obtained from these methods.\nThe results of a cross-validation performance comparison are shown in the upper half of Table 2,\nwhile the results of the test performance comparison are shown in the bottom half. For the human\ndataset, the cross-validation performance of our method was comparable to that of others, but our\nmethod achieved the highest test performance in terms of F-score, AUROC, and AUPRG. For the\ncross-species dataset, our method achieved the best overall performance in terms of both cross-\nvalidation and test evaluation results. Some tools, such as miRBoost, showed fair performance in\n\n6\n\n\fTable 3: Evaluation of performance on the new dataset.\n\nMethods\n\nmiRBoost\nCSHMM\ntriplet-SVM\nmicroPred\nMIReNA\nProposed method\n\nSE\n0.921\n0.536\n0.721\n0.728\n0.450\n0.917\n\nSP\n\n0.936\n0.069\n0.981\n0.970\n0.941\n0.964\n\nPPV\n\n0.609\n0.046\n0.759\n0.672\n0.392\n0.682\n\nF-score\n\ng-mean\n\nAUROC\n\nAUPR\n\n0.733\n0.085\n0.740\n0.699\n0.419\n0.782\n\n0.928\n0.192\n0.841\n0.840\n0.650\n0.941\n\n-\n-\n\n-\n\n0.934\n0.940\n\n0.981\n\n-\n-\n0.766\n0.756\n-\n0.808\n\nFigure 3: Using both sequence and structure information gives the best performance on the human\ndataset. Each bar shows the metrics of average cross-validation results.\n\nterms of the cross-validation but failed to deliver the same level of performance on the test data.\nThese results suggest that our approach provides better generalization than the alternatives. The\nsimilarity of the performance in terms of the cross-validation and test results suggests that over\ufb01tting\nwas handled effectively.\nFollowing the experimental setup used by Tran et al. [4], we also evaluated the proposed method on\nthe new dataset, with a model trained by the cross-species dataset, as shown in Table 3, to assess the\npotential of our approach in the search for novel pre-miRNAs. Again, our method did not show the\nbest performance in terms of basic metrics such as sensitivity and speci\ufb01city, but it returned the best\nvalues of AUROC and AUPR. The results show that the proposed method can be used effectively to\nidentify novel pre-miRNAs as well as to \ufb01lter out pseudo pre-miRNAs.\nTo evaluate the statistical signi\ufb01cance of our approach, we applied a Kolmogorov-Smirnov test [40] to\nthe classi\ufb01cations produced by our method, grouped by true data labels. For the human, cross-species,\nand new datasets, the p-values we obtained were 5.23 \u00d7 10\u221254, 6.06 \u00d7 10\u2212102, and 7.92 \u00d7 10\u221249\nrespectively, indicating that the chance of these results occurring at random is very small indeed.\n\n4.2 Effectiveness of Multimodal Learning\n\nOur approach to the identi\ufb01cation of pre-miRNAs takes both biological sequence information and\nsecondary structure information into account. To assess the bene\ufb01t of this multimodality, we measured\nthe performance of our method using only sequences or secondary structures in training on the human\ndataset. As shown in Fig. 3, all of the performance metrics were higher when both sequence and\nstructure information were used together. Compared with the use of sequence or structure alone,\nthe sensitivity of the multimodal approach was increased by 48% point and 2% point, respectively.\nFor speci\ufb01city, the cases using both sequence and structure achieved higher performance values\n(0.988) than those of the sequence only (0.987) and structure only (0.978) cases. Similarly, in terms\nof F-score, using the multimodality gave 29% point and 5% point higher scores (0.839) than using\nsequence only (0.649) or structure only (0.795), respectively.\n\n4.3 Gaining Insights by Analyzing Attention Weights\n\nA key strength of our approach is its ability to learn the features useful for pre-miRNA identi\ufb01cation\nfrom data. This improves ef\ufb01ciency, and also has the potential to aid the discovery of subtle features\nthat might be missed in manual feature design. However, learned features, which are implicitly\nrepresented by the trained weights of a deep model, come without intuitive signi\ufb01cance.\nTo address this issue, we experimented with the visualization of attention weights using the class\nactivation mapping [32], a technique that was originally proposed to interpret the operation of\n\n7\n\nSESPF-scoreg-mean0.50.60.70.80.910.5370.9870.6490.7270.7830.9780.7950.8750.7990.9880.8390.888Sequence onlyStructure onlyMultimodal\fFigure 4: Attention weighted RNN outputs with the human dataset. (A) Class activation mapping\nfor predicted examples (negatives: non-pre-miRNAs). (B) Class activation mapping for predicted\nexamples (positives: pre-miRNAs). (C) Stem-loop structure of a pre-miRNA (homo sapiens miR-1-1).\n\nTable 4: Performance of different types of neural network, assessed in\nterms of \ufb01ve-fold cross-validation results from the human dataset. The\nnumber of stacked layers is shown in brackets. ATT means that an attention\nmechanism was included, and a BiLSTM is a bi-directional LSTM. The\ncon\ufb01guration that we \ufb01nally adopted is shown in row 6.\n\nNo.\n\nType\n\n1\n2\n3\n\n4\n5\n6\n\n7\n8\n\n1D-CNN(2)\n1D-CNN(2)+LSTM(2)\n1D-CNN(2)+LSTM(2)+ ATT\n\nLSTM(2)\nLSTM(1) + ATT\nLSTM(2) + ATT (proposed)\n\nBiLSTM(1) + ATT\nBiLSTM(2) + ATT\n\nSE\n\n0.745\n0.707\n0.691\n\n0.666\n0.781\n0.799\n\n0.783\n0.795\n\nSP\n\n0.978\n0.976\n0.979\n0.988\n0.987\n0.988\n\n0.987\n0.987\n\nF-score\n\ng-mean\n\n0.771\n0.738\n0.739\n\n0.751\n0.824\n0.839\n\n0.827\n0.834\n\n0.853\n0.830\n0.822\n\n0.810\n0.878\n0.888\n\n0.879\n0.886\n\nconvolutional neural networks (CNNs) in image classi\ufb01cation by highlighting discriminative regions.\nWe modi\ufb01ed the class activation mapping of RNNs to discover which part of the sequential output\nis signi\ufb01cant for identifying pre-miRNAs. We performed one-dimensional global average pooling\n(GAP) on the attention weighted output Hatt (see Section 3.2) to derive a d2-dimensional weight\nvector \u03c9gap. We then multiplied Hatt by \u03c9gap to obtain a class activation map of size |xs| for each\nsequence sample.\nFig. 4 (A) and Fig. 4 (B) show the resulting heatmap representations of class activation mapping on\nthe human dataset for positive and negative predicted examples, respectively. Since sequences can\nhave different lengths, we normalized the sequence lengths to 1 and presented individual positions in\na sequence between 0% and 100% in the x-axis. By comparing the plots in Fig. 4 (A) and (B), we\ncan see that class activation maps of the positive and negative data show clear differences, especially\nat the 10\u201350% sequence positions, within the red box in Fig. 4 (B). This region corresponds to the 5(cid:48)\nstem region of typical pre-miRNAs, as shown in Fig. 4 (C). This region coincides with the location of\na mature miRNA encoded within a pre-miRNA, suggesting that the data-driven features learned by\nour approach have revealed relevant characteristics of pre-miRNAs.\nThe presence of some nucleotide patterns has recently been reported in the mature miRNA region\ninside a pre-miRNA [33]. We anticipate that further interpretation of our data-driven features may\nassist in con\ufb01rming such patterns, and also in discovering novel motifs in pre-miRNAs.\n\n4.4 Additional Experiments\n\n1) Architecture exploration: We explored various alternative network architectures, as listed in\nTable 4, which shows the performance of different network architectures, annotated with the number\nof layers and any of attention mechanism. Rows 1\u20133 of the table, show results for CNNs with and\nwithout LSTM networks. Rows 4\u20136 show the results of LSTM networks. Rows 7\u20138, show results for\nbi-directional LSTM (BiLSTM) networks. More details can be found in Appendix C.1.\n\n2) Additional results: Appendix C.2\u20134 presents more details of hyperparameter tuning, the design\ndecisions made between the uses of soft and hard encoding, and running-time comparisons.\n\n8\n\nnegative predicted examplespositive predicted examplesABCStem-loopUGGGAAACAUACUUCUUUAUAUGCCCAUAUGGACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUCUCA102030405060705\u20193\u2019sequence position (%)sequence position (%)UGGGAAACAUACUUCUUUAUAUGCCCAUAUGGACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUCUCAHomo sapiens miR-1-15\u2019 stemmature miRNA5\u20193\u2019\f5 Discussion\n\nGiven the importance of the secondary structure in pre-miRNA identi\ufb01cation (e.g., see Section 4.2),\nwe derived the secondary structure of each input sequence using RNAfold. We then combined the\nsecondary structure information with the primary structure (i.e. the sequence), and sent the result\nto the RNN. However, a fully end-to-end approach to pre-miRNA identi\ufb01cation we would need to\nlearn even the secondary structure from the input sequences. Due to the limited numbers of known\npre-miRNA sequences, this remains as challenging future work.\nOur experimental results supported the effectiveness of a multi-modal approach that considers\nsequences and structures together from an early stage of the pipeline. Incorporating other types of\ninformation would be possible and might improve performance further. For example, sequencing\nresults from RNA-seq experiments re\ufb02ect the expression levels and the positions of each sequenced\nRNA [34]; and conservation information would allow a phylogenetic perspective [35]. Such additional\ninformation could be integrated into the current framework by representing it as new network branches\nand merging them with the current data before the FC layers.\nOur proposed method has the clear advantage over existing approaches that it does not require\nhand-crafted features. But we need to ensure that learned feature provide satisfactory performance,\nand they also need to have some biological meaning. Biomedical researchers naturally hesitate to use\na black-box methodology. Our method of visualizing attention weights provides a tool for opening\nthat black-box, and assist data-driven discovery.\n\nAcknowledgments\n\nThis work was supported in part by the Samsung Research Funding Center of the Samsung Electronics\n[No. SRFC-IT1601-05], the Institute for Information & communications Technology Promotion\n(IITP) grant funded by the Korea government (MSIT) [No. 2016-0-00087], the Future Flagship\nProgram funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) [No. 10053249], the\nBasic Science Research Program through the National Research Foundation of Korea (NRF) funded\nby the Ministry of Science, ICT & Future Planning [No. 2016M3A7B4911115], and Brain Korea 21\nPlus Project in 2017.\n\nReferences\n\n[1] M. E. Rahman, et al. MiRANN: A reliable approach for improved classi\ufb01cation of precursor microRNA\n\nusing Arti\ufb01cial Neural Network model. Genomics, 99(4):189\u2013194, 2012.\n\n[2] K. Xu, et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML,\n\nvolume 14, pages 77\u201381, 2015.\n\n[3] J. Chung, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv\n\npreprint arXiv:1412.3555, 2014.\n\n[4] P. Baldi and S. Brunak. Chapter 6. Neural Networks: applications. In Bioinformatics: The Machine\n\nLearning Approach. MIT press, 2001.\n\n[5] I. L. Hofacker. Vienna RNA secondary structure server. Nucleic acids research, 31(13):3429\u20133431, 2003.\n[6] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. CoRR, abs/1412.6980, 2014.\n[7] V. D. T. Tran, et al. miRBoost: boosting support vector machines for microRNA precursor classi\ufb01cation.\n\nRNA, 21(5):775\u2013785, 2015.\n\n[8] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n[9] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is dif\ufb01cult.\n\nNeural Networks, IEEE Transactions on, 5(2):157\u2013166, 1994.\n\n[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[11] R. Batuwita and V. Palade. microPred: effective classi\ufb01cation of pre-miRNAs for human miRNA gene\n\nprediction. Bioinformatics, 25(8):989\u2013995, 2009.\n\n[12] P. Jiang, et al. MiPred: classi\ufb01cation of real and pseudo microRNA precursors using random forest\n\nprediction model with combined features. Nucleic acids research, 35(suppl 2):W339\u2013W344, 2007.\n\n[13] S. Agarwal, et al. Prediction of novel precursor miRNAs using a context-sensitive hidden Markov model\n\n(CSHMM). BMC bioinformatics, 11(Suppl 1):S29, 2010.\n\n9\n\n\f[14] A. Mathelier and A. Carbone. MIReNA: \ufb01nding microRNAs with high accuracy and no learning at genome\n\nscale and from deep sequencing data. Bioinformatics, 26(18):2226\u20132234, 2010.\n\n[15] C. Xue, et al. Classi\ufb01cation of real and pseudo microRNA precursors using local structure-sequence\n\nfeatures and support vector machine. BMC bioinformatics, 6(1):310, 2005.\n\n[16] R. C. Lee, R. L. Feinbaum, and V. Ambros. The C. elegans heterochronic gene lin-4 encodes small RNAs\n\nwith antisense complementarity to lin-14. Cell, 75(5):843\u2013854, 1993.\n\n[17] D. P. Bartel. MicroRNAs: genomics, biogenesis, mechanism, and function. cell, 116(2):281\u2013297, 2004.\n[18] D. Kleftogiannis, et al. Where we stand, where we are moving: Surveying computational techniques\nfor identifying miRNA genes and uncovering their regulatory role. Journal of biomedical informatics,\n46(3):563\u2013573, 2013.\n\n[19] I. de ON Lopes, A. Schliep, and A. C. d. L. de Carvalho. The discriminant power of RNA features for\n\npre-miRNA recognition. BMC bioinformatics, 15(1):1, 2014.\n\n[20] R. Lorenz, et al. ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6(1):1, 2011.\n[21] R. B. Lyngs\u00f8. Complexity of pseudoknot prediction in simple models. In Automata, Languages and\n\nProgramming, pages 919\u2013931. Springer, 2004.\n\n[22] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[23] T. Kin, et al. fRNAdb: a platform for mining/annotating functional RNA candidates from non-coding RNA\n\nsequences. Nucleic acids research, 35(suppl 1):D145\u2013D148, 2007.\n\n[24] D. Bu, et al. NONCODE v3. 0: integrative annotation of long noncoding RNAs. Nucleic acids research,\n\npage gkr1175, 2011.\n\n[25] S. Grif\ufb01ths-Jones, et al. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic acids\n\nresearch, 34(suppl 1):D140\u2013D144, 2006.\n\n[26] L. Lestrade and M. J. Weber. snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D\n\nbox snoRNAs. Nucleic acids research, 34(suppl 1):D158\u2013D162, 2006.\n\n[27] L. Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-N\u0131mes, 91(8), 1991.\n[28] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural Networks for Machine Learning, 4:2, 2012.\n\n[29] O. Vinyals, et al. Grammar as a foreign language. In Advances in Neural Information Processing Systems,\n\npages 2773\u20132781, 2015.\n\n[30] T. Rockt\u00e4schel, et al. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664,\n\n2015.\n\n[31] G. King and L. Zeng. Logistic regression in rare events data. Political analysis, 9(2):137\u2013163, 2001.\n[32] B. Zhou, et al. Learning deep features for discriminative localization.\n\nIn Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 2921\u20132929, 2016.\n\n[33] J. Starega-Roslan, P. Galka-Marciniak, and W. J. Krzyzosiak. Nucleotide sequence of miRNA precursor\n\ncontributes to cleavage site selection by Dicer. Nucleic acids research, 43(22):10939\u201310951, 2015.\n\n[34] M. R. Friedl\u00e4nder, et al. Discovering microRNAs from deep sequencing data using miRDeep. Nature\n\nbiotechnology, 26(4):407\u2013415, 2008.\n\n[35] N. Mendes, A. T. Freitas, and M.-F. Sagot. Current tools for the identi\ufb01cation of miRNA genes and their\n\ntargets. Nucleic acids research, 37(8):2419\u20132433, 2009.\n\n[36] T. Mikolov, et al. Distributed representations of words and phrases and their compositionality. In Advances\n\nin neural information processing systems, pages 3111\u20133119, 2013.\n\n[37] J. Thomas, S. Thomas, and L. Sael. DP-miRNA: An improved prediction of precursor microRNA using\ndeep learning model. In Big Data and Smart Computing (BigComp), 2017 IEEE International Conference\non, pages 96\u201399. IEEE, 2017.\n\n[38] J. Chen, X. Wang, and B. Liu. IMiRNA-SSF: improving the identi\ufb01cation of MicroRNA precursors by\n\ncombining negative sets with different distributions. Scienti\ufb01c reports, 6, 2016.\n\n[39] L. Wei, et al. Improved and promising identi\ufb01cation of human microRNAs by incorporating a high-quality\nnegative set. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1):192\u2013201,\n2014.\n\n[40] H. W. Lilliefors. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal\n\nof the American statistical Association, 62(318):399\u2013402, 1967.\n\n10\n\n\f", "award": [], "sourceid": 1662, "authors": [{"given_name": "Seunghyun", "family_name": "Park", "institution": "Seoul National University"}, {"given_name": "Seonwoo", "family_name": "Min", "institution": "Seoul National University"}, {"given_name": "Hyun-Soo", "family_name": "Choi", "institution": "Seoul Nation University"}, {"given_name": "Sungroh", "family_name": "Yoon", "institution": "Seoul National University"}]}