Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods". I am Ge Gao from the Center for Bioinformatics, Peking University. Let’s continue our course. In last unit, we have made a brief introduction to the basic concepts of transcriptome data mining and the background knowledge of non-coding RNAs. Now we will introduce the corresponding data mining methods by starting from identification of long non-coding RNA. Identification can be regarded as a type of classification, separating individuals with different properties by a set of features. It should be noted that these features need only be associated with, but need not be equal to, the properties of interest. Then what features can we use to identify non-coding RNAs? A simple idea is to use directly the known biological properties.For example, the pre-miRNA can form a hairpin structure. We can start from this to identify new miRNAs by RNA secondary structure. Similar methods have also been applied to find tRNAs and other non-coding RNAs with a well-defined structure property. This type of method, however, does not work for lncRNAs whose functions do not depend on specific secondary structures. Considering the fact that non-coding RNAs do not code for proteins, it can be inferred that the conservation of bases in non-coding RNAs will not fluctuate in different positions of putative codons as the conservation of bases in coding sequence do. This feature is used by PhyloCSF to classify the sequences. However, this method is based on genome alignment of multiple species. Thus it will not work if the alignment is incorrect, or is not available at all. Is it possible to distinguish between ncRNAs and mRNAs by the sequence information of the transcript itself, but not the external information such as multiple alignment? Also, it would be better for this method to be independent of specific mechanisms, making it suitable for identification of both lncRNAs and small non-coding RNAs such as miRNAs. Finally, considering the huge amount of data generated by deep sequencing, this method should be as fast as possible without losing its accuracy. We have learnt in previous units that machine learning algorithms, such as SVM, can quickly classify data accurately given a feature set and a training set. But how are we supposed to choose the features to use? As mentioned just now, we expect to use features from the sequence only to accommodate robustness and generality. However, there are still too many sequence-only features to choose from. Moreover, multiple features can be combined to increase the total accuracy significantly, as implied in the previous SAPRED example. Then how can we get the best feature set? We will need feature selection to handle this problem. Feature selection will screen a set of candidate features systematically to get a feature subset for a specific goal of classification. Generally, we hope this set to be as small as possible without losing its accuracy in order to speed up the algorithm. Methods of feature selection can be classified into three types by their different natures: Complete, Heuristic, and Random. Breadth-first search is a typical complete search method. Basically, this method evaluates all possible combinations from the original feature set exhaustively. It is obvious that this method is guaranteed to find the optimal solution. Considering the combinatorial explosion, however, it will take a lot time to run this method, making it not suitable in practice when the initial feature set is very large. Forward search is a heuristic search method. This method will start with an empty subset, add to it at a time the feature with the best power of classification over all the unselected features, and stop when further addition of features cannot increase the accuracy of classification. Forward search cannot delete features that have been added. This will result in selecting features that are highly correlated, leading to redundancy [in the feature set]. However, given n features to select from, forward search only needs to try at most n combinations. This will take much less computation, making forward search suitable for initial feature sets that are relatively large. Simulated annealing algorithm refers to the physical process of metal annealing. It introduces random factors to the feature selection to avoid falling into local optimization. It is essentially a stochastic algorithm, so the final performance is highly dependent on the initial values and the selection of parameters. The stability of the final result is also an important issue. So, how to get the initial feature set? In fact, in the process of feature selection, a rational and effective initial feature set is an important factor for the effect of identification. If the initial set is mixed with numbers of irrelevant features, it will seriously affect the efficiency of the subsequent feature selection. In practice, it’s carried out mainly based on the existing literature, data, and also combined with our own biological intuition. According to the relevant literature and our biological background knowledge, we select about 60 features in the RNA sequence level as the initial feature set. Then, we firstly use the forward search algorithm to screen 11 features as the initial feature set. To further improve the accuracy, we apply a full search based on the breadth-first strategy to obtain the final six features. For the final six features three of them are the ORF (open reading frame) of the RNA sequence predicted by the probability model. Wherein, coverage is the length ratio of the predicted ORF and the entire RNA sequence. ORF integrity means whether the predicted ORF is complete or not. LOG-ODD score is to assess the reliability of the prediction. The higher the scores are, the more reliable the predicted ORF is. The other three are the homology-based information. The basic idea is that mRNA is much more likely to find similar proteins in the protein database than non-coding RNA. In particular, the non-coding RNA may also be randomly matched to several protein fragments. But because there is no real ORF in non-coding RNA, these random matches will be dispersed in three reading frames, rather than concentrated in a particular real-coded frame just like a true match. Next, we feed the six features into the SVM which we talked before and get the final CPC, which is the Coding Potential Calculator. Tests show that CPC can reach more than 90% accuracy for ncRNAs of different length. On the other hand, CPC ensures accuracy without sacrificing speed. In fact, the running speed of CPC is over 10 times faster than that of CONC, which is also based on SVM. The proper feature selection plays an important role on it. CPC has become one of the common online tools for the non-coding RNA identification. It has been applied in many fields from the expression and regulation to disease research as well as the evolutionary analysis. Since its launch in 2007, CPC has so far analyzed thirty-two million sequences submitted by more than 50,000 users all over the world Well, in this section we’ve focused on the identification of non coding RNA and introduced the feature selection methods. Here are some summary questions. You are encouraged to think about them and discuss them with other students and TAs in the online forum. Well, that’s all for this section. In the next section, we will further focus on the functional prediction Well, that’s all for this section. In the next section, we will further focus on the functional prediction