Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods". I am Ge Gao from the Center for Bioinformatics, Peking University. Let's start this week's topic: Sequence Database Search. First, let's review some basic concepts. Last time we have illustrated how to compare two sequences by dynamic programming. We can infer the structural, functional, and evolutionary similarity from the sequence similarity of proteins In practice, we often need to find the most similar sequences of our interested sequence by searching sequence databases. We usually use a query sequence with unknown function to search databases of genes or proteins with known sequences and functional annotations, in order to infer the structure, functional, and evolutionary origin of our query sequence. Specifically, we will align the query sequence to known sequences in the database one by one. We will then use the annotations of the statistically significantly similar sequences (often called “hits”) to infer the structure, function, and evolution. It seems that we can implement it by simply looping over the pairwise sequence alignment taught last week. Students with background in computer sciences may realize immediately that such implementation merely requires an additional for-loop outside the pairwise sequence alignment algorithm. So that's all for this week... right? Hold on. Let's estimate how long it would take. As discussed earlier, we can put the sequences of X and Y on the x- and y-axis of a matrix. The i and j in F(i,j) become the subscripts of the elements in this matrix. the recursion of this formula then becomes filling the blanks one by one from top to bottom and left to right. Then for two sequences of length m and n, respectively, we need to fill in m*n cells. Each filling takes a constant time c. The total time needed will thus be c*m*n. Let's try to substitute some numerical values in for these variables. Assume that the query sequence is human myoglobin alpha subunit. It is made of 142 amino acids. The database to search is the latest version of the Swiss-Prot database released on Sep 18th, 2013. If your computer can fill in a cell within one microsecond, then you will need about 7.8 hours to finish searching the whole database! The sequence databases are growing rapidly, especially nucleotide sequence databases. Therefore, although dynamic programming has significantly reduced the computational time compared with enumeration, we need even faster algorithms to search the rapidly growing large biological databases. Let's return to our pairwise sequence alignment. As the formula shows, each pair of residues aligned is denoted by a diagonal move in the dynamic programming matrix. Successive alignments of residue pairs (i.e. no gaps) are denoted by a path parallel to the [main] diagonal of the matrix. In extreme cases, the optimal alignment path between two identical sequences is the main diagonal of the matrix. Similarly, the optimal global alignment path between two nearly identical sequences is confined within a small region around the main diagonal. For local alignment, the optimal alignment path might not be around the main diagonal, but will be parallel to the main diagonal. Thus it is natural to take advantage of this property and "guess" first where the diagonal fragments of the optimal alignment path might reside in the matrix. This will significantly reduce the number of cells to fill in and, as a result, the computational time. In fact, this is the basic idea behind the BLAST algorithm. BLAST was proposed first in 1990. Its two main papers have been cited more than 30,000 times each, making it one of the most widely used and appreciated algorithms in bioinformatics. Next we will illustrate how to use BLAST to search the databases by an example. Again, we will illustrate it online. OK, here is the homepage of NCBI BLAST. We will choose the "protein blast" first at the center of the webpage. Here's the webpage. Does it seem a little more complex than that of pairwise sequence alignment? The first box is for query sequence. Let's fill it with the sequence of human myoglobin alpha subunit. Next we will choose which database to search. As you can see, the default database is "nr", i.e. the non-redundant set of all known protein sequences. This database is often used as the first filter to check whether the query sequence has been studied by other scientists. However, although there are a lot of sequences in this database, not all of them have detailed and precise functional annotations. Because we want to annotate our query sequence as much as possible, we choose the Swissprot database instead. The Swissprot database is part of the Uniprot protein database. Thus it is also called the UniProt/Swiss-Prot database. Every sequence in the Swissprot database has been manually curated by a team of experts. Each sequence record contains a description of this protein including its sequence, function, modification, and structure, whatever available, plus a list of hyperlinks to other related databases. Therefore, to annotate a new query sequence we often start from searching the Swiss-Prot database. Next we can select the species. We will choose human first. Now let's click the big blue BLAST button below! OK, the result comes out within a minute. Does the resultsult look confusing? Don't worry. Let's look at it together. First, on the top of the screen we see a summary of the query sequence and the database searched. Below that there is an analysis of the functional domains in the query sequence. The concept of "Profile" is used here. We will explain it later. Next is a summary of the Hits. As you can see, some hits almost aligned to the full length of the query sequence, while others aligned to only part of the query sequence. Next is a detailed list of the hits. Aside from the familiar scores and Identity as used in pairwise sequence alignment there's a new value called E-value. We will explain the meaning and calculation of this E-value later. Finally there is a detailed list of alignments. As you can see, BLAST has found out not only the query sequence itself, but also the myoglobin beta subunit. That provides some clues for the function annotation of this sequence. Here are the alignment results done by the Needleman-Wunsch algorithm, Smith-Waterman algorithm, and BLAST. Please think about which parts of the three results are the same and which parts are different, and why. Good. Next let's remove the restriction on species. We will try to find homologous of human myoglobin alpha subunit in other species. Now we press the BLAST button. Here's the result. It took a little longer, but less than two minutes. Let's take a look at the result. The first two parts appear more or less the same as previous search of human sequences, but now we get considerably more hits. In fact, we can see that the hits are from many different species, implying that BLAST has found out many sequences that are potential homologues of human myoglobin alpha subunit To see from which species the hits come, we can choose the Taxonomy Report option on the top of the webpage to sort the hits in the order of species. We can then construct a phylogenetic tree. So we can see that BLAST can find lots of possible homologous sequences in different species indeed. Good. We have illustrated with an example how to use BLAST to search databases for similar sequences of the query sequence within a selected species or across species. Here are the summary questions for this unit. These questions are not the part of the assignment. You are encouraged to think about them and discuss your answers and ideas in the online forum. In Unit 2, we will explain in detail the idea behind the BLAST algorithm.