We are now ready to define profile hidden market models for classifying proteins into family, a difficult task in bioinformatics. Classifying proteins into families important because as soon as a protein is classified as a family member, we know the function of this project. However, the similarities may be subtle for diverse protein and therefore, pairwise similarities may fail to detect what family a given newly sequenced protein belongs to. However, protein may have thick similarities with as many proteins in the family. And it will help us to assign this protein to a family, but we need new computational techniques for finding such subtle similarities. Let's start from the alignment of number of proteins, as shown on this slide. And let's remove columns if the fraction of insertion exceeds a threshold theta, the maximum fraction of insertions per column. And after we've done this we'll construct alignment for when each column has relatedly small number of insertions. We can throw the construct profile of this alignment and finally, we will construct an HMM correspondent to this profile. This is actually a strange HMM, because the probability of transition from every state to max state next to the right state is 1. Therefore, there is nothing probabilistic about this HMM with respect to transition probability models. However, with respect to mission probabilities there is a probabilistic minimum of this HMM. And it corresponds to every new protein having certain probabilities being committed by this HMM. This HMM has a major flaw. It is not clear how to model insertions in this HMM. For example, how would we represents this protein in the framework of this HMM. Well an obvious way is to add insertion states like this one. And if there are multiple insertions, we will also connect insertion state with itself. And finally, we will design this HMM of these insertions. Now we took care of insertions and this would be passed in the HMM correspondent to this protein, but what about deletions? And there is a simple way to account for deletion. Let's say for this deletion, by introducing transitions between all can stay to the right in this HMM diagram like this state. However the question arises how many such edges in HMM diagram we will have after adding all those possible transitions? In this case, we added only transition for a given node of the HMM diagram. And there are already many edges and as you remember the running time of equals to the number of edges on the HMM diagram multiply by the number of omitted symbols. So as the solving the corresponding decoding problem will be time consuming. However, there is a better way. We can introduce deletion states in the HMM diagram. And in this case, we can root the sequence of this newly introduced deletion state. We need to introduce all of them of course. And that's now how the dilution issue will be addressed with introducing relatively small numbers of new edges in the HMM diagram. So here's a path corresponding to this proteins that is shown in the bottom of this diagram. And I want to ask you a question, are there any edges missing from this HMM diagram? And of course, there are missing edges between deletion and insertion states. We will add them and our HMM diagram is almost done. The only thing we need to add is the start and end states. However, the question that remains unanswered is what are the transitions and emissions probabilities for this HMM? And we wanted to define profile HMM problem, which is to construct a profile HMM from a multiple alignment. The input to this problem is a multiple alignment and a threshold theta, the maximum fraction of insertions per column. And the output, transition and emission matrices of the profile HMM. Let's construct hidden paths for every row of the multiple alignment in the profile HMM. Please note that they are constructing hidden process in the HMM diagram, not in the Viterbi graph of the HMM. We haven't even constructed the Viterbi graph yet. So for the top row, in the multiply alignment this will be our green paths through the HMM diagram. This is [INAUDIBLE] path as it responds to the second row and this is yet another path corresponding to the third row. And here all five passes corresponding to each row of the most probable alignment and the question arises. How do we compute transition probabilities based on these paths? Let's take a look at one particular node in these HMM diagram, and nodes and passes. Only four out of five passes paths resist node and the rather for a full transition from this state M5. 3 of them go to the state I5, 1 go to the state M6, and 0 go to the state D6. And therefore, it makes sense to define transition probabilities from M5 to I5 as 3/4 transition probability from M5 to M6 as 1/4 in transition probability from M5 to D6 as 0. It turns out that this common sense assigning of contrition probability. Actually correspond to optimal assigning of transition probabilities in the HMM diagram. But what about the emission probability? In this case, we look at the node M2 and it is traversed by again by 4 out of 5 passes. And the emitted symbol are C, F, C, and D from this state. And therefore, we define the initial probabilities correspondingly, according to the count of the emitted symbol from this state. We are almost done with defining the profile HMM for multiplier alignment. But let's consider a small complication caused by forbidden transitions in the HMM diagrams. Please note that most of the transitions between states in the profile HMM are forbidden. And the graph of the HMM diagram shown here only shows allowed transitions. So the transition matrix, matrix of transition probability for these HMM diagrams is mainly empty. We show gray cells here corresponding to a loud transition and clear cells corresponding to forbidden transitions. Everything is zero in the forbidden transition. But please know there are also many zeros in the gray cells. And you should worry about a zero, because they simply indicate the number of throws in our multiplier alignment is small. and therefore, don't forget about pseudocounts in this HMM by aiding small pseudocounts for every gray cell in the matrix of transition probability for profile HMM. It's usually be done in the section when we can consider it finding.