[MUSIC] All right, welcome to Bioinformatic Methods II. I'm your instructor Nicholas Provart. Course material for this course was developed by Ryan Austin, David Guttman, Laura Hug, Momoko Price, and myself. And the course was produced by Jamie Waese, Rohan Patel, William Heikoop and again myself. As a reminder, please do use the Coursera tools to discuss the lecture content and labs. The course format and syllabus is as follows. The course will cover motif searching, protein-protein interactions, structural bioinformatics, gene expression, data analysis and cis-element prediction. Most of the tools used for exploration are web-based. Week 1, we'll cover protein motifs. Week 2, we'll cover protein-protein interactions. Week 3, protein structure. Week 4 and 5, gene expression analysis and Week 6, cis-regulatory elements. The weekly material with consists of a mini lectures of about 20 minutes long and short 2-minute intro and summary videos. Then there are the weekly labs which will take you about 1 to 2 hours to do and then there are lab quizzes associated with those, fairly short lab quizzes. There's also an optional online lab discussion video that you can watch to help you work through the lab. And there are two sectional quizzes. One after the first three weeks material and the other one at the end of the course. Finally, we'll finish up with one assignment, which is due at the end of the course. I should add that it's not necessary to have taken Bioinformatic Methods I for this course for Bioinformatic Methods II. It would help but it's not necessary. All right, so in this week, we're doing Motif and Profile Analysis and we'll talk about motifs and profiles and profile HMMs. And touch on a tool called HMMer and a database of profiles and motifs. So why do we want motifs and profiles? Why do we care about them? The reason is that divergence, evolutionary divergence, gives rise to sequence families. Given protein families have related structural elements necessary for biological function. And there tight constraints on amino acid composition and the orientation necessary for, for example, correct active site geometry. However, sequence divergence may result in no homologue being identified. But the structural elements might still be present and we can use these to infer function if we can't identify a homologue. And also having the model of the structural elements may allow better alignment of a new sequence family member. They're also sequence motifs that can be present in the promoters of genes. And these are necessary for the binding of transcription factors and other regulatory proteins. And we'll discuss these in greater detail in the cis-element laboratory in week 6. All right, we'll start with motifs which are also called patterns or rules. And this is the simplest approach to structural element identification. An example database for motifs is Prosite. So given an alignment, here's an example alignment here... We can start to see that certain residues within the alignment are conserved or at least semi-conserved. For instance, at the second position, we see in aspartate which seems to be conserved. And then at the 4th position, we see a glycine that seems to be absolutely conserved. We can use the following set of rules to create or derive a motif. And the patterns in Prosite are described using these these rules. First of all, we use the standard IUPAC one letter code for the amino acids. We use an X to denote a position where any amino acid is accepted. We denote ambiguities within square parentheses. So if we see something that looks like this, that means that an alanine, a leucine or threonine is allowed at that position. More general ambiguities use a pair of curly braces to indicate what is disallowed at that position. So for instance, this {AM} means that any amino acid except alanine or methionine is allowed at that position. Now each element in the pattern is separated using a dash. It's not an absolute rule, repetition is denoted using numerical values or numerical range between the parentheses. So x 3 for instance would be I mean three Xs, x 2 comma 4 would mean you could have two Xs in a row, three Xs in a row or four Xs in a row. Patterns at the N or C-terminal end of the sequence can be denoted using this leftward pointing arrow or the rightward pointing greater than symbol, respectively. And a period ends the pattern that's also not always observed. All right, coming back to our alignment, we use those rules to derive a motif, which we can see here. And we would read that motif as an alanine or serine at the first position, followed by an absolutely conserved aspartate followed by IV or L, followed by an absolutely conserved glycine, Any one of four amino acids, anything except proline or glycine, followed by an absolutely conserved cysteine then D or E, arginine. Any one of phenylalanine or tyrosine, twice, and then ending up with a glutamine. So a real life example would be C2H2 zinc finger. And here we see two absolutely conserved cysteines, which are zinc ligands, as well as the two absolutely conserved histidines, which are also zinc ligands and then this sort of intervening spacer region. The problem with the motif approach though is that there is no such thing as a partial match. So for instance, if we're searching with an evolutionarily divergent sequence and are trying to identify C2H2 zinc fingers. If that sequence doesn't have one of these amino acids in the spacer region, then it won't be found a through database search. So this leads us to the next way of scoring patterns and that's using profiles and we these are also called position-specific scoring matrices or PSSMs. So here, we've got another alignment five sequences. One, two, three, four five and there are five positions in this alignment, five columns. So we build a matrix of all of the amino acids on on the rows here, cysteine, lysine, histidine, serine, and so on. And then at each position in the matrix, the positions correspond to the alignment columns. We just record the value, the number of times we see a cysteine or a glycine or histidine at that position. So in the first column, we have four of the five amino acids being cysteines. So we put in a probability of observing a cysteine in that position of 0.8. And a probability of observing a glycine of 0.2. And we do that across all of the positions. So we can then use this profile, this PSSM, to actually score any given sequence, to score any given sequence as to how well it matches the profile. So if we're given a sequence, so here C G G S V, we can calculate a score based on the profile that we have for it simply by multiplying the probabilities of observing a C at the first position times the probability of observing a G at the second position, a G at the third position, an S at the fourth position and a V at the fifth position to come up with an overall score of 0.031. So it seems like a great thing. We can actually take into account the abundance of certain amino acids at given positions. There is some some leeway given when creating the profiles in terms of deletions and the weights given to unlikely amino acids and so on. But these are all kind of tweaks that have to be done manually and this leads us to a new kind of profile based on hidden Markov models. Now just as an aside, I'd like to introduce sequence logos to allow the visualization of conserved residues. So what we're looking at here, even though you can't see anything, is a set of sequences that are in common between triose phosphate isomerases. This is from a profile database and we see that there's phenylalanine at the first position, some tryptophans here in sort of the middle and so on. But even if we add colour to denote residues that have the same physico- chemical properties, it's really hard to tell which residues are conserved and how well they are conserved. We might pick up this this lysine here at this position here, the red stripe. But otherwise, it's kind of difficult so we can use something called sequence logos to actually get at this in a visual way. And here, this is a sequence logo of that alignment and what we can see actually is that there is absolute conservation at the of the lysine at the 7th position, semi- conservation of the asparagine at the fifth position. And this tryptophan here is also somewhat conserved at the sixth position. Now the height of the letters in this sequence logo is determined by the conservation, as measured by the entropy. And we use something called the bit score to calculate that and the bit score is calculated given this equation here. Basically, we sum across for each amino acid at a given position. We compute the frequency of that amino acid and we multiply it by the log 2 of the frequency of that amino acid at that position and then we sum over all amino acids at a given position. And we subtract that value from the log 2 of 20 in the case of protein sequences, amino acid sequences, there 20 amino acids, and we in the case of nucleotide sequences we would actually subtract the entropy value, the Shannon entropy value, from the log 2 of four because there are four different nucleotides. So the maximum value you can have the residues absolutely conserved as is the case of this lysine residue at position 7 is 4.32, so keep that in mind. The other nice thing that sequence logos is that you could read off a consensus sequence by simply reading off the top letter in each pile. The letters are ordered in each pile according to their abundance in the amino acid alignment at the given column position. So to read off the consensus sequence, we would simply read the top letter in each column. W V M G N W K M N G T and that will give us the consensus sequence for that particular alignment. So we can use these to examine bits of biology and look at for instance the CAP-DNA binding complex. We see that there are certain residues on the DNA sequence that this CAP protein recognizes and these are visible here. We need a T G T G A and a T C A C A at this position and then these in term map to residues on the protein structure... these residues on the protein structure bind to these DNA residues. And we see conservation of these protein residues in terms of the DNA binding region of the helix-turn-helix Motif. In the case of yeast TATA sites, we see that certainly it does seem to be a TATA motif. This is the start of transcription in yeast promoters, for the yeast promoters. Some sites within the TATA box are better conserved than others. For instance, the second A seems to be an absolute requirement. We can also see in the case of intron-exon splice junctions that the signal is actually fairly weak. There does seem to be requirement of a G and T at the first and second position of the intron, and A and G at the last position of the intron. And then there's this polypyrimidine trapped here at towards the three prime end of the intron that is also required. But it here again, it's not a very strong signal. We also see some requirement here of some nucleotide specificity at the 3 prime end of the exon. So we're coming now back to Hidden Markov models, and hidden Markov models or HMMs offer a more systematic approach to estimating the model parameters. If we're trying to describe a specific structural pattern. It's a dynamic kind of statistical profile and as with an ordinary profile, we can build it by analyzing the distribution of the amino acids in the training set of related proteins of an alignment. However, an HMM has more complex topology than a profile. So rather than just having a matrix of values, we can use a finite state machine to represent not only the values at a given position but also the ability to transition into different states, so an insert state or delete state. And this little cartoon here just shows the kinds of states the hidden states that can exist within a model in terms of a finite state machine. In the case of a sequence HMM typically we have a certain number of match states for each position in the alignment that's well conserved / not gappy. And then we also have insert states as denoted by these characters here and then we also have delete states denoted by the circles. And to generate a sequence once we've created this HMM, we can actually generate a sequence by moving through the HMM starting at the beginning and then transitioning in any number of ways into either an insert state or a match state or a delete state. And the transition probabilities can all be described based on the data that we use to generate the HMM. And the emission probabilities associated with the match states and the insert states are also described based on the data that we use to generate the HMM. So this is sort of a cartoon of what a sequence HMM would look like. In the case of a real alignment, something like this where we have eight match states, we would basically for each match state in the sequence alignment where we have more than 50% of residues at each position, that's how we determine the number of match states with a simple heuristic here... there are more sophisticated ways of doing this, we would compute the frequency of each residue at each match state. So in this first column, for instance, we have one two, three, four five valines plus phenylalanine plus an isoleucine. And in the match state emission probability series, we would have the highest probability of emitting a valine at this given position followed by isoleucine and phenylalanine. We typically add in a very small probability of emitting other amino acids at a given position so that we can still use the HMM to score sequences rationally, and as I mentioned before we also capture the transition probabilities between states. So the transition probabilities here are denoted by the width of the arrows. So the vast majority of the the sequences don't contain any insertions or deletions. And so the transition would be typically in this direction. However, we can at some points transition into delete state or insert state. We would need to transition into an insert state to generate this sequence. Or to generate this sequence, we need to transition into a delete state, and then we finish up at the end. And then we can use this HMM using the Viterbi algorithm, sort of beyond the scope of this course. But we can use this model of sequence properties, alignment properties to then score any given sequence as to whether or not it matches the HMM or how well it matches the HMM. A database of profile HMMs is Pfam. And it encompasses a large collection of multiple sequence alignments, which are then used to generate a large collection of hidden Markov models. The current iteration encompasses around 19,000 protein families. A Pfam is formed in two separate ways. There are two flavours of Pfam models. Pfam-A HMMs are based on fairly accurate human-crafted multiple sequence alignments, whereby Pfam-B models are based on an automated clustering of the rest of SWISS-PROT using a program called Domainer. Pfam-A uses high-quality seed alignments to build HMMs and then additional sequences are added to generate a final set of aligned sequences. And the seeds for those alignments are honed by iterative methods. So there are issues...HMMs sound great and sounds like they've solved all our problems. They allow gaps. They allow deletions. However, it's a linear model and it's unable to capture a higher order correlations among amino acids in a protein molecule. So for instance, amino acids which are far apart in the linear chain, but which may be in proximity to each other when the protein folds, those interactions between, those amino acids, the dependencies can't be predicted with a linear model. And for HMMs, we assume that any amino acid in the sequence is independent of the probability of its neighbours. And this may not always be true. So in the case of a hydrophobic core of proteins, hydrophobic amino acids are likely to appear in proximity to each other. And so researchers have developed new kinds of statistical models and neural nets, hybrid HMMs, dynamic Bayesian nets, factorial HMMs, and so on. But for the purpose of this course, we're just going to explore HMMs and they really are quite useful. So in today's lab, we'll use several domain, motif, profile HMM databases and tools to examine a representative sequence. We'll look at the CDD, Conserved Domain Database. You should consider what was used to generate the CDD. We'll use CDART to identify conserved domain architectures. We'll look at SMART, which is Simple Modular Architecture Research Tool, look at Pfam. And if there's, actually, we won't be looking at HMMer, but there is a suite of tools for generating profile HMMs if you're interested in exploring that on your own. Interproscan offers a convenient way to search Pfam and other profile and motif databases. It's not completely comprehensive, but it's a really good starting place to scan for sequence patterns in a protein of unknown function if you can't find a homolog. All right, well, I hope you enjoy the lab and I'll see you in a bit.