Our goal now is to solve the monkey and the typewriter problem for the case of spectral dictionary, and after we solve this problem, we will see that it will lead us surprisingly to Ostrich Hemoglobin. Let's start from finding the probability that a string Peptide matches a string started at a given position in DecoyProteome. And this probability is of course, probability of Peptide, is (1/20)^|Peptide| Now, let's estimate the expected number of times Peptide appears in DecoyProteome of length n. To compute this number, E(Peptide,n), we simply need to multiply Pr(Peptide) by n. Now, our goal is to find the expected number of times peptides from dictionary appear in DecoyProteome of length n. And this number, E(Dictionary,n), is simply the product of n, length of DecoyProteome, multiplied over the sum through all peptides in dictionary, probability of each individual peptide, and this is simply denoted for our future references, n*Pr(Dictionary) The question that we will be interested in is: How many peptides in DecoyUniprot+ are expected to score at least -19 against DinosaurSpectrum? I remind you that -19 is the score of the peptide spectrum match between DinosaurPeptide and DinosaurSpectrum. To solve this problem, we will first need to solve the following probability of spectral dictionary problem. Find the probability of a spectral dictionary for a given spectrum and score threshold. Input is a spectral vector, and the score threshold, and the Output is probability of the resulting diction. Before this solves the Probability of Spectral Dictionary Problem, we will solve a slightly simpler Size of Spectral Dictionary Problem, which is: Find the size of SpectralDictionary, for a given spectrum and a score threshold. To compute the size of a spectral dictionary, we will introduce the variable size(i,t), which is the number of peptides matching i-prefix of SpectralVector, which is s_1...s_i, matching it with score t. We will introduce the variable size_a(i,t), which is the number of peptides matching i-prefix with score t and ending in a specific amino acid a. Well, size(i,t) of course, is equal to the sum through all amino acids a, size_a(i,t). And since removing the last amino acid a from a peptide results in a shorter peptide with mass i-a and score t-s_i, we can rewrite this expression for size as sum through all amino acids a, size(i-|a|, t-s_i) And, of course, we need to add initialization which will be size(0,0) = 1 So, we are now ready to compute the size of spectral dictionary using this recurrency. Given spectrum, let's construct a directed acyclic graph of spectrum on nodes 0, ..., m, with weight(node i) = s_i. So that will be the nodes of our graph and I've already represented them as simply weights that I inherited from the spectral vector. Let's assume that we have a toy example of amino acids x and z with respective masses 4 and 5. It gives us this set of edges for amino acid x, this set of edges for amino acid z, and a peptide in this case becomes simply a path in the resulting graph from source to sink. And labels of this path of course spell out a peptide. Now, the score of the peptide spectrum match is simply the sum of the weights of nodes on a path in the resulting graph, which is a score of PSM. In this case, the score is three, after there's another pass corresponding to a different peptide XZX, the score will be four. Therefore, to compute the size of spectral dictionary, we simply need to construct a table for computing size(i,t). So this is an example of how we compute values of size(i,t) in a given row- I'm sorry, in a given column. So in this case, the recurrency correspond to value s_i = one. We continue further, and in this case, the recurrency corresponds to the value in spectral reactor equal to 0. And finally, we fill in all the table. In this case, the final value in the last element of the spectral vector is 2. We have just computed the size of the spectral dictionary, and it turns out that computing the probability of the spectral dictionary is not much different. Once again, consider a spectral vector, s1, ..., si, ..., sn and define Pr(i,t) as the sum of probabilities of all peptides matching i-prefix with a score of t. We will also define Pr_a(i,t) as the sum of probabilities of all peptides matching i-prefix with a score of t and ending in amino acid a. Obviously, Pr(i,t) equals the sum of Pr_a(i,t), and since removing the last amino acid a from a peptide results in a shorter peptide with mass i-|a|, score t-s_i, and 20 times larger probability, we have the recurrence that is shown at the bottom of this slide. Please note that the only difference in the recurrence for size(a,t) and the recurrence for Pr(a,t) is this term "divided by 20". So in the case of computing size(i,t), that was an example of recurrence, the only difference for the case of computing Pr(i,t) is that we will need to divide new entries by 20. And now, we are ready to find the statistical significance of the peptide spectrum match between DinosaurPeptide and DinosaurSpectrum. The statistical significance under the assumption that we search for this peptide spectrum range in the Uniprot database of length approximately 200 million amino acids. To compute this probability, I will give you a reminder and a hint. A reminder: PSM(DinosaurPeptide,DinosaurSpectrum) has a score of -19. And hint: if you already solved the problem of computing the size and probability of the spectral dictionary, Dictionary of DinosaurSpectrum for threshold -19 contains an astounding number of peptides and has probability 0.00018. Which allows us to answer the question: How many PSMs with score at least -19 do we expect to find in a decoy proteome of the same size as UniProt+? We are simply solving the monkey and the typewriter problem, and in this case we get an amazing number: over 35,000. Therefore, finding DinosaurPeptide as an interpretation of DinosaurSpectrum is no more surprising than the monkey typing THE after 200 million attempts. Actually, there is absolutely nothing surprising if it happened in the monkey typing exercise. And likewise, there is nothing surprising in finding DinosaurPeptide in UniProt+ database as an interpretation of the DinosaurSpectrum. After it turned out that some peptides identified in the T-Rex sample are questionable, it became clear that all peptides reported in the T-Rex paper have to be re-analyzed. And there are 70 T-Rex peptides reported in the T-Rex paper. And after receiving criticism upon his discovery, John Azara released all 30,000 T-Rex spectra so that people from all over the world were able to analyze them. And within a week after the release of T-Rex spectra, the Laboratory of Martin McIntosh at Fred Hutchinson Cancer Research Center in Seattle made an astonishing discovery. They found yet another peptide in the T-Rex sample that was much more statistically significant than any peptide Azara identified. The shock was that the peptide actually comes from Ostrich and it's not a collagen. It is a hemoglobin. It would be shocking if the hemoglobin peptide indeed came from T-Rex because hemoglobins are much less conserved than collagens. Furthermore, hemoglobin peptides have never been found in much younger fossils, such as the bones of extinct cave bats. These cave bat fossils are so common in Europe in caves that they were used as a source of phosphate to produce gunpowder during WWI. Because Azara had analyzed ostrich samples before analyzing the T-Rex sample, could it be that the hemoglobin peptide is a carry-over or the identification of leftover peptides hiding inside a mass spectrometer after a previous experiment? Contamination is a fact of life in every proteomics laboratory. Mass spectrometrists are never surprised when they identify human keratin in their samples because the air in any room contains millions of tiny human skin particles. It doesn't matter how many times you vacuum-cleaned it. If the hemoglobin peptide is a carry-over, then the entire T-Rex sample has been contaminated, implying that all other T-Rex peptides should be discarded. However, Azara maintained that there was no contamination and that the ostrich hemoglobin must be a T-Rex peptide, expanding the class of proteins that can survive for millions of years beyond just collagen. If the T-Rex fossil is indeed a treasure trove of ancient proteins, then why we should we limit our search to collagen peptides? Why not do a search against all known from all vertebrates? Of course, you should use criteria that are similar to ones that Azara used, such as allowing for up to one mutation. If we follow this criterion, then we would find a surprisingly diverse array of peptides, including mouse and human peptides. Thus, the hemoglobin peptide didn't help the claim about finding molecular evidence of the link between birds and dinosaurs: it has weakened this claim. And as the T-Rex peptides paper continues to age, there is no end in sight to its controversy. Yet, it was not the first paper to report the retrieval of genetic material from a dinosaur. In 1994, Scott Woodward announced that he had sequenced DNA from an 80 million year old dinosaur fossil. The most vehement critics of his finding was, believe it or not, Mary Schweitzer, who proved that Woodward had only sequenced contaminated human DNA. The moral is that, although we often present scientific discoveries as clear and noncontroversial, the reality is that many of the avenues of modern science sometimes fall short of this ideal. In a sense, the academic battleground is a part of the appeal of becoming a scientist in the first place. But we also cannot help but wonder if we would have a conclusive answer to whether Horner's fossil really contained dinosaur peptides if it was shared with thousands of fossil researchers. Fittingly, in her criticism of Woodward's dinosaur DNA paper, Schweitzer wrote: "Real advance in paleontology will come only when it is demonstrated that those studies can be replicated in independent laboratories."