[SOUND] Yet another method of normalization for microarray data is the quantile normalization scheme. The basic idea behind this normalization scheme is, in terms of the distribution of probe intensities, it is to transform that distribution. So that each sample that you want to compare, not only has the same mean but actually has the same form, the exact same form for the whole distribution of values. This is perhaps the most aggressive form of normalization for microarray data that is in use. It can work in certain circumstances. But it's not always appropriate to apply this method of normalization. For example, if your biological question requires you to discover absolute differential expression which is irrespective of the rank order of the expression of the genes. Then this method will not be able to address such a question. We can go into a little bit more detail about the quantile normalization. With a very, very simple example, and this will give you a kind of an operational definition of what we mean by quantile normalization. So here on the left, we have an example data set. And you might like to think of each column, each of these three columns as being a microarray sample and each row as being corresponding to a gene. And then each one of these elements is the expression of a given gene in a given sample. Then under the quantile normalization scheme, our first step is to rank the data. So you could see what we've done. Within each column we have reordered the numbers such that the smallest number is at the top and the largest number is at the bottom. And we've done this for each column. Then we take the mean across the rows and replace each of the values with mean. So here, for the first row we have the mean of two, three, and four which is three. So we replace all the numbers in that row with a number three and we do this for each row. Then finally, we reshuffle the rows within each column in order to restore the rank order. And here finally, is our transformed data set. This is our quantile normalized version of our expression values. Now, it's not immediately obvious here from these numbers but what effectively we've done in terms of the distribution of values is we have taken the distribution of expression entities from each sample. We've actually forced them all to at exactly the same distribution in the course of this process. Now, let's take a quick glimpse at some of the effects of different normalization procedures on real gene expression data. So here we're going to look at some gene expression data that's actually from samples of peripheral blood. And there are going to be actually 189 samples. And the gene expression profiles were generated using an Illumina array that to the user is 14,434 unique probes. Now, this gene expression data, it came from a study by the Center for Health Discovery and Well-being, and it can be found, if you like you can look up this data yourself. It's on GEO. It has a GSE number, 35846. Now, let's take a look at two of the normalization schemes that we've looked at, and how that effects the distribution of probing intensities for this Microarray data. On the top, we have the mean centering normalization scheme. And on the bottom we're showing the distribution for quantile normalization. So you can see there's a quite a contrast between the distributions here. The mean centered scheme each distribution is different, although they all have the same mean. And in the quantile normalization procedure, all the samples, all the 189 samples have exactly the same distribution of probe intensities. Now, after applying these two different normalization transformations put to the data, we can get a further glimpse on the effect that those two different transformations have had by examining the correlations between probes over the 189 samples. So here in this figure, you can see a heat map. And so it's an array which is 14,434 square and each element of this array gives the correlation across the 189 samples of the and expression intensities of a pair of probes. And you can see this, the columns have been shuffled in order to reveal the cluster, the clustering structure. Now in the top, we have the probe similarity matrix for the mean centered data. And in the bottom, we have the probe similarity matrix for the quantile normalized version of the data. And you can immediately see, you can just see it with your eyes, that there is some significant differences in the apparent similarities between the expressions of probes on this array. So this illustrates in a visual way that the normalization scheme that you apply has a real significant effect on the inferences that you make from your data. This can have a very strong effect on your biological conclusions. So it's very important to consider the normalization scheme that you use. Now, let's consider some normalization methods that are specific for RNA-Seq data. Now, the RNA-Seq technology is relatively new and there is still lots of work going into figuring what are the different normalization schemes appropriate for different biological questions in this context. But here what I'll do, I'll give you a little sketch of some of the more common normalization schemes for our RNA-Seq data. The first one is the total counts normalization. This is a very crude normalization scheme which is roughly equivalent to the mean centering approach microarrays. Here, what we do is for each gene count, you divide that number by the total number of gene counts for that sample. So this is a very crude basic way of removing the effect of variation of sample depth across your samples. The second is more sophisticated, it's from the method of DESeq. Here, the normalization scheme these scales, the gene counts by a scaling factor and a different scaling factor for each sample. First, in order to calculate the scaling factor, we take for each gene, take its ratio of its count with the geometric mean across all the samples for that. Then so for each gene you have one of these ratios. Then take the median of all those ratios across all the genes, and this gives you your scaling factor. And you rescale the gene count by that factor. The idea behind this is that this is the scaling factor which will appropriately rescale the data in such that most genes will not be deferentially expressed. So this hypotheses that most genes are not deferentially expressed is satisfied after this transformation is applied. The third normalization scheme is the one used in the edgeR program. It's called the Trimmed Mean, and this has a very similar idea to the DESeq, where you rescale the gene counts in such a way that most genes are apparently not deferentially expressed. The difference is that in the manner in which the scaling factor is calculated, with a trimmed mean for each gene, for each you take a sample. And it defines to be a reference sample. And all the other samples are what we term test samples. Then for each test sample you calculate the ratio of the counts with the reference sample. And these log ratios, you then take the mean of those log ratios. Then you exclude very highly expressed genes and you exclude genes that have very large log ratios. And then with what's left, you take the mean and this gives you the scaling factor with which you transform the data. Quantile normalization is applied sometimes in RNA-Seq in exactly the same way that it is applied in microarrays. And finally, a very commonly occurring normalization scheme is the Reads per Kilobase per Million Mapped Reads scheme. So this is a normalization scheme that rescales gene counts not only to correct for library size but also to take account of gene length. And now, I think it's been established that this method actually introduces a bias in the normalized data, particularly in the lowly expressed genes. However, this method still remains popular in the literature. [MUSIC]