In this session I will introduce the BSgenome package, which is a package for dealing with representing full genomes in Bioconductor. Genomes in Bioconductor have a very specific naming scheme that's rather long. It's always useful to get an idea of what genomes are directly available from Bioconductor. And you get that with the available.genomes function, which lists all the genomes you can download directly from the Bioconductor website. You can make genome packages yourself, it's not that difficult. But of course it's much more convenient to get something from the Bioconductor website. We see that we have different species, we have different sources, if you look under humans, H.sapiens, there's both a genome from NCPI and there's a genome from UCSC. There's something called masked, there's something that are not masked and there's different genome versions. So there's a wealth of opportunities there. So it's real straight, further on we're going to take a specific yeast genome and look at that. Because the yeast genome's not that big, and our things, our commands are going to get executed relatively quickly. When you load a genome package, you get an object back that's the name of the species. In this case, Scerevisiae. This is the short name of the genome object. So we print it and we can see in this case here, Bioconductor got the genome from UCSC, and we can some release date, and we can see the name of the different sequences or chromosomes or contexts that are present in this species on this version of the genome. We can get the names with seqnames, and we can get the lengths with seqlengths. Have to know it from G ranges. So it's important to understand that at this point in time, nothing has been loaded into memory. The BSgenome allows for efficient representation of genomes and allows for loading and unloading of big character matrix on the fly. That's pretty nice because otherwise, when you compute on say, the human genome, you very quickly use up a lot of memory. So let's access a given chromosome here. We use that with the dollar operator or the single bracket operator like we know from this. And now we have loaded in that specific chromosome into memory. There's 230,000 letters, and it's a DNA string. Now we're going to call standard functions on this long string. For example, we will compute the GC content of the genome. This just gives us an number of nucleotides, GC nucleotides, if you want it as a percentage. We'll use as.prob = true and we see that each genome on this particular chromosome has a GC content of around 40 percent. So now it seems very natural in order to compute the GC content of the entire genome that you take this function and you apply it to each genome. Now, you can do this using an lapply, but the way to do this using genome objects is using a new type of apply called bsapply. bsapply has a slightly different interface compared to the standard apply set of functions. And that's because behind the scenes when you run bsapply, it'll load and unload the different genomes as we need them. So this is very fancy. We start off by running a bsapply by setting up something called a BSParams. And BSParams is a small little object that really contains the function we're going to apply and the object we're going to apply it to. This seems very weird when you see it the first time, but it's a paradigm that's become to be introduced in Bioconductor packages. For example, the BiocParallel processing on Bioconductor packages uses this paradigm quite a bit. Now, it makes a lot more sense when you see it in practice. You set up a new BSParams object, and inside there's an X, which is what object are we going to apply it to. So that's not in brackets, it's Scerevisiae, and the function is going to be letterFrequency. Now, we use this inside bsapply, but just running bsapply on the params, and now remember for letter frequency, we need to give it which nucleotides that we're going to count. This is an additional argument to the function, and we just put that inside the bsapply call. And back we get the number of GC nucleotides in the different chromosomes. It's a little hard to see because we get a list back. So let's unlist it, and here we have it. Now we're almost there. In order to fully get the GC content across the entire genome, we sum up all of the GC nucleotides, and we divide by the sum of the lengths of the chromosomes. So here we have it. The GC content of each genome is 38%. Now check a little bit with the GC content of the individual chromosomes. So we could use as.prob = TRUE to get it as a percentage. And scanning over this list here, we see that there's almost no difference from chromosome to chromosome, except the mitochondrial chromosome that a much smaller GC content. This introduced BSgenome objects and the bsapply function.