Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Bioconductor for Genomic Data Science

93 ratings

Johns Hopkins University

93 ratings

Course 6 of 8 in the Specialization Genomic Data Science

Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

From the lesson

Week One

The class will cover how to install and use Bioconductor software. We will discuss common data structures, including ExpressionSets, SummarizedExperiment and GRanges used across several types of analyses.

- Kasper Daniel Hansen, PhDAssistant Professor, Biostatistics and Genetic Medicine

Bloomberg School of Public Health

Sounds a little funny, but let's take an example.Â We construct an IRanges by using the IRanges constructor function,Â and we give two out of three arguments, start, end, and with.Â We only need two, because if we know two of them, the last one can be inferred.Â

So here we have a start, an end, a width, andÂ we can see the width column has been filled by knowing the start and the end.Â

Here we construct another IRange by specifying the start and the width, andÂ we get exactly the same object out.Â

We access the different components of the IRanges vectorÂ using x as a functions that are named start, end, and width.Â

So for 7 start of IRange,Â we get back a vector of the start positions of these different intervals.Â

We can also set the elements of the IRanges using these access functions.Â In this case here we have resized the different ranges to have width 1,Â and we can see we used the stock assist as the anchor point of the resizing.Â

IRanges can have names like any other vector andÂ because they are vectors, even though they look a little bit like matrices,Â they don't have a dimension, they have a length.Â And we subset them using a single bracket with a single index,Â either an integer or the name exactly as we know it from any other big chunk.Â

A specific type of IRange that's very important because we encounter itÂ again and again in usage, is something called a normal IRanges, andÂ that's a little hard to explain at first, so let's plot them and see some examples.Â So first, we evaluate a function here that allows us to plot these things.Â

We get us an IRange and we plot it.Â So here we have an IRange, and as you should've seen before, there's noÂ requirement that the different intervals inside the IRanges are non-overlapping.Â We have two intervals on the left that are clearly overlapping.Â So you can think of this for example as axons in the genome.Â A normal IRanges is created by the reduce function and it's a minimal representationÂ of the original IRanges as a set, so what do I mean by that?Â Well, I mean that each integer that belongs to one orÂ more of the original ranges belongs to a single range.Â

Furthermore, the ranges are as big as they can be.Â If you look to the right of the picture, the two ranges have been mergedÂ into one and they're also sawed so that the first element.Â And the output is the element most to the left on the diagram.Â

So this is kind of a minimal representationÂ of the integers that belongs to the original IRanges.Â And we'll see many functions that output normal IRanges.Â

In a way, the inverse to reduce is a function calledÂ disjoin that can be incredibly handy when you need it, butÂ I've found that I mostly use it in seldom esoteric circumstance.Â So disjoin here creates kind of also a set of disjoint intervals.Â

When you manipulate IRanges there are set of functions that does kind ofÂ a straightforward manipulation.Â

And one way of manipulating IRanges, is a manipulation that takes all ofÂ the original ranges, and produces a single new range for each of the original ranges.Â

Let's close this off here and look at resize.Â So here we have an IRange of some length 4.Â And we resize them around the start position, you can see the fixedÂ argument here that tells us that we want to resize it to a width 1,Â we fix them to have around the start position.Â More useful in my experience is to resize them from the center of the intervals.Â In this case here, the original ranges have even number of elements andÂ the start position becomes the element to the left of the midpoint.Â There are other types of manipulation, such as shift, flank, and so on.Â

Another way of manipulating IRanges is thinking of IRanges as sets of integers.Â In other words, converting them to normal IRanges first.Â Then we can think about doing stuff like union and intersection.Â

So we can take the union of them, andÂ we can see what comes out of it is a normal IRange.Â We have merged things together.Â And you can see here that in a way the union,Â we immerse the neighboring cells together.Â

Another way of saying that is that, the union isÂ equal to first concatenating the two IRanges together and then column reduce.Â

Now the real powers of the IRange's of library is the findOverlaps function.Â findOverlaps allows us to relate two sets of IRanges to each other.Â Let's takeÂ an example here.Â Let's look at them. And now we're going to do the overlap between them.Â

So the output of the findOverlaps function is this two dimensional matrix orÂ it looks like a two dimensional matrix.Â

When I call findOverlaps, I give it a query and a subject.Â And that's what the two columns of the Hits object here refers to.Â So the Hits object is really an adjacency matrix, orÂ it is a matrix of indices of the different overlaps.Â So the first row, or the first element of the Hit object,Â means that range number one,Â in the query, overlaps range number one in the subject.Â So let's verify that by hand.Â So range number one of the queryÂ is ir[1], and ir2[1].Â And we can see that these two ranges can be overlapped.Â

We can access the query Hits and the subject Hits through the queryHits andÂ the subjecHits accessor functions.Â

So let's do this, overlap here, andÂ we get our, basically the column of what looks like a matrix here.Â

Note that there are repeat elements of the query Hits,Â because range number two in the query, overlaps multiple ranges in the subject.Â

It's very common to call unique on both queryHits.Â The subjectHits, in this case here, unique would give us, exactlyÂ

whether or not there should be a minimal overlap.Â Whether or not the overlap should just be like we've done it here.Â Or an overlap means they should be exactly equal to each other for example.Â

And there's also a way of specifying what should be returned.Â Should it be all the possible overlaps?Â Or just the first overlap you encounter?Â And so on and so forth.Â This takes a while to become totally comfortable with, andÂ we will see more uses of it throughout the class.Â So in many cases when you're running findOverlaps,Â you're not really interested in the exact overlaps, you're just interested in,Â how often do I see overlaps between a query set and a subject set?Â And for this we have the convenient function,Â countOverlaps that returns a vector.Â In this case, it means that range number one in the ir1 overlaps ir2[1].Â Element two overlaps two elements of ir2.Â That's represented in the Hits object above.Â And element number three doesn't overlap anything at all.Â countOverlaps as faster and more memory efficient,Â which matters a lot if you use this for extremely big, high ranges,Â and we will be using it for extremely big, high ranges.Â Finally we can also relate IRanges in a different way than through the overlaps.Â We can look at which ones are close to each other.Â So again, we take our two IRanges and we can ask,Â which of these IRanges in ir2 are closer to the ones in ir1?Â

Coursera provides universal access to the worldâ€™s best education, partnering with top universities and organizations to offer courses online.