[MUSIC] Hello everyone. Today we are going to talk about metagenomic assembly and binning, which is an alternative to the recount analysis we have previously discussed. But you can use the same raw data you get from your sequencing machines to do this kind of analysis. So an outline, we are going to very very briefly talk about metagenomic assembly. And some of the things you might want to do when you do that. And then we're going to talk about how we can characterize the different contigs or scaffolds we get from the assembly. Then we are going to talk about the actual binning like putting scaffolds together that belongs together. And then some of the things you might want to do after you have binned your different organisms. So just to clarify it's the part over here we're going to talk about, where you have your reach, you do your assembly, and you do your binning after that. Okay, so very briefly. Genome assembly, if you have just a single organism you have isolated, sequenced it's genome and you get a lot of reads, a good analogy would be a puzzle like this. You have a lot of puzzle pieces, and you know they are pretty much all going to fit together to make a single puzzle. And you will pretty much have all the pieces if you have sequenced deep enough. And that's pretty nice. Now, metagenomic assembly on the other hand, you have all these different colors here signaling the different species of the different organisms living in our environmental sample. And you need to put these together, not to one puzzle piece, but you need to make maybe hundreds of different puzzles. So some of them, we are hoping for this. Like, yes, we can fit it all together and we construct an entire organism, all its genomic content. This is what we're hoping for. Full genomes or draft genomes very good. But in many cases, you will not have all the data. You will get some of your genome, maybe very little. This could be analogous to a scaffold or a single contig. So you'll basically get this mixture. So what are some of the pros and cons for doing this assembly-binning approach over the read mapping we have previously discussed? Some of the pros, of course, that you get the genomic context of your genes. Instead of just knowing, there's this resistance gene and it's this abundant. Now we can see, it's actually sitting together, maybe with other antimicrobial resistance genes that would cause the organism to be multi-resistant. It's not in different bacteria. That's useful information. Also, you can get the taxonomic context. For example, it's very relevant to know if it's a dangerous, dangerous pathogenic bacteria that has the resistance gene or it's just a bacteria we don't care too much about clinically. Another pro is you can actually discover completely new bacteria. We have never grown it before, never seen it in any databases. You cannot find it at NCBI, but we actually maybe manage to piece it together from these small puzzles pieces. And then if you actually manage to assemble something, then you are pretty sure it's there. Whereas read counts, it can be a bit harder to correctly assign the reads to where they need to go. Some of the cons. You have way higher limit of detection which is kind of tied to this one. You need a certain depth on each organism in order to assemble it. I seem to recall people saying, okay, maybe you need 7 billion base pairs of bacterial DNA in order to accurately assemble bacteria making up 1% of your data. Then it's also very computationally expensive. You might not want to do big surveillance sequencing assembly binning projects on your laptop. Maybe you want a compute cluster somewhere to carry it out for you. And then if something is not assembled, that does not mean that it's not in your sample. So for surveillance work, it's not a good indicator just because it's not there. It could not get assembled for a lot of reasons. It could be that there were a lot of strains that looked very similar. So the assembler didn't manage to piece it together, so it's more qualitative and less quantitative than the read mapping approaches. So first, you would assemble your reads. And I'm not going to explain to you the algorithms for doing this. But if we have our blue reads here, these are paired-end reads, we have some distance in between. We can see we can actually get continuous sequences or contigs, in this case, two of them. And then because of these being paired, we maybe know that the area in between here, even though we don't know the sequence, we know that these two contigs actually belong together. So we can call this one scaffold. So a lot of these metagenomic assemblers will also do scaffolding. And these are some of the popular examples here. Yes, so after you have done your metagenomic assembly, you have maybe, I don't know, 100,000 or some different scaffolds and you want to know which scaffolds actually come together. So to keep with your puzzle/ jigsaw analogy, you have these partially completed puzzles and you can see, that's bit of a horse here. This is some other part of a horse. These scaffolds need to go together. These other partially completed things belong to another puzzle and you want to separate them our or bin them. So you can use a lot of different information to do your binning. A popular way is to look at the read coverage. So you take your original short reads and you map them back to your contigs or scaffolds, and then you simply see what is the coverage across the different scaffolds. The idea being that if two contigs come from the same genome in your sample, so the same organism, then they would have been sequenced roughly to the same depth, they would have a similar coverage and they go together. It gets even better if you have multiple samples that vary a bit and you sequence all of them. You do your assembly on one of the samples, then you can take the reads from the other samples, map to that genome (metagenomic assembly) and you can use that as additional information. I'm going to show you a little bit of that later. Also you can look at the composition or some other characteristics of each of your scaffolds. For example, you might take all four nucleotide words, 4-mers, tetranucleotides. Count how many times each 4-mer occurs, and use that as a way to describe each of your scaffolds. You can look at the general GC content, predict genes, look at the codon usage that is somehow specific to different taxa of microorganisms. You could also try to take each scaffold and assign it taxonomically, and use that as input in how you want to try to bin your scaffolds or maybe predict all the genes. And then the genes, you could try to taxonomically assign and use that information. So just to try to visualize at a bit here, if we have a super super super simple metagenome. We have two different organisms that is two different genomes in our sample. We have the green and the blue, and we have three times as much of the green. Then when we chop these up to sequence it, we get some short reads here. And we have more of the green ones than the blue ones. We then try to assemble these. And we end up with two blue contigs or scaffolds here. And we end up with two of the green contigs. Now when we take and map our reads back here in red to them, we can see, okay, so how covered is each position on this one on average, and on this one. On both these, it's roughly two times coverage with our sequencing depth. But because we have three times as many genomes of the green genomes, we will see we have roughly three times as much coverage on the green genomes. So maybe six times coverage. So, yeah, the other way to do it is, as we said, you can use codon usage, GC content, taxonomic assignment, k-mer frequency. You can use a lot of different things to look at it. So here we have some more fake data I've generated. We have four scaffolds. And then we have different samples where we have taken the reads and mapped to those scaffold, so we can see what the coverage is. We can see it's 27, 29, 5.8, 6, etc. So, these different read sets are all mapped to the same scaffolds and we can see how covered they are. Then just as additional information, we have a also calculated for each of our four scaffolds the GC content of them in the last column here. So what you can do is you can take these different variables, these different attributes we have recorded for each of our scaffolds, like coverage for read set 1, 2, 3, and GC. And then when you plot in your different scaffolds, you can see how scaffold 1 and 2 up here. They follow each other. No matter which of these pattern parameters yor attributes you look at, they follow each other. And the same for scaffold [COUGH] 3 and 4. And that is, of course, because they actually originated, these scaffolds, from the same genome in your sample. So what you could do is you could say, okay, now we have these numbers here for each of the scaffolds. We have these, so we could correlate these numbers to these numbers and all of the combinations. And you would actually see that, [COUGH] yes, scaffold 1 and scaffold 2 are highly highly correlated. Not so much correlated to the two other ones. And, again, scaffold 3 and 4 highly correlated, not so correlated to the other one. So we have a strong indication that they're coming from two different organisms. And we could actually take these scaffolds and put in its own bin, and the other scaffolds and put in their own bin. So there are many ways to do binning. Just on the last slide I described some of the general things you might want to use for binning. So you can do manual binning. You can have a user interface where you have all this information available for your GC coverage stuff, and you can try to hand bin these different things. You can do that from any number of graphs or emergent self organizing maps, where scaffolds that are similar across all these attributes would cluster together. [COUGH] And you can maybe circle them and bin them yourself. Or you could use an automatic binner and an unsupervised solution. Here's some examples of programs that can do that for you. [COUGH] So some of the pros for manual binning would be, you have your expert curation. You are actually sitting there making judgement calls and maybe you have a lot of knowledge about what these genomes look like. A con for the manual thing is it's time consuming. If you have hundreds of samples and you assemble them individually and you try to sit and bin them and it's complex metagenomes, with hundreds of species. It's not really feasible. It's not something you can scale that easily. Whereas with automated binners it's much more scalable. It's also reproducible because you can write exactly, this is how I ran my program with these parameters and others should be able to reproduce it. You don't have to somehow defend all your individual decisions you did during your manual curation. [COUGH] The con, of course, is that there is no expert opinion here that matters. Maybe you get some weird artifacts, things that go together that really shouldn't, and you would have been able to spot. So perhaps it's best for big projects to first use an automatic binner, and then manually go in and curate it and check everything is actually okay, so you get the best of both worlds. So after you have binned all your scaffolds and you have all these individual groups, maybe you would split them out in different FASTA files, so you have them as their own draft genomes. Here's some of the things you might want to do post-binning. For one, maybe you want to figure out which bacteria each of your bins actually is. You can compare all this content to known bacteria. You can do phylogenetics, compare them phylogenetically, align them to other whole genome bacteria. You could also look for these single-copy genes. So search each of your files, each of your bins for a single-copy genes, because we know that all bacteria should have a number of genes that only occur once. These bacterial single-copy genes. So if we find a single-copy gene multiple times within the same bin, then we know, there's some contamination of the bin. We actually have scaffold belonging to different organisms in the same bin. On the other hand, if you're lacking some of these single-copy genes you know should be there, then, of course, you can say this is not a complete bin. This is just a partial genome or a puzzle that's half done to keep with the previous analogy. And then finally, maybe you were doing this whole surveillance project for looking for antimicrobial resistance genes or virulence genes or some other thing. And you can actually go in and search each of your bins for these specific genes you are interested in, and now suddenly you can know a lot more. You can know, so this anti-microbial resistance gene is sitting in this specific species of bacteria, which is very, very useful knowledge. That's all for today. Thank you. [MUSIC]