[MUSIC] Welcome to this session entitled Phylogenetic relatedness: CSIPhylologeny tool description and application. My name is Rolf Sommer Kaas, and I'm a bioinformatician with the National Food Institute, here at the Technical University of Denmark. In this session, I'm going to talk about the web tools CSIPhylogeny, which calls SNPs from single-nucleotide polymorphisms, from raw data and assemblies, and then in first, phylogeny. At first, I'll be talking briefly about some general information about phylogenies inferred from single nucleotide polymorphism, or called SNPs. Then I'm going to talk specifically about the CSI phylogeny pipeline, then I'll show you how to run it, and which options you can set. And then we'll discuss or talk about some of the output. Now, the most important assumption when we're inferring SNPs and inferring phylogenies, and calling SNPs is that the SNPs are random and independent. First of all, we find the differences between your input samples, and then compared to a specific reference. This is called SNP calling, and this is where we find these point mutations. And in general, a rule of thumb is that the closer the reference, the better. Then we take all our SNPs and create a pseudo-alignment, and this is where the independent assumption comes in, and from that alignment we infer a phylogeny. Now, for raw reads, you come in with your raw reads, and you then use some mapping software of your choice. For CSI phylogeny, we use BWA, and the raw reads is then mapped to a reference sequence. Now, it's important to note that any part that is found in the reference sequence and not in your data will be ignored, and also any part of the genome that in your sample, but not in the reference sequence, is also ignored. This is why it's important to use close references. Now, this is an example of how it could look, when you map reads to a reference, and call the SNP. So for instance, all the reads agree that that should be an A at this position, and in the reference, a G is found. So here we have a SNP A where in the position of G. Now, we want to filter our SNPs also, and we do this for several reasons. One of the reasons is trying to get rid of false positives, which can occur from repeat sequences in your genome, and the other is due to mobile elements. So mobile elements are essentially transferred, and does not share the same phylogeny as the backbone of your genome, so that why we're trying to get rid of that part. And the way that we do this is that we try to utilize the information that often, these false positives and this SNPs occurring in mobile elements, will occur in clusters over the genome. So we try to look for snips that are within a close proximity of each other, and then we try and remove those. So you should also make yourself clear why you're calling SNPs. So if you're trying to look for specific mutations, then you probably want to be less conservative with your filtering because then false positives might not be such a big issue. But if you're studying relations between different genomes, then you want to use a more conservative filtering. So the first step is to call SNPs for each of your isolates, and by using the same reference of course. And then you concatenate all your SNPs into SNP sequences, which you then put on top of each other to create these pseudo-alignments, and from the pseudo-alignment, you put it into a tree algorithm, which creates trees like this. Now, in CSI phylogeny, we start off by if we're fed raw reads, we start off by mapping the reads to the reference using the BWA. We then call all possible SNPs using Samtools. We then filter the SNPs based on a lot of different parameters, its coverage, quality, and also Z-score, I'll get back to what the Z-score means. And then finally, we prune the SNPs. So this is where we try to remove the SNPs that are in very close proximity to each other. Then we write all our SNPs into a text file, which is in a specific format called Variant Calling Format, which VCF is short for. If you provide assembled genomes, then there's actually only one step, and that is we take NUCMER, which aligns all your context to the reference, and then finds the snips, and then does some pruning also. And this also illustrates why it's always better to use wall reads if possible because it's much easier to validate the SNPs that we are actually calling. So now, we have raw reads or assembled sequence that creates this VCF file, so the SNPs. The next part of the pipeline is actually comparing all the SNPs found in all your isolates. So we are only utilizing the part of the genomes, which is common for all your isolates. And this creates new VCF files with the SNPs that are found in positions, that are found in all of the genomes. This is also a file that you download on the output page, and also we create a SNP matrix, which gives you information of the amount of SNP differences that are between each isolates. This is also something you can download on the output page. Then we take our SNPs, and we concatenate them into the pseudo-alignment, and we infer maximum likelihood phylogeny using FastTree. And this is what creates our final phylogeny. So this is the part of the front page looks like. This is where you set all your options, and you start off by inputting a reference genome. Then you check if you want that reference genome to be part of the phylogeny, it doesn't have to be. Then you select a minimum depth that you require at each of the SNP positions, so a 10x means that you would require at least 10 reads to cover each of your SNP positions in order to trust it. Then you can set minimum relative depth, and this minimum relative depth is a percentage of the average of depth that you have across your entire genome. So if you have more than a depth of 100, then this would be more strict than this up here. But this will always be enforced, even if this would end up being less than 10x. Then we have the minimum distance between SNPs, this was the pruning I was talking about. So in this case, we require that at least 10 nucleot We require at least 10 nucleotides between each SNP in order to keep them. If we find SNPs with less than 10 nucleotides between them, then we will ignore all those. Then we set minimum SNP quality, minimum read mapping quality, and then we come to the Z-score. The Z-score is used to sort out ambiguous SNP calls. So the Z-score is calculated like this. And X is the amount of reads that are supporting the base in question, and Y is the amount of reads that support alternative base calls. So this means that setting a score of 1.6 corresponds to a probability of 0.05 that we are calling the wrong base. You can increase this to 3.26, and then your probability will rise to 0.01. Now, some of these ambiguous SNPs are also classified as heterozygous SNPs by the same tools. This will usually be resolved by the Z-score, but you can choose to ignore them completely. Now, the output results from CSI phylogeny. You will obviously get phylogeny out, and you will get it in the form of Newick, and a PDF, and as a picture file. So we recommend that you download the Newick format, and then use a tree virtualization software like FigTree, or something similar to visualize your tree and to manipulate them. Then you can download the SNP Matrix, it comes in a txt file or pdf. The txt file is tab formatted, and can be directly imported to Excel or another spreadsheet software. Then you can download the pseudo-alignment in fasta format, which means that you can take this alignment and create trees using other tree building algorithms. Finally, we have a section on quality control. Very important, you need to make a note of how much of the reference genome was actually covered by your isolates. This needs to be relatively high. If it's too low, you might not even get results. Then if you provided raw data, you will also get these two plots. One of the plots is a plot of ignored snips. So this is a bar plot where for each isolate, you will see how many SNPs are ignored based on of low coverage and so on, on of all the filters, if you are just looking at that isolate. So you should be cautious, if it looks like one of your isolates is removing, or ignoring a lot of SNPs because that might have influence on your phylogeny. Then you get a plot of the heterozygous SNPs, and this is the plot of the heterozygous SNPs that actually have high enough quality to be included in the analysis. If you chose to ignore those, they're of course not part of the analysis, but if you didn't choose to ignore them, this is a plot of how many was actually used in the analysis. Usually, it's of no concern if there's a few heterozygous SNPs, 20 or 50 or something like that, but if it seems like one of your isolates has a lot of heterozygous SNPs, that might indicate that it has been contaminated with a similar genome. Finally, just to remind you, use closely related references. Check the percentage of the reference genome covered by all your isolates. This is the number one reason that people don't get results. And remember that you are allowed to mix raw read data and assembled data together into the same analysis. Thank you for watching. [MUSIC]