Hi everyone, my name is Pimlapas Leekitcharoenphon. I'm a researcher from the Research Group for Genomic Epidemiology at National Food Institute, Technical University of Denmark. Today I'm going to talk about species identification, KmerFinder tool description and application. First of all, one of the issues that emerge when a bacterial organism of interest is encountered, that mean when you start working with bacteria. Whatever the question that coming up that you need to know is, what it is? To be precise, which species it is? Normally the 16S rRNA gene formed the basis of the first method for the sequence-based taxonomy, so 16S is still used for their species identification of bacteria. The 16S rRNA has been found to have a number of shortcomings, and that's why they are other approaches like there is gyrB gene are ribosomal MLST species, specific functional protein domain profiles have been proposed,but anyway when you look at the 16S or whatever genes that you use for a specie identification. When you look at a gene, it's actually a very tiny part of all this whole-genome sequencing, and once we have whole-genome sequencing data. We have the whole thing, why we only use these small pieces to refer species? Why don't we use all the genetic information in whole genome sequencing data to infer species? Yes, why not,and if you want to do it,okay. Can I just compare a whole genome like this with all the whole genome of the available genomes of bacteria, can they do that? In principle, computer capacity is not really possible in a very short time yet,but if possible if you break them, if you break your genomes, the whole genome into small pieces, and you identify species from those small pieces, then it is possible to get the species from the whole genomic data And what is the small pieces from the whole genome sequencing data ? We call the small pieces k-mers. K-mers is actually a short sequences with a number of k bases, if I say 10-mers, I mean the short sequences of the size of 10 base pair, for example, you have a long sequence and then you cut them into pieces of DNA. A chunk of small pieces of DNA here sequences we call k-mers, and if this is the size of seven, we call them seven k-mers. How do we identify species from this? Because in principle, any sequences with high similarity must share k-mers, what does it mean? It mean if you have genome A, and then you have genome B, if you cut them into k-mers, to break them into k-mers, if they are similar genome, if you compare the k-mers, they must share, they must have a lot of k-mers in common. In principle,that's how you identify species using the k-mers. The database behind a KmerFinder tool it contains all the known species basically we got it from the NCBI, and then from those known species we break them into a group of k-mers. Basically you have a library of known species of k-mers of the known species. Then when you have your unknown species, then the tool will cut to break your unknown species into the small pieces of DNA sequences k-mers, then the tool will compare the k-mers of your genome and a k-mers of the unknown species,and then it gives you the best matching species for your genome. Let's give you deep detail how to break or cut your genome into k-mers.The tool basically spit your genome into 16mers. The size of 16 base pair, for example here, it chopped your data like this into the 16 base pair like this, but we will not get all of these k-mers in the database. Why? Because they have a lot of redundancy data over here. This is very similar to this one, it's just one base pair different. If you keep all these data, then the tool will take a longer time to compute and identify species. To save the time to make it fast, we reduce the redundant data. For example, we only get the Kmers or 16mers with particular prefix, like ATGAC, for example. Then if you do that, you get a lot of redundant out of your database, and then it's going to speed up the tool to predict your species. Another example, how does it work? This is imagine is the reference bacterial species in the databases and this is that the Kmers in the genome and you have your own unknown species with the Kmers, then you try to align it or compare to all the species and see which one has the best matching Kmers. It turns out this one is similar to this staphylococcus aureus species. What tool behind the KmerFinder to do this alignment of Kmers is the KMA or k-mer alignment that developed by a PHD in our group. If you want to read more about the detail of the tool KMA alignment and k-mer, you can read from this paper. The tool, the KmerFinder actually has been tested their performance with some training and testing data. This is back to many years ago when we started building the tool. We tested with some of the training data. That means the data that we know the species already. We have about this number of training data with about 1,000 different species, that we know, species. Then we have the unknown, the species actually, we know, but want to test it in the training data. We have NCBI data with a draft genome, about 600 species and we have a lot of them from the SRA or ENA databases about 10,000 draft genome. All of these, they put them to the tools, KmerFinder, alongside with the other tools and see the performance. This is the performance. They you can see the blue one is KmerFinder, and TaxonomyFinder is another kind of tool to identify species, this ribosomal MLST and this 16S the first tool to identify species. The first group here is the tool can predict correctly both genus and species level. This is only genus level, and this is not even genus level is not predicted at all. If you look at this, you can see that the KmerFinder has a greater performance in term of predicting correctly both genus and species than the other tools here. Also for both NCBI data and SRA data. It's better than 16S that you can see over here. The concept of the tool is as a scientist or a researcher, once you have your genomic data, either raw reads or assembled data, you put them into the KmerFinder and KmerFinder will return you with a species or predicted species. Have a look at KmerFinder website. This is the link to the Kmer website and this is the front page of the tool. First of all, first things first, you have to choose what kind of databases that you want to map to or you want to search for. The default of course, is a bacterial organism but actually, the tool has more organisms that you can choose. Like; algae, fungi, Protozoa or even virus. You choose the database that you want to search for and then you upload your genome here. You can upload FASTA file or FASTQ file. FASTA means assembled genomes. In FASTQ format of raw reads, short reads in FASTQ format, and it can be single end read one file, or pair end read two FASTQ files, you can dump them here, but you should do one genome at a time. One submission per one unknown sample. Don't mix them all here. Otherwise, you're going to get a result of a mix of everything. When you put your data here and then you click upload, you are going to upload your data into the server. If you upload completely, it will show this page. If you want the tool to send an email to you, you put it here. The tool will send the output link directly to your email that you provided here. Let's have a look at the output. This example of the KmerFinder output, this number is just a taxonomy number. It doesn't matter in this result here. What you have to focus on here is what specie that is shown here. Sometimes, it shows many species here but you should focus on the first one, because the first one is the best hit species. Then you can verify if it's a good result or not. There are some parameters over here that you can actually consider, if its a trustable results or not trustable result. For example, the score here means how many Kmers from your genome that match to the species. Even though you see two species pop up here, E coli and salmonella. But the E coli has about 174000 Kmer matched to this E coli. Meanwhile, the salmonella has 1500 Kmer of your genome matches salmonella. Of course, salmonella and E coli have some genetic element that's similar. Some of the Kmers can match randomly to salmonella. But it's very minor detail or minor matching so you should not consider salmonella as your predicted species in this result. Your result here, it should be E coli. As you see a lot of Kmer score, it matches more in here. So, the other expected statistic value, you can see more thing over here. For example, let me go to each of them. The template length means all Kmers from this template, how many Kmers from the template or the species is shown here? What about the template coverage? What does it mean? The template coverage means how many Kmers in the template species that match to the Kmers of your genome query sequence means your genome, your input genome. Divide by the total number of Kmers in the specie in a database. What is Query coverage? Query means your input genome, it means number of Kmers in your input genome that match to the template database species, divide by number of Kmers in total of your query sequences. The last one depth, what does depth mean? The depth means, number of Kmers in your input genome that match to the template species divide by total number of Kmers of the templates species. So, we expect those numbers to be very high. For example, we expect this to be nearly a 100 percent, then we can be very sure. But sometimes, its never a 100 percent or even close to 100 percent. Why? Because the species that it matches or your species is very rare. It doesn't have much variety of variants, variation in the database. That's why you cannot find the best matching one to your database. Therefore, that's a limitation of not only the tool but it's a database availability of the variation of all the bacterial species that can be used. If you want to see more or know more about the other parameters in the output of KmerFinder, you go to the output and then you will see the explanation of each of the parameters in the output page. As I told you the CGE tools that we have, we have the web-based version, we also have the standalone version. But to be able to use it, you have to have some UNIX skills or bioinformatic skill to be able to install and run the tool. Anyway, this is the opportunity for you to download and install the tool. If you want to run it locally stand-alone on your computer or your server. Of course, the tool has to come with database. This is the link to the bacteria database species that we use in our KmerFinder. But keep in mind this is a very huge database. So before you download it, be prepared for that. That's all for the KmerFinder. Thank you very much for watching.