Hi, I'm Judit Szarvas, a PhD student at the National Food Institute. In this video, I'm going to talk about Evergreen Online, a platform for identification of food-borne bacterial outbreaks. First, I'm going to give a brief introduction to Genomic Epidemiology. Then, walk you through the workflow of Evergreen Online, and then the Evergreen Online platform. In this context, Genomic Epidemiology means the use of genomic data, in epidemiological investigations to track infectious diseases; samples of pathogens that differ only in a few single nucleotide polymorphisms or SNPs could be related, and have a common source. We use phylogenetic trees for showing the relationship between the samples, and to cluster related samples. Evergreen online is a continuous effort, with the aim to discover food-borne disease outbreaks, and connect clinical samples to sources. The bioinformatic pipeline behind Evergreen Online, first downloads raw sequencing data from public repositories, that were uploaded by food safety authorities, or public health laboratories. This could be, for example, the US Center for Disease Control or Food and Drug Administration. The next step, it divides the data by sub-typing the pathogens, and thereafter it calls high-quality SNPs, by aligning the reads to the references, and then calculates the genetic distance, which is then used to infer phylogenies on the non-clustered data and as last step, it displays the phylogenetic trees and putative clusters on the website, in bit more details. The raw sequencing data from Illumina, and Ion Torrent platforms, are downloaded to the system and then k-mer based sub-typing, is performed to find the closely matching references, or templates as we call them and then the data is divided into sets, based on these templates. All this information is saved in the database as the next step, things are done set by set. The reads are aligned to the references, and same length consensus sequences are created by evaluating each site in the reference, with these criteria displayed here. In order to sure high confidence in the consensus sequence that we create. Thereafter, pairwise distance calculations is performed. Let's go into details here, and let's see an example for four sequences. The differences between the sequences are highlighted in blue, and the unknown or missing bases are marked with red, and the metal disregards the differences between two sequences, if either of them contains an N. For example, if we look at sequences A and C, we can only find one difference between them, despite that there is a C against an N here, because we don't count this as difference. If we go through all the pairs of sequences, we get this distance matrix on the right, in order to reduce the redundancy in the data-set, and to create clusters. A clustering step is performed on samples with fewer than 10 SNPs between them. There are three possible scenarios displayed in this slide. First, at the beginning, only the reference sequence is contained, in the database and set. Each new sequence that we are adding, are being compared to this reference. If any of them are closer than 10 SNPs to it, they are clustered with this reference sequence. Afterwards, the genetic distances are calculated between the remaining sequences. The distance matrix and the clustered sequences are saved in the database. In the next scenario, we are looking at a continued template set, where there is already samples in the set that we're homology reduced before. The new sequences are first compared to all these sequences, and if some things are clustered, those are removed from the later consideration. When we are calculating the distance between the new sequences, only the remaining ones are being used. All this information, again, is saved in the database and the distance matrix. In the last scenario, there are no sequences that fall below the clustering threshold. We are only updating the distance matrix, using the distance matrices and constancy sequences, phylogenetic trees are inferred for each updated template set. The previously clustered isolates are placed back onto the tree, with an asterix. If we return to our previous three scenarios, we can see how each tree is being updated. There is a difference whether there is a new template set, or we are continuing an old, and already populated template set, and only adding a few number of new sequences, and clustered sequences. The pipeline is cyclical. It means that when the next iteration is being run, some of the data doesn't need to be computed again, which means that we are saving compute time and resources, during the process. Evergreen Online platform is hosted at the Center of Genomic Epidemiology. The pipeline neurons daily. However, it could be that it doesn't have new data, that could be added to the system. In any case, you can find the date of the last update on the website. There are two expandable tables. The first table contains the latest changes in the clusters. Each cluster is identified by its template name, and the non-clustered isolate. That the other samples were clustered to and the one that we call representative sample, because it represents the other samples in the calculations. The third column contains the new, and total numbers of sequences in the cluster. The four-column is marked if the cluster is new, and the letters in the fifth signify whether the clusters contain isolates from different sources, or different institutes and countries. The second table lists all the phylogenetic trees in the system, with their date of last update, and number of isolates. We offered download links in different formats, and also link to an external viewer called microreactor. Samples could be searched by their ID. For example, this sample is the representative of a campylobacter jejuni outbreak, that was caused by raw milk. This concludes our tour. Thank you for watching.