[SOUND] [MUSIC] In this lecture, I will give an introduction to LINCS L1000 data. The gene expression data generated by L1000 technology. This is the background in which things that are done for the project comes into being. First gene expression signatures are most informative of global cellular state. The cellular state could be disease, physiology, or drug induced. Second, the Connectivity Map project is potentially transformative. I think most people should have heard about the Connectivity Map or the CMAP project published in 2006 by the Broad Institute. They applied 164 small molecules, on a few cell-lines, and then measured the gene expression profiles, using Affymetrix array technology. They were able to connect these gene expression signatures of cancer perturbed cells to battle disease and pathophysiology, and that provides a systematic approach to find a connections between drugs and disease. Obviously, the more gene expression signatures we have, the more resources we can utilize to find a meaningful connection for our own research. But common gene expression profile methods are expensive and are not as scalable to be performed in high throughput. Like the Affymatrix array technology used in the original CMAP project. So LINCS L1000 project was launched as part of the LINCS project to generate a large number of gene expression signatures, using the L1000 technology. The LINCS project produces the largest set of perturbations to cell-lines, but still covering a small fraction of all possibilities. So what is LINCS L1000 data, and how the L1000 technology reduces the cost. As I mentioned, LINCS L1000 data is generated by the L1000 technology. The L1000 technology measures 978 genes in each experiment and estimates the rest of the transcriptome, the 22 thousand genes are predicted using a model built from GEO. So the reason that the L1000 technology is cost effective in that it only measures 978 genes instead of tens of thousands of genes in common methods. Thus, the L1000 technology is suitable for gene expression measurements in high throughput. So why 978 genes are enough to reasonably represent the whole transcriptome? The short answer is gene expressions are correlated. You only need to know one to know the other. To demonstrate this idea, the CMAP team compared the connections between all the available gene expression data on GEO, and they found only 978 carefully picked genes are necessary to recover 80% of the connections, the criteria for the landmark genes are: that they are minimally redundant; Widely expressed in different cellular contexts; and contain inferential value. This slide is an overview of the protocol of how the landmarked genes are measured. First, mRNA is reversely transcribed into cDNA. Then landmark genes specific upstream and downstream probes are annealed to the cDNA and ligated. The upstream probe has a unique barcoded sequence. In the next step the probes are amplified using PCR, and hybridized to the beads by their barcodes. Each bead recognizes two bar codes. The beads are then sent to Luminex detectors to narrow how many probes are hybridized. The previous slide is the protocol of how a sample is measured in each well. Hundreds of experiments are mirrored simultaneously on 384 well plate. This slide shows the common experimental setup. About 360 experiments are normally married together in a batch. Each experiment has two to four replicates. The figure shows the batch of the experiments with three replicates. The red circles are the control replicates and the blue are experimental replicates. The replicates of the same experiment are placed in the same well on separate plates. So the number of plates equals to the number of replicates in the batch. There are normally 18 controls. Replicates per plate, where the field name for batch is called brew_prefix in the LINCS L1000 metadata. So this line shows the data levels of the LINCS L1000 data. So level 1 is the raw, unprocessed flow cytometry data from the Luminex scanners. One LXB file is generated for each well of a 384 well plate, and each file contains a fluorescence intensity value for every observed analyte in the well. And the level two is the GEX file type, and they are gene expression values per 1000 genes after de-convolution from the Luminex beads. Level three is quantile normalized data. Gene expression profiles both directly measured landmark transcripts plus imputed genes, normalized using invariant set scaling followed by quantile normalization first within-plate and then across replicate plates. Which means plates in the same batch. Level 4, are the z-score data, they are profiles of differentially expressed genes computed by robust z-scores for each profile relative to the population control. And Level 5 are moderated z-scores, which are computed by the Broad, and also Characteristic Directions calculated by our lab. There are signatures computed from replicate profiles. I think it is useful to have an in depth analysis of the LINCS L1000 ID system. The IDs reflect the experiment set up and a link between data and metadata. This slide shows the distil_id, which is the ID for replicated gene expression profile. Since both level 3 and level 4 data are on replicate level. It is this ID that I use for the index for the level three and level four data. Distil_id consists of Brew prefix, plate index, and a well index. Brew prefix equals to batch as previously mentioned. Each batch name is in turn made up of three parts: The perturbagen group, the cell line, and the time point, which indicates experiments in the same batch have the same cell-lines at a timepoint. For the perturbagen group is just a broader group concept, consisting of several related batches. Clear index is the index of a plate using a batch, together with a group prefix, they make up the unique identifier for plate. Well index designates the position where the replicate is profiled. Replicates of the same experiment, normally have the same well index. You can see the distil_id exactly mirrors the experimenter's set up. This slide shows the sig_id, which is the ID for level 5 data. sig_id also consists of three parts: brew_prefix, pert_id and pert_dose. Pert_id the component index of the Broad Institute has BRD, the abbrevation for Broad. In the LINCS data each experiment is determined by 4 pieces of information: perturbation, cell line, dose and time point. You can see sig_id with flags 4 essentials. Together with the perturbagen group, the unique identifier level 5 gene expression signature sometimes the pert_id is the pert_mfc_id. Which is the ID for the drug-class/batch information. Because they track the drugs at different times, and from different vendors. So they use this pert_mfc_id/pert_id to more accurately describe the source of the drug. If you are an experienced gene expression analyst, you probably have wondered why there are level 4 data. Most experimental expression analysis methods start with normalized data and yield differetial expression signatures. Namely, jumping from level 3 to level 5 with no intermediary level 4 step. The reason for having a level 4 data here is that there is a strong plate batch effect in level 3 data, as shown in the figure. The figure, the PCA plot, is an example of experimental replicates and control replicates in level 3 data with the same conditions across plates in the same batch. The broad pink and the blue dots are control replicates from four plates. And the yellow dots are experimental replicates from four plates. Normally you should expect control replicates to group together and experimental replicates to group together. But here the replicates of the same plate are grouped together which makes no sense and is surely an artifact. So level 4 L1000 data are calculated to correct this plate bias effect. The level 5 characteristic direction data are directly computed from level 3 normalized data. How this jump overcomes the batch defect? If you carefully observe the figure, you can find that the control replicates point to the experimental replicate in the same direction in each plate. Which reflects the biology of how the experimental replicate is systematically deviated from the control. The characteristic direction approach computes the 4 directions and averages them to get the level 5 characteristic direction signatures. The details about this approach can be found in the following YouTube video link. So now let's see what perturbations and cell lines we have in LINCS L1000 data. There are more than 20,000 small-molecule compounds in which there are 1,300 FDA approved drugs. About 5,585 bioactive tool compounds and more than 2,000 screen hits. And there are also 22,000 genetic constructs for knocking-down genes or over-expressing genes. They consist of 900 targets or pathways of FDA-approved drugs. 600 candidate disease genes, and more than 500 community nominations which have the genes that are interesting to the biology community in general. The data set covers more than 18 cells, including primary cells, cancer cell lines, stem-cell lines, and differentiated cell lines from different tissue types. So here we arrived to the question that I think now is probably most important to the audience. Where to find the LINCS L1000 data? There are three locations to find and download the data: There is LINCS cloud, GEO, and the Ma'ayan lab website. The LINCS cloud hosts about 95% of the data from level 1 to level 4 with the chemical perturbations, gene knock downs and the gene over-expression perturbations. GSE70138 holds about 5% of the data via the LJP005 to LJP009 pertubation group datasets. And it consists of only chemical perturbations, you can find them on GEO on this GSE70138 index. The Ma'ayan Lab website provides level 5 characteristic direction signatures only. And the signatures are computed from the data in both of the above sources. In the next few slides I will show you how to download and analyze data and the metadata from lincscloud. It is big data and a little bit complex. You have to first register for an account to access the website. Entering to the website there are four icons on the upper left corner. The first icon provides access to several apps that help users interact with the data. They are easy to use and will not be covered in this presentation. The second API icon offers functionalities to download and analyze the data and the metadata. The first API icon is the HTTP service to search and query metadata that is actively updated. The following data icon enables user to download level 3 and level 4 data and associated metadata. The Code icon provides code in various programming languages to parse and analyze the downloaded data. The face way function enable user to analyze the data on the cloud without downloading. Note set and the metadata that can be accessed through the API are generally more complete and accurate than the directory downloaded metadata. This slide shows the services provided by the LINCS cloud API. After downloading the data, you will get a basic data matrix. With row ids and the column ids, the row ids are probe ids for each gene and it can be correlated to the GeneInfo service. The column ID for level 3 and level 4, is the distil_id and it can be quiried with the InstInfo service. The column ID for level 5 data is sig_id and it can be quiried by the SigInfo. Yes, you probably realize that although moderated Z-score level 5 data are not available for download, you can broaden the search using the SigInfo service. Here is the workflow for processing data downloaded from the lincscloud. This workflow uses level 3 data as an example, but also applicable to level 4 data. First, download the the big matrix file onto our computer, then download the code of your preferred language and use the parse_gctx() function to the parse the big matrix file with the annot_only option set to true. The result will be a list of column IDs, the cid, and row ids that are rid. Correlate the row ids through the GeneInfo API to gather the information of each gene. Correlate the instant info to find the column IDs of gene matching profiles that are interesting to your project, then use parse gctx function again with the second option, send it to the selected column ID queried from the API. The matrix will be sliced according to the selected column IDs and output a submatrix that contains only the data of your interest. Notice that by default, the pase_gctx() function will try to parse the whole matrix into your memory. This is not practical since it requires hundreds of gigabytes of memory available. That's why we need to sort the matrix by column IDs to analyze the data. The LINCS L1000 data on GEO is much easier to handle. The GEO page lists several files to download covering data from level two to level four. Both data and the metadata are assembled in a single file and there's no need to query a separate API. The files are in GCT format, which is a simplified version of GCTX format. You can still use the parse GCTX function to parse the GCT files. As you can see, they are a two level 4 files. The Z-scores are computed relatively to the population background and the in the Z-scores in the other is computed relativey to control vehicles. Generally, I think the Broad Institute will prefer the file computed relative to population background. If you want to get the level 5 characteristic direction signatures processed by the Ma'ayan Lab, you need to install MongoDB. MongoDB is the most popular non SQL database that assumes data as objects rather than rows in tables. In this L1000 database, each signature is represented as an object with data and metadata as attributes. The files downloaded from our webpage are MongoDB files that can be directly used in the database. Instructions of how to set up this are provided on this web page. The only thing that require your attention is that the database does not store genes metadata. The genes metadata needs to be downloaded separately in three files. The LINCS cloud rod.json is the real proper IDs. Matching the order of genes the full character direction in LINCS cloud fashion. The order of landmark genes in this array matches the order of landmark genes in the landmark direction. The GSE70138 are adjacent with an array of probe IDs matching the order of the genes in the GSE70138 collection under the API role metadata file, in the metadata information for each probe ID downloaded from the Broad API and it can be used to convert proper IDs to gene symbols and also used to determine if a probe ID is a landmark gene. Here are some apps developed by the BD2K-LINCS DCIC that use LINCS L1000 data. LIFE is a search engine developed by the University of Miami that integrates all LINCS content leveraging a semantic knowledge model and a common LINCS managed data standards. iLINCS is a computational biology app developed by the University of Cincinnati that aims to provide statistical methods and a computational tool for integrated analysis of the data produced by the LINCS program. L1000CDS2 is a Search engine developed in our lab to search for level 5 characteristic direction signatures that are either mimic or reverse user input signatures. Lich is a metadata search engine for LINCS L1000 data deposited on GEO and it provides customized download of level 3 and level 5 data. GEO2Enrichr is a Chrome and a Firefox extension that helps users extract the signatures from studies deposited in GEO. Although it does not directly use LINCS L1000 data GEO2Enrichr pipes the GEO signatures to L1000CDS2 to search for similar or reverse signatures in the LINCS L1000 database. The last slide is a summary of resources that might help your research with LINCS L1000 data. [MUSIC]