Another important source of healthcare data are many different type of imaging data, such as X-ray data and computed tomography data or CT, or magnetic resonance imaging data, MRI, and positron emission tomography or PET-CT. These are really different type of technologies. They have different resolutions and they will also lead to different size of the images. For example, a full body PET-CT generate eight gigabyte of data and CT cardiac generates 36 gigabyte of data, fMRI can be up to 300 gigabytes of data. The size estimate for all the imaging data in the US in 2014 is about 100 petabyte. So it's definitely very big data and they're routinely collected. It's a very important source of data we need to analyze. Here is a quiz question about imaging data. What type of imaging are this pictures? They're X-ray, CT, MRI, and PET-CT. Just annotate these four images, tell us which one it is. Here's a X-ray. You can see the chest X-ray data. This is B, that's the CT, and C is the MRI data, and D is the PET-CT. What are the properties of medical imaging data? They're objective measure. Again, no human opinion involved, at least the raw data. Then they're also standard data. They have their own data standards involved. If you train a model on a specific standard, you can believe that's more generalizable for that type of data. It's very detailed and high resolution of the data. Each image can be quite large. The limitation of this data is, the labels can be insufficient. It's difficult because labeling this data, annotate this data is a human effort, it could be difficult to acquire high quality labels from X-rays, especially for a large number of images. It's high-dimensional. The raw data is recorded at a very high resolution, the size is large, and it's also difficult to analyze because of the high dimensionality. Medical literature: That's a knowledge that are documented in the medical publications and a clinical guideline. They are text data mainly, and the data sources are from PubMed, which is a medical literature search engine. They also publish all the papers abstract data and also include authors and journal, and when those papers are published. Guideline Central is another clinical guideline database that contains important texts information about how you should treat different type of patients. What are the property of medical literature data? The pros are; it's very high quality, it's well written, articles or documents comparing to the year of trial or clinical notes. The quality of this data from a human perspective, it's a lot higher. It's very comprehensive. It covers diverse conditions. The cons are; it can be difficult to parse. Just like any natural language processing challenges, this documents are written for humans. If you want a machine to benefit from this, you have to develop algorithms. Even for human, this is for human experts. You need to have necessary expertise to really understand this content. It's not machine friendly. It's designed for human consumption and has very limited structure data and can be hard for a machine to consume. Medical ontology are the knowledge graph of different medical terminologies such as diseases, symptoms, treatments, and so on. There are many popular medical ontologies for different purposes, such as just hierarchy, such as CPT as for period procedures. RxNORM has it for drugs. SNOWMED contains general clinical terms. MESH terms are medical subject headings for literatures. ATC, again, for drugs. So there are different ontology that have been developed for different type of data. So this medical oncology, what are the properties? The pros and cons are the following. The pros: It's very machine-readable. The whole thing is a direct acyclic graphs, so it's very easy to parse for machine, and it's easy to integrate with other data sources, such as EHR or claims. The cons are: It has a limited coverage because constructing the medical ontologies are labor-intensive efforts. So that's why those ontology may only cover a subset of all the medical knowledge. As a labor-intensive process, error can occur. It's difficult to control. The usage of ontology, it's actually very limited, that make the noise or error more common. It can be easily out of date because you put in an effort to construct this medical ontology once, and that's a huge labor-intensive effort, and that's hard to maintain that to be up to date all the time. Clinical trial data is another important data that produced in the process of drug development. What are the different data sources? They contain clinical trials protocols. Those are text documents you can find in clinicaltrials.gov, including all trial eligibility criteria and protocol designs. Then you have these recruitment networks between trials and investigator and patients so you know who participated in what trials and run by which investigators. Then you have safety report. So those are report that are monitored and produced by FDA, such as this FDA adverse event database. Then clinical trial management system, CTMS. That monitors all the results from this clinical trial process. The final result will be reported to FDA. This CTMS contain all the patient information collected in this trial process. So here's the pros and cons about trials data. The pros. It's very important data. It's very valuable data because that's the data you need to get drug approved. It's heterogeneous data; contain both structured, unstructured information, and also, some longitudinal information about patients. Cons. It's difficult to match as the unstructured protocol need to map with structured patient record. So this process is done largely manually, and it's a very difficult matching process for algorithms to do. There's a lot of data integration challenge involved because heterogeneous data sources with variable quality. So data integration is a challenge for trials data. Finally, we have drug-related data. So there are data about existing drugs. If you go to drugbank.data, you can find all the existing drugs and all the meta information about those drugs. Then you have a chemical database that's mainly used for drug discovery. There, mainly you'll see all the different compounds and the property of those compound. So ChEMBL is a large-scale bioactivity database for drug discovery, and ZINC is another commercially available chemical compound database for virtual screening, and QM9 is another data set that's quantum chemistry benchmark data for property predictions. So here is the property of the drug data. On the positive side, they're pretty standard, the data formats are standard, and they're mainly free. A lot of those data are freely available online. What are the limitations? The lack of 3D structures in most of the cases, although the actual molecules are three-dimensional structures, but in a lot of this data you have, we have just two-dimensional graphs representation, and most novel or latest chemical data are lacking. Those are tested and kept secret in those pharmaceutical company in their drug discovery process. So most up-to-date chemical data are actually lacking.