We begin with an overview of data quality and its many facets or dimensions. We start with some definitions that we use to organize the topics in this module. Data quality is a very broad subject that encompasses a wide range of challenges, that can arise when using data. Data quality problems can be organized in different ways called data quality dimensions. For example, some data quality problems may be relevant across an entire data set, whereas other data quality problems may relate to specific intervals of time such as successive years or quarters. This first type is called a temporal data quality, whereas the second is called temporal data quality. There is a wide range of different categorizations of data quality problems. In the readings, we provide an organizational model that was developed by a large group of clinical investigators that focus specifically on electronic medical records data quality. That framework uses terms such as completeness, conformance and plausibility, other frameworks use similar words. Because data quality encompasses many ways in which data may be used, not all data quality dimensions are relevant to a specific use case or data set. Data quality measures are either qualitative, or quantitative assessments that provide insights into a data quality issue. The most common measures are simple summary statistics, such as mean, median, minimum, and maximum values. But there is a very large range of data quality measures that have been developed to produce insights into special situations, such as when data are measured at multiple points in time or time-varying data, which is common in electronic medical records. Data quality rules use the data quality measures to provide insights into the acceptability of data set to meet the desired use case. These can be considered acceptability thresholds that provide warnings to daily users about critical data quality deficiencies in the data set that may render the data unusable. There are broad use cases for considering data quality. Intrinsic data quality refers to the overall data quality features of a data base or data set without any reference to a specific use of those data. Intrinsic data quality provides a very high level overview of the key strengths and weaknesses of a data set. Fitness for use, as the name implies, is a more focused view of data quality that is based around a specific use case. In the fitness for use setting, some intrinsic data quality measures that may indicate serious data quality problems might be totally irrelevant in that specific use case. Similarly, intrinsic data quality measures that look acceptable at the general overview may be found to hide very significant data quality problems for a specific use case. For example, a general overview of the age distribution of a data set may look acceptable, but drilling into the age distribution for a specific disease may reveal no pediatric patients within this disease are present in the data set. If one envisions a data lifecycle that begins with the initial recording of data into a data collection system or device, and the steps of data storage, management, transfer, extraction, computation, and reporting. Every one of these steps can be the root cause of a data quality problem, including erroneous, missing, or corrupted data. In the setting of combining data from multiple institutions, additional data quality problems can arise from data harmonization into a common data model and from linking records together from multiple sources. This graph comes from an article published by Ritu Khare, previously at Children's Hospital Philadelphia. On the x-axis or the number of ETL cycles performed by a data network called PEDSnet. For each new data cycle, new data quality features were examined. You will see that as the number of data quality checks increased, the number of data quality problems detected increased in a similar manner. From the same article, the types of data quality problems found during one of the PEDSnet ETL cycles, are categorized along the x-axis. By far, the largest contributor of data quality problems is missing data. Outlier values and unmapped concepts, are a distant second and third problem. The other problems tail off, but represent a small fraction of the data quality problems detected PEDSnet. We have just skim the surface of a very large body of work, and the subject of multiple books on data quality. Needless to say, data quality is a complex concept with multiple dimensions, used to describe its many facets multiple methods for measuring data quality features and an endless number of rules that could be developed to test for the acceptability of data for specific use case. Assessing data quality is never a linear process of deciding which data dimensions to assess, creating data quality measures, and applying data quality rules. As illustrated in the PEDSnet examples, data quality programs tend to evolve in sophistication over time. There are multiple cycles in determining if a data quality issue warrants more investigation. The deeper one looks at the data, the more one finds problems that could impact the results. It is important that data quality problems remain a constant source of examination and vigilance. The very overused expression of garbage in garbage out is still very applicable. All major data initiatives need to invest significant technical and personal resources in a data quality assessment program. This activity is often overlooked until some unpleasant surprise arises. It is much better to be proactive and developing a data quality program than to be faced with an embarrassing situation where data that was not fit to be used created incorrect results.