Hello. In this video, we'll talk about the emerging data types. So the key takeaway for this video is to identify the emerging data types that you may encounter in different health data sources. Remember, in previous videos, we covered a good number of common data types, but these are the emerging ones that we will cover in this video. So, what are the emerging data types? Here is a list. Lab orders and values, vital signs, social data, patient generated data, and also other data that don't fall into one category. We'll talk about each of them further in the next slides. So, lab data. The reason we call this an emerging data type is that on a population level, lab data was not available in the past. But now that we have more EHR data and more lab data in a centralized fashion, we are able to look at lab data on a large population. Hence calling it an emerging data type. Lab data may contain lab orders and values, or only orders depending on what data set you're looking at. Lab values could be indicative of certain sub-populations at risk, so this has very good data to know of. You can derive certain variables out of lab data like the severity of a disease. You can see a lab value going up, but the coding of a diagnosis is not changing, so, that lab data could help you to find certain severities or even missing diagnosis. Sometimes you might see a lab value that crosses a threshold and that should have meant that the patient has a disease or not, but you don't see it on the diagnostic side. There are coding standards. However, most of the providers here in the US, use their own internally developed coding terminologies. So, it makes it very hard to map it to these standardized coding terminologies. The most prominent one in this area is LOINC, which is the Logical Observation Identifiers Names and Codes. Actually, LOINC started as a lab coding standard a while ago, but now it includes a lot of other observational information and also surveys and other things, But LOINC is the de facto standard for lab data. Now, there might be a wide range of data sources for lab data. EHRs of course is one source, but at least here in the US, there are some lab servicing companies that are almost national that cover a wide populations and they provide services to large number of clinical providers and they also have their own databases. So, again, it's not called an EHR database, that's a laboratory information system that they have. But a laboratory information system could be part of the EHR as well. So, again, depends on where you look at. In terms of data quality, it's usually acceptable but there are varying coding standards and we will talk about it. The units of measurements and the type of devices used to measure those things all of it, why it create biases and skewness in the data. We have the typical data interoperability issues when you want to cross work between different coding systems. Again, if that certain lab order results indicate a very special person or a special disease that is very sensitive, it might be protected by various federal or even state laws. Here is an example of just showing the complexity of coding. This is LOINC. This is basically going on LOINC website and trying to find the code for a very common lab test like a hemoglobin A1c, which helps you track your blood glucose levels in a longer time frame, and especially it's important for the diabetic patients to track that. You can see there are multiple results coming back and depending on what you're looking at, you might actually pick up the wrong code. So, there needs to be some training or expertise before you start working with lab coding standards. Here's the typical coding issue. You can see, CPT is sometimes used for ordering lab tests. If you want to cross work them with LOINC is not as easy. You can see a number 1, It's a one-to-one match. It's very easy, but number two, three, and four show a one-two multiple mapping where one CPT code maps to like three, a dozen or almost more than a 1,000 different LOINC codes. Here's a sample table from a database where you see the lab data. You can see in each row, you have patient ID, what was the date that the order was made. But also at the end, you see the test name like, it's a blood test of MCV or other things and the value associated with it. Now, as you can see in box number 1, there might be multiple lab orders and multiple results for even one visit. So, it's important to know the relationship between a lab test and a patient. Number 2, you can see that sometimes a lab order is cancelled. Either by the physician or the lab technician because they wanted to redo it, the device gave an error and things like that. So, you need to do a lot of pruning and cleaning before you use lab data. Number 3, indicates the fact that there are different decimals for different lab tests. There are a lot of little things that you need to know before you can start using those lab tests, especially if you do not know the units associated with it. So, that is also another complexity with lab data. Now, the next emerging data type that might become available on a very large population level are vital signs. I've listed some of them here like weight, height, the body mass index or BMI, blood pressure, temperature, pulse rate, respiratory rate, and so on. They are becoming very helpful to find certain trends in the population, especially if you have a good temporal data on these variables, where your population is of risk of developing certain outcomes. Vital signs very much like the other emerging data types could help you also to derive new variables such as diagnosis severity or even missing diagnosis in case of not having that diagnostic codes. So, for example, if the blood pressure is higher than a certain limit in a couple of visits, then you might infer that the patient has hypertension, even if the hypertension diagnosis is not in the database. In terms of coding standards, LOINC also covers vital signs, but a lot of EHRs do not use any standards for vital signs, so that's also makes it tricky to use it in your research. Data sources: mostly it's EHR data, but there could be also some other monitoring systems either at home or in an ICU setting that may not actually feed the data into EHR and they might also have good vital sign data. Data quality is acceptable, but there are a lot of human errors, unit errors, and quality over time when specially you transfer the data from one system to another. There is a lot of mismatching unit issues and also other things about these vital signs that makes interoperability a challenge. For example, blood pressure. Any blood pressure needs to be accompanied with how the blood pressure was measured. Was it sitting, standing, and other conditions, and a lot of databases may not even have those. I'm not aware of any legal considerations about vital signs. There are a lot of variations in vital signs and that depends on the source. I don't want to go into all of the details but it could be just simple human physiology where blood pressure just changes over time. That's just human physiology or there is a co-morbidity associated with it. It could be a subjective bias like a pain level. It's very subjective depending on what the patient says. There might be measurement issues where a tool is used and the tool is not calibrated and that creates all of the problems. It could be data capture errors like somebody is typing the wrong wait for a patient. Then also data and interoperability issues where as you extract the data from one data table and then you want to import it elsewhere, you forget the measures or the units or the targeting table might actually use a different unit system and then that screws up all of your results. Now, here's an example of a vital sign table which most times they call it a flow sheet in a lot of these EHR databases. You can see number one refers to a column that talks about some other conditions about for example, the blood pressure position that was the patient, when the blood pressure wasn't measured so for example was sitting. So, you have to always make sure that you look at those conditions. Number two refers to the fact that weight seems to be very high like 2,800, doesn't make sense either kilogram or pounds. It's just too high but because the unit was not there that's a data quality issue here and it could be ounce like somebody might be 2,800 ounces. So, number three refers to the fact that sometimes it might not be as simple as a numeric value and you may need to do some pre-processing before you can use it. For example, here there is a backslash between the two numbers of the systolic and diastolic blood pressure. So, you need to pre-process the data. Finally, number four shows that it's very important to see the trajectory of a lot of the vital signs because vital signs change almost on a daily basis. Some much faster, some takes a bit longer, but the trajectory might be more helpful than just looking at one. For example, for one patient here, you can see the BMI from one date to another date has changed by almost 10 points which is a considerable change of BMI. Another emerging datatype is social data and that includes a long list of variables that you may or may not find in health data. Like smoking status, alcohol consumption, addictive behavior, socioeconomic status, and so on. Having those data, it's very helpful to better managing people who are at risk of social needs. There are a lot of variables that you might derive from it including things like treatment affordability. If somebody has a low SES or socioeconomic status, whether they can actually afford that treatment or not, that could be helpful. There currently not that many coding standards for social data. Because social data for a long time were not considered medical data but it is now becoming part of medical data, it's part of the Patient-Centered Movement. Some coding centers have started actually coding a lot of social data including ICD, LOINC and SNOMED. Now, depending on what you are looking at whether it's personal level or aggregate level, you might find social data in different places. You might find that in an EHR and destructured side of EHR or the clinical notes of an EHR that is free text unstructured data. If you're looking at an aggregate level, census data, other data about housing data of a neighborhood might help you to understand in what context a patient lives. Data quality is not good. Actually, there's a lot of incompleteness of survey responses and having it on a personal level data on EHRs. Actually, this data is very sparse and not well connected because there is no reimbursement policies for it here in the US except from certain pockets or certain states. So, the data quality is low and there is also possibility of bias as usual with if you use a survey to collect that data. Now, data interoperability, it's still being influx. We still don't know how to share the data and there are some legal considerations with some social data and they're not always HIPAA. There are also other rules about some social data like education data is protected by the Family Education Rights and Privacy Act or FERPA. Here's an example of a social table. You can see number one shows that the patient is a former smoker. Now but it doesn't say for how many years they have not smoked. So, that's one problem for example. Number two shows whether they're consuming alcohol or not. Then number three shows whether they are sexually active or not. Again, it's a lot of it is because of some unknown reasons a lot of times these social variables are not asked and missing and it's very hard to deal with them. Very much like vital signs, the trajectory of social data might be more important than just flatly knowing whether somebody is sexually active or not. Here number four shows that somebody was not sexually active but now they are socially active. So, it might be a good trigger or change in your analytics to understand some trends or outcomes. Now, patient-generated data is also booming because of all of this technology that mobile technology and variables that helps you to collect data. Of course, we have to look into the value of them in terms of predicting certain outcomes of interests. There might be many derived variables that you can get out of such data like fitness data, active daily living levels, and so on. Although there are some standards that are recommended by the Food and Drug Administration, FDA here in the US, there is there is no other mechanism. If you want to just create an app on a smart phone, you can just go on and have your own standard. If it's not really affecting the health of a person and it's just collecting some lifestyle information, it's not even FDA regulated. So, again coding standards really doesn't exist much in this area yet. As I talk, data sources could be from a lot of different devices. Data quality would be very varied depending on what device, what app, and so on. Interoperability as I said is a challenge right now. Interoperability in this area is not an active point of discussion right now, but we hope that this also becomes an important topic here in the US to consider. I don't know anything about the legal concentrations of such data, but the only problem is if you want to go and run a trial and get the data from individual patients usually running consent is complex but there are some solutions that are available. Here's just a picture of some patient-generated data. You can see it's not only smart phones, there are now smartwatches, there are smart almost everything around you. That can collect data that could be helpful for medical research. There are also some other emerging data types like workflow data types, like who connected with who at when, how many patients were seen by this doctor, how many doctors were responsible for this patient, and things like that. Environmental data, the geographical information systems depending on where you live, how many restaurants are close to you, all of that could be information that are useful. Also even marketing and consumer data like shopping behavior, bankruptcy records, credit scores, and so on. All could be helpful for research. However, they are not typically, in an EHR and insurance claims, they're all have their own databases or log files and it might be tough to get them for medical research. So, in summary, we talked about four types of emerging data types: lab orders and values, vital signs, social data, and patient-generated data. We also touched on possibilities of other data that you might use for medical research.