Those of you interested in Data Science should be saying yes, finally we are going to be getting to Data Science because this last section is about how do you get knowledge out of the data warehouse? And that is what Data Science is supposed to be about, and so you find out where you belong in our whole course. So, when we talk about getting knowledge into the knowledge base, we talk about getting computable knowledge into the knowledge base. We already talked about the knowledge, that the kind of qualitative knowledge from the experts, qualitative knowledge from the literature, somebody's got to turn that into computational rules or intelligent use of semantic networks, whatever it is that we talked about in the last session. We already talked about maintaining consistency using ontologies and such. But when you learn from data, you want your approach from learning to data to also generate a computational thing that is consistent, coherent and usable. All right. So, that's what the process of learning from data should produce. On the left-hand side, starting off with the data, now let's talk about how that happens. So the first question that the data scientists needs to ask is, okay, what data am I going to use? The answer is well, it's kind of multi-modal, right? You have your numerical data, you have got your coded data like ICD-9, you've got your text data if you want to use NLP, you have your imaging data, you got your continuous single data, you got your sound data, who knows what. All right. So, that's and obviously the more types of data you have, the more complicated things are. In our next course, we'll actually go through more detail that is really the different types of data you'd be dealing with in this data to knowledge concern. If you're going to learn from EHR data, you're dealing with the plight word garbage data. Now, it's not fair to call it garbage unless we can articulate what's garbage about it. There are at least three issues of "garbage" that you got to deal with. Number one is noise; transcribing errors instead of 32 it is 23, or rather than 223 is 230. There are many ways that the data can be wrong in and out themselves. In fact they've got their outside of range, that did not make sense with the previous, what was previously done. So that's level. The next level of the inequality is normalization. Here I have weight in kilos, here I have weight in pounds. Here I have diagnosis in ICD-9, here I have diagnosis in ICD-10. Here I have procedures on CBT, here they are on ICD et cetera. So, that's if I can't normalize data, I can't put them together and therefore I can't learn from them, they have to go together. Finally, missing data. This is a huge issue with EHR data in particular, and if you are ever doing a data science project, there's this thing called data imputation, where you impute data, it's not there from the surrounding data. That makes sense when data are missing at random. Data in the EHR are not missing at random, and you got to use a method that addresses that it's not missing at random. So for instance, if the data are missing, it could be that the patient is dead, and no longer able to come to clinic. Well, death is a very big deal and it's not a random issue. That's a big deal, or, the patient may be feeling so much better they don't bother to come again. Well, that's important to know that the patient is better, or the patient can be doing so badly, they had to go to the tertiary care center, and not to you. So, that's a patient coming in. Let's say the patient has come in, and maybe the doctor did not examine that part of the body that you care about, or they didn't give this survey, or they didn't send off the laboratory that you care about. Or let's say they did examine that part of the body, but they didn't record it. So you can see that missingness of data in EHR has multi-levels of causation, and if you do not take into account that causation, you are not going to come up with great solutions and answers. Another issue is there's so much data which they really focus on. If you know what the features are, which ones you select and when I talk about normalizing data I took it for granted that weight and weight should be combined. Okay, that's great. What about weight and heaviness? You know, are those the same concepts? Finally, if I have lots of data but I don't know what the essential features are, how do I figure out what those features are. A good example of this is from this paper talking about Temporal Lobe Epilepsy. And so epilepsy is seizures or convulsions, and the temporal lobe is a part of your brain right above your inner ear. A Temporal Lobe Epilepsy is a little bit different or Temporal Lobe Seizure is a different type of a seizure than the typical [inaudible] the whole-person convulsing and becoming unconscious. Reasonable question is; from their scans, I'm not going go into the types of scans here but from their scans. Can we get the machine to diagnose or classify temporal lobe epilepsy? So, what you're looking at the three columns is left Temporal Lobe Epilepsy, then right Temporal Lobe Epilepsy, and then the neural control. Then if I ask you, okay, how you distinguish those columns. You might say Gee it looks like there's a lot more red on the right and is less than or less than the left and who knows what. So, there's a lot of data in these pictures there. I'm just showing you one layer, right? This has multiple layers like in 3-D, I can turn it around and inside out. What's important? So, this machine shows that well, it turns out that critical thickness is important. The grey matter that's on the top of this temporal lobe, how thick it is as important, what surface area that the grey matter has, what the volume of it is and, finally the mean curvature. I can understand the first three seizures come from grey matter. So, it's not surprising that grey matter is important. Do I need broad surface area and volume? I might have known about that and curvature, where does that come from? So, I can make a story about a system on kind of interpretation of this thing but thank goodness that the machine knows how to extract features from this gobs and gobs of data.