In previous videos, we talked very briefly about categorical and continuous variables in electronic health records. We adopted one hot encoding in order to encode categorical values and combine them with continue fireballs. Here we're going to overview more options in encoding information in electronic health records, and we can see the main strength as well as the main limitations of these approaches. First of all categorical variables can be classified as nominal, dichotomous and ordinal. Nominal variables are variables that have two or more categories, but which do not have an intrinsic order. For example, ethnicity or [INAUDIBLE]. If it's an ambulance, a car, or the patients have come to the hospital on foot. Other nominal variables electronic health records can be the blood group, or the symptom of disease or the cause of death. They hold the most variables or nominal variables which have only two categories or levels. For example, if you're looking at gender at birth, we would most probably categorize them, but they are see their male or female. This is an example of the dichotomus variable which is also a nominal variable. Ordinal variables are variables that have two or more categories, just like nominal variables, but the categories can also be altered or ranked. The variance, for example of Glasgow Coma Scale can be seen as an ordinal variable. Since it scales in an order that starts from no responds and gradually increases to low level of response and finally to full response. Continuous variables are also known as quantitative variables that are also categorized into interval and ratio variables. Interval variables are numeric and they are sampled regularly across time. An example is the patient's temperature. In this case we see that the difference between 20 degrees Celsius and 30 degree Celsius is the same as the difference between 30 degrees and 40 degrees Celsius. Internal values do not have a true 0. 0 here means that there is a missing value. With internal values we cannot add, subtract, but we cannot multiply, divide or calculate ratios. Ratio variables are similar with interval values with a difference that they have an absolute 0. Examples are patient heights or patient weight. It is worth noting that how we categorize variable sometimes is a matter of choice. In some cases, the measurement scale for data is ordinal, but the variable is treated as continuous. Another example where we can find different choices is gender. We categorize it as dichotomous variable, but social scientists may disagree with this. And they argue that gender is a more complex variable involving more than two distinctions. Another example is allocated scale. Sometimes can be used as a continuous variable whereas some researcher would argue that should never be treated as a continuous variable. The key message is that how we categorize variables play an important role on how we represent them and how we standardize them for machine learning model. And subsequently how we process them. Here we see an example, why ordinal encoding is challenging. Blood types is a categorical variable, which has no intrinsic order, how we're going to convert it into a unique meaningful numerical input. It might seem possible just to assign a number for each category, however this will lead to misleading representation. An alternative approach is to use one-hot encoding. One-Hot Encoding, is also known as dummy variables, and is a method,of converting categorical variables, into several binary columns, where one, indicates, the pre-sense, of a row. A solution to reduce the dimensionality of the data is to use the hashing trick. Hashing converts categorical variables to a higher dimensional space of integers, where the distance between 2 vectors of categorical variables is approximately maintained. With catching the number of dimension will be far less than the number of dimensions with encoding like one hot encoding. Therefore, these methods may have an advantage when the number of classes is high. In hashing coding, instead of assigning a different unit vector to each category, it defines a hash function to designate a feature vector on a reduced vector space. So, the problem with this method is a hashing is a one way process. In other words, one cannot generate original input from the hash representation. This mean better ability to explain the model and enhance trust at the model is a very important limitation for healthcare applications. Another family of encodings is the target encoding. Target here is a predicted variable. The key information used in target encoding is one that maps each value to the probability estimate of the target attribute. In the classification scenario, the numerical representation corresponds to the posterior probability of the target condition by the value of the categorical attribute. Whereas in a prediction scenario, the numerical representation corresponds to the expected value of the target, giving the value of the categorical attribute. In target encoding, it's important to have split the data into training and testing before we apply the encoding to avoid mixing Information between the training and the testing data set. Inside here represents a probability and thus the transformed attribute. S is automatically normalized between 0 and 1, which is important for deep learning models. If the training set is sufficiently large, then to calculate the probability Si, it is enough to take the ratio between the observations with target variable y equal to 1 divided by the total number of samples. For smaller data sets, the probability is estimated as a mixture of 2 probabilities, the posterior probability of y and the prior probability of y. Lamba here is a monotonically increasing function of ni. When the size of n is large than lambda, we'll be called to 1 and could assign more credits to the posterial probability. When the sample size is small, then we replace the probability with a new hypothesis given by the prior probability of the dependent variable. Here we can see an example of main target encoding. First we identify the samples that belong to each of the classes. Let's start with a plus. Then we identify the target variable, the corresponding target variables and we average them. And we replace the values of the class with the average value of the target variable. We repeat this process for each of the classes we have in our categorical variable. Once we apply this for each of the classes, then we have done coding that relates to our features. In target and coding missing values can be handled by treating them as a new missing class. That encoding is a more compact representation than one-hot-encoding. The main target encoding has certain limitations. It is very sensitive to the target variable and he tends to overfit. Also if our training datasets, cause only few examples and they take extreme values. This will affect the average value and the performance of the model. Last but not least, mean target encoding, does not extract information from intra-category, target variable distribution, because it is based only on the mean. The leave one out target encoding has been suggested as a way to avoid. The data leakage that the main target encoding can create. Here we see an example of level one now target and code think. And the process is very similar with mean target encoding but each time. We actually collect a sample of a particular class and we find that the target for the encoding we would like to create, we will need to include the target in the main. This breaks a direct dependency of the encoding with the corresponding target value. Again, we repeat iteratively this process for all of the classes in the categorical variable So summarizing here, the live one now target encoding is very similar the main target and coding. However, here the main response over old rows for this category is estimated by excluding the row itself. And in this way we avoid direct response leakage. It also reduces the effect of outliers and it is computationally more efficient. Target encoding might be better in some cases compared to one hot encoding when the categorical data contain a large number of categories. However, if the data have only few categories, then one hot encoding consider, most of the time, a better option. There is also an effect of category imbalance, having few samples in one category and a much higher number of samples in other categories affects targeting coding, but it also affect one hot encoding. Target encoding does not handle interactions between columns. If there is an interaction effect, the effect on the target variable will not be simply the sum of the two feature effects. For example, just adding sugar or stealing coffee may not have a huge effect on the sweetness of the coffee. But if one adds sugar and stairs, then there is a large effect on how sweet the coffee would be. Summarizing, we saw that one hot encoding is the most popular way of encoding categorical binary variables also with electronic health records. However, other approaches exist with some strengths and some limitations. So we explore mean target encoding, which encodes categorical variables whose can conditional mean. And it takes into account as a predicted variable of the categorical variables and it can be used for binary problem multi class problems and prediction problems. We saw that if we're now target and coding is an extension of the main target in coding in order to reduce overfitting. However, none of these targets encodings address the problem of interactions between variables which is often common in electronic health records.