So far, we saw how to develop deep learning models such as multi-layer perceptrons, convolutional, neural networks, and recurrent neural networks, which included LSTM and GRU. Subsequently, we saw how to evaluate and compare their performance. In particular, we use time series data obtained from ECG recordings, and we had some experience with multi-class classification. How do we apply these models and knowledge in electronic health record data? We have already discussed that electronic health record include very diverse information. This can be vital signals in intensive care unit, it can be a lab test, it can be medical images, and doctor diagnoses. Another important challenge in processing electronic health record is that all these signals, they have irregular timing. Pre-processing and standardization of this dataset is more complicate than originally we have anticipated. Because clinical variables have different sampling rate, they can be continuous or categorical. Categorical variables can be ranked, or they can have no ranking, as we're going to see later. How do we aggregate so diverse information? We have to make choices that may require expert knowledge, or they can vary and produce different results. In order to be able to use this information to build machine learning models, we're going to look into clinical variables over a fixed time intervals. The most common practice to benchmark datasets and use them to compare the performance of machine learning models is to get summary statistics within few hours and minutes, and use them to construct dataset that resemble time series. We see here an example of patient data, and we see demographic information such as age, sex, and ethnicity. We also see a number of different clinical variables. Some of them are continuous, like the heart rate. Others, they're sampled differently, like the glucose, or their categorical values. By we doing this information, we will end up with missing variables. Later on, we will see approaches of how to address this problem. Also, the choice of the window will affect the missingness in our data, and it will also affect the performance of the algorithms. Our goal is to build a robust representation of the labs and vital time series, and to make frameworks that are reproducible and easy to extend, whereas at the same time, they are clinical, meaningful, both in terms of intervention and outcomes. This is an overview of the pre-processing pipeline that we're going to follow. The first part is the cohort selection. The second part, it relates to standardization of the data. This involves standardize the units of measurements, detecting and correcting outliers. It also involves aggregating the data within specific time intervals, and addressing the missingness in our data. For each of those steps, we should consider clinical validity. Also, we consider clinically meaningful intervention and outcome. The pipeline is based on hourly observed treatment signals for several critical care interventions including ventilation, vasopressors, fluid-ball therapies, and so on. There are also several common outcome of interest that we can consider like mortality, length of stay, or the compensation. Meaningful temporal gaps in our measurements are very important not only to control the missingness in our data, but also for other related reasons. For example, it's important to minimize noise that can come from label leakage. The cohort selection process have been made with a view to be able to easily adjusted to support future research questions. Deep learning are very successful because they automatically extract features, and in this way, we would like to minimize the expert knowledge required to build the models. For our cohort selection, we use the basic tables in MIMIC relational database, which they give us information about the patient ID, about the admission ID, and the Intensive Care Unit stays ID. Before we actually can proceed to the cohort selection, it's important to consider clinically meaningful task for our research question. In-hospital mortality prediction is an example. Here, we can perform this as a binary classification task to predict a hospital mortality based on the first 48 hours of an intensive care unit stay. The compensation prediction, on the other hand, predicts whether a patient health will rapidly deteriorate in the next 24 hours. We can examine that by looking into mortality prediction in each hour. Therefore, the compensation prediction is similar to in-hospital mortality prediction, but it requires far more data. The reason behind this is that the fact that we have to slice our time windows in a different manner. The length of stay prediction predicts the remaining time spent in the intensive care unit at each hour of stay. Finally, the phenotype classification is classifying a given patient based on his ICU records into one of 25 acute care conditions which can be extracted for the international classification coding. In our examples, we are going to focus in the in-hospital mortality prediction. The reason is computationally efficiency because it requires less data and computations than the other tasks. Nevertheless, all the models and knowledge we're going to obtain is easily extensible to other clinical tasks described here. Since successful to planning applications require a lot of data, we would like to use as much patient data as possible as soon as this is clinically meaningful. Therefore, we're going to select all the patients in MIMIC III, which they have at least more than one day in ICU unit and less than 10 days stay. Also, we would like to focus on adult patients. The cohort selection can be done efficiently just by using the three basic tables; patients, ICU stays, and admission of the MIMIC database. By focusing on the unique ICU stay identifier, we are interested in the patient's corresponding hospital admission identifier. We're also interested on the subject identifier, and the time of admission to the hospital in the ICU unit. Along with the time of admission, we will also extract the time of discharge and the death time in the hospital. Based on this information, we will need to join the patient ICU stays and admission days, the admission table, with a view to select all the necessary information that we need in order to exclude patients that they are younger than 15-years-old. Also note here that from the beginning we can exclude the neonatal intensive care unit, which includes only babies. The next step is to choose clinically meaningful variables, and use them for our predictive model. We did this based on state of the art benchmark datasets. This includes 17 variables, and they are drawn based on experts' knowledge. We're going to see here that we have information for capillary refill rate, diastolic blood pressure, fraction inspired oxygen, we have the Glasgow coma scale measurement, glucose, heart rate, height, mean blood pressure, oxygen saturation, respiratory rate, systolic blood pressure, temperature, weight, and pH. In order to be able to extract those variables, we need to know the chart itemid, which is information that can be found in the chartevents table of the MIMIC database. We need to combine this information with the patient information, through the icustays table. We can do that by getting the unique subject, hospital admission, and ICU stay identifiers. Also we need the time of registering the events, the value, and the measurement units. In this case, to get this information, we use first chartevents table and the labevents table. Subsequently, we join the results with the icustays tables. We should be careful to select the ICU stays with relation to the cohort in the previous step. Once we have all the information related within the chartevent and all the information we need from the labevents, then we can take the union of those two, and extract all our events with relation to the clinical variables of interest. A dictionary can help us to verify that we have extracted the right codes with relation to the clinical variables we want to obtain. To do that, we can use the D_items table. We can filter it based on the itemid selected in the previous query. Also the deep-learning models are powerful algorithms to automatically extract relevant features. The electronic health record include very diverse information. We need to robustly encode and represent those data which they come from lab or vital time series. We need to aggregate them, in order to express them as regular time series. To do that,Do we need to have in mind that the intervention and the outcomes needs to be clinically meaningful. The resulting frameworks, they need to be reproducible and extensible.