This segment is about de-identifying your date variables. We breezed through the issue of date shifting in the previous video when we discussed how to de-identify your data sets by removing the 18 HIPAA identifiers. Dates are identifiers, but they are also often the trickiest variables to handle in the de-identification process. In this segment we're going to cover a few popular approaches to removing identifying information from your date variables. I'm going to cover four different date management techniques that I've seen used for date sharing by researchers and statisticians. This includes keeping year information only, shifting dates by random intervals, changing your dates to time intervals, and changing your dates to age at event variables. I'm sure this doesn't cover the whole universe of options. So if you use different techniques, or particular tools to accomplish this task, please share them on the forums, and we might include them in the next iteration of this course. Here's the slide from our last lecture to review what is considered a date identifier. All elements of dates, except year, for dates that are directly related to an individual, including birth date, admission date, discharge date, and death date. Then there's a second part to this definition, for people of 90 years of age and older. Just surviving long enough to make into this age group can be identifying. So, the dates were also states that all ages over 89 and all elements of dates including year that indicate ages over 89 are identifiers, unless you can aggregate those ages and date elements into a single category of age 90 or older. Here's a very basic example that we're going to use to look at some of these methods. This is record ID number 4. It contains four date events. Birth on January 1 1970, study enrollment on March 10th, 2013, first visit date on March 20th, 2013, and finally the study completion date, on August 23rd, 2013. In this first approach, you just keep the year information from each date. Discard the month and day values. This is very easy to implement, and year only values are explicitly excluded in the HIPAA definition of a date identifier. It's also easy to group all years that indicate an age of 90, and above. On the downside, this involves a huge loss of information including in many cases the order in which these events took place. You might lose information on which of the two drugs a patient took first. Which adverse event occurred first to the true duration of a drug regimen. In most cases, this approach provides insufficient date detail for most analyses. Here's what our example looks like with this technique applied. You can see that enrollment and study completion both happened in 2013, but you can't tell which happened first. Next, let's look at shifting dates by random intervals. Here's how this works. For each record you choose a random integer between 0 and 364 that covers the span of a year. So your dates could be adjusted to anywhere in a year window. Which is sort of just like preserving the true year. Some groups prefer to pick negative numbers so that dates are all shifted back in time. That means if you have four records like shown here, then you'll have four random integers one for each record. Then, for each record shift all dates in that record by the random integer for that record. And don't forget to save your date shifting key in a separate safe place. Don't share it by mistake, or else you'll have a data breach. Our example record is ID 4. So we're going to shift all the dates in our example record by 6 days. Here's our example again with all dates shifted by 6 days. The enrollment date was originally March 10, 2013, but in the shifted data set it becomes March 16, 2013. This approach is very popular, because it preserves the order and duration of events and it also preserves the meaning of your variables. You don't have to create new variables in your data set to accommodate the de- identified information. Data collection software and data analysis packages often enable this type of date shifting. Either out of the box, or with some scripting. On the down side you have to generate and store a secret key and make sure you use the same key again if you need re-exported copies of your data to match. It also makes it challenging to handle data quality queries. The investigator who sees your de-identified data set might ask you to verify a date in record 4, but you can't match it against source documents without consulting the key. You'll also need to separately process any dates that indicate ages 90 and above. This tape, type of date shifting tool is built into REDCap, to make it easy to create compliant data sets. Here's how to find it. First, go to a REDCap project you've created. Then, look in the Applications menu on the left side of the REDCap page, and click the data export tool link. Clicking the data export tool link will take you to this data export tool page, where you're, you'll be presented with two options. The simple data export, clicking that one click button, will automatically take you to the data download page for the full data set. In this case, you'll select Advanced Data Export instead. The Advanced Data Export lets you apply de-identification methods to your data set. The Advanced Data Export will take you to a page that looks like this, showing all of the variables in your forms or surveys. If you scroll down to the bottom of this page, you'll see a red box with data de-identification options. It looks like this. You can select any of these de-identification options to apply to the exported copy of your data set. None of this will change the real data you've collected in your REDCap project. It only applies to the exported data files. The options in the first section here, will remove known identifiers. The first one is particularly useful, as it will strip your data of any variable that you've marked as an identifier when developing your forms. And the second option will let you encode the record ID. In this next section on free-form text, you can remove any text fields and text boxes, because unless you've reviewed those fields or they were entered by your study staff, you can't be sure there's no identifying information in there. What do I mean by identifying characteristics in free text? Well, remember the introduction survey you took in Redcap? It was optional, so if you didn't take it, don't worry. But, here's an example of a text field in green from that survey. If you were reviewing answers to a question about your experience in data management for clinical research, and somebody wrote, currently co-teaching a Coursera class on this topic, it wouldn't be too hard to narrow down the identity of that person to just a small group of us. This type of information is hard to catch in an automated fashion unless you use advanced natural language processing techniques. So if you have any doubts and you can't review the specific information, one by one, don't hesitate to remove your text fields and the de-identification process. The final section of REDCaps de-identification options refers to date fields, and fields with dates plus times. You can remove all such fields, using the first option, which is certainly a method of removing data identifiers that's even more extreme then just restricting them to the year. Or, you can apply a date shifting process like the one we just looked at. REDCap will pick a secret value between 0 and 364 for each record, and shift all dates in that record by that secret value. And REDCap will use the same secret value for each record, every time you export this data set, as long as you don't change the ID of that record. That's convenient in case you have to send collaborators data set updates over time. Now lets look a third option for disguising your date information. This is changing dates to time intervals and durations instead. First, you need to choose an event to serve as your baseline event. Let's say that's the date of study enrollment. That date will probably be different for every record. Now go through every record and calculate the number of days from the baseline date to every other event date and report those date intervals instead of actual dates. If you have a pair of associated dates, like hospital admission date and hospital discharge date, you can also convert that to a duration variable, like duration of hospital stay in days. Let's look at this in the context of our simple example, which has dates for birth, study enrollment, first visit, and study completion. Your enrollment date becomes your baseline event for this patient. Here the enrollment date is March 10, 2013. That date is 0 days from the baseline event. First visit is March 20, 2013 which is 10 days from the baseline event, which was on March 10th. Study completion, on August 23rd is a 166 days later. To anchor this study in time, you can also report the subjects age and enrollment, and year of enrollment. This fourth approach is very similar. Calculate the study subject's exact age at each event. 0 years of age at birth, 43.186 years old at enrollment etc. Report the year of enrollment but the year only. Now the resulting data set has only exact ages and year of enrollment, no dates. Here's what this looks like, when you report age at event instead of dates. Both the age and interval approaches are nice, because they remove all real dates, and the remaining data doesn't look like PHI. It also preserves time intervals and the order of events, and it doesn't require keeping a secret key like date shifting does. On the other hand, you have to create a lot of new variables, either intervals or ages, to replace your existing date variables. The codebook that you share alongside the de-identified data set, may need to be updated as a result. This process can be difficult to implement and the resulting data can be tricky to analyze, in some cases. Grouping ages 90 and above also requires some pre or post-processing of the data with this method. This was our overview of a few different methods of de-identifying date variables in your data sets. We reviewed some of the benefits and drawbacks to preserving year data only, shifting dates by random intervals, reporting time intervals, and reporting ages at study events. Not every one of these techniques is appropriate for every analysis. Your institution or country may have different requirements for date de-identification, and your research team may have a different approach too. Your local IRB is a great resource. Don't hesitate to consult them on de-identification issues. We also took a look at how red cap facilitates data de-identification and date shifting. Feel free to log in to datacourse.org, create a test project and experiment with this functionality.