[MUSIC] Hi, this is Dave Shade. And today we're going to talk about elements of data assembly, creating data sets for sharing and the steps necessary for that. And some other associated topics that go along with data set assembly and distribution. Some of the topics will cover include data freezes and a related topic of data locking, data cleaning, data de-identification, data sharing and some standards associated with data assembly and sharing. So to start, we'll talk a little bit about data freezes and data locks. A data freezes often used to refer to a saved dataset reflecting the state of the data at a particular point in time. Most clinical trials will include many data freezes that have occurred during the course of the study. Some purposes of creating a data freeze are for internal quality assurance and internal monitoring as well as the preparation of reports for entities such as the DSMB. Or sometimes elements of a report that might be used for a funder or a sponsor or other entities like that. Related to but slightly different from a data freeze is a data lock and in most cases a day to lock might refer to the final data freeze or the final frozen dataset. And might imply that there will never be other changes to the data underlying the trial. Sometimes people will use the term data lock to describe the final frozen data set that's used for a particular purpose such as a publication even if the study is ongoing. And therefore there could be more than one data lock associated with a study. But might be only one data lock associated with the use of a study for a particular purpose. And I will point out that the terms data freeze and data locked are widely used, but they're not universal and not everyone uses them in exactly the same way. And so when discussing various things associated with creation of data sets for analysis, always best to clarify and define the terms just to make sure there are no misunderstandings. Whether we talk about a day to freeze or a data lock, they will both involve some step or a sequence of steps that are needed to assemble the data, to merge various data sets. And we'll talk about the dataset assembly process that would underlie either a data freeze or a data lock. A couple of tips that I have learned over the years and one particular is very helpful and that is to, consider adding a variable to every data file that includes the date and time that the dataset was frozen and is attached to every record. So just adding a variable to every observation for every data file within a study dataset, that documents the date and time of the day to freeze. That can be helpful when people are using the data freeze or elements of the data freeze to help, among other things to make sure that elements of two or more different data freezes are not inadvertently combined together. Once the data freeze is compiled and assembled, which we'll talk about, maybe that frees variable has been added to all the records. The final files that would be used for analysis should be somehow archived in a clear way. I do encourage folks to add the date and time of the data freeze to the name of the file in addition to it being embedded within the file in that new variable so that the file can be readily identified. Data freezes should include both the raw files as exported directly from the data management system. Or if using a smaller data management approach such as Excel files, the raw Excel file or files and additionally, all the analysis files that have been prepared from the assembly process. Both kinds of files should be archived in the data freeze. And it's typical to combine all the files in a zip file, although that's certainly not required. The process of creating an analysis data file or set of data files from trial data management system can be complex. Although some data management systems might create the series of steps automatically with a single click of a button. But in many projects there will be a series of steps that have to be undertaken by study staff. Perhaps the first step is to figure out how to get the data in its raw form out of the data management system that might involve some form of a download or an export or clicking of a button that allows for the download of the data. And it might also involve separate steps to somehow gather and be able to incorporate external data coming from separate sources. There was more discussion of this in the data management module of this course, but those external files might reside in a different location than the main study data. So it's necessary to think about that. Whether they will be part of the data freeze. Generally speaking they should be if possible, and so that has to be accounted for. In some data management systems it might be required to identify and remove records that should not be included for analysis. For example, some data management systems might include multiple copies of an individual record if that record has been edited successively over time. So for example a data record is originally stored that contains some information in a study site identifies that there was a mistake made during the data entry phase. They go to the data system to correct that mistake, many data systems will retain both the original and now the new corrected copy of the records. They'll be marked in some way to identify which record is the current record for that single observation. But some data systems will download or export all of the records. Keeping all of those records together is actually a requirement for a good auditing practice. They don't have to be downloaded and exported every time, but they do need to remain available in order to be able to conduct a full audit of the study database. A similar set of rules might apply to records that have been deleted. Deleted records might be moved and maybe would not be included in a download or an export, but they might be included in the download or an export marked as having been deleted. If the records for edits or deletions remain in the exported or downloaded data, they should be removed during the dataset assembly process. Typically these data freeze files are going to be moved over to an analyst or a statistician and unless there's been some communication to the contrary. Those kinds of records, records that do not reflect the current state of the data would typically be removed before the files are given to an analyst. Some data systems might include different files or separate files for different versions of a particular data collection instrument that might have been modified during the course of the study. Therefore, there might be a need to merge the data from those different versions of the data collection instrument. If the data were not originally stored together in a single file. Again, this is an area where communication between the data managers and the analysts. If they're separate people, can work out what's the proper procedure for this. But this is an area that sometimes is miscommunicated or not communicated and can lead to confusion. During the data assembly phase, there might be a need to standardize and structure some of the data elements. Maybe many of the data elements. For example a data management system that is using standard web browsers to collect and enter data often is going to send those data to the storage server in a pure text format. The data may appear to contain information about, for example numbers or dates but their raw form is really just text format. Data analysts might like to prefer having their new American date information in actual variables that are structured as dates and numbers. And so there might be a need to transform that text information into the proper format. In our group, we also undertake a process to standardize certain key fields such as participant ID. Participants site, visit ID. Visit date and data collection instrument revision date and because we collect data often using web browsers, sometimes those variables can be stored with different lengths. Some analysis software will truncate the size of a variable based on the first or the first few observations it encounters when loading a dataset. If you haven't standardized the length of those variables, it's possible that later observations could have that variable truncated, and that can be a significant problem. For example if some of the sites in your study have a three letter identifier and other sites have a six letter identifier and you haven't properly set the length of that variable to six characters. When your analysis program, for example, if you're using SAS encounters the data set the first time, it will set the size of that variable to the first observation which might be three characters. And truncate the length of other observations for that same variable which could lead to either duplicates having the same site appearance but not actually representing the same site or could cause other problems. Likewise, especially for variables that will be used to merge files. It's important to standardize the length because some programs when merging will not merge correctly if variables contain the same information but are of different sizes. So we in our group always make a point to standardize the length of key variables that are likely to be used for merging different data sources. Other steps in the assembly process. Sometimes we remove administrative variables. For example, in our systems when data are stored there always associated with a timestamp but we usually remove the timestamp when preparing an analysis dataset. Likewise, we often store information about the version of the browser that was used to supply those information to our server and those are not necessary for the analysts. So we remove them just to make the files a little smaller and easier to manage. Sometimes there will be anomalies identified in the data from perhaps an earlier analysis or from some problem report or just an observation that require a change to the data that can't or has not been accomplished in the data management system itself. We typically refer to these as manual edits and often our assembly program will include some hopefully small number of these manual edits during the assembly process. If we're working on a study that has a convention that uses certain special values to convey particular pieces of information. We often will handle those also during the dataset assembly stage. A common example would be some projects might like to use for example the number 99 to represent a value that is missing. I will add, I'm not a fan of this particular approach. Most analysts have at least once in their life encountered a time when they computed for example, the average age of participants in their study. The average age was 342 years because there was one entry of 999, that was not properly removed before computing that average. This is why we recommend doing this during the assembly stage as opposed to the analysis of the stage where there's often more standardization about the assembly process. But if given a choice, I often will not work with those kinds of special values to begin with because of the risk of retaining them and using the retained values in some form of an analysis. This is the stage also at which we would add that variable I discussed earlier that keeps track of the date and time of the frees itself. And then when all the other steps are completed, some step is undertaken to save the data sets in one or more selected formats that will then be used for analysis. For example, if the analysts are hoping to get SAS files, then we might save the data sets at this time in SAS format. CSV, which stands for comma separated variables, is another very common data transport format that is relatively universal. And of course it's an important and good practice to archive and save forever all data freezes that are used, especially those that are used to complete some external task. For example, a data freeze that's used to support a publication or a communication with a regulatory agency or with a DSMB interim monitoring report. Those data freezes should be preserved basically forever. [MUSIC]