Hi. In this module I'm going to talk about some of the ethical or professional issues that come up when we are engaged in data sharing, an increasingly common practice in the social science research. Sharing of data for social science research is steadily becoming more common for a variety of reasons. Funding agencies typically require a plan for data sharing following the completion of a research project in order for the project to actually get funded. Some journals also now require at least encourage, strongly encourage, the sharing of data that was used for the conduct of an analysis, the results of which are reported in a publication in that journal. Collaborators, of course, routinely share data with each other. And it's important of note that sharing of data in all of these contexts actually requires a lot of planning and thought. It's a lot more difficult than just handing somebody a USB stick or e-mailing them an attachment or posting something on the web and saying 'go to it'. Now, publicly releasing data is increasingly common in social science research. As I mentioned, funding agencies may require a plan for data sharing following the completion of a research project. In other cases they fund data collection directly so in many countries funding agencies support large data collection efforts, the longitudinal surveys that we talked about in previous lectures. These are directly funded and their specific intent is to prepare data that's publicly accessible. Now, the public release of data requires extensive preparation and additional work, well beyond the work required to actually collect the data in the first place. The data typically need to be cleaned. So even if we've been working with a data set for quite a while, it still usually has to be cleaned before it's released to the public because there may be variables that we didn't use in our own analysis so we never checked them for consistency or for errors and so forth. So all that work has to be done before the data are released to the public. There's also documentation. Now, we may be very familiar with our own data so we may not need to document it because we already know what the contents of each of the variables is but if we hand that data to somebody else they need some documentation to understand what's going on. Now it may also be the case that additional variables will need to be created to make the data set useful to a broader audience than say ourselves or the people that originally collected the data. There may be specific variables that help people manipulate or organize the data along different dimensions in order to prepare it for different kinds of analysis, analysis that we had not been interested in doing ourselves. Typically the data will have to be anonymised so that the data, especially if we were working with it ourselves and we kept it secure, obviously, it may not have been entirely anonymous or it may have been possible or a candidate for re-identification. So once something is about to become public though a whole new process has to be conducted in order to think about ways that re-identification out there, when the data is in the wild, might lead to problems for the subjects in the data. Finally, there's the issue of hosting. Releasing data is more difficult or more complex than simply placing it up on a website and letting people download it. You'll need to find a website that's going to be stable there for decades. So, we talked about in previous lectures, repositories like ICPSR, and so forth, that act as custodians for data even after the person that collected the data may have retired or passed away. These are the ideal sites for hosting publicly released data. Now, failure to complete these steps may actually render the data that's released useless and not particularly useful for anyone else's research and may limit its impact. Now, there's another issue that comes up with, as I said, journals increasingly require or at least encourage authors to share their data following the publication of a paper. Now, for publicly accessible data this may be as simple as releasing code that will help reproduce the results. So, if a paper is analyzing perhaps census data that were downloaded from the web and that anybody can access simply by downloading it through a browser, then in fact, it may not be necessary to turn around and pass along the entire data set that was downloaded but rather pass along the code that was used to transform and analyze the data that was downloaded from the web. Now, for a proprietary or restricted data there may be issues related to ownership that can arise. This can get a bit tricky. So quite often if we're using proprietary data or restricted data, we may have signed an agreement that limited our ability to in turn share the data or pass it on to others. That was, is something that has to be negotiated on a case by case basis. Now for sensitive data anything that might be put into the public domain or made available in a journal website it's going to be crucial to prevent re-identification. So again, as we talked about in the case of making public datasets for the purpose of publicizing them, anything that goes out into the wild and becomes widely available we have to think about whether or not there are risks of re-identification that we talked about in previous models. Now, even for something as simple as disseminating a data set in conjunction with the publication, documenting the data and the accompanying code may be time consuming, perhaps as much time or more time than was involved in the original preparation of the paper. So, this is something you have to think about when you're writing a paper and preparing it for submission. Leave some time especially if the journal that you're planning to send it to requires data sharing. Now, some of the more complex issues that can come up with data sharing involve collaboration. So, for proprietary and restricted data, the lead researcher may need to obtain permission to share the data with their collaborators so some of the major datasets that I've talked about in previous lectures, the big longitudinal data sets, for example, add health and so forth, that contain actually a lot of sensitive data, somebody may sign an agreement where they are given access to that data set. But then, if they want to allow collaborators to work with that data set they may need to have those collaborators also sign an agreement or they may have signed a special agreement in the first place that allowed them to allow students to make use of the data. So pay attention that if you're given access to, or are seeking access to, these sorts of datasets. Now, all collaborators will need to respect whatever requirements exist for keeping the data secure. So, if you're sharing data or somebody is sharing data with you and it's proprietary or otherwise restricted, you need to follow all of the rules about keeping it secure, keep it on a encrypted hard drive, a computer that's secure, or you may only analyze it in a data enclave with no internet connection. There are different kinds of rules for different situations but you have to obey them. And one issue that can come up is that collaborators need to know clearly about things like whether or not they can continue to work with data on their own after they've completed a collaborative project that led them to have access data in the first place or whether or not they can use data that is shared with them for their own independent side project. So, we sometimes have problems, misunderstandings, where somebody say shares data with a student for the specific purpose for working on a particular paper and then the student, perhaps not with any malicious intent just as a result of misunderstanding, keeps the data on their hard drive after that project is complete and starts to conduct other research with it. Now, in many cases, perhaps that was part of the plan from the beginning, but, all of that sort of thing has to be made clear and via communication between the people sharing the data with each other. Now, if they are allowed, if students or other collaborators are allowed to continue to use data that's been shared with them in their own work, they need to know for how long and whether they can share it with others in turn and whether they need to inform whoever shared the data with them about what they're working on. So these are all issues that if they're sorted out in advance through communication between all the parties that are involved, they won't cause a problem but if they're not discussed beforehand they can lead to misunderstanding. Now, similar understandings may be required for the sharing of code. So, as I mentioned, sometimes there are publicly accessible data sets where you can get the original data directly from a website and then people write code to manipulate, transform the data, to create variables with it and so forth. So, if you're sharing a program or somebody shares a program with you that will carry out the sorts of transformations you have to have a clear understanding of the circumstances under which you can make use of this program. Can you go on to use it in your own work? How do you credit the person that shared the program with you and so forth? Or if you are asked to create the program for your mentor or your professor, are they allowed to share it with other people? Again, these are all issues that can be dealt with and they're best dealt with by communication at the beginning before things get too far out of hand. Now, if you're making use of publicly released data or data that's been shared with you, then you also have to think about the issue of publication credit. Publications that make use of public data should cite the dataset as well as any of the documentation that was available for it. If the public release was funded by a government agency or a private foundation then the number and title of the grant should also be cited. So quite often agencies use statistics on publications to decide whether or not in fact their support for the release of a dataset, or the creation of a dataset, was worthwhile. Contribution of proprietary data may also be recognized with co-authorship in certain specific situations where individuals who produce the data have shared it and the production of that data represented a real intellectual contribution. Now it's less common to offer co-authorship to somebody that was simply doing their job when they granted access to a dataset. So, typically if somebody just happens to be working at a company or a government agency and they are the one that signs the approval that offers access to an existing data set, we typically don't offer them co-authorship. So, co-authorship for data production or data sharing is because the individual, him or herself, was the intellectual creator, the producer of the dataset. So, as you move forward in your careers, you'll see that data sharing is becoming more and more common. It's been a big boon for social science research. In previous lectures we talked about some of the amazing datasets that are out there. But, like anything, as data sharing has become more common, there are some issues that have emerged that I've talked about that you have to think about when you're planning your own research and writing your own resulting publications.