Dataset protections. In this lesson, we will look at how secure models really are. First, we're going to define data minimization and then we'll identify ways to maximize privacy and security of your datasets. To get started, let's talk about the challenges to dataset security. The first is a pretty common one. The idea of security versus convenience. As time crunched researchers, we tend to make a lot of early mistakes in the early days where we pass up the right way for the easy way. This is the classic attitude of I'll secure things later. In reality, it's a much better practice to do the hard things first enough to reformat your model as it scales. We also have complexity to deal with in the security realm. There are multiple datasets, multiple storage points, and multiple people accessing them. Because there are so many places where the data lives, it can be really hard to make sure each is a secure environment. This is why. This is an example of all of the different places and all the different tools involved in a typical data science project. Remember that you're not only trusting your team security-wise, but you're trusting each of these companies to secure your dataset. A good practice is to research all of the companies involved and use each at the highest level of security offered, different passwords, two-factor authentication, etc. Here are some security basics to follow to overcome those challenges. The first is to ensure that all team members and stakeholders have a basic understanding of security and privacy. Make sure everyone's on the same page with terminology, policy, and standards that the team will use when it comes to data storage and use. Second is to enact a sound at data governance structure. This means you need to have ownership and accountability for all team members as data changes hands at each step of the process. What does it look like when a new version of the dataset is used? Old version is usually archived in an e-mail somewhere or is it actually securely erased? This is one of the biggest challenges. Older data sets are more likely to be forgotten about and then accidentally leaked or breached. Finally, we need to perform threat modeling with adversarial algorithms. Make sure that security is built in throughout the system. If you are trying to get access to the dataset, where would you look? Remember, it's not going to be worth the trouble of building an incredibly secure model if a hacker can just get access to the entire dataset by resetting the password on AWS or any other tool. We need to make sure that we have security throughout the entire stack. Because of the challenges in securing datasets, we can abide by an overarching principle, and that is the principle of data minimization. Data minimization states that we need to limit data collection to only what is required to fulfill a specific purpose. In machine learning, this specific purpose is an accurate and fair model. What does it actually look like in practice? It's really common for machine learning researchers to say, we're not going to use this column in this dataset, but we should hold on to it just in case, it's never worth the privacy trade-offs. Instead, we need to delete unused datasets early and often. Remember that each unused dataset that's just hanging around your hard drive is a potential security liability. It makes sense to securely erase it before it ever becomes a threat. Second, we need to monitor diminishing returns in model performance. For example, if your model is performing at 95 percent accuracy with 100,000 records in the dataset and it would take one million records to get to 97 percent accuracy. Consider if it's worth the privacy trade-off. In most cases do do diminishing returns in model performance. It is not. The privacy trade-off of potentially 500,000 sensitive data points is not worth that extra two percent bump in performance. Another thing to recognize is that governments are starting to realize the necessity of these practices as well. The privacy law in the European Union, or GDPR, states that personal data shall be adequate, relevant, and not excessive in relation to the purpose or purposes for which they are processed. This is really a sign of what's to come. The key thing here is the subjective language about what is excessive or not excessive when it comes to personal identifying data. In a machine learning algorithm that usually requires thousands of data points to be effective, it's going to be an uphill battle to prove in courts that only the necessary data was collected and used. All of these considerations are worth studying before we get into the next realm of privacy.