Welcome back for the Computer Forensics Path, course 5, Module 2. In this module we're going to discuss big data. Some of the challenges that we're going to face when we are collecting big data. Big data. We definitely need to talk about the three V's. You're going to hear this term a lot when you hear big data. We're talking about volume, it's amount of data, velocity, the speed this data is traveling at, and variety, the data comes in many shapes and sizes. Most of the time it's going to be unstructured or structured in many different ways, or multi structured. This data is going to be too large for our standard tools and techniques that we normally would use in forensics. Why do we care about big data? Why is this going to be important to us in our investigations? Well, it is going to be part of a lot of our investigations. Big data will play a role. It is especially important when we're going to see incident forensics because big data may contain the sources of our breach. So we need to find it. It can be buried under a lot of other things that are going to be irrelevant. Sorting through that is going to be the key. We're going to see patterns of data hidden under piles of irrelevant information. It is of great value to businesses and cybersecurity. We're talking about businesses, marketing and sales. They rely on this big data, all these data that's being collected about us all the time moving across the Internet at rapid speed. This data is very important to businesses and cybersecurity. It's very important to cybersecurity because like we said, it can reveal the sources of breaches and most of the data that is dealt with in the business world is an unstructured data. We need to remember that, it's going to be unstructured. It could be vendor proprietary, proprietary to just that particular entity that you're looking at, just a particular company. These things are going to play a big role in how we look at the data. The methodology is pretty much the same that we use throughout any other type of forensic investigation. We need to identify. The first thing is we need to identify our a source of evidence, where is this big data? Then we need to collect it. Once we've identified it, we need to collect it, Then we need to acquire it. That's going to be taking some type of forensic copy of it. We need to acquire these data. We're talking about big data. You're going to be talking about doing a lot of live acquisitions. We'll get more into that throughout this path. Preservation, we need to preserve it like any other evidence we've collected we need to preserve it as best we can. Then we need to analyze it. Again, this is something that we're going to have challenges with because it's going to be too big for our traditional tools and so justly how we examined things. It's going to be unstructured or it may be vendor proprietary, and it can be multi-structured. Then we're going to write a report. We're going to report on this. Then of course, after that we should have peer review and quality control. How do we identify? Where are we going to find the data? Well, you'll find them on servers. Servers are multiple hard drives and they can be configured in what's called a RAID, which is going to mean you have multiple disks that may have duplicate data. It may be stripe so you could have parts of data spread out over several disks in this RAID. RAIDS do provide their own set of challenges. We could have a SAN, a Storage Area Network, and this is going to be a network of service containing large-scale data. We could have a network of servers and on something like this you're going to need to do some type of live collection. We can have large-scale databases. Databases now come in many shapes and sizes and they are growing in size. We may have distributed file systems. A lot of people use distributed file systems and distributed processing because of the power and the speed of it and that means your data could be spread out across multiple servers, multiple systems. You're going to have associated application, specialized applications that we're going to need to deal with that are not going to be our traditional file systems. Some of the challenges we're going to have, lack of expertise. There's not a lot of people that are experts in doing big data collection analysis. It takes specialized training beyond that course of this path, but lack of experience, lack of experts. It is time-consuming. It is extremely time-consuming to do because you're dealing with a very large amount of data that you have to sift through. It's going to be difficult to understand because it's not going to be our traditional NTFS file system, EXE files, HFS or MELIO. It's going to be something that's unstructured a lot of times. We're going to be dealing with large-scale storage devices, which again are these servers, these RAIDS. A lot of these data is going to need to be collected live. Then we have to deal with IoT which is the Internet of Things, which means we can have many different devices and even different types of devices all intertwined into the same system. You also have to deal with Cloud Computing. A lot of your data is not going to be on a local server. It's going to be on some server firm out there in the Cloud and you're going to have to rely on Amazon or whoever holds that club to give you the data you're looking forward to sift through. Social media, there's a ton of data out there on social media. Again, you're going to have to rely on somebody else to give you that data because Facebook isn't going to let you live image their servers. You may be dealing with thousands of storage drives. When you're dealing with Cloud computing and social media, your servers may be located in different jurisdictions and that in itself can give you even more challenges. There are some challenges with big data hence the systems cannot be shut down. We've talked about that a lot of times you can't shut the system's down because you could lose data if you do, the servers may not come back up and you can't shut down legitimate business. It's going to require a logical acquisition of the targeted data. So you're going to need to talk to somebody who knows the network. You're going to have to talk to a network admin to figure out where the data you want is in. It could be even on a virtual machine, located on a server. But you're going to need to find that information out so that you can target the data you need with a logical acquisition. We will cover more about this throughout the path. Some of the tools that are specialized for big data. We had Apache Hadoop, which is an open source framework and it stores and processes large data-sets. We also have traditional tools, we have FTK the Forensic Toolkit. We have cellebrite, X-ways, KAPE which is a free tool, EnCase, and Autopsy which is a free tool. But a lot of these tools are going to have a hard time with extremely large data sources. Hadoop is especially configured to handle large data sources. KAPE would do a pretty good job on a large data source and alive acquisition. But we're talking about cellebrite, X-ways, FTK, you're going to have a hard time using your traditional tools to examine these large data-sets. To use your traditional tools, you're going to have to narrow down your data-set to be able to process it with traditional forensic tools. In our next section, we are going to talk about acquiring data, how we would do an acquisition. We're going to talk about collecting digital evidence.