In this video, I want to talk about the data science workflow. Before diving into that, let me remind you of what we just spoke of, which is the scientific method. That was a four-step process where we begin by clearly articulating a specific or precise question. We then guessed an answer or answers, that is, we hypothesize the answers to that question. We then identify the empirical implications of those different hypotheses or guesses. That is, if those hypotheses or guesses are correct, what should we see in the data? Then the last step is to compare those implications with what's actually in the data. In this video, I want to focus on that last step because there's an actual process for comparing those implications with what we see in the data, and that process is referred to as the data science workflow. Now, as always, we start by googling data science workflow and you get, in this case, over 95 million hits. Perhaps, more interestingly, there's a whole bunch of different images describing the data science workflow. Some quite colorful as you can see on the screen. Well, there's many different ways to describe this workflow, many different labels for the different steps. I like to keep things very simple and intuitive. Let me describe the data science workflow in terms of the following four steps. The first step is acquisition and verification. First, you are going to get the data, you've got to acquire the data. That could be at a very simple process, as easy as grabbing a spreadsheet off your local computer. It could be more involved with getting data from different parts of the organization. It could involve hooking into external APIs to download data, it could be scraping data off of the web, it can be purchasing proprietary data from different vendors. Data's everywhere and there's nearly as many ways to get it, but that in and of itself, is a part of the process. Of course, once we have the data, you have to verify it. Reagan used to say, trust, but verify. Never is it more true than when working with data. Quite often, data will come with some data dictionary or documentation. Never take that at face value. We always want to verify the data by actually looking at it, by beginning to work with it, and making sure we understand it. Now, having acquired and verified the data that it is what we think it is, the second step is to prepare the data for analysis. That might sound like a straightforward step, but as I mentioned just a minute, it is not. Quite often, data is in a variety of formats. We're going to have to wrangle that data, we're going to have to clean that data, we're going to have to explore that data through exploratory data analysis, perhaps EDA, to again, improve our understanding to further verify, maybe we'll have to go back to step 1 to get more data or different data, but preparation is in and of itself an important step in the data science workflow. Once we've prepared our data, again, ensuring that we understand it, that it's correct, that it's in a format that's ready for whatever analysis we intend to take, we then analyze the data. Analyzing the data can be as simple as producing an average or standard deviation, summary statistic, or it can be as complex as running a machine learning or AI pipeline, in which we explore, we train and test a variety of different models to arrive at a final model that will put into production. The last step of the data science workflow that's alluded to, but not often considered a formal step, is what I call communication. Having gone through the process of acquiring, preparing, and analyzing the data, arguably, the most important part or at least, an equally important part, is being able to communicate your results to decision-makers in a manner that is clear and compelling so that they can take action. It's quite often, there's a disconnect between steps 1, 2, 3 and step 4, that I think is a disservice to data analytics in the workplace that puts a ceiling on just how powerful and how useful data analytics can, because data scientists or whoever's working with the data, is simply unable to communicate their findings in a way that resonates with management who may not be as well versed as working with data as data scientists, statisticians, and others may be. Those are the four steps of the data science workflow. I put a little very low brow pie chart on the slide to illustrate what I think is a general breakdown of time and effort in the data science workflow. The numbers in the pie chart correspond to the numbers or the steps in the data science workflow, and you'll see the two preparation is the majority of the pie. That's certainly been my experience, it's been the experience of virtually all of my colleagues. Preparing, cleaning, understanding the data takes so much time. In my experience, a good 75, 80 percent of the time, at least. Whereas the actual analysis, once the data is in a nice clean format, is relatively straightforward, it's relatively easy to push data through models. It's a whole different thing to have the confidence in the data that's being pushed into the model. You can see that just the preparation takes a lot of time, and so don't underestimate that. Of course, there's always exceptions. This isn't a rule, it's just a general characterization.