But let's look at a real-world problem that I had to build before when I was building a social network several years ago from scratch. One of the things that happens with a social network is how do you actually get a user to join your social network? One way to do that is that, we now know it's called something, it didn't used to be called something, it's called influencer marketing. You partner with people who are already on some existing social network and you get that person with a large amount of followers to post to your platform, right? One of the ways that you would need to first figure out who to partner with, is to learn about who are the kind of people that if you partnered with these people they would build out respectively the right signal to your platform. This is something that I built from scratch where, I would go to a third party system and get a list of all of the people in sports in the United States and actually around the world as well, including European soccer players like Ronaldo or other famous soccer players. In this example, this is Joe Montana, and I have the information that he played for the Forty Niners and I put this into a database. Then I have a nightly job that runs, that goes into this jobs framework that I built. This jobs framework would have something that can go through and run something periodically and most are machine learning or Dean engineering jobs frameworks, you tell it to run at a certain interval and then you also tell it the Data Sources and the data output. What I would do was, I created a Mechanical Turk job, this is a way of cleaning up the data. I would ask people around the world effectively to find the social media handles then I would take 10 of the results, then I would figure out if most of the 10 [inaudible] of them agreed on who Joe Montana's social media handle was, then I would create the full record. Here would be example of LeBron James, it says LeBron James Twitter handle @kingjames, Wikipedia handle, King James. Then it would go through and it put it into the database. Then again, I would go back and go to this nightly job and now I could do more interesting information. First, there's a cycle here where I would have to get a rough record that isn't fully complete and I would need to augment it with a way to clean up the data. That was what the first aspect of the data engineering pipeline did. Next, what I would do, once I knew what the real social media handles were, is that I could create a second social media metadata handle. This would go through in query Twitter, Facebook, Instagram, and it would find out metadata about that particular user. One of the things that I would look for would be, how many followers that they would have? What would be their engagement ratio? Then what I would do is take that information and use it to build out a machine learning model. How would I do this? Well, I would look at the platform that I built and I would look at how much traffic was actually being generated. I would then look at the signals on that platform, the third-party platform, and use it to predict how many page views would occur based on the users on my platform. What we saw here is that we're able to have a machine-learning model where it would predict that page views per post. One of the things that happened was that we were able to actually identify the right people to predict with. Here's a good example. What happened was that using those same signals, we're able to partner with a famous NFL quarterback. Here's the logo of our company at the time. He created some posts around the NFL draft and completely saturated our system because it was so popular. A good example of this would be this overview from a high level is you have the Amazon Mechanical Turk. It goes through and it finds the social media handles from these different platforms, and then you can see this is a diagram that we're able to predict, which are the most successful users. The actuals were that we're able to get an exponential growth of traffic on our platform, right? This would just be a good example of the iterative process it takes to actually build a real-world data engineering pipeline. But it really starts with this, right? It starts with that. You have to architect out from a high level where your system does, and often there'll be multiple loops, so there'll be the start, which will give you the raw data. Then you'll continually have to run multiple other jobs to complete the data. This could take maybe a year. I think this took almost a year to get the data pipeline completely right, just because of how difficult it is to get fully complete data. I think that's one of the big takeaways with building. This kind of a pipeline from scratch is the difficulty in doing ETL and identifying it. What I can draw next here is some of the things that we're edge cases that I noticed in building a machine learning pipeline. Doing this kind of work from scratch was, what are the edge cases to building these data engineering pipelines? One of the edge cases that comes up a lot is that it's easy to forget that the data input is the most challenging part of the problem when you're building machine learning systems. What this means is that in the first part of building a data pipeline, if you don't have the data right, it could, on the back end, really cause a lot of extra work. When we were first inputting the data, we used interns in our company to collect the data, and they really didn't do that good of a job because it's a really hard job. As a result, we would have, let's say 50 percent of the records were bad records. A good example of this would be that, let's say there was a sports player, and I'll pick one that's a really good example. In the NFL, There was a player, Anthony Davis. This player in the NFL didn't have a hugely popular career, but there's also a player in the NBA called Anthony Davis. We know just a few days ago that Anthony Davis won the championship along with LeBron James. These are very different people and they have very different social media handles. This Anthony Davis had, I don't know, 10,000 followers. The Anthony Davis in the NBA, even back then, in 2014 or something like that or 15. He had, I think millions of followers at the time, so the signals would be very different. If I captured his data here and I put this into the Machine Learning System, this would give me accurate predictions about later if we wanted to partner with him but if I mistakenly put this NFL player and with only 10,000 followers, this now causes a huge problem downstream in that now my prediction accuracy is very poor. In this case, I guess we would say this would be the mistake where it looks like the data is correctly input, but it will cause poor accuracy in the model downstream. It's a subtle problem that the only way to really solve it was the humans had to go away. It's that when you have humans involved in building things that it really can cause big issues because the work is so boring, and they needed to create a repository of, let's say 100,000 users here, the work is so intense that there are going to be constant series of mistakes and then the results are, if what we're trying to build is a Machine Learning prediction model, that it really is going to have some significant problems. What is the solution to make sure that you don't introduce these kinds of errors where you can fuse the records or you switch things? Really, the solution here is actually automation. Now I'll explain the solution here. What we're able to do is use that Mechanical Turk system and the way Mechanical Turk works and this is really now a bigger trend in terms of getting the right data, but it's basically a mixture of humans plus technology. Here's the humans, and then we also have some kind of data augmentation, AI, we'll see this augmentation. The case of Mechanical Turk, we'll say this is MT what it does is, you could ask, it really then becomes more of the fix it. We will just call this automation. What we had to do was create a record here of the people. We would have to tell a very specific story to the people that were asking these questions around the world. We would say, effectively, go to Google and search for this person's name search for X and then find these three things: find A the Instagram handle, B the Facebook handle, and then C find Twitter, and then put it into the spreadsheet. We would ask 10 people, and these would be random people to grab that data, and if we had eight out of the 10 people would agree on the correct social media handles, then we would accept that record. So we knew that the error rate was effectively zero if we're able to get eight out of 10 people to agree. The fix here is that you really can't rely on just a human by themselves in putting data for a Machine Learning pipeline, you have to have more sophisticated tools and that's what one of the things that people are really starting to lean on is this mixture of the humans in putting the data in a larger system with some kind of data augmentation where you have a Mechanical Turk system that can really hold it all and pull it into an API that's available for you.