All right. So here's my [SOUND] personal example. I asked the question. Here's the hypothesis. Is it possible to look at a whole year of NFL statistics using machine learning algorithms and data analytics techniques? To predict the winners of each game of the regular season with 80+% accuracy? Many people watch cricket. So pick your game. There's lots of statistics available online. You pick whatever game you want. I happen to like football. I grew up with my dad eversince I was knee high to nothing. Watching football games with him, the thing that we did on Sunday afternoons. And it just occurred to me as I was working on this material and was driving home from grocery store or something. And I thought, I wonder if we could extract statistics from a season, and use machine learning algorithms and do this. So then I started on it and I'll share with you my experience here. The first question is, what data do you have to look at? There's tons of statistics, you go to the NFL website, there's all kind of statistics out there. So having watched NFL football games for many, many years, and it's true for any sport, you start to understand what critical statistics there are. Critical features you might want to look at and consider for machine learning problem. So when I choose, when it was just an excerpt, I started putting all this data into a spreadsheet. So I choose what I thought were the five most important statistics for the offense and the five most important statistics for the defense. How many wins they had at home, how many wins they had on the road, what their total wins were, and whether they went to the Superbowl or not. And I'm not sure this column really matters because there's only two teams. So these offensive statistics are offensive points per game, Offensive yards per game. Offensive yards per play, the number of fumbles, and the third down percentage. So if you know anything about NFL, you get the ball and you're trying to move it down the field, and you get three downs, which is three possessions of the ball, and you have to move the ball ten yards. So just less than ten meters and you can't do it in four plays and the other team gets the ball. So, this third down percentage is a big deal. Teams that do well on third down percentages meaning that when it's third down they get the ball at least ten yards and then it goes back to first down again. Four more downs, four more plays to move the ball ten yards and if they can move the ball on the third down then it goes back to first down. And so this percentage is really important so I included that one. And the defense, this is defensive points per game so defenses can take the ball away from the offensive team. They can cause interceptions, they can cause fumbles, etc. So this is defensive points per game, defensive yards per game, defensive yards per play, number of fumbles they caused and the number of defensive sacks. That's where the quarterback is the guy that bends down, he gets the ball, right? And if the defense can get to him before he can throw the ball or hand the ball off, that's called sacking the quarterback. And so sacks are important from, on the defensive side, much like on the offensive side, third down percentage Is important, it's an important statistic. And I explained those already. So I've got five offensive features, five defensive features of three win features. These total wins is not an independent feature. It is a dependent feature on wins at home and wins on the road. So these are correlated. These are not uncorrelated data here. And it may explain some things I see. That I'll have to deal with. I plan at some point to go back and finish this. This is an unfinished and ongoing project that's sitting on my hard drive at home. And I plan to go back because if I can get it to work then I might be able to use it to start predicting while the season, an active season is taking place. This is looking at all of the data, all the features, and all of the outcomes we knew from the 2016 season. So in addition to those, I started creating a win-loss matrix for each of the 16 weeks in the regular season. So I created a sheet with all the teams on it, and I put in this where all the sales are 0. But this week the Broncos played the Colts, and the Broncos won 34 to 20. In this case, that week the Tampa Bay Buccaneers played the Cardinals and the Cardinals won 40 to 7. And so there were more teams. There's rows for all the teams and columns for all the teams. So it forms a grid with the resulting scores of each of the games. So you get more features. So here was my process. Go to a spreadsheet, I extracted CSV files, I fed them into Python and Anaconda, and I was expecting those awesome result, all smiles. I'm all, this is going to be so cool. No, it didn't quite work out like I had anticipated. Here is the approach that I took. All of my. Features that were offense, defense, wins in Super Bowl for Team A and Team B. In Week 1, there's 16 games where Team A plays plays Team B, okay. And so there were 16 entries for Week 1, 16 entries for Week 2, and 16 entries all the way down except I only had time to create of those team versus team statistics I only have six weeks in there, so I'm missing over half the seasons worth of data. It was 16 weeks total and I only ran six weeks of data. This is how I chose to structure my feature vectors. And then the outcome. The prediction is a label. A one means team A1, and a zero means team B1. So my hypotheses function, or my target variable I'm trying to predict for each, when Team A playsTteam B, I'm trying to predict based on those statistics which of those teams won. Based on that binary value there that is, acts as a label for who won. So that's how I structured my data. So do I have good data? Do I have smart data? Pretty certain I don't have enough data. Is the data properly prepared and arranged? Not a 100% convinced of that yet. I got more work to do. Do I have the right features? An excellent question. I think I do. Based on what I know about the game I believe I have extracted the right features from the NFLs website. But I'm not 100% convinced. Are there ambiguous samples? I don't know. I don't think so. There might be, but these are all questions if you find yourself on your first machine learning and data analytics problem whether it's for work or for fun. These are the questions that you need to be asking yourself. You can't just trust that you're going to jam a bunch of data into a machine learning algorithm and it's going to, here are the answer to life, the universe, and everything. So, ambiguous samples. So, one ambiguous sample, I thought, was, well Team A plays Team B, and Team A wins. And then later in the season Team B plays Team A, and Team B wins. What does that mean? So it means teams in their own division, they play themselves twice during the course of the 16 games. Does that represent ambiguity or not? I don't know. That's something I've been thinking about. Might be, might not be. Am I using the right algorithm? Another excellent question, I chose because this is a classification problem I'm looking for feeding in a bunch of features. And producing a zero or a one to tell me which team wins based on those features. It's a classification problem. It's not a regression problem because I'm not producing, my output isn't a real number like a housing price example. It's a label, it's either a zero or a one. So it's a classification problem. So I took my team data and I thought, well, I'm just going to start with Principal Component Analysis, because I've been reading all about it. And I'm like well, what can it tell me? I'm still not sure. [LAUGH] So the two teams here. All right, so what are we looking at first here? So this takes all of 14 columns of all the total number of features and plots them for 32 rows of data that I had at the time when I ran this. And I fed it in and I reduced it to a two-dimensional view. So the horizontal axis represents the features. And you can print out what that is. I didn't print out what it is, I don't know which feature it is. But you can figure out which it is. But I just chose to plot it, just to stand back and look at it and say, what does this tell me? So I fed all the team data and I reduced it to two dimensions. And these two teams are the teams that played in 2016, the New England Patriots and the Atlanta Falcons. So I labeled those and instead of printing them as a little red X's, I printed them as little blue stars because I wanted to see where they were in this data set. So I said, okay. So then I said, well, I want to just look at the offensive data, just those five features for the offense. And again you can see the Patriots are here and the Falcons are here. So they're not very far apart in terms of their principle of component variation. But they are pretty far apart in their second principle component. And then I plotted the same thing for the defensive data. And here you can see they're pretty close on the defensive side. This one's a little to the right, and this one's just a little to the left of zero. I've been staring at those for, off and on, for a year now, and I want to go back and figure which axis those or which features those variances correspond to, which I just haven't had time to do. That I've been staring at just wondering what is this telling me if anything. It might not be telling me anything at all. So then one night I had this other idea and I said, could it be possible to create new features from this data? Think about them as derived features, okay? So is it possible to create an offensive and a defensive force vector? This would be a vector in five dimensions for the offense, and then five dimensions for the defense. So you've got five features for the offense. And remember there were five features for the defense. So, I just drew a three dimensional graph of, here's the first three features in three dimensions. So think about it as these five dimensional vectors that would show a magnitude, okay? Using the square root of sum of squares like you would for three dimensions but do it in five instead. So that's what this is supposed to show. So this is the thread line represents a three dimensional vector of offensive yards per play, offensive yards per game and offensive points per game. And then there would be two other ones, the fumbles and the third down. So they'd be create this five dimensional and they did. So, these two vectors could then be combined into a single force team vector. So these force vectors like how strong is the offense, and how strong is the defense. And you could combine those two vectors together and say how strong is the team as a whole? That was my idea for creating derived features. So I did that, and again I called out the Falcons and the Patriots. So I don't remember what the x-axis is here, but I normalized all of the forced values to one. And you can see that the Falcons and the Patriots, when you look at the offensive force data are both very high. Not as high, the Falcons and the Redskins are pretty close. Saints had the highest offensive force value. But these are the two teams that made it to the Superbowl. So just looking at the offense. And the defense is a little murkier I was looking at this over the weekend. So here larger numbers represents a better offense. And this middle case larger number represented a worse defense. And then I combined them into a single force vector. And I don't remember in my math, I actually should, in order to combine these two, this scale needs to inverted. This should have been one minus, right? Larger should have been better. Larger should have been better, right, if you're going to combine them. because otherwise, one's pointing in one direction and they're going to cancel each other out. So, I may have in my math before I created the combine, I may have inverted this value into one minus the normalized value here that lies between one and zero. But when you see the normalized combined force vectors, you see that the Falcons and the Patriots both rate very high in the combined and on the offensive. And that's somewhat ambiguous when we look at the defensive data. So, where the results are right now, I got six weeks of win/loss data and 16 games in a row. Like I said, so I'm missing 10 weeks. X has 96 rows with 28 features, 14 offense, 14 defense. And y has 96 rows of who won, okay? I used cross-validation to select 10% for test and 90% for training. And I used the Linear Support Vector Classifier in the site sk.learnlibrary. This is a classification problem, okay? And when I ran it for the very first time, Accuracy, 80%. Yes! I was all smiles. It's like, this is so cool. But then I forgot the random element, there's that random variable that these algorithms use and you run it again. And you get 0.71, and you run it again and you get 0.55, and then if you run it again. [LAUGH] And you get 0.35, and you run it again and you get 0.6, and it moves around. I just got lucky on the very first one, I thought I've gotta capture that and share it with students because it's like yay whoo get the champagne out. All right, I'm on to something now! And I've only got six weeks of the whole 16-week season in there, and I've got 80% accuracy. Man! This is going to be so cool! No, it turned out that that wasn't the case. There is a ton of work that I have to do. Clearly, I need to add the other ten weeks of win/loss. I probably need to do cross-validation in the training with k-folds. We worked that last week. I have not introduced the force vectors as features yet, although it looked promising. By plotting the results, but I have not introduced those as features. And I want to try other algorithms besides the sk.learn. And, again, here is my interest in exploring Spark, and there's other SVM libraries out there that you can Google and find them. So, that was interesting. Like I said, I plan to at some point finish this experiment and see if I can actually get close to 80%. I think I would consider that a successful machine learning process.