Welcome to this special lecture, demonstration, and interview. I'm actually here with one of our Masters of Applied Data Science students, Anthony Giove. Anthony's special to me. In this programming, he's one of the first students I actually had a chance to start teaching and working with. Actually, with this course, he's actually helped to do some of the content preparation, the dataset finding, and looked over some of the materials. One of the things that's really impressive about Anthony is the work that he's done and some of the milestones of his degree. We wanted to share that with you because it's focused on sports analytics, which is a passion of his. Anthony, welcome. Thank you, Chris. It's pleasure to be here it's been a fun ride working with you the past six months on the sports analytics specialization. Can't speak sorry about that. But I am excited to begin to launch this course. Now one of the things we're going to do today is we're going to show you a new Data Science environment called Deepnote. This is a commercial product but it's freely available and I think even their teams level is available, and for education users, it's also free. I actually use this a lot for content preparation. We're going to demonstrate Anthony's project and some of the work that he's done in sports analytics through this environment. Now, I don't have any affiliation with Deepnote, the company, but I find it very useful for exploring on your own, and as this is the last course in this specialization on Coursera, you might be wondering where you should go next, where can you start to explore this data or similar data and do so while keeping costs under control and not necessarily having to deploy a lot on your own machine. Now, Deepnote has a couple of different features. The big feature though that I love is the collaboration. You can actually share your Deepnote, and Anthony and I are in different studios because of COVID, but we'll be sharing a Deepnote and you'll be able to see me go in there and highlight things and reference things in there as he goes through as well. You can also scale though with Deepnote as well. You can get bigger and bigger computational instances, GPUs, and so forth, and they've got reasonable cloud computing costs associated with that. I really enjoy that as a service. I use it a bunch. I think that we're going to play with that and demonstrate Anthony's project. But I think that's enough about that. I like to get right into the code, so why don't I pass it over to you, Anthony. You can share what the goals of your project was and where you're going with it, and then we can start looking at the code. As Chris mentioned, I am a MADS Student and during my milestones with Milestone 1 and 2 and about to begin working on my capstone. My project is centered around sports analytics, with respect to baseball mostly, as I've played baseball my entire life and I'm extremely passionate about baseball as well as baseball analytics, and Bill James, to me, is an idol. I've read all of his work and along with the rest of the community, for the most part, I'm thoroughly impressed with what he's done. Because of my love of baseball and anything that has to do with baseball, one of the things I wanted to do was predict Baseball Hall of Fame worthiness with machine learning. Today, there's multiple paths and ways to get elected into the Hall of Fame with the Baseball Association of America and then the Veterans Committee. What I wanted to do was try to peel back a little bit of the obfuscation that comes from the baseball association with respect to what's important and what actually classifies a Hall of Famer versus not being in the Hall of Fame. Because I have to do it for my milestone as well as personal passion. That seemed like a fun topic to explore. During my second milestone, which is when this portion was completed, along with a new look at player similarity, which is not included in this, but I'll probably share the code with the specialization afterwards. I completed this project with Avinash Reddy, who isn't featured here on camera. But he was an integral part, performing all the data cleaning, manipulation, and then all the modeling work and the visualization work. I just wanted to give Avinash's fair do. With this, what we wanted to do obviously was predict Major League Baseball player Hall of Fame worthiness. The Baseball Hall of Fame itself, the first election was in 1936. It's a special place for the best and most impactful players in baseball history. As I mentioned just before, there's two primary methods of being inducted, and that's the Baseball Writers' Association of America and the Veterans Committee now. There have been many alternatives for getting elected as well over the years and they've changed. There's a nice ebb and flow with how veteran players can get elected. At the moment, there are 333 members currently enshrined. Then the breakdown of that is shown on the screen. As far as why we want to predict Hall of Fame induction, there's a lot of external factors that go into the voting process and there's many variables and factors that allow for a different opinion. That's what made Bill James' work on similarity scores so impactful. He wrote a book centered around that. With respect to the politics of the Hall of Fame. It's an excellent read, so anyone that is interested in the topic, I highly recommend it. It's a very fun read cover to cover. We wanted to be able to collect the data, have a little bit of fun with it, and then, being a Bill James fan myself, recreating Bill James' similarity score work with a different way of visualizing the data. You can look at different segments of a player's career, which on baseballreference.com, they also go into that and you can see age breakdowns for the most similar player. But we want to step further with this and made it so that you could select the seasons that you care about in a player's career, as well as how old they were on top of the era. You could filter out players that played before 1950, for instance, and get a snapshot of how Gary Sheffield looks against players that have played after 1950. But you can also do the opposite of that and you look at only players that were pre-World War II era and see then how Gary Sheffield stacks up for instance. In order to get the data, we had to collect data from baseballreference.com because the type of data and the aggregation that we needed doesn't exist currently. A large portion of our project focused on just collecting the data and then organizing it in such a fashion that we would be able to quickly query it, manipulate it, explore it as we needed, and then ultimately build these models. After performing some feature engineering to take some stats like, runs above replacement, wins above replacement, and then try to isolate the prime of a player's career, which is obviously a very fluid thing. As technology has increased, the primes of player's career, as I recall, that has somewhat expanded. There was a lot of iteration we did in the model itself to try to narrow it down and ultimately get a predictive model that wasn't over or under-fitting, which in their terms if you weren't exposed to prior to this course, I know Chris has talked about that in length. As far as getting more information on the Baseball Hall of Fame, the URL is attached at the bottom there, I highly recommend going there because you can learn a lot about the Hall of Fame. With respect to our problem and our time constraints, our milestone takes place over the course of two months. The first month being getting ready for a comprehensive exam of everything we've learned in the program to that point. We had to narrow down, what we thought was achievable and attainable to do within a month. Because of this, we were focusing on the Baseball Writers' Association of America only and their voting. The reasons for that were that, it's very standard for how you can get elected by the Baseball Writers' Association of America. Five-years post your retirement and then up till 10 years after you're on the first ballot, you can remain on the ballot until you fall under a five percent threshold. At that point you're removed from the ballot. The other alternative is the Veterans Committee. The Veterans Committee itself doesn't have this regulation. The rules for when players can go into the Veterans Committee for discussion has changed over the years as well as the name of the Veterans Committee. That wasn't something that we wanted to focus on because we wanted something that was a little bit more repeatable, but then also had enough data that further expansion could be done with that data as we collect it. Another factor that is pretty important is, there's a large disparity in the number of batters versus number of pitchers that have been inducted into the Hall of Fame. With our data set and time constraint, we chose to focus on batters only, which left us with 85 inductions by the Baseball Writers' Association of America. With that, some of the statistics that we wanted to collect were basic hitting and fielding, then advance hitting and fielding, as well as all- star game appearances and MVP voting appearances. These were all things that as we were designing and thinking about what features may be critical and important in a model, these were the aspects we felt we should capture so that we have flexibility to explore and investigate the different features that would be available to us. As mentioned before, all the statistical data was scraped from Baseball Reference and then because we have labels as to a player being in the Hall of Fame already versus not being in the Hall of Fame. This lines itself well to supervised machine learning. Then because of the class disparity and the fact that there are 4,136 players not in the Hall of Fame versus 85 that are, we used the stratified test train split during all of our model creation so that we could take into consideration and account for the limited number of positive class members. Ultimately, when we were performing this we had a ton of fun. My love for baseball definitely made this a very intriguing and exciting project for myself. My partner, well, he doesn't share my love of baseball. He is a sports analytics fanatic as well. He learned a ton, especially about the different statistics that are kept. He asked so many questions as to why is this even kept, why did they come up with all of this with respect to our field and our bet, and these other metrics that are advanced and didn't used to exist, but now do. Ultimately, our Hall of Fame prediction performed pretty well. In my opinion, all of the false negative and false positive cases and true negative and true positive cases, I feel are deserving of being in the conversation for the Hall of Fame. Again, my personal opinion not cemented in fact or anything. As I mentioned, there are some external factors which cannot be accounted for like the allegations surrounding performance enhancing drugs, political standing, anything that could potentially sway a voter's mind that isn't a statistic, could be accounted for in why somebody votes for somebody. Our model accuracy scores were pretty high. There were a little bit higher on the test set versus the train set, which we were happy with. We still didn't feel like we were under over fitting the players that fall just inside and outside the bounds of being classified as a Hall of Fame or with respect to current players made sense to myself and my partner. Then you have future work for this. In our capstone, we plan on expanding this into picture Hall of Fame predictions as well. There's some traditional statistical measures we will have to take to account for an even larger sample bias. Additional interactive visualizations which allow you to explore Hall of Fame worthiness with more configuration and feature engineering brought to the end-users is another expansion that we're working on for our capstone while we develop a full-fledged website around all of this. When I get into the notebooks, I will show what a couple of the visualizations look like. But because they're in deep note, they don't have the interactivity that one could get on their personal machines. But overall, it is a pretty compelling visualization in my mind in that you can see along the trajectory of a player's career, at what point the model predicts that they became a Hall of Famer. Which is really cool when you look at players that are actually in the Hall of Fame makes sure that you can see, at year 7, 8, 9, 10, would they have been in the Hall of Fame if they retired? In some cases, you can actually see a player flip between being classified as a Hall of Famer and then being classified as not a Hall of Famer. Then going back to being classified as a Hall of Famer as they extend their career, which to me is pretty interesting and pretty cool feature. Then with that, we also want to try to dabble into predicting possible Veterans Committee Elections, which is infinitely harder, in my opinion than the Baseball writers' Association of America. But something we want to explore further and even if we don't capture that in my capstone project, that's something I plan on exploring on my own in the future. Then another piece of our second milestone was creating this new player similarity method. With that, we utilized the individual pitch by pitch descriptions of events and then implemented a Word2vec kind of methodology and algorithm that utilized pitch sequences as well as cumulative statistics, which gave us a different look at player similarity and compare that directly with how Bill James's player similarity is done. That's not captured in this discussion, but incorporating that new player similarity method into what we've previously done is future work that we're looking into doing. With that, let's dive into Notebook.