Okay, so here we are in the Deep Notebook and it looks just like a Jupiter notebook because that's what it's based on. You can see, I'll be moving around in some of these cells, selecting some things and giving some comments and Anthony will be as well. Anthony, your project really was broken up into two segments, which I think is really nice. The first one's on the Data Manipulation, the data cleaning, and the second one's on the modeling. Why don't I just hand it over to you and walk us through the data cleaning? Yes. As I mentioned in the slides aren't for this a large portion of our project, I would say roughly 75 percent of the time that we spent on the project focused on the Data Acquisition and Cleaning. At the top of the notebook. Obviously, I'm just doing some imports here, and then if I scroll down real quick, I want to give an overview of the baseball reference.com website. The link is embedded in the notebook. From here, this takes you to baseball reference.com, where you can see all the different major league players that have played and will be standings the works. One of the nice things about baseball reference.com and one of the things that we leveraged with respect to your writing our scraper to iterate through all the different players and extract information was the way that the HTML setup. When you click on the players here, you have a nice URL, which isn't shown here. However, it's included later on in the code. As you click on an index of players, with respect, let's say A. It pulls up all 690 players that have a last name that starts with A. Then all of these links which are embedded in the HTML code can be extracted. We use Beautiful Soup to do that extraction. With that it allowed us to iterate through and then parallel process the iteration and the collection of the data for the individual URLs that existed underneath. One example of this and moving to player page is Hank Aaron, Henry Aaron full name. On this page, you can see that he has his name, positions, how he bats, and then his career summary statistics. Then the information that we care about, as I alluded to earlier, is standard batting, which is just your basic batting statistics. Then we have player value batting, which starts getting into some of the advanced statistics showing how worthwhile and how valuable a player is, and then advanced batting. We wanted to collect this information as well. Ultimately, we don't end up using any of the information that was collected here exactly. However, we did store it, we do have access to it. For future improvement and future iteration, we can leverage that. Then another aspect which isn't as popularized but is more important for baseball Association of America candidates versus Veterans Committee were the theory of recency, comes into play with respect to fielding there and there's a higher focus on hitting statistics. We also want to collect the standard fielding. We grabbed all of this for every single player, and then as you're going through and you're trying to do this, pandas, which is what we use to collect our data and what's been featured so far in this course and some of the other courses has a nice read HTML function method that you can use in order to extract tabular information. When we run it on this page, for instance, this first tab here, standard batting can be extracted just leveraging the pandas built-in method. All the other tables that come afterward we had to extract ourself. This is a nice trial and error situation where we go through and by leveraging the dev tools that exist within Chrome, we can select some things so that as we highlight it, it takes us to where that exists within the code. Then leveraging that Beautiful Soup, and then some graduates work. We can know, if we want to grab the player value batting, we know what the table is going to be called and that it's going to be that batting value. Then if we want to scroll down and look at the advanced batting, again because we selected it to tell us where we are within the HTML code. This allowed us to go through the page, figure out all the different tables that we cared about. This one in particular is all batting value. We use that identifier to specify this is the beginning of the information that we care about. Then it iterates through and extracts the rest of the information. Now Anthony, I think one of the things that you're talking about here, this screen needing to a screen scrape and manage HTML is really interesting because up until this point in the course, we've dealt with pretty clean datasets. We've done a little bit of the read HTML with pandas, which is a great feature, and a lot of CSV datasets. We looked at a little bit of JSON data that came straight from the NHL API. But I would say the majority of data that I need to get to in this area, especially when I just start playing, unless I'm paying for the data itself from an API, I end up having to rip out of HTML pages and parse out of HTML pages. How much of your project for the Data Cleaning portion, do you think actually was dealing with HTML, webpages and web dim technologies? With respect to our project, all of the data that we used came from web scraping and the HTML itself. One-fourth, one of the four tables that we ultimately needed to scrape from every player. We are able to use pandas built-in method. So that eliminated the need for us to do some additional cleaning as it got it into a nicer format right off the bat. But then we also went, scraped a Hall of Fame page, as well as we could get the euro players were inducted and how they were inducted, that we're able to leverage pandas, as well. But for the most part it was a lot of web dev tools and just going through and writing code that didn't work. I mean, working on software in my day job and automation itself, I'm very well aware of the other process of writing something that doesn't work and then trying to fix it. As you get more experienced and skilled, the thing I think that you get better at is just figuring out when something isn't going to work and then iterating faster. That's something we definitely got more well-versed in with respect to our project and looking at these different HTMLs and figuring out how we were going to need to go about aggregating and then declaring data types and making sure that they would play nicely later on was another pain point because a lot of the data didn't come in the format that we would need it to ultimately be in. It was definitely a fun learning process and definitely something that I imagine a lot of the projects that I want do in my personal time will require this work. I have completed some projects in the past. I worked with respect to fantasy sports, which a daily fantasy is another passion of mine and something I enjoy doing. A lot of that datas also came from creating the data set myself because having something that is as comprehensive as I would like doesn't necessarily exist. Moving on to the code. This function here basically was written to iterate through that base URL, which is the page that I pulled up and showed and then we're passing in the alphabet which is just a string of A, B, C, D, E, F, G, and so on. Then this code essentially just iterates through and creates all the unique URLs as it's extracting the player information. When this is run ultimately, it collects close to 23,000 unique URLs which are all the players we're going to have to iterate through to determine whether or not we need to script their data and this is an example of running that. We take this base URL here which is baseball-reference.com then the alphabet and then we're creating a list for all the URLs and then this list ultimately could be passed in parallel processed and leveraged parallel computing, so that we can concatenate data frames and or dictionaries depending on the implementation that you want to take. Here, I just have it run through and iterate through five times. These are the URLs that are created. This first one is for Hank Aaron, as I showed earlier and then it just breaks it rather than creating all of them. Then if we wanted to look at collecting the data from URL in particular, we could go about doing it this way. This cell here is looking at the top of the page. When you go to Hank Aaron, you can see his name, the position he plays, when his last game was, which last game. When you're looking at Hall of Fame, induction is extremely important because you have to be retired for at least five years before you're eligible to be in the Hall of Fame and to be voted on. This code here just goes through and it prints the player name, position, the last game as it's found. Within this, it's utilizing Beautiful Soup. Then we're using the schema.org person, so that we could take that match data and then put it in a slightly nicer format, so that we can run fines which is essentially leveraging RegEx and looking for identifiers that exist within the code of the script and then grabs your match from that. Then down here, it's just doing that. Ultimately we get Henry Aaron right fielder and first baseman and his last game was in 1976 and then if we wanted to expand on that and collect the summary stats for Henry Aaron, this bit of code here goes through and finds out, extracts his war, his at-bats is hits home runs in his batting average for his entire career, which again is at the top of the page on that baseball reference.com site for him. After that, that's the core summary information, which we use that information during our first milestone. A portion of this was leveraged Milestone 1 and then we also extracted every single pitch that's been thrown in Major League Baseball that's available via baseball savant in our Milestone 1 to leverage that for looking at umpire tendencies and to explore how an umpire calls a game. Built a nice dashboard around that, so that you could explore. Well, given this condition how does this umpire call versus this umpire created a pretty powerful visualization that allowed exploration into that and that was a space that I've never seen anything quite like it. That's one reason we targeted that. As mentioned earlier, pandas does have a read HTML function in it which allows you to extract tabular information. So if we take that URL for Henry Aaron and try to extract all the information that we care about which are those four tables, we can immediately see that. Well, it's only going to grab one table and then if we print that you can see it comes out as a list. Underneath it here I'm going to convert this into a data frame. Basically this df, which is typical nomenclature, equals pd read HTML example. Then I'm taking the first instance because in the list that is the first instance. For those that are well-versed in Python, I could use a negative 1 instead, and it's going to index the exact same point. But ultimately I get this DataFrame. This is the standard batting information for Hank Aaron. You can look through and you can see that everything that was there is extracted. Then as well as position the awards. The awards will come in handy later on in our project. Now Anthony, one of the things I like about using deep note actually is their default rendering for the DataFrame actually tells you a little bit more about the data, tells you what the most common feature is. For instance, for year it was,1952, was three percent of the time there, and that how much data is actually missing in that DataFrame on a per column basis. That's one of the features that I've enjoyed, that it would be great to see you in our general computing environment. Yeah, that's another feature that I really enjoy, because a lot of the times, if you're working in your Jupyter Notebook environment on your own machine unless you install something like sparklines, which essentially gives you another way of looking at the data in a nice, easy fashion. You then have to perform an info or describe on your DataFrame to collect more statistics. Whereas deepnote just brings it right to you. Which definitely. I felt aided in our exploration process as we're able to iterate through different feature permutations. Just by getting a glance could get an idea as to the impact it would have on our false positives and so on. But it's definitely, and they keep adding new features into deepnote also. I'm hopeful that ultimately there'll be a lot more interactivity built into deepnote so that some of the interactive visualizations that I enjoyed creating with, eye Pi widgets, for instance, on my personal machine, I can bring into the deepnote environment. It's only a matter of time personally, but yeah, there'll be excited to see how that goes. But I do agree with you. It is a very nice feature. As I was walking through the HTML itself, I did mention there were four tables that we cared about that we wanted to extract information for. These are the actual tags that are used in the HTML code. Later on, you'll see that we're iterating through this in order to extract this information. This git table function that we wrote essentially is just trying to get a table. The first try is attached to the read HTML within pandas, which will allow us to extract the first table. Then when that fails, because it can't pass for all the different table names. It goes into this excerpt, and then within this excerpt, performs some different cleaning to extract a nice looking table. Running this, get table method here with all batting standard. This DataFrame matches identically. It's what we saw above because it's using the exact same method. Then next we move into all batting value. This is completely scraped on its own. We're defining everything that goes into here based on the different tags which are in the HTML code. Then moving on, we have the all batting advanced front. You can see all of these different fields which are popping up as well. Then the last table is the all standard fielding. Then all this has extra information at the bottom because one of the nice things about having all the statistics and the teams you played for and the positions that you played like, you can get smaller chunks. While it's cool if you just want to look through player data, we don't actually care about that information for our purposes. Part of our cleaning process involve chopping all of that and getting just the seasonal data. Next, we move into cleaning the data. This is the function that we wrote where we're passing in our tables. Then we're doing a bunch of different conversions to convert datatypes. Then we're also using regex to take out and basically to replace special characters with MTS. We can get everything into a nice float int path. Then also as we go through in Zoom, there's a lot of different cleaning, and some of it is specific to a certain tables. Here we have an if table type equal batting standard and then batting advanced front, batting value, and then all standard fielding. There are different cleaning that we have to do. This is a nice function so that we can call this function and run it. But the process to actually create this function involves doing each of these on a one-by-one basis and iterating through getting to our cleaned DataFrame and what we want it to look like for each individual table before worrying about trying to concatenate them together and merge them. This is what the end-product can look like if you take your chunked work and then you put it all together so that you can have a nice clean method to call and do your cleaning. Then after that, we then have some additional support functions which are used in the cleaner itself. You can see here these are specific functions for the different tables within all of this. Like if you want, you can dive into this and play with it on your own. A lot of this code you can cut-paste into a different cell. If you're working on your own, you can play around with it and try different things. It should lend itself to that pretty okay. Then here for cleaning the career and then doing a groupby. Groupby is a very powerful thing in pandas in that it allows you to take on aggregated data and then aggregate it together based on something. Then you have a quick and dirty way to look at averaging different columns and everything, which ultimately it's not what gets added into the data long-term and like at the end of our python. But for just a quick look at the data, we wrote it in a condition so that if you do see a per cent sign, that represents something that's going to be fractional. So, we would instead of taking the sum, we'd want to take the mean because there are different number of games played in each season and we want to represent every game rather than every season in an equal basis. Ultimately, for all the features we use, we go back and calculate those correctly. But in the meantime, just for a quick and dirty look at it this allows you to get an idea of what lays in your data. Yes, now we're just going to go through and this is actually creating the dataframes which will be used for the merge. Here we're creating our first dataframe and then our second dataframe. Sorry, this is our first one again. This is our second one. I had split up the git table on df cleaner in the notebook to make them a little bit different. Here you basically run these two commands and then this third one is just printing out the dataframe. I'm at df2 equals df cleaner. Then we're passing in these arguments of df2, which is coming from above and then all batting value. If you remember from above, in this dataframe, there was a lot of extra stuff at the bottom. As I mentioned, we don't need anything other than the actual seasonal data. We're chopping all that often we're only keeping the individual season data. This individual season data is only with respect to major league baseball seasons, so, minor league baseball isn't taken in consideration. Any Negro League stats also aren't taken into consideration for the scope of our project. However, that is a factor that when we get to the false positives and negatives, there are some players that had spent time in the Negro Leagues, which the model isn't necessarily identifying correctly that are members of the Hall of Fame. Again, due to the limitations of our data collection and what we had focused on what was in scope that's how that can kind of be explained. So then jumping down, this is the third dataframe, which represents the third table, and then this is the fourth dataframe, which is df4 and represents the fourth table which is the fielding. Soon after this, it's time to merge dataframes together. Depending on what exists in the dataframe, we wrote a merge function which takes two dataframes, then merges them. Then if it sees year or player in the dataframe, it's going to perform the merge in a different way. Basically that's how it does the merge. When you're combining two different dataframes, if you have the exact same column name into different dataframes, you can merge on something. In this case we want to merge on year or merge on player. That's essentially what this is doing, and then we get this df final here. We're iterating through, we're doing the df merger on dataframes one and two and then on dataframe three, and then dataframe four and then we get this final dataframe here. This is for [inaudible] only at the moment, but you can see that we now have 23 rows and 81 columns. If I scroll to the right here, I can look at a quick overview via histogram for a lot of the different columns to see the distribution of where the values lie. Like this is similar to what you would get with sparklines if you're operating in your personal environment, but to have the integrated deep node is again a really cool feature and definitely aided in our exploration. Then we come over here and we have these awards, and then you can see all the other tables which are getting added onto this. Then because we have the unique identifiers, whether it be a year or player name. Everything is added in sync. One season is represented by all of these different features. Then you scrolling over and this information here is used for player similarity and taking into consideration, what position they spent a lot of their season at and then creating an adjustment. You get a positional mean, which obviously will allow you to, if a player splits time between first, third base, it's just a way to represent that. Okay, they weren't exclusively a first baseman. That's semi-departure from exactly how Bill James did it. Being that he focused on primary position. But something that we wanted to keep in there for future flexibility but isn't actually used in the model creation. Then putting it all together, this is ultimately the function that we cascade into our parallel processing so that we can run across all the different URLs and take advantage of the computing system that we have. Within this, you'll see it iterate through all the different functions that have come previously, and then the ultimate results is a nice, clean concatenated dataframe which when you run it will look like this. The process and the steps that we're ran above are all put together here. Then, ultimately when we do parallel process this, I converted this into a CSV so that it can be used in the next Notebook. But for now we're done with this Notebook and we can move on to the modeling. At the beginning of the modeling, I'll quickly pull up the dataframe that was created for all the individual seasons that exist that we scraped.