Every year, millions of basketball fans from around the world tune in to the NBA Draft with the hope that their favorite team strikes gold and discovers the next big NBA star. The people in the front offices of these NBA teams spend thousands of hours scouting and evaluating college and international talent trying to find players that can succeed at the pro level and contribute to the team. Following the growth of the field of data science, it makes sense to try and evaluate talent beyond traditional methods. This article documents a project that attempted to do just that by predicting the stat-lines for the newest batch of NBA rookies.

### Data Preparation

The overall objective of this project was to predict how certain players would do in their first year in the NBA in terms of points, assists, rebounds, steals, and blocks, and the first step to achieving that was to create the right dataset. There are a lot of variables that contribute to the success of an NBA player, but for this project I decided to focus on how well these various players performed at the college level. In order to create this dataset, BeautifulSoup was used to scrape the NBA rookie stats of players drafted between 2000 and 2018 from www.basketball-reference.com. After that, the average college stats of all of those drafted players were scraped from www.sports-reference.com/cbb and everything was nicely formatted into a Pandas data-frame on python. All of the datasets that were created for the purposes of this project are now available here as a collection of .csv files.

### Analysis

Before jumping in to the Machine Learning models, it is good to first go over the dataset and look out for any basic/interesting patterns and anomalies.

#### Statistical Trends

The evolutions of college basketball and professional basketball were visualized by creating box plots of various statistics in regards to different years.

This years draft class (represented by the box plot for the year 2020) doesn’t stand out significantly in any statistical category. That should translate into this year’s draft class being a very typical draft class in the sense that it will follow the pattern, set by previous years, of there being a few superstars, and a plethora of average to below average role players..

The NBA Rookie box plot diagrams proved to be a lot more interesting though with a lot more significant trends and patterns sticking out. The most fascinating pattern here regards the evolution of the 3 point shot and how it’s becoming more and more popular in recent years. What’s just as interesting as the uptick in average 3-point attempts in recent years is the recency of the pivot of this 3-point shooting trend. Before 2010, it doesn’t look like any rookie class averaged 1 3-point per game, whereas after 2010, almost every rookie class exceeded that stat.

#### Clusters

Besides analyzing the data from the perspective of looking for historic statistic trends, the data was analyzed from a cluster analysis perspective with two main objectives. The generated clusters put into perspective how players in this draft class stack up against each other and how players in this draft class stack up rookies from previous years.

Three different clustering algorithms (K-Means clustering, Agglomerative clustering, and Affinity Propagation ) were run on the data set of the college stats of this year’s draft class. Zion Williamson is a certain player that has received a lot of hype as the next big superstar from the sports media world, and all the clustering algorithms run compare his college performance to that of Brandon Clarke and Bol Bol.

The same three clustering algorithms were run on the dataset comprised of the college stats of players in the last 20 NBA Draft classes, and some interesting results were obtained. The Affinity Propagation model here describes Zion as a hybrid of Blake Griffin and Deandre Ayton, and it correspondingly estimates that he will put up an impressive stat line of 19.4 Points, 11.2 Rebounds, 2.8 Assists, 0.85 Steals, and 0.7 Blocks per game.

### Feature Engineering

The three main steps of creating powerful machine learning models come from selecting/manipulating the input features, choosing the most successful algorithm, and fine tuning that algorithm’s hyper-parameters. That is why, before running the data through all of the ML algorithms, some adjustments need to be made to the dataset.

Categorical variables, such as the name of the college and the name of the team that drafted the player, were originally broken down into a series of dummy variables uniquely representing each college/team. This technique was ultimately unsuccessful though as the algorithms run on this modified dataset tended to yield lower metrics than algorithms run on the original dataset without the team or college.

The team variable wasn’t very strong because teams fundamentally change from year to year. For example, it doesn’t seem quite right to equate the 2010 Cleveland Cavaliers team that won 74% of their games to the 2011 Cleveland Cavaliers team that ended up winning just 23% of their games. That is why the team variable was replaced with some metadata features regarding the success of said team the year before the player got drafted (ex. Wins, Point differentials per game, etc). This feature expansion was validated to a degree by the results, as the algorithms run on the modified dataset yielded better metrics than the algorithms run on the raw dataset with just team.

Besides experimentation with dummy variables, a correlation matrix was constructed to better understand the strength of the relationships between the input variables and the target variables. For example, as seen in the diagram above, there seems to be a strong correlation between field goals attempted per game in college and actual points scored per game in the NBA.

Recurrent feature elimination was also used to determine the best variable subset to consider. This method worked by repetitively retrieving feature importances from a linear regression model and removing the feature with the lowest importance. Upon experimentation, it was found that reducing the input variables from 37 to 30 using RFE produced the best results.

### Models

A lot of different algorithms were run throughout the process of this experiment, and the raw code for all of the algorithms described below can be found here.

#### Linear Regression

Before jumping into all the fancy algorithms, a basic regression model was run to set some baseline benchmarks. Linear regression was selected as this benchmark model, and this algorithm works by attempting to draw a straight line through all of the points provided in the train set in N dimensions (where N is the number of features in the dataset). The equation for this line is calculated by following the method of least squares where the objective is to minimize the sum of the square of the errors.

#### Random Forrest

The first major algorithm used was the Random Forrest regressor, and this algorithm works by randomly extracting various subsets from the original training dataset through a process of picking out the data that lies in the intersection of N random input features and M random columns. Next, the basic decision tree algorithm illustrated above is run on all of these different subsets. Once all the trees are created, the prediction of an element in the test set is calculated by taking the mean of the results produced by running the input features through each and every decision tree.

#### Extra Trees

The second algorithm run was the Extra Trees regressor, and this algorithm acts in a very similar manner to the Random Forrest regressor. Just like Random Forrest, Extra Trees runs a decision tree algorithm on various random subsets generated from the training dataset to create predictions. The big difference between these two algorithms comes from the way the decision tree is run on the subsets. The Random Forest algorithm uses the traditional decision tree approach where the feature and the value used at a split point is determined based on information gained at that step. The Extra Trees algorithm uses a more lenient decision tree approach where the feature and the value used at a split point are chosen randomly.

#### XGBoost

The next algorithm run was XGBoost, and this algorithm uses a technique known as gradient boosting to create a powerful and accurate model. Gradient boosting works by recursively building different models on top of each other to minimize the error value. Since the whole objective of XGBoost is to minimize the error found on the training set, this algorithm has an occasional tendency to overfit the data and perform subpar on the testing set.

#### Neural Nets

Next up came the challenge of designing effective and appropriate Neural Networks to understand the data provided. Neural Networks work quite differently than all of the algorithms mentioned above, but the core makeup of a Neural Network can be described as a series of layers made up of nodes that connect to the nodes of the next layer via weights and activation functions. More specifically, the values in a node at some hidden layer N are defined by the values inside of the nodes at the previous layer put in a linear combination with initially randomized weights and run through some activation function. The algorithm behind the Neural Network tries to continuously modify these initially random weights with the goal of producing outputs close to the the provided target outputs.

#### TPOT

The final algorithm that was run was TPOT, and this algorithm is intrinsically quite a bit different than the aforementioned algorithms in the sense that it is really a tool used to find good algorithms and models. The essence of it is that it uses genetic programming to continuously eliminate models with poor results so that the the most successful model is returned.

### Results

At the end of the day, machine learning is really a result driven game where models that produce higher metrics are significantly more valuable than models that don’t. For the purposes of this project, there were 5 main metrics used to compare the success of the different models built.

- Adjusted Testing r² : This statistic measures the adjusted r² value on the testing set. The value of this statistic ranges from -inf to 1 with higher values indicating better results.
- Cross Validation Score: This statistic is derived by multiplying 100 and the average of the raw r²s produced by running the algorithms on different train-test splits within the dataset. The value of this statistic ranges from -inf to 100 with higher values indicating better results.
- Percent Very Accurate : A prediction is considered as “Very Accurate” if the prediction is within 20% of the actual result. This statistic looks at what percent of the testing set was labeled as “Very Accurate”.
- Percent Accurate : A prediction is considered as “Accurate” if the prediction lies between 20% and 50% away from the actual result. This statistic looks at what percent of the testing set was labeled as “Accurate”.
- Point Differential Error : This statistic looks at what percent of the predictions in the testing set lied fewer than 2 points away from the actual results.

A basic website was created to display the results from the table above in a more in-depth and interactive manner.