Data Science

Predicting NBA Championships with Machine Learning

Each NBA season, there are 30 teams competing for the achievements that only one person will achieve: The champion’s legacy. From power rankings to the chaos and hurts of trade deadlines, fans and analysts have endless speculation about who will lift the Larry O’Brien Trophy.

But if we can go beyond the hot and predicted, and At the end of the regular season, use data and machine learning to predict NBA championships?

In this article, I will cover the process – from collecting and preparing data, to training and evaluating models, and finally using it to make predictions for the upcoming 2024-25 playoffs. Along the way, I will focus on the most surprising insights from the analysis.

All code and data used can be github.


Understand the problem

The most important step in any machine learning project is to understand the problem before conducting model training:
What questions are we trying to answer, and what data (and models) can help us get there?

In this case, the problem is simple: Who will become the NBA champion?

A natural first idea is to use it as a Classification Questions: Every team in each season is marked as champion or Not the champion.

But there is a catch. only A champion every year (obviously).

So if we pull data from the last 40 seasons, we’ll have 40 positive examples…and hundreds of negative examples. The lack of positive samples makes it difficult for the model to learn meaningful patterns, especially given that winning the NBA title is a rare event, we simply don’t have enough historical data – we didn’t use 20,000 seasons. This scarcity makes it very difficult for any classification model to really understand the reasons why the champion is separated from other champions.

We need a smarter way to solve the problem.

To help the model understand what makes the champion, teach it what it is almost Champions – and how that differs from the teams that were eliminated in the first round. In other words, we want the model to be learned Level of success In the playoffs, not simply yes/no results.

This made me understand Champion Sharing – The playoff ratio wins what a team has achieved, which is the total amount needed to win a championship.

Since 2003, it needs 16 winsBecome an NBA champion. However, between 1984 and 2002, the first round was the five best series, so during this period the total required was 15 wins.

Teams that lose in the first round may have a 0 or 1 victory (Champion Stock = 1/16), while a team that will be final but loses may have 14 victory (Champion Share = 14/16). The total share of the champion is 1.0.

Example of playoff brackets for 2021 playoffs

This rebuilds the task as Regression question the model can predict continuous values ​​between 0 and 1 – representing the close distance each team wins everything.

In this setup, the team and The highest predicted value It is our model selection NBA champion.

This is with MVP Prediction From my previous post.

data

Basketball – especially the NBA – is one of the most exciting sports in data science thanks to the statistics available for free. For this project, I collected data Basketball ReferenceUsing my python package brscraperwhich allows easy access to player data and team data. All data collection is done according to the website’s guidelines and rate limits.

The data used includes Team-level statistics ,,,,, Ranking of the final season of the regular season (e.g., win percentage, seed) and Player-level statistics For each team (limited to players who appear in at least 30 games) and Historical playoff performanceindex.

However, with Original, absolute value . For example, Average score per game (PPG)In the 2023-24 season 114.2and in 2000-01 94.8– Add almost 20%.

This is due to a series of factors, but the fact is that the game has changed dramatically over the years and the metrics are drawn from it.

Evolution of certain NBA statistics per game (author’s image)

To explain this shift, the method here avoids the direct use of absolute statistics, but instead chooses Normalized relative indicators. For example:

  • You can use their PPG, but theirs Ranked in that season .
  • You might consider Top 10 scoring etc.

This enables the model to capture Relative AdvantagesIn each era, decades of comparisons are made more meaningful, allowing older seasons to be incorporated into rich datasets.

From 1984 to 2024Seasons are used to train and test models, total 40 Seasons there are 70 variables in total.

Before exploring the model itself, some interesting patterns emerged from the exploratory analysis when comparing the champion team to the entire playoff team:

Team comparison: Champions vs. Other playoff teams (image of the author)

Not surprisingly, the championship often comes from the top seed and is occupied with a higher percentage of wins. During this period, the team with the worst regular season record is 1994–95 Houston Rocketsled by Hakeem Olajuwon, enters the playoffs with a 47-35 (.573) Tenth overall team(West Sixth).

Another notable trend is that the average age of champions tends to be slightly higher, which suggests that experience plays a crucial role once the playoffs begin. The youngest champion team in the database averages 26. 6 years are 1990–91 Chicago Bullsthe oldest one is 1997–98 Chicago Bulls31.2 years – Michael Jordan Dinasty’s first and last championship.

Similarly, the team of coaches with the franchise tends to have greater success in the playoffs.

modeling

The model used is LightgbmThis is a tree-based algorithm that is widely considered one of the most efficient methods for tabular data, as well as one of other algorithms like XGBoost. A grid search was performed to determine the best hyperparameters for this particular problem.

Evaluate model performance using root square error (RMSE) and the coefficient of determination ( ).

You can find formulas and explanations for each of my metrics Previous MVP articles.

Randomly select seasons for training and testing, limiting Keep the test set for the last three seasonsTo better evaluate the performance of the model on the latest data. Importantly, all teams are included in the dataset, not just those eligible for the playoffs – allowing the model to learn patterns without relying on prior knowledge of playoff qualifications.

result

Here we can see a comparison between the “distribution” of the prediction and the actual value. Technically, this is a histogram (as we are dealing with the regression problem, but it can still be used as a visual distribution because the target values ​​range from 0 to 1. In addition, we show the distribution of the remaining errors for each prediction.

(Image of the author)

As we have seen, both predictions and actual values ​​follow a similar pattern, focusing on zero close to zero – because most teams are unable to achieve high playoff success. The distribution of residual errors is further supported by the distribution of which surrounds zero, similar to a normal distribution. This suggests that the model is able to capture and reproduce the fundamental patterns present in the data.

In terms of performance metrics, the best model achieved a RMSE of 0.184 on the test dataset with an R² score of 0.537.

An effective way to visualize the impact of key variables on model prediction is through Shape value atechnique provides a reasonable explanation of how each function affects the model’s predictions.

Likewise, it can be Predicting NBA MVP with machine learning .

Shape diagram (author image)

From the shaping charts, some important insights emerge:

  • seedand w/l%Among the three most influential functions, the importance of team performance in the regular season is highlighted.
  • Team-level statistics Net Grade (NRTG) ,,,,, Opponents per game (PA/G) ,,,,, The Edge of Victory (MOV) and Adjusted offensive rating (ORTG/A) It can also play an important role in shaping playoff success.
  • When it comes to players, advanced metrics stand out: Box Plus/Reduce (BPM) Number of Top 30 Players (BPM) and Top 3 wins every 48 minutes (WS/48) It is the most influential.

Interestingly, the model also captures a broader trend – teams with higher average age perform better in the playoffs, and strong performances in the last playoffs are often associated with future success. Two modes point to experienceAs a valuable asset in pursuing the championship.

Now let’s take a closer look at the model. Predicting the last three NBA champions:

Forecasts from the past three years (images of the author)

The model predicts correctly The last three NBA champion. The only miss is in 2023, when Milwaukee Bucks . That season, Milwaukee was the best in the regular season record of 58-24 (.707) Injuried For Giannis’ Antetokounmpo, they hurt their playoffs. The Miami Heat eliminated the Bucks 4-1 in the first round, which went on to the finals, a surprising and disappointing playoff exit from Milwaukee, who won the championship two years ago.

2025 Playoff Forecast

For the upcoming 2025 playoffs, the model is predicting Boston Celtics and OKC and ClevelandLeaning behind.

Given their regular season (61-21, the second seed of the East) and the fact that they are dominant champions, I tend to agree. They combined Current performanceand Recent playoff success.

However, as we all know, anything can happen in the movement – we will only get real answers by the end of June.

(Photo by Richard Burlton on Unsplash)

in conclusion

The project demonstrates how machine learning can be applied to complex, dynamic environments such as sports. Using datasets spanning forty years of basketball history, the model is able to discover meaningful patterns that drive playoff success. In addition to predictions, tools like Shap allow us to explain the decisions of the model and better understand the factors that contribute to the success of the playoffs.

One of the biggest challenges with this problem is to consider Injuried. They can completely reshape the playoff landscape – especially when they affect star players during the playoffs or later in the regular season. Ideally, we could combine damage history and availability data to illustrate this better. Unfortunately, consistent and structured open data on this issue, especially at the granularity required for modeling – is hard to come by. As a result, this is still one of the blind spots of the model: it treats all teams with all its might, and this is not the case.

Although no model can perfectly predict the confusion and unpredictability of motion, this analysis shows that data-driven approaches are accessible. As the 2025 playoffs unfold, it will be exciting to see how the predictions are achieved – and still surprising in the game.

(Photo by Tim Hart

I always use it on the channel (LinkedIn and github).

Thank you for your attention! 👏

Gabriel Speranza Pastorello

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button