How to establish a benchmark for a model

0 0 6 minutes read

How to establish a benchmark for a model

Over the past three years, I have had scientific consultants and I have had the opportunity to work in multiple projects across various industries. However, I noticed one thing in common among most of the clients I work with:

They rarely have clear ideas about the project goals.

This is one of the main obstacles data scientists face, especially now that the AI generation is taking over every field.

But let’s assume that after a while back and forth, the goal becomes clear. We managed to figure out a specific problem. For example:

I want to divide my clients into two groups based on their stir possibilities: “There is a high chance of stirring” and “There is a small chance of stirring”

OK, now? Simple, let’s start building some models!

Wrong!

If it is rare to have clear goals, then have reliable ones Benchmark Even more rare.

I think one of the most important steps in providing a data science project is to define and agree A set of benchmarks With the customer.

In this blog post, I will explain:

What is a benchmark?
Why it is important to have a benchmark?
How would I build a and
Keep some potential shortcomings in mind

What is a benchmark?

one Benchmark It is a standardized method for evaluating model performance. It provides a reference point that can compare new models.

Benchmarking requires two key components to complete:

A set of indicators Evaluate the performance
A simple set of models Used as baseline

The core concept is simple: whenever I develop a new model, I compare it with previous versions and baseline models. This ensures that improvements are real and tracked.

It must be understood that this baseline should not be a model or dataset specific to the model, but rather a business case. It should be a general benchmark for a given business case.

If I encounter a new dataset with the same business objectives, the benchmark should be a reliable reference point.

Why is building a benchmark important

Now that we have defined a benchmark, let’s dig into why I think it’s worth spending an extra project week developing a strong benchmark.

Without a benchmark, you can achieve perfection – If you don’t have a clear reference point to work, any results will lose their meaning. “My model has a MAE of 30.000” OK? idk! Maybe simply put, your MAE is 25.000. By typing your model with Baselineyou can measure both Performance and improve.
Improve communication with customers – Customers and business teams may not immediately understand the standard output of the model. However, by interacting with a simple baseline from the start, it becomes easier to show improvements later. In many cases, the benchmark may come directly from a business of different shapes or forms.
Help model selection – The benchmark is given starting point Comparison of multiple models fairly. Without it, you might waste time testing the model that is not worth considering.
Model drift detection and monitoring – The model can Downgrade over time. By having benchmarks, you may be able to intercept Early drift By comparing the new model output with past benchmarks and baselines.
Consistency between different datasets – Development of data sets. By having fixed metrics and models, you can ensure performance comparisons remain valid over time.

With clear benchmarks, every step of model development will provide Feedback nowmake the whole process more Intentionally and data-driven.

How will I build a benchmark

I hope I have convinced you of the importance of having a benchmark. Now, let’s actually build one.

Let’s start with the business questions raised at the beginning of this blog post:

I want to divide my clients into two groups based on their stir possibilities: “There is a high chance of stirring” and “There is a small chance of stirring”

For simplicity, I’ll assume No other business restrictions, But in practical cases, there are usually constraints.

For this example, I’m using This dataset (CC0: Public Domain). The data contains some attributes of the company’s customer base (e.g., age, gender, product quantity) and its churn status.

Now, we have some work to do, let’s set up the benchmark:

1. Define metrics

We’re dealing with agitation use case, especially this is a Binary classification problem. Therefore, the main indicators we can use are:

Accurate – Percentage of correctly predicted stirrer in all predicted stirrers
Remember – Percentage of the actual agitator correctly identified
F1 score – Balance accuracy and memory
True positivity, false positives, real negatives and false negatives

These are some “simple” metrics that can be used to evaluate model output.

Howeverthis is not an exhaustive list, and standard metrics are not always sufficient. In many use cases, it can be useful Build custom metrics.

Let’s assume in our business case Discounts are available for customers marked as “high probability customers”. This creates:

one cost ($250) When offering discounts to non-fundraising clients
one profit ($1000) Keep stirring customers

By this definition, we can build a custom metric that is crucial in our case:

# Defining the business case-specific reference metric
def financial_gain(y_true, y_pred):  
    loss_from_fp = np.sum(np.logical_and(y_pred == 1, y_true == 0)) * 250  
    gain_from_tp = np.sum(np.logical_and(y_pred == 1, y_true == 1)) * 1000  
    return gain_from_tp - loss_from_fp

When you build Business-driven indicators These are usually the most relevant. Such indicators can take any form or form: financial goals, minimum requirements, percentage of coverage, etc.

2. Define the benchmark

Now that we have defined the metrics, we can define a set of benchmark models to use as references.

At this stage, you should define a list of simple to implement models in its simplest setup. In this state, there is no reason to use time and resources for the optimization of these models, my mindset is:

If I have 15 minutes, how would I implement this model?

In the later stages of the model, you can add a pattern baseline model as the project progresses.

In this case, I will use the following model:

Random Model – Randomly assigned tags
Most models – Always predict the most common classes
Simple XGB
Simple knn

import numpy as np  
import xgboost as xgb  
from sklearn.neighbors import KNeighborsClassifier  
  
class BinaryMean():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        np.random.seed(21)  
        return np.random.choice(a=[1, 0], size=len(df_test), p=[df_train['y'].mean(), 1 - df_train['y'].mean()])  
      
class SimpleXbg():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        model = xgb.XGBClassifier()  
        model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y'])  
        return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y'))  
      
class MajorityClass():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        majority_class = df_train['y'].mode()[0]  
        return np.full(len(df_test), majority_class)  
  
class SimpleKNN():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        model = KNeighborsClassifier()  
        model.fit(df_train.select_dtypes(include=np.number).drop(columns='y'), df_train['y'])  
        return model.predict(df_test.select_dtypes(include=np.number).drop(columns='y'))

Likewise, like metrics, we can build Custom benchmarks.

Let’s assume in our business case Marketing team contacts each client:

More than 50 y/o and
That’s No more activity

Following this rule, we can build this model:

# Defining the business case-specific benchmark
class BusinessBenchmark():  
    @staticmethod  
    def run_benchmark(df_train, df_test):  
        df = df_test.copy()  
        df.loc[:,'y_hat'] = 0  
        df.loc[(df['IsActiveMember'] == 0) & (df['Age'] >= 50), 'y_hat'] = 1  
        return df['y_hat']

Running benchmark

To run the benchmark, I will use the following class. The entry point is the method compare_with_benchmark() Considering a prediction, it runs all models and calculates all metrics.

import numpy as np  
  
class ChurnBinaryBenchmark():  
    def __init__(        
	    self,  
        metrics = [],  
        benchmark_models = [],        
        ):  
        self.metrics = metrics  
        self.benchmark_models = benchmark_models  
  
    def compare_pred_with_benchmark(        
	    self,  
        df_train,  
        df_test,  
        my_predictions,    
        ):  
       
        output_metrics = {  
            'Prediction': self._calculate_metrics(df_test['y'], my_predictions)  
        }  
        dct_benchmarks = {}  
  
        for model in self.benchmark_models:  
            dct_benchmarks[model.__name__] = model.run_benchmark(df_train = df_train, df_test = df_test)  
            output_metrics[f'Benchmark - {model.__name__}'] = self._calculate_metrics(df_test['y'], dct_benchmarks[model.__name__])  
  
        return output_metrics  
      
    def _calculate_metrics(self, y_true, y_pred):  
        return {getattr(func, '__name__', 'Unknown') : func(y_true = y_true, y_pred = y_pred) for func in self.metrics}

Now all we need is a prediction. In this example, I did a quick feature engineering and some hyperparameter tuning.

The last step is just running the benchmark:

binary_benchmark = ChurnBinaryBenchmark(  
    metrics=[f1_score, precision_score, recall_score, tp, tn, fp, fn, financial_gain],  
    benchmark_models=[BinaryMean, SimpleXbg, MajorityClass, SimpleKNN, BusinessBenchmark]  
    )  
  
res = binary_benchmark.compare_pred_with_benchmark(  
    df_train=df_train,  
    df_test=df_test,  
    my_predictions=preds,  
)  
  
pd.DataFrame(res)

Benchmark Comparison | Author’s Image

This will generate a Comparison table In all models in all metrics. Use this table to draw specific conclusions about the model’s predictions and make informed decisions about the following steps of the process.

Some disadvantages

As we have seen, there are many reasons to make having a benchmark useful. But even if the benchmark is very useful, there are some trap beware:

Non-information benchmark – When the definition of the indicator or model is poor, the marginal influence of the benchmark decreases. Always define meaningful baselines.
Stakeholder misunderstandings – Communication with customers is essential, it is important to clearly state what the metric is measuring. The best model may not be the best on all definition metrics.
Overfitting benchmarks – You may end up trying to create features that are too specific that may exceed the benchmark, but are not well summarized in the predictions. Don’t focus on beating benchmarks, but create the best solution for solving problems.
Target changes – Defined goals may change due to poor communication or changes in planning. Maintain baseline flexibility to adapt when needed.

The final thought

Benchmarks provide clarity, ensure improvements are measurable, and create a Shared reference points Between data scientists and clients. They help avoid the pitfalls of hypothetical models, have no evidence, and ensure that each iteration brings real value.

They also act as Communication Toolsit is easier to show progress to customers. You can not only display numbers, but also explicit comparisons to highlight improvements.

Here you can find notebooks with full implementation from this blog post.