Data Science

Survival Analysis When No One Dies: A Value-Based Approach

Are statistical methods used to answer the following questions: “How long will something last?” from the patient’s lifespan to the durability of the machine components or the duration of a user’s subscription.

One of the most widely used tools in this field is Kaplan-Meier Estimator.

Born in the world of biology, Kaplan-Meier made his debut in tracking life and death. But, like any real celebrity algorithm, it doesn’t stay in the driveway. Today, it appears in business dashboards, marketing teams and churn analytics.

But this is a trap: Business is not biology. It’s messy, unpredictable, full of plot twists and turns. That’s why when we try to use survival analysis in the business world, there are some problems that make our lives more difficult.

First, we are usually interested not only in whether the client is “survival” (what might mean in this case), but in How much of the individual survives the economic value.

Secondly, contrary to biology, It is very likely that customers “dead” and “recover” multiple times (Think about when you unsubscribe/resubscribe in the online service).

In this article, we will see how to extend the classic Kaplan-Meier approach to make it more suitable for our needs: Model continuous (economic) values, not binary (life and death), and allow for “resurrection”.

Repair of Kaplan-Meier Estimator

Let’s pause and rewind for a second. Before we start customizing Kaplan-Meier to meet our business needs, we need to quickly refresh the classic version how it works.

Suppose you have 3 subjects (e.g., lab mice) and you give them a drug you need to test. The drug is given at different moments: one Topics received in January b In April and theme c in May.

Then, you measure how long they survive. theme one Death 6 months later, subject c 4 months later, the theme b At the time of analysis (November) is still alive.

Graphically, we can represent 3 topics as follows:

[Image by Author]

Now, Even if we want to measure a simple metric, such as average survival, we will face problems. Actually, we don’t know how long the subject will take b Will survive because today is still alive.

This is a classic problem in statistics, called “Correct review“.

The correct review is the statistics that say “we don’t know what happens after a certain point”, which is important in survival analysis. So big The development that led to one of the most iconic estimators in statistical history: Kaplan-Meier estimatorNamed after the two introduced in the 1950s.

So, how does Kaplan-Meier handle our problems?

First, we align the clocks. Even if our mice are treated at different times, What is important is Time since treatment. So we reset x– Everyone’s axis is zero – Zero day is the day they get drugs.

[Image by Author]

Now that we are all on the same timeline, we want to build something useful: Aggregate survival curve. This curve tells us Typical The mouse in our group can at least survive x Months after treatment.

Let’s follow the logic together.

  • Until time 3? Everyone is still alive. Therefore survival = 100%. Simple.
  • At time 4, the mouse c die. This means that, of these 3 mice, only 2 survived after time 4. This gives us a 67% survival rate at time 4.
  • Then at time 6, the mouse one check out. Of the 2 mice that reached time 6, only 1 survived, so the survival rate from time 5 to 6 was 50%. Multiply it before 67%, and we get 33% survival until time 6.
  • After time 7, we have no other observed topics, so the curve must stop here.

Let’s plot these results:

[Image by Author]

Since code is usually easier to understand than words, let’s translate it into Python. We have the following variables:

  • kaplan_meieran array containing Kaplan-Meier estimates at each time point, such as the possibility of survival t.
  • obs_tan array tells us whether a person is observed at time (e.g., not right review) t.
  • surv_tBoolean array tells us if everyone is still alive t.
  • surv_t_minus_1Boolean array tells us if everyone is still alive t-1.

All we have to do is bring all the observed individuals to t,from t-1 to t ((survival_rate_t) and multiply it by the arrival survival rate t-1(km[t-1]) to obtain survival rate until time t ((km[t]). in other words,

survival_rate_t = surv_t[obs_t].sum() / surv_t_minus_1[obs_t].sum()

kaplan_meier[t] = kaplan_meier[t-1] * survival_rate_t

Of course, the starting point is kaplan_meier[0] = 1.

If you don’t want to encode this encoding from scratch, you can use the Kaplan-Meier algorithm in your Python library lifelinesand can be used as follows:

from lifelines import KaplanMeierFitter

KaplanMeierFitter().fit(
    durations=[6,7,4],
    event_observed=[1,0,1],
).survival_function_["KM_estimate"]

If you use this code, you will get the same results you got manually from the previous summary.

So far, we have been wandering on the land of rats, drugs and mortality. Your average quarterly KPI review is not at all, right? So, what is this useful for the business?

Turn to the business environment

So far, we have treated “death” as obvious. On Kaplan-Meier’s land, someone dies or dies, and we can easily record the time of death. But now let’s stir up some real-world business chaos.

Even what yes “Death” in a business environment?

It turns out that answering this question is not easy for at least two reasons:

  1. “Death” is not easy to define. Suppose you work in an e-commerce company. You want to know when the user “dead”. Should you consider them dead when they delete accounts? This is easy to track…but it is too rare to be useful. What if they just start shopping less? but how Is it less? A week of silence? One month? two? You see the problem. The definition of “death” is arbitrary and depending on where you draw the boundaries, your analysis may tell very different stories.
  2. “Death” is not permanent. Kaplan-Meier has been conceived for biological applications where once a person dies, there is no reward. But in business applications, resurrection is not only possible, but also very frequent. Imagine a streaming service that people subscribe to each month. In this case, it is easy to define “death”: when a user unsubscribes. However, it is common that after a period of cancellation, they resubscribe.

So how does all this play in the data?

Let’s browse a toy example. Suppose we have a user on the e-commerce platform. How much have they spent in the past 10 months:

[Image by Author]

To squeeze it into the Kaplan-Meier framework, we need The decision to convert this spending behavior into life or death.

Therefore, we have made a rule: if the user stops spending for 2 consecutive months, we declare it “inactive”.

Graphically, this rule looks like this:

[Image by Author]

Since the user spent $0 for two consecutive months (month 4 and 5) we will consider this user starting from month 4. Even though users start spending again in month 7, we will do this. This is because, in Kaplan-Meier, it is impossible to assume resurrection.

Now, let’s add two users in our example. Since we have decided on the rule of turning its value curve into a survival curve, we can also calculate the Kaplan-Meier survival curve:

[Image by Author]

By now, you may have noticed How many nuances (and data) are thrown away just to make this work. user one Back from the dead – but we ignore that. user cSpending dropped sharply – but Kaplan-Meier doesn’t care, because what it sees is 1 and 0. We impose continuous value (expenditure) on the binary box (live/death), and along the way, we lose a lot of information.

So the question is: we can extend Kaplan-Meier by:

  • Keep original continuous data intact,,,,,
  • Avoid any binary cutoff,,,,,
  • Resurrection allowed?

Yes, we can. In the next section, I’ll show you how.

Introducing “Value Kaplan-Meier”

Let’s start with the simple Kaplan-Meier formula we’ve seen before.

# kaplan_meier: array containing the Kaplan-Meier estimates,
#               e.g. the probability of survival up to time t
# obs_t: array, whether a subject has been observed at time t
# surv_t: array, whether a subject was alive at time t
# surv_t_minus_1: array, whether a subject was alive at time t−1

survival_rate_t = surv_t[obs_t].sum() / surv_t_minus_1[obs_t].sum()

kaplan_meier[t] = kaplan_meier[t-1] * survival_rate_t

The first change we need to make is replacement surv_t and surv_t_minus_1This is a boolean array that tells us whether a subject is alive (1) or dead (0), and the array tells us the (economic) value of each subject at a given time. To do this we can use an array of two names val_t and val_t_minus_1.

But that’s not enough, because since we’re dealing with continuous value, Each user is of different size, so, assuming we want to weigh them equally, we need to restore them based on certain personal values. But what value should we use? The most reasonable option is to use its initial value on time 0 before being affected by any processing we have on its application.

Therefore, we need to use another vector, name val_t_0 This represents the personal value of time 0.

# value_kaplan_meier: array containing the Value Kaplan-Meier estimates
# obs_t: array, whether a subject has been observed at time t
# val_t_0: array, user value at time 0
# val_t: array, user value at time t
# val_t_minus_1: array, user value at time t−1

value_rate_t = (
    (val_t[obs_t] / val_t_0[obs_t]).sum()
    / (val_t_minus_1[obs_t] / val_t_0[obs_t]).sum()
)

value_kaplan_meier[t] = value_kaplan_meier[t-1] * value_rate_t

What we built is Direct summary of Kaplan-Meier. Actually, if you set val_t = surv_t,,,,, val_t_minus_1 = surv_t_minus_1and val_t_0 As an array of 1s, the formula will be neatly folded back to our original survival estimator. Yes, this is legal.

This is the curve you will get when applying to these 3 users.

[Image by Author]

Let’s call this new version Value Kaplan-Meier Estimator. In fact, it answers a question:

After the average x time?

We have theories. But does it work in the wild?

Using Value Kaplan-Meier in Practice

If you use the value Kaplan-Meier estimator for real-world rotation and compare it to the old Kaplan-Meier curve, you may notice something comfortable – They usually have the same shape. That’s a good sign. This means we don’t break anything basic when upgrading from binary to continuous.

But here’s the interesting thing: Value Kaplan-Meier usually sits some more than Its traditional cousin. Why? Because in this new world, users can be “resurrected”. Kaplan-Meier is the more rigid of the two, who will write them down the moment they quiet down.

So how do we use it?

Imagine you are experimenting. At zero, you start a new treatment for a group of users. Whatever it is, you can track the value of “survival” in the treatment and control groups over time.

This is what your output might look like:

[Image by Author]

in conclusion

Kaplan-Meier is a widely used and intuitive method for estimating survival functions, especially when the result is a binary event such as death or failure. However, many real-world business scenarios involve more complexity – resurrection is possible, and the results can be better represented by continuous value rather than binary states.

In this case, Value Kaplan-Meier provides a natural extension. By incorporating the economic value of an individual, it can have a more nuanced understanding of the value of retention and decay over time.. This method preserves the simplicity and interpretation of the original Kaplan-Meier estimator while adjusting it to better reflect the dynamics of customer behavior.

Compared with Kaplan-Meier, Kaplan-Meier’s value tends to provide higher retention value estimates because it is able to take into account recovery rates. This makes it particularly useful when evaluating experiments or tracking customer value over time.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button