The other day I got a call from a friend who had built a predictive maintenance model for drilling machines in her factory. The hope was that the model would accurately predict which machines were about to fail. When they had tested the model on past data, the model was extremely accurate in predicting upcoming failures. But now, with the model in production, she was baffled why accuracy had apparently dropped so much. And she wanted to know, what happened?
Here are two of the most glaring problems that illustrate the challenges of evaluating AI models as we move them from the lab to the real world:
Suppose you’ve built a predictive model for the churn and renewal risk of your SaaS customers (we, at StepFunction, have). How do you know it’s working and working as well as it should be? It’s one thing to evaluate the model on historical data (see here for a discussion of various efficacy metrics, e.g. Precision and Recall). But once you put your model in action, you can no longer assess your model based on whether “acted-upon” customers churn or not.
So what’s the right way to measure the impact of your model? Start by thinking hard about your objective for the project — Is it to minimize churn rate, or maximize NRR (Net Revenue Retained), or keep your Customer Success (CSM) costs to a minimum?
Once you’ve decided on the business metrics you want to improve, you can monitor them over time. You’ll still have the problem of confounding variables — for example, suppose you deploy the churn-reducing AI model in Jan 2020, and six months later the churn rate is clearly down. Should you be overjoyed and declare the project a victory? No, because you cannot say for sure what caused the dip — in fact, many businesses saw reduced churn in the first months of covid quarantine as customers avoided drastic action.
Nonetheless, you should certainly track the most important business metrics, e.g. Net Revenue Retained, over time. A significant uptick in NRR, sufficiently long after your AI model goes into production, everything else being equal, can and should give you a good feeling about the project. And you won’t fall into the trap of false excitement if you’re aware of most, if not all, factors that could be changing your key metrics.
The gold standard for testing the impact of a new business practice is to do an A/B experiment. If the churn AI model identifies 1000 at-risk customers in a certain cycle, randomly put X% in set A (the set to be acted upon and given to the CSM team) and the remaining in set B (don’t give them to the CSM team). The obvious problem here is that most business execs won’t be willing to stay silent on an identified set of customers and risk losing them to churn. Research in medical experiments gives us a ray of hope here, and in subsequent posts, we’ll explore techniques of conducting such experiments without going silent on identified at-risk customers.
Now suppose your CSM team had some kind of an existing methodology in place for the identification of customers at risk of churn (a lot of SaaS businesses do). The team used to use triggers, e.g. if weekly product usage drops by more than 50%, to identify at-risk customers before — call it the Trigger model or Model T. Now, starting in Jan 2021, they’ve put the new AI-based predictive model in place — call it Model P. Your objective now might refine slightly — You might want to know whether Model P is doing a better job than Model T of helping you reduce your churn rate. Or even if it’s not, is it additive to your existing model? Doing a fair comparison between Model T and Model P is still challenging, e.g. if there’s a big overlap between the sets of customers they identify or if the size of the sets is very different. But the comparison is doable, and having a baseline model to compare to is much better than not having one.
It’s easy to get excited when you see good results over a short period of time that you’ve invested in something magical, but clear proof isn’t easy. Here are 3 takeaways that can help your ROI efforts: