Introduction

At SnackNation, we strive to provide the right snacks and experience to each of our members (customers). To ensure we provide the best service, each member is assigned a Member Success Account Manager.

Concern arises if our services or snack curations cause dissatisfaction to the member, which potentially leads to loss of business. Member Churn is one of the key indicators for measuring success in the subscription business. Thus, our team’s objective is to improve the interaction between our Member Success team and our active members.

Prior to our predictive churn model, our member success managers would manually run through all the member accounts in their respective portfolios and make experience/hunch based guesses on who is most likely to cancel the subscription. We saw an opportunity to improve on this approach by training a machine learning model that would learn from historical instances to predict churn of current active members.

This machine learning model makes it easier for our member success team to find our most ‘at risk’ members and arm them with insightful data about those members.

Methodology

Mind Map

The business intelligence (BI) team gathered and brainstormed on ideas that could explain potential reasons for member churn. In order to visualize how different ideas relate to the primary problem and to each other, we created a mind map. Our BI team consists of individuals with diverse skill sets and backgrounds, ranging from engineering, analytics, business, entrepreneurship etc. Our broad perspectives helped us capture an array of different dimensions and points of view.

The Mind Map begins by defining a problem from an analytical perspective. In this case, it was to determine key point indicators of member churn. Next, we defined broad level hypotheses that encapsulate factors affecting churn. As an example, one of our hypotheses was that frequency of communication with the member positively affects churn. We then dug deeper into each hypothesis and define even more granular hypotheses. The leaf nodes of this map are the key point indicators (KPIs) or measurable factors that we can use to conduct analyses. For example, one such leaf node to the above mentioned broad hypothesis was ‘Number of calls exchanged’. The statistical test we could conduct is checking if correlation exists between number of calls versus churn. In simpler terms, we tried to find if higher number of calls lead to a member becoming more susceptible to churn or vice versa.

Why this step is important:

  1. Helps us organize our thoughts on the problem
  2. It is easily digestible for upper management and non technical stakeholders.
  3. Helps us recognize most of the factors/variables that we can feed our machine learning algorithm

Data Gathering

The backbone of any good data science project is its data engineering architecture. A great data engineering solution can decrease the data science efforts by huge margins. Luckily, we have very a skilled engineering and tech team at SnackNation which ensured that all the required data sources were in a centralized data warehouse. You can go through this article to see, How a Data Warehouse Solved a Snack Company’s Data Problems. With our data warehousing solution, we removed manual and repetitive workflows and were able to uncover insights faster by feeding live data into our machine learning solution.

After we established our data sources in a single place, we queried them using SQL and pulled all variables we needed into a csv format.

Processing

Our data science team primarily uses R. With the plethora of R libraries available and supported by thousands of developers and data scientists, our team rarely needs to go outside the R ecosystem. One such popular library we used is caret. It integrates all activities related to machine learning model development in a streamlined workflow.

Train Test Split

We split our data set into 3 parts: training, testing, and validation. This is done in a stratified fashion to keep distribution similar across the datasets. We used the initial_split function, provided by `rsample` package, to divide testing and training data into 70% and 30% chunks of the complete set respectively.

train_test_split <- initial_split(churn_data_frame, prop = 0.7)
train_df <- training(train_test_split)
test_df <- testing(train_test_split)

Transformations

We applied a few transformation and basic pre-processing steps to the training dataset as follows:

  1. Removing features like primary keys, names, addresses, columns with unique value etc.
  2. Check number of NAs in each variable and removing those that have less than 20% non-null values
  3. Fixing levels of features in the dataset with underrepresented values. For example, one of the level for a variable ‘snack_budget’ was$3000-$4000. The number of records with these values in our data was very low(1–2 cases). To tackle this, we created a new level called ‘$2000+’ and grouped all corresponding values under this new level. This fixed the distribution of the ‘snack_budget’ variable where none of the levels were underrepresented. We treated other categorical variables similarly
  4. Fixing data types of dates, factors and numerical variables
  5. Handling NAs and missing values in accordance to business sense. For example, we used simple regression to fill out missing values for life time (days) of the customer

For each of the above steps we wrote a user defined function. These might come in handy when we create a data transformation pipeline.

Feature Selection

We used the chi.squared function to give us some idea on the importance of the features. We also used the Boruta feature selection method listed here. We selected the top 10 out of all variables based on these weights and importance measures.

Pipelining

Pipelines are helpful in reducing work by streamlining repeated tasks. As mentioned earlier, we wrote functions for data pre processing. These functions can now be streamlined using a pipeline to be applied on test and validation datasets.

validation_set_processed <- validation_set %>%
data_processing_step_one() %>%
data_processing_step_two() %>%
data_processing_step_three() %>%

Machine Learning

Our problem can be categorized as a classic case for supervised learning with target variable being ‘Will_Churn?’. Looking at similar studies done across the industry, we found that Random Forest, Decision Trees and Gradient Boosting machines were rated as the most effective algorithms. Out of the above, we ended up selecting Gradient Boosting, which gave us the best balance between accuracy and generalizability of results.

For GBM, we started with defining trainControl function as following:

gbm.ctrl <- trainControl(
method = “repeatedcv”,
repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = T,
verboseIter = F
)

Further, for hyperparameter tuning we used Grid Search defined as follows:

gbmGrid <- expand.grid(
interaction.depth = c(1:5),
n.trees = c(1:10) * 25,
shrinkage = 0.1,
n.minobsinnode = 10
)

Finally, the train function looked as follows:

gbm.fit <- train(
target_variable ~ .,
data = train_df,
method = ‘gbm’,
preProc = c(“center”, “scale”),
trControl = gbm.ctrl,
verbose = FALSE,
tuneGrid = gbmGrid
)

Model Evaluation

The caret package provides functions to test how well the algorithm is performing over validation or test set. We chose to use ROC statistic to see how well our algorithm generalizes over unseen datasets. As in the diagram below, we see that a tree depth of 3 over 140 iterations gives us the best Area under the curve.

We can also use the confusion matrix function , which provides a much more digestible form of model evaluation. It provides us with true positives, false positives, accuracy, sensitivity etc.

Confusion Matrix and Statistics
Reference
Prediction     Will_Not_Churn Will_Churn
Will_Not_Churn 1070 250
Will_Churn 184 1657
Accuracy : 0.8627
95% CI : (0.8502, 0.8745)
No Information Rate : 0.6033
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7157
Mcnemar's Test P-Value : 0.001808
Sensitivity : 0.8533
Specificity : 0.8689
Pos Pred Value : 0.8106
Neg Pred Value : 0.9001
Prevalence : 0.3967
Detection Rate : 0.3385
Detection Prevalence : 0.4176
Balanced Accuracy : 0.861

Deployment using R Shiny

After creating and testing a model, we needed it to be deployed through an application to be used by the member success team. We leveraged the R Shiny package developed by R Studio. This package enabled us to build web based applications using R programming. This seemed like the best solution since our team did not have to move outside the R environment to code simple web applications.

At the time of deployment, the initial version of our application looked like this:

R Shiny Application for Predicting Churn Probability

In Summary

We can summarize the above process as follows:

  1. We started our data science project with a mind map discussion to recognize important data points to explore
  2. We gathered the data sources using our data engineering infrastructure to create the analytical dataset
  3. Once we had the required dataset, we made efforts to clean and transform the data for machine learning use
  4. Finally, we compared various machine learning models, tuning them using the validation set and testing them out on the unseen data
  5. In order to share all our hard work with the end user, we made use of the data scientist-friendly R Shiny web framework

For our team, the next step is to improve the algorithm by including more data points that will be made available to us by new systems being introduced at the company. We also intend to use sentiment analysis over customer voice calls and emails as an added indicator to churn risk in our future versions.

Since this was one of the first machine learning projects at the company, there is a lot of scope of improvement in the workflow. Please share your thoughts and possible improvements in the comments below. Also, stay tuned for many more exciting Data Science projects coming up at SnackNation.

Author

Data Scientist at SnackNation

Write A Comment