Predict Churn (Post Feedback)

Expanding on previous work with EDA, and training the model on multiple cutoff time periods
Created on October 6|Last edited on October 6
Comment
The dataset given has 3 separate fields:
Merchant ID as 'merchant'
Time of transaction as 'time'
Transaction amount in cents 'amount_usd_in_cents'
Using rolled-up data from previous work to understand when a customer might be considered as churned:
﻿
﻿
Working on the merchant-level rollup data we created to find churn
merchant_data['avg_time_bw_payments'].fillna(1, inplace = True)
﻿
pd.cut(merchant_data['avg_time_bw_payments'], 
       bins = [0,30,60,90,120,150,180,210,240,270,300,330,360,390,420,
              450,480,510,540,570,600,630,660,690,720]).value_counts(normalize = True).cumsum().plot(kind = 'bar');
Average number of days between successive payments:
﻿
Churn IdentificationWe could say that anyone who hasn't returned within 90 days of their last payment could be considered as churned because we see from above chart that the only <5% of customers take more than 90 days on average between successive purchases
To model churn, we need:
Merchant behavior - i.e. transaction history for a period of time
Merchant churn - i.e. in the days / months following the transaction history, information on whether the merchant churned or not.
Since we have a two year history of the merchant transactions, we can a train models at different timestamps and a 90-day post-period that moves along with the cutoff date.
Using the following:
A Cutoff date
Payments data
An empty DataFrame that builds the model results
Feature Extraction and ModelingWe train three simple models and maintain a dominant-class-filled baseline model.
The three components of this following code:
Merchant level rollups and log transformations
Train models at multiple cutoffs
Append metrics to a larger table for comparison
def train_a_few_models(pre_cutoff, payments, model_results):
﻿
    payments['time'] = pd.to_datetime(payments['time'])
    payments.sort_values(by = ['merchant', 'time'], ascending = [True, True], inplace = True)
    payments_pre = payments[payments['time'] <= pd.to_datetime(pre_cutoff)]
﻿
    payments_post = payments[(payments['time'] > pd.to_datetime(pre_cutoff)) & 
                            (payments['time'] <= pd.to_datetime(pre_cutoff) + timedelta(days=90))]
﻿
    payments_pre['purchase_quarter'] = pd.to_datetime(payments_pre['time']).dt.quarter
﻿
    merchant_q1 = payments_pre[payments_pre['purchase_quarter'] == 1].groupby(['merchant']).agg({'amount_usd_in_cents':
                                                                                        ['sum', 'count']})
    merchant_q1.columns = ['_'.join(t) for t in merchant_q1.columns]
    merchant_q1.rename(columns = {'amount_usd_in_cents_sum':'q1_amount',
                                  'amount_usd_in_cents_count':'q1_purchases'},
                      inplace = True)
    merchant_q1['q1_aov'] = merchant_q1['q1_amount'] / merchant_q1['q1_purchases']
﻿
    merchant_q2 = payments_pre[payments_pre['purchase_quarter'] == 2].groupby(['merchant']).agg({'amount_usd_in_cents':
                                                                                        ['sum', 'count']})
    merchant_q2.columns = ['_'.join(t) for t in merchant_q2.columns]
    merchant_q2.rename(columns = {'amount_usd_in_cents_sum':'q2_amount',
                                  'amount_usd_in_cents_count':'q2_purchases'},
                      inplace = True)
﻿
    merchant_q2['q2_aov'] = merchant_q2['q2_amount'] / merchant_q2['q2_purchases']
﻿
    merchant_q3 = payments_pre[payments_pre['purchase_quarter'] == 3].groupby(['merchant']).agg({'amount_usd_in_cents':
                                                                                        ['sum', 'count']})
    merchant_q3.columns = ['_'.join(t) for t in merchant_q3.columns]
    merchant_q3.rename(columns = {'amount_usd_in_cents_sum':'q3_amount',
                                  'amount_usd_in_cents_count':'q3_purchases'},
                      inplace = True)
﻿
    merchant_q3['q3_aov'] = merchant_q3['q3_amount'] / merchant_q3['q3_purchases']
﻿
    merchant_q4 = payments_pre[payments_pre['purchase_quarter'] == 4].groupby(['merchant']).agg({'amount_usd_in_cents':
                                                                                        ['sum', 'count']})
    merchant_q4.columns = ['_'.join(t) for t in merchant_q4.columns]
    merchant_q4.rename(columns = {'amount_usd_in_cents_sum':'q4_amount',
                                  'amount_usd_in_cents_count':'q4_purchases'},
                      inplace = True)
﻿
    merchant_q4['q4_aov'] = merchant_q4['q4_amount'] / merchant_q4['q4_purchases']
﻿
    # first payment date
    merchant_first_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].min().dt.date).reset_index().rename(columns = 
                                                                                                                    {'time':'first_payment'})
﻿
    # last payment date
    merchant_last_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].max().dt.date).reset_index().rename(columns = 
                                                                                                                    {'time':'last_payment'})
﻿
    # total payment amount
    merchant_total_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['amount_usd_in_cents'].sum()).reset_index().rename(columns = 
                                                                                                                           {'time':'total_spend'})
﻿
    # number of payments_pre
    merchant_num_payments_pre = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].count()).reset_index().rename(columns = 
                                                                                                                    {'time':'num_payments_pre'})
﻿
    # time between payments_pre
    payments_pre['time_from_last_order'] = payments_pre.groupby(['merchant'])['time'].diff().dt.days
    mask = payments_pre['merchant'] != payments_pre['merchant'].shift(1)
    payments_pre[mask]['time_from_last_order'] = np.nan
﻿
    # time between payments_pre
    merchant_time_bw_payments_pre = pd.DataFrame(payments_pre.groupby('merchant')['time_from_last_order'].mean()).reset_index().rename(columns = 
                                                                                                                              {'time_from_last_order':
                                                                                                                              'avg_time_bw_payments_pre'})
﻿
    merchant_level = merchant_first_payment.merge(merchant_last_payment, 
                                 how = 'inner', 
                                 on = 'merchant').merge(merchant_total_payment, 
                                                        how = 'inner', 
                                                        on = 'merchant').merge(merchant_num_payments,
                                                                              how = 'inner',
                                                                              on = 'merchant').merge(merchant_time_bw_payments,
                                                                                                    how = 'inner',
                                                                                                    on = 'merchant').merge(merchant_q1,
                                                                                                                          how = 'left',
                                                                                                                          on = 'merchant').merge(merchant_q2,
                                                                                                                                                how = 'left',
                                                                                                                                                on = 'merchant').merge(merchant_q3,
                                                                                                                                                how = 'left',
                                                                                                                                                on = 'merchant').merge(merchant_q4,
                                                                                                                                                how = 'left',
                                                                                                                                                on = 'merchant')
��
    merchant_level['time_between_first_and_last_payment'] = (merchant_level['last_payment'] - merchant_level['first_payment']).dt.days + 1
﻿
    merchant_level['q1_amount'].fillna(0, inplace = True)
    merchant_level['q1_purchases'].fillna(0, inplace = True)
    merchant_level['q1_aov'].fillna(0, inplace = True)
﻿
    merchant_level['q2_amount'].fillna(0, inplace = True)
    merchant_level['q2_purchases'].fillna(0, inplace = True)
    merchant_level['q2_aov'].fillna(0, inplace = True)
﻿
    merchant_level['q3_amount'].fillna(0, inplace = True)
    merchant_level['q3_purchases'].fillna(0, inplace = True)
    merchant_level['q3_aov'].fillna(0, inplace = True)
﻿
    merchant_level['q4_amount'].fillna(0, inplace = True)
    merchant_level['q4_purchases'].fillna(0, inplace = True)
    merchant_level['q4_aov'].fillna(0, inplace = True)
﻿
    merchant_level['time_since_first_payment'] = (pd.to_datetime('2035-01-01') - pd.to_datetime(merchant_level['first_payment'])).dt.days
﻿
    merchant_level['time_since_last_payment'] = (pd.to_datetime('2035-01-01') - pd.to_datetime(merchant_level['last_payment'])).dt.days
﻿
    merchant_level['acquisition_month'] = pd.to_datetime(merchant_level['first_payment']).dt.to_period('M')
﻿
    merchant_level['acquisition_quarter'] = pd.to_datetime(merchant_level['first_payment']).dt.quarter
﻿
    merchant_level = pd.concat([merchant_level,
                                pd.get_dummies(merchant_level['acquisition_quarter'], 
                                               prefix = 'acquisition_quarter')],
                              axis = 1)
﻿
    merchant_level['acquisition_year'] = pd.to_datetime(merchant_level['first_payment']).dt.year
﻿
    merchant_level = pd.concat([merchant_level,
                                pd.get_dummies(merchant_level['acquisition_year'], 
                                               prefix = 'acquisition_year')],
                              axis = 1)
﻿
    merchant_level['num_payments_per_day_in_life'] = merchant_level['num_payments'] / merchant_level['time_between_first_and_last_payment']
﻿
    merchant_level['av_order_value'] = merchant_level['amount_usd_in_cents'] / merchant_level['num_payments']
﻿
    merchant_level['avg_time_bw_payments'].fillna(1, inplace = True)
﻿
    merchant_level['log_amt_in_cents'] = np.log(merchant_level['amount_usd_in_cents'] + 1)
    merchant_level['log_q1_amount'] = np.log(merchant_level['q1_amount'] + 1)
    merchant_level['log_q2_amount'] = np.log(merchant_level['q2_amount'] + 1)
    merchant_level['log_q3_amount'] = np.log(merchant_level['q3_amount'] + 1)
    merchant_level['log_q4_amount'] = np.log(merchant_level['q4_amount'] + 1)
﻿
    merchant_level['log_num_payments'] = np.log(merchant_level['num_payments'] + 1)
    merchant_level['log_q1_purchases'] = np.log(merchant_level['q1_purchases'] + 1)
    merchant_level['log_q2_purchases'] = np.log(merchant_level['q2_purchases'] + 1)
    merchant_level['log_q3_purchases'] = np.log(merchant_level['q3_purchases'] + 1)
    merchant_level['log_q4_purchases'] = np.log(merchant_level['q4_purchases'] + 1)
﻿
    merchant_level['log_time_bw_payments'] = np.log(merchant_level['avg_time_bw_payments'])
﻿
    merchant_level['log_av_order_value'] = np.log(merchant_level['av_order_value'] + 1)
    merchant_level['log_q1_aov'] = np.log(merchant_level['q1_aov'] + 1)
    merchant_level['log_q2_aov'] = np.log(merchant_level['q2_aov'] + 1)
    merchant_level['log_q3_aov'] = np.log(merchant_level['q3_aov'] + 1)
    merchant_level['log_q4_aov'] = np.log(merchant_level['q4_aov'] + 1)
﻿
    merchant_level['log_num_payments_per_day_in_life'] = np.log(merchant_level['num_payments_per_day_in_life'])
﻿
    merchant_level['log_time_between_first_and_last_payment'] = np.log(merchant_level['time_between_first_and_last_payment'] + 1)
    merchant_level['log_time_since_first_payment'] = np.log(merchant_level['time_since_first_payment'] + 1)
    merchant_level['log_time_since_last_payment'] = np.log(merchant_level['time_since_last_payment'] + 1)
﻿
    merchant_post = payments_post['merchant'].unique()
﻿
    merchant_level['churn'] = (merchant_level['merchant'].isin(merchant_post))
﻿
    merchant_level = merchant_level[(merchant_level['av_order_value'] < 2000000) & 
                                    (merchant_level['num_payments_per_day_in_life'] < 1000)]
﻿
    
    columns_to_use = ['merchant',
     'avg_time_bw_payments',
     'log_amt_in_cents',
     'log_q1_amount',
     'log_q2_amount',
     'log_q3_amount',
     'log_q4_amount',
     'log_num_payments',
     'log_q1_purchases',
     'log_q2_purchases',
     'log_q3_purchases',
     'log_q4_purchases',
     'log_time_between_first_and_last_payment',
     'log_time_since_first_payment',
     'log_time_since_last_payment',
    'churn']
﻿
    features_data  = merchant_level[columns_to_use]
    y = features_data['churn']
    X = features_data.drop(columns = 'churn')
    X_id = X['merchant']
    X_train, X_test, y_train, y_test = train_test_split(X, y,                                                    
                                                       test_size = 0.20,
                                                       random_state = 0)
﻿
    X_train_id = X_train['merchant']
    X_test_id = X_test['merchant']
    X_train = X_train.drop(columns = 'merchant')
    X_test = X_test.drop(columns = 'merchant')
﻿
    #### Baseline model
    
    from sklearn import metrics
﻿
    y_pred_test = [False]*len(y_test)
    y_pred_train = [False]*len(y_train)
﻿
    wandb.log({'accuracy': metrics.accuracy_score(y_test, y_pred_test),
              'roc-auc-score': np.nan,
              'precision': metrics.precision_score(y_test, y_pred_test),
              'recall': metrics.recall_score(y_test, y_pred_test)})
﻿
    results = {'Algorithm':'Baseline_False',
               'ROC AUC Score':np.nan,
               'Accuracy':metrics.accuracy_score(y_test, y_pred_test),
              'Precision':metrics.precision_score(y_test, y_pred_test),
              'Recall':metrics.recall_score(y_test, y_pred_test)}
﻿
    model_results = model_results.append(results, ignore_index = True)
    
﻿
    ### simple logistic regression
﻿
    scaler = StandardScaler()
﻿
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train))
    X_train_scaled.columns = X_train.columns.values
    X_train_scaled.index = X_train.index.values
﻿
    X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test))
    X_test_scaled.columns = X_test.columns.values
    X_test_scaled.index = X_test.index.values
﻿
    from sklearn.linear_model import LogisticRegression
    from sklearn import metrics
    
﻿
    model = LogisticRegression(random_state = 42)
    model.fit(X_train_scaled, y_train)
﻿
    prediction_test = model.predict(X_test_scaled)
﻿
    prediction_test = model.predict(X_test_scaled)
﻿
    wandb.log({'accuracy': metrics.accuracy_score(y_test, prediction_test),
              'roc-auc-score': metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),
              'precision': metrics.precision_score(y_test, prediction_test),
              'recall': metrics.recall_score(y_test, prediction_test)})
﻿
    results = {'Algorithm':'Simple Logistic Regression',
               'ROC AUC Score':metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),
               'Accuracy':metrics.accuracy_score(y_test, prediction_test),
              'Precision':metrics.precision_score(y_test, prediction_test),
              'Recall':metrics.recall_score(y_test, prediction_test)}
﻿
    model_results = model_results.append(results, ignore_index = True)
    
﻿
    ### Data was sligtly imbalanced, let's see if balancing it makes any difference
﻿
    model = LogisticRegression(random_state = 42, class_weight='balanced')
    model.fit(X_train_scaled, y_train)
﻿
    prediction_test = model.predict(X_test_scaled)
﻿
    wandb.log({'accuracy': metrics.accuracy_score(y_test, prediction_test),
              'roc-auc-score': metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),
              'precision': metrics.precision_score(y_test, prediction_test),
              'recall': metrics.recall_score(y_test, prediction_test)})
﻿
    results = {'Algorithm':'Simple Logistic Regression with weight balancing',
               'ROC AUC Score':metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),
               'Accuracy':metrics.accuracy_score(y_test, prediction_test),
              'Precision':metrics.precision_score(y_test, prediction_test),
              'Recall':metrics.recall_score(y_test, prediction_test)}
﻿
    model_results = model_results.append(results, ignore_index = True)
    
﻿
    ### Simple RF
﻿
    # Train model
    
﻿
    
    clf_rf = RandomForestClassifier(random_state=42)
    clf_rf.fit(X_train, y_train)
﻿
    # Predict on test set
    pred_y_rf = clf_rf.predict(X_test)
﻿
    wandb.log({'accuracy': metrics.accuracy_score(y_test, pred_y_rf),
              'roc-auc-score': metrics.roc_auc_score(y_test, clf_rf.predict_proba(X_test)[:,1]),
              'precision': metrics.precision_score(y_test, pred_y_rf),
              'recall': metrics.recall_score(y_test, pred_y_rf)})
﻿
    results = {'Algorithm':'Simple Random Forest Classifier',
               'ROC AUC Score':metrics.roc_auc_score(y_test, clf_rf.predict_proba(X_test)[:,1]),
               'Accuracy':metrics.accuracy_score(y_test, pred_y_rf),
              'Precision':metrics.precision_score(y_test, pred_y_rf),
              'Recall':metrics.recall_score(y_test, pred_y_rf)}
﻿
    model_results = model_results.append(results, ignore_index = True)
﻿
    recall = model_results[['Algorithm','Recall']].rename(columns = {'Recall':pre_cutoff}).set_index('Algorithm').round(2)
    precision = model_results[['Algorithm','Precision']].rename(columns = {'Precision':pre_cutoff}).set_index('Algorithm').round(2)
    accuracy = model_results[['Algorithm','Accuracy']].rename(columns = {'Accuracy':pre_cutoff}).set_index('Algorithm').round(2)
    roc = model_results[['Algorithm','ROC AUC Score']].rename(columns = {'ROC AUC Score':pre_cutoff}).set_index('Algorithm').round(2)
﻿
    return recall, precision, accuracy, roc
﻿
Calling it at multiple cutoff dates:
pre_cutoff_dates = ["2033-12-31", "2034-01-31", "2034-02-28", "2034-03-31", "2034-04-30",
                   "2034-05-31", "2034-06-30", "2034-07-31", "2034-08-31", "2034-09-30"]
﻿
payments = pd.read_csv("payments_data.csv")
recalls = []
precisions = []
accuracies = []
rocs = []
﻿
for pre_cutoff_date in pre_cutoff_dates:
    
    col = ['Algorithm', 'ROC AUC Score', 'Accuracy', 'Precision', 'Recall']
    model_results = pd.DataFrame(columns=col)
    
    recall_ans, precision_ans, accuracy_ans, roc_ans = (train_a_few_models(pre_cutoff_date, payments, model_results))
    
    recalls.append(recall_ans)
    precisions.append(precision_ans)
    accuracies.append(accuracy_ans)
    rocs.append(roc_ans)
ResultsMetrics averaged over models trained on different cutoff dates
﻿
We see that The Simple RF does best on Accuracy, Precision and Area under ROC curve, but Logistic Regression does better on Recall. But since they aren't that different on any of these metrics, the easiest to implement and maintain should be prioritized over the rest.
Recall
Precision
Accuracy
Area under ROC Curve
ConclusionOur models were fairly simplistic
They relied entirely on the transaction dataset
We were able to leverage the time-series component of the data to train multiple models at multiple cutoff dates to see if the models are stable enough with different levels of data available - a form of cross validation
We see that they are fairly robust to the cutoff date
But based on just this information, our model was able to:
Accurately classify 80%+ of the time between customers who would likely churn or stay active
Have fairly low false negative rate (thereby lowering the chances of misclassifying churning merchants as healthy) over time
Is robust to new data, i.e there is little model drift because the merchant acquisition patterns, type of transactions, frequency, order values etc. remain quite stable over time, and so the model is fairly robust to new data
Putting a model into production could be a good start to proactively identifying potentially churning customers
﻿
﻿
Add a comment