Predict Churn (Post Feedback)
Expanding on previous work with EDA, and training the model on multiple cutoff time periods
Created on October 6|Last edited on October 6
Comment
The dataset given has 3 separate fields:
- Merchant ID as 'merchant'
- Time of transaction as 'time'
- Transaction amount in cents 'amount_usd_in_cents'
Using rolled-up data from previous work to understand when a customer might be considered as churned:

Working on the merchant-level rollup data we created to find churn
merchant_data['avg_time_bw_payments'].fillna(1, inplace = True)pd.cut(merchant_data['avg_time_bw_payments'],bins = [0,30,60,90,120,150,180,210,240,270,300,330,360,390,420,450,480,510,540,570,600,630,660,690,720]).value_counts(normalize = True).cumsum().plot(kind = 'bar');
Average number of days between successive payments:

Churn Identification
We could say that anyone who hasn't returned within 90 days of their last payment could be considered as churned because we see from above chart that the only <5% of customers take more than 90 days on average between successive purchases
To model churn, we need:
- Merchant behavior - i.e. transaction history for a period of time
- Merchant churn - i.e. in the days / months following the transaction history, information on whether the merchant churned or not.
Since we have a two year history of the merchant transactions, we can a train models at different timestamps and a 90-day post-period that moves along with the cutoff date.
Using the following:
- A Cutoff date
- Payments data
- An empty DataFrame that builds the model results
Feature Extraction and Modeling
We train three simple models and maintain a dominant-class-filled baseline model.
The three components of this following code:
- Merchant level rollups and log transformations
- Train models at multiple cutoffs
- Append metrics to a larger table for comparison
def train_a_few_models(pre_cutoff, payments, model_results):payments['time'] = pd.to_datetime(payments['time'])payments.sort_values(by = ['merchant', 'time'], ascending = [True, True], inplace = True)payments_pre = payments[payments['time'] <= pd.to_datetime(pre_cutoff)]payments_post = payments[(payments['time'] > pd.to_datetime(pre_cutoff)) &(payments['time'] <= pd.to_datetime(pre_cutoff) + timedelta(days=90))]payments_pre['purchase_quarter'] = pd.to_datetime(payments_pre['time']).dt.quartermerchant_q1 = payments_pre[payments_pre['purchase_quarter'] == 1].groupby(['merchant']).agg({'amount_usd_in_cents':['sum', 'count']})merchant_q1.columns = ['_'.join(t) for t in merchant_q1.columns]merchant_q1.rename(columns = {'amount_usd_in_cents_sum':'q1_amount','amount_usd_in_cents_count':'q1_purchases'},inplace = True)merchant_q1['q1_aov'] = merchant_q1['q1_amount'] / merchant_q1['q1_purchases']merchant_q2 = payments_pre[payments_pre['purchase_quarter'] == 2].groupby(['merchant']).agg({'amount_usd_in_cents':['sum', 'count']})merchant_q2.columns = ['_'.join(t) for t in merchant_q2.columns]merchant_q2.rename(columns = {'amount_usd_in_cents_sum':'q2_amount','amount_usd_in_cents_count':'q2_purchases'},inplace = True)merchant_q2['q2_aov'] = merchant_q2['q2_amount'] / merchant_q2['q2_purchases']merchant_q3 = payments_pre[payments_pre['purchase_quarter'] == 3].groupby(['merchant']).agg({'amount_usd_in_cents':['sum', 'count']})merchant_q3.columns = ['_'.join(t) for t in merchant_q3.columns]merchant_q3.rename(columns = {'amount_usd_in_cents_sum':'q3_amount','amount_usd_in_cents_count':'q3_purchases'},inplace = True)merchant_q3['q3_aov'] = merchant_q3['q3_amount'] / merchant_q3['q3_purchases']merchant_q4 = payments_pre[payments_pre['purchase_quarter'] == 4].groupby(['merchant']).agg({'amount_usd_in_cents':['sum', 'count']})merchant_q4.columns = ['_'.join(t) for t in merchant_q4.columns]merchant_q4.rename(columns = {'amount_usd_in_cents_sum':'q4_amount','amount_usd_in_cents_count':'q4_purchases'},inplace = True)merchant_q4['q4_aov'] = merchant_q4['q4_amount'] / merchant_q4['q4_purchases']# first payment datemerchant_first_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].min().dt.date).reset_index().rename(columns ={'time':'first_payment'})# last payment datemerchant_last_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].max().dt.date).reset_index().rename(columns ={'time':'last_payment'})# total payment amountmerchant_total_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['amount_usd_in_cents'].sum()).reset_index().rename(columns ={'time':'total_spend'})# number of payments_premerchant_num_payments_pre = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].count()).reset_index().rename(columns ={'time':'num_payments_pre'})# time between payments_prepayments_pre['time_from_last_order'] = payments_pre.groupby(['merchant'])['time'].diff().dt.daysmask = payments_pre['merchant'] != payments_pre['merchant'].shift(1)payments_pre[mask]['time_from_last_order'] = np.nan# time between payments_premerchant_time_bw_payments_pre = pd.DataFrame(payments_pre.groupby('merchant')['time_from_last_order'].mean()).reset_index().rename(columns ={'time_from_last_order':'avg_time_bw_payments_pre'})merchant_level = merchant_first_payment.merge(merchant_last_payment,how = 'inner',on = 'merchant').merge(merchant_total_payment,how = 'inner',on = 'merchant').merge(merchant_num_payments,how = 'inner',on = 'merchant').merge(merchant_time_bw_payments,how = 'inner',on = 'merchant').merge(merchant_q1,how = 'left',on = 'merchant').merge(merchant_q2,how = 'left',on = 'merchant').merge(merchant_q3,how = 'left',on = 'merchant').merge(merchant_q4,how = 'left',on = 'merchant')��merchant_level['time_between_first_and_last_payment'] = (merchant_level['last_payment'] - merchant_level['first_payment']).dt.days + 1merchant_level['q1_amount'].fillna(0, inplace = True)merchant_level['q1_purchases'].fillna(0, inplace = True)merchant_level['q1_aov'].fillna(0, inplace = True)merchant_level['q2_amount'].fillna(0, inplace = True)merchant_level['q2_purchases'].fillna(0, inplace = True)merchant_level['q2_aov'].fillna(0, inplace = True)merchant_level['q3_amount'].fillna(0, inplace = True)merchant_level['q3_purchases'].fillna(0, inplace = True)merchant_level['q3_aov'].fillna(0, inplace = True)merchant_level['q4_amount'].fillna(0, inplace = True)merchant_level['q4_purchases'].fillna(0, inplace = True)merchant_level['q4_aov'].fillna(0, inplace = True)merchant_level['time_since_first_payment'] = (pd.to_datetime('2035-01-01') - pd.to_datetime(merchant_level['first_payment'])).dt.daysmerchant_level['time_since_last_payment'] = (pd.to_datetime('2035-01-01') - pd.to_datetime(merchant_level['last_payment'])).dt.daysmerchant_level['acquisition_month'] = pd.to_datetime(merchant_level['first_payment']).dt.to_period('M')merchant_level['acquisition_quarter'] = pd.to_datetime(merchant_level['first_payment']).dt.quartermerchant_level = pd.concat([merchant_level,pd.get_dummies(merchant_level['acquisition_quarter'],prefix = 'acquisition_quarter')],axis = 1)merchant_level['acquisition_year'] = pd.to_datetime(merchant_level['first_payment']).dt.yearmerchant_level = pd.concat([merchant_level,pd.get_dummies(merchant_level['acquisition_year'],prefix = 'acquisition_year')],axis = 1)merchant_level['num_payments_per_day_in_life'] = merchant_level['num_payments'] / merchant_level['time_between_first_and_last_payment']merchant_level['av_order_value'] = merchant_level['amount_usd_in_cents'] / merchant_level['num_payments']merchant_level['avg_time_bw_payments'].fillna(1, inplace = True)merchant_level['log_amt_in_cents'] = np.log(merchant_level['amount_usd_in_cents'] + 1)merchant_level['log_q1_amount'] = np.log(merchant_level['q1_amount'] + 1)merchant_level['log_q2_amount'] = np.log(merchant_level['q2_amount'] + 1)merchant_level['log_q3_amount'] = np.log(merchant_level['q3_amount'] + 1)merchant_level['log_q4_amount'] = np.log(merchant_level['q4_amount'] + 1)merchant_level['log_num_payments'] = np.log(merchant_level['num_payments'] + 1)merchant_level['log_q1_purchases'] = np.log(merchant_level['q1_purchases'] + 1)merchant_level['log_q2_purchases'] = np.log(merchant_level['q2_purchases'] + 1)merchant_level['log_q3_purchases'] = np.log(merchant_level['q3_purchases'] + 1)merchant_level['log_q4_purchases'] = np.log(merchant_level['q4_purchases'] + 1)merchant_level['log_time_bw_payments'] = np.log(merchant_level['avg_time_bw_payments'])merchant_level['log_av_order_value'] = np.log(merchant_level['av_order_value'] + 1)merchant_level['log_q1_aov'] = np.log(merchant_level['q1_aov'] + 1)merchant_level['log_q2_aov'] = np.log(merchant_level['q2_aov'] + 1)merchant_level['log_q3_aov'] = np.log(merchant_level['q3_aov'] + 1)merchant_level['log_q4_aov'] = np.log(merchant_level['q4_aov'] + 1)merchant_level['log_num_payments_per_day_in_life'] = np.log(merchant_level['num_payments_per_day_in_life'])merchant_level['log_time_between_first_and_last_payment'] = np.log(merchant_level['time_between_first_and_last_payment'] + 1)merchant_level['log_time_since_first_payment'] = np.log(merchant_level['time_since_first_payment'] + 1)merchant_level['log_time_since_last_payment'] = np.log(merchant_level['time_since_last_payment'] + 1)merchant_post = payments_post['merchant'].unique()merchant_level['churn'] = (merchant_level['merchant'].isin(merchant_post))merchant_level = merchant_level[(merchant_level['av_order_value'] < 2000000) &(merchant_level['num_payments_per_day_in_life'] < 1000)]columns_to_use = ['merchant','avg_time_bw_payments','log_amt_in_cents','log_q1_amount','log_q2_amount','log_q3_amount','log_q4_amount','log_num_payments','log_q1_purchases','log_q2_purchases','log_q3_purchases','log_q4_purchases','log_time_between_first_and_last_payment','log_time_since_first_payment','log_time_since_last_payment','churn']features_data = merchant_level[columns_to_use]y = features_data['churn']X = features_data.drop(columns = 'churn')X_id = X['merchant']X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.20,random_state = 0)X_train_id = X_train['merchant']X_test_id = X_test['merchant']X_train = X_train.drop(columns = 'merchant')X_test = X_test.drop(columns = 'merchant')#### Baseline modelfrom sklearn import metricsy_pred_test = [False]*len(y_test)y_pred_train = [False]*len(y_train)wandb.log({'accuracy': metrics.accuracy_score(y_test, y_pred_test),'roc-auc-score': np.nan,'precision': metrics.precision_score(y_test, y_pred_test),'recall': metrics.recall_score(y_test, y_pred_test)})results = {'Algorithm':'Baseline_False','ROC AUC Score':np.nan,'Accuracy':metrics.accuracy_score(y_test, y_pred_test),'Precision':metrics.precision_score(y_test, y_pred_test),'Recall':metrics.recall_score(y_test, y_pred_test)}model_results = model_results.append(results, ignore_index = True)### simple logistic regressionscaler = StandardScaler()X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train))X_train_scaled.columns = X_train.columns.valuesX_train_scaled.index = X_train.index.valuesX_test_scaled = pd.DataFrame(scaler.fit_transform(X_test))X_test_scaled.columns = X_test.columns.valuesX_test_scaled.index = X_test.index.valuesfrom sklearn.linear_model import LogisticRegressionfrom sklearn import metricsmodel = LogisticRegression(random_state = 42)model.fit(X_train_scaled, y_train)prediction_test = model.predict(X_test_scaled)prediction_test = model.predict(X_test_scaled)wandb.log({'accuracy': metrics.accuracy_score(y_test, prediction_test),'roc-auc-score': metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),'precision': metrics.precision_score(y_test, prediction_test),'recall': metrics.recall_score(y_test, prediction_test)})results = {'Algorithm':'Simple Logistic Regression','ROC AUC Score':metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),'Accuracy':metrics.accuracy_score(y_test, prediction_test),'Precision':metrics.precision_score(y_test, prediction_test),'Recall':metrics.recall_score(y_test, prediction_test)}model_results = model_results.append(results, ignore_index = True)### Data was sligtly imbalanced, let's see if balancing it makes any differencemodel = LogisticRegression(random_state = 42, class_weight='balanced')model.fit(X_train_scaled, y_train)prediction_test = model.predict(X_test_scaled)wandb.log({'accuracy': metrics.accuracy_score(y_test, prediction_test),'roc-auc-score': metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),'precision': metrics.precision_score(y_test, prediction_test),'recall': metrics.recall_score(y_test, prediction_test)})results = {'Algorithm':'Simple Logistic Regression with weight balancing','ROC AUC Score':metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),'Accuracy':metrics.accuracy_score(y_test, prediction_test),'Precision':metrics.precision_score(y_test, prediction_test),'Recall':metrics.recall_score(y_test, prediction_test)}model_results = model_results.append(results, ignore_index = True)### Simple RF# Train modelclf_rf = RandomForestClassifier(random_state=42)clf_rf.fit(X_train, y_train)# Predict on test setpred_y_rf = clf_rf.predict(X_test)wandb.log({'accuracy': metrics.accuracy_score(y_test, pred_y_rf),'roc-auc-score': metrics.roc_auc_score(y_test, clf_rf.predict_proba(X_test)[:,1]),'precision': metrics.precision_score(y_test, pred_y_rf),'recall': metrics.recall_score(y_test, pred_y_rf)})results = {'Algorithm':'Simple Random Forest Classifier','ROC AUC Score':metrics.roc_auc_score(y_test, clf_rf.predict_proba(X_test)[:,1]),'Accuracy':metrics.accuracy_score(y_test, pred_y_rf),'Precision':metrics.precision_score(y_test, pred_y_rf),'Recall':metrics.recall_score(y_test, pred_y_rf)}model_results = model_results.append(results, ignore_index = True)recall = model_results[['Algorithm','Recall']].rename(columns = {'Recall':pre_cutoff}).set_index('Algorithm').round(2)precision = model_results[['Algorithm','Precision']].rename(columns = {'Precision':pre_cutoff}).set_index('Algorithm').round(2)accuracy = model_results[['Algorithm','Accuracy']].rename(columns = {'Accuracy':pre_cutoff}).set_index('Algorithm').round(2)roc = model_results[['Algorithm','ROC AUC Score']].rename(columns = {'ROC AUC Score':pre_cutoff}).set_index('Algorithm').round(2)return recall, precision, accuracy, roc
Calling it at multiple cutoff dates:
pre_cutoff_dates = ["2033-12-31", "2034-01-31", "2034-02-28", "2034-03-31", "2034-04-30","2034-05-31", "2034-06-30", "2034-07-31", "2034-08-31", "2034-09-30"]payments = pd.read_csv("payments_data.csv")recalls = []precisions = []accuracies = []rocs = []for pre_cutoff_date in pre_cutoff_dates:col = ['Algorithm', 'ROC AUC Score', 'Accuracy', 'Precision', 'Recall']model_results = pd.DataFrame(columns=col)recall_ans, precision_ans, accuracy_ans, roc_ans = (train_a_few_models(pre_cutoff_date, payments, model_results))recalls.append(recall_ans)precisions.append(precision_ans)accuracies.append(accuracy_ans)rocs.append(roc_ans)
Results
Metrics averaged over models trained on different cutoff dates

We see that The Simple RF does best on Accuracy, Precision and Area under ROC curve, but Logistic Regression does better on Recall. But since they aren't that different on any of these metrics, the easiest to implement and maintain should be prioritized over the rest.
Recall
Precision
Accuracy
Area under ROC Curve
Conclusion
- Our models were fairly simplistic
- They relied entirely on the transaction dataset
- We were able to leverage the time-series component of the data to train multiple models at multiple cutoff dates to see if the models are stable enough with different levels of data available - a form of cross validation
- We see that they are fairly robust to the cutoff date
- But based on just this information, our model was able to:
- Accurately classify 80%+ of the time between customers who would likely churn or stay active
- Have fairly low false negative rate (thereby lowering the chances of misclassifying churning merchants as healthy) over time
- Is robust to new data, i.e there is little model drift because the merchant acquisition patterns, type of transactions, frequency, order values etc. remain quite stable over time, and so the model is fairly robust to new data
- Putting a model into production could be a good start to proactively identifying potentially churning customers
Add a comment