Skip to main content

Predict Churn (Post Feedback)

Expanding on previous work with EDA, and training the model on multiple cutoff time periods
Created on October 6|Last edited on October 6
The dataset given has 3 separate fields:
  1. Merchant ID as 'merchant'
  2. Time of transaction as 'time'
  3. Transaction amount in cents 'amount_usd_in_cents'
Using rolled-up data from previous work to understand when a customer might be considered as churned:


Working on the merchant-level rollup data we created to find churn
merchant_data['avg_time_bw_payments'].fillna(1, inplace = True)

pd.cut(merchant_data['avg_time_bw_payments'],
bins = [0,30,60,90,120,150,180,210,240,270,300,330,360,390,420,
450,480,510,540,570,600,630,660,690,720]).value_counts(normalize = True).cumsum().plot(kind = 'bar');
Average number of days between successive payments:


Churn Identification

We could say that anyone who hasn't returned within 90 days of their last payment could be considered as churned because we see from above chart that the only <5% of customers take more than 90 days on average between successive purchases
To model churn, we need:
  1. Merchant behavior - i.e. transaction history for a period of time
  2. Merchant churn - i.e. in the days / months following the transaction history, information on whether the merchant churned or not.
Since we have a two year history of the merchant transactions, we can a train models at different timestamps and a 90-day post-period that moves along with the cutoff date.
Using the following:
  1. A Cutoff date
  2. Payments data
  3. An empty DataFrame that builds the model results

Feature Extraction and Modeling

We train three simple models and maintain a dominant-class-filled baseline model.
The three components of this following code:
  1. Merchant level rollups and log transformations
  2. Train models at multiple cutoffs
  3. Append metrics to a larger table for comparison
def train_a_few_models(pre_cutoff, payments, model_results):

payments['time'] = pd.to_datetime(payments['time'])
payments.sort_values(by = ['merchant', 'time'], ascending = [True, True], inplace = True)
payments_pre = payments[payments['time'] <= pd.to_datetime(pre_cutoff)]

payments_post = payments[(payments['time'] > pd.to_datetime(pre_cutoff)) &
(payments['time'] <= pd.to_datetime(pre_cutoff) + timedelta(days=90))]

payments_pre['purchase_quarter'] = pd.to_datetime(payments_pre['time']).dt.quarter

merchant_q1 = payments_pre[payments_pre['purchase_quarter'] == 1].groupby(['merchant']).agg({'amount_usd_in_cents':
['sum', 'count']})
merchant_q1.columns = ['_'.join(t) for t in merchant_q1.columns]
merchant_q1.rename(columns = {'amount_usd_in_cents_sum':'q1_amount',
'amount_usd_in_cents_count':'q1_purchases'},
inplace = True)
merchant_q1['q1_aov'] = merchant_q1['q1_amount'] / merchant_q1['q1_purchases']

merchant_q2 = payments_pre[payments_pre['purchase_quarter'] == 2].groupby(['merchant']).agg({'amount_usd_in_cents':
['sum', 'count']})
merchant_q2.columns = ['_'.join(t) for t in merchant_q2.columns]
merchant_q2.rename(columns = {'amount_usd_in_cents_sum':'q2_amount',
'amount_usd_in_cents_count':'q2_purchases'},
inplace = True)

merchant_q2['q2_aov'] = merchant_q2['q2_amount'] / merchant_q2['q2_purchases']

merchant_q3 = payments_pre[payments_pre['purchase_quarter'] == 3].groupby(['merchant']).agg({'amount_usd_in_cents':
['sum', 'count']})
merchant_q3.columns = ['_'.join(t) for t in merchant_q3.columns]
merchant_q3.rename(columns = {'amount_usd_in_cents_sum':'q3_amount',
'amount_usd_in_cents_count':'q3_purchases'},
inplace = True)

merchant_q3['q3_aov'] = merchant_q3['q3_amount'] / merchant_q3['q3_purchases']

merchant_q4 = payments_pre[payments_pre['purchase_quarter'] == 4].groupby(['merchant']).agg({'amount_usd_in_cents':
['sum', 'count']})
merchant_q4.columns = ['_'.join(t) for t in merchant_q4.columns]
merchant_q4.rename(columns = {'amount_usd_in_cents_sum':'q4_amount',
'amount_usd_in_cents_count':'q4_purchases'},
inplace = True)

merchant_q4['q4_aov'] = merchant_q4['q4_amount'] / merchant_q4['q4_purchases']

# first payment date
merchant_first_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].min().dt.date).reset_index().rename(columns =
{'time':'first_payment'})

# last payment date
merchant_last_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].max().dt.date).reset_index().rename(columns =
{'time':'last_payment'})

# total payment amount
merchant_total_payment = pd.DataFrame(payments_pre.groupby(['merchant'])['amount_usd_in_cents'].sum()).reset_index().rename(columns =
{'time':'total_spend'})

# number of payments_pre
merchant_num_payments_pre = pd.DataFrame(payments_pre.groupby(['merchant'])['time'].count()).reset_index().rename(columns =
{'time':'num_payments_pre'})

# time between payments_pre
payments_pre['time_from_last_order'] = payments_pre.groupby(['merchant'])['time'].diff().dt.days
mask = payments_pre['merchant'] != payments_pre['merchant'].shift(1)
payments_pre[mask]['time_from_last_order'] = np.nan

# time between payments_pre
merchant_time_bw_payments_pre = pd.DataFrame(payments_pre.groupby('merchant')['time_from_last_order'].mean()).reset_index().rename(columns =
{'time_from_last_order':
'avg_time_bw_payments_pre'})

merchant_level = merchant_first_payment.merge(merchant_last_payment,
how = 'inner',
on = 'merchant').merge(merchant_total_payment,
how = 'inner',
on = 'merchant').merge(merchant_num_payments,
how = 'inner',
on = 'merchant').merge(merchant_time_bw_payments,
how = 'inner',
on = 'merchant').merge(merchant_q1,
how = 'left',
on = 'merchant').merge(merchant_q2,
how = 'left',
on = 'merchant').merge(merchant_q3,
how = 'left',
on = 'merchant').merge(merchant_q4,
how = 'left',
on = 'merchant')
��
merchant_level['time_between_first_and_last_payment'] = (merchant_level['last_payment'] - merchant_level['first_payment']).dt.days + 1

merchant_level['q1_amount'].fillna(0, inplace = True)
merchant_level['q1_purchases'].fillna(0, inplace = True)
merchant_level['q1_aov'].fillna(0, inplace = True)

merchant_level['q2_amount'].fillna(0, inplace = True)
merchant_level['q2_purchases'].fillna(0, inplace = True)
merchant_level['q2_aov'].fillna(0, inplace = True)

merchant_level['q3_amount'].fillna(0, inplace = True)
merchant_level['q3_purchases'].fillna(0, inplace = True)
merchant_level['q3_aov'].fillna(0, inplace = True)

merchant_level['q4_amount'].fillna(0, inplace = True)
merchant_level['q4_purchases'].fillna(0, inplace = True)
merchant_level['q4_aov'].fillna(0, inplace = True)

merchant_level['time_since_first_payment'] = (pd.to_datetime('2035-01-01') - pd.to_datetime(merchant_level['first_payment'])).dt.days

merchant_level['time_since_last_payment'] = (pd.to_datetime('2035-01-01') - pd.to_datetime(merchant_level['last_payment'])).dt.days

merchant_level['acquisition_month'] = pd.to_datetime(merchant_level['first_payment']).dt.to_period('M')

merchant_level['acquisition_quarter'] = pd.to_datetime(merchant_level['first_payment']).dt.quarter

merchant_level = pd.concat([merchant_level,
pd.get_dummies(merchant_level['acquisition_quarter'],
prefix = 'acquisition_quarter')],
axis = 1)

merchant_level['acquisition_year'] = pd.to_datetime(merchant_level['first_payment']).dt.year

merchant_level = pd.concat([merchant_level,
pd.get_dummies(merchant_level['acquisition_year'],
prefix = 'acquisition_year')],
axis = 1)

merchant_level['num_payments_per_day_in_life'] = merchant_level['num_payments'] / merchant_level['time_between_first_and_last_payment']

merchant_level['av_order_value'] = merchant_level['amount_usd_in_cents'] / merchant_level['num_payments']

merchant_level['avg_time_bw_payments'].fillna(1, inplace = True)

merchant_level['log_amt_in_cents'] = np.log(merchant_level['amount_usd_in_cents'] + 1)
merchant_level['log_q1_amount'] = np.log(merchant_level['q1_amount'] + 1)
merchant_level['log_q2_amount'] = np.log(merchant_level['q2_amount'] + 1)
merchant_level['log_q3_amount'] = np.log(merchant_level['q3_amount'] + 1)
merchant_level['log_q4_amount'] = np.log(merchant_level['q4_amount'] + 1)

merchant_level['log_num_payments'] = np.log(merchant_level['num_payments'] + 1)
merchant_level['log_q1_purchases'] = np.log(merchant_level['q1_purchases'] + 1)
merchant_level['log_q2_purchases'] = np.log(merchant_level['q2_purchases'] + 1)
merchant_level['log_q3_purchases'] = np.log(merchant_level['q3_purchases'] + 1)
merchant_level['log_q4_purchases'] = np.log(merchant_level['q4_purchases'] + 1)

merchant_level['log_time_bw_payments'] = np.log(merchant_level['avg_time_bw_payments'])

merchant_level['log_av_order_value'] = np.log(merchant_level['av_order_value'] + 1)
merchant_level['log_q1_aov'] = np.log(merchant_level['q1_aov'] + 1)
merchant_level['log_q2_aov'] = np.log(merchant_level['q2_aov'] + 1)
merchant_level['log_q3_aov'] = np.log(merchant_level['q3_aov'] + 1)
merchant_level['log_q4_aov'] = np.log(merchant_level['q4_aov'] + 1)

merchant_level['log_num_payments_per_day_in_life'] = np.log(merchant_level['num_payments_per_day_in_life'])

merchant_level['log_time_between_first_and_last_payment'] = np.log(merchant_level['time_between_first_and_last_payment'] + 1)
merchant_level['log_time_since_first_payment'] = np.log(merchant_level['time_since_first_payment'] + 1)
merchant_level['log_time_since_last_payment'] = np.log(merchant_level['time_since_last_payment'] + 1)

merchant_post = payments_post['merchant'].unique()

merchant_level['churn'] = (merchant_level['merchant'].isin(merchant_post))

merchant_level = merchant_level[(merchant_level['av_order_value'] < 2000000) &
(merchant_level['num_payments_per_day_in_life'] < 1000)]

columns_to_use = ['merchant',
'avg_time_bw_payments',
'log_amt_in_cents',
'log_q1_amount',
'log_q2_amount',
'log_q3_amount',
'log_q4_amount',
'log_num_payments',
'log_q1_purchases',
'log_q2_purchases',
'log_q3_purchases',
'log_q4_purchases',
'log_time_between_first_and_last_payment',
'log_time_since_first_payment',
'log_time_since_last_payment',
'churn']

features_data = merchant_level[columns_to_use]
y = features_data['churn']
X = features_data.drop(columns = 'churn')
X_id = X['merchant']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.20,
random_state = 0)

X_train_id = X_train['merchant']
X_test_id = X_test['merchant']
X_train = X_train.drop(columns = 'merchant')
X_test = X_test.drop(columns = 'merchant')

#### Baseline model
from sklearn import metrics

y_pred_test = [False]*len(y_test)
y_pred_train = [False]*len(y_train)

wandb.log({'accuracy': metrics.accuracy_score(y_test, y_pred_test),
'roc-auc-score': np.nan,
'precision': metrics.precision_score(y_test, y_pred_test),
'recall': metrics.recall_score(y_test, y_pred_test)})

results = {'Algorithm':'Baseline_False',
'ROC AUC Score':np.nan,
'Accuracy':metrics.accuracy_score(y_test, y_pred_test),
'Precision':metrics.precision_score(y_test, y_pred_test),
'Recall':metrics.recall_score(y_test, y_pred_test)}

model_results = model_results.append(results, ignore_index = True)

### simple logistic regression

scaler = StandardScaler()

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train))
X_train_scaled.columns = X_train.columns.values
X_train_scaled.index = X_train.index.values

X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test))
X_test_scaled.columns = X_test.columns.values
X_test_scaled.index = X_test.index.values

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

model = LogisticRegression(random_state = 42)
model.fit(X_train_scaled, y_train)

prediction_test = model.predict(X_test_scaled)

prediction_test = model.predict(X_test_scaled)

wandb.log({'accuracy': metrics.accuracy_score(y_test, prediction_test),
'roc-auc-score': metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),
'precision': metrics.precision_score(y_test, prediction_test),
'recall': metrics.recall_score(y_test, prediction_test)})

results = {'Algorithm':'Simple Logistic Regression',
'ROC AUC Score':metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),
'Accuracy':metrics.accuracy_score(y_test, prediction_test),
'Precision':metrics.precision_score(y_test, prediction_test),
'Recall':metrics.recall_score(y_test, prediction_test)}

model_results = model_results.append(results, ignore_index = True)

### Data was sligtly imbalanced, let's see if balancing it makes any difference

model = LogisticRegression(random_state = 42, class_weight='balanced')
model.fit(X_train_scaled, y_train)

prediction_test = model.predict(X_test_scaled)

wandb.log({'accuracy': metrics.accuracy_score(y_test, prediction_test),
'roc-auc-score': metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),
'precision': metrics.precision_score(y_test, prediction_test),
'recall': metrics.recall_score(y_test, prediction_test)})

results = {'Algorithm':'Simple Logistic Regression with weight balancing',
'ROC AUC Score':metrics.roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:,1]),
'Accuracy':metrics.accuracy_score(y_test, prediction_test),
'Precision':metrics.precision_score(y_test, prediction_test),
'Recall':metrics.recall_score(y_test, prediction_test)}

model_results = model_results.append(results, ignore_index = True)

### Simple RF

# Train model

clf_rf = RandomForestClassifier(random_state=42)
clf_rf.fit(X_train, y_train)

# Predict on test set
pred_y_rf = clf_rf.predict(X_test)

wandb.log({'accuracy': metrics.accuracy_score(y_test, pred_y_rf),
'roc-auc-score': metrics.roc_auc_score(y_test, clf_rf.predict_proba(X_test)[:,1]),
'precision': metrics.precision_score(y_test, pred_y_rf),
'recall': metrics.recall_score(y_test, pred_y_rf)})

results = {'Algorithm':'Simple Random Forest Classifier',
'ROC AUC Score':metrics.roc_auc_score(y_test, clf_rf.predict_proba(X_test)[:,1]),
'Accuracy':metrics.accuracy_score(y_test, pred_y_rf),
'Precision':metrics.precision_score(y_test, pred_y_rf),
'Recall':metrics.recall_score(y_test, pred_y_rf)}

model_results = model_results.append(results, ignore_index = True)

recall = model_results[['Algorithm','Recall']].rename(columns = {'Recall':pre_cutoff}).set_index('Algorithm').round(2)
precision = model_results[['Algorithm','Precision']].rename(columns = {'Precision':pre_cutoff}).set_index('Algorithm').round(2)
accuracy = model_results[['Algorithm','Accuracy']].rename(columns = {'Accuracy':pre_cutoff}).set_index('Algorithm').round(2)
roc = model_results[['Algorithm','ROC AUC Score']].rename(columns = {'ROC AUC Score':pre_cutoff}).set_index('Algorithm').round(2)

return recall, precision, accuracy, roc

Calling it at multiple cutoff dates:
pre_cutoff_dates = ["2033-12-31", "2034-01-31", "2034-02-28", "2034-03-31", "2034-04-30",
"2034-05-31", "2034-06-30", "2034-07-31", "2034-08-31", "2034-09-30"]

payments = pd.read_csv("payments_data.csv")
recalls = []
precisions = []
accuracies = []
rocs = []

for pre_cutoff_date in pre_cutoff_dates:
col = ['Algorithm', 'ROC AUC Score', 'Accuracy', 'Precision', 'Recall']
model_results = pd.DataFrame(columns=col)
recall_ans, precision_ans, accuracy_ans, roc_ans = (train_a_few_models(pre_cutoff_date, payments, model_results))
recalls.append(recall_ans)
precisions.append(precision_ans)
accuracies.append(accuracy_ans)
rocs.append(roc_ans)

Results

Metrics averaged over models trained on different cutoff dates

We see that The Simple RF does best on Accuracy, Precision and Area under ROC curve, but Logistic Regression does better on Recall. But since they aren't that different on any of these metrics, the easiest to implement and maintain should be prioritized over the rest.

Recall

Precision

Accuracy

Area under ROC Curve

Conclusion

  1. Our models were fairly simplistic
  2. They relied entirely on the transaction dataset
  3. We were able to leverage the time-series component of the data to train multiple models at multiple cutoff dates to see if the models are stable enough with different levels of data available - a form of cross validation
  4. We see that they are fairly robust to the cutoff date
  5. But based on just this information, our model was able to:
    • Accurately classify 80%+ of the time between customers who would likely churn or stay active
    • Have fairly low false negative rate (thereby lowering the chances of misclassifying churning merchants as healthy) over time
    • Is robust to new data, i.e there is little model drift because the merchant acquisition patterns, type of transactions, frequency, order values etc. remain quite stable over time, and so the model is fairly robust to new data
  6. Putting a model into production could be a good start to proactively identifying potentially churning customers