Data Science Take Home
Description and submission instructions
Created on September 10|Last edited on January 10
Comment
Customer Churn and Segmentation
This take-home is intended to take <6 hours. Please try to respect that limitation for both our sakes. I strongly encourage you to read this entire page carefully before getting started.
In this discussion, our goal is to examine some transaction data and attempt to understand a few things:
- what are some natural ways to segment or cluster the merchants in the transactions data?
- how might we predict which active merchants in the dataset might become inactive soon (churn)?
Some useful things to consider:
- Starting with some EDA is a good idea.
- Please provide an explanation of what segments mean in your analysis–both their meaning mathematically and how you can convey the segments to business partners.
- Please provide a clear definition of churn, and any caveats regarding the dataset.
- Please present your results via tables, charts, and don't forget the explanations! Your write-up should read like a brief report or blog post on the topic.
- For examples of how to log to W&B from sklearn, you might be interested in checking out the resources here on how to plot the outputs of your models with Weights and Biases.
Goals of the take-home (how you'll be evaluated):
- Ensure you have a working knowledge of how to interact with data in Python.
- Verify that you can build simple models for a business related task and efficiently manage your time.
- Establish your competency in communicating your results in a written report format, demonstrating both clarity and communication skill without sacrificing technical rigor. It's essential to explain your model choice, data processing decisions, and the basic mathematical relationship between the model and the business problem.
Note: You will not be evaluated on the sophistication of your model choice, or the performance of your model.
A few important technical details:
- Please stick to somewhat reasonable Python packages.
- Make certain that your approach is yours. Code snippets from the internet are fine but direct mapping of someone else's churn example is not. Where you choose to follow an online resource or example, please cite as you would in a research paper.
- Make sure you communicate how to run your code, and view your writeup. Your writeup should be submitted as a report on W&B with relevant code and discussion. Separately you can send a link to the repo, link to a colab nb, or email the code/notebook you ran.
- Please set repos and reports as private.
- For the sake of this take home, assume the date is currently '2035-01-01' (if this doesn't make sense now, it will eventually)
- After you've completed the feature engineering, please save your dataset csv as an artifact to your W&B workspace via add file. Artifacts documentation here, and similarly for the model.
Simple data validation:
Before you get started, please confirm that your dataset has columns:
- merchant
- time
- amount_usd_in_cents
and it is comprised of 544831
rows.
To save you the annoyance of thinking about it, here's one way to parse the timestamp column:
# where pay is the dataset as a pandas DataFrame:
print(pay.iloc[0]['time']) # what the data looks like before parsing
dt_format_str = '%Y-%m-%d %H:%M:%S'
pd.to_datetime(
pay.time,
format=dt_format_str
)
Add a comment