Hi there,
At work I was recently given a dataset of customer orders totaling around $73m of spend across 380,000 customers. I wanted to see what I can learn by applying the KMeans algorithm to the dataset of customers, to see how it would classify customers. I got the results, they make sense, but I wanted to start a discussion here to see how everybody thinks about clustering methods in practice.
Context:
I decided to go with three groups of customers. The charts for inertia and silhouette scores are attached (I tested k from 2 to 11). I selected 3 because of 2 main reasons:
middle ground between what the inertia and silhouette scores are telling me. After k=4, inertia starts to decrease at a slower rate, and silhouette sore is highest at k=2.
intuitively, three groups of customers make sense for us.
Overall, the three clusters that were identified represented:
50% of customers that place only a couple of smaller orders
25% of customers with very high LTV, due to many/frequent orders
25% of customers with very high AOV (they purchase a specific product type).
Attached image shows differences between groups.
What I'm thinking about:
Does using KMeans even make sense in this case? The results matched pretty well with a manual classification I did separately (high-value, frequent customers / small amount of orders, low value customers, and the rest). Is it better to use a classification that you can understand / has a clear interpretation, instead of using clusters?
How do you interpret inertia / silhouette scores? From what I understand, the absolute values themselves do not matter, it's the relationship between different number of clusters. In this case, the silhouette chart is a bit misleading (y-axis actually shows a very small range, I just wanted to zoom in a little bit). From what I understand, domain knowledge is key when selecting k, but wanted to see if there are some other "tricks" here to search for. Which one to prioritize between inertia and silhouette?
I used KMeans because it seemed like a reasonable starting point, I had little intuition about the geometry of data points in the space, to assume another clustering methods would be better. So how do you decide between clustering methods?
Did clustering methods help you solve a problem in production? I'm interested in hearing your thoughts about clustering methods in general.
Inertia and silhouette charts
Averages of spend, # orders, AOV between three groups
submitted by