3 min readfrom Data Science

So how do we all feel about KMeans algorithm for clustering?

So how do we all feel about KMeans algorithm for clustering?
So how do we all feel about KMeans algorithm for clustering?

Hi there,

At work I was recently given a dataset of customer orders totaling around $73m of spend across 380,000 customers. I wanted to see what I can learn by applying the KMeans algorithm to the dataset of customers, to see how it would classify customers. I got the results, they make sense, but I wanted to start a discussion here to see how everybody thinks about clustering methods in practice.

Context:

I decided to go with three groups of customers. The charts for inertia and silhouette scores are attached (I tested k from 2 to 11). I selected 3 because of 2 main reasons:

  1. middle ground between what the inertia and silhouette scores are telling me. After k=4, inertia starts to decrease at a slower rate, and silhouette sore is highest at k=2.

  2. intuitively, three groups of customers make sense for us.

Overall, the three clusters that were identified represented:

  1. 50% of customers that place only a couple of smaller orders

  2. 25% of customers with very high LTV, due to many/frequent orders

  3. 25% of customers with very high AOV (they purchase a specific product type).

Attached image shows differences between groups.

What I'm thinking about:

  1. Does using KMeans even make sense in this case? The results matched pretty well with a manual classification I did separately (high-value, frequent customers / small amount of orders, low value customers, and the rest). Is it better to use a classification that you can understand / has a clear interpretation, instead of using clusters?

  2. How do you interpret inertia / silhouette scores? From what I understand, the absolute values themselves do not matter, it's the relationship between different number of clusters. In this case, the silhouette chart is a bit misleading (y-axis actually shows a very small range, I just wanted to zoom in a little bit). From what I understand, domain knowledge is key when selecting k, but wanted to see if there are some other "tricks" here to search for. Which one to prioritize between inertia and silhouette?

  3. I used KMeans because it seemed like a reasonable starting point, I had little intuition about the geometry of data points in the space, to assume another clustering methods would be better. So how do you decide between clustering methods?

Did clustering methods help you solve a problem in production? I'm interested in hearing your thoughts about clustering methods in general.

Inertia and silhouette charts

Averages of spend, # orders, AOV between three groups

submitted by /u/vercig09
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#financial modeling with spreadsheets
#large dataset processing
#rows.com
#interactive charts
#big data management in spreadsheets
#conversational data analysis
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions
#KMeans
#clustering
#inertia
#silhouette score