K-Means, a method of vector quantization that is popular for cluster analysis in data mining, is about choosing the number of clusters, selecting the centroids (not necessarily from the dataset) at random K points, assigning each data point to the closest centroid (forming K clusters), computing and placing the new centroids of each cluster, reassigning each data point to the new closest centroid, and keep repeating the last step until no reassignment takes place.
WCSS (Within-Cluster-Sum-of-Squares) is calculated to allow choosing the appropriate number of clusters: the minimal WCSS (decreased to a limit) is chosen as the right number of clusters. Here comes the Elbow method, to find where the drop goes from being substantial to not as substantial:
Once the number of clusters is chosen, centroids are to be selected, and data points to be assigned to the closet centroids. Afterwards, new centroids are being chosen in the middle of each cluster, and data points are being reassigned to the corresponding cluster. This is done until no reassignment takes place:
P.S.: k-means++ is used to prevent choosing wrong initial values, centroids leading to clusters not being the most appropriate.
Computer Engineer • Entrepreneur • Blogger