K-Means Explained

K-Means, a method of vector quantization that is popular for cluster analysis in data mining, is about choosing the number of clusters, selecting the centroids (not necessarily from the dataset) at random K points, assigning each data point to the closest centroid (forming K clusters), computing and placing the new centroids of each cluster, reassigning each data point to the new closest centroid, and keep repeating the last step until no reassignment takes place.

 

Intuition:

WCSS (Within-Cluster-Sum-of-Squares) is calculated to allow choosing the appropriate number of clusters: the minimal WCSS (decreased to a limit) is chosen as the right number of clusters. Here comes the Elbow method, to find where the drop goes from being substantial to not as substantial:

K01.png K02.png

 

Once the number of clusters is chosen, centroids are to be selected, and data points to be assigned to the closet centroids. Afterwards, new centroids are being chosen in the middle of each cluster, and data points are being reassigned to the corresponding cluster. This is done until no reassignment takes place:

K03.png

 

Python example:

KM P

P.S.: k-means++ is used to prevent choosing wrong initial values, centroids leading to clusters not being the most appropriate.

 

Uncategorized

mostlyfad View All →

Computer Engineer • Entrepreneur • Blogger

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: