What is clustering?
Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groups of similar items.
Clustering is guided by the principle that items inside a cluster should be very similar to each other, but very different from those outside.
What is K-mean clustering?
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.
The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are :-
- The centroids of the K clusters, which can be used to label new data
- Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze the groups that have formed organically. The “Choosing K” section below describes how the number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents.
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in real life, like :-
- Academic performance :- Based on the scores, students are categorized into grades like A, B, or C.
- Diagnostic systems :- The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.
- Search engines :- Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.
- Wireless sensor networks :- The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.
How Does K-Means Clustering Work?
The flowchart below shows how k-means clustering works :-
The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.
- Figure 1 :- It shows the representation of data of two different items. the first item has shown in blue color and the second item has shown in red color. Here I am choosing the value of K randomly as 2. There are different methods by which we can choose the right k values.
- Figure 2 :- It Join the two selected points. Now to find out centroid, we will draw a perpendicular line to that line. The points will move to their centroid. If you will notice there, then you will see that some of the red points are now moved to the blue points. Now, these points belong to the group of blue color items.
- Figure 3 :- The same process will continue in figure 3. we will join the two points and draw a perpendicular line to that and find out the centroid. Now the two points will move to its centroid and again some of the red points get converted to blue points.
- Figure 4 :- The same process is happening in figure 4. This process will be continued until and unless we get two completely different clusters of these groups.
K-Means Clustering Algorithm
Let’s say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters.
The steps to form clusters are :-
Step 1 :- Choose K random points as cluster centers called centroids.
Step 2 :- Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid)
Step 3 :- Identify new centroids by taking the average of the assigned points.
Step 4 :- Keep repeating step 2 and step 3 until convergence is achieved
Use case of K-mean clustering
- Crime analysis :-
Criminal activities are a major cause for concern for law enforcement officials. Existing strategies to control crime are usually reactive, responding to the crime scene after the crimes have occurred.
There are certain questions that law enforcement officers often ask — is there any correlation between crime type, the weapon used, and locations? What are the demographics of the people performing a certain crime? What are the most typical weapons that are possessed by the criminals? Can the reports help us in prediction or future criminal activities?
The use of K-means data mining approach helps us identify patterns since it is very difficult for humans to process large amounts of data, especially if there are missing information to detect patterns.
K-means clustering is one of the methods of cluster analysis. In the K-means algorithm, each point is assigned to the cluster whose centroid is the closest. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It can be applied to relatively large sets of data.
Advantages of K-means
- It is very simple to implement.
- It is scalable to a huge data set and also faster to large datasets.
- it adapts the new examples very frequently.
- Generalization of clusters for different shapes and sizes.
Disadvantages of K-means
- It is sensitive to the outliers.
- Choosing the k values manually is a tough job.
- As the number of dimensions increases its scalability decreases.