Sunday, October 16, 2022

What is Clustering?


.

Clustering is an unsupervised machine learning technique that divides the population into several groups or clusters such that data points in the same group are similar to each other, and data points in different groups are dissimilar.

Clustering is used to identify some segments or groups in the dataset.

Clustering can be divided into two subgroups:

(1) Hard Clustering.

In hard clustering, each data point is clustered or grouped to any one cluster. For each data point, it may either completely belong to a cluster or not. 

K-Means Clustering is a hard clustering algorithm. It clusters data points into k-clusters. 

(2) Soft Clustering.

In soft clustering, instead of putting each data points into separate clusters, a probability of that point to be in that cluster assigned. In soft clustering or fuzzy clustering, each data point can belong to multiple clusters along with its probability score or likelihood.

One of the widely used soft clustering algorithms is the Fuzzy C-means clustering (FCM) Algorithm.

.

https://towardsdatascience.com/fuzzy-c-means-clustering-is-it-better-than-k-means-clustering-448a0aba1ee7

.

.

Clustering is a unsupervised Machine learning algorithm. In unsupervised learning , you have only input data and no output data. Unsupervised learning is used to find pattern in given data in order learn more about data.

.

https://medium.com/@codingpilot25/clustering-explained-to-beginners-of-data-science-e25d73c77a24

.

Cluster analysis or clustering is the most commonly used technique of unsupervised learning. It is used to find data clusters such that each cluster has the most closely matched data.

.

.
The types of Clustering Algorithms are:

1.Prototype-Based Clustering
2.Graph-Based Clustering (Contiguity-Based Clustering)
3.Density-Based Clustering
4.Well Separated Clustering
.


.
.

What is the difference between K Means and Hierarchical Clustering?


.

k-means is method of cluster analysis using a pre-specified no. of clusters. It requires advance knowledge of ‘K’. 

Hierarchical clustering also known as hierarchical cluster analysis (HCA) is also a method of cluster analysis which seeks to build a hierarchy of clusters without having fixed number of cluster. 

Main differences between K means and Hierarchical Clustering are: 


k-means ClusteringHierarchical Clustering
k-means, using a pre-specified  number of clusters, the method  assigns records to each cluster to  find the mutually exclusive cluster  of spherical shape based on distance.Hierarchical methods can be either divisive or agglomerative.
K Means clustering needed advance knowledge of K i.e. no. of clusters one want to divide your data.In hierarchical clustering one can stop at any number of clusters, one find appropriate by interpreting the dendrogram.
One can use median or mean as a cluster centre to represent each cluster.Agglomerative methods  begin with ‘n’ clusters and sequentially combine similar clusters until only one cluster is obtained.
Methods used are normally less computationally intensive and are suited with very large datasets.Divisive methods work in the opposite direction, beginning with one cluster that includes all the records and Hierarchical methods are especially useful when the target is to arrange the clusters into a natural hierarchy.
In K Means clustering, since one start with random choice of clusters, the results produced by running the algorithm many times may differ.In Hierarchical Clustering, results are reproducible in Hierarchical clustering
K- means clustering a simply a division of the set of data objects into non-overlapping subsets (clusters) such that each  data object is in exactly one subset).A hierarchical clustering is a set of nested clusters that are arranged as a tree.
K Means clustering is found to work well when the structure of the clusters is hyper spherical (like circle in 2D,  sphere in 3D).Hierarchical clustering don’t work  as well as, k means when the  shape of the clusters is hyper  spherical.
Advantages: 1. Convergence is guaranteed. 2. Specialized to clusters of different sizes and shapes.Advantages:  1 .Ease of handling of any forms of similarity or distance. 2. Consequently, applicability to any attributes types.
Disadvantages: 1. K-Value is difficult to predict 2. Didn’t work well with global cluster.Disadvantage: 1. Hierarchical clustering requires the computation and storage of an n×n  distance matrix. For very large datasets, this can be expensive and slow

 


.
.

What is hierarchical clustering analysis?

 .

In data mining and statistics, hierarchical clustering analysis is a method of cluster analysis that seeks to build a hierarchy of clusters i.e. tree-type structure based on the hierarchy. 

Basically, there are two types of hierarchical cluster analysis strategies –

1. Agglomerative Clustering: Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). A structure that is more informative than the unstructured set of clusters returned by flat clustering. This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data. 

2. Divisive clustering: Also known as a top-down approach. This algorithm also does not require to prespecify the number of clusters. Top-down clustering requires a method for splitting a cluster that contains the whole data and proceeds by splitting clusters recursively until individual data have been split into singleton clusters.

.

https://www.geeksforgeeks.org/ml-hierarchical-clustering-agglomerative-and-divisive-clustering/

.