Basics of Clustering
- Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
- Why is clustering difficult?
- Curse of dimensionality: Almost all pairs of points are at about the same distance in high-dimensional spaces.
- Clusters can be Ambiguous
Preliminary:
- similarity: Normally we use distance between vectors to measure the similarity.
- evaluation:
• Cluster Cohesion and Separation
• Silhouette coefficient – combine ideas of both cohesion and separation, but for individual points,as well as clusters
Clustering Techniques
K-means (Partitioning Clustering)
-
Object function:
-
Algorithm:
• Determine the value of K.
• Choose K cluster centres randomly.
• Each data point is assigned to its closest centroid.
• Use the mean of each cluster to update each centroid.
• Repeat until no more new assignment.
• Return the K centroids. -
Pros: relatively fast, O(t·k·n)
Cons: -
Choose optimal K:SSE的拐点和轮廓系数的峰值
Different starting points may lead to different clustering results:
Outliers: K-Medians –Manhattan Distance –Median; K-Medoids –Manhattan Distance
Non-convex: Other clustering methods
Agglomerative Clustering Algorithm (Hierarchical Clustering )
- Object function for similarity:
- MIN
- MAX
- Group average distance between centroids
- other methods driven by an objective function
- Ward’s Method uses squared error
- Algorithm
- Compute the proximity matrix
- Let each data point be a cluster
- Repeat
• Merge the two closest clusters
• Update the proximity matrix - Until only a single cluster remains
- Pros: Do not have to assume any particular number of clusters; may correspond to meaningful taxonomies.
Cons:
DBSCAN (Density Clustering)
-
Density:number of points within a specified radius
• Core Point: points with high density
• Border Point: points with low density but in the neighbourhood of a core point
• Noise Point:neither a core point nor a border point -
Algorithm
• A cluster is defined as the maximal set of density connected points
• Start from a randomly selected unseen point P
• If P is a core point,build a cluster by gradually adding all points that are density reachable to the current point set
• Noise points are discarded (unlabelled). -
Pros:
• Generate clusters of arbitrary shapes.
• Robust against noise.
• No K value required in advance.
• Somewhat similar to human vision.