Data mining lecture6 notes: Clustering

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
Why is clustering difficult?
- Curse of dimensionality: Almost all pairs of points are at about the same distance in high-dimensional spaces.
- Clusters can be Ambiguous

similarity: Normally we use distance between vectors to measure the similarity.
evaluation:
• Cluster Cohesion and Separation
• Silhouette coefficient – combine ideas of both cohesion and separation, but for individual points,as well as clusters

Object function:
Algorithm:
• Determine the value of K.
• Choose K cluster centres randomly.
• Each data point is assigned to its closest centroid.
• Use the mean of each cluster to update each centroid.
• Repeat until no more new assignment.
• Return the K centroids.
Pros: relatively fast, O(t·k·n)
Cons:
Choose optimal K：SSE的拐点和轮廓系数的峰值
Different starting points may lead to different clustering results:

Outliers: K-Medians –Manhattan Distance –Median; K-Medoids –Manhattan Distance
Non-convex: Other clustering methods

Data mining lecture6 notes: Clustering

MIN
MAX
Group average distance between centroids
other methods driven by an objective function
- Ward’s Method uses squared error

Pros: Do not have to assume any particular number of clusters; may correspond to meaningful taxonomies.
Cons:

Data mining lecture6 notes: Clustering

Density:number of points within a specified radius
• Core Point: points with high density
• Border Point: points with low density but in the neighbourhood of a core point
• Noise Point:neither a core point nor a border point
Algorithm
• A cluster is defined as the maximal set of density connected points
• Start from a randomly selected unseen point P
• If P is a core point,build a cluster by gradually adding all points that are density reachable to the current point set
• Noise points are discarded (unlabelled).
Pros:
• Generate clusters of arbitrary shapes.
• Robust against noise.
• No K value required in advance.
• Somewhat similar to human vision.

Data mining lecture6 notes: Clustering