Basics of Clustering

  • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
  • Why is clustering difficult?
    • Curse of dimensionality: Almost all pairs of points are at about the same distance in high-dimensional spaces.
    • Clusters can be Ambiguous

Preliminary:

  • similarity: Normally we use distance between vectors to measure the similarity.
  • evaluation:
    • Cluster Cohesion and Separation
    • Silhouette coefficient – combine ideas of both cohesion and separation, but for individual points,as well as clusters
    Data mining lecture6 notes: Clustering
    Data mining lecture6 notes: Clustering

Clustering Techniques

K-means (Partitioning Clustering)

Data mining lecture6 notes: Clustering

  1. Object function:
    Data mining lecture6 notes: Clustering

  2. Algorithm:
    • Determine the value of K.
    • Choose K cluster centres randomly.
    • Each data point is assigned to its closest centroid.
    • Use the mean of each cluster to update each centroid.
    • Repeat until no more new assignment.
    • Return the K centroids.

  3. Pros: relatively fast, O(t·k·n)
    Cons:
    Data mining lecture6 notes: Clustering

  4. Choose optimal K:SSE的拐点和轮廓系数的峰值
    Different starting points may lead to different clustering results:
    Data mining lecture6 notes: Clustering
    Outliers: K-Medians –Manhattan Distance –Median; K-Medoids –Manhattan Distance
    Non-convex: Other clustering methods

Agglomerative Clustering Algorithm (Hierarchical Clustering )

Data mining lecture6 notes: Clustering

  1. Object function for similarity:
  • MIN
  • MAX
  • Group average distance between centroids
  • other methods driven by an objective function
    • Ward’s Method uses squared error
  1. Algorithm
  • Compute the proximity matrix
  • Let each data point be a cluster
  • Repeat
    • Merge the two closest clusters
    • Update the proximity matrix
  • Until only a single cluster remains
  1. Pros: Do not have to assume any particular number of clusters; may correspond to meaningful taxonomies.
    Cons:
    Data mining lecture6 notes: Clustering

DBSCAN (Density Clustering)

Data mining lecture6 notes: Clustering

  1. Density:number of points within a specified radius
    • Core Point: points with high density
    • Border Point: points with low density but in the neighbourhood of a core point
    • Noise Point:neither a core point nor a border point
    Data mining lecture6 notes: Clustering

  2. Algorithm
    • A cluster is defined as the maximal set of density connected points
    • Start from a randomly selected unseen point P
    • If P is a core point,build a cluster by gradually adding all points that are density reachable to the current point set
    • Noise points are discarded (unlabelled).

  3. Pros:
    • Generate clusters of arbitrary shapes.
    • Robust against noise.
    • No K value required in advance.
    • Somewhat similar to human vision.

Data mining lecture6 notes: Clustering

相关文章: