【问题标题】:How can I implement pam clustering algorithm using gower distance in sklearn?如何在 sklearn 中使用 gower 距离实现 pam 聚类算法?
【发布时间】:2021-06-01 03:27:12
【问题描述】:

我想使用高尔距离实现 pam (KMedoid, method='pam') 算法。

我的数据集包含混合特征,数字和分类,几个猫特征有 1000 多个不同的值。

我在这里找到了合适的高尔距离实现:https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py

我的问题是我使用的 sklearn-extra implementation of PAM 没有实现 metric='gower' 选项。所以我尝试创建一个可调用对象,但我似乎发现很难将它们连接在一起。

D = gower.gower_matrix(df_ext, cat_features=cat_mask) # cat_mask is a boolean list marking what the 
                                                    categorical features are in the df_ext

# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
def get_gower():
    return sklearn.metrics.pairwise_distances(D, metric='precomputed')

# https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
kmedoids = sklearn_extra.cluster.KMedoids(df_ext, metric=get_gower, method='pam')
kmedoids.fit(df_ext)

我得到这个 ValueError:

ValueError                                Traceback (most recent call last)
<ipython-input-13-9ae677cd636a> in <module>
      1 # https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
      2 kmedoids = KMedoids(df_ext, metric=get_gower, method='pam')
----> 3 kmedoids.fit(df_ext)

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in fit(self, X, y)
    183         random_state_ = check_random_state(self.random_state)
    184 
--> 185         self._check_init_args()
    186         X = check_array(X, accept_sparse=["csr", "csc"])
    187         if self.n_clusters > X.shape[0]:

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_init_args(self)
    154 
    155         # Check n_clusters and max_iter
--> 156         self._check_nonnegative_int(self.n_clusters, "n_clusters")
    157         self._check_nonnegative_int(self.max_iter, "max_iter", False)
    158 

D:\ProgramFiles\anaconda3\lib\site-packages\sklearn_extra\cluster\_k_medoids.py in _check_nonnegative_int(self, value, desc, strict)
    144         else:
    145             negative = (value is None) or (value < 0)
--> 146         if negative or not isinstance(value, (int, np.integer)):
    147             raise ValueError(
    148                 "%s should be a nonnegative integer. "

D:\ProgramFiles\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1327 
   1328     def __nonzero__(self):
-> 1329         raise ValueError(
   1330             f"The truth value of a {type(self).__name__} is ambiguous. "
   1331             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我认为我的可调用对象有问题。你知道我做错了什么吗?

【问题讨论】:

    标签: python-3.x scikit-learn cluster-analysis


    【解决方案1】:

    在 Python 中使用 Gower 度量的 K-medoids (PAM)

    • 数据类型:数值和分类变量
    • 与 R 相比的结果
    • 注意:在应用聚类之前考虑缩放您的数值数据。
    import pandas as pd 
    import numpy as np
    import gower
    from sklearn.preprocessing import LabelEncoder
    from sklearn_extra.cluster import KMedoids
    
    # Create a dataframe with both numeric and string type columns 
    
    age = [21, 21, 19, 30, 21, 21, 19, 30, 35, 39, 50, 2]
    gender = ['M', 'M', 'N', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M']
    civil_status = ['MARRIED', 'SINGLE', 'SINGLE', 'SINGLE', 'MARRIED', 'SINGLE', 'WIDOW', 'DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED']
    salary = [3000.0, 1200.0 , 32000.0, 1800.0 , 2900.0 , 1100.0 , 10000.0, 1500.0, 200.0, 500.0, 50.0, 5000.0]
    available_credit = [2200, 100, 22000, 1100, 2000, 100, 6000, 2200, 6000, 12000, 500, 50]
    
    df_eg = pd.DataFrame({'age': age,
                     'gender': gender,
                      'civil_status': civil_status,
                     'salary': salary,
                     'available_credit': available_credit})
    # Label encode categorical variables
    
    df_eg_encoded = df_eg.copy() # Avoid Pandas error
    df_eg_encoded[['gender', 'civil_status']] = df_eg_encoded[['gender', 'civil_status']].apply(LabelEncoder().fit_transform)
    
    
    # Apply Gower distance calculation
    
    gower_mat = gower.gower_matrix(df_eg,  cat_features = [False, True, True, False, False])
    # Fit model
    km_model = KMedoids(n_clusters = 3, random_state = 0, metric = 'precomputed', method = 'pam', init =  'k-medoids++').fit(gower_mat)  
    
    clusters = km_model.labels_
    clusters
    > array([1, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 1], dtype=int64)
    

    R 代码

    install.packages("clusters")
    age <- c(21,21,19, 30,21,21,19,30, 35, 39, 50, 2)
    gender <- c('M','M','N','M','F','F','F','F', 'F', 'M', 'F', 'M')
    civil_status <- c('MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED', 'WIDOW', 'MARRIED', 'WIDOW', 'MARRIED')
    salary <-c (3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0, 200.0, 500.0, 50.0, 5000.0)
    available_credit <- c (2200,100,22000,1100,2000,100,6000,2200, 6000, 12000, 500, 50)
    X <- data.frame(age, gender, civil_status, salary, available_credit)
    print(X)
    
    library(cluster)
    gower_mat <- daisy(X, metric = c("gower"))
    pamx <- pam(gower_mat, 3)
    print(pamx)
    > Clustering vector:
    > [1] 1 1 2 1 1 3 3 3 3 1 3 1
    

    参考文献

    https://pypi.org/project/gower/ https://scikit-learn-extra.readthedocs.io/en/stable/generated/sklearn_extra.cluster.KMedoids.html https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/daisy https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/pam

    【讨论】:

      【解决方案2】:

      我想我找到了解决方案,但在我的数据集上速度很慢:

      #  code implementation ideas are from here: https://github.com/wwwjk366/gower/blob/master/gower/gower_dist.py
      # what I did basically is implemented gower_get to be usable for data sample by data sample calculation (this is what 
      # scikit-learn-extra.KMedoids metric requires)
      
      # NOTE: extremely slow on my data. Q: Would it be much easier to use a precomputed D distance matrix? - no, even slower...
      def get_gower(x, y, cat_features=cat_mask):
          xi_cat = x[cat_features]
          xi_num = x[np.logical_not(cat_features)]
          xj_cat = y[cat_features]
          xj_num = y[np.logical_not(cat_features)]
          Z = np.array([x, y])
          Z_num = Z[:, np.logical_not(cat_features)]
      #     print('Z.shape', Z.shape)
          weight = np.ones(Z.shape[1])
      #     print('weight', weight.shape)
          feature_weight_cat = weight[cat_features]
          feature_weight_num = weight[np.logical_not(cat_features)]
          feature_weight_sum = weight.sum()
      #     print('feature_weight_sum', feature_weight_sum.shape)
          categorical_features = np.array(cat_features)
          
          num_cols = Z_num.shape[1]
          num_ranges = np.zeros(num_cols)
          num_max = np.zeros(num_cols)
          
          for col in range(num_cols):
              col_array = Z_num[:, col].astype(np.float32) 
              max = np.nanmax(col_array)
              min = np.nanmin(col_array)
           
              if np.isnan(max):
                  max = 0.0
              if np.isnan(min):
                  min = 0.0
              num_max[col] = max
              num_ranges[col] = (1 - min / max) if (max != 0) else 0.0
              
          # categorical columns
          sij_cat = np.where(xi_cat == xj_cat, np.zeros_like(xi_cat), np.ones_like(xi_cat))
      #     print('sij_cat', sij_cat.shape)
          sum_cat = np.multiply(feature_weight_cat,sij_cat).sum() 
      
          # numerical columns
          abs_delta=np.absolute(xi_num-xj_num)
          sij_num=np.divide(abs_delta, num_ranges, out=np.zeros_like(abs_delta), where=num_ranges!=0)
      
          sum_num = np.multiply(feature_weight_num,sij_num).sum()
          sums= np.add(sum_cat,sum_num)
          sum_sij = np.divide(sums,feature_weight_sum)
          
          return sum_sij
      
      kmedoids = KMedoids(metric=get_gower, method='pam')
      kmedoids.fit(df)
      

      无论如何,我仍然愿意接受反馈,一定有更简单的方法:-)

      【讨论】:

        猜你喜欢
        • 2015-09-22
        • 2021-03-16
        • 2019-05-04
        • 2016-08-09
        • 2021-07-03
        • 2020-01-21
        • 2017-09-29
        • 2018-09-30
        • 1970-01-01
        相关资源
        最近更新 更多