【问题标题】:Kmeans unique words taggsKmeans 独特的单词标签
【发布时间】:2020-07-19 23:22:22
【问题描述】:

我想从 K-Means 聚类中获取唯一标记列表。我有以下代码:

def cluster_tagging(variable_a_taggear):

document = result[variable_a_taggear]
vectorizer = TfidfVectorizer(ngram_range=(1, 5))

X = vectorizer.fit_transform(document)

true_k = 180
puntos2= true_k

if model_setting == 'MiniBatchKMeans':
    #model = MiniBatchKMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
    pass
elif model_setting == 'KMeans':
    model = KMeans(n_clusters=true_k, init='k-means++', max_iter=10000000, n_init=1)


model.fit(X)

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
#print(terms[:8])

cluster_ = []
key_ = []
ID = []

cluster_col = 'Cluster_%s'%(variable_a_taggear)
keywords_col = 'Keywords_%s'%(variable_a_taggear)
word_cloud = pd.DataFrame(columns=[cluster_col, keywords_col])

for i in range(puntos2):
    print('Cluster %s:' % (i))
    cluster_.append(i)
    key_1 = []
    key_1 = list(set(key_1))
    key_.append(key_1)

    for ind in order_centroids[i, :8]:
        print('%s' % terms[ind])
        terms_ = terms[ind]
        key_1.append(terms_)

print('first key_', key_)
info = {cluster_col:cluster_,keywords_col:key_}

word_cloud = pd.DataFrame(info)
word_cloud.head()


#print('Prediction')

predicted = model.predict(vectorizer.transform(document))
lst2 = result['Ticket ID']
predictions = pd.DataFrame(list(zip(predicted, lst2)), columns =[cluster_col, 'Ticket ID'])

#predictions = pd.DataFrame(predicted,result['Ticket ID'])
predictions.columns = [cluster_col, 'Ticket ID']
#print(predictions)

resultado = pd.merge(predictions, word_cloud, left_on=cluster_col, right_on=cluster_col, how='inner')
print(resultado.head())
return resultado

正如您使用 n-gram 所观察到的,我将重复的单词作为不同 n-gram 的一部分。例如,对于一个集群,我有以下标签:[['fecha iniciar', 'iniciar', 'modificar fecha iniciar cc', 'proceder modificar fecha iniciar', 'proceder modificar fecha iniciar cc', 'fecha iniciar cc', 'iniciar cc', 'fecha'] 如何获取每个集群的唯一单词列表?

谢谢

【问题讨论】:

    标签: python-3.x k-means tagging


    【解决方案1】:

    问题:如何获取每个集群的唯一词列表?

    您可以使用nltk 来分隔句子中的单词,使用numpy.unique 来获取数组中的唯一值。

    import numpy as np
    from nltk.tokenize import word_tokenize
    
    cluster_tags  = ['fecha iniciar', 'iniciar', ..., 'fecha']
    one_string = ' '.join(cluster_tags)
    np.unique(word_tokenize(one_string))
    

    如果您确定所有单词总是用空白空格' ' 分隔,您可以简单地拆分它们...

    np.unique(' '.join(cluster_tags).split())
    

    额外提示: 如果你愿意,你可以计算每个单词的频率。

    # See answer by Max Malysh: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
    from collections import Counter
    from pandas.core.common import flatten
    
    tokenized = [word_tokenize(text) for text in cluster_tags]
    Counter(flatten(tokenized))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-10-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-01-13
      相关资源
      最近更新 更多