【发布时间】:2018-08-29 02:49:31
【问题描述】:
我尝试使用 word2vec 加权 tfidf 向量进行 DBSCAN 聚类,并为 DBSCAN 使用不同的 epsilon 和 minpts 阈值。我还尝试了具有不同 minpts 的光学聚类方法,但它根本没有产生任何输出。
#Import libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from unidecode import unidecode # $ pip install unidecode
import gensim
import csv
import nltk
from sklearn.feature_extraction import text
import pandas as pd
import numpy as np
from collections import defaultdict
from string import lower
#read data
dat = pd.read_csv('D:\\data_800k.csv',encoding='latin',nrows=500000).Certi.tolist()
wnl = WordNetLemmatizer()
#nltk.download('punkt')
my_stop_words = text.ENGLISH_STOP_WORDSunion(['education','certification','certificate','certified'])
def tokenize_stop(row):
az = []
for j in nltk.word_tokenize(lower(unidecode(row))):
if j not in my_stop_words:
az.extend([j])
return az
def preprocess(dat):
return [tokenize_stop(row) for row in dat]
X = preprocess(dat)
#word2vec
model = gensim.models.Word2Vec(X, size=100)
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
#
tfidf = TfidfVectorizer(analyzer=lambda x: x)
tfidf.fit(X)
max_idf = max(tfidf.idf_)
#train model
def fit(X):
tfidf = TfidfVectorizer(analyzer=lambda x: x)
tfidf.fit(X)
# if a word was never seen - it must be at least as infrequent
# as any of the known words - so the default idf is the max of
# known idf's
max_idf = max(tfidf.idf_)
return defaultdict(
lambda: max_idf,
[(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
#actual training//
word2weight = fit(X)
#multiply word2vec with tfidf
def transform_word2vec_tfidf(X,word2vec,word2weight):
return np.array([
np.mean([word2vec[w] * word2weight[w]
for w in words if w in word2vec] or
[np.zeros(dim)], axis=0)
for words in X
])
export_data_w2v_Tfidf = transform_word2vec_tfidf(X,w2v,word2weight)
np.savetxt('D:\Azim\data_500k_w2v_tfidf.csv',export_data_w2v_Tfidf,delimiter=',',fmt=('%1.15e'))
以下是 ELKI 截图。如果他们能够使用 DBSCAN 或任何其他算法对文本数据进行有意义的聚类,任何人都可以分享见解吗?谢谢
【问题讨论】:
-
您尝试过更大的 Epsilon 吗?一个 kdist 情节?光学图?
-
@Anony-Mousse 是的,在更大的 epsilon 下,它会耗尽内存。与光学相同
-
先用一个子集做实验。在找到正确的方式之前无需解决可伸缩性问题。
-
我为 50000 个文档做了,然后尝试对整个数据集使用 epsilon 值。这个 stackoverflow.com 问题的结果是针对整个数据的。
-
另外,minpts 值太高了。
标签: python cluster-analysis dbscan elki