【发布时间】:2018-02-26 04:40:03
【问题描述】:
我有一组文档,并从中创建了一个特征矩阵。然后我计算文档之间的余弦相似度。我将该余弦距离矩阵输入到 DBSCAN 算法。我的代码如下。
import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
from sklearn.cluster import DBSCAN
# Initialize some documents
doc1 = {'Science':0.8, 'History':0.05, 'Politics':0.15, 'Sports':0.1}
doc2 = {'News':0.2, 'Art':0.8, 'Politics':0.1, 'Sports':0.1}
doc3 = {'Science':0.8, 'History':0.1, 'Politics':0.05, 'News':0.1}
doc4 = {'Science':0.1, 'Weather':0.2, 'Art':0.7, 'Sports':0.1}
doc5 = {'Science':0.2, 'Weather':0.7, 'Art':0.8, 'Sports':0.9}
doc6 = {'Science':0.2, 'Weather':0.8, 'Art':0.8, 'Sports':1.0}
collection = [doc1, doc2, doc3, doc4, doc5, doc6]
df = pd.DataFrame(collection)
# Fill missing values with zeros
df.fillna(0, inplace=True)
# Get Feature Vectors
feature_matrix = df.as_matrix()
print(feature_matrix.tolist())
# Get cosine distance between pairs
sims = pairwise_distances(feature_matrix, metric='cosine')
# Fit DBSCAN
db = DBSCAN(min_samples=1, metric='precomputed').fit(sims)
现在,如 sklearn 的 DBSCAN demo 所示,我绘制了集群。也就是说,我插入了sims,而不是X,这是我的余弦距离矩阵。
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
#print(labels)
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = sims[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = sims[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
- 我的第一个问题是,将
sims改为X是否正确,因为X代表坐标值in the demo of sklearn 而sims代表余弦距离值? - 我的第二个问题是,是否可以将给定的点变成红色?比如我想把
feature_matrix中代表[0.8, 0.0, 0.0, 0.0, 0.2, 0.9, 0.7]的点改成红色?
【问题讨论】:
标签: python numpy matplotlib scikit-learn