【发布时间】:2017-01-03 08:31:10
【问题描述】:
假设我有一个包含不同文本行的数据框,我想对这些行进行聚类以找出数据中的潜在主题:
import pandas as pd
df = pd.DataFrame({"id_num": np.random.randint(low = 0, high = 50, size = 10), "text": ["hello these are words i would like to cluster", "hello i would like to go home", "home i would like to go please thank you", "thank you please apple banana", "orange banana apple fruit corn", "orange orange orange banana banana banana banana", "can you take me home i have had enough of this place", "i am bored can we go home", "i would like to leave now to go home", "apple apple banana"])
我先把这个dataframe分成train和test:
>>> from sklearn.cross_validation import train_test_split
>>> train, test = train_test_split(df, test_size = 0.40)
>>> train, test = train["text"], test["text"]
然后开始聚类过程:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.cluster import KMeans
>>> vectorizer = TfidfVectorizer()
>>> train_X = vectorizer.fit_transform(train)
>>> test_X = vectorizer.fit_transform(test)
>>> model = KMeans(n_clusters = 2)
>>> model.fit(train_X)
>>> model.predict(test_X)
ValueError: Incorrect number of features. Got 22 features, expected 18.
当然,如果您在自己的机器上运行此代码,您可能会得到不同的结果。也许特征的数量甚至可能是一致的。但在大多数情况下,train_X 和 test_X 的尺寸不会匹配。
还有其他人处理过这个问题吗?我想使维度相等的一种方法是通过仅采用train 和test 中存在的特征(阅读:单词)来进行某种降维。另一种会产生更大矩阵的解决方案是在给定文档没有来自其他语料库的单词的两个矩阵中填充零。
我还有其他方法可以解决这个问题吗?
【问题讨论】: