在 pandas 列中对相似的单词/句子进行分组答案

【问题标题】：Group similar words/sentences in pandas column在 pandas 列中对相似的单词/句子进行分组
【发布时间】：2018-04-10 17:04:49
【问题描述】：

我尝试使用代码：

Counter(" ".join(df["text"]).split()).most_common(100)

要获取最常用的单词，但我想要的是计算句子中的常用单词。例如：

1. A123 B234 C345 test data.
2. A123 B234 C345 D555 test data.
3. A123 B234 test data.
4. A123 B234 C345 more data.

我想要计数：

 A123 B234 data- 4
 A123 B234 test data - 3
 A123 B234 C345 test data- 3

我正在寻找一组常见且数量较多的单词。我怎样才能在 pandas/python 中得到这个

例句：

Money transferred from xyz@abc.com to account no.123
Money transferred from xyz@abc.net to account no.abc
Money failed transferring from xyz@abc. to account no.cde
Money transferred from example@yyy.com to account no.www
Money failed transferring from xyz@abc.com to account no.ttt

【问题讨论】：

我不明白你想要的数据集 - 你能详细说明吗？
@MaxU 所需的数据集是一组单词，每个句子中都匹配的单词以及计数。
@jason - 有多少行数据框？有多少个独特的词？
@jezrael 我正在研究按句子中拆分词的降序对值进行排序，并将它们与计数聚集在一起。那些线上的东西。就像我的例子一样。
@jason - 我问是因为如果有很多行或很多唯一词不确定我的解决方案是否应该很快......请检查一下。

标签： python pandas

【解决方案1】：

使用 groupby 多列作为输入，后跟 size 方法

df.groupby(['col1','col2','col3']).size().sort_values()

【讨论】：

嗨 vumaasha。请解释

【解决方案2】：

一种可能的解决方案：

df = df['col'].str.get_dummies(' ')
print (df)
   A123  B234  C345  D555  data  more  test
0     1     1     1     0     1     0     1
1     1     1     1     1     1     0     1
2     1     1     0     0     1     0     1
3     1     1     1     0     1     1     0

替代方案：

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['col'].str.split()),
                  columns=mlb.classes_, 
                  index=df.index)
print (df)
   A123  B234  C345  D555  data  more  test
0     1     1     1     0     1     0     1
1     1     1     1     1     1     0     1
2     1     1     0     0     1     0     1
3     1     1     1     0     1     1     0

获取所有列组合的从min_length 到max 的所有组合（words）：

from  itertools import combinations
a = df.columns
min_length = 3
comb = [j for i in range(len(a), min_length -1, -1) for j in combinations(a,i)]

在列表理解计数值中：

df1 = pd.DataFrame([(', '.join(x), df.loc[:, x].all(axis=1).sum(), len(x)) for x in comb], 
                    columns=['words','count','len'])

TOP = 2
TOP_count = sorted(df1['count'].unique())[-TOP:]
df1 = df1[df1['count'].isin(TOP_count)].sort_values(['count', 'len'], ascending=False)
print (df1)
                     words  count  len
66        A123, B234, data      4    3
30  A123, B234, C345, data      3    4
37  A123, B234, data, test      3    4
64        A123, B234, C345      3    3
68        A123, B234, test      3    3
70        A123, C345, data      3    3
77        A123, data, test      3    3
80        B234, C345, data      3    3
87        B234, data, test      3    3

编辑：

纯python解决方案：

from  itertools import combinations, takewhile
from collections import Counter

min_length = 3
d = Counter()
for a in df['col'].str.split():
    for i in range(len(a), min_length -1, -1):
        for j in combinations(a,i):
            d[j] +=1
#print (d)

#https://stackoverflow.com/a/26831143
def get_items_upto_count(dct, n):
  data = dct.most_common()
  val = data[n-1][1] #get the value of n-1th item
  #Now collect all items whose value is greater than or equal to `val`.
  return list(takewhile(lambda x: x[1] >= val, data))

L = get_items_upto_count(d, 2)

s = pd.DataFrame(L, columns=['val','count'])
print (s)
                        val  count
0        (A123, B234, data)      4
1  (A123, B234, C345, data)      3
2  (A123, B234, test, data)      3
3        (A123, B234, C345)      3
4        (A123, B234, test)      3
5        (A123, C345, data)      3
6        (A123, test, data)      3
7        (B234, C345, data)      3
8        (B234, test, data)      3

【讨论】：

嗨，杰兹瑞尔！我在顶部添加了示例句子。请看一下:)
@jason - 请查看聊天记录。