如何在python中对具有相似文本的数据框进行分组答案

【问题标题】：How to group data frame with similar text in python如何在python中对具有相似文本的数据框进行分组
【发布时间】：2021-07-18 07:22:42
【问题描述】：

我有一个这样的数据框 DF：

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def'],  
               'Description':['ABC String', 'ABC String', 'ABC String and sth', 'Only sth else', 'ABC String'],     
               'Value':[10, 20, 30, 40, 100]})

我需要按代码和描述对其进行分组。按代码分组很简单：

GR = DF.groupby('Code')

现在我想继续按描述分组，因此所有相等或相似（具有共同部分）的值都被分组在一起。你能帮我用一个公式来得到这样的东西吗：

可能有两个问题：“相等值”和“相似值”。如果至少有关于“相等值”的任何提示，那就太好了。

【问题讨论】：

对于“等值”，您是否尝试过像 this 这样的数据透视表？
还有很多其他“相等”的字符串吗？除了ABC String?
谢谢 Amri，我会看看的。是的，Serge，原始集合中还有其他字符串。

标签： python pandas pandas-groupby

【解决方案1】：

这真的取决于你如何定义相似......如果你说前 10 个字符，你可以使用字符串切片。

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def'],  
               'Description':['ABC String', 'ABC String', 'ABC String and sth', 'Only sth else', 'ABC String'],     
               'Value':[10, 20, 30, 40, 100]})  

DF.groupby(["Code", DF.Description.str[:10]])["Value"].sum()

【讨论】：

【解决方案2】：

要检查相似的字符串，您可以使用 jellyfish.levenshtein_distance。想法是遍历每个组并从组中获取最频繁的元素，然后评估相对于该最频繁元素的 levenshtein_distance。如果距离接近 0，则表示给定字符串相似，反之亦然。

# from difflib import SequenceMatcher
from statistics import mode
import jellyfish

import pandas as pd

df = pd.DataFrame({'Code': ['abc', 'abc', 'abc', 'abc', 'def'],
                   'Description': ['ABC String', 'abc string', 'ABC String and sth', 'Only sth else', 'ABC String'],
                   'Value': [10, 20, 30, 40, 100]})

df_list = []
for grp,df in df.groupby('Code'):
    df['distance'] = df['Description'].apply(lambda x : jellyfish.levenshtein_distance(x, mode(df['Description'])))
    df['Description'] =  mode(df['Description'])
    df_list.append(df[df['distance'] < 10])

df = pd.concat(df_list).drop('distance', axis=1)
print(df)

输出 -

  Code Description  Value
0  abc  ABC String     10
1  abc  ABC String     20
2  abc  ABC String     30
4  def  ABC String    100

为了更好和更准确的分析 - 将字符串转换为小写删除空格和标点符号，然后遵循算法。

【讨论】：

非常感谢@Nk03。这是一种有趣的方法，我将对其进行更多研究以适应我的最终解决方案。

【解决方案3】：

您也可以使用fuzzywuzzy 来计算 Levensthein 距离，即使有两个以上的“相似”值

例如

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def', 'def', 'def', 'abc'],  
               'Description':['ABC String', 'ABC String', 
                              'ABC String and sth', 'Only sth else', 
                              'ABC String', 'CDEFGH', 'CDEFGH and sth', 
                              'CDEFGH and sth',],
               'Value':[10, 20, 30, 40, 50, 60, 70, 80]}) 

# for each unique value in Description
for d in DF.Description.unique():
    # compute Levensthein distance
    # and set to True if >= a limit
    # (you may have to play around with it)
    DF[d] = DF['Description'].apply(
        lambda x : fuzz.ratio(x, d) >= 60
    )
    # set a name for the group
    # here, simply the shortest
    m = np.min(DF[DF[d]==True].Description)
    # assign the group
    DF.loc[DF.Description==d, 'group'] = m

print(DF)

  Code         Description  Value  ABC String          group  \
0  abc          ABC String     10        True     ABC String   
1  abc          ABC String     20        True     ABC String   
2  abc  ABC String and sth     30        True     ABC String   
3  abc       Only sth else     40       False  Only sth else   
4  def          ABC String     50        True     ABC String   
5  def              CDEFGH     60       False         CDEFGH   
6  def      CDEFGH and sth     70       False         CDEFGH   
7  abc      CDEFGH and sth     80       False         CDEFGH   

   ABC String and sth  Only sth else  CDEFGH  CDEFGH and sth  
0                True          False   False           False  
1                True          False   False           False  
2                True          False   False           False  
3               False           True   False           False  
4                True          False   False           False  
5               False          False    True            True  
6               False          False    True            True  
7               False          False    True            True

现在你可以groupby创建的群组了

DF.groupby('group').Value.mean()

group
ABC String       27.5
CDEFGH           70.0
Only sth else    40.0
Name: Value, dtype: float64

【讨论】：

【解决方案4】：

尝试用函数添加布尔列然后应用：

val = 'ABC String'

df['boo'] = df['description'].apply(lambda x: 1 if x.find(val)>=0 else 0)
df

code	description	value	boo
abc	ABC String	10	1
abc	ABC String	20	1
abc	ABC String and sth	30	1
abc	Only sth else	40	0
def	ABC String	100	1

然后是一些争吵

df = df[df.boo == 1]
df = df.iloc[:,:-1]
df

code	description	value
abc	ABC String	10
abc	ABC String	20
abc	ABC String and sth	30
def	ABC String	100

【讨论】：