【问题标题】:mapping matching word count on a column using pandas in python在python中使用pandas在列上映射匹配的字数
【发布时间】:2018-04-06 10:34:41
【问题描述】:

我有一个 df,

Name    Step     Description
Ram        1     Ram is oNe of the good cricketer
Ram        2     gopal one
Sri        1     Sri is one of the member
Sri        2     ravi good 
Kumar      1     Kumar is a keeper
Madhu      1     good boy
Vignesh    1     oNe little
Pechi      1     one book
mario      1     good randokm
Roger      1     one milita good
bala       1     looks good
raj        1     more one
venk       1     likes good

还有一个列表,

my_list=["one","good"]

我正在尝试从 my_list 中获取至少包含一个关键字的行。

我试过了, mask=df["描述"].str.contains("|".join(my_list),na=False) 我正在获取 output_df,

Name    Description
Ram     Ram is one of the good cricketer
Sri     Sri is one of the member        

我还想将“描述”中存在的关键字及其计数添加到单独的列中,

当 df["Name"] 不是第一次出现时,即使“描述”包含关键字,也不应该在键列中复制关键字我想要的输出是,

my_desired 输出是,

 Name   Step    Description                          keys        count
 Ram     1     Ram is one of the good cricketer      one,good    2
 Ram     2     gopal one
 Sri     1     Sri is one of the member              one         1
 Sri     2     ravi good
 Kumar   1     Kumar is a keeper
 Madhu   1     good boy                              good        1
 Vignesh 1     oNe little                            oNe         1
 Pechi   1     one book                              one         1 
 mario   1     good randokm good                     good        1
 Roger   1     one milita good                       one,good    2
 bala    1     looks good                            good        1
 raj     1     more one                              one         1
 venk    1     likes good                            good        1

【问题讨论】:

    标签: python pandas dataframe data-analysis


    【解决方案1】:

    创建新蒙版并应用它:

    my_list=["one","good"]
    
    mask=df["Description"].str.contains("|".join(my_list),na=False,flags=re.IGNORECASE ) & \
         (df.groupby('Name').cumcount() == 0)
    print (mask)
    0      True
    1     False
    2      True
    3     False
    4     False
    5      True
    6      True
    7      True
    8      True
    9      True
    10     True
    11     True
    12     True
    dtype: bool
    

    extracted = df['Description'].str.findall('(' + '|'.join(my_list) + ')', flags=re.IGNORECASE)
    df.loc[mask, 'keys'] = extracted.str.join(',')
    df.loc[mask, 'count'] = extracted.str.len()
    print (df)
           Name  Step                       Description      keys  count
    0       Ram     1  Ram is oNe of the good cricketer  oNe,good    2.0
    1       Ram     2                         gopal one       NaN    NaN
    2       Sri     1          Sri is one of the member       one    1.0
    3       Sri     2                        ravi good        NaN    NaN
    4     Kumar     1                 Kumar is a keeper       NaN    NaN
    5     Madhu     1                          good boy      good    1.0
    6   Vignesh     1                        oNe little       oNe    1.0
    7     Pechi     1                          one book       one    1.0
    8     mario     1                      good randokm      good    1.0
    9     Roger     1                   one milita good  one,good    2.0
    10     bala     1                        looks good      good    1.0
    11      raj     1                          more one       one    1.0
    12     venk     1                        likes good      good    1.0
    

    编辑:

    #transform all values if need same size of original
    s = df.groupby('Name')['Description'].transform(','.join)
    print (s)
    0     Ram is oNe of the good cricketer,gopal one
    1     Ram is oNe of the good cricketer,gopal one
    2            Sri is one of the member,ravi good 
    3            Sri is one of the member,ravi good 
    4                              Kumar is a keeper
    5                                       good boy
    6                                     oNe little
    7                                       one book
    8                              good randokm good
    9                                one milita good
    10                                    looks good
    11                                      more one
    12                                    likes good
    Name: Description, dtype: object
    

    #for mask use new Series s
    mask=s.str.contains("|".join(my_list),na=False,flags=re.IGNORECASE ) & \
         (df.groupby('Name').cumcount() == 0)
    print (mask)
    0      True
    1     False
    2      True
    3     False
    4     False
    5      True
    6      True
    7      True
    8      True
    9      True
    10     True
    11     True
    12     True
    dtype: bool
    

    #extract from new Series s
    extracted = s.str.findall('(' + '|'.join(my_list) + ')', flags=re.IGNORECASE).apply(set)
    df.loc[mask, 'keys'] = extracted.str.join(',')
    df.loc[mask, 'count'] = extracted.str.len()
    print (df)
           Name  Step                       Description          keys  count
    0       Ram     1  Ram is oNe of the good cricketer  good,oNe,one    3.0
    1       Ram     2                         gopal one           NaN    NaN
    2       Sri     1          Sri is one of the member      good,one    2.0
    3       Sri     2                        ravi good            NaN    NaN
    4     Kumar     1                 Kumar is a keeper           NaN    NaN
    5     Madhu     1                          good boy          good    1.0
    6   Vignesh     1                        oNe little           oNe    1.0
    7     Pechi     1                          one book           one    1.0
    8     mario     1                 good randokm good          good    1.0
    9     Roger     1                   one milita good      good,one    2.0
    10     bala     1                        looks good          good    1.0
    11      raj     1                          more one           one    1.0
    12     venk     1                        likes good          good    1.0
    

    【讨论】:

    • 我不想考虑 Step 列,我想在“Name”列上应用逻辑。当名称值第一次出现时。正如您在 index=1 中看到的,Ram 出现了第二次,所以我们不应该考虑索引为 1 的行上的关键字
    • 好的,您认为cumcunt 的计数准确率是多少?
    • 只匹配第一个值? print (df.groupby('Name').cumcount()) 等于 0 ?
    • ` 0 0 1 1 2 0 3 1 4 0 `
    • 好的,但我做到了,fillna("") 然后s= df.groupby('Name')['Description'].transform(','.join) 它,工作。需要改成s = df.groupby('Name')['Description'].transform(lambda x: ','.join(x.astype(str)))
    猜你喜欢
    • 2018-04-06
    • 2015-11-18
    • 2018-03-15
    • 2018-04-09
    • 2021-01-18
    • 2021-12-27
    • 2019-08-29
    • 2017-07-29
    相关资源
    最近更新 更多