【问题标题】:Get count of matching word in string of pandas column with a predefined list使用预定义列表获取 pandas 列中匹配单词的计数
【发布时间】:2020-10-29 08:44:19
【问题描述】:

我有一个 DataFrame 包含 indextext 列。

例如:

index | text
1     | "I have a pen, but I lost it today."
2     | "I have pineapple and pen, but I lost it today."

现在我有一个很长的列表,我想将text 中的每个单词与列表进行匹配。

假设:

long_list = ['pen', 'pineapple']

我想创建一个FunctionTransformer 以将long_list 中的单词与列值的每个单词匹配,如果匹配,则返回计数。

index | text                                             | count
1     | "I have a pen, but I lost it today."             | 1
2     | "I have pineapple and pen, but I lost it today." | 2

我是这样做的:

def count_words(df):
    long_list = ['pen', 'pineapple']
    count = 0
    for c in df['tweet_text']:
        if c in long_list:
            count = count + 1
            
    df['count'] = count   
    return df

count_word = FunctionTransformer(count_words, validate=False)

我如何开发我的另一个 FunctionTransformer 的一个例子是:

def convert_twitter_datetime(df):
    df['hour'] = pd.to_datetime(df['created_at'], format='%a %b %d %H:%M:%S +0000 %Y').dt.strftime('%H').astype(int)
    return df

convert_datetime = FunctionTransformer(convert_twitter_datetime, validate=False)

【问题讨论】:

  • 为什么不在pandas中使用count()函数呢?
  • @CeliusStingher 我正在研究管道,所以我的计划是为其创建一个 FunctionTransformer,但我愿意接受任何解决方案!我还是新手:3
  • 你能澄清你的问题吗?

标签: python pandas dataframe scikit-learn dataset


【解决方案1】:

灵感来自@Quang Hoang 的回答

import pandas as pd
import sklearn as sk

y=['pen', 'pineapple']

def count_strings(X, y):
    pattern = r'\b{}\b'.format('|'.join(y))
    return X['text'].str.count(pattern)

string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)

结果

    text                                              count
1   "I have a pen, but I lost it today."                1
2   "I have pineapple and pen, but I lost it today.     2

对于以下df2

#df2
      text
1     "I have a pen, but I lost it today. pen pen"
2     "I have pineapple and pen, but I lost it today."

我们得到

string_transformer.transform(X=df2)
#result
1    3
2    2
Name: text, dtype: int64

这表明,我们将函数转换为sklearn 样式的对象。为了进一步抽象这一点,我们可以将列名作为关键字参数传递给count_strings

【讨论】:

    【解决方案2】:

    | 连接列表中的元素。查找与.str.findall() 匹配的元素并应用.str.len() 进行计数

     p='|'.join(long_list)
    df=df.assign(count=(df.text.str.findall(p)).str.len())
                                                 text   count
    0              "I have a pen, but I lost it today."      1
    1  "I have pineapple and pen, but I lost it today."      2
    

    【讨论】:

      【解决方案3】:

      熊猫有str.count:

      # matching any of the words
      pattern = r'\b{}\b'.format('|'.join(long_list))
      
      df['count'] = df.text.str.count(pattern)
      

      输出:

         index                                              text  count
      0      1              "I have a pen, but I lost it today."      1
      1      2  "I have pineapple and pen, but I lost it today."      2
      

      【讨论】:

      • 但我不太确定这种方法,因为 OP 说他想要一个 FunctionTransformer,scikit-learn.org/stable/modules/generated/… 我认为必须创建一个函数
      • 感谢@Quang Hoang!您的回答激发了其他人解决我的问题!值得一票!
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2013-12-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-12-10
      • 1970-01-01
      • 2020-02-02
      相关资源
      最近更新 更多