【问题标题】:Pre-process text string with NLTK使用 NLTK 预处理文本字符串
【发布时间】:2021-04-25 10:36:57
【问题描述】:

我有一个数据框 A,其中包含 docid(文档 ID)、title(文章标题)、lineid(行 ID,也就是段落的位置)、文本和 tokencount(包括空格的字数):

  docid   title  lineid                                         text        tokencount
0     0     A        0   shopping and orders have become more com...                66
1     0     A        1  people wrote to the postal service online...                67
2     0     A        2   text updates really from the U.S. Postal...                43
...

我想创建一个基于 A 的新数据框,包括 titlelineidcountquery

query 是包含一个或多个单词的文本字符串,例如“数据分析”、“短信”或“购物和订单”。

countquery的每个单词的计数。

新的数据框应该如下所示:

title  lemma   count   lineid
  A    "data"    0        0
  A    "data"    1        1
  A    "data"    4        2
  A    "shop"    2        0
  A    "shop"    1        1
  A    "shop"    2        2
  B    "data"    4        0
  B    "data"    0        1
  B    "data"    2        2
  B    "shop"    9        0
  B    "shop"    3        1
  B    "shop"    1        2
...

如何制作一个函数来生成这个新的数据框?


我从 A 中创建了一个新的数据框 df,其中有一列 count

df = A[['title','lineid']]
df['count'] = 0
df.set_index(['title','lineid'], inplace=True)

另外,我创建了一个函数来计算查询词数。

from collections import Counter

def occurrence_counter(target_string, query):
    data = dict(Counter(target_string.split()))
    count = 0
    for key in query:
        if key in data:
            count += data[key]
    return count

但是,我怎样才能同时使用它们来生成一个新数据框的函数呢?

【问题讨论】:

    标签: python pandas dataframe nltk


    【解决方案1】:

    如果我理解正确,您可以使用内置的 pandas 函数:Series.str.count() 来计算 queriesmelt() 重塑为最终的列结构。

    给定样本df

    df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...',  1: 'people wrote to the postal service online...',  2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}})
    
    #   docid  title  lineid                                          text
    # 0     0      A       0   shopping and orders have become more com...
    # 1     0      A       1  people wrote to the postal service online...
    # 2     0      A       2   text updates really from the U.S. Postal...
    

    count()queries

    queries = ['order', 'shop', 'text']
    df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries})
    
    #   docid  title  lineid                                          text  tokencount  query_order  query_shop  query_text
    # 0     0      A       0   shopping and orders have become more com...          66            1           1           0
    # 1     0      A       1  people wrote to the postal service online...          67            0           0           0
    # 2     0      A       2   text updates really from the U.S. Postal...          43            0           0           1
    

    然后melt()进入最后的列结构:

    df.melt(
        id_vars=['title', 'lineid'],
        value_vars=[f'query_{query}' for query in queries],
        var_name='lemma',
        value_name='count',
    ).replace(r'^query_', '', regex=True)
    
    #   title  lineid  lemma  count
    # 0     A       0  order      1
    # 1     A       1  order      0
    # 2     A       2  order      0
    # 3     A       0   shop      1
    # 4     A       1   shop      0
    # 5     A       2   shop      0
    # 6     A       0   text      0
    # 7     A       1   text      0
    # 8     A       2   text      1
    

    【讨论】:

      【解决方案2】:

      这应该可以处理您的场景:

      import pandas as pd
      from collections import Counter
      
      query = "data analysis"
      wordlist = query.split(" ")
      #print(wordlist)
      
      # row wise frequency count
      df['text_new']  = df.text.str.split().apply(lambda x: Counter(x))
      
      output = pd.DataFrame()
      # iterate row by row
      for index, row in df.iterrows():
          temp = dict()
          for word in wordlist:
              temp['title']  = row['title']
              temp['lemma']  = word
              temp['count']  = row['text_new'][word]
              temp['lineid'] = row['lineid']
          
          output = output.append(temp, ignore_index=True)
      #print(output)
      

      【讨论】:

        猜你喜欢
        • 2017-01-20
        • 1970-01-01
        • 1970-01-01
        • 2011-02-13
        • 1970-01-01
        • 2018-07-20
        • 1970-01-01
        • 2016-04-19
        • 2017-09-18
        相关资源
        最近更新 更多