使用 NLTK 预处理文本字符串答案

【问题标题】：Pre-process text string with NLTK使用 NLTK 预处理文本字符串
【发布时间】：2021-04-25 10:36:57
【问题描述】：

我有一个数据框 A，其中包含 docid（文档 ID）、title（文章标题）、lineid（行 ID，也就是段落的位置）、文本和 tokencount（包括空格的字数）：

  docid   title  lineid                                         text        tokencount
0     0     A        0   shopping and orders have become more com...                66
1     0     A        1  people wrote to the postal service online...                67
2     0     A        2   text updates really from the U.S. Postal...                43
...

我想创建一个基于 A 的新数据框，包括 title、lineid、count 和 query。

query 是包含一个或多个单词的文本字符串，例如“数据分析”、“短信”或“购物和订单”。

count是query的每个单词的计数。

新的数据框应该如下所示：

title  lemma   count   lineid
  A    "data"    0        0
  A    "data"    1        1
  A    "data"    4        2
  A    "shop"    2        0
  A    "shop"    1        1
  A    "shop"    2        2
  B    "data"    4        0
  B    "data"    0        1
  B    "data"    2        2
  B    "shop"    9        0
  B    "shop"    3        1
  B    "shop"    1        2
...

如何制作一个函数来生成这个新的数据框？

我从 A 中创建了一个新的数据框 df，其中有一列 count。

df = A[['title','lineid']]
df['count'] = 0
df.set_index(['title','lineid'], inplace=True)

另外，我创建了一个函数来计算查询词数。

from collections import Counter

def occurrence_counter(target_string, query):
    data = dict(Counter(target_string.split()))
    count = 0
    for key in query:
        if key in data:
            count += data[key]
    return count

但是，我怎样才能同时使用它们来生成一个新数据框的函数呢？

【问题讨论】：

标签： python pandas dataframe nltk

【解决方案1】：

如果我理解正确，您可以使用内置的 pandas 函数：Series.str.count() 来计算 queries； melt() 重塑为最终的列结构。

给定样本df：

df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...',  1: 'people wrote to the postal service online...',  2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}})

#   docid  title  lineid                                          text
# 0     0      A       0   shopping and orders have become more com...
# 1     0      A       1  people wrote to the postal service online...
# 2     0      A       2   text updates really from the U.S. Postal...

先count()queries：

queries = ['order', 'shop', 'text']
df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries})

#   docid  title  lineid                                          text  tokencount  query_order  query_shop  query_text
# 0     0      A       0   shopping and orders have become more com...          66            1           1           0
# 1     0      A       1  people wrote to the postal service online...          67            0           0           0
# 2     0      A       2   text updates really from the U.S. Postal...          43            0           0           1

然后melt()进入最后的列结构：

df.melt(
    id_vars=['title', 'lineid'],
    value_vars=[f'query_{query}' for query in queries],
    var_name='lemma',
    value_name='count',
).replace(r'^query_', '', regex=True)

#   title  lineid  lemma  count
# 0     A       0  order      1
# 1     A       1  order      0
# 2     A       2  order      0
# 3     A       0   shop      1
# 4     A       1   shop      0
# 5     A       2   shop      0
# 6     A       0   text      0
# 7     A       1   text      0
# 8     A       2   text      1

【讨论】：

【解决方案2】：

这应该可以处理您的场景：

import pandas as pd
from collections import Counter

query = "data analysis"
wordlist = query.split(" ")
#print(wordlist)

# row wise frequency count
df['text_new']  = df.text.str.split().apply(lambda x: Counter(x))

output = pd.DataFrame()
# iterate row by row
for index, row in df.iterrows():
    temp = dict()
    for word in wordlist:
        temp['title']  = row['title']
        temp['lemma']  = word
        temp['count']  = row['text_new'][word]
        temp['lineid'] = row['lineid']
    
    output = output.append(temp, ignore_index=True)
#print(output)

【讨论】：