标记文本并为数据框中的每一行创建更多行答案

【问题标题】：Tokenise text and create more rows for each row in dataframe标记文本并为数据框中的每一行创建更多行
【发布时间】：2019-05-24 09:52:38
【问题描述】：

我想用python 和pandas 来做这件事。

假设我有以下内容：

file_id   text
1         I am the first document. I am a nice document.
2         I am the second document. I am an even nicer document.

我终于想拥有以下：

file_id   text
1         I am the first document
1         I am a nice document
2         I am the second document
2         I am an even nicer document

所以我希望在每个句号处拆分每个文件的文本，并为这些文本的每个标记创建新行。

最有效的方法是什么？

【问题讨论】：

你可以使用nltk.tokenize.sent_tokenize('text')来拆分句子。

标签： python pandas tokenize

【解决方案1】：

用途：

s = (df.pop('text')
      .str.strip('.')
      .str.split('\.\s+', expand=True)
      .stack()
      .rename('text')
      .reset_index(level=1, drop=True))

df = df.join(s).reset_index(drop=True)
print (df)
   file_id                         text
0        1      I am the first document
1        1         I am a nice document
2        2     I am the second document
3        2  I am an even nicer document

解释：

首先使用DataFrame.pop 提取列，用Series.str.rstrip 删除最后一个. 并用Series.str.split 和转义. 分割，因为特殊的正则表达式字符，用DataFrame.stack 重塑系列，DataFrame.reset_index 和rename 为系列，DataFrame.join 为原创。

【讨论】：

我在等你@jezrael！感谢您的回答（赞成）。不太容易再次阅读（至少对于熊猫的非专家而言）。顺便说一句，如果我告诉您每次遇到换行符 (\n) 或正斜杠 (/) 时您还必须拆分文本，您的答案会如何变化？
另外，顺便说一句，如果我在文本列的右侧也有其他列，你的代码会起作用吗？
@PoeteMaudit - 是的，我的代码也可以处理多个列。
好的，很酷，我的意思是我在文本列的右侧有列，但我不想对它们做任何特别的事情，除了我对 file_id 所做的事情。我只想拆分文本列。

【解决方案2】：

df = pd.DataFrame( { 'field_id': [1,2], 
                    'text': ["I am the first document. I am a nice document.",
                             "I am the second document. I am an even nicer document."]})

df['sents'] = df.text.apply(lambda txt: [x for x in txt.split(".") if len(x) > 1])
df = df.set_index(['field_id']).apply(lambda x: 
                                      pd.Series(x['sents']),axis=1).stack().reset_index(level=1, drop=True)
df = df.reset_index()
df.columns = ['field_id','text']

【讨论】：