当涉及新行时，如何从文本中提取模式？答案

【问题标题】：How can I extract a pattern from a text when it involves a new line?当涉及新行时，如何从文本中提取模式？
【发布时间】：2019-11-02 20:15:32
【问题描述】：

假设我在数据集（csv 文件）的单元格中有以下文本：

我想提取出现在关键字Decision 和reason 之后的单词/短语。我可以这样做：

import pandas as pd

text = '''Decision: Postpone\n\nreason:- medical history -  information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''

keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)

a = text.split('\n')
for cell in a:
    for keyword in keywords:
        if keyword in cell.lower():
            if len(cell.split(':'))>1:
                new_df[keyword][0]=cell.split(':')[1]

new_df

但是，在某些单元格中，单词/短语出现在关键字之后的新行中，在这种情况下，此程序无法提取它：

import pandas as pd

text = '''Decision: Postpone\n\nreason: \n- medical history \n-  information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''

keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)

a = text.split('\n')
for cell in a:
    for keyword in keywords:
        if keyword in cell.lower():
            if len(cell.split(':'))>1:
                new_df[keyword][0]=cell.split(':')[1]
new_df

我该如何解决这个问题？

【问题讨论】：

标签： python pandas text pattern-matching text-processing

【解决方案1】：

使用正则表达式来拆分数据，这将减少循环次数

import re
import pandas as pd

text = '''Decision: Postpone\n\nreason: \n- medical history \n-  information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''

keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)
text =text.lower()
tokens = re.findall(r"[\w']+", text)
for key in keywords:
   if key =='decision':
     index = tokens.index(key)
     new_df[key][0] = ''.join(tokens[index+1:index+2])
   if key =='reason':
     index = tokens.index(key)
     meta = tokens.index('review')
     new_df[key][0] = " ".join(tokens[index + 1:meta -1])

print(new_df)

【讨论】：

这也会提取原因（“to review with current assessment from...”）下面的文本，这些文本不应在reason下提取。
您提供的文本包含文本 = '''决定：推迟\n\n原因：\n病史\n\n以 Cynthia Dominguez 博士关于病史的当前评估进行审查，当前 CBC 显示实际数字血小板计数的\n\nmib: F\n''' after reason it is not in end
是的，但不应在reason 列中提取“to review with...”部分，因为它是一些元信息。
@Kristada673 感谢反馈我已经更新了代码
谢谢。您的代码的问题在于它适用于这个非常具体的输入文本。但是我拥有的数据集包含几种文本 - 元信息并不总是存在；如果存在，则“评论”一词不一定在所有这些中；有时缺少“原因”一词，但您可以看到“由于医疗原因”等短语。

【解决方案2】：

如果内容在另一行，则绝对不能拆分源字符串成行，然后在你中查找所有“令牌” 当前行。

相反，您应该：

准备一个带有 2 个捕获组（关键字和内容）的 正则表达式，
查找匹配项，例如使用 finditer。

示例代码如下：

df = pd.DataFrame(columns=keywords)
keywords = ['decision', 'reason']
it = re.finditer(r'(?P<kwd>\w+):\n?(?P<cont>.+?(?=\n\w+:|$))',
    text, flags=re.DOTALL)
row = dict.fromkeys(keywords, '')
for m in it:
    kwd = m.group('kwd').lower()
    cont = m.group('cont').strip()
    if kwd in keywords:
        row[kwd] = cont
df = df.append(row, ignore_index=True)

当然，你应该从import re开始。

也许您还应该阅读一些关于正则表达式的知识。

【讨论】：