【问题标题】:How can I extract a pattern from a text when it involves a new line?当涉及新行时,如何从文本中提取模式?
【发布时间】:2019-11-02 20:15:32
【问题描述】:

假设我在数据集(csv 文件)的单元格中有以下文本:

我想提取出现在关键字Decisionreason 之后的单词/短语。我可以这样做:

import pandas as pd

text = '''Decision: Postpone\n\nreason:- medical history -  information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''

keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)

a = text.split('\n')
for cell in a:
    for keyword in keywords:
        if keyword in cell.lower():
            if len(cell.split(':'))>1:
                new_df[keyword][0]=cell.split(':')[1]

new_df

但是,在某些单元格中,单词/短语出现在关键字之后的新行中,在这种情况下,此程序无法提取它:

import pandas as pd

text = '''Decision: Postpone\n\nreason: \n- medical history \n-  information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''

keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)

a = text.split('\n')
for cell in a:
    for keyword in keywords:
        if keyword in cell.lower():
            if len(cell.split(':'))>1:
                new_df[keyword][0]=cell.split(':')[1]
new_df

我该如何解决这个问题?

【问题讨论】:

    标签: python pandas text pattern-matching text-processing


    【解决方案1】:

    使用正则表达式来拆分数据,这将减少循环次数

    import re
    import pandas as pd
    
    text = '''Decision: Postpone\n\nreason: \n- medical history \n-  information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''
    
    keywords = ['decision', 'reason']
    new_df = pd.DataFrame(0, index=[0], columns=keywords)
    text =text.lower()
    tokens = re.findall(r"[\w']+", text)
    for key in keywords:
       if key =='decision':
         index = tokens.index(key)
         new_df[key][0] = ''.join(tokens[index+1:index+2])
       if key =='reason':
         index = tokens.index(key)
         meta = tokens.index('review')
         new_df[key][0] = " ".join(tokens[index + 1:meta -1])
    
    print(new_df)
    
    
    
    

    【讨论】:

    • 这也会提取原因(“to review with current assessment from...”)下面的文本,这些文本不应在reason下提取。
    • 您提供的文本包含文本 = '''决定:推迟\n\n原因:\n病史\n\n以 Cynthia Dominguez 博士关于病史的当前评估进行审查,当前 CBC 显示实际数字血小板计数的\n\nmib: F\n''' after reason it is not in end
    • 是的,但不应在reason 列中提取“to review with...”部分,因为它是一些元信息。
    • @Kristada673 感谢反馈我已经更新了代码
    • 谢谢。您的代码的问题在于它适用于这个非常具体的输入文本。但是我拥有的数据集包含几种文本 - 元信息并不总是存在;如果存在,则“评论”一词不一定在所有这些中;有时缺少“原因”一词,但您可以看到“由于医疗原因”等短语。
    【解决方案2】:

    如果内容在另一行,则绝对不能拆分 源字符串成行,然后在你中查找所有“令牌” 当前行。

    相反,您应该:

    • 准备一个带有 2 个捕获组(关键字和内容)的 正则表达式
    • 查找匹配项,例如使用 finditer

    示例代码如下:

    df = pd.DataFrame(columns=keywords)
    keywords = ['decision', 'reason']
    it = re.finditer(r'(?P<kwd>\w+):\n?(?P<cont>.+?(?=\n\w+:|$))',
        text, flags=re.DOTALL)
    row = dict.fromkeys(keywords, '')
    for m in it:
        kwd = m.group('kwd').lower()
        cont = m.group('cont').strip()
        if kwd in keywords:
            row[kwd] = cont
    df = df.append(row, ignore_index=True)
    

    当然,你应该从import re开始。

    也许您还应该阅读一些关于正则表达式的知识。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-07-18
      • 2013-11-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多