Python - 将文本从一列解析为多列答案

【问题标题】：Python - Parse Text from One Column into Multiple ColumnsPython - 将文本从一列解析为多列
【发布时间】：2021-08-30 15:22:39
【问题描述】：

我有一个包含两列的 Excel 文件。第一个是有用的识别信息，而第二个是可以逐行变化的文本块。在该列中，我想根据可用性将一些信息解析为单独的列。例如：

Reference	Block of text
1	Number: there are four people Location: this happened in downtown Time: this happened at midnight

我想把它变成：

Reference	Block of text	Number	Location	Time
1	Number: there are four people Location: this happened in downtown Time: this happened at midnight	there are four people	this happened in downtown	this happened at midnight

如何使用 Python 实现这一点？

【问题讨论】：

查看我的回复，请考虑标记解决方案。
Nikhil，您应该按照问题的答案来回答。

标签： python regex pandas parsing text

【解决方案1】：

在这里，当拆分器处于任何顺序或任何信息丢失时，它都可以工作。

import re
import pandas as pd

SPLITTERS = ['Number:', 'Location:', 'Time:']
SPLITTERS_PAIRS = [(s1, s2) for s1 in SPLITTERS for s2 in SPLITTERS if s1 != s2]

BLOCK_OF_TEXT_COLUMN_NAME = 'Block of text'
REFERENCE_COLUMN_NAME = 'Reference'


def extract_info(row) -> Tuple[str]:

    info: Dict = {}
    text_in: str = row[BLOCK_OF_TEXT_COLUMN_NAME] if row[BLOCK_OF_TEXT_COLUMN_NAME] else ''

    for splitter_pair in SPLITTERS_PAIRS:

        re_text_in_between = re.search(f'{splitter_pair[0]}(.*){splitter_pair[1]}', text_in)
        text_in_between = re_text_in_between.group(1) if re_text_in_between else None

        if text_in_between:
            third_splitter = [s for s in SPLITTERS if s not in splitter_pair][0]
            if third_splitter in text_in_between:
                pass
            else:
                info[splitter_pair[0]] = text_in_between.strip()

    # we now add the third splitter, which is the last one appearing in the string
    missing_splitter = [s for s in SPLITTERS if s not in info.keys()][0]
    info[missing_splitter] = text_in.split(missing_splitter)[-1].strip() if text_in.split(missing_splitter)[-1] else None

    # in case a splitter was not present in the text, we add as null
    missing_splitters = [s for s in SPLITTERS if s not in info.keys()]

    for ms in missing_splitters:
        info[ms] = None

    return info[SPLITTERS[0]], info[SPLITTERS[1]], info[SPLITTERS[2]]


df = pd.DataFrame({REFERENCE_COLUMN_NAME:
                       [1, 2, 3, 4],
                   BLOCK_OF_TEXT_COLUMN_NAME:
                       ['Number: there are four people Location: this happened in downtown Time: this happened at midnight',
                       'Location: this happened in Rome Time: this happened at noon Number: there are three people ',
                       'Number: there are eight people',
                        None]})

df[SPLITTERS[0]], df[SPLITTERS[1]], df[SPLITTERS[2]] = \
    zip(*df.apply(extract_info, axis=1))

【讨论】：