【问题标题】:Python - Parse Text from One Column into Multiple ColumnsPython - 将文本从一列解析为多列
【发布时间】:2021-08-30 15:22:39
【问题描述】:

我有一个包含两列的 Excel 文件。第一个是有用的识别信息,而第二个是可以逐行变化的文本块。在该列中,我想根据可用性将一些信息解析为单独的列。例如:

Reference Block of text
1 Number: there are four people Location: this happened in downtown Time: this happened at midnight

我想把它变成:

Reference Block of text Number Location Time
1 Number: there are four people Location: this happened in downtown Time: this happened at midnight there are four people this happened in downtown this happened at midnight

如何使用 Python 实现这一点?

【问题讨论】:

  • 查看我的回复,请考虑标记解决方案。
  • Nikhil,您应该按照问题的答案来回答。

标签: python regex pandas parsing text


【解决方案1】:

在这里,当拆分器处于任何顺序或任何信息丢失时,它都可以工作。

import re
import pandas as pd

SPLITTERS = ['Number:', 'Location:', 'Time:']
SPLITTERS_PAIRS = [(s1, s2) for s1 in SPLITTERS for s2 in SPLITTERS if s1 != s2]

BLOCK_OF_TEXT_COLUMN_NAME = 'Block of text'
REFERENCE_COLUMN_NAME = 'Reference'


def extract_info(row) -> Tuple[str]:

    info: Dict = {}
    text_in: str = row[BLOCK_OF_TEXT_COLUMN_NAME] if row[BLOCK_OF_TEXT_COLUMN_NAME] else ''

    for splitter_pair in SPLITTERS_PAIRS:

        re_text_in_between = re.search(f'{splitter_pair[0]}(.*){splitter_pair[1]}', text_in)
        text_in_between = re_text_in_between.group(1) if re_text_in_between else None

        if text_in_between:
            third_splitter = [s for s in SPLITTERS if s not in splitter_pair][0]
            if third_splitter in text_in_between:
                pass
            else:
                info[splitter_pair[0]] = text_in_between.strip()

    # we now add the third splitter, which is the last one appearing in the string
    missing_splitter = [s for s in SPLITTERS if s not in info.keys()][0]
    info[missing_splitter] = text_in.split(missing_splitter)[-1].strip() if text_in.split(missing_splitter)[-1] else None

    # in case a splitter was not present in the text, we add as null
    missing_splitters = [s for s in SPLITTERS if s not in info.keys()]

    for ms in missing_splitters:
        info[ms] = None

    return info[SPLITTERS[0]], info[SPLITTERS[1]], info[SPLITTERS[2]]


df = pd.DataFrame({REFERENCE_COLUMN_NAME:
                       [1, 2, 3, 4],
                   BLOCK_OF_TEXT_COLUMN_NAME:
                       ['Number: there are four people Location: this happened in downtown Time: this happened at midnight',
                       'Location: this happened in Rome Time: this happened at noon Number: there are three people ',
                       'Number: there are eight people',
                        None]})

df[SPLITTERS[0]], df[SPLITTERS[1]], df[SPLITTERS[2]] = \
    zip(*df.apply(extract_info, axis=1))

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多