【问题标题】:Python - Parse Text from One Column into Multiple ColumnsPython - 将文本从一列解析为多列
【发布时间】:2021-08-30 15:22:39
【问题描述】:
我有一个包含两列的 Excel 文件。第一个是有用的识别信息,而第二个是可以逐行变化的文本块。在该列中,我想根据可用性将一些信息解析为单独的列。例如:
| Reference |
Block of text |
| 1 |
Number: there are four people Location: this happened in downtown Time: this happened at midnight |
我想把它变成:
| Reference |
Block of text |
Number |
Location |
Time |
| 1 |
Number: there are four people Location: this happened in downtown Time: this happened at midnight |
there are four people |
this happened in downtown |
this happened at midnight |
如何使用 Python 实现这一点?
【问题讨论】:
标签:
python
regex
pandas
parsing
text
【解决方案1】:
在这里,当拆分器处于任何顺序或任何信息丢失时,它都可以工作。
import re
import pandas as pd
SPLITTERS = ['Number:', 'Location:', 'Time:']
SPLITTERS_PAIRS = [(s1, s2) for s1 in SPLITTERS for s2 in SPLITTERS if s1 != s2]
BLOCK_OF_TEXT_COLUMN_NAME = 'Block of text'
REFERENCE_COLUMN_NAME = 'Reference'
def extract_info(row) -> Tuple[str]:
info: Dict = {}
text_in: str = row[BLOCK_OF_TEXT_COLUMN_NAME] if row[BLOCK_OF_TEXT_COLUMN_NAME] else ''
for splitter_pair in SPLITTERS_PAIRS:
re_text_in_between = re.search(f'{splitter_pair[0]}(.*){splitter_pair[1]}', text_in)
text_in_between = re_text_in_between.group(1) if re_text_in_between else None
if text_in_between:
third_splitter = [s for s in SPLITTERS if s not in splitter_pair][0]
if third_splitter in text_in_between:
pass
else:
info[splitter_pair[0]] = text_in_between.strip()
# we now add the third splitter, which is the last one appearing in the string
missing_splitter = [s for s in SPLITTERS if s not in info.keys()][0]
info[missing_splitter] = text_in.split(missing_splitter)[-1].strip() if text_in.split(missing_splitter)[-1] else None
# in case a splitter was not present in the text, we add as null
missing_splitters = [s for s in SPLITTERS if s not in info.keys()]
for ms in missing_splitters:
info[ms] = None
return info[SPLITTERS[0]], info[SPLITTERS[1]], info[SPLITTERS[2]]
df = pd.DataFrame({REFERENCE_COLUMN_NAME:
[1, 2, 3, 4],
BLOCK_OF_TEXT_COLUMN_NAME:
['Number: there are four people Location: this happened in downtown Time: this happened at midnight',
'Location: this happened in Rome Time: this happened at noon Number: there are three people ',
'Number: there are eight people',
None]})
df[SPLITTERS[0]], df[SPLITTERS[1]], df[SPLITTERS[2]] = \
zip(*df.apply(extract_info, axis=1))