从一列中提取特定单词并将其移至下一行答案

【问题标题】：Extract specific words from one column and move it to the next row从一列中提取特定单词并将其移至下一行
【发布时间】：2023-01-26 18:11:57
【问题描述】：

我有一个像下面这样的数据框

Animals	Type	Year
Penguin AVES	Omnivore	2015
Caiman REP	Carnivore	2018
Komodo.Rep	Carnivore	2019
Blue Jay.aves	Omnivore	2015
Peregrine aves Falcon	Carnivore	2016
Iguana+rep	Carnivore	2020
Rep Salamander	Carnivore	2019

我想从“动物”列的值中提取特定的词（例如 AVES 和 REP），并将其移动到下一行，同时保留整行的值。除了 AVES 和 REP 之外，还有几个特定的词。它不是很干净（如特定单词前的空格、点和“+”运算符所示）。预期的新 DataFrame 如下所示。

Animals	Type	Year
Penguin AVES	Omnivore	2015
AVES	Omnivore	2015
Caiman REP	Carnivore	2018
REP	Carnivore	2018
Komodo.Rep	Carnivore	2019
Rep	Carnivore	2019
Blue Jay.aves	Omnivore	2015
aves	Omnivore	2015
Peregrine aves Falcon	Carnivore	2016
aves	Carnivore	2016
Iguana+rep	Carnivore	2020
rep	Carnivore	2020
Rep Salamander	Carnivore	2019
Rep	Carnivore	2019

我已经使用@mozway提供的以下代码成功提取了位于末尾的特定单词

out = (pd.concat([df, df.assign(Animals=df['Animals'].str.extract(r'(\w+)$'))]) .sort_index(kind='stable', ignore_index=True) )

但我仍然不知道如何从中间（指 Peregrine aves Falcon）和开始（指 Rep Salamander）中提取特定单词。我打算使用正则表达式，因为我发现它对我的 DataFrame 更灵活，但我刚开始使用 Python，没有使用正则表达式的经验。我应该如何处理这个问题？提前致谢。

【问题讨论】：

你有要提取的单词白名单吗？如果不是，你怎么知道要提取哪一个？
@mozway 是的，我有白名单。

标签： python pandas regex dataframe

【解决方案1】：

my previous answer 的变体，使用单词白名单：

import re

words = ['aves', 'rep']

pattern = '|'.join(map(re.escape, words))

out = df.loc[df.index.repeat(2)].reset_index(drop=True)

out.loc[1::2, 'Animals'] = out.loc[1::2, 'Animals'].str.extract(fr'({pattern})', flags=re.I, expand=False)

输出：

                  Animals       Type  Year
0            Penguin AVES   Omnivore  2015
1                    AVES   Omnivore  2015
2              Caiman REP  Carnivore  2018
3                     REP  Carnivore  2018
4              Komodo.Rep  Carnivore  2019
5                     Rep  Carnivore  2019
6           Blue Jay.aves   Omnivore  2015
7                    aves   Omnivore  2015
8   Peregrine aves Falcon  Carnivore  2016
9                    aves  Carnivore  2016
10             Iguana+rep  Carnivore  2020
11                    rep  Carnivore  2020
12         Rep Salamander  Carnivore  2019
13                    Rep  Carnivore  2019

regex demo

【讨论】：

再次感谢。要接受的内容很多。您能解释一下这段代码的 pattern 部分吗？
我刚刚为演示添加了指向 regex101 的链接。简而言之，模式是 (aves|rep)，它匹配 aves 或 rep 作为具有单词边界的完整单词 ()。该模式使用 re.I 标志不区分大小写。