将行拆分为多个，同时保持其他列相同 pandas python答案

【问题标题】：split row into multiple while keeping other columns same pandas python将行拆分为多个，同时保持其他列相同 pandas python
【发布时间】：2021-11-19 17:25:47
【问题描述】：

我有一个如下的数据框

import pandas as pd
df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1","p4","p5,p6"],"package_code":["As he crossed toward the pharmacy at the","he was dancing in the","they were playing football"]})
df

    order_id    order_date  package package_code
0   1   20/5/2018   p1  As he crossed toward the pharmacy at the
1   3   22/5/2018   p4  he was dancing in the
2   7   23/5/2018   p5,p6   they were playing football

我写了一个如下的函数，它将一个字符串分成 5 个单词的组

s = 'As he crossed toward the pharmacy at the corner '
n = 5

def group_words(s, n):
    words = s.split()
    for i in range(0, len(words), n):
        yield ' '.join(words[i:i+n])

list(group_words(s,n))


['As he crossed toward the', 'pharmacy at the corner']

我想获取数据框并将“package_code”列拆分为多行，每行 5 个单词，同时保持列的其余部分相同（每行）。

我该怎么做

例如第一行应该是：

order_id    order_date  package package_code
0   1   20/5/2018   p1  As he crossed toward the 
0   1   20/5/2018   p1  pharmacy at the

我在下面尝试过，但它没有提供我想要的东西

(df.set_index(['order_id', 'order_date'])
   .apply(lambda x: group_words(x, 3))
   .reset_index()) 


index   0
0   package <generator object group_words at 0x7fa263e98570>
1   package_code    <generator object group_words at 0x7fa263e98678>

【问题讨论】：

标签： python pandas string

【解决方案1】：

您可以使用列表来解包生成器并使用带有series.map 的explode：

col = 'package_code'
s = df['package_code'].map(lambda x: list(group_words(x,n))).explode()
out = s.to_frame().join(df.drop(col,1)).loc[:,[*df]]

print(out)

   order_id order_date package                package_code
0         1  20/5/2018      p1    As he crossed toward the
0         1  20/5/2018      p1             pharmacy at the
1         3  22/5/2018      p4       he was dancing in the
2         7  23/5/2018   p5,p6  they were playing football

【讨论】：

你能解释一下out = s.to_frame().join(df.drop(col,1)).loc[:,[*df]]吗？一切正常，但想更好地理解这部分。我对这部分的目的感到非常困惑loc[:,[*df]]。我了解到您在从原始数据框中删除一列后加入了两个数据框
@user2543622 loc[:,[*df]] 以与原始数据帧相同的顺序返回列，[*df] 解包数据帧的键（列）。这个想法是首先用你的函数映射列，然后分解以创建一个系列。由于这个爆炸的系列有重复的索引，我将系列转换为数据框，然后加入原始数据框，然后重新排序列。希望它澄清

【解决方案2】：

您可以使用extractall 和一个小的正则表达式（(?:\w+\s+?){1,5} = 5 个字），因此不需要外部函数：

(df.drop('package_code', axis=1) # remove existing column as we replace after
   .join(df['package_code'].str.extractall('(?P<package_code>(?:\w+\s+?){1,5})').droplevel(1))
)

输出：

   order_id order_date package               package_code
0         1  20/5/2018      p1  As he crossed toward the 
0         1  20/5/2018      p1               pharmacy at 
1         3  22/5/2018      p4         he was dancing in 
2         7  23/5/2018   p5,p6         they were playing

【讨论】：