【问题标题】:split row into multiple while keeping other columns same pandas python将行拆分为多个,同时保持其他列相同 pandas python
【发布时间】:2021-11-19 17:25:47
【问题描述】:

我有一个如下的数据框

import pandas as pd
df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1","p4","p5,p6"],"package_code":["As he crossed toward the pharmacy at the","he was dancing in the","they were playing football"]})
df

    order_id    order_date  package package_code
0   1   20/5/2018   p1  As he crossed toward the pharmacy at the
1   3   22/5/2018   p4  he was dancing in the
2   7   23/5/2018   p5,p6   they were playing football

我写了一个如下的函数,它将一个字符串分成 5 个单词的组

s = 'As he crossed toward the pharmacy at the corner '
n = 5

def group_words(s, n):
    words = s.split()
    for i in range(0, len(words), n):
        yield ' '.join(words[i:i+n])

list(group_words(s,n))


['As he crossed toward the', 'pharmacy at the corner']

我想获取数据框并将“package_code”列拆分为多行,每行 5 个单词,同时保持列的其余部分相同(每行)。

我该怎么做

例如第一行应该是:

order_id    order_date  package package_code
0   1   20/5/2018   p1  As he crossed toward the 
0   1   20/5/2018   p1  pharmacy at the

我在下面尝试过,但它没有提供我想要的东西

(df.set_index(['order_id', 'order_date'])
   .apply(lambda x: group_words(x, 3))
   .reset_index()) 


index   0
0   package <generator object group_words at 0x7fa263e98570>
1   package_code    <generator object group_words at 0x7fa263e98678>

【问题讨论】:

    标签: python pandas string


    【解决方案1】:

    您可以使用列表来解包生成器并使用带有series.map 的explode:

    col = 'package_code'
    s = df['package_code'].map(lambda x: list(group_words(x,n))).explode()
    out = s.to_frame().join(df.drop(col,1)).loc[:,[*df]]
    

    print(out)
    
       order_id order_date package                package_code
    0         1  20/5/2018      p1    As he crossed toward the
    0         1  20/5/2018      p1             pharmacy at the
    1         3  22/5/2018      p4       he was dancing in the
    2         7  23/5/2018   p5,p6  they were playing football
    

    【讨论】:

    • 你能解释一下out = s.to_frame().join(df.drop(col,1)).loc[:,[*df]]吗?一切正常,但想更好地理解这部分。我对这部分的目的感到非常困惑loc[:,[*df]]。我了解到您在从原始数据框中删除一列后加入了两个数据框
    • @user2543622 loc[:,[*df]] 以与原始数据帧相同的顺序返回列,[*df] 解包数据帧的键(列)。这个想法是首先用你的函数映射列,然后分解以创建一个系列。由于这个爆炸的系列有重复的索引,我将系列转换为数据框,然后加入原始数据框,然后重新排序列。希望它澄清
    【解决方案2】:

    您可以使用extractall 和一个小的正则表达式((?:\w+\s+?){1,5} = 5 个字),因此不需要外部函数:

    (df.drop('package_code', axis=1) # remove existing column as we replace after
       .join(df['package_code'].str.extractall('(?P<package_code>(?:\w+\s+?){1,5})').droplevel(1))
    )
    

    输出:

       order_id order_date package               package_code
    0         1  20/5/2018      p1  As he crossed toward the 
    0         1  20/5/2018      p1               pharmacy at 
    1         3  22/5/2018      p4         he was dancing in 
    2         7  23/5/2018   p5,p6         they were playing 
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-07-06
      • 1970-01-01
      • 2019-12-27
      • 2019-04-07
      • 2015-09-22
      • 2020-02-10
      • 1970-01-01
      • 2016-02-28
      相关资源
      最近更新 更多