【问题标题】:Find duplicate rows and move corresponding data to adjacent to original row查找重复行并将相应数据移动到与原始行相邻的位置
【发布时间】:2021-06-22 21:45:30
【问题描述】:

我有以下数据框:

unique_id   person_id   fruit_name  poduct  guest   
92          11          apple       silver  Miller  
93          12          cherry      bronze  Gus     
967         121        orange       purple  Mike    
94          176         apple       silver  Miller  
95          176         banana      gold    John    
96          176         orange      purple  Mike    
445         111         apple       silver  Miller  
100         112         cherry      bronze  Gus     
232         111         apple       silver  Miller  
355         555        cherry       bronze  Gus 

我想抓取在 person_id 列下找到的任何重复值并将它们移动到与原始行相邻的位置,这是预期输出的示例:

unique_id   person_id   fruit_name  poduct  guest   unique_id_1 fruit_name  poduct  guest unique_id_2   fruit_name  poduct  guest   
92          11          apple       silver  Miller  
93          12          cherry      bronze  Gus     
967         121        orange       purple  Mike    
94          176         apple       silver  Miller  95          banana      gold    John  96             orange     purple  Mike
100         112         cherry      bronze  Gus 
445         111         apple       silver  Miller  232         apple       silver  Miller  
355         555        cherry       bronze  Gus 

我不确定我应该在线搜索什么才能实现这一点,非常感谢任何建议。

【问题讨论】:

  • 你为什么要这样做?也许有更好的方法来解决您的主要问题

标签: python-3.x pandas python-2.7 dataframe


【解决方案1】:

试试:

# Separate duplicated lines
dup = df.duplicated(subset=['person_id'], keep='last')
rem = df[~dup]

# Merge on "person_id"
new_df = pd.merge(
    right=rem,
    left=dup,
    how="outer",
    on=["person_id"],
    suffixes=("_0", "_1"],
)

【讨论】:

    【解决方案2】:

    这是一个“从长到宽”的转变。

    您可以添加一列来确定某行属于哪个组。

    df['group'] = df.groupby('person_id').cumcount() + 1
    
    >>> df
       unique_id  person_id fruit_name  poduct   guest  group
    0         92         11      apple  silver  Miller      1
    1         93         12     cherry  bronze     Gus      1
    2        967        121     orange  purple    Mike      1
    3         94        176      apple  silver  Miller      1
    4         95        176     banana    gold    John      2
    5         96        176     orange  purple    Mike      3
    6        445        111      apple  silver  Miller      1
    7        100        112     cherry  bronze     Gus      1
    8        232        111      apple  silver  Miller      2
    9        355        555     cherry  bronze     Gus      1
    

    然后在 DataFrame.pivot() 中使用它

    >>> df.pivot(index='person_id', columns='group').sort_index(axis=1, level=1)
    
              fruit_name   guest  poduct unique_id fruit_name   guest  poduct unique_id fruit_name guest  poduct unique_id
    group              1       1       1         1          2       2       2         2          3     3       3         3
    person_id                                                                                                             
    11             apple  Miller  silver      92.0        NaN     NaN     NaN       NaN        NaN   NaN     NaN       NaN
    12            cherry     Gus  bronze      93.0        NaN     NaN     NaN       NaN        NaN   NaN     NaN       NaN
    111            apple  Miller  silver     445.0      apple  Miller  silver     232.0        NaN   NaN     NaN       NaN
    112           cherry     Gus  bronze     100.0        NaN     NaN     NaN       NaN        NaN   NaN     NaN       NaN
    121           orange    Mike  purple     967.0        NaN     NaN     NaN       NaN        NaN   NaN     NaN       NaN
    176            apple  Miller  silver      94.0     banana    John    gold      95.0     orange  Mike  purple      96.0
    555           cherry     Gus  bronze     355.0        NaN     NaN     NaN       NaN        NaN   NaN     NaN       NaN
    

    然后你可以重命名列。

    out = df.pivot(index='person_id', columns='group').sort_index(axis=1, level=1)
    out.columns = [ f'{x}_{y}' for x, y in out.columns ]
    
    >>> out.reset_index()
       person_id fruit_name_1 guest_1 poduct_1  unique_id_1 fruit_name_2 guest_2 poduct_2  unique_id_2 fruit_name_3 guest_3 poduct_3  unique_id_3
    0         11        apple  Miller   silver         92.0          NaN     NaN      NaN          NaN          NaN     NaN      NaN          NaN
    1         12       cherry     Gus   bronze         93.0          NaN     NaN      NaN          NaN          NaN     NaN      NaN          NaN
    2        111        apple  Miller   silver        445.0        apple  Miller   silver        232.0          NaN     NaN      NaN          NaN
    3        112       cherry     Gus   bronze        100.0          NaN     NaN      NaN          NaN          NaN     NaN      NaN          NaN
    4        121       orange    Mike   purple        967.0          NaN     NaN      NaN          NaN          NaN     NaN      NaN          NaN
    5        176        apple  Miller   silver         94.0       banana    John     gold         95.0       orange    Mike   purple         96.0
    6        555       cherry     Gus   bronze        355.0          NaN     NaN      NaN          NaN          NaN     NaN      NaN          NaN
    

    更新


    自定义列顺序示例:

    order = ['person_id', 'fruit_name', 'unique_id', 'guest', 'poduct']
    out = df.pivot(index='person_id', columns='group')
    out = out[sorted(out.columns, key=lambda idx: (idx[1], order.index(idx[0])))]
    out.columns = [ f'{x}_{y}' for x, y in out.columns ]
    
    >>> out.reset_index()
       person_id fruit_name_1  unique_id_1 guest_1 poduct_1 fruit_name_2  unique_id_2 guest_2 poduct_2 fruit_name_3  unique_id_3 guest_3 poduct_3
    0         11        apple         92.0  Miller   silver          NaN          NaN     NaN      NaN          NaN          NaN     NaN      NaN
    1         12       cherry         93.0     Gus   bronze          NaN          NaN     NaN      NaN          NaN          NaN     NaN      NaN
    2        111        apple        445.0  Miller   silver        apple        232.0  Miller   silver          NaN          NaN     NaN      NaN
    3        112       cherry        100.0     Gus   bronze          NaN          NaN     NaN      NaN          NaN          NaN     NaN      NaN
    4        121       orange        967.0    Mike   purple          NaN          NaN     NaN      NaN          NaN          NaN     NaN      NaN
    5        176        apple         94.0  Miller   silver       banana         95.0    John     gold       orange         96.0    Mike   purple
    6        555       cherry        355.0     Gus   bronze          NaN          NaN     NaN      NaN          NaN          NaN     NaN      NaN
    

    【讨论】:

    • 这正是我要找的,你!
    • 我试图对我提出的上述问题中的列进行重新排序,我尝试使用 reindex() 方法,但结果会被剥离。目前,该列按字母顺序打印。有没有办法控制顺序?假设我想以 person_id 和 fruit_name 开头。 @卡尔
    • 您可以使用函数对列进行排序 - 我添加了一个示例。
    • 这项工作按照建议进行,现在我正在尝试消化您对该功能所做的工作。再次感谢您。
    • 它从列中提取数字并首先按数字排序,然后按您定义的列表中“名称”部分的索引 - 当它仍然是多时这样做更有意义-index - 我已经编辑为使用该方法。
    猜你喜欢
    • 2014-12-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-04-10
    相关资源
    最近更新 更多