【问题标题】:Sorting complete pandas data frame by a column while preserving sequence order按列对完整的熊猫数据框进行排序,同时保留序列顺序
【发布时间】:2021-02-03 15:35:29
【问题描述】:

想法如下;整个数据框应按第 4 列排序,同时在按 df_train[4] 排序时保持每一行中的确切信息序列(例如,'Table1'、'Table1'、'Table1'、'Table2'、'Table2'、'表3','表3', '表4)

尝试和输出:

import pandas as pd
from pandas import np

df_train = pd.DataFrame({1: ['1234', '4321','4321', '3241', '5432','5821', '4442', '4567', '0952','0952','7426','4422','4121','8885','4231','5932','5932','5555','0532','1111','5555','6321','6321','9955','3333','3333','3333','4132']})
df_train[2] = ['Mayo', 'Mango','Mango','Apple','Kiwi','Watermelon','Orange','Peach','Broccoli','Broccoli','Name','Surname','Location','Carrots','Salad','Water','Water','Wine','Beer','Cup','Rose','Flower','Flower','Paint','Van','Van','Van','Car']
df_train[3] = ['1','2','2','1','1','1','1','1','2','2','1','1','1','1','1','2','2','1','1','1','1','2','2','1','3','3','3','1']
df_train[4] = ['Table1','Table1','Table1','Table2','Table2','Table3','Table3','Table4','Table4','Table4','Table1','Table1','Table1','Table1','Table2','Table2','Table2','Table3','Table4','Table4','Table1','Table1','Table1','Table2','Table3','Table3','Table3','Table4']
output: 
    1        2          3      4
0   1234    Mayo        1   Table1
1   4321    Mango       2   Table1
2   4321    Mango       2   Table1
3   3241    Apple       1   Table2
4   5432    Kiwi        1   Table2
5   5821    Watermelon  1   Table3
6   4442    Orange      1   Table3
7   4567    Peach       1   Table4
8   0952    Broccoli    2   Table4
9   0952    Broccoli    2   Table4
10  7426    Name        1   Table1
11  4422    Surname     1   Table1
12  4121    Location    1   Table1
13  8885    Carrots     1   Table1
14  4231    Salad       1   Table2
15  5932    Water       2   Table2
16  5932    Water       2   Table2
17  5555    Wine        1   Table3
18  0532    Beer        1   Table4
19  1111    Cup         1   Table4
20  5555    Rose        1   Table1
21  6321    Flower      2   Table1
22  6321    Flower      2   Table1
23  9955    Paint       1   Table2
24  3333    Van         3   Table3
25  3333    Van         3   Table3
26  3333    Van         3   Table3
27  4132    Car         1   Table4

尝试 1 和 2:

df_train2 = df_train.sort_values(4, ignore_index=True) # 
df_train2 = df_train.sort_values(4, ignore_index=False) #same issue

输出按所有目标列 4 排序,但无论索引如何,顺序都是相当随意的:

    1        2          3     4
0   1234    Mayo        1   Table1 
1   6321    Flower      2   Table1 -> should be: 4321, Mango, 2, Table1
2   6321    Flower      2   Table1 -> should be: 4321, Mango, 2, Table1
3   5555    Rose        1   Table1 -> should be: 7426, Name, 1, Table1
4   4121    Location    1   Table1 -> should be: 4422, Surname, 1, Table1
5   4422    Surname     1   Table1 -> should be: 4121, Location, 1, Table1
6   7426    Name        1   Table1 ...etc
7   8885    Carrots     1   Table1
8   4321    Mango       2   Table1
9   4321    Mango       2   Table1
10  5432    Kiwi        1   Table2
11  4231    Salad       1   Table2
12  5932    Water       2   Table2
13  5932    Water       2   Table2
14  9955    Paint       1   Table2
15  3241    Apple       1   Table2
16  5555    Wine        1   Table3
17  3333    Van         3   Table3
18  4442    Orange      1   Table3
19  5821    Watermelon  1   Table3
20  3333    Van         3   Table3
21  3333    Van         3   Table3
22  0952    Broccoli    2   Table4
23  0532    Beer        1   Table4
24  4567    Peach       1   Table4
25  0952    Broccoli    2   Table4

尝试 3 和 4:

# Attempt 3:
df_train3 = df_train.copy()
df_train3.loc[3:] = df_train3.loc[3:].sort_values(4).set_index(df_train3.loc[3:].index)

即使从第 3 行切片也没有得到所需的输出

# Attempt 4:
isolate_digit = []
for indx,row in df_train.iterrows():
    isolate_digit.append( int(''.join(filter(str.isdigit,row[4]))))

df_train['order'] = isolate_digit 

甚至尝试从第 4 列中分离出它是整数,可能是因为它是一个字符串('Table1'..etc),因此 alg 没有保留顺序,仍然没有解决问题。

所需的输出,通过手动切片:

# Manually slicing desired output:

df_table_seq1 = df_train.iloc[0:3]  # Table1 sequence
df_table_seq2 = df_train.iloc[10:14] # Table1 seuqence
df_table_seq3 = df_train.iloc[20:23] # Table1 sequence

df_table_seq4 = df_train.iloc[3:5]  # Table2 sequence
df_table_seq5 = df_train.iloc[14:17] # Table2 seuqence
df_table_seq6 = df_train.iloc[23:24] # Table2 sequence

df_table_seq7 = df_train.iloc[5:7]  # Table3 sequence
df_table_seq8 = df_train.iloc[17:18] # Table3 seuqence
df_table_seq9 = df_train.iloc[24:27] # Table3 sequence


df_table_seq10 = df_train.iloc[7:10] # Table4 seuqence
df_table_seq11 = df_train.iloc[18:20] # Table4 sequence
df_table_seq12 = df_train.iloc[27:28] # Table4 sequence

df_final = df_table_seq1+df_table_seq2

df_final = pd.concat([df_table_seq1,df_table_seq2,df_table_seq3,df_table_seq4,df_table_seq5,df_table_seq6,df_table_seq7,df_table_seq8,df_table_seq9,df_table_seq10,df_table_seq11,df_table_seq12], sort = False)

输出应该是这样的:

    1         2          3      4      order
0   1234    Mayo         1    Table1    1
1   4321    Mango        2    Table1    1
2   4321    Mango        2    Table1    1
10  7426    Name         1    Table1    1
11  4422    Surname      1    Table1    1
12  4121    Location     1    Table1    1
13  8885    Carrots      1    Table1    1
20  5555    Rose         1    Table1    1
21  6321    Flower       2    Table1    1
22  6321    Flower       2    Table1    1
3   3241    Apple        1    Table2    2
4   5432    Kiwi         1    Table2    2
14  4231    Salad        1    Table2    2
15  5932    Water        2    Table2    2
16  5932    Water        2    Table2    2
23  9955    Paint        1    Table2    2
5   5821    Watermelon   1    Table3    3
6   4442    Orange       1    Table3    3
17  5555    Wine         1    Table3    3
24  3333    Van          3    Table3    3
25  3333    Van          3    Table3    3
26  3333    Van          3    Table3    3
7   4567    Peach        1    Table4    4
8   0952    Broccoli     2    Table4    4
9   0952    Broccoli     2    Table4    4
18  0532    Beer         1    Table4    4
19  1111    Cup          1    Table4    4
27  4132    Car          1    Table4    4

【问题讨论】:

  • df_train.sort_values(4, kind='mergesort')?
  • 似乎工作正常,我得到了想要的输出!谢谢!
  • 所以我正确理解了这个问题。这意味着您正在寻找稳定排序,而sort_values 的默认算法是快速排序,这是不稳定的。
  • 感谢澄清不知道默认设置,现在有意义。 @QuangHoang

标签: python pandas dataframe


【解决方案1】:

创建一个作为索引副本的辅助列,然后按第 4 列和辅助列排序

df['sort_helper'] = df.index
df.sort_values(['4', 'sort_helper'])

【讨论】:

  • 是的....我使用了df_train_r = df_train.reset_index()df_train_r.sort_values([4,'index']) +1 好答案
  • @mullinscr 谢谢! kind='mergesort' 也完成了它的工作。
猜你喜欢
  • 2021-09-17
  • 2023-01-10
  • 1970-01-01
  • 2019-04-12
  • 1970-01-01
  • 1970-01-01
  • 2014-12-29
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多