【发布时间】:2021-02-03 15:35:29
【问题描述】:
想法如下;整个数据框应按第 4 列排序,同时在按 df_train[4] 排序时保持每一行中的确切信息序列(例如,'Table1'、'Table1'、'Table1'、'Table2'、'Table2'、'表3','表3', '表4)
尝试和输出:
import pandas as pd
from pandas import np
df_train = pd.DataFrame({1: ['1234', '4321','4321', '3241', '5432','5821', '4442', '4567', '0952','0952','7426','4422','4121','8885','4231','5932','5932','5555','0532','1111','5555','6321','6321','9955','3333','3333','3333','4132']})
df_train[2] = ['Mayo', 'Mango','Mango','Apple','Kiwi','Watermelon','Orange','Peach','Broccoli','Broccoli','Name','Surname','Location','Carrots','Salad','Water','Water','Wine','Beer','Cup','Rose','Flower','Flower','Paint','Van','Van','Van','Car']
df_train[3] = ['1','2','2','1','1','1','1','1','2','2','1','1','1','1','1','2','2','1','1','1','1','2','2','1','3','3','3','1']
df_train[4] = ['Table1','Table1','Table1','Table2','Table2','Table3','Table3','Table4','Table4','Table4','Table1','Table1','Table1','Table1','Table2','Table2','Table2','Table3','Table4','Table4','Table1','Table1','Table1','Table2','Table3','Table3','Table3','Table4']
output:
1 2 3 4
0 1234 Mayo 1 Table1
1 4321 Mango 2 Table1
2 4321 Mango 2 Table1
3 3241 Apple 1 Table2
4 5432 Kiwi 1 Table2
5 5821 Watermelon 1 Table3
6 4442 Orange 1 Table3
7 4567 Peach 1 Table4
8 0952 Broccoli 2 Table4
9 0952 Broccoli 2 Table4
10 7426 Name 1 Table1
11 4422 Surname 1 Table1
12 4121 Location 1 Table1
13 8885 Carrots 1 Table1
14 4231 Salad 1 Table2
15 5932 Water 2 Table2
16 5932 Water 2 Table2
17 5555 Wine 1 Table3
18 0532 Beer 1 Table4
19 1111 Cup 1 Table4
20 5555 Rose 1 Table1
21 6321 Flower 2 Table1
22 6321 Flower 2 Table1
23 9955 Paint 1 Table2
24 3333 Van 3 Table3
25 3333 Van 3 Table3
26 3333 Van 3 Table3
27 4132 Car 1 Table4
尝试 1 和 2:
df_train2 = df_train.sort_values(4, ignore_index=True) #
df_train2 = df_train.sort_values(4, ignore_index=False) #same issue
输出按所有目标列 4 排序,但无论索引如何,顺序都是相当随意的:
1 2 3 4
0 1234 Mayo 1 Table1
1 6321 Flower 2 Table1 -> should be: 4321, Mango, 2, Table1
2 6321 Flower 2 Table1 -> should be: 4321, Mango, 2, Table1
3 5555 Rose 1 Table1 -> should be: 7426, Name, 1, Table1
4 4121 Location 1 Table1 -> should be: 4422, Surname, 1, Table1
5 4422 Surname 1 Table1 -> should be: 4121, Location, 1, Table1
6 7426 Name 1 Table1 ...etc
7 8885 Carrots 1 Table1
8 4321 Mango 2 Table1
9 4321 Mango 2 Table1
10 5432 Kiwi 1 Table2
11 4231 Salad 1 Table2
12 5932 Water 2 Table2
13 5932 Water 2 Table2
14 9955 Paint 1 Table2
15 3241 Apple 1 Table2
16 5555 Wine 1 Table3
17 3333 Van 3 Table3
18 4442 Orange 1 Table3
19 5821 Watermelon 1 Table3
20 3333 Van 3 Table3
21 3333 Van 3 Table3
22 0952 Broccoli 2 Table4
23 0532 Beer 1 Table4
24 4567 Peach 1 Table4
25 0952 Broccoli 2 Table4
尝试 3 和 4:
# Attempt 3:
df_train3 = df_train.copy()
df_train3.loc[3:] = df_train3.loc[3:].sort_values(4).set_index(df_train3.loc[3:].index)
即使从第 3 行切片也没有得到所需的输出
# Attempt 4:
isolate_digit = []
for indx,row in df_train.iterrows():
isolate_digit.append( int(''.join(filter(str.isdigit,row[4]))))
df_train['order'] = isolate_digit
甚至尝试从第 4 列中分离出它是整数,可能是因为它是一个字符串('Table1'..etc),因此 alg 没有保留顺序,仍然没有解决问题。
所需的输出,通过手动切片:
# Manually slicing desired output:
df_table_seq1 = df_train.iloc[0:3] # Table1 sequence
df_table_seq2 = df_train.iloc[10:14] # Table1 seuqence
df_table_seq3 = df_train.iloc[20:23] # Table1 sequence
df_table_seq4 = df_train.iloc[3:5] # Table2 sequence
df_table_seq5 = df_train.iloc[14:17] # Table2 seuqence
df_table_seq6 = df_train.iloc[23:24] # Table2 sequence
df_table_seq7 = df_train.iloc[5:7] # Table3 sequence
df_table_seq8 = df_train.iloc[17:18] # Table3 seuqence
df_table_seq9 = df_train.iloc[24:27] # Table3 sequence
df_table_seq10 = df_train.iloc[7:10] # Table4 seuqence
df_table_seq11 = df_train.iloc[18:20] # Table4 sequence
df_table_seq12 = df_train.iloc[27:28] # Table4 sequence
df_final = df_table_seq1+df_table_seq2
df_final = pd.concat([df_table_seq1,df_table_seq2,df_table_seq3,df_table_seq4,df_table_seq5,df_table_seq6,df_table_seq7,df_table_seq8,df_table_seq9,df_table_seq10,df_table_seq11,df_table_seq12], sort = False)
输出应该是这样的:
1 2 3 4 order
0 1234 Mayo 1 Table1 1
1 4321 Mango 2 Table1 1
2 4321 Mango 2 Table1 1
10 7426 Name 1 Table1 1
11 4422 Surname 1 Table1 1
12 4121 Location 1 Table1 1
13 8885 Carrots 1 Table1 1
20 5555 Rose 1 Table1 1
21 6321 Flower 2 Table1 1
22 6321 Flower 2 Table1 1
3 3241 Apple 1 Table2 2
4 5432 Kiwi 1 Table2 2
14 4231 Salad 1 Table2 2
15 5932 Water 2 Table2 2
16 5932 Water 2 Table2 2
23 9955 Paint 1 Table2 2
5 5821 Watermelon 1 Table3 3
6 4442 Orange 1 Table3 3
17 5555 Wine 1 Table3 3
24 3333 Van 3 Table3 3
25 3333 Van 3 Table3 3
26 3333 Van 3 Table3 3
7 4567 Peach 1 Table4 4
8 0952 Broccoli 2 Table4 4
9 0952 Broccoli 2 Table4 4
18 0532 Beer 1 Table4 4
19 1111 Cup 1 Table4 4
27 4132 Car 1 Table4 4
【问题讨论】:
-
df_train.sort_values(4, kind='mergesort')? -
似乎工作正常,我得到了想要的输出!谢谢!
-
所以我正确理解了这个问题。这意味着您正在寻找稳定排序,而
sort_values的默认算法是快速排序,这是不稳定的。 -
感谢澄清不知道默认设置,现在有意义。 @QuangHoang