【发布时间】:2023-01-26 23:56:11
【问题描述】:
考虑以下数据框
>>> df
Start End Tiebreak
0 1 6 0.376600
1 5 7 0.050042
2 15 20 0.628266
3 10 15 0.984022
4 11 12 0.909033
5 4 8 0.531054
每当两行的 [Start, End] 间隔重叠时,我希望删除具有较低平局值的行。该示例的结果将是
>>> df
Start End Tiebreak
2 15 20 0.628266
3 10 15 0.984022
5 4 8 0.531054
我有一个双循环,它的工作效率很低,我想知道是否存在一种利用内置函数并按列工作的方法。
import pandas as pd
import numpy as np
# initial data
df = pd.DataFrame({
'Start': [1, 5, 15, 10, 11, 4],
'End': [6, 7, 20, 15, 12, 8],
'Tiebreak': np.random.uniform(0, 1, 6)
})
# checking for overlaps
list_idx_drop = []
for i in range(len(df) - 1):
for j in range(i + 1, len(df)):
idx_1 = df.index[i]
idx_2 = df.index[j]
cond_1 = (df.loc[idx_1, 'Start'] < df.loc[idx_2, 'End'])
cond_2 = (df.loc[idx_2, 'Start'] < df.loc[idx_1, 'End'])
# if rows overlaps
if cond_1 & cond_2:
tie_1 = df.loc[idx_1, 'Tiebreak']
tie_2 = df.loc[idx_2, 'Tiebreak']
# delete row with lower tiebreaking value
if tie_1 < tie_2:
df.drop(idx_1, inplace=True)
else:
df.drop(idx_2, inplace=True)
【问题讨论】:
标签: python pandas dataframe for-loop coding-efficiency