我假设,您想组合所有范围。这样所有重叠的范围都减少到一行。我认为您需要递归地执行此操作,因为可能有多个范围,形成一个大范围,而不仅仅是两个。您可以这样做(只需将 df 替换为您用于存储数据框的变量):
# create a dummy key column to produce a cartesian product
df['fake_key']=0
right_df= pd.DataFrame(df, copy=True)
right_df.rename({col: col + '_r' for col in right_df if col!='fake_key'}, axis='columns', inplace=True)
# this variable indicates that we need to perform the loop once more
change=True
# diff and new_diff are used to see, if the loop iteration changed something
# it's monotically increasing btw.
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
while change:
diff= new_diff
joined_df= df.merge(right_df, on='fake_key')
invalid_indexer= joined_df['se']<joined_df['st_r']
joined_df.drop(joined_df[invalid_indexer].index, axis='index', inplace=True)
right_df= joined_df.groupby('st').aggregate({col: 'max' if '_min' not in col else 'min' for col in joined_df})
# update the ..._min / ..._max fields in the combined range
for col in ['st_min', 'se_min', 'st_max', 'se_max']:
col_r= col + '_r'
col1, col2= (col, col_r) if 'min' in col else (col_r, col)
right_df[col_r]= right_df[col1].where(right_df[col1]<=right_df[col2], right_df[col2])
right_df.drop(['se', 'st_r', 'st_min', 'se_min', 'st_max', 'se_max'], axis='columns', inplace=True)
right_df.rename({'st': 'st_r'}, axis='columns', inplace=True)
right_df['fake_key']=0
# now check if we need to iterate once more
new_diff= (right_df['se_r'] - right_df['st_r']).sum()
change= diff <= new_diff
# now all ranges which overlap have the same value for se_r
# so we just need to aggregate on se_r to remove them
result= right_df.groupby('se_r').aggregate({col: 'min' if '_max' not in col else 'max' for col in right_df})
result.rename({col: col[:-2] if col.endswith('_r') else col for col in result}, axis='columns', inplace=True)
result.drop('fake_key', axis='columns', inplace=True)
如果你对你的数据执行这个,你会得到:
st se st_min st_max se_min se_max
se_r
923190 922444 923190 922434 922455 923180 923200
929459 928718 929459 922434 928728 923180 929469
请注意,如果您的数据集大于几千条记录,您可能需要更改上面生成笛卡尔积的连接逻辑。所以在第一次迭代中,你会得到一个大小为 n^2 的joined_df,其中 n 是输入数据帧中的记录数。然后在每次迭代的后期,joined_df 将由于聚合而变小。
我只是忽略了这一点,因为我不知道您的数据集有多大。避免这种情况会使代码更复杂一些。但是,如果您需要它,您可以创建一个辅助数据框,它允许您在两个数据框上“合并”se 值,并将合并后的值用作fake_key。这不是很常规的分箱,您必须为每个 fake_key 创建一个数据框,其中包含范围 (0...fake_key) 内的所有值。所以例如如果您将假密钥定义为fake_key=se//1000,您的数据框将包含
fake_key fake_key_join
922 922
922 921
922 920
... ...
922 0
如果您将上述循环中的merge 替换为代码,则将fake_key 上的此类数据帧与right_df 合并,并将fake_key_join 上的结果与df 合并,您可以使用其余代码并得到与上面相同的结果,但不必产生完整的笛卡尔积。