【发布时间】:2020-12-15 12:15:36
【问题描述】:
我有一个包含多个列的时间序列数据框,其中包含彼此独立的 NaN。
我有一个给定的长度“LEN”,每个有效元素序列应该至少有。 (通过“序列我的意思是收集之前索引中的值。)
迭代的时间效率极低,但看起来类似于:
LEN = 100
maximum_sequence_len = 0
for i in range(len(df)): # for every index
for col in df.columns: # for every column
df_ = df[col].iloc[:i].dropna()
seq_end_ix = i
seq_start_ix = get_seq_start_where_every_col_has_enough_valids(
df,seq_end,LEN)
necessary_len = len( df.loc[seq_start_ix:seq_end_ix] )
if maximum_sequence_len < necessary_len :
maximum_sequence_len = necessary_len
get_seq_start_where_every_col_has_enough_valids(df,seq_end_ix,LEN)
# determine the index where every column contains at least "LEN" valid elements
first_SEQ_LEN_Sample_start_ix = start_ix
for col in df.columns:
col_df = df[col].dropna()
temp = col_df[col_df.index <= seq_end_ix ].index[-(LEN)]
if temp < first_SEQ_LEN_Sample_start_ix:
first_SEQ_LEN_Sample_start_ix = temp
seq_start_ix = first_SEQ_LEN_Sample_start_ix
return seq_start_ix
一个例子:
LEN = 6 # in this example we have to have at least 6 valid elements in the frame of rows before
print(df)
>>>>
A B C D E F
index
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 | 1
3 NaN 1 1 NaN 1 | 1
4 NaN 1 1 NaN 1 | 1
5 1 1 1 1 1 | 1
6 1 1 1 1 NaN | 1
7 NaN 1 1 NaN 1 | 1
8 NaN 1 1 1 1 | 1
9 1 1 1 1 NaN | 1
10 1 1 1 1 NaN | 1
11 1 1 1 NaN NaN | 1
12 1 1 1 1 NaN | 1
13 1 1 1 1 NaN | 1
14 1 NaN 1 1 NaN |* 1
16 1 1 1 1 1 NaN
17 NaN 1 1 1 1 1
18 NaN 1 1 1 1 NaN
19 1 1 1 1 1 1
# ==> Result: 13
# *here, longest sequence necessary to get minimum 6 valids in EVERY column has a length of 13. note, that if the other columns contained more NaNs in the marked indices, then it would probably have taken more than 13.
问题是我想创建序列样本,但不知道它们需要多长时间才能使每个样本在每列中至少有“LEN”有效元素。
【问题讨论】:
-
你能
df.dropna吗? -
不。如果我们删除 NaN,相同行中的值也会被删除:(@inspectorG4dget
标签: python pandas dataframe indexing sequence