【发布时间】:2020-07-19 21:28:07
【问题描述】:
我有一个具有以下结构的时间序列数据框:
| ID | second | speaker1 | speaker2 | company | ... |
|----|--------|----------|----------|---------|-----|
| A | 1 | 1 | 1 | name1 | |
| A | 2 | 1 | 1 | name1 | |
| A | 3 | 1 | 1 | name1 | |
| B | 1 | 1 | 1 | name2 | |
| B | 2 | 1 | 1 | name2 | |
| B | 3 | 1 | 1 | name2 | |
| B | 4 | 1 | 1 | name2 | |
| C | 1 | 1 | 1 | name3 | |
| C | 2 | 1 | 1 | name3 | |
*note that speaker1 and speaker2 can be either 0 or 1, I set all to one for clarity here
我想向每个组添加行,直到每个组的行数相同。 (其中行数 = 行数最多的 ID)
对于每个新行,我想用 0 填充扬声器 1 和扬声器 2 列,同时保持该 ID 的其他列中的值相同。
所以输出应该是:
| ID | second | speaker1 | speaker2 | company | ... |
|:--:|:------:|:--------:|:--------:|:-------:|:---:|
| A | 1 | 1 | 1 | name1 | |
| A | 2 | 1 | 1 | name1 | |
| A | 3 | 1 | 1 | name1 | |
| A | 4 | 0 | 0 | name1 | |
| B | 1 | 1 | 1 | name2 | |
| B | 2 | 1 | 1 | name2 | |
| B | 3 | 1 | 1 | name2 | |
| B | 4 | 1 | 1 | name2 | |
| C | 1 | 1 | 1 | name3 | |
| C | 2 | 1 | 1 | name3 | |
| C | 3 | 0 | 0 | name3 | |
| C | 4 | 0 | 0 | name3 | |
到目前为止,我已经尝试了 groupby 并应用,但发现它非常慢,因为我在这个数据框中有很多行和列。
def add_rows_sec(w):
'input: dataframe for grouped by ID, output: dataframe with added rows until max call length'
while w['second'].max() < clean_data['second'].max(): # if duration is less than max duration in full data set
last_row = w.iloc[-1]
last_row['second'] += 1
last_row['speaker1'] = 0
last_row['speaker2'] = 0
return w.append(last_row)
return w
df.groupby('ID').apply(add_rows_sec).reset_index(drop=True)
有没有办法用 numpy 做到这一点?类似的东西
condition = w['second'].max() < df['second'].max()
choice = pd.Series([w.ID, w.second + 1, 0, 0, w.company...])
df = np.select(condition, choice, default = np.nan)
非常感谢任何帮助!
【问题讨论】:
-
这是什么意思?
I want to add rows for each unique ID until every ID has the number of rows equal to the ID with the most rows. -
基本上只是向每个组添加行,直到每个组具有相同的行数。 (其中行数 = 行数最多的 ID)