【问题标题】:Pandas: add rows to each group until condition is metPandas:向每个组添加行,直到满足条件
【发布时间】:2020-07-19 21:28:07
【问题描述】:

我有一个具有以下结构的时间序列数据框:

| ID | second | speaker1 | speaker2 | company | ... |
|----|--------|----------|----------|---------|-----|
|  A |    1   |     1    |     1    |  name1  |     |
|  A |    2   |     1    |     1    |  name1  |     |
|  A |    3   |     1    |     1    |  name1  |     |
|  B |    1   |     1    |     1    |  name2  |     |
|  B |    2   |     1    |     1    |  name2  |     |
|  B |    3   |     1    |     1    |  name2  |     |
|  B |    4   |     1    |     1    |  name2  |     |
|  C |    1   |     1    |     1    |  name3  |     |
|  C |    2   |     1    |     1    |  name3  |     |

*note that speaker1 and speaker2 can be either 0 or 1, I set all to one for clarity here

我想向每个组添加行,直到每个组的行数相同。 (其中行数 = 行数最多的 ID)

对于每个新行,我想用 0 填充扬声器 1 和扬声器 2 列,同时保持该 ID 的其他列中的值相同。

所以输出应该是:

| ID | second | speaker1 | speaker2 | company | ... |
|:--:|:------:|:--------:|:--------:|:-------:|:---:|
|  A |    1   |     1    |     1    |  name1  |     |
|  A |    2   |     1    |     1    |  name1  |     |
|  A |    3   |     1    |     1    |  name1  |     |
|  A |    4   |     0    |     0    |  name1  |     |
|  B |    1   |     1    |     1    |  name2  |     |
|  B |    2   |     1    |     1    |  name2  |     |
|  B |    3   |     1    |     1    |  name2  |     |
|  B |    4   |     1    |     1    |  name2  |     |
|  C |    1   |     1    |     1    |  name3  |     |
|  C |    2   |     1    |     1    |  name3  |     |
|  C |    3   |     0    |     0    |  name3  |     |
|  C |    4   |     0    |     0    |  name3  |     |

到目前为止,我已经尝试了 groupby 并应用,但发现它非常慢,因为我在这个数据框中有很多行和列。

def add_rows_sec(w):
    'input: dataframe for grouped by ID, output: dataframe with added rows until max call length'
    
    while w['second'].max() < clean_data['second'].max(): # if duration is less than max duration in full data set
        last_row = w.iloc[-1]
        last_row['second'] += 1
        last_row['speaker1'] = 0
        last_row['speaker2'] = 0
        return w.append(last_row)
    return w

df.groupby('ID').apply(add_rows_sec).reset_index(drop=True)

有没有办法用 numpy 做到这一点?类似的东西

condition = w['second'].max() < df['second'].max()
choice = pd.Series([w.ID, w.second + 1, 0, 0, w.company...])
df = np.select(condition, choice, default = np.nan)

非常感谢任何帮助!

【问题讨论】:

  • 这是什么意思? I want to add rows for each unique ID until every ID has the number of rows equal to the ID with the most rows.
  • 基本上只是向每个组添加行,直到每个组具有相同的行数。 (其中行数 = 行数最多的 ID)

标签: python pandas numpy


【解决方案1】:

熊猫的另一种方法

  1. 构造一个数据框,它是IDsecond 的笛卡尔积
  2. 将其外部连接回原始数据框
  3. 根据您的规范填充缺失值

没有groupby()没有循环。

df = pd.DataFrame({"ID":["A","A","A","B","B","B","B","C","C"],"second":["1","2","3","1","2","3","4","1","2"],"speaker1":["1","1","1","1","1","1","1","1","1"],"speaker2":["1","1","1","1","1","1","1","1","1"],"company":["name1","name1","name1","name2","name2","name2","name2","name3","name3"]})

df2 = pd.DataFrame({"ID":df["ID"].unique()}).assign(foo=1).merge(\
    pd.DataFrame({"second":df["second"].unique()}).assign(foo=1)).drop("foo", 1)\
    .merge(df, on=["ID","second"], how="outer")

df2["company"] = df2["company"].fillna(method="ffill")
df2.fillna(0)

输出

    ID  second  speaker1    speaker2    company
0   A   1   1   1   name1
1   A   2   1   1   name1
2   A   3   1   1   name1
3   A   4   0   0   name1
4   B   1   1   1   name2
5   B   2   1   1   name2
6   B   3   1   1   name2
7   B   4   1   1   name2
8   C   1   1   1   name3
9   C   2   1   1   name3
10  C   3   0   0   name3
11  C   4   0   0   name3

【讨论】:

    猜你喜欢
    • 2022-01-01
    • 2016-04-29
    • 2018-05-26
    • 2021-06-06
    • 2019-05-10
    • 1970-01-01
    • 2017-09-15
    • 2014-01-23
    • 2012-05-12
    相关资源
    最近更新 更多