【问题标题】:Dask Filter Dataframe on Multi-Column Groupby SizeDask Filter Dataframe on Multi-Column Groupby Size
【发布时间】:2019-02-15 06:55:08
【问题描述】:

目标 = 多列分组按一个 dask 数据框,并过滤掉包含少于 3 行的组。

根据这篇文章: Filtering grouped df in Dask

我能够计算每个 groupby 对象的大小,但我无法弄清楚如何将它从多列 groupby 映射回我的数据框。我尝试了以下多种变体均无济于事:

a = input_df.groupby(["FeatureID", "region"])["Target"].size()
s = input_df[["FeatureID", "region"]].map(a)

它适用于单列 groupby。

解决方案

感谢@jezrael,我能够提出以下解决方案:

a = input_df.groupby(["FeatureID", "region"])["Target"].nunique().to_frame("feature_div")
input_df = input_df.join(a, on=["FeatureID", "region"])

# filter out features below diversity threshold
diversified = input_df[input_df.feature_div >= diversity_threshold]

【问题讨论】:

    标签: python pandas dask


    【解决方案1】:

    你需要jointo_frame

    a = input_df.groupby(["FeatureID", "region"])["Target"].size().to_frame('New')
    input_df = input_df.join(a, on=["FeatureID", "region"])
    

    示例

    import pandas as pd
    from dask import dataframe as dd 
    
    input_df = pd.DataFrame({
             'FeatureID':[4,5,4,5,5,4],
             'region':list('aaabbb'),
             'Target':[7,8,9,4,2,3],
    })
    
    print (input_df)
       FeatureID region  Target
    0          4      a       7
    1          5      a       8
    2          4      a       9
    3          5      b       4
    4          5      b       2
    5          4      b       3
    

    sd = dd.from_pandas(input_df, npartitions=3)
    print (sd)
                  FeatureID  region Target
    npartitions=3                         
    0                 int64  object  int64
    2                   ...     ...    ...
    4                   ...     ...    ...
    5                   ...     ...    ...
    Dask Name: from_pandas, 3 tasks
    
    a = sd.groupby(["FeatureID", "region"])["Target"].size().to_frame('New')
    out = sd.join(a, on=["FeatureID", "region"]).compute()
    print (out)
       FeatureID region  Target  New
    0          4      a       7    2
    1          5      a       8    1
    2          4      a       9    2
    3          5      b       4    2
    4          5      b       2    2
    5          4      b       3    1
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-12-27
      • 2017-04-11
      • 2023-03-28
      • 1970-01-01
      • 1970-01-01
      • 2021-11-06
      • 1970-01-01
      • 2016-08-30
      相关资源
      最近更新 更多