【问题标题】:groupy and aggregation in DaskDask 中的 groupy 和聚合
【发布时间】:2022-01-04 19:07:47
【问题描述】:

我的数据框如下所示:

    # initialize list of lists
    data = [['tom', 10], ['nick', 15], ['juli', 14],['tom', 10], ['juli', 15] ]
     
    # Create the pandas DataFrame
    df = pd.DataFrame(data, columns = ['Name', 'Age'])
    
            Name    Age
        0   tom     10
        1   nick    15
        2   juli    14
        3   tom     10
        4   juli    15

我想按“姓名”分组,计算“年龄”和“年龄”的唯一计数。

使用pandas我得到了结果:

           Age
           count    nunique
    Name        
    juli    2      2
    nick    1      1
    tom     2      1

熊猫代码:

    types = ['count', 'nunique'] 
    df.groupby('Name').agg({'Age': types})

我如何在 Dask 中实现这一点?

dask 中,我可以做 count 或 nunique...

    ddf = daskdf.from_pandas(df, npartitions=4)     
    ddf.groupby('Name').Age.count().to_frame().compute()
               Age
        Name    
        nick    1
        tom     2
        juli    2

【问题讨论】:

    标签: python pandas dask dask-distributed dask-dataframe


    【解决方案1】:

    惰性计算的优点是您可以一次指定一个,但实际计算将通过一些优化来完成以避免冗余计算。

    具体来说,可以分别为nuniquecount创建惰性计算,然后合并计算结果:

    # calculation with dask
    dask_series = ddf.groupby("Name")["Age"]
    
    # these are lazy results that will need to be computed
    lazy_results = [
        dask_series.nunique().to_frame(name="age_nunique"),
        dask_series.count().to_frame(name="age_count"),
    ]
    
    # note that concatenation happens on computed results
    print(pd.concat(*dd.compute(lazy_results), axis=1))
    
    

    这是完整的 sn-p:

    import dask.dataframe as dd
    import pandas as pd
    
    # initialize list of lists
    data = [["tom", 10], ["nick", 15], ["juli", 14], ["tom", 10], ["juli", 15]]
    
    # Create the pandas DataFrame
    df = pd.DataFrame(data, columns=["Name", "Age"])
    
    # calculation with pandas
    types = ["count", "nunique"]
    print(df.groupby("Name").agg({"Age": types}))
    #        Age
    #      count nunique
    # Name
    # juli     2       2
    # nick     1       1
    # tom      2       1
    
    # calculation with dask
    dask_series = ddf.groupby("Name")["Age"]
    
    # these are lazy results that will need to be computed
    lazy_results = [
        dask_series.nunique().to_frame(name="age_nunique"),
        dask_series.count().to_frame(name="age_count"),
    ]
    
    # note that concatenation happens on computed results
    print(pd.concat(*dd.compute(lazy_results), axis=1))
    #       age_nunique  age_count
    # Name
    # nick            1          1
    # tom             1          2
    # juli            2          2
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-10-02
      • 2017-10-04
      • 1970-01-01
      • 2018-04-07
      • 2018-03-04
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多