【问题标题】:MySQL sum over a window that contains a null value returns nullMySQL 对包含 null 值的窗口求和返回 null
【发布时间】:2019-01-18 16:18:27
【问题描述】:

我正在尝试获取每个客户过去 3 个月行(不包括当前行)的收入总和。当前在 Databricks 中尝试的最小示例:

cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
                   ['A',201702,101],
                   ['A',201703,102],
                   ['A',201704,103],
                   ['A',201705,104],
                   ['B',201701,201],
                   ['B',201702,np.nan],
                   ['B',201703,203],
                   ['B',201704,204],
                   ['B',201705,205],
                   ['B',201706,206],
                   ['B',201707,207]                
                  ])
df_pd.columns = cols

spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')

df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
  order by Client,Month
  rows between 3 preceding and 1 preceding)) as Total_Sum3
  from df_sql
  """)
df_out.show()

+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
|     A|201701|  100.0|      null|
|     A|201702|  101.0|     100.0|
|     A|201703|  102.0|     201.0|
|     A|201704|  103.0|     303.0|
|     A|201705|  104.0|     306.0|
|     B|201701|  201.0|      null|
|     B|201702|    NaN|     201.0|
|     B|201703|  203.0|       NaN|
|     B|201704|  204.0|       NaN|
|     B|201705|  205.0|       NaN|
|     B|201706|  206.0|     612.0|
|     B|201707|  207.0|     615.0|
+------+------+-------+----------+

如您所见,如果在 3 个月窗口中的任何位置都存在空值,则返回空值。我想将空值视为 0,因此尝试 ifnull,但这似乎不起作用。我还尝试了一个 case 语句将 NULL 更改为 0,但没有运气。

【问题讨论】:

  • 为什么/这与 MySQL 有什么关系?这似乎是 apache spark?
  • 而 MySQL 得到了正确的结果:db-fiddle.com/f/gabXH7MRvJbymL9fESAzrU/0
  • 我怀疑问题在于IFNULL() 没有将NaN 视为NULL
  • 试试SUM(IF(Revenue = NaN, 0, Revenue))

标签: apache-spark null apache-spark-sql window-functions


【解决方案1】:

就在coalesce外总和:

df_out = sqlContext.sql("""
  select *, coalesce(sum(Revenue) over (partition by Client
  order by Client,Month
  rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
  from df_sql
 """)

【讨论】:

    【解决方案2】:

    这是 Apache Spark,我的错! (我在 Databricks 工作,我认为它是底层的 MySQL)。改标题是不是太晚了?

    @Barmar,你说得对,IFNULL() 没有将NaN 视为null。感谢@user6910411,我设法解决了这个问题:SO link。我不得不改变 numpy NaNs 来激发空值。创建示例 df_pd 后的正确代码:

    spark_df = spark.createDataFrame(df_pd)
    
    from pyspark.sql.functions import isnan, col, when
    
    #this converts all NaNs in numeric columns to null:
    spark_df = spark_df.select([
        when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c 
        for c, t in spark_df.dtypes])
    
    spark_df.createOrReplaceTempView('df_sql')
    
    df_out = sqlContext.sql("""
    select *, (sum(ifnull(Revenue,0)) over (partition by Client
      order by Client,Month
      rows between 3 preceding and 1 preceding)) as Total_Sum3
      from df_sql order by Client,Month
      """)
    df_out.show()
    

    然后给出所需的:

    +------+------+-------+----------+
    |Client| Month|Revenue|Total_Sum3|
    +------+------+-------+----------+
    |     A|201701|  100.0|      null|
    |     A|201702|  101.0|     100.0|
    |     A|201703|  102.0|     201.0|
    |     A|201704|  103.0|     303.0|
    |     A|201705|  104.0|     306.0|
    |     B|201701|  201.0|      null|
    |     B|201702|   null|     201.0|
    |     B|201703|  203.0|     201.0|
    |     B|201704|  204.0|     404.0|
    |     B|201705|  205.0|     407.0|
    |     B|201706|  206.0|     612.0|
    |     B|201707|  207.0|     615.0|
    +------+------+-------+----------+
    

    sqlContext 是解决此问题的最佳方法,还是通过 pyspark.sql.window 获得相同结果会更好/更优雅?

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2013-03-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-01-07
      • 2020-07-16
      • 2021-03-11
      • 1970-01-01
      相关资源
      最近更新 更多