【问题标题】:Spark sql count hotel visits per monthSpark sql 统计每月酒店访问次数
【发布时间】:2021-08-23 09:49:54
【问题描述】:

我需要统计每月入住酒店的次数。
如果访客不是在他入住的同一个月离开的,这应该算作他在酒店住宿的每个月的访问次数

JohnA酒店住了两个月,所以酒店在09月有1次访问,10月有1次访问
MarkA酒店住了一个月,所以在09 月 +1 访问
总计:月 09 - 2 次访问,月 10 - 1 次访问

我开始编写查询,但是由于 cte(同一个月的访问,不同的访问)、联合和删除重复项,它变得非常庞大,我觉得应该有一个更优雅的解决方案

所以。有什么简单的方法吗?

Hotel Visitor In Out
A John 22.09.2020 01.10.2020
A Mark 22.09.2020 29.09.2020

【问题讨论】:

    标签: sql apache-spark-sql


    【解决方案1】:

    试试这个 spark-sql 解决方案。我为跨越不同年份的进/出场景添加了一行。

    我使用了参考月份“2000-01-01”,您可以将其更改为某个旧日期以满足您的要求。

    val mdf = spark.sql("""
    select  'A' hotel, 'John' visitor ,to_date('2020-09-22') in, to_date('2020-10-01') out  union all
    select  'A', 'Mark', '2020-09-22', '2020-09-29' union all
    select  'A', 'Mark', '2019-09-22', '2020-09-29' 
    """)
    mdf.show(false)
    
    +-----+-------+----------+----------+
    |hotel|visitor|in        |out       |
    +-----+-------+----------+----------+
    |A    |John   |2020-09-22|2020-10-01|
    |A    |Mark   |2020-09-22|2020-09-29|
    |A    |Mark   |2019-09-22|2020-09-29|
    +-----+-------+----------+----------+
    
    mdf.createOrReplaceTempView("mdf")
    
    spark.sql("""
    select hotel, visitor, 
       int(months_between(out,'2000-01-01'))-int(months_between(in,'2000-01-01'))+1 as no_visits
      from mdf
    """).show(false)
    
    +-----+-------+---------+
    |hotel|visitor|no_visits|
    +-----+-------+---------+
    |A    |John   |2        |
    |A    |Mark   |1        |
    |A    |Mark   |13       |
    +-----+-------+---------+
    

    使用 udf() 的另一种解决方案:

    下面是 udf - 我们只是从开始日到结束日的循环,将它们转换为纪元日。

    def months_between_t(start:java.sql.Date, end:java.sql.Date):Int={
          val st = start.toLocalDate
          val ed = end.toLocalDate
          val days_mm = for( i <- st.toEpochDay.toInt to ed.toEpochDay.toInt )
                     yield java.time.LocalDate.ofEpochDay(i).format(java.time.format.DateTimeFormatter.ofPattern("yyyyMM")).toString
          val months_mm = days_mm.distinct.length
        months_mm
    }
    

    注册udf

    val udf_months_between = udf ( months_between_t(_:java.sql.Date, _:java.sql.Date):Int)
    

    现在调用 udf

    mdf.withColumn("num_of_visits",udf_months_between(col("in"),col("out"))).show(false)
    
    +-----+-------+----------+----------+-------------+
    |hotel|visitor|in        |out       |num_of_visits|
    +-----+-------+----------+----------+-------------+
    |A    |John   |2020-09-22|2020-10-01|2            |
    |A    |Mark   |2020-09-22|2020-09-29|1            |
    |A    |Mark   |2019-09-22|2020-09-29|13           |
    +-----+-------+----------+----------+-------------+
    

    【讨论】:

      【解决方案2】:

      我有一种方法可以使用您的数据为详尽的月份列表创建一个表格,然后用它映射数据。

      我使用的样本数据,

      # create data table
      data_ls = [
          ('A', 'blah1', '2020-02-02', '2020-04-16'),
          ('A', 'blah2', '2020-02-02', '2020-03-01'),
          ('A', 'blah3', '2020-12-02', '2021-03-01'),
          ('A', 'blah4', '2020-12-02', '2021-03-01'),
          ('B', 'blah2', '2021-02-02', '2021-03-01')
      ]
      
      data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['hotel', 'person', 'in', 'out']). \
          withColumn('in', func.col('in').cast('date')). \
          withColumn('out', func.col('out').cast('date'))
      
      # +-----+------+----------+----------+
      # |hotel|person|        in|       out|
      # +-----+------+----------+----------+
      # |    A| blah1|2020-02-02|2020-04-16|
      # |    A| blah2|2020-02-02|2020-03-01|
      # |    A| blah3|2020-12-02|2021-03-01|
      # |    A| blah4|2020-12-02|2021-03-01|
      # |    B| blah2|2021-02-02|2021-03-01|
      # +-----+------+----------+----------+
      

      下面的查询映射了手动创建的月份表中的所有人员。月份表是根据您的酒店数据中可用的年份和详尽的月份列表创建的。

      data_sdf.createOrReplaceTempView('hotel_data')
      
      spark.sql('''
          select y.hotel, x.yyyymm, count(distinct y.person) as num_visits from (
              select a.mth, b.yr, concat(b.yr, a.mth) as yyyymm from (
              select '01' as mth union all 
              select '02' as mth union all 
              select '03' as mth union all 
              select '04' as mth union all 
              select '05' as mth union all 
              select '06' as mth union all 
              select '07' as mth union all 
              select '08' as mth union all 
              select '09' as mth union all 
              select '10' as mth union all 
              select '11' as mth union all 
              select '12' as mth) a
              cross join (
                  select distinct year(in) as yr from hotel_data
                  union
                  select distinct year(out) as yr from hotel_data) b
              on 1=1) x
          left join hotel_data y
          on x.yyyymm >= date_format(y.in, 'yyyyMM')
          and x.yyyymm <= date_format(y.out, 'yyyyMM')
          where y.hotel is not null
          group by 1, 2
          order by 1, 2
      ''').show()
      
      # +-----+------+----------+
      # |hotel|yyyymm|num_visits|
      # +-----+------+----------+
      # |    A|202002|         2|
      # |    A|202003|         2|
      # |    A|202004|         1|
      # |    A|202012|         2|
      # |    A|202101|         2|
      # |    A|202102|         2|
      # |    A|202103|         2|
      # |    B|202102|         1|
      # |    B|202103|         1|
      # +-----+------+----------+
      
      • 第一部分 (x) 是创建详尽的月份列表 (yyyyMM)。根据我使用的数据,
      mth yr yyyymm
      01 2020 202001
      02 2020 202002
      03 2020 202003
      04 2020 202004
      05 2020 202005
      06 2020 202006
      07 2020 202007
      08 2020 202008
      09 2020 202009
      10 2020 202010
      11 2020 202011
      12 2020 202012
      01 2021 202101
      02 2021 202102
      03 2021 202103
      04 2021 202104
      ... ... ...
      • 下一部分将酒店数据与上述数据相结合,以便将人员映射到inout 日期内的月份。我使用 inout 创建月份(yyyyMM 格式)并检查条件,如果一个月,从详尽的月份表中,一个人的 inout 之间可用。
      • 加入后,查询会统计每个酒店每月的唯一人数(来自详尽的月份表)。

      【讨论】:

        【解决方案3】:

        也许是这样的?

        select
          mnt.m,
          count(*)
        from
          (
            select '01' as m union all
            select '02' as m union all
            select '03' as m union all
            select '04' as m union all
            select '05' as m union all
            select '06' as m union all
            select '07' as m union all
            select '08' as m union all
            select '09' as m union all
            select '10' as m union all
            select '11' as m union all
            select '12' as m
          ) as mnt      join
          left join your_data y
            on month(y.in) >= mnt.m
           and mnt.m <= month(y.in) + months_between(y.out,y.in)
        group by mnt.m
        

        我会处理这类查询的这一年。

        【讨论】:

        • 如果in 在 12 月,out 在次年 2 月,会发生什么?我无法理解在这种情况下这是如何工作的。我觉得这行不通。最好使用日期(一个月的开始日期为in,一个月的结束日期为out)。
        • 这行不通,正如我所说:“我会在这种查询上处理好这一年。”如果你对此感到害怕,你应该在你的问题中指出它(至少在你的样本数据中......)
        • 我不是 OP。但是,鉴于假设,您的答案似乎对实际数据没有用处。我鼓励您将您的假设添加到您的答案中。
        【解决方案4】:

        基本上,您需要一份月份列表。你可能有一个日历表。或者,如果您每个月都有人签到,您可以使用它:

        select yyyymm, count(*)
        from (select distinct date_trunc(in, 'mm') as yyyymm
              from t
             ) ym join
             t
             on date_trunc(in, 'mm') <= yyyymm and
                date_trunc(out, 'mm') >= yyyymm
        group by yyyymm;
        

        【讨论】:

          【解决方案5】:

          如果你想用 sql 来做,你可以这样做:

          with your_data as (
            select 'John' as Visitor, '2020-09-22' as [In], '2020-10-01' as [out] union all
            select 'Mark' as Visitor, '2020-09-22' as [In], '2020-09-29' as [out]
          )
          select 
            my.y as [Year], 
            mnt.m as [Month], 
            count(distinct concat(Visitor,y,m)) as Visitors
          from
            (
              select 'mny' as t, min(year([in])) as y
              from your_data yydd
              union all
              select 'mny' as t,  max(year([out])) as y
              from your_data yydd 
             ) as my
            cross join
            (
              select '01' as m union all
              select '02' as m union all
              select '03' as m union all
              select '04' as m union all
              select '05' as m union all
              select '06' as m union all
              select '07' as m union all
              select '08' as m union all
              select '09' as m union all
              select '10' as m union all
              select '11' as m union all
              select '12' as m
            ) as mnt
            left join your_data y
              on
                (
                  year(y.[in]) = my.y and 
                  mnt.m between month(y.[in]) and case when month(y.[out]) < month(y.[in]) then '12' else month(y.[out]) end
                )
                or
                (
                  year(y.[out]) = my.y and
                  mnt.m between case when month(y.[out]) < month(y.[in]) then '01' else '13' end and month(y.[out])
                )
          where Visitor is not null
          group by my.y, mnt.m
          

          你可以在这个db<>fiddle上测试

          如果要检查“跨年”行为,请替换为新日期:

          with your_data as (
            select 'John' as Visitor, '2020-09-22' as [In], '2020-10-01' as [out] union all
            select 'Mark' as Visitor, '2020-09-22' as [In], '2021-04-29' as [out]
          )
          

          你可以试试这个<>db_fiddle

          【讨论】:

          • 如果您的访客连续站立超过 2 年,则必须人为创建其间的年份,否则将无法按预期工作。
          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2012-03-06
          • 2022-01-20
          • 1970-01-01
          • 1970-01-01
          • 2015-01-06
          • 1970-01-01
          相关资源
          最近更新 更多