我有一种方法可以使用您的数据为详尽的月份列表创建一个表格,然后用它映射数据。
我使用的样本数据,
# create data table
data_ls = [
('A', 'blah1', '2020-02-02', '2020-04-16'),
('A', 'blah2', '2020-02-02', '2020-03-01'),
('A', 'blah3', '2020-12-02', '2021-03-01'),
('A', 'blah4', '2020-12-02', '2021-03-01'),
('B', 'blah2', '2021-02-02', '2021-03-01')
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['hotel', 'person', 'in', 'out']). \
withColumn('in', func.col('in').cast('date')). \
withColumn('out', func.col('out').cast('date'))
# +-----+------+----------+----------+
# |hotel|person| in| out|
# +-----+------+----------+----------+
# | A| blah1|2020-02-02|2020-04-16|
# | A| blah2|2020-02-02|2020-03-01|
# | A| blah3|2020-12-02|2021-03-01|
# | A| blah4|2020-12-02|2021-03-01|
# | B| blah2|2021-02-02|2021-03-01|
# +-----+------+----------+----------+
下面的查询映射了手动创建的月份表中的所有人员。月份表是根据您的酒店数据中可用的年份和详尽的月份列表创建的。
data_sdf.createOrReplaceTempView('hotel_data')
spark.sql('''
select y.hotel, x.yyyymm, count(distinct y.person) as num_visits from (
select a.mth, b.yr, concat(b.yr, a.mth) as yyyymm from (
select '01' as mth union all
select '02' as mth union all
select '03' as mth union all
select '04' as mth union all
select '05' as mth union all
select '06' as mth union all
select '07' as mth union all
select '08' as mth union all
select '09' as mth union all
select '10' as mth union all
select '11' as mth union all
select '12' as mth) a
cross join (
select distinct year(in) as yr from hotel_data
union
select distinct year(out) as yr from hotel_data) b
on 1=1) x
left join hotel_data y
on x.yyyymm >= date_format(y.in, 'yyyyMM')
and x.yyyymm <= date_format(y.out, 'yyyyMM')
where y.hotel is not null
group by 1, 2
order by 1, 2
''').show()
# +-----+------+----------+
# |hotel|yyyymm|num_visits|
# +-----+------+----------+
# | A|202002| 2|
# | A|202003| 2|
# | A|202004| 1|
# | A|202012| 2|
# | A|202101| 2|
# | A|202102| 2|
# | A|202103| 2|
# | B|202102| 1|
# | B|202103| 1|
# +-----+------+----------+
- 第一部分 (
x) 是创建详尽的月份列表 (yyyyMM)。根据我使用的数据,
| mth |
yr |
yyyymm |
| 01 |
2020 |
202001 |
| 02 |
2020 |
202002 |
| 03 |
2020 |
202003 |
| 04 |
2020 |
202004 |
| 05 |
2020 |
202005 |
| 06 |
2020 |
202006 |
| 07 |
2020 |
202007 |
| 08 |
2020 |
202008 |
| 09 |
2020 |
202009 |
| 10 |
2020 |
202010 |
| 11 |
2020 |
202011 |
| 12 |
2020 |
202012 |
| 01 |
2021 |
202101 |
| 02 |
2021 |
202102 |
| 03 |
2021 |
202103 |
| 04 |
2021 |
202104 |
| ... |
... |
... |
- 下一部分将酒店数据与上述数据相结合,以便将人员映射到
in 和out 日期内的月份。我使用 in 和 out 创建月份(yyyyMM 格式)并检查条件,如果一个月,从详尽的月份表中,一个人的 in 和 out 之间可用。
- 加入后,查询会统计每个酒店每月的唯一人数(来自详尽的月份表)。