在对数据进行切片和聚合(按时间或其他方式)时,星型模式(Kimball 星型)是一种相当简单但功能强大的解决方案。假设对于每次点击,我们都会存储时间(秒分辨率)、用户信息、按钮 ID 和用户位置。为了方便切片和切块,我将从预加载的查找表开始,查找很少更改的对象的属性——在 DW 世界中称为维度表。
dimDate 表每天有一行,其中包含描述特定日期的属性(字段)的数量。该表可以提前多年预加载,如果包含DaysAgo, WeeksAgo, MonthsAgo, YearsAgo等字段,则应每天更新一次;否则它可能是“加载并忘记”。 dimDate 允许按日期属性轻松切片,例如
WHERE [YEAR] = 2009 AND DayOfWeek = 'Sunday'
对于十年的数据,该表只有约 3650 行。
dimGeography 表预加载了感兴趣的地理区域——行数取决于报告中所需的“地理分辨率”,它允许像
WHERE Continent = 'South America'
一旦加载,就很少更改。
对于站点的每个按钮,在 dimButton 表中都有一行,因此查询可能有
WHERE PageURL = 'http://…/somepage.php'
dimUser 表中每个注册用户有一行,用户注册后应该立即加载新用户信息,或者至少在任何其他用户事务发生之前新用户信息应该在表中记录在事实表中。
为了记录按钮点击,我将添加factClick 表。
factClick 表对于特定用户在某个时间点每次单击按钮都有一行。我在复合主键中使用了TimeStamp(第二分辨率)、ButtonKey 和UserKey,以过滤掉特定用户每秒超过一次的点击。注意Hour 字段,它包含TimeStamp 的小时部分,0-23 范围内的整数,以便于每小时切片,例如
WHERE [HOUR] BETWEEN 7 AND 9
所以,现在我们必须考虑:
- 如何加载表格?使用 ETL 工具或使用某种事件流式处理的低延迟解决方案定期(可能每隔一小时或每隔几分钟)从博客中获取。
- 表格中的信息要保留多长时间?
不管表是只保存一天的信息还是保存几年的信息——它都应该被分区; ConcernedOfTunbridgeW 已经在他的回答中解释了分区,所以我会在这里跳过。
现在,根据不同属性(包括日期和小时)进行切片和切块的几个示例
为了简化查询,我将添加一个视图来展平模型:
/* To simplify queries flatten the model */
CREATE VIEW vClicks
AS
SELECT *
FROM factClick AS f
JOIN dimDate AS d ON d.DateKey = f.DateKey
JOIN dimButton AS b ON b.ButtonKey = f.ButtonKey
JOIN dimUser AS u ON u.UserKey = f.UserKey
JOIN dimGeography AS g ON g.GeographyKey = f.GeographyKey
查询示例
/*
Count number of times specific users clicked any button
today between 7 and 9 AM (7:00 - 9:59)
*/
SELECT [Email]
,COUNT(*) AS [Counter]
FROM vClicks
WHERE [DaysAgo] = 0
AND [Hour] BETWEEN 7 AND 9
AND [Email] IN ('dude45@somemail.com', 'bob46@bobmail.com')
GROUP BY [Email]
ORDER BY [Email]
假设我对User = ALL 的数据感兴趣。 dimUser 是一个大表,所以我会在没有它的情况下创建一个视图,以加快查询速度。
/*
Because dimUser can be large table it is good
to have a view without it, to speed-up queries
when user info is not required
*/
CREATE VIEW vClicksNoUsr
AS
SELECT *
FROM factClick AS f
JOIN dimDate AS d ON d.DateKey = f.DateKey
JOIN dimButton AS b ON b.ButtonKey = f.ButtonKey
JOIN dimGeography AS g ON g.GeographyKey = f.GeographyKey
查询示例
/*
Count number of times a button was clicked on a specific page
today and yesterday, for each hour.
*/
SELECT [FullDate]
,[Hour]
,COUNT(*) AS [Counter]
FROM vClicksNoUsr
WHERE [DaysAgo] IN ( 0, 1 )
AND PageURL = 'http://...MyPage'
GROUP BY [FullDate], [Hour]
ORDER BY [FullDate] DESC, [Hour] DESC
假设对于聚合,我们不需要保留特定的用户信息,而只对日期、时间、按钮和地理位置感兴趣。 factClickAgg 表中的每一行都有一个计数器,用于记录从特定地理区域单击特定按钮的每一小时。
factClickAgg 表可以每小时加载一次,甚至可以在每天结束时加载——取决于报告和分析的要求。例如,假设表格在每天结束时(午夜之后)加载,我可以使用类似:
/* At the end of each day (after midnight) aggregate data. */
INSERT INTO factClickAgg
SELECT DateKey
,[Hour]
,ButtonKey
,GeographyKey
,COUNT(*) AS [ClickCount]
FROM vClicksNoUsr
WHERE [DaysAgo] = 1
GROUP BY DateKey
,[Hour]
,ButtonKey
,GeographyKey
为了简化查询,我将创建一个视图来展平模型:
/* To simplify queries for aggregated data */
CREATE VIEW vClicksAggregate
AS
SELECT *
FROM factClickAgg AS f
JOIN dimDate AS d ON d.DateKey = f.DateKey
JOIN dimButton AS b ON b.ButtonKey = f.ButtonKey
JOIN dimGeography AS g ON g.GeographyKey = f.GeographyKey
现在我可以查询聚合数据,例如按天:
/*
Number of times a specific buttons was clicked
in year 2009, by day
*/
SELECT FullDate
,SUM(ClickCount) AS [Counter]
FROM vClicksAggregate
WHERE ButtonName = 'MyBtn_1'
AND [Year] = 2009
GROUP BY FullDate
ORDER BY FullDate
或者有更多的选择
/*
Number of times specific buttons were clicked
in year 2008, on Saturdays, between 9:00 and 11:59 AM
by users from Africa
*/
SELECT SUM(ClickCount) AS [Counter]
FROM vClicksAggregate
WHERE [Year] = 2008
AND [DayOfWeek] = 'Saturday'
AND [Hour] BETWEEN 9 AND 11
AND Continent = 'Africa'
AND ButtonName IN ( 'MyBtn_1', 'MyBtn_2', 'MyBtn_3' )