【发布时间】:2020-03-17 20:03:39
【问题描述】:
我在数天/数月内收集了许多推文,并正在监控词组(主题标签)。推文的主数据库每天收集应用程序 500 万条推文,并将主题标签提取到单独的表中。计算这些主题标签以显示在一段时间(天/月)内形成的热图。
收集数据库是 MYSQL,其中主推文表是应用程序 500 M 记录,主题标签表是应用程序 175 M 记录。然后复制到 Clickhouse 进行分析。
以冠状病毒为例,下面的列表显示,可以将多个主题标签更好地组合在一起,以提高统计数据的可见性。
问题:
- 如何在一个定义的“标签”或别名中添加用于选择相似词/主题标签的过滤器? 完成
- 如何使用多个别名,每组过滤器/选择标准都使用? 完成
- 如何使用 RANK 或类似的方法不使用累积计数列出,而是使用 RANK?
SELECT (match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.*') ? 'COVID19' : hashtag) as Hashtag,
SUM(CASE when datetime between now() - interval 1 day AND now() then 1 END) "Today",
SUM(CASE when datetime between now() - interval 2 day AND now() - interval 1 day then 1 END) "Today -1",
SUM(CASE when datetime between now() - interval 3 day AND now() - interval 2 day then 1 END) "Today -2",
SUM(CASE when datetime between now() - interval 4 day AND now() - interval 3 day then 1 END) "Today -3",
SUM(CASE when datetime between now() - interval 5 day AND now() - interval 4 day then 1 END) "Today -4",
SUM(CASE when datetime between now() - interval 6 day AND now() - interval 5 day then 1 END) "Today -5",
SUM(CASE when datetime between now() - interval 7 day AND now() - interval 6 day then 1 END) "Today -6",
SUM(CASE when datetime between now() - interval 8 day AND now() - interval 7 day then 1 END) "Today -7"
FROM twitterDBhashtags
group by Hashtag
order by "Today" DESC limit 20;
在 twitterDBhashtags 表上:
id BIGINT(20) PK
hashtag VARCHAR(75)
datetime DATETIME
产生这个结果:
Hashtag Today Today -1 Today -2 Today -3 Today -4 Today -5 Today -6 Today -7
------------------------------------------------------------------------------------------------------------------
COVID19 245 799 253 088 241 731 226 515 249 281 84 088 149 789 117 015
BhulaDungaFirstLook 36 379 34
StPatricksDay 12 622 410 251 233 307 72 194 176
BhulaDungaWithSid 12 595 47
QuarantineLife 10 742 2 339 59 1 1
UPDATE 9 432 534 1 063 340 884 215 336 242
BREAKING 7 038 11 737 10 434 6 985 10 726 4 345 6 748 5 091
SidNaaz 6 012 2 247 4 115 1 692 2 065 241 1 502 1 236
China 5 840 4 803 4 887 5 472 7 039 2 086 3 392 3 748
FamiliesFirst 4 578 420 902 6 480 5 952 1 326
iHeartAwards 4 540 5 274 6 846 5 412 6 747 2 500 6 559 4 767
HomeOfSoul_Satlok 4 341
TrumpVirus 4 094 750 752 1 381 1 935 624 590 1 176
100WAYS 4 055 106 125 22
TEAMWANG 4 014 101 107 78 34 21 160 127
ChineseVirus 3 919 1 3 4 69 32 15 2
ShipsGoingDown 3 755 71
在@vladimir 提供非常好的输入后,使用
SELECT case when match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.*') then 'COVID19'
when match(hashtag, 'Bhula.*') then 'Bhula'
else hashtag END
as Hashtag,
SUM(CASE when datetime between now() - interval 1 day AND now() then 1 END) "Today",
SUM(CASE when datetime between now() - interval 2 day AND now() - interval 1 day then 1 END) "Today -1",
SUM(CASE when datetime between now() - interval 3 day AND now() - interval 2 day then 1 END) "Today -2",
SUM(CASE when datetime between now() - interval 4 day AND now() - interval 3 day then 1 END) "Today -3",
SUM(CASE when datetime between now() - interval 5 day AND now() - interval 4 day then 1 END) "Today -4",
SUM(CASE when datetime between now() - interval 6 day AND now() - interval 5 day then 1 END) "Today -5",
SUM(CASE when datetime between now() - interval 7 day AND now() - interval 6 day then 1 END) "Today -6",
SUM(CASE when datetime between now() - interval 8 day AND now() - interval 7 day then 1 END) "Today -7"
FROM twitterDBhashtags
group by Hashtag
order by "Today" DESC limit 10;
我明白了(请注意这是实时数据,因此上述结果的总和将不准确)
Hashtag Today Today -1 Today -2 Today -3 Today -4 Today -5 Today -6 Today -7
---------------------------------------------------------------------------------------------------------
COVID19 241825 260486 237838 236318 222989 129159 161506 122959
Bhula 35267 22372 856 1 13 4 1 12
StPatricksDay 14776 1147 254 239 271 130 198 167
QuarantineLife 10442 5140 169 1 2
AsiManshiDebut 8900
LuzonLockdown 6764 9
FamiliesFirst 6563 382 439 3285 8854 1307 927
Italy 6516 2617 4590 4493 2710 1725 3287 8885
BREAKING 6391 9878 10726 8603 9830 4305 8464 4992
China 5469 5745 4417 5279 5753 4290 3556 3408
现在,如何将其设为 RANK,而不是计数,按 RANK 排序。
非常感谢任何关于如何进步的想法。
【问题讨论】:
标签: mysql clickhouse