【问题标题】:MySQL/Clickhouse combine/RANK selected results with a set label and keep the remaining uniqueMySQL/Clickhouse 组合/RANK 选择的结果与一组标​​签并保持其余的唯一
【发布时间】:2020-03-17 20:03:39
【问题描述】:

我在数天/数月内收集了许多推文,并正在监控词组(主题标签)。推文的主数据库每天收集应用程序 500 万条推文,并将主题标签提取到单独的表中。计算这些主题标签以显示在一段时间(天/月)内形成的热图。

收集数据库是 MYSQL,其中主推文表是应用程序 500 M 记录,主题标签表是应用程序 175 M 记录。然后复制到 Clickhouse 进行分析。

以冠状病毒为例,下面的列表显示,可以将多个主题标签更好地组合在一起,以提高统计数据的可见性。

问题:

  1. 如何在一个定义的“标签”或别名中添加用于选择相似词/主题标签的过滤器? 完成
  2. 如何使用多个别名,每组过滤器/选择标准都使用? 完成
  3. 如何使用 RANK 或类似的方法不使用累积计数列出,而是使用 RANK?
SELECT (match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.*') ? 'COVID19' : hashtag) as Hashtag,
  SUM(CASE when datetime between now() - interval 1 day AND now() then 1 END) "Today",
  SUM(CASE when datetime between now() - interval 2 day AND now() - interval 1 day then 1 END) "Today -1",
  SUM(CASE when datetime between now() - interval 3 day AND now() - interval 2 day then 1 END) "Today -2",
  SUM(CASE when datetime between now() - interval 4 day AND now() - interval 3 day then 1 END) "Today -3",
  SUM(CASE when datetime between now() - interval 5 day AND now() - interval 4 day then 1 END) "Today -4",
  SUM(CASE when datetime between now() - interval 6 day AND now() - interval 5 day then 1 END) "Today -5",
  SUM(CASE when datetime between now() - interval 7 day AND now() - interval 6 day then 1 END) "Today -6",
  SUM(CASE when datetime between now() - interval 8 day AND now() - interval 7 day then 1 END) "Today -7"
FROM twitterDBhashtags
group by Hashtag 
order by "Today" DESC limit 20;

twitterDBhashtags 表上:

id          BIGINT(20)      PK
hashtag     VARCHAR(75)
datetime    DATETIME

产生这个结果:

Hashtag             Today       Today -1    Today -2    Today -3    Today -4    Today -5    Today -6    Today -7
------------------------------------------------------------------------------------------------------------------
COVID19             245 799     253 088     241 731      226 515     249 281    84 088       149 789    117 015    
BhulaDungaFirstLook 36 379       34                         
StPatricksDay       12 622       410         251         233         307         72         194         176    
BhulaDungaWithSid   12 595       47                         
QuarantineLife      10 742       2 339       59                                             1           1
UPDATE              9 432        534         1 063       340         884         215         336         242    
BREAKING            7 038        11 737      10 434      6 985       10 726      4 345       6 748       5 091    
SidNaaz             6 012        2 247       4 115       1 692       2 065       241         1 502       1 236    
China               5 840        4 803       4 887       5 472       7 039       2 086       3 392       3 748    
FamiliesFirst       4 578        420         902         6 480       5 952       1 326          
iHeartAwards        4 540        5 274       6 846       5 412       6 747       2 500       6 559       4 767    
HomeOfSoul_Satlok   4 341                               
TrumpVirus          4 094        750         752         1 381       1 935       624         590         1 176    
100WAYS             4 055        106         125         22                 
TEAMWANG            4 014        101         107         78          34          21         160         127    
ChineseVirus        3 919        1           3           4           69          32         15           2    
ShipsGoingDown      3 755        71

在@vladimir 提供非常好的输入后,使用

SELECT case when match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.*') then 'COVID19' 
            when match(hashtag, 'Bhula.*') then 'Bhula'
            else hashtag END
            as Hashtag,
  SUM(CASE when datetime between now() - interval 1 day AND now() then 1 END) "Today",
  SUM(CASE when datetime between now() - interval 2 day AND now() - interval 1 day then 1 END) "Today -1",
  SUM(CASE when datetime between now() - interval 3 day AND now() - interval 2 day then 1 END) "Today -2",
  SUM(CASE when datetime between now() - interval 4 day AND now() - interval 3 day then 1 END) "Today -3",
  SUM(CASE when datetime between now() - interval 5 day AND now() - interval 4 day then 1 END) "Today -4",
  SUM(CASE when datetime between now() - interval 6 day AND now() - interval 5 day then 1 END) "Today -5",
  SUM(CASE when datetime between now() - interval 7 day AND now() - interval 6 day then 1 END) "Today -6",
  SUM(CASE when datetime between now() - interval 8 day AND now() - interval 7 day then 1 END) "Today -7"
FROM twitterDBhashtags
group by Hashtag 
order by "Today" DESC limit 10;

我明白了(请注意这是实时数据,因此上述结果的总和将不准确)

Hashtag         Today   Today -1    Today -2    Today -3    Today -4    Today -5    Today -6    Today -7
---------------------------------------------------------------------------------------------------------
COVID19         241825  260486      237838      236318      222989      129159      161506      122959
Bhula           35267   22372       856         1           13          4           1           12
StPatricksDay   14776   1147        254         239         271         130         198         167
QuarantineLife  10442   5140        169         1                       2
AsiManshiDebut  8900                            
LuzonLockdown   6764    9                       
FamiliesFirst   6563    382         439         3285        8854        1307        927 
Italy           6516    2617        4590        4493        2710        1725        3287        8885
BREAKING        6391    9878        10726       8603        9830        4305        8464        4992
China           5469    5745        4417        5279        5753        4290        3556        3408

现在,如何将其设为 RANK,而不是计数,按 RANK 排序。

非常感谢任何关于如何进步的想法。

【问题讨论】:

    标签: mysql clickhouse


    【解决方案1】:

    我会在 WHERE 子句中定义日期期间,而不是在 SELECT 中列出它们:

    SELECT toStartOfDay(datetime) day, match(hashtag, '[Cc]orona.*|COVID.*') ? 'COVID19' : hashtag as hashtag, count() tweets_count
    FROM (
      /* test data */
      SELECT toDateTime(data.1) datetime, data.2 hashtag
      FROM (
        SELECT arrayJoin([
          ('2020-03-01 10:10:10', 'coronavirus'), 
          ('2020-03-01 12:12:12', 'COVID'), 
          ('2020-03-05 10:10:10', 'StPatricksDay'), 
          ('2020-03-15 01:01:01', 'Coronavirus')]) data)
    )
    WHERE datetime >= '2020-03-01 00:00:00' AND datetime < '2020-04-01 00:00:00'
    GROUP BY day, hashtag;
    
    /* result
    ┌─────────────────day─┬─hashtag───────┬─tweets_count─┐
    │ 2020-03-01 00:00:00 │ COVID19       │            2 │
    │ 2020-03-15 00:00:00 │ COVID19       │            1 │
    │ 2020-03-05 00:00:00 │ StPatricksDay │            1 │
    └─────────────────────┴───────────────┴──────────────┘
    */
    

    SELECT (match(hashtag, '[Cc]orona.*|COVID.*') ? 'COVID19' : hashtag) as hashtag, sum(day1) day1, sum(day2) day2, sum(day3) day3, sum(day4) day4, sum(day5) day5, sum(day6) day6, sum(day7) day7
    FROM (
      /* test data */
      SELECT data.1 AS hashtag, data.2 AS day1, data.3 AS day2, data.4 AS day3, data.5 AS day4, data.6 AS day5, data.7 AS day6, data.8 AS day7
      FROM
      (
          SELECT arrayJoin([
            ('coronavirus', 67299, 60633, 53780, 55375, 59866, 27150, 47824), 
            ('COVID', 62502, 50998, 50365, 51554, 50062, 23140, 40908), 
            ('BhulaDungaFirstLook', 35524, 34, 0, 0, 0, 0, 0), 
            ('Coronavirus', 14076, 15297, 12321, 16496, 16263, 7028, 9975), 
            ('CoronavirusOutbreak', 13020, 9410, 2597, 1044, 1853, 950, 2436), 
            ('BhulaDungaWithSid', 12190, 47, 0, 0, 0, 0, 0), 
            ('StPatricksDay', 10426, 374, 244, 233, 282, 79, 213), 
            ('QuarantineLife', 10110, 1477, 56, 0, 1, 0, 0), 
            ('COVID2019', 9892, 2085, 1417, 2009, 2929, 1568, 4918)]) AS data
      ))
    GROUP BY hashtag;
    
    /* result
    ┌─hashtag─────────────┬───day1─┬───day2─┬───day3─┬───day4─┬───day5─┬──day6─┬───day7─┐
    │ COVID19             │ 166789 │ 138423 │ 120480 │ 126478 │ 130973 │ 59836 │ 106061 │
    │ StPatricksDay       │  10426 │    374 │    244 │    233 │    282 │    79 │    213 │
    │ QuarantineLife      │  10110 │   1477 │     56 │      0 │      1 │     0 │      0 │
    │ BhulaDungaFirstLook │  35524 │     34 │      0 │      0 │      0 │     0 │      0 │
    │ BhulaDungaWithSid   │  12190 │     47 │      0 │      0 │      0 │     0 │      0 │
    └─────────────────────┴────────┴────────┴────────┴────────┴────────┴───────┴────────┘
    */
    

    【讨论】:

    • 谢谢。使用 WHERE 而不是使用 SELECT 列出有什么好处(我不是 MYSQL 专业人士)?根据我从您的回复中了解到的情况,我需要对 arrayJoin 中的数据进行硬编码,这将大大违背目的。我确实尝试过使用您建议的“(match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.*') ? 'COVID19' : hashtag) as Hashtag”,这非常好。如何将其扩展为在同一个选择/匹配中添加其他组合(如“COVID19”)并将其他组合(如 fx“BhulaDunga.*”组合为 Bhula?
    • 1.这些查询特定于 ClickHouse,而不是 MySQL 2. arrayJoin 仅用于提供测试数据,您不应该使用它 3. 如何扩展 match - 看看clickhouse.tech/docs/en/query_language/functions/…
    • 感谢您的意见,非常感谢。不幸的是,我对您提供的链接没有多大意义。最初的“匹配”为列出的字符串变体创建了一个别名。这可以扩展到第二个,第三个..别名,它们各自的字符串变体与所述别名相关联吗?
    • 看起来你需要像这样:select case when match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.')然后是 'COVID19' 当匹配(hashtag, 'he.*')然后是 'HELLO' else hashtag end from (select arrayJoin(['corona1', 'Corona', 'hello', 'unknown']) hashtag);*
    • 是的,这似乎可行:-),完美地使用SELECT case when match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.*') then 'COVID19' when match(hashtag, 'Bhula.*') then 'Bhula' else hashtag END as Hashtag, 最后一个问题是对所有不同的“天”进行排名,并仅列出每列的排名。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多