【问题标题】:SQL Ranking Functions for Sorting and Aggregating Data for Conversation Data用于对会话数据进行排序和聚合的 SQL 排名函数
【发布时间】:2020-12-04 03:14:12
【问题描述】:

我希望在 SQL(特别是 BigQuery)中对对话数据进行排名/聚合。数据是对话数据,其中每一行代表一个句子。在下图中,我添加了扬声器、句子和 sequence_start 的示例数据。 desired_rank 是目标结果(或类似的东西)。

我相信应该有一个窗口函数,比如排名/滞后/第一,应该以编程方式达到所需的排名。

我最初得到的最接近的是:

WITH DATA AS (
SELECT 'Speaker A' as speaker, 'Sentence 1' as sentence, 1 as sentence_start, 1 as desired_rank
UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 2' as sentence, 9 as sentence_start, 1 as desired_rank
UNION ALL SELECT 'Speaker B' as speaker, 'Sentence 3' as sentence, 27 as sentence_start, 2 as desired_rank
UNION ALL SELECT 'Speaker C' as speaker, 'Sentence 4' as sentence, 46 as sentence_start, 3 as desired_rank
UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 5' as sentence, 78 as sentence_start, 4 as desired_rank
)
SELECT speaker, sentence, sentence_start, desired_rank,

FIRST_VALUE(sentence_start)
  OVER (
    PARTITION BY speaker
    ORDER BY sentence_start
    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

FROM DATA
ORDER BY sentence_start

结果的问题是说话者 A 总是排名为 1,而它应该是 4(或类似的东西)。

感谢您的帮助。谢谢!

【问题讨论】:

    标签: sql google-bigquery window-functions ranking-functions


    【解决方案1】:

    想通了。需要连接到下一行以确定更改。添加了说话者 A 说出第 5 句和第 6 句的复杂功能。

    WITH data AS (
    SELECT          'Speaker A' as speaker, 'Sentence 1' as sentence, 1 as sentence_start, 1 as desired_rank
    UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 2' as sentence, 9 as sentence_start, 1 as desired_rank
    UNION ALL SELECT 'Speaker B' as speaker, 'Sentence 3' as sentence, 27 as sentence_start, 2 as desired_rank
    UNION ALL SELECT 'Speaker C' as speaker, 'Sentence 4' as sentence, 46 as sentence_start, 3 as desired_rank
    UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 5' as sentence, 78 as sentence_start, 4 as desired_rank
    UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 6' as sentence, 90 as sentence_start, 4 as desired_rank
    ),
    data_ranked AS (
    SELECT speaker, sentence, sentence_start, desired_rank,
    COALESCE(LEAD(sentence_start) OVER (ORDER BY sentence_start asc),9999999999999) AS next_sentence_start
    FROM DATA
    ORDER BY sentence_start
    ),
    sentence_information AS (
    SELECT sentence_information.speaker, sentence_information.sentence, sentence_information.sentence_start, sentence_information.next_sentence_start
      , CASE WHEN sentence_information.speaker <> next_sentence_information.speaker THEN TRUE ELSE FALSE END as next_sentence_speaker_change_indicator
    FROM DATA_RANKED as sentence_information
      LEFT OUTER JOIN DATA AS next_sentence_information ON sentence_information.next_sentence_start = next_sentence_information.sentence_start
    ),
    compiled_sentence_information AS (SELECT sentence_information.speaker, sentence_information.sentence, sentence_information.sentence_start, sentence_information.next_sentence_start
    , COALESCE(next_sentence_information.next_sentence_speaker_change_indicator, FALSE) as speaker_change_indicator
    , CASE WHEN COALESCE(next_sentence_information.next_sentence_speaker_change_indicator, FALSE) THEN 1 ELSE 0 END as speaker_change_number
    , SUM(CASE WHEN COALESCE(next_sentence_information.next_sentence_speaker_change_indicator, FALSE) THEN 1 ELSE 0 END) OVER (ORDER BY sentence_information.sentence_start ASC) AS speaker_sentence_rank
    , CASE WHEN sentence_information.next_sentence_start = 9999999999999 THEN TRUE ELSE sentence_information.next_sentence_speaker_change_indicator END as final_sentence_in_paragraph
    FROM sentence_information 
      LEFT OUTER JOIN sentence_information as next_sentence_information on sentence_information.sentence_start = next_sentence_information.next_sentence_start
    ),
    paragraphs as (
    SELECT *, STRING_AGG(sentence, " ") OVER (PARTITION BY speaker_sentence_rank ORDER BY sentence_start) as paragraph
    FROM compiled_sentence_information
    )
    SELECT speaker, paragraph
    FROM paragraphs
    WHERE final_sentence_in_paragraph = TRUE
    ORDER BY sentence_start
    

    【讨论】:

      【解决方案2】:

      也许RANK, DENSE_RANK and ROW_NUMBER functions 之一会有所帮助。用sentence_start = 9 在DATA 中再添加一行以指出区别:

      WITH DATA AS (
        SELECT 'Speaker A' as speaker, 'Sentence 1' as sentence, 1 as sentence_start, 1 as desired_rank
        UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 2' as sentence, 9 as sentence_start, 1 as desired_rank
        UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 2' as sentence, 9 as sentence_start, 1 as desired_rank
        UNION ALL SELECT 'Speaker B' as speaker, 'Sentence 3' as sentence, 27 as sentence_start, 2 as desired_rank
        UNION ALL SELECT 'Speaker C' as speaker, 'Sentence 4' as sentence, 46 as sentence_start, 3 as desired_rank
        UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 5' as sentence, 78 as sentence_start, 4 as desired_rank
      )
      SELECT 
        speaker,
        sentence,
        sentence_start,
        desired_rank,
        RANK() OVER (ORDER BY sentence_start) AS rank,
        DENSE_RANK() OVER (ORDER BY sentence_start) AS dense_rank,
        ROW_NUMBER() OVER (ORDER BY sentence_start) AS row_number,
      FROM DATA
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-10-28
        • 1970-01-01
        • 1970-01-01
        • 2017-02-22
        • 2017-10-17
        • 1970-01-01
        • 2012-03-25
        • 2018-02-05
        相关资源
        最近更新 更多