【问题标题】:Remove duplicates from a column whose values are separated by a pipe从其值由管道分隔的列中删除重复项
【发布时间】:2020-08-31 23:35:55
【问题描述】:

数据如下-

category_id   category_name    associated_keys
    111          Books         CC34DE5W|SQA7ZZ87|LM24NO3P|SQA7ZZ87
    222          Office        LM24NO3P|AAB12B34
    444         Furniture      X34YY78Z|LM24NO3P|SQA7ZZ87|SEF5C6T4|CC34DE5W|AAB12B34
    222          Office        X34YY78Z|X34YY78Z

我想从 associated_keys 列中删除不同 category_id 的重复项。输出应如下所示-

category_id   category_name    associated_keys
    111          Books         CC34DE5W|SQA7ZZ87|LM24NO3P
    222          Office        LM24NO3P|AAB12B34
    444         Furniture      X34YY78Z|LM24NO3P|SQA7ZZ87|SEF5C6T4|CC34DE5W|AAB12B34
    222          Office        X34YY78Z

【问题讨论】:

    标签: sql string csv google-bigquery


    【解决方案1】:

    以下是 BigQuery 标准 SQL

    #standardSQL
    SELECT category_id, category_name, 
      (SELECT STRING_AGG(DISTINCT key, '|') FROM UNNEST(SPLIT(associated_keys, '|')) key) associated_keys
    FROM (
      SELECT category_id, category_name, STRING_AGG(associated_keys, '|') AS associated_keys
      FROM `project.dataset.data` 
      GROUP BY category_id, category_name  
    )   
    

    如果要应用于示例中的示例数据 - 输出是

    Row category_id category_name   associated_keys  
    1   111         Books           CC34DE5W|SQA7ZZ87|LM24NO3P   
    2   222         Office          LM24NO3P|AAB12B34|X34YY78Z   
    3   444         Furniture       X34YY78Z|LM24NO3P|SQA7ZZ87|SEF5C6T4|CC34DE5W|AAB12B34       
    

    如果您不想按 category_id 分组(如上一个问题中所示) - 请在下面使用

    #standardSQL
    SELECT category_id, category_name, 
      (SELECT STRING_AGG(DISTINCT key, '|') FROM UNNEST(SPLIT(associated_keys, '|')) key) associated_keys
    FROM `project.dataset.data` 
    

    有输出

    Row category_id category_name   associated_keys  
    1   111         Books           CC34DE5W|SQA7ZZ87|LM24NO3P   
    2   222         Office          LM24NO3P|AAB12B34    
    3   444         Furniture       X34YY78Z|LM24NO3P|SQA7ZZ87|SEF5C6T4|CC34DE5W|AAB12B34    
    4   222         Office          X34YY78Z
    

    【讨论】:

      【解决方案2】:

      不要将这样的值存储为字符串! BigQuery 提供数组。因此,我将向您展示如何将结果从字符串转换为数组:

      with t as (
            select 111 as category_id, 'Books' as category_name, 'CC34DE5W|SQA7ZZ87|LM24NO3P|SQA7ZZ87' as associated_keys union all
            select 222, 'Office', 'LM24NO3P|AAB12B34' union all
            select 444, 'Furniture', 'X34YY78Z|LM24NO3P|SQA7ZZ87|SEF5C6T4|CC34DE5W|AAB12B34' union all
            select 222, 'Office', 'X34YY78Z|X34YY78Z'
           )
      select t.* except (associated_keys),
             (select array_agg(distinct key)
              from unnest(split(t.associated_keys, '|')) key
             ) as associated_keys
      from t;
      

      如果你真的想重构一个字符串,你可以使用string_agg(),但我不建议这样做。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2015-11-24
        • 1970-01-01
        • 1970-01-01
        • 2020-06-27
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多