【问题标题】:Bigquery finding text in a column based on another tableBigquery 根据另一个表在列中查找文本
【发布时间】:2021-08-30 11:43:28
【问题描述】:

我想返回产品描述中包含的所有屏蔽列表字词

with blocklist as (

    select 'instagram' as blocklist union all
    select 'facebook' as blocklist union all
    select 'whatsapp web'
),
products as (

    select 'seller1' as seller, 'Tenis Nike 43 call me on instagram or facebook'    as product union all
    select 'seller1' as seller, 'TV 42 sansung link whatsapp WEB or INSTAGRAM'      as product union all
    select 'seller2' as seller, 'TV 42 sansung link'                                as product

)
select
     seller
    ,product
    ,blocklists
from
    ?

结果会是这样的

seller product blocklists
seller 1 Tenis Nike 43 call me on instagram or facebook instagram,facebook
seller 1 TV 42 sansung link whatsapp WEB whatsapp web,instagram
seller 2 TV 42 sansung link null

我是否需要将阻止列表转换为数组,在 select ... 上使用正则表达式?

【问题讨论】:

    标签: arrays regex select google-bigquery match


    【解决方案1】:

    这适用于您的示例:

    with blocklist as (
        select 'instagran' as blocklist union all
        select 'facebook' as blocklist union all
        select 'whatsapp web'
    ),
    products as (
        select 'seller1' as seller, 'Tenis Nike 43 call me on instagram or facebook'    as product union all
        select 'seller1' as seller, 'TV 42 sansung link whatsapp WEB or INSTAGRAM'      as product union all
        select 'seller2' as seller, 'TV 42 sansung link'                                as product
    )
    select p.*,
           (select array_agg(bl.blocklist)
            from blocklist bl
            where lower(p.product) like concat('%', lower(bl.blocklist), '%')
         )
    from products p
    

    【讨论】:

    • 你好,戈登!很多。是否可以将此数组转换为逗号分隔的字符串?
    • 我将 array_agg 更改为 string_agg 并且有效!非常感谢!
    【解决方案2】:

    考虑下面的方法

    select p.*,
      lower(array_to_string(regexp_extract_all(product, r'(?i)' || list), ', ')) blocklists
    from products p, (select string_agg(b.blocklist, '|') list from blocklist b) 
    

    如果应用于您问题中的样本数据 - 输出是

    你可以在下面自己玩

    with blocklist as (
      select 'instagram' as blocklist union all
      select 'facebook' as blocklist union all
      select 'whatsapp web'
    ), products as (
      select 'seller1' as seller, 'Tenis Nike 43 call me on instagram or facebook'    as product union all
      select 'seller1' as seller, 'TV 42 sansung link whatsapp WEB or INSTAGRAM'      as product union all
      select 'seller2' as seller, 'TV 42 sansung link'                                as product
    )
    select p.*,
      lower(array_to_string(regexp_extract_all(product, r'(?i)' || list), ', ')) blocklists
    from products p, (select string_agg(b.blocklist, '|') list from blocklist b) 
    

    【讨论】:

    • 我尝试使用 regexp_extract_all 但 bigquery 告诉我:无法解析正则表达式:重复运算符没有参数:?
    • 几分钟后回来查看 - 将在我的答案中添加测试示例供您使用
    • 在我的回答中添加了示例!
    • 请注意。我的表产品有超过 1b 的十亿行。我想避免使用 Cartezian(产品 x 阻止列表),因为 Bigquery 向我发送了消息“无法查询大于 100MB 限制的行”。有没有办法逐行进行匹配?例如:从产品 p 中选择 p.*、p.product => 在列表或阻止列表数组中
    猜你喜欢
    • 2020-10-10
    • 1970-01-01
    • 2021-05-22
    • 2022-12-12
    • 2018-05-29
    • 2014-11-28
    • 1970-01-01
    • 2017-07-11
    • 1970-01-01
    相关资源
    最近更新 更多