【问题标题】:Extract numbers from string in Google BigQuery using regex使用正则表达式从 Google BigQuery 中的字符串中提取数字
【发布时间】:2016-03-21 07:58:15
【问题描述】:

我想知道是否可以在 BigQuery 中使用正则表达式从字符串中提取所有数字。

我认为以下方法有效,但只返回第一个命中 - 有没有办法提取所有命中。

我的用例是我基本上想从 url 中获取最大的数字,因为它更像是我需要加入的 post_id。

这是我所说的一个例子:

SELECT
  mystr,
  REGEXP_EXTRACT(mystr, r'(\d+)') AS nums
FROM
  (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
  (SELECT 'just one number 123 in this one ' AS mystr),
  (SELECT '99' AS mystr),
  (SELECT 'another -2 example 99' AS mystr),
  (SELECT 'another-8766 example 99' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
  (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)

我从中得到的结果是:

[
  {
    "mystr": "this is a string with some 666 numbers 999 in it 333",
    "nums": "666"
  },
  {
    "mystr": "just one number 123 in this one ",
    "nums": "123"
  },
  {
    "mystr": "99",
    "nums": "99"
  },
  {
    "mystr": "another -2 example 99",
    "nums": "2"
  },
  {
    "mystr": "another-8766 example 99",
    "nums": "8766"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999",
    "nums": "2015"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001",
    "nums": "2015"
  },
  {
    "mystr": "http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview",
    "nums": "2015"
  }
]

【问题讨论】:

    标签: regex google-bigquery


    【解决方案1】:

    虽然您将越来越多地在 BigQuery 中使用正则表达式,但您会意识到它的实施目前非常有限
    BigQuery Regular expression functions
    re2 Syntax

    所以很可能很快您将不得不执行以下操作
    请注意 - 对于您当前的具体示例 - 下面的代码与 @Cybril 提供的简单解决方案相比绝对没有任何好处
    此解决方案更适合您在不久的将来的潜在需求
    它使用 javascript UDF,从而为您提供 javascript regexp 实现的能力
    BigQuery User-Defined Functions

    SELECT mystr, MAX(number) as max_number FROM JS(
      // input table
      (SELECT mystr FROM
        (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
        (SELECT 'just one number 123 in this one ' AS mystr),
        (SELECT '99' AS mystr),
        (SELECT 'another -2 example 99' AS mystr),
        (SELECT 'another-8766 example 99' AS mystr),
        (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
        (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
        (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)
      ) ,
      // input columns
        mystr,
      // output schema
      "[
      {name: 'mystr', type: 'string'},
      {name: 'number', type: 'string'}
      ]",
      // function
      "function(r, emit){
        var numbers = r.mystr.match(/(\d+)/g);
        for (var i=0; i < numbers.length; i++) {
          emit({
            mystr: r.mystr,
            number: numbers[i]
          });
        };  
      }"
    )
    GROUP BY 1
    

    当然你也可以在UDF中移动确定最大值的逻辑来消除额外的分组

    【讨论】:

    • 谢谢 - 我想我可能需要采用 UDF 方法,但我仍然只是在这方面学习。感谢分享。我认为 BQ 团队中的一些人提到他们正在研究他们希望在某个阶段发布的功能目录,因此也很期待。
    【解决方案2】:

    经过一番挖掘,我最终得到了这个解决方案:

    SELECT
      mystr,
      GROUP_CONCAT(SPLIT(REGEXP_REPLACE(mystr, r'[^\d]+', ','))) AS nums
    FROM
      (SELECT 'this is a string with some 666 numbers 999 in it 333' AS mystr),
      (SELECT 'just one number 123 in this one ' AS mystr),
      (SELECT '99' AS mystr),
      (SELECT 'another -2 example 99' AS mystr),
      (SELECT 'another-8766 example 99' AS mystr),
      (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999' AS mystr),
      (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/gallery/001' AS mystr),
      (SELECT 'http://somedomain.com/2015/12/this-is-a-post-with-id-in-url-99999/print-preview' AS mystr)
    

    它是如何工作的:

    • 我首先使用正则表达式匹配任何数字并用逗号替换
    • 然后使用split获取结果,空结果被丢弃
    • group_concat 只是在这里展示结果

    【讨论】:

    • 请注意,浮点数或负数会失败;),但可以轻松改进。
    猜你喜欢
    • 1970-01-01
    • 2014-10-17
    • 2011-05-10
    • 2014-08-25
    • 1970-01-01
    • 1970-01-01
    • 2010-10-14
    • 1970-01-01
    相关资源
    最近更新 更多