【问题标题】:data profiling on bigquery table covering min,max,unique, null count statisticsbigquery 表上的数据分析,包括最小、最大、唯一、空计数统计信息
【发布时间】:2020-09-26 07:17:21
【问题描述】:

我正在寻找对 bigquery 表执行数据分析的解决方案,涵盖表中每一列的统计信息。一些列是 ARRAY 和 STRUCT,如下所示。

我尝试了多种方法来生成动态查询以涵盖以下场景,但没有成功。我将非常感谢您的帮助/输入。

我想计算部分这个解决方案的指标是:

  • 最小值
  • 最大值
  • 场地最小长度
  • 最大场地长度
  • 每个领域的唯一记录数
  • 字段中没有空值
  • 字段中的非空值数量。
  • 日期或日期时间字段中的最小日期
  • 日期或日期时间字段中的最大日期

样本表数据:

期望的输出

【问题讨论】:

  • 您的字段的嵌套程度如何?我的意思是我看到了addresses.phone.primarynumber => 3rd level。你想让它自动化还是你有最大的深度?
  • 谢谢萨布里。最大深度为 3。

标签: google-bigquery data-profiling


【解决方案1】:

此查询返回数据集中表中的所有列。我排除了 STRUCTS,因为您只需要值列。

SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name = 'table_name'
  AND data_type NOT LIKE 'STRUCT%'

使用列表,我们将生成一个 SQL 查询来获取所有这些列。 在这里,我只添加了 MIN、MAX 和 COUNT DISTINCT 列。但是,您可以通过向 SELECT 部分添加新行来添加更多。

SELECT 
  STRING_AGG(
    CONCAT(
      'SELECT "', field_path, '" as field_path, ',
        'CAST(MIN(', field_path, ') as string) as max, ',
        'CAST(MAX(', field_path, ') as string) as min ',
        'COUNT(DISTINCT ', field_path, ') as count_distinct ',
      'FROM ', table_name) ,
    ' UNION ALL \n'
  ) as query
FROM columns

最后,我们将使用 EXECUTE IMMEDIATE 运行这个查询,因为它是一个字符串:

EXECUTE IMMEDIATE (
  query
)

要将所有这些查询组合在一起,如下所示:

EXECUTE IMMEDIATE (
  SELECT 
    STRING_AGG(
      CONCAT(
        'SELECT "', field_path, '" as field_path, ',
          'CAST(MIN(', field_path, ') as string) as max, ',
          'CAST(MAX(', field_path, ') as string) as min ',
          'COUNT(DISTINCT ', field_path, ') as count_distinct ',
        'FROM ', table_name) ,
      ' UNION ALL \n'
    ) as query
  FROM (
    SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
    FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
    WHERE table_name = 'table_name'
      AND data_type NOT LIKE 'STRUCT%'
  )
)

PS:它现在只解决结构。你能给我看一个你的 ARRAY 列的例子吗?

【讨论】:

  • 感谢 Sabri.. 这是具有 Array 列的表之一。创建或替换表demo.Customer(地址数组>,名字字符串,出生日期,姓氏字符串,id INT64)
  • 另一个列深度为 2 的 DDL。 CREATE OR REPLACE TABLE demo.Enrollment ( kafka STRUCTPartition INT64, OffSet INT64, Key STRING>, account STRUCT>, address ARRAY>, legalAcceptance ARRAY>, isGuestCheckout BOOL, gaiyoshomenId STRING> )
  • 谢谢@Sabri。我需要所有列输出,包括 STRUCTS 和 ARRAY。我修改了您的查询以包含 STRUCTS 和 ARRAY,但它返回以下错误。 **无法在 [1:89] 处访问类型为 ARRAY> 的值的字段 address1 **。查询返回错误是。 SELECT "account.billingProfile.address1" as field_path, CAST(MIN(account.billingProfile.address1) as string) as max, CAST(MAX(account.billingProfile.address1) as string) as min, COUNT(DISTINCT account.billingProfile.address1) as count_distinct FROM myproject.demo.Enrollment.
  • 不,不要从 where 语句中删除该条件。它只删除父列,而不是子列。例如,它删除了地址,但保留了 address.zip 和 address.state。对于数组,我会在有时间的时候处理它们。它比结构更复杂,这就是我现在推迟它的原因。
  • 嗨 Sabri Karagonen:如果您能提供解决方案,请告诉我。
【解决方案2】:

我不明白您所说的 Min LengthMax Length 是什么意思,但考虑到提供的数据,您可以执行以下操作。

这个查询基本上有两个步骤:

  1. 使用WITH 子句创建包含平面数据的临时表
  2. 通过对每一列运行一个查询来计算指标,并使用UNION ALL 将所有内容组合到一个表中。

查询:

WITH
  t AS(
  SELECT
    first_name,
    dob,
    last_name,
    a.zip addresses_zip,
    a.state addresses_state,
    a.city addresses_city,
    a.numberOfYears addresses_numberOfYears,
    a.status addresses_status,
    a.phone.primarynumber addresses_phone_primarynumber,
    a.phone.secondary addresses_phone_secondary
  FROM
    <your-table> t,
    t.addresses a 
)

SELECT
  "first_name" AS column,
  COUNT(first_name) total_count,
  COUNT(DISTINCT first_name) total_distinct,
  SUM(
  IF
    (first_name IS NULL,
      1,
      0)) total_null,
  CAST(MIN(first_name) AS string) min_value,
  CAST(MAX(first_name) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "dob" AS column,
  COUNT(dob) total_count,
  COUNT(DISTINCT dob) total_distinct,
  SUM(
  IF
    (dob IS NULL,
      1,
      0)) total_null,
  CAST(MIN(dob) AS string) min_value,
  CAST(MAX(dob) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "last_name" AS column,
  COUNT(last_name) total_count,
  COUNT(DISTINCT last_name) total_distinct,
  SUM(
  IF
    (last_name IS NULL,
      1,
      0)) total_null,
  CAST(MIN(last_name) AS string) min_value,
  CAST(MAX(last_name) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.zip" AS column,
  COUNT(addresses_zip) total_count,
  COUNT(DISTINCT addresses_zip) total_distinct,
  SUM(
  IF
    (addresses_zip IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_zip) AS string) min_value,
  CAST(MAX(addresses_zip) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.state" AS column,
  COUNT(addresses_state) total_count,
  COUNT(DISTINCT addresses_state) total_distinct,
  SUM(
  IF
    (addresses_state IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_state) AS string) min_value,
  CAST(MAX(addresses_state) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.city" AS column,
  COUNT(addresses_city) total_count,
  COUNT(DISTINCT addresses_city) total_distinct,
  SUM(
  IF
    (addresses_city IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_city) AS string) min_value,
  CAST(MAX(addresses_city) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.numberOfYears" AS column,
  COUNT(addresses_numberOfYears) total_count,
  COUNT(DISTINCT addresses_numberOfYears) total_distinct,
  SUM(
  IF
    (addresses_numberOfYears IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_numberOfYears) AS string) min_value,
  CAST(MAX(addresses_numberOfYears) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.status" AS column,
  COUNT(addresses_status) total_count,
  COUNT(DISTINCT addresses_status) total_distinct,
  SUM(
  IF
    (addresses_status IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_status) AS string) min_value,
  CAST(MAX(addresses_status) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.phone.primarynumber" AS column,
  COUNT(addresses_phone_primarynumber) total_count,
  COUNT(DISTINCT addresses_phone_primarynumber) total_distinct,
  SUM(
  IF
    (addresses_phone_primarynumber IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_phone_primarynumber) AS string) min_value,
  CAST(MAX(addresses_phone_primarynumber) AS string) max_value
FROM
  t 

UNION ALL

SELECT
  "addresses.phone.secondary" AS column,
  COUNT(addresses_phone_secondary) total_count,
  COUNT(DISTINCT addresses_phone_secondary) total_distinct,
  SUM(
  IF
    (addresses_phone_secondary IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_phone_secondary) AS string) min_value,
  CAST(MAX(addresses_phone_secondary) AS string) max_value
FROM
  t

【讨论】:

  • 我正在寻找通过处理 ARRAY 列动态构建上述查询的解决方案,因为所有 ARRAY 列在访问之前都需要取消嵌套。根据上面 cmets 提供的表格,让我们知道如何动态处理 ARRAY 列。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-06-19
  • 1970-01-01
  • 2014-11-07
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多