【发布时间】:2016-03-12 09:43:01
【问题描述】:
我可以在 BigQuery 中查看表的元数据详细信息,但对于项目估算,我希望查看整个数据集的元数据。
SELECT * From 'dataset'._TABLES_SUMMARY_WHERE size_bytes>0 isn't working for me.
【问题讨论】:
标签: google-bigquery
我可以在 BigQuery 中查看表的元数据详细信息,但对于项目估算,我希望查看整个数据集的元数据。
SELECT * From 'dataset'._TABLES_SUMMARY_WHERE size_bytes>0 isn't working for me.
【问题讨论】:
标签: google-bigquery
SELECT SUM(size_bytes) AS bytes
FROM [yourdataset.__TABLES__]
【讨论】:
前面的答案是正确的,但我想扩展答案。
在 BigQuery StandardSQL 上,您可以按数据集查询大小,如下所示:
SELECT
dataset_id,
count(*) AS tables,
SUM(row_count) AS total_rows,
SUM(size_bytes) AS size_bytes
FROM (
SELECT * FROM `dataset1.__TABLES__` UNION ALL
SELECT * FROM `dataset2.__TABLES__` UNION ALL
...
)
GROUP BY 1
ORDER BY size_bytes DESC
不幸的是,我还没有找到列出项目所有数据集的所有表的方法。相反,我使用bq命令行来生成所有SELECT ... UNION ALL 语句:
bq ls --format=json | jq -r '.[] | select(.location == "EU") | .id' | sed 's/:/./' | sed 's/\(.*\)/SELECT * FROM `\1.__TABLES__` UNION ALL/'
【讨论】:
bq --project_id=... 可以选择特定项目。这样我就统计了 5 个项目的数量。
基于@Luís Bianchinanswer,为了避免编写多个UNION ALL查询,我们可以使用SQL脚本。首先选择所有数据集
INFORMATION_SCHEMA 然后计算一个项目中所有数据集的大小
DECLARE
select_dataset_sql STRING;
DECLARE
sql STRING;
SET
select_dataset_sql = (
SELECT
ARRAY_TO_STRING(ARRAY_AGG("SELECT dataset_id, row_count, size_bytes FROM `"||schema_name||".__TABLES__`"),"UNION ALL ") AS sql
FROM
projec-id.INFORMATION_SCHEMA.SCHEMATA );
SET
sql = FORMAT("""
SELECT
dataset_id,
COUNT(*) AS tables,
SUM(row_count) AS total_rows,
SUM(size_bytes)/1e+6 AS size_mb
FROM (
%s
)
GROUP BY
1
ORDER BY
size_mb DESC
"""
,select_dataset_sql
);
EXECUTE IMMEDIATE sql;
【讨论】:
另一种方便的方法是使用Monitoring 功能来可视化数据集的大小。
【讨论】: