【发布时间】:2020-10-13 17:43:22
【问题描述】:
我摄取了大量数据。它来自许多不同的来源,最终都进入 BigQuery。
我预解析为 .jsonl 文件——每条记录 1 行,由目标表命名。
为了粗略了解规模,这是我现在正在做的数据集中的一个示例。 (以下所有数据都是真实的,只是稍作编辑/清理。)
% find json -type f -size +2000c -print0 | head -z | sort | wc --files0-from=-
2 387 4737 json/baz_1.jsonl
3 579 7055 json/baz_2.jsonl
1 193 2358 json/baz_3.jsonl
25 4835 58958 json/baz_4.jsonl
37 7161 87467 json/baz_5.jsonl
3 580 7072 json/baz_6.jsonl
15 2897 35393 json/baz_7.jsonl
129 24950 304262 json/baz_8.jsonl
3 373 4221 json/foo_1.jsonl
6 746 8491 json/foo_2.jsonl
224 42701 520014 total
% wc -l *.jsonl
11576 foos.jsonl
20 bars.jsonl
337770 bazzes.jsonl
349366 total
% du -m *.jsonl
3 foos.jsonl
1 bars.jsonl
93 bazzes.jsonl
这对我来说相对较小。其他数据集在数百万行/TB 的数据范围内。
因为数据来自外部来源,通常没有记录,通常不匹配规范或只是简单的混乱(例如 null 的各种信号值、同一字段中的多个日期格式等),我事先并不真正了解结构.
但是,我希望在我的目标表中有一个漂亮、干净、高效的结构——例如转换为正确的类型,如整数/布尔/日期,正确设置 REQUIRED/NULLABLE,知道哪些列实际上是枚举,将字符串化数组转换为 REPEATED 列,很好地猜测我可以有效地用于分区的内容/聚类等等等等。
不可避免地需要对样本进行一些手动操作来推断实际发生的情况,但我这样做的第一遍是jq(1.6 版)。
这是我当前的代码:
~/.jq
def isempty(v):
(v == null or v == "" or v == [] or v == {});
def isnotempty(v):
(isempty(v) | not);
def remove_empty:
walk(
if type == "array" then
map(select(isnotempty(.)))
elif type == "object" then
with_entries(select(isnotempty(.value))) # Note: this will remove keys with empty values
else .
end
);
# bag of words
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
# https://stackoverflow.com/questions/46254655/how-to-merge-json-objects-by-grouping-on-key-with-jq
def add_by(f):
reduce .[] as $x ({}; ($x|f) as $f | .[$f] += [$x])
| [.[] | add];
# takes array of {string: #, ...}
def merge_counts:
map(.|to_entries)|flatten | add_by(.key)|from_entries;
induce_schema.sh(添加了换行符)
#!/bin/zsh
pv -cN ingestion -s `wc -l $1` -l $1 | \
jq -c --unbuffered --stream '{"name": ( .[0]), "encoded_type":( .[1] | type), \
"tonumber": (.[1] | if (type == "string") then try(tonumber|type) catch type else null end), \
"chars": (.[1] | if(type=="string") then try(split("") | sort | unique | join("")) else null end), \
"length":(.[1] | length),"data":.[1]}' | \
# sed -r 's/[0-9]+(,|])/"array"\1/g' | awk '!_[$0]++' | sort | \
pv -cN grouping -l | \
jq -sc '. | group_by(.name,.encoded_type,.tonumber)[] | {"name":first|.name, \
"encoded_type":([(first|.encoded_type),(first|.tonumber)]|unique - [null]|join("_")), \
"allchars": (map(.chars) | join("")|split("")|sort|unique|join("")), \
"count_null": (map(.data | select(.==null)) | length), \
"count_empty": (map(.data | select(.==[] or . == {} or . == "")) | length), \
"count_nonempty": (map(.data | select(. != null and . != "")) |length), \
"unique": (map(.data)|unique|length), "length": bow(.[] | .length) }' | \
pv -cN final -l | \
jq -sc '. | group_by(.name)[] | {"name":first|.name, \
"nullable":(map(.encoded_type) | contains(["null"])), \
"schemas_count":(map(. | select(.encoded_type != "null") )|length), \
"lengths":(map(.length)|merge_counts), "total_nonempty":(map(.count_nonempty)|add), \
"total_null":(map(.count_null)|add), "total_empty": (map(.count_empty) |add), \
"schemas":map(. | select(.encoded_type != "null") | del(.name) )}'
这是bars.jsonl 的部分输出(为便于阅读添加了换行符):
{"name":["FILING CODE"],"nullable":false,"schemas_count":1,
"lengths":{"0":1930,"2":16},
"total_nonempty":16,"total_null":0,"total_empty":1930,
"schemas":[
{"encoded_type":"string","allchars":"EGPWX",
"count_null":0,"count_empty":1930,"count_nonempty":16,"unique":6,
"length":{"0":1930,"2":16}}
]}
{"name":["LAST NAME"],"nullable":true,"schemas_count":1,
"lengths":{"0":416,"5":4650,"6":5648,"7":4796,"4":1934,"8":3042,"9":1362,"10":570,"11":226,"3":284,"14":30,"12":70,"13":54,"16":20,"15":26,"17":10,"18":8,"2":4,"19":2},
"total_nonempty":22736,"total_null":2,"total_empty":416,
"schemas":[
{"encoded_type":"string","allchars":" ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"count_null":0,"count_empty":416,"count_nonempty":22736,"unique":6233,
"length":{"5":4650,"6":5648,"7":4796,"4":1934,"8":3042,"9":1362,"10":570,"11":226,"3":284,"14":30,"12":70,"0":416,"13":54,"16":20,"15":26,"17":10,"18":8,"2":4,"19":2}}
]}
{"name":["NUMBER OF COFFEES"],"nullable":false,"schemas_count":2,
"lengths":{"1":16,"0":4},
"total_nonempty":16,"total_null":0,"total_empty":4,
"schemas":[
{"encoded_type":"number_string","allchars":"1",
"count_null":0,"count_empty":0,"count_nonempty":16,"unique":1,
"length":{"1":16}},
{"encoded_type":"string","allchars":"",
"count_null":0,"count_empty":4,"count_nonempty":0,"unique":1,
"length":{"0":4}}
]}
{"name":["OFFICE CODE"],"nullable":false,"schemas_count":2,
"lengths":{"3":184,"0":22092},
"total_nonempty":1036,"total_null":0,"total_empty":22092,
"schemas":[
{"encoded_type":"number_string","allchars":"0123456789",
"count_null":0,"count_empty":0,"count_nonempty":852,"unique":254,
"length":{"3":852}},
{"encoded_type":"string","allchars":"0123456789ABCDEIJQRSX",
"count_null":0,"count_empty":22092,"count_nonempty":184,"unique":66,
"length":{"0":22092,"3":184}}
]}
{"name":["SOURCE FILE"],"nullable":true,"schemas_count":1,
"lengths":{"0":416,"7":22708},
"total_nonempty":22708,"total_null":23124,"total_empty":416,
"schemas":[
{"encoded_type":"string","allchars":"0123456789F_efil",
"count_null":0,"count_empty":416,"count_nonempty":22708,"unique":30,
"length":{"7":22708,"0":416}}
]}
...
这样做的目的是总结“这个未知数据集的结构以及其中的内容”,我可以轻松地将其转换为我的 BigQuery 表架构/参数,用于指出我可能需要做的事情下一步是把它变成比我得到的更干净、更有用的东西,等等。
此代码有效,但那些 -s (slurp) 行在服务器 RAM 上确实很难。 (如果数据集比这更大,它们根本就无法工作;我今天才添加了这些部分。在bazzes 数据集上,它使用了大约 20GB 的总 RAM,包括交换。)
它也没有检测到例如任何日期/时间field types。
我相信使用@joelpurra 的jq + parallel 和/或jq 食谱的reduce inputs 应该可以提高效率,但我很难弄清楚如何做。
所以,我很感激关于如何制作这个的建议
- CPU 和 RAM 效率更高
- 在其他方面更有用(例如识别日期字段,几乎可以是任何格式)
【问题讨论】:
标签: json performance schema jq