使用 jq 更有效地诱导模式答案

【问题标题】：Inducing a schema using jq more efficiently使用 jq 更有效地诱导模式
【发布时间】：2020-10-13 17:43:22
【问题描述】：

我摄取了大量数据。它来自许多不同的来源，最终都进入 BigQuery。

我预解析为 .jsonl 文件——每条记录 1 行，由目标表命名。

为了粗略了解规模，这是我现在正在做的数据集中的一个示例。（以下所有数据都是真实的，只是稍作编辑/清理。）

% find json -type f -size +2000c -print0 | head -z | sort |  wc --files0-from=-
  2   387   4737 json/baz_1.jsonl
  3   579   7055 json/baz_2.jsonl
  1   193   2358 json/baz_3.jsonl
 25  4835  58958 json/baz_4.jsonl
 37  7161  87467 json/baz_5.jsonl
  3   580   7072 json/baz_6.jsonl
 15  2897  35393 json/baz_7.jsonl
129 24950 304262 json/baz_8.jsonl
  3   373   4221 json/foo_1.jsonl
  6   746   8491 json/foo_2.jsonl
224 42701 520014 total

% wc -l *.jsonl
    11576 foos.jsonl
       20 bars.jsonl
   337770 bazzes.jsonl
   349366 total

% du -m *.jsonl
 3      foos.jsonl
 1      bars.jsonl
93      bazzes.jsonl

这对我来说相对较小。其他数据集在数百万行/TB 的数据范围内。

因为数据来自外部来源，通常没有记录，通常不匹配规范或只是简单的混乱（例如 null 的各种信号值、同一字段中的多个日期格式等），我事先并不真正了解结构.

但是，我希望在我的目标表中有一个漂亮、干净、高效的结构——例如转换为正确的类型，如整数/布尔/日期，正确设置 REQUIRED/NULLABLE，知道哪些列实际上是枚举，将字符串化数组转换为 REPEATED 列，很好地猜测我可以有效地用于分区的内容/聚类等等等等。

不可避免地需要对样本进行一些手动操作来推断实际发生的情况，但我这样做的第一遍是jq（1.6 版）。

这是我当前的代码：

~/.jq

def isempty(v):
 (v == null or v == "" or v == [] or v == {});

def isnotempty(v):
 (isempty(v) | not);

def remove_empty:
  walk(
   if type == "array" then
      map(select(isnotempty(.)))
   elif type == "object" then
      with_entries(select(isnotempty(.value))) # Note: this will remove keys with empty values
   else .
   end
  );

# bag of words
def bow(stream):
  reduce stream as $word ({}; .[($word|tostring)] += 1);

# https://stackoverflow.com/questions/46254655/how-to-merge-json-objects-by-grouping-on-key-with-jq
def add_by(f):
  reduce .[] as $x ({}; ($x|f) as $f | .[$f] += [$x])
  | [.[] | add];

# takes array of {string: #, ...}
def merge_counts:
  map(.|to_entries)|flatten | add_by(.key)|from_entries;

induce_schema.sh（添加了换行符）

#!/bin/zsh

pv -cN ingestion -s `wc -l $1` -l $1 | \
jq  -c --unbuffered --stream '{"name": ( .[0]), "encoded_type":( .[1] | type),  \
   "tonumber": (.[1] |  if (type == "string") then try(tonumber|type) catch type else null end), \
   "chars": (.[1] | if(type=="string") then try(split("") | sort | unique | join("")) else null end), \
   "length":(.[1] | length),"data":.[1]}' |  \
# sed -r 's/[0-9]+(,|])/"array"\1/g' | awk '!_[$0]++' | sort | \
pv -cN grouping -l | \
jq -sc '. | group_by(.name,.encoded_type,.tonumber)[] | {"name":first|.name, \ 
   "encoded_type":([(first|.encoded_type),(first|.tonumber)]|unique - [null]|join("_")), \
   "allchars": (map(.chars) | join("")|split("")|sort|unique|join("")), \
   "count_null": (map(.data | select(.==null)) | length), \
   "count_empty": (map(.data | select(.==[] or . == {} or . == "")) | length), \
   "count_nonempty": (map(.data | select(. != null and . != "")) |length), \
   "unique": (map(.data)|unique|length), "length": bow(.[] | .length)  }' | \
pv -cN final -l | \
jq -sc '. | group_by(.name)[] | {"name":first|.name, \
   "nullable":(map(.encoded_type) | contains(["null"])), \
   "schemas_count":(map(. | select(.encoded_type != "null") )|length), \
   "lengths":(map(.length)|merge_counts), "total_nonempty":(map(.count_nonempty)|add), \
   "total_null":(map(.count_null)|add), "total_empty": (map(.count_empty) |add), \
   "schemas":map(. | select(.encoded_type != "null") | del(.name) )}'

这是bars.jsonl 的部分输出（为便于阅读添加了换行符）：

{"name":["FILING CODE"],"nullable":false,"schemas_count":1,
 "lengths":{"0":1930,"2":16},
 "total_nonempty":16,"total_null":0,"total_empty":1930,
 "schemas":[
  {"encoded_type":"string","allchars":"EGPWX",
   "count_null":0,"count_empty":1930,"count_nonempty":16,"unique":6,
   "length":{"0":1930,"2":16}}
 ]}
{"name":["LAST NAME"],"nullable":true,"schemas_count":1,
 "lengths":{"0":416,"5":4650,"6":5648,"7":4796,"4":1934,"8":3042,"9":1362,"10":570,"11":226,"3":284,"14":30,"12":70,"13":54,"16":20,"15":26,"17":10,"18":8,"2":4,"19":2},
 "total_nonempty":22736,"total_null":2,"total_empty":416,
 "schemas":[
  {"encoded_type":"string","allchars":" ABCDEFGHIJKLMNOPQRSTUVWXYZ",
   "count_null":0,"count_empty":416,"count_nonempty":22736,"unique":6233,
   "length":{"5":4650,"6":5648,"7":4796,"4":1934,"8":3042,"9":1362,"10":570,"11":226,"3":284,"14":30,"12":70,"0":416,"13":54,"16":20,"15":26,"17":10,"18":8,"2":4,"19":2}}
 ]}
{"name":["NUMBER OF COFFEES"],"nullable":false,"schemas_count":2,
 "lengths":{"1":16,"0":4},
 "total_nonempty":16,"total_null":0,"total_empty":4,
 "schemas":[
  {"encoded_type":"number_string","allchars":"1",
   "count_null":0,"count_empty":0,"count_nonempty":16,"unique":1,
   "length":{"1":16}},
  {"encoded_type":"string","allchars":"",
   "count_null":0,"count_empty":4,"count_nonempty":0,"unique":1,
   "length":{"0":4}}
 ]}
{"name":["OFFICE CODE"],"nullable":false,"schemas_count":2,
 "lengths":{"3":184,"0":22092},
 "total_nonempty":1036,"total_null":0,"total_empty":22092,
 "schemas":[
  {"encoded_type":"number_string","allchars":"0123456789",
   "count_null":0,"count_empty":0,"count_nonempty":852,"unique":254,
   "length":{"3":852}}, 
  {"encoded_type":"string","allchars":"0123456789ABCDEIJQRSX",
   "count_null":0,"count_empty":22092,"count_nonempty":184,"unique":66,
   "length":{"0":22092,"3":184}}
 ]}
{"name":["SOURCE FILE"],"nullable":true,"schemas_count":1,
 "lengths":{"0":416,"7":22708},
 "total_nonempty":22708,"total_null":23124,"total_empty":416,
 "schemas":[
  {"encoded_type":"string","allchars":"0123456789F_efil",
   "count_null":0,"count_empty":416,"count_nonempty":22708,"unique":30,
   "length":{"7":22708,"0":416}}
 ]}
...

这样做的目的是总结“这个未知数据集的结构以及其中的内容”，我可以轻松地将其转换为我的 BigQuery 表架构/参数，用于指出我可能需要做的事情下一步是把它变成比我得到的更干净、更有用的东西，等等。

此代码有效，但那些 -s (slurp) 行在服务器 RAM 上确实很难。（如果数据集比这更大，它们根本就无法工作；我今天才添加了这些部分。在bazzes 数据集上，它使用了大约 20GB 的总 RAM，包括交换。）

它也没有检测到例如任何日期/时间field types。

我相信使用@joelpurra 的jq + parallel 和/或jq 食谱的reduce inputs 应该可以提高效率，但我很难弄清楚如何做。

所以，我很感激关于如何制作这个的建议

CPU 和 RAM 效率更高
在其他方面更有用（例如识别日期字段，几乎可以是任何格式）

【问题讨论】：

标签： json performance schema jq

【解决方案1】：

无论是否采用任何并行化技术，都可以使用inputs。

在我前段时间写的用于诱导结构模式的jq模块中（https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed），有一个过滤器，schema/1定义为：

def schema(stream):
  reduce stream as $x ("null";  typeUnion(.; $x|typeof));

因此可以按照这个 sn-p 的建议使用：

jq -n 'include "schema"; schema(inputs)' FILESPECIFICATIONS

（假设定义schema 模块的文件“schema.jq”已正确安装。）

这里的重点不是 schema.jq 可能会适应您的特定期望，而是上面的“def”可以作为如何编写有效模式推理的指南（无论是否使用 jq）引擎，在能够处理大量实例的意义上。也就是说，您基本上只需要编写typeof 的定义（应该产生最一般意义上的所需“类型”）和typeUnion （定义如何组合两种类型）。

当然，推断架构可能是一件棘手的事情。特别是，schema(stream) 永远不会失败，假设输入是有效的 JSON。也就是说，推断的模式是否有用在很大程度上取决于它的使用方式。我发现基于这些元素的综合方法是必不可少的：

一种模式规范语言；
生成符合 (1) 的模式的模式推理引擎；
模式检查器。

进一步的想法

schema.jq 非常简单，可以根据更具体的要求进行定制，例如推断日期。

您可能对 JESS（“JSON 扩展结构模式”）感兴趣，它结合了基于 JSON 的规范语言和面向 jq 的工具：https://github.com/pkoppstein/JESS

【讨论】：

AFAICT，您的代码相信它摄取的数据类型。我的问题是我摄取的数据通常不是输入的——它是字符串。我必须弄清楚例如我是否可以将所有 .foobar 转换为数字（或日期或其他）而不会丢失；否则我会在目标表中浪费太多空间，并且无法执行特定类型的操作（数值比较、日期分片等）。
@sai - 看起来你误解了我的帖子和 schema.jq。如果你用一些你正在处理的输入的例子来修改你的 Q 可能是最简单的。
@sai - 我添加了一个段落，希望至少能让一些事情更清楚。