【问题标题】:Browsing large JSON file浏览大型 JSON 文件
【发布时间】:2017-05-20 09:27:30
【问题描述】:

我有一个巨大的 JSON 文件,其中包含一些非常深的路径。我希望使用jq 来显示前 N 个键,并隐藏更深的内容。然后,一旦我找到我感兴趣的键,继续向下钻取,只向我显示从我的起点向下的 N 级,类似于文本编辑器将所有内容折叠在 N 级以下的能力。这可能吗?

【问题讨论】:

  • 为什么不试试呢?
  • “巨大”是什么意思? “深”有多深?文件是单个 JSON 实体还是它们的集合? “浏览”是什么意思?通过向下钻取,您的意思是增加N的值吗?如果一个模糊的问题值得一个模糊的答案,我会回答:“这可能值得你花时间学习 jq。”但我也想问:你有没有想过另一种方法来寻找“有趣”的钥匙?
  • @ceejayoz 试试什么?
  • @peak 我试着澄清这个问题。下面的答案让我接近我想要的。

标签: json path key schema jq


【解决方案1】:

Appended 是一个 jq 模式推断程序,可用于理解大型 JSON 对象或 JSON 实体数组的结构,至少在它背后有一些押韵或原因的情况下。

用法:如果感兴趣的JSON实体在文件input.json中,那么假设下面的程序在schema.jq中,运行:

jq -f schema.jq input.json

对于一个非常大的文件,模式推断可能会很慢,但通常这样做比使用某种迭代方法更快。例如,请参阅下面给出的示例之后的备注。

示例

这是一个使用 JSON=JEOPARDY_QUESTIONS1.json 的示例,一个 54MB 的文件(55554625 字节) 来自https://raw.githubusercontent.com/alicemaz/super_jeopardy/master/JEOPARDY_QUESTIONS1.json

$ time jq -c -f schema.jq $JSON
[
  {
    "air_date": "string",
    "answer": "string",
    "category": "string",
    "question": "string",
    "round": "string",
    "show_number": "string",
    "value": "string"
  }
]

real    0m12.868s
user    0m11.713s
sys     0m0.342s

u+s 时间值得注意,因为对于使用流解析器生成路径概要的方法来说,它大约是同一台机器上 u+s 时间的三分之二(参见本页的 synopsis.jq)。鉴于 JSON 文件是一个长度为 216,930 的数组,这可能违反直觉。

schema.jq

# Schema inference
# Version 0.1
# Author: pkoppstein at gmail dot com
# Requires: jq 1.4 or higher

# This module defines three filters:
#   typeof/0 returns the extended-type of its input;
#   typeUnion(a;b) returns the union of the two specified extended-type values;
#   schema/0 returns the typeUnion of the extended-type values of the entities
#    in the input array, if the input is an array,
#     otherwise it simply returns the "typeof" value of its input.

# Each extended type can be thought of as a set of JSON entities,
# e.g. "number" for the set of JSON numbers, and ["number"] for the
# set of JSON number-valued arrays including [].

# The extended-type values are always JSON entities.
# The possible values are:
# "null", "boolean", "string", "number";
# "scalar" for any combination of non-null scalars;
# [T] where T is an extended type;
# an object all of whose values are extended types;
# "JSON" signifying that no other extended-type value is applicable.

# The extended-type values are defined recursively:
# The extended-type of a scalar value is its JSON type.
# The extended-type of a non-empty array of values all of which have the
#      same JSON type, t, is [t], and similarly for ["scalar"], and ["JSON"].
# The extended-type of [] is ["null"], since that is the extended type of all arrays
#     which have no elements other than null.
# The extended-type of an object is an object with the same keys, but the
#     values of which are the extended-types of the corresponding values.

# typeUnion(a;b) returns the least extended-type value that subsumes both a and b.
# For example:
#  typeUnion("number"; "string") yields "scalar";
#  typeUnion({"a": "number"}; {"b": "string"}) yields {"a": "number", "b": "string"};
#  typeUnion("null", t) yields t for any valid extended type, t.

def typeUnion(a;b):
  def scalarp: . == "boolean" or . == "string" or . == "number" or . == "scalar";
  a as $a | b as $b
  | if $a == $b then $a
    elif ($a | scalarp) and ($b | scalarp) then "scalar"
    elif $a == "JSON" or $b == "JSON" then "JSON"
    elif ($a|type) == "array" and ($b|type) == "array" then [ typeUnion($a[0]; $b[0]) ]
    elif ($a|type) == "object" and ($b|type) == "object" then
      ((($a|keys) + ($b|keys)) | unique) as $keys
      | reduce $keys[] as $key ( {} ; .[$key] = typeUnion( $a[$key]; $b[$key]) )
    elif $a == "null" or $a == null then $b
    elif $b == "null" or $b == null then $a
    else "JSON"
    end ;

def typeof:
  def typeofArray:
    if length == 0 then ["null"]
    else [reduce .[] as $item (null; typeUnion(.; $item|typeof))]
    end ;
  def typeofObject:
    reduce keys[] as $key (. ; .[$key] |= typeof) ;

  . as $in
  | type
  | if . == "string" or . == "number" or . == "null" or . == "boolean" then .
    elif . == "object" then $in | typeofObject
    else $in | typeofArray
    end ;

# Omit the outermost [] for an array
def schema:
  if type == "array" then reduce .[] as $x ("null";  typeUnion(.; $x|typeof))
  else typeof
  end ;



# Example top-level:
schema

【讨论】:

    【解决方案2】:

    这是一个过滤器,它发出所有路径的概要流 输入实体中的长度

    路径 [p1, p2, ...] 的概要是通过替换来构造的 带有“.[]”的整数组件,以及带有“.”前缀的字符串组件, 例如,如果 i 和 j 是整数,那么 [i, "keyname", j] 将表示为 .[].keyname.[]

    这是使用jq -r 生成的输出示例:

    .[]
    .[].data
    .[].data.children
    .[].data.modhash
    .[].kind
    

    paths_synopsis/1

    # If depth<0 then select paths of length equal to -depth    
    def paths_synopsis(depth):
      [ paths
      | if depth > 0 then select(length <= depth)
        elif (depth < 0) then select(length == -depth)
        else . end
      | [.[]|if type=="number" then "[]" else . end]]
      | unique
      | .[]
      | "." + join(".")
      ;
    

    非常大的 JSON 实体

    jq 有一个流式解析器,用于处理非常大的 JSON 实体。

    以下过滤器旨在与 jq 流解析器 (jq --stream) 一起使用 在管道中,其中的第二个组件使概要独一无二,如本例所示:

    jq --arg depth 0 -c --stream -f synopsis.jq input.json | sort -u
    

    在以下公式中,必须在命令行上指定所需的深度限制。 指定 0 表示无限制。

    概要.jq
    # Usage: jq --arg depth DEPTH -c --stream -f synopsis.jq input.json | sort -u
    # or:    jq --arg depth DEPTH -c --stream -f synopsis.jq input.json | jq -s -c unique[]
    def synopsis(depth):
      select(length == 2)
      | .[0]
      | if depth > 0 then select(length <= depth)
        elif (depth < 0) then select(length == -depth)
        else . end
      | map( if type=="number" then [] else . end) ;
    
    synopsis( $depth | if . then tonumber else 0 end )
    

    示例:

    curl -Ss 'http://forecast.weather.gov/MapClick.php?FcstType=json&lat=39.56&lon=-104.85' |
      jq --arg depth 0 -c --stream -f synopsis.jq |
      sort -u | head -n 50
    

    ["creationDate"] ["creationDateLocal"] ["credit"] ["currentobservation","Altimeter"] ["currentobservation","Date"] ["currentobservation","Dewp"] ["currentobservation","Gust"] ["currentobservation","Relh"] ["currentobservation","SLP"] ["currentobservation","Temp"] ["currentobservation","Visibility"] ["currentobservation","Weather"] ["currentobservation","Weatherimage"] ["currentobservation","WindChill"] ["currentobservation","Windd"] ["currentobservation","Winds"] ["currentobservation","elev"] ["currentobservation","id"] ["currentobservation","latitude"] ["currentobservation","longitude"] ["currentobservation","name"] ["currentobservation","state"] ["currentobservation","timezone"] ["data","hazard",[]] ["data","hazardUrl",[]]

    【讨论】:

      【解决方案3】:

      如果您有兴趣查看特定深度的对象,可以使用getpathpathspaths 将返回图中所有值的路径。您可以将这些路径过滤为特定长度的路径,然后使用 getpath 获取相应的值。

      例如,查看当前对象深度 3 处的所有值

      getpath(paths | select(length == 3))
      

      然后您可以随时过滤和缩小范围。

      【讨论】:

      • 这已经接近我想要的了。我将更改为:paths | select(length &lt; 3) 我不想要实际内容,只想要了解可用数据的键。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-02-15
      • 1970-01-01
      • 2012-04-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多