【问题标题】:JSON -> csv creating header line and padding header if found empty fieldJSON -> csv 创建标题行和填充标题(如果找到空字段)
【发布时间】:2023-03-17 05:45:01
【问题描述】:

我有一个 bash 程序,它可以获取 JSONline 文件,每行包含数百万个此类对象(请参阅 source

{
  "company_number": "09626947",
  "data": {
    "address": {
      "address_line_1": "Troak Close",
      "country": "England",
      "locality": "Christchurch",
      "postal_code": "BH23 3SR",
      "premises": "9",
      "region": "Dorset"
    },
    "country_of_residence": "United Kingdom",
    "date_of_birth": {
      "month": 11,
      "year": 1979
    },
    "etag": "7123fb76e4ad7ee7542da210a368baa4c89d5a06",
    "kind": "individual-person-with-significant-control",
    "links": {
      "self": "/company/09626947/persons-with-significant-control/individual/FFeqke7T3LvGvX6xmuGqi5SJXAk"
    },
    "name": "Ms Angela Lynette Miller",
    "name_elements": {
      "forename": "Angela",
      "middle_name": "Lynette",
      "surname": "Miller",
      "title": "Ms"
    },
    "nationality": "British",
    "natures_of_control": [
      "significant-influence-or-control"
    ],
    "notified_on": "2016-06-06"
  }
}

我的 JQ 查询如下所示:

for file in psc_chunk_*; do
jq --slurp --raw-output 'def pad($n): range(0;$n) as $i | 
.[$i]; ([.[] | .data.natures_of_control | length] | max) as $mx |
.[] | 
select(.data) |
[.company_number, .data.kind, .data.address.address_line_1, .data.address.country, .data.address.locality, .data.address.postal_code, .data.address.premises, .data.identification.country_registered, .data.identification.legal_authority, .data.identification.legal_form, .data.identification.place_registered, .data.identification.registration_number, .data.ceased_on, .data.country_of_residence, "\(.data.date_of_birth.year)-\(.data.date_of_birth.month)", .data.etag, .data.links.self, .data.name, .data.name_elements.title, .data.name_elements.forename, .data.name_elements.middle_name, .data.name_elements.surname, .data.nationality, .data.notified_on, (.data.natures_of_control | pad($mx))] |
@csv' $file > $file.csv;
done

这可能伤害了许多 JQ 专业人士的眼睛 - 它在提取键:值对方面效率不高,如果提供者碰巧更改了键的名称,我的代码将不再工作。

有没有办法将所有 json 扁平化为 csv 将键保留为标题 - 额外的困难是 有一个列表 natures_of_control 有不同数量的条目 (为此我使用了 pad 函数来获得一个矩形结果)。

【问题讨论】:

  • 请遵循minimal reproducible example 准则,尤其是“最小”部分。我不清楚你想如何处理数组值键。
  • 非常感谢您的建议 - 我忘了在问题正文中写下我实际上希望所有键都变成 cols!这是您在下面的回答中所假设的 - 非常感谢!我正在测试。

标签: json csv padding jq missing-data


【解决方案1】:

这是一种基于以编程方式确定标头的方法。为了说明这一点,我们将注意力限制在单个对象上。

由于 jq 的 paths 内置忽略了指向 null 的路径,并且由于这里的要求之一是不要忽略此类路径,我们首先定义一些类似于 paths/0paths/1 的过滤器:

# Generate a stream of all paths, including paths to null
def allpaths:
  def conditional_recurse(f):  def r: ., (select(.!=null) | f | r); r;
  path(conditional_recurse(.[]?)) | select(length > 0);

def allpaths(filter):
  allpaths as $p | getpath($p) as $v | select($v | filter) | $p;

接下来,我们定义一个用于缩短长路径的函数。您可能希望根据自己的需要进行调整。

# Input: an array denoting a path; output: a string
def abbreviate: if .[-1]|type == "number" then "\(.[-2]):\(.[-1])" else "\(.[-1])" end;

最后,我们通过生成一行标题,然后是一行相应的值,为单对象情况汇总所有内容:

[allpaths(scalars)] as $p
| ($p | map(abbreviate) | @csv),
  ([getpath($p[])] | @csv)

输出

对于问题中的 JSON 对象,上述生成的输出(使用 -r 命令行选项)将是以下 CSV:

"company_number","address_line_1","country","locality","postal_code","premises","region","country_of_residence","month","year","etag","kind","self","name","forename","middle_name","surname","title","nationality","natures_of_control:0","notified_on"
"09626947","Troak Close","England","Christchurch","BH23 3SR","9","Dorset","United Kingdom",11,1979,"7123fb76e4ad7ee7542da210a368baa4c89d5a06","individual-person-with-significant-control","/company/09626947/persons-with-significant-control/individual/FFeqke7T3LvGvX6xmuGqi5SJXAk","Ms Angela Lynette Miller","Angela","Lynette","Miller","Ms","British","significant-influence-or-control","2016-06-06"

【讨论】:

  • 嗨峰!这很好用,而且我在破译您的解决方案时还学习了很多 JQ 语法 - 谢谢!在尝试使用每行一个 json 对象的 jqplay 之后,我意识到它每隔一行创建一个标题行。有没有办法只获得 1 个标头,其中有尽可能多的列解包 natures_of_control 与最长的 natures_of_control 数组中的一样多 - 同时在其 natures_of_control 中没有那么多条目的行中将额外字段留空大批。哇 - 这很复杂,对不起,在这个阶段的任何事情都将不胜感激!! :)
  • Tytire Recubans - 请参阅本页其他地方的“高效处理同构对象流”。
【解决方案2】:

这是一个通过将输入 JSON 中的数组转换为“冒号分隔值”来处理数组的解决方案:

def atos: map(tostring) | join(":");

也使用了与本页其他地方相同的通用 allpaths 过滤器:

# Generate a stream of all paths, including paths to null
def allpaths:
  def conditional_recurse(f):  def r: ., (select(.!=null) | f | r); r;
  path(conditional_recurse(.[]?)) | select(length > 0);

def allpaths(filter):
  allpaths as $p | getpath($p) as $v | select($v | filter) | $p;

同样对于单对象的情况,可以得到如下解:

walk( if type == "array" then atos else . end )
| [allpaths(scalars)] as $p
| ($p | map(last) | @csv),
  ([getpath($p[])] | @csv)

输出

对于给定的输入,输出将是:

"company_number","address_line_1","country","locality","postal_code","premises","region","country_of_residence","month","year","etag","kind","self","name","forename","middle_name","surname","title","nationality","natures_of_control","notified_on"
"09626947","Troak Close","England","Christchurch","BH23 3SR","9","Dorset","United Kingdom",11,1979,"7123fb76e4ad7ee7542da210a368baa4c89d5a06","individual-person-with-significant-control","/company/09626947/persons-with-significant-control/individual/FFeqke7T3LvGvX6xmuGqi5SJXAk","Ms Angela Lynette Miller","Angela","Lynette","Miller","Ms","British","significant-influence-or-control","2016-06-06"

警告

此处介绍的解决方案仅适用于输入中的数组都是标量值的情况。

高效处理同构对象流

在下文中,假定对象流是同构的,因为 JSON 对象中键的顺序无关紧要。

基础设施

allpathsatos的基础设施如上,这里不再赘述。

辅助函数

# input: an object
def paths:
  walk( if type == "array" then atos else . end )
  | [allpaths(scalars)] ;

# input: an array of paths
def headers:
  map(last) | @csv ; 

# input: an object
def row($paths):
  walk( if type == "array" then atos else . end )
  | [getpath($paths[])]
  | @csv ;

处理输入流

以下使用input 读取第一个对象,并使用inputs 读取其余对象,因此必须使用 -n 命令行选项调用 jq:

input as $first
| ($first|paths) as $paths
| ($paths | headers),
  ($first | row($paths)),
  (inputs | row($paths))

【讨论】:

  • 嗨@peak - 我终于阅读了我需要从手册和维基中阅读的所有内容,以了解您的解决方案并尝试了它。我只是想强调同构对象的最终解决方案是多么优雅和便携。非常感谢!
  • 嗨@peak。我正在尝试使用您对this json object 的第一个答案 - 我想获得全名。例如,如果我有一个嵌套对象{"matches" : [ { "address_snippet" : ["integer"] } ] },我希望值"integer" 的键为"matches_address_snippet"
猜你喜欢
  • 2019-02-19
  • 1970-01-01
  • 2020-07-07
  • 2020-02-14
  • 1970-01-01
  • 1970-01-01
  • 2016-02-06
  • 2016-12-26
  • 2021-05-19
相关资源
最近更新 更多