基于特定约束使用jq转换json答案

【问题标题】：Convert json using jq based on specific constraints基于特定约束使用jq转换json
【发布时间】：2017-01-23 01:02:34
【问题描述】：

我有一个 json 文件“OpenEnded_mscoco_val2014.json”。该 json 文件包含 121,512 个问题。
这是一些示例：

"questions": [
{
  "question": "What is the table made of?",
  "image_id": 350623,
  "question_id": 3506232
},
{
  "question": "Is the food napping on the table?",
  "image_id": 350623,
  "question_id": 3506230
},
{
  "question": "What has been upcycled to make lights?",
  "image_id": 350623,
  "question_id": 3506231
},
{
  "question": "Is this an Spanish town?",
  "image_id": 8647,
  "question_id": 86472
}

]

我使用jq -r '.questions | [map(.question), map(.image_id), map(.question_id)] | @csv' OpenEnded_mscoco_val2014_questions.json >> temp.csv 将 json 转换为 csv。
但是这里 csv 中的输出是问题，后跟 image_id，这就是上面代码的作用。
预期的输出是：

"What is table made of",350623,3506232
"Is the food napping on the table?",350623,3506230

是否也可以只过滤具有image_id <= 10000 和group questions having same image_id 的结果？例如json的1,2,3个结果可以组合成3个问题，1个image_id，3个question_id。

编辑：possible duplicate question解决了第一个问题。我想知道是否可以在 jq 的命令行上调用比较运算符来转换 json 文件。在这种情况下，如果仅image_id <= 10000，则从 json 中获取所有字段。

【问题讨论】：

不太清楚你的第一个问题是什么？
How to convert arbirtrary simple JSON to CSV using jq?的可能重复
我想使用 jq 过滤 image_id 的值
1.请修正问题，以便示例输入是有效的 JSON。 2. 如果问题的重点是 (.questions | length) 太大以至于您不想将整个文件读入内存，那么请说出来。（在这种情况下，jq 有一个流解析器可能会提供帮助。）

标签： python json csv filtering jq

【解决方案1】：

使用-r 选项，以下过滤器

  .questions[] | [ .[] ] | @csv

生产

"What is the table made of?",350623,3506232
"Is the food napping on the table?",350623,3506230
"What has been upcycled to make lights?",350623,3506231
"Is this an Spanish town?",8647,86472

要过滤数据，请使用选择。例如。使用-r 选项以下过滤器

  .questions[] | select(.image_id <= 10000) | [ .[] ] | @csv

产生子集

"Is this an Spanish town?",8647,86472

要对数据进行分组，请使用 group_by。以下过滤器

    .questions
  | group_by(.image_id)[]
  | [ .[] | [ .[] ] | @csv ]

产生分组数据

[
  "\"Is this an Spanish town?\",8647,86472"
]
[
  "\"What is the table made of?\",350623,3506232",
  "\"Is the food napping on the table?\",350623,3506230",
  "\"What has been upcycled to make lights?\",350623,3506231"
]

这在这种形式中不是很有用，可能不是您想要的，但它演示了基本方法。

【讨论】：

【解决方案2】：

1) 给定您的输入（经过适当阐述以使其成为有效的 JSON），以下查询会生成 CSV 输出，如下所示：

$ jq -r '.questions[] | [.question, .image_id, .question_id] | @csv'

"What is the table made of?",350623,3506232
"Is the food napping on the table?",350623,3506230
"What has been upcycled to make lights?",350623,3506231
"Is this an Spanish town?",8647,86472

这里要记住的关键是@csv 需要一个平面数组，但与所有 jq 过滤器一样，您可以为其提供一个流。

2) 要使用标准.image_id <= 10000 进行过滤，只需插入适当的select/1 过滤器：

.questions[]
| select(.image_id <= 10000)
| [.question, .image_id, .question_id]
| @csv

3) 要按 image_id 排序，请使用 sort_by(.image_id)

.questions
| sort_by(.image_id)
|.[]
| [.question, .image_id, .question_id]
| @csv

4) 要按.image_id 分组，您可以将以下管道的输出通过管道传输到您自己的管道中：

.questions | group_by(.image_id)

但是，您必须确定要如何组合分组的对象。

【讨论】：

对于第二个答案，是否可以编写 .question |select(.image_id
在 (2) 中，给定的过滤器确实发出了受约束的输出！你试过了吗？
嘿@peak，谢谢它一切正常！是否可以从 JSON 数据中提取特定的问题类型。就像我只想要以“How”、“What is”等开头的问题。使用 json.load()。
5.考虑 select(startswith(_))。如果您可以访问支持正则表达式的 jq 版本，请考虑使用 test/1。