如何将多记录多行 JSON 转换为 AWS Athena 的每记录 JSON 单行？答案

【问题标题】：How to convert a multi record multiline JSON into single line per record JSON for AWS Athena?如何将多记录多行 JSON 转换为 AWS Athena 的每记录 JSON 单行？
【发布时间】：2023-04-03 20:38:01
【问题描述】：

我想在 AWS Athena 中使用 json 文件，但 Athena 不支持多行 JSON。

我有以下（其中一个值是 XML）

{
  "id" : 10,
  "name" : "bob",
  "data" : "<some> \n <xml> \n <in here>"
},
{
  "id" : 20,
  "name" : "jane",
  "data" : "<other> \n <xml> \n <in here>"
}

我需要以下雅典娜

{ "id" : 10, "name" : "bob", "data" : "<some> <xml> <in here>" },
{ "id" : 20, "name" : "jane", "data" : "<other> <xml> <in here>" }

我正在使用 RazorSQL 从 DB2 中导出数据，并尝试使用 Python 编写一些代码来“扁平化”它，但还没有成功。

谢谢！

【问题讨论】：

这不是有效的 JSON 语法，也不是有意义的 Python 语法。此内容是否在文件中？
您能否更具体地说明问题所在？请参阅How to Ask、help center。
实际的 JSON 文件更像是一个数组 [ { "prop": "value"}, { "prop" : "value"} ]，但似乎 Athena 只喜欢我的示例中显示的方式。我试过了，它可以在 Athena 中使用这种格式，但不要相信我的话，因为我只是在学习它。

标签： python sql json aws-lambda amazon-athena

【解决方案1】：

我最终做了一些快速而肮脏的事情

import json
with open('data.json') as jfile:
    data = json.load(jfile)
    for d in data:
        print(json.dumps(d) + ',')

打印出来的

{'id': 200, 'name': 'bob', 'data': '<other> \n <xml> \n <data>'},
{"id": 200, "name": "bob", "data": "<other> \n <xml> \n <data>"},

刚刚将输出保存到另一个文件：P

它失败了，因为文件太大，但是嘿..很接近！

【讨论】：

【解决方案2】：

使用正则表达式

import re
html = '''
{
  "id" : 10,
  "name" : "bob",
  "data" : "<some> \n <xml> \n <in here>"
},
{
  "id" : 20,
  "name" : "jane",
  "data" : "<other> \n <xml> \n <in here>"
}
'''


def replaceReg(html, regex, new):
    return re.sub(re.compile(regex), new, html)

html = replaceReg(html,' \n ',' ')
html = replaceReg(html,'{[\s]+','{ ')
html = replaceReg(html,'[\s]+}',' }')
html = replaceReg(html,',[\s]+',', ')
html = replaceReg(html,'}, ','\n')
print (html)

结果：

{ "id" : 10, "name" : "bob", "data" : "<some> <xml> <in here>" 
{ "id" : 20, "name" : "jane", "data" : "<other> <xml> <in here>" }

【讨论】：

【解决方案3】：

您只需要在写入另一个文件时替换结束换行符（\n）：

s=''
with open('input.txt','r') as f_in, open('output.txt', 'w') as f_out:
    for line in f_in:        
        s += line.replace('\n', '')
    f_out.write(s)

input.txt 有这个数据的地方：

{
  "id" : 10,
  "name" : "bob",
  "data" : "<some> \n <xml> \n <in here>"
},
{
  "id" : 20,
  "name" : "jane",
  "data" : "<other> \n <xml> \n <in here>"
}

【讨论】：