【发布时间】:2021-05-13 21:55:42
【问题描述】:
我想使用open data IMDb,但他们以TSV格式提供,不太方便。
https://datasets.imdbws.com/title.crew.tsv.gz
tconst directors writers
tt0000238 nm0349785 \N
tt0000239 nm0349785 \N
tt0000240 \N \N
tt0000241 nm0349785 \N
tt0000242 nm0617588 nm0617588
tt0000243 nm0349785 \N
tt0000244 nm0349785 \N
tt0000245 \N \N
tt0000246 nm0617588 \N
tt0000247 nm0002504,nm0005690,nm2156608 nm0000636,nm0002504
tt0000248 nm0808310 \N
tt0000249 nm0808310 \N
tt0000250 nm0005717 \N
tt0000251 nm0177862 \N
我想将 TSV 数据转换为 JSON。
[
{
"tconst": "tt0000247",
"directors": [
"nm0005690",
"nm0002504",
"nm2156608"
],
"writers": [
"nm0000636",
"nm0002504"
]
},
{
"tconst": "tt0000248",
"directors": [
"nm0808310"
],
"writers": [
"\\N"
]
}
]
我可以用命令做到这一点:
jq -rRs 'split("\n")[1:-1] |
map([split("\t")[]|split(",")] | {
"tconst":.[0][0],
"directors":.[1],
"writers":.[2]
}
)' ./title.crew.tsv > ./title.crew.json
但是,文件变得非常大,我得到了内存不足的错误。
1.如何将这个 TSV 文件拆分成多个 JSON 文件,每个文件有 1000 条记录?
./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json
2.如何排除空字段?要有一个空数组。
"writers": [ "\\N" ] -> "writers": [ ]
UPD(第二个问题已解决。):
jq -rRs 'split("\n")[1:-1] |
map([split("\t")[]|split(",")] |
.[2] |= if .[0] == "\\N" then [] else . end | {
"tconst":.[0][0],
"directors":.[1],
"writers":.[2]
}
)' ./title.crew.tsv > ./title.crew.json
[
{
"tconst": "tt0000247",
"directors": [
"nm0005690",
"nm0002504",
"nm2156608"
],
"writers": [
"nm0000636",
"nm0002504"
]
},
{
"tconst": "tt0000248",
"directors": [
"nm0808310"
],
"writers": []
}
]
感谢您的回答。
【问题讨论】: