【问题标题】:Both json.load and json.loads is unable to load my jsonl filejson.load 和 json.loads 都无法加载我的 jsonl 文件
【发布时间】:2019-06-27 07:50:23
【问题描述】:

我正在尝试在 python 中加载我的 jsonl 文件。我正在使用以下代码并收到如下错误。

with open("mli_train_v1.jsonl", 'r', encoding='utf-8') as f:
    data = json.loads(f)

显示错误为

TypeError: the JSON object must be str, bytes or bytearray, not 'TextIOWrapper'

所以,我尝试了这个

with open("mli_train_v1.jsonl", 'r') as f:
    data = json.load(f)

我收到错误

JSONDecodeError: Extra data: line 2 column 1 (char 835)

我的jsonl文件格式是这样的

{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb94b8-66c7-11e7-a8dc-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has elevated Cr", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ elevated) (NN Cr)))))", "sentence2_binary_parse": "( Patient ( has ( elevated Cr ) ) )", "gold_label": "entailment"}
{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb979c-66c7-11e7-b76c-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has normal Cr", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ normal) (NN Cr)))))", "sentence2_binary_parse": "( Patient ( has ( normal Cr ) ) )", "gold_label": "contradiction"}
{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb9986-66c7-11e7-9ef9-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has elevated BUN", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ elevated) (NN BUN)))))", "sentence2_binary_parse": "( Patient ( has ( elevated BUN ) ) )", "gold_label": "neutral"}

【问题讨论】:

  • 您的文件不包含单个根 JSON 对象,而 json.load 旨在读取该对象。
  • 你不是打算在你的第一个例子中做json.load(f)吗? loads() 需要一个字符串,而不是文件句柄。所以你尝试的第二件事是有道理的。问题在于 - 您的文件包含多个 JSON 对象,因此您需要执行 for line in f: json.loads(line) 或将这些行拆分为多个文件并一一加载。

标签: python json python-3.x


【解决方案1】:

要读取 JSONL 文件,必须先读取行,然后对其进行解析。

data = []
with open("mli_train_v1.jsonl", 'r', encoding='utf-8') as f:
    for line in f:
       data.append(json.loads(line))

【讨论】:

  • 可以写成data = [json.loads(line) for line in open("mli_train_v1.jsonl", 'r', encoding='utf-8')]
  • @RemcoGerlich 这会丢失with 提供给您的文件关闭,如果必须添加任何语句,则需要重写。
  • data = [json.loads(line) for line in f] then,或者 data = map(json.loads, f) 如果你想要一个迭代器。要点是初始化一个列表并从循环中追加到它上面写成列表理解更好。
【解决方案2】:

以下内容可能会解决您的问题。

import re, json
path = 'path/to/your/file'
with open(path) as f:
    contents = f.read()
contents = re.sub('}', '},', contents)
contents = contents[:-1]
contents = '[' + contents + ']'
with open(path, 'w') as f:
    f.write(contents)
with open(path) as f:
    json_contents = json.load(f)

【讨论】:

    猜你喜欢
    • 2020-09-29
    • 2016-11-30
    • 2017-02-04
    • 1970-01-01
    • 2018-11-01
    • 2013-07-09
    • 2018-01-16
    • 2020-05-18
    相关资源
    最近更新 更多