【问题标题】:Split nested JSON File into two JSONs according to their ID?根据 ID 将嵌套的 JSON 文件拆分为两个 JSON?
【发布时间】:2021-01-28 09:55:33
【问题描述】:

我有嵌套的 JSON 文件,它加载为名为 movies_data 的 python 字典,如下所示:

with open('project_folder/data_movie_absa.json') as infile:
  movies_data = json.load(infile)

结构如下:

{ "review_1": {"tokens": ["Best", "show", "ever", "!"], 
               "movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]},  
               "movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}}, 

  "review_2": {"tokens": ["Its", "a", "great", "show"], 
               "movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}, 
               "movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}},

  "review_3": {"tokens": ["I", "love", "this", "actor", "!"],  
               "movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}, 
               "movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}},

  "review_4": {"tokens": ["Bad", "movie"], 
               "movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}, 
               "movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}

...
}

它有 3324 个键值对(即,最多 key review_3224)。我想根据特定的键列表将此文件拆分为两个 json 文件(train_movies.jsontest_movies.json):

test_IDS = ['review_2', 'review_4']

with open("train_movies.json", "w", encoding="utf-8-sig") as outfile_train, open("test_movies.json", "w", encoding="utf-8-sig") as outfile_test:
  for review_id, review in movies_data.items():
    if review_id in test_IDS:
      outfile = outfile_test
      outfile.write('{"%s": "%s"}' % (review_id, movies_data[review_id]))
      
    else:
      outfile = outfile_train
      outfile.write('{"%s": "%s"}' % (review_id, movies_data[review_id]))
  outfile.close()

对于 test_movies.json,我有以下结构:

{"review_2": "{'tokens': ['Its', 'a', 'great', 'show'], 
            'movie_user_4': {'aspects': ['O', 'O', 'O', 'B_A'], 'sentiments': ['O', 'O', 'B_S', 'O']}, 
            'movie_user_6': {'aspects': ['O', 'O', 'O', 'B_A'], 'sentiments': ['O', 'O', 'B_S', 'O']}}"}

{"review_4": "{'tokens': ['Bad', 'movie'], 
               'movie_user_1': {'aspects': ['O', 'B_A'], 'sentiments': ['B_S', 'O']},
               'movie_user_6': {'aspects': ['O', 'B_A'], 'sentiments': ['B_S', 'O']}}"}

不幸的是,这种结构存在一些问题,例如不一致的双引号 (" vs. ')、评论之间没有逗号等等......因此,通过将test_movies.json 读取为json 文件,我遇到了以下问题:

with open('project_folder/test_movies.json') as infile:
  testing_data = json.load(infile)

错误信息:


JSONDecodeError                           Traceback (most recent call last)
<ipython-input-10-3548a718f421> in <module>()
      1 with open('/content/gdrive/My Drive/project_folder/test_movies.json') as infile:
----> 2   testing_data = json.load(infile)

1 frames
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    342         if s.startswith('\ufeff'):
    343             raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
--> 344                                   s, 0)
    345     else:
    346         if not isinstance(s, (bytes, bytearray)):

JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

期望的输出应该像原来的movies_data一样有一个正确的json结构,这样python就可以把它作为一个dict正确读取。

你能帮我改正我的python代码吗?

提前谢谢你!

【问题讨论】:

  • 您似乎遇到的问题与拆分无关,而是单引号嵌套JSON的解析。请相应地修正标题
  • 如果我不得不猜测,我会说该文件不是 JSON,而是 JSONS 列表,每个 "review_i" 的值都是 python 字典清楚地打印出来的。检查this 以了解如何解析它们。

标签: python json logic filesplitting


【解决方案1】:

问题

  • 需要使用 json.dumps 创建输出字符串写入文件。
  • 使用 Python 字符串格式,即 '{"%s": "%s"}' % (review_id, movies_data[review_id]) 会产生您描述的问题

代码

train, test = {}, {}   # Dicionaries for storing training and test data
for review_id, review in movies_data.items():
    if review_id in test_IDS:
        test[review_id] = review
    else:
        train[review_id] = review

# Output Test
with open("test_movies.json", "w") as outfile_test:
    json.dump(test, outfile_test)
    
# Output training
with open("train_movies.json", "w") as outfile_train:
    json.dump(train, outfile_train)

结果

输入: test.json 的文件内容

{ "review_1": {"tokens": ["Best", "show", "ever", "!"], 
               "movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]},  
               "movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}}, 

  "review_2": {"tokens": ["Its", "a", "great", "show"], 
               "movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}, 
               "movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}},

  "review_3": {"tokens": ["I", "love", "this", "actor", "!"],  
               "movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}, 
               "movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}},

  "review_4": {"tokens": ["Bad", "movie"], 
               "movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}, 
               "movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}

}

输出: test_movies.json 的文件内容

{"review_2": {"tokens": ["Its", "a", "great", "show"], "movie_user_1": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}, "movie_user_6": {"aspects": ["O", "O", "O", "B_A"], "sentiments": ["O", "O", "B_S", "O"]}}, "review_4": {"tokens": ["Bad", "movie"], "movie_user_1": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}, "movie_user_6": {"aspects": ["O", "B_A"], "sentiments": ["B_S", "O"]}}}

输出: train_movies.json 的文件内容

{"review_1": {"tokens": ["Best", "show", "ever", "!"], "movie_user_4": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}, "movie_user_6": {"aspects": ["O", "B_A", "O", "O"], "sentiments": ["B_S", "O", "O", "O"]}}, "review_3": {"tokens": ["I", "love", "this", "actor", "!"], "movie_user_17": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}, "movie_user_23": {"aspects": ["O", "O", "O", "B_A", "O"], "sentiments": ["O", "B_S", "O", "O", "O"]}}}

【讨论】:

  • 感谢您的回答,但我仍然有两个问题: 1- 每个键值对(即审阅)都会得到一个额外的花括号 { } 2- 每个键之间没有逗号-值对。所以输出不能被读取为 json 文件:(
  • @AliF——很容易纠正。由于误解了您想要的输出格式而出错。
  • 解决我在outfile.write(result + ",\n") 中添加的逗号问题,仍在考虑花括号
  • @AliF--现在输出怎么样了?
  • @AliF--如果格式没问题,作为改进,我们可以让 test_ids 成为原始数据中所有键的随机子集。这与训练和测试数据通常基于数据集的随机分区的方式是一致的。
猜你喜欢
  • 2017-09-05
  • 2018-12-09
  • 2022-01-07
  • 1970-01-01
  • 1970-01-01
  • 2020-06-30
  • 2021-01-02
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多