【发布时间】:2016-05-31 16:20:41
【问题描述】:
我有大约 10 万个 JSON 文件,每个文件都包含我正在循环创建的词袋模型的 JSON - 非常简单。每个 JSON 文件如下所示:
[{"tokens":[{"word":"Voices","lemma":"voice","pos":"NNS","ner":"O"},{"word":"from","lemma":"from","pos":"IN","ner":"O"},{"word":"Russia","lemma":"Russia","pos":"NNP","ner":"LOCATION"}],"dependencies":[{"head":0,"dep":2,"label":"prep_from"}]},{"tokens":[{"word":"Wednesday","lemma":"Wednesday","pos":"NNP","ner":"DATE"},{"word":",","lemma":",","pos":",","ner":"DATE"},{"word":"11","lemma":"11","pos":"CD","ner":"DATE"},
....
我需要的是仅提取每个文件的 "word" 键的值,并将此数组存储在一个名为的新文件中,因此每个文件都有一个如下数组:
["Voices", "from", "Wednesday","Russia", "," ,"11"...]
我也有一个类似的数组用于所有文件放在一起,存储在../../data/train_jsons/all_words.json
但是json.loads 为每个项目创建一个列表,而不是一个字典。我怎样才能通过循环遍历每个文件的列表来实现我想要的,并将这些单独的单词数组存储在维护json文件路径名称的新文件中,例如名为../../data/train_jsons/words_for_.........json 的新文件?
尝试转换为字典并使用关键字“单词”似乎不起作用:
for subdir, dirs, files in os.walk('../../data/train_jsons'):
for file in files:
filepath = subdir + os.sep + file
if filepath.endswith(".json"):
with open(filepath) as data_file:
data = json.load(data_file)
dict = dict(itertools.izip_longest(*[iter(data)] * 2, fillvalue=""))
速度是我的解决方案中的一个关键因素。
【问题讨论】:
标签: python arrays json dictionary nlp