【发布时间】:2016-04-14 07:01:58
【问题描述】:
我正在尝试将大型 JSON(至少 500MB)文件写入数据库。我写了一个脚本,它可以工作并且对内存很友好,但它很慢。关于如何提高效率的任何建议?
我的 JSON 文件(从谷歌地球引擎提取的遥感测量数据)的格式如下:
{"type":"FeatureCollection","features":[{"geometry":{"coordinates":[-55.347046,-12.179673],"geodesic":true,"type":"Point"},"id":"LT52240692005129COA00_2","properties":{"B1":null,"B2":null,"B3":null,"B4":null,"B5":null,"B7":null,"description":"","id":0.0,"name":""},"type":"Feature"},{"geometry":{"coordinates":[-52.726481,-13.374343],"geodesic":true,"type":"Point"},"id":"LT52250692005184COA00_10","properties":{"B1":217,"B2":497,"B3":424,"B4":2633,"B5":1722,"B7":747,"description":"","id":8.0,"name":""},"type":"Feature"}]}
这是读取 JSON、解析它并写入数据库的脚本。
import pandas as pd
import json
import sqlite3
# Variables
JSON_file = '../data/LT5oregon.geojson'
db_src = '../data/SR_ee_samples.sqlite'
table_name = 'oregon'
chunk_size = 5000
# Read JSON file
with open(JSON_file) as data_file:
data = json.load(data_file)
# Create database connection
con = sqlite3.connect(db_src)
# Create empty dataframe
df = pd.DataFrame()
# Initialize count for row index
count = 0
# Main loop
for feature in data['features']:
json_feature = feature['properties']
if json_feature['B1'] is not None:
# Build metadata
meta = feature['id'].split('_')
meta_dict = {'scene_id': meta[0], 'feature_id': int(meta[1])}
# Append meta data to feature data
json_feature.update(meta_dict)
# Append row to df
df = df.append(pd.DataFrame(json_feature, index=[count]))
count += 1
if len(df) >= chunk_size: # When df reaches a certain number of rows, empty it to db
df.to_sql(name = table_name, con = con, if_exists='append')
df = pd.DataFrame()
# write remaining rows to db
df.to_sql(name = table_name, con = con, if_exists='append')
提前感谢您的任何建议
【问题讨论】:
标签: python json sqlite pandas bigdata