高效地将 JSON 写入 sqlite 数据库答案

【问题标题】：Efficiently write JSON to sqlite database高效地将 JSON 写入 sqlite 数据库
【发布时间】：2016-04-14 07:01:58
【问题描述】：

我正在尝试将大型 JSON（至少 500MB）文件写入数据库。我写了一个脚本，它可以工作并且对内存很友好，但它很慢。关于如何提高效率的任何建议？

我的 JSON 文件（从谷歌地球引擎提取的遥感测量数据）的格式如下：

{"type":"FeatureCollection","features":[{"geometry":{"coordinates":[-55.347046,-12.179673],"geodesic":true,"type":"Point"},"id":"LT52240692005129COA00_2","properties":{"B1":null,"B2":null,"B3":null,"B4":null,"B5":null,"B7":null,"description":"","id":0.0,"name":""},"type":"Feature"},{"geometry":{"coordinates":[-52.726481,-13.374343],"geodesic":true,"type":"Point"},"id":"LT52250692005184COA00_10","properties":{"B1":217,"B2":497,"B3":424,"B4":2633,"B5":1722,"B7":747,"description":"","id":8.0,"name":""},"type":"Feature"}]}

这是读取 JSON、解析它并写入数据库的脚本。

import pandas as pd
import json
import sqlite3

# Variables
JSON_file = '../data/LT5oregon.geojson'
db_src = '../data/SR_ee_samples.sqlite'
table_name = 'oregon'
chunk_size = 5000

# Read JSON file
with open(JSON_file) as data_file:    
    data = json.load(data_file)

# Create database connection
con = sqlite3.connect(db_src)

# Create empty dataframe
df = pd.DataFrame()
# Initialize count for row index
count = 0

# Main loop
for feature in data['features']:
    json_feature = feature['properties']
    if json_feature['B1'] is not None:
        # Build metadata
        meta = feature['id'].split('_')
        meta_dict = {'scene_id': meta[0], 'feature_id': int(meta[1])}
        # Append meta data to feature data
        json_feature.update(meta_dict)
        # Append row to df
        df = df.append(pd.DataFrame(json_feature, index=[count]))
        count += 1
        if len(df) >= chunk_size: # When df reaches a certain number of rows, empty it to db
            df.to_sql(name = table_name, con = con, if_exists='append')
            df = pd.DataFrame()

# write remaining rows to db
df.to_sql(name = table_name, con = con, if_exists='append')

提前感谢您的任何建议

【问题讨论】：

标签： python json sqlite pandas bigdata

【解决方案1】：

我认为您会受益于分析器（例如 line_profiler 或 the standard library）来评估您的代码的哪一部分需要时间。

我的赌注是对数据帧的追加调用，我怀疑它每次都必须复制整个数据帧以保持一个连续的数组（就像 numpy 一样）。也许制作一个元素列表，然后从中创建数据框？

【讨论】：

【解决方案2】：

如果您能够以 json 行格式导出，则 pandas 在读入时支持数据分块，您可以执行以下操作：

import pandas as pd

with pd.read_json(json_file, orient="records",
                  chunksize=chunksize, lines=True) as reader:
    for chunk in reader:
        chunk.to_sql('table_name', 
                     con=con, 
                     if_exists='append', 
                     index=False)

【讨论】：