【问题标题】:Efficiently write JSON to sqlite database高效地将 JSON 写入 sqlite 数据库
【发布时间】:2016-04-14 07:01:58
【问题描述】:

我正在尝试将大型 JSON(至少 500MB)文件写入数据库。我写了一个脚本,它可以工作并且对内存很友好,但它很慢。关于如何提高效率的任何建议?

我的 JSON 文件(从谷歌地球引擎提取的遥感测量数据)的格式如下:

{"type":"FeatureCollection","features":[{"geometry":{"coordinates":[-55.347046,-12.179673],"geodesic":true,"type":"Point"},"id":"LT52240692005129COA00_2","properties":{"B1":null,"B2":null,"B3":null,"B4":null,"B5":null,"B7":null,"description":"","id":0.0,"name":""},"type":"Feature"},{"geometry":{"coordinates":[-52.726481,-13.374343],"geodesic":true,"type":"Point"},"id":"LT52250692005184COA00_10","properties":{"B1":217,"B2":497,"B3":424,"B4":2633,"B5":1722,"B7":747,"description":"","id":8.0,"name":""},"type":"Feature"}]}

这是读取 JSON、解析它并写入数据库的脚本。

import pandas as pd
import json
import sqlite3

# Variables
JSON_file = '../data/LT5oregon.geojson'
db_src = '../data/SR_ee_samples.sqlite'
table_name = 'oregon'
chunk_size = 5000

# Read JSON file
with open(JSON_file) as data_file:    
    data = json.load(data_file)

# Create database connection
con = sqlite3.connect(db_src)

# Create empty dataframe
df = pd.DataFrame()
# Initialize count for row index
count = 0

# Main loop
for feature in data['features']:
    json_feature = feature['properties']
    if json_feature['B1'] is not None:
        # Build metadata
        meta = feature['id'].split('_')
        meta_dict = {'scene_id': meta[0], 'feature_id': int(meta[1])}
        # Append meta data to feature data
        json_feature.update(meta_dict)
        # Append row to df
        df = df.append(pd.DataFrame(json_feature, index=[count]))
        count += 1
        if len(df) >= chunk_size: # When df reaches a certain number of rows, empty it to db
            df.to_sql(name = table_name, con = con, if_exists='append')
            df = pd.DataFrame()

# write remaining rows to db
df.to_sql(name = table_name, con = con, if_exists='append')

提前感谢您的任何建议

【问题讨论】:

    标签: python json sqlite pandas bigdata


    【解决方案1】:

    我认为您会受益于分析器(例如 line_profilerthe standard library)来评估您的代码的哪一部分需要时间。

    我的赌注是对数据帧的追加调用,我怀疑它每次都必须复制整个数据帧以保持一个连续的数组(就像 numpy 一样)。也许制作一个元素列表,然后从中创建数据框?

    【讨论】:

      【解决方案2】:

      如果您能够以 json 行格式导出,则 pandas 在读入时支持数据分块,您可以执行以下操作:

      import pandas as pd
      
      with pd.read_json(json_file, orient="records",
                        chunksize=chunksize, lines=True) as reader:
          for chunk in reader:
              chunk.to_sql('table_name', 
                           con=con, 
                           if_exists='append', 
                           index=False)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-07-30
        • 2021-12-30
        • 1970-01-01
        • 2018-05-16
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-07-02
        相关资源
        最近更新 更多