【问题标题】:appending to existing avro file oython附加到现有的 avro 文件 python
【发布时间】:2022-12-10 08:13:16
【问题描述】:

我正在探索 avro 文件格式,目前正在努力追加数据。我似乎在每次运行时都会覆盖。我找到了一个现有的线程here,说我不应该传递一个模式来“附加”到现有文件而不覆盖。甚至我的 lint 也给出了这个线索:If the schema is not present, presume we're appending.。但是,如果我尝试将 DataFileWriter 声明为 DataFileWriter(open("users.avro", "wb"), DatumWriter(), None),则代码将无法运行。

简而言之,如何在不覆盖现有内容的情况下将值附加到现有的 avro 文件。

schema = avro.schema.parse(open("user.avsc", "rb").read()
writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)

print("start appending")
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 12, "favorite_color": "blue"})
writer.close()
print("write successful!")

# Read data from an avro file
with open('users.avro', 'rb') as f:
    reader = DataFileReader(open("users.avro", "rb"), DatumReader())
    users = [user for user in reader]
    reader.close()

print(f'Schema {schema}')
print(f'Users:\n {users}')

【问题讨论】:

  • 这可能是文件打开的方式。您当前拥有 wb,但 w 将始终覆盖该文件。 ab 有效吗?
  • 似乎对我不起作用。

标签: python python-3.x avro


【解决方案1】:

我不确定如何使用标准 avro 库来完成它,但如果你使用 fastavro 就可以完成。请参见下面的示例:

from fastavro import parse_schema, writer, reader

schema = {
 "namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

parsed_schema = parse_schema(schema)

records = [
    {"name": "Alyssa", "favorite_number": 256},
    {"name": "Ben", "favorite_number": 12, "favorite_color": "blue"},
]

# Write initial 2 records
with open("users.avro", "wb") as fp:
    writer(fp, schema, records)

# Append third record
with open("users.avro", "a+b") as fp:
    writer(fp, schema, [{"name": "Chris", "favorite_number": 1}])

# Read all records
with open("users.avro", "rb") as fp:
    for record in reader(fp):
        print(record)

【讨论】:

    【解决方案2】:

    跳过架构的解决方案是正确的,但前提是您使用正确的架构设置了 Avro 文件。

    此代码以正确的模式初始化文件。不管是ab还是wb模式,只要写入一个带有schema的空文件,然后关闭即可。

    writer = DataFileWriter(open("reproducible.avro", "ab+"), DatumWriter(), schema)
    writer.close()
    

    现在以附加模式写入实际记录(因此无需重新读取文件!),您可以在 ab 模式下跳过 schema

    for i in range(3):
    
        writer = DataFileWriter(open("reproducible.avro", "ab+"), DatumWriter())
        writer.append(db_entry)
        writer.close()
    

    最后,读取整个文件:

    reader = DataFileReader(open("reproducible.avro", "rb"), DatumReader())
    for data in reader:
        print(data)
    reader.close()
    

    在 Windows 上为我工作,使用 Python 3.9.13 和 avro 库1.11.1

    对于完整的可重现示例,请从以下内容开始:

    import avro.schema
    from avro.datafile import DataFileReader, DataFileWriter
    from avro.io import DatumReader, DatumWriter
    import json
    
    schema = {
      "type": "record",
      "name": "recordName",
      "fields": [
        {
          "name": "id",
          "type": "string"
        }
      ]
    }
    
    schema = avro.schema.parse(json.dumps(schema))
    
    db_entry = {
        "id": "random_id"
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-07-12
      • 2022-12-13
      • 1970-01-01
      • 1970-01-01
      • 2021-09-04
      相关资源
      最近更新 更多