【问题标题】:Error when reading avro files in python在 python 中读取 avro 文件时出错
【发布时间】:2016-11-22 14:15:08
【问题描述】:

我在 Python 中成功安装了 Apache Avro。然后我尝试按照以下说明将 Avro 文件读入 Python。

https://avro.apache.org/docs/1.8.1/gettingstartedpython.html

我在一个已经在 Python 中设置为正确路径的目录中有一堆 Avros。这是我的代码:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
for user in reader:
   print (user)
reader.close()

但是它返回这个错误:

Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 349, in __init__
    self._read_header()
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 459, in _read_header
    META_SCHEMA, META_SCHEMA, self.raw_decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
    return self.read_record(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg \avro\io.py", line 725, in read_record
    field_val = self.read_data(field.type, readers_field.type, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 515, in read_data
    return self.read_fixed(writer_schema, reader_schema, decoder)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 568, in read_fixed
    return decoder.read(writer_schema.size)
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 170, in read
    input_bytes = self.reader.read(n)
  File "I:\Program Files\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 863: character maps to <undefined>

我确实知道在指令中的示例中,首先创建了一个模式。但是什么是.avsc 文件?在我的情况下,我应该如何创建它和相应的模式? 理想情况下,我想将 Avro 文件读入 Python 并将其保存为磁盘中的 csv 格式或 Python 中的数据帧/列表类型以供进一步分析。我在 Windows 7 上使用 Python 3。 p>

已编辑 我尝试了 Stephane 的代码,它返回了一个新错误

Traceback (most recent call last):
  File "I:\DJ data\read avro.py", line 5, in <module>
    reader = DataFileReader(open("part-00000-of-01733.avro", "rb"), DatumReader())
  File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 352, in __init__
    self.codec = self.GetMeta('avro.codec').decode('utf-8')
AttributeError: 'NoneType' object has no attribute 'decode'

EDITED2:Stephane 的代码在大多数情况下都能正常工作,但有时会引发像这样的 AssertionError

Traceback (most recent call last):
File "I:\DJ data\read avro.py", line 42, in <module>
for user in reader:
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\datafile.py", line 522, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 480, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 525, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 725, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 523, in read_data
return self.read_union(writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 689, in read_union
return self.read_data(selected_writer_schema, reader_schema, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 493, in read_data
return self.read_data(writer_schema, s, decoder)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 503, in read_data
return decoder.read_utf8()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 248, in read_utf8
input_bytes = self.read_bytes()
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 241, in read_bytes
return self.read(nbytes)
File "I:\Program Files\lib\site-packages\avro_python3-1.8.1-py3.5.egg\avro\io.py", line 171, in read
assert (len(input_bytes) == n), input_bytes
AssertionError: b'BlackRock Group\n\n17 December 2015\n\nFORM 8.3\n\nPUBLIC OPENING POSITION DISCLOSURE/DEALING DISCLOSURE BY\n\nA PERSON WITH INTERESTS IN RELEVANT SECURITIES REPRESENTING 1% OR MORE\n\nRule 8.3 of the Takeover Code (the "Code") \n\n\n   1.         KEY INFORMATION \n \n (a) Full name of discloser:                                                                        BlackRock, Inc. \n-------------------------------------------------------------------------------------------------  ----------------- \n (b) Owner or controller of interests and short positions disclosed, if diffe

【问题讨论】:

    标签: python-3.x avro


    【解决方案1】:

    您正在使用 Windows 和 Python 3。

    • 在 Python 3 中默认 open 以文本模式打开文件。这意味着当进一步的读取操作发生时,Python 会尝试将文件内容从某个字符集解码为 un​​icode。

    • 您没有指定默认字符集,因此 Python 尝试对内容进行解码,就好像这些内容是使用 charmap 编码的一样(默认情况下在 Windows 上)。

    • 显然你的avro文件没有用charmap编码,解码失败并出现异常

    • 据我记得,无论如何,avro 标头都是二进制内容...不是文本的(不确定)。所以也许首先你应该尽量不要用 open 解码文件:

    reader = DataFileReader(open("part-00000-of-01733.avro", 'rb'), DatumReader())

    (通知'rb',二进制模式)

    编辑:对于下一个问题(AttributeError),您遇到了一个在 1.8.1 中未修复的已知错误。在下一个版本发布之前,您可以执行以下操作:

    import avro.schema
    from avro.datafile import DataFileReader, DataFileWriter, VALID_CODECS, SCHEMA_KEY
    from avro.io import DatumReader, DatumWriter
    from avro import io as avro_io
    
    
    class MyDataFileReader(DataFileReader):
        def __init__(self, reader, datum_reader):
            """Initializes a new data file reader.
    
            Args:
              reader: Open file to read from.
              datum_reader: Avro datum reader.
            """
            self._reader = reader
            self._raw_decoder = avro_io.BinaryDecoder(reader)
            self._datum_decoder = None  # Maybe reset at every block.
            self._datum_reader = datum_reader
    
            # read the header: magic, meta, sync
            self._read_header()
    
            # ensure codec is valid
            avro_codec_raw = self.GetMeta('avro.codec')
            if avro_codec_raw is None:
                self.codec = "null"
            else:
                self.codec = avro_codec_raw.decode('utf-8')
            if self.codec not in VALID_CODECS:
                raise DataFileException('Unknown codec: %s.' % self.codec)
    
            self._file_length = self._GetInputFileLength()
    
            # get ready to read
            self._block_count = 0
            self.datum_reader.writer_schema = (
                schema.Parse(self.GetMeta(SCHEMA_KEY).decode('utf-8')))
    
    
    reader = MyDataFileReader(open("part-00000-of-01733.avro", "r"), DatumReader())
    for user in reader:
        print (user)
    reader.close()
    

    很奇怪,这种愚蠢的错误会出现在版本中,这并不是代码成熟的标志!

    【讨论】:

    • 我试过你的代码,但它返回了新的错误。请参阅问题中的编辑部分。谢谢。
    • 好吧,看起来 avro 文件没有在标头中明确指定编解码器。如果没有编解码器,规范说它应该是'null'。奇怪的是,实现似乎并不知道。
    • 你来了:issues.apache.org/jira/browse/AVRO-1741。该补丁似乎没有合并到 avro python3 1.8.1 中。
    • 感谢您的代码。但似乎代码有一些错误。我猜在python 3中,我们应该在使用open时使用“rb”而不是“r”,对吧?在类块中,它显示“avro_io”和“VALID_CODECS”未定义。如何解决这个问题?
    • 是的'rb'。对于未定义的符号,我添加了所需的导入。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-08-09
    • 1970-01-01
    • 1970-01-01
    • 2018-01-03
    • 2022-06-13
    • 2021-03-28
    相关资源
    最近更新 更多