无论如何要比较两个avro文件以查看数据中存在哪些差异？答案

【问题标题】：Is there anyway to compare two avro files to see what differences exist in the data?无论如何要比较两个avro文件以查看数据中存在哪些差异？
【发布时间】：2014-10-14 00:42:19
【问题描述】：

理想情况下，我想要一些像 SAS proc compare 这样的包装，可以给我：

每个数据集的行数
存在于一个数据集中但不存在于另一个数据集中的行数
存在于一个数据集中但不存在于另一个数据集中的变量
两个文件中格式不同的变量（我意识到这对于 AVRO 文件来说很少见，但有助于快速了解而不破译错误）
每列的不匹配行总数，以及一列的所有不匹配或任何 20 个不匹配（以最小者为准）的表示

我想出了一种方法来确保数据集是等价的，但它的效率很低。假设我们有两个具有 100 行和 5 列的 avro 文件（一个键和四个浮动功能）。如果我们连接表并创建新变量，这些变量是数据集中匹配特征之间的差异，那么任何非零差异都是数据中的一些不匹配。从那里可以很容易地确定上述要求的整个列表，但似乎可能有更有效的方法。

【问题讨论】：

删除sas，因为这不是关于使用 SAS 的问题。

标签： hadoop avro

【解决方案1】：

AVRO 文件分别存储架构和数据。这意味着除了带有数据的 AVRO 文件之外，您还应该有一个模式文件，通常它类似于 *.avsc。这样您的任务可以分为 3 个部分：

比较架构。通过这种方式，您可以获得这些文件中具有不同数据类型的字段，具有不同的字段集等等。这个任务非常简单，可以在 Hadoop 之外完成，例如在 Python 中：

import json
schema1 = json.load(open('schema1.avsc'))
schema2 = json.load(open('schema2.avsc'))
def print_cross (s1set, s2set, message):
    for s in s1set:
        if not s in s2set:
            print message % s
s1names = set( [ field['name'] for field in schema1['fields'] ] )
s2names = set( [ field['name'] for field in schema2['fields'] ] )
print_cross(s1names, s2names, 'Field "%s" exists in table1 and does not exist in table2')
print_cross(s2names, s1names, 'Field "%s" exists in table2 and does not exist in table1')
def print_cross2 (s1dict, s2dict, message):
    for s in s1dict:
        if s in s2dict:
            if s1dict[s] != s2dict[s]:
                print message % (s, s1dict[s], s2dict[s])
s1types = dict( zip( [ field['name'] for field in schema1['fields'] ],  [ str(field['type']) for field in schema1['fields'] ] ) )
s2types = dict( zip( [ field['name'] for field in schema2['fields'] ],  [ str(field['type']) for field in schema2['fields'] ] ) )
print_cross2 (s1types, s2types, 'Field "%s" has type "%s" in table1 and type "%s" in table2')

以下是架构示例：

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int"]},
     {"name": "favorite_color", "type": ["string", "null"]},
     {"name": "test", "type": "int"}
 ]
}

这是输出：

[localhost:temp]$ python compare.py 
Field "test" exists in table2 and does not exist in table1
Field "favorite_number" has type "[u'int', u'null']" in table1 and type "[u'int']" intable2

如果架构相等（如果架构不相等，您可能不需要比较数据），那么您可以通过以下方式进行比较。匹配任何情况的简单方法：为每一行计算 md5 哈希，根据此 md5 哈希的值连接两个表。如果将为您提供两个表中相同的行数，特定于 table1 的行数和特定于 table2 的行数。在 Hive 中可以轻松完成，这里是 MD5 UDF 的代码：https://gist.github.com/dataminelab/1050002
为了比较字段到字段，您必须知道表的主键，并根据主键连接两个表，比较字段

之前我为表格开发了比较函数，它们通常看起来像这样：

检查两个表是否存在且可用
比较它们的架构。如果架构中有一些不匹配 - 中断
如果指定了主键：
1. 使用完全外连接在主键上连接两个表
2. 为每一行计算 md5 哈希
3. 输出带诊断的主键（PK只存在于table1中，PK只存在于table2中，PK存在于两个表中但数据不匹配）
4. 获取每个有问题的类的 100 行相同，与两个表连接并输出到“不匹配示例”表中
如果未指定主键：
1. 为每一行计算 md5 哈希
2. table1 与 table2 在 md5hash 值上的完全外连接
3. 统计匹配行数，行数只存在于table1，行数只存在于table2
4. 获取每种不匹配类型的 100 行样本并输出到“不匹配示例”表

通常开发和调试这样一个功能需要 4-5 个工作日

【讨论】：