【问题标题】:ECDF plot from a truncated MD5截断 MD5 的 ECDF 图
【发布时间】:2019-03-26 23:28:09
【问题描述】:

在这个 link 中,它说截断的 MD5 是均匀分布的。我想使用 PySpark 检查它,我首先在 Python 中创建了 1,000,000 个 UUID,如下所示。然后从 MD5 中截断前三个字符。但是我得到的图与均匀分布的累积分布函数不相似。我尝试使用 UUID1 和 UUID4,结果相似。符合截断MD5均匀分布的正确方法是什么?

import uuid
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF
import pandas as pd
import pyspark.sql.functions as f
%matplotlib inline

### Generate 1,000,000 UUID1 

uuid1 = [str(uuid.uuid1()) for i in range(1000000)]  # make a UUID based on the host ID and current time
uuid1_df = pd.DataFrame({'uuid1':uuid1})
uuid1_spark_df =  spark.createDataFrame(uuid1_df)
uuid1_spark_df = uuid1_spark_df.withColumn('hash', f.md5(f.col('uuid1')))\
               .withColumn('truncated_hash3', f.substring(f.col('hash'), 1, 3))

count_by_truncated_hash3_uuid1 = uuid1_spark_df.groupBy('truncated_hash3').count()

uuid1_count_list = [row[1] for row in count_by_truncated_hash3_uuid1.collect()]
ecdf = ECDF(np.array(uuid1_count_list))
plt.figure(figsize = (14, 8))
plt.plot(ecdf.x,ecdf.y)
plt.show()

编辑: 我添加了直方图。如下所示,它看起来更像正态分布。

  plt.figure(figsize = (14, 8))
  plt.hist(uuid1_count_list)
  plt.title('Histogram of counts in each truncated hash')
  plt.show()

【问题讨论】:

  • 如果您只使用前 3 个半字节,则只有 4096 个值。为什么不只记录每个值出现的次数,然后验证它们的数字是否大致相同?
  • 我在上面的代码中添加了最大和最小计数,你可以看到范围很大。

标签: python pyspark md5 uniform-distribution


【解决方案1】:

这是一种快速而简单的方式来证明这一点:

import hashlib
import matplotlib.pyplot as plt
import numpy as np
import random

def random_string(n):
    """Returns a uniformly distributed random string of length n."""
    return ''.join(chr(random.randint(0, 255)) for _ in range(n))

# Generate 100K random strings
data = [random_string(10) for _ in range(100000)]
# Compute MD5 hashes
md5s = [hashlib.md5(d.encode()).digest() for d in data]
# Truncate each MD5 to the first three characters and convert to int
truncated_md5s = [md5[0] * 0x10000 + md5[1] * 0x100 + md5[2] for md5 in md5s]

# (Rather crudely) compute and plot the ECDF    
hist = np.histogram(truncated_md5s, bins=1000)
plt.plot(hist[1], np.cumsum([0] + list(hist[0])))

【讨论】:

    【解决方案2】:

    我上面分析的问题是我正在绘制截断哈希计数的直方图。正确的做法应该是将截断后的哈希从十六进制转换为十进制,看看小数的分布情况。

    import uuid
    import numpy as np
    import matplotlib.pyplot as plt
    from statsmodels.distributions.empirical_distribution import ECDF
    import pandas as pd
    import pyspark.sql.functions as f
    from pyspark.sql.types import IntegerType
    %matplotlib inline
    
    ### Generate 1,000,000 UUID1 
    
    uuid1 = [str(uuid.uuid1()) for i in range(1000000)]  
    uuid1_df = pd.DataFrame({'uuid1':uuid1})
    uuid1_spark_df =  spark.createDataFrame(uuid1_df)
    uuid1_spark_df = uuid1_spark_df.withColumn('hash', f.md5(f.col('uuid1')))\
               .withColumn('truncated_hash3', f.substring(f.col('hash'), 1, 3))\
               .withColumn('truncated_hash3_base10', f.conv('truncated_hash3', 16, 10).cast(IntegerType()))
    
    
    truncated_hash3_base10_list = [row[0] for row in 
    uuid1_spark_df.select('truncated_hash3_base10').collect()]
    pd_df = uuid1_spark_df.select('truncated_hash3_base10').toPandas()
    ecdf = ECDF(truncated_hash3_base10_list)
    plt.figure(figsize = (8, 6))
    plt.plot(ecdf.x,ecdf.y)
    plt.show()
    
    plt.figure(figsize = (8, 6))
    plt.hist(truncated_hash3_base10_list)
    plt.show()
    

    【讨论】:

      猜你喜欢
      • 2012-01-01
      • 2021-06-26
      • 2016-07-01
      • 1970-01-01
      • 2015-05-31
      • 1970-01-01
      • 2012-02-02
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多