如何在 Pyspark 中的数据框中添加连续的“Ident”列，而不是 monotonically_increasing_id()？答案

【问题标题】：How can I add continuous 'Ident' column to a dataframe in Pyspark, not as monotonically_increasing_id()?如何在 Pyspark 中的数据框中添加连续的“Ident”列，而不是 monotonically_increasing_id()？
【发布时间】：2018-03-17 00:05:54
【问题描述】：

我有一个数据框“df”，我想添加一个“Ident”数字列，其中的值是连续的。我尝试使用 monotonically_increasing_id() 但值不连续。正如其描述所说：“生成的 ID 保证单调递增且唯一，但不连续。”

那么，我的问题是，我该怎么做呢？

【问题讨论】：

请提供您已经尝试过的代码示例以及您得到的结果

标签： dataframe pyspark pyspark-sql continuous ident

【解决方案1】：

你可以试试这样的，

df = df.rdd.zipWithIndex().map(lambda x: [x[1]] + [y for y in x[0]]).toDF(['Ident']+df.columns)

这将为您提供第一列作为您的标识符，它将具有从 0 到 N-1 的连续值，其中 N 是 df 中的记录总数。

【讨论】：