用列表 Pyspark Dataframe 中的值替换 NA答案

【问题标题】：replacing NA with value from a list Pyspark Dataframe用列表 Pyspark Dataframe 中的值替换 NA
【发布时间】：2019-10-12 22:05:04
【问题描述】：

我有一个包含 20 列的 spark 数据框。我想用它们的平均值替换选定列（数字列）中的 NA 值。

我有一个数字列名称的列表，以及它们的平均值列表。我已经编写了以下函数，但我不确定如何将它应用到 Dataframe

NumColNames=['MinTemp','MaxTemp','Rainfall','WindGustSpeed',\
             'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am','Pressure3pm']

avgLst=[12,33,44,21,132,35,22,11,4,55]

def replaceNaNum(df, NumColNames,avgLst):
    #iterate through numerical columns names
    for column in NumColNames:
        #iterate through the aveagres in avgLst
        for avg in avgLst:
            #replace each NA value in every column with the corresponding average 
            df=df.withColumn(column, when(df[column] == 'NA',\
                                                       avg).otherwise(df[column]))
    return df

感谢任何意见，谢谢

【问题讨论】：

标签： python dataframe pyspark iteration user-defined-functions

【解决方案1】：

你可以在这里使用zip绑定列名和对应的平均值，然后从单个循环中提取出来：

for column, avg in zip(NumColNames, avgLst):
    df = df.withColumn(column, when(df[column] == 'NA',\
                                                   avg).otherwise(df[column]))

【讨论】：