向pyspark中的数据框添加一列假数据：不支持的文字类型类答案

【问题标题】：Adding a column of fake data to a dataframe in pyspark: Unsupported literal type class向pyspark中的数据框添加一列假数据：不支持的文字类型类
【发布时间】：2022-01-03 14:39:52
【问题描述】：

我正在尝试向我的数据集添加一个额外的新假数据列。以这个为例（数据框是什么并没有什么区别——我需要一个新的额外列，具有唯一的假名称；这只是一个可以玩的假人）：

from faker import Faker

faker = Faker("en_GB")

profiles = [faker.profile() for i in range(0, 100)]
profiles = spark.createDataFrame(profiles)

我正在尝试添加一个新的名字列，每行一个名字。目前，我正在这样做（我知道这不会做我想要的，但我不知道还能做什么）：

profiles = profiles.withColumn('first_name', lit([faker.first_name()] for _ in 'name'))

但是，我不断收到此错误：

java.lang.RuntimeException: 不支持的文字类型类 java.util.ArrayList [[Robin], [Margaret], [Robin], [Victor]] 我想将其保留为一行，因为这是我需要进行进一步分析的内容。

我想我明白为什么我会收到错误，但我不知道该怎么办......任何想法都值得赞赏！

【问题讨论】：

您的预期输出是什么？目前，您正在尝试将值 [[Robin], [Margaret], [Robin], [Victor]]（字符串数组的数组）添加到数据帧的每一行。
我希望将 Robin、Margaret 等分别添加到数据框中的单独行中（编辑后添加）
名称是随机生成的，您必须在 name 列上使用 split 并取名字，但请注意，某些名称可以采用以下格式：Mrs Carole Price。所以仅仅在空间上分割并取第一个元素是行不通的。
她需要类似：profiles = profiles.withColumn("first_name", F.lit(faker.first_name()))。但问题在于 faker.first_name() 被评估一次，并且会为所有行生成相同的名字。
stackoverflow.com/questions/63269034/…

标签： python apache-spark pyspark faker

【解决方案1】：

试试这样的：

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from faker import Faker

faker = Faker("en_GB")

spark = SparkSession.builder.getOrCreate()
profiles = [faker.profile() for i in range(0, 100)]
profiles = spark.createDataFrame(profiles)
fake_names = [faker.first_name() for _ in range(profiles.count())]
profiles = profiles.withColumn(
    "first_name", F.udf(lambda x: fake_names[x])(F.monotonically_increasing_id())
)

需要在数据框之外生成假名称。如果你使用：

profiles.withColumn("first_name", F.lit(faker.first_name()))

您将获得所有行的相同假名。

编辑：

row_number 示例：

fake_names = [faker.first_name() for _ in range(profiles.count())]
window = Window.orderBy("name") # Or any other unique column, but I guess name is unique here
profiles = profiles.withColumn(
    "first_name", F.udf(lambda x: fake_names[x - 1])(F.row_number().over(window))
)

【讨论】：

monotonically_increasing_id 不会工作，因为它不是连续的。 The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive
@CarolinaKaroullas 不行，因为有些假名字会重复。 monotonically_increasing_id 不是连续的，并且可能（在这种情况下）导致批次重复。如果您的原始数据框中有一个唯一列，您可以在其上创建一个 Window 并使用 row_number 来获取唯一的假名称。
@CarolinaKaroullas 是的，我更新了答案并添加了一个示例。
@CarolinaKaroullas 是的，这个在逻辑方面更好，但在性能方面很糟糕。
@Steven 好吧，我想帮忙。获得更好性能的其他选择是什么？我真的很想知道。我还在学习。

【解决方案2】：

这是你想要的吗？

from faker import Faker

faker = Faker("en_GB")

profiles = [[faker.profile(), faker.first_name()] for i in range(0, 100)]
profiles = spark.createDataFrame(profiles, ["profile", "first_name"])

profiles.show()

【讨论】：

不完全，但更接近！我只想将“first_name”列添加到配置文件数据框（或任何其他数据框）作为单独的列，每行一个名称。