sparkposeexplode 函数运行速度很慢答案

【问题标题】：spark posexplode function runs very slowsparkposeexplode 函数运行速度很慢
【发布时间】：2018-07-31 17:38:22
【问题描述】：

我有一个存储为 orc 的 spark 数据框，其中包含大约 10000 行和以下架构：

>>> df.printSchema()
root
 |-- contig: string (nullable = true)
 |-- start: integer (nullable = true)
 |-- ref: string (nullable = true)
 |-- alt: string (nullable = true)
 |-- gt: array (nullable = true)
 |    |-- element: integer (containsNull = true)

其中 arrayField 是 200000 个整数的列表。我想将其转换为具有扁平结构的数据框：

>>> from pyspark.sql.functions import posexplode
>>> flat = df.select('contig', 'start', 'ref', 'alt', posexplode(df.gt))
>>> flat.explain()
== Physical Plan ==
*Project [contig#0, start#1, ref#2, alt#3, pos#11, col#12]
+- Generate posexplode(gt#4), true, false, [pos#11, col#12]
   +- *FileScan orc [contig#0,start#1,ref#2,alt#3,gt#4] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/path/to/data], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<contig:string,start:int,ref:string,alt:string,gt:array<int>>
>>> flat.write.orc('/path/to/output/file')

在具有 24 个 CPU 内核和超过 100GB 内存的机器上，将扁平化的数据帧写入文件需要五个多小时。这只是poseexplode函数的一个特性还是有其他问题？

【问题讨论】：

如果你分解 200000 个整数，那么每行会生成 200000 行。所以很明显变慢了。
确实，我们在这里谈论的是 20 亿个输出行，但 24 个工作线程真的有那么多工作吗？输出文件的大小仅为 1.3GB。
谢谢，我不确定驱动程序的 72g，但您可以尝试调整执行程序内存吗？更具体地说，显着增加它（假设您有资源，否则，请尝试减少驱动程序内存以进行补偿）并查看它是否加快了执行速度。
另外，从您的评论来看，似乎存在某种形式的偏差，使您的工作并行性不太理想。原始数据是否以某种方式分区或分桶？
我个人建议你重新分区数据帧，这样每个 24 执行器在你做poseexplode之前都会得到相等的分区。然后对posexplode使用withColumn函数，然后只使用select函数。试试看。

标签： python apache-spark pyspark spark-dataframe orc

【解决方案1】：

看来 spark 对这里的行做了一些疯狂的事情。使用 RDD，我能够获得更好的性能（每个 CPU 核心每秒 1/3 行，而数据帧的每个 CPU 核心每秒 1/40 行）。不过，这仍然不是很快。

df = sql_context.read.orc('/path/to/source/file')
rdd = df.rdd

def expand(row):
    contig, start, ref, alt, gt = row
    def getrow(index, genotype):
        return contig, start, ref, alt, index, genotype
    return [getrow(index, genotype) for index, genotype in enumerate(gt)]

rdd_flat = rdd.flatMap(expand)
schema = ('contig', 'start', 'ref', 'alt', 'index', 'genotype')
sqlc.createDataFrame(rdd_flat, schema=schema).write.orc('/path/to/output/file')

有趣的是，如果我将 expand-function 重新定义为

def expand(row):
    def getrow(index, genotype):
        return Row(
            contig=row.contig,
            start=row.start,
            ref=row.ref,
            alt=row.alt,
            index=index,
            genotype=genotype
        )
    return [getrow(index, genotype) for index, genotype in enumerate(row.gt)]

它的运行速度大约慢了 13 倍（单个函数调用大约需要 1.4 秒）。

很明显，行对象效率极低。

不过，还有更多问题需要解决。单核应该可以每秒运行9次expand-function，但实际性能是每3秒1行。

编辑：找到一个“解决方案”：使用 prestodb 查询而不是 spark。每个 cpu 核心每秒运行 1 行多一点 - 比数据帧快 20 倍以上，比 RDD 快 4 倍：

create table flat (
  contig varchar,
  start int,
  ref varchar,
  alt varchar,
  index bigint,
  genotype tinyint
) 
WITH (format = 'ORC');

insert into flat
select contig, start, ref, alt, index, genotype, partition_name
from nested cross join unnest(gt) with ordinality as g (genotype, index)
where partition_name='10-70329347';

【讨论】：