【问题标题】:Pyspark alpabet indexPyspark 字母索引
【发布时间】:2020-11-26 15:38:06
【问题描述】:

我需要按字母顺序分配我的字母。 例如: 此外,这些字母是列表类型。 我有列表'simbol_combinations'

[Column<b'A'>, Column<b'B'>, Column<b'C'>, Column<b'D'>, Column<b'E'>, Column<b'F'>, Column<b'G'>, Column<b'H'>, Column<b'I'>, Column<b'J'>, Column<b'K'>, Column<b'L'>, Column<b'M'>, Column<b'N'>, Column<b'O'>, Column<b'P'>, Column<b'Q'>, Column<b'R'>, Column<b'S'>, Column<b'T'>, Column<b'U'>, Column<b'V'>, Column<b'W'>, Column<b'X'>, Column<b'Y'>, Column<b'Z'>]

我有 DataFrame 'unque_activity'

+--------------+
|activity_start|
+--------------+
|       Stage_3|
|       Stage_5|
|       Stage_4|
|       Stage_1|
|       Stage_6|
|       Stage_2|
|       Stage_0|
|       Stage_8|
|       Stage_7|
|       Stage_9|
+--------------+

unque_activity = df.select("activity_start").distinct()

我想要这个。联合 DataFrame 'unque_activity' 和列表 'simbol_combinations'。怎么样?

我想要

    stages symbol
0  Stage_0      A
1  Stage_3      B
2  Stage_5      C
3  Stage_2      D
4  Stage_7      E
5  Stage_4      F
6  Stage_8      G
7  Stage_9      H
8  Stage_1      I
9  Stage_6      J

怎么样?谢谢)

【问题讨论】:

    标签: python dataframe apache-spark pyspark rdd


    【解决方案1】:

    在 PySpark 中,我会这样解决这个问题:

    • 为两个列表创建 DataFrame

    • 排序

    • 添加row_num

    • 在 row_num 上加入两个数据框

       from pyspark.sql import SparkSession
       from pyspark.sql.types import StringType
       from pyspark.sql.functions import row_number,lit
       from pyspark.sql.window import Window
      
       spark = SparkSession.builder \
           .appName("Dyn") \
           .getOrCreate()
      
       activity_start = [
           "Stage_3", "Stage_5", "Stage_4", "Stage_1", "Stage_6", "Stage_2", "Stage_0", "Stage_8", "Stage_7",
           "Stage_9"]
      
       activity_name = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
      
       w = Window().orderBy(lit('A'))
       activity_start_df = spark.createDataFrame(activity_start, StringType())
       activity_start_df = (
           activity_start_df
           .withColumnRenamed("value", "activity_start")
           .orderBy("activity_start")
           .withColumn("row_num", row_number().over(w))
       )
      
       activity_name_df = spark.createDataFrame(activity_name, StringType())
       activity_name_df = (
           activity_name_df
           .withColumnRenamed("value", "activity_name")
           .orderBy("activity_name")
           .withColumn("row_num", row_number().over(w))
       )
      
       df = (
           activity_start_df
           .join(
               activity_name_df,
               activity_start_df["row_num"] == activity_name_df["row_num"]
           )
       )
       df.show(truncate=False)
      

    输出 [1]:https://i.stack.imgur.com/c7iwo.png

    如果这有帮助,请告诉我。

    【讨论】:

      猜你喜欢
      • 2010-11-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-05-23
      • 2016-07-03
      相关资源
      最近更新 更多