【问题标题】:bind repeated ID with Incremental Number将重复的 ID 与增量编号绑定
【发布时间】:2018-05-07 07:43:42
【问题描述】:

我对数据进行预处理,我有类似的数据框

id     ref          text
+------+------+--------------------+
|  8309|  3129|3 MO F/U HAIR LOS...|
|  8309|  3129|        4 MO SKIN CK|
|  8309|  3129|      4 MO F/U LM AG|
|  8309|  3129|HAIR LOSS AND SPO...|
|  8309|  3129|    2 MO F/U CONF KC|
|  8309|  3129|SSR AND DISCUSS H...|
|  4569|  1101|F/U LM TO CONFIRM...|
|  4569|  1101|F/U (LF) LM TO CO...|
|  4569|  1101|        FU CONFIRMED|
|  4569|  1101|F/U MRI RESULTS  ...|
|  4569|  1101|F/U AFTER MRI JC ...|
|  4569|  1101|                  FU|
|  4569|  1101|F/U AND NEW PROBL...|
|  4569|  1101|                 F/U|
|  4569|  1101|        FU CONFIRMED|
|  4569|  1101|REVIEW MRI       ...|
|  4569|  1101|REVIEW MRI RESULT...|
+------+------+--------------------+

我想像这样转换这个 Dataframe

   id       ref          text
+--------+------+--------------------+
|  8309  |  3129|3 MO F/U HAIR LOS...|
|  8309_1|  3129|        4 MO SKIN CK|
|  8309_2|  3129|      4 MO F/U LM AG|
|  8309_3|  3129|HAIR LOSS AND SPO...|
|  8309_4|  3129|    2 MO F/U CONF KC|
|  8309_5|  3129|SSR AND DISCUSS H...|
|  4569  |  1101|F/U LM TO CONFIRM...|
|  4569_1|  1101|F/U (LF) LM TO CO...|
|  4569_2|  1101|        FU CONFIRMED|
|  4569_3|  1101|F/U MRI RESULTS  ...|
|--------|------|--------------------|

我只想将重复的 ID 与唯一编号绑定。如果不是增量的就好了。

【问题讨论】:

    标签: pandas dataframe pyspark


    【解决方案1】:

    使用GroupBy.cumcount 计数:

    df['id'] = (df['id'].astype(str).add(df.groupby('id')
                                           .cumcount()
                                           .astype(str)
                                           .radd('_')
                                           .replace('_0','')))
    print (df)
    
             id   ref                  text
    0      8309  3129  3 MO F/U HAIR LOS...
    1    8309_1  3129          4 MO SKIN CK
    2    8309_2  3129        4 MO F/U LM AG
    3    8309_3  3129  HAIR LOSS AND SPO...
    4    8309_4  3129      2 MO F/U CONF KC
    5    8309_5  3129  SSR AND DISCUSS H...
    6      4569  1101  F/U LM TO CONFIRM...
    7    4569_1  1101  F/U (LF) LM TO CO...
    8    4569_2  1101          FU CONFIRMED
    9    4569_3  1101  F/U MRI RESULTS  ...
    10   4569_4  1101  F/U AFTER MRI JC ...
    11   4569_5  1101                    FU
    12   4569_6  1101  F/U AND NEW PROBL...
    13   4569_7  1101                   F/U
    14   4569_8  1101          FU CONFIRMED
    15   4569_9  1101         REVIEW MRI...
    16  4569_10  1101  REVIEW MRI RESULT...
    

    【讨论】:

      【解决方案2】:

      您可以使用row_number()lag()whenwindow 函数的组合来获得您想要的结果

      import org.apache.spark.sql.expressions._
      import org.apache.spark.sql.functions._
      def windowSpec = Window.partitionBy("id").orderBy("ref")
      df.withColumn("rank", lag(row_number().over(windowSpec), 1).over(windowSpec))
          .withColumn("id", when($"rank".isNotNull, concat_ws("_", $"id", $"rank")).otherwise($"id"))
          .drop("rank")
          .show(false)
      

      你应该得到最终的dataframe

      +-------+----+--------------------+
      |id     |ref |text                |
      +-------+----+--------------------+
      |4569   |1101|F/U LM TO CONFIRM...|
      |4569_1 |1101|F/U (LF) LM TO CO...|
      |4569_2 |1101|        FU CONFIRMED|
      |4569_3 |1101|F/U MRI RESULTS  ...|
      |4569_4 |1101|F/U AFTER MRI JC ...|
      |4569_5 |1101|                  FU|
      |4569_6 |1101|F/U AND NEW PROBL...|
      |4569_7 |1101|                 F/U|
      |4569_8 |1101|        FU CONFIRMED|
      |4569_9 |1101|REVIEW MRI       ...|
      |4569_10|1101|REVIEW MRI RESULT...|
      |8309   |3129|3 MO F/U HAIR LOS...|
      |8309_1 |3129|        4 MO SKIN CK|
      |8309_2 |3129|      4 MO F/U LM AG|
      |8309_3 |3129|HAIR LOSS AND SPO...|
      |8309_4 |3129|    2 MO F/U CONF KC|
      |8309_5 |3129|SSR AND DISCUSS H...|
      +-------+----+--------------------+
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-03-17
        • 1970-01-01
        • 2011-03-02
        • 2012-10-14
        • 2018-06-11
        相关资源
        最近更新 更多