【问题标题】:spark dataframe - dynamic substring based on numbers in stringsspark dataframe - 基于字符串中数字的动态子字符串
【发布时间】:2018-10-01 19:15:27
【问题描述】:

转换给定的 spark 数据帧(Spark 版本 2.0,scala 2.11),

A   B
a   2*Z12*CA9*ThisnThat10*51827630323*fa2
b   1*C7*Friends5*names1*O2
c   4*19456*helpme6*please
d   2*M13*fin2*na2*325*123456*fancy2

转换为以下格式(在 scala 或 pyspark 中)。

A   B
a   Z1*CA*ThisnThat*5182763032*fa2
b   C*Friends*names*O
c   1945*helpme*please
d   M1*fin*na*32*12345*fancy2

使用的逻辑 - 在每一行中,使用第一个数值作为下一个值的子串。使用剩余的数字部分提取下一个值,依此类推....

例如对于第一个字符串

(2*Z12*CA9*ThisnThat10*51827630323*fa2) - 
* Use the first 2 to break 'Z12' into 'Z1' (two characters) with 2 remaining.  
* Use this 2 to break 'CA9' into 'CA' (two characters) with 9 remaining.  
* Use this 9 to break 'ThisnThat10' into 'ThisnThat' (9 characters) and 10.  
* Use the 10 to break '51827630323' into '5182763032' (10 characters) and 3.  
* Use the 3 to break 'fa2' into 'fa2' (3 characters).  

我可以拆分字符串并创建具有动态列数的宽数据框 - 但我无法找出用于缩短字符串的 UDF。

【问题讨论】:

    标签: arrays regex scala apache-spark dataframe


    【解决方案1】:

    您可以创建一个 UDF 来处理列 B,如下所示。 Try 用于验证整数转换,foldLeft 用于遍历拆分子串,进行所需的处理逻辑。

    请注意,(String, Integer) 的tuple 用作foldLeft 的累加器,以迭代地转换字符串以及结转计算的长度值 (n)。

    val df = Seq(
      ("a", "2*Z12*CA9*ThisnThat10*51827630323*fa2"),
      ("b", "1*C7*Friends5*names1*O2"),
      ("c", "4*19456*helpme6*please"),
      ("d", "2*M13*fin2*na2*325*123456*fancy2")
    ).toDF("A", "B")
    
    def processString = udf( (s: String) => {
      import scala.util.{Try, Success, Failure}
    
      val arr = s.split("\\*")
      val firstN = Try(arr.head.toInt) match {
        case Success(i) => i
        case Failure(_) => 0
      }
    
      arr.tail.foldLeft( ("", firstN) ){ (acc, x) =>
        val n = Try( x.drop(acc._2).toInt ) match {
          case Success(i) => i
          case Failure(_) => 0
        }
        ( acc._1 + "*" + x.take(acc._2), n )
      }._1.tail
    } )
    
    df.select($"A", processString($"B").as("B")).
      show(false)
    // +---+------------------------------+
    // |A  |B                             |
    // +---+------------------------------+
    // |a  |Z1*CA*ThisnThat*5182763032*fa2|
    // |b  |C*Friends*names*O             |
    // |c  |1945*helpme*please            |
    // |d  |M1*fin*na*32*12345*fancy2     |
    // +---+------------------------------+
    

    【讨论】:

      【解决方案2】:

      假设您关注dataframe(数据取自问题)

      +---+-------------------------------------+
      |A  |B                                    |
      +---+-------------------------------------+
      |a  |2*Z12*CA9*ThisnThat10*51827630323*fa2|
      |b  |1*C7*Friends5*names1*O2              |
      |c  |4*19456*helpme6*please               |
      |d  |2*M13*fin2*na2*325*123456*fancy2     |
      +---+-------------------------------------+
      

      那么你需要一个在udf函数中的递归函数作为

      import org.apache.spark.sql.functions._
      def shorteningUdf = udf((actualStr: String) => {
        val arrayStr = actualStr.split("\\*")
        val nextSubStrIndex = arrayStr.head.toInt
        val listBuffer = new ListBuffer[String]
        def recursiveFund(arrayStr2: List[String], index: Int, resultStrBuff: ListBuffer[String]): ListBuffer[String] = arrayStr2 match{
          case head :: Nil => resultStrBuff += head.splitAt(index)._1
          case head :: tail => {
            val splitStr = head.splitAt(index)
            recursiveFund(tail, splitStr._2.toInt, resultStrBuff += splitStr._1)
          }
          case _ => resultStrBuff
        }
        recursiveFund(arrayStr.tail.toList, nextSubStrIndex, listBuffer).mkString("*")
      })
      

      所以当你调用udf函数时

      df.withColumn("B", shorteningUdf(col("B"))).show(false)
      

      你会得到你想要的输出

      +---+------------------------------+
      |A  |B                             |
      +---+------------------------------+
      |a  |Z1*CA*ThisnThat*5182763032*fa2|
      |b  |C*Friends*names*O             |
      |c  |1945*helpme*please            |
      |d  |M1*fin*na*32*12345*fancy2     |
      +---+------------------------------+
      

      希望回答对你有帮助

      【讨论】:

        猜你喜欢
        • 2017-02-15
        • 2020-07-09
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-07-22
        相关资源
        最近更新 更多