遍历Dataframe并获取索引答案

【问题标题】：iterate throught Dataframe and get the indexes遍历Dataframe并获取索引
【发布时间】：2019-09-12 23:11:04
【问题描述】：

我想遍历 DataFrame 流/列，获取当前行/列索引并执行一些其他操作。有没有方便的方法从行/列中获取索引号？

目标是通过 Apache POI 库将输出保存到 xlsx 文件，因此可能需要对每个单元格进行迭代。

// proceed throught each row/column of the DataFrame
myDataframe.foreach{row =>
  row.toSeq.foreach{col =>
    val rowNum = row.???
    val colNum = col.???
    // further operations on the data...
    // like save the output to the xlsx file with the Apache POI
  }
}

我正在开发 Spark 1.6.3。和 Scala 2.10.5。

【问题讨论】：

一个DataFrame本质上是一个分布式集合。 Excel 文件 本质上是一个本地文件。如果您确定数据足够小，可以写入单个文件，只需 collect() 它，并使用普通的 Scala / Java 代码处理结果。
我同意@LuisMiguelMejíaSuárez 如果它适合excel文件，它应该适合内存，因此您可以收集和处理它。 foreach 通常用于副作用，这不是您需要的。

标签： scala dataframe apache-spark

【解决方案1】：

您可以使用 row_number() 添加索引：

  val myDataframe = sc.parallelize(List("a", "b", "c", "d")).toDF("value")
  val withIndex = myDataframe.select(row_number().over(Window.orderBy('value)).as("index").cast("INT"), '*)

  myDataframe.foreach { row =>
    for (i <- 0 until (row.length)) {
      val rowNum = row.getInt(0)
      val colNum = i
    }
  }

但是如果你想将 df 保存到 excel 文件中，你应该收集你的数据。然后将其转换为数组数组/二维数组。

 val list: Array[Array[String]] = withIndex
    .select(concat_ws(",", withIndex.columns.map(withIndex(_)): _*))
    .map(s => s.getString(0))
    .collect()
    .map(s => s.toString.split(","))

  for (elem <- 0 until  list.length) {
    for (elem2 <- 0 until list.apply(elem).length) {
      println(list.apply(elem).apply(elem2),", row:"+elem+", col:"+elem2)
    }
  }

(1,, row:0, col:0)
(a,, row:0, col:1)
(2,, row:1, col:0)
(b,, row:1, col:1)
(3,, row:2, col:0)
(c,, row:2, col:1)
(4,, row:3, col:0)
(d,, row:3, col:1)

我不知道 apache poi 在 scala 中是如何工作的，但在 java 中它应该是这样的：

            FileInputStream inputStream = new FileInputStream(new File(excelFilePath));
            Workbook workbook = WorkbookFactory.create(inputStream);
            Sheet newSheet = workbook.createSheet("spark");

            // your data from DataFrame
            Object[][] bookComments = {
                    {"1", "a"},
                    {"2", "b"},
                    {"3", "c"},
                    {"4", "d"},
            };

            int rowCount = 0;

            for (Object[] aBook : bookComments) {
                Row row = newSheet.createRow(++rowCount);

                int columnCount = 0;

                for (Object field : aBook) {
                    Cell cell = row.createCell(++columnCount);
                    if (field instanceof String) {
                        cell.setCellValue((String) field);
                    } else if (field instanceof Integer) {
                        cell.setCellValue((Integer) field);
                    }
                }

            }

            FileOutputStream outputStream = new FileOutputStream("JavaBooks.xlsx");
            workbook.write(outputStream);
            workbook.close();
            outputStream.close();

【讨论】：