是的,例如:
import org.apache.spark.sql.Column
val df = List(
("1001", "[physics, chemistry]", "pass"),
("1001", "[biology, math]", "fail"),
("3002", "[economics]", "pass"),
("2002", "[physics, chemistry]", "fail")
).toDF("student_id", "subjects", "result")
df.filter(col("student_id").startsWith("3")).show
返回:
+----------+-----------+------+
|student_id| subjects|result|
+----------+-----------+------+
| 3002|[economics]| pass|
+----------+-----------+------+
对于 JSON 派生输入 - 虽然不是真正相关,但使用 DF 而不是 DS 的示例(也适用于 DS),结构内的字段只有细微差别:
import org.apache.spark.sql.Column
val df = spark.read.json("/FileStore/tables/json_nested_4.txt")
import org.apache.spark.sql.functions._
val flattened = df.select($"name", explode($"schools").as("schools_flat"))
flattened.filter(col("name").startsWith("J")).show
flattened.filter(col("schools_flat.sname").startsWith("u")).show
基本输入和结构:
+-------+----------------+
| name| schools_flat|
+-------+----------------+
|Michael|[stanford, 2010]|
|Michael|[berkeley, 2012]|
| Andy| [ucsb, 2011]|
| Justin|[berkeley, 2014]|
+-------+----------------+
flattened: org.apache.spark.sql.DataFrame = [name: string, schools_flat: struct<sname: string, year: bigint>]
返回:
+------+----------------+
| name| schools_flat|
+------+----------------+
|Justin|[berkeley, 2014]|
+------+----------------+
+----+------------+
|name|schools_flat|
+----+------------+
|Andy|[ucsb, 2011]|
+----+------------+