【发布时间】:2020-03-11 06:41:13
【问题描述】:
我在努力
def customFunction(rows):
for row in rows:
key = row.key #this value is boolean instead of actual value same with row["key"]
val = row.value #this value is boolean instead of actual value same with row["val"]
#do something with key value
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# spark is an existing SparkSession
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
# Queries are expressed in HiveQL
df = spark.sql("SELECT key, value FROM src")
# assumption that df row size is of billions
df.rdd.foreachPartition(customFunction)
我在自定义函数中的键、val 变量中获取布尔值。我们如何获取行属性实际值?
这是在 aws emr 5.29、python 2.7 上运行的,python 代码是通过 spark-submit 执行的
【问题讨论】:
-
是的,应该投反对票。我正在尝试的是 pyspark 相当于 stackoverflow.com/questions/45421593/…
标签: python python-2.7 apache-spark pyspark