【发布时间】:2018-03-11 19:03:30
【问题描述】:
我正在尝试通过 Dataproc UI 提交 pyspark 作业并不断收到错误消息,它似乎没有加载 kafka 流包。
这是我工作中 UI 提供的 REST 命令:
POST /v1/projects/projectname/regions/global/jobs:submit/
{
"projectId": "projectname",
"job": {
"placement": {
"clusterName": "cluster-main"
},
"reference": {
"jobId": "job-33ab811a"
},
"pysparkJob": {
"mainPythonFileUri": "gs://projectname/streaming.py",
"args": [
"--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0"
],
"jarFileUris": [
"gs://projectname/spark-streaming-kafka-0-10_2.11-2.2.0.jar"
]
}
}
}
我尝试将 kafka 包作为 args 和 jar 文件传递。
这是我的代码 (streaming.py):
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc = SparkContext()
spark = SparkSession.builder.master("local").appName("Spark-Kafka-Integration").getOrCreate()
# < ip > is masked
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "<ip>:9092") \
.option("subscribe", "rsvps") \
.option("startingOffsets", "earliest") \
.load()
df.printSchema()
错误: :java.lang.ClassNotFoundException:找不到数据源:kafka。请在http://spark.apache.org/third-party-projects.html查找包
【问题讨论】:
标签: python pyspark google-cloud-platform google-cloud-dataproc