【发布时间】:2018-10-05 01:38:30
【问题描述】:
我尝试将 cassandra 保存的表与 RDD 和 DataSet 的 30,000 条记录进行比较。我发现数据集保存速度比 RDD 慢 10 倍。 该表有 4 个分区键。
DSE Version :5.1.7
Spark version :2.0.1
Nodes:6( 20 cores each 6g)
Using Spark Standalone
我们使用了以下 spark 配置:
- spark.scheduler.listenerbus.eventqueue.size=100000
- spark.locality.wait=1
- spark.dse.continuous_paging_enabled=false
- spark.cassandra.input.fetch.size_in_rows=500
- spark.cassandra.connection.keep_alive_ms=10000
- spark.cassandra.output.concurrent.writes=2000
- num-cpu-cores=48
- 每节点内存=3g
- spark.executor.cores=3
- spark.cassandra.output.ignoreNulls=true
- spark.cassandra.output.throughput_mb_per_sec=10
- spark.serializer=org.apache.spark.serializer.KryoSerializer
- spark.cassandra.connection.local_dc=dc1
- spark.cassandra.connection.compression=LZ4
- spark.cassandra.connection.connections_per_executor_max=20
以下是相同的示例代码:
val sparkSession = SparkSession.builder().config(conf).getOrCreate()
import sparkSession.implicits._
val RDD1 = sc.cassandraTable[TableName]("keySpace1", "TableName")
.where("id =?,id)
RDD1.saveToCassandra("keySpace1", "TableName")
var DS1 = sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "TableName", "keyspace" ->"keySpace1"))
.load()
.where("id ='"+ id +"'").as[CaseClassModel]
DS1.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append).option("table", "TableName1")
.option("keyspace", "KeySpace1")
.save()
【问题讨论】:
标签: scala apache-spark spark-dataframe spark-cassandra-connector