官方文档:Quick Start
下载JAR包
apache-carbondata-1.5.2-bin-spark2.2.1-hadoop2.7.2.jar
Run locally with Spark Shell
新建sample.csv
cd carbondata
cat > sample.csv << EOF
id,name,city,age
1,david,shenzhen,31
2,eason,shenzhen,27
3,jarry,wuhan,35
EOF
将测试文件先put到hdfs中
hdfs dfs -put sample.csv /carbon/data
bin/spark-shell --jars /Users/xxx/Documents/software/carbondata/apache-carbondata-1.5.2-bin-spark2.2.1-hadoop2.7.2.jar
.....
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.CarbonSession._
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://localhost:9000/carbon/data/store", "hdfs://localhost:9000/carbon/metastore/store")
scala> carbon.sql(
| s"""
| | CREATE TABLE IF NOT EXISTS test_table(
| | id string,
| | name string,
| | city string,
| | age Int)
| | STORED AS carbondata
| """.stripMargin)
19/03/17 14:27:07 AUDIT carbon.audit: {"time":"2019年3月17日 下午02时27分07秒","username":"xxx","opName":"CREATE TABLE","opId":"49588049602293","opStatus":"START"}
19/03/17 14:27:10 AUDIT carbon.audit: {"time":"2019年3月17日 下午02时27分10秒","username":"xxx","opName":"CREATE TABLE","opId":"49588049602293","opStatus":"SUCCESS","opTime":"2951 ms","table":"default.test_table","extraInfo":{"bad_record_path":"","local_dictionary_enable":"true","external":"false","sort_columns":"","comment":""}}
res0: org.apache.spark.sql.DataFrame = []
hdfs dfs -put sample.csv /carbon/data
carbon.sql("LOAD DATA INPATH 'hdfs://localhost:9000/carbon/data/sample.csv' INTO TABLE test_table")
scala> carbon.sql("LOAD DATA INPATH 'hdfs://localhost:9000/carbon/data/sample.csv' INTO TABLE test_table")
19/03/17 14:41:10 AUDIT carbon.audit: {"time":"2019年3月17日 下午02时41分10秒","username":"xxx","opName":"LOAD DATA","opId":"50431785226855","opStatus":"START"}
19/03/17 14:41:13 WARN memory.UnsafeMemoryManager: It is not recommended to set offheap working memory size less than 512MB, so setting default value to 512
19/03/17 14:41:15 AUDIT carbon.audit: {"time":"2019年3月17日 下午02时41分15秒","username":"xxx","opName":"LOAD DATA","opId":"50431785226855","opStatus":"SUCCESS","opTime":"4335 ms","table":"default.test_table","extraInfo":{"SegmentId":"0","DataSize":"1.11KB","IndexSize":"600.0B"}}
res4: org.apache.spark.sql.DataFrame = []
scala> carbon.sql("SELECT * FROM test_table").show()
+---+-----+--------+---+
| id| name| city|age|
+---+-----+--------+---+
| 1|david|shenzhen| 31|
| 2|eason|shenzhen| 27|
| 3|jarry| wuhan| 35|
+---+-----+--------+---+
scala> carbon.sql(
| s"""
| | SELECT city, avg(age), sum(age)
| | FROM test_table
| | GROUP BY city
| """.stripMargin).show()
+--------+--------+--------+
| city|avg(age)|sum(age)|
+--------+--------+--------+
| wuhan| 35.0| 35|
|shenzhen| 29.0| 58|
+--------+--------+--------+
HDFS Store
By default metastore location points to …/carbon.metastore, user can provide own metastore location to CarbonSession like
SparkSession
.builder().config(sc.getConf)
.getOrCreateCarbonSession("<carbon_store_path>", "<local metastore path>").
Data storage location can be specified by <carbon_store_path>, like /carbon/data/store, hdfs://localhost:9000/carbon/data/store or s3a://carbon/data/store.
这里指定了存储路径如下:
hdfs://localhost:9000/carbon/data/store
在hdfs中生成文件如下: