一、环境搭建
(一)Environment variables
在PyCharm IDE中创建一个Spark_project项目,然后创建一个test.py文件,并且进行配置:
然后点击Environment variables进行变量添加:
PATHONPATH I:\hadoop-pyspark\spark-2.0.2-bin-hadoop2.6\python
SPARK_HOME I:\hadoop-pyspark\spark-2.0.2-bin-hadoop2.6
(二) Project Structure
在File--Settings--Project:spark_project--Add Content Root中添加:
I:\hadoop-pyspark\spark-2.0.2-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip
I:\hadoop-pyspark\spark-2.0.2-bin-hadoop2.6\python\lib\pyspark.zip
二、测试
from pyspark import SparkConf,SparkContext #创建SparkConf,进行Spark配置 conf = SparkConf().setMaster("local[2]").setAppName("sparktest") #创建SparkContext sc = SparkContext(conf=conf) #进行业务数据处理 data = [1,2,3,4,5] distData = sc.parallelize(data) print(distData.collect()) #输出结果 #停止sc sc.stop()
如何在spark cluster上进行运行呢?你需要先将py文件进行提交,在linux上新建一个test文件,将本地开发好的脚本拷贝到上面去,但是你需要注意的是:
""" In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with
spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process. """
- 不要通过硬编码的方式指定master
- 通过spark-submit的方式提交脚本
from pyspark import SparkConf,SparkContext #创建SparkConf,进行Spark配置 conf = SparkConf() #创建SparkContext sc = SparkContext(conf=conf) #进行业务数据处理 data = [1,2,3,4,5] distData = sc.parallelize(data) print(distData.collect()) #输出结果 #停止sc sc.stop()
进入/root/app/spark-2.0.2-bin-hadoop2.6/bin下,执行:
[root@hadoop-master bin]# ./spark-submit --master local[2] --name sparktest /root/hadoopdata/test.py
即可。