shenjianping

一、环境搭建

(一)Environment variables

在PyCharm IDE中创建一个Spark_project项目,然后创建一个test.py文件,并且进行配置:

然后点击Environment variables进行变量添加:

PATHONPATH    I:\hadoop-pyspark\spark-2.0.2-bin-hadoop2.6\python
SPARK_HOME   I:\hadoop-pyspark\spark-2.0.2-bin-hadoop2.6

(二) Project Structure

在File--Settings--Project:spark_project--Add Content Root中添加:

I:\hadoop-pyspark\spark-2.0.2-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip

I:\hadoop-pyspark\spark-2.0.2-bin-hadoop2.6\python\lib\pyspark.zip

 二、测试

from pyspark import SparkConf,SparkContext

#创建SparkConf,进行Spark配置
conf = SparkConf().setMaster("local[2]").setAppName("sparktest")

#创建SparkContext
sc = SparkContext(conf=conf)

#进行业务数据处理
data = [1,2,3,4,5]
distData = sc.parallelize(data)
print(distData.collect())  #输出结果

#停止sc
sc.stop()

 如何在spark cluster上进行运行呢?你需要先将py文件进行提交,在linux上新建一个test文件,将本地开发好的脚本拷贝到上面去,但是你需要注意的是:

""" 
In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with 
spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.
"""
  • 不要通过硬编码的方式指定master
  • 通过spark-submit的方式提交脚本
from pyspark import SparkConf,SparkContext

#创建SparkConf,进行Spark配置
conf = SparkConf()  

#创建SparkContext
sc = SparkContext(conf=conf)

#进行业务数据处理
data = [1,2,3,4,5]
distData = sc.parallelize(data)
print(distData.collect())  #输出结果

#停止sc
sc.stop()

进入/root/app/spark-2.0.2-bin-hadoop2.6/bin下,执行:

[root@hadoop-master bin]# ./spark-submit --master local[2] --name sparktest /root/hadoopdata/test.py

即可。

 

分类:

技术点:

相关文章: