运行spark——5. 实例：wordcount

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

object WordCount {

def main(args: Array[String]): Unit = {

val inputPath="file:///test/kmeans_data.txt"

val outputPath="file:///test/result"

val sc = new SparkContext()

val texts = sc.textFile(inputPath)

println(sc.master) //查看是local模式还是yarn模式

val wordCounts = texts.flatMap{a => a.split(" ")}

.map(word => (word,1))

.reduceByKey(_+_)

wordCounts.saveAsTextFile(outputPath) //保存

}

}

使用idea或sbt打jar包，然后spark-submit：

local模式：

[[email protected] ~]# spark-submit --class WordCount --master local file:///export/spark_jar/wordcount.jar

19/04/19 17:36:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

local

[[email protected] ~]#

结果：

运行spark——5. 实例：wordcount

yarn模式：

路径改为hdfs路径

val inputPath="hdfs://master:9000/test/kmeans_data.txt"

val outputPath="hdfs://master:9000/test/result"

[[email protected] ~]# spark-submit --class WordCount --master yarn-client file:///export/spark_jar/wordcount.jar

Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.

19/04/19 18:13:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

19/04/19 18:13:35 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

yarn

[[email protected] ~]#

结果：

[[email protected] ~]# hadoop fs -ls /test/result

Found 3 items

-rw-r--r-- 1 root supergroup 0 2019-04-19 18:14 /test/result/_SUCCESS

-rw-r--r-- 1 root supergroup 24 2019-04-19 18:14 /test/result/part-00000

-rw-r--r-- 1 root supergroup 24 2019-04-19 18:14 /test/result/part-00001

[[email protected] ~]#

yan模式遇到报错：

ERROR YarnClientSchedulerBackend:

YARN application has exited unexpectedly with state FAILED!

Check the YARN application

思路：yarn失败的错误，yarn出错大多数是因为内存不够用

解决：

修改yarn-site.xml，加上

<name>yarn.nodemanager.pmem-check-enabled</name>

<value>false</value> </property> <property>

<name>yarn.nodemanager.vmem-check-enabled</name>

<value>false</value>

</property>

yarn.nodemanager.pmem-check-enabled

是否启动一个线程检查每个任务正使用的物理内存量，

如果任务超出分配值，则直接将其杀掉，默认是true。

yarn.nodemanager.vmem-check-enabled

是否启动一个线程检查每个任务正使用的虚拟内存量，

如果任务超出分配值，则直接将其杀掉，默认是true。