hadoop官方文档:
|
1
|
https://hadoop.apache.org/docs/
|
安装hadoop集群
配置DNS解析或hosts文件:
|
1
2
3
4
5
6
7
|
cat > /etc/hosts <<EOF
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4::1 localhost localhost.localdomain localhost6 localhost6.localdomain610.3.149.20 hadoop-master10.3.149.21 hadoop-node110.3.149.22 hadoop-node2EOF |
配置root用户免秘钥:
|
1
2
3
4
5
6
7
|
ssh-keygen
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-master
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-node1
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-node2
ssh root@hadoop-master \'date\'
ssh root@hadoop-node1 \'date\'
ssh root@hadoop-node2 \'date\'
|
配置hadoop免秘钥:
|
1
2
3
4
5
6
7
8
9
10
11
12
|
useradd hadoop
echo \'123456\' | passwd --stdin hadoop
su hadoop
ssh-keygen
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-node1
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-node2
ssh hadoop@hadoop-master \'date\'
ssh hadoop@hadoop-node1 \'date\'
ssh hadoop@hadoop-node2 \'date\'
exit |
安装java:
|
1
|
tar -xf jdk-8u231-linux-x64.tar.gz -C /usr/local/
|
创建软连接:
|
1
2
|
cd /usr/local/
ln -sv jdk1.8.0_231/ jdk
|
添加环境变量:
|
1
2
3
4
5
6
7
|
cat > /etc/profile.d/java.sh <<EOF
export JAVA_HOME=/usr/local/jdk
export JRE_HOME=\$JAVA_HOME/jre
export CLASSPATH=.:\$JAVA_HOME/lib/dt.jar:\$JAVA_HOME/lib/tools.jar:\$JRE_HOME/lib
export PATH=\$PATH:\$JAVA_HOME/bin:\$JRE_HOME/bin
EOF. /etc/profile.d/java.sh
|
测试是否安装成功:
|
1
2
|
java -versionjavac -version |
安装hadoop:
hadoop下载地址:
|
1
2
|
https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/ http://archive.apache.org/dist/hadoop/common/
|
hadoop2.7版本的:
|
1
|
http://archive.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
|
下载安装包:
|
1
|
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz
|
解压:
|
1
2
3
|
tar -xf hadoop-2.10.0.tar.gz -C /usr/local/
cd /usr/local/
ln -sv hadoop-2.10.0/ hadoop
|
配置环境变量:
|
1
2
3
4
|
cat > /etc/profile.d/hadoop.sh <<EOF
export HADOOP_HOME=/usr/local/hadoop
export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin
EOF |
应用环境变量:
|
1
|
. /etc/profile.d/hadoop.sh
|
创建数据目录:
|
1
2
3
4
|
# mastermkdir -pv /data/hadoop/hdfs/{nn,snn}
# nodemkdir -pv /data/hadoop/hdfs/dn
|
master节点的配置:
进入配置目录:
|
1
|
cd /usr/local/hadoop/etc/hadoop
|
core-site.xml
|
1
2
3
4
5
6
7
|
<configuration> <property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:8020</value>
<final>true</final>
</property>
</configuration>
|
yarn-site.xml
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
<configuration> <property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-master:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
</configuration>
|
hdfs-site.xml
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
<configuration> <property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/hadoop/hdfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/hadoop/hdfs/dn</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:///data/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:///data/hadoop/hdfs/snn</value>
</property>
</configuration>
|
mapred-site.xml
|
1
2
3
4
5
6
|
<configuration> <property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
|
创建master文件:
|
1
2
3
|
cat > master <<EOF
hadoop-masterEOF |
创建slave
|
1
2
3
4
|
cat > slaves <<EOF
hadoop-node1hadoop-node2EOF |
常用配置注解:
|
1
|
http://blog.51yip.com/hadoop/2020.html
|
node节点上:
将主节点上的配置复制到node节点即可:
|
1
2
|
scp ./* root@hadoop-node1:/usr/local/hadoop/etc/hadoop/
scp ./* root@hadoop-node2:/usr/local/hadoop/etc/hadoop/
|
删除slaves文件:其他配置同master。
|
1
|
rm /usr/local/hadoop/etc/hadoop/slaves -rf
|
创建日志目录:
|
1
2
|
mkdir /usr/local/hadoop/logs
chmod g+w /usr/local/hadoop/logs/
|
改属主属组:
|
1
2
3
|
chown -R hadoop.hadoop /data/hadoop/
cd /usr/local/
chown -R hadoop.hadoop hadoop hadoop/
|
启动与停止集群
格式化hdfs:格式化之后就可以启动集群了
|
1
2
|
su hadoop
[hadoop@hadoop-master ~]$ hadoop namenode -format
|
先启动hdfs:从下面的输出可以看出各个节点以及运行的程序。
|
1
2
3
4
5
6
7
|
[hadoop@hadoop-master ~]$ start-dfs.sh Starting namenodes on [hadoop-master]hadoop-master: starting namenode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-namenode-hadoop-master.out
hadoop-node2: starting datanode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-datanode-hadoop-node2.out
hadoop-node1: starting datanode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-datanode-hadoop-node1.out
Starting secondary namenodes [0.0.0.0]0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-secondarynamenode-hadoop-master.out
|
查看本节点运行的进程:可以到任意一个节点上使用如下命令。
|
1
2
3
4
5
|
~]$ jps1174 Jps32632 ResourceManager32012 NameNode32220 SecondaryNameNode |
再启动yarn:可以看到对应的节点启动的进程。
|
1
2
3
4
5
|
[hadoop@hadoop-master ~]$ start-yarn.sh starting yarn daemonsstarting resourcemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-resourcemanager-hadoop-master.out
hadoop-node2: starting nodemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-nodemanager-hadoop-node2.out
hadoop-node1: starting nodemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-nodemanager-hadoop-node1.out
|
或者一次性启动:
|
1
|
[hadoop@hadoop-master ~]$ start-all.sh |
查看hadoop集群的运行状态:
|
1
|
hadoop dfsadmin -report |
访问概览web页面:
|
1
|
http://10.3.149.20:50070/
|
集群信息web页面:
|
1
|
http://10.3.149.20:8088/cluster
|
停止集群:
|
1
2
|
stop-dfs.shstop-yarn.sh |
或者:
|
1
|
stop-all.sh |
hdfs文件系统的使用
浏览目录:
|
1
|
~]$ hdfs dfs -ls /
|
创建目录:
|
1
|
~]$ hdfs dfs -mkdir /test
|
上传文件:
|
1
|
~]$ hdfs dfs -put /etc/fstab /test/fstab
|
查看文件存储位置:到其中一个datanode上的数据目录就可以查看到这个文件块,默认为128m,超过这个大小文件会分成两块,但是小于128m的文件并不会真正占用128m。
|
1
|
]$ cat /data/hadoop/hdfs/dn/current/BP-1469813358-10.3.149.20-1595493741225/current/finalized/subdir0/subdir0/blk_1073741825
|
递归浏览
|
1
|
~]$ hdfs dfs -ls -R /
|
查看文件:
|
1
|
~]$ hdfs dfs -cat /fstab
|
更多使用命令帮助:
|
1
|
https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/FileSystemShell.html
|
统计字符数运算示例:
在 /usr/local/hadoop/share/hadoop/mapreduce 目录中有很多用于计算的示例可以用来测试。
先上传用于测试的文件:
|
1
2
|
hdfs dfs mkdir /test
hdfs dfs -put /etc/fstab /test/fstab
|
查看帮助:直接运行程序会给出帮助信息
|
1
|
yarn jar hadoop-mapreduce-examples-2.10.0.jar |
测试:这里选择一个单词统计进行测试。
|
1
2
|
cd /usr/local/hadoop/share/hadoop/mapreduce
]$ yarn jar hadoop-mapreduce-examples-2.10.0.jar wordcount /test/fstab /test/count
|
可以在下面页面查看到正在运行的任务:
|
1
|
http://10.3.149.20:8088/cluster/apps
|
查看运算的结果:
|
1
|
]$ hdfs dfs -cat /test/count/part-r-00000
|
yarn常用命令:
查看运行中的应用:
|
1
|
~]$ yarn application -list |
已经运行过的应用:
|
1
|
~]$ yarn application -list -appStates=all
|
查看应用的状态:
|
1
|
~]$ yarn application -status application_1595496103452_0001 |