Hadoop基础总结

一、Hadoop是什么？

　　Hadoop是开源的分布式存储和分布式计算平台

二、Hadoop包含两个核心组成：

　　1、HDFS: 分布式文件系统，存储海量数据

　　　　a、基本概念

　　　　　　-块(block）

　　　　　　　　HDFS的文件被分成块进行存储，每个块的默认大小64MB

　　　　　　　　　块是文件存储处理的逻辑单元

　　　　　　-NameNode

　　　　　　　　　管理节点，存放文件元数据，包括：

　　　　　　　　　　（1）文件与数据块的映射表

　　　　　　　　　　（2）数据块与数据节点的映射表

　　　　　　-DataNode

　　　　　　　　　是HDFS的工作节点，存放数据块

Hadoop基础总结

　　　　b、数据管理策略

　　　　　　11、数据块副本

　　　　　　　　　每个数据块三个副本，分布在两个机架内的三个节点，以防数据故障丢失

Hadoop基础总结

　　　　　　22、心跳检测：

　　　　　　　　DataNode定期向NameNode发送心跳信息

Hadoop基础总结

　　　　　　33、二级NameNode（Secondary NameNode）

　　　　　　　　　二级NameNode定期同步元数据映像文件和修改日志，NameNode发生故障时，备胎转正

Hadoop基础总结

　　　　　　44、HDFS文件读取的流程

　　　　　　55、HDFS写入文件的流程

Hadoop基础总结

　　　　　　66、HDFS的特点

　　　　　　　　　数据冗余，硬件容错

　　　　　　　　　流式的数据访问，一次写入多次读取，一旦写入无法修改，要修改只有删除重写

　　　　　　　　　存储大文件，小文件NameNode压力会很大

　　　　　　77、适用性和局限性

　　　　　　　　　适合数据批量读写，吞吐量高

　　　　　　　　　不适合交互式应用，低延迟很难满足

　　　　　　　　　适合一次写入多次读取，顺序读写

　　　　　　　　　不支持多用户并发写相同文件

　　2、Mapreduce：并行处理框架，实现任务分解和调度

　　　　a、Mapreduce的原理

　　　　　　分而治之，一个大任务分成多个小的子任务（map)，由多个节点并行执行后，合并结果（reduce）

　　　　b、Mapreduce的运行流程

　　　　　　11、基本概念

　　　　　　　　- Job & Task

　　　　　　　　　job → Task(maptask, reducetask)

　　　　　　　　- JobTracker

　　　　　　　　　　作业任务

　　　　　　　　　　分配任务、监控任务执行进度

　　　　　　　　　　监控TaskTracker的状态

　　　　　　　　- TaskTracker

　　　　　　　　　　执行任务

　　　　　　　　　　汇报任务状态

Hadoop基础总结

　　　　　　22、作业执行过程

Hadoop基础总结

　　　　　　33、Mapreduce的容错机制

　　　　　　　　　重复执行

　　　　　　　　　推测执行

三、可用来做什么

　　搭建大型数据仓库，PB级数据的存储、处理、分析、统计等业务

　　如：搜索引擎、商业智能、日志分析、数据挖掘

四、Hadoop优势

　　1、高扩展

　　　　可通过增加一些硬件，使得性能和容量提升

　　2、低成本

　　　　普通PC即可实现，堆叠系统，通过软件方面的容错来保证系统的可靠性

　　3、成熟的生态圈

　　　　如：Hive, Hbase

五、HDFS操作

　　1、shell命令操作

　　　　常用HDFS Shell命令：

　　　　　　类Linux系统：ls, cat, mkdir, rm, chmod, chown等

　　　　　HDFS文件交互：copyFromLocal、copyToLocal、get(下载）、put（上传）

六、Hadoop生态圈

Hadoop基础总结

七、Mapreduce操作实战

　　本例中为了实现读取某个文档，并统计文档中各单词的数量

　　先建立hdfs_map.py用于读取文档数据

# hdfs_map.py
import sys

def read_input(file):
    for line in file:
        yield line.split()


def main():
    data = read_input(sys.stdin)

    for words in data:
        for word in words:
            print('{}\t1'.format(word))


if __name__ == '__main__':
    main()

　　建立hdfs_reduce.py用于统计各单词数量

# hdfs_reduce.py

import sys
from operator import itemgetter
from itertools import groupby


def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)


def main():
    data = read_mapper_output(sys.stdin)

    for current_word, group in groupby(data, itemgetter(0)):
        total_count = sum(int(count) for current_word, count in group)

        print('{} {}'.format(current_word, total_count))


if __name__ == '__main__':
    main()

　　事先建立文档mk.txt，并编辑部分内容，然后粗如HDFS中

　　 Hadoop基础总结

　　在命令行中运行Mapreduce操作

hadoop jar /opt/hadoop-2.9.1/share/hadoop/tools/lib/hadoop-streaming-2.9.1.jar -files '/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py,/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py' -input /test/mk.txt -output /output/wordcount -mapper 'python3 hdfs_map.py' -reducer 'python3 hdfs_reduce.py'

　　运行如下

  1 ➜  Documents hadoop jar /opt/hadoop-2.9.1/share/hadoop/tools/lib/hadoop-streaming-2.9.1.jar -files '/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py,/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py' -input /test/mk.txt -output /output/wordcount -mapper 'python3 hdfs_map.py' -reducer 'python3 hdfs_reduce.py' 
  2 # 结果
  3 18/06/26 16:22:45 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
  4 18/06/26 16:22:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  5 18/06/26 16:22:45 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
  6 18/06/26 16:22:46 INFO mapred.FileInputFormat: Total input files to process : 1
  7 18/06/26 16:22:46 INFO mapreduce.JobSubmitter: number of splits:1
  8 18/06/26 16:22:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local49685846_0001
  9 18/06/26 16:22:46 INFO mapred.LocalDistributedCacheManager: Creating symlink: /home/zzf/hadoop_tmp/mapred/local/1530001366609/hdfs_map.py <- /home/zzf/Documents/hdfs_map.py
 10 18/06/26 16:22:46 INFO mapred.LocalDistributedCacheManager: Localized file:/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py as file:/home/zzf/hadoop_tmp/mapred/local/1530001366609/hdfs_map.py
 11 18/06/26 16:22:47 INFO mapred.LocalDistributedCacheManager: Creating symlink: /home/zzf/hadoop_tmp/mapred/local/1530001366610/hdfs_reduce.py <- /home/zzf/Documents/hdfs_reduce.py
 12 18/06/26 16:22:47 INFO mapred.LocalDistributedCacheManager: Localized file:/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py as file:/home/zzf/hadoop_tmp/mapred/local/1530001366610/hdfs_reduce.py
 13 18/06/26 16:22:47 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
 14 18/06/26 16:22:47 INFO mapred.LocalJobRunner: OutputCommitter set in config null
 15 18/06/26 16:22:47 INFO mapreduce.Job: Running job: job_local49685846_0001
 16 18/06/26 16:22:47 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
 17 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 18 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 19 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Waiting for map tasks
 20 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Starting task: attempt_local49685846_0001_m_000000_0
 21 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 22 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 23 18/06/26 16:22:47 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
 24 18/06/26 16:22:47 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/test/mk.txt:0+2267
 25 18/06/26 16:22:47 INFO mapred.MapTask: numReduceTasks: 1
 26 18/06/26 16:22:47 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
 27 18/06/26 16:22:47 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
 28 18/06/26 16:22:47 INFO mapred.MapTask: soft limit at 83886080
 29 18/06/26 16:22:47 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
 30 18/06/26 16:22:47 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
 31 18/06/26 16:22:47 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
 32 18/06/26 16:22:47 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python3, hdfs_map.py]
 33 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
 34 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
 35 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
 36 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
 37 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
 38 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
 39 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
 40 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
 41 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
 42 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
 43 18/06/26 16:22:47 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
 44 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
 45 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
 46 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
 47 18/06/26 16:22:47 INFO streaming.PipeMapRed: Records R/W=34/1
 48 18/06/26 16:22:47 INFO streaming.PipeMapRed: MRErrorThread done
 49 18/06/26 16:22:47 INFO streaming.PipeMapRed: mapRedFinished
 50 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 
 51 18/06/26 16:22:47 INFO mapred.MapTask: Starting flush of map output
 52 18/06/26 16:22:47 INFO mapred.MapTask: Spilling map output
 53 18/06/26 16:22:47 INFO mapred.MapTask: bufstart = 0; bufend = 3013; bufvoid = 104857600
 54 18/06/26 16:22:47 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26212876(104851504); length = 1521/6553600
 55 18/06/26 16:22:47 INFO mapred.MapTask: Finished spill 0
 56 18/06/26 16:22:47 INFO mapred.Task: Task:attempt_local49685846_0001_m_000000_0 is done. And is in the process of committing
 57 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Records R/W=34/1
 58 18/06/26 16:22:47 INFO mapred.Task: Task 'attempt_local49685846_0001_m_000000_0' done.
 59 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local49685846_0001_m_000000_0
 60 18/06/26 16:22:47 INFO mapred.LocalJobRunner: map task executor complete.
 61 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Waiting for reduce tasks
 62 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Starting task: attempt_local49685846_0001_r_000000_0
 63 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 64 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 65 18/06/26 16:22:47 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
 66 18/06/26 16:22:47 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@257adccd
 67 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
 68 18/06/26 16:22:47 INFO reduce.EventFetcher: attempt_local49685846_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
 69 18/06/26 16:22:47 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local49685846_0001_m_000000_0 decomp: 3777 len: 3781 to MEMORY
 70 18/06/26 16:22:47 INFO reduce.InMemoryMapOutput: Read 3777 bytes from map-output for attempt_local49685846_0001_m_000000_0
 71 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 3777, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->3777
 72 18/06/26 16:22:47 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
 73 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 74 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
 75 18/06/26 16:22:47 INFO mapred.Merger: Merging 1 sorted segments
 76 18/06/26 16:22:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3769 bytes
 77 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merged 1 segments, 3777 bytes to disk to satisfy reduce memory limit
 78 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merging 1 files, 3781 bytes from disk
 79 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
 80 18/06/26 16:22:47 INFO mapred.Merger: Merging 1 sorted segments
 81 18/06/26 16:22:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3769 bytes
 82 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 83 18/06/26 16:22:47 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python3, hdfs_reduce.py]
 84 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
 85 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
 86 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
 87 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
 88 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
 89 18/06/26 16:22:47 INFO streaming.PipeMapRed: Records R/W=381/1
 90 18/06/26 16:22:47 INFO streaming.PipeMapRed: MRErrorThread done
 91 18/06/26 16:22:47 INFO streaming.PipeMapRed: mapRedFinished
 92 18/06/26 16:22:47 INFO mapred.Task: Task:attempt_local49685846_0001_r_000000_0 is done. And is in the process of committing
 93 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 94 18/06/26 16:22:47 INFO mapred.Task: Task attempt_local49685846_0001_r_000000_0 is allowed to commit now
 95 18/06/26 16:22:47 INFO output.FileOutputCommitter: Saved output of task 'attempt_local49685846_0001_r_000000_0' to hdfs://localhost:9000/output/wordcount/_temporary/0/task_local49685846_0001_r_000000
 96 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Records R/W=381/1 > reduce
 97 18/06/26 16:22:47 INFO mapred.Task: Task 'attempt_local49685846_0001_r_000000_0' done.
 98 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local49685846_0001_r_000000_0
 99 18/06/26 16:22:47 INFO mapred.LocalJobRunner: reduce task executor complete.
100 18/06/26 16:22:48 INFO mapreduce.Job: Job job_local49685846_0001 running in uber mode : false
101 18/06/26 16:22:48 INFO mapreduce.Job:  map 100% reduce 100%
102 18/06/26 16:22:48 INFO mapreduce.Job: Job job_local49685846_0001 completed successfully
103 18/06/26 16:22:48 INFO mapreduce.Job: Counters: 35
104     File System Counters
105         FILE: Number of bytes read=279474
106         FILE: Number of bytes written=1220325
107         FILE: Number of read operations=0
108         FILE: Number of large read operations=0
109         FILE: Number of write operations=0
110         HDFS: Number of bytes read=4534
111         HDFS: Number of bytes written=2287
112         HDFS: Number of read operations=13
113         HDFS: Number of large read operations=0
114         HDFS: Number of write operations=4
115     Map-Reduce Framework
116         Map input records=34
117         Map output records=381
118         Map output bytes=3013
119         Map output materialized bytes=3781
120         Input split bytes=85
121         Combine input records=0
122         Combine output records=0
123         Reduce input groups=236
124         Reduce shuffle bytes=3781
125         Reduce input records=381
126         Reduce output records=236
127         Spilled Records=762
128         Shuffled Maps =1
129         Failed Shuffles=0
130         Merged Map outputs=1
131         GC time elapsed (ms)=0
132         Total committed heap usage (bytes)=536870912
133     Shuffle Errors
134         BAD_ID=0
135         CONNECTION=0
136         IO_ERROR=0
137         WRONG_LENGTH=0
138         WRONG_MAP=0
139         WRONG_REDUCE=0
140     File Input Format Counters 
141         Bytes Read=2267
142     File Output Format Counters 
143         Bytes Written=2287
144 18/06/26 16:22:48 INFO streaming.StreamJob: Output directory: /output/wordcount

View Code