一、Hadoop是什么?

  Hadoop是开源的分布式存储和分布式计算平台


二、Hadoop包含两个核心组成:

  1HDFS: 分布式文件系统,存储海量数据

    a、基本概念

      -(block

        HDFS的文件被分成块进行存储,每个块的默认大小64MB

         块是文件存储处理的逻辑单元

      -NameNode

         管理节点,存放文件元数据,包括:

          (1)文件与数据块的映射表

          (2)数据块与数据节点的映射表


      -DataNode

         是HDFS的工作节点,存放数据块

Hadoop基础总结

 

 

    b、数据管理策略

      11、数据块副本

         每个数据块三个副本,分布在两个机架内的三个节点,以防数据故障丢失

Hadoop基础总结

 

      22、心跳检测:

        DataNode定期向NameNode发送心跳信息

Hadoop基础总结

 

      33、二级NameNodeSecondary NameNode

         二级NameNode定期同步元数据映像文件和修改日志,NameNode发生故障时,备胎转正

                   Hadoop基础总结

      44HDFS文件读取的流程

Hadoop基础总结

      55HDFS写入文件的流程

Hadoop基础总结

      66HDFS的特点

         数据冗余,硬件容错

         流式的数据访问,一次写入多次读取,一旦写入无法修改,要修改只有删除重写

         存储大文件,小文件NameNode压力会很大

      77、适用性和局限性

         适合数据批量读写,吞吐量高

         不适合交互式应用,低延迟很难满足

         适合一次写入多次读取,顺序读写

         不支持多用户并发写相同文件


  2Mapreduce:并行处理框架,实现任务分解和调度

    aMapreduce的原理

      分而治之,一个大任务分成多个小的子任务(map),由多个节点并行执行后,合并结果(reduce

    bMapreduce的运行流程

      11、基本概念

        - Job & Task

         job → Task(maptask, reducetask)

        - JobTracker

          作业任务

          分配任务、监控任务执行进度

          监控TaskTracker的状态

        - TaskTracker

          执行任务

          汇报任务状态

Hadoop基础总结

      22、作业执行过程

Hadoop基础总结

      33Mapreduce的容错机制

         重复执行

         推测执行

 

三、可用来做什么

  搭建大型数据仓库,PB级数据的存储、处理、分析、统计等业务

  如:搜索引擎、商业智能、日志分析、数据挖掘


四、Hadoop优势

  1、高扩展

    可通过增加一些硬件,使得性能和容量提升

  2、低成本

    普通PC即可实现,堆叠系统,通过软件方面的容错来保证系统的可靠性

  3、成熟的生态圈

    如:Hive, Hbase


五、HDFS操作

  1shell命令操作

    常用HDFS Shell命令:

      类Linux系统:ls, cat, mkdir, rm, chmod, chown

     HDFS文件交互:copyFromLocalcopyToLocalget(下载)、put(上传)

 

六、Hadoop生态圈

Hadoop基础总结

 

七、Mapreduce操作实战

  本例中为了实现读取某个文档,并统计文档中各单词的数量

  先建立hdfs_map.py用于读取文档数据

# hdfs_map.py
import sys

def read_input(file):
    for line in file:
        yield line.split()


def main():
    data = read_input(sys.stdin)

    for words in data:
        for word in words:
            print('{}\t1'.format(word))


if __name__ == '__main__':
    main()

  建立hdfs_reduce.py用于统计各单词数量

# hdfs_reduce.py

import sys
from operator import itemgetter
from itertools import groupby


def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)


def main():
    data = read_mapper_output(sys.stdin)

    for current_word, group in groupby(data, itemgetter(0)):
        total_count = sum(int(count) for current_word, count in group)

        print('{} {}'.format(current_word, total_count))


if __name__ == '__main__':
    main()

  事先建立文档mk.txt,并编辑部分内容,然后粗如HDFS中

  Hadoop基础总结

  在命令行中运行Mapreduce操作

hadoop jar /opt/hadoop-2.9.1/share/hadoop/tools/lib/hadoop-streaming-2.9.1.jar -files '/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py,/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py' -input /test/mk.txt -output /output/wordcount -mapper 'python3 hdfs_map.py' -reducer 'python3 hdfs_reduce.py'

  运行如下

  1 ➜  Documents hadoop jar /opt/hadoop-2.9.1/share/hadoop/tools/lib/hadoop-streaming-2.9.1.jar -files '/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py,/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py' -input /test/mk.txt -output /output/wordcount -mapper 'python3 hdfs_map.py' -reducer 'python3 hdfs_reduce.py' 
  2 # 结果
  3 18/06/26 16:22:45 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
  4 18/06/26 16:22:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  5 18/06/26 16:22:45 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
  6 18/06/26 16:22:46 INFO mapred.FileInputFormat: Total input files to process : 1
  7 18/06/26 16:22:46 INFO mapreduce.JobSubmitter: number of splits:1
  8 18/06/26 16:22:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local49685846_0001
  9 18/06/26 16:22:46 INFO mapred.LocalDistributedCacheManager: Creating symlink: /home/zzf/hadoop_tmp/mapred/local/1530001366609/hdfs_map.py <- /home/zzf/Documents/hdfs_map.py
 10 18/06/26 16:22:46 INFO mapred.LocalDistributedCacheManager: Localized file:/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py as file:/home/zzf/hadoop_tmp/mapred/local/1530001366609/hdfs_map.py
 11 18/06/26 16:22:47 INFO mapred.LocalDistributedCacheManager: Creating symlink: /home/zzf/hadoop_tmp/mapred/local/1530001366610/hdfs_reduce.py <- /home/zzf/Documents/hdfs_reduce.py
 12 18/06/26 16:22:47 INFO mapred.LocalDistributedCacheManager: Localized file:/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py as file:/home/zzf/hadoop_tmp/mapred/local/1530001366610/hdfs_reduce.py
 13 18/06/26 16:22:47 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
 14 18/06/26 16:22:47 INFO mapred.LocalJobRunner: OutputCommitter set in config null
 15 18/06/26 16:22:47 INFO mapreduce.Job: Running job: job_local49685846_0001
 16 18/06/26 16:22:47 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
 17 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 18 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 19 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Waiting for map tasks
 20 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Starting task: attempt_local49685846_0001_m_000000_0
 21 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 22 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 23 18/06/26 16:22:47 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
 24 18/06/26 16:22:47 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/test/mk.txt:0+2267
 25 18/06/26 16:22:47 INFO mapred.MapTask: numReduceTasks: 1
 26 18/06/26 16:22:47 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
 27 18/06/26 16:22:47 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
 28 18/06/26 16:22:47 INFO mapred.MapTask: soft limit at 83886080
 29 18/06/26 16:22:47 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
 30 18/06/26 16:22:47 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
 31 18/06/26 16:22:47 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
 32 18/06/26 16:22:47 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python3, hdfs_map.py]
 33 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
 34 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
 35 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
 36 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
 37 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
 38 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
 39 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
 40 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
 41 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
 42 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
 43 18/06/26 16:22:47 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
 44 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
 45 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
 46 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
 47 18/06/26 16:22:47 INFO streaming.PipeMapRed: Records R/W=34/1
 48 18/06/26 16:22:47 INFO streaming.PipeMapRed: MRErrorThread done
 49 18/06/26 16:22:47 INFO streaming.PipeMapRed: mapRedFinished
 50 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 
 51 18/06/26 16:22:47 INFO mapred.MapTask: Starting flush of map output
 52 18/06/26 16:22:47 INFO mapred.MapTask: Spilling map output
 53 18/06/26 16:22:47 INFO mapred.MapTask: bufstart = 0; bufend = 3013; bufvoid = 104857600
 54 18/06/26 16:22:47 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26212876(104851504); length = 1521/6553600
 55 18/06/26 16:22:47 INFO mapred.MapTask: Finished spill 0
 56 18/06/26 16:22:47 INFO mapred.Task: Task:attempt_local49685846_0001_m_000000_0 is done. And is in the process of committing
 57 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Records R/W=34/1
 58 18/06/26 16:22:47 INFO mapred.Task: Task 'attempt_local49685846_0001_m_000000_0' done.
 59 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local49685846_0001_m_000000_0
 60 18/06/26 16:22:47 INFO mapred.LocalJobRunner: map task executor complete.
 61 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Waiting for reduce tasks
 62 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Starting task: attempt_local49685846_0001_r_000000_0
 63 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 64 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 65 18/06/26 16:22:47 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
 66 18/06/26 16:22:47 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@257adccd
 67 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
 68 18/06/26 16:22:47 INFO reduce.EventFetcher: attempt_local49685846_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
 69 18/06/26 16:22:47 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local49685846_0001_m_000000_0 decomp: 3777 len: 3781 to MEMORY
 70 18/06/26 16:22:47 INFO reduce.InMemoryMapOutput: Read 3777 bytes from map-output for attempt_local49685846_0001_m_000000_0
 71 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 3777, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->3777
 72 18/06/26 16:22:47 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
 73 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 74 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
 75 18/06/26 16:22:47 INFO mapred.Merger: Merging 1 sorted segments
 76 18/06/26 16:22:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3769 bytes
 77 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merged 1 segments, 3777 bytes to disk to satisfy reduce memory limit
 78 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merging 1 files, 3781 bytes from disk
 79 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
 80 18/06/26 16:22:47 INFO mapred.Merger: Merging 1 sorted segments
 81 18/06/26 16:22:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3769 bytes
 82 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 83 18/06/26 16:22:47 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python3, hdfs_reduce.py]
 84 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
 85 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
 86 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
 87 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
 88 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
 89 18/06/26 16:22:47 INFO streaming.PipeMapRed: Records R/W=381/1
 90 18/06/26 16:22:47 INFO streaming.PipeMapRed: MRErrorThread done
 91 18/06/26 16:22:47 INFO streaming.PipeMapRed: mapRedFinished
 92 18/06/26 16:22:47 INFO mapred.Task: Task:attempt_local49685846_0001_r_000000_0 is done. And is in the process of committing
 93 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 94 18/06/26 16:22:47 INFO mapred.Task: Task attempt_local49685846_0001_r_000000_0 is allowed to commit now
 95 18/06/26 16:22:47 INFO output.FileOutputCommitter: Saved output of task 'attempt_local49685846_0001_r_000000_0' to hdfs://localhost:9000/output/wordcount/_temporary/0/task_local49685846_0001_r_000000
 96 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Records R/W=381/1 > reduce
 97 18/06/26 16:22:47 INFO mapred.Task: Task 'attempt_local49685846_0001_r_000000_0' done.
 98 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local49685846_0001_r_000000_0
 99 18/06/26 16:22:47 INFO mapred.LocalJobRunner: reduce task executor complete.
100 18/06/26 16:22:48 INFO mapreduce.Job: Job job_local49685846_0001 running in uber mode : false
101 18/06/26 16:22:48 INFO mapreduce.Job:  map 100% reduce 100%
102 18/06/26 16:22:48 INFO mapreduce.Job: Job job_local49685846_0001 completed successfully
103 18/06/26 16:22:48 INFO mapreduce.Job: Counters: 35
104     File System Counters
105         FILE: Number of bytes read=279474
106         FILE: Number of bytes written=1220325
107         FILE: Number of read operations=0
108         FILE: Number of large read operations=0
109         FILE: Number of write operations=0
110         HDFS: Number of bytes read=4534
111         HDFS: Number of bytes written=2287
112         HDFS: Number of read operations=13
113         HDFS: Number of large read operations=0
114         HDFS: Number of write operations=4
115     Map-Reduce Framework
116         Map input records=34
117         Map output records=381
118         Map output bytes=3013
119         Map output materialized bytes=3781
120         Input split bytes=85
121         Combine input records=0
122         Combine output records=0
123         Reduce input groups=236
124         Reduce shuffle bytes=3781
125         Reduce input records=381
126         Reduce output records=236
127         Spilled Records=762
128         Shuffled Maps =1
129         Failed Shuffles=0
130         Merged Map outputs=1
131         GC time elapsed (ms)=0
132         Total committed heap usage (bytes)=536870912
133     Shuffle Errors
134         BAD_ID=0
135         CONNECTION=0
136         IO_ERROR=0
137         WRONG_LENGTH=0
138         WRONG_MAP=0
139         WRONG_REDUCE=0
140     File Input Format Counters 
141         Bytes Read=2267
142     File Output Format Counters 
143         Bytes Written=2287
144 18/06/26 16:22:48 INFO streaming.StreamJob: Output directory: /output/wordcount
View Code

相关文章: