Hadoop Operations(Hadoop操作) 详解(一) 简介

Hadoop Operations 详解

文章目录结构如下：

Chapter 1. Introduction

Over the past few years, there has been a fundamental shift in data storage, management, and processing. Companies are storing more data from more sources in more formats than ever before. This isn’t just about being a “data packrat” but rather building products, features, and intelligence predicated on knowing more about the world (where the world can be users, searches, machine logs, or whatever is relevant to an organization). Organizations are finding new ways to use data that was previously believed to be of little value, or far too expensive to retain, to better serve their constituents. Sourcing and storing data is one half of the equation. Processing that data to produce information is fundamental to the daily operations of every modern business.

在过去的几年里，数据存储、管理和处理都发生了根本性的转变。公司从更多的资源中存储的数据比以往任何时候都多。这不仅仅是一个“数据包装器”，更重要的是构建产品、特性和基于对世界了解更多的信息(世界可以是用户、搜索、机器日志，或者任何与组织相关的东西)。组织正在寻找新的方法来使用以前被认为没有价值的数据，或者是为了更好地服务他们的选民而花费的昂贵的数据。

Data storage and processing isn’t a new problem, though. Fraud detection in commerce and finance, anomaly detection in operational systems, demographic analysis in advertising, and many other applications have had to deal with these issues for decades. What has happened is that the volume, velocity, and variety of this data has changed, and in some cases, rather dramatically. This makes sense, as many algorithms benefit from access to more data. Take, for instance, the problem of recommending products to a visitor of an ecommerce website. You could simply show each visitor a rotating list of products they could buy, hoping that one would appeal to them. It’s not exactly an informed decision, but it’s a start. The question is what do you need to improve the chance of showing the right person the right product? Maybe it makes sense to show them what you think they like, based on what they’ve previously looked at. For some products, it’s useful to know what they already own. Customers who already bought a specific brand of laptop computer from you may be interested in compatible accessories and upgrades.[1] One of the most common techniques is to cluster users by similar behavior (such as purchase patterns) and recommend products purchased by “similar” users. No matter the solution, all of the algorithms behind these options require data and generally improve in quality with more of it. Knowing more about a problem space generally leads to better decisions (or algorithm efficacy), which in turn leads to happier users, more money, reduced fraud, healthier people, safer conditions, or whatever the desired result might be.

不过，数据存储和处理并不是一个新问题。在商业和金融领域的欺诈检测、操作系统的异常检测、广告的人口分析以及许多其他的应用程序都需要几十年的时间来处理这些问题。所发生的情况是，这些数据的体积、速度和多样性发生了变化，在某些情况下，发生了相当大的变化。这是有意义的，因为许多算法从访问更多数据中受益。举个例子，向一个电子商务网站的访问者推荐产品的问题。你可以简单地向每个访问者展示一个p的旋转列表

Apache Hadoop is a platform that provides pragmatic, cost-effective, scalable infrastructure for building many of the types of applications described earlier. Made up of a distributed filesystem called the Hadoop Distributed Filesystem (HDFS) and a computation layer that implements a processing paradigm called MapReduce, Hadoop is an open source, batch data processing system for enormous amounts of data. We live in a flawed world, and Hadoop is designed to survive in it by not only tolerating hardware and software failures, but also treating them as first-class conditions that happen regularly. Hadoop uses a cluster of plain old commodity servers with no specialized hardware or network infrastructure to form a single, logical, storage and compute platform, or cluster, that can be shared by multiple individuals or groups. Computation in Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction for developers that obviates complex synchronization and network programming. Unlike many other distributed data processing systems, Hadoop runs the user-provided processing logic on the machine where the data lives rather than dragging the data across the network; a huge win for performance.

Apache Hadoop是一个平台，它提供了实用的、具有成本效益的、可伸缩的基础设施，用于构建前面描述的许多应用程序类型。由称为Hadoop分布式文件系统(HDFS)的分布式文件系统和实现称为MapReduce的处理范例的计算层组成，Hadoop是一种开放源码的批处理数据处理系统，用于处理大量数据。我们生活在一个有缺陷的世界里，Hadoop的设计是为了生存，它不仅容忍硬件和软件的失败，而且还把它们当作是经常发生的一流条件。

For those interested in the history, Hadoop was modeled after two papers produced by Google, one of the many companies to have these kinds of data-intensive processing problems. The first, presented in 2003, describes a pragmatic, scalable, distributed filesystem optimized for storing enormous datasets, called the Google Filesystem, or GFS. In addition to simple storage, GFS was built to support large-scale, data-intensive, distributed processing applications. The following year, another paper, titled "MapReduce: Simplified Data Processing on Large Clusters," was presented, defining a programming model and accompanying framework that provided automatic parallelization, fault tolerance, and the scale to process hundreds of terabytes of data in a single job over thousands of machines. When paired, these two systems could be used to build large data processing clusters on relatively inexpensive, commodity machines. These papers directly inspired the development of HDFS and Hadoop MapReduce, respectively.

对于那些对历史感兴趣的人来说，Hadoop是由Google制作的两篇论文所建模的，而Google是众多公司中有这种数据密集型处理问题的公司之一。第一个是在2003年提出的，它描述了一个实用的、可扩展的分布式文件系统，用于存储巨大的数据集，称为Google文件系统或GFS。除了简单的存储之外，GFS还用于支持大规模的、数据密集型的分布式处理应用程序。第二年，另一篇题为“MapReduce:大型集群简化数据处理”的论文被提出，定义了一个程序

Interest and investment in Hadoop has led to an entire ecosystem of related software both open source and commercial. Within the Apache Software Foundation alone, projects that explicitly make use of, or integrate with, Hadoop are springing up regularly. Some of these projects make authoring MapReduce jobs easier and more accessible, while others focus on getting data in and out of HDFS, simplify operations, enable deployment in cloud environments, and so on. Here is a sampling of the more popular projects with which you should familiarize yourself:

对Hadoop的兴趣和投资已经导致了一个完整的生态系统生态系统，无论是开源的还是商业的。仅在Apache Software Foundation中，明确使用或集成Hadoop的项目就会经常出现。其中一些项目使得编写MapReduce任务变得更容易，更容易访问，而另一些项目则侧重于从HDFS中获取数据，简化操作，支持在云环境中部署，等等。以下是一些你应该熟悉的更受欢迎的项目的抽样:

Apache Hive

Hive creates a relational database−style abstraction that allows developers to write a dialect of SQL, which in turn is executed as one or more MapReduce jobs on the cluster. Developers, analysts, and existing third-party packages already know and speak SQL (Hive’s dialect of SQL is called HiveQL and implements only a subset of any of the common standards). Hive takes advantage of this and provides a quick way to reduce the learning curve to adopting Hadoop and writing MapReduce jobs. For this reason, Hive is by far one of the most popular Hadoop ecosystem projects.

Hive创建了一种关系数据库风格的抽象，允许开发人员编写SQL的方言，而这又被作为集群上的一个或多个MapReduce作业执行。开发人员、分析人员和现有的第三方包已经知道并说SQL了(Hive的SQL方言被称为HiveQL，并且只实现了所有通用标准的子集)。Hive利用了这一点，并提供了一种快速的方法来减少学习曲线，采用Hadoop并编写MapReduce作业。出于这个原因，Hive是目前最流行的Hadoop生态系统项目之一。

Hive works by defining a table-like schema over an existing set of files in HDFS and handling the gory details of extracting records from those files when a query is run. The data on disk is never actually changed, just parsed at query time. HiveQL statements are interpreted and an execution plan of prebuilt map and reduce classes is assembled to perform the MapReduce equivalent of the SQL statement.

Hive通过在HDFS上的现有文件集定义一个类似表格的模式，并处理查询运行时从这些文件中提取记录的血腥细节。磁盘上的数据从未真正更改过，只是在查询时解析。HiveQL语句被解释，并执行预构建映射和reduce类的执行计划，以执行MapReduce等价的SQL语句。

Apache Pig

Like Hive, Apache Pig was created to simplify the authoring of MapReduce jobs, obviating the need to write Java code. Instead, users write data processing jobs in a high-level scripting language from which Pig builds an execution plan and executes a series of MapReduce jobs to do the heavy lifting. In cases where Pig doesn’t support a necessary function, developers can extend its set of built-in operations by writing user-defined functions in Java (Hive supports similar functionality as well). If you know Perl, Python, Ruby, JavaScript, or even shell script, you can learn Pig’s syntax in the morning and be running MapReduce jobs by lunchtime.

就像Hive一样，Apache Pig的创建是为了简化MapReduce的工作，从而避免了编写Java代码的需要。相反，用户用一种高级的脚本语言编写数据处理作业，从这种语言中，猪构建执行计划，并执行一系列MapReduce作业来完成繁重的工作。在猪不支持必要功能的情况下，开发人员可以通过在Java中编写用户定义的函数来扩展它的内置操作(Hive也支持类似的功能)。如果您了解Perl、Python、Ruby、JavaScript甚至shell脚本，那么您可以在m中学习猪的语法

Apache Sqoop

Not only does Hadoop not want to replace your database, it wants to be friends with it. Exchanging data with relational databases is one of the most popular integration points with Apache Hadoop. Sqoop, short for “SQL to Hadoop,” performs bidirectional data transfer between Hadoop and almost any database with a JDBC driver. Using MapReduce, Sqoop performs these operations in parallel with no need to write code.

Hadoop不仅不想取代你的数据库，它还想成为它的朋友。与关系数据库交换数据是Apache Hadoop最流行的集成点之一。Sqoop，“SQL to Hadoop”，在Hadoop和几乎所有的JDBC驱动程序之间进行双向数据传输。使用MapReduce，Sqoop可以并行执行这些操作，不需要编写代码。

For even greater performance, Sqoop supports database-specific plug-ins that use native features of the RDBMS rather than incurring the overhead of JDBC. Many of these connectors are open source, while others are free or available from commercial vendors at a cost. Today, Sqoop includes native connectors (called direct support) for MySQL and PostgreSQL. Free connectors exist for Teradata, Netezza, SQL Server, and Oracle (from Quest Software), and are available for download from their respective company websites.

为了更大的性能，Sqoop支持使用特定于数据库的插件，这些插件使用的是RDBMS的原生特性，而不是JDBC的开销。这些连接器中的许多是开源的，而其他的则是免费的，或者是由商业供应商提供的。今天，Sqoop包括了MySQL和PostgreSQL的本地连接器(称为直接支持)。用于Teradata、Netezza、SQL Server和Oracle(来自Quest软件)的免费连接器，可以从各自的公司网站下载。

Apache Flume

Apache Flume is a streaming data collection and aggregation system designed to transport massive volumes of data into systems such as Hadoop. It supports native connectivity and support for writing directly to HDFS, and simplifies reliable, streaming data delivery from a variety of sources including RPC services, log4j appenders, syslog, and even the output from OS commands. Data can be routed, load-balanced, replicated to multiple destinations, and aggregated from thousands of hosts by a tier of agents.

Apache Flume是一个流媒体数据收集和聚合系统，旨在将大量数据传输到Hadoop等系统中。它支持本地连接和支持直接写入HDFS，并简化了来自各种来源的可靠的流数据传输，包括RPC服务、log4j appender、syslog，甚至是来自OS命令的输出。数据可以被路由、负载平衡、复制到多个目的地，并由一层代理从数千台主机聚合。

Apache Oozie

It’s not uncommon for large production clusters to run many coordinated MapReduce jobs in a workfow. Apache Oozie is a workflow engine and scheduler built specifically for large-scale job orchestration on a Hadoop cluster. Workflows can be triggered by time or events such as data arriving in a directory, and job failure handling logic can be implemented so that policies are adhered to. Oozie presents a REST service for programmatic management of workflows and status retrieval.

大型生产集群在工作中运行许多协调的MapReduce作业并不少见。Apache Oozie是一个工作流引擎和调度器，专门为Hadoop集群上的大规模作业编制而构建。工作流可以由时间或事件触发，比如到达某个目录的数据，并且可以实现作业失败处理逻辑，从而实现策略。Oozie提供了一个REST服务，用于对工作流和状态检索进行编程管理。

Apache Whirr

Apache Whirr was developed to simplify the creation and deployment of ephemeral clusters in cloud environments such as Amazon’s AWS. Run as a command-line tool either locally or within the cloud, Whirr can spin up instances, deploy Hadoop, configure the software, and tear it down on demand. Under the hood, Whirr uses the powerful jclouds library so that it is cloud provider−neutral. The developers have put in the work to make Whirr support both Amazon EC2 and Rackspace Cloud. In addition to Hadoop, Whirr understands how to provision Apache Cassandra, Apache ZooKeeper, Apache HBase, ElasticSearch, Voldemort, and Apache Hama.

Apache Whirr的开发是为了简化在云环境中创建和部署临时集群的过程，比如Amazon的AWS。可以在本地或云中运行命令行工具，可以旋转实例，部署Hadoop，配置软件，并根据需要对其进行销毁。在这个系统的背后，它使用了强大的jcloud库，因此它是云服务提供商的中立者。开发人员已经投入了这项工作，让呼呼声支持Amazon EC2和Rackspace Cloud。除了Hadoop之外，Whirr的还知道如何提供Apache Cassandra、Apache ZooKeeper、Apache HBase、ElasticSearch、Voldemort, and Apache Hama.

Apache HBase

Apache HBase is a low-latency, distributed (nonrelational) database built on top of HDFS. Modeled after Google’s Bigtable, HBase presents a flexible data model with scale-out properties and a very simple API. Data in HBase is stored in a semi-columnar format partitioned by rows into regions. It’s not uncommon for a single table in HBase to be well into the hundreds of terabytes or in some cases petabytes. Over the past few years, HBase has gained a massive following based on some very public deployments such as Facebook’s Messages platform. Today, HBase is used to serve huge amounts of data to real-time systems in major production deployments.

Apache HBase is a(distributed low-latency,nonrelational)构建database on top of HDFS。Google的Bigtable、包、维度之后,HBase presents到灵活data model with properties and a very scale-out简单的API。Data in HBase is in a下面的格式semi-columnar partitioned rows into区域。古巴博客uncommon It’s not for a single table in to be well HBase into the失败的tb或在某些案件千万亿字节。在过去十年,HBase窃取你gained到下列massive based on一些非常公共deployments Facebook platform as等的信息。今天,HBase is used to serve to data用作燃料的研究

Apache ZooKeeper

A true workhorse, Apache ZooKeeper is a distributed, consensus-based coordination system used to support distributed applications. Distributed applications that require leader election, locking, group membership, service location, and configuration services can use ZooKeeper rather than reimplement the complex coordination and error handling that comes with these functions. In fact, many projects within the Hadoop ecosystem use ZooKeeper for exactly this purpose (most notably, HBase).

Apache ZooKeeper是一个真正的工作人员，它是一个分布式的、基于共识的协调系统，用于支持分布式应用程序。分布式应用程序需要领导人选举、锁定、组成员关系、服务位置和配置服务，可以使用动物园管理员而不是重新实现这些功能带来的复杂的协调和错误处理。实际上，Hadoop生态系统中的许多项目都使用动物园管理员来完成这个目的(最明显的是，HBase)。

Apache HCatalog

A relatively new entry, Apache HCatalog is a service that provides shared schema and data access abstraction services to applications with the ecosystem. The long-term goal of HCatalog is to enable interoperability between tools such as Apache Hive and Pig so that they can share dataset metadata information.

Apache HCatalog是一个相对较新的条目，它是一个为具有生态系统的应用程序提供共享模式和数据访问抽象服务的服务。HCatalog的长期目标是支持Apache Hive和猪这样的工具之间的互操作性，以便它们能够共享数据集元数据信息。

The Hadoop ecosystem is exploding into the commercial world as well. Vendors such as Oracle, SAS, MicroStrategy, Tableau, Informatica, Microsoft, Pentaho, Talend, HP, Dell, and dozens of others have all developed integration or support for Hadoop within one or more of their products. Hadoop is fast becoming (or, as an increasingly growing group would believe, already has become) the de facto standard for truly large-scale data processing in the data center.

Hadoop生态系统也正在向商业世界发展。Oracle、SAS、微策略、图表、信息化、微软、五霍、Talend、惠普、戴尔和其他几十个厂商都已经在他们的一个或多个产品中集成或支持Hadoop。Hadoop正迅速成为(或者，越来越多的人相信，它已经成为)数据中心真正大规模数据处理的事实标准。

If you’re reading this book, you may be a developer with some exposure to Hadoop looking to learn more about managing the system in a production environment. Alternatively, it could be that you’re an application or system administrator tasked with owning the current or planned production cluster. Those in the latter camp may be rolling their eyes at the prospect of dealing with yet another system. That’s fair, and we won’t spend a ton of time talking about writing applications, APIs, and other pesky code problems. There are other fantastic books on those topics, especially Hadoop: The Definitive Guide by Tom White (O’Reilly). Administrators do, however, play an absolutely critical role in planning, installing, configuring, maintaining, and monitoring Hadoop clusters. Hadoop is a comparatively low-level system, leaning heavily on the host operating system for many features, and it works best when developers and administrators collaborate regularly. What you do impacts how things work.

如果您正在阅读这本书，您可能是一个与Hadoop有接触的开发人员，希望了解更多关于在生产环境中管理系统的知识。或者，可能是您是一个应用程序或系统管理员，负责拥有当前或计划的生产集群。那些在后一阵营的人可能会对另一个系统的前景感到担忧。这是公平的，我们不会花大量的时间来讨论编写应用程序、api和其他讨厌的代码问题。还有其他关于这些主题的精彩书籍，特别是Hadoop:汤姆怀特(o'reilly)的权威指南。然而，管理员在计划、安装、配置、维护和监视Hadoop集群方面扮演着绝对重要的角色。Hadoop是一种相对较低的系统，在许多特性上依赖于主机操作系统，当开发人员和管理员定期协作时，它的工作效果最好。你所做的事情会影响事情的运作。

It’s an extremely exciting time to get into Apache Hadoop. The so-called big data space is all the rage, sure, but more importantly, Hadoop is growing and changing at a staggering rate. Each new version—and there have been a few big ones in the past year or two—brings another truckload of features for both developers and administrators alike. You could say that Hadoop is experiencing software puberty; thanks to its rapid growth and adoption, it’s also a little awkward at times. You’ll find, throughout this book, that there are significant changes between even minor versions. It’s a lot to keep up with, admittedly, but don’t let it overwhelm you. Where necessary, the differences are called out, and a section in Chapter 4 is devoted to walking you through the most commonly encountered versions.

进入Apache Hadoop是一个非常激动人心的时刻。当然，所谓的大数据空间是非常流行的，但更重要的是，Hadoop正在以惊人的速度增长和变化。在过去的一年中，每一个新的版本都有一些大的版本，或者是两种版本，都给开发者和管理员带来了一大堆的特性。你可以说Hadoop正经历着软件的青春期;由于它的快速增长和采用，它有时也会有点尴尬。在这本书中，你会发现，即使是小版本之间也存在着显著的变化。诚然，这是一件很重要的事情，但不要让它压倒你。在必要的时候，会有不同的地方，第四章的一个章节专门介绍最常见的版本。

This book is intended to be a pragmatic guide to running Hadoop in production. Those who have some familiarity with Hadoop may already know alternative methods for installation or have differing thoughts on how to properly tune the number of map slots based on CPU utilization.[2] That’s expected and more than fine. The goal is not to enumerate all possible scenarios, but rather to call out what works, as demonstrated in critical deployments.

   这本书旨在成为一个实用指南,运行Hadoop在生产。那些有familiarity与Hadoop可能已经知道替代方法是安装或其中differing看法如何properly优化基于CPU utilization其他插槽的数量。[2]’的预期,多好。目标不是列举所有可能的情况,但而叫出行之有效的demonstrated deployments至关重要。

   这是我Hadoop的系列文章以及学习路线，后续还有连载，尽情期待。
   Github地址： https://github.com/noseparte
   NPM地址： https://www.npmjs.com/~noseparte
   个人网站： http://www.noseparte.com/    Copyright © 2017 noseparte