HDInsight 和 Hive 查询答案

【问题标题】：HDInsight and Hive queriesHDInsight 和 Hive 查询
【发布时间】：2018-04-30 12:42:01
【问题描述】：

我们正在为 HDInsight 进行 POC。我对这项技术很陌生。我们正在做的是，尝试将一些数据发送到 Azure 并编写一些 Hive 查询。我们能够实现第一部分：我们可以使用 AzCopy 将一些测试数据推送到 Azure blob。（我知道有 Azure 表和 Azure 队列）。但是对于 POC，Azure blob 就可以了。

我们可以使用 Visual Studio 与这个 blob 对话。但是，我们还想检查 HDinsight 及其 MapReduce 功能。

有了这个背景，这里有几个问题：

 1. Do I need to copy data from Azure Blob to Anywhere else for writing
    Hive queries in Ambari? Or Can Ambari directly talk to data stored
    in Azure blob? 
 2. Is this the right way to process data? (Keep data in
        Azure blob, and use HDInsight/Ambari to process the data)
 3. If point 2 is correct, that means HDInsight is used only for
    parallel processing with MapReducing feature. Is this correct?

非常感谢您提供任何见解。

【问题讨论】：

标签： azure hive azure-hdinsight ambari

【解决方案1】：

是的，HDInsight 可以读取存储在 BLOB 存储中的数据。示例：

https://docs.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-linux-tutorial-get-started https://blogs.msdn.microsoft.com/azuredatalake/2017/04/06/azure-hdinsight-3-6-five-things-that-will-make-data-developer-happy/

是的，根据您想要做什么，您可以使用 Spark、MR、Pig 或 Hive 来处理数据好的起点在这里https://www.edx.org/course/processing-big-data-with-hadoop-in-azure-hdinsight

3：是的，数据是使用其中一种分布式框架处理的，例如 Spark、Map Reduce、Hive 或 Pig

【讨论】：

在 Azure Blob 中存储数据时，底层存储系统是 HDFS 吗？