【发布时间】:2017-12-05 01:44:58
【问题描述】:
我是 Hive 和 Hadoop 的新手。我已将 Hadoop 配置为具有一个数据节点和一个名称节点的伪分布式操作,所有这些都在 localhost 上。
我有一个包含 4 条记录的普通员工表。我可以在合理的时间内选择记录,但除此之外的任何事情都需要很长时间。例如:
0: jdbc:hive2://localhost:10000> select * from emp;
+------------+------------+-------------+-------------+------------+
| emp.empno | emp.ename | emp.job | emp.deptno | emp.etype |
+------------+------------+-------------+-------------+------------+
| 7369 | SMITH | CLERK | 10 | PART_TIME |
| 7400 | JONES | ENGINEER | 10 | FULL_TIME |
| 7500 | BROWN | NIGHTGUARD | 20 | FULL_TIME |
| 7510 | LEE | ENGINEER | 20 | FULL_TIME |
+------------+------------+-------------+-------------+------------+
4 rows selected (0.643 seconds)
0: jdbc:hive2://localhost:10000> select * from emp order by empno;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+------------+------------+-------------+-------------+------------+
| emp.empno | emp.ename | emp.job | emp.deptno | emp.etype |
+------------+------------+-------------+-------------+------------+
| 7369 | SMITH | CLERK | 10 | PART_TIME |
| 7400 | JONES | ENGINEER | 10 | FULL_TIME |
| 7500 | BROWN | NIGHTGUARD | 20 | FULL_TIME |
| 7510 | LEE | ENGINEER | 20 | FULL_TIME |
+------------+------------+-------------+-------------+------------+
4 rows selected (225.852 seconds)
这么长的时间在做什么?是否有可以减少的投票周期?我知道 Hive 没有针对小任务进行优化,但这似乎很荒谬。
以下是各种文件: hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.mapred.mode</name>
<value>nostrict</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/hadoop/tmp</value>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/home/hadoop/tmp/${hive.session.id}_resources</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/home/hadoop/tmp</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/home/hadoop/tmp/operation_logs</value>
</property>
</configuration>
hdfs.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
</configuration>
核心站点.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
【问题讨论】:
-
您为 Hadoop 守护进程提供了哪些 Java 内存属性?你安装 Hadoop 的环境有多少内存?内存交换了吗?默认的嵌入式 derby 数据库并不意味着很快
-
另外,选择纯文本比 ORC 或 Parquet 慢,但是对于阅读 4 行,我认为还有其他问题
-
您可以使用 tez 代替 map reduce。只需运行命令 set hive.execution.engine=tez;
标签: performance hadoop hive