【问题标题】:Hadoop last reduce job stuck at connection timeoutHadoop最后减少作业卡在连接超时
【发布时间】:2013-09-23 13:33:25
【问题描述】:

我的 hadoop 作业卡在最后一个 reduce 任务上。我已经看到很多从 3 个不同主机到单个主机的连接超时。但是,我可以从任何其他机器上 ping 这台有问题的机器。

这是一个 5 节点集群。它是最近建成的。它们具有相同的 hadoop 和 pig 二进制文件。它有3台新机器和2台旧机器。如果我删除了 2 台旧机器,它就可以正常工作了。

有问题的旧机版本:

java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.13) (6b20-1.9.13-0ubuntu1~10.10.1)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

新工作机版本:

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.3) (6b24-1.11.3-1ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

我还从日志中看到了大量 (180) 的 dup 主机。

2012-08-28 12:34:53,307 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:34:53,307 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:35:53,308 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:35:53,308 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:36:53,309 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:36:53,309 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:37:18,416 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 copy failed: attempt_201208231557_0021_m_000002_0 from Han
2012-08-28 12:37:18,417 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
    at java.net.Socket.connect(Socket.java:546)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
    at sun.net.www.http.HttpClient.New(HttpClient.java:321)
    at sun.net.www.http.HttpClient.New(HttpClient.java:338)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)

2012-08-28 12:37:18,417 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201208231557_0021_r_000008_0: Failed fetch #6 from attempt_201208231557_0021_m_000002_0
2012-08-28 12:37:18,417 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 adding host Han to penalty box, next contact in 48 seconds
2012-08-28 12:37:18,417 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0: Got 1 map-outputs from previous failures
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 0 is already in progress
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 
2012-08-28 12:37:53,418 INFO org.apache.hadoop.mapred.ReduceTask: Han Will be considered after: 13 seconds.
2012-08-28 12:38:08,418 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-28 12:38:53,419 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:38:53,419 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:39:53,420 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:39:53,420 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:40:53,421 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:40:53,421 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)
2012-08-28 12:41:08,537 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 copy failed: attempt_201208231557_0021_m_000002_0 from Han
2012-08-28 12:41:08,538 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
    at java.net.Socket.connect(Socket.java:546)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
    at sun.net.www.http.HttpClient.New(HttpClient.java:321)
    at sun.net.www.http.HttpClient.New(HttpClient.java:338)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1618)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1575)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1483)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1394)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1326)

2012-08-28 12:41:08,538 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201208231557_0021_r_000008_0: Failed fetch #7 from attempt_201208231557_0021_m_000002_0
2012-08-28 12:41:08,538 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 adding host Han to penalty box, next contact in 62 seconds
2012-08-28 12:41:08,538 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0: Got 1 map-outputs from previous failures
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 0 is already in progress
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts: 
2012-08-28 12:41:53,539 INFO org.apache.hadoop.mapred.ReduceTask: Han Will be considered after: 17 seconds.
2012-08-28 12:42:13,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-08-28 12:42:53,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Need another 4 map output(s) where 1 is already in progress
2012-08-28 12:42:53,540 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201208231557_0021_r_000008_0 Scheduled 0 outputs (0 slow hosts and3 dup hosts)

更多日志:

2012-09-06 11:01:47,719 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201209041720_0010_r_000014_0 1.0% reduce > reduce
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201209041720_0010_r_000014_0 is done.
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201209041720_0010_r_000014_0  was -1
2012-09-06 11:01:47,720 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 6
2012-09-06 11:01:47,769 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201209041720_0010_r_1617080101 exited with exit code 0. Number of tasks it ran: 1
2012-09-06 11:01:47,913 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201209041720_0010_r_-434797671 exited with exit code 0. Number of tasks it ran: 1
2012-09-06 11:01:50,428 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201209041720_0010_r_000014_0
2012-09-06 11:01:50,428 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201209041720_0010_r_000014_0
2012-09-06 11:01:50,429 INFO org.apache.hadoop.mapred.TaskTracker: Received KillTaskAction for task: attempt_201209041720_0010_r_000020_0
2012-09-06 11:01:50,429 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201209041720_0010_r_000020_0

【问题讨论】:

  • 我在 3 小时后杀死了有问题的主机上的 tasktracker 和 datanode,现在它已畅通无阻。仍然不知道根本原因。

标签: hadoop


【解决方案1】:

您是否使用默认的推测执行设置? (4 次尝试)- 这里有一个 article 可能会有所帮助。

【讨论】:

  • 我使用的是默认值。没有更改任何推测执行设置。看我的回答。
【解决方案2】:

这可能是由于硬盘驱动器速度慢。在新机器(三星 830 SSD)上,我看到 3MB/s 的传输速率和慢速机器(旋转硬盘)0.12 MB/s。它可能只是完成了一项大型减速器工作并且太慢了。

【讨论】:

  • 我将硬盘升级为SSD并修复了一台旧机器。但是另一台机器仍然有同样的减速器卡死问题。我打算尝试删除hadoop数据目录试试。
【解决方案3】:

当 HADOOP 作业处于 reduce 阶段时,任务跟踪器需要在数据节点之间洗牌。如果您在从属配置文件中设置数据节点主机名,则会发生连接超时错误,而无法解析主机名。 将主机名和 IP 地址添加到 /etc/hosts 是一种快速的解决方案。

【讨论】:

    【解决方案4】:

    不确定为什么会发生这种情况,但我遇到了同样的问题...我的 datanode 试图访问本地主机上的 namenode,而不是我实际的“主”namenode...

    我通过转到我的 hosts 文件解决了这个问题:

    etc/hosts

    从那里删除 localhost 行。我很确定应该有一种更清洁的方法来解决这个问题,但这个解决方案对我有用。

    如果有人有更简洁的方法来阻止数据节点尝试访问本地主机上的名称节点,请告诉我!!!

    【讨论】:

      猜你喜欢
      • 2014-01-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-08-31
      • 1970-01-01
      • 1970-01-01
      • 2010-11-10
      • 2013-08-25
      相关资源
      最近更新 更多