【问题标题】:H2O Deep Water on MS Azure N-Instance (GPU enabled) can not initialize backendMS Azure N-Instance(启用 GPU)上的 H2O Deep Water 无法初始化后端
【发布时间】:2017-08-16 15:59:20
【问题描述】:

更新 1:

来自 H2O 深水云的日志文件:https://drive.google.com/file/d/0B_1g718qYsqhcUl4WFQ5S1NKbE0/view?usp=sharing

  • mxnet 后端 - 现已解决(在 Azure 中停止/启动 VM 后)
  • tensorflow 后端 - 仍然失败

我想在 MS Azure (NC6 - https://azure.microsoft.com/en-us/blog/azure-n-series-general-availability-on-december-1/) 上使用支持 GPU 的云实例测试 H2Os Deep Water。 但运行 H2O Deep Water 时出现错误提示:

  • mxnet 后端:java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: Could not initialize class deepwater.backends.mxnet.MXNetBackend$MXNetLoader
  • 张量流后端:java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: null

配置和设置如下:

在 NC6 虚拟机上配置 DSVM 之后。我检查了深水的先决条件 - CUDA & CUDANN:

sysadmin@DEVSMTTSYGPU002:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
sysadmin@DEVSMTTSYGPU002:~$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR      5
#define CUDNN_MINOR      1
#define CUDNN_PATCHLEVEL 10

之后我运行了以下步骤:

设置环境变量:

  • export CUDA_PATH=/usr/local/cuda
  • export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH

为 python 2.7 安装 pip

  • sudo apt-get install python-pip

安装深水:

  • pip2 install http://s3.amazonaws.com/h2o-deepwater/public/nightly/latest/h2o-3.13.0-py2.py3-none-any.whl

安装 libatlas-base-dev

  • sudo apt-get install libatlas-base-dev

为了运行示例,我启动 python 2.7 并运行

import h2o
h2o.init()

之后我使用 H2O Flow 创建了一些人工数据并学习了一个简单的深水模型

  • createFrame {"dest":"MNIST_SIM_60k","rows":"60000","cols":"784","seed":7595850248774472000,"seed_for_column_types":-1,"randomize":true,"value":0,"real_range":100,"categorical_fraction":"0","factors":5,"integer_fraction":"1","binary_fraction":"0","binary_ones_fraction":"0","time_fraction":0,"string_fraction":0,"integer_range":"127","missing_fraction":"0","response_factors":2,"has_response":true}
  • buildModel 'deepwater', {"model_id":"deepwater-782cc564-497c-4c39-a22a-b6904fb04188","training_frame":"MNIST_SIM_60k","nfolds":0,"response_column":"response","ignored_columns":[],"epochs":"100","ignore_const_cols":true,"network":"auto","activation":"Rectifier","hidden":[100],"problem_type":"dataset","checkpoint":"","autoencoder":false,"balance_classes":false,"score_each_iteration":false,"categorical_encoding":"AUTO","train_samples_per_iteration":-2,"standardize":true,"distribution":"AUTO","score_interval":5,"score_training_samples":10000,"score_validation_samples":0,"score_duty_cycle":0.1,"stopping_rounds":5,"stopping_metric":"AUTO","stopping_tolerance":0,"max_runtime_secs":0,"backend":"tensorflow","image_shape":[0,0],"channels":3,"network_definition_file":"","network_parameters_file":"","mean_image_file":"","export_native_parameters_prefix":"","input_dropout_ratio":0,"hidden_dropout_ratios":[],"overwrite_with_best_model":true,"target_ratio_comm_to_comp":0.05,"seed":-1,"learning_rate":0.001,"learning_rate_annealing":0.000001,"momentum_start":0.9,"momentum_ramp":10000,"momentum_stable":0.9,"classification_stop":0,"shuffle_training_data":true,"mini_batch_size":32,"clip_gradient":10,"sparse":false,"gpu":true,"device_id":[0],"cache_data":true}

对于两个后端(mxnet 和 tensorflow),我得到了上面提到的错误。对于 tensorflow,堆栈跟踪是

java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: null
    at hex.deepwater.DeepWaterModelInfo.setupNativeBackend(DeepWaterModelInfo.java:267)
    at hex.deepwater.DeepWaterModelInfo.<init>(DeepWaterModelInfo.java:214)
    at hex.deepwater.DeepWaterModel.<init>(DeepWaterModel.java:227)
    at hex.deepwater.DeepWater$DeepWaterDriver.buildModel(DeepWater.java:131)
    at hex.deepwater.DeepWater$DeepWaterDriver.computeImpl(DeepWater.java:118)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
    at hex.deepwater.DeepWater$DeepWaterDriver.compute2(DeepWater.java:111)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1255)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

对于 mxnet,堆栈跟踪是

java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: Could not initialize class deepwater.backends.mxnet.MXNetBackend$MXNetLoader
    at hex.deepwater.DeepWaterModelInfo.setupNativeBackend(DeepWaterModelInfo.java:267)
    at hex.deepwater.DeepWaterModelInfo.<init>(DeepWaterModelInfo.java:214)
    at hex.deepwater.DeepWaterModel.<init>(DeepWaterModel.java:227)
    at hex.deepwater.DeepWater$DeepWaterDriver.buildModel(DeepWater.java:131)
    at hex.deepwater.DeepWater$DeepWaterDriver.computeImpl(DeepWater.java:118)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
    at hex.deepwater.DeepWater$DeepWaterDriver.compute2(DeepWater.java:111)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1255)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

如何使用至少一个后端运行 H2O Deep Water?

旁注:xgboost 与来自 H2O 的 GPU 支持工作。

非常感谢

罗伯托

【问题讨论】:

    标签: h2o


    【解决方案1】:

    我认为我们从未尝试过运行 Azure,除非使用 docker 映像。您使用的是 Ubuntu 16.04 吗?如果是这样,它应该可以工作,除非它与标准 Ubuntu 16.04 之间存在差异。似乎 h2o 无法与后端通信。如果您可以从 h2o 发布完整日志,我可以尝试查看问题可能是什么。

    否则我会说运行它的最简单方法是使用 docker 映像,这就是我的建议。一切都已经安装好了。您唯一需要安装的是 docker 和 nvidia-docker。说明:https://github.com/h2oai/deepwater#pre-release-docker-image

    【讨论】:

    • 评论第一部分:感谢您的回答。首先我检查了 Ubuntu 版本,它是“Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-92-generic x86_64)”。我怀疑来自 Azure 的普通 Ubuntu VM 使用某种非标准的 Linux 版本,因为我从未遇到任何问题。但另一方面,我无法确定这可能是个问题。
    • 评论第二部分:然后我在 mxnet 后端取得了进展——它现在可以正常工作而无需更改任何内容。上面提到的错误和现在成功计算之间的唯一步骤:我停止了虚拟 VM 并再次启动它。也许 Azure 在 VM 上使用了一些重启机制,而这正是我们所需要的。我不知道,只是猜测。所以,mxnet 现在可以工作了。但是 tensorflow 与之前的错误相同。我在我的问题中附加了一个日志文件。我在问题的开头添加了来自 H2O 的日志文件。
    • 嗨,考虑到 MXNet 突然没有任何变化,这让我觉得它不是一个标准的 Ubuntu 环境。我查看了日志文件,无法说出为什么没有加载 tensorflow。某些库可能存在一些版本差异。为了解决这个问题,您可能需要在该平台上自己构建它。同样,我建议您使用 docker 映像,因为我们知道它适用于 Azure 和其他地方。 github.com/h2oai/deepwater#pre-release-docker-image
    猜你喜欢
    • 2018-01-20
    • 1970-01-01
    • 1970-01-01
    • 2018-01-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-05-06
    • 2020-12-29
    相关资源
    最近更新 更多