【发布时间】:2017-08-16 15:59:20
【问题描述】:
更新 1:
来自 H2O 深水云的日志文件:https://drive.google.com/file/d/0B_1g718qYsqhcUl4WFQ5S1NKbE0/view?usp=sharing
- mxnet 后端 - 现已解决(在 Azure 中停止/启动 VM 后)
- tensorflow 后端 - 仍然失败
我想在 MS Azure (NC6 - https://azure.microsoft.com/en-us/blog/azure-n-series-general-availability-on-december-1/) 上使用支持 GPU 的云实例测试 H2Os Deep Water。 但运行 H2O Deep Water 时出现错误提示:
- mxnet 后端:
java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: Could not initialize class deepwater.backends.mxnet.MXNetBackend$MXNetLoader - 张量流后端:
java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: null
配置和设置如下:
在 NC6 虚拟机上配置 DSVM 之后。我检查了深水的先决条件 - CUDA & CUDANN:
sysadmin@DEVSMTTSYGPU002:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
sysadmin@DEVSMTTSYGPU002:~$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 5
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 10
之后我运行了以下步骤:
设置环境变量:
export CUDA_PATH=/usr/local/cudaexport LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH
为 python 2.7 安装 pip
sudo apt-get install python-pip
安装深水:
pip2 install http://s3.amazonaws.com/h2o-deepwater/public/nightly/latest/h2o-3.13.0-py2.py3-none-any.whl
安装 libatlas-base-dev
sudo apt-get install libatlas-base-dev
为了运行示例,我启动 python 2.7 并运行
import h2o
h2o.init()
之后我使用 H2O Flow 创建了一些人工数据并学习了一个简单的深水模型
createFrame {"dest":"MNIST_SIM_60k","rows":"60000","cols":"784","seed":7595850248774472000,"seed_for_column_types":-1,"randomize":true,"value":0,"real_range":100,"categorical_fraction":"0","factors":5,"integer_fraction":"1","binary_fraction":"0","binary_ones_fraction":"0","time_fraction":0,"string_fraction":0,"integer_range":"127","missing_fraction":"0","response_factors":2,"has_response":true}buildModel 'deepwater', {"model_id":"deepwater-782cc564-497c-4c39-a22a-b6904fb04188","training_frame":"MNIST_SIM_60k","nfolds":0,"response_column":"response","ignored_columns":[],"epochs":"100","ignore_const_cols":true,"network":"auto","activation":"Rectifier","hidden":[100],"problem_type":"dataset","checkpoint":"","autoencoder":false,"balance_classes":false,"score_each_iteration":false,"categorical_encoding":"AUTO","train_samples_per_iteration":-2,"standardize":true,"distribution":"AUTO","score_interval":5,"score_training_samples":10000,"score_validation_samples":0,"score_duty_cycle":0.1,"stopping_rounds":5,"stopping_metric":"AUTO","stopping_tolerance":0,"max_runtime_secs":0,"backend":"tensorflow","image_shape":[0,0],"channels":3,"network_definition_file":"","network_parameters_file":"","mean_image_file":"","export_native_parameters_prefix":"","input_dropout_ratio":0,"hidden_dropout_ratios":[],"overwrite_with_best_model":true,"target_ratio_comm_to_comp":0.05,"seed":-1,"learning_rate":0.001,"learning_rate_annealing":0.000001,"momentum_start":0.9,"momentum_ramp":10000,"momentum_stable":0.9,"classification_stop":0,"shuffle_training_data":true,"mini_batch_size":32,"clip_gradient":10,"sparse":false,"gpu":true,"device_id":[0],"cache_data":true}
对于两个后端(mxnet 和 tensorflow),我得到了上面提到的错误。对于 tensorflow,堆栈跟踪是
java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: null
at hex.deepwater.DeepWaterModelInfo.setupNativeBackend(DeepWaterModelInfo.java:267)
at hex.deepwater.DeepWaterModelInfo.<init>(DeepWaterModelInfo.java:214)
at hex.deepwater.DeepWaterModel.<init>(DeepWaterModel.java:227)
at hex.deepwater.DeepWater$DeepWaterDriver.buildModel(DeepWater.java:131)
at hex.deepwater.DeepWater$DeepWaterDriver.computeImpl(DeepWater.java:118)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at hex.deepwater.DeepWater$DeepWaterDriver.compute2(DeepWater.java:111)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1255)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
对于 mxnet,堆栈跟踪是
java.lang.RuntimeException: Unable to initialize the native Deep Learning backend: Could not initialize class deepwater.backends.mxnet.MXNetBackend$MXNetLoader
at hex.deepwater.DeepWaterModelInfo.setupNativeBackend(DeepWaterModelInfo.java:267)
at hex.deepwater.DeepWaterModelInfo.<init>(DeepWaterModelInfo.java:214)
at hex.deepwater.DeepWaterModel.<init>(DeepWaterModel.java:227)
at hex.deepwater.DeepWater$DeepWaterDriver.buildModel(DeepWater.java:131)
at hex.deepwater.DeepWater$DeepWaterDriver.computeImpl(DeepWater.java:118)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at hex.deepwater.DeepWater$DeepWaterDriver.compute2(DeepWater.java:111)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1255)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
如何使用至少一个后端运行 H2O Deep Water?
旁注:xgboost 与来自 H2O 的 GPU 支持工作。
非常感谢
罗伯托
【问题讨论】:
标签: h2o