【问题标题】:Kaggle TPU Unavailable: failed to connect to all addressesKaggle TPU 不可用:无法连接到所有地址
【发布时间】:2021-08-20 05:31:39
【问题描述】:

我是机器学习的新手。在尝试使用 TPU 方法完成数字识别时,我遇到了真正困扰我的问题。

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
    Model = Sequential([

        InputLayer((28, 28, 1)),
        Dropout(0.1),
        Conv2D(128, 3, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        MaxPooling2D(2, 2),
        Conv2D(64, 3, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        MaxPooling2D(2, 2),
        Flatten(),
        Dense(128, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        Dense(10, activation='softmax')

    ])

with strategy.scope():
    Model.compile(optimizer='adam',
                  loss='categorical_crossentropy', metrics='accuracy') 
CancelledError: 4 root error(s) found.
  (0) Cancelled:  Operation was cancelled
     [[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (1) Cancelled:  Iterator was cancelled
     [[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (2) Cancelled:  Operation was cancelled
     [[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (3) Cancelled:  Iterator was cancelled
     [[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]

Function call stack:
train_function -> train_function -> train_function -> train_function

然后我再次运行它。报错如下

UnavailableError: 9 root error(s) found.
  (0) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_11/switch_pred/_107/_78]]
  (1) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
  (2) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
  (3) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]

Function call stack:
train_function -> train_function -> train_function -> train_function

一定是在某处丢失了strategy.scopy():

我尝试了很多次,在其他很多笔记本上都成功了,但都是tf.data.Dataset

尽管如此,我仍然无法弄清楚这个简单的数字识别哪里错了。我一次又一次地搜索并在这里停留了 2 天,真的很生气。

完整代码位于 https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286

Version 6 是 TPU 版本。并且仅使用上面的代码从Version 5 修改。请帮帮我!

【问题讨论】:

    标签: tensorflow tpu


    【解决方案1】:

    您似乎将训练数据存储在本地,这导致了问题,因为 TPU 只能访问 GCS 中的数据。

    TPUs read training data exclusively from GCS (Google Cloud Storage)查看详情here

    您也可以查看此 stackoverflow Colab TPU Error when calling model.fit() : UnimplementedError 帖子。

    【讨论】:

    • 谢谢,我试试看!
    • @DacianPeng 你找到解决方案了吗?
    • @Gagik 是否意味着数据必须上传到 GCS?
    • 对于 Kaggle/Colab 中的 TPU 训练,数据集必须上传到 Google Cloud Storage,而不是在本地下载数据集。您可以查看kaggle.com/docs/tpu的数据集部分
    【解决方案2】:

    修复了将它们更改为 tf.data.Dataset 的问题。(没有GCS

    只使用本地的tf.data.Dataset. 调用fit() 是可以的。但是一旦使用了ImageDataGenerator(),它就会因Unavailable: failed to connect to all addresses 而失败。

    # Fixed with changing to tf.data.Dataset.
    
    ds1=tf.data.Dataset.from_tensor_slices((DS1,L1)).batch(128).prefetch(-1)
    ds2=tf.data.Dataset.from_tensor_slices((DS2,L2)).batch(128).prefetch(-1)
    
    ...
    ...
    
    
    History = Model.fit(ds1, epochs=Epochs,validation_data=ds2,
                        callbacks=[ReduceLR, Stop], verbose=1)
    
    # one epoch time is not stable, sometimes faster, sometimes slower,
    # but most time it's approximately same as GPU costs
    
    

    一旦使用 ImageDataGenerator() 就会失败。

    # Fail again with ImageDataGenerator() used
    
    ds1=tf.data.Dataset.from_generator(lambda:ImageModifier.flow(DS1,L1),output_signature=(
        tf.TensorSpec(shape=(28,28,1), dtype=tf.float32),
        tf.TensorSpec(shape=(10), dtype=tf.float32))
    ).batch(128).prefetch(-1)
    
    History = Model.fit(ds1, epochs=Epochs, verbose=1)
    ---------------------------------------------------------------------------
    UnavailableError                          Traceback (most recent call last)
    <ipython-input-107-149f17c4776c> in <module>
          1 Epochs = 15
    ----> 2 History = Model.fit(ds1, epochs=Epochs, verbose=1)
    
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
       1100               tmp_logs = self.train_function(iterator)
       1101               if data_handler.should_sync:
    -> 1102                 context.async_wait()
       1103               logs = tmp_logs  # No error, now safe to assign to logs.
       1104               end_step = step + data_handler.step_increment
    
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in async_wait()
       2328   an error state.
       2329   """
    -> 2330   context().sync_executors()
       2331 
       2332 
    
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in sync_executors(self)
        643     """
        644     if self._context_handle:
    --> 645       pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
        646     else:
        647       raise ValueError("Context is not initialized.")
    
    UnavailableError: 4 root error(s) found.
      (0) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
    Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
    :{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
         [[IteratorGetNextAsOptional]]
         [[Pad_2/paddings/_130]]
      (1) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
    Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
    :{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
         [[IteratorGetNextAsOptional]]
         [[strided_slice_36/_238]]
      (2) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
    Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
    :{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
         [[IteratorGetNextAsOptional]]
         [[IteratorGetNextAsOptional_3/_35]]
      (3) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
    Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
    :{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
         [[IteratorGetNextAsOptional]]
    0 successful operations.
    5 derived errors ignored.
    

    【讨论】:

      猜你喜欢
      • 2021-04-08
      • 2020-05-06
      • 2021-12-06
      • 2021-07-19
      • 2019-12-27
      • 2018-11-18
      • 1970-01-01
      • 1970-01-01
      • 2019-06-12
      相关资源
      最近更新 更多