【发布时间】:2021-10-12 14:17:52
【问题描述】:
目前我正在使用 TensorFlow Federated 开展联合学习项目。 当我收到此错误时,我正在从服务器发出请求以检查我的代码是否正常工作:
RuntimeError: No default context installed.
You should not expect to get this error using the TFF API.
不过,我只是在某些特定条件下才会遇到。
场景是这样的(所有代码如下):
从网站发出 http 请求。 routes/developers.py 中的函数 upload_and_train 处理请求。在此内部,调用 start_processing 函数开始训练预处理(收集训练数据、初始化超参数等)。最后,federated_computation_new 函数被调用(这也是它崩溃的地方),它开始了联邦学习。 它在到达调用时崩溃:iterative_process.initialize()。
iterative_process = tff.learning.build_federated_averaging_process(model_fn,client_optimizer_fn=lambda: tf.keras.optimizers.SGD(lr=0.5))
state = iterative_process.initialize()
令人困惑的部分如下。如果我在本地运行代码,一切顺利,训练过程正常;没有错误。如果我在服务器上运行它,它也适用于发出的第一个请求。之后它崩溃并在所有以下请求中返回相同的错误(在下面详细说明),直到我重新启动服务器。然后它再次在第一次调用时完美运行,并在后续调用中继续崩溃。
这个问题把我逼疯了,我想不通。我唯一剩下的想法是在第一次调用之后发生了一些事情(一个进程没有关闭或类似的东西)并且在随后的调用中它没有得到一个“新”的开始?虽然它一开始就不应该发生。
完整的错误信息如下:
143.205.173.225 - - [12/Oct/2021 13:18:05] "[35m[1mPOST /api/Developers/use_cases/text_processing/developer_id/3/upload_and_train HTTP/1.1[0m" 500 -
INFO:werkzeug:143.205.173.225 - - [12/Oct/2021 13:18:05] "[35m[1mPOST /api/Developers/use_cases/text_processing/developer_id/3/upload_and_train HTTP/1.1[0m" 500 -
doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
ERROR:main:Exception on /api/Developers/use_cases/text_processing/developer_id/4/upload_and_train [POST]
Traceback (most recent call last):
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/connexion/decorators/decorator.py", line 48, in wrapper
response = function(request)
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/connexion/decorators/uri_parsing.py", line 144, in wrapper
response = function(request)
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/connexion/decorators/validation.py", line 384, in wrapper
return function(request)
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/connexion/decorators/parameter.py", line 121, in wrapper
return function(**kwargs)
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/routes/developers.py", line 46, in upload_and_train
last_train_metrics = main_proc.start_processing(use_case,developer_id)
File "processing/text_processing/main_proc.py", line 17, in start_processing
state,metrics = federated_computation_new(train_dataset,test_dataset)
File "processing/text_processing/federated_algorithm.py", line 29, in federated_computation_new
state = iterative_process.initialize()
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/tensorflow_federated/python/core/impl/utils/function_utils.py", line 521, in __call__
return context.invoke(self, arg)
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/tensorflow_federated/python/core/impl/context_stack/runtime_error_context.py", line 41, in invoke
self._raise_runtime_error()
File "/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/tensorflow_federated/python/core/impl/context_stack/runtime_error_context.py", line 23, in _raise_runtime_error
raise RuntimeError(
RuntimeError: No default context installed.
You should not expect to get this error using the TFF API.
If you are getting this error when testing a module inside of `tensorflow_federated/python/core/...`, you may need to explicitly invoke `execution_contexts.set_local_execution_context()` in the `main` function of your test.
处理传入请求的第一个函数。 该请求包含 4 个参数:2 个标识符“use_case”和“developer_”id”以及 2 个包含训练数据的 formData 文件,这些文件存储在本地。
def upload_and_train(use_case: str, developer_id: int):
use_case_path = 'processing/'+use_case+'/'
sys.path.append(use_case_path)
import main_proc
app_path = dirname(dirname(abspath(__file__)))
file_dict = request.files
db_File_True = file_dict["dataset_file1"]
db_File_Fake = file_dict["dataset_file2"]
true_csv_path = os.path.join(app_path+"/"+use_case_path+"db/", "True.csv")
fake_csv_path = os.path.join(app_path+"/"+use_case_path+"db/", "Fake.csv")
db_File_True.save(true_csv_path)
db_File_Fake.save(fake_csv_path)
time.sleep(5) #wait for the files to be copied before proceeding
#THEN start processing
last_train_metrics = main_proc.start_processing(use_case,developer_id) # <============== GOES INTO HERE & CRASHES
metricsJson = trainMetricsToJSON(last_train_metrics)
return Response(status=200, response=metricsJson)
开始预处理的函数:
def start_processing(use_case, developer_id:int = 0):
globals.initialize(use_case,developer_id)
globals.TRAINER_ID = developer_id
train_dataset, test_dataset= get_preprocessed_train_test_data()
state,metrics = federated_computation_new(train_dataset,test_dataset) # <============== GOES INTO HERE & CRASHES
trained_metrics= metrics['train']
timestamp = int(time.time())
globals.DATASET_ID = timestamp
written_row = save_to_file_CSV(use_case,globals.TRAINER_ID,timestamp,globals.DATASET_ID,trained_metrics['sparse_categorical_accuracy'],trained_metrics['loss'])
return written_row
正在进行联合训练的函数:
def federated_computation_new(train_dataset,test_dataset):
# Training and evaluating the model
iterative_process = tff.learning.build_federated_averaging_process(model_fn,client_optimizer_fn=lambda: tf.keras.optimizers.SGD(lr=0.5))
state = iterative_process.initialize() # <============== CRASHES HERE
print(type(state))
for n in range(globals.EPOCHS):
state, metrics = iterative_process.next(state, train_dataset)
print('round {}, training metrics={}'.format(n+1, metrics))
evaluation = tff.learning.build_federated_evaluation(model_fn)
eval_metrics = evaluation(state.model, train_dataset)
print('Training evaluation metrics={}'.format(eval_metrics))
test_metrics = evaluation(state.model, test_dataset)
print('Test evaluation metrics={}'.format(test_metrics))
#############################################################################################
#Save Last Trained Model
import pickle
with open("processing/"+globals.USE_CASE+"/last_model",'wb') as f:
pickle.dump(state, f)
return state,metrics
def model_fn():
keras_model = get_simple_LSTM_model()
return tff.learning.from_keras_model(
keras_model,
input_spec=globals.INPUT_SPEC,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
函数:/home/itec/bogdan/Articonf/smart/tools/federated-training/app/venv/lib/python3.8/site-packages/tensorflow_federated/python/core/impl/utils/function_utils.py ",第 521 行,
def __call__(self, *args, **kwargs):
context = self._context_stack.current
arg = pack_args(self._type_signature.parameter, args, kwargs, context)
return context.invoke(self, arg) # <============== This returns the runtime Error
非常感谢您的时间和耐心。
【问题讨论】:
标签: python tensorflow runtime-error tensorflow-federated