在联邦学习中将数据拆分为训练和测试答案

【问题标题】：splitting the data into training and testing in federated learning在联邦学习中将数据拆分为训练和测试
【发布时间】：2022-03-10 21:35:39
【问题描述】：

我是联邦学习的新手我目前正在按照官方 TFF 文档试验一个模型。但我遇到了一个问题，希望我能在这里找到一些解释。

我正在使用自己的数据集，数据分布在多个文件中，每个文件都是一个客户端（因为我正计划构建模型）。并且已经定义了因变量和自变量。

现在，我的问题是如何在联邦学习中将数据拆分为每个客户端（文件）中的训练和测试集？就像我们通常在集中式 ML 模型中所做的那样 以下代码是我迄今为止实现的：注意我的代码受到官方文档和 post 的启发，这与我的应用程序几乎相似，但它旨在将客户端拆分为训练和测试客户端本身，而我的目标是拆分数据在这些客户端中。

dataset_paths = {
  'client_0': '/content/drive/MyDrive/Colab Notebooks/1.csv',
  'client_1': '/content/drive/MyDrive/Colab Notebooks/2.csv',
  'client_2': '/content/drive/MyDrive/Colab Notebooks/3.csv'
}
record_defaults = [int(), int(), int(), int(), float(),float(),float(),
                   float(),float(),float(), int(), int(),float(),float(),int()]

@tf.function
def create_tf_dataset_for_client_fn(dataset_path):
   return tf.data.experimental.CsvDataset(dataset_path,
                                          record_defaults=record_defaults,
                                          header=True)

@tf.function
def add_parsing(dataset):
  def parse_dataset(*x):
    ## x defines the dependant varable & y defines the independant 
    return OrderedDict([('x', x[-1]), ('y', x[1:-1])])
  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)

source = tff.simulation.datasets.FilePerUserClientData(
  dataset_paths, create_tf_dataset_for_client_fn) 

source = source.preprocess(add_parsing)
## Creat the the datasets from client data 
dataset_creation=source.create_tf_dataset_for_client(source.client_ids[0-2])
print(dataset_creation)
>>> _VariantDataset element_spec=OrderedDict([('x', TensorSpec(shape=(), dtype=tf.int32, name=None)), ('y', (TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None)))])>
## Convert the x into array(I think it is necessary for spliting to training and testing sets ) 
test= tf.nest.map_structure(lambda x: x.numpy(),next(iter(dataset_creation)))
print(test)
>>> OrderedDict([('x', 1), ('y', (0, 1, 9, 85.0, 7.75, 85.0, 95.0, 75.0, 50.0, 6))])

我对监督机器学习的理解是将数据拆分为训练集和测试集，如下面的代码所示，我不确定在联邦学习中如何做到这一点以及它是否会以这种方式工作？

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

所以，我正在寻找这个问题的解释，以便我可以进入培训阶段。

【问题讨论】：

请编辑问题以将其限制为具有足够详细信息的特定问题，以确定适当的答案。
您是否有示例说明要拆分哪些数据？
嗨，@AloneTogether 我已经更新了问题，希望现在清楚了！

标签： python tensorflow tensorflow-datasets tensorflow-federated

【解决方案1】：

看到这个tutorial。您应该能够根据客户及其数据创建两个数据集（训练和测试）：

import tensorflow as tf
import tensorflow_federated as tff
from collections import OrderedDict

record_defaults = [int(), int(), int(), int(), float(),float(),float(),float(),float(),float(), int(), int()]

@tf.function
def create_tf_dataset_for_client_fn(dataset_path):
   return tf.data.experimental.CsvDataset(dataset_path, record_defaults=record_defaults, header=True)
   
@tf.function
def add_parsing(dataset):
  def parse_dataset(*x):
    return OrderedDict([('label', x[:-1]), ('features', x[1:-1])])
  return dataset.map(parse_dataset, num_parallel_calls=tf.data.AUTOTUNE)

def split_train_test(client_ids):
  train, test = [], []
  for x in client_ids:
    d = source.create_tf_dataset_for_client(x)
    d_length = d.reduce(0, lambda x,_: x+1).numpy()
    d = d.shuffle(d_length)
    train.append(list(d.take(int(d_length*.8)))) 
    test.append(list(d.skip(int(d_length*.2))))
  return train[0], test[0]

dataset_paths = {'client1': '/content/client1.csv', 'client2': '/content/client2.csv', 
                 'client3': '/content/client2.csv', 'client4': '/content/client2.csv'}
source = tff.simulation.datasets.FilePerUserClientData(
  dataset_paths, create_tf_dataset_for_client_fn) 

client_ids = sorted(source.client_ids)

federated_train_data, federated_test_data = split_train_test(client_ids)
print(*federated_train_data, sep='\n')

(<tf.Tensor: shape=(), dtype=int32, numpy=24>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=27>, <tf.Tensor: shape=(), dtype=int32, numpy=4>, <tf.Tensor: shape=(), dtype=float32, numpy=0.17308392>, <tf.Tensor: shape=(), dtype=float32, numpy=1.889401>, <tf.Tensor: shape=(), dtype=float32, numpy=1.6235029>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.56010467>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.0171211>, <tf.Tensor: shape=(), dtype=float32, numpy=0.43558818>, <tf.Tensor: shape=(), dtype=int32, numpy=40>, <tf.Tensor: shape=(), dtype=int32, numpy=14>)
(<tf.Tensor: shape=(), dtype=int32, numpy=8>, <tf.Tensor: shape=(), dtype=int32, numpy=32>, <tf.Tensor: shape=(), dtype=int32, numpy=14>, <tf.Tensor: shape=(), dtype=int32, numpy=11>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.91828436>, <tf.Tensor: shape=(), dtype=float32, numpy=0.29887632>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.4598584>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.1088414>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.4057387>, <tf.Tensor: shape=(), dtype=float32, numpy=-2.1537204>, <tf.Tensor: shape=(), dtype=int32, numpy=15>, <tf.Tensor: shape=(), dtype=int32, numpy=45>)
(<tf.Tensor: shape=(), dtype=int32, numpy=11>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=17>, <tf.Tensor: shape=(), dtype=int32, numpy=2>, <tf.Tensor: shape=(), dtype=float32, numpy=0.93560874>, <tf.Tensor: shape=(), dtype=float32, numpy=-2.4382026>, <tf.Tensor: shape=(), dtype=float32, numpy=-1.7638668>, <tf.Tensor: shape=(), dtype=float32, numpy=0.65431964>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.7130539>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.96356>, <tf.Tensor: shape=(), dtype=int32, numpy=15>, <tf.Tensor: shape=(), dtype=int32, numpy=18>)
(<tf.Tensor: shape=(), dtype=int32, numpy=42>, <tf.Tensor: shape=(), dtype=int32, numpy=27>, <tf.Tensor: shape=(), dtype=int32, numpy=34>, <tf.Tensor: shape=(), dtype=int32, numpy=8>, <tf.Tensor: shape=(), dtype=float32, numpy=0.3965425>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.2588629>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.84179455>, <tf.Tensor: shape=(), dtype=float32, numpy=0.114052325>, <tf.Tensor: shape=(), dtype=float32, numpy=-0.9591451>, <tf.Tensor: shape=(), dtype=float32, numpy=0.94621265>, <tf.Tensor: shape=(), dtype=int32, numpy=28>, <tf.Tensor: shape=(), dtype=int32, numpy=7>)

如果您按照我链接的教程进行操作，您应该可以将拆分数据直接提供给tff.learning.from_keras_model。

【讨论】：

谢谢你的回答，我想问一下，有没有另一种方法来追加训练和测试数据，我想占每个客户的80/20%，因为数据的大小不是所有客户端都一样。
这意味着它只取第一行并从最后一行跳过，不洗牌或随机取数据？
最后一个问题，使用tff.learning.from_keras_model 传递federated_train_data[0].element_spec 会引发错误The top-level structure in input_spec` 必须恰好包含两个顶级元素，因为它必须指定输入和输入的类型信息来自模型的预测`。这是什么意思？
@Rayan 请就此查询提出一个新问题，因为它与您的原始问题无关，如果没有更多细节很难回答。
您能解释一下d.reduce(0, lambda x,_: x+1).numpy() 是什么意思吗？为什么要将它包含在代码中？