如何在 Python 脚本中将张量流数据集拆分为训练、测试和验证？答案

【问题标题】：How to split a tensorflow dataset into train, test and validation in a Python script?如何在 Python 脚本中将张量流数据集拆分为训练、测试和验证？
【发布时间】：2021-02-03 15:45:39
【问题描述】：

在带有 Tensorflow-2.0.0 的 jupyter notebook 上，以这种方式执行了 80-10-10 的 train-validation-test 拆分：

import tensorflow_datasets as tfds
from os import getcwd
splits = tfds.Split.ALL.subsplit(weighted=(80, 10, 10))

filePath = f"{getcwd()}/../tmp2/"
splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True, split=splits, data_dir=filePath)

但是，当尝试在本地运行相同的代码时，我得到了错误

AttributeError: type object 'Split' has no attribute 'ALL'

我已经看到我可以通过这种方式创建两个集合：

splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True, split=['train[:80]','test[80:90]'], data_dir=filePath)

但我不知道如何添加第三组。

【问题讨论】：

标签： python tensorflow tensorflow-datasets train-test-split

【解决方案1】：

tfds.Split.ALL.subsplit 或 tfds.Split.TRAIN.subsplit 显然已弃用且不再受支持。

一些数据集已经在训练和测试之间进行了拆分。在这种情况下，我找到了以下解决方案（例如使用时尚 MNIST 数据集）：

splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True,
split=['train+test[:80]','train+test[80:90]', 'train+test[90:]'],
data_dir=filePath)
(train_examples, validation_examples, test_examples) = splits

罢工>

评论后编辑

之前的代码有一些错误。首先，这个official link 说：

完整数据集（'all'）：'all'是一个特殊的split名称，对应于所有split的union（相当于'train+test+...'）

但是当我尝试时它不起作用。 all 会有所帮助，但还有其他选择。前面代码中的错误是必须使用%，并且必须为每个集合指定它。我是这样修改代码的：

import tensorflow_datasets as tfds
splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True,
split=['train[:80%]+test[:80%]','train[80%:90%]+test[80%:90%]', 'train[90%:]+test[90%:]'],
data_dir='./')
#(train_examples, validation_examples, test_examples) = splits

for el in splits:
    print(el.cardinality())

哪个打印：

tf.Tensor(56000, shape=(), dtype=int64)
tf.Tensor(7000, shape=(), dtype=int64)
tf.Tensor(7000, shape=(), dtype=int64)

【讨论】：

这真的行不通。 'train+test[:80]' 例如接受 100% 的训练和 80% 的测试；它不需要 80% 的组合火车 + 测试。
此外，如果您不添加 % 您将不会使用百分比拆分（因此在您的情况下，您将整个火车示例加上测试中的前 80 个示例作为火车示例。跨度>
我删除了不赞成票，我赞成你:) 谢谢你解决这个问题。我可以确认“所有”特殊拆分名称不起作用（我检查了 github 代码，他们撤销了推送请求）。所以现在我猜标志'all'它只出现在文档中，而不是代码本身。

【解决方案2】：

对于 tfds 上的 rock_paper_scissor 数据集，它适用于我：

splits = ['train+test[:80]', 'train+test[80:90]', 'train+test[90:]']

splits, info = tfds.load( 'rock_paper_scissors', split=splits, as_supervised=True, with_info=True)

(train_examples, validation_examples, test_examples) = splits

num_examples = info.splits['train'].num_examples
num_classes = info.features['label'].num_classes

【讨论】：

@Francesco Boi 请检查此答案。
@Nikita 我看不出这个答案与我的旧答案相比有什么额外信息。