如何将 EMNIST 字母从文件导入 Keras答案

【问题标题】：How to import EMNIST letters to Keras from file如何将 EMNIST 字母从文件导入 Keras
【发布时间】：2019-11-24 13:48:27
【问题描述】：

我正在尝试将 EMNIST Letters 数据集导入我创建的人工智能程序（用 python 编写），但似乎无法正确执行。我应该如何将其导入以下程序？

...
# Import Statements
...


emnist = spio.loadmat("EMNIST/emnist-letters.mat")
...

# The problems appear to originate below--I am trying to set these variables to the corresponding parts of the EMNIST dataset and cannot succeed

x_train = emnist["dataset"][0][0][0][0][0][0]
x_train = x_train.astype(np.float32)

y_train = emnist["dataset"][0][0][0][0][0][1]

x_test = emnist["dataset"][0][0][1][0][0][0]
x_test = x_test.astype(np.float32)

y_test = emnist["dataset"][0][0][1][0][0][1]

train_labels = y_train
test_labels = y_test

x_train /= 255
x_test /= 255

x_train = x_train.reshape(x_train.shape[0], 1, 28, 28, order="A")
x_test = x_test.reshape(x_test.shape[0], 1, 28, 28, order="A")

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

# Does not work:
plt.imshow(x_train[54000][0], cmap='gray')
plt.show()

# Compilation and Fitting
...

我根本没想到会出现错误消息，但收到了：

Traceback (most recent call last):
  File "OCIR_EMNIST.py", line 61, in <module>
    y_train = keras.utils.to_categorical(y_train, 10)
  File "/home/user/.local/lib/python3.7/site-packages/keras/utils/np_utils.py", line 34, in to_categorical
    categorical[np.arange(n), y] = 1
IndexError: index 23 is out of bounds for axis 1 with size 10

修正：MNIST 数据集不适合本项目，因为它不包含手写字母；它只包含手写数字。

【问题讨论】：

标签： python tensorflow machine-learning keras mnist

【解决方案1】：

MNIST 是学习机器学习和数据挖掘的经典案例。这是我在比较 CNN、SVR 和决策树的性能时用来加载 MNIST 的代码。

def load_mnist(path, kind='train'):
import os
import gzip
import numpy as np


"""Load MNIST data from `path`"""
labels_path = os.path.join(path,
                           '%s-labels-idx1-ubyte.gz'
                           % kind)
images_path = os.path.join(path,
                           '%s-images-idx3-ubyte.gz'
                           % kind)

with gzip.open(labels_path, 'rb') as lbpath:
    labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                           offset=8)

with gzip.open(images_path, 'rb') as imgpath:
    images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                           offset=16).reshape(len(labels), 784)

return images, labels

请注意，第一行的缩进应该向后四个空格。使用这个数据集阅读器，您可以只使用“load_mnist”函数来加载数据集，并使您的代码整洁。

或者你可以只使用 keras 数据集来加载。详细信息可在 Keras 文档中找到。

from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

我希望这会有所帮助。

【讨论】：

请看我上面的修改。

【解决方案2】：

我对EMNIST数据集并不熟悉，但经过一番研究发现它直接匹配MNIST数据集，发现at this link。由于它是相同的数据集，我建议您只使用 MNIST，尽管我不知道您是否出于特定原因需要此数据集。通过 keras 使用 MNIST 数据集很简单：

mnist = keras.datasets.mnist #loads in the data set
(x_train, y_train), (x_test, y_test) = mnist.load_data() #separates data for training/validation
x_train = x_train / 255.0
x_test = x_test  / 255.0

在通过您希望使用的任何机器学习方法发送数据点之前对其进行标准化。注意，y_train 和 y_test 只是标签。

希望这会有所帮助，您应该以更短/更轻松的方式获得相同的数据集。

编辑：由于您正在寻找一个字母数据库来执行而不仅仅是数字，我建议从this link 获取数据集。 letter-recognition.data 文件应该是您可以使用的。它包含字母，以及描述每个字母的 16 个特征向量。然后，您可以将其加载到 csv 文件中并将数据分区以进行训练/验证，然后对其执行某种类型的 ML（我已使用此数据集完成了 ANN）。请注意，您可能需要将下载的数据文件中的字母更改为您的基本事实的数值（A=0,B=1,...,Z=25）。

【讨论】：

请看我上面的修改。

【解决方案3】：

也许你应该看看：https://github.com/christianversloot/extra_keras_datasets

它不是一个流行的库（在撰写本文时），我还没有尝试过，但是，它似乎很容易使用，并且有据可查。

要使用它加载 EMNIST 数据集，您可以像使用 Keras 一样进行操作：

from extra_keras_datasets import emnist
(input_train, target_train), (input_test, target_test) = emnist.load_data(type='balanced')

【讨论】：