将 TensorFlow 教程转换为使用我自己的数据答案

【问题标题】：Converting TensorFlow tutorial to work with my own data将 TensorFlow 教程转换为使用我自己的数据
【发布时间】：2017-07-07 05:59:08
【问题描述】：

这是我上一个问题Converting from Pandas dataframe to TensorFlow tensor object的后续内容

我现在正在进行下一步，需要更多帮助。我正在尝试替换这行代码

batch = mnist.train.next_batch(100)

用我自己的数据替换。我在 StackOverflow 上找到了这个答案：Where does next_batch in the TensorFlow tutorial batch_xs, batch_ys = mnist.train.next_batch(100) come from? 但我不明白：

1) 为什么 .next_batch() 不适用于我的张量。我是不是创建错了

2) 如何实现.next_batch() 问题答案中给出的伪代码

我目前有两个张量对象，一个带有我希望用来训练模型的参数（dataVar_tensor），另一个带有正确的结果（depth_tensor）。我显然需要保持他们的关系，以使用正确的参数保持正确的响应。

请您花点时间帮助我了解发生了什么并替换这行代码吗？

非常感谢

【问题讨论】：

刚刚看到您对上一篇文章的更新。很高兴看到你成功了。似乎您正在尝试向后尝试-首先将数据从 CSV 加载到 DataFrame 中，然后尝试从 DataFrame 中批量读取内容？我的印象是“典型”的 TF 方式是直接从 CSV 文件中读取内容，这种方式 TF 已经内置了许多有用的排队/随机化/批处理功能。
查看我们之前与某人讨论的关于从多个 CSV 文件中读取行的机制。希望应该足够清楚：stackoverflow.com/questions/42175609/…
顺便说一句，这将避免您将 DataFrames 转换为张量的问题，因为这样所有内容都会被切片并直接从 CSV 加载到张量中，并在需要时完成，而不是预先完成，因此可以节省你的资源。
@VS_FF 我有一个文本文件，其中包含我要训练的变量、预期结果和一堆其他内容。你是说我可以直接在 TensorFlow 中完成所有的数据拆分和准备工作吗？老实说，我没有完全理解您在其他线程中的示例
是的，它执行以下所有操作：逐行读取文本，将每行拆分为一组观察结果、观察结果标签以及其他一些用于监控的内容。然后 TF 将每个 line-read 操作打包成给定大小的 batch 并随机化采样过程，使得文件不是按顺序读取，而是随机采样。唯一的问题是它是一个 CSV 文件，我假设你的文件也是以逗号或空格分隔的？

标签： python tensorflow

【解决方案1】：

我去掉了不相关的东西，以保留格式和缩进。希望现在应该清楚了。以下代码分批读取 N 行的 CSV 文件（N 在顶部的常量中指定）。每行包含一个日期（第一个单元格），然后是一个浮点列表（480 个单元格）和一个单热向量（3 个单元格）。然后，代码在读取这些日期、浮点数和 one-hot 向量时简单地打印它们。它打印它们的地方通常是您实际运行模型并提供这些代替占位符变量的地方。

请记住，这里它将每一行读取为字符串，然后将该行中的特定单元格转换为浮点数，这仅仅是因为第一个单元格更容易作为字符串读取。如果您的所有数据都是数字，那么只需将默认值设置为浮点/整数而不是“a”，并摆脱将字符串转换为浮点数的代码。否则不需要！

我放了一些 cmets 来澄清它在做什么。如果有不清楚的地方请告诉我。

import tensorflow as tf

fileName = 'YOUR_FILE.csv'

try_epochs = 1
batch_size = 3

TD = 1 # this is my date-label for each row, for internal pruposes
TS = 480 # this is the list of features, 480 in this case
TL = 3 # this is one-hot vector of 3 representing the label

# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((TD+TS+TL))]

# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
    reader = tf.TextLineReader(skip_header_lines=False) # i have no header file
    _, csv_row = reader.read(filename_queue) # read one line
    data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
    dateLbl = tf.slice(data, [0], [TD]) # first cell is my 'date-label' for internal pruposes
    features = tf.string_to_number(tf.slice(data, [TD], [TS]), tf.float32) # cells 2-480 is the list of features
    label = tf.string_to_number(tf.slice(data, [TD+TS], [TL]), tf.float32) # the remainin 3 cells is the list for one-hot label
    return dateLbl, features, label

# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
        [fName],
        num_epochs=num_epochs,
        shuffle=True)  # this refers to multiple files, not line items within files
    dateLbl, features, label = read_from_csv(filename_queue)
    min_after_dequeue = 10000 # min of where to start loading into memory
    capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
    # this packs the above lines into a batch of size you specify:
    dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
        [dateLbl, features, label], 
        batch_size=batch_size,
        capacity=capacity,
        min_after_dequeue=min_after_dequeue)
    return dateLbl_batch, feature_batch, label_batch

# these are the date label, features, and label:
dateLbl, features, labels = input_pipeline(fileName, batch_size, try_epochs)

with tf.Session() as sess:

    gInit = tf.global_variables_initializer().run()
    lInit = tf.local_variables_initializer().run()

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    try:
        while not coord.should_stop():
            # load date-label, features, and label:
            dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])      

            print(dateLbl_batch);
            print(feature_batch);
            print(label_batch);
            print('----------');

    except tf.errors.OutOfRangeError:
        print("Done looping through the file")

    finally:
        coord.request_stop()

    coord.join(threads)

【讨论】：

我想我明白这段代码中发生了什么，谢谢。我已经能够让代码在我的数据上运行。但是，我看不到如何编辑代码以允许我过滤某些值。例如，在当前情况下，我只对 ActualIE = 1 的行感兴趣。可以这样做吗？再次感谢您抽出宝贵时间帮助我
也许可以对第一个定义函数的返回值进行一些实验？您可以看到代码肯定可以评估诸如“AcutalIIE==1”之类的条件，无论它可能是什么。我不确定的部分是例如 train_shuffle_batch 是否会理解如果该函数返回 null 或某些类似逻辑，则需要跳过给定行的行？
我尝试向“read_from_csv”函数添加一个while循环，但我无法访问张量对象中的数字以设置为循环的条件。有没有简单的方法可以做到这一点？
我认为使用 DataFrame 在 TF 之外进行过滤可能会更容易、更干净。这样你就可以让 TF 去做它真正应该做的事情。例如创建原始 DF =pandas.read_csv() 然后另一个 DF_1 = DF[YourCondition=True] 然后保存 DF_1.to_csv()?
哦，是的，这是一个如此简单的解决方案，我没有想到它！ :)