将可变长度列表数据（来自 csv）分配给“indicator_column”特征答案

【问题标题】：Assigning variable length list data (from csv) to an 'indicator_column' feature将可变长度列表数据（来自 csv）分配给“indicator_column”特征
【发布时间】：2020-09-02 20:10:50
【问题描述】：

我有一个特点如下：

tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_file(...))

对应的'vocabulary_file'包含整数值如下：

1212

...

考虑这样的训练示例：

Jack, M, 22, "[10, 20]", 2.33, 1

Sara, F, 24, "[32, 44, 5, 1212]", 5.6, -1

每个训练示例都有一个可变长度的列表数据，例如 [10, 20] 或 [32, 44, 5, 1212]

现在，我想将 csv 文件中的这些数据捕获到 'indicator_column' 功能中，然后将 multi-hot 表示（结果）提供给深度模型。 decode_csv 函数仅支持 float32、float64、int32、int64、string，我对 csv 中的 'list' 类型数据有问题。

系统信息：

操作系统平台：Win8， TensorFlow 安装自：二进制文件， TensorFlow 版本：1.5， Python版本：3.6， Bazel 版本：无， CUDA/cuDNN 版本：无， GPU型号和内存：GPU>无| CPU> AMD(Phenom II x4),

重现的确切命令很清楚。

【问题讨论】：

标签： python tensorflow machine-learning neural-network deep-learning

【解决方案1】：

这里有两个问题。首先，官方的 CSV 格式没有任何递归的概念——一个单元格实际上是多个要解析的值。

如果内部列表的大小是恒定的，你可以通过调用decode_csv 两次来实现你想要的（ipython REPL 使用急切执行）：

In [21]: a, b, c = tf.decode_csv(tf.constant('"Jack","10, 20",1'), ["", "", 0])

In [22]: tf.decode_csv(b, [0, 0])
Out[22]: 
[<tf.Tensor: id=113, shape=(), dtype=int32, numpy=10>,
 <tf.Tensor: id=114, shape=(), dtype=int32, numpy=20>]

但是，官方 CSV 格式也不支持可变长度数据 - 每行应该有相同数量的字段/列。

鉴于 CSV 的所有这些限制，我建议采用以下替代方法（假设您想保留在文本中 - 如果不是，您可以在 TFRecord 中编码您的数据）：

使用 tf.data API。
使用 TextLineDataset 从文件中读取行。见https://www.tensorflow.org/programmers_guide/datasets#consuming_text_data
在 python 中编写自己的行解析函数并使用tf.py_func 调用它。见https://www.tensorflow.org/programmers_guide/datasets#applying_arbitrary_python_logic_with_tfpy_func。

【讨论】：

【解决方案2】：

您可以像这样使用sklearn.preprocessing.MultiLabelBinarizer 和tf.feature_column.indicator_column：

mlb = sklearn.preprocessing.MultiLabelBinarizer()
mlb.fit([item.split(",") for item in user_df[column]])
multi_hot_column_dict[column] = tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(column, mlb.classes_))

【讨论】：