Pandas 数据到 pytorch 张量答案

【问题标题】：Pandas data to pytorch tensorPandas 数据到 pytorch 张量
【发布时间】：2021-08-20 00:56:13
【问题描述】：

我正在尝试将 pandas 数据帧转换为 pytorch 张量以运行 LSTM 模型，但我不断收到以下错误消息，指出存在值错误并且无法确定对象类型“系列”的形状.然后它引用以下代码：

class MicroESDataset(Dataset):

    def __init__(self, sequences):
        self.sequences = sequences

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence, label = self.sequences[idx]
        return dict (
            sequence=torch.Tensor(sequence.to_numpy()),
            label = torch.tensor(label).float ()
        )

我是否遗漏了一些非常明显的东西？谢谢

这是确切的错误消息和回溯：

    ValueError                                Traceback (most recent       call last)
    <ipython-input-46-fb5c7eb803e1> in <module>()
----> 1 for item in data_module.train_dataloader():
  2   print(item["sequence"].shape)
  3   print(item["label"].shape)
  4   # print(item["label"])
  5   break

    3 frames
/usr/local/lib/python3.7/dist-packages/torch/_utils.py in reraise(self)
427             # have message field
428             raise self.exc_type(message=msg)
--> 429         raise self.exc_type(msg)
  430 
  431 

  ValueError: Caught ValueError in DataLoader worker process 0.
 Original Traceback (most recent call last):
 File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
 File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "<ipython-input-30-36c44aae196d>", line 13, in __getitem__
label = torch.tensor(label).float()

ValueError: 无法确定对象类型“系列”的形状

【问题讨论】：

请提供准确的错误信息和完整的回溯。
我在 OP 中添加了确切的错误消息和回溯。
看起来 label 是一个 Series 对象，而 Tensorflow 不知道如何处理它。
这能回答你的问题吗？ Convert Pandas dataframe to PyTorch tensor?
请使用dataloaders num_workers=0 参数进行调试。

标签： python pandas pytorch

【解决方案1】：

2 列

首先，Dataset 中的idx 应该引用pd.DataFrame 中的行。

从中获取行的方法是df.iloc[idx] 而不是[idx]（这将获取索引指定的列，这可能不是你想要的，如果是你应该转置你的数据）。

鉴于此，我们可以这样做（只有 2 列的虚拟 pd.DataFrame，参见代码 cmets）：

import pandas as pd
import torch


class MicroESDataset(torch.utils.data.Dataset):
    def __init__(self):
        # Dummy sequences dataframe
        self.sequences = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence, label = self.sequences.iloc[idx]
        return dict(
            # torch.tensor infers dtype, torch.Tensor is always float
            sequence=torch.tensor(sequence),
            label=torch.tensor(label).float(),
        )


dataset = MicroESDataset()
print(dataset[0])

更多栏目

如果您有更多列（假设 series 可能指的是多个值），您必须：

先获取行
按适当的列切片

鉴于上述一个可以做到（在这种情况下4 列，最后一个是标签，参见代码 cmets）：

class MicroESDataset(torch.utils.data.Dataset):
    def __init__(self):
        # Dummy sequences dataframe
        self.sequences = pd.DataFrame(
            {"col1": [1, 2], "col2": [3, 4], "col3": [5, 6], "col4": [7, 8]}
        )

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        # No magic unpacking here!
        row = self.sequences.iloc[idx]
        # Now only columns are left and we can slice with the indices
        # One could also slice using : "col3", but I think this is better in ur case
        sequence, label = row.iloc[:-1], row.iloc[-1]
        return dict(
            sequence=torch.tensor(sequence),
            label=torch.tensor(label).float(),
        )


dataset = MicroESDataset()
print(dataset[0])

【讨论】：