【发布时间】:2021-06-27 08:14:21
【问题描述】:
我尝试将 csv 文件加载到张量数据集中以进行垂直联邦学习。 参考网址是https://github.com/OpenMined/PyVertical/blob/master/examples/PyVertical%20Example.ipynb
以下是我加载文件但失败的方式
train = pd.read_csv('datatrain.csv') # load data
cols = ["a","b,"c"] # select feature columns
train_feature = train[cols] # create dataset with features
train_target = train['result'] # the dataset with result
# turn them in to torch.tensor data
train_feature_tensor = torch.tensor(train_feature.values)
train_target_tensor = torch.tensor(train_target.values)
# Put them into a TensorDataset
train_tensor = data_utils.dataset.TensorDataset(train_feature_tensor, train_target_tensor)
# them put them in to add_ids()
temp = add_ids(data_utils.dataset.TensorDataset)
temp.data = train_tensor
traindata_ft = temp(train_tensor)
输出:
'TensorDataset' object has no attribute 'size'
他们指出问题出在:
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
在:
class TensorDataset(Dataset):
r"""Dataset wrapping tensors.
Each sample will be retrieved by indexing tensors along the first dimension.
Arguments:
*tensors (Tensor): tensors that have the same size of the first dimension.
"""
def __init__(self, *tensors):
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
self.tensors = tensors
def __getitem__(self, index):
return tuple(tensor[index] for tensor in self.tensors)
def __len__(self):
return self.tensors[0].size(0)
关于add_ids(),它是一个为每个数据行生成唯一id的函数。 原代码如下
def add_ids(cls):
"""Decorator to add unique IDs to a dataset
Args:
cls (torch.utils.data.Dataset) : dataset to generate IDs for
Returns:
VerticalDataset : A class which wraps cls to add unique IDs as an attribute,
and returns data, target, id when __getitem__ is called
"""
class VerticalDataset(cls):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.ids = np.array([uuid4() for _ in range(len(self))])
def __getitem__(self, index):
if self.data is None:
img = None
else:
img = self.data[index]
img = Image.fromarray(img.numpy(), mode="L")
if self.transform is not None:
img = self.transform(img)
if self.targets is None:
target = None
else:
target = int(self.targets[index]) if self.targets is not None else None
if self.target_transform is not None:
target = self.target_transform(target)
id = self.ids[index]
# Return a tuple of non-None elements
return (*filter(lambda x: x is not None, (img, target, id)),)
def __len__(self):
if self.data is not None:
return self.data.size(0)
else:
return len(self.targets)
def get_ids(self) -> List[str]:
"""Return a list of the ids of this dataset."""
return [str(id_) for id_ in self.ids]
def sort_by_ids(self):
"""
Sort the dataset by IDs in ascending order
"""
ids = self.get_ids()
sorted_idxs = np.argsort(ids)
if self.data is not None:
self.data = self.data[sorted_idxs]
if self.targets is not None:
self.targets = self.targets[sorted_idxs]
self.ids = self.ids[sorted_idxs]
return VerticalDataset
【问题讨论】:
-
您好,
add_ids应该做什么并返回?这未在您的代码 sn-p 中声明。请尝试发布stackoverflow.com/help/minimal-reproducible-example :) 无论如何,pytorch 的任何数据集都不会有size方法,只有__len__。但是,可以为张量调用size,这似乎是断言所期望的。您需要找出为什么要将数据集放在预期张量的位置 -
@trialNerror 谢谢提醒。我刚刚在帖子中添加了
add_ids解释。add_ids通常为每个数据行创建唯一的 id。非常感谢您的评论;)