使用 itertuples 索引超出范围：IndexError：索引 850 超出轴 1 的范围，大小为 786答案

【问题标题】：index out of bounds with itertuples : IndexError: index 850 is out of bounds for axis 1 with size 786使用 itertuples 索引超出范围：IndexError：索引 850 超出轴 1 的范围，大小为 786
【发布时间】：2020-08-31 06:50:51
【问题描述】：

我的数据框

userID storeID rating
0   1    662    3.6
1   2    665    3.4
2   3    678    4.0
3   4    500    3.1
4   5    421    2.9


n_users = df.userID.unique().shape[0]
n_stores = df.storeID.unique().shape[0]

我有两个问题。如果我想像这样构建我的训练数据集

ratings = np.zeros((n_users, n_stores))
for row in df.itertuples():
    ratings[row[1]-1, row[2]-1] = row[3]

我有这样的 IndexEroor

IndexError: index 850 is out of bounds for axis 1 with size 786

【问题讨论】：

你能告诉我们你想要的预期输出吗？

标签： python

【解决方案1】：

据我所知，您正在尝试创建一个二维浮点数组，每个浮点数代表一个评分，由第一个轴中的用户 ID 和第二个轴中的商店 ID 索引。

您正在创建一个形状为(n_users, n_stores) 的数组，其中n_users 和n_stores 分别是唯一用户数和商店数。索引这个数组时，

for row in df.itertuples():
   ratings[row[1]-1, row[2]-1] = row[3]

您直接使用用户/商店 ID（移动 1）作为索引。仅当您知道所有用户/商店 ID 的范围从 1 到唯一用户/商店的数量，并且两者之间没有任何间隙时，这才有效。例如，给定您显示的数据框的 sn-p，有 5 个唯一用户和 5 个唯一商店，但即使我制作一个 5 x 5 数组，我也无法索引第二个轴（商店 ID）直接，因为store ID的值为[662, 665, 678, 500, 421]，但是只能被[0, 1, 2, 3, 4]索引。

您获得的IndexError 发生在轴 1（即第二个轴，商店 ID 的轴）中，索引值为 850。这意味着您的商店编号从 1 到 786 不连续（唯一的商店 ID），而是它们只是“单个”整数之间有间隙，因为有一个 ID 为 850 的商店。

您正在寻找的更像是字典：键和值之间的任意映射，其中索引（键）不必像数组那样连续。具体来说，我认为通过获得由MultiIndex 的userID 和storeID 索引的ratings 系列，无论您尝试做什么都会容易得多：

>>> indexed_df = df.set_index(['userID', 'storeID'])
>>> indexed_df
                rating
userID storeID
1      662         3.6
2      665         3.4
3      678         4.0
4      500         3.1
5      421         2.9

>>> ratings = indexed_df['ratings']
>>> ratings
userID  storeID
1       662        3.6
2       665        3.4
3       678        4.0
4       500        3.1
5       421        2.9
Name: rating, dtype: float64

【讨论】：

ratings_train, ratings_test = train_test_split(ratings, test_size=0.33, random_state=42) // ratings_train.shape, ratings_test.shape // cosine_distances(ratings_train) ValueError: Expected 2D array, got 1D array:
@JeongHunChoi train_test_split 需要一个二维数组，而 ratings 是一个系列 (1D)。为此，您无需执行任何操作，只需将原始数据帧 df（甚至 indexed_df）传递给 train_test_split，然后再进行任何处理。