使用逻辑将 Pandas 数据采样成不同的比率答案

【问题标题】：Pandas data sampling into different ratios using a logic使用逻辑将 Pandas 数据采样成不同的比率
【发布时间】：2019-10-11 05:21:26
【问题描述】：

我有一个如下所示的数据框，我想使用 order_id.Each 对应该将每个客户数据分成三个桶的数据进行采样，分别为 train(70%)、validation(15%) 和 test(15%)客户应该出现在所有三个存储桶中。每个客户的 order_id 计数和项目可能不同

数据框：

Customer  Orderid   item_name
   A        1        orange
   A        1        apple
   A        1        banana
   A        2        apple
   A        2        carrot
   A        3        orange
   A        4        grape
   A        4        watermelon
   A        4        banana
   B        1        pineapple
   B        2        banana
   B        3        papaya
   B        3        Lime

采样后的所有三个数据集（train、validation 和 test）应该包含相同数量的客户，并且来自验证和 test 的项目应该是 train 的子集。

预期结果：

  train: should contain all customers and all item_names (70% of complete data)
train:
     customer  item
         A     orange
         A     apple
         A     banana
         A     carrot
         A     grape
         A     watermelon
         B     pinepple 
         B     banana
         B     papaya
         B     Lime
  validation : should contain all customers and item_names can be subset of train(15% of complete data)
        customer  item
         A     orange
         A     apple
         A     banana
         B     pinepple 
         B     banana
         B     papaya
         B     Lime
  test : should contain all customers and item_names can be subset of train(15% of complete data)
       Customer  item
         A     carrot
         A     grape
         A     watermelon
         B     papaya
         B     Lime

【问题讨论】：

例如客户 A 和商品 Orange，只有 2 个条目。在这种情况下，不可能将它们分成 3 个桶。如果您可以根据需要发布 3 个存储桶的示例预期数据，那将会很有帮助。
@parth ，修改它，上面问题的任何输入
@Serdar ERİÇ 的回答似乎是实现您想要的最简单的方法。但是，如果某些（客户、项目）组合的示例很少，它将失败。如果您知道在您的实际数据中并非如此，那么就可以了，否则您需要编写自定义代码，其中您必须对每个（客户、项目）组合进行随机抽样。

标签： python pandas data-science training-data sampling

【解决方案1】：

正如@Parth 在 cmets 中提到的，首先您需要有一个符合这种分层拆分条件的数据集。然后，您可以使用“Customer”和“item_name”的组合创建一个新列，以提供“train_test_split”方法的“stratify”参数，这是 sklearn 的一部分。

下面，你可以找到一个例子。

import pandas as pd
from sklearn.model_selection import train_test_split

#Create sample data
data = {
    "Customer":["A", "A", "A", "A","A","A","A","A","A", "B", "B", "B","B", "B", "B", "B","B","B"],
    "Orderid":[1, 1, 1, 2, 2, 2, 2, 3, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2],
    "item_name":[
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple",
        "orange",
        "apple"
       ]
}
# Convert data to dataframe
df = pd.DataFrame(data)
# Create a new column with combination of "Customer" and "item_name" to feed the "stratify" parameter
# train_test_split method which is a part of "sklearn.model_selection"
df["CustAndItem"] = df["Customer"]+"_"+df["item_name"]

# First split the "train" and "test" set. In this example I have split %40 of the data as "test"
# and %60 of data as "train"
X_train, X_test, y_train, y_test = train_test_split(df.index,
                                                    df["CustAndItem"],
                                                    test_size=0.4,
                                                    stratify=df["CustAndItem"])

# Get actual data after split operation
df_train = df.loc[X_train].copy(True)
df_test = df.loc[X_test].copy(True)

# Now split "test" set to "validation" and "test" sets. In this example I have split them equally 
# (test_size = 0.5) which will contain %20 of the main set.
X_validate, X_test, y_validate, y_test = train_test_split(df_test.index,
                                                          df_test["CustAndItem"],
                                                          test_size= 0.5,
                                                          stratify=df_test["CustAndItem"])
# Get actual data after split
df_validate = df_test.loc[X_validate]
df_test = df_test.loc[X_test]

# Print results
print(df_train)
print(df_validate)
print(df_test)

【讨论】：

感谢您的回复，如果我将测试大小从 0.4 减少到 0.3。我会收到以下错误。alueError：y 中人口最少的类只有 1 个成员，这太少了。任何班级的最小组数不能少于 2。我可以知道为什么吗？
这是因为数据的大小。例如，考虑一个有 5 行的数据框。其中 2 个具有相同的标签，另外 2 个具有另一个标签，最后一个具有完全不同的标签。如果您想根据标签将此数据均匀地拆分为 2 个数据帧，那将是不可能的。因为您有一个只有 1 行的标签。在拆分验证集和测试集时，您可能会遇到此错误。因此，可以打印 df_test 并观察具有 uniuqe CustAndItem 值的行。然后，您可以追加更多行来平衡您的数据。