应该应用哪种技术来拆分大型文本数据集以进行数据匹配？答案

【问题标题】：Which technique should be applied to split a large text dataset for data matching?应该应用哪种技术来拆分大型文本数据集以进行数据匹配？
【发布时间】：2021-08-20 05:19:18
【问题描述】：

我正在处理二进制分类问题，并且我正在使用应该用于数据匹配的大型文本数据集。数据不平衡，但我正在使用一种方法来解决此问题。

我想在这个数据集的小子集中尝试一些带有 sklearn 的分类器。 sklearn中有没有办法将此数据集划分为N个子集，保持类的比例，那么我可以将这些子集中的每一个划分为训练/测试并为每个子集独立拟合分类器吗？

【问题讨论】：

你能举个例子吗？就像您提供的输入和您想要的输出/结果一样。

标签： python machine-learning scikit-learn classification record-linkage

【解决方案1】：

@ 987654321是@ 987654322的模块，可以做这份工作。假设您的数据存储在X（特征）和y（目标）中。

该方法将数据集拆分为n个部分，并且每个块默认被分成列车和测试子集。从代码中看到，拆分器返回指数，而不是拆分数据。

# Import module
from sklearn.preprocessing import StratifiedKFold

# Set N
N = 5

# Initialize a splitter that will divide data into N groups
kf = StratifiedKFold(n_splits=N)

# Append the indices of each of the N splits to a list
idx_splits = []
for idx_1, idx_2 in kf.split(X, y):
    idx_splits.append((idx_1, idx_2))

# Get the third train split
X[idx_splits[3][0]]
y[idx_splits[3][0]]

# Get the third test split
X[idx_splits[3][1]]
y[idx_splits[3][1]]

【讨论】：

感谢您的回答。我想基于实施例30％和70％的数据集的百分比拆分数据集。 span>

【解决方案2】：

我认为 sklearn 的 StratifiedKFold 可能正是您想要的。它将保持原始数据集中的类比例。

【讨论】：