【发布时间】:2013-04-18 22:25:51
【问题描述】:
我有一些数据,我想将其分成保持共同比率的较小组。我写了一个函数,它将接受两个数组的输入并计算大小比,然后告诉我可以将它分成多少组的选项(如果所有组的大小相同),这里是函数:
def cross_validation_group(train_data, test_data):
import numpy as np
from calculator import factors
test_length = len(test_data)
train_length = len(train_data)
total_length = test_length + train_length
ratio = test_length/float(total_length)
possibilities = factors(total_length)
print possibilities
print possibilities[len(possibilities)-1] * ratio
super_count = 0
for i in possibilities:
if i < len(possibilities)/2:
pass
else:
attempt = float(i * ratio)
if attempt.is_integer():
print str(i) + " is an option for total size with " + str(attempt) + " as test size and " + str(i - attempt) + " as train size! This is with " + str(total_length/i) + " folds."
else:
pass
folds = int(raw_input("So how many folds would you like to use? If no possibilities were given that would be sufficient, type 0: "))
if folds != 0:
total_size = total_length/folds
test_size = float(total_size * ratio)
train_size = total_size - test_size
columns = train_data[0]
columns= len(columns)
groups = np.empty((folds,(test_size + train_size),columns))
i = 0
a = 0
b = 0
for j in range (0,folds):
test_size_new = test_size * (j + 1)
train_size_new = train_size * j
total_size_new = (train_size + test_size) * (j + 1)
cut_off = total_size_new - train_size
p = 0
while i < total_size_new:
if i < cut_off:
groups[j,p] = test_data[a]
a += 1
else:
groups[j,p] = train_data[b]
b += 1
i += 1
p += 1
return groups
else:
print "This method cannot be used because the ratio cannot be maintained with equal group sizes other than for the options you were givens"
所以我的问题是我如何才能使函数的第三个输入成为折叠数并更改函数,而不是迭代以确保每个组具有相同的数量正确的比例,它只会有正确的比例,但大小不一?
@JamesHolderness 的补充
所以你的方法几乎是完美的,但这里有一个问题:
长度为 357 和 143,9 折,这是返回列表:
[(39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16)]
现在,当您将列相加时,您会得到:351 144
351 很好,因为它小于 357,但 144 不起作用,因为它大于 143!原因是357和143是数组的长度,所以那个数组的第144行不存在……
【问题讨论】:
-
您的意思是要使用不同的训练集进行交叉验证吗?这在统计上听起来有点不确定?在实践中通常会这样做吗?
-
是的,这是用于交叉验证。不,这应该测试测试集和训练集之间的相似性,以检查测试数据中是否存在训练数据中没有的内容。通常交叉验证只在一个训练集上完成,这也可以应用于它们,而不是两个数组,你可以在训练数组和训练数组中给出列,它会这样做。
-
如果您的两个数组的大小为
m和n,并且m除以n的不可约分数是p/q,那么m = k*p和n = k*q。一旦您拥有k,其中的任何partition 都会引导您拆分原始数据以保持元素的比例。如果您需要我详细说明,请告诉我。 -
啊,好古老的数论......不幸的是,这是有限的。我希望能够拥有任意数量的组,即使是一个不均分的组,因为只要一个数据集与另一个数据集的比率一致,大小就可以不同。这有意义吗?