sklearn随机森林分类python中的内存分配错误答案

【问题标题】：Memory allocation error in sklearn random forest classification pythonsklearn随机森林分类python中的内存分配错误
【发布时间】：2019-04-30 18:29:16
【问题描述】：

我正在尝试对具有 5 个属性和 1 个类的 2,79,900 个实例运行 sklearn 随机森林分类。但是我在尝试在拟合线上运行分类时遇到内存分配错误，它无法训练分类器本身。有关如何解决此问题的任何建议？

数据a是

x、y、日、周、准确度

x 和 y 是坐标 day 是一个月中的哪一天 (1-30) 星期是一周中的哪一天 (1-7) 准确率是一个整数

代码：

import csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier


with open("time_data.csv", "rb") as infile:
    re1 = csv.reader(infile)
    result=[]
    ##next(reader, None)
    ##for row in reader:
    for row in re1:
        result.append(row[8])

    trainclass = result[:251900]
    testclass = result[251901:279953]


with open("time_data.csv", "rb") as infile:
    re = csv.reader(infile)
    coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
    train = coords[:251900]
    test = coords[251901:279953]

print "Done splitting data into test and train data"

clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
clf.fit(train,trainclass)

print "Done training"
score = clf.score(test,testclass)
print "Done Testing"
print score

错误：

line 366, in fit
    builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
  File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
  File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
  File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
  File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
  File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 10206838784 bytes

【问题讨论】：

标签： python scikit-learn random-forest

【解决方案1】：

请尝试 Google Colaboratory。您可以连接到 localhost 或托管运行时。它对我有用 n_estimators=10000。

【讨论】：

【解决方案2】：

我最近遇到了同样的 MemoryErr。但是我通过减少训练数据大小而不是修改我的模型参数来修复它。我的 OOB 值为 0.98，这意味着该模型不太可能过拟合。

【讨论】：

【解决方案3】：

来自 scikit-learn 文档：“控制树大小的参数的默认值（例如 max_depth、min_samples_leaf 等）会导致完全生长和未修剪的树在一些数据集。为了减少内存消耗，应该通过设置这些参数值来控制树的复杂性和大小。”

然后我会尝试调整这些参数。另外，您可以尝试使用内存。如果您的计算机内存太少，请尝试在 GoogleCollaborator 上运行分析器。

【讨论】：

我尝试使用 max_features="log2", min_samples_split=3, min_samples_leaf=2 作为我的参数，但我仍然面临同样的问题，我可能会尝试最大深度。我有一个 16GB 的内存
根据特征的数量，我的深度不够大，对吧？
我肯定会设置一个 max_depth。决策树在高深度处急剧过拟合。通常深度为 6 就足够了，但这当然取决于您的模型。
当我试图在它运行的 25000 个点上运行时，是否可以在数据块中运行。我认为它不会因为最后的数据是相同的
我认为你可以做到。但是你不能在不同的数据块上训练两个模型，因为结果会不同。你试过 max_depth 吗？