Python内存不足答案

【问题标题】：Python running out of memoryPython内存不足
【发布时间】：2016-07-28 15:32:37
【问题描述】：

我有以下程序。当我运行它时，我收到了Memory Error，特别是Fpred = F.predict(A)（请参见下文）

import json
data = []
with open('yelp_data.json') as f:
    for line in f:
        data.append(json.loads(line))
star = []
for i in range(len(data)):
    star.append(data[i].values()[10])

attributes = []
for i in range(len(data)):
    attributes.append(data[i].values()[12])


def flatten_dict(dd, separator=' ', prefix=''):
    return { prefix + separator + k if prefix else k : v
         for kk, vv in dd.items()
         for k, v in flatten_dict(vv, separator, kk).items()
         } if isinstance(dd, dict) else { prefix : dd }

flatten_attr = list(flatten_dict(attributes[i], separator = ' ', prefix = '') for i in range(len(attributes)))


from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse = False)
X = v.fit_transform(flatten_attr)

from sklearn.feature_extraction.text import TfidfTransformer
Transformer = TfidfTransformer()
A = Transformer.fit_transform(X)

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split

from sklearn.neighbors import KNeighborsRegressor
from sklearn.cross_validation import KFold

F = KNeighborsRegressor(n_neighbors = 27)

Ffit = F.fit(A, star)
Fpred = F.predict(A)
Score = F.score(A, star)
print(Score)

我的 json 文件看起来像这样 -

{"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 7, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}
{"business_id": "JwUE5GmEO-sH1FuwJgKBlQ", "full_address": "6162 US Highway 51\nDe Forest, WI 53532", "hours": {}, "open": true, "categories": ["Restaurants"], "city": "De Forest", "review_count": 26, "name": "Pine Cone Restaurant", "neighborhoods": [], "longitude": -89.335843999999994, "state": "WI", "stars": 4.0, "latitude": 43.238892999999997, "attributes": {"Take-out": true, "Good For": {"dessert": false, "latenight": false, "lunch": true, "dinner": false, "breakfast": false, "brunch": false}, "Caters": false, "Noise Level": "average", "Takes Reservations": false, "Delivery": false, "Ambience": {"romantic": false, "intimate": false, "touristy": false, "hipster": false, "divey": false, "classy": false, "trendy": false, "upscale": false, "casual": false}, "Parking": {"garage": false, "street": false, "validated": false, "lot": true, "valet": false}, "Has TV": true, "Outdoor Seating": false, "Attire": "casual", "Alcohol": "none", "Waiter Service": true, "Accepts Credit Cards": true, "Good for Kids": true, "Good For Groups": true, "Price Range": 1}, "type": "business"}

$ls -l yelp_data.json

显示文件大小为 33524921

我能做的更糟糕的事情是在不同的文件中提取所需的数据并将其导入到这个程序中？改进这个程序以使其更有效地运行有什么好处？谢谢！！

【问题讨论】：

阅读您的代码有点困难。你最好给一些cmets和你的yelp_data.json有多大，json文件中每一行的格式是什么。
谢谢。我正在这样做。
您的代码看起来像 python 不是您的普通语言，但我认为这不是问题所在。您可以用 sklearn 标记问题，因为我猜这些函数可能会占用大量内存。如果您可以使用生成器而不是列表，则不会占用内存。一旦不再需要中间列表，您可能希望删除它们

标签： python memory scikit-learn

【解决方案1】：

与性能/内存无关，但您可以替换：

for i in range(len(data)):
    star.append(data[i].values()[10])

作者：

for item in data:
    star.append(item.values()[10])

data 是 list，它是可迭代的。 https://docs.python.org/3/library/stdtypes.html#list

同样在 Python 3 中，索引 dict 值不再起作用，您最终会得到：

    star.append(data[i].values()[10])
TypeError: 'dict_values' object does not support indexing

由于data 中的项目是json dicts，您可能希望按名称搜索属性，而不是依赖属性索引：

for item in data:
    star.append(item['thekeyyourelookingfor'])

然后让它变成单行：

star = [item['thekeyyourelookingfor'] for item in data]

编辑：实际上，由于json.loads 将 JSON 字符串读取到字典中，因此顺序或属性是任意的，因此当您通过索引访问它们时，您很可能最终会得到不同的属性您正在寻找的那个。我猜你想在这里阅读stars。 我什至猜想这就是你的代码失败的原因，因为你给了 sklearn 输入他没有预料到。

【讨论】：

试试：star = [item.get('stars') for item in data] 和 attributes = [item.get('attributes') for item in data] 看看我上面的编辑。