【发布时间】:2016-07-28 15:32:37
【问题描述】:
我有以下程序。当我运行它时,我收到了Memory Error,特别是Fpred = F.predict(A)(请参见下文)
import json
data = []
with open('yelp_data.json') as f:
for line in f:
data.append(json.loads(line))
star = []
for i in range(len(data)):
star.append(data[i].values()[10])
attributes = []
for i in range(len(data)):
attributes.append(data[i].values()[12])
def flatten_dict(dd, separator=' ', prefix=''):
return { prefix + separator + k if prefix else k : v
for kk, vv in dd.items()
for k, v in flatten_dict(vv, separator, kk).items()
} if isinstance(dd, dict) else { prefix : dd }
flatten_attr = list(flatten_dict(attributes[i], separator = ' ', prefix = '') for i in range(len(attributes)))
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse = False)
X = v.fit_transform(flatten_attr)
from sklearn.feature_extraction.text import TfidfTransformer
Transformer = TfidfTransformer()
A = Transformer.fit_transform(X)
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.cross_validation import KFold
F = KNeighborsRegressor(n_neighbors = 27)
Ffit = F.fit(A, star)
Fpred = F.predict(A)
Score = F.score(A, star)
print(Score)
我的 json 文件看起来像这样 -
{"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 7, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}
{"business_id": "JwUE5GmEO-sH1FuwJgKBlQ", "full_address": "6162 US Highway 51\nDe Forest, WI 53532", "hours": {}, "open": true, "categories": ["Restaurants"], "city": "De Forest", "review_count": 26, "name": "Pine Cone Restaurant", "neighborhoods": [], "longitude": -89.335843999999994, "state": "WI", "stars": 4.0, "latitude": 43.238892999999997, "attributes": {"Take-out": true, "Good For": {"dessert": false, "latenight": false, "lunch": true, "dinner": false, "breakfast": false, "brunch": false}, "Caters": false, "Noise Level": "average", "Takes Reservations": false, "Delivery": false, "Ambience": {"romantic": false, "intimate": false, "touristy": false, "hipster": false, "divey": false, "classy": false, "trendy": false, "upscale": false, "casual": false}, "Parking": {"garage": false, "street": false, "validated": false, "lot": true, "valet": false}, "Has TV": true, "Outdoor Seating": false, "Attire": "casual", "Alcohol": "none", "Waiter Service": true, "Accepts Credit Cards": true, "Good for Kids": true, "Good For Groups": true, "Price Range": 1}, "type": "business"}
$ls -l yelp_data.json
显示文件大小为 33524921
我能做的更糟糕的事情是在不同的文件中提取所需的数据并将其导入到这个程序中? 改进这个程序以使其更有效地运行有什么好处?谢谢!!
【问题讨论】:
-
阅读您的代码有点困难。你最好给一些cmets和你的
yelp_data.json有多大,json文件中每一行的格式是什么。 -
谢谢。我正在这样做。
-
您的代码看起来像 python 不是您的普通语言,但我认为这不是问题所在。您可以用 sklearn 标记问题,因为我猜这些函数可能会占用大量内存。如果您可以使用生成器而不是列表,则不会占用内存。一旦不再需要中间列表,您可能希望删除它们
标签: python memory scikit-learn