【发布时间】:2017-07-31 07:23:15
【问题描述】:
这是我的第一篇文章,所以请多多包涵。
我有一个大型 (~1GB) json 文件,其中包含我通过 Twitter 的 Streaming API 收集的推文。我能够成功地将其解析为包含我需要的字段的 CSV,但是,它非常缓慢 - 即使我正在提取少数实体(用户 ID、纬度/经度以及将 Twitter 日期字符串解析为日期/时间)。我可以使用哪些方法来尝试加快速度?目前需要几个小时,我期待收集更多数据....
import ujson
from datetime import datetime
from dateutil import tz
from csv import writer
import time
def hms_string(sec_elapsed):
h = int(sec_elapsed / (60 * 60))
m = int((sec_elapsed % (60 * 60)) / 60)
s = sec_elapsed % 60.
return "{}:{:>02}:{:>05.2f}".format(h, m, s)
start_time = time.time()
with open('G:\Programming Projects\GGS 681\dmv_raw_tweets1.json', 'r') as in_file, \
open('G:\Programming Projects\GGS 681\dmv_tweets1.csv', 'w') as out_file:
print >> out_file, 'user_id,timestamp,latitude,longitude'
csv = writer(out_file)
tweets_count = 0
for line in in_file:
tweets_count += 1
tweets = ujson.loads(line)
timestamp = []
lats = ''
longs = ''
for tweet in tweets:
tweet = tweets
from_zone = tz.gettz('UTC')
to_zone = tz.gettz('America/New_York')
times = tweet['created_at']
for tweet in tweets:
times = tweets['created_at']
utc = datetime.strptime(times, '%a %b %d %H:%M:%S +0000 %Y')
utc = utc.replace(tzinfo=from_zone) #comment out to parse to utc
est = utc.astimezone(to_zone) #comment out to parse to utc
timestamp = est.strftime('%m/%d/%Y %I:%M:%S %p') # use %p to differentiate AM/PM
for tweet in tweets:
if tweets['geo'] and tweets['geo']['coordinates'][0]:
lats, longs = tweets['geo']['coordinates'][:2]
else:
pass
row = (
tweets['user']['id'],
timestamp,
lats,
longs
)
values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
end_time = time.time()
print "{} to execute this".format(hms_string(end_time - start_time))
【问题讨论】:
-
为什么要在
tweets上进行内部迭代,而您已经对其进行了迭代? -
老实说,可能只是用户错误