【发布时间】:2016-10-10 10:46:29
【问题描述】:
我有一个包含 300+ .txt 文件的文件夹,总大小为 15GB+。这些文件包含推文。每一行都是不同的推文。我有一个我想在推文中搜索的关键字列表。我创建了一个脚本,该脚本在每个文件的每一行中搜索列表中的每个项目。如果推文包含关键字,则它将该行写入另一个文件。这是我的代码:
# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
f = open(file_path + file, 'r')
for line in f:
for key in keywords:
if re.search(key, line, re.IGNORECASE):
db.write(line)
这是每行的格式:
{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"@deerwalkinc 24000+ tweeps bigdata #Team #Genomics http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}
脚本有效,但需要很长时间。对于约 40 个关键字,它需要 2 个多小时。显然我的代码没有优化。我可以做些什么来提高速度?
附言我已经阅读了一些有关搜索和速度的相关问题,但我怀疑我的脚本中的问题在于我正在使用关键字列表。我已经尝试了一些建议的解决方案,但无济于事。
【问题讨论】:
-
推文是否以原始 json 格式保存?
-
@e4c5 我编辑了答案以包含信息。我不确定这是否是“原始 json 格式”,因为我继承了这些文件。到目前为止,我还没有使用过 Twitter API。我的任务是搜索。
-
首先我会将正则表达式编译成一个。
pat = re.compile("|".join( keywords),re.IGNORECASE),每次匹配似乎错误的关键字时,您也会多次编写同一行。此外,如果您只想搜索文本,那么搜索json.loads(line)["text"]可能会更快 -
现在你在谈论一个数据库,问题标题是文件。是哪一个?
-
@e4c5 选词不当。就像我在帖子中所说的那样,它是一个文件列表,其中包含我显示的格式的推文。对不起。
标签: python list python-3.x search twitter