在 Python 中使用列表搜索大文件 - 如何提高速度？答案

【问题标题】：Seaching big files using list in Python - How can improve the speed?在 Python 中使用列表搜索大文件 - 如何提高速度？
【发布时间】：2016-10-10 10:46:29
【问题描述】：

我有一个包含 300+ .txt 文件的文件夹，总大小为 15GB+。这些文件包含推文。每一行都是不同的推文。我有一个我想在推文中搜索的关键字列表。我创建了一个脚本，该脚本在每个文件的每一行中搜索列表中的每个项目。如果推文包含关键字，则它将该行写入另一个文件。这是我的代码：

# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
    f = open(file_path + file, 'r')
    for line in f:
        for key in keywords:
            if re.search(key, line, re.IGNORECASE):
                db.write(line)

这是每行的格式：

{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"@deerwalkinc 24000+ tweeps bigdata  #Team #Genomics  http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}

脚本有效，但需要很长时间。对于约 40 个关键字，它需要 2 个多小时。显然我的代码没有优化。我可以做些什么来提高速度？

附言我已经阅读了一些有关搜索和速度的相关问题，但我怀疑我的脚本中的问题在于我正在使用关键字列表。我已经尝试了一些建议的解决方案，但无济于事。

【问题讨论】：

推文是否以原始 json 格式保存？
@e4c5 我编辑了答案以包含信息。我不确定这是否是“原始 json 格式”，因为我继承了这些文件。到目前为止，我还没有使用过 Twitter API。我的任务是搜索。
首先我会将正则表达式编译成一个。 pat = re.compile("|".join( keywords),re.IGNORECASE)，每次匹配似乎错误的关键字时，您也会多次编写同一行。此外，如果您只想搜索文本，那么搜索 json.loads(line)["text"] 可能会更快
现在你在谈论一个数据库，问题标题是文件。是哪一个？
@e4c5 选词不当。就像我在帖子中所说的那样，它是一个文件列表，其中包含我显示的格式的推文。对不起。

标签： python list python-3.x search twitter

【解决方案1】：

1) 外部库

如果您愿意依赖外部库（并且执行时间比一次性安装时间成本更重要），您可以通过将每个文件加载到一个简单的 Pandas DataFrame 和将关键字搜索作为向量操作执行。要获得匹配的推文，您可以执行以下操作：

import pandas as pd
dataframe_from_text = pd.read_csv("/path/to/file.txt")
matched_tweets_index =  dataframe_from_text.str.match("keyword_a|keyword_b")
dataframe_from_text[matched_tweets_index] # Uses the boolean search above to filter the full dataframe
# You'd then have a mini dataframe of matching tweets in `dataframe_from_text`. 
# You could loop through these to save them out to a file using the `.to_dict(orient="records")` format.

Pandas 中的数据框操作非常快，因此可能值得研究。

2) 对您的正则表达式进行分组

看起来您没有记录匹配的关键字。如果这是真的，您可以将关键字分组到单个正则表达式查询中，如下所示：

for line in f:
    keywords_combined = "|".join(keywords)
    if re.search(keywords_combined, line, re.IGNORECASE):
        db.write(line)

我没有对此进行测试，但通过减少每行的循环数，可以减少一些时间。

【讨论】：

【解决方案2】：

为什么慢

您正在通过 json 转储进行正则表达式搜索，这并不总是一个好主意。例如，如果您的关键字包含诸如用户、时间、个人资料和图像之类的词，则每行都会导致匹配，因为推文的 json 格式将所有这些词作为字典键。

除了原始 JSON 很大之外，每条推文的大小都会超过 1kb（这个是 2.1kb），但您的示例中唯一相关的部分是：

"text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF",

这还不到 100 字节，尽管最近 API 发生了变化，一条典型的推文仍然少于 140 个字符。

要尝试的事情：

按照Padraic Cunningham 的建议预编译正则表达式

选项 1. 将此数据加载到 postgresql JSONB 字段中。 JSONB 字段是可索引的，可以非常快速地搜索

选项 2. 将其加载到任何旧数据库中，文本字段的上下文具有它自己的列，以便可以轻松搜索此列。

选项 3。最后但同样重要的是，仅将 text 字段提取到它自己的文件中。您可以有一个 CSV 文件，其中第一列是屏幕名称，第二列是推文的文本。您的 15GB 将缩小到 1GB 左右

简而言之，您现在正在做的是在整个农场寻找针头，而您只需要搜索干草堆。

【讨论】：

唯一的问题是我确实需要整行，因为我对每条推文包含的元数据感兴趣。
这仅适用于选项 3，不适用于其他两个选项。