【发布时间】:2021-09-22 09:02:06
【问题描述】:
我的 csv 中有大约 30000 个网址。我需要检查每个 url 是否存在元内容。我正在使用 request_cache 基本上将响应缓存到 sqlite db。即使在使用缓存系统后也需要大约 24 小时。因此我转向并发。我想我对out = executor.map(download_site, sites, headers) 做错了。也不知道怎么解决。
AttributeError: 'str' 对象没有属性 'items'
import concurrent.futures
import requests
import threading
import time
import pandas as pd
import requests_cache
from PIL import Image
from io import BytesIO
thread_local = threading.local()
df = pd.read_csv("test.csv")
sites = []
for row in df['URLS']:
sites.append(row)
# print("URL is shortened")
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent,}
requests_cache.install_cache('network_call', backend='sqlite', expire_after=2592000)
def getSess():
if not hasattr(thread_local, "session"):
thread_local.session = requests.Session()
return thread_local.session
def networkCall(url, headers):
print("In Download site")
session = getSess()
with session.get(url, headers=headers) as response:
print(f"Read {len(response.content)} from {url}")
return response.content
out = []
def getMeta(meta_res):
print("Get data")
for each in meta_res:
meta = each.find_all('meta')
for tag in meta:
if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
content = tag.attrs['content']
if content != '':
out.append("Absent")
else:
out.append("Present")
return out
def allSites(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
out = executor.map(networkCall, sites, headers)
return list(out)
if __name__ == "__main__":
sites = [
"https://www.jython.org",
"http://olympus.realpython.org/dice",
] * 15000
start_time = time.time()
list_meta = allSites(sites)
print("META ", list_meta)
duration = time.time() - start_time
print(f"Downloaded {len(sites)} in {duration} seconds")
output = getMeta(list_meta)
df["is it there"] = pd.Series(output)
df.to_csv('new.csv',index=False, header=True)
【问题讨论】:
-
此代码无法运行,因为它缺少几个函数。您还需要记住,即使您的代码尽可能高效,您也可能会受到各种 URL 响应 HTTP GET 所需时间的限制
-
你试过使用 asyncio 吗?我在加快数百页的查询速度方面取得了一些成功。
-
@DarkKnight 它是可运行的,只需评论 df[test.csv] 部分。我在 if name == "main":. 下给出了站点变量
-
@MariaZentsova 啊,这是我最后的手段
-
它不可运行,因为 download_sites 和 get_session 都丢失了
标签: python python-3.x multithreading asynchronous