在本地存储请求并在稍后将它们恢复为 Beautifoul Soup 对象
如果您正在遍历网站页面,您可以使用request 存储每个页面,此处解释。
在脚本所在的同一文件夹中创建文件夹 soupCategory。
为headers 使用任何latest user agent
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}
def getCategorySoup():
session = requests.Session()
retry = Retry(connect=7, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="
t0 = time.time()
j=0
totalPages = 1525 # put your number of pages here
for i in range(1,totalPages):
url = basic_url+str(i)
r = requests.get(url, headers=headers)
pageName = "./soupCategory/"+str(i)+".html"
with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
f.write(r.text)
print (pageName, end=" ")
t1 = time.time()
total = t1-t0
print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
return
稍后您可以创建 Beautifoul Soup 对象,如 @merlin2011 提到的:
with open("/soupCategory/1.html") as f:
soup = BeautifulSoup(f)