Python - 在本地保存请求或 BeautifulSoup 对象答案

【问题标题】：Python - save requests or BeautifulSoup object locallyPython - 在本地保存请求或 BeautifulSoup 对象
【发布时间】：2014-05-29 22:04:55
【问题描述】：

我有一些代码很长，所以需要很长时间才能运行。我想简单地在本地保存请求对象（在本例中为“名称”）或 BeautifulSoup 对象（在本例中为“汤”），以便下次我可以节省时间。代码如下：

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

【问题讨论】：

您可能会发现pickle 模块很有用...
将html 源代码保存到html 文件中怎么样？

标签： python file beautifulsoup scrape

【解决方案1】：

由于name.content 只是HTML，您可以将其转储到文件中，稍后再读回。

通常瓶颈不是解析，而是发出请求的网络延迟。

from bs4 import BeautifulSoup
import requests

url = 'https://google.com'
name = requests.get(url)

with open("/tmp/A.html", "w") as f:
  f.write(name.content)


# read it back in
with open("/tmp/A.html") as f:
  soup = BeautifulSoup(f)
  # do something with soup

这里有一些轶事证据表明瓶颈存在于网络中。

from bs4 import BeautifulSoup
import requests
import time

url = 'https://google.com'

t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();

print t2 - t1, t3 - t2

输出，运行在 Thinkpad X1 Carbon 上，具有快速的校园网络。

0.11 0.02

【讨论】：

仅供参考，您可以将 BeautifulSoup(f.read()) 替换为 BeautifulSoup(f)。
@alecxe，已更新。谢谢！

【解决方案2】：

在本地存储请求并在稍后将它们恢复为 Beautifoul Soup 对象

如果您正在遍历网站页面，您可以使用request 存储每个页面，此处解释。在脚本所在的同一文件夹中创建文件夹 soupCategory。

为headers 使用任何latest user agent

headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}

def getCategorySoup():
    session = requests.Session()
    retry = Retry(connect=7, backoff_factor=0.5)

    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
    t0 = time.time() 
    j=0    
    totalPages = 1525 # put your number of pages here        
    for i in range(1,totalPages):         
        url = basic_url+str(i)
        r  = requests.get(url, headers=headers)
        pageName = "./soupCategory/"+str(i)+".html"
        with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
            f.write(r.text)        
            print (pageName, end=" ")
    t1 = time.time()
    total = t1-t0
    print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
    return

稍后您可以创建 Beautifoul Soup 对象，如 @merlin2011 提到的：

with open("/soupCategory/1.html") as f:
  soup = BeautifulSoup(f)

【讨论】：