【问题标题】:Python - save requests or BeautifulSoup object locallyPython - 在本地保存请求或 BeautifulSoup 对象
【发布时间】:2014-05-29 22:04:55
【问题描述】:

我有一些代码很长,所以需要很长时间才能运行。我想简单地在本地保存请求对象(在本例中为“名称”)或 BeautifulSoup 对象(在本例中为“汤”),以便下次我可以节省时间。代码如下:

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

【问题讨论】:

  • 您可能会发现pickle 模块很有用...
  • html 源代码保存到html 文件中怎么样?

标签: python file beautifulsoup scrape


【解决方案1】:

由于name.content 只是HTML,您可以将其转储到文件中,稍后再读回。

通常瓶颈不是解析,而是发出请求的网络延迟。

from bs4 import BeautifulSoup
import requests

url = 'https://google.com'
name = requests.get(url)

with open("/tmp/A.html", "w") as f:
  f.write(name.content)


# read it back in
with open("/tmp/A.html") as f:
  soup = BeautifulSoup(f)
  # do something with soup

这里有一些轶事证据表明瓶颈存在于网络中。

from bs4 import BeautifulSoup
import requests
import time

url = 'https://google.com'

t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();

print t2 - t1, t3 - t2

输出,运行在 Thinkpad X1 Carbon 上,具有快速的校园网络。

0.11 0.02

【讨论】:

  • 仅供参考,您可以将 BeautifulSoup(f.read()) 替换为 BeautifulSoup(f)
  • @alecxe,已更新。谢谢!
【解决方案2】:

在本地存储请求并在稍后将它们恢复为 Beautifoul Soup 对象

如果您正在遍历网站页面,您可以使用request 存储每个页面,此处解释。 在脚本所在的同一文件夹中创建文件夹 soupCategory

headers 使用任何latest user agent

headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}

def getCategorySoup():
    session = requests.Session()
    retry = Retry(connect=7, backoff_factor=0.5)

    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
    t0 = time.time() 
    j=0    
    totalPages = 1525 # put your number of pages here        
    for i in range(1,totalPages):         
        url = basic_url+str(i)
        r  = requests.get(url, headers=headers)
        pageName = "./soupCategory/"+str(i)+".html"
        with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
            f.write(r.text)        
            print (pageName, end=" ")
    t1 = time.time()
    total = t1-t0
    print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
    return 

稍后您可以创建 Beautifoul Soup 对象,如 @merlin2011 提到的:

with open("/soupCategory/1.html") as f:
  soup = BeautifulSoup(f)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-06-22
    • 2013-07-21
    • 2021-09-15
    • 1970-01-01
    • 1970-01-01
    • 2023-03-09
    相关资源
    最近更新 更多