【问题标题】:Send request to website for crawl every second每秒向网站发送抓取请求
【发布时间】:2021-02-13 05:17:52
【问题描述】:
我想每秒钟抓取一个网站 4 小时,我该怎么做。
我的代码如下。
import requests
from bs4 import BeautifulSoup
site = requests.get("http://example.com")
soup =BeautifulSoup(site.text,'html.parser')
r = str(soup).split(",")
update_time = r[0]
price1 = r[2]
price2 = r[3]
print(update_time,price1,price2)
【问题讨论】:
标签:
python
time
beautifulsoup
request
web-crawler
【解决方案1】:
您可以使用time 和threading 模块
import requests
from threading import Thread
from time import sleep
from bs4 import BeautifulSoup
def scrape():
site = requests.get("http://example.com")
soup =BeautifulSoup(site.text,'html.parser')
r = str(soup).split(",")
update_time = r[0]
price1 = r[2]
price2 = r[3]
print(update_time,price1,price2)
for i in range(14400):
t = Thread(target=scrape)
t.start()
sleep(1)
【解决方案2】:
您可以为此使用计划模块。
import schedule
import time
import requests
from bs4 import BeautifulSoup
def crawl():
site = requests.get("http://example.com")
soup =BeautifulSoup(site.text,'html.parser')
r = str(soup).split(",")
update_time = r[0]
price1 = r[2]
price2 = r[3]
print(update_time,price1,price2)
schedule.every(1).seconds.do(crawl)
while True:
schedule.run_pending()
time.sleep(1)
四个小时的窗口可以通过 crontab 或 for 循环来实现。
您必须安装调度模块才能运行上述脚本
sudo pip install schedule