【发布时间】:2021-08-04 09:34:03
【问题描述】:
我正在尝试提取 url 下的所有 url: https://www.scotts.com/en-us/library/lawn-food
我意识到它不会返回几个网址,例如 https://www.scotts.com/en-us/library/lawn-food/when-feed-greener-lawn 还有更多
我在下面提到了我的代码sn-p:
import time
from random import randint
import requests
from bs4 import BeautifulSoup, SoupStrainer
import re
def scrape_google_summaries(url):
time.sleep(randint(0, 2)) # relax and don't let google be angry
r = requests.get(url)
content = r.text
soup = BeautifulSoup(content, "html.parser",parse_only=SoupStrainer('a', href=True))
summary=[]
for link in soup:#.find_all('a'):
summary.append(link.get('href'))
return summary
output = scrape_google_summaries("https://www.scotts.com/en-us/library/lawn-food")
【问题讨论】:
-
网站使用 javascript 加载数据。我相信这就是没有得到预期结果的原因。
-
该站点正在由 JavaScript 加载。使用
Selenium。
标签: python python-3.x web-scraping beautifulsoup