【发布时间】:2021-02-08 05:58:43
【问题描述】:
我正在尝试抓取网页并循环浏览链接中的所有页面。当我遍历下面的所有页面时,代码给出了许多重复项
lst = []
urls = ['https://www.f150forum.com/f118/2019-adding-adaptive-cruise-454662/','https://www.f150forum.com/f118/adaptive-cruise-control-module-300894/']
for url in urls:
with requests.Session() as req:
for item in range(1,33):
response = req.get(f"{url}index{item}/")
soup = BeautifulSoup(response.content, "html.parser")
threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
for item in soup.findAll('a',attrs={"class":"bigusername"}):
lst.append([threadtitle.text])
for div in soup.find_all('div', class_="ism-true"):
try:
div.find('div', class_="panel alt2").extract()
except AttributeError:
pass
try:
div.find('label').extract()
except AttributeError:
pass
result = [div.get_text(strip=True, separator=" ")]
comments.append(result)
修改如下代码不会重复但会跳过url的最后一页
comments= []
for url in urls:
with requests.Session() as req:
index=1
while(True):
response = req.get(url+"index{}/".format(index))
index=index+1
soup = BeautifulSoup(response.content, "html.parser")
if 'disabled' in soup.select_one('a#mb_pagenext').attrs['class']:
break
posts = soup.find(id = "posts")
threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
for item in soup.findAll('a',attrs={"class":"bigusername"}):
lst.append([threadtitle.text])
for div in soup.find_all('div', class_="ism-true"):
try:
div.find('div', class_="panel alt2").extract()
except AttributeError:
pass # sometimes there is no 'panel alt2'
try:
div.find('label').extract()
except AttributeError:
pass # sometimes there is no 'Quote'
result = [div.get_text(strip=True, separator=" ")]
comments.append(result)
删除 " if 'disabled' in soup.select_one('a#mb_pagenext').attrs['class']: break" 此代码提供无限循环。如何在不重复的情况下循环浏览页面
【问题讨论】:
-
我会给你解决分页的提示: 1)获取最后页码。 2)迭代页面,直到找不到要进入下一页的元素。选择你更喜欢的
-
我在我的代码的第一部分中这样做了,它给出了许多重复项。
标签: python web-scraping beautifulsoup