【发布时间】:2021-06-23 00:34:14
【问题描述】:
我正在使用Beautiful Soup 从非英语网站提取数据。现在我的代码只从关键字搜索中提取前十个结果。该网站旨在通过“更多”按钮访问其他结果(有点像无限滚动,但您必须继续单击更多才能获得下一组结果)。当我点击“更多”时,URL 不会改变,所以我不能每次都迭代不同的 URL。
我真的很想在两件事上得到一些帮助。
- 修改下面的代码,以便我可以从所有页面获取数据,而不仅仅是前 10 个结果
- 插入一个定时器功能,这样服务器就不会阻止我
我要添加一张“更多”按钮的照片,因为它不是英文的。它在page 的末尾以蓝色文本显示。
import requests, csv, os
from bs4 import BeautifulSoup
from time import strftime, sleep
# make a GET request (requests.get("URL")) and store the response in a response object (req)
responsePA = requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
# read the content of the server’s response
rawPagePA = responsePA.text
soupPA = BeautifulSoup(rawPagePA)
# take a look
print (soupPA.prettify())
urlsPA = [] #creating empty list to store URLs
for item in soupPA.find_all("div", class_= "customStoryCard9-m__story-data__2qgWb"): #first part of loop selects all items with class=field-title
aTag = item.find("a") #extracting elements containing 'a' tags
urlsPA.append(aTag.attrs["href"])
print(urlsPA)
#Below I'm getting the data from each of the urls and storing them in a list
PAlist=[]
for link in urlsPA:
specificpagePA=requests.get(link) #making a get request and stores the response in an object
rawAddPagePA=specificpagePA.text # read the content of the server’s response
PASoup2=BeautifulSoup(rawAddPagePA) # parse the response into an HTML tree
PAcontent=PASoup2.find_all(class_=["story-element story-element-text", "time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX", "headline headline-type-9 story-headline bn-story-headline headline-m__headline__3vaq9 headline-m__headline-type-9__3gT8S", "contributor-name contributor-m__contributor-name__1-593"])
#print(PAcontent)
PAlist.append(PAcontent)
【问题讨论】:
-
如果你想与网站交互,你可能需要 selenium。否则,也许您可以找到按下按钮时正在发送的请求并进行模拟。
-
我同意@AlexNe。在浏览器中按下更多按钮不会更改发送到 python 脚本的 html,而是会更改浏览器中的 html。为了在 python 中“单击更多按钮”,您需要使用 selenium
-
这里是docs
-
哈,找到教程了:medium.com/the-andela-way/…
标签: python python-3.x web-scraping beautifulsoup infinite-scroll