【发布时间】:2018-12-08 13:25:50
【问题描述】:
我用 python 编写了一个脚本来获取不同链接,这些链接指向网页上的不同文章。运行我的脚本后,我可以完美地得到它们。但是,我面临的问题是文章链接遍历多个页面,因为它们的数量很大以适合单个页面。如果我单击下一页按钮,我可以在开发人员工具中看到附加信息,这些信息实际上通过发布请求产生 ajax 调用。由于该下一页按钮没有附加链接,因此我找不到任何方法可以进入下一页并从那里解析链接。我试过用post request 和formdata,但它似乎不起作用。我哪里错了?
Link to the landing page containing articles
这是我在单击下一页按钮时使用 chrome 开发工具获得的信息:
GENERAL
=======================================================
Request URL: https://www.ncbi.nlm.nih.gov/pubmed/
Request Method: POST
Status Code: 200 OK
Remote Address: 130.14.29.110:443
Referrer Policy: origin-when-cross-origin
RESPONSE HEADERS
=======================================================
Cache-Control: private
Connection: Keep-Alive
Content-Encoding: gzip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: text/html; charset=UTF-8
Date: Fri, 29 Jun 2018 10:27:42 GMT
Keep-Alive: timeout=1, max=9
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03.m_8
NCBI-SID: CE8C479DB3510951_0083SID
Referrer-Policy: origin-when-cross-origin
Server: Apache
Set-Cookie: ncbi_sid=CE8C479DB3510951_0083SID; domain=.nih.gov; path=/; expires=Sat, 29 Jun 2019 10:27:42 GMT
Set-Cookie: WebEnv=1Jqk9ZOlyZSMGjHikFxNDsJ_ObuK0OxHkidgMrx8vWy2g9zqu8wopb8_D9qXGsLJQ9mdylAaDMA_T-tvHJ40Sq_FODOo33__T-tAH%40CE8C479DB3510951_0083SID; domain=.nlm.nih.gov; path=/; expires=Fri, 29 Jun 2018 18:27:42 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block
REQUEST HEADERS
========================================================
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Content-Length: 395
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: ncbi_sid=CE8C479DB3510951_0083SID; _ga=GA1.2.1222765292.1530204312; _gid=GA1.2.739858891.1530204312; _gat=1; WebEnv=18Kcapkr72VVldfGaODQIbB2bzuU50uUwU7wrUi-x-bNDgwH73vW0M9dVXA_JOyukBSscTE8Qmd1BmLAi2nDUz7DRBZpKj1wuA_QB%40CE8C479DB3510951_0083SID; starnext=MYGwlsDWB2CmAeAXAXAbgA4CdYDcDOsAhpsABZoCu0IA9oQCZxLJA===
Host: www.ncbi.nlm.nih.gov
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03
Origin: https://www.ncbi.nlm.nih.gov
Referer: https://www.ncbi.nlm.nih.gov/pubmed
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
X-Requested-With: XMLHttpRequest
FORM DATA
========================================================
p$l: AjaxServer
portlets: id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity
load: yes
到目前为止,这是我的脚本(如果未注释,get 请求可以正常工作,但对于第一页):
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
# res = requests.get(geturl,headers={"User-Agent":"Mozilla/5.0"})
# soup = BeautifulSoup(res.text,"lxml")
# for items in soup.select("div.rslt p.title a"):
# print(items.get("href"))
FormData={
'p$l': 'AjaxServer',
'portlets': 'id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity',
'load': 'yes'
}
req = requests.post(posturl,data=FormData,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("div.rslt p.title a"):
print(items.get("href"))
顺便说一句,当我点击下一页链接时,浏览器中的 url 变为“https://www.ncbi.nlm.nih.gov/pubmed”。
我不想寻求与任何浏览器模拟器相关的任何解决方案。提前致谢。
【问题讨论】:
-
你会认为 selenium 是一个浏览器模拟器吗?因为那是你需要的模块。
-
您似乎在尝试模拟错误的请求。第一个发给
pubmed(不是/pubmed/)的帖子是你需要的 -
您可能对这个 Github 存储库感兴趣,它使用 NCBI 的 E-utilities 端点,我相信它会返回许多相同的信息:github.com/jordibc/entrez
标签: python python-3.x web-scraping beautifulsoup