【问题标题】:Can't parse the names from the third page onward无法解析从第三页开始的名称
【发布时间】:2020-08-26 16:51:29
【问题描述】:

我使用 requests 模块和 BeautifulSoup 库在 python 中创建了一个脚本,以从网站获取不同成员的名称。该脚本可以完美地从第一页和第二页获取名称。但是,它会从第三页开始抓取相同的名称。我可以注意到下一页逻辑在__EVENTTARGET 的值内,如dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl07dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl09 等等。脚本可以相应地增加数字,但第二页之后的结果保持不变。

要从这个website 填充结果,您只需单击搜索按钮而不进行任何更改。然后您可以点击 2,3,4 e.t.c 页面进入相关页面。

我已经尝试过(从前两页抓取数据):

import requests
from bs4 import BeautifulSoup

link = 'https://www.icsi.in/student/Members/MemberSearch.aspx?SkinSrc=%5BG%5DSkins/IcsiTheme/IcsiIn-Bare&ContainerSrc=%5BG%5DContainers/IcsiTheme/NoContainer'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml") 
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTTARGET'] = 'dnn$ctr410$MemberSearch$btnSearch'

    page = 5
    while True:
        r = s.post(link,data=payload)
        soup = BeautifulSoup(r.text,"lxml")
        for item in soup.select("span[id$='_lblFullName']"):
            print(item.text)

        page+=2
        payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
        if len(str(page))==1:
            payload['__EVENTTARGET'] = 'dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl0{}'.format(page)
        else:
            payload['__EVENTTARGET'] = 'dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl{}'.format(page)

        payload['__dnnVariable'] = {'__scdoff':'1','__dnn_pageload':'__dnn_setScrollTop();'}
        payload['ScrollTop'] = '400'

如何从第二页之后的其余页面中获取名称?

【问题讨论】:

  • 它看起来像 ASP.NET 页面 - 它可以发送许多 POST 值 - 不仅是 __EVENTTARGET - 而且您可能必须发送所有这些值 - 也作为 POST 请求。 FIRTS:使用 Firefox/Chrome 中的 DevTools 来查看当您转到下一页时从浏览器发送的所有请求 - 并查看它发送的值,以及是 GET 还是 POST 请求。您的代码必须发送相同的。
  • 我可能会发送所有这些。如果您打印有效负载,您可以在其中看到所需的参数。谢谢。
  • 顺便说一句,您可以使用"{:02}".format(7) 来获取07 而不是7,然后您不必检查if len(str(page))==1:(顺便说一句:而不是if len(str(page))==1:,您可以简单地检查@987654335 @)
  • 当我检查payload.keys() 并与在网络浏览器中发送的密钥进行比较时,我会看到浏览器未发送的密钥 - 即。带有箭头的按钮的键(移动到第一页/最后一页/上一页/下一页),可能会出现问题。 IE。 dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl02 用于移动到第一页的按钮。
  • 我在 while 循环的底部添加了这一行 payload.pop('dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl02') 以从有效负载中踢出密钥,但这似乎并不能解决问题。谢谢。

标签: python python-3.x web-scraping beautifulsoup python-requests


【解决方案1】:

如果我从像 dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl02 这样的有效载荷键中删除它开始工作,这些键是箭头按钮的键。

    name_length = len('dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl02')

    for key in list(payload.keys()):
        if key.startswith('dnn') and len(key) == name_length:
            payload.pop(key)
            print(key)

但您可以使用αԋɱҽԃ αмєяιcαη 答案中的方法来确保您只发送需要的值。


import requests
from bs4 import BeautifulSoup

link = 'https://www.icsi.in/student/Members/MemberSearch.aspx?SkinSrc=%5BG%5DSkins/IcsiTheme/IcsiIn-Bare&ContainerSrc=%5BG%5DContainers/IcsiTheme/NoContainer'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")

    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTTARGET'] = 'dnn$ctr410$MemberSearch$btnSearch'

    page = 5
    while True:

        r = s.post(link, data=payload)
        soup = BeautifulSoup(r.text, "lxml")
        for item in soup.select("span[id$='_lblFullName']"):
            print(item.text)

        page += 2

        payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
        payload['__EVENTTARGET'] = 'dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl{:02}'.format(page)

        name_length = len('dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl02')
        for key in list(payload.keys()):
            if key.startswith('dnn') and len(key) == name_length:
                payload.pop(key)
                print(key)

        payload['__dnnVariable'] = {'__scdoff':'1','__dnn_pageload':'__dnn_setScrollTop();'}
        payload['ScrollTop'] = '400'

编辑: 页面使用更复杂的系统,在 10 个页面后显示新链接,但使用旧值 ctl07ctl09。而不是这个链接,我使用带有箭头的按钮到下一页的名称 - 开始时它的值 ctrl28 但在 10 页之后它有 @​​987654328@ (因为有更多链接 - 它会将链接 ... 添加到下一个/上一个列表共 10 页)

import requests
from bs4 import BeautifulSoup

link = 'https://www.icsi.in/student/Members/MemberSearch.aspx?SkinSrc=%5BG%5DSkins/IcsiTheme/IcsiIn-Bare&ContainerSrc=%5BG%5DContainers/IcsiTheme/NoContainer'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")

    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTTARGET'] = 'dnn$ctr410$MemberSearch$btnSearch'

    page = 1  # I don't need it to generate lins, now I use it only to display page number
    while True:
        print('page:', page)
        page += 1

        r = s.post(link, data=payload)
        soup = BeautifulSoup(r.text, "lxml")
        for item in soup.select("span[id$='_lblFullName']"):
            print(item.text)

        payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}

        name_length = len('dnn$ctr410$MemberSearch$grdMembers$ctl00$ctl02$ctl01$ctl28')
        for key in list(payload.keys()):
            if key.startswith('dnn') and len(key) == name_length:
                payload.pop(key)
                #print(key)

        # button with arrow to next page

        next_page = soup.select("input[class='rgPageNext']")
        if not next_page:
            break

        next_page = next_page[0]['name']
        print(next_page)
        payload[next_page] = ''

        payload['__dnnVariable'] = {'__scdoff':'1','__dnn_pageload':'__dnn_setScrollTop();'}
        payload['ScrollTop'] = '400'

【讨论】:

  • 在 12 或 13 页之后,脚本开始重复生成前一页的结果。
  • 您在浏览器中查看了吗?可能它使用的值与您期望的不同。现在我检查了第 11 页的链接,它再次使用 ctl07 而不是 ctl25 - 所以它的工作方式与我们预期的不同。
  • 我没有为'__EVENTTARGET' 生成值,而是使用来自select("input[class='rgPageNext']")[0]['name'] 的名称(带有箭头的按钮到下一页),然后即使在10页之后我也得到了正确的值。
【解决方案2】:

实际上,您需要包含完整的帖子Payload 参数。

我们必须在同一个Session 中使用requests.Session() 完成此操作,因为网站分页使用基于__dnnVariable 的旋转function,服务器在JS 请求下接收它,该请求被转换为循环。

它的实际含义Next

所以,我首先发出了GET 请求并获取了所需的params(其中一些是动态的,而另一些是静态的)

然后,我在同一个session下发了一个post请求

import requests
import re
from bs4 import BeautifulSoup
from urllib.parse import unquote

data = {
    '__EVENTTARGET': "dnn$ctr410$MemberSearch$btnSearch",
    '__EVENTARGUMENT': '',
    '__VIEWSTATEENCRYPTED': '',
    'dnn$ctlHeader$dnnSearch$Search': 'SiteRadioButton',
    'dnn$ctlHeader$dnnSearch$txtSearch': '',
    'dnn$ctr410$MemberSearch$txtFirstName': '',
    'dnn$ctr410$MemberSearch$txtLastName': '',
    'dnn$ctr410$MemberSearch$ddlMemberType': 0,
    'dnn$ctr410$MemberSearch$txtMembershipNumber': '',
    'dnn$ctr410$MemberSearch$txtCpNumber': '',
    'dnn$ctr410$MemberSearch$txtCity': '',
    'dnn$ctr410$MemberSearch$txtOrganisation': '',
    'dnn$ctr410$MemberSearch$txtAddress2': '',
    'dnn$ctr410$MemberSearch$txtAddress3': '',
    'dnn$ctr410$MemberSearch$txtEmail': '',
    'dnn_ctr410_MemberSearch_grdMembers_ClientState': '',
    'ScrollTop': 432,
    '__dnnVariable': '{"__scdoff":"1","__dnn_pageload":"__dnn_setScrollTop();"}'
}


def main(url):
    with requests.Session() as req:
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        data['StylesheetManager_TSSM'] = re.search(
            r"hf.value \+= '(.*?)\'", r.text).group(1)
        data['ScriptManager_TSM'] = unquote(soup.findAll('script', src=True)
                                            [2]['src']).split("=", 3)[-1]
        data['__VIEWSTATE'] = soup.find("input", id="__VIEWSTATE").get("value")
        data['__VIEWSTATEGENERATOR'] = soup.find(
            "input", id="__VIEWSTATEGENERATOR").get("value")
        data['__EVENTVALIDATION'] = soup.find(
            "input", id="__EVENTVALIDATION").get("value")

        for _ in range(10):
            r = req.post(url, data=data)
            soup = BeautifulSoup(r.content, 'html.parser')
            names = [name.text for name in soup.select("div.name_head")]
            page = soup.select_one(
                "a.rgCurrentPage").next_sibling['href'].split("'")[1]
            data['__EVENTTARGET'] = page
            data['__EVENTVALIDATION'] = soup.find(
                "input", id="__EVENTVALIDATION").get("value")
            data['__VIEWSTATE'] = soup.find(
                "input", id="__VIEWSTATE").get("value")
            print(names)


main("https://www.icsi.in/student/Members/MemberSearch.aspx")

注意:由于search 函数在后端服务器上使用random,因此对于每个请求,您将获得不按排序顺序的数据。

输出:

['SH. DILIP RAGHUNATH KOTWAL', 'SH. ARUNODAY ROY MUKHERJEE', 'SH. J SUBRAMANI', 'SH. R KRISHNAMANI', 'SH. R NARAYANASWAMI', 'SH. M V GOPALAKRISHNAN', 'SH. RAJAM KRISHNAMURTHY', 'SH. V SIVASUBRAMANIAN', 'SH. V RAGHAVENDRAN', 'SH. G V AIMAN']
['SH. K J MATHEW', 'SH. K K GHOSH', 'SH. SUBHASH CHANDER DHAWAN', 'SH. BABU RAM MAHESWARI', 'SH. S SWAMINATHAN', 'SH. T S A AIYER', 'SH. KOVILOOR VIJAYARAGHAVACHARI SAMPATHKUMAR', 'SH. M KRISHNAN', 'SH. R N BANSAL', 'SH. N V RAMAN']
['SH. R VENKATARAMANI', 'SH. UTPALENDU ROY CHOUDHURY', 'SH. LAKSHMI NARAYANAN V', 'SH. PARIJAT KUMAR HORE', 'SH. B R VENKATESAN', 'SH. KISHAN GOPAL SOMANI', 'SH. O P GANERIWALA', 'SH. P T KUPPUSWAMY', 'SH. U P MATHUR', 'SH. N N UPADHYAY']
['SH. N K BHANDARI', 'SH. S R C SETTY', 'SH. S V BALASUBRAMANIAN', 'SH. HOSHIE HIRJI MALGHAM', 'SH. KAIKOBAD SORABJI ITALIA', 'SH. K SIVADAS', 'SH. K K SIVARAMAKRISHNAN', 'SH. A CHANDRASEKARAN', 'SH. R PONNAMBALAM', 'SH. T K B VENKATARAMAN']
['SH. NARINDER PAL', 'SH. PARKASH ATAM', 'SH. K A PARTHASARATHY', 'SH. SURESH CHANDRA OSWAL', 'SH. MAHENDRA KANTILAL SHAH', 'SH. V. SANTHANAKRISHNA', 'SH. VASANT NARAYAN GOGATE', 'SH. MANEKLAL 
PATEL', 'SH. B N VISHWANATH', 'SH. B S L NARAYAN']
['SH. P L N VIJAYANAGAR', 'SH. SHREEPAD MARTAND  KORDE', 'SH. SHIV BHAGWAN KOTHARI', 'SH. R B POPLAI', 'SH. RAMESH KHANNA', 'SH. RAVINDER NATH JOSHI', 'SH. VIDYA SAGAR AGGARWAL', 'SH. ARVIND JAYKUMAR CHAKOTE', 'SH. V RAMASESHAN', 'SH. BADRINARAYAN BALDAWA']
['SH. C GOVINDANKUTTY', 'SH. A G MADHAVAN', 'SH. DHIRAJ NATH BHATTACHARYYA', 'SH. RAMESHWAR LAL INANI', 'SH. RAMESHWARDAS C DAGA', 'SH. R SUBRAMANIAN', 'SH. S M REGE', 'SH. NARENDRA KUMAR KAPOOR', 'SH. K RAMAMURTHI', 'SH. ROOPENDRA NARAYAN ROY']
['SH. KALYAN KUMAR MITRA', 'SH. KALYANASUNDARAM ', 'SH. N A SESHADRI', 'SH. RAJENDRA KUMAR JAIN', 'SH. BISWAJIT SEN', 'SH. RAMKRISHNA NATHOOMAL  AGRAWAL', 'SH. P C SHETH', 'SH. K S NATARAJAN', 
'SH. S N DAMLE', 'SH. A M FADIA']
['DR. K N M RAO', 'SH. IYER M. RAMASWAMY', 'SH. DILIP KANTI MAZUMDAR', 'SH. RAM CHANDRA NIGAM', 'SH. SUBRAHMANIAM VISWANATHAN', 'SH. SURESH KUMAR JERATH', 'SH. A Y SRINIVASAN', 'DR. S C GARG', 
'SH. CHANDRA PRAKASH SHARDA', 'SH. M P JAIN']
['SH. E S DWARKANATH', 'SH. MYSORE SHAMANNA  RAMACHANDRA', 'SH. SUBHASH CHANDER SINGHAL', 'SH. T T SINHA', 'SH. G R BHANDARI', 'SH. M P GOEL', 'SH. CHOKKANATHAPURAM SUBRAMANIAN  NATESAN', 'SH. 
V M PATEL', 'SH. BIJOY KUMAR AGARWALLA', 'SH. BAHADUR CHAND JAIN']

【讨论】:

  • 是的,您的两个解决方案都有效。感谢一万亿。
  • @MITHU 欢迎您,如果对您有帮助,请勾选答案旁边的复选标记,随时接受我的回答。如果你也喜欢,你可以投票。
猜你喜欢
  • 2020-10-06
  • 2021-12-18
  • 2022-07-07
  • 2020-12-04
  • 2023-04-01
  • 2017-05-30
  • 2020-05-01
  • 2016-06-23
  • 1970-01-01
相关资源
最近更新 更多