【问题标题】:How to extract a particular link from a site if it has two links one which I want and other which I don't want?如果站点有两个链接,一个是我想要的,另一个是我不想要的,如何从站点中提取特定链接?
【发布时间】:2020-06-16 11:12:24
【问题描述】:

<td>
  <input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl01$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_1" value="866  " />
  <a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_1" href="CollegeDetailedInformation.aspx?Inst=866  ">A.N.A INSTITUTE OF PHARMACEUTICAL SCIENCES & RESEARCH,BAREILLY (866)</a>

  <br />
  <b>Location:</b>
  <span id="ContentPlaceHolder1_dlstCollege_lblAddress_1">13.5 km Bareilly - Delhi road, near rubber factory agras road ,Bareilly</span>

  <br />
  <b>Course:</b>
  <span id="ContentPlaceHolder1_dlstCollege_lblCourse_1">B.Pharm,</span>
  <br />
  <b>Category:</b>
  <span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_1">Private</span>
  <br />
  <b>Web Address:</b>

  <a id="lnkBtnWebURL" href='' target="_blank"></a>
  <br />
</td>
</tr>
<tr>
  <td>
    <input type="hidden" name="ctl00$ContentPlaceHolder1$dlstCollege$ctl02$hdnInstituteId" id="ContentPlaceHolder1_dlstCollege_hdnInstituteId_2" value="486  " />
    <a id="ContentPlaceHolder1_dlstCollege_hlpkInstituteName_2" href="CollegeDetailedInformation.aspx?Inst=486  ">A.N.A.COLLEGE OF ENGINEERING & MANAGEMENT,BAREILLY (486)</a>

    <br />
    <b>Location:</b>
    <span id="ContentPlaceHolder1_dlstCollege_lblAddress_2">13.5 Km. NH-24, Bareilly-Delhi Highway, Near Rubber Factory, Bareilly</span>

    <br />
    <b>Course:</b>
    <span id="ContentPlaceHolder1_dlstCollege_lblCourse_2">B.Tech,M.Tech,</span>
    <br />
    <b>Category:</b>
    <span id="ContentPlaceHolder1_dlstCollege_lblInstituteType_2">Private</span>
    <br />
    <b>Web Address:</b>

    <a id="lnkBtnWebURL" href='http://www.anacollege.org/index.html' target="_blank">http://www.anacollege.org/index.html</a>
    <br />
  </td>
</tr>

我想从这个网站提取一个特定的 URL(例如:CollegeDetailedInformation.aspx?Inst=866),但是这段代码有两个我不想要的标签(例如:http://www.anacollege.org/index.html)。


res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
soup = BeautifulSoup(res.content, 'html.parser')


table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})

pagelink = []
for anchor in table.findAll('a')[1:]:
        link = anchor['href']
        print(link)
        url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
        pagelink.append(url)
print(pagelinks)

我写了这段代码,但它正在提取所有链接

CollegeDetailedInformation.aspx?Inst=486  
http://www.anacollege.org/index.html
CollegeDetailedInformation.aspx?Inst=602  
http://www.aashlarbschool.com
CollegeDetailedInformation.aspx?Inst=032  
http://www.abes.ac.in
CollegeDetailedInformation.aspx?Inst=290  
http://www.abesit.in
CollegeDetailedInformation.aspx?Inst=913  
http://www.abesitpharmacy.in
CollegeDetailedInformation.aspx?Inst=643  
http://www.vitsald.com
CollegeDetailedInformation.aspx?Inst=1036 
http://www.abss.edu.in

我该如何解决这个问题我只想要与 CollegeDetailedInformation.aspx?Inst= 的链接?部分。

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup python-requests


    【解决方案1】:

    作为查看大学详细信息的链接的锚元素具有以ContentPlaceHolder1_dlstCollege_ 开头的id 属性。因此,将其作为regex to the attrs argumentfind_all() 传递:

    import re
    
    for anchor in table.findAll('a', attrs={"id": re.compile("^ContentPlaceHolder1_dlstCollege_.*")}):
        ...
    

    您也可以将其作为id keyword argument 传递给find_all()

    for anchor in table.findAll('a', id=re.compile("^ContentPlaceHolder1_dlstCollege_.*")):
        ...
    

    可以使正则表达式更加具体,例如 "^ContentPlaceHolder1_dlstCollege_hlpkInstituteName_.*",它应该只与提供的学院名称的链接相匹配。

    (我会删除您放在末尾的[1:],因为这可能会过滤掉您不想要的开头的链接。如果没有,则将其重新添加。)

    【讨论】:

    • 并使用更新/更标准的find_all() 而不是findAll()
    【解决方案2】:

    您可以使用 CSS selector 并使用它来查找所有链接 a[href*=CollegeDetailedInformation] 任何您想要的。

    import requests
    from bs4 import BeautifulSoup
    
    res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
    soup = BeautifulSoup(res.content, 'html.parser')
    
    
    table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})
    
    allAnchor = table.select("a[href*=CollegeDetailedInformation]")
    
    pagelink = []
    for anchor  in allAnchor:
        link = anchor['href']
        # print(link)
        url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
        pagelink.append(url)
    
    print(pagelink)
    

    输出将是:

    ['https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=968  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=866  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=486  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=602  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=032  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=290  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=913  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=643  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=1036 ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=312  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=986  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=686  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=805  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=225  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=799  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=041  ',
    'https://erp.aktu.ac.in/WebPages/KYC/CollegeDetailedInformation.aspx?Inst=952  ',
    
    and so on....
    ]
    

    【讨论】:

      【解决方案3】:

      我不了解 Python,但一般规则是在 for 循环中填充一个数组,然后查找具有您的过滤器的子字符串,选择索引并获取该索引中的所有内容。

      在循环外初始化并清空数组(如果允许为空) Python),在循环中填充它,然后执行类似 in_array 的操作(对于 php) 为您的过滤器:CollegeDetailedInformation.aspx?Inst=?.

      这应该是一个好的开始,因为 Python 的大师们会来帮忙。

      【讨论】:

        【解决方案4】:

        试试下面的代码 sn-p。在继续之前,还要使用 pip 安装 **lxml**

        import requests as rq
        from bs4 import BeautifulSoup as bs
        
        es = rq.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
        soup = bs(res.content, 'lxml')
        
        table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})
        
        
        links = [elem.strip() for anchor in table.findAll('a') for _,elem in anchor.attrs.items() if "=" in elem]
        
        print(links)
        

        【讨论】:

          【解决方案5】:

          您可以使用 CSS 选择器 a[id*="dlstCollege"] 仅过滤您想要的链接。

          例如:

          import requests as rq
          from bs4 import BeautifulSoup
          
          res = requests.get('https://erp.aktu.ac.in/WebPages/KYC/CollegeList.aspx?City=&CType=&Cu=&Br=&Inst=&IType=')
          soup = BeautifulSoup(res.content, 'html.parser')
          
          
          table = soup.find("table", attrs = {'class':'table table-bordered table-responsive'})
          
          pagelink = []
          for anchor in table.select('a[id*="dlstCollege"]')[1:]:
                  link = anchor['href']
                  print(link)
                  url = 'https://erp.aktu.ac.in/WebPages/KYC/'+ link
                  pagelink.append(url)
          

          打印:

          CollegeDetailedInformation.aspx?Inst=866  
          CollegeDetailedInformation.aspx?Inst=486  
          CollegeDetailedInformation.aspx?Inst=602  
          CollegeDetailedInformation.aspx?Inst=032  
          CollegeDetailedInformation.aspx?Inst=290  
          CollegeDetailedInformation.aspx?Inst=913  
          CollegeDetailedInformation.aspx?Inst=643  
          
          ...and so on.
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 2018-09-20
            • 2017-08-08
            • 1970-01-01
            • 2019-08-15
            • 2013-05-14
            • 2016-09-26
            • 1970-01-01
            相关资源
            最近更新 更多