【问题标题】:How to get all the li tag within div tag如何获取div标签中的所有li标签
【发布时间】:2016-02-26 19:06:28
【问题描述】:

我正在抓取一个网站以获取公司和产品的详细信息。 它有 div 标签,其中有 li 标签,我想获取 div 标签中的所有 li 标签。 我正在使用 python 3.5.1 和 BeautifulSoup

我的代码:

from bs4 import BeautifulSoup
import urllib.request
import re
r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware')
soup = BeautifulSoup(r, "html.parser")

links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+"))
linksfromcategories = ([link["href"] for link in links])

string = "http://i.cantonfair.org.cn/en/"
linksfromcategories = [string + x for x in linksfromcategories]

for link in linksfromcategories:
    response = urllib.request.urlopen(link)
    soup2 = BeautifulSoup(response, "html.parser")
    links2 = soup2.find_all("a", href=re.compile(r"\ExpExhibitorList\.aspx\?categoryno=[0-9]+"))
    linksfromsubcategories = ([link["href"] for link in links2])
    linksfromsubcategories = [string + x for x in linksfromsubcategories]
    for link in linksfromsubcategories:
        response = urllib.request.urlopen(link)
        soup3 = BeautifulSoup(response, "html.parser")
        links3 = soup3.find_all("a", href=re.compile(r"\ExpExhibitorList\.aspx\?categoryno=[0-9]+"))
        linksfromsubcategories2 = ([link["href"] for link in links3])
        linksfromsubcategories2 = [string + x for x in linksfromsubcategories2]
        for link in linksfromsubcategories2:
            response2 = urllib.request.urlopen(link)
            soup4 = BeautifulSoup(response2, "html.parser")
            companylink = soup4.find_all("a", href=re.compile(r"\expCompany\.aspx\?corpid=[0-9]+"))
            companylink = ([link["href"] for link in companylink])
            companylink = [string + x for x in companylink]
            for link in companylink:
                response3 = urllib.request.urlopen(link)
                soup5 = BeautifulSoup(response3, "html.parser")
                companydetail = soup5.find_all("div", id="contact")
                for element in companydetail:
                    companyname = element.a[0].get_text()
                    print (companyname)
                    companyaddress = element.a[1].get_text()
                    print (companyaddress)And I am getting error

我遇到了错误

Traceback (most recent call last):
  File "D:\python\phase3.py", line 54, in <module>
    lis = companydetail.find_all('li')
AttributeError: 'ResultSet' object has no attribute 'find_all'

【问题讨论】:

  • 它说第 54 行有错误,但您只包含了 37 行,其中没有包含引发错误的代码。

标签: python web-scraping beautifulsoup


【解决方案1】:

companydetailResultSet。也就是说,它是一个包含许多元素的可迭代对象(如listset)。发生错误是因为您尝试在此 ResultSet 对象上调用 .find_all()。你应该像这样遍历这个对象并在ResultSet中的元素上调用find_all()

for d in companydetail:
    lis = d.find_all('li')

或者使用列表推导获取companydetail 中所有lis 的列表:

lis = [ li for d.find_all('li') for d in companydetail ]

【讨论】:

  • “两次获取列表”是什么意思?
  • 像来自 li atg 一样,我获得了公司详细信息,例如姓名和电子邮件 ID,但该姓名和电子邮件 ID 获得了两次。可能是我刮了两次网址还是什么?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-04-30
  • 1970-01-01
  • 2016-08-12
  • 1970-01-01
  • 2017-06-04
相关资源
最近更新 更多