【问题标题】:Want to scrape from a web page and its next pages想要从网页及其下一页中抓取
【发布时间】:2020-07-23 04:32:14
【问题描述】:

我想将公司名称、人员、国家、电话和电子邮件提取到 excel 文件中。我尝试了以下代码,但它在 excel 文件中只返回一个值。如何在第一页和下一页也循环播放..

import csv
import re
import requests
import urllib.request
from bs4 import BeautifulSoup
for page in range(10):
        url = "http://www.aepcindia.com/buyersdirectory"
        soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'lxml')
        tbody = soup('div', {'class':'view-content'})#[0].find_all('')
        f = open('filename.csv', 'w', newline = '')
        Headers = "Name,Person,Country,Email,Phone\n"
        csv_writer = csv.writer(f)
        f.write(Headers)
        for i in tbody:
                try:
                    name = i.find("div", {"class":"company_name"}).get_text()
                    person = i.find("div", {"class":"title"}).get_text()
                    country = i.find("div", {"class":"views-field views-field-field-country"}).get_text()
                    email = i.find("div", {"class":"email"}).get_text()
                    phone = i.find("div", {"class":"telephone_no"}).get_text()
                    print(name, person, country, email, phone)
                    f.write("{}".format(name).replace(","," ")+ ",{}".format(person)+ ",{}".format(country)+ ",{}".format(email) + ",{}".format(phone) + "\n")
                except: AttributeError
        f.close()

这是网页的链接 http://www.aepcindia.com/buyersdirectory

【问题讨论】:

    标签: python beautifulsoup screen-scraping


    【解决方案1】:
    import requests
    from bs4 import BeautifulSoup
    import csv
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'}
    
    
    def main(url):
        with requests.Session() as req:
            with open("data.csv", 'w', newline="") as f:
                writer = csv.writer(f)
                writer.writerow(["Name", "Title", "Country", "Email", "Phone"])
                for item in range(0, 10):
                    print(f"Extracting Page# {item +1}")
                    r = req.get(url.format(item), headers=headers)
                    soup = BeautifulSoup(r.content, 'html.parser')
    
                    name = [name.text for name in soup.select("div.company_name")]
                    title = [title.text for title in soup.select("div.title")]
                    country = [country.text for country in soup.findAll(
                        "div", class_="field-content", text=True)]
                    email = [email.a.text for email in soup.select(
                        "div.email")]
                    phone = [phone.text
                             for phone in soup.select("div.telephone_no")]
                    data = zip(name, title, country, email, phone)
                    writer.writerows(data)
    
    
    main("http://www.aepcindia.com/buyersdirectory?page={}")
    

    输出:view-online

    【讨论】:

    • 兄弟,就像魔术一样工作......非常感谢,你能再教我一件事吗?我在excel中有一个搜索查询列表,需要在google上搜索,结果应该保存在另一个excel中怎么做?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-12-06
    • 2021-06-03
    • 1970-01-01
    • 1970-01-01
    • 2020-06-18
    • 1970-01-01
    相关资源
    最近更新 更多