【问题标题】:Scraped data not printing to csv/xlsx file抓取的数据不打印到 csv/xlsx 文件
【发布时间】:2019-09-05 15:02:00
【问题描述】:

我正在尝试抓取数据并将其存储在 csv 或 xlsx 文件中,但是当我运行我的代码时,该文件返回为空。

当我在一个循环后使用break 停止迭代器时,我发现代码保存了一行我想要的数据。最终目标是将这些数据逐行写入文件中。如果有帮助,我正在使用 beautifulsoup4。

代码如下:

from bs4 import BeautifulSoup
import requests
import xlsxwriter

url = 'https://www.rwaq.org/courses'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
base = 'https://www.rwaq.org'

course_div = soup.find_all('div', attrs={'class': 'course-info'})
course_links = [base + item.h3.a['href'] for item in course_div]
row = 0
for link in course_links:
    inner_page = requests.get(link)
    inner_soup = BeautifulSoup(inner_page.content, 'html.parser')
    course_name = inner_soup.find('div', attrs={'class': 'page-title'}).h2.text
    course_lecturer_name = inner_soup.find('div', attrs={'class': 'instructor-details'}).a.text.strip()
    course_desc = inner_soup.find('div', attrs={'class': 'lecture_desc'}).p.text.strip()
    if inner_soup.select_one('#organization div.course-content div:nth-child(4) div.row-fluid ul'):
        course_manhag = inner_soup.select_one('#organization div.course-content div:nth-child(4) div.row-fluid ul').text
    elif inner_soup.select_one('#organization div.course-content div:nth-child(4) div.row-fluid p'):
        course_manhag = inner_soup.select_one('#organization div.course-content div:nth-child(4) div.row-fluid p').text
    else:
        course_manhag = ''

    if inner_soup.select_one('#organization div.course-content div:nth-child(5) div.row-fluid ul'):
        course_require = inner_soup.select_one(
            '#organization div.course-content div:nth-child(5) div.row-fluid ul').text
    elif inner_soup.select_one('#organization div.course-content div:nth-child(5) div.row-fluid p'):
        course_require = inner_soup.select_one('#organization div.course-content div:nth-child(5) div.row-fluid p').text
    else:
        course_require = ''

    if inner_soup.select_one('#organization div.course-content div:nth-child(6) div.row-fluid ul'):
        course_out = inner_soup.select_one('#organization div.course-content div:nth-child(6) div.row-fluid ul').text
    elif inner_soup.select_one('#organization div.course-content div:nth-child(6) div.row-fluid p'):
        course_out = inner_soup.select_one('#organization div.course-content div:nth-child(6) div.row-fluid p').text
    else:
        course_out = ''

    course_company = inner_soup.select_one(
        'body div.container-fluid div div.subject-cover div.cover-info div div.subject-organization p a').text
    course_date_from = inner_soup.select_one('p.subject-date').text.strip()[3:16]
    if inner_soup.select_one('p.subject-date') is True:
        course_date_to = inner_soup.select_one('p.subject-date').text.strip()[31:]
    else:
        course_date_to = ''
    course_status = inner_soup.select_one('p.subject-date span').text
    course_lecturer_link = [base + li.a['href'] for li in
                            inner_soup.find_all("div", attrs={'class': 'instructor-details'})]
    course_iframe = inner_soup.select_one('iframe').attrs["src"]
    course_promo_link = course_iframe[:24] + 'watch?v=' + course_iframe[30:course_iframe.find('?')]
    wb = xlsxwriter.Workbook('file001.xlsx')
    sheet = wb.add_worksheet()
    sheet.write(row, 0, course_promo_link)
    sheet.write_row(row, 1, course_lecturer_link)
    sheet.write(row, 2, course_desc)
    sheet.write(row, 3, course_out)
    sheet.write(row, 4, course_status)
    sheet.write(row, 5, course_name)
    sheet.write(row, 6, course_date_from)
    sheet.write(row, 7, course_date_to)
    sheet.write(row, 8, course_manhag)
    sheet.write(row, 9, course_require)
    row += 1
    wb.close()
    break

【问题讨论】:

    标签: web-scraping beautifulsoup export-to-csv xlsx


    【解决方案1】:

    我在运行您的代码时遇到错误,所以我无法测试。但我要尝试的第一件事是将wb = xlsxwriter.Workbook('file001.xlsx')wb.close()for 循环中取出。我最初的想法是你每次都写完文件。所以像:

    from bs4 import BeautifulSoup
    import requests
    import xlsxwriter
    
    url = 'https://www.rwaq.org/courses'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    base = 'https://www.rwaq.org'
    
    course_div = soup.find_all('div', attrs={'class': 'course-info'})
    course_links = [base + item.h3.a['href'] for item in course_div]
    
    #Initialize your file/workbook
    wb = xlsxwriter.Workbook('C:/file001.xlsx')
    sheet = wb.add_worksheet()
    
    row = 0
    for link in course_links:
        inner_page = requests.get(link)
        inner_soup = BeautifulSoup(inner_page.content, 'html.parser')
        course_name = inner_soup.find('div', attrs={'class': 'page-title'}).h2.text
        course_lecturer_name = inner_soup.find('div', attrs={'class': 'instructor-details'}).a.text.strip()
        course_desc = inner_soup.find('div', attrs={'class': 'lecture_desc'}).p.text.strip()
        if inner_soup.select_one('#organization div.course-content div:nth-child(4) div.row-fluid ul'):
            course_manhag = inner_soup.select_one('#organization div.course-content div:nth-child(4) div.row-fluid ul').text
        elif inner_soup.select_one('#organization div.course-content div:nth-child(4) div.row-fluid p'):
            course_manhag = inner_soup.select_one('#organization div.course-content div:nth-child(4) div.row-fluid p').text
        else:
            course_manhag = ''
    
        if inner_soup.select_one('#organization div.course-content div:nth-child(5) div.row-fluid ul'):
            course_require = inner_soup.select_one(
                '#organization div.course-content div:nth-child(5) div.row-fluid ul').text
        elif inner_soup.select_one('#organization div.course-content div:nth-child(5) div.row-fluid p'):
            course_require = inner_soup.select_one('#organization div.course-content div:nth-child(5) div.row-fluid p').text
        else:
            course_require = ''
    
        if inner_soup.select_one('#organization div.course-content div:nth-child(6) div.row-fluid ul'):
            course_out = inner_soup.select_one('#organization div.course-content div:nth-child(6) div.row-fluid ul').text
        elif inner_soup.select_one('#organization div.course-content div:nth-child(6) div.row-fluid p'):
            course_out = inner_soup.select_one('#organization div.course-content div:nth-child(6) div.row-fluid p').text
        else:
            course_out = ''
    
        course_company = inner_soup.select_one(
            'body div.container-fluid div div.subject-cover div.cover-info div div.subject-organization p a').text
        course_date_from = inner_soup.select_one('p.subject-date').text.strip()[3:16]
        if inner_soup.select_one('p.subject-date') is True:
            course_date_to = inner_soup.select_one('p.subject-date').text.strip()[31:]
        else:
            course_date_to = ''
        course_status = inner_soup.select_one('p.subject-date span').text
        course_lecturer_link = [base + li.a['href'] for li in
                                inner_soup.find_all("div", attrs={'class': 'instructor-details'})]
        course_iframe = inner_soup.select_one('iframe').attrs["src"]
        course_promo_link = course_iframe[:24] + 'watch?v=' + course_iframe[30:course_iframe.find('?')]
    
        sheet.write(row, 0, course_promo_link)
        sheet.write(row, 1, course_lecturer_link)
        sheet.write(row, 2, course_desc)
        sheet.write(row, 3, course_out)
        sheet.write(row, 4, course_status)
        sheet.write(row, 5, course_name)
        sheet.write(row, 6, course_date_from)
        sheet.write(row, 7, course_date_to)
        sheet.write(row, 8, course_manhag)
        sheet.write(row, 9, course_require)
        row += 1
    
    # Close it once all rows are written    
    wb.close()
    

    【讨论】:

    • 是的,我想我正在这样做,我会尝试你的代码我希望它可以工作谢谢@chitown88
    猜你喜欢
    • 2020-10-28
    • 1970-01-01
    • 1970-01-01
    • 2018-06-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多