【问题标题】:BeautifulSoup4 scraping cannot reach beyond the first page in a website (Python 3.6)BeautifulSoup4 抓取无法超出网站的第一页(Python 3.6)
【发布时间】:2018-08-29 10:12:14
【问题描述】:

我正在尝试从本网站的第一页到第 14 页:https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All&region=All 这是我的代码:

import requests as r
from bs4 import BeautifulSoup as soup
import pandas 

#make a list of all web pages' urls
webpages=[]
for i in range(15):
    root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All&region=All&page='+ str(i)
    webpages.append(root_url)
    print(webpages)

#start looping through all pages
for item in webpages:  
    headers = {'User-Agent': 'Mozilla/5.0'}
    data = r.get(item, headers=headers)
    page_soup = soup(data.text, 'html.parser')

#find targeted info and put them into a list to be exported to a csv file via pandas
    title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
    title = [el.replace('\n', '') for el in title_list]

#export to csv file via pandas
    dataset = {'Title': title}
    df = pandas.DataFrame(dataset)
    df.index.name = 'ArticleID'
    df.to_csv('example31.csv',encoding="utf-8")

输出的 csv 文件只包含最后一页的目标信息。当我打印“网页”时,它表明所有页面的 url 都已正确放入列表中。我究竟做错了什么?提前谢谢!

【问题讨论】:

    标签: python pandas web-scraping beautifulsoup


    【解决方案1】:

    您只是为所有页面覆盖相同的输出 CSV 文件,您可以在“附加”模式下调用 .to_csv() 以将新数据添加到现有文件的末尾:

    df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)
    

    或者,更好的办法是将标题收集到标题列表中,然后转储到 CSV 中一次:

    #start looping through all pages
    titles = []
    for item in webpages:
        headers = {'User-Agent': 'Mozilla/5.0'}
        data = r.get(item, headers=headers)
        page_soup = soup(data.text, 'html.parser')
    
        #find targeted info and put them into a list to be exported to a csv file via pandas
        title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
    
        titles += [el.replace('\n', '') for el in title_list]
    
    # export to csv file via pandas
    dataset = [{'Title': title} for title in titles]
    df = pandas.DataFrame(dataset)
    df.index.name = 'ArticleID'
    df.to_csv('example31.csv', encoding="utf-8")
    

    【讨论】:

    • 非常感谢!!您的第一个建议不起作用(仍然是相同的结果),但是“将标题收集到标题列表中,然后转储到 CSV 一次”效果很好!
    【解决方案2】:

    除了 alexce 发布的内容之外,另一种方法是继续将内部的数据帧附加到新的数据帧,然后将其写入 CSV。

    将 finalDf 声明为循环外的数据框:

    finalDf = pandas.DataFrame()
    

    稍后再做:

    for item in webpages:
        headers = {'User-Agent': 'Mozilla/5.0'}
        data = r.get(item, headers=headers)
        page_soup = soup(data.text, 'html.parser')
    
    #find targeted info and put them into lists to be exported to a csv file   via pandas
        title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
        title = [el.replace('\n', '') for el in title_list]
    
    #export to csv file via pandas
        dataset = {'Title': title}
        df = pandas.DataFrame(dataset)
        finalDf = finalDf.append(df)
        #df.index.name = 'ArticleID'
        #df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)
    
    finalDf = finalDf.reset_index(drop = True)
    finalDf.index.name = 'ArticleID'
    finalDf.to_csv('example31.csv', encoding="utf-8")
    

    注意带有finalDf的行

    【讨论】:

    • 感谢您的意见!除了索引号在 19 之后一直返回到 0(而不是 0-400,它是 0-19,然后是 0-19 一次又一次)之外,这很有效。知道为什么会这样吗?
    • @AshleyLiu 我已经通过添加 reset_index() 更新了答案,您现在将拥有 0-400 :)
    猜你喜欢
    • 1970-01-01
    • 2020-09-26
    • 2018-02-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多