【问题标题】:How can I run a python command that would click on each link on a page and extract the title, content, and date for each link?如何运行 python 命令来单击页面上的每个链接并提取每个链接的标题、内容和日期?
【发布时间】:2021-01-04 19:52:31
【问题描述】:

使用此链接:https://1997-2001.state.gov/briefings/statements/2000/2000_index.html。我有一个命令可以单击页面上的每个链接并取出所有数据,但我想将其转换为 csv 文件,因此需要运行三个不同的命令来获取标题、段落和日期页面上每篇文章的名称(以便它们可以成为 excel 表中的列)。我遇到了困难,因为这个页面没有'class'或'id'。任何建议都会非常有帮助。

这是我当前的代码:

    url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')

    for a in soup.select('td[width="580"] img + a')[400:]:
    u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href'] 
    print(u)
    s = BeautifulSoup(requests.get(u).content, 'html.parser')
    t = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"])').get_text(strip=True, separator='\n')
    print( t.split('[end of document]')[0] )
    print('-' * 80)

【问题讨论】:

    标签: python html selenium web-scraping beautifulsoup


    【解决方案1】:

    您可以使用此脚本将数据保存为 CSV:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    url = 'https://1997-2001.state.gov/briefings/statements/2000/2000_index.html'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    all_data = []
    for a in soup.select('td[width="580"] img + a'):
        date = a.text.strip(':')
        title = a.find_next_sibling(text=True).strip(': ')   
        u = 'https://1997-2001.state.gov/briefings/statements/2000/' + a['href'] 
        print(u)
        s = BeautifulSoup(requests.get(u).content, 'html.parser')
        t = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"])').get_text(strip=True, separator='\n')
        content = t.split('[end of document]')[0]
        print(date, title, content)
        all_data.append({
            'url': u,
            'date': date,
            'title': title,
            'content': content
        })
        print('-' * 80)
    
    df = pd.DataFrame(all_data)
    df.to_csv('data.csv', index=False)
    print(df)
    

    打印:

    ...
    
                                                       url  ...                                            content
    0    https://1997-2001.state.gov/briefings/statemen...  ...  Statement by Philip T. Reeker, Deputy Spokesma...
    1    https://1997-2001.state.gov/briefings/statemen...  ...  Media Note\nDecember 26, 2000\nRenewal of the ...
    2    https://1997-2001.state.gov/briefings/statemen...  ...  Statement by Philip T. Reeker, Deputy Spokesma...
    3    https://1997-2001.state.gov/briefings/statemen...  ...  Notice to the Press\nDecember 21, 2000\nMeetin...
    4    https://1997-2001.state.gov/briefings/statemen...  ...  Statement by Philip T. Reeker, Deputy Spokesma...
    ..                                                 ...  ...                                                ...
    761  https://1997-2001.state.gov/briefings/statemen...  ...  Press Statement by James P. Rubin, Deputy Spok...
    762  https://1997-2001.state.gov/briefings/statemen...  ...  Press Statement by James P. Rubin, Spokesman\n...
    763  https://1997-2001.state.gov/briefings/statemen...  ...  Notice to the Press\nJanuary 6, 2000\nAssistan...
    764  https://1997-2001.state.gov/briefings/statemen...  ...  Press Statement by James P. Rubin, Spokesman\n...
    765  https://1997-2001.state.gov/briefings/statemen...  ...  Press Statement by James P. Rubin, Spokesman\n...
    
    [766 rows x 4 columns]
    

    并保存data.csv(来自 LibreOffice 的屏幕截图):


    编辑:1998 年:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    url = 'https://1997-2001.state.gov/briefings/statements/1998/1998_index.html'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    all_data = []
    for a in soup.select('td[width="580"] img + a, blockquote img + a'):
        date = a.text.strip(':')
        title = a.find_next_sibling(text=True).strip(': ')   
        u = 'https://1997-2001.state.gov/briefings/statements/1998/' + a['href'] 
        print(u)
        s = BeautifulSoup(requests.get(u).content, 'html.parser')
        if not s.body:
            continue
        t = s.select_one('td[width="580"], td[width="600"], table[width="580"]:has(td[colspan="2"]), blockquote, body').get_text(strip=True, separator='\n')
        content = t.split('[end of document]')[0]
        print(date, title, content)
        all_data.append({
            'url': u,
            'date': date,
            'title': title,
            'content': content
        })
        print('-' * 80)
    
    df = pd.DataFrame(all_data)
    df.to_csv('data.csv', index=False)
    print(df)
    

    【讨论】:

    • 非常感谢,这成功了!既然您似乎是这方面的专家,您是否知道此链接的结构存在什么问题:1997-2001.state.gov/briefings/statements/1998/1998_index.html 以及为什么它似乎不会运行相同的脚本(既不打印内容也不转换为 csv,并且没有错误)?
    • 所以我也尝试使用那个脚本来抓取这个链接:1997-2001.state.gov/statements/2000_index.html,但是什么都不会运行并且没有错误,你知道结构上有什么不同或者需要改变什么吗?在脚本中?
    猜你喜欢
    • 2016-04-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多