【问题标题】:Why am I not getting the output nor an error in web scraping?为什么我没有得到输出,也没有网页抓取错误?
【发布时间】:2018-11-22 11:30:10
【问题描述】:

我正在使用 beautifulsoup 和 requests 在 google colab 上执行网络抓取任务。在这里,我只是在抓取谷歌新闻的标题。下面是代码:

import requests
from bs4 import BeautifulSoup

def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
INTO SOMETHING THAT IS EASY TO READ'''

request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
print(soup.prettify())

beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')

for headlines in soup.find_all('a', {'class': 'VDXfz'}):
   print(headlines.text)

问题是当我运行单元格时,它既不显示输出(标题列表)也不显示错误。请帮助它困扰我 2 天。

【问题讨论】:

    标签: python web-scraping beautifulsoup python-requests google-colaboratory


    【解决方案1】:

    您可能需要显示来自下一个span 元素的文本。这可以按如下方式完成:

    import requests
    from bs4 import BeautifulSoup
    
    def beautiful_soup(url):
        '''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
           INTO SOMETHING THAT IS EASY TO READ'''
    
        request = requests.get(url)
        soup = BeautifulSoup(request.text, "lxml")
        #print(soup.prettify())
        return soup
    
    soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
    
    for headlines in soup.find_all('a', {'class': 'VDXfz'}):
        print(headlines.find_next('span').text)
    

    这会给你输出开始的东西:

    I Take Back My Comment, Says Ram Madhav After Omar Abdullah’s Dare to Prove Pakistan Charge
    Ram Madhav Backpedals On "Instruction From Pak" After Omar Abdullah Dare
    National Conference backed PDP to save J&K from uncertainty: Omar Abdullah
    On Ram Madhav ‘instruction from Pak’ barb, Omar Abdullah’s stinging reply
    Make public reports of horse-trading in govt formation in J-K: Omar Abdullah to Guv
    

    您可以使用以下方法将标题写入 CSV 格式的文件:

    import requests
    from bs4 import BeautifulSoup
    import csv
    
    def beautiful_soup(url):
        '''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
           INTO SOMETHING THAT IS EASY TO READ'''
    
        request = requests.get(url)
        soup = BeautifulSoup(request.text, "lxml")
        return soup
    
    soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
    
    with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
        csv_output = csv.writer(f_output)
        csv_output.writerow(['Headline'])
    
        for headlines in soup.find_all('a', {'class': 'VDXfz'}):
            headline = headlines.find_next('span').text
            print(headline)
            csv_output.writerow([headline])
    

    目前这只产生一个名为Headline的列

    【讨论】:

    • 如何将此列表转换为 csv?
    • 你会有哪些栏目?目前这只是一个列。
    • 这是在我的本地 PC 上测试过的,所以它保存在当前文件夹中。我不能说谷歌 colab 会把它保存在哪里。我想你需要看看files.download()
    • 我写了 files.download('output.csv') 它下载了输出 csv 标题的次数,并且每个 0kb 的 Excel 文件都没有任何数据
    • 我在本地电脑上测试得到了输出!
    【解决方案2】:

    执行以下脚本,您应该会得到所需的结果。如果你使用选择器,脚本会更干净。

    但是,使用.find_all():

    import requests
    from bs4 import BeautifulSoup
    
    def get_headlines(url):
        request = requests.get(url)
        soup = BeautifulSoup(request.text,"lxml")
        headlines = [item.find_next("span").text for item in soup.find_all("h3")]
        return headlines
    
    if __name__ == '__main__':
        link = 'https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en'
        for titles in get_headlines(link):
            print(titles)
    

    要使用.select() 执行相同操作,请在脚本中进行此更改:

    headlines = [item.text for item in soup.select("h3 > a > span")]
    return headlines
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-09-01
      • 2022-10-23
      • 2018-06-06
      • 1970-01-01
      • 2015-07-13
      • 1970-01-01
      相关资源
      最近更新 更多