【问题标题】:How can I get an article's: author, the date it was uploaded and updated?如何获取文章的:作者、上传和更新日期?
【发布时间】:2023-03-20 12:38:01
【问题描述】:

这是我的文章抓取功能,现在我正在尝试研究如何抓取作者的姓名、上传日期和更新日期。我可以采取哪些方法来处理关于 SF 编年史的大量文章?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# Initializing our title key
title_key = 'title'
dictionary.setdefault(title_key, [])

# Initializing our url key
url_key = 'url'
dictionary.setdefault(url_key, [])

# Initializing our author key
author_key = 'author'
dictionary.setdefault(text_key, [])

# Initializing our date key
author_key = 'date'
dictionary.setdefault(text_key, [])

# Initializing our date updated key
author_key = 'date_updated'
dictionary.setdefault(text_key, []) 

def article_scraper(url):
    # Opening up the connection, grabbing the page
    uClient = uReq(url)
    page_html = uClient.read()
    uClient.close()

    # HTML parsing
    page_soup = soup(page_html, "html.parser")

    dictionary['url'].append(url)
    dictionary['title'].append(page_soup.title.string)
    dictionary['author'] = page_soup.select("author.name")
    return(dictionary)

articles[0] = 'https://www.sfchronicle.com/news/bayarea/heatherknight/article/Special-education-teacher-a-prime-example-of-13560483.php'
article0 = article_scraper(articles[0])

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup html-parsing


    【解决方案1】:

    作者姓名:

    author = soup.findAll("div", {"class": ""header-authors-name"})
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-12-22
      • 2012-12-15
      • 1970-01-01
      • 1970-01-01
      • 2019-11-27
      相关资源
      最近更新 更多