抓取时无法获取标题内容答案

【问题标题】：Cannot get headlines content while scraping抓取时无法获取标题内容
【发布时间】：2019-10-09 09:06:21
【问题描述】：

我是爬虫的新手，但我已经尝试了各种方法来解决这个问题，但没有得到想要的结果。我想抓取这个网站https://www.accesswire.com/newsroom/ 并且我想抓取所有的标题，当我在浏览器中检查它们时会显示标题，但是在使用 bs4 或 selenium 抓取之后，我没有得到完整的页面源代码，也没有获得头条新闻。

我已经尝试过time.sleep(10)，但这对我来说也行不通。我使用硒来获取页面，但这对我也不起作用。 div.column-15 w-col w-col-9 这是标题所在的类，div

ua     = UserAgent()
header = {'user-agent':ua.chrome}
url = "https://www.accesswire.com/newsroom/"
response = requests.get(url, headers=header)
time.sleep(12)
soup = BeautifulSoup(response.content, 'html.parser')
time.sleep(12)
headline_Div = soup.find("div",{"class":"column-15 w-col w-col-9"})
print(headline_Div)

我只想获取此页面上的所有标题和标题链接或者至少应该显示一个完整的页面源，以便我可以自己操作它。

【问题讨论】：

你得到了什么？
该站点似乎是异步和动态加载的，因此请求和 BS4 将无法获取页面元素。请包括您在 selenium 中尝试过的内容，因为这可能是更好的选择
我得到了网站的页面来源，但没有得到标题。这是我想废弃的东西
这里是硒代码
import time from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.Chrome('C:/Users/MUNTAZIR/Downloads/Compressed/chromedriver_win32/chromedriver.exe') time.sleep(5) site_url = "https://www.accesswire.com/newsroom/" time.sleep(5) print(site_url) soup = BeautifulSoup(driver.page_source, 'lxml') print(soup)

标签： python selenium web-scraping beautifulsoup screen-scraping

【解决方案1】：

如果拉取和解析不起作用是因为内容是动态的，那么您将需要 selenium 让实际浏览器为您生成内容

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://www.accesswire.com/newsroom/')
headline_links = driver.find_elements_by_css_selector('a.headlinelink')
headlines = [link.get_attribute('textContent') for link in headline_links]

【讨论】：

【解决方案2】：

你不需要硒。只需使用更高效的请求和页面使用的 API

import re
import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.accesswire.com/api/newsroom.ashx')
p = re.compile(r" \$\('#newslist'\)\.after\('(.*)\);")
html = p.findall(r.text)[0]
soup = bs(html, 'lxml')
headlines = [(item.text, item['href']) for item in soup.select('a.headlinelink')]
print(headlines)

正则表达式解释：

试试正则表达式here

【讨论】：

你能推荐一些关于如何提高抓取速度的好的教程/文档吗@dalvenjia
使用如上所示的 API 应该是首选。除此之外，还有很多关于为工作选择正确的工具、最佳选择器方法/选择器路径、在需要时正确和最佳地使用等待；代码结构......我相信你可以用谷歌搜索这些。我主要通过谷歌接听。我是 python 新手，所以希望有经验的 pythonistas 看看他们如何构建他们的代码以及他们使用什么语法。上面我会说是非常有效的。
又一个很棒的方法@QHarr。