无法在 WSJ 页面上的“div”类中抓取数据答案

【问题标题】：Not able to scrape data in "div" class on WSJ pages无法在 WSJ 页面上的“div”类中抓取数据
【发布时间】：2020-01-05 07:10:15
【问题描述】：

我正在尝试从 WSJ 网站上的文章中抓取文本内容。例如考虑以下 html 源代码：

<div class="article-content ">
       <p>BEIRUT—
      Carlos Ghosn, 
       who is seeking to clear his name in Lebanon, would face a very different path to vindication here, where endemic corruption and the former auto executive’s widespread popularity could influence the outcome of a potential trial. </p> <p>Mr. Ghosn, the former chief of auto makers

我正在使用以下代码：

res = requests.get(url)
html = BeautifulSoup(res.text, "lxml")
classid = "article-content "
item = html.find_all("div", {"class":classid})

这将返回一个空项目。我看到了其他一些帖子，人们建议 adding delays 和 others 但这些在我的情况下不起作用。计划将抓取的文本用于一些 ML 项目。

我订阅了 WSJ，并且在运行上述脚本时已登录。

对此的任何帮助将不胜感激！谢谢

【问题讨论】：

关闭浏览器中的javascript并重新加载页面。你想要的内容还在吗？
是的，检查了渲染页面和html源代码。

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

您的代码对我来说很好用。只要确保您正在搜索正确的“classid”。我认为这不会产生影响，但您可以尝试使用它作为替代方案：

item = html.find_all("div", class_ = classid)

【讨论】：

谢谢苏丹。只是在我的最后不起作用： html = BeautifulSoup(res.text, "lxml") classid = "article-content" #最后一个 t item = html.find_all("div", class_ = classid)打印（项目）。输出为“[]”
你可以试试 classid = "article" 看看会发生什么？？
相同。 :( 您可以分享代码和输出的屏幕截图吗？非常感谢您的帮助。

【解决方案2】：

可以做的一件事是通过在控制台上使用 javascript 检查来确认元素的存在。很多时候，都会发出后台请求来服务该页面。因此，您可能会在页面中看到该元素..但它是对不同 URL 或脚本内部的请求的结果。

【讨论】：

谢谢。这就是正在发生的事情。我搜索了 html 的输出，找不到标签，所以它是动态生成的。关于如何进行的任何想法？
@user6027414 我没有订阅华尔街日报..所以我无法检查..但是，您可以尝试搜索 ('script')..如果文章是由脚本生成的..它会出现..然后你需要使用 json.loads。如果您觉得答案有帮助，请接受。

【解决方案3】：

尝试使用select 并将解析器设置为'lxml'

content = [p.text for p in soup.select('.article-content p')]

【讨论】：