BeautifulSoup 不会提取所有元素答案

【问题标题】：BeautifulSoup Doesn't Pull all ElementsBeautifulSoup 不会提取所有元素
【发布时间】：2016-10-21 15:18:40
【问题描述】：

我正在尝试从http://www.emoryhealthcare.org/locations/offices/advanced-digestive-care-1.html 抓取信息。

我想抓取出现在页面下三分之一处的专科，即“消化内科”和“内科”。当我检查该元素时，我看到它是 <div class="module bordered specialist"> 的 li 但是当我尝试循环遍历汤并打印每个找到的项目时，返回的结果与预期的不同。

<div class="module bordered specialist">
<ul>
<li>Cardiac Care</li>
<li>Transplantation</li>
<li>Cancer Care (Oncology)</li>
<li>Diagnostic Radiology</li>
<li>Neurosciences</li>
<li>Mental Health Services</li>
</ul>
</div>

当我在浏览器中打开网站时，我看到上面的值在内容切换到预期结果之前闪烁。有没有办法让我提高我能够刮掉我打算刮掉的物品的可能性？

【问题讨论】：

听起来页面有 javascript 可以在加载后更改内容。
您可以使用selenium 并等待几秒钟（这似乎需要多长时间才能更改）

标签： python web-scraping beautifulsoup

【解决方案1】：

只需使用 selenium 等待几秒钟，然后像以前一样解析。这似乎奏效了。

from selenium import webdriver
import os
import time
from bs4 import BeautifulSoup

chromedriver = "/Users/Rafael/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
driver.get('http://www.emoryhealthcare.org/locations/offices/advanced-digestive-care-1.html')
time.sleep(5)
html = driver.page_source

soup = BeautifulSoup(html, 'lxml')
results = soup.find_all("div", { "class" : "module bordered specialist" })
print(results[0].text) #prints GastroenterologyInternal Medicine

【讨论】：

啊，好吧，所以 selenium 和 time.sleep 允许页面在解析之前完成加载？
是的，就是这样，有更优雅的方法可以通过等待特定元素加载来做到这一点，但这个网站似乎非常一致，只花了几秒钟

【解决方案2】：

不需要selenium，一个简单的post请求就可以获取数据：

所以你只需要模仿那个请求：

import requests

# you can change there fields to get different results
data = {"selectFields":["Name","URL","Specialists"],"filters":{},"orderBy":{"Name":-1}}

post = "http://www.emoryhealthcare.org/service/findPhysician/api/locations/retrieve"
 #  post the data as json and create a dict from the returned json.
js = requests.post(post, json=data).json()
print(js[u'locations'][0][u'Specialists'])

如果我们运行它会给你：

In [3]: import requests
...: 
...: data = {"selectFields":["Name","URL","Specialists"],"filters":{},"orderB
...: y":{"Name":-1}}
...: post =   "http://www.emoryhealthcare.org/service/findPhysician/api/locatio
...: ns/retrieve"
...: js = requests.post(post, json=data).json()
...: print(js[u'locations'][0][u'Specialists'])
...: 
[u'Gastroenterology', u'Internal Medicine']

json 中有大量数据，其中几乎包含您可能想要的任何内容。

【讨论】：