Beautifulsoup 从网站上抓取表格，请求熊猫答案

【问题标题】：Beautifulsoup scraping table from website with requests for pandasBeautifulsoup 从网站上抓取表格，请求熊猫
【发布时间】：2018-06-30 13:12:57
【问题描述】：

我正在尝试下载此网站上的数据 https://coinmunity.co/ ...为了以后在 Python 或 Pandas 中操作它我曾尝试通过 Requests 直接对 Pandas 执行此操作，但没有成功，使用以下代码：

res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
dfm = pd.read_html(str(table), header = 0)
dfm = dfm[0].dropna(axis=0, thresh=4)
dfm.head()

在我尝试的大多数事情中，我只能访问标题中的信息，这似乎是代码在此页面中看到的唯一表格。

看到这不起作用，我尝试使用 Requests 和 BeautifulSoup 进行相同的抓取，但它也不起作用。这是我的代码：

import requests
from bs4 import BeautifulSoup

res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
#table = soup.find_all('table')[0]
#table = soup.find_all('div', {'class':'inner-container'})
#table = soup.find_all('tbody', {'class':'_ngcontent-c0'})
#table = soup.find_all('table')[0].findAll('tr')
#table = soup.find_all('table')[0].find('tbody')#.find_all('tbody _ngcontent-c3=""')
table = soup.find_all('p', {'class':'stats change positiveSubscribers'})

你可以在评论的行中看到，我尝试了所有的东西，但没有任何效果。有什么方法可以轻松下载该表以在 Pandas/Python 上以最整洁、更简单和最快的方式使用它？谢谢

【问题讨论】：

标签： python pandas beautifulsoup python-requests

【解决方案1】：

由于在发出初始请求后会动态加载内容，因此您将无法通过请求抓取此数据。这就是我会做的事情：

from selenium import webdriver
import pandas as pd
import time
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://coinmunity.co/")

html = driver.page_source.encode('utf-8')

soup = BeautifulSoup(html, 'lxml')

results = []
for row in soup.find_all('tr')[2:]:
    data = row.find_all('td')
    name = data[1].find('a').text
    value = data[2].find('p').text
    # get the rest of the data you need about each coin here, then add it to the dictionary that you append to results
    results.append({'name':name, 'value':value})

df = pd.DataFrame(results)

df.head()

name    value
0   NULS    14,005
1   VEN 84,486
2   EDO 20,052
3   CLUB    1,996
4   HSR 8,433

您需要确保 geckodriver 已安装且位于您的 PATH 中。我只是刮掉了每枚硬币的名称和价值，但获取其余信息应该很容易。

【讨论】：

我强烈建议在任何一种隐式等待上使用explicit wait。
有 2 个原因 - 1. time.sleep(5) 速度很慢，如果页面加载速度更快会浪费时间。 2. 网速慢或者网站慢都是不可靠的。
@KeyurPotdar 感谢您指出这一点。我通过添加隐式等待并删除睡眠来更新我的答案。
其实我说的是explicit wait，而不是implicitly_wait。你可以阅读这背后的原因-When to use explicit wait vs implicit wait in Selenium Webdriver。
哦，对不起，我读错了，我会尝试添加一个explicit_wait。再次感谢@KeyurPotdar