使用python 3.6美汤获取html表格行数据答案

【问题标题】：Fetch html table row data using python 3.6 beautiful soup使用python 3.6美汤获取html表格行数据
【发布时间】：2018-01-13 09:12:43
【问题描述】：

我有下面的 html 表格，并且想要获取表格数据，即表格第一行中存在的“收入 ($M) $135,987”。如何使用 python beautifulsoup 实现这一点。

<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
 <thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
  <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
   <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
   </th>
   <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
    $ millions
   </th>
   <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
    % change
   </th>
  </tr>
 </thead>
 <tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
  <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
   <td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
    Revenues ($M)
   </td>
   <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
    $135,987
   </td>
   <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
    27.1%
   </td>
  </tr>

从直接来源中提取数据的脚本：

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('http://fortune.com/fortune500/amazon-com/')
soup = bs(r.content, 'html.parser')

result = soup.find('div', {'class': 'small-12 columns'})
table = result.find_all('table')[0] # Grab the first table
print(table.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)

【问题讨论】：

标签： python python-3.x beautifulsoup

【解决方案1】：

选择值为“.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1”的“data-reactid” } 并阅读它的文字。

from bs4 import BeautifulSoup

html = """<table data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0">
     <thead data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0">
      <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0">
       <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.0" width="200">
       </th>
       <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-$ millions">
        $ millions
       </th>
       <th data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.0.0.1:$th-% change">
        % change
       </th>
      </tr>
     </thead>
     <tbody data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1">
      <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M)">
       <td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).0">
        Revenues ($M)
       </td>
       <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1">
        $135,987
       </td>
       <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).2">
        27.1%
       </td>
      </tr>
      <tr data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M)">
       <td class="title" data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).0">
        Profits ($M)
       </td>
       <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).1">
        $2,371.0
       </td>
       <td data-reactid=".romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Profits ($M).2">
        297.8%
       </td>
      </tr>
      </tbody>
    </table>
    """

soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'data-reactid': '.romjx8c48.1.0.5.1:1.4.0.3.1.0.0.0.0.1.0.0.0.0.1.$company-data-Revenues ($M).1'}).text)

输出：

$135,987

根据评论更新：

页面是用 JavaScript 渲染的，你可以使用 Selenium 来渲染它：

首先安装 Selenium：

sudo pip3 install selenium

然后获取驱动程序https://sites.google.com/a/chromium.org/chromedriver/downloads，如果您使用的是 Windows 或 Mac，则可以使用无头版本的 chrome“Chrome Canary”。

import bs4 as bs
from selenium import webdriver

browser = webdriver.Chrome()

url = "http://fortune.com/fortune500/amazon-com/"
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
# print (soup)
tds = soup.find_all('td')
print(tds[1].text)

或者对于其他非硒方法，请参阅我对Scraping Google Finance (BeautifulSoup)的回答

【讨论】：

哇超级......实际上我想从fortune.com/fortune500/amazon-com 获取所有有用的信息，我尝试了一些脚本，添加到查询中，请检查其给出的错误“AttributeError：'NoneType' 对象没有属性'文本'"
以上代码工作正常。如何在运行脚本时禁用浏览器和驱动程序 exe 打开，我想在后台运行，因为我想传递大约 10 个 URL 来收集数据。
对于无头运行 chrome（不在窗口中打开），请参阅 duo.com/blog/driving-headless-chrome-with-python 说明适用于 Mac，但我认为 Windows 类似。
我遇到了另一个问题，我在另一个查询中问过，请您帮忙：stackoverflow.com/questions/45533571/…