抓取网站时如何让我的代码读取所有表格？答案

【问题标题】：How do I get my code to read all tables when scraping a website?抓取网站时如何让我的代码读取所有表格？
【发布时间】：2020-08-15 12:55:51
【问题描述】：

我是python的超级新手，这个网站在这个学期帮助了我很多，我希望你们能再次帮助我。

我需要从https://money.cnn.com/data/hotstocks/ 中抓取表格。

这些牌桌是最活跃的、赢家的和输家的。

现在我可以让这段代码为我工作了

     import requests
     from bs4 import BeautifulSoup

     url = 'http://money.cnn.com/data/hotstocks/index.html'
     response = requests.get(url)
     html = response.content

     soup = BeautifulSoup(html)

     all_stock = soup.find('div', attrs={'id':'wsod_hotStocks'})

     table = all_stock.find('table',attrs={'class':'wsod_dataTable wsod_dataTableBigAlt'  })

     for row in table.findAll('tr'):
         for cell in row.findAll('td'):
                 print(cell.text)

但这只会让我获得最活跃的表格，我不确定我需要做什么才能让我的代码获得网站上的其他 2 个表格。

如果我能提供任何关于我做错了什么以及如何解决它的见解，我将不胜感激。

我不知道我是否必须创建代码来抓取每个表，或者我是否可以调整我所拥有的。

[这是来自网站的 HTML，因此你们可以了解我在做什么。 1

【问题讨论】：

您已经知道如何使用.findAll 循环遍历所有表格行和表格单元格 - 为什么不使用相同的方法循环遍历所有表格？

标签： python beautifulsoup python-requests

【解决方案1】：

实际上你可以使用pandas.read_html()，它会以很好的格式读取所有表格。

注意：它将以列表形式返回表格。因此您可以使用 DataFrame 的索引来访问它，例如 print(df[0])。

import pandas as pd

df = pd.read_html("https://money.cnn.com/data/hotstocks/")

print(df)

【讨论】：

我支持这个。如果最终目标是通过 pandas 保存它们，最好从一开始就使用 pandas。
@rpanai 是的，很简单就像df[0].to_csv("data.csv",index= False)
哦，谢谢！有人告诉我我们可以使用熊猫，但教授没有教我们，所以我有点担心使用它们。问题，当我使用您提供的代码并查看 CSV 时，只有第一个表在那里。我需要在代码中添加更多内容吗？
@αԋɱҽԃαмєяιcαη ohhhhh 我当然明白。很抱歉，是的，非常感谢你！！！！！！！！！

【解决方案2】：

删除以下

table = all_stock.find('table', attrs={'class': 'wsod_dataTable wsod_dataTableBigAlt'})

只需使用和更新

for row in all_stock.find_all('tr'):
    for cell in row.find_all('td'):
        print(cell.text)

完整代码

import requests
from bs4 import BeautifulSoup

url = 'http://money.cnn.com/data/hotstocks/index.html'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, features='html.parser')

all_stock = soup.find('div', attrs={'id': 'wsod_hotStocks'})

for row in all_stock.find_all('tr'):
    for cell in row.find_all('td'):
        print(cell.text)

【讨论】：

【解决方案3】：

只需要对现有代码做一点小改动 - 使用 find_all 而不是 find，然后循环遍历新的可迭代对象。

import requests
from bs4 import BeautifulSoup

url = 'http://money.cnn.com/data/hotstocks/index.html'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)

all_stock = soup.find('div', attrs={'id':'wsod_hotStocks'})

tables = all_stock.find_all('table',attrs={'class':'wsod_dataTable wsod_dataTableBigAlt'  })

for table in tables:
    print("Next_Table!!")
    for row in table.findAll('tr'):
        for cell in row.findAll('td'):
                print(cell.text)

【讨论】：