我无法使用 for 循环和 BeautifulSoup 从多个 URL 中抓取表数据答案

【问题标题】：I cannot scrape table data from multiple URLs using a for loop and BeautifulSoup我无法使用 for 循环和 BeautifulSoup 从多个 URL 中抓取表数据
【发布时间】：2018-11-28 15:14:18
【问题描述】：

我正在尝试从多个 URL 中抓取表格数据。我要查找的表是特定的，并且在将 .find_all 与 BeautifulSoup 一起使用时，我已将其编入索引。例如，当我在一个 URL 上执行脚本时，它可以正常工作并返回我正在查找的表。当我使用 for 循环从多个 URL 中抓取表格并将它们附加到一个数据框中时，就会出现问题。

new_table=pd.DataFrame(columns=range(0,10), index=[0])

k=0
for k in range(0, 11200):
    response=requests.get(urls[k])
    htmls=response.text
    soup=BeautifulSoup(htmls, 'html.parser')

    table=soup.find_all("table")[4]
    row_marker=0
    rows=table.find_all("tr")

    for row in rows:
        column_marker=0
        columns=row.find_all("td")

        for column in columns:
            new_table.iat[row_marker, column_marker]=column.get_text()
            column_marker += 1

    row_marker += 1
    k += 1

new_table

错误：

IndexError                                Traceback (most recent call last)
<ipython-input-132-13c30de3ad5a> in <module>()
      5     soup=BeautifulSoup(htmls, 'html.parser')
      6 
----> 7     table=soup.find_all("table")[4]
      8     row_marker=0
      9     rows=table.find_all("tr")

IndexError: list index out of range

【问题讨论】：

第一件事。您不需要使用k += 1 增加k。你的for k in...: 做到了。对于该错误，您得到该错误的原因是[4] 索引位置没有表格元素。我现在不明白你为什么在那里。尝试删除它。
好的，刚刚看到您说它适用于特定表，这就是它被索引的原因。所以我的猜测是，并非所有的 url 都遵循这个确切的结构。对于您的某些网址，它的索引不是 4，事实上，甚至没有，因此出现错误。如果没有/不知道您拉哪个网址，可能很难看到它，但我想知道您是否必须从一个网址获取所有表格，让它查看表格，并以某种方式识别是否这是您想要的表格，或者跳过以获得您的输出。您真的要遍历 11,000 多个网址吗？

标签： beautifulsoup index-error

【解决方案1】：

不设置索引表，直接加check前

table = soup.find_all("table")
if len(table) < 5:
    print('no table[4], skip')
    continue
row_marker = 0
rows = table[4].find_all("tr")

【讨论】：