我抓取了 html 表数据，它们显示错误“无法设置列不匹配的行”答案

【问题标题】：I scrape html tables data they show the error 'cannot set a row with mismatched columns'我抓取了 html 表数据，它们显示错误“无法设置列不匹配的行”
【发布时间】：2021-08-18 14:53:41
【问题描述】：

我抓取了 html 表格数据，它们显示错误“无法设置列不匹配的行”

import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    headers ={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3'}
    r =requests.get('https://jleague.co/clubs/sapporo/player/') 
    soup=BeautifulSoup(r.content, 'lxml')
    table=soup.find('table',class_='commonTable playerData')
    headers=[]
    
    for i in table.find_all('th'):
        title=i.text.strip()
        headers.append(table)
    
    df=pd.DataFrame(columns=headers)
    
    for row in table.find_all('tr')[1:]:
        data=row.find_all('td')
        row_data=[td.text.strip() for td in data]
        length=len(df)
        df.loc[length]=row_data

【问题讨论】：

我会得到csv格式的输出

标签： python html web-scraping beautifulsoup

【解决方案1】：

要从该页面获取表格，您可以使用下一个示例：

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
r = requests.get("https://jleague.co/clubs/sapporo/player/")
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("table", class_="commonTable playerData")

header = [th.get_text(strip=True) for th in table.tr.select("th")][1:]

all_data = []
for row in table.select("tr:has(td)"):
    tds = [td.get_text(strip=True) for td in row.select("td")]
    all_data.append(tds)

df = pd.DataFrame(all_data, columns=header)
print(df)
df.to_csv("data.csv", index=False)

打印：

                 Name Pos. Height Weight Games Played Goals
0     Takanori SUGENO   GK    179     75            3     0
1        Shunta AWAKA   GK    188     77            0     0
2          Koki OTANI   GK    186     90            4     0
3       Kojiro NAKANO   GK    200     90            1     0
4       Shunta TANAKA   DF    183     68            6     0
5     Takahiro YANAGI   DF    185     80            7     1
6      Akito FUKUMORI   DF    183     75            4     0
7       Toya NAKAMURA   DF    186     78            3     0
8       Shota NISHINO   DF    179     68            0     0
9    Daihachi OKAMURA   DF    183     82            6     0
10    Tomoki TAKAMINE   MF    177     74            7     0
11    LUCAS FERNANDES   MF    174     65            6     1
12       Kazuki FUKAI   MF    179     80            4     1
13      Takuro KANEKO   MF    178     68            6     0
14    Hiroki MIYAZAWA   MF    182     72            3     0
15     Yoshiaki KOMAI   MF    168     64            5     0
16          CHANATHIP   MF    158     56            2     0
17       Takuma ARANO   MF    180     72            6     0
18         Ryota AOKI   MF    174     68            7     2
19      Hiromu TANAKA   MF    174     68            3     0
20         Shinji ONO   MF    175     74            5     0
21         Daiki SUGA   FW    171     69            7     1
22        MILAN TUCIC   FW    186     77            0     0
23   DOUGLAS OLIVEIRA   FW    188     88            7     3
24  Tsuyoshi OGASHIWA   FW    167     67            4     0
25    Taika NAKASHIMA   FW    188     77            4     1
26         Yosei SATO   FW    168     64            0     0
27                JAY   FW    190     89            3     0

并保存data.csv（来自 LibreOffice 的屏幕截图）：

【讨论】：

【解决方案2】：

import requests, pandas as pd
        
    
    
#for the following code try:
    headers ={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3'
    }

def scrape():
    url = 'https://jleague.co/clubs/sapporo/player/'
    headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    players=soup.find('table', class_='commonTable playerData')
    data=[]
    for i in players.find_all('td'):
        data.append(i.text.strip())
    df = pd.DataFrame(data, columns=['name'])
    return df

   
    

    #the following code is from my own solution and does not require pandas. I used it to download some data from a website to get some better statistics:
def get_tables_html(url: str, headers: dict) -> dict:
    content = requests.get(url, headers=headers).content
    soup = BeautifulSoup(content, 'lxml')
    tables = soup.find_all('table',{'class' : 'tableList tableList-b'})
    headers = []
    for t in tables:
   
        header = t.find('th')
        if header:
            header = header.text.strip()
        else:
            header = None
        headers.append(header)
    return headers
 

df = get_tables_html('https://jleague.co/clubs/sapporo/player/',{
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3'})

【讨论】：