【问题标题】:I scrape html tables data they show the error 'cannot set a row with mismatched columns'我抓取了 html 表数据,它们显示错误“无法设置列不匹配的行”
【发布时间】:2021-08-18 14:53:41
【问题描述】:

我抓取了 html 表格数据,它们显示错误“无法设置列不匹配的行”

import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    headers ={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3'}
    r =requests.get('https://jleague.co/clubs/sapporo/player/') 
    soup=BeautifulSoup(r.content, 'lxml')
    table=soup.find('table',class_='commonTable playerData')
    headers=[]
    
    for i in table.find_all('th'):
        title=i.text.strip()
        headers.append(table)
    
    df=pd.DataFrame(columns=headers)
    
    for row in table.find_all('tr')[1:]:
        data=row.find_all('td')
        row_data=[td.text.strip() for td in data]
        length=len(df)
        df.loc[length]=row_data

【问题讨论】:

  • 我会得到csv格式的输出

标签: python html web-scraping beautifulsoup


【解决方案1】:

要从该页面获取表格,您可以使用下一个示例:

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
r = requests.get("https://jleague.co/clubs/sapporo/player/")
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("table", class_="commonTable playerData")

header = [th.get_text(strip=True) for th in table.tr.select("th")][1:]

all_data = []
for row in table.select("tr:has(td)"):
    tds = [td.get_text(strip=True) for td in row.select("td")]
    all_data.append(tds)

df = pd.DataFrame(all_data, columns=header)
print(df)
df.to_csv("data.csv", index=False)

打印:

                 Name Pos. Height Weight Games Played Goals
0     Takanori SUGENO   GK    179     75            3     0
1        Shunta AWAKA   GK    188     77            0     0
2          Koki OTANI   GK    186     90            4     0
3       Kojiro NAKANO   GK    200     90            1     0
4       Shunta TANAKA   DF    183     68            6     0
5     Takahiro YANAGI   DF    185     80            7     1
6      Akito FUKUMORI   DF    183     75            4     0
7       Toya NAKAMURA   DF    186     78            3     0
8       Shota NISHINO   DF    179     68            0     0
9    Daihachi OKAMURA   DF    183     82            6     0
10    Tomoki TAKAMINE   MF    177     74            7     0
11    LUCAS FERNANDES   MF    174     65            6     1
12       Kazuki FUKAI   MF    179     80            4     1
13      Takuro KANEKO   MF    178     68            6     0
14    Hiroki MIYAZAWA   MF    182     72            3     0
15     Yoshiaki KOMAI   MF    168     64            5     0
16          CHANATHIP   MF    158     56            2     0
17       Takuma ARANO   MF    180     72            6     0
18         Ryota AOKI   MF    174     68            7     2
19      Hiromu TANAKA   MF    174     68            3     0
20         Shinji ONO   MF    175     74            5     0
21         Daiki SUGA   FW    171     69            7     1
22        MILAN TUCIC   FW    186     77            0     0
23   DOUGLAS OLIVEIRA   FW    188     88            7     3
24  Tsuyoshi OGASHIWA   FW    167     67            4     0
25    Taika NAKASHIMA   FW    188     77            4     1
26         Yosei SATO   FW    168     64            0     0
27                JAY   FW    190     89            3     0

并保存data.csv(来自 LibreOffice 的屏幕截图):

【讨论】:

    【解决方案2】:
    import requests, pandas as pd
            
        
        
    #for the following code try:
        headers ={
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3'
        }
    
    def scrape():
        url = 'https://jleague.co/clubs/sapporo/player/'
        headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3'}
        r = requests.get(url, headers=headers)
        soup = BeautifulSoup(r.content, 'lxml')
        players=soup.find('table', class_='commonTable playerData')
        data=[]
        for i in players.find_all('td'):
            data.append(i.text.strip())
        df = pd.DataFrame(data, columns=['name'])
        return df
    
       
        
    
        #the following code is from my own solution and does not require pandas. I used it to download some data from a website to get some better statistics:
    def get_tables_html(url: str, headers: dict) -> dict:
        content = requests.get(url, headers=headers).content
        soup = BeautifulSoup(content, 'lxml')
        tables = soup.find_all('table',{'class' : 'tableList tableList-b'})
        headers = []
        for t in tables:
       
            header = t.find('th')
            if header:
                header = header.text.strip()
            else:
                header = None
            headers.append(header)
        return headers
     
    
    df = get_tables_html('https://jleague.co/clubs/sapporo/player/',{
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3'})
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-06-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多