如何抓取单元格值具有相同类名的网站表？答案

【问题标题】：How to scrape a website table where the cell values have the same class name?如何抓取单元格值具有相同类名的网站表？
【发布时间】：2019-09-23 08:24:41
【问题描述】：

我正在尝试从Transfermarkt.com 为一个项目抓取一个（足球队）表，但有些列具有相同的类名并且无法区分。

列 [2,10] 具有独特的类并且工作正常。我正在努力寻找获得其余部分的方法。

from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent':
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.transfermarkt.com/hertha-bsc-u17/kader/verein/21066/saison_id/2018/plus/1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Values = pageSoup.find_all("td", {"class": "zentriert"})

PlayersList = []
ValuesList = []

for i in range(0, 25):
    PlayersList.append(Players[i].text)
    ValuesList.append(Values[i].text)

df = pd.DataFrame({"Players": PlayersList, "Values": ValuesList})

我想抓取该表行上的所有列。

【问题讨论】：

获取所有<td> 并使用索引来获取价值 - 即。 value = all_tds[5]
我会得到所有<tr> 和每一行<tr> 我会得到<td> 这样我会单独处理一行，我会确保我不会从其他行获得价值排。而且我可以使用索引而不是类来获得正确的值。
pandas 具有函数pd.read_html(url) 可以查找HTML中的所有表格并将每个表格转换为DataFrame

标签： python web-scraping beautifulsoup html-parsing

【解决方案1】：

使用 bs4、pandas 和 css 选择器。这将位置分开，例如守门员的名字。它不包括市场价值，因为没有给出任何价值。对于任何给定的玩家 - 它显示玩家国籍的所有值“/”分隔；给出从“/”分隔的所有传输值。具有相同类的列可以通过nth-of-type 来区分。

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

headers = {'User-Agent' : 'Mozilla/5.0'}
df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'height' , 'foot' , 'joined' , 'signed_from' , 'contract_until']
r = requests.get('https://www.transfermarkt.com/hertha-bsc-u17/kader/verein/21066/saison_id/2018/plus/1', headers = headers)
soup = bs(r.content, 'lxml')

position_number = [item.text for item in soup.select('.items .rn_nummer')]
position_description = [item.text for item in soup.select('.items td:not([class])')]
name = [item.text for item in soup.select('.hide-for-small .spielprofil_tooltip')]
dob = [item.text for item in soup.select('.zentriert:nth-of-type(3):not([id])')]
nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in soup.select('.zentriert:nth-of-type(4):not([id])')]
height = [item.text for item in soup.select('.zentriert:nth-of-type(5):not([id])')]
foot = [item.text for item in soup.select('.zentriert:nth-of-type(6):not([id])')]
joined = [item.text for item in soup.select('.zentriert:nth-of-type(7):not([id])')]
signed_from = ['/'.join([item['title'].lstrip(': '), item['alt']])  for item in soup.select('.zentriert:nth-of-type(8):not([id]) [title]')]
contract_until = [item.text for item in soup.select('.zentriert:nth-of-type(9):not([id])')]

df = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, height, foot, joined, signed_from, contract_until)), columns = df_headers)
print(df.head())

示例 df.head

【讨论】：

一个快速跟进问题。如果signed_from 列没有item ['alt']，我会得到KeyError，例如[这里] (transfermarkt.com/rot-weiss-essen-u17/kader/verein/21073/…)。我试图让它在找到缺失数据的地方放置空值。尝试使用 if-else 语句，但我没有让它工作......

【解决方案2】：

我会获取所有<tr>，然后使用for 循环来获取所有<td>。然后我可以使用索引来获取列，我可以使用不同的方法从列中获取值。

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = {
    'name': [],
    'data of birth': [],
    'height': [],
    'foot': [],
    'joined': [],
    'contract until': [],
}

headers = {
  'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'
}

url = "https://www.transfermarkt.com/hertha-bsc-u17/kader/verein/21066/saison_id/2018/plus/1"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

all_tr = soup.find_all('tr', {'class': ['odd', 'even']})
print('rows:', len(all_tr))

for row in all_tr:
    all_td = row.find_all('td', recursive=False)

    print('columns:', len(all_td))
    for column in all_td:
        print(' >', column.text)

    data['name'].append( all_td[1].text.split('.')[0][:-1] )
    data['data of birth'].append( all_td[2].text[:-5])
    data['height'].append( all_td[4].text )
    data['foot'].append( all_td[5].text )
    data['joined'].append( all_td[6].text )
    data['contract until'].append( all_td[8].text )


df = pd.DataFrame(data)
print(df.head())

结果：

               name data of birth  height   foot       joined contract until
0   Kilian Schubert   Sep 9, 2002  1,80 m  right  Jul 1, 2018              -
1   Raphael Bartell  Jan 26, 2002  1,82 m      -  Jul 1, 2018              -
2  Till Aufderheide  Jun 15, 2002  1,79 m      -  Jul 1, 2018              -
3  Milan Kremenovic   Mar 8, 2002  1,91 m      -  Jul 1, 2018     30.06.2020
4      Adnan Alagic   Jul 4, 2002  1,86 m  right  Jul 1, 2018     30.06.2021

【讨论】：