Python web 抓取未显示 BeautifulSoup 的所有行答案

【问题标题】：Python webscraping not showing all rows with BeautifulSoupPython web 抓取未显示 BeautifulSoup 的所有行
【发布时间】：2019-09-25 21:09:54
【问题描述】：

尝试从 Transfermarkt 中抓取几个网页的小队概览，发现某些页面缺少行。

以下是两个示例网页：

有效：包含所有行here。

不起作用：缺少行 here

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

headers = {'User-Agent' : 'Mozilla/5.0'}
df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'height' , 'foot' , 'joined' , 'signed_from' , 'contract_until']
r = requests.get('https://www.transfermarkt.com/grasshopper-club-zurich-u17/kader/verein/59526/saison_id/2018/plus/1', headers = headers)
soup = bs(r.content, 'html.parser')

position_number = [item.text for item in soup.select('.items .rn_nummer')]
position_description = [item.text for item in soup.select('.items td:not([class])')]
name = [item.text for item in soup.select('.hide-for-small .spielprofil_tooltip')]
dob = [item.text for item in soup.select('.zentriert:nth-of-type(3):not([id])')]
nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in soup.select('.zentriert:nth-of-type(4):not([id])')]
height = [item.text for item in soup.select('.zentriert:nth-of-type(5):not([id])')]
foot = [item.text for item in soup.select('.zentriert:nth-of-type(6):not([id])')]
joined = [item.text for item in soup.select('.zentriert:nth-of-type(7):not([id])')]
signed_from = ['/'.join([item['title'].lstrip(': '), item['alt']])  for item in soup.select('.zentriert:nth-of-type(8):not([id]) [title]')]
contract_until = [item.text for item in soup.select('.zentriert:nth-of-type(9):not([id])')]

df = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, height, foot, joined, signed_from, contract_until)), columns = df_headers)
print(df)

df.to_csv(r'Uljanas-MacBook-Air-2:~ uljanadufour$\grasshopper18.csv')

这就是我得到的一个应该包含 22 行的页面。

  position_number  ... contract_until
0               -  ...              -
1               -  ...              -
2               -  ...              -
3               -  ...              -
4               -  ...              -
5               -  ...              -
6               -  ...              -
7               -  ...              -
8               -  ...     30.06.2019

[9 rows x 10 columns]

Process finished with exit code 0

我无法确定为什么它对某些人有效，而对另一些人则无效。任何帮助将不胜感激。

【问题讨论】：

使用 print() 查看变量中的内容。如果其中一个变量只有 9 项，而其他变量有更多项，则 zip() 将仅创建 9 行 - 它始终使用 shortes 列表来创建数据。
是的，我可以看到 signed from 变量在这些情况下似乎是因变量。是否有解决方法来获取空字符串而不是缩短列表？

标签： python web-scraping beautifulsoup xml-parsing html-parsing

【解决方案1】：

问题出在这一行：

signed_from = ['/'.join([item['title'].lstrip(': '), item['alt']])  for item in soup.select('.zentriert:nth-of-type(8):not([id]) [title]')]

你可以这样修改：

signed_from = ['/'.join([item.find('img')['title'].lstrip(': '), item.find('img')['alt']])  if item.find('a') else '' for item in soup.select('.zentriert:nth-of-type(8):not([id])')]

【讨论】：