【发布时间】:2018-12-21 11:08:07
【问题描述】:
下面是一个网络爬虫,它使用漂亮的汤从website 中爬取团队名册。每列数据都被放入一个数组中,然后循环到一个 CSV 文件。我想抓取团队名称(代码中的“团队”),但我正在努力将元标记(请参阅下面的 html 代码)合并到我的 CSV writerow 循环中。
<meta property="og:site_name" content="Tampa Bay Rays" />
我认为问题在于“团队”数组中值的长度与其他列中值的长度不匹配。例如,我当前的代码打印如下所示的数组:
[Player A, Player B, Player C]
[46,36,33]
[Tampa Bay Rays]
但我需要团队数组(最后一个数组)来匹配前两个数组的长度,如下所示:
[Player A, Player B, Player C]
[46,36,33]
[Tampa Bay Rays, Tampa Bay Rays, Tampa Bay Rays]
有人知道如何在我的 writerow csv 循环中进行此元标记调整吗?提前致谢!
import requests
import csv
from bs4 import BeautifulSoup
page=requests.get('http://m.rays.mlb.com/roster/')
soup=BeautifulSoup(page.text, 'html.parser')
#Remove Unwanted Links
last_links=soup.find(class_='nav-tabset-container')
last_links.decompose()
side_links=soup.find(class_='column secondary span-5 right')
side_links.decompose()
#Generate CSV
f=csv.writer(open('MLB_Active_Roster.csv','w',newline=''))
f.writerow(['Name','Number','Hand','Height','Weight','DOB','Team'])
#Find Player Name Links
player_list=soup.find(class_='layout layout-roster')
player_list_items=player_list.find_all('a')
#Extract Player Name Text
names=[player_name.contents[0] for player_name in player_list_items]
#Find Player Number
number_list=soup.find(class_='layout layout-roster')
number_list_items=number_list.find_all('td',index='0')
#Extract Player Number Text
number=[player_number.contents[0] for player_number in number_list_items]
#Find B/T
hand_list=soup.find(class_='layout layout-roster')
hand_list_items=hand_list.find_all('td',index='3')
#Extract B/T
handedness=[player_hand.contents[0] for player_hand in hand_list_items]
#Find Height
height_list=soup.find(class_='layout layout-roster')
height_list_items=hand_list.find_all('td',index='4')
#Extract Height
height=[player_height.contents[0] for player_height in height_list_items]
#Find Weight
weight_list=soup.find(class_='layout layout-roster')
weight_list_items=weight_list.find_all('td',index='5')
#Extract Weight
weight=[player_weight.contents[0] for player_weight in weight_list_items]
#Find DOB
DOB_list=soup.find(class_='layout layout-roster')
DOB_list_items=DOB_list.find_all('td',index='6')
#Extract DOB
DOB=[player_DOB.contents[0] for player_DOB in DOB_list_items]
#Find Team Name
team_list=soup.find('meta',property='og:site_name')
Team=[team_name.contents[0] for team_name in team_list]
print(Team)
#Loop Excel Rows
for i in range(len(names)):
f.writerow([names[i],number[i],handedness[i],height[i],weight[i],DOB[i],Team[i]])
【问题讨论】:
标签: python arrays web-scraping beautifulsoup meta-tags