【问题标题】:Webscraping with Python and Selenium使用 Python 和 Selenium 进行网页抓取
【发布时间】:2020-02-05 06:05:29
【问题描述】:

我一直在做一个项目,从业余曲棍球网站上抓取日程表,并将其以可接受的格式导出为 csv,以便上传到 Sports Engine 应用程序。我已经设法以纯文本格式获取我想要的数据,但现在需要弄清楚如何转换它以便可以将其导出为 csv。

这里是脚本的示例输出,为简洁起见进行了缩写。

AL1602 · 11 月 6 日 · Atom A 联赛 · FVC Flight 3FINALMSA Arena · Abbotsford, BCLANGLEY MHA ATOM A4 EAGLES2 - 6ABBOTSFORD ATOM A2 HAWKS AL1607 · 11 月 10 日 · Atom A 联赛 · FVC Flight 3FINALMission 休闲中心 · 北部 · Mission , BC 冰冲突导致的时间变化 CSABBOTSFORD ATOM A2 HAWKS5 - 4MISSION MHA ATOM A2

这是脚本的示例输出,但仅使用 print(tables) 显示格式,而不仅仅是打印文本。

[<tr class="gamelist-row"><td class="game-details"><div class="game-meta text-muted">AL1602 · Nov 6<a class="text-muted" href="/leagues/786?scheduleId=1265&groupId=5" title="Atom A League · FVC Flight 3"> · Atom A League · FVC Flight 3</a></div><div class="game-time">FINAL</div><div class="game-arena">MSA Arena<span class="text-muted"> · Abbotsford, BC</span></div></td><td><div class="game-matchup"><a class="team-link" href="/teams/4688?scheduleId=1265&groupId=5"><div class="d-flex flex-row" style="min-width: 125px;"><div class="pr-2"><div alt="LANGLEY MHA ATOM A4 EAGLES" class="team-logo" style='background-image: url("https://s3-ca-central-1.amazonaws.com/hisports-logos/1537488764672.png");'></div></div><div class="d-flex flex-fill flex-column justify-content-center"><span class="team-name text-uppercase">LANGLEY MHA ATOM A4 EAGLES</span></div></div></a><div class="game-result score"><div class="result result-loss">2</div><span class="text-muted"> - </span><div class="result result-win">6</div></div><a class="team-link" href="/teams/4326?scheduleId=1265&groupId=5"><div class="d-flex flex-row flex-row-reverse" style="min-width: 125px;"><div class="pl-2"><div alt="ABBOTSFORD ATOM A2 HAWKS" class="team-logo" style='background-image: url("https://s3-ca-central-1.amazonaws.com/hisports-logos/1538567502609.jpg");'></div></div><div class="d-flex flex-fill flex-column justify-content-center"><span class="team-name text-uppercase text-right">ABBOTSFORD ATOM A2 HAWKS</span></div></div></a></div></td></tr>, <tr class="gamelist-row"><td class="game-details"><div class="game-meta text-muted">AL1607

下面是脚本。

from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

#launch url
url = "https://games.pcaha.ca/teams/4326"

#create a new Firefox session
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)

#After opening the url above, Selenium finds the table with the schedule
games = driver.find_elements_by_id("table-responsive")

#Selenium hands the page source to Beautiful Soup
soupsource=BeautifulSoup(driver.page_source, 'lxml')
soupsource.prettify()

#Beautiful Soup grabs the class gamelist-row
tables = soupsource.find_all("tr", class_="gamelist-row")

# prints out the text only
for x in tables:
    print(x.text)


【问题讨论】:

  • 你能展示样品的预期结果吗?
  • 你能把脚本的输出贴出来吗?
  • @Marco 这是脚本的示例输出,为简洁起见进行了缩写。 AL1602 · 11 月 6 日 · Atom A League · FVC Flight 3FINALMSA Arena · Abbotsford, BCLANGLEY MHA ATOM A4 EAGLES2 - 6ABBOTSFORD ATOM A2 HAWKS AL1607 · Nov 10 · Atom A League · FVC Flight 3FINALMission Leisure Center · North · Mission, BC时间变更由于冰冲突 CSABBOTSFORD ATOM A2 HAWKS5 - 4MISSION MHA ATOM A2
  • @wedge22 这不只是我的,你可以把它放在帖子里.. 然后从这里看是由制表符分隔的字段和由 CR 分隔的行?
  • @Marco 我在原始帖子中添加了更多信息,应该可以回答您的问题。

标签: python selenium csv web-scraping


【解决方案1】:
import csv

with open('file.csv', mode='w') as csv_file:
fieldnames = ['header1', 'header2', 'header3']
     writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

     writer.writeheader()
     writer.writerow({'field1': 'John Smith', 'field2': 'Accounting','field3': 'November'})

试试这个小 sn-p 来写入 csv 文件。修改它以满足您的需求!

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-05-08
    • 2018-07-20
    • 2020-03-13
    • 1970-01-01
    • 2023-04-02
    相关资源
    最近更新 更多