【问题标题】:How to web scrape from two websites in one script?如何在一个脚本中从两个网站抓取网页?
【发布时间】:2019-12-22 23:34:12
【问题描述】:

我目前正在研究一个模型,需要收集信息,而不仅仅是关于游戏结果 (此链接https://www.hltv.org/stats/teams/matches/4991/fnatic?startDate=2019-01-01&endDate=2019-12-31) 但我还希望脚本在 HTML 源代码中打开另一个链接。该链接在源代码中可用,它将带我到一个页面,该页面解释每个匹配的详细结果, (如谁想要哪轮,https://www.hltv.org/stats/matches/mapstatsid/89458/cr4zy-vs-fnatic?startDate=2019-01-01&endDate=2019-12-31&contextIds=4991&contextTypes=team),主要目标是我想知道谁赢得了比赛(来自第一个链接)以及谁赢得了每场比赛的第一轮(在第二个链接中)。这可能吗?这是我当前的脚本;

import requests
r = requests.get('https://www.hltv.org/stats/teams/maps/6665/Astralis')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('tr')
AstralisResults = []

for result in results[1:]:
    date = result.contents[1].text
    event = result.contents[3].text
    opponent = result.contents[7].text
    Map = result.contents[9].text
    Score = "'" + result.contents[11].text
    WinorLoss = result.contents[13].text
    AstralisResults.append((date,event,opponent,Map,Score,WinorLoss))

import pandas as pd
df5 = pd.DataFrame(AstralisResults,columns=['date','event','opponent','Map','Score','WinorLoss'])
df5.to_csv('AstralisResults.csv',index=False,encoding='utf-8')

所以我会寻找以下信息:

Date | Opponent | Map | Score | Result | Round1Result |

【问题讨论】:

  • 什么是分数,它与结果有何不同?也许显示一个示例输出行,其中包含来自给定链接的实际数字。
  • 我也没有看到第二个链接包含在第一个链接中。您目前是否真的获得任何似乎受 cloudflare 保护的信息?

标签: python pandas csv web-scraping beautifulsoup


【解决方案1】:

如果你刮得太快,看起来网站会阻塞,所以不得不延迟一段时间。有一些方法可以让这段代码更高效,但总的来说,我认为它可以满足您的要求:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time


headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

r = requests.get('https://www.hltv.org/stats/teams/matches/4991/fnatic?startDate=2019-01-01&endDate=2019-12-31' , headers=headers)
print (r)


soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('tr')
df5 = pd.DataFrame()

cnt=1
for result in results[1:]:
    print ('%s of %s' %(cnt, len(results)-1))
    date = result.contents[1].text
    event = result.contents[3].text
    opponent = result.contents[7].text
    Map = result.contents[9].text
    Score = "'" + result.contents[11].text
    WinorLoss = result.contents[13].text

    round_results = result.find('td', {'class':'time'})
    link = round_results.find('a')['href']


    r2 = requests.get('https://www.hltv.org' + link ,headers=headers)
    soup2 = BeautifulSoup(r2.text, 'html.parser')
    round_history = soup2.find('div', {'class':'standard-box round-history-con'})

    teams = round_history.find_all('img', {'class':'round-history-team'})
    teams_list = [ x['title'] for x in teams ]



    rounds_winners = {}
    n = 1
    row = round_history.find('div',{'class':'round-history-team-row'})
    for each in row.find_all('img',{'class':'round-history-outcome'}):
        if 'emptyHistory' in each['src']:
            winner = teams_list[1]
            loser = teams_list[0]
        else:
            winner = teams_list[0]
            loser = teams_list[1]

        rounds_winners['Round%02dResult' %n] = winner
        n+=1


    round_row_df = pd.DataFrame.from_dict(rounds_winners,orient='index').T

    temp_df = pd.DataFrame([[date,event,opponent,Map,Score,WinorLoss]],columns=['date','event','opponent','Map','Score','WinorLoss'])
    temp_df = temp_df.merge(round_row_df, left_index=True, right_index=True)

    df5 = df5.append(temp_df, sort=True).reset_index(drop=True)
    time.sleep(.5)
    cnt+=1

df5 = df5[['date','event','opponent','Map','Score','WinorLoss', 'Round01Result']]
df5 = df5.rename(columns={'date':'Date',
                          'event':'Event',
                          'WinorLoss':'Result',
                          'Round01Result':'Round1Result'})

df5.to_csv('AstralisResults.csv',index=False,encoding='utf-8')

输出:

print (df5.head(10).to_string())
       Date                                 Event     opponent       Map     Score Result Round1Result
0  20/07/19  Europe Minor - StarLadder Major 2019        CR4ZY     Dust2  '13 - 16      L       fnatic
1  20/07/19  Europe Minor - StarLadder Major 2019        CR4ZY     Train  '13 - 16      L       fnatic
2  19/07/19  Europe Minor - StarLadder Major 2019  mousesports   Inferno   '8 - 16      L  mousesports
3  19/07/19  Europe Minor - StarLadder Major 2019  mousesports     Dust2  '13 - 16      L       fnatic
4  17/07/19  Europe Minor - StarLadder Major 2019        North     Train   '16 - 9      W       fnatic
5  17/07/19  Europe Minor - StarLadder Major 2019        North      Nuke   '16 - 2      W       fnatic
6  17/07/19  Europe Minor - StarLadder Major 2019      Ancient    Mirage   '16 - 7      W       fnatic
7  04/07/19                  ESL One Cologne 2019     Vitality  Overpass  '17 - 19      L     Vitality
8  04/07/19                  ESL One Cologne 2019     Vitality    Mirage  '16 - 19      L       fnatic
9  03/07/19                  ESL One Cologne 2019     Astralis      Nuke   '6 - 16      L       fnatic

【讨论】:

  • 这不会返回上述内容,而是返回每个单独回合的地图和结果。有没有办法像上面提到的输出那样获取 csv 文件?可以仅限于第一轮吗?
  • @Kreeshee,你没看输出吗?它确实返回了这个结果。您需要检查所有列(我对它们进行了重新排序以向您显示此输出,但它都在那里)。其次,是的,你可以限制在第一轮。你以前没用过熊猫吗?我更新了解决方案。
猜你喜欢
  • 2016-05-15
  • 1970-01-01
  • 2018-05-23
  • 1970-01-01
  • 1970-01-01
  • 2021-06-03
  • 2012-12-16
  • 2013-12-05
  • 1970-01-01
相关资源
最近更新 更多