【发布时间】:2018-06-11 01:20:41
【问题描述】:
如何处理出现错误的页面,导致无法正确抓取数据,类似于 this
虽然我尝试在下面实现类似的东西但没有运气,因为页面结构不简单。知道如何处理不相等的数据,因为网页会导致数据随机变得不均匀。
想要的
Azam FC v Mwenge 1.8 https://www.bet365.com.au/#/AC/B1/C1/D13/E104/F16/S1/
Western Sydney Wanderers v Melbourne City 2.87 https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/
Sydney FC v Newcastle Jets 1.53 https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/
输出看起来像
Azam FC v Mwenge 1.8 https://www.bet365.com.au/#/AC/B1/C1/D13/E104/F16/S1/
Western Sydney Wanderers v Melbourne City 1.53 https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/
1.53 不应该是西悉尼而是悉尼足球俱乐部
脚本.py
import collections
import csv
import time
from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.Chrome()
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get('https://www.bet365.com.au/#/AS/B1/')
driver.get('https://www.bet365.com.au/#/AS/B1/')
def page_counter():
for x in range(1000):
yield x
count = page_counter()
clickMe = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]'))))
coupon_lables = [x.text for x in driver.find_elements_by_xpath('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]')]
links = dict((next(count) + 1, e) for e in coupon_lables)
desc_links = collections.OrderedDict(sorted(links.items(), reverse=True))
for key, label in desc_links.items():
driver.get('https://www.bet365.com.au/#/AS/B1/')
clickMe = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]'))))
driver.find_element_by_xpath(f'//div[contains(text(), "' + label + '")]').click()
groups = '/html/body/div[1]/div/div[2]/div[1]/div/div[2]/div[2]/div/div/div[2]/div'
xp_match_link = "//div//div[contains(@class, 'sl-CouponParticipantWithBookCloses_Name ')]"
xp_bp1 = "//div[contains(@class, 'gl-Market_HasLabels')]/following-sibling::div[contains(@class, 'gl-Market_PWidth-12-3333')][1]//div[contains(@class, 'gl-ParticipantOddsOnly')]"
try:
# wait for the data to populate the tables
wait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, (xp_bp1))))
time.sleep(2)
data = []
for elem in driver.find_elements_by_xpath(groups):
try:
match_link = elem.find_element_by_xpath(xp_match_link) \
.get_attribute('href')
except:
match_link = None
try:
bp1 = elem.find_element_by_xpath(xp_bp1).text
except:
bp1 = None
data.append([bp1, match_link])
# data.append([match_link, bp1, ba1, bp3, ba3])
print(data)
url1 = driver.current_url
with open('C:\\daw.csv', 'a', newline='',
encoding="utf-8") as outfile:
writer = csv.writer(outfile)
for row in data:
writer.writerow(row)
except TimeoutException as ex:
pass
except NoSuchElementException as ex:
print(ex)
break
driver.close()
【问题讨论】:
-
您能否提供此非结构化数据所在页面的网址?
-
@VikasOjha 这是您可以看到数据未正确加载导致数据不均匀的页面之一。 bet365.com.au#/AC/B1/C1/D13/E40/F443/S1。它似乎不时弹出,而不是每次加载
-
嗯,这就是原因 -
groupsxpath 提供 12 个节点,而xp_match_link提供 7 个节点。您需要找到一种更好的方式来编写这些 xpath,以便它们保持一致 -
我找不到没有对齐的页面,所以很难评论。您需要提供一个显示问题的页面。您可以做的一件事是计算匹配数和赔率数,然后如果这两个数字不相等,请重新加载页面或不要刮掉它或其他东西。
标签: python selenium web-scraping