处理不均匀的数据答案

【问题标题】：Dealing with uneven data处理不均匀的数据
【发布时间】：2018-06-11 01:20:41
【问题描述】：

如何处理出现错误的页面，导致无法正确抓取数据，类似于 this

虽然我尝试在下面实现类似的东西但没有运气，因为页面结构不简单。知道如何处理不相等的数据，因为网页会导致数据随机变得不均匀。

想要的

 Azam FC v Mwenge    1.8    https://www.bet365.com.au/#/AC/B1/C1/D13/E104/F16/S1/
 Western Sydney Wanderers v Melbourne City    2.87    https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/
 Sydney FC v Newcastle Jets    1.53    https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/

输出看起来像

 Azam FC v Mwenge    1.8    https://www.bet365.com.au/#/AC/B1/C1/D13/E104/F16/S1/
 Western Sydney Wanderers v Melbourne City    1.53    https://www.bet365.com.au/#/AC/B1/C1/D13/E101/F16/S1/

1.53 不应该是西悉尼而是悉尼足球俱乐部

脚本.py

 import collections
 import csv
 import time

 from selenium import webdriver
 from selenium.common.exceptions import TimeoutException, NoSuchElementException
 from selenium.webdriver.common.by import By
 from selenium.webdriver.support import expected_conditions as EC
 from selenium.webdriver.support.ui import WebDriverWait
 from selenium.webdriver.support.ui import WebDriverWait as wait

 driver = webdriver.Chrome()
 driver.set_window_size(1024, 600)
 driver.maximize_window()


 driver.get('https://www.bet365.com.au/#/AS/B1/')
 driver.get('https://www.bet365.com.au/#/AS/B1/')


 def page_counter():
     for x in range(1000):
         yield x

 count = page_counter()

 clickMe = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]'))))
 coupon_lables = [x.text for x in driver.find_elements_by_xpath('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]')]

 links = dict((next(count) + 1, e) for e in coupon_lables)
 desc_links = collections.OrderedDict(sorted(links.items(), reverse=True))
 for key, label in desc_links.items():
     driver.get('https://www.bet365.com.au/#/AS/B1/')
     clickMe = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ('//div[div/div/text()="Main Lists"]//div[starts-with(@class, "sm-CouponLink_Label") and normalize-space()]'))))
     driver.find_element_by_xpath(f'//div[contains(text(), "' + label + '")]').click()

     groups = '/html/body/div[1]/div/div[2]/div[1]/div/div[2]/div[2]/div/div/div[2]/div'
     xp_match_link = "//div//div[contains(@class, 'sl-CouponParticipantWithBookCloses_Name ')]"
     xp_bp1 = "//div[contains(@class, 'gl-Market_HasLabels')]/following-sibling::div[contains(@class, 'gl-Market_PWidth-12-3333')][1]//div[contains(@class, 'gl-ParticipantOddsOnly')]"

     try:
         # wait for the data to populate the tables
         wait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, (xp_bp1))))
         time.sleep(2)

         data = []
         for elem in driver.find_elements_by_xpath(groups):
             try:
                 match_link = elem.find_element_by_xpath(xp_match_link) \
                     .get_attribute('href')
             except:
                 match_link = None

             try:
                 bp1 = elem.find_element_by_xpath(xp_bp1).text
             except:
                 bp1 = None

             data.append([bp1, match_link])
             # data.append([match_link, bp1, ba1, bp3, ba3])
         print(data)
         url1 = driver.current_url

         with open('C:\\daw.csv', 'a', newline='',
                   encoding="utf-8") as outfile:
             writer = csv.writer(outfile)
             for row in data:
                 writer.writerow(row)

     except TimeoutException as ex:
         pass
     except NoSuchElementException as ex:
         print(ex)
         break

 driver.close()

【问题讨论】：

您能否提供此非结构化数据所在页面的网址？
@VikasOjha 这是您可以看到数据未正确加载导致数据不均匀的页面之一。 bet365.com.au#/AC/B1/C1/D13/E40/F443/S1。它似乎不时弹出，而不是每次加载
嗯，这就是原因 - groups xpath 提供 12 个节点，而 xp_match_link 提供 7 个节点。您需要找到一种更好的方式来编写这些 xpath，以便它们保持一致
我找不到没有对齐的页面，所以很难评论。您需要提供一个显示问题的页面。您可以做的一件事是计算匹配数和赔率数，然后如果这两个数字不相等，请重新加载页面或不要刮掉它或其他东西。

标签： python selenium web-scraping

【解决方案1】：

如果您更改以下 xpath，它应该可以工作：

xp_match_link = "//div//div[contains(@class, 'sl-CouponParticipantWithBookCloses_NameContainer ')]"

【讨论】：

组 xpath 似乎完全正确，因为它在页面中捕获了两个单独的表
我得到 [1.80，无]。这是由组 xpath 引起的。我可以完全删除它并正常抓取数据，但是如果由于故障而删除了一个按钮，那么我会得到不均匀的数据。
您可以完全摆脱组 xpath 并直接使用其他组。那会更好。
可悲的是与 .stackoverflow.com/questions/47922167/… 类似的问题。只有在不存在空白但在此找不到太多内容的情况下才有可能获取元素