【问题标题】:Can't scrape three fields from a table with complicated layout无法从具有复杂布局的表中抓取三个字段
【发布时间】:2019-08-09 09:00:58
【问题描述】:

我在 python 和 selenium 中创建了一个脚本来解析网站中可用表中的三个字段 franking creditgross dividentfurther information。只有当浏览器点击带有加号的黄色圆形按钮时,最后两个字段才会显示。

但是,当单击按钮时,它们会变成红色,表示信息已显示。

我的脚本可以点击所有按钮,但无法从该表中抓取三个字段。

我附上了一张图片,向您展示它的真实外观。

我知道如果我向 https://www.sharedividends.com.au/wp-content/custom/ajaxfile.php?code=MLT 发送带有相关负载的 post http 请求,我可以获取 json 中的所有表格字段,但这不是我想要解决的方法。

Website link

我试过了:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.sharedividends.com.au/mlt-dividend-history/"

driver = webdriver.Chrome()

driver.get(url)

table = driver.find_element_by_css_selector("#divTable")
driver.execute_script("arguments[0].scrollIntoView();",table)

for items in driver.find_elements_by_css_selector("td.sorting_1"):
    driver.execute_script("arguments[0].scrollIntoView();",items)
    items.click()

for elems in driver.find_elements_by_css_selector("#divTable tbody tr"):
    franking_credit = elems.find_elements_by_css_selector("td")[5].text
    gross_divident = elems.find_elements_by_css_selector("td")[6].text
    further_info = elems.find_elements_by_css_selector("td")[7].text
    print(franking_credit,gross_divident,further_info)

driver.quit()

当我运行上面的脚本时,它会抛出这个错误 IndexError: list index out of range 指向 franking_credit = 这一行。

这就是这张桌子的样子。我已在下图中的该表中标记了我感兴趣的三个字段。

Image link

如何解析该表中的三个字段?

【问题讨论】:

    标签: python python-3.x selenium selenium-webdriver web-scraping


    【解决方案1】:

    您收到以下错误,因为当运行自动化脚本时,它显示 20 行和一些其他属性而不是 10 行。试试下面的代码。

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    url = "https://www.sharedividends.com.au/mlt-dividend-history/"
    
    driver = webdriver.Chrome()
    
    driver.get(url)
    
    table = driver.find_element_by_css_selector("#divTable")
    driver.execute_script("arguments[0].scrollIntoView();",table)
    
    for items in driver.find_elements_by_css_selector("td.sorting_1"):
        driver.execute_script("arguments[0].scrollIntoView();",items)
        items.click()
    
    for elems in driver.find_elements_by_css_selector("#divTable tbody tr[role='row']"):
        franking_credit = elems.find_elements_by_css_selector("td")[5].text
        gross_divident = elems.find_elements_by_css_selector("td")[6].get_attribute('textContent')
        further_info = elems.find_elements_by_css_selector("td")[7].get_attribute('textContent')
        print(franking_credit, gross_divident,further_info)
    

    控制台输出:

    $ 0.0446 $ 0.1486 10.4C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0107 $ 0.0357 2.5C FRANKED@30%; SP ECIAL; DRP SUSP
    
    $ 0.0386 $ 0.1286 9C FRANKED @ 30%; DR P NIL DISCOUNT
    
    $ 0.0437 $ 0.1457 10.2C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0377 $ 0.1257 8.8C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0429 $ 0.1429 10C FRANKED @ 30%; D RP NIL DISCOUNT
    
    $ 0.0373 $ 0.1243 8.7C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0424 $ 0.1414 9.9C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0373 $ 0.1243 8.7C FRANKED @ 30%; DRP
    
    $ 0.0441 $ 0.1471 10.3C FR@30%;0.4C SP ECIAL;DRP;NIL DIS
    

    【讨论】:

    • 第一个 for loop 应该显示内容是完全多余的,因为预期的元素已经存在。
    • @SIM:当您运行自动化代码时,您必须看到更多带有其他类名的冗余行,这就是 OP 面临索引问题的原因。
    【解决方案2】:

    这应该可以解决问题!

    from selenium import webdriver
    
    driver = webdriver.Chrome('chromedriver/chromedriver.exe')
    
    driver.get("https://www.sharedividends.com.au/mlt-dividend-history/")
    
    for button in driver.find_elements_by_class_name("sorting_1"):
        button.click()
    
    # Returns first part of the info
    for item in driver.find_elements_by_xpath("//tr[@role='row']/td"):
        print(item.text)
    
    # Returns second part of info
    for a in driver.find_elements_by_xpath("//ul[@class='dtr-details']/li"):
            print(a.text)
    

    输出; this

    【讨论】:

    • 我不明白你的意思。你能告诉我更多这种方法吗?
    • Selenium 有一个driver.find_elements_by_xpath 方法。像这样使用它 - driver.find_elements_by_xpath("//tr[@role='row']/td")。这将返回一个列表,只需搜索和/或遍历该列表即可找到您需要的信息。
    • @MITHU 我已经编辑了我的答案,现在应该对你有很大帮助
    • 你准确地找到了一切,但为了保持我当前的实现完整,我需要.get_attribute('textContent')这个命令。
    • 是的get_attribute("textContent") 可以替换.text。我刚刚测试了它。希望这是对您有用的答案!
    【解决方案3】:

    要从 Franking CreditGross DividentFurther Information 三个字段中提取数据,您必须诱导 WebDriverWait em> 对于visibility_of_all_elements_located(),您可以使用以下Locator Strategies

    • 代码块:

      from selenium import webdriver
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.support import expected_conditions as EC
      
      chrome_options = webdriver.ChromeOptions() 
      chrome_options.add_argument("start-maximized")
      chrome_options.add_argument('disable-infobars')
      driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
      driver.get("https://www.sharedividends.com.au/mlt-dividend-history/")
      driver.execute_script("arguments[0].scrollIntoView();", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#divTable"))))
      for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@aria-describedby='divTable_info']//tbody//tr/td[@class='sorting_1']"))):
          elem.click()
      all_fc = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@aria-describedby='divTable_info']//tbody//tr//td[position()=6]")))]
      all_gd = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@aria-describedby='divTable_info']//tbody//tr//td[position()=7]")))]
      all_fi = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@aria-describedby='divTable_info']//tbody//tr[@class='child']//li//span[@class='dtr-data']")))]
      for x,y,z in zip(all_fc, all_gd, all_fi):
          print(x,y,z)
      
    • 控制台输出:

      $ 0.0446 $ 0.1486 10.4C FRANKED @ 30%; DRP NIL DISCOUNT
      
      $ 0.0107 $ 0.0357 2.5C FRANKED@30%; SP ECIAL; DRP SUSP
      
      $ 0.0386 $ 0.1286 9C FRANKED @ 30%; DR P NIL DISCOUNT
      
      $ 0.0437 $ 0.1457 10.2C FRANKED @ 30%; DRP NIL DISCOUNT
      
      $ 0.0377 $ 0.1257 8.8C FRANKED @ 30%; DRP NIL DISCOUNT
      
      $ 0.0429 $ 0.1429 10C FRANKED @ 30%; D RP NIL DISCOUNT
      
      $ 0.0373 $ 0.1243 8.7C FRANKED @ 30%; DRP NIL DISCOUNT
      
      $ 0.0424 $ 0.1414 9.9C FRANKED @ 30%; DRP NIL DISCOUNT
      
      $ 0.0373 $ 0.1243 8.7C FRANKED @ 30%; DRP
      
      $ 0.0441 $ 0.1471 10.3C FR@30%;0.4C SP ECIAL;DRP;NIL DIS
      

    【讨论】:

      猜你喜欢
      • 2022-01-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-01-17
      • 1970-01-01
      • 2020-10-29
      • 1970-01-01
      相关资源
      最近更新 更多