【问题标题】:BeautifulSoup and Pandas read_html is not pulling all of the rows in a tableBeautifulSoup 和 Pandas read_html 没有提取表中的所有行
【发布时间】:2022-02-08 05:36:23
【问题描述】:

当我从网站上抓取表格时,它缺少底部 5 行数据,我不知道如何提取它们。我正在使用 BeautifulSoup 和 Selenium 的组合。我以为它们没有加载,所以我尝试使用 Selenium 滚动到底部,但这仍然不起作用。

代码试验:

site = 'https://fbref.com//en/comps/15/10733/schedule/2020-2021-League-One'
PATH = my_path
driver = webdriver.Chrome(PATH)
driver.get(site)
webpage = bs.BeautifulSoup(driver.page_source, features='html.parser')

table = webpage.find('table', {'class': 'stats_table sortable min_width now_sortable'})
print(table.prettify())
df = pd.read_html(str(table))[0]

print(df.tail())

请你帮忙刮一下整张桌子吗?

【问题讨论】:

  • 这里不需要使用 selenium,使用 requests 就足够了,而且速度更快。该表未动态加载。

标签: python pandas dataframe selenium beautifulsoup


【解决方案1】:

仅使用Seleniumwebsite 内的表中提取所有行,您需要为visibility_of_element_located() 诱导WebDriverWait 并使用Pandas 中的DataFrame,您可以使用以下Locator Strategy

  • 使用CSS_SELECTOR

    tabledata = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.stats_table.sortable.min_width.now_sortable"))).get_attribute("outerHTML")
    tabledf = pd.read_html(tabledata)
    print(tabledf)
    
  • 使用XPATH

    driver.get('https://fbref.com//en/comps/15/10733/schedule/2020-2021-League-One')
    data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='stats_table sortable min_width now_sortable']"))).get_attribute("outerHTML")
    df = pd.read_html(data)
    print(df)
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
  • 控制台输出:

    [              Round   Wk  Day  ...             Referee  Match Report                         Notes
    0    Regular Season    1  Sat  ...  Charles Breakspear  Match Report                           NaN
    1    Regular Season    1  Sat  ...       Andrew Davies  Match Report                           NaN
    2    Regular Season    1  Sat  ...       Kevin Johnson  Match Report                           NaN
    3    Regular Season    1  Sat  ...   Anthony Backhouse  Match Report                           NaN
    4    Regular Season    1  Sat  ...        Marc Edwards  Match Report                           NaN
    ..              ...  ...  ...  ...                 ...           ...                           ...
    685     Semi-finals  NaN  Tue  ...       Robert Madley  Match Report                    Leg 1 of 2
    686     Semi-finals  NaN  Wed  ...         Craig Hicks  Match Report                    Leg 1 of 2
    687     Semi-finals  NaN  Fri  ...        Keith Stroud  Match Report     Leg 2 of 2; Blackpool won
    688     Semi-finals  NaN  Sat  ...   Michael Salisbury  Match Report  Leg 2 of 2; Lincoln City won
    689           Final  NaN  Sun  ...     Tony Harrington  Match Report                           NaN
    
        [690 rows x 13 columns]]
    

【讨论】:

  • 谢谢!!这真的很有帮助。
猜你喜欢
  • 2021-11-18
  • 2020-12-22
  • 1970-01-01
  • 2017-01-07
  • 2019-11-07
  • 2017-02-10
  • 2016-06-20
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多