【问题标题】:How do I use python requests to web scrape the filtered results?如何使用 python 请求对过滤后的结果进行网络抓取?
【发布时间】:2020-06-18 20:08:39
【问题描述】:

我正在尝试从该网站https://www.gurufocus.com/insider/summary 抓取过滤后的结果。现在我只能从第一页获取信息。但我真正想做的是过滤几个行业并获得相关数据(您可以在过滤区域看到“行业”)。但是当我选择行业时,网址不会改变,我不能直接从网址中抓取。我看到有人说你可以使用requests.post 来获取数据,但我真的不知道它是如何工作的。

这是我现在的一些代码。

TradeUrl = "https://www.gurufocus.com/insider/summary"
r = requests.get(TradeUrl)
data=r.content
soup=BeautifulSoup(data, 'html.parser')

ticker = []
for tk in soup.find_all('td',{'class': 'table-stock-info', 'data-column': 'Ticker'}):
    ticker.append(tk.text)

如果我只需要金融服务行业的代码,我应该怎么做?

【问题讨论】:

    标签: python web-scraping beautifulsoup python-requests


    【解决方案1】:

    按照建议使用发布请求的问题是该请求需要一个授权令牌,该令牌具有到期时间。您可以在 Chrome 或 Firefox 中看到 post 请求,如果您右键单击页面 -> 选择 Inspect -> 选择 Network 然后选择 Industry 点击 POST 请求并点击 Cookies 有一个 cookie password_grant_custom.client.expires 具有授权将不再起作用的时间戳。

    但是,您可以使用 selenium 从所有页面上刮取数据。

    首先安装 Selenium:

    `sudo pip3 install selenium` on Linux or `pip install selenium` on Windows
    

    那就找个驱动https://sites.google.com/a/chromium.org/chromedriver/downloads, 为您的 Chrome 版本获取正确的版本并将其从 zip 文件中解压缩。

    注意在 Windows 上,您需要将 chromedriver 的路径添加到

    driver = webdriver.Chrome(options=options)
    

    在 Linux 上将 chromedriver 复制到 /usr/local/bin/chromedriver

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    import selenium.webdriver.support.expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.common.exceptions import TimeoutException
    from bs4 import BeautifulSoup
    import time
    
    # Start with the driver maximised to see the drop down menus properly
    options = webdriver.ChromeOptions()
    options.add_argument("--start-maximized")
    driver = webdriver.Chrome(options=options)
    driver.get('https://www.gurufocus.com/insider/summary')
    
    # Set the page size to 100 to reduce page loads
    driver.find_element_by_xpath("//span[contains(text(),'40 / Page')]").click()
    wait = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((
            By.XPATH,
            "//div[contains(text(),'100')]"))
    )
    element = driver.find_element_by_xpath("//div[contains(text(),'100')]").click()
    
    # Wait for the page to load and don't overload the server
    time.sleep(2)
    
    # select Industry
    driver.find_element_by_xpath("//span[contains(text(),'Industry')]").click()
    
    # Select Financial Services
    element = WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((
            By.XPATH,
            "//span[contains(text(),'Financial Services')]"))
    )
    element.click()
    
    ticker = []
    
    while True:
        # Wait for the page to load and don't overload the server
        time.sleep(6)
        # Parse the HTML
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        for tk in soup.find_all('td', {'class': 'table-stock-info', 'data-column': 'Ticker'}):
            ticker.append(tk.text)
        try:
            # Move to the next page
            element = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'btn-next')))
            element.click()
        except TimeoutException as ex:
            # No more pages so break
            break
    driver.quit()
    
    print(len(ticker))
    print(ticker)
    

    输出

    4604
    ['PUB   ', 'ARES   ', 'EIM   ', 'CZNC   ', 'SSB   ', 'CNA   ', 'TURN   ', 'FNF   ', 'EGIF   ', 'NWPP  etc...
    

    更新

    如果您想从所有页面上刮取所有数据和/或写入 csv,请使用 pandas:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    import selenium.webdriver.support.expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.common.exceptions import TimeoutException
    import pandas as pd
    import time
    
    # Start with the driver maximised to see the drop down menus properly
    options = webdriver.ChromeOptions()
    options.add_argument("--start-maximized")
    driver = webdriver.Chrome(options=options)
    driver.get('https://www.gurufocus.com/insider/summary')
    
    # Set the page size to 100 to reduce page loads
    driver.find_element_by_xpath("//span[contains(text(),'40 / Page')]").click()
    wait = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((
            By.XPATH,
            "//div[contains(text(),'100')]"))
    )
    driver.find_element_by_xpath("//div[contains(text(),'100')]").click()
    
    # Wait for the page to load and don't overload the server
    time.sleep(2)
    
    # select Industry
    driver.find_element_by_xpath("//span[contains(text(),'Industry')]").click()
    
    # Select Financial Services
    element = WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((
            By.XPATH,
            "//span[contains(text(),'Financial Services')]"))
    )
    element.click()
    
    
    columns = [
        'Ticker', 'Links', 'Company', 'Price1', 'Insider Name', 'Insider Position',
        'Date', 'Buy/Sell', 'Insider Trading Shares', 'Shares Change', 'Price2',
        'Cost(000)', 'Final Share', 'Price Change Since Insider Trade (%)',
        'Dividend Yield %', 'PE Ratio', 'Market Cap ($M)', 'None'
    ]
    df = pd.DataFrame(columns=columns)
    
    
    while True:
        # Wait for the page to load and don't overload the server
        time.sleep(6)
        # Parse the HTML
        df = df.append(pd.read_html(driver.page_source, attrs={'class': 'data-table'})[0], ignore_index=True)
    
        try:
            # Move to the next page
            element = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'btn-next')))
            element.click()
        except TimeoutException as ex:
            # No more pages so break
            break
    driver.quit()
    
    # Write to csv
    df.to_csv("Financial_Services.csv", encoding='utf-8', index=False)
    

    针对 cme​​ts 进行了更新: 首先从https://github.com/mozilla/geckodriver/releases下载Firefox驱动geckodriver解压驱动。再次在 Windows 上,您需要将 geckodriver 的路径添加到 driver = webdriver.Firefox() 或在 linux 上将 geckodriver 复制到 /usr/local/bin/geckodriver

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    import selenium.webdriver.support.expected_conditions as EC
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.common.exceptions import TimeoutException
    import pandas as pd
    import time
    
    # Start with the driver maximised to see the drop down menus properly
    driver = webdriver.Firefox()
    driver.maximize_window()
    driver.get('https://www.gurufocus.com/insider/summary')
    
    # Set the page size to 100 to reduce page loads
    driver.find_element_by_xpath("//span[contains(text(),'40 / Page')]").click()
    wait = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((
            By.XPATH,
            "//div[contains(text(),'100')]"))
    )
    driver.find_element_by_xpath("//div[contains(text(),'100')]").click()
    
    # Wait for the page to load and don't overload the server
    time.sleep(2)
    
    # select Industry
    driver.find_element_by_xpath("//span[contains(text(),'Industry')]").click()
    
    # Select Financial Services
    element = WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((
            By.XPATH,
            "//span[contains(text(),'Financial Services')]"))
    )
    element.click()
    
    columns = [
        'Ticker', 'Links', 'Company', 'Price1', 'Insider Name', 'Insider Position',
        'Date', 'Buy/Sell', 'Insider Trading Shares', 'Shares Change', 'Price2',
        'Cost(000)', 'Final Share', 'Price Change Since Insider Trade (%)',
        'Dividend Yield %', 'PE Ratio', 'Market Cap ($M)', 'None'
    ]
    df = pd.DataFrame(columns=columns)
    page_limit = 5
    page = 0
    
    while True:
        # Wait for the page to load and don't overload the server
        time.sleep(6)
        # Parse the HTML
        df = df.append(pd.read_html(driver.page_source, attrs={'class': 'data-table'})[0], ignore_index=True)
    
        # Stop after page limit is reached.
        page = page + 1
        if page >= page_limit:
            break
    
        try:
            # Move to the next page
            element = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'btn-next')))
            element.click()
        except TimeoutException as ex:
            # No more pages so break
            break
    
    driver.quit()
    
    # Write to csv
    df.to_csv("Financial_Services.csv", encoding='utf-8', index=False)
    

    【讨论】:

    • 嗨!非常感谢,但我尝试运行代码并得到element click intercepted: Element <span>...</span> is not clickable at point (502, 81) 的错误。你知道如何处理这个问题吗?
    • 我想我试了几次,这次不知何故奏效了。我如何限制页面?如果只想看前五页怎么办?
    • 我已经看到错误 Element <span>...</span> is not clickable at point 之前是由于 chromedriver 中的错误,请参阅 stackoverflow.com/questions/11908249/… 修复它的最简单方法是使用 Firefox 驱动程序 geckodriver github.com/mozilla/geckodriver/releases 我会发布很快就会解决你的两个 cmets 的更新。
    • 抱歉回复晚了,非常感谢您的详细回答!!这些真的很有帮助,我的问题现在已经解决了。对此,我真的非常感激!从现在开始尝试学习一些硒,哈哈
    • 不客气。如果回答了您的问题,请随时使用左侧的按钮接受答案
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-11-04
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多